FlowLens

ElevenLabs Integration

How ElevenLabs powers transcription, voice selection, TTS playback, and the Matrix voice visualization loop.

Why voice is central

FlowLens uses voice where voice is faster than writing another prompt:

  • asking what is wrong with a visible error
  • describing the goal for a visible prompt
  • requesting a rewrite while staying in the current app
  • answering one clarifying question without switching into chat

Voice is not a decoration. It is the input path that keeps the user in the current workflow.

Speech-to-text with scribe_v2

The renderer records microphone audio with MediaRecorder and streams chunks to the main process. After the final chunk is delivered, the renderer emits flowlens:audio-stop, and the main process sends the completed audio buffer to ElevenLabs Scribe:

client.speechToText.convert({
  file: audioBuffer,
  modelId: 'scribe_v2',
})

The result becomes the transcript that anchors the multimodal provider request.

Recording behavior

The current recording path is tuned for natural speech:

  • MicCapture streams chunks every 250ms
  • a safety cap prevents indefinite recording
  • speech-turn detection waits for speech to begin
  • trailing silence is around three seconds by default
  • short pauses do not immediately end the turn

This is why the overlay can stay in listening mode long enough for a user to think between phrases.

Matrix visualization

The Matrix component is left as the vendored UI primitive. FlowLens feeds it changing inputs:

  • recording mode uses live microphone VU levels
  • processing mode uses animated frames
  • response/TTS mode can react to playback state

The important implementation detail is the amplitude normalization. Web Audio time-domain data uses 128 as silence, so FlowLens computes:

Math.abs(sample - 128)

That makes silence render as low/no levels and loud input render as a brighter matrix.

Text-to-speech

After a structured response arrives, the main process calls ElevenLabs TTS with a short speech summary unless voice playback is disabled. FlowLens does not read the entire provider output aloud; code blocks, markdown tables, long lists, and oversized answers are reduced to the useful spoken explanation first.

client.textToSpeech.convert(voiceId, {
  text: prepareSpeechSummary(spokenSummary),
  modelId: config.tts.model,
  outputFormat: 'mp3_44100_128',
})

Audio chunks stream to the renderer over flowlens:tts-stream, where the renderer assembles the MP3 Blob and plays it without exposing the ElevenLabs API key. The renderer CSP explicitly allows local audio Blob playback so packaged builds can play generated speech.

Voice picker

The settings UI includes the ElevenLabs Voice Picker. Voice listing happens in the main process:

  1. renderer asks for voices through IPC
  2. main process reads the stored ElevenLabs secret
  3. main process calls ElevenLabs voices API
  4. renderer receives voice IDs, names, labels, descriptions, and preview URLs

The key never crosses into the renderer.

TTS model choices

ModelWhy use it
eleven_flash_v2_5Fastest fit for short overlay summaries
eleven_turbo_v2_5Balanced speed and voice quality
eleven_v3Richer expressive playback
eleven_multilingual_v2Stable multilingual speech

Failure behavior

FailureFlowLens behavior
Missing ElevenLabs keysetup/settings show the key as not configured and connection tests fail clearly
STT failureoverlay receives a retryable API error
Empty transcriptpipeline treats it as no usable content
TTS failurevisual response remains available and the overlay shows a non-blocking playback status
Cancel or dismissin-progress TTS is aborted

Why ElevenLabs fits FlowLens

FlowLens is designed around short loops:

  1. speak the request
  2. see the matrix react
  3. receive a structured answer
  4. hear the short summary
  5. copy the final output if needed

ElevenLabs covers both ends of that loop with one voice stack while the app keeps secrets and network calls in the main process.

On this page