ElevenLabs Integration

How ElevenLabs powers transcription, voice selection, TTS playback, and the Matrix voice visualization loop.

Why voice is central

FlowLens uses voice where voice is faster than writing another prompt:

asking what is wrong with a visible error
describing the goal for a visible prompt
requesting a rewrite while staying in the current app
answering one clarifying question without switching into chat

Voice is not a decoration. It is the input path that keeps the user in the current workflow.

Speech-to-text with `scribe_v2`

The renderer records microphone audio with MediaRecorder and streams chunks to the main process. After the final chunk is delivered, the renderer emits flowlens:audio-stop, and the main process sends the completed audio buffer to ElevenLabs Scribe:

client.speechToText.convert({
  file: audioBuffer,
  modelId: 'scribe_v2',
})

The result becomes the transcript that anchors the multimodal provider request.

Recording behavior

The current recording path is tuned for natural speech:

MicCapture streams chunks every 250ms
a safety cap prevents indefinite recording
speech-turn detection waits for speech to begin
trailing silence is around three seconds by default
short pauses do not immediately end the turn

This is why the overlay can stay in listening mode long enough for a user to think between phrases.

Matrix visualization

The Matrix component is left as the vendored UI primitive. FlowLens feeds it changing inputs:

recording mode uses live microphone VU levels
processing mode uses animated frames
response/TTS mode can react to playback state

The important implementation detail is the amplitude normalization. Web Audio time-domain data uses 128 as silence, so FlowLens computes:

Math.abs(sample - 128)

That makes silence render as low/no levels and loud input render as a brighter matrix.

Text-to-speech

After a structured response arrives, the main process calls ElevenLabs TTS with a short speech summary unless voice playback is disabled. FlowLens does not read the entire provider output aloud; code blocks, markdown tables, long lists, and oversized answers are reduced to the useful spoken explanation first.

client.textToSpeech.convert(voiceId, {
  text: prepareSpeechSummary(spokenSummary),
  modelId: config.tts.model,
  outputFormat: 'mp3_44100_128',
})

Audio chunks stream to the renderer over flowlens:tts-stream, where the renderer assembles the MP3 Blob and plays it without exposing the ElevenLabs API key. The renderer CSP explicitly allows local audio Blob playback so packaged builds can play generated speech.

Voice picker

The settings UI includes the ElevenLabs Voice Picker. Voice listing happens in the main process:

renderer asks for voices through IPC
main process reads the stored ElevenLabs secret
main process calls ElevenLabs voices API
renderer receives voice IDs, names, labels, descriptions, and preview URLs

The key never crosses into the renderer.

TTS model choices

Model	Why use it
`eleven_flash_v2_5`	Fastest fit for short overlay summaries
`eleven_turbo_v2_5`	Balanced speed and voice quality
`eleven_v3`	Richer expressive playback
`eleven_multilingual_v2`	Stable multilingual speech

Failure behavior

Failure	FlowLens behavior
Missing ElevenLabs key	setup/settings show the key as not configured and connection tests fail clearly
STT failure	overlay receives a retryable API error
Empty transcript	pipeline treats it as no usable content
TTS failure	visual response remains available and the overlay shows a non-blocking playback status
Cancel or dismiss	in-progress TTS is aborted

Why ElevenLabs fits FlowLens

FlowLens is designed around short loops:

speak the request
see the matrix react
receive a structured answer
hear the short summary
copy the final output if needed

ElevenLabs covers both ends of that loop with one voice stack while the app keeps secrets and network calls in the main process.

On this page