ElevenLabs Integration
How ElevenLabs powers transcription, voice selection, TTS playback, and the Matrix voice visualization loop.
Why voice is central
FlowLens uses voice where voice is faster than writing another prompt:
- asking what is wrong with a visible error
- describing the goal for a visible prompt
- requesting a rewrite while staying in the current app
- answering one clarifying question without switching into chat
Voice is not a decoration. It is the input path that keeps the user in the current workflow.
Speech-to-text with scribe_v2
The renderer records microphone audio with MediaRecorder and streams chunks to the main process. After the final chunk is delivered, the renderer emits flowlens:audio-stop, and the main process sends the completed audio buffer to ElevenLabs Scribe:
client.speechToText.convert({
file: audioBuffer,
modelId: 'scribe_v2',
})The result becomes the transcript that anchors the multimodal provider request.
Recording behavior
The current recording path is tuned for natural speech:
MicCapturestreams chunks every 250ms- a safety cap prevents indefinite recording
- speech-turn detection waits for speech to begin
- trailing silence is around three seconds by default
- short pauses do not immediately end the turn
This is why the overlay can stay in listening mode long enough for a user to think between phrases.
Matrix visualization
The Matrix component is left as the vendored UI primitive. FlowLens feeds it changing inputs:
- recording mode uses live microphone VU levels
- processing mode uses animated frames
- response/TTS mode can react to playback state
The important implementation detail is the amplitude normalization. Web Audio time-domain data uses 128 as silence, so FlowLens computes:
Math.abs(sample - 128)That makes silence render as low/no levels and loud input render as a brighter matrix.
Text-to-speech
After a structured response arrives, the main process calls ElevenLabs TTS with a short speech summary unless voice playback is disabled. FlowLens does not read the entire provider output aloud; code blocks, markdown tables, long lists, and oversized answers are reduced to the useful spoken explanation first.
client.textToSpeech.convert(voiceId, {
text: prepareSpeechSummary(spokenSummary),
modelId: config.tts.model,
outputFormat: 'mp3_44100_128',
})Audio chunks stream to the renderer over flowlens:tts-stream, where the renderer assembles the MP3 Blob and plays it without exposing the ElevenLabs API key. The renderer CSP explicitly allows local audio Blob playback so packaged builds can play generated speech.
Voice picker
The settings UI includes the ElevenLabs Voice Picker. Voice listing happens in the main process:
- renderer asks for voices through IPC
- main process reads the stored ElevenLabs secret
- main process calls ElevenLabs voices API
- renderer receives voice IDs, names, labels, descriptions, and preview URLs
The key never crosses into the renderer.
TTS model choices
| Model | Why use it |
|---|---|
eleven_flash_v2_5 | Fastest fit for short overlay summaries |
eleven_turbo_v2_5 | Balanced speed and voice quality |
eleven_v3 | Richer expressive playback |
eleven_multilingual_v2 | Stable multilingual speech |
Failure behavior
| Failure | FlowLens behavior |
|---|---|
| Missing ElevenLabs key | setup/settings show the key as not configured and connection tests fail clearly |
| STT failure | overlay receives a retryable API error |
| Empty transcript | pipeline treats it as no usable content |
| TTS failure | visual response remains available and the overlay shows a non-blocking playback status |
| Cancel or dismiss | in-progress TTS is aborted |
Why ElevenLabs fits FlowLens
FlowLens is designed around short loops:
- speak the request
- see the matrix react
- receive a structured answer
- hear the short summary
- copy the final output if needed
ElevenLabs covers both ends of that loop with one voice stack while the app keeps secrets and network calls in the main process.