How FlowLens Works
The runtime flow from hotkey press to overlay answer, including setup gating, screen capture, STT, multimodal analysis, and TTS playback.
Global hotkey
The user presses the configured shortcut. If onboarding is incomplete, FlowLens opens setup. If complete, the overlay invocation begins.
Overlay and capture
The overlay appears, then the main process hides it long enough to capture the primary screen through Electron desktopCapturer.
Microphone capture
The renderer starts MicCapture with the saved microphone device ID. It streams audio chunks to main and exposes analyser data for the Matrix visualizer.
Speech turn detection
FlowLens waits for speech to start, tolerates short pauses, and ends the turn after sustained trailing silence or a max-duration cap.
Speech-to-text
The main process sends the completed audio buffer to ElevenLabs scribe_v2 and receives transcript text plus duration metadata.
Multimodal request assembly
FlowLens sends the screenshot, transcript, selected mode, and optional prior turn state to the configured OpenAI-compatible provider.
Structured response
The provider response is parsed and validated into spoken_summary, card_content, clarifying_question, and actionable_output.
Overlay rendering
The overlay shows the answer in a scrollable card and exposes copy, retry, cancel, dismiss, and follow-up paths where relevant.
Spoken playback
If voice playback is enabled, ElevenLabs TTS reads the short summary and streams audio chunks back to the renderer.
Runtime properties that matter
- The screen is captured only on explicit invocation.
- The microphone is opened only for the active request.
- Audio chunks are delivered before the final stop signal.
- The matrix visualizer receives live VU levels instead of static frames during recording.
- Config is read per invocation, so changed provider, model, voice, and microphone settings take effect without restarting.
- TTS stops on cancel or dismiss.
- Setup must be valid before the overlay can invoke the pipeline.
Why this flow works well in a demo
Every step is visible or explainable. The trigger is obvious, the matrix confirms listening, the overlay shows processing, the structured response is easy to scan, and the spoken summary makes the result feel immediate.