Here's what you get at the end: a browser app where you click a button, ask a question aloud, and hear the answer back in a cloned voice. Speech recognition, LLM response, and text-to-speech — all Mistral, all on the free plan. This article walks through how the pipeline fits together, shows the code for the part most tutorials skip (the STT relay), and covers the cost and compliance angles that are worth knowing before you pick a stack. How the pipeline fits together Browser mic → [WebSocket] → Voxtral STT → Mistral LLM → Voxtral TTS → Browser audio Enter fullscreen mode Exit fullscreen mode The browser never talks to Mistral directly. It relays audio over WebSocket to a FastAPI backend, which handles all three API calls. There are two reasons for this: you can't expose your API key in browser JavaScript, and Voxtral's realtime speech recognition requires a persistent connection that has to stay open for the full duration of the audio stream.…