Why all this interest in realtime transcription It all started with the collaboration with PyCon IT. At PyCon IT 2025 they set up live transcription with local Whisper on a Graphics Processing Unit (GPU), based on the repo realtime-transcription-fastrtc . With the YouTube videos used as tests, all good. With the real audio of a conference room, Whisper started hallucinating: a generative model, if you give it a signal it doesn't recognize, doesn't leave a blank, it writes something anyway. For PyCon IT 2026 a different path was needed, on a non-negotiable anchor: no hallucinations. If the model doesn't hear, ok, skip a word. If it hears badly, ok, transcribe badly. But it must not write sentences I didn't say. Fixing Whisper's hallucinations directly (Voice Activity Detection, tuning decoding parameters, logprob filters, fine-tuning, ..) would have been a separate effort: I didn't have the time, with everything else to build. A bigger Whisper I haven't tested.…