So my dad needed to transcribe an interview. Simple enough right? Except he refused to upload his voice to any cloud service which honestly makes total sense. I went looking for local options and everything required installing Python, managing dependencies, running terminal commands. An hour of setup minimum. Not happening. So instead of doing the setup I just built it. Took about 5 hours. Here's how the whole thing works under the hood. The core architecture The fundamental insight is that you can run Whisper which is the same model powering most cloud transcription services directly in the browser using WebAssembly. No server needed. Transformers.js by Hugging Face handles all the heavy lifting: model downloading, caching, ONNX inference, and audio chunking. Here's the high level flow: Why a Web Worker is non-negotiable This is the most important architectural decision. Whisper is computationally heavy.…