Cohere released their new ASR model on March 26 with a 5.42% Word Error Rate on the LibriSpeech test-clean benchmark. That's a noticeable improvement over Whisper-large-v3 (~5.7%), and given it's open-source under a permissive license, I spent the last two weeks running it through real-world audio to see if the benchmark numbers translate. The short answer: yes for clean studio audio, partially for noisy real-world recordings, and not yet for code-switched conversations. What's actually new Cohere's transcribe model is built on a different architecture than Whisper (encoder-decoder transformer with a lighter decoder). Key claims from the release notes: 5.42% WER on LibriSpeech test-clean Roughly 30% faster inference than Whisper-large-v3 at similar batch sizes Released with weights + inference code (not API-only) Supports streaming via chunked inference The "30% faster" caveat: this assumes you're running on the same hardware Cohere benchmarked.…