Menu

Post image 1
Post image 2
1 / 2
0

One Open Source Project a Day (No.51): VibeVoice - Microsoft's Speech AI That Processes 90 Minutes of Audio in a Single Pass

DEV Community·WonderLab·about 1 month ago
#xKDwnRqm
Reading 0:00
15s threshold

Introduction "The fundamental limit of traditional speech AI isn't model quality — it's architecture. They were never designed for long audio." This is article No.51 in the "One Open Source Project a Day" series. Today's project is VibeVoice ( GitHub ). In August 2025, Microsoft Research quietly pushed a repository to GitHub. No launch event. No press release. The capability it demonstrated: synthesizing 90 minutes of natural multi-speaker conversation — 4 speakers, consistent voices throughout — in a single model pass . For context, ElevenLabs tops out around 5 minutes per call. OpenAI's TTS has similar constraints. The open-source alternatives before this couldn't touch an hour of audio without stitching together segments. The mechanism behind this is a single architectural decision: a 7.5 Hz ultra-low framerate tokenizer that compresses 90 minutes of audio into ~40,500 tokens — small enough to fit inside an LLM's context window. That's a 3,200x compression ratio compared to the raw audio.…

Continue reading — create a free account

Join HashtagPLUS to read full articles, follow hashtags, vote, and join the conversation.

Read More