If you've spent any time looking at AI music separation in the last twelve months, you've probably run into the same three names: Spleeter , htdemucs (Hybrid Transformer Demucs), and BS-RoFormer . They show up in every comparison post, every research paper, and every "how to extract vocals" tutorial — but the way they're compared is usually wrong. Most posts cite a single SDR number from a 2019 paper and call it a day. That's not useful if you're trying to ship a product, build a pipeline, or pick a model for real audio. This post compares the three on the dimensions that actually matter when you're deploying audio separation: Quality — SDR scores from peer-reviewed sources, not vibes Inference speed — what you'll actually wait for in production Cost per song — running on commodity GPUs at 2026 prices Output flexibility — 2 stems vs 4 stems vs 6 stems When each one is the right choice — and when it isn't Everything below is based on published benchmarks plus our own production deployment of htdemucs at…