Most people benchmark inference engines on throughput. Tokens per second, batch size limits, latency percentiles. But when you're training agents with reinforcement learning, there's a metric that matters more: correctness. A silent bug in your inference stack doesn't just slow you down—it poisons your training data, and you won't know for weeks. The vLLM team just shipped V1, and buried in the release notes is a fix that should make anyone running RL training take notice. They found and corrected subtle correctness issues in how V0 handled certain token sequences under grouped query attention. The kind of bugs that don't crash your job but subtly shift your reward model's understanding of what "good" looks like. Why RL is Unforgiving Supervised fine-tuning is forgiving. If your inference engine produces slightly different logits for 0.1% of tokens, the gradient updates average out. RL is different. You're generating rollouts, computing advantages, updating policy and value networks in tight loops.…