Over 36 hours we ran four DPO training iterations against Qwen2.5-Coder-7B-Instruct, trying to push HumanEval pass@1 above the base model's 87.20%. The first three iterations failed in different ways (-9.15pp, -1.22pp, two NO-GO calls). The fourth recovered to +0.61pp. Each failure revealed a different class of bug in our chosen-sample generation pipeline — bugs the existing certification gates were not catching. This post walks through the four iterations and what we ended up building to fix them. We're sharing this because the same gate-blindness probably affects most teams running DPO on autopilot-generated data. The bugs we found were not exotic; the gates that missed them were not naive. Δpp vs base (Qwen2.5-Coder-7B-Instruct, 4-bit, 87.20% pass@1). Each bar represents one full DPO training run.…