TL;DR: Our DPO pipeline used a single LLM as the preference judge. Training reward climbed every run. Production accuracy fell 4 points. The judge was flipping its own labels 28% of the time at temperature 0. The setup Nexus Labs ships agents that book travel, file expenses, process insurance claims. Eight engineers on my fine-tuning team. We run DPO on Qwen2.5-32B, target latency under 800ms p95 on a single H100. Our preference data pipeline: 2,400 prompts sampled from production traces per cycle 4 completions per prompt from the current checkpoint GPT-4o-mini grades pairwise preferences against a 6-axis rubric TRL DPO, 3 epochs, lr 5e-7, beta 0.1 Standard recipe. Worked fine for two months. What we saw Week 9. Training loss curves looked clean. Reward margins grew run over run. Held-out eval reward climbed 0.62 → 0.71. Internal dashboards were green. Then product filed tickets. Latency was fine. Tool use accuracy on our production traffic mirror was down 4 points against the pre-DPO baseline.…