LLM-as-judge variance broke our DPO training signal for 3 weeks

1 / 2

LLM-as-judge variance broke our DPO training signal for 3 weeks

DEV Community: pytorch·Marcus Chen·4 days ago

#MrTmMApK

#dev #judge #model #pairs #reward #three

Reading 0:00

15s threshold

TL;DR: Our DPO pipeline used a single LLM as the preference judge. Training reward climbed every run. Production accuracy fell 4 points. The judge was flipping its own labels 28% of the time at temperature 0. The setup Nexus Labs ships agents that book travel, file expenses, process insurance claims. Eight engineers on my fine-tuning team. We run DPO on Qwen2.5-32B, target latency under 800ms p95 on a single H100. Our preference data pipeline: 2,400 prompts sampled from production traces per cycle 4 completions per prompt from the current checkpoint GPT-4o-mini grades pairwise preferences against a 6-axis rubric TRL DPO, 3 epochs, lr 5e-7, beta 0.1 Standard recipe. Worked fine for two months. What we saw Week 9. Training loss curves looked clean. Reward margins grew run over run. Held-out eval reward climbed 0.62 → 0.71. Internal dashboards were green. Then product filed tickets. Latency was fine. Tool use accuracy on our production traffic mirror was down 4 points against the pre-DPO baseline.…

Continue reading — create a free account

Join HashtagPLUS to read full articles, follow hashtags, vote, and join the conversation.

Create free account Log in

Menu

LLM-as-judge variance broke our DPO training signal for 3 weeks