From -9.15pp to +0.61pp: An engineering journey through four DPO iteration failures

1 / 7

From -9.15pp to +0.61pp: An engineering journey through four DPO iteration failures

DEV Community·namakoo [IDFU]·25 days ago

#S7qKwzUX

#iter #machinelearning #ai #chosen #samples #model

Reading 0:00

15s threshold

Over 36 hours we ran four DPO training iterations against Qwen2.5-Coder-7B-Instruct, trying to push HumanEval pass@1 above the base model's 87.20%. The first three iterations failed in different ways (-9.15pp, -1.22pp, two NO-GO calls). The fourth recovered to +0.61pp. Each failure revealed a different class of bug in our chosen-sample generation pipeline — bugs the existing certification gates were not catching. This post walks through the four iterations and what we ended up building to fix them. We're sharing this because the same gate-blindness probably affects most teams running DPO on autopilot-generated data. The bugs we found were not exotic; the gates that missed them were not naive. Δpp vs base (Qwen2.5-Coder-7B-Instruct, 4-bit, 87.20% pass@1). Each bar represents one full DPO training run.…

Continue reading — create a free account

Join HashtagPLUS to read full articles, follow hashtags, vote, and join the conversation.

Create free account Log in

Menu

From -9.15pp to +0.61pp: An engineering journey through four DPO iteration failures