Menu

Post image 1
Post image 2
Post image 3
Post image 4
Post image 5
Post image 6
Post image 7
1 / 7
0

From -9.15pp to +0.61pp: An engineering journey through four DPO iteration failures

DEV Community·namakoo [IDFU]·25 days ago
#S7qKwzUX
#iter#machinelearning#ai#chosen#samples#model
Reading 0:00
15s threshold

Over 36 hours we ran four DPO training iterations against Qwen2.5-Coder-7B-Instruct, trying to push HumanEval pass@1 above the base model's 87.20%. The first three iterations failed in different ways (-9.15pp, -1.22pp, two NO-GO calls). The fourth recovered to +0.61pp. Each failure revealed a different class of bug in our chosen-sample generation pipeline — bugs the existing certification gates were not catching. This post walks through the four iterations and what we ended up building to fix them. We're sharing this because the same gate-blindness probably affects most teams running DPO on autopilot-generated data. The bugs we found were not exotic; the gates that missed them were not naive. Δpp vs base (Qwen2.5-Coder-7B-Instruct, 4-bit, 87.20% pass@1). Each bar represents one full DPO training run.…

Continue reading — create a free account

Join HashtagPLUS to read full articles, follow hashtags, vote, and join the conversation.

Read More