What 500 curated failure pairs actually fix: a breakdown across 3 seeds

1 / 3

What 500 curated failure pairs actually fix: a breakdown across 3 seeds

DEV Community·namakoo [IDFU]·about 1 month ago

#SYq43zYl

#ai #python #model #seeds #idfu #humaneval

Reading 0:00

15s threshold

In the previous post , I described the curation philosophy for IDFU's rejected-side dataset — why I avoid synthetic bug generation, why stub detection matters, why "honest failures" are hard to come by. A few people asked the obvious question: does it work? This post is the answer. Not as a marketing pitch, but as a breakdown. Aggregate scores hide more than they reveal, and I want to show what's actually changing under the hood. The setup Base model : Qwen2.5-Coder-3B-Instruct (trained on 92 programming languages) Method : DPO via TRL with LoRA Data : 500 preference pairs from the IDFU dataset Eval : HumanEval, pass@1, three random seeds (42, 123, 7) Hardware : RTX 4060, single-GPU, ~3-4 hours of training per seed A note on the data shape, since I left this implicit in the previous post. Each preference pair has a chosen side and a rejected side, both produced inside IDFU. The rejected side is what Part 1 was about — honest failures, curated rather than synthesized.…

Continue reading — create a free account

Join HashtagPLUS to read full articles, follow hashtags, vote, and join the conversation.

Create free account Log in

Menu

What 500 curated failure pairs actually fix: a breakdown across 3 seeds