Menu

Post image 1
Post image 2
Post image 3
1 / 3
0

What 500 curated failure pairs actually fix: a breakdown across 3 seeds

DEV Community·namakoo [IDFU]·about 1 month ago
#SYq43zYl
#ai#python#model#seeds#idfu#humaneval
Reading 0:00
15s threshold

In the previous post , I described the curation philosophy for IDFU's rejected-side dataset — why I avoid synthetic bug generation, why stub detection matters, why "honest failures" are hard to come by. A few people asked the obvious question: does it work? This post is the answer. Not as a marketing pitch, but as a breakdown. Aggregate scores hide more than they reveal, and I want to show what's actually changing under the hood. The setup Base model : Qwen2.5-Coder-3B-Instruct (trained on 92 programming languages) Method : DPO via TRL with LoRA Data : 500 preference pairs from the IDFU dataset Eval : HumanEval, pass@1, three random seeds (42, 123, 7) Hardware : RTX 4060, single-GPU, ~3-4 hours of training per seed A note on the data shape, since I left this implicit in the previous post. Each preference pair has a chosen side and a rejected side, both produced inside IDFU. The rejected side is what Part 1 was about — honest failures, curated rather than synthesized.…

Continue reading — create a free account

Join HashtagPLUS to read full articles, follow hashtags, vote, and join the conversation.

Read More