In the previous post , I described the curation philosophy for IDFU's rejected-side dataset — why I avoid synthetic bug generation, why stub detection matters, why "honest failures" are hard to come by. A few people asked the obvious question: does it work? This post is the answer. Not as a marketing pitch, but as a breakdown. Aggregate scores hide more than they reveal, and I want to show what's actually changing under the hood. The setup Base model : Qwen2.5-Coder-3B-Instruct (trained on 92 programming languages) Method : DPO via TRL with LoRA Data : 500 preference pairs from the IDFU dataset Eval : HumanEval, pass@1, three random seeds (42, 123, 7) Hardware : RTX 4060, single-GPU, ~3-4 hours of training per seed A note on the data shape, since I left this implicit in the previous post. Each preference pair has a chosen side and a rejected side, both produced inside IDFU. The rejected side is what Part 1 was about — honest failures, curated rather than synthesized.…