Menu

# Curating Python failures for DPO: notes from the rejected side
📰
0

# Curating Python failures for DPO: notes from the rejected side

DEV Community·namakoo [IDFU]·about 1 month ago
#rMok4D4h
Reading 0:00
15s threshold

Most of the work in DPO training data is on the rejected side. The chosen side has gold-standard reference implementations everywhere production code, peer-reviewed libraries, official examples. The rejected side is harder. You need code that someone could plausibly write, that fails for a real reason, and that fails in a way the model can actually learn from. I tried a few approaches before settling on one that works. Hand-curating from production code review is honest, but slow. After about fifty samples I'm tired and my judgments start drifting. Public failure datasets exist but tend to be sparse and narrow --toy bugs in toy domains, or syntactic typos that don't really teach anything. Asking GPT-4 to "write a buggy version of X" is fast and expensive, and the bugs come out so obviously fabricated that they'd train the model on the wrong distribution. The bug-versus-correct boundary in those samples is too clean. Real bugs are messier.…

Continue reading — create a free account

Join HashtagPLUS to read full articles, follow hashtags, vote, and join the conversation.

Read More