Menu

Post image 1
Post image 2
1 / 2
0

DPO vs SimPO: What Your Preference Trainer Is Actually Optimizing

DEV Community·Natnael Alemseged·25 days ago
#gxXAkipQ
#ai#llm#finetuning#margins#held#simpo
Reading 0:00
15s threshold

SalesConversion-Bench had one uncomfortable preference-tuning mismatch: the code trained with TRL DPOTrainer , while the methodology narrative argued for SimPO. That is not just a naming issue. DPO and SimPO turn the same (prompt, chosen, rejected) pair into different update signals. If the held-out lift is small, like 22.73% vs 18.18%, the project cannot honestly claim whether the model improved because DPO was the right objective, because LoRA rank constrained the update, or because training margins improved without robust held-out behavior. The useful answer is not "DPO good, SimPO good, ORPO also good." The useful answer is: Compare the objectives under fixed conditions, control for LoRA rank, and keep the objective whose gains survive held-out evaluation instead of only improving training margins.…

Continue reading — create a free account

Join HashtagPLUS to read full articles, follow hashtags, vote, and join the conversation.

Read More