DPO vs SimPO: What Your Preference Trainer Is Actually Optimizing

1 / 2

DPO vs SimPO: What Your Preference Trainer Is Actually Optimizing

DEV Community·Natnael Alemseged·25 days ago

#gxXAkipQ

#ai #llm #finetuning #margins #held #simpo

Reading 0:00

15s threshold

SalesConversion-Bench had one uncomfortable preference-tuning mismatch: the code trained with TRL DPOTrainer , while the methodology narrative argued for SimPO. That is not just a naming issue. DPO and SimPO turn the same (prompt, chosen, rejected) pair into different update signals. If the held-out lift is small, like 22.73% vs 18.18%, the project cannot honestly claim whether the model improved because DPO was the right objective, because LoRA rank constrained the update, or because training margins improved without robust held-out behavior. The useful answer is not "DPO good, SimPO good, ORPO also good." The useful answer is: Compare the objectives under fixed conditions, control for LoRA rank, and keep the objective whose gains survive held-out evaluation instead of only improving training margins.…

Continue reading — create a free account

Join HashtagPLUS to read full articles, follow hashtags, vote, and join the conversation.

Create free account Log in

Menu

DPO vs SimPO: What Your Preference Trainer Is Actually Optimizing