This post documents a real negative result: my trained model worked… but a well-written prompt worked better. TL;DR I built a 266-task evaluation benchmark for B2B sales-outreach agents — something existing benchmarks don’t measure well. Then I trained a small preference-learning judge model using SimPO. What happened surprised me: Training accuracy → 100% Held-out accuracy → 25% Classic overfitting. But the real lesson wasn’t about the model. It was about the data. After fixing dataset construction: Held-out accuracy improved to 0.417 (Delta A +25pp) A carefully prompted untrained model scored 0.833 👉 Conclusion: At this scale, judging B2B sales tone is mostly a prompt-following problem, not a preference-learning problem.…