When Your Training Loss Is Lying to You Building a Tenacious-Specific Sales Outreach Benchmark Ey…

1 / 2

When Your Training Loss Is Lying to You Building a Tenacious-Specific Sales Outreach Benchmark Eyoel Nebiyu · May 2026

DEV Community·Eyoel Nebiyu·about 1 month ago

#k271xe5Q

#agents #ai #llm #machinelearning #model #training

Reading 0:00

15s threshold

This post documents a real negative result: my trained model worked… but a well-written prompt worked better. TL;DR I built a 266-task evaluation benchmark for B2B sales-outreach agents — something existing benchmarks don’t measure well. Then I trained a small preference-learning judge model using SimPO. What happened surprised me: Training accuracy → 100% Held-out accuracy → 25% Classic overfitting. But the real lesson wasn’t about the model. It was about the data. After fixing dataset construction: Held-out accuracy improved to 0.417 (Delta A +25pp) A carefully prompted untrained model scored 0.833 👉 Conclusion: At this scale, judging B2B sales tone is mostly a prompt-following problem, not a preference-learning problem.…

Continue reading — create a free account

Join HashtagPLUS to read full articles, follow hashtags, vote, and join the conversation.

Create free account Log in

Menu

When Your Training Loss Is Lying to You Building a Tenacious-Specific Sales Outreach Benchmark Eyoel Nebiyu · May 2026