Menu

Post image 1
Post image 2
1 / 2
0

When Your Training Loss Is Lying to You Building a Tenacious-Specific Sales Outreach Benchmark Eyoel Nebiyu · May 2026

DEV Community·Eyoel Nebiyu·about 1 month ago
#k271xe5Q
#agents#ai#llm#machinelearning#model#training
Reading 0:00
15s threshold

This post documents a real negative result: my trained model worked… but a well-written prompt worked better. TL;DR I built a 266-task evaluation benchmark for B2B sales-outreach agents — something existing benchmarks don’t measure well. Then I trained a small preference-learning judge model using SimPO. What happened surprised me: Training accuracy → 100% Held-out accuracy → 25% Classic overfitting. But the real lesson wasn’t about the model. It was about the data. After fixing dataset construction: Held-out accuracy improved to 0.417 (Delta A +25pp) A carefully prompted untrained model scored 0.833 👉 Conclusion: At this scale, judging B2B sales tone is mostly a prompt-following problem, not a preference-learning problem.…

Continue reading — create a free account

Join HashtagPLUS to read full articles, follow hashtags, vote, and join the conversation.

Read More