General sales benchmarks often miss how real outbound agents fail: overclaiming on weak signals, unsafe “bench” commitments, tone that drifts into pushy follow-ups, and gaps between what the rep promises and what delivery can support. For a class project (TRP1 Week 11), I built Tenacious-Bench v0.1, a compact, machine-scored task set aimed at those failure modes—not generic helpfulness. What’s in the dataset The public release is on Hugging Face: https://huggingface.co/datasets/Bnobody/tenacious_bench_v0.1 . It currently exposes 168 rows in the hub viewer, with splits aligned to how I train and evaluate: train (105) and validation (63). Tasks mix several authoring modes—programmatic sweeps, multi-LLM synthesis with judge filtering, trace-informed scenarios, and hand-authored adversarial cases—so the bench isn’t a single-generator monoculture.…