I Built a Benchmark for the Failures Generic LLM Evaluations Miss Generic LLM benchmarks are useful, but they are not the same thing as a workflow benchmark. That gap became obvious in my Week 11 project. I was working on SignalForge , a deterministic-first outbound workflow for Tenacious . The system already had structured enrichment, confidence calibration, grounded email generation, CRM sync, lifecycle routing, and evaluation hooks. But Week 10 evidence showed that the hardest failures were not “can the model produce text?” failures. They were judgment failures : over-claiming from weak public signals, drifting into generic outsourcing language, escalating to booking too early, mishandling pricing handoffs, sounding technically plausible but socially wrong with a new CTO. That is the kind of behavior that a broad assistant benchmark or a retail-agent benchmark can easily under-measure.…