Book: AI Agents Pocket Guide Also by me: Thinking in Go (2-book series) — Complete Guide to Go Programming + Hexagonal Architecture in Go My project: Hermes IDE | GitHub — an IDE for developers who ship with Claude Code and other AI coding tools Me: xgabriel.com | GitHub You changed one line of the system prompt. The chat-eval suite still passes: output quality looks fine, no hallucinations, the model still answers in JSON. You ship. Two days later, support says the agent stopped sending follow-up emails after refunds. It is calling log_refund instead of send_followup . The text outputs were right; the tool calls were silently rewired. Output evals do not catch this. You need an eval that grades the tool trajectory : which tools the agent called, in what order, with what arguments. The harness is small. About ninety lines of Python, three judges in a ladder, and a golden CSV. Total bill on a 30-row golden set: a few dollars per run.…