Book: LLM Observability Pocket Guide: Picking the Right Tracing & Evals Tools for Your Team Also by me: Thinking in Go (2-book series) — Complete Guide to Go Programming + Hexagonal Architecture in Go My project: Hermes IDE | GitHub — an IDE for developers who ship with Claude Code and other AI coding tools Me: xgabriel.com | GitHub A team ships a prompt change. The eval set has 100 questions. The old prompt scored 78, the new prompt scores 82. Slack lights up; the numbers land in the deploy note; the change ships. Two weeks later, customer support tickets are flat. The "win" was four examples flipping out of a hundred. With a sample that small, four is well inside the noise floor. The same prompt re-run on the same model on a different day moves by more than that. The team did not measure an improvement; they measured a coin flip and treated the side it landed on as a result. This is the cheapest, most common eval mistake in production LLM work.…