Book: LLM Observability Pocket Guide: Picking the Right Tracing & Evals Tools for Your Team Also by me: Thinking in Go (2-book series) — Complete Guide to Go Programming + Hexagonal Architecture in Go My project: Hermes IDE | GitHub — an IDE for developers who ship with Claude Code and other AI coding tools Me: xgabriel.com | GitHub You ran the eval. Pass rate is 87%. You ship. Your colleague reruns the same suite, same model, same prompts, same dataset, ten minutes later. Pass rate is 81%. Then 84%. Then 89%. Nothing changed. Your eval is non-deterministic and you didn't notice because the library you used printed one number with two decimal places and called it a metric. That number is whatever the LLM judge said on the first call. Ask it again and a fraction of verdicts flip. The bug isn't in any single library. The bug is the shape.…