I shipped my fourth LLM agent to production last quarter. By month two, the eval suite that "passed in CI" was the reason a regression made it to a customer. The tests were green. But they were green for the wrong reason — every assertion was a single LLM call against a single golden answer, on a model whose temperature happened to land in our favor that day. We had built a coin flip and called it a test. This article is the harness I wish I'd had on day one. Not another wrapper around DeepEval or RAGAS — a thin layer on top of pytest that solves the five things every production LLM evaluation harness needs and most tutorials skip: Flake-aware tests. LLMs are stochastic. Single-shot assertions are noise. Cost-bounded tests. A single misbehaving prompt should not burn $40 on one CI run. Golden set with versioning. When a result changes, you need to know if the answer drifted or the model did. Regression-only CI gating. Block PRs on degradation vs. baseline, not on absolute floors that bit-rot.…