Building a Production LLM Evaluation Harness in Pytest: Cost-Bounded, Flake-Aware, CI-Gated (Runn…

1 / 2

Building a Production LLM Evaluation Harness in Pytest: Cost-Bounded, Flake-Aware, CI-Gated (Runnable Python)

DEV Community·Nitin Srivastava·25 days ago

#NlQ61qIC

#python #llm #testing #ai #tests #cost

Reading 0:00

15s threshold

I shipped my fourth LLM agent to production last quarter. By month two, the eval suite that "passed in CI" was the reason a regression made it to a customer. The tests were green. But they were green for the wrong reason — every assertion was a single LLM call against a single golden answer, on a model whose temperature happened to land in our favor that day. We had built a coin flip and called it a test. This article is the harness I wish I'd had on day one. Not another wrapper around DeepEval or RAGAS — a thin layer on top of pytest that solves the five things every production LLM evaluation harness needs and most tutorials skip: Flake-aware tests. LLMs are stochastic. Single-shot assertions are noise. Cost-bounded tests. A single misbehaving prompt should not burn $40 on one CI run. Golden set with versioning. When a result changes, you need to know if the answer drifted or the model did. Regression-only CI gating. Block PRs on degradation vs. baseline, not on absolute floors that bit-rot.…

Continue reading — create a free account

Join HashtagPLUS to read full articles, follow hashtags, vote, and join the conversation.

Create free account Log in

Menu

Building a Production LLM Evaluation Harness in Pytest: Cost-Bounded, Flake-Aware, CI-Gated (Runnable Python)