Most teams compare prompts like this: Prompt A average score: 6.8 Prompt B average score: 7.4 "B is better, ship it." I used to do this too. Then I ran the numbers properly and realized I'd been making deployment decisions on statistical noise. Here's what I learned about evaluating LLM prompts correctly, and the specific implementation I built. The problem with averages on small datasets LLM eval datasets are small. Most teams have 10-30 golden test cases. That's not enough data to make averages reliable. Here's why. Imagine you score both prompts on 10 cases. Prompt B scores 0.6 points higher on average. Sounds like a win. But with n=10, a difference of 0.6 points could easily happen by random chance — the model had a slightly better day, the test cases happened to favor B's phrasing, one outlier case pulled the average. You have no way to know without actually computing the probability. This is the core problem: a difference is not the same as a statistically significant difference.…