Menu

Post image 1
Post image 2
Post image 3
1 / 3
0

Eval Set Sizing: The Statistical Power Math Behind LLM A/B Tests

DEV Community·Gabriel Anhaia·25 days ago
#44b5xMcI
Reading 0:00
15s threshold

Book: LLM Observability Pocket Guide: Picking the Right Tracing & Evals Tools for Your Team Also by me: Thinking in Go (2-book series) — Complete Guide to Go Programming + Hexagonal Architecture in Go My project: Hermes IDE | GitHub — an IDE for developers who ship with Claude Code and other AI coding tools Me: xgabriel.com | GitHub A team ships a prompt change. The eval set has 100 questions. The old prompt scored 78, the new prompt scores 82. Slack lights up; the numbers land in the deploy note; the change ships. Two weeks later, customer support tickets are flat. The "win" was four examples flipping out of a hundred. With a sample that small, four is well inside the noise floor. The same prompt re-run on the same model on a different day moves by more than that. The team did not measure an improvement; they measured a coin flip and treated the side it landed on as a result. This is the cheapest, most common eval mistake in production LLM work.…

Continue reading — create a free account

Join HashtagPLUS to read full articles, follow hashtags, vote, and join the conversation.

Read More