Your RAG Eval Set Is Probably Wrong. The Test That Catches It.

📰

Your RAG Eval Set Is Probably Wrong. The Test That Catches It.

DEV Community·Gabriel Anhaia·about 1 month ago

Reading 0:00

15s threshold

Book: RAG Pocket Guide Also by me: LLM Observability Pocket Guide My project: Hermes IDE | GitHub — an IDE for developers who ship with Claude Code and other AI coding tools Me: xgabriel.com | GitHub Picture a composite scenario drawn from several RAG postmortems: a team ships a legal-research assistant with a Ragas faithfulness score in the mid-0.9s and an answer-relevance score not far behind. Two weeks after launch, customer-success starts forwarding screenshots of the bot citing the wrong jurisdiction. The eval scores never move. The eval set is a few hundred questions hand-written months earlier. Production users are running tax-court queries with citation patterns that do not exist anywhere in the eval set. The evals are measuring how well the system answered last quarter's questions, not this morning's. The dashboard is green for a system nobody is actually using. Most teams blame the retriever. It is almost never the retriever.…

Continue reading — create a free account

Join HashtagPLUS to read full articles, follow hashtags, vote, and join the conversation.

Create free account Log in

Menu

Your RAG Eval Set Is Probably Wrong. The Test That Catches It.