RAG in Production: What the Tutorials Don't Tell You

1 / 4

RAG in Production: What the Tutorials Don't Tell You

DEV Community·Davide Mibelli·18 days ago

#UgRrQQLJ

#ai #machinelearning #python #tutorial #retrieval #chunk

Reading 0:00

15s threshold

I built a RAG system that scored 91% on our internal eval suite. It retrieved the right chunks four out of five times in every benchmark we ran. We shipped it. Users thought it was broken. The gap between "works in evaluation" and "works in production" is the thing every RAG tutorial skips. This article is what I learned closing that gap across three different production deployments — a customer support bot, an internal knowledge base, and a document Q&A tool for a legal team. Why your evals lie to you The typical RAG eval flow: take 50 question-answer pairs, run retrieval, score chunk relevance, measure answer quality. The benchmark looks good. Production does not. The problem is evaluation datasets are clean. Real user questions are not. Users ask ambiguous things, reference context from earlier in the conversation, use company-specific jargon that is not in your embeddings vocabulary, and ask questions that span multiple documents. Your 50-pair eval dataset does not cover any of this.…

Continue reading — create a free account

Join HashtagPLUS to read full articles, follow hashtags, vote, and join the conversation.

Create free account Log in

Menu

RAG in Production: What the Tutorials Don't Tell You