Book: RAG Pocket Guide Also by me: LLM Observability Pocket Guide My project: Hermes IDE | GitHub — an IDE for developers who ship with Claude Code and other AI coding tools Me: xgabriel.com | GitHub Picture a composite scenario drawn from several RAG postmortems: a team ships a legal-research assistant with a Ragas faithfulness score in the mid-0.9s and an answer-relevance score not far behind. Two weeks after launch, customer-success starts forwarding screenshots of the bot citing the wrong jurisdiction. The eval scores never move. The eval set is a few hundred questions hand-written months earlier. Production users are running tax-court queries with citation patterns that do not exist anywhere in the eval set. The evals are measuring how well the system answered last quarter's questions, not this morning's. The dashboard is green for a system nobody is actually using. Most teams blame the retriever. It is almost never the retriever.…