We measured baseline accuracy across a production RAG system: 62%. Six weeks later, after five architecture changes and zero model changes: 94%. Here's exactly what we changed, why each one mattered, and what the numbers looked like before and after each step. The Setup Internal knowledge assistant for a mid-market company. Knowledge spread across Confluence, Google Drive, and SharePoint, approximately 4,200 documents total. Users asking natural language questions about internal policies, processes, and product specifications. Stack at baseline: Component Details LLM GPT-4o Embedding model text-embedding-3-large Vector store Pinecone Chunking RecursiveCharacterTextSplitter, 1024 tokens, 20% overlap Retrieval top-8 by cosine similarity, no re-ranking Eval none This worked well in testing. In production, it hit a ceiling at week three β when real users arrived with queries that went beyond the clean, structured examples we'd tested against. The first sign was a CTO call.β¦