My RAG system for financial document Q&A was stuck at 53% accuracy. I spent two weeks implementing hybrid retrieval, metadata filtering, and query routing. Accuracy went to 58%. Then I ran a corpus audit and found that 5 documents were never ingested and 2 were corrupted. Fixing that alone pushed recall from 83% to 94%. The most impactful improvement in the entire project took 30 minutes and zero lines of new code. The setup Quick context: I'm building a RAG system evaluated against FinanceBench (Patronus AI), a benchmark with 150 expert-annotated Q&A pairs about SEC filings. The pipeline is GPT-4o-mini for generation, text-embedding-3-small for embeddings, and Qdrant as the vector store. Full eval infrastructure with LLM-as-judge calibrated against human labels ([Post 1 covers the eval setup] https://dev.to/joaopaulotr/building-an-evaluation-harness-for-financial-rag-what-i-learned-about-llm-as-judge-calibration-5030 ).…