5 Failure Modes I Found in My Financial RAG (And the One That Actually Mattered)

1 / 2

5 Failure Modes I Found in My Financial RAG (And the One That Actually Mattered)

DEV Community: python·João Paulo Traguetta Rufino·2 days ago

#Oo3glaXy

#dev #retrieval #judge #documents #human #correct

Reading 0:00

15s threshold

My RAG system for financial document Q&A was stuck at 53% accuracy. I spent two weeks implementing hybrid retrieval, metadata filtering, and query routing. Accuracy went to 58%. Then I ran a corpus audit and found that 5 documents were never ingested and 2 were corrupted. Fixing that alone pushed recall from 83% to 94%. The most impactful improvement in the entire project took 30 minutes and zero lines of new code. The setup Quick context: I'm building a RAG system evaluated against FinanceBench (Patronus AI), a benchmark with 150 expert-annotated Q&A pairs about SEC filings. The pipeline is GPT-4o-mini for generation, text-embedding-3-small for embeddings, and Qdrant as the vector store. Full eval infrastructure with LLM-as-judge calibrated against human labels ([Post 1 covers the eval setup] https://dev.to/joaopaulotr/building-an-evaluation-harness-for-financial-rag-what-i-learned-about-llm-as-judge-calibration-5030 ).…

Continue reading — create a free account

Join HashtagPLUS to read full articles, follow hashtags, vote, and join the conversation.

Create free account Log in

Menu

5 Failure Modes I Found in My Financial RAG (And the One That Actually Mattered)