Menu

Post image 1
Post image 2
1 / 2
0

5 Failure Modes I Found in My Financial RAG (And the One That Actually Mattered)

DEV Community: python·João Paulo Traguetta Rufino·2 days ago
#Oo3glaXy
#dev#retrieval#judge#documents#human#correct
Reading 0:00
15s threshold

My RAG system for financial document Q&A was stuck at 53% accuracy. I spent two weeks implementing hybrid retrieval, metadata filtering, and query routing. Accuracy went to 58%. Then I ran a corpus audit and found that 5 documents were never ingested and 2 were corrupted. Fixing that alone pushed recall from 83% to 94%. The most impactful improvement in the entire project took 30 minutes and zero lines of new code. The setup Quick context: I'm building a RAG system evaluated against FinanceBench (Patronus AI), a benchmark with 150 expert-annotated Q&A pairs about SEC filings. The pipeline is GPT-4o-mini for generation, text-embedding-3-small for embeddings, and Qdrant as the vector store. Full eval infrastructure with LLM-as-judge calibrated against human labels ([Post 1 covers the eval setup] https://dev.to/joaopaulotr/building-an-evaluation-harness-for-financial-rag-what-i-learned-about-llm-as-judge-calibration-5030 ).…

Continue reading — create a free account

Join HashtagPLUS to read full articles, follow hashtags, vote, and join the conversation.

Read More