Building a RAG Evaluation Harness That Actually Catches Problems

1 / 3

Building a RAG Evaluation Harness That Actually Catches Problems

DEV Community·Shiva Shrestha·28 days ago

#gTqRrVfA

#issue #rag #ai #words #context #question

Reading 0:00

15s threshold

Most "chat with your website" projects ship without any measurement. Mine did too. The live demo was up, answers looked plausible, and I moved on. Then I built a proper evaluation harness and found out exactly how wrong "looks plausible" is as a quality signal. This post covers the eval design, the bugs it caught, the prompt changes that fixed most of them, and the two metrics that still don't pass threshold after all the fixes. The failures are the interesting part. The System Web Intelligence is a RAG pipeline that turns any public URL into a queryable knowledge base. You give it a URL, it crawls up to 50 pages, chunks and embeds the text with Pinecone's multilingual-e5-large , and stores vectors in a serverless Pinecone index. At query time, top-k chunks are retrieved and passed to an LLM (Gemini 2.0 Flash or a local Ollama model) with a strict context-only prompt. Nothing exotic. The evaluation harness is the part I want to talk about.…

Continue reading — create a free account

Join HashtagPLUS to read full articles, follow hashtags, vote, and join the conversation.

Create free account Log in

Menu

Building a RAG Evaluation Harness That Actually Catches Problems