Most "chat with your website" projects ship without any measurement. Mine did too. The live demo was up, answers looked plausible, and I moved on. Then I built a proper evaluation harness and found out exactly how wrong "looks plausible" is as a quality signal. This post covers the eval design, the bugs it caught, the prompt changes that fixed most of them, and the two metrics that still don't pass threshold after all the fixes. The failures are the interesting part. The System Web Intelligence is a RAG pipeline that turns any public URL into a queryable knowledge base. You give it a URL, it crawls up to 50 pages, chunks and embeds the text with Pinecone's multilingual-e5-large , and stores vectors in a serverless Pinecone index. At query time, top-k chunks are retrieved and passed to an LLM (Gemini 2.0 Flash or a local Ollama model) with a strict context-only prompt. Nothing exotic. The evaluation harness is the part I want to talk about.…