A tuned grep beat my MCP code-intelligence server on F1 by 9 points. I'm publishing the result anyway. Here's why. Why this benchmark exists I've spent the last six months building sverklo , a local-first MCP server that gives AI coding agents (Claude Code, Cursor, Windsurf) a real symbol graph instead of grep-based pattern matching. The product positioning has always been "stops the agent from hallucinating function names that don't exist in your codebase." That positioning is hand-wavy without numbers. Six months in, I had no public benchmark. Whatever speed-of-iteration story I told myself was, I was telling myself. So I built one: 60 hand-verified retrieval tasks across two real OSS codebases ( expressjs/express and the sverklo repo itself), three baselines (naive grep, smart grep, sverklo), and metrics that measure both retrieval quality (F1, recall, precision) and the thing AI agents actually pay for (input tokens, tool calls, wall time). Results live at sverklo.com/bench .…