An Eval Harness for Tool-Use Agents: 90 Lines, 3 Judges, $3 Per Run

1 / 3

An Eval Harness for Tool-Use Agents: 90 Lines, 3 Judges, $3 Per Run

DEV Community·Gabriel Anhaia·about 1 month ago

#nR3R7ipM

#ai #agents #python #judge #tool #input

Reading 0:00

15s threshold

Book: AI Agents Pocket Guide Also by me: Thinking in Go (2-book series) — Complete Guide to Go Programming + Hexagonal Architecture in Go My project: Hermes IDE | GitHub — an IDE for developers who ship with Claude Code and other AI coding tools Me: xgabriel.com | GitHub You changed one line of the system prompt. The chat-eval suite still passes: output quality looks fine, no hallucinations, the model still answers in JSON. You ship. Two days later, support says the agent stopped sending follow-up emails after refunds. It is calling log_refund instead of send_followup . The text outputs were right; the tool calls were silently rewired. Output evals do not catch this. You need an eval that grades the tool trajectory : which tools the agent called, in what order, with what arguments. The harness is small. About ninety lines of Python, three judges in a ladder, and a golden CSV. Total bill on a 30-row golden set: a few dollars per run.…

Continue reading — create a free account

Join HashtagPLUS to read full articles, follow hashtags, vote, and join the conversation.

Create free account Log in

Menu

An Eval Harness for Tool-Use Agents: 90 Lines, 3 Judges, $3 Per Run