Skills Without Evals Are Just Markdown and Hope

1 / 5

Skills Without Evals Are Just Markdown and Hope

DEV Community·Daniel Sogl·about 1 month ago

#Yk2NpdB7

#failure #where #claude #skill #skills #model

Reading 0:00

15s threshold

TL;DR. I built an Anthropic Agent Skill for @ngrx/signals and ran it through the full eval pipeline: capability A/B benchmarks, token and wall-time accounting, and a description-optimizer loop. The skill lifts pass rate from 84% to 100%. It also adds 14 seconds and ~12,000 tokens per invocation (about $0.04 at Sonnet 4.6 input pricing). The description optimizer ran for three iterations and never beat the description I started with. And my evals are now saturated at 100%, which is itself a problem. All four findings are useful, and most teams ship skills without measuring any of them. The two questions almost no one asks You write a 200-line SKILL.md , drop it in ~/.claude/skills/ , restart Claude Code, and move on. Skill shipped. Two questions you didn't answer: A) Does this skill actually outperform the base model on the same task, or are you shipping a glorified prompt that the model could already do without help?…

Continue reading — create a free account

Join HashtagPLUS to read full articles, follow hashtags, vote, and join the conversation.

Create free account Log in

Menu

Skills Without Evals Are Just Markdown and Hope