Menu

Post image 1
Post image 2
Post image 3
Post image 4
Post image 5
1 / 5
0

Skills Without Evals Are Just Markdown and Hope

DEV Community·Daniel Sogl·about 1 month ago
#Yk2NpdB7
#failure#where#claude#skill#skills#model
Reading 0:00
15s threshold

TL;DR. I built an Anthropic Agent Skill for @ngrx/signals and ran it through the full eval pipeline: capability A/B benchmarks, token and wall-time accounting, and a description-optimizer loop. The skill lifts pass rate from 84% to 100%. It also adds 14 seconds and ~12,000 tokens per invocation (about $0.04 at Sonnet 4.6 input pricing). The description optimizer ran for three iterations and never beat the description I started with. And my evals are now saturated at 100%, which is itself a problem. All four findings are useful, and most teams ship skills without measuring any of them. The two questions almost no one asks You write a 200-line SKILL.md , drop it in ~/.claude/skills/ , restart Claude Code, and move on. Skill shipped. Two questions you didn't answer: A) Does this skill actually outperform the base model on the same task, or are you shipping a glorified prompt that the model could already do without help?…

Continue reading — create a free account

Join HashtagPLUS to read full articles, follow hashtags, vote, and join the conversation.

Read More