TL;DR. I built an Anthropic Agent Skill for @ngrx/signals and ran it through the full eval pipeline: capability A/B benchmarks, token and wall-time accounting, and a description-optimizer loop. The skill lifts pass rate from 84% to 100%. It also adds 14 seconds and ~12,000 tokens per invocation (about $0.04 at Sonnet 4.6 input pricing). The description optimizer ran for three iterations and never beat the description I started with. And my evals are now saturated at 100%, which is itself a problem. All four findings are useful, and most teams ship skills without measuring any of them. The two questions almost no one asks You write a 200-line SKILL.md , drop it in ~/.claude/skills/ , restart Claude Code, and move on. Skill shipped. Two questions you didn't answer: A) Does this skill actually outperform the base model on the same task, or are you shipping a glorified prompt that the model could already do without help?…