Most runbooks are useless. Either they're too abstract ('check the logs') or they're a 40-page Confluence doc that nobody reads at 3 AM. Here is the template I use. It fits on one page and works. The template 1. Trigger. The exact alert name and what it means. 2. Impact. Who is affected? What are they seeing? Is this user-facing? 3. First 5 minutes. The single most useful command to run. One. Not five. 4. Common causes. The 3 things that most often cause this alert, in order of likelihood. 5. Fix per cause. For each common cause, the exact fix. Copy-paste-ready. 6. Escalation. Who to page if none of the above works. Include their timezone. 7. Post-incident. What to update after the incident is done (ticket, dashboard, doc). Why this works At 3 AM, your brain is running at 60%. You need a runbook that gives you the next action in under 30 seconds. A 40-page doc makes you think. A one-pager tells you what to do. Start with your noisiest alert. Write the runbook. Test it on a new team member.…