Introduction
A bad runbook is worse than no runbook. It creates false confidence — the on-call engineer opens it, skims a wall of background context written for someone who has never seen the system, finds a single kubectl rollout restart command, and closes it. If that command doesn’t work, they’re on their own. The runbook has wasted their time and reset their mental state exactly when they needed to be most focused.
Good runbooks exist in a narrow genre: instructions for a specific, high-stress moment, written by someone who knows the system for someone who might not. Here’s what makes them work.
The core principle: a runbook is not documentation
Documentation explains how a system works. A runbook explains what to do right now when a specific thing has gone wrong. These are different documents serving different readers at different moments.
Runbook readers are:
- Tired, possibly just woken up
- Under time pressure
- Not necessarily the person who built this system
- Trying to make a decision with incomplete information
Write for that person. Every sentence should earn its place by either reducing diagnosis time or enabling an action.
Structure that works
1. Alert name and severity at the top
Literally the first line. The engineer searching for this runbook is coming from an alert. Make the match instant.
Alert: HighMemoryPressure (severity: warning)
Service: feature-store
SLO impact: potential increase in p99 latency
2. What is probably happening
One short paragraph. Not a complete description of the system — just the most likely failure modes that produce this alert. Three bullets is usually right.
This alert fires when working set memory across feature-store pods
exceeds 80% of limit. Most common causes:
- Traffic spike without HPA catching up (check replica count)
- Cache leak in the embedding store (check /metrics endpoint)
- Upstream model server returning unexpectedly large payloads
3. Immediate triage steps
Numbered. Specific commands. Expected output for healthy and unhealthy states.
1. Check replica count:
kubectl get hpa feature-store -n ml-serving
→ Healthy: CURRENT >= DESIRED
→ If CURRENT < DESIRED: HPA is struggling, see step 4
2. Check recent restarts:
kubectl get pods -n ml-serving -l app=feature-store
→ RESTARTS > 0 in last 10m: OOMKill in progress, go to step 5
The key is expected output. An engineer who has never seen this system should be able to tell immediately whether what they’re seeing is normal.
4. Resolution paths (one per likely cause)
Branch from triage. Each path has a clear header and ends with either a resolution action or an escalation instruction.
5. Escalation path with names
If the runbook can’t resolve the issue, who do you call? Include actual names, Slack handles, and — for genuinely critical systems — phone numbers. “Contact the team” is not an escalation path.
What to ruthlessly cut
- History of how the system was built
- Architectural diagrams (link to them, don’t embed)
- Explanations of why things work the way they do
- Commands that “might be useful”
- Anything that takes more than 10 seconds to read without yielding an actionable insight
When to write them
The best time to write a runbook is immediately after resolving an incident — not because you’re required to, but because you have perfect information. You know what you looked at, what worked, and what wasted time. Write the runbook you wish you’d had.
The second best time is before you write the alert. If you’re creating a new alert and can’t immediately describe what the on-call engineer should do when it fires, you shouldn’t create the alert yet.
Testing runbooks
Ask a colleague who didn’t write it to follow it on a non-production system. Note every point where they hesitate or ask a question. Every hesitation is a gap to fill. Run this exercise once when written, and again after any major system change.
A runbook that nobody has ever followed in a drill will fail when it matters most.