Using tooling to keep runbooks current
How we use lightweight tooling to keep runbooks close to reality instead of an aspirational wiki.
Runbooks age badly when they live far away from the systems they describe.
We had runbooks that mentioned dashboards which no longer existed, commands that no longer worked, and flags that had been removed months earlier.
During incidents, this looked like:
- someone following a runbook step that failed halfway through
- five minutes of "oh, that changed last quarter" explanations
- people abandoning the doc in favor of asking whoever "knows the system"
We didn’t try to solve this with stricter discipline alone.
We added tooling that made it easier for the runbooks to stay current than to rot.
Constraints
- We didn’t want a heavy authoring system; engineers were already comfortable with Markdown in the repo.
- We couldn’t stop all changes until every runbook was perfect.
- Some systems changed more often than others; we needed a way to focus effort.
What we changed
1. Make runbooks live next to code
We moved key runbooks into the same repositories as their services.
That meant:
- changes to dashboards, flags, or APIs could include runbook updates in the same PR
- reviewers could see when a code change invalidated part of a runbook
Runbooks in the repo stayed Markdown, with links to dashboards and relevant scripts.
2. Add lightweight linting
We added a small linter that checked runbooks for:
- broken internal links (e.g., moved dashboards, renamed pages)
- references to flags or endpoints that no longer exist in code
When a check failed, it showed up in CI for the relevant service.
The goal wasn’t to catch everything, just the most common, easy-to-automate drift.
3. Wire runbooks into incident tooling
Our incident tooling links directly to runbooks based on the service and alert.
We added a small feedback loop:
- after an incident, the lead can mark sections of the runbook as "helpful", "outdated", or "missing"
This feedback is visible to the owning team, who can then prioritize updates.
4. Use templates for common patterns
We created simple templates for common runbook sections:
- "First 5 minutes"
- "Rollback / Degrade"
- "Diagnostics"
Templates include placeholders for:
- links to dashboards
- commands for checks
- flags or config toggles
New runbooks start from these templates, which reduces the chance of missing critical sections.
5. Allow small edits without ceremony
Engineers can make small edits to runbooks (fixing a command, updating a link) with lightweight review.
We treat these like doc fixes, not major features.
This keeps friction low for the people who actually notice drift during day-to-day work.
Results / Measurements
We looked at:
- how often runbooks were edited
- how often lint checks failed
- feedback from incident leads about runbook usefulness
Over time:
- runbooks for the most active services saw regular small updates instead of annual overhauls
- the number of broken links and obviously outdated steps in incident reviews dropped
- incident write-ups more often referenced runbooks as part of the solution, not the problem
We still have stale docs in quieter corners of the system, but the core surfaces stay much healthier.
Takeaways
- Runbooks age at the speed of the systems they describe; co-locating them with code slows the drift.
- Simple linting catches a surprising amount of decay for cheap.
- Wiring runbooks into incident tooling—and gathering feedback there—keeps them connected to reality.
- Lowering the friction for small edits produces more accurate docs than demanding perfection.