DELIVERY2023-12-12BY STORECODE

Using tooling to keep runbooks current

How we use lightweight tooling to keep runbooks close to reality instead of an aspirational wiki.

deliveryrunbookstoolingincidents

Runbooks age badly when they live far away from the systems they describe.

We had runbooks that mentioned dashboards which no longer existed, commands that no longer worked, and flags that had been removed months earlier.

During incidents, this looked like:

someone following a runbook step that failed halfway through
five minutes of "oh, that changed last quarter" explanations
people abandoning the doc in favor of asking whoever "knows the system"

We didn’t try to solve this with stricter discipline alone.

We added tooling that made it easier for the runbooks to stay current than to rot.

Constraints

We didn’t want a heavy authoring system; engineers were already comfortable with Markdown in the repo.
We couldn’t stop all changes until every runbook was perfect.
Some systems changed more often than others; we needed a way to focus effort.

What we changed

1. Make runbooks live next to code

We moved key runbooks into the same repositories as their services.

That meant:

changes to dashboards, flags, or APIs could include runbook updates in the same PR
reviewers could see when a code change invalidated part of a runbook

Runbooks in the repo stayed Markdown, with links to dashboards and relevant scripts.

2. Add lightweight linting

We added a small linter that checked runbooks for:

broken internal links (e.g., moved dashboards, renamed pages)
references to flags or endpoints that no longer exist in code

When a check failed, it showed up in CI for the relevant service.

The goal wasn’t to catch everything, just the most common, easy-to-automate drift.

3. Wire runbooks into incident tooling

Our incident tooling links directly to runbooks based on the service and alert.

We added a small feedback loop:

after an incident, the lead can mark sections of the runbook as "helpful", "outdated", or "missing"

This feedback is visible to the owning team, who can then prioritize updates.

4. Use templates for common patterns

We created simple templates for common runbook sections:

"First 5 minutes"
"Rollback / Degrade"
"Diagnostics"

Templates include placeholders for:

links to dashboards
commands for checks
flags or config toggles

New runbooks start from these templates, which reduces the chance of missing critical sections.

5. Allow small edits without ceremony

Engineers can make small edits to runbooks (fixing a command, updating a link) with lightweight review.

We treat these like doc fixes, not major features.

This keeps friction low for the people who actually notice drift during day-to-day work.

Results / Measurements

We looked at:

how often runbooks were edited
how often lint checks failed
feedback from incident leads about runbook usefulness

Over time:

runbooks for the most active services saw regular small updates instead of annual overhauls
the number of broken links and obviously outdated steps in incident reviews dropped
incident write-ups more often referenced runbooks as part of the solution, not the problem

We still have stale docs in quieter corners of the system, but the core surfaces stay much healthier.

Takeaways

Runbooks age at the speed of the systems they describe; co-locating them with code slows the drift.
Simple linting catches a surprising amount of decay for cheap.
Wiring runbooks into incident tooling—and gathering feedback there—keeps them connected to reality.
Lowering the friction for small edits produces more accurate docs than demanding perfection.