STEWARDSHIP2018-07-14BY MARA SABOGAL

Ship the runbook with the change

If the system changed, the runbook changed. Otherwise incidents turn into archaeology.

stewardshipoperationsdocumentationreliabilityprocess

The easiest way to tell whether a runbook is real is to watch what happens right after a change.

If a deploy moves, a dashboard changes, or a rollback path gets “simplified,” and the runbook stays the same… the next incident becomes archaeology. Someone is paging through old docs while the system is on fire.

That’s why the line we try to enforce is boring and strict:

If the system changed, the runbook changed.

Constraints

Runbooks usually rot for normal, predictable reasons:

The change felt small. “It’s just a flag name.” “It’s just a new queue.”
The person shipping isn’t the person on call. The pain shows up later, for someone else.
Documentation is treated as optional. Code review blocks on tests; docs are “nice to have.”
Incidents are rare enough to hide the cost. You can go weeks without paying the bill.

The problem is that runbooks aren’t “documentation” in the usual sense. They’re an interface.

An interface that’s out of date is worse than missing: it looks authoritative while sending you down the wrong path.

What we changed

We stopped treating runbook updates as follow-up work.

Instead, we treat them as part of the same change. Concretely:

Any PR that changes runtime behavior must include a runbook diff.
- deploy path, config path, feature flags
- dashboard links or log query examples
- rollback / backout steps
- new dependencies, quotas, or failure modes
“Definition of done” includes the runbook being usable for the new world.
- If you rename a job, the runbook uses the new name.
- If you move a button, the runbook points to the new path.
- If rollback changed, the runbook says the new safe first action.
We make the runbook reviewable.
- The runbook lives next to the code (or is linked from it).
- We expect reviewers to comment on clarity, not just correctness.

A small rule that helps: if the change would make you answer “where do we look now?” in Slack, it belongs in the runbook.

The runbook diff we expect

When a PR changes runtime behavior, the PR description includes a short “runbook diff” (copy/paste):

Symptom changes: what would page/ticket look like now?
First dashboard: link + what question it answers
Safe first action: the thing you can do that won’t make it worse
Rollback / backout: exact command or button path + stop condition
New failure modes: new dependency, quota, overload path, or timeouts
Comms snippet: one sentence support can reuse

If you can’t fill this in, it usually means one of two things: the change isn’t observable yet, or the rollback story isn’t real yet.

Review expectations

We don’t require perfect prose. We require that a non-author can take the first safe action.

Reviewers look for:

names that match the new world (jobs, flags, queues)
links that go to the right place without extra navigation
a safe first action that is reversible (turn down, pause, rollback) before “scale”
a backout that doesn’t require tribal knowledge

If the change increases operational complexity, we ask the reverse question: what did we delete or automate so the runbook stays short?

Where we put it

We keep the runbook next to the code when we can. When we can’t, we still make the navigation reliable:

alerts link to the relevant runbook section
dashboards link back to the runbook
the runbook links to the one dashboard you start from

That triangle is what makes “the runbook exists” equivalent to “the runbook is findable.”

One more constraint we’ve learned the hard way: the runbook needs a path for “I’m new here.”

If a step requires privileged access or a tool only one person has, the runbook should say that up front and point to the escalation path. Otherwise the runbook reads like a prank at 2am.

The smallest enforcement mechanism isn’t a tool. It’s a habit: reviewers ask “what did the runbook change?” the same way they ask “how do we roll back?”

We’ve also found it helps to treat runbooks as living UI: if a runbook section hasn’t been opened in months, it either needs a better entry point (alert/ticket link) or it’s describing a failure mode you no longer have.

In practice, “staying current” often means deleting. If a step is no longer valid, we remove it rather than leaving a fossil for someone else to trip over.

Results / Measurements

We don’t pretend this is perfectly measurable, but we’ve found a few cheap signals that correlate with calmer incidents:

Time-to-first-safe-action goes down. The first ten minutes stop being a debate about where to start.
Fewer “tribal knowledge” escalations. Less paging of the one person who “knows the weird thing.”
Fewer self-inflicted wounds. When rollback steps are current and explicit, people stop improvising under pressure.

The simplest metric we track is mechanical:

“Did this change ship with a runbook update?”

When the answer is consistently “yes,” the incident room gets quieter.

Takeaways

Treat the runbook like UI copy: it’s written for a stressed human, not for the person who wrote the system.

If a change alters how the system behaves, your runbook is now wrong until you update it.

If you can’t explain the new safe first action in a few lines, it’s a sign the change needs a clearer rollback plan.