STEWARDSHIP2018-06-30BY MARA SABOGAL

Runbooks are interfaces

A runbook checklist we use to make incidents boring.

stewardshipoperationsdocumentationreliability

I’ve watched smart teams lose time in the same place: not in the fix, but in the first ten minutes.

Someone asks, “what changed?” Someone asks, “where do we look?” And the room goes quiet while everyone rebuilds a mental model that should have been written down.

That’s not a competence problem. It’s an interface problem.

A runbook is the interface between a tired human and a complex system.

Context

Most documentation is written for the person who already understands the system.

A runbook is written for the moment you don’t.

What we mean by “runbook” (for non-engineers)

A runbook is a step-by-step “what to do next” guide for a live service—especially when something is broken or confusing.

It’s not a design doc. It’s closer to a pilot checklist: short, specific, and safe.

A decent runbook answers:

what “bad” looks like (and how to confirm it’s real)
how to tell severity / who’s affected
the safest first action (and what not to do)
how to roll back or back out a change
who to escalate to, and what info to include
what to communicate while you work

Even if you can’t execute every step, a runbook lets you describe the issue cleanly, pull the right people in quickly, and avoid thrash.

It should be scannable. It should be blunt. It should prefer verbs over paragraphs. It should assume you’re reading it under stress.

If you want a team to “stay” with a system over time, the runbook is one of the most practical places to start.

Checklist

This is the starter runbook we use when we adopt a service. It’s intentionally boring.

Notes

A runbook doesn’t need to be comprehensive. It needs to be usable.

Usable usually means:

the first page tells you where to look
the first step is safe (and reversible)
the doc is short enough that someone will actually read it

If the runbook is long, make the top section scannable and link out to deeper docs. Don’t bury the first action under context.

If you can’t name a safe first action, you don’t have a runbook yet.

The safe first action is usually: stop the bleeding, restore visibility, then decide.

A quick test: can an engineer who didn’t build this service take a safe first action from this doc?

Write it like you’re writing UI copy. Prefer short labels. Use consistent naming. Put the important links and commands where someone can find them without scrolling through a wall of text.

We also make the runbook discoverable from the first entry point:

a page links to the first dashboard
the first dashboard links back to the runbook
tickets/alerts link to both

If you can’t find the runbook from the page payload, it’s not operational yet.

One more test we like: can someone new to the rotation follow the runbook on a calm Tuesday?

If they can’t, the problem isn’t them. The interface is unclear.

And update the runbook as part of the change—not “after.”

If this checklist feels like overhead, it usually means incidents are doing the documentation work for you. That trade is expensive.

Takeaways

In an incident, your runbook is the UI.

If the UI is unclear, you will pay for it in time, stress, and unnecessary escalation.

Write the runbook before you need it.