STEWARDSHIP2018-08-05BY MARA SABOGAL

Where runbooks live

Runbooks take three common forms: a page, a checklist, or a button. Pick the home that matches your incident reality.

stewardshipoperationsdocumentationreliabilityincident-response

When someone asks “do we have a runbook for this?” they’re usually not asking whether the knowledge exists.

They’re asking where it lives.

I’ve watched smart teams lose time in the same place: not in the fix, but in the first ten minutes.

Someone asks, “what changed?” Someone asks, “where do we look?” And the room goes quiet while everyone rebuilds a mental model that should have been written down.

Runbooks don’t fail because teams can’t write. They fail because the runbook is in the wrong place for the moment you need it.

Constraints

There are a few predictable tensions:

  • Different readers. The person on call wants “what do I do next?” Support wants “what do I tell customers?” A new engineer wants “what is this thing?”
  • Different moments. A runbook written for calm Tuesday afternoons is unreadable at 2am.
  • Different homes. The runbook is in a wiki, the dashboards are elsewhere, the “one real command” is in a chat thread, and escalation is tribal knowledge.
  • Different kinds of runbooks. Some are instructions. Some are decision trees. Some are automation.

The mistake is treating all of those as the same artifact.

The second mistake is assuming “we wrote it” means “we can find it.”

In an incident, retrieval is part of the interface.

If the runbook is hidden behind three clicks, an unfamiliar folder name, and someone’s memory, it’s not operational.

What we changed

We started naming the form of the runbook up front.

In practice, runbooks tend to take three shapes:

  1. A page (the “scan-and-do” doc)
  • Markdown/wiki/doc that you can read in under a minute.
  • Good for: symptoms, severity, safe first actions, rollback steps, escalation, comms.
  • Weak spot: it can drift if it’s far from the change.
  1. A checklist (the “incident workspace” playbook)
  • A short sequence of steps that you can follow and mark off while you work.
  • Good for: coordination, handoff, “don’t forget to communicate,” capturing notes.
  • Weak spot: if it becomes a second source of truth, it diverges from the page.
  1. A button (the “runbook automation” workflow)
  • A controlled, repeatable action: restart a job, drain a queue, roll back a deploy.
  • Good for: reducing risky manual steps, making recovery safer and auditable.
  • Weak spot: it’s easy to assume “the button is the runbook” and forget the context.

We don’t pick one. We pick a primary home, then link the others to it.

Our default home

Our default is:

  • The page is canonical (symptoms, dashboards, safe first actions, escalation).
  • The checklist links to the page (and is focused on coordination and comms).
  • The automation links back to the page (and includes “when to use / when not to use”).

A runbook can have many entry points.

It should not have many sources of truth.

How a runbook evolves

Most runbooks start as a page, because you need context before you need speed.

After a few incidents, you pull repeated steps into a checklist so you can coordinate without re-reading prose.

After a few more, the risky steps become automation (the button). But the button never replaces the page. The page is where you explain what the button does, when it’s safe, and how to back out.

Buttons need guardrails: scoped inputs, explicit confirmation, and an audit trail. If a button can do irreversible damage, it’s not a button yet.

The navigation triangle

In practice, we try to make a simple triangle easy to follow:

  • The page links to the one dashboard you open first.
  • The dashboard links back to the runbook and the safe first actions.
  • The alert or ticket links to the dashboard and the runbook.

If one of those links is missing, the incident room becomes a scavenger hunt.

Choosing a primary home (a quick rubric)

A runbook should live where the incident starts.

  • If incidents start from a page, put the runbook link in the page payload.
  • If incidents start from a ticket, put the runbook link in the ticket template.
  • If incidents start from a customer report, make sure support has a path from the report to the runbook (reference ID → dashboard → runbook).

A rule that keeps it honest: if you can’t find the runbook from the first alert or ticket, it effectively doesn’t exist.

Signs your runbook is in the wrong place

  • The first question on every incident call is “where is the doc?”
  • People paste commands in chat instead of linking the runbook.
  • The runbook is accurate but never opened.
  • The runbook exists, but it’s written like an encyclopedia (not like UI).

Results / Measurements

A good runbook format is the one that reduces repeated questions in the first ten minutes.

The signals we watch are simple:

  • Fewer “where is the doc?” messages during incidents.
  • Faster time-to-first-safe-action (even when the on-call engineer didn’t build the system).
  • Fewer contradictory instructions (one source of truth, many entry points).

And a maintenance signal:

  • Runbook updates ship with changes. If that stops happening, the runbook is already rotting.

A softer signal that still matters: the tone of the incident room.

When the runbook is findable and scannable, the room gets quieter.

Takeaways

A runbook is an interface. Interfaces need a home.

Most teams don’t need a perfect tool. They need a clear answer to “where does the runbook live?” and a habit of updating it as part of the change.

If you have automation, document it like you would any other UI: what it does, what it’s safe for, and what it will not do.

And make the runbook discoverable from the first alert or ticket. If it’s not the first click, it’s not the interface.

Further reading