STEWARDSHIP2018-06-30BY MARA SABOGAL

Runbooks are interfaces

A runbook checklist we use to make incidents boring.

stewardshipoperationsdocumentationreliability

I’ve watched smart teams lose time in the same place: not in the fix, but in the first ten minutes.

Someone asks, “what changed?” Someone asks, “where do we look?” And the room goes quiet while everyone rebuilds a mental model that should have been written down.

That’s not a competence problem. It’s an interface problem.

A runbook is the interface between a tired human and a complex system.

Context

Most documentation is written for the person who already understands the system.

A runbook is written for the moment you don’t.

What we mean by “runbook” (for non-engineers)

A runbook is a step-by-step “what to do next” guide for a live service—especially when something is broken or confusing.

It’s not a design doc. It’s closer to a pilot checklist: short, specific, and safe.

A decent runbook answers:

  • what “bad” looks like (and how to confirm it’s real)
  • how to tell severity / who’s affected
  • the safest first action (and what not to do)
  • how to roll back or back out a change
  • who to escalate to, and what info to include
  • what to communicate while you work

Even if you can’t execute every step, a runbook lets you describe the issue cleanly, pull the right people in quickly, and avoid thrash.

It should be scannable. It should be blunt. It should prefer verbs over paragraphs. It should assume you’re reading it under stress.

If you want a team to “stay” with a system over time, the runbook is one of the most practical places to start.

Checklist

This is the starter runbook we use when we adopt a service. It’s intentionally boring.

  • What is this service?

    • one sentence purpose
    • primary users / callers
    • link to architecture diagram (if it exists)
  • How do we know it’s healthy?

    • “green” signals (P95 latency, error rate, queue depth, etc.)
    • the one dashboard you’d open first
    • log query examples (copy/paste)
  • What changes the system?

    • deploy path (how code ships)
    • config path (how config ships)
    • feature flags (if any)
  • Common failure modes (top 3)

    • what it looks like
    • what to check first
    • what a safe first action is
  • Rollback / backout steps

    • the exact command or button path
    • what to confirm after rollback
    • what not to do under pressure
  • Dependencies and limits

    • external dependencies (DB, queue, vendor)
    • rate limits / quotas
    • known bottlenecks
  • Escalation

    • who to page (role, not person)
    • when to page
    • what info to include in the escalation message
  • Handoff

    • where this runbook lives
    • last updated date
    • owner (team / rotation)
  • Comms + coordination

    • internal channel / bridge link
    • update cadence (who updates, where)
    • customer-facing status page link (if you have one)
  • Safety rails

    • “do not do this” list (irreversible commands, dangerous knobs)
    • known sharp edges (the thing that surprises new on-call)
    • how to pause/stop background jobs safely
  • Practice

    • a quarterly runbook drill (open the dashboard, find the logs, confirm rollback)
    • record what was missing and patch it immediately
  • After-action

    • link to the last incident report (if you have one)
    • what we changed afterward (alerts, thresholds, runbook updates)
    • the “known weird thing” we don’t want rediscovered under pressure
  • Degrade modes / safe knobs

    • which feature flag disables the risky path
    • which throttle to turn down before you scale
    • what “degraded but safe” looks like on the dashboard

Notes

A runbook doesn’t need to be comprehensive. It needs to be usable.

Usable usually means:

  • the first page tells you where to look
  • the first step is safe (and reversible)
  • the doc is short enough that someone will actually read it

If the runbook is long, make the top section scannable and link out to deeper docs. Don’t bury the first action under context.

If you can’t name a safe first action, you don’t have a runbook yet.

The safe first action is usually: stop the bleeding, restore visibility, then decide.

A quick test: can an engineer who didn’t build this service take a safe first action from this doc?

Write it like you’re writing UI copy. Prefer short labels. Use consistent naming. Put the important links and commands where someone can find them without scrolling through a wall of text.

We also make the runbook discoverable from the first entry point:

  • a page links to the first dashboard
  • the first dashboard links back to the runbook
  • tickets/alerts link to both

If you can’t find the runbook from the page payload, it’s not operational yet.

One more test we like: can someone new to the rotation follow the runbook on a calm Tuesday?

If they can’t, the problem isn’t them. The interface is unclear.

And update the runbook as part of the change—not “after.”

If this checklist feels like overhead, it usually means incidents are doing the documentation work for you. That trade is expensive.

Takeaways

In an incident, your runbook is the UI.

If the UI is unclear, you will pay for it in time, stress, and unnecessary escalation.

Write the runbook before you need it.