RELIABILITY2023-01-18BY ELI NAVARRO

Practicing incident handovers

What we changed about incident handovers so they stopped being an afterthought and started shortening incidents.

reliabilityincidentson-callhandovers

Incidents rarely stay neatly inside one person’s shift.

For a long time, we treated handovers as a detail. The on-call who was awake would type a few sentences in chat, maybe leave a comment in an incident doc, and hope the next person could reconstruct the rest.

Sometimes it worked.

Sometimes the new on-call would spend the first 20–30 minutes re-learning what the previous shift had already discovered.

We decided to treat incident handovers as a first-class part of reliability work.

Constraints

  • We did not want to introduce a parallel system of incident tooling just for handovers.
  • On-call rotations already had limited slack; any new process had to be short.
  • Incidents crossed time zones regularly; we could not rely on synchronous calls for every handover.
  • We wanted something that worked for both big "named" incidents and small-but-important pages.

What we changed

We focused on making handovers predictable and easy to consume.

1. A small, fixed template

We added a tiny handover template to our incident docs:

  • Status: one line ("stabilized / degrading / improving / unknown")
  • Hypothesis: what we currently think is happening
  • Active levers: what we are changing right now (flags, rollbacks, config)
  • Next checks: the next 2–3 things we plan to look at

The outgoing on-call fills this in before handing over.

We deliberately kept it short enough to complete in a few minutes.

2. A single source of truth

Previously, we had context scattered across:

  • chat threads
  • dashboards
  • an incident doc (sometimes)

We made the incident doc the canonical source of truth.

Rules:

  • links to relevant dashboards and runbooks go into the doc
  • significant decisions and hypotheses get a timestamped note
  • the handover template lives at the top

Chat is for conversation. The doc is for state.

3. Explicit lead baton-passing

We stopped assuming that the person who joined a call "last" was now leading.

Instead:

  • the outgoing lead writes the handover block
  • the incoming lead replies in chat with a short acknowledgment ("I’ve read the doc; taking lead from here")

This sounds formal, but it prevents the awkward period where nobody is sure who is in charge.

4. Practicing handovers outside of real incidents

We added handovers to our incident rehearsal exercises:

  • run a small, time-boxed simulation
  • pause halfway and swap the lead to another engineer

The goal is not to make perfect game-day drills; it’s to make writing and consuming handovers feel normal.

Results / Measurements

We measured handover quality indirectly.

Two signals mattered:

  • Time-to-first-confident-action after handover. In incident reviews, we looked at how long it took the new lead to make a decision after taking over.
  • Re-discovery. We tracked cases where the new shift repeated diagnostic work that the previous shift had already done.

After a few months of using the template and explicit batons:

  • time-to-first-confident-action after a handover dropped noticeably (tens of minutes down to around 5–10 minutes on average)
  • we saw fewer "we had already checked that" moments in reviews

The biggest change was qualitative:

  • incoming leads reported feeling less like they were "walking into the dark"
  • outgoing leads had an obvious stopping point instead of typing ad-hoc notes until they collapsed

Takeaways

  • Handovers are part of the incident, not paperwork that happens afterward.
  • A tiny, consistent template beats long, freeform notes that nobody has time to write or read.
  • A single canonical incident doc makes it easier to find context across shifts.
  • Practicing handovers when nothing is on fire makes them faster and calmer when it is.

Further reading