Practicing incident handovers
What we changed about incident handovers so they stopped being an afterthought and started shortening incidents.
Incidents rarely stay neatly inside one person’s shift.
For a long time, we treated handovers as a detail. The on-call who was awake would type a few sentences in chat, maybe leave a comment in an incident doc, and hope the next person could reconstruct the rest.
Sometimes it worked.
Sometimes the new on-call would spend the first 20–30 minutes re-learning what the previous shift had already discovered.
We decided to treat incident handovers as a first-class part of reliability work.
Constraints
- We did not want to introduce a parallel system of incident tooling just for handovers.
- On-call rotations already had limited slack; any new process had to be short.
- Incidents crossed time zones regularly; we could not rely on synchronous calls for every handover.
- We wanted something that worked for both big "named" incidents and small-but-important pages.
What we changed
We focused on making handovers predictable and easy to consume.
1. A small, fixed template
We added a tiny handover template to our incident docs:
- Status: one line ("stabilized / degrading / improving / unknown")
- Hypothesis: what we currently think is happening
- Active levers: what we are changing right now (flags, rollbacks, config)
- Next checks: the next 2–3 things we plan to look at
The outgoing on-call fills this in before handing over.
We deliberately kept it short enough to complete in a few minutes.
2. A single source of truth
Previously, we had context scattered across:
- chat threads
- dashboards
- an incident doc (sometimes)
We made the incident doc the canonical source of truth.
Rules:
- links to relevant dashboards and runbooks go into the doc
- significant decisions and hypotheses get a timestamped note
- the handover template lives at the top
Chat is for conversation. The doc is for state.
3. Explicit lead baton-passing
We stopped assuming that the person who joined a call "last" was now leading.
Instead:
- the outgoing lead writes the handover block
- the incoming lead replies in chat with a short acknowledgment ("I’ve read the doc; taking lead from here")
This sounds formal, but it prevents the awkward period where nobody is sure who is in charge.
4. Practicing handovers outside of real incidents
We added handovers to our incident rehearsal exercises:
- run a small, time-boxed simulation
- pause halfway and swap the lead to another engineer
The goal is not to make perfect game-day drills; it’s to make writing and consuming handovers feel normal.
Results / Measurements
We measured handover quality indirectly.
Two signals mattered:
- Time-to-first-confident-action after handover. In incident reviews, we looked at how long it took the new lead to make a decision after taking over.
- Re-discovery. We tracked cases where the new shift repeated diagnostic work that the previous shift had already done.
After a few months of using the template and explicit batons:
- time-to-first-confident-action after a handover dropped noticeably (tens of minutes down to around 5–10 minutes on average)
- we saw fewer "we had already checked that" moments in reviews
The biggest change was qualitative:
- incoming leads reported feeling less like they were "walking into the dark"
- outgoing leads had an obvious stopping point instead of typing ad-hoc notes until they collapsed
Takeaways
- Handovers are part of the incident, not paperwork that happens afterward.
- A tiny, consistent template beats long, freeform notes that nobody has time to write or read.
- A single canonical incident doc makes it easier to find context across shifts.
- Practicing handovers when nothing is on fire makes them faster and calmer when it is.