RELIABILITY2020-01-29BY STORECODE

Incident report: Paging gaps during the first remote week

Our first week of mostly-remote on-call exposed blind spots in paging and escalation. We describe where alerts went missing and how we changed the system.

reliabilityincidentson-callremote-work

Summary

In late January 2020, we shifted most of the team to working from home.

The services stayed the same. The alerts stayed the same.

The on-call rotation did not.

During the first mostly-remote week, we had a minor but telling incident: an alert that should have paged the primary engineer was delayed and then acknowledged by someone who happened to see it in chat, not by the person on rotation.

Nothing broke badly enough to make the news. A small spike in errors self-resolved as traffic moved off the problematic path.

But the incident showed that our paging and escalation paths had silent assumptions about people being in the same building.

Impact

  • Duration: ~27 minutes from the first alert firing to someone with access taking a concrete action.
  • User impact:
    • Elevated error rate (~1–2% above baseline) on a secondary API for ~20 minutes.
    • Most users experienced retries or a slightly slower path; a small number saw generic error messages.
  • Internal impact:
    • Confusion about who was on call and who was "allowed" to respond.
    • Extra work for support to confirm whether there was an active incident.

No data was lost. This was a coordination failure, not a data integrity failure.

Timeline

All times local.

  • 09:41 — Error-rate alert fires for a secondary API endpoint. The alert is configured to page the primary on-call and post to a shared incident channel.
  • 09:42 — The alert posts in chat. The paging call to the on-call’s phone does not arrive; their VPN client had changed the network they relied on for push notifications.
  • 09:45 — A teammate notices the alert in chat and pings "Is anyone on this?" in the channel.
  • 09:48 — The teammate who noticed the alert begins investigating, even though they are not on rotation.
  • 09:55 — They identify a misbehaving dependency and temporarily route traffic away from the problematic path.
  • 10:01 — Error rates return to baseline.
  • 10:08 — The actual on-call engineer sees the backscroll and calls out that they never received the page.
  • 10:20 — A short review identifies several gaps in how we assumed pages would reach people who were now at home.

Root cause

The immediate cause was that the primary on-call did not receive the page.

The deeper cause was that our paging and escalation design assumed:

  • people would see wall monitors or hear someone’s laptop alarm in the office
  • other engineers would know, in real time, who was physically nearby and reachable

Our system relied on multiple layers:

  • the paging provider’s notification path
  • the VPN client
  • the engineer’s device and network

We had never tested this combination in the new working arrangement.

Contributing factors:

  • Outdated contact information. Some phone numbers and backup contacts were stale.
  • Loose rotation reminders. The engineer on call had swapped a shift informally without updating the rotation tool.
  • No explicit backup for this kind of alert. The alert did not escalate beyond the primary, assuming they would reliably receive it.

What we changed

1. Treat contact paths as production dependencies

We updated the on-call runbook and rotation tooling:

  • Phone numbers and contact methods are part of onboarding and reviewed quarterly.
  • Engineers verify that their devices receive a test page when their rotation starts.

We added a small, automated "page test" at the start of each new week:

  • a non-urgent test alert fires
  • the primary confirms receipt with a simple acknowledgment

If the test fails, we investigate before a real incident depends on it.

2. Clarify ownership and backup

We stopped relying on "whoever sees it" behavior.

  • Each rotation has a clear primary and backup.
  • Alerts that page the primary will escalate to the backup if not acknowledged within a short window.
  • The backup is expected to respond if the primary is unreachable, not to wait for confirmation.

We also clarified in writing that stepping in when you see an unowned alert is appreciated, but should be accompanied by looping in the current primary/backup.

3. Make the rotation visible

We added simple visibility improvements:

  • A small indicator in chat and internal dashboards showing who is currently primary and backup.
  • Handover messages at shift change that tag the new primary.

This reduced the time spent asking "who’s on?" when an alert appears.

4. Include remote conditions in tests

We added a few simple "remote assumptions" to incident rehearsals:

  • run through a paging test with only home networks and VPNs in play
  • simulate the primary being unreachable and see how quickly the backup responds

We do not aim to model every possible failure path, just to avoid assuming that office-only paths will exist forever.

Follow-ups

Completed

  • Cleaned up rotation data and enforced owner fields.
  • Added weekly paging tests.
  • Configured escalation from primary to backup for this class of alerts.

Planned / in progress

  • Extend paging tests to more services.
  • Add simple reporting on "time from alert to human acknowledgment" as a reliability metric.

Takeaways

  • Changing where people work without changing how pages reach them is a hidden reliability risk.
  • Contact paths and escalation rules are production dependencies and need tests, just like services do.
  • Making on-call ownership visible reduces hesitation and confusion during the first minutes of an incident.
  • Weekly low-stakes paging tests are cheap insurance against finding out during a real event that someone’s phone never rings.