RELIABILITY2020-02-05BY ELI NAVARRO

Remote on-call without losing signal

What we changed in alerts, dashboards, and runbooks so remote on-call engineers see the same incident at the same time.

reliabilityon-callincident-responseobservabilityremote-work

When most of the team started taking pages from home, a few weaknesses in our incident flow became obvious.

Nothing was "broken" in the usual sense. Alerts fired. Dashboards loaded. People showed up.

But the first ten minutes of an incident turned into parallel solo investigations.

In the office, ad-hoc coordination covered for a lot:

someone would walk past a desk and say "it looks like queues"
two people could point at the same graph and argue about it
you could overhear which hypothesis was currently winning

Remote, we lost that ambient bandwidth. We saw three problems show up repeatedly:

Different people opened different dashboards from the same alert.
Runbooks assumed shared context ("open the main dashboard" without a link).
Alerts were noisy enough that people muted them locally and relied on chat.

If we wanted remote on-call to work, we had to make the system carry more of the coordination load.

Constraints

We could not pause feature work for a quarter to rebuild incident tooling.
We had one primary metrics stack and one log stack; adding a third would increase surface area.
Network conditions at home were unpredictable; we had to assume higher latency and occasional flakiness.
We had mixed experience levels on the rotation; new engineers could not rely on "grabbing whoever is near" to get unstuck.
We were already over-alerting. Any change that increased raw alert volume would be rejected.

Organizationally:

Incident review time was limited; we could not rely on multi-hour training per person.
Teams owned their services but shared common tooling; we needed conventions that scaled beyond one service.

What we changed

We made three categories of changes: alerts, dashboards, and runbooks.

1. Alerts that point to a single place

Every paged alert now links to one first dashboard and one runbook section.

For each service, we designated a single starting dashboard.
Alerts for that service link directly to that dashboard with a pre-set time window.
The same alert links to the incident runbook anchored to a concrete heading (e.g., # First 10 minutes).

We removed alerts that did not have a clear first action.

If we could not answer "what should the on-call do in the first three minutes?", the alert became an info-level signal or was deleted.

2. Dashboards designed for remote collaboration

The first dashboard for a service became part of its interface.

We standardized a layout:

Top row: request rate, error rate, P95 latency.
Second row: critical endpoints by error rate, dependency health.
Third row: resource saturation (queue depth, DB connections, thread pool usage).

Additional graphs live on secondary dashboards linked from the first.

We also made the dashboards easier to narrate over a call:

Consistent units and colors across services.
Deploy markers visible by default.
Named, saved views for common incident ranges ("last 20 minutes", "last deploy").

During reviews, we ask a specific question: "If three people are looking at this at once, can they talk about it without saying 'the blue one under the other blue one'?"

3. Runbooks that assume no shared memory

We rewrote the early sections of runbooks to be remote-friendly:

Every step that said "check the main dashboard" now has a direct link.
We added explicit copy-pasteable commands for common checks instead of "ssh in and take a look".
We added a short table of contents with anchors for: "First 5 minutes", "Rollback", "Degrade mode".

We also added a small "Radio discipline" subsection:

one person is the incident lead by default (the on-call)
they say out loud which dashboard and time range they are looking at
others confirm before proposing actions

This was intentionally lightweight. The goal was not a perfect incident command system, just less thrash in the first ten minutes.

4. Tooling to keep people on the same page

We made two small implementation changes that paid off disproportionally:

The incident bot posts a link to the first dashboard and the runbook when an alert opens an incident channel.
The bot also posts a pinned "current hypothesis" message that the lead can edit instead of rewriting context every few minutes.

Neither change required new infrastructure. Both reduced the amount of "wait, what are we looking at?" time.

Results / Measurements

We did not try to measure "remote productivity." We picked concrete signals tied to incidents:

Time-to-first-shared-view. Before the change, it was common for the first 5–7 minutes of a page to be spent getting everyone to the same graphs. After the change, we consistently saw people land on the same dashboard within ~60 seconds of the first page.
Number of parallel dashboard links in chat. In earlier incidents, we would routinely see 5–10 different dashboard URLs in the first ten minutes. Afterward, most incidents had one or two, usually the first dashboard and a specific deep-dive.
Rollback decision time. In a small sample of incidents where rollback was on the table, the time from page to "we should roll back" shrank from ~18–25 minutes to ~10–15 minutes. The difference came from spending less time reconciling contradictory views.

We also collected qualitative feedback from on-call engineers after a month:

People reported less "zoom fatigue" from incident calls because the tooling carried more of the coordination.
Newer engineers were more willing to take the primary on-call role when they knew the alert would hand them a starting point.

There were failures too:

In the first week, a few alerts still pointed to the old dashboards; those incidents reminded us to treat alert links as code, with ownership and review.
Some services tried to pack too much into the first dashboard, which made it slow over home connections. We had to enforce a performance budget and remove expensive breakdowns.

Takeaways

Remote on-call is not just "the same process over video." You have to move coordination into the tools.
Every paged alert should point to a single starting dashboard and a specific runbook section.
First dashboards should be small, fast, and narratable; everything else can live on deeper pages.
Runbooks should assume zero shared memory: explicit links, explicit commands, explicit roles.
Small automation (like posting the dashboard and runbook links in chat) can remove surprising amounts of friction.