DESIGN2021-09-16BY MARA SABOGAL

Designing support tooling to shorten incidents

How small UX decisions in internal support tools reduced incident time-to-understand and support handoffs.

designsupportincidentstoolingux

Internal support tools are often designed around the happy path: look up a user, see their latest activity, answer a question.

During an incident, that same tool becomes a primary diagnostic surface.

If it’s confusing or slow, support teams send more questions to engineering. Engineers spend more time digging for context instead of fixing the problem.

We treated our support tool as a design problem with operational goals:

  • shorten time-to-understand what the user is seeing
  • make it easy to tell whether an issue is local, account-specific, or systemic
  • reduce the number of back-and-forth messages during an incident

Constraints

  • The tool was already widely used; we could not pause work to rebuild it from scratch.
  • Data lived in multiple backends (primary DB, logs, third-party providers); the tool mediated between them.
  • Support staff worked across time zones and shifts; we could not assume verbal handoffs.
  • We wanted changes that made sense both on a quiet Tuesday and during a major incident.

We also had to respect access boundaries:

  • not everyone should see sensitive fields
  • some actions (like refunds or force-completing flows) required stricter controls

What we changed

We focused on a few surfaces that repeatedly showed up in incident reviews.

1. A single “current status” panel

Previously, support reps had to scan multiple sections to answer a basic question: "What does this user think is happening right now?"

We created a top-level "current status" panel that pulled together:

  • most recent user-facing state (e.g., "payment pending", "verification required")
  • recent errors surfaced to the user
  • the last few key actions the user took

We deliberately avoided raw internal codes here. Instead, we:

  • reused the same phrases we show in the product
  • added a small toggle to reveal internal codes when needed

This made it much faster to map a screenshot or description from a user to what the system thought was happening.

2. Timeline designed for incidents

We already logged events; the problem was how they were displayed.

We redesigned the per-user timeline with incidents in mind:

  • events grouped by flow (sign-in, checkout, settings change)
  • clear markers for errors and retries
  • a compact view of relevant metadata (device type, IP region) without overwhelming detail

We added filters for:

  • "show only errors and warnings"
  • "show only events from the last N minutes"

During an incident, this let support quickly answer:

  • Is this user experiencing a new issue or something ongoing?
  • Are others seeing similar errors right now?

3. Built-in runbook links

Support tools often rely on tribal knowledge: people know which internal doc to open when they see a specific error.

We made the tool do that mapping:

  • each class of error or state can link to a short runbook entry
  • the link shows up next to the error, not buried in a separate tab

Runbooks for support focus on:

  • what to tell the user
  • what, if anything, to ask engineering
  • when not to escalate

This reduced the number of "just in case" escalations during noisy incidents.

4. Triage views for support leads

During a major incident, a support lead needs a different view than an individual rep.

We added a small "incident triage" dashboard inside the tool:

  • count of active tickets tagged with likely-incident labels
  • recent spikes in specific error codes
  • quick filters to see affected accounts by region or plan tier

This dashboard doesn’t replace engineering observability; it complements it with the user-facing side of the story.

5. Making slowness visible

We instrumented the tool itself:

  • page load times for key views
  • time to first meaningful paint for the status panel
  • error rates for the APIs it calls

If the tool slows down or fails during an incident, that’s an incident for us too.

We added simple SLOs:

  • support dashboard P95 load time under X seconds
  • search success rate above Y%

Breaching those SLOs triggers work just like user-facing regressions.

Results / Measurements

We measured a few concrete outcomes over the months after these changes:

  • Time-to-understand (roughly, how long it takes a support rep to summarize a user’s situation) dropped. In shadowed sessions, reps went from spending a minute or more hunting through multiple panels to ~20–30 seconds using the "current status" panel and timeline filters.
  • Incident handoff clarity improved. Engineering incident channels received messages with a single link to a user in the support tool and a short summary, instead of screenshots and partial context.
  • Escalation volume during some classes of incidents went down. In one rollout where a configuration bug affected a narrow cohort, support was able to identify the scope from the triage view and answer many users without engineering involvement.

We also caught a few tool-specific issues earlier because we were measuring its own performance.

We added one more small but important detail: the tool shows a subtle warning when its own error rate is elevated, so support doesn’t mistake a broken tool for a broken product.

Takeaways

  • Internal support tools are part of the incident surface. Designing them well can shorten incidents without touching backend code.
  • A single, well-designed "current status" panel reduces cognitive load for both support and engineering.
  • Timelines and runbook links inside the tool turn scattered logs into a usable narrative.
  • Measuring and setting SLOs for the support tool itself keeps it from quietly becoming the bottleneck during incidents.

Further reading