DELIVERY2024-08-16BY STORECODE

Incident report: Noisy LLM assistant during an outage

An internal LLM-based assistant generated confusing suggestions during an outage. We describe how it distracted responders and what we changed.

incidentsllmtoolingon-call

Summary

We introduced an internal assistant powered by an LLM to help during incidents.

The assistant could:

summarize recent logs
draft candidate timelines
suggest likely runbook sections

During a major outage on August 16, 2024, the assistant became more of a distraction than a help.

It generated:

confident but incorrect summaries of the situation
suggestions pointing to the wrong services
long blocks of text that hid the few useful observations

Responders spent time arguing with or ignoring the tool instead of turning it off.

We treated this as an incident involving tooling on the incident path.

Impact

Duration: about 90 minutes of degraded response quality during an already serious outage.
User impact:
- primarily from the underlying outage; the assistant did not directly affect users
- indirectly, some mitigation decisions were slower because attention was split
Internal impact:
- confusion in the incident channel
- erosion of trust in automated assistance

No additional user-facing downtime was attributed solely to the assistant, but it clearly did not improve the situation.

Timeline

All times local.

14:02 — Primary outage begins; core service experiencing elevated error rates and latency.
14:05 — Incident channel created; automated tooling, including the assistant, joins and posts initial summaries.
14:09 — Assistant posts a summary incorrectly attributing the issue to a specific downstream dependency based on a small sample of logs.
14:12 — Responders investigate the suggested dependency and find no matching issues.
14:18 — Assistant posts an updated timeline that mixes confirmed events with hypotheses stated as facts.
14:24 — A responder calls out that the assistant’s messages are confusing; others start muting its posts locally.
14:31 — The assistant suggests an outdated runbook that does not apply to the current architecture.
14:36 — Incident lead decides to disable the assistant for the remainder of the outage.
14:40 — Assistant is removed from the channel; focus returns to dashboards and runbooks.
15:12 — Underlying outage is mitigated through a combination of rollback and capacity changes.
15:45 — Incident closed; a separate review is scheduled for the assistant’s behavior.

Root cause

The assistant was configured and used in ways that were not aligned with incident needs.

Key issues:

Overly broad input. It consumed entire chat logs and large volumes of logs, mixing:
- tentative hypotheses
- confirmed facts
- unrelated noise
Prompts that rewarded confidence. The prompts emphasized "clear" summaries and "likely" root causes without strong emphasis on uncertainty and source attribution.
Lack of guardrails. There was no clear signal for:
- which suggestions were based on confirmed signals
- which runbooks were up to date
Role confusion. Responders treated the assistant’s output somewhere between a teammate and a dashboard, without a clear contract.

The LLM did what it was asked: produce fluent text.

We had not constrained the problem well enough.

What we changed

1. Narrow the assistant’s scope

We limited what the assistant is allowed to do during incidents:

focus on formatting and organizing human-provided data (e.g., structuring a list of known events)
avoid generating new hypotheses about root cause
avoid suggesting mitigations directly

The assistant now operates on tagged, curated inputs rather than raw logs and entire chats.

2. Make uncertainty and provenance explicit

We changed prompts and presentation so that:

suggestions are labeled as such
summaries include references to the specific graphs or log snippets they’re based on
any inferred connection between events is marked clearly as a hypothesis

This makes it easier for humans to assess whether to trust or discard a suggestion.

3. Add a "mute" and "off" switch with clear ownership

We added simple controls:

incident leads can disable assistant messages for a given channel with a command
the assistant respects a per-incident configuration (enabled, summary-only, or disabled)

We documented that:

it is always acceptable to turn the assistant off during an incident
this is not considered a failure, just a signal for tuning

4. Align runbook suggestions with versioned docs

We integrated the assistant with versioned runbooks stored in the repo.

It can:

suggest runbooks that match the tagged service and incident type
avoid suggesting runbooks that are marked as deprecated

This reduced the chance of it pointing at outdated procedures.

5. Treat the assistant as a product

We added the assistant to our internal product list with:

an owner
SLOs for responsiveness and accuracy (measured qualitatively through incident reviews)
a feedback channel for responders to report confusing behavior

We now review its performance periodically like any other tool.

We also track a small set of incidents-by-assistant metrics (where it was enabled) so we can see whether it is trending toward "net helpful" or "net distracting" over time.

Follow-ups

Completed

Narrowed the assistant’s role and inputs.
Added explicit controls for disabling or limiting it per incident.
Integrated with versioned runbook metadata.

Planned / in progress

Add lightweight evaluation tooling to test prompts against past incidents.
Provide training and guidance for incident leads on when and how to use the assistant.

Takeaways

LLM-based assistants can help, but only with a tightly scoped role and clear guardrails.
During incidents, confident-sounding but wrong suggestions are worse than silence.
Humans need obvious ways to turn tools off when they’re not helping.
Treating incident tooling as a product—with owners and reviews—keeps experiments from becoming hidden risks.