Incident report: Noisy LLM assistant during an outage
An internal LLM-based assistant generated confusing suggestions during an outage. We describe how it distracted responders and what we changed.
Summary
We introduced an internal assistant powered by an LLM to help during incidents.
The assistant could:
- summarize recent logs
- draft candidate timelines
- suggest likely runbook sections
During a major outage on August 16, 2024, the assistant became more of a distraction than a help.
It generated:
- confident but incorrect summaries of the situation
- suggestions pointing to the wrong services
- long blocks of text that hid the few useful observations
Responders spent time arguing with or ignoring the tool instead of turning it off.
We treated this as an incident involving tooling on the incident path.
Impact
- Duration: about 90 minutes of degraded response quality during an already serious outage.
- User impact:
- primarily from the underlying outage; the assistant did not directly affect users
- indirectly, some mitigation decisions were slower because attention was split
- Internal impact:
- confusion in the incident channel
- erosion of trust in automated assistance
No additional user-facing downtime was attributed solely to the assistant, but it clearly did not improve the situation.
Timeline
All times local.
- 14:02 — Primary outage begins; core service experiencing elevated error rates and latency.
- 14:05 — Incident channel created; automated tooling, including the assistant, joins and posts initial summaries.
- 14:09 — Assistant posts a summary incorrectly attributing the issue to a specific downstream dependency based on a small sample of logs.
- 14:12 — Responders investigate the suggested dependency and find no matching issues.
- 14:18 — Assistant posts an updated timeline that mixes confirmed events with hypotheses stated as facts.
- 14:24 — A responder calls out that the assistant’s messages are confusing; others start muting its posts locally.
- 14:31 — The assistant suggests an outdated runbook that does not apply to the current architecture.
- 14:36 — Incident lead decides to disable the assistant for the remainder of the outage.
- 14:40 — Assistant is removed from the channel; focus returns to dashboards and runbooks.
- 15:12 — Underlying outage is mitigated through a combination of rollback and capacity changes.
- 15:45 — Incident closed; a separate review is scheduled for the assistant’s behavior.
Root cause
The assistant was configured and used in ways that were not aligned with incident needs.
Key issues:
- Overly broad input. It consumed entire chat logs and large volumes of logs, mixing:
- tentative hypotheses
- confirmed facts
- unrelated noise
- Prompts that rewarded confidence. The prompts emphasized "clear" summaries and "likely" root causes without strong emphasis on uncertainty and source attribution.
- Lack of guardrails. There was no clear signal for:
- which suggestions were based on confirmed signals
- which runbooks were up to date
- Role confusion. Responders treated the assistant’s output somewhere between a teammate and a dashboard, without a clear contract.
The LLM did what it was asked: produce fluent text.
We had not constrained the problem well enough.
What we changed
1. Narrow the assistant’s scope
We limited what the assistant is allowed to do during incidents:
- focus on formatting and organizing human-provided data (e.g., structuring a list of known events)
- avoid generating new hypotheses about root cause
- avoid suggesting mitigations directly
The assistant now operates on tagged, curated inputs rather than raw logs and entire chats.
2. Make uncertainty and provenance explicit
We changed prompts and presentation so that:
- suggestions are labeled as such
- summaries include references to the specific graphs or log snippets they’re based on
- any inferred connection between events is marked clearly as a hypothesis
This makes it easier for humans to assess whether to trust or discard a suggestion.
3. Add a "mute" and "off" switch with clear ownership
We added simple controls:
- incident leads can disable assistant messages for a given channel with a command
- the assistant respects a per-incident configuration (enabled, summary-only, or disabled)
We documented that:
- it is always acceptable to turn the assistant off during an incident
- this is not considered a failure, just a signal for tuning
4. Align runbook suggestions with versioned docs
We integrated the assistant with versioned runbooks stored in the repo.
It can:
- suggest runbooks that match the tagged service and incident type
- avoid suggesting runbooks that are marked as deprecated
This reduced the chance of it pointing at outdated procedures.
5. Treat the assistant as a product
We added the assistant to our internal product list with:
- an owner
- SLOs for responsiveness and accuracy (measured qualitatively through incident reviews)
- a feedback channel for responders to report confusing behavior
We now review its performance periodically like any other tool.
We also track a small set of incidents-by-assistant metrics (where it was enabled) so we can see whether it is trending toward "net helpful" or "net distracting" over time.
Follow-ups
Completed
- Narrowed the assistant’s role and inputs.
- Added explicit controls for disabling or limiting it per incident.
- Integrated with versioned runbook metadata.
Planned / in progress
- Add lightweight evaluation tooling to test prompts against past incidents.
- Provide training and guidance for incident leads on when and how to use the assistant.
Takeaways
- LLM-based assistants can help, but only with a tightly scoped role and clear guardrails.
- During incidents, confident-sounding but wrong suggestions are worse than silence.
- Humans need obvious ways to turn tools off when they’re not helping.
- Treating incident tooling as a product—with owners and reviews—keeps experiments from becoming hidden risks.