Using LLMs to draft incident timelines safely
How we introduced LLM-based drafting for incident timelines without letting a tool become the source of truth.
When large language model tools became widely available, one of the first obvious uses was summarizing incidents.
We were already writing detailed incident docs:
- chat transcripts
- metric snapshots
- runbook links
The idea of pressing a button and getting a clean, readable timeline was attractive.
We also had concerns:
- Would a tool hallucinate steps that never happened?
- Would people skip writing good notes because "the bot will handle it"?
- Would the draft become the record, even if it missed important nuance?
We decided to experiment, but only if we could treat an LLM as a drafting assistant—not as the system of record.
Constraints
- We wanted incident docs to remain the canonical record.
- We did not want to send sensitive or identifying data to third-party tools.
- We needed drafts to be obviously drafts until a human accepted them.
- We could not assume everyone was comfortable editing model-generated text.
What we changed
We started small and focused on one narrow task: drafting timelines.
1. Structure the input before asking for help
Instead of copying an entire chat log into a tool, we:
- tagged key events in the incident doc ("page fired", "rollback started", "failover complete")
- linked to or embedded relevant graphs with brief captions ("P95 latency spiked here")
- ensured the doc contained timestamps alongside actions
This gave the drafting tool a structured, human-curated view of the incident.
We wrote a small internal tool that:
- pulled the tagged events and their context
- passed them to an LLM with a constrained prompt
- produced a first-draft timeline section
2. Constrain the prompt
The prompt asked for a specific format:
- bullet list of time-stamped events
- no new facts; only rephrase what was provided
- mark any uncertainty explicitly
We treated any deviation as a bug in our usage, not as "creative license."
3. Keep humans firmly in the loop
The draft timeline never replaced the incident doc automatically.
Instead:
- the tool posted a draft section into the doc with clear markers ("Draft generated by tooling—review before publishing")
- the incident lead or reviewer edited, reordered, or deleted entries
- only after human review did we remove the draft marker
We also asked leads to note in reviews when the draft missed or distorted something important.
4. Privacy and data hygiene
We configured the tool to:
- strip or mask user identifiers and other sensitive fields before sending text to the LLM
- avoid including raw payloads or secrets
We logged what we sent for auditing and improvement.
This forced us to tidy up some of our incident docs, which had occasionally included more raw data than was necessary.
Results / Measurements
We looked at a few indicators over several months:
- Time to first coherent timeline. Before the tool, some incidents went days before anyone wrote a proper narrative. With drafting, leads often had a rough timeline within an hour of closure.
- Editing effort. In reviews, leads reported that editing a draft was noticeably faster than composing from scratch—as long as the input tags were good.
- Accuracy issues. We tracked cases where the draft introduced errors. The most common category was over-confident wording (e.g., "root cause" where the doc only had a hypothesis). We adjusted the prompt to prefer hedging language.
We did not treat the tool as success or failure based on cleverness.
We treated it as successful when:
- incident docs were more consistently complete and readable
- people spent less time retyping chat logs into paragraphs
Guardrails we added
We wrote down a few guardrails:
- The incident doc is the record; the tool only drafts.
- If an LLM-generated timeline disagrees with metrics or logs, the human wins.
- It’s acceptable to discard a draft entirely if it confuses more than it helps.
We also decided not to use LLMs for:
- generating action items (too easy to invent work that nobody owns)
- guessing at root causes (too easy to make hypotheses sound like facts)
Those remain human responsibilities.
Takeaways
- LLMs are good at turning structured, human-tagged events into readable timelines.
- They are not good at being the source of truth for what happened.
- Small internal tools with clear prompts and review steps worked better than pasting entire chats into generic interfaces.
- Treating the model as a drafting assistant, not an author, kept our incident process grounded.