RELIABILITY2024-01-23BY PRIYA PATEL

Using LLMs to draft incident timelines safely

How we introduced LLM-based drafting for incident timelines without letting a tool become the source of truth.

reliabilityincidentstoolingllm

When large language model tools became widely available, one of the first obvious uses was summarizing incidents.

We were already writing detailed incident docs:

chat transcripts
metric snapshots
runbook links

The idea of pressing a button and getting a clean, readable timeline was attractive.

We also had concerns:

Would a tool hallucinate steps that never happened?
Would people skip writing good notes because "the bot will handle it"?
Would the draft become the record, even if it missed important nuance?

We decided to experiment, but only if we could treat an LLM as a drafting assistant—not as the system of record.

Constraints

We wanted incident docs to remain the canonical record.
We did not want to send sensitive or identifying data to third-party tools.
We needed drafts to be obviously drafts until a human accepted them.
We could not assume everyone was comfortable editing model-generated text.

What we changed

We started small and focused on one narrow task: drafting timelines.

1. Structure the input before asking for help

Instead of copying an entire chat log into a tool, we:

tagged key events in the incident doc ("page fired", "rollback started", "failover complete")
linked to or embedded relevant graphs with brief captions ("P95 latency spiked here")
ensured the doc contained timestamps alongside actions

This gave the drafting tool a structured, human-curated view of the incident.

We wrote a small internal tool that:

pulled the tagged events and their context
passed them to an LLM with a constrained prompt
produced a first-draft timeline section

2. Constrain the prompt

The prompt asked for a specific format:

bullet list of time-stamped events
no new facts; only rephrase what was provided
mark any uncertainty explicitly

We treated any deviation as a bug in our usage, not as "creative license."

3. Keep humans firmly in the loop

The draft timeline never replaced the incident doc automatically.

Instead:

the tool posted a draft section into the doc with clear markers ("Draft generated by tooling—review before publishing")
the incident lead or reviewer edited, reordered, or deleted entries
only after human review did we remove the draft marker

We also asked leads to note in reviews when the draft missed or distorted something important.

4. Privacy and data hygiene

We configured the tool to:

strip or mask user identifiers and other sensitive fields before sending text to the LLM
avoid including raw payloads or secrets

We logged what we sent for auditing and improvement.

This forced us to tidy up some of our incident docs, which had occasionally included more raw data than was necessary.

Results / Measurements

We looked at a few indicators over several months:

Time to first coherent timeline. Before the tool, some incidents went days before anyone wrote a proper narrative. With drafting, leads often had a rough timeline within an hour of closure.
Editing effort. In reviews, leads reported that editing a draft was noticeably faster than composing from scratch—as long as the input tags were good.
Accuracy issues. We tracked cases where the draft introduced errors. The most common category was over-confident wording (e.g., "root cause" where the doc only had a hypothesis). We adjusted the prompt to prefer hedging language.

We did not treat the tool as success or failure based on cleverness.

We treated it as successful when:

incident docs were more consistently complete and readable
people spent less time retyping chat logs into paragraphs

Guardrails we added

We wrote down a few guardrails:

The incident doc is the record; the tool only drafts.
If an LLM-generated timeline disagrees with metrics or logs, the human wins.
It’s acceptable to discard a draft entirely if it confuses more than it helps.

We also decided not to use LLMs for:

generating action items (too easy to invent work that nobody owns)
guessing at root causes (too easy to make hypotheses sound like facts)

Those remain human responsibilities.

Takeaways

LLMs are good at turning structured, human-tagged events into readable timelines.
They are not good at being the source of truth for what happened.
Small internal tools with clear prompts and review steps worked better than pasting entire chats into generic interfaces.
Treating the model as a drafting assistant, not an author, kept our incident process grounded.