Using LLM tools to assist incident retrospectives
How we use LLM-based tools to help with retrospectives—clustering themes, drafting sections—while keeping humans in charge of conclusions.
A public-facing slice of an internal archive. Practical notes on stewardship, reliability, and delivery.
How we use LLM-based tools to help with retrospectives—clustering themes, drafting sections—while keeping humans in charge of conclusions.
Clock skew between systems turned a cautious background sync into a runaway loop. We describe how time assumptions failed and what we changed.
We built and tested a feature in one staging environment and one region. It behaved very differently elsewhere. Here’s why and what we changed.
What we changed about API deprecations so they behaved more like managed migrations and less like surprise deadlines.
We lost data in a secondary index that many flows depended on more than we thought. We explain why it behaved as more than a cache and what we changed.
We set an SLO tighter than reality and spent months failing against it without learning much. Here’s what we changed.
We chose a common set of rollout controls so deploy tools, runbooks, and dashboards speak the same language across services.
A change to tracing sampling made a latency regression invisible in our usual views. We describe how it happened and what we changed.
A checklist we use when an internal tool quietly becomes essential and needs to be treated like a tier-1 service.
A platform team’s dashboards were great for them but hard for service owners to use during incidents. We describe how we changed them.
A misconfigured cost alert let a runaway job spend much more than intended before we noticed. We describe what happened and how we changed our approach.
How we use internal LLM tools to propose changes to runbooks based on incident docs, without letting the tool edit production docs on its own.
How we decide whether capabilities like auth, flags, or logging live in shared platforms or in individual services.
An online schema change left transactional traffic healthy but broke analytics pipelines. We describe the disconnect and what we changed.
Operational metrics we treated as internal-only later became compliance and reporting signals. We describe how we adapted.
An internal admin tool started as a quick experiment and quietly became essential. We describe how we discovered that and what we did about it.
A checklist we use when someone wants to ship or run automation—scripts, tools, jobs—that can change lots of production records at once.
How we built a reusable way to run backfills and reprocessing jobs without turning them into surprise production incidents.
Questions we kept getting about where AI assistants fit into review, and how we avoid outsourcing judgment.
An internal LLM-based assistant generated confusing suggestions during an outage. We describe how it distracted responders and what we changed.
We decided to move flag evaluation into a shared service instead of letting every client decide on its own.
A change to rate-limiting configuration in only one tier caused uneven outages. We describe what broke and how we aligned limits with SLOs.
What we changed about dashboards once we treated SLOs as the primary lens for reliability work.
A condensed checklist we use when a new vendor might end up on the production critical path.
Visual regression tests said everything was fine. Keyboard users and screen readers disagreed. This is how we found and fixed the gap.
A subtle difference in configuration between regions turned a safe rollout into an uneven incident. We describe the drift and how we removed it.
We defined what 'safe mode' means per service so we can degrade predictably instead of improvising under pressure.
How we introduced LLM-based drafting for incident timelines without letting a tool become the source of truth.
How we use lightweight tooling to keep runbooks close to reality instead of an aspirational wiki.
A refactor changed how we routed jobs between queues and regions. Some jobs began running in the wrong place. We describe how we found it and what we changed.
We treated observability as 'basically free' until the bill and the query latencies told a different story. This is what changed next.
Answers to common questions about when we use queues, when we use streams, and what we watch out for operationally.
How we changed security reviews so they protect users without turning into unpredictable roadblocks.
A short checklist we run before and during schema changes on shared databases.
How we adjusted our on-call rotation and habits once the team was no longer sitting in the same room or even the same continent.
How we design and stage schema changes in shared databases so one team’s release doesn’t surprise everyone else.
A background job ran for years without a clear expectation. When it finally broke, we had to decide what 'on time' meant.
How we changed our flag system and habits so flags carry enough context to be safe to flip during incidents.
Patterns we use so automatic retries feel predictable and honest instead of random and frustrating.
A planned failover that was supposed to be nearly invisible took much longer under real traffic. We describe why and what we changed.
What we changed about incident handovers so they stopped being an afterthought and started shortening incidents.
A misconfigured rate limit rule throttled legitimate traffic instead of abusive clients. We describe how it happened and what we changed.
How we made infrastructure cost visible enough that engineers could treat it like latency or reliability when making decisions.
Answers to common questions about which services and behaviors deserve explicit SLOs and how we choose them.
How we clarified who owns what for shared services so incidents and roadmaps stopped stalling in the gaps.
Patterns we use to make async work—emails, background jobs, slow checks—feel predictable instead of flaky.
We added a cache to protect a slow path and accidentally created a new failure mode. This is the story and what we changed.
The concrete changes we made to our alerting so pages became rarer, clearer, and more actionable.
We decided to move application secrets out of long-lived environment variables and into a managed secrets system.
What we changed in the checkout architecture so partial failures lead to smaller, clearer problems instead of full outages.
How we made sure experiments respect reliability by tying them to error budgets instead of running them until something breaks.
A degraded downstream service caused slow responses that turned into a wave of timeouts upstream. We describe the chain and what we changed.
How we started treating metrics, logs, and tracing changes like production code instead of 'just add it and see.'
Three practical tactics we use to keep shipping during large integration projects without breaking production.
A misconfigured permission set blocked a critical operational action during a rollout. We describe how it happened and what we changed.
A scheduled job we thought was gone kept running in an old namespace, causing periodic load spikes. We describe how we found it and what we changed.
How small UX decisions in internal support tools reduced incident time-to-understand and support handoffs.
A cross-team migration mostly worked, then stalled on edge cases. We describe what finally got it finished.
Answers to the questions we kept hearing about how realistic staging needs to be and where to spend the effort.
Why we started giving internal tools explicit error budgets instead of treating them as best-effort.
We chose a shared way to categorize errors across services so dashboards, alerts, and user-facing messages line up.
We migrated a core read path to a new backend and watched P95 latency climb. This is the story of how we noticed, rolled back, and changed how we plan migrations.
We moved from ad-hoc circuit breakers to a shared pattern so failures in one dependency don’t fragment every service’s behavior.
How we introduced simple telemetry budgets so small services stay observable without surprising costs or overload.
A short checklist we run before shipping end-of-year changes like pricing, tax rules, and reporting formats.
A practical checklist we use when the team is tired but the work still needs to ship.
We decided to keep a single primary source of truth for authentication and treat other stores as caches, even when duplication looks convenient.
We let a dashboard drift while an alert still depended on it. The next incident taught us why observability assets need owners.
A routine change to batch timing turned our job system into a bottleneck. We describe how a small shift in scheduling created a queue backlog and what we changed afterward.
Questions we kept getting once some services were put into maintenance mode, and what that means operationally.
A short list of things we wish we had treated as production-critical in our flag system before traffic spiked.
We adjusted how optional emails fail so they don’t turn into noisy incidents or user confusion during stressed periods.
What we changed in alerts, dashboards, and runbooks so remote on-call engineers see the same incident at the same time.
Our first week of mostly-remote on-call exposed blind spots in paging and escalation. We describe where alerts went missing and how we changed the system.
If an alert doesn’t point to a single starting dashboard, the first ten minutes turn into archaeology. We keep one “first dashboard” per service.
Telemetry that can’t be queried is just expensive noise. We treat cardinality and log volume as budgets.
During an incident, lack of access looks like downtime. Excess access looks like risk. We treat access like any other production system.
A small checklist we run on the flows that matter: sign-in, checkout, account changes. Accessibility is reliability for humans.
A single date is comfort. A range with assumptions is a plan you can update without drama.
A deploy invalidated hot cache keys and the database became the cache. We rolled back and added stampede protection.
If the old path breaks during a migration, you lose your escape hatch. We prefer expand/contract patterns that keep rollback real.
Done isn’t shipped. Done is shipped, observable, reversible, and supportable.
Support work happens under stress. A small internal tool can remove minutes of confusion from every incident-sized ticket.
We prefer merging small changes behind flags to running long-lived branches that collapse into a risky merge.
A vendor degraded, our retries amplified it, and checkout suffered. We changed retry defaults and added clearer degradation paths.
Discovery isn’t a mood board. It’s the work that turns unknowns into decisions you can defend.
A destructive admin action was labeled like a refresh. We changed the UI so the safe path is obvious under stress.
Not every alert deserves a page. We separate pages from tickets so on-call attention stays reserved for real impact.
A good error message reduces support load: it tells a stressed human what happened, what to do next, and what to share.
We kept one deployable and invested in boundaries, tests, and observability first. Splitting later became safer and less dramatic.
A simple structure for updates when you’re uncertain: facts, unknowns, next step, next checkpoint.
A small observability change increased log volume enough to overload the service. Rollback fixed it; we changed how we ship logging changes.
Runbooks take three common forms: a page, a checklist, or a button. Pick the home that matches your incident reality.
If the system changed, the runbook changed. Otherwise incidents turn into archaeology.
A runbook checklist we use to make incidents boring.
Decision: default to reversible changes; treat irreversible work as a planned cutover.
False certainty is expensive. Honest constraints make better plans.
Maintenance is where the real system shows up. It requires judgment, not leftovers.
A late-night incident taught us a simple rule: heroics can’t be the delivery model.
A Q&A on long-term ownership: what it changes, what it costs, and what it buys you.
Case work is documented separately as structured engagements.