Logs

A public-facing slice of an internal archive. Practical notes on stewardship, reliability, and delivery.

100 entries

CULTURE2025-12-20BY STORECODE

cultureincidentsretrospectives

Using LLM tools to assist incident retrospectives

How we use LLM-based tools to help with retrospectives—clustering themes, drafting sections—while keeping humans in charge of conclusions.

Open log →

RELIABILITY2025-12-05BY ELI NAVARRO

reliabilitytimebackground-jobs

Incident report: Runaway background sync after clock skew

Clock skew between systems turned a cautious background sync into a runaway loop. We describe how time assumptions failed and what we changed.

Open log →

DELIVERY2025-11-18BY JONAS "JO" CARLIN

deliveryenvironmentstesting

Story: the feature that only worked in our favorite environment

We built and tested a feature in one staging environment and one region. It behaved very differently elsewhere. Here’s why and what we changed.

Open log →

ARCHITECTURE2025-10-29BY ELI NAVARRO

architectureapisdeprecation

Designing API deprecations with real off-ramps

What we changed about API deprecations so they behaved more like managed migrations and less like surprise deadlines.

Open log →

RELIABILITY2025-10-13BY STORECODE

reliabilitydata-losscaches

Incident report: Partial data loss in a secondary system

We lost data in a secondary index that many flows depended on more than we thought. We explain why it behaved as more than a cache and what we changed.

Open log →

RELIABILITY2025-09-30BY STORECODE

reliabilitysloobservability

Story: the SLO that was too strict to be useful

We set an SLO tighter than reality and spent months failing against it without learning much. Here’s what we changed.

Open log →

DELIVERY2025-09-11BY JONAS "JO" CARLIN

deliveryrolloutsfeature-flags

Decision record: Standardizing rollout knobs across services

We chose a common set of rollout controls so deploy tools, runbooks, and dashboards speak the same language across services.

Open log →

RELIABILITY2025-08-03BY PRIYA PATEL

reliabilitytracingobservability

Incident report: Trace sampling change hid a regression

A change to tracing sampling made a latency regression invisible in our usual views. We describe how it happened and what we changed.

Open log →

CULTURE2025-07-08BY STORECODE

cultureinternal-toolsreliability

Checklist: Promoting an internal tool to tier-1 status

A checklist we use when an internal tool quietly becomes essential and needs to be treated like a tier-1 service.

Open log →

RELIABILITY2025-06-22BY ELI NAVARRO

reliabilityplatformsdashboards

Story: rethinking on-call dashboards for platform teams

A platform team’s dashboards were great for them but hard for service owners to use during incidents. We describe how we changed them.

Open log →

STEWARDSHIP2025-05-14BY PRIYA PATEL

stewardshipcostincidents

Incident report: Cost alert that arrived too late

A misconfigured cost alert let a runaway job spend much more than intended before we noticed. We describe what happened and how we changed our approach.

Open log →

DELIVERY2025-04-09BY STORECODE

deliveryrunbooksllm

Using LLMs to draft runbook improvements

How we use internal LLM tools to propose changes to runbooks based on incident docs, without letting the tool edit production docs on its own.

Open log →

ARCHITECTURE2025-03-27BY ELI NAVARRO

architectureplatformsownership

Q&A: centralize or embed platform capabilities?

How we decide whether capabilities like auth, flags, or logging live in shared platforms or in individual services.

Open log →

ARCHITECTURE2025-02-20BY STORECODE

architecturedatabasesanalytics

Incident report: Schema change that broke derived analytics

An online schema change left transactional traffic healthy but broke analytics pipelines. We describe the disconnect and what we changed.

Open log →

STEWARDSHIP2025-01-15BY PRIYA PATEL

stewardshipobservabilitycompliance

Story: the metrics that became compliance signals

Operational metrics we treated as internal-only later became compliance and reporting signals. We describe how we adapted.

Open log →

STEWARDSHIP2024-12-05BY STORECODE

stewardshipinternal-toolsreliability

Story: the 'temporary' admin tool that became production-critical

An internal admin tool started as a quick experiment and quietly became essential. We describe how we discovered that and what we did about it.

Open log →

STEWARDSHIP2024-11-02BY JONAS "JO" CARLIN

stewardshipautomationrisk

Checklist: Reviewing automation that touches production data

A checklist we use when someone wants to ship or run automation—scripts, tools, jobs—that can change lots of production records at once.

Open log →

ARCHITECTURE2024-10-21BY ELI NAVARRO

architecturebackfillsbatch-jobs

Building a backfill framework that doesn’t fight production traffic

How we built a reusable way to run backfills and reprocessing jobs without turning them into surprise production incidents.

Open log →

CULTURE2024-09-09BY STORECODE

culturecode-reviewllm

Q&A: using AI tools in code review without losing judgment

Questions we kept getting about where AI assistants fit into review, and how we avoid outsourcing judgment.

Open log →

DELIVERY2024-08-16BY STORECODE

incidentsllmtooling

Incident report: Noisy LLM assistant during an outage

An internal LLM-based assistant generated confusing suggestions during an outage. We describe how it distracted responders and what we changed.

Open log →

ARCHITECTURE2024-07-04BY ELI NAVARRO

architecturefeature-flagsconfiguration

Decision record: Centralizing feature flag evaluation

We decided to move flag evaluation into a shared service instead of letting every client decide on its own.

Open log →

SECURITY2024-06-18BY STORECODE

securityrate-limitingincidents

Incident report: Partial outage from a misconfigured rate limiter tier

A change to rate-limiting configuration in only one tier caused uneven outages. We describe what broke and how we aligned limits with SLOs.

Open log →

RELIABILITY2024-05-27BY PRIYA PATEL

reliabilitydashboardsslo

Reshaping dashboards for SLO-first operations

What we changed about dashboards once we treated SLOs as the primary lens for reliability work.

Open log →

STEWARDSHIP2024-04-15BY JONAS "JO" CARLIN

stewardshipvendorsrisk

Checklist: Evaluating a new vendor for the critical path

A condensed checklist we use when a new vendor might end up on the production critical path.

Open log →

DESIGN2024-03-29BY MARA SABOGAL

designaccessibilitytesting

Story: the accessibility regression our visual tests missed

Visual regression tests said everything was fine. Keyboard users and screen readers disagreed. This is how we found and fixed the gap.

Open log →

RELIABILITY2024-02-19BY STORECODE

reliabilitymulti-regionconfiguration

Incident report: Config drift across multi-region clusters

A subtle difference in configuration between regions turned a safe rollout into an uneven incident. We describe the drift and how we removed it.

Open log →

RELIABILITY2024-02-07BY ELI NAVARRO

reliabilitydegradationconfiguration

Standardizing safe-mode configs across services

We defined what 'safe mode' means per service so we can degrade predictably instead of improvising under pressure.

Open log →

RELIABILITY2024-01-23BY PRIYA PATEL

reliabilityincidentstooling

Using LLMs to draft incident timelines safely

How we introduced LLM-based drafting for incident timelines without letting a tool become the source of truth.

Open log →

DELIVERY2023-12-12BY STORECODE

deliveryrunbookstooling

Using tooling to keep runbooks current

How we use lightweight tooling to keep runbooks close to reality instead of an aspirational wiki.

Open log →

RELIABILITY2023-11-21BY STORECODE

reliabilitybackground-jobsqueues

Incident report: Misrouted background jobs after a refactor

A refactor changed how we routed jobs between queues and regions. Some jobs began running in the wrong place. We describe how we found it and what we changed.

Open log →

STEWARDSHIP2023-10-17BY PRIYA PATEL

observabilitystewardshipcost

Story: the observability costs we didn’t see coming

We treated observability as 'basically free' until the bill and the query latencies told a different story. This is what changed next.

Open log →

ARCHITECTURE2023-09-22BY PRIYA PATEL

architecturequeuesstreams

Q&A: choosing between queues and streams

Answers to common questions about when we use queues, when we use streams, and what we watch out for operationally.

Open log →

SECURITY2023-08-25BY STORECODE

securityreviewsdelivery

Security reviews that don’t block shipping

How we changed security reviews so they protect users without turning into unpredictable roadblocks.

Open log →

ARCHITECTURE2023-07-19BY ELI NAVARRO

architecturedatabasesmigrations

Checklist: Safe schema changes in a shared database

A short checklist we run before and during schema changes on shared databases.

Open log →

CULTURE2023-06-30BY STORECODE

cultureon-callremote-work

On-call across time zones

How we adjusted our on-call rotation and habits once the team was no longer sitting in the same room or even the same continent.

Open log →

ARCHITECTURE2023-05-23BY ELI NAVARRO

architecturedatabasesschema-changes

Designing schema changes for shared databases

How we design and stage schema changes in shared databases so one team’s release doesn’t surprise everyone else.

Open log →

STEWARDSHIP2023-04-27BY JONAS "JO" CARLIN

stewardshipbackground-jobsreliability

Story: the job that never had an SLA

A background job ran for years without a clear expectation. When it finally broke, we had to decide what 'on time' meant.

Open log →

DELIVERY2023-03-21BY JONAS "JO" CARLIN

deliveryfeature-flagsconfiguration

Feature flags that explain themselves

How we changed our flag system and habits so flags carry enough context to be safe to flip during incidents.

Open log →

DESIGN2023-03-06BY MARA SABOGAL

designreliabilityretries

Design: making retries visible in the UI

Patterns we use so automatic retries feel predictable and honest instead of random and frustrating.

Open log →

RELIABILITY2023-02-24BY STORECODE

reliabilitydatabasesfailover

Incident report: Slow database failover under load

A planned failover that was supposed to be nearly invisible took much longer under real traffic. We describe why and what we changed.

Open log →

RELIABILITY2023-01-18BY ELI NAVARRO

reliabilityincidentson-call

Practicing incident handovers

What we changed about incident handovers so they stopped being an afterthought and started shortening incidents.

Open log →

RELIABILITY2022-12-12BY STORECODE

reliabilityrate-limitingincident-response

Incident report: Rate limit applied to the wrong cohort

A misconfigured rate limit rule throttled legitimate traffic instead of abusive clients. We describe how it happened and what we changed.

Open log →

STEWARDSHIP2022-11-18BY PRIYA PATEL

stewardshipcostinfrastructure

Cost visibility for infrastructure decisions

How we made infrastructure cost visible enough that engineers could treat it like latency or reliability when making decisions.

Open log →

RELIABILITY2022-10-28BY STORECODE

reliabilitysloobservability

Q&A: how we decide what gets an SLO

Answers to common questions about which services and behaviors deserve explicit SLOs and how we choose them.

Open log →

CULTURE2022-10-03BY ELI NAVARRO

cultureownershipreliability

Ownership boundaries for shared services

How we clarified who owns what for shared services so incidents and roadmaps stopped stalling in the gaps.

Open log →

DESIGN2022-09-08BY MARA SABOGAL

designasyncux

Building async flows that feel responsive

Patterns we use to make async work—emails, background jobs, slow checks—feel predictable instead of flaky.

Open log →

ARCHITECTURE2022-08-19BY JONAS "JO" CARLIN

architecturecachingperformance

Story: when 'just add caching' made things worse

We added a cache to protect a slow path and accidentally created a new failure mode. This is the story and what we changed.

Open log →

RELIABILITY2022-07-25BY PRIYA PATEL

reliabilityalertingon-call

How we reduced alert noise without losing signal

The concrete changes we made to our alerting so pages became rarer, clearer, and more actionable.

Open log →

SECURITY2022-06-17BY STORECODE

securitysecretsconfiguration

Decision record: Moving secrets out of env vars

We decided to move application secrets out of long-lived environment variables and into a managed secrets system.

Open log →

ARCHITECTURE2022-05-10BY ELI NAVARRO

architecturecheckoutdegradation

Designing for graceful degradation in checkout

What we changed in the checkout architecture so partial failures lead to smaller, clearer problems instead of full outages.

Open log →

STEWARDSHIP2022-04-26BY PRIYA PATEL

stewardshipexperimentsreliability

Aligning product experiments with error budgets

How we made sure experiments respect reliability by tying them to error budgets instead of running them until something breaks.

Open log →

RELIABILITY2022-03-31BY STORECODE

reliabilityincident-responsetimeouts

Incident report: Cascading timeouts from a slow dependency

A degraded downstream service caused slow responses that turned into a wave of timeouts upstream. We describe the chain and what we changed.

Open log →

STEWARDSHIP2022-02-22BY PRIYA PATEL

observabilitytestingtelemetry

Testing observability changes before production

How we started treating metrics, logs, and tracing changes like production code instead of 'just add it and see.'

Open log →

DELIVERY2022-01-14BY JONAS "JO" CARLIN

deliveryintegrationsdeployment

Note: keeping deployment velocity during integration phases

Three practical tactics we use to keep shipping during large integration projects without breaking production.

Open log →

SECURITY2021-12-08BY STORECODE

securityaccess-controlrollout

Incident report: Access control misconfig slowed a rollout

A misconfigured permission set blocked a critical operational action during a rollout. We describe how it happened and what we changed.

Open log →

RELIABILITY2021-11-11BY ELI NAVARRO

reliabilitycronjobs

Incident report: Runaway cron job in a forgotten namespace

A scheduled job we thought was gone kept running in an old namespace, causing periodic load spikes. We describe how we found it and what we changed.

Open log →

DESIGN2021-09-16BY MARA SABOGAL

designsupportincidents

Designing support tooling to shorten incidents

How small UX decisions in internal support tools reduced incident time-to-understand and support handoffs.

Open log →

DELIVERY2021-08-03BY JONAS "JO" CARLIN

deliverymigrationsownership

Story: the migration that stalled because nobody owned the last 10%

A cross-team migration mostly worked, then stalled on edge cases. We describe what finally got it finished.

Open log →

DELIVERY2021-07-29BY JONAS "JO" CARLIN

deliverystagingtesting

Q&A: what "good enough" means for staging

Answers to the questions we kept hearing about how realistic staging needs to be and where to spend the effort.

Open log →

RELIABILITY2021-05-17BY STORECODE

reliabilityslointernal-tools

Note: error budgets for internal tools

Why we started giving internal tools explicit error budgets instead of treating them as best-effort.

Open log →

STEWARDSHIP2021-04-19BY STORECODE

stewardshiperrorsobservability

Decision record: One shared error taxonomy

We chose a shared way to categorize errors across services so dashboards, alerts, and user-facing messages line up.

Open log →

DELIVERY2021-03-10BY JONAS "JO" CARLIN

deliverymigrationsperformance

Story: the migration that almost doubled latency

We migrated a core read path to a new backend and watched P95 latency climb. This is the story of how we noticed, rolled back, and changed how we plan migrations.

Open log →

ARCHITECTURE2021-02-08BY ELI NAVARRO

architecturereliabilitycircuit-breakers

Standardizing circuit breaker patterns across services

We moved from ad-hoc circuit breakers to a shared pattern so failures in one dependency don’t fragment every service’s behavior.

Open log →

STEWARDSHIP2021-01-22BY PRIYA PATEL

observabilitytelemetryreliability

Telemetry budgets for small services

How we introduced simple telemetry budgets so small services stay observable without surprising costs or overload.

Open log →

DELIVERY2020-12-14BY JONAS "JO" CARLIN

deliveryreleasesrisk

Checklist: Shipping calendar-year changes safely under stress

A short checklist we run before shipping end-of-year changes like pricing, tax rules, and reporting formats.

Open log →

STEWARDSHIP2020-11-30BY JONAS "JO" CARLIN

stewardshipdeliveryon-call

Checklist: Shipping safely when everyone is tired

A practical checklist we use when the team is tired but the work still needs to ship.

Open log →

ARCHITECTURE2020-09-14BY ELI NAVARRO

architectureauthenticationdata-integrity

Decision record: Keeping one primary auth store

We decided to keep a single primary source of truth for authentication and treat other stores as caches, even when duplication looks convenient.

Open log →

RELIABILITY2020-09-09BY PRIYA PATEL

reliabilityobservabilitydashboards

Story: the dashboard we stopped updating and the alert that still used it

We let a dashboard drift while an alert still depended on it. The next incident taught us why observability assets need owners.

Open log →

RELIABILITY2020-06-09BY STORECODE

reliabilityincident-responsequeues

Incident report: Queue backlog from "just-in-time" jobs

A routine change to batch timing turned our job system into a bottleneck. We describe how a small shift in scheduling created a queue backlog and what we changed afterward.

Open log →

STEWARDSHIP2020-06-02BY ELI NAVARRO

stewardshipownershipreliability

Q&A: what "maintenance only" really means for a service

Questions we kept getting once some services were put into maintenance mode, and what that means operationally.

Open log →

ARCHITECTURE2020-04-18BY JONAS "JO" CARLIN

architecturefeature-flagsrollout

Note: feature flags under stress

A short list of things we wish we had treated as production-critical in our flag system before traffic spiked.

Open log →

DELIVERY2020-03-12BY JONAS "JO" CARLIN

deliveryemailreliability

Designing quiet failure modes for optional emails

We adjusted how optional emails fail so they don’t turn into noisy incidents or user confusion during stressed periods.

Open log →

RELIABILITY2020-02-05BY ELI NAVARRO

reliabilityon-callincident-response

Remote on-call without losing signal

What we changed in alerts, dashboards, and runbooks so remote on-call engineers see the same incident at the same time.

Open log →

RELIABILITY2020-01-29BY STORECODE

reliabilityincidentson-call

Incident report: Paging gaps during the first remote week

Our first week of mostly-remote on-call exposed blind spots in paging and escalation. We describe where alerts went missing and how we changed the system.

Open log →

RELIABILITY2019-11-02BY PRIYA PATEL

reliabilityobservabilitymonitoring

The one dashboard

If an alert doesn’t point to a single starting dashboard, the first ten minutes turn into archaeology. We keep one “first dashboard” per service.

Open log →

RELIABILITY2019-10-10BY PRIYA PATEL

reliabilityobservabilitymonitoring

Cardinality is a budget

Telemetry that can’t be queried is just expensive noise. We treat cardinality and log volume as budgets.

Open log →

SECURITY2019-09-15BY ELI NAVARRO

securityoperationsaccess-control

Access is a production dependency

During an incident, lack of access looks like downtime. Excess access looks like risk. We treat access like any other production system.

Open log →

DESIGN2019-07-22BY MARA SABOGAL

designaccessibilityux

Checklist: Accessibility for critical flows

A small checklist we run on the flows that matter: sign-in, checkout, account changes. Accessibility is reliability for humans.

Open log →

DELIVERY2019-06-07BY JONAS "JO" CARLIN

deliveryestimationcommunication

Estimates as ranges

A single date is comfort. A range with assumptions is a plan you can update without drama.

Open log →

RELIABILITY2019-05-11BY STORECODE

reliabilityincident-responsecaching

Incident report: A cache stampede

A deploy invalidated hot cache keys and the database became the cache. We rolled back and added stampede protection.

Open log →

ARCHITECTURE2019-04-20BY ELI NAVARRO

architecturemigrationsdata

Migrations: keep the old path working

If the old path breaks during a migration, you lose your escape hatch. We prefer expand/contract patterns that keep rollback real.

Open log →

DELIVERY2019-04-02BY JONAS "JO" CARLIN

deliveryqualityoperations

Q&A: What does “done” mean?

Done isn’t shipped. Done is shipped, observable, reversible, and supportable.

Open log →

DESIGN2019-03-06BY MARA SABOGAL

designsupportinternal-tools

Support tools are product

Support work happens under stress. A small internal tool can remove minutes of confusion from every incident-sized ticket.

Open log →

DELIVERY2019-02-14BY ELI NAVARRO

deliverydeploymentschange-management

Decision record: Feature flags over long-lived branches

We prefer merging small changes behind flags to running long-lived branches that collapse into a risky merge.

Open log →

RELIABILITY2019-01-27BY STORECODE

reliabilityincident-responseintegrations

Incident report: A retry storm

A vendor degraded, our retries amplified it, and checkout suffered. We changed retry defaults and added clearer degradation paths.

Open log →

DELIVERY2019-01-09BY JONAS "JO" CARLIN

deliverydiscoveryscoping

Discovery outputs that hold up

Discovery isn’t a mood board. It’s the work that turns unknowns into decisions you can defend.

Open log →

DESIGN2018-12-18BY MARA SABOGAL

designoperationsinternal-tools

The button that looked harmless

A destructive admin action was labeled like a refresh. We changed the UI so the safe path is obvious under stress.

Open log →

RELIABILITY2018-12-03BY ELI NAVARRO

reliabilityalertingon-call

Page or ticket

Not every alert deserves a page. We separate pages from tickets so on-call attention stays reserved for real impact.

Open log →

DESIGN2018-11-19BY MARA SABOGAL

designaccessibilitysupport

Error messages are operational tooling

A good error message reduces support load: it tells a stressed human what happened, what to do next, and what to share.

Open log →

ARCHITECTURE2018-09-28BY ELI NAVARRO

architectureboundariesmigrations

Decision record: Delay the service split

We kept one deployable and invested in boundaries, tests, and observability first. Splitting later became safer and less dramatic.

Open log →

DELIVERY2018-09-08BY JONAS "JO" CARLIN

deliverycommunicationestimation

Status updates that don’t lie

A simple structure for updates when you’re uncertain: facts, unknowns, next step, next checkpoint.

Open log →

RELIABILITY2018-08-23BY STORECODE

reliabilityincident-responseobservability

Incident report: Timeouts caused by request logging

A small observability change increased log volume enough to overload the service. Rollback fixed it; we changed how we ship logging changes.

Open log →

STEWARDSHIP2018-08-05BY MARA SABOGAL

stewardshipoperationsdocumentation

Where runbooks live

Runbooks take three common forms: a page, a checklist, or a button. Pick the home that matches your incident reality.

Open log →

STEWARDSHIP2018-07-14BY MARA SABOGAL

stewardshipoperationsdocumentation

Ship the runbook with the change

If the system changed, the runbook changed. Otherwise incidents turn into archaeology.

Open log →

STEWARDSHIP2018-06-30BY MARA SABOGAL

stewardshipoperationsdocumentation

Runbooks are interfaces

A runbook checklist we use to make incidents boring.

Open log →

RELIABILITY2018-05-22BY ELI NAVARRO

reliabilitydeploymentsmigrations

Reversibility over bravado

Decision: default to reversible changes; treat irreversible work as a planned cutover.

Open log →

DELIVERY2018-04-10BY JONAS "JO" CARLIN

deliveryscopingestimation

We do not lie to make work feel safer

False certainty is expensive. Honest constraints make better plans.

Open log →

STEWARDSHIP2018-03-01BY ELI NAVARRO

stewardshipmaintenancereliability

Maintenance is senior work

Maintenance is where the real system shows up. It requires judgment, not leftovers.

Open log →

RELIABILITY2018-02-05BY ELI NAVARRO

reliabilityoperationson-call

No heroics as policy

A late-night incident taught us a simple rule: heroics can’t be the delivery model.

Open log →

STEWARDSHIP2018-01-12BY JONAS "JO" CARLIN

stewardshipmaintenancedelivery

We will stay

A Q&A on long-term ownership: what it changes, what it costs, and what it buys you.

Open log →

Looking for case studies?

Case work is documented separately as structured engagements.

View case studies →