Archive

Logs

A public-facing slice of an internal archive. Practical notes on stewardship, reliability, and delivery.

100 entries
CULTURE2023-06-30BY STORECODE
cultureon-callremote-work

On-call across time zones

How we adjusted our on-call rotation and habits once the team was no longer sitting in the same room or even the same continent.

STEWARDSHIP2023-04-27BY JONAS "JO" CARLIN
stewardshipbackground-jobsreliability

Story: the job that never had an SLA

A background job ran for years without a clear expectation. When it finally broke, we had to decide what 'on time' meant.

RELIABILITY2023-01-18BY ELI NAVARRO
reliabilityincidentson-call

Practicing incident handovers

What we changed about incident handovers so they stopped being an afterthought and started shortening incidents.

ARCHITECTURE2020-04-18BY JONAS "JO" CARLIN
architecturefeature-flagsrollout

Note: feature flags under stress

A short list of things we wish we had treated as production-critical in our flag system before traffic spiked.

RELIABILITY2019-11-02BY PRIYA PATEL
reliabilityobservabilitymonitoring

The one dashboard

If an alert doesn’t point to a single starting dashboard, the first ten minutes turn into archaeology. We keep one “first dashboard” per service.

RELIABILITY2019-10-10BY PRIYA PATEL
reliabilityobservabilitymonitoring

Cardinality is a budget

Telemetry that can’t be queried is just expensive noise. We treat cardinality and log volume as budgets.

SECURITY2019-09-15BY ELI NAVARRO
securityoperationsaccess-control

Access is a production dependency

During an incident, lack of access looks like downtime. Excess access looks like risk. We treat access like any other production system.

DELIVERY2019-06-07BY JONAS "JO" CARLIN
deliveryestimationcommunication

Estimates as ranges

A single date is comfort. A range with assumptions is a plan you can update without drama.

RELIABILITY2019-05-11BY STORECODE
reliabilityincident-responsecaching

Incident report: A cache stampede

A deploy invalidated hot cache keys and the database became the cache. We rolled back and added stampede protection.

DESIGN2019-03-06BY MARA SABOGAL
designsupportinternal-tools

Support tools are product

Support work happens under stress. A small internal tool can remove minutes of confusion from every incident-sized ticket.

RELIABILITY2019-01-27BY STORECODE
reliabilityincident-responseintegrations

Incident report: A retry storm

A vendor degraded, our retries amplified it, and checkout suffered. We changed retry defaults and added clearer degradation paths.

DESIGN2018-12-18BY MARA SABOGAL
designoperationsinternal-tools

The button that looked harmless

A destructive admin action was labeled like a refresh. We changed the UI so the safe path is obvious under stress.

RELIABILITY2018-12-03BY ELI NAVARRO
reliabilityalertingon-call

Page or ticket

Not every alert deserves a page. We separate pages from tickets so on-call attention stays reserved for real impact.

STEWARDSHIP2018-08-05BY MARA SABOGAL
stewardshipoperationsdocumentation

Where runbooks live

Runbooks take three common forms: a page, a checklist, or a button. Pick the home that matches your incident reality.

RELIABILITY2018-05-22BY ELI NAVARRO
reliabilitydeploymentsmigrations

Reversibility over bravado

Decision: default to reversible changes; treat irreversible work as a planned cutover.

RELIABILITY2018-02-05BY ELI NAVARRO
reliabilityoperationson-call

No heroics as policy

A late-night incident taught us a simple rule: heroics can’t be the delivery model.

STEWARDSHIP2018-01-12BY JONAS "JO" CARLIN
stewardshipmaintenancedelivery

We will stay

A Q&A on long-term ownership: what it changes, what it costs, and what it buys you.

Looking for case studies?

Case work is documented separately as structured engagements.