RELIABILITY2018-02-05BY ELI NAVARRO

No heroics as policy

A late-night incident taught us a simple rule: heroics can’t be the delivery model.

reliabilityoperationson-calldelivery

Incidents happen. That’s not the problem.

The problem is designing delivery so that incidents are the way the work gets done.

This is a sanitized story from a routine deploy that turned into a night of improvisation.

What happened

A few hours after deployment, error rates climbed from background noise to a level users could feel.

We didn’t have a dramatic signal—just a steady increase in failures and support messages that all meant the same thing: “it worked earlier today.”

Triage was straightforward. Confirm it’s real. Confirm blast radius. Correlate with the latest change.

Rollback was not straightforward.

The rollback path existed, but it wasn’t executable under pressure. The steps lived in someone’s memory. The deploy bundled a couple of operational changes that were “small” in a pull request and large at 1am.

We got back to a known-good state, but the cost was predictable: time, context-switching, and the creeping realization that we were depending on the same thing every fragile system depends on—heroics.

What we changed

The next day we wrote down the conclusion in plain language: if late nights are part of the plan, the system will eventually collect the debt.

So we changed the plan.

We made rollback a first-class requirement.

That meant two practical things. First, the rollback steps had to be written down and runnable. Second, they had to be practiced often enough that rollback wasn’t an invention under stress.

We reduced the size of changes by default. Smaller diffs are easier to review, easier to observe, and easier to undo.

And we treated operational docs as real work: a minimal runbook for common failure modes and a release checklist that makes the safe option the easy option.

None of this eliminates incidents. The goal is to stop incidents from being the delivery mechanism.

Takeaways

Heroics aren’t a personality trait. They’re a system behavior.

If your process depends on late nights, you’ll get late nights.

If your process depends on reversibility, runbooks, and small changes, you’ll get calmer delivery.

Further reading