STEWARDSHIP2020-11-30BY JONAS "JO" CARLIN

Checklist: Shipping safely when everyone is tired

A practical checklist we use when the team is tired but the work still needs to ship.

stewardshipdeliveryon-callchecklistsremote-work

2020 forced us to admit something we already knew but rarely wrote down: people were shipping changes while tired long before any crisis.

The difference this year was that there was no illusion of "catching it in the hallway." We could not rely on quick in-person gut checks to stop risky deploys.

We needed a way to make safe choices visible in the tooling and the workflow, not just in ad-hoc conversations.

This checklist is deliberately small. It fits on one screen, it can be read on a phone, and it is focused on the few things that reliably keep incidents from getting worse when everyone is low on energy.

Context

We use this checklist when:

the on-call rotation has been busy
a release is happening outside normal hours
external events mean people are distracted or stressed

The goal is not "never ship when tired."

The goal is:

ship only changes that we can safely roll back or disable
reduce the number of decisions a tired engineer has to make alone
make it easy to say "not tonight" without social penalty

Checklist

Is this change reversible in minutes?
- There is a documented rollback path (flag, config toggle, or deploy).
- The rollback steps are written in the runbook, not only in someone’s head.
- We know how long rollback normally takes (e.g., "~5 minutes to redeploy").
Is the blast radius understood and acceptable?
- The change clearly identifies which services and user flows it touches.
- We can describe the worst plausible visible failure in one sentence.
- If we’re wrong about the blast radius, it fails "small" (e.g., feature hidden) rather than "loud" (e.g., checkout down).
Is observability in place before we ship?
- There is at least one graph or log query that will move if the change misbehaves.
- Alerts are configured or we intentionally agree to manual checks.
- The on-call knows which dashboard to watch in the first 15 minutes.
Is the rollout plan scoped to energy levels?
- We prefer dark launches, limited cohorts, or low-traffic windows.
- We avoid "big bang" migrations when the team is already stretched.
- Someone is explicitly on point to watch the rollout (not "whoever is around").
Are there at least two sets of eyes?
- Code review is not skipped.
- The reviewer understands the rollback plan and can repeat it back.
- If a second engineer is not available, we bias toward deferring.
Do we have a stop rule?
- We know what metric or signal will cause us to pause or roll back.
- The stop rule is written down ("if error rate > X for Y minutes, roll back").
- The on-call is empowered to apply the stop rule without asking for permission.

Notes

This checklist does not care about the reason people are tired.

It applies equally to:

a long incident earlier in the week
a busy launch calendar
a global event in the background

A few practical patterns we’ve found helpful:

Bundle small-but-safe changes. When energy is low, it’s better to ship one safe batch than many tiny risky ones.
Make "defer" cheap. It should be socially normal to say "this can wait until tomorrow" and have the system respect that.
Record decisions in the runbook. A short note like "we deferred X on 2020-11-30 due to fatigue" helps future us understand why the checklist exists.

We also learned that tired people will skip steps if they are hard to find.

So we:

linked this checklist directly from the deploy tool
added a short reminder in the on-call handoff doc
mentioned it during incident reviews when fatigue was a factor

None of this guarantees safety.

It simply shifts the default from "ship unless someone yells stop" to "ship if it’s safe to do so given how we are today."

Takeaways

Tired teams will still ship; the question is whether they are supported when they do.
A small checklist that fits on one screen is more useful than an exhaustive guide nobody opens.
Rollback, blast radius, observability, and stop rules are the levers that matter most when energy is low.
Making "not tonight" an acceptable answer is part of technical stewardship, not a sign of weakness.

Checklist: Shipping safely when everyone is tired

Context

Checklist

Notes

Takeaways

Further reading