Checklist: Shipping safely when everyone is tired
A practical checklist we use when the team is tired but the work still needs to ship.
2020 forced us to admit something we already knew but rarely wrote down: people were shipping changes while tired long before any crisis.
The difference this year was that there was no illusion of "catching it in the hallway." We could not rely on quick in-person gut checks to stop risky deploys.
We needed a way to make safe choices visible in the tooling and the workflow, not just in ad-hoc conversations.
This checklist is deliberately small. It fits on one screen, it can be read on a phone, and it is focused on the few things that reliably keep incidents from getting worse when everyone is low on energy.
Context
We use this checklist when:
- the on-call rotation has been busy
- a release is happening outside normal hours
- external events mean people are distracted or stressed
The goal is not "never ship when tired."
The goal is:
- ship only changes that we can safely roll back or disable
- reduce the number of decisions a tired engineer has to make alone
- make it easy to say "not tonight" without social penalty
Checklist
-
Is this change reversible in minutes?
- There is a documented rollback path (flag, config toggle, or deploy).
- The rollback steps are written in the runbook, not only in someone’s head.
- We know how long rollback normally takes (e.g., "~5 minutes to redeploy").
-
Is the blast radius understood and acceptable?
- The change clearly identifies which services and user flows it touches.
- We can describe the worst plausible visible failure in one sentence.
- If we’re wrong about the blast radius, it fails "small" (e.g., feature hidden) rather than "loud" (e.g., checkout down).
-
Is observability in place before we ship?
- There is at least one graph or log query that will move if the change misbehaves.
- Alerts are configured or we intentionally agree to manual checks.
- The on-call knows which dashboard to watch in the first 15 minutes.
-
Is the rollout plan scoped to energy levels?
- We prefer dark launches, limited cohorts, or low-traffic windows.
- We avoid "big bang" migrations when the team is already stretched.
- Someone is explicitly on point to watch the rollout (not "whoever is around").
-
Are there at least two sets of eyes?
- Code review is not skipped.
- The reviewer understands the rollback plan and can repeat it back.
- If a second engineer is not available, we bias toward deferring.
-
Do we have a stop rule?
- We know what metric or signal will cause us to pause or roll back.
- The stop rule is written down ("if error rate > X for Y minutes, roll back").
- The on-call is empowered to apply the stop rule without asking for permission.
Notes
This checklist does not care about the reason people are tired.
It applies equally to:
- a long incident earlier in the week
- a busy launch calendar
- a global event in the background
A few practical patterns we’ve found helpful:
- Bundle small-but-safe changes. When energy is low, it’s better to ship one safe batch than many tiny risky ones.
- Make "defer" cheap. It should be socially normal to say "this can wait until tomorrow" and have the system respect that.
- Record decisions in the runbook. A short note like "we deferred X on 2020-11-30 due to fatigue" helps future us understand why the checklist exists.
We also learned that tired people will skip steps if they are hard to find.
So we:
- linked this checklist directly from the deploy tool
- added a short reminder in the on-call handoff doc
- mentioned it during incident reviews when fatigue was a factor
None of this guarantees safety.
It simply shifts the default from "ship unless someone yells stop" to "ship if it’s safe to do so given how we are today."
Takeaways
- Tired teams will still ship; the question is whether they are supported when they do.
- A small checklist that fits on one screen is more useful than an exhaustive guide nobody opens.
- Rollback, blast radius, observability, and stop rules are the levers that matter most when energy is low.
- Making "not tonight" an acceptable answer is part of technical stewardship, not a sign of weakness.