STEWARDSHIP2024-11-02BY JONAS "JO" CARLIN

Checklist: Reviewing automation that touches production data

A checklist we use when someone wants to ship or run automation—scripts, tools, jobs—that can change lots of production records at once.

stewardshipautomationriskproduction

Automation is where we scale both our best and worst ideas.

A small script or tool that touches production data can:

This checklist is what we walk through before we:

Context

Use this checklist when automation can:

What exactly can this automation change?
- list the primary entities (users, orders, configs)
- note any secondary effects (emails, webhooks, cache invalidation)
Is it scoped by default?
- does it require explicit filters or IDs, or can it run against "everything"?
- are there guardrails to prevent unbounded changes (max batch size, dry-run flags)?
Can it run in dry-run mode?
- will it show what it would change without actually changing it?
- does the output make it easy to spot mistakes?
Is it idempotent or at least safe to retry?
- if it runs twice, will it double-apply changes or leave data in a weird state?
- can it detect already-processed records?
How will we monitor it while it runs?
- logs or metrics that show progress and error rates
- a way to see which records have been touched
Is there a way to stop it quickly?
- can we pause or cancel it without leaving everything half-updated?
- do we know what state it leaves the world in if we stop halfway?
Who is allowed to run it?
- is access limited to people who understand the impact?
- are credentials and permissions scoped appropriately?
Do we have a rollback or repair story?
- how would we restore or fix data if something goes wrong?
- is that plan realistic at the scale of changes this tool can make?

We don’t expect every script to be perfect before first use.

We do expect:

A good pattern:

Automation that touches production data is part of your architecture, not just a convenience.
Guardrails, scoping, and observability matter as much for scripts and bots as they do for services.
Having a clear stop button and a repair story turns "this might be scary" into "this is a tool we can use safely."