STEWARDSHIP2024-11-02BY JONAS "JO" CARLIN

Checklist: Reviewing automation that touches production data

A checklist we use when someone wants to ship or run automation—scripts, tools, jobs—that can change lots of production records at once.

stewardshipautomationriskproduction

Automation is where we scale both our best and worst ideas.

A small script or tool that touches production data can:

  • remove a pile of tedious manual work
  • or break thousands of records in seconds

This checklist is what we walk through before we:

  • ship new automation
  • run a one-off script on production data
  • promote a "helpful" internal bot to broader use

Context

Use this checklist when automation can:

  • create, update, or delete production records in bulk
  • change access, permissions, or configuration for many users or services
  • trigger downstream effects (emails, billing, external calls)

Checklist

  • What exactly can this automation change?

    • list the primary entities (users, orders, configs)
    • note any secondary effects (emails, webhooks, cache invalidation)
  • Is it scoped by default?

    • does it require explicit filters or IDs, or can it run against "everything"?
    • are there guardrails to prevent unbounded changes (max batch size, dry-run flags)?
  • Can it run in dry-run mode?

    • will it show what it would change without actually changing it?
    • does the output make it easy to spot mistakes?
  • Is it idempotent or at least safe to retry?

    • if it runs twice, will it double-apply changes or leave data in a weird state?
    • can it detect already-processed records?
  • How will we monitor it while it runs?

    • logs or metrics that show progress and error rates
    • a way to see which records have been touched
  • Is there a way to stop it quickly?

    • can we pause or cancel it without leaving everything half-updated?
    • do we know what state it leaves the world in if we stop halfway?
  • Who is allowed to run it?

    • is access limited to people who understand the impact?
    • are credentials and permissions scoped appropriately?
  • Do we have a rollback or repair story?

    • how would we restore or fix data if something goes wrong?
    • is that plan realistic at the scale of changes this tool can make?

Notes

We don’t expect every script to be perfect before first use.

We do expect:

  • someone to think through these questions
  • changes to be small at first (limited scope, limited blast radius)
  • teams to harden automation over time based on real usage and incidents

A good pattern:

  • start with read-only or dry-run-only modes
  • add logging and progress metrics
  • gradually allow writes once we trust the behavior

Takeaways

  • Automation that touches production data is part of your architecture, not just a convenience.
  • Guardrails, scoping, and observability matter as much for scripts and bots as they do for services.
  • Having a clear stop button and a repair story turns "this might be scary" into "this is a tool we can use safely."

Further reading