ARCHITECTURE2023-05-23BY ELI NAVARRO

Designing schema changes for shared databases

How we design and stage schema changes in shared databases so one team’s release doesn’t surprise everyone else.

architecturedatabasesschema-changesmigrations

Shared databases are a fact of life longer than we’d like to admit.

In diagrams, every service owns its own data store.

In reality, some systems share a database for reasons that are expensive to change: legacy monoliths, complex reporting requirements, operational inertia.

In those systems, schema changes are one of the easiest ways for one team to surprise everyone else.

We’ve broken queries with an index "cleanup." We’ve locked tables with columns that looked small. We’ve deployed incompatible changes where one service updated its ORM models before another.

This post describes how we design schema changes now so they behave more like boring infrastructure work and less like roll-of-the-dice deploys.

Constraints

Multiple teams depend on the same database.
Not all queries are known or centrally cataloged.
Some workloads are bursty; load varies by hour and by day.
We can’t pause all writes during every migration.

We also have a hard requirement: user-facing flows must stay within SLOs while we change the schema.

What we changed

1. Schema changes have design docs

We stopped treating schema alterations as minor chores.

For non-trivial changes (new tables, new indexes, column type changes), we write a short design doc that covers:

what is changing and why
expected impact on read and write paths
how we’ll roll out and roll back the change
which teams and services are affected

The goal is not bureaucracy. It’s to force explicit thinking before we run ALTER TABLE on something hot.

2. Prefer additive changes

We bias toward additive changes that can be rolled out in phases:

add new columns instead of changing types in-place
add new tables alongside old ones, then backfill and switch over
add new indexes before changing queries to rely on them

Destructive changes (dropping columns, changing types) are treated as a second phase after traffic has safely moved.

3. Measure before and after

Before a change, we collect baseline data:

query frequencies and latencies for affected tables
peak times for relevant workloads
existing index usage

After the change, we compare:

did query latency or error rate move?
did overall load on the database change in expected ways?

This is mundane work, but it catches the difference between "we think this will be fine" and "it actually is."

4. Schedule and stage during low-risk windows

We schedule heavy changes for windows where load is lower and support coverage is strong.

We also stage changes:

apply schema change at a time when the database is quiet
wire up new code paths in a later deploy

This avoids stacking risks.

5. Test with production-shaped data

We can’t perfectly mirror production, but we can avoid worst-case surprises by:

running migrations against a copy or subset of real data in a lower environment
using tools that simulate the timing and locking behavior of the change

If a change takes 10 seconds on a small copy, we treat that as a warning, not an all-clear.

6. Add guardrails for long-running operations

We put simple protections in place:

time limits for online schema changes
monitoring for lock wait times and deadlocks during migrations
the ability to pause or abort a migration if indicators go bad

This keeps an unexpected long-running change from quietly wedging the system.

Results / Measurements

Over time, we saw:

fewer incidents traced back to "someone changed the schema"
clearer communication between teams when a change might affect shared tables
more predictable migration windows

We still have to be careful. Shared databases don’t become magically safe because we wrote a process.

But when something does go wrong, we have:

a design doc to refer back to
monitoring in place to understand impact
a rollback or mitigation story that isn’t "wait and hope"

Takeaways

Schema changes in shared databases deserve design, not just tickets.
Additive, staged changes are easier to reason about and roll back.
Measuring before and after gives you evidence, not just optimism.
Guardrails and monitoring turn migrations from big bangs into managed operations.