Building a backfill framework that doesn’t fight production traffic
How we built a reusable way to run backfills and reprocessing jobs without turning them into surprise production incidents.
Backfills and data reprocessing jobs started as one-off scripts.
Each time we needed to:
- recompute a field
- migrate data to a new format
- replay a subset of events
someone would write a script, point it at production, throttle it "by feel," and watch dashboards.
Sometimes this went fine.
Sometimes it:
- competed with production traffic for database or queue capacity
- caused sudden latency spikes
- failed in the middle, leaving partial state
We decided to stop treating backfills as bespoke operations and build a framework that encoded what we kept re-learning.
Constraints
- We couldn’t pause all writes during backfills.
- Different systems had different storage and access patterns.
- We needed something simple enough that teams would actually use it.
What we changed
1. Make backfills first-class jobs
We built a small backfill framework with clear primitives:
- iterate over a set of items (IDs, rows, events)
- apply a function to each item
- track progress and errors
Instead of each team writing their own loops, they:
- implement the per-item logic
- configure batch size, concurrency, and schedules
The framework handles:
- tracking what has been processed
- resuming from partial completion
- reporting metrics
2. Budget backfills with the same care as traffic
We introduced the idea of a backfill budget:
- how much database, queue, or external API capacity we are willing to spend
- what share of capacity the backfill may consume during peak vs off-peak hours
The framework enforces:
- configurable rate limits
- adaptive backoff when key metrics (latency, error rates) cross thresholds
We treat backfill-induced metric changes as regressions, not "just background work."
3. Align with queues and streams
Instead of hitting databases and services directly, backfills often:
- enqueue work onto existing queues, at controlled rates
- read from or write to streams with explicit partitions
This lets them reuse existing scaling and isolation mechanisms.
We avoid backfills that:
- open many direct connections to hot databases
- bypass normal application paths entirely
4. Make progress and impact observable
The framework emits metrics:
- items processed per unit time
- remaining items
- error rates by error type
We also chart impact on:
- database load
- queue depths
- key service latencies
Each backfill has:
- a small dashboard combining its own metrics and the impacted services
- a runbook entry that explains how to slow down, pause, or roll back
5. Practice on non-critical data
Before running a new backfill on critical data, teams:
- try it on lower-importance tables or entities
- verify that throttling, backoff, and resumption behave as expected
We also rehearse "stop in the middle" scenarios:
- intentionally pause a backfill
- confirm the system stays in a consistent, recoverable state
Results / Measurements
After rolling out the framework, we saw:
- fewer incidents where "someone’s backfill" showed up as the root cause
- more predictable timelines for long-running jobs
- better reuse of patterns across teams
Backfills are still work; they still require planning.
But the default is now:
- use the framework
- respect capacity budgets
- have dashboards and runbooks
Takeaways
- Backfills are part of the production story; they deserve budgets, observability, and tooling.
- A small framework can encode hard-earned lessons and prevent each team from rediscovering them.
- Treating backfills as first-class jobs reduces the chance that a "one-time script" becomes the next incident.