ARCHITECTURE2024-10-21BY ELI NAVARRO

Building a backfill framework that doesn’t fight production traffic

How we built a reusable way to run backfills and reprocessing jobs without turning them into surprise production incidents.

architecturebackfillsbatch-jobsreliability

Backfills and data reprocessing jobs started as one-off scripts.

Each time we needed to:

recompute a field
migrate data to a new format
replay a subset of events

someone would write a script, point it at production, throttle it "by feel," and watch dashboards.

Sometimes this went fine.

Sometimes it:

competed with production traffic for database or queue capacity
caused sudden latency spikes
failed in the middle, leaving partial state

We decided to stop treating backfills as bespoke operations and build a framework that encoded what we kept re-learning.

Constraints

We couldn’t pause all writes during backfills.
Different systems had different storage and access patterns.
We needed something simple enough that teams would actually use it.

What we changed

1. Make backfills first-class jobs

We built a small backfill framework with clear primitives:

iterate over a set of items (IDs, rows, events)
apply a function to each item
track progress and errors

Instead of each team writing their own loops, they:

implement the per-item logic
configure batch size, concurrency, and schedules

The framework handles:

tracking what has been processed
resuming from partial completion
reporting metrics

2. Budget backfills with the same care as traffic

We introduced the idea of a backfill budget:

how much database, queue, or external API capacity we are willing to spend
what share of capacity the backfill may consume during peak vs off-peak hours

The framework enforces:

configurable rate limits
adaptive backoff when key metrics (latency, error rates) cross thresholds

We treat backfill-induced metric changes as regressions, not "just background work."

3. Align with queues and streams

Instead of hitting databases and services directly, backfills often:

enqueue work onto existing queues, at controlled rates
read from or write to streams with explicit partitions

This lets them reuse existing scaling and isolation mechanisms.

We avoid backfills that:

open many direct connections to hot databases
bypass normal application paths entirely

4. Make progress and impact observable

The framework emits metrics:

items processed per unit time
remaining items
error rates by error type

We also chart impact on:

database load
queue depths
key service latencies

Each backfill has:

a small dashboard combining its own metrics and the impacted services
a runbook entry that explains how to slow down, pause, or roll back

5. Practice on non-critical data

Before running a new backfill on critical data, teams:

try it on lower-importance tables or entities
verify that throttling, backoff, and resumption behave as expected

We also rehearse "stop in the middle" scenarios:

intentionally pause a backfill
confirm the system stays in a consistent, recoverable state

Results / Measurements

After rolling out the framework, we saw:

fewer incidents where "someone’s backfill" showed up as the root cause
more predictable timelines for long-running jobs
better reuse of patterns across teams

Backfills are still work; they still require planning.

But the default is now:

use the framework
respect capacity budgets
have dashboards and runbooks

Takeaways

Backfills are part of the production story; they deserve budgets, observability, and tooling.
A small framework can encode hard-earned lessons and prevent each team from rediscovering them.
Treating backfills as first-class jobs reduces the chance that a "one-time script" becomes the next incident.