ARCHITECTURE2024-10-21BY ELI NAVARRO

Building a backfill framework that doesn’t fight production traffic

How we built a reusable way to run backfills and reprocessing jobs without turning them into surprise production incidents.

architecturebackfillsbatch-jobsreliability

Backfills and data reprocessing jobs started as one-off scripts.

Each time we needed to:

  • recompute a field
  • migrate data to a new format
  • replay a subset of events

someone would write a script, point it at production, throttle it "by feel," and watch dashboards.

Sometimes this went fine.

Sometimes it:

  • competed with production traffic for database or queue capacity
  • caused sudden latency spikes
  • failed in the middle, leaving partial state

We decided to stop treating backfills as bespoke operations and build a framework that encoded what we kept re-learning.

Constraints

  • We couldn’t pause all writes during backfills.
  • Different systems had different storage and access patterns.
  • We needed something simple enough that teams would actually use it.

What we changed

1. Make backfills first-class jobs

We built a small backfill framework with clear primitives:

  • iterate over a set of items (IDs, rows, events)
  • apply a function to each item
  • track progress and errors

Instead of each team writing their own loops, they:

  • implement the per-item logic
  • configure batch size, concurrency, and schedules

The framework handles:

  • tracking what has been processed
  • resuming from partial completion
  • reporting metrics

2. Budget backfills with the same care as traffic

We introduced the idea of a backfill budget:

  • how much database, queue, or external API capacity we are willing to spend
  • what share of capacity the backfill may consume during peak vs off-peak hours

The framework enforces:

  • configurable rate limits
  • adaptive backoff when key metrics (latency, error rates) cross thresholds

We treat backfill-induced metric changes as regressions, not "just background work."

3. Align with queues and streams

Instead of hitting databases and services directly, backfills often:

  • enqueue work onto existing queues, at controlled rates
  • read from or write to streams with explicit partitions

This lets them reuse existing scaling and isolation mechanisms.

We avoid backfills that:

  • open many direct connections to hot databases
  • bypass normal application paths entirely

4. Make progress and impact observable

The framework emits metrics:

  • items processed per unit time
  • remaining items
  • error rates by error type

We also chart impact on:

  • database load
  • queue depths
  • key service latencies

Each backfill has:

  • a small dashboard combining its own metrics and the impacted services
  • a runbook entry that explains how to slow down, pause, or roll back

5. Practice on non-critical data

Before running a new backfill on critical data, teams:

  • try it on lower-importance tables or entities
  • verify that throttling, backoff, and resumption behave as expected

We also rehearse "stop in the middle" scenarios:

  • intentionally pause a backfill
  • confirm the system stays in a consistent, recoverable state

Results / Measurements

After rolling out the framework, we saw:

  • fewer incidents where "someone’s backfill" showed up as the root cause
  • more predictable timelines for long-running jobs
  • better reuse of patterns across teams

Backfills are still work; they still require planning.

But the default is now:

  • use the framework
  • respect capacity budgets
  • have dashboards and runbooks

Takeaways

  • Backfills are part of the production story; they deserve budgets, observability, and tooling.
  • A small framework can encode hard-earned lessons and prevent each team from rediscovering them.
  • Treating backfills as first-class jobs reduces the chance that a "one-time script" becomes the next incident.

Further reading