RELIABILITY2023-11-21BY STORECODE

Incident report: Misrouted background jobs after a refactor

A refactor changed how we routed jobs between queues and regions. Some jobs began running in the wrong place. We describe how we found it and what we changed.

reliabilitybackground-jobsqueuesconfiguration

Summary

On November 21, 2023, a refactor of our background job routing logic caused a subset of jobs to be enqueued into the wrong queues and, in some cases, the wrong regions.

Jobs still ran.

They just didn’t always run where we expected.

The immediate symptoms were subtle:

  • some jobs took much longer than usual to complete
  • a few region-specific tasks ran in the wrong data center

The risk was higher than the initial symptoms suggested:

  • jobs might have violated data locality expectations
  • some follow-up work that assumed region affinity behaved inconsistently

We detected the issue within a few hours, constrained the blast radius, and then corrected the routing logic.

Impact

  • Duration: roughly 3 hours from deployment of the refactor to full mitigation.
  • User impact:
    • some background tasks (like report generation and non-critical notifications) were delayed beyond their usual expectations
    • no direct user-facing data integrity issues were observed
  • Internal impact:
    • increased effort in verifying that misrouted jobs had completed correctly
    • additional checks on data locality and compliance for affected jobs

We did not find evidence of data crossing boundaries it wasn’t allowed to cross, but the incident surfaced places where those guarantees were more implicit than explicit.

Timeline

All times local.

  • 09:12 — A deployment containing refactored job routing logic reaches production.
  • 09:40 — Metrics show a slight increase in average job age for certain queues, but still within alert thresholds.
  • 10:05 — An engineer notices an unexpected pattern in job metrics: region A queues are relatively idle while region B queues are busier than normal.
  • 10:17 — A manual spot-check reveals jobs with region A identifiers running in region B.
  • 10:22 — Incident channel opened. On-call for the job system and a partner team join.
  • 10:30 — The team confirms that a subset of routing rules are using a default region when a new field is missing or mis-parsed.
  • 10:38 — Temporary mitigation: tighten filters so that only jobs explicitly marked as cross-region are allowed to route to the default.
  • 10:50 — New deployment prepared to roll back the routing refactor while preserving unrelated fixes.
  • 11:05 — Rollback completed. New routing metrics show queues behaving as expected.
  • 11:20 — Additional checks confirm that misrouted jobs have completed and that no jobs are stuck in limbo.
  • 12:15 — Incident closed with follow-ups.

Root cause

The refactor changed how routing keys were computed for background jobs.

Previously, job routing was based on a small set of fields:

  • job type
  • target region
  • priority

The refactor introduced:

  • a more generic routing function that accepted a job payload and returned a destination
  • fallback behavior for missing or malformed fields

The intended behavior was:

  • jobs with explicit region affinity go to region-specific queues
  • region-agnostic jobs go to shared queues

Due to a bug in how we parsed and validated job metadata:

  • some jobs that should have been region-specific looked region-agnostic to the router
  • they were routed to default queues, sometimes in a different region

Contributing factors:

  • Weak tests. Tests validated that jobs reached "a" queue, but not always the correct one.
  • Underspecified contracts. Job producers and the router disagreed on which fields were required vs optional.
  • Limited observability. Our dashboards showed aggregate job counts by queue, but not clear breakdowns by intended vs actual region.

What we changed

1. Tighten routing contracts

We clarified and enforced the contract between job producers and the routing layer:

  • which fields are required (e.g., region, tenant)
  • which are optional
  • what happens when required fields are missing

We changed the routing code so that:

  • jobs missing required routing metadata fail fast with clear errors
  • no job silently falls back to a default region unless explicitly allowed

2. Improve tests

We added tests that assert:

  • specific job types with specific metadata end up in specific queues
  • missing or malformed metadata leads to explicit errors, not silent defaults

We also introduced end-to-end tests in a lower environment that generate jobs and verify:

  • which queues they land in
  • how quickly they start and finish

3. Enhance observability for routing

We extended metrics and logs around job routing:

  • counts of jobs by intended vs actual region
  • routing error rates by job type
  • dashboards that highlight mismatches

This makes it easier to spot unexpected patterns (e.g., one region’s queues going quiet while another spikes).

4. Clarify data locality expectations

The incident raised a broader question: which jobs must stay within a region for compliance or performance reasons?

We cataloged:

  • jobs with strict data locality requirements
  • jobs that are region-agnostic

We then updated routing rules and documentation so the system reflects those expectations.

Follow-ups

Completed

  • Fixed the routing bug and rolled out stricter validation.
  • Added tests and dashboards for job routing correctness.
  • Documented data locality requirements for key job types.

Planned / in progress

  • Extend routing correctness checks to more job categories.
  • Add automated alerts when jobs are enqueued into unexpected queues or regions.
  • Review other refactors that touch routing or classification logic for similar risks.

Takeaways

  • Routing logic is part of the system contract; refactors need tests that check behavior, not just types.
  • Silent fallbacks are dangerous, especially when they cross regions or other boundaries.
  • Observability for "where work happens" matters as much as observability for "whether work happens."
  • Data locality requirements should be explicit and enforced in code, not assumed.