RELIABILITY2020-06-09BY STORECODE

Incident report: Queue backlog from "just-in-time" jobs

A routine change to batch timing turned our job system into a bottleneck. We describe how a small shift in scheduling created a queue backlog and what we changed afterward.

reliabilityincident-responsequeuesbatch-jobsscalability

Summary

On June 9, 2020, a change to "just-in-time" batch processing caused our background job queue to accumulate a large backlog.

The queue primarily handled non-user-facing work: settlement updates, analytics exports, and delayed notifications.

Under normal conditions, spikes in this queue were acceptable. During this incident, the backlog grew large enough that it started to delay work which users did notice: confirmation emails, status updates, and some scheduled tasks.

We mitigated by temporarily disabling one high-volume job type, manually replaying a subset of work, and then reverting part of the scheduling change. We later reworked how we schedule and isolate batch workloads.

This incident is worth recording because it shows how easy it is to turn a queue into a shared choke point, especially when trying to be more "efficient" with timing.

Impact

Duration: approximately 2 hours and 40 minutes of degraded background processing.
User-visible effects:
- delayed confirmation emails (up to ~45 minutes at the peak)
- lag in account status updates for some long-running operations
Internal effects:
- operations had to manually verify completion of some high-value tasks
- support saw an increase in "did this actually complete?" questions

No data was lost. Some work was completed out of the expected order, which we reconciled once the backlog cleared.

Timeline

All times are in local time.

09:05 — A deployment changes the schedule of a set of batch jobs from "spread across the hour" to "run closer to when data is expected to arrive".
09:20 — Queue depth for the primary worker queue starts to climb above its normal morning baseline.
09:32 — An internal dashboard alert for "queue depth > 5x baseline" fires. On-call acknowledges but classifies it as "background only" based on description.
09:45 — Support reports a trickle of tickets about missing confirmation emails and "stuck" statuses. On-call correlates this with the queue alert and opens an incident channel.
09:52 — We confirm that the worker fleet is healthy (no mass failures, CPU and memory within normal ranges) but that average job age is increasing.
10:02 — A deeper look at the queue composition shows one job type has increased from ~10% to ~60% of the queue.
10:10 — We discover that the scheduling change made this high-volume job type fire in a narrow window instead of being spread across the hour.
10:17 — Mitigation step 1: temporarily disable enqueuing for the high-volume job type.
10:23 — Queue depth stops growing but is still well above baseline; confirmation-related jobs are still delayed.
10:31 — Mitigation step 2: scale up worker instances handling this queue and increase concurrency for the affected job types.
10:44 — Average job age begins to drop; confirmation delays shrink to under 10 minutes.
11:05 — Queue depth returns to within 2x normal baseline. Confirmation jobs are effectively back to normal.
11:15 — We re-enable the high-volume job type with its previous "spread across the hour" schedule and keep additional monitoring in place.
11:45 — Incident is closed with follow-ups captured.

Root cause

The immediate cause was a scheduling change that concentrated a large batch of "just-in-time" jobs in a short window.

Previously, several classes of batch jobs were scheduled to run periodically over the course of an hour:

some every 5 minutes
some every 15 minutes
some once per hour but with a randomized offset

To make results feel more "real-time" and improve freshness, we moved certain jobs to run shortly after upstream data became available.

The change looked reasonable in isolation:

jobs that previously ran at :00, :15, :30, :45 were moved to run around :05–:10
the number of jobs per run was roughly the same, but now they landed in a tighter band

In practice, this stacked multiple high-volume job types into the same short interval.

Because these jobs shared a single worker pool and queue with other background work, they competed for the same capacity. Everything in that queue slowed down together.

Contributing factors

Shared queue for dissimilar workloads. Long-running analytics jobs, quick confirmation emails, and status updates all lived in the same queue and worker pool.
No per-job-type SLOs. We had a rough expectation that "background jobs complete within 15 minutes" but nothing more granular per job type.
Scheduling logic in multiple places. Some scheduling decisions lived in code, others in a configuration panel; we didn’t have a single view of "what fires when".
Underestimated peak. Our test data did not reproduce the actual traffic profile for the high-volume job type.

What we changed

1. Separate queues for latency-sensitive work

We split the queue into at least two classes:

Latency-sensitive background work (e.g., confirmation emails, user-visible status updates).
Throughput-oriented work (e.g., analytics exports, bulk recomputations).

Each class has its own queue, its own worker processes, and its own scaling rules.

This means a burst of analytics work cannot delay confirmation emails; in the worst case, the analytics queue falls behind while the latency-sensitive queue continues to drain.

2. Make scheduling visible and reviewable

We created a simple "job schedule" dashboard:

rows: job types
columns: time buckets across the hour or day
cells: expected number of enqueued jobs per bucket

We refresh this using actual enqueue counts from the last N days, not just configuration.

Before changing schedules, we now look at this dashboard to see whether we are stacking too much work into a narrow window.

3. Limits and backpressure

We added basic protections to the worker system:

per-job-type concurrency caps (so one job type cannot starve others)
alerting on job age for latency-sensitive queues
a simple circuit breaker for new enqueues of non-critical jobs when the backlog exceeds a threshold

This is deliberately unsophisticated. The point is not perfect fairness; it is to avoid one "efficient" change from turning into a systemic backlog.

4. Operational runbook updates

We updated runbooks with explicit steps for "queue depth" incidents:

Identify which job types make up the backlog (top N by count and age).
Check which queues are affected and whether latency-sensitive jobs are being delayed.
Decide whether to:
- disable enqueues for non-critical job types
- scale up workers for the affected queues
- revert scheduling changes
Document any temporary disables and create follow-up tasks to reconcile.

We also added a checklist item for reviewers: "Does this change concentrate work into a new narrow window?"

Follow-ups

Completed

Split the background worker system into separate queues for latency-sensitive and bulk work.
Added dashboards for job age and queue depth per queue and per job type.
Created alerts for when average job age for latency-sensitive queues exceeds a small threshold.

Planned / in progress

Move more of the scheduling configuration into a single place so we can see and review it coherently.
Introduce small load tests that simulate peak enqueue rates for the highest-volume job types.
Define explicit SLOs for key background flows (e.g., confirmation emails sent within X minutes, status updates reflected within Y minutes).

What we’d do differently

The mistake was not "caring about freshness." It was changing timing without treating the queue as a shared resource with its own capacity.

Next time we:

review not just "how often" a job runs, but how many units of work it creates per run
check the combined effect of multiple schedule changes on shared queues
test the worst-case stack-up, not just the average case

Takeaways

Queues are not infinite buffers; they are shared resources with capacity and trade-offs.
"Just-in-time" scheduling can be an improvement, but only if you understand what else is trying to be "just-in-time" in the same window.
Separate latency-sensitive background work from bulk work so they cannot starve each other.
Make job schedules visible and reviewable; don’t let them live only in code or muscle memory.