RELIABILITY2025-12-05BY ELI NAVARRO

Incident report: Runaway background sync after clock skew

Clock skew between systems turned a cautious background sync into a runaway loop. We describe how time assumptions failed and what we changed.

reliabilitytimebackground-jobsincidents

Summary

On December 5, 2025, a set of background sync jobs went into a runaway loop.

The jobs were designed to:

sync changes between systems based on "last updated" timestamps
run periodically and idempotently

A clock skew between two systems meant that:

the jobs kept seeing records as "new" or "updated"
they reprocessed the same data repeatedly

This caused:

elevated load on both systems
longer sync times
a backlog of work that delayed legitimate updates

We treated this as an incident in how we handle time across systems.

Impact

Duration: several hours of elevated load and delayed syncs.
User impact:
- some users saw stale data for longer than expected
- certain "eventually consistent" flows took much longer to converge
Internal impact:
- on-call engineers spent time identifying and stopping the runaway loop
- follow-up work to clean up duplicated work and verify consistency

Timeline

All times local.

01:20 — A routine time sync process on one system fails silently, leaving clocks skewed by several minutes.
02:05 — Background sync jobs begin a new run, using timestamps to identify changed records.
02:30 — Metrics show higher-than-usual work per run and rising CPU on both systems.
03:10 — On-call is paged for elevated load and growing job backlogs.
03:35 — Investigation reveals that many records are being re-synced repeatedly.
04:05 — We identify inconsistent timestamps between systems as the root cause.
04:20 — Jobs are paused; time synchronization is fixed.
05:15 — Jobs are restarted with corrected configuration; backlog begins to drain.
08:00 — Backlog cleared; metrics return to baseline.

Root cause

The immediate cause was a clock skew between systems A and B.

The deeper causes were:

assuming wall-clock time was consistent and trustworthy across systems
using timestamps as the sole mechanism for detecting changes

Contributing factors

Silent time sync failure. Our monitoring did not treat time sync failures as critical.
No sequence or versioning field. We lacked a monotonic sequence or version number to detect changes.
Simplistic change detection logic. The jobs treated any record with a timestamp greater than the last seen value as "new." Clock skew made that logic loop.

What we changed

1. Treat time sync as a first-class dependency

We:

added monitoring and alerts for time synchronization on critical systems
treated significant skew as an operational incident

Background jobs that depend on time now:

check time sync health before running
fail fast or degrade safely when time is unreliable

2. Introduce explicit change markers

We added explicit change markers for sync:

version counters or sequence IDs
last processed markers stored per job

Jobs now:

rely on these markers rather than raw timestamps where possible
can resume safely even if clocks drift briefly

3. Harden sync logic

We updated sync jobs to:

detect and log suspicious patterns (e.g., seeing the same records as "new" repeatedly)
cap the amount of work they will do per run for a given record set

In suspicious cases, jobs:

slow down
raise alerts.

4. Improve observability

We added metrics for:

number of records processed per run
number of re-processed records
time between source updates and sync completion

This makes it easier to spot runaway behavior early.

Follow-ups

Completed

Fixed time sync and cleared backlogs.
Added time sync monitoring and alerts.
Introduced explicit change markers for the affected flows.

Planned / in progress

Extend these patterns to other cross-system syncs.
Review jobs that rely heavily on wall-clock assumptions.

Takeaways

Time is a shared dependency; clock skew can turn safe jobs into runaway loops.
Timestamps alone are a fragile basis for change detection.
Monitoring for both time sync and suspicious job behavior helps catch these issues before they grow.