RELIABILITY2025-12-05BY ELI NAVARRO
Incident report: Runaway background sync after clock skew
Clock skew between systems turned a cautious background sync into a runaway loop. We describe how time assumptions failed and what we changed.
reliabilitytimebackground-jobsincidents
Summary
On December 5, 2025, a set of background sync jobs went into a runaway loop.
The jobs were designed to:
- sync changes between systems based on "last updated" timestamps
- run periodically and idempotently
A clock skew between two systems meant that:
- the jobs kept seeing records as "new" or "updated"
- they reprocessed the same data repeatedly
This caused:
- elevated load on both systems
- longer sync times
- a backlog of work that delayed legitimate updates
We treated this as an incident in how we handle time across systems.
Impact
- Duration: several hours of elevated load and delayed syncs.
- User impact:
- some users saw stale data for longer than expected
- certain "eventually consistent" flows took much longer to converge
- Internal impact:
- on-call engineers spent time identifying and stopping the runaway loop
- follow-up work to clean up duplicated work and verify consistency
Timeline
All times local.
- 01:20 — A routine time sync process on one system fails silently, leaving clocks skewed by several minutes.
- 02:05 — Background sync jobs begin a new run, using timestamps to identify changed records.
- 02:30 — Metrics show higher-than-usual work per run and rising CPU on both systems.
- 03:10 — On-call is paged for elevated load and growing job backlogs.
- 03:35 — Investigation reveals that many records are being re-synced repeatedly.
- 04:05 — We identify inconsistent timestamps between systems as the root cause.
- 04:20 — Jobs are paused; time synchronization is fixed.
- 05:15 — Jobs are restarted with corrected configuration; backlog begins to drain.
- 08:00 — Backlog cleared; metrics return to baseline.
Root cause
The immediate cause was a clock skew between systems A and B.
The deeper causes were:
- assuming wall-clock time was consistent and trustworthy across systems
- using timestamps as the sole mechanism for detecting changes
Contributing factors
- Silent time sync failure. Our monitoring did not treat time sync failures as critical.
- No sequence or versioning field. We lacked a monotonic sequence or version number to detect changes.
- Simplistic change detection logic. The jobs treated any record with a timestamp greater than the last seen value as "new." Clock skew made that logic loop.
What we changed
1. Treat time sync as a first-class dependency
We:
- added monitoring and alerts for time synchronization on critical systems
- treated significant skew as an operational incident
Background jobs that depend on time now:
- check time sync health before running
- fail fast or degrade safely when time is unreliable
2. Introduce explicit change markers
We added explicit change markers for sync:
- version counters or sequence IDs
- last processed markers stored per job
Jobs now:
- rely on these markers rather than raw timestamps where possible
- can resume safely even if clocks drift briefly
3. Harden sync logic
We updated sync jobs to:
- detect and log suspicious patterns (e.g., seeing the same records as "new" repeatedly)
- cap the amount of work they will do per run for a given record set
In suspicious cases, jobs:
- slow down
- raise alerts.
4. Improve observability
We added metrics for:
- number of records processed per run
- number of re-processed records
- time between source updates and sync completion
This makes it easier to spot runaway behavior early.
Follow-ups
Completed
- Fixed time sync and cleared backlogs.
- Added time sync monitoring and alerts.
- Introduced explicit change markers for the affected flows.
Planned / in progress
- Extend these patterns to other cross-system syncs.
- Review jobs that rely heavily on wall-clock assumptions.
Takeaways
- Time is a shared dependency; clock skew can turn safe jobs into runaway loops.
- Timestamps alone are a fragile basis for change detection.
- Monitoring for both time sync and suspicious job behavior helps catch these issues before they grow.