RELIABILITY2025-12-05BY ELI NAVARRO

Incident report: Runaway background sync after clock skew

Clock skew between systems turned a cautious background sync into a runaway loop. We describe how time assumptions failed and what we changed.

reliabilitytimebackground-jobsincidents

Summary

On December 5, 2025, a set of background sync jobs went into a runaway loop.

The jobs were designed to:

  • sync changes between systems based on "last updated" timestamps
  • run periodically and idempotently

A clock skew between two systems meant that:

  • the jobs kept seeing records as "new" or "updated"
  • they reprocessed the same data repeatedly

This caused:

  • elevated load on both systems
  • longer sync times
  • a backlog of work that delayed legitimate updates

We treated this as an incident in how we handle time across systems.

Impact

  • Duration: several hours of elevated load and delayed syncs.
  • User impact:
    • some users saw stale data for longer than expected
    • certain "eventually consistent" flows took much longer to converge
  • Internal impact:
    • on-call engineers spent time identifying and stopping the runaway loop
    • follow-up work to clean up duplicated work and verify consistency

Timeline

All times local.

  • 01:20 — A routine time sync process on one system fails silently, leaving clocks skewed by several minutes.
  • 02:05 — Background sync jobs begin a new run, using timestamps to identify changed records.
  • 02:30 — Metrics show higher-than-usual work per run and rising CPU on both systems.
  • 03:10 — On-call is paged for elevated load and growing job backlogs.
  • 03:35 — Investigation reveals that many records are being re-synced repeatedly.
  • 04:05 — We identify inconsistent timestamps between systems as the root cause.
  • 04:20 — Jobs are paused; time synchronization is fixed.
  • 05:15 — Jobs are restarted with corrected configuration; backlog begins to drain.
  • 08:00 — Backlog cleared; metrics return to baseline.

Root cause

The immediate cause was a clock skew between systems A and B.

The deeper causes were:

  • assuming wall-clock time was consistent and trustworthy across systems
  • using timestamps as the sole mechanism for detecting changes

Contributing factors

  • Silent time sync failure. Our monitoring did not treat time sync failures as critical.
  • No sequence or versioning field. We lacked a monotonic sequence or version number to detect changes.
  • Simplistic change detection logic. The jobs treated any record with a timestamp greater than the last seen value as "new." Clock skew made that logic loop.

What we changed

1. Treat time sync as a first-class dependency

We:

  • added monitoring and alerts for time synchronization on critical systems
  • treated significant skew as an operational incident

Background jobs that depend on time now:

  • check time sync health before running
  • fail fast or degrade safely when time is unreliable

2. Introduce explicit change markers

We added explicit change markers for sync:

  • version counters or sequence IDs
  • last processed markers stored per job

Jobs now:

  • rely on these markers rather than raw timestamps where possible
  • can resume safely even if clocks drift briefly

3. Harden sync logic

We updated sync jobs to:

  • detect and log suspicious patterns (e.g., seeing the same records as "new" repeatedly)
  • cap the amount of work they will do per run for a given record set

In suspicious cases, jobs:

  • slow down
  • raise alerts.

4. Improve observability

We added metrics for:

  • number of records processed per run
  • number of re-processed records
  • time between source updates and sync completion

This makes it easier to spot runaway behavior early.

Follow-ups

Completed

  • Fixed time sync and cleared backlogs.
  • Added time sync monitoring and alerts.
  • Introduced explicit change markers for the affected flows.

Planned / in progress

  • Extend these patterns to other cross-system syncs.
  • Review jobs that rely heavily on wall-clock assumptions.

Takeaways

  • Time is a shared dependency; clock skew can turn safe jobs into runaway loops.
  • Timestamps alone are a fragile basis for change detection.
  • Monitoring for both time sync and suspicious job behavior helps catch these issues before they grow.

Further reading