RELIABILITY2024-02-19BY STORECODE

Incident report: Config drift across multi-region clusters

A subtle difference in configuration between regions turned a safe rollout into an uneven incident. We describe the drift and how we removed it.

reliabilitymulti-regionconfigurationrolloutincident-response

Summary

On February 19, 2024, we rolled out a new version of a core service across two regions.

The rollout plan assumed that both regions were configured identically:

  • same feature flag defaults
  • same retry and timeout settings
  • same dependency endpoints

In practice, they were not.

A configuration difference meant that the same code behaved differently by region. One region handled the new version as expected; the other experienced elevated error rates and retries.

The deployment felt like two different releases: a clean one and a noisy one.

We rolled back the affected region, stabilized traffic, and then treated the configuration drift itself as the real incident.

Impact

  • Duration: roughly 55 minutes of elevated error rates in one region.
  • User impact:
    • users routed to the affected region saw higher error rates and retries for several endpoints
    • users in the other region saw normal behavior
  • Internal impact:
    • increased on-call load for the service and for a dependent team
    • confusion in early investigation because only one region looked broken

We did not observe data corruption, but some operations failed more often for users routed to the affected region.

Timeline

All times local to the affected region.

  • 09:02 — Rollout of new version begins in region A (control) with canary traffic.
  • 09:07 — Canary in region A looks healthy; traffic gradually increases.
  • 09:15 — Rollout begins in region B using the same staged process.
  • 09:22 — Error rate for a key API in region B begins to rise. Region A remains within SLO.
  • 09:25 — On-call acknowledges error-rate alert scoped to region B.
  • 09:29 — Initial checks show code and build versions match across regions.
  • 09:34 — On-call compares dashboards and notices that retries and latency patterns differ between regions despite identical code.
  • 09:39 — Hypothesis shifts to configuration differences. Config snapshots are pulled for both regions.
  • 09:44 — A diff reveals that a timeout and retry policy for a downstream dependency is stricter in region A and more lenient in region B.
  • 09:48 — Decision: roll back region B to the previous version while we examine configurations more closely.
  • 09:52 — Rollback in region B completes; error rates drop toward baseline.
  • 10:05 — Teams review config management history and confirm that drift accumulated over several small, region-specific changes.
  • 10:30 — Incident closed with follow-ups focused on eliminating the sources of drift.

Root cause

The root cause was configuration drift between regions:* timeouts and retry settings for a key downstream dependency differed.*

In region A:

  • lower per-call timeouts
  • fewer retries with backoff

In region B:

  • higher per-call timeouts
  • more aggressive retries

Under the new version, the downstream dependency’s occasional slowness interacted badly with region B’s more permissive settings:

  • longer waits increased tail latency
  • extra retries amplified load on the dependency

Region A’s stricter settings made the same degradation manifest as faster failures, which our code handled via fallbacks.

The configuration drift originated from:

  • a one-off change months earlier in region B to "stabilize" a noisy dependency
  • lack of a single source of truth for cross-region configuration
  • ad-hoc overrides applied during a previous incident that were never reconciled

Contributing factors

  • No automated drift detection. We had no routine check that compared config between regions.
  • Limited config review. Region-specific changes were reviewed locally but not against a global baseline.
  • Implicit assumptions. Documentation and runbooks assumed identical behavior across regions.

What we changed

1. Single source of truth for shared configuration

We moved shared configuration into a central, versioned store:

  • regions consume the same base config
  • region-specific overrides are explicit and minimal
  • changes are reviewed against the shared baseline

This does not eliminate the need for overrides, but it makes them visible.

2. Automated drift detection

We built a small tool that:

  • periodically fetches effective configuration from each region
  • diffs it against the shared baseline and against other regions
  • reports unexpected differences

We wired this into CI for config changes and into a low-priority alert channel for periodic checks.

3. Config change runbooks

We updated runbooks with:

  • how to make region-specific overrides safely
  • when to prefer changing the shared baseline instead
  • steps to follow after using emergency overrides (including how to reconcile them)

We also added a checklist item to post-incident reviews:

  • "Did we change configuration in only one region? If so, did we reconcile it later?"

4. Rollout patterns that assume differences

We adjusted our rollout patterns:

  • canaries now include both regions with explicit regional metrics
  • dashboards show per-region behavior side by side by default
  • alerts for rollout issues include region tags and direct links to per-region config views

This made it easier to see when "same code" behaved differently in different places.

Follow-ups

Completed

  • Migrated relevant config keys into a shared store with explicit region overrides.
  • Implemented automated drift detection for core services.
  • Updated dashboards and alerts to highlight per-region differences.

Planned / in progress

  • Extend drift detection to more services and dependencies.
  • Add periodic reviews of region-specific overrides to retire those that are no longer needed.
  • Incorporate drift checks into pre-rollout validation for critical changes.

Takeaways

  • Multi-region systems rarely stay identical by accident; drift accumulates.
  • Assuming identical behavior across regions can hide configuration risk until a rollout arrives.
  • A shared config baseline plus automated drift detection turns "surprising" differences into reviewable facts.
  • Rollout dashboards and alerts should make per-region differences obvious, not something you discover by accident.