RELIABILITY2023-02-24BY STORECODE

Incident report: Slow database failover under load

A planned failover that was supposed to be nearly invisible took much longer under real traffic. We describe why and what we changed.

reliabilitydatabasesfailoverincident-responsetesting

Summary

On February 24, 2023, we executed a planned failover of a primary database.

In pre-production tests, the failover was expected to complete in under 60 seconds with minimal impact.

Under real production load, the failover took several minutes longer than planned.

During that window, a subset of write-heavy operations experienced elevated error rates and timeouts. Read traffic fared better but still saw elevated latency and occasional failures.

We completed the failover without data loss, but the incident exposed gaps in how we tested and monitored failover behavior.

Impact

Duration: roughly 11 minutes of increased write failure rates and latency above SLO.
User impact:
- some users saw errors when attempting state-changing operations (e.g., updates, certain checkouts)
- retries often succeeded after a short delay, but not reliably
Internal impact:
- elevated on-call load for the application and database teams
- additional verification work after the event to confirm no inconsistencies

No data loss was observed. Some operations were delayed or retried multiple times.

Timeline

All times local.

09:55 — Pre-failover checks complete; metrics and dashboards show healthy baseline.
10:00 — Planned failover window begins. Application teams are notified.
10:02 — Automated failover process initiated for the primary database cluster.
10:03–10:05 — Replicas promote as expected, but DNS and connection string changes take longer than in staging tests.
10:06 — Application error rates for write-heavy endpoints begin to climb. Latency for some writes increases sharply.
10:07 — On-call for the primary application is paged for elevated errors. They join the planned failover channel.
10:09 — Database team confirms that the new primary is healthy but that some application instances are still attempting to connect to the old primary.
10:11 — Investigation reveals that some application processes cached connection information longer than expected and did not handle connection failures quickly.
10:13 — Application team rolls a configuration change to reduce connection timeouts and force reconnects.
10:16 — Error rates begin to drop as more application instances successfully connect to the new primary.
10:21 — Write success rates and latency return to baseline.
10:30 — Post-failover verification runs (data consistency checks, replication health).
11:10 — Incident closed with follow-ups recorded.

Root cause

The primary issue was a mismatch between our failover assumptions and real-world behavior.

Our test scenarios assumed that:

all application instances would detect connection failures quickly
connection pools would refresh promptly against the new primary
DNS and connection string updates would propagate within seconds

In reality:

some long-lived application processes held onto stale connections longer than expected
connection retries used conservative backoff, delaying reconnection
a subset of clients used slightly different connection settings that were not exercised in tests

The failover itself (from the database system’s perspective) behaved as designed.

The combination of stale connections, slow reconnection behavior, and uneven configuration across clients turned a sub-minute failover into over ten minutes of partial unavailability for writes.

Contributing factors:

Inconsistent client configuration. Different services and even different code paths used slightly different connection pool settings.
Limited chaos testing. We did not regularly practice failovers under realistic traffic levels and patterns.
Insufficient visibility. Dashboards focused on primary health and replication lag, not on client connection behavior.

What we changed

1. Standardize connection behavior

We consolidated database connection settings into shared libraries and configuration:

consistent connection and read/write timeouts
consistent retry and backoff policies
clearer separation of read vs write connections where applicable

This reduced the chances of one client behaving differently in a way we hadn’t tested.

2. Make failover behavior visible

We added metrics and dashboards for:

number of active connections per client to each database node
error rates by connection error type (e.g., connection refused, timeout)
time-to-reconnect for clients after a simulated failover

These metrics are now part of the standard view during planned failovers and related incidents.

3. Practice failovers under load

We scheduled regular failover exercises with production-like traffic patterns:

start with non-peak windows
gradually test closer to busier times as confidence grows

Each exercise is treated like an incident drill:

clear success criteria (maximum acceptable error budget burn, reconnection time)
a written plan and a post-exercise review

4. Improve client failure handling

We updated client code to:

detect and close stale connections more aggressively on specific error codes
fail fast when the database is clearly unreachable, instead of waiting through long timeouts
distinguish between transient errors and configuration errors, so we can surface the right signals during a failover

5. Runbook updates

We enriched the database and application runbooks with failover-specific sections:

pre-checks: what to look at before initiating a failover
during: which dashboards to watch, which logs to tail
after: how to verify data consistency and replication

We also added a simple checklist for application teams:

confirm your service reconnects correctly in staging failover tests
ensure metrics for connection errors are wired into your dashboards

Follow-ups

Completed

Standardized connection settings for major services.
Added client connection metrics to database dashboards.
Ran at least one additional failover test with improved visibility.

Planned / in progress

Expand failover drills to more services and busier time windows.
Automate parts of the failover and verification process to reduce manual steps.
Define explicit SLOs for failover behavior (e.g., "clients reconnect within N seconds in normal conditions").

Takeaways

A "successful" failover from the database’s perspective can still be an incident if clients reconnect slowly.
Standardizing connection behavior across services makes failover behavior more predictable.
Practicing failovers under realistic load is the only way to flush out certain failure modes.
Visibility into client connections is as important as visibility into the database itself during failovers.