Incident report: Slow database failover under load
A planned failover that was supposed to be nearly invisible took much longer under real traffic. We describe why and what we changed.
Summary
On February 24, 2023, we executed a planned failover of a primary database.
In pre-production tests, the failover was expected to complete in under 60 seconds with minimal impact.
Under real production load, the failover took several minutes longer than planned.
During that window, a subset of write-heavy operations experienced elevated error rates and timeouts. Read traffic fared better but still saw elevated latency and occasional failures.
We completed the failover without data loss, but the incident exposed gaps in how we tested and monitored failover behavior.
Impact
- Duration: roughly 11 minutes of increased write failure rates and latency above SLO.
- User impact:
- some users saw errors when attempting state-changing operations (e.g., updates, certain checkouts)
- retries often succeeded after a short delay, but not reliably
- Internal impact:
- elevated on-call load for the application and database teams
- additional verification work after the event to confirm no inconsistencies
No data loss was observed. Some operations were delayed or retried multiple times.
Timeline
All times local.
- 09:55 — Pre-failover checks complete; metrics and dashboards show healthy baseline.
- 10:00 — Planned failover window begins. Application teams are notified.
- 10:02 — Automated failover process initiated for the primary database cluster.
- 10:03–10:05 — Replicas promote as expected, but DNS and connection string changes take longer than in staging tests.
- 10:06 — Application error rates for write-heavy endpoints begin to climb. Latency for some writes increases sharply.
- 10:07 — On-call for the primary application is paged for elevated errors. They join the planned failover channel.
- 10:09 — Database team confirms that the new primary is healthy but that some application instances are still attempting to connect to the old primary.
- 10:11 — Investigation reveals that some application processes cached connection information longer than expected and did not handle connection failures quickly.
- 10:13 — Application team rolls a configuration change to reduce connection timeouts and force reconnects.
- 10:16 — Error rates begin to drop as more application instances successfully connect to the new primary.
- 10:21 — Write success rates and latency return to baseline.
- 10:30 — Post-failover verification runs (data consistency checks, replication health).
- 11:10 — Incident closed with follow-ups recorded.
Root cause
The primary issue was a mismatch between our failover assumptions and real-world behavior.
Our test scenarios assumed that:
- all application instances would detect connection failures quickly
- connection pools would refresh promptly against the new primary
- DNS and connection string updates would propagate within seconds
In reality:
- some long-lived application processes held onto stale connections longer than expected
- connection retries used conservative backoff, delaying reconnection
- a subset of clients used slightly different connection settings that were not exercised in tests
The failover itself (from the database system’s perspective) behaved as designed.
The combination of stale connections, slow reconnection behavior, and uneven configuration across clients turned a sub-minute failover into over ten minutes of partial unavailability for writes.
Contributing factors:
- Inconsistent client configuration. Different services and even different code paths used slightly different connection pool settings.
- Limited chaos testing. We did not regularly practice failovers under realistic traffic levels and patterns.
- Insufficient visibility. Dashboards focused on primary health and replication lag, not on client connection behavior.
What we changed
1. Standardize connection behavior
We consolidated database connection settings into shared libraries and configuration:
- consistent connection and read/write timeouts
- consistent retry and backoff policies
- clearer separation of read vs write connections where applicable
This reduced the chances of one client behaving differently in a way we hadn’t tested.
2. Make failover behavior visible
We added metrics and dashboards for:
- number of active connections per client to each database node
- error rates by connection error type (e.g., connection refused, timeout)
- time-to-reconnect for clients after a simulated failover
These metrics are now part of the standard view during planned failovers and related incidents.
3. Practice failovers under load
We scheduled regular failover exercises with production-like traffic patterns:
- start with non-peak windows
- gradually test closer to busier times as confidence grows
Each exercise is treated like an incident drill:
- clear success criteria (maximum acceptable error budget burn, reconnection time)
- a written plan and a post-exercise review
4. Improve client failure handling
We updated client code to:
- detect and close stale connections more aggressively on specific error codes
- fail fast when the database is clearly unreachable, instead of waiting through long timeouts
- distinguish between transient errors and configuration errors, so we can surface the right signals during a failover
5. Runbook updates
We enriched the database and application runbooks with failover-specific sections:
- pre-checks: what to look at before initiating a failover
- during: which dashboards to watch, which logs to tail
- after: how to verify data consistency and replication
We also added a simple checklist for application teams:
- confirm your service reconnects correctly in staging failover tests
- ensure metrics for connection errors are wired into your dashboards
Follow-ups
Completed
- Standardized connection settings for major services.
- Added client connection metrics to database dashboards.
- Ran at least one additional failover test with improved visibility.
Planned / in progress
- Expand failover drills to more services and busier time windows.
- Automate parts of the failover and verification process to reduce manual steps.
- Define explicit SLOs for failover behavior (e.g., "clients reconnect within N seconds in normal conditions").
Takeaways
- A "successful" failover from the database’s perspective can still be an incident if clients reconnect slowly.
- Standardizing connection behavior across services makes failover behavior more predictable.
- Practicing failovers under realistic load is the only way to flush out certain failure modes.
- Visibility into client connections is as important as visibility into the database itself during failovers.