Standardizing circuit breaker patterns across services
We moved from ad-hoc circuit breakers to a shared pattern so failures in one dependency don’t fragment every service’s behavior.
Circuit breakers started as one-off patches.
A team would see a dependency misbehave and add a quick safeguard:
- if failures exceed N in M seconds, stop calling for a while
Over time, every service grew its own version:
- slightly different thresholds
- slightly different reset logic
- slightly different metrics (or none)
During incidents, this fragmented view mattered:
- some services kept hammering a failing dependency
- others backed off too aggressively, causing avoidable degradations
- it was hard to answer "who is still calling this dependency right now?"
We decided to standardize.
Constraints
- We could not migrate every call site at once.
- Different dependencies had different characteristics (latency, error modes).
- We wanted a pattern that worked in both synchronous and async contexts.
What we changed
1. Define a simple, explicit contract
We wrote down what a circuit breaker should do for us:
- detect when a dependency is unhealthy based on failures or latency
- stop sending most traffic while unhealthy
- periodically test the dependency to see if it has recovered
- expose its state (open/half-open/closed) via metrics and logs
We added one more requirement:
- make it obvious to operators when the breaker, not the dependency, is blocking calls
2. Build a shared implementation
We built a small library that implemented this contract:
- consistent state machine (closed → open → half-open → closed)
- configurable thresholds per dependency
- hooks for metrics and structured logs
Services integrated the library instead of rolling their own.
Where language differences existed, we ported the same semantics.
3. Make configuration data-driven
Breaker settings live in configuration, not hard-coded in call sites.
This makes it possible to:
- tune thresholds without redeploying
- apply similar policies across services that share a dependency
We documented defaults:
- conservative starting thresholds
- recommended values for different classes of dependencies (e.g., internal vs external)
4. Expose breaker state in dashboards and runbooks
We added:
- metrics for breaker state per dependency
- panels showing when breakers open and close
- guidance in runbooks for how to interpret breaker behavior
During incidents, this helps answer:
- is the dependency still failing, or are we holding it open?
- which services are currently blocked by their breakers?
5. Migrate incrementally
We migrated call sites in phases:
- high-risk dependencies first
- less critical paths later
At each step, we checked:
- changes in error and latency patterns
- breaker behavior under normal and stressed conditions
Results / Measurements
After the migration, we saw:
- more predictable behavior when dependencies degraded
- fewer "retry storms" from services that didn’t previously have breakers
- faster diagnosis of who was still calling a failing dependency
Breakers didn’t remove failures, but they made them more bounded.
Takeaways
- Circuit breakers are a coordination tool, not just a local optimization.
- Standardizing their behavior and metrics makes incidents easier to reason about.
- Configuration-driven thresholds let us tune without code churn.
- Breakers should be visible in dashboards and runbooks, not hidden behind library calls.