ARCHITECTURE2021-02-08BY ELI NAVARRO

Standardizing circuit breaker patterns across services

We moved from ad-hoc circuit breakers to a shared pattern so failures in one dependency don’t fragment every service’s behavior.

architecturereliabilitycircuit-breakersdependencies

Circuit breakers started as one-off patches.

A team would see a dependency misbehave and add a quick safeguard:

  • if failures exceed N in M seconds, stop calling for a while

Over time, every service grew its own version:

  • slightly different thresholds
  • slightly different reset logic
  • slightly different metrics (or none)

During incidents, this fragmented view mattered:

  • some services kept hammering a failing dependency
  • others backed off too aggressively, causing avoidable degradations
  • it was hard to answer "who is still calling this dependency right now?"

We decided to standardize.

Constraints

  • We could not migrate every call site at once.
  • Different dependencies had different characteristics (latency, error modes).
  • We wanted a pattern that worked in both synchronous and async contexts.

What we changed

1. Define a simple, explicit contract

We wrote down what a circuit breaker should do for us:

  • detect when a dependency is unhealthy based on failures or latency
  • stop sending most traffic while unhealthy
  • periodically test the dependency to see if it has recovered
  • expose its state (open/half-open/closed) via metrics and logs

We added one more requirement:

  • make it obvious to operators when the breaker, not the dependency, is blocking calls

2. Build a shared implementation

We built a small library that implemented this contract:

  • consistent state machine (closed → open → half-open → closed)
  • configurable thresholds per dependency
  • hooks for metrics and structured logs

Services integrated the library instead of rolling their own.

Where language differences existed, we ported the same semantics.

3. Make configuration data-driven

Breaker settings live in configuration, not hard-coded in call sites.

This makes it possible to:

  • tune thresholds without redeploying
  • apply similar policies across services that share a dependency

We documented defaults:

  • conservative starting thresholds
  • recommended values for different classes of dependencies (e.g., internal vs external)

4. Expose breaker state in dashboards and runbooks

We added:

  • metrics for breaker state per dependency
  • panels showing when breakers open and close
  • guidance in runbooks for how to interpret breaker behavior

During incidents, this helps answer:

  • is the dependency still failing, or are we holding it open?
  • which services are currently blocked by their breakers?

5. Migrate incrementally

We migrated call sites in phases:

  • high-risk dependencies first
  • less critical paths later

At each step, we checked:

  • changes in error and latency patterns
  • breaker behavior under normal and stressed conditions

Results / Measurements

After the migration, we saw:

  • more predictable behavior when dependencies degraded
  • fewer "retry storms" from services that didn’t previously have breakers
  • faster diagnosis of who was still calling a failing dependency

Breakers didn’t remove failures, but they made them more bounded.

Takeaways

  • Circuit breakers are a coordination tool, not just a local optimization.
  • Standardizing their behavior and metrics makes incidents easier to reason about.
  • Configuration-driven thresholds let us tune without code churn.
  • Breakers should be visible in dashboards and runbooks, not hidden behind library calls.

Further reading