ARCHITECTURE2021-02-08BY ELI NAVARRO

Standardizing circuit breaker patterns across services

We moved from ad-hoc circuit breakers to a shared pattern so failures in one dependency don’t fragment every service’s behavior.

architecturereliabilitycircuit-breakersdependencies

Circuit breakers started as one-off patches.

A team would see a dependency misbehave and add a quick safeguard:

if failures exceed N in M seconds, stop calling for a while

Over time, every service grew its own version:

slightly different thresholds
slightly different reset logic
slightly different metrics (or none)

During incidents, this fragmented view mattered:

some services kept hammering a failing dependency
others backed off too aggressively, causing avoidable degradations
it was hard to answer "who is still calling this dependency right now?"

We decided to standardize.

Constraints

We could not migrate every call site at once.
Different dependencies had different characteristics (latency, error modes).
We wanted a pattern that worked in both synchronous and async contexts.

What we changed

1. Define a simple, explicit contract

We wrote down what a circuit breaker should do for us:

detect when a dependency is unhealthy based on failures or latency
stop sending most traffic while unhealthy
periodically test the dependency to see if it has recovered
expose its state (open/half-open/closed) via metrics and logs

We added one more requirement:

make it obvious to operators when the breaker, not the dependency, is blocking calls

2. Build a shared implementation

We built a small library that implemented this contract:

consistent state machine (closed → open → half-open → closed)
configurable thresholds per dependency
hooks for metrics and structured logs

Services integrated the library instead of rolling their own.

Where language differences existed, we ported the same semantics.

3. Make configuration data-driven

Breaker settings live in configuration, not hard-coded in call sites.

This makes it possible to:

tune thresholds without redeploying
apply similar policies across services that share a dependency

We documented defaults:

conservative starting thresholds
recommended values for different classes of dependencies (e.g., internal vs external)

4. Expose breaker state in dashboards and runbooks

We added:

metrics for breaker state per dependency
panels showing when breakers open and close
guidance in runbooks for how to interpret breaker behavior

During incidents, this helps answer:

is the dependency still failing, or are we holding it open?
which services are currently blocked by their breakers?

5. Migrate incrementally

We migrated call sites in phases:

high-risk dependencies first
less critical paths later

At each step, we checked:

changes in error and latency patterns
breaker behavior under normal and stressed conditions

Results / Measurements

After the migration, we saw:

more predictable behavior when dependencies degraded
fewer "retry storms" from services that didn’t previously have breakers
faster diagnosis of who was still calling a failing dependency

Breakers didn’t remove failures, but they made them more bounded.

Takeaways

Circuit breakers are a coordination tool, not just a local optimization.
Standardizing their behavior and metrics makes incidents easier to reason about.
Configuration-driven thresholds let us tune without code churn.
Breakers should be visible in dashboards and runbooks, not hidden behind library calls.