ARCHITECTURE2025-03-27BY ELI NAVARRO
Q&A: centralize or embed platform capabilities?
How we decide whether capabilities like auth, flags, or logging live in shared platforms or in individual services.
architectureplatformsownershipreliability
Q&A
Why centralize anything at all?
Centralizing capabilities like authentication, feature flags, or logging:
- reduces duplicate work
- creates consistent behavior across services
- gives us one place to enforce policies and observability
Without some centralization, every team re-learns the same hard lessons.
Why not centralize everything?
Because platforms have costs:
- they become critical dependencies
- they can become bottlenecks for change
- they require dedicated ownership and on-call
If we centralize too aggressively, we:
- slow teams down
- make outages wider when platforms fail
How do we decide what belongs in a shared platform?
We ask a few questions:
- Is this capability naturally cross-cutting? (auth, flags, logging usually are.)
- Do multiple teams need the same behavior?
- Does centralization make operations safer or more observable?
If the answer is "yes" to these, we lean toward a shared platform.
When is it better to embed capability in a service?
We keep things embedded when:
- the behavior is highly specific to one domain
- the blast radius of mistakes is small and well-understood
- the service can own its SLOs without a central dependency
Examples:
- domain-specific caching strategies
- one-off data transforms that don’t generalize
How do SLOs influence the decision?
Shared platforms need:
- clear SLOs that match or exceed dependent services’ needs
- a plan for degradation when they’re unhealthy
If we can’t meet those SLOs, we:
- reconsider centralization
- or design services to degrade gracefully when the platform is down
How do we avoid platforms becoming "one more big ball of mud"?
We:
- define clear boundaries and responsibilities
- version and document contracts (APIs, data, guarantees)
- avoid stuffing every unrelated capability into the same platform
If a proposed feature doesn’t fit the platform’s mission, we:
- build it closer to the consuming service
- or create a new, focused platform if it truly is cross-cutting
What about incident response?
Platforms change how we respond to incidents:
- platform issues can affect many services at once
- platform dashboards and runbooks become central tools
We make sure platforms have:
- their own on-call rotation
- clear communication channels with consumers
- "blast radius" docs that explain who is impacted when they fail
Takeaways
- Centralization makes sense for truly cross-cutting capabilities with many consumers.
- Platforms need strong SLOs, ownership, and clear contracts to be worth the dependency.
- Not everything belongs in a platform; small, domain-specific behaviors often should stay embedded.
- Thinking about SLOs and incident blast radius early helps avoid building platforms that are bigger risks than the problems they solve.