ARCHITECTURE2025-03-27BY ELI NAVARRO

Q&A: centralize or embed platform capabilities?

How we decide whether capabilities like auth, flags, or logging live in shared platforms or in individual services.

architectureplatformsownershipreliability

Q&A

Why centralize anything at all?

Centralizing capabilities like authentication, feature flags, or logging:

  • reduces duplicate work
  • creates consistent behavior across services
  • gives us one place to enforce policies and observability

Without some centralization, every team re-learns the same hard lessons.

Why not centralize everything?

Because platforms have costs:

  • they become critical dependencies
  • they can become bottlenecks for change
  • they require dedicated ownership and on-call

If we centralize too aggressively, we:

  • slow teams down
  • make outages wider when platforms fail

How do we decide what belongs in a shared platform?

We ask a few questions:

  • Is this capability naturally cross-cutting? (auth, flags, logging usually are.)
  • Do multiple teams need the same behavior?
  • Does centralization make operations safer or more observable?

If the answer is "yes" to these, we lean toward a shared platform.

When is it better to embed capability in a service?

We keep things embedded when:

  • the behavior is highly specific to one domain
  • the blast radius of mistakes is small and well-understood
  • the service can own its SLOs without a central dependency

Examples:

  • domain-specific caching strategies
  • one-off data transforms that don’t generalize

How do SLOs influence the decision?

Shared platforms need:

  • clear SLOs that match or exceed dependent services’ needs
  • a plan for degradation when they’re unhealthy

If we can’t meet those SLOs, we:

  • reconsider centralization
  • or design services to degrade gracefully when the platform is down

How do we avoid platforms becoming "one more big ball of mud"?

We:

  • define clear boundaries and responsibilities
  • version and document contracts (APIs, data, guarantees)
  • avoid stuffing every unrelated capability into the same platform

If a proposed feature doesn’t fit the platform’s mission, we:

  • build it closer to the consuming service
  • or create a new, focused platform if it truly is cross-cutting

What about incident response?

Platforms change how we respond to incidents:

  • platform issues can affect many services at once
  • platform dashboards and runbooks become central tools

We make sure platforms have:

  • their own on-call rotation
  • clear communication channels with consumers
  • "blast radius" docs that explain who is impacted when they fail

Takeaways

  • Centralization makes sense for truly cross-cutting capabilities with many consumers.
  • Platforms need strong SLOs, ownership, and clear contracts to be worth the dependency.
  • Not everything belongs in a platform; small, domain-specific behaviors often should stay embedded.
  • Thinking about SLOs and incident blast radius early helps avoid building platforms that are bigger risks than the problems they solve.

Further reading