ARCHITECTURE2018-09-28BY ELI NAVARRO

Decision record: Delay the service split

We kept one deployable and invested in boundaries, tests, and observability first. Splitting later became safer and less dramatic.

architectureboundariesmigrationsreliability

Context

We inherited a codebase that was described as “a monolith” and therefore “the problem.”

The request to split it was reasonable: deploys were stressful, ownership was fuzzy, and every change felt like it could break “somewhere else.”

But the real problem was operational: unclear ownership boundaries, weak deploy confidence, and limited observability. Splitting into services would have multiplied those problems (more pipelines, more dashboards, more pages, more places to be blind) before we had the habits to operate them.

Constraints at the time:

  • Small team, limited on-call coverage.
  • No reliable end-to-end test path for core flows.
  • Rollback was possible, but slow and under-documented.
  • Boundaries were social, not technical. A service split would have frozen the confusion into APIs.
  • We didn’t have a clear “first dashboard” per subsystem, so incidents started from debate.

We wanted to avoid a common trap: calling a rewrite “architecture” and then discovering we just moved the mess across network calls.

Decision

We will keep a single deployable for now.

Before we split anything, we will:

  • draw boundaries in code (modules, interfaces, ownership)
  • establish baselines (latency, error rate, queue depth) and page only on impact
  • make rollback routine (fast, documented, exercised)
  • create a single starting dashboard per subsystem and link it from pages/tickets
  • treat the boundary as a contract we can test, even while everything is still one deploy

We will revisit a service split only when:

  • the boundary is already stable in code (clear inputs/outputs, minimal shared state)
  • each boundary has an owner and an on-call story
  • deploy + rollback are boring

Consequences

  • Some changes will feel slower because we’re paying down coupling first.
  • On-call stays simpler: one deployment pipeline, one set of dashboards, fewer failure modes.
  • When we do split, it will be a migration with a plan, not a rewrite with optimism.
  • We learn where the real boundaries are (change volume + incident patterns) before we cut new network boundaries.

Further reading