STEWARDSHIP2018-03-01BY ELI NAVARRO

Maintenance is senior work

Maintenance is where the real system shows up. It requires judgment, not leftovers.

stewardshipmaintenancereliabilitydelivery

Maintenance is where you learn what the system actually is.

Features are optional. Production isn’t.

On most inherited systems, the first “small” change teaches the same lesson: nothing is isolated. A minor dependency upgrade reveals hidden coupling. A one-line config tweak exposes a brittle assumption. A clean diff turns into a long review because no one is sure what else it touches.

This is why we treat maintenance as senior work.

Constraints

Maintenance lives inside constraints that greenfield work doesn’t.

The system is running. People rely on it. You don’t get to pause the world to tidy things up.

The technical constraints are predictable: partial docs, missing baselines, hidden dependencies, unclear ownership, and sharp edges that only show up under real traffic.

The human constraints are just as real: stakeholders want visible progress, and urgency is an easy substitute for a plan. If a team is already deep in after-hours work, it’s hard to justify time spent on upgrades, tests, and runbooks—until the day it’s unavoidable.

Maintenance also has an invisible enemy: recency bias.

When nothing is on fire, it feels safe to defer.

But deferral is how you turn a one-hour upgrade into a weekend cutover. The system doesn’t get simpler while you wait. It gets more coupled, more outdated, and less testable.

The other trap is calling it “cleanup.” Cleanup without a risk statement gets deprioritized forever.

Maintenance with a risk statement gets scheduled.

What we changed

We stopped treating maintenance as a chore and started treating it as risk work.

Senior engineers lead the changes that reduce future blast radius:

  • dependency upgrades and “rot” prevention
  • migration planning (especially anything involving data)
  • observability and operational guardrails
  • test strategy that matches how the system fails
  • removing coupling that makes every change high risk

We also stopped doing maintenance “in leftovers.”

We set a small explicit budget (time and attention) and protect it the same way we protect feature delivery:

  • a standing maintenance lane (a few hours/week or a day per sprint)
  • a rotating maintainer who owns upgrades, dependency drift, and small refactors
  • a weekly risk review where we pick the next two boring things

The work is intentionally small: upgrade one dependency, delete one unused path, write one runbook section, add one guardrail.

If the only maintenance you do is the emergency kind, the system trains you to be afraid of touching it.

We also changed how we write maintenance tickets. Each ticket must answer:

  • what risk it reduces
  • how we’ll know it worked (a check, a metric, or a test)
  • what “stop” looks like if it makes things worse
  • what the rollback is (even if rollback is “revert the config change”)

That makes maintenance reviewable. It stops being “cleanup” and becomes operational work.

We also tightened the definition of “done.”

A feature that ships without an operational story is not done. A service with no rollback path is not done. A system that can’t be upgraded safely is not done.

Junior engineers should absolutely do maintenance work—but not by being thrown into production alone. The goal is to teach safe maintenance inside constraints, not to use production as a rite of passage.

We pair maintenance deliberately.

A senior engineer drives the plan. A junior engineer does the work with guardrails: a rollback, a test, and a clear stop condition.

The point is to build judgment. Maintenance teaches you how the system actually fails, and that’s the fastest way to learn “production thinking” without inventing heroics.

Results / Measurements

Maintenance doesn’t always produce a single headline number, but it changes the shape of delivery.

  • releases get smaller because the system is easier to change
  • regressions get rarer because boundaries are clearer
  • upgrades stop becoming “big bang” events because they happen continuously

A proxy we like is time-to-understand: how long it takes an engineer to reason about a part of the system well enough to change it safely.

We also watch two boring numbers:

  • change failure rate: how often a deploy needs rollback or follow-up hotfix
  • unplanned work ratio: how much time goes to urgent fixes vs planned improvements

When maintenance is owned and paced, both tend to move in the right direction.

We also track:

  • dependency age: how far behind current we are on core libraries
  • time-to-upgrade: how long a typical upgrade takes end-to-end

If those numbers grow, it’s a sign maintenance is slipping back into “later.”

When maintenance is neglected, time-to-understand rises. When maintenance is owned and paced, it drops.

Takeaways

Maintenance is senior work because it’s where tradeoffs are real.

You need context to decide what not to touch, what to stabilize, and what to rewrite.

You also need taste.

Not “design taste.” Operational taste: the ability to spot the change that looks small but will create recurring incident work.

That’s why maintenance is a leadership job, not a leftovers job.

And you need the discipline to do it when nothing is on fire.

The system will always offer you a reason to postpone maintenance.

Senior work is doing it anyway.

And when you do it consistently, you get compounding returns: fewer surprises and more boring deploys.

Maintenance isn’t just about keeping up. It’s about keeping the system understandable.

If you want faster delivery, maintenance is one of the few levers that actually works.

Further reading