STEWARDSHIP2023-04-27BY JONAS "JO" CARLIN

Story: the job that never had an SLA

A background job ran for years without a clear expectation. When it finally broke, we had to decide what 'on time' meant.

stewardshipbackground-jobsreliabilityslo

What happened

Every night, a background job reconciled a set of records.

It had existed longer than most of the team.

Nobody could remember who originally wrote it. The code had been moved, renamed, and partially refactored, but the behavior was roughly the same:

gather a batch of records
call out to a few dependencies
write results back

There was no explicit SLO.

The job "usually" finished sometime before people arrived in the morning. That was considered good enough—until it wasn’t.

The night it slipped

One night, the job didn’t finish before business hours.

At first, the only signal was a trickle of support tickets:

"My report is missing data"
"This status hasn’t updated yet"

Engineers on call opened dashboards and saw that the daytime systems were healthy.

It took another 30 minutes before someone checked the job system and noticed that the nightly reconciliation was still grinding through work.

By then, the backlog had piled up enough that even if the job finished, it would be late by several hours.

The missing expectation

We had a debate in the incident channel:

How late is "too late" for this job?
Is this an incident or just an inconvenience?
Should we pause other work to let it catch up, or kill it and let it rerun tomorrow?

We realized we had no shared answer.

The job was important enough that people noticed when it slipped—but not important enough that anyone had written down what "on time" meant.

We ended up doing the reasonable thing in the moment:

prioritized clearing the backlog
communicated with support so they could answer user questions more concretely

Afterward, we treated the gap as the real problem.

What we changed

1. Give the job an SLA

We started by writing down what success looked like:

"The reconciliation job completes by HH:MM local time on at least N% of days."

We picked HH:MM based on when downstream users actually needed the data.

This became the job’s SLO.

We added a simple metric:

completion time per run

And two alerts:

if the job has not started by a certain time
if it has not finished by the SLO time

Now a late run is visible before support tickets accumulate.

2. Make progress observable

Previously, we only knew whether the job had "finished" or not.

We added progress reporting:

total items to process
items processed so far
rate over the last few minutes

This allowed us to answer:

Is it stuck, or just slow?
If current throughput holds, when will it finish?

During follow-up work, we also discovered a few inefficiencies that were easy to fix once we could see them.

3. Define failure behavior

We wrote down what should happen if the job cannot complete on time:

which metrics and dashboards to check
whether to prioritize partial completion for certain subsets
how to communicate expected delays to users and support

This turned future decisions from improvisation into execution of a plan.

4. Assign ownership

We identified a team as the owner of the job.

Ownership meant:

keeping the SLO and alerts healthy
being the first point of contact during regressions
reviewing changes that affected the job’s inputs or outputs

We also documented the job’s purpose and dependencies.

What we changed in ourselves

The job was not unique.

Once we saw the pattern, we found other long-running tasks that also "never had an SLA":

weekly reports
periodic cleanups
syncs with external systems

We started asking a simple question in design reviews:

"If this runs later than planned, who notices and why does it matter?"

If the answer was "nobody" or "it doesn’t," we relaxed.

If the answer was "support, finance, or a partner," we treated it the same way we treated user-facing latency: with SLOs, alerts, and ownership.

What we changed in the system

Concretely, we:

added per-job metrics and health checks to our job system
created a small dashboard for long-running recurring jobs
added a "job SLO" section to runbooks where relevant

We also aligned job scheduling with expectations:

jobs that fed morning workflows started earlier
low-priority maintenance jobs ran later or in narrower windows

This reduced contention and made it easier to reason about which parts of the job fleet mattered most at any given time.

Takeaways

Background jobs that affect users need clear expectations, just like APIs.
"Usually done by morning" is not an SLO.
Simple metrics—start time, finish time, progress—make long-running jobs much easier to operate.
Assigning ownership turns a historical script into a real part of the system.