Story: the job that never had an SLA
A background job ran for years without a clear expectation. When it finally broke, we had to decide what 'on time' meant.
What happened
Every night, a background job reconciled a set of records.
It had existed longer than most of the team.
Nobody could remember who originally wrote it. The code had been moved, renamed, and partially refactored, but the behavior was roughly the same:
- gather a batch of records
- call out to a few dependencies
- write results back
There was no explicit SLO.
The job "usually" finished sometime before people arrived in the morning. That was considered good enough—until it wasn’t.
The night it slipped
One night, the job didn’t finish before business hours.
At first, the only signal was a trickle of support tickets:
- "My report is missing data"
- "This status hasn’t updated yet"
Engineers on call opened dashboards and saw that the daytime systems were healthy.
It took another 30 minutes before someone checked the job system and noticed that the nightly reconciliation was still grinding through work.
By then, the backlog had piled up enough that even if the job finished, it would be late by several hours.
The missing expectation
We had a debate in the incident channel:
- How late is "too late" for this job?
- Is this an incident or just an inconvenience?
- Should we pause other work to let it catch up, or kill it and let it rerun tomorrow?
We realized we had no shared answer.
The job was important enough that people noticed when it slipped—but not important enough that anyone had written down what "on time" meant.
We ended up doing the reasonable thing in the moment:
- prioritized clearing the backlog
- communicated with support so they could answer user questions more concretely
Afterward, we treated the gap as the real problem.
What we changed
1. Give the job an SLA
We started by writing down what success looked like:
- "The reconciliation job completes by HH:MM local time on at least N% of days."
We picked HH:MM based on when downstream users actually needed the data.
This became the job’s SLO.
We added a simple metric:
- completion time per run
And two alerts:
- if the job has not started by a certain time
- if it has not finished by the SLO time
Now a late run is visible before support tickets accumulate.
2. Make progress observable
Previously, we only knew whether the job had "finished" or not.
We added progress reporting:
- total items to process
- items processed so far
- rate over the last few minutes
This allowed us to answer:
- Is it stuck, or just slow?
- If current throughput holds, when will it finish?
During follow-up work, we also discovered a few inefficiencies that were easy to fix once we could see them.
3. Define failure behavior
We wrote down what should happen if the job cannot complete on time:
- which metrics and dashboards to check
- whether to prioritize partial completion for certain subsets
- how to communicate expected delays to users and support
This turned future decisions from improvisation into execution of a plan.
4. Assign ownership
We identified a team as the owner of the job.
Ownership meant:
- keeping the SLO and alerts healthy
- being the first point of contact during regressions
- reviewing changes that affected the job’s inputs or outputs
We also documented the job’s purpose and dependencies.
What we changed in ourselves
The job was not unique.
Once we saw the pattern, we found other long-running tasks that also "never had an SLA":
- weekly reports
- periodic cleanups
- syncs with external systems
We started asking a simple question in design reviews:
- "If this runs later than planned, who notices and why does it matter?"
If the answer was "nobody" or "it doesn’t," we relaxed.
If the answer was "support, finance, or a partner," we treated it the same way we treated user-facing latency: with SLOs, alerts, and ownership.
What we changed in the system
Concretely, we:
- added per-job metrics and health checks to our job system
- created a small dashboard for long-running recurring jobs
- added a "job SLO" section to runbooks where relevant
We also aligned job scheduling with expectations:
- jobs that fed morning workflows started earlier
- low-priority maintenance jobs ran later or in narrower windows
This reduced contention and made it easier to reason about which parts of the job fleet mattered most at any given time.
Takeaways
- Background jobs that affect users need clear expectations, just like APIs.
- "Usually done by morning" is not an SLO.
- Simple metrics—start time, finish time, progress—make long-running jobs much easier to operate.
- Assigning ownership turns a historical script into a real part of the system.