RELIABILITY2025-10-13BY STORECODE

Incident report: Partial data loss in a secondary system

We lost data in a secondary index that many flows depended on more than we thought. We explain why it behaved as more than a cache and what we changed.

reliabilitydata-losscachesincidents

Summary

On October 13, 2025, a failure in a secondary data system caused partial data loss.

The system in question was described in diagrams as:

"a cache"
"a secondary index"

In reality, several flows treated it as if it were a source of truth.

When we lost a chunk of data in this system:

some requests failed outright
others returned incomplete results

The primary database still held the correct data, but the code paths to rebuild and fall back to it were not as robust or fast as we assumed.

We treated this as a reliability incident, not just a "cache miss".

Impact

Duration: several hours of degraded behavior while we detected, corrected, and re-indexed.
User impact:
- some users could not find records they expected to see
- some background processes that relied on the index stalled
Internal impact:
- time spent rebuilding the secondary system under pressure
- work to reconcile which flows were safe to use while rebuilding

No primary data was lost, but "secondary" broke the behaviors users cared about.

Timeline

All times local.

02:15 — A maintenance task runs against the secondary system, unexpectedly deleting more keys than intended.
02:40 — Background jobs that rely on the index begin to fail at higher rates.
03:05 — User-facing search and lookup flows show a spike in "no results" where results are expected.
03:20 — On-call is paged for error-rate increases and user reports.
03:45 — Investigation identifies missing entries in the secondary system; primary data is confirmed intact.
04:10 — We begin a controlled rebuild of the index from primary data.
07:30 — Rebuild completes for the highest-priority partitions; user-facing flows improve.
09:50 — Full rebuild completes; background jobs catch up.
11:00 — Incident closed; follow-ups documented.

Root cause

The immediate cause was a maintenance job that:

ran with broader scope than intended
removed entries from the secondary system that should have stayed

The deeper causes were:

treating the system as "just a cache" in language, but not in code
lacking robust, tested rebuild paths and fallbacks

Contributing factors

Overloaded semantics. Some features wrote data only to the secondary system, assuming backfills would reconcile to the primary later.
Unclear ownership. The team running maintenance tasks did not own all the flows that depended on the index.
Weak monitoring. Alerting focused on primary database signals, not on the health and completeness of the secondary.

What we changed

1. Clarify the role of secondary systems

We classified secondary systems as one of:

caches (safe to rebuild or lose, with clear fallbacks)
indexes (derived from primary, with clear rebuild paths)
partial sources of truth (where writes must be coordinated more carefully)

We updated documentation and diagrams accordingly.

2. Strengthen rebuild and fallback paths

For this system, we:

implemented and tested a full rebuild procedure from primary data
ensured that user-facing flows could fall back to primary reads in degraded mode, even if slower

We practiced running rebuilds during planned windows, not just during incidents.

3. Improve monitoring for secondary health

We added metrics and alerts for:

coverage/completeness of the index
consistency checks between primary and secondary for sampled records

These signals now appear on the same dashboards as the primary system, not in a separate "nice to have" view.

4. Guard maintenance operations

We tightened:

scoping and "dry-run" requirements for maintenance jobs
review and approval for operations that can remove large amounts of data

We added a simple rule:

no maintenance operation that can delete or rewrite many entries runs without:
- an explicit owner
- a rollback or rebuild plan

Follow-ups

Completed

Restored the secondary system from primary data.
Documented the system’s role and dependencies.
Added metrics and alerts for its health.

Planned / in progress

Apply the same classification and rebuild patterns to other secondary systems.
Review flows that write only to secondary systems to ensure they have a clear reconciliation story.

Takeaways

"Secondary" does not mean "unimportant"; if users depend on it, it is part of your reliability story.
Caches and indexes need explicit rebuild and fallback paths, not just optimistic assumptions.
Maintenance operations on derived systems should be treated with the same care as operations on primary data.