Incident report: Partial data loss in a secondary system
We lost data in a secondary index that many flows depended on more than we thought. We explain why it behaved as more than a cache and what we changed.
Summary
On October 13, 2025, a failure in a secondary data system caused partial data loss.
The system in question was described in diagrams as:
- "a cache"
- "a secondary index"
In reality, several flows treated it as if it were a source of truth.
When we lost a chunk of data in this system:
- some requests failed outright
- others returned incomplete results
The primary database still held the correct data, but the code paths to rebuild and fall back to it were not as robust or fast as we assumed.
We treated this as a reliability incident, not just a "cache miss".
Impact
- Duration: several hours of degraded behavior while we detected, corrected, and re-indexed.
- User impact:
- some users could not find records they expected to see
- some background processes that relied on the index stalled
- Internal impact:
- time spent rebuilding the secondary system under pressure
- work to reconcile which flows were safe to use while rebuilding
No primary data was lost, but "secondary" broke the behaviors users cared about.
Timeline
All times local.
- 02:15 — A maintenance task runs against the secondary system, unexpectedly deleting more keys than intended.
- 02:40 — Background jobs that rely on the index begin to fail at higher rates.
- 03:05 — User-facing search and lookup flows show a spike in "no results" where results are expected.
- 03:20 — On-call is paged for error-rate increases and user reports.
- 03:45 — Investigation identifies missing entries in the secondary system; primary data is confirmed intact.
- 04:10 — We begin a controlled rebuild of the index from primary data.
- 07:30 — Rebuild completes for the highest-priority partitions; user-facing flows improve.
- 09:50 — Full rebuild completes; background jobs catch up.
- 11:00 — Incident closed; follow-ups documented.
Root cause
The immediate cause was a maintenance job that:
- ran with broader scope than intended
- removed entries from the secondary system that should have stayed
The deeper causes were:
- treating the system as "just a cache" in language, but not in code
- lacking robust, tested rebuild paths and fallbacks
Contributing factors
- Overloaded semantics. Some features wrote data only to the secondary system, assuming backfills would reconcile to the primary later.
- Unclear ownership. The team running maintenance tasks did not own all the flows that depended on the index.
- Weak monitoring. Alerting focused on primary database signals, not on the health and completeness of the secondary.
What we changed
1. Clarify the role of secondary systems
We classified secondary systems as one of:
- caches (safe to rebuild or lose, with clear fallbacks)
- indexes (derived from primary, with clear rebuild paths)
- partial sources of truth (where writes must be coordinated more carefully)
We updated documentation and diagrams accordingly.
2. Strengthen rebuild and fallback paths
For this system, we:
- implemented and tested a full rebuild procedure from primary data
- ensured that user-facing flows could fall back to primary reads in degraded mode, even if slower
We practiced running rebuilds during planned windows, not just during incidents.
3. Improve monitoring for secondary health
We added metrics and alerts for:
- coverage/completeness of the index
- consistency checks between primary and secondary for sampled records
These signals now appear on the same dashboards as the primary system, not in a separate "nice to have" view.
4. Guard maintenance operations
We tightened:
- scoping and "dry-run" requirements for maintenance jobs
- review and approval for operations that can remove large amounts of data
We added a simple rule:
- no maintenance operation that can delete or rewrite many entries runs without:
- an explicit owner
- a rollback or rebuild plan
Follow-ups
Completed
- Restored the secondary system from primary data.
- Documented the system’s role and dependencies.
- Added metrics and alerts for its health.
Planned / in progress
- Apply the same classification and rebuild patterns to other secondary systems.
- Review flows that write only to secondary systems to ensure they have a clear reconciliation story.
Takeaways
- "Secondary" does not mean "unimportant"; if users depend on it, it is part of your reliability story.
- Caches and indexes need explicit rebuild and fallback paths, not just optimistic assumptions.
- Maintenance operations on derived systems should be treated with the same care as operations on primary data.