Designing quiet failure modes for optional emails
We adjusted how optional emails fail so they don’t turn into noisy incidents or user confusion during stressed periods.
When things get busy, optional emails are the first to show strain.
By "optional" we mean messages that are helpful but not required for system correctness:
- summaries
- low-priority notifications
- some marketing adjacent to product activity
They tend to share a few properties:
- they’re triggered from hot paths (after important actions)
- they talk to external providers
- they don’t have strict SLOs on their own
During stressed periods—traffic spikes, provider issues, infrastructure changes—we saw two failure modes:
- Optional emails caused noisy, user-visible errors when they failed.
- Optional email failures created distracting alerts that pulled focus away from truly critical issues.
We needed those emails to fail quietly and predictably.
Constraints
- We could not remove optional emails entirely; some teams depended on them for user engagement.
- Our email provider occasionally had its own incidents; we couldn’t change that.
- We wanted to avoid breaking the main flows (sign-up, settings changes, purchases) because an optional email was slow or failing.
What we changed
We changed three things:
- Where optional emails are triggered.
- How failures surface internally.
- What we promise to users.
1. Move optional emails off the critical path
Previously, the pattern was:
- user completes an action
- we write to the database
- we send the optional email synchronously
- only then do we respond
If the email provider was slow or down, the whole flow stalled or failed.
We changed this to:
- user completes an action
- we write to the database
- we enqueue a job to send the optional email
- we respond to the user
The job:
- runs with modest retries and backoff
- records success/failure in a small log table
This way, optional emails no longer sit on the same synchronous call path as the user’s main action.
2. Downgrade failures to signals, not emergencies
We split email-related alerts into two classes:
- Critical: password reset emails, verification for sign-in, anything that directly affects access.
- Optional: summaries, mostly-informational messages.
For optional emails, we stopped paging on single failures or small error-rate bumps.
Instead, we:
- added dashboards showing send rates and error rates over time
- set thresholds where sustained failure (e.g., hours of high error rates) triggered an alert to a non-paging channel
On-call remains responsible for critical email flows.
Optional flows are monitored, but they don’t wake people up unless they start to look like a systemic problem.
3. Align user messaging with reality
We reviewed the copy around actions that triggered optional emails.
We replaced statements like:
- "We’ve emailed you a summary."
with more honest alternatives:
- "We’ll email you a summary shortly (this doesn’t affect the status of your [action])."
For flows where optional emails might be delayed or skipped during incidents, we made sure users could:
- view the same information in-product
- get confirmation there, not only in email
This reduced confusion when an email didn’t arrive immediately.
Results / Measurements
After these changes, we saw:
- fewer incidents where optional email failures appeared as root causes for user-visible problems
- a reduction in alert noise related to transient provider issues
- clearer boundaries between "email is a nice-to-have" and "email is required for access"
During one provider-side incident, optional email errors spiked.
Previously, that would have:
- slowed down some critical paths
- caused noisy alerts
This time, the main flows remained healthy, and optional emails recovered later without requiring an incident response.
Takeaways
- Optional emails should not share the same failure mode as required actions.
- Moving optional work off the synchronous path is often more effective than retrying harder.
- Alerts for optional features should reflect long-term health, not every transient blip.
- Honest user messaging about what "optional" really means helps during partial failures.