ARCHITECTURE2022-05-10BY ELI NAVARRO

Designing for graceful degradation in checkout

What we changed in the checkout architecture so partial failures lead to smaller, clearer problems instead of full outages.

architecturecheckoutdegradationreliability

Checkout outages are expensive in a way that is easy to explain to anyone.

For a long time, our mental model of checkout reliability was binary: it works, or it’s down.

Reality was messier. The most painful incidents were not total outages, but partial failures:

gift card balances failing to load
add-on services not attaching correctly
address validation timing out and blocking the whole flow

Users saw the same thing either way: "checkout is broken."

We decided to treat graceful degradation as an architectural requirement, not a nice-to-have.

Constraints

We could not rebuild checkout from scratch.
Several dependencies (tax, risk, address validation, add-on services) were owned by different teams and vendors.
Some flows were legally or financially mandatory (e.g., tax calculation); others were optional but important.
We wanted to avoid turning every edge case into a new config flag.

We also had to respect user trust:

silently skipping important steps would be worse than failing clearly
we needed to be explicit when we fell back to simpler behavior

What we changed

We approached graceful degradation as a series of small, concrete decisions.

1. Classify dependencies by criticality

We created three categories for checkout dependencies:

Essential: required for a valid order (e.g., payment processing, core inventory checks).
Strongly recommended: improves correctness or user clarity (e.g., tax calculation, some forms of address validation).
Optional: adds value but is not required for a valid order (e.g., certain recommendations, some add-ons).

For each dependency, we wrote down:

what happens if it’s slow
what happens if it fails
whether checkout should block, warn, or continue without it

This sounds obvious, but we hadn’t written it down before. It forced conversations between product, legal, and engineering.

2. Make failure modes explicit in the flow

We then changed the flow to match those decisions.

Examples:

If an optional recommendation service fails, we hide that section and continue. We do not block checkout.
If an essential inventory check fails, we block with a clear error and do not charge the user.
If a strongly recommended address validation call times out, we:
- allow the user to continue with a clear "unable to confirm" message
- log the event and surface it to support

The key is that the user sees a different message depending on what failed and why.

3. Isolate calls and timeouts

We untangled a few places where optional work was sitting on the same call stack as essential work.

Previously, one backend call might:

fetch core checkout data
call out to multiple dependencies sequentially
assemble everything into a single response

If an optional dependency was slow, the entire response was slow.

We changed this to:

fetch essential data first, with its own tighter timeout
call optional dependencies separately, with their own looser budgets
stream or progressively enhance optional UI as data arrives

On the client side, this meant rendering the core checkout form first and filling in extras as they became available.

4. Design clear degraded states

We worked with design to define specific degraded states:

an address form that shows "We couldn’t verify this address" with clear next steps
a line item that shows "Gift card balance will be applied after confirmation" when the balance service is degraded
a lightweight banner that acknowledges "Some promotions may not appear right now" instead of silently dropping them

These states had to be:

visually distinct from normal success
honest about what the system knows and doesn’t know
non-blocking where appropriate

5. Instrument degradation, not just failures

We added metrics and logs for degradation events:

count of checkouts that continued after a validation timeout
rate of fallbacks for each optional service
user outcomes (completion vs abandon) in degraded vs normal paths

We surfaced these on dashboards and in incident reviews.

This kept us honest: if a fallback path was overused or causing confusion, we could see it.

Results / Measurements

After rolling out these changes incrementally, we saw a few concrete improvements:

Fewer "all-or-nothing" incidents. Some classes of dependency issues that previously caused full checkout failures now showed up as localized degradation with smaller user impact.
Clearer incident scopes. During issues with specific services (e.g., an add-on provider), we could quantify how many checkouts were affected and how.
Better user outcomes in degraded states. In one case, when address validation was slow, allowing users to proceed with a clear warning preserved completion rates much better than the previous hard block.

We also saw more nuanced discussions in incident reviews:

some sessions asked whether a dependency should move between categories based on how often it degraded
others focused on whether the fallback copy and UI actually matched what users experienced under load
"Should this dependency be essential?" instead of "Checkout broke again."
"Is the degraded message clear enough?" instead of "Users are confused and we don’t know why."

Takeaways

Graceful degradation is an architectural choice, not just a UI tweak.
Classifying dependencies by criticality makes failure decisions explicit and reviewable.
Separating essential and optional work, with different timeouts, prevents small problems from becoming big outages.
Designed degraded states, with honest messaging, keep user trust when parts of the system are unhealthy.
Instrumenting degradation paths lets you tune them over time instead of treating them as one-off hacks.