STEWARDSHIP2024-04-15BY JONAS "JO" CARLIN

Checklist: Evaluating a new vendor for the critical path

A condensed checklist we use when a new vendor might end up on the production critical path.

stewardshipvendorsriskreliability

Bringing a new vendor into production is easy.

Bringing them onto your critical path is not.

This checklist is what we run through when someone proposes that a new external service should sit between us and a user being able to complete an important task.

Context

Use this checklist when a vendor is going to:

sit in front of or alongside checkout, sign-in, or account recovery
influence decisions that block user actions (fraud, compliance, gating)
handle data that users would reasonably consider sensitive

If the vendor is purely back-office and non-critical, you can still use this checklist—but treat it as guidance, not a gate.

Checklist

Do we know what happens when they are down or slow?
- define failure modes: timeouts, errors, partial responses
- define our behavior: block, degrade, or bypass
- confirm with the vendor’s documentation and our own tests
Do we have observability into their behavior?
- metrics for their latency and error rates from our perspective
- logs or traces that let us debug interactions
- clear mapping between their status page and our incidents
Are their SLOs compatible with ours?
- compare their availability and latency targets with our SLOs
- understand how they measure (and how we do)
- decide what we will do when they are inside their SLOs but we are not
Do we control how much we rely on them?
- feature flags or configuration to turn them off or down
- the ability to route some percentage of traffic around them
- documented "safe modes" in our own system
Can we test and stage safely?
- sandbox or test environment that behaves like production
- ability to replay traffic or run limited canaries
- a plan for how to roll forward and roll back
Is data handling acceptable?
- clear list of data fields we send and why
- retention and access patterns we can live with
- a path to reduce or anonymize data if needed later
Do we know who to call and how?
- operational contacts, not just sales
- escalation paths and expected response times
- integration with our own incident process where appropriate

Notes

This checklist is a floor, not a ceiling.

We’ve learned that skipping even one of these questions almost always shows up later as "we never asked that" in an incident review.

We also learned not to rely solely on marketing material or status pages.

The most useful information usually comes from:

talking to the vendor’s engineers or SREs directly
testing behavior under controlled failure
talking to other teams who have run the integration in production

Takeaways

New vendors on the critical path are architectural decisions, not just purchasing decisions.
Understanding failure modes and SLO alignment upfront prevents painful surprises later.
Feature flags, observability, and clear contacts turn vendor issues into manageable incidents instead of mysteries.

Checklist: Evaluating a new vendor for the critical path

Context

Checklist

Notes

Takeaways

Further reading