Checklist: Evaluating a new vendor for the critical path
A condensed checklist we use when a new vendor might end up on the production critical path.
Bringing a new vendor into production is easy.
Bringing them onto your critical path is not.
This checklist is what we run through when someone proposes that a new external service should sit between us and a user being able to complete an important task.
Context
Use this checklist when a vendor is going to:
- sit in front of or alongside checkout, sign-in, or account recovery
- influence decisions that block user actions (fraud, compliance, gating)
- handle data that users would reasonably consider sensitive
If the vendor is purely back-office and non-critical, you can still use this checklist—but treat it as guidance, not a gate.
Checklist
-
Do we know what happens when they are down or slow?
- define failure modes: timeouts, errors, partial responses
- define our behavior: block, degrade, or bypass
- confirm with the vendor’s documentation and our own tests
-
Do we have observability into their behavior?
- metrics for their latency and error rates from our perspective
- logs or traces that let us debug interactions
- clear mapping between their status page and our incidents
-
Are their SLOs compatible with ours?
- compare their availability and latency targets with our SLOs
- understand how they measure (and how we do)
- decide what we will do when they are inside their SLOs but we are not
-
Do we control how much we rely on them?
- feature flags or configuration to turn them off or down
- the ability to route some percentage of traffic around them
- documented "safe modes" in our own system
-
Can we test and stage safely?
- sandbox or test environment that behaves like production
- ability to replay traffic or run limited canaries
- a plan for how to roll forward and roll back
-
Is data handling acceptable?
- clear list of data fields we send and why
- retention and access patterns we can live with
- a path to reduce or anonymize data if needed later
-
Do we know who to call and how?
- operational contacts, not just sales
- escalation paths and expected response times
- integration with our own incident process where appropriate
Notes
This checklist is a floor, not a ceiling.
We’ve learned that skipping even one of these questions almost always shows up later as "we never asked that" in an incident review.
We also learned not to rely solely on marketing material or status pages.
The most useful information usually comes from:
- talking to the vendor’s engineers or SREs directly
- testing behavior under controlled failure
- talking to other teams who have run the integration in production
Takeaways
- New vendors on the critical path are architectural decisions, not just purchasing decisions.
- Understanding failure modes and SLO alignment upfront prevents painful surprises later.
- Feature flags, observability, and clear contacts turn vendor issues into manageable incidents instead of mysteries.