SECURITY2019-09-15BY ELI NAVARRO

Access is a production dependency

During an incident, lack of access looks like downtime. Excess access looks like risk. We treat access like any other production system.

securityoperationsaccess-controlreliability

Most teams only notice access when it’s missing.

A system can be healthy and still be unoperable if the on-call engineer can’t reach the logs.

In an incident, “I can’t access that” is not a security concern. It’s downtime.

I’ve seen incidents where the system was recoverable in five minutes, but we spent twenty minutes getting into the right place: waiting for someone with admin access to wake up, hunting for a token, or discovering that the “temporary” vendor login was tied to one person’s 2FA.

The outage clock keeps running while you’re negotiating access.

Constraints

Access control gets worse under predictable pressure:

onboarding is rushed, so roles are copied instead of designed
credentials are shared because it’s faster than fixing the process
“temporary” access becomes permanent
audit trails are missing, so trust becomes memory

In an incident, missing access is an outage multiplier.

In steady state, excessive access is a silent risk.

Access is also a reliability dependency that doesn’t show up in your error rate. Your service can be healthy and you can still fail to restore it quickly if nobody has the right permissions.

Two things are true at once:

If nobody can access production, recovery is slow.
If everybody can access everything, you don’t have boundaries.

Access also rots in ways that don’t show up on dashboards:

vendor consoles with separate logins
2FA devices tied to a single person
long-lived tokens nobody remembers
service accounts that quietly become “the admin account”

If your recovery path depends on one person’s phone or one person’s memory, you don’t have an on-call rotation.

What we changed

We treated access as part of the operational surface area.

Inventory the access surface

We wrote down every place an on-call engineer might need to go during an incident: metrics, logs, tracing, deploys, feature flags, queue admin, vendor consoles.

For each surface, we wrote:

how you get in
who owns it
what the least-privilege operational role looks like

This sounds bureaucratic.

It prevents the worst incident pattern: the page is firing, the dashboards exist, but the link takes you to a login screen you don’t have.

It also tells you where the real risks are: places where “admin” is the only role, where the audit trail is missing, or where the recovery path depends on a single person.

Replace shared credentials

Shared credentials feel like speed.

They are also:

un-auditable
hard to rotate
impossible to attribute

We replaced shared credentials with role-based access where possible.

Define minimal operational roles

We made roles explicit.

Not “engineer” and “admin.”

Operational roles:

on-call: read-only access to logs/metrics, ability to roll back and turn down risky knobs
support: ability to look up reference IDs and correlate user reports, without broad write access
deploy/ops: the small set of write actions needed to operate the system

This lets us answer a concrete question: “what does an on-call engineer need to do the first ten minutes of work?”

A follow-up question matters just as much: “what should on-call explicitly not be able to do at 2am?”

We want recovery actions to be available, and risky configuration changes to require deliberate escalation.

Add a break-glass path

We introduced a break-glass path: time-bounded elevated access with logging.

The point is not to bypass security.

The point is to avoid the worst pattern: permanent admin access “just in case” because escalation is painful.

Break-glass has three requirements:

it expires
it is logged
it requires a reason (even a short one)

Put access in the runbook

We put access paths into the runbook.

If the safe first action requires a permission that only one person has, that’s not a safe first action.

A runbook that says “check the logs” is not done if on-call can’t reach the logs.

Make access testable

We treat access like we treat backups: something you don’t discover is broken during the incident.

On a recurring cadence (monthly or quarterly), we do a quick “access drill”:

a random on-call engineer verifies they can reach logs, dashboards, and the deploy tool
we validate break-glass still works and still expires
we fix whatever has drifted (SSO changes, renamed groups, expired vendor accounts)

If you can’t test access, you don’t really have it.

Review access like a production system

We added a simple recurring review:

who has access
why
when it expires

We also treat offboarding as an operational event: access removed, tokens rotated, shared secrets eliminated.

Results / Measurements

The goal isn’t perfection. The goal is fewer bad surprises.

We watched for:

fewer incidents delayed by “who can access this?”
fewer shared secrets living in chat threads
fewer “temporary” permissions that never get removed

We also watched how often an incident required someone outside the rotation to join purely to grant access. That’s not escalation. That’s a process failure.

We also tracked a direct operational metric:

time-to-first-log-line during an incident

If it takes ten minutes to get access to logs, you don’t have observability.

A practical metric: time-to-restore should not depend on knowing which person has the right login.

Takeaways

Access is part of uptime.

If you can’t get to the logs quickly, you don’t have observability.

If access is shared, you don’t have accountability.

If recovery depends on a single person’s phone, you don’t have an on-call rotation.