MFA Fatigue Attacks: How Uber Got Breached and How to Stop It

Reading Time: 10 minutes

What is purple team securityOWASP Top 10 mapped to cloud infrastructureCloud security breaches 2020–2025Broken access control in AWSMFA fatigue attacks


TL;DR

  • An MFA fatigue attack exploits push-notification MFA (Duo, Okta Verify, Microsoft Authenticator) by flooding a user with push requests until they accept one — either out of exhaustion or after social engineering
  • Uber (September 2022): contractor credentials purchased on a criminal marketplace → repeated Duo push notifications → WhatsApp social engineering → push accepted → admin PAM credentials found on internal file share → full access to AWS, GCP, Slack, HackerOne
  • The attack works because push MFA creates a UX habit: “tap accept” is a trained response, not a decision
  • Detection: multiple MFA failures followed by a single success in a short window — Okta System Log, Azure AD Sign-in Log, AWS CloudTrail
  • The structural fix is replacing push MFA with phishing-resistant FIDO2 hardware keys — not security awareness training, not more push notifications, not “number matching” alone
  • Okta (October 2023): support system breach exposed session tokens → attackers bypassed MFA entirely by using stolen session context

OWASP Mapping: A07 Identification and Authentication Failures. The Uber breach is the defining infrastructure example. Okta demonstrates session token theft as a related A07 variant.


The Big Picture

┌─────────────────────────────────────────────────────────────────────┐
│                    MFA FATIGUE ATTACK ANATOMY                       │
│                                                                     │
│   STEP 1: OBTAIN CREDENTIALS                                        │
│   Attacker ──── phish / buy on market ──────▶ username + password  │
│                                                                     │
│   STEP 2: TRIGGER MFA FLOOD                                         │
│   Attacker ──── repeated login attempts ────▶ Push #1 → User: NO   │
│                                               Push #2 → User: NO   │
│                                               Push #3 → User: NO   │
│                                               Push #4 → User: ???   │
│                                                                     │
│   STEP 3: SOCIAL ENGINEERING LAYER                                  │
│   Attacker ──── "Hi, I'm from IT support.                           │
│                  Please accept the next push."                      │
│                                               Push #4 → User: YES  │
│                                                                     │
│   STEP 4: ACCESS                                                    │
│   Attacker ──── authenticated session ──────▶ Internal network      │
│                                               Enumerate shares      │
│                                               Find next credential  │
│                                                                     │
│   ═══════════════════════════════════════════════════════           │
│   WHY TRAINING DOESN'T HELP:                                        │
│   Push MFA trains users to tap accept. The attacker exploits        │
│   the trained behavior. Education competes with habit.              │
│                                                                     │
│   WHY HARDWARE KEYS DO:                                             │
│   FIDO2 requires physical presence. WhatsApp message                │
│   cannot accept a hardware key challenge.                           │
└─────────────────────────────────────────────────────────────────────┘

An MFA fatigue attack is how you bypass multi-factor authentication without breaking encryption or stealing the MFA seed — you exploit the user’s psychology and the UX of push-notification systems. The attacker knows the password. The only thing standing between them and access is the user’s willingness to tap “deny” indefinitely.


The Uber Breach: Anatomy Minute by Minute

September 15, 2022. The attacker’s capabilities: a purchased credential set for an Uber contractor account, a phone number, and patience.

The credential acquisition: Uber contractor credentials were available on criminal marketplaces. The attacker obtained a valid username and password for an Uber contractor’s Uber corporate account.

The MFA flood:

The contractor’s account had Duo push-based MFA enrolled. The attacker initiated login attempts repeatedly, triggering a sequence of Duo push notifications to the contractor’s phone. The contractor rejected three or four of them. At this point, most attacks would stop — but the attacker added a social engineering layer.

The WhatsApp message:

The attacker sent a WhatsApp message to the contractor’s number, claiming to be from Uber IT support:

“Hi, this is the Uber IT support team. We’re seeing some issues with your account and need you to approve the next Duo notification to verify your identity.”

The contractor accepted the next push notification.

Post-authentication enumeration:

With an authenticated session, the attacker accessed Uber’s internal network. On an internal network share accessible to contractors, they found a PowerShell script. In that script: hardcoded Thycotic admin credentials. Thycotic is a Privileged Access Management (PAM) system — it stores credentials for privileged accounts across an organization.

The blast radius:

With Thycotic admin access, the attacker retrieved credentials for:
– AWS IAM accounts
– GCP service accounts
– Google Workspace admin
– VMware vSphere
– Slack workspace admin
– HackerOne bug bounty program admin (including details of open security reports)

The entire Uber infrastructure was accessible from one contractor’s push notification acceptance.

What Uber’s logs showed:

2022-09-15T02:17:00Z  [Duo] [email protected]  action=push_sent  result=rejected
2022-09-15T02:17:45Z  [Duo] [email protected]  action=push_sent  result=rejected
2022-09-15T02:18:30Z  [Duo] [email protected]  action=push_sent  result=rejected
2022-09-15T02:19:15Z  [Duo] [email protected]  action=push_sent  result=rejected
2022-09-15T02:22:00Z  [Duo] [email protected]  action=push_sent  result=approved
2022-09-15T02:22:05Z  [VPN] [email protected]  connection=established  ip=<attacker>

Four rejections followed by one approval in a five-minute window. This is a detectable pattern — but only if someone is looking for it.


Red Phase: Simulating MFA Fatigue

What the Attack Looks Like in Tooling

MFA fatigue attacks are conducted manually — an attacker with valid credentials and knowledge of which MFA system the target uses. No special tooling is required for the attack itself. What can be simulated:

Option 1: Repeated legitimate login attempts (test account only)

# DO NOT run against production accounts or accounts you don't own

# Using Okta API to authenticate (test environment only)
TEST_USERNAME="[email protected]"
TEST_PASSWORD="TestPassword123!"
OKTA_DOMAIN="your-org.okta.com"

for i in {1..5}; do
  echo "Attempt $i at $(date +%T)"
  response=$(curl -s -X POST \
    "https://${OKTA_DOMAIN}/api/v1/authn" \
    -H "Content-Type: application/json" \
    -d "{\"username\": \"${TEST_USERNAME}\", \"password\": \"${TEST_PASSWORD}\"}")

  status=$(echo "$response" | jq -r '.status')
  echo "  Status: $status"

  if [ "$status" = "MFA_CHALLENGE" ]; then
    state_token=$(echo "$response" | jq -r '.stateToken')
    factor_id=$(echo "$response" | jq -r '._embedded.factors[] | select(.factorType == "push") | .id')
    echo "  Factor ID: $factor_id (push notification triggered)"

    # In a real attack, the attacker would poll for the MFA response:
    echo "  Waiting 10 seconds for user to respond..."
    sleep 10
  fi

  sleep 30  # Wait between attempts to avoid rate limiting
done

Option 2: Tabletop exercise (no credentials required)

For organizations that cannot run live credential tests, the tabletop simulation maps the attack against your specific IdP logs. Pull 30 days of authentication logs and look for the pattern:

# Okta System Log: find users with multiple MFA failures followed by success
curl -H "Authorization: SSWS ${OKTA_API_TOKEN}" \
  "https://your-org.okta.com/api/v1/logs?filter=eventType+eq+\"user.authentication.auth_via_mfa\"&limit=1000" | \
  jq '
    group_by(.actor.id) |
    map({
      user: .[0].actor.displayName,
      total: length,
      failures: [.[] | select(.outcome.result == "FAILURE")] | length,
      successes: [.[] | select(.outcome.result == "SUCCESS")] | length
    }) |
    sort_by(.failures) |
    reverse |
    .[0:20]
  '

Users with high failure counts followed by eventual success are the fatigue attack pattern. Some will be legitimate (user locked themselves out, then called IT). The ones to investigate are those where the failure-to-success sequence happened in a short window (under 30 minutes) and from an unusual IP.


Blue Phase: Detection Across Identity Providers

Okta: Push Notification Flood

# Okta System Log — detect repeated push failures from same user
# Query for: >3 push failures within 10 minutes for same user
curl -H "Authorization: SSWS ${OKTA_API_TOKEN}" \
  "https://your-org.okta.com/api/v1/logs?filter=eventType+eq+\"user.authentication.auth_via_mfa\"+and+outcome.result+eq+\"FAILURE\"&since=$(date -u -d '24 hours ago' +%Y-%m-%dT%H:%M:%SZ)" | \
  jq '
    group_by(.actor.id, (.published[0:16])) |
    map(select(length >= 3)) |
    map({
      user: .[0].actor.displayName,
      window: .[0].published[0:16],
      failure_count: length,
      ips: [.[].client.ipAddress] | unique
    })
  '

Azure AD: Conditional Access Logs

# Azure AD: MFA push denial flood detection (using Azure CLI)
az monitor activity-log list \
  --start-time "$(date -u -d '24 hours ago' +%Y-%m-%dT%H:%M:%SZ)" \
  --query "[?contains(operationName.value, 'MFA')].{user:caller,time:eventTimestamp,result:status.value}" \
  --output table

In Microsoft Sentinel, the detection rule for MFA fatigue:

// Azure AD MFA Fatigue Detection — Sentinel KQL
SigninLogs
| where TimeGenerated > ago(24h)
| where AuthenticationRequirement == "multiFactorAuthentication"
| where ResultType != "0"  // Non-success
| summarize
    FailureCount = count(),
    SuccessCount = countif(ResultType == "0"),
    IPs = make_set(IPAddress),
    StartTime = min(TimeGenerated),
    EndTime = max(TimeGenerated)
    by UserPrincipalName, bin(TimeGenerated, 10m)
| where FailureCount >= 3
| where SuccessCount >= 1
| where datetime_diff('minute', EndTime, StartTime) <= 30
| project UserPrincipalName, FailureCount, SuccessCount, IPs, StartTime, EndTime
| order by FailureCount desc

AWS CloudTrail: Console Session After MFA Flood

If your organization uses AWS SSO (IAM Identity Center) with an external IdP, the CloudTrail event that matters is the console login event immediately following the MFA success:

# Find AWS console login events from unusual IPs
aws cloudtrail lookup-events \
  --lookup-attributes AttributeKey=EventName,AttributeValue=ConsoleLogin \
  --start-time "$(date -d '24 hours ago' --iso-8601=seconds)" \
  --query 'Events[].{Time:EventTime,User:Username,IP:CloudTrailEvent}' \
  --output json | \
  jq '.[] | {
    time: .Time,
    user: .User,
    ip: (.IP | fromjson | .sourceIPAddress),
    mfa: (.IP | fromjson | .additionalEventData.MFAUsed)
  }'

What a GuardDuty Alert Looks Like for This Attack

GuardDuty does not generate a specific finding for MFA fatigue (it does not have visibility into IdP logs). What it may catch downstream:

  • UnauthorizedAccess:IAMUser/ConsoleLoginSuccess.B — console login from unusual geographic location or Tor exit node
  • Discovery:IAMUser/AnomalousBehavior — if the attacker begins enumerating IAM after console access

The gap: GuardDuty’s behavioral analysis is per-account. If the attacker logs in using valid credentials and MFA, GuardDuty may not flag the initial access — only downstream actions that deviate from baseline.


Purple Phase: The Structural Fix

Fix 1: Replace Push MFA with FIDO2 Hardware Keys (for Tier-0 Accounts)

This is the only structural fix. MFA fatigue attacks work because push notifications can be approved by a human who is socially engineered. FIDO2 hardware keys (YubiKey, Google Titan, etc.) require physical possession of the key and a user gesture (touch). A WhatsApp message cannot substitute for physical key presence.

# Okta: Require hardware key MFA for admin accounts
# (done via Okta Admin Console → Security → Authentication Policies)
# CLI example using Okta API:

# Create a new authentication policy requiring hardware authenticator
curl -X POST \
  "https://your-org.okta.com/api/v1/policies" \
  -H "Authorization: SSWS ${OKTA_API_TOKEN}" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "Admin Hardware Key Policy",
    "type": "ACCESS_POLICY",
    "status": "ACTIVE",
    "description": "Requires FIDO2 hardware key for admin access"
  }'

Phasing hardware keys across an organization:

Tier Examples Timeline
Tier 0 — immediate Cloud admin, IAM admin, Okta admin, DNS admin Week 1
Tier 1 — 30 days All engineers with production access Month 1
Tier 2 — 90 days All employees with SSO access Month 3
Contractors Scope-limited access, enforce at boundary Immediate

Fix 2: Number Matching (Intermediate Mitigation)

If hardware keys cannot be deployed immediately, number matching significantly reduces MFA fatigue effectiveness. Instead of a simple “approve/deny” push, the user must match a number shown on the login screen to a number shown in the authenticator app. This breaks the fatigue pattern — the attacker cannot trigger an approval without the user actively entering the correct number.

# Duo: Enable number matching
# Duo Admin Console → Policies → Duo Push Number Matching: Required

# Microsoft Authenticator: Enable number matching
# Azure AD → Security → Authentication methods → Microsoft Authenticator
# Enable: "Require number matching for push notifications"

# Okta Verify: Enable TOTP-bound push
# Okta Admin → Security → Multifactor → Okta Verify → Enable "Number Challenge"

Fix 3: Detect and Block — Automated Response to Fatigue Pattern

#!/usr/bin/env python3
# Purple Team EP05 — MFA Fatigue Auto-Response
# Monitors Okta System Log; suspends user on fatigue pattern detection
# Run as a Lambda function or scheduled script in your SIEM pipeline

import boto3
import requests
import json
from datetime import datetime, timedelta

OKTA_DOMAIN = "your-org.okta.com"
OKTA_TOKEN = "your-okta-api-token"  # use Secrets Manager in production
SNS_TOPIC_ARN = "arn:aws:sns:us-east-1:123456789012:security-alerts"

def get_recent_mfa_events(hours=1):
    since = (datetime.utcnow() - timedelta(hours=hours)).strftime("%Y-%m-%dT%H:%M:%SZ")
    url = f"https://{OKTA_DOMAIN}/api/v1/logs"
    params = {
        "filter": 'eventType eq "user.authentication.auth_via_mfa"',
        "since": since,
        "limit": 1000
    }
    headers = {"Authorization": f"SSWS {OKTA_TOKEN}"}
    response = requests.get(url, params=params, headers=headers)
    return response.json()

def detect_fatigue_pattern(events, failure_threshold=3, window_minutes=10):
    user_events = {}
    for event in events:
        user_id = event["actor"]["id"]
        user_name = event["actor"]["displayName"]
        result = event["outcome"]["result"]
        timestamp = event["published"]

        if user_id not in user_events:
            user_events[user_id] = {"name": user_name, "events": []}
        user_events[user_id]["events"].append({"result": result, "time": timestamp})

    fatigue_users = []
    for user_id, data in user_events.items():
        events_sorted = sorted(data["events"], key=lambda x: x["time"])
        failures = [e for e in events_sorted if e["result"] == "FAILURE"]

        if len(failures) >= failure_threshold:
            # Check if a success followed the failures
            last_failure_time = failures[-1]["time"]
            successes_after = [
                e for e in events_sorted
                if e["result"] == "SUCCESS" and e["time"] > last_failure_time
            ]
            if successes_after:
                fatigue_users.append({
                    "user_id": user_id,
                    "user_name": data["name"],
                    "failure_count": len(failures),
                    "success_after_failures": True
                })

    return fatigue_users

def alert_security_team(fatigue_users):
    sns = boto3.client("sns")
    message = f"MFA FATIGUE ALERT — {len(fatigue_users)} user(s) detected:\n"
    for user in fatigue_users:
        message += f"  - {user['user_name']}: {user['failure_count']} failures then success\n"

    sns.publish(
        TopicArn=SNS_TOPIC_ARN,
        Subject="Purple Team: MFA Fatigue Attack Detected",
        Message=message
    )

def lambda_handler(event, context):
    events = get_recent_mfa_events(hours=1)
    fatigue_users = detect_fatigue_pattern(events)
    if fatigue_users:
        alert_security_team(fatigue_users)
    return {"fatigue_users_detected": len(fatigue_users)}

Fix 4: Privileged Access Workstations and Session Recording

The Uber breach succeeded because the attacker found hardcoded credentials on a file share accessible to contractors. The downstream fix after identity:

# Ensure no scripts or configuration files contain credentials
# Run TruffleHog against your internal repositories and file shares
trufflehog filesystem /path/to/internal/share \
  --json \
  --include-detectors=all \
  2>/dev/null | \
  jq '{file: .SourceMetadata.Data.Filesystem.file, detector: .DetectorName, verified: .Verified}'

Run This in Your Own Environment: MFA Audit

#!/bin/bash
# Purple Team EP05 — MFA Coverage Audit
# Checks for push-MFA users who are A07 exposure without hardware key enrollment

echo "=== AWS: Console Users Without MFA ==="
aws iam generate-credential-report > /dev/null 2>&1
sleep 5
aws iam get-credential-report --query 'Content' --output text | base64 -d | \
  awk -F',' 'NR>1 && $4=="true" && $8=="false" {
    print "  USER: " $1 " | Console: " $4 " | MFA: " $8
  }'

echo ""
echo "=== AWS: IAM Users with Long-Lived Access Keys (rotation risk) ==="
aws iam get-credential-report --query 'Content' --output text | base64 -d | \
  awk -F',' 'NR>1 && $9!="N/A" {
    cmd = "date -d " $10 " +%s"
    cmd | getline key_date; close(cmd)
    now = systime()
    age_days = int((now - key_date) / 86400)
    if (age_days > 90) print "  USER: " $1 " | KEY AGE: " age_days " days"
  }'

echo ""
echo "=== RECOMMENDATION ==="
echo "  - Any console user without MFA = immediate A07 exposure"
echo "  - For accounts with Okta/Azure AD: run IdP-specific audit above"
echo "  - Hardware FIDO2 keys required for all admin accounts"

⚠ Common Mistakes When Responding to MFA Fatigue Risk

Mandating security training as the primary response. The Uber contractor was experienced. Training did not fail — the attacker exploited a social engineering vector that training cannot structurally prevent. Hardware keys remove the social engineering surface entirely.

Implementing “number matching” and considering MFA fatigue solved. Number matching makes fatigue attacks harder, not impossible. A sophisticated attacker can relay the number in real time via voice call (“what number do you see on your screen?”). It buys time; it does not eliminate the attack class.

Requiring MFA for employees but not contractors. The Uber breach was a contractor account. Contractor access policies tend to have looser MFA requirements because contractors often resist corporate MDM on personal devices. The solution is to scope contractor access tightly and require hardware key MFA at the access boundary, not push MFA.

Not monitoring for the failure-then-success pattern. The Okta System Log, Azure AD Sign-in Logs, and Duo Admin Panel all have the data to detect MFA fatigue in real time. Most organizations generate these logs but do not have detection rules for the pattern. The detection is straightforward; the investment is adding the rule to your SIEM.

Forgetting session tokens. The Okta breach was not MFA fatigue — it was session token theft. An attacker who can steal a valid session token does not need to beat MFA at all. Session token lifetime, storage security, and re-authentication requirements for sensitive operations are separate controls that address this variant.


Quick Reference

Attack Variant Mechanism Structural Fix
Push notification flood Attacker initiates logins repeatedly until user accepts FIDO2 hardware key MFA
Social engineering layer Attacker contacts user claiming to be IT support Hardware key (physical presence required)
Session token theft Steal valid session without needing MFA at all Short session lifetime + re-auth for sensitive ops
Number matching bypass Relay number via voice call in real time Hardware key (no relay possible)
SIM swap Port victim’s phone number to attacker’s SIM; receive OTP Hardware key (phone-independent)

Key Takeaways

  • An MFA fatigue attack exploits push notification UX — training users to tap “deny” competes with a trained habit of tapping “accept”; hardware keys eliminate the attack surface by requiring physical presence
  • The Uber breach (2022) was MFA fatigue + hardcoded credentials in a file share — two OWASP categories chained (A07 + A02)
  • Detection is straightforward: multiple MFA failures followed by a success in a short window — this pattern exists in every IdP’s logs; adding the detection rule is the work
  • Number matching is a meaningful intermediate mitigation; it is not a structural fix
  • Hardware FIDO2 keys are the structural fix — they require physical presence and are phishing-resistant by design
  • Tier-0 accounts (cloud admin, IAM admin, Okta admin) cannot wait for the phased rollout — hardware keys on day one
  • Session token theft (CircleCI, Okta support breach) is a related A07 variant: even perfect MFA is bypassed if a valid session token is exfiltrated

What’s Next

EP06 covers CI/CD secrets exposure — how pipeline breaches work, why storing credentials in environment variables is structurally dangerous, and how the CircleCI breach exposed secrets that teams thought were safely stored. The structural answer is OIDC workload identity (IAM EP07): short-lived credentials that cannot be exfiltrated because they don’t exist until the moment they’re needed.

Get EP06 in your inbox when it publishes → subscribe at linuxcent.com

Zero Trust Access in the Cloud: How the Evaluation Loop Actually Works

Reading Time: 10 minutes


What Is Cloud IAMAuthentication vs AuthorizationIAM Roles vs PoliciesAWS IAM Deep DiveGCP Resource Hierarchy IAMAzure RBAC ScopesOIDC Workload IdentityAWS IAM Privilege EscalationAWS Least Privilege AuditSAML vs OIDC FederationKubernetes RBAC and AWS IAMZero Trust Access in the Cloud


TL;DR

  • Zero Trust: trust nothing implicitly, verify everything explicitly, minimize blast radius by assuming you will be breached
  • Network location is not identity — VPN is authentication for the tunnel, not authorization for the resource
  • JIT privilege elevation removes standing admin access: engineers request elevation for a specific purpose, scoped to a specific duration
  • Device posture is an access signal — a compromised endpoint with valid credentials is still a threat; Conditional Access gates on device compliance
  • Continuous session validation re-evaluates signals throughout the session — device falls out of compliance, sessions revoke in minutes, not at expiry
  • The highest-ROI early moves: eliminate machine static credentials, enforce MFA on all human access, federate to a single IdP

The Big Picture

  ZERO TRUST IAM — EVERY REQUEST EVALUATED INDEPENDENTLY

  API call arrives
         │
         ▼
  Identity verified? ──── No ────► DENY
         │
        Yes
         │
         ▼
  Device compliant? ───── No ────► DENY (or step-up MFA)
         │
        Yes
         │
         ▼
  Policy allows this  ─── No ────► DENY
  action on this ARN?
         │
        Yes
         │
         ▼
  Conditions met? ─────── No ────► DENY
  (time, IP, MFA age,              (e.g., outside business hours,
   risk score, session)             impossible travel detected)
         │
        Yes
         │
         ▼
       ALLOW ──────────────────────► LOG every decision (allow and deny)
         │
         └── Continuous re-evaluation:
             device state changes → revoke
             anomaly detected → revoke or step-up
             credential age → require re-auth

Introduction

The perimeter model of network security made a bet: inside the network is trusted, outside is not. Lock down the perimeter tightly enough and you’re safe. VPN in, and you’re one of us.

I grew up professionally in that model. Firewalls, DMZs, trusted zones. The idea had intuitive appeal — you build walls, you control what crosses them. For a while it worked reasonably well.

Then I watched it fail, repeatedly, in ways that were predictable in hindsight. An engineer’s laptop gets compromised at a coffee shop. They VPN in. Now the attacker is “inside.” A contractor account gets phished. They have valid Active Directory credentials. They’re inside. A cloud service gets misconfigured and exposes a management interface. There’s no perimeter for that to be inside of.

The perimeter model failed not because the walls weren’t strong enough, but because the premise was wrong. There is no inside. There is no perimeter that reliably separates trusted from untrusted. In a world of remote work, cloud services, contractor access, and API integrations, the attack surface doesn’t respect network boundaries.

Zero Trust is the architecture built on a different premise: trust nothing implicitly. Verify everything explicitly. Minimize blast radius by assuming you will be breached.

This isn’t a product you buy. It’s a set of principles applied to how you design, build, and operate your IAM. This episode is how those principles translate to concrete practices — building on everything we’ve covered in this series.


The Three Principles

Verify Explicitly

Every request must carry verifiable identity and context. Network location is not identity.

Old model: request from 10.0.0.0/8 → trusted, proceed
Zero Trust: request from 10.0.0.0/8 → still must present verifiable identity
                                       still must pass authorization check
                                       still must pass context evaluation
                                       then proceed (or deny)

In cloud IAM terms: every API call carries identity claims (IAM role ARN, federated identity, managed identity), and those claims are verified against policy on every single request. There’s no concept of “once authenticated, trusted until logout.” In cloud IAM, this already exists natively. Every API call is authenticated and authorized independently. The challenge is extending this model to internal services, internal APIs, and human access patterns.

Implementation in practice:
– mTLS for service-to-service communication — both sides present certificates; identity is the certificate, not the network path
– Bearer tokens on every internal API call — no session cookies, no “we’re on the same VPC so it’s fine”
– Short-lived credentials everywhere — a compromised credential expires, not “after the session times out in 8 hours”

Use Least Privilege — Just-in-Time, Just-Enough

No standing access to sensitive resources. Access granted when needed, for the minimum scope, for the minimum duration.

Old model: alice is in the DBA group → permanent access to all databases
Zero Trust: alice requests access to production DB →
            verified: alice's device is enrolled in MDM and compliant
            verified: alice has an open change ticket for this task
            verified: current time is within business hours
            granted: connection to this specific database, from alice's specific IP
                     for 2 hours, then revoked automatically

This is JIT access. It reduces the window where a compromised credential can cause damage. It requires a change in how engineers think about access: access is not a property you have, it’s something you request when you need it. The operational friction is a feature, not a bug. Justifying each elevated access request is what keeps the access model honest.

Assume Breach

Design systems as if the attacker is already inside. This drives different decisions:

  • Micro-segmentation: one role per service, minimum permissions per role. If one service is compromised, it can’t pivot to everything else.
  • Log everything: every authorization decision, allow or deny. When you’re investigating an incident, you need to know what happened, not just that something happened.
  • Automate response: anomalous API call pattern → trigger automated credential revocation or session termination. Don’t wait for a human to notice.

Building Zero Trust IAM — Block by Block

Block 1: Strong Identity Foundation

You can’t verify explicitly without strong authentication. The starting point:

# AWS: require MFA for all IAM operations — enforce via SCP across the org
{
  "Effect": "Deny",
  "Action": "*",
  "Resource": "*",
  "Condition": {
    "BoolIfExists": {
      "aws:MultiFactorAuthPresent": "false"
    },
    "StringNotLike": {
      "aws:PrincipalArn": [
        "arn:aws:iam::*:role/AWSServiceRole*",
        "arn:aws:iam::*:role/OrganizationAccountAccessRole"
      ]
    }
  }
}
# GCP: enforce OS Login for VM SSH (ties SSH access to Google identity, not SSH keys)
gcloud compute project-info add-metadata \
  --metadata enable-oslogin=TRUE

# This means: SSH to a VM requires your Google identity to have roles/compute.osLogin
# or roles/compute.osAdminLogin. No more managing ~/.authorized_keys files on instances.

For human access: hardware FIDO2 keys (YubiKey, Google Titan) rather than TOTP where possible. TOTP codes can be phished in real-time adversary-in-the-middle attacks. Hardware keys cannot — the cryptographic challenge-response is bound to the origin URL.

Block 2: Device Posture as an Access Signal

In a Zero Trust model, the identity of the user is necessary but not sufficient. The state of the device matters too — a compromised endpoint with valid credentials is still a threat.

# Azure Conditional Access: block access from non-compliant devices
# (configures in Entra ID Conditional Access portal)
conditions:
  clientAppTypes: [browser, mobileAppsAndDesktopClients]
  devices:
    deviceFilter:
      mode: exclude
      rule: "device.isCompliant -eq True and device.trustType -eq 'AzureAD'"
grantControls:
  builtInControls: [compliantDevice]
# AWS Verified Access: identity + device posture for application access — no VPN
aws ec2 create-verified-access-instance \
  --description "Zero Trust app access"

# Attach identity trust provider (Okta OIDC)
aws ec2 create-verified-access-trust-provider \
  --trust-provider-type user \
  --user-trust-provider-type oidc \
  --oidc-options IssuerURL=https://company.okta.com,ClientId=...,ClientSecret=...,Scope=openid

# Attach device trust provider (Jamf, Intune, or CrowdStrike)
aws ec2 create-verified-access-trust-provider \
  --trust-provider-type device \
  --device-trust-provider-type jamf \
  --device-options TenantId=JAMF_TENANT_ID

AWS Verified Access allows users to reach internal applications by verifying both their identity (via OIDC) and their device health (via MDM) — without a VPN. The access gateway evaluates both signals on every connection, not just at login.

Block 3: Just-in-Time Privilege Elevation

No standing elevated access. Engineers are eligible for elevated roles; they activate them when needed.

# Azure PIM: engineer activates an eligible privileged role
az rest --method POST \
  --uri "https://graph.microsoft.com/v1.0/roleManagement/directory/roleAssignmentScheduleRequests" \
  --body '{
    "action": "selfActivate",
    "principalId": "USER_OBJECT_ID",
    "roleDefinitionId": "ROLE_DEF_ID",
    "directoryScopeId": "/",
    "justification": "Investigating security alert in tenant — incident ticket INC-2026-0411",
    "scheduleInfo": {
      "startDateTime": "2026-04-11T09:00:00Z",
      "expiration": {"type": "AfterDuration", "duration": "PT4H"}
    }
  }'
# Access activates, lasts 4 hours, then automatically removed
# AWS: temporary account assignment via Identity Center
# (typically triggered by ITSM workflow integration, not manual CLI)
aws sso-admin create-account-assignment \
  --instance-arn "arn:aws:sso:::instance/ssoins-xxx" \
  --target-id ACCOUNT_ID \
  --target-type AWS_ACCOUNT \
  --permission-set-arn "arn:aws:sso:::permissionSet/ssoins-xxx/ps-yyy" \
  --principal-type USER \
  --principal-id USER_ID

# Schedule deletion (using EventBridge + Lambda in a real deployment)
aws sso-admin delete-account-assignment \
  --instance-arn "arn:aws:sso:::instance/ssoins-xxx" \
  --target-id ACCOUNT_ID \
  --target-type AWS_ACCOUNT \
  --permission-set-arn "arn:aws:sso:::permissionSet/ssoins-xxx/ps-yyy" \
  --principal-type USER \
  --principal-id USER_ID

The operational change this requires: engineers stop thinking of access as something they hold permanently and start thinking of it as something they request for a specific purpose.

This feels like friction until you’re investigating an incident and you have a precise record of who activated what elevated access and why.

Block 4: Continuous Session Validation

Traditional auth: verify once at login, trust the session until timeout.
Zero Trust auth: re-evaluate access signals continuously throughout the session.

Session starts: identity verified + device compliant + IP in expected range
                → access granted

15 minutes later: impossible travel detected (IP changes to different country)
                  → step-up authentication required, or session terminated

Later: device compliance state changes (EDR detects malware)
       → all active sessions for this device revoked immediately

This requires integration between your identity platform and your device management / EDR tooling. Entra ID Conditional Access with Continuous Access Evaluation (CAE) implements this natively. When certain events occur — device compliance change, IP anomaly, token revocation — access tokens are invalidated within minutes rather than waiting for natural expiry.

// GCP: bind IAM access to an Access Context Manager access level
// Access level enforces device compliance — if device falls out of compliance,
// the access level is no longer satisfied and requests fail immediately
gcloud projects add-iam-policy-binding my-project \
  --member="user:[email protected]" \
  --role="roles/bigquery.admin" \
  --condition="expression=request.auth.access_levels.exists(x, x == 'accessPolicies/POLICY_NUM/accessLevels/corporate_compliant_device'),title=Compliant device required"

Block 5: Micro-Segmented Permissions

Every service has its own identity. Every identity has only what it needs. Compromise of one service cannot propagate to others.

# Terraform: IAM as code — each service gets a dedicated, scoped role
resource "aws_iam_role" "order_processor" {
  name                 = "svc-order-processor"
  permissions_boundary = aws_iam_policy.service_boundary.arn

  assume_role_policy = jsonencode({
    Version = "2012-10-17"
    Statement = [{
      Effect    = "Allow"
      Principal = { Service = "lambda.amazonaws.com" }
      Action    = "sts:AssumeRole"
    }]
  })
}

resource "aws_iam_role_policy" "order_processor" {
  name   = "order-processor-policy"
  role   = aws_iam_role.order_processor.id
  policy = jsonencode({
    Version = "2012-10-17"
    Statement = [
      {
        Effect   = "Allow"
        Action   = ["sqs:ReceiveMessage", "sqs:DeleteMessage", "sqs:GetQueueAttributes"]
        Resource = aws_sqs_queue.orders.arn
      },
      {
        Effect   = "Allow"
        Action   = ["dynamodb:PutItem", "dynamodb:GetItem", "dynamodb:UpdateItem"]
        Resource = aws_dynamodb_table.orders.arn
      }
    ]
  })
}
# Open Policy Agent: enforce IAM standards at the policy level
# Run this in CI/CD — fail the build if any policy statement has wildcard actions
package iam.policy

deny[msg] {
  input.Statement[i].Effect == "Allow"
  input.Statement[i].Action == "*"
  msg := sprintf("Statement %d has wildcard Action — not allowed", [i])
}

deny[msg] {
  input.Statement[i].Effect == "Allow"
  input.Statement[i].Resource == "*"
  endswith(input.Statement[i].Action, "Delete")
  msg := sprintf("Statement %d allows Delete on all resources — requires specific ARN", [i])
}

Block 6: Universal Audit Trail

Zero Trust without logging is just obscurity. Every authorization decision — allow and deny — must be logged, retained, and queryable.

# AWS: verify CloudTrail is comprehensive
aws cloudtrail get-trail-status --name management-trail
# Must have: LoggingEnabled=true, IsMultiRegionTrail=true, IncludeGlobalServiceEvents=true

# Verify no management events are excluded
aws cloudtrail get-event-selectors --trail-name management-trail \
  | jq '.EventSelectors[] | {ReadWrite: .ReadWriteType, Mgmt: .IncludeManagementEvents}'
# ReadWriteType should be "All"; IncludeManagementEvents should be true

# GCP: ensure Data Access audit logs are enabled for IAM
gcloud projects get-iam-policy my-project --format=json | jq '.auditConfigs'
# Should see auditLogConfigs for cloudresourcemanager.googleapis.com and iam.googleapis.com
# with both DATA_READ and DATA_WRITE enabled

# Azure: route Entra ID logs to Log Analytics for long-term retention and querying
az monitor diagnostic-settings create \
  --name entra-audit-to-la \
  --resource "/tenants/TENANT_ID/providers/microsoft.aad/domains/company.com" \
  --logs '[{"category":"AuditLogs","enabled":true},{"category":"SignInLogs","enabled":true}]' \
  --workspace /subscriptions/SUB_ID/resourceGroups/rg-monitoring/providers/Microsoft.OperationalInsights/workspaces/security-logs

Framework Alignment

Zero Trust IAM isn’t a framework itself — it’s a design philosophy. But it maps cleanly onto the controls that compliance frameworks are pushing organizations toward:

Framework Reference What It Covers Here
CISSP Domain 5 — IAM Zero Trust reframes IAM as continuous, context-aware verification rather than perimeter-based trust
CISSP Domain 1 — Security & Risk Management Assume breach as a risk management posture; blast radius minimization through least privilege
CISSP Domain 7 — Security Operations Continuous monitoring, anomaly detection, and automated response are operational requirements of Zero Trust
ISO 27001:2022 5.15 Access control Zero Trust access policy: verify explicitly, least privilege, assume breach
ISO 27001:2022 8.16 Monitoring activities Continuous session validation and universal audit trail — all authorization decisions logged
ISO 27001:2022 8.20 Networks security Micro-segmentation and mTLS replace implicit network trust with verified identity at every hop
ISO 27001:2022 5.23 Information security for cloud services Zero Trust architecture applied to cloud IAM across AWS, GCP, and Azure
SOC 2 CC6.1 Zero Trust logical access controls — JIT, device posture, context-aware authorization
SOC 2 CC6.7 Continuous session validation and transmission controls across all system components
SOC 2 CC7.1 Threat detection through universal audit trails and anomaly-triggered automated response
SOC 2 CC7.2 Incident response — automated revocation and session termination on anomaly detection

Zero Trust Maturity — Where to Start

In practice, most organizations think about Zero Trust as a destination — a large, multi-year program. The reality is it’s a direction. Any movement in that direction reduces risk.

Level Where You Are What to Build Next
1 — Initial Some MFA; static credentials for machines; no centralized IdP Eliminate machine static keys → workload identity
2 — Managed Centralized IdP; SSO for most systems; some MFA enforcement Close SSO gaps; enforce MFA everywhere; federate to cloud
3 — Defined Least privilege being enforced; audit tooling in use; JIT for some privileged access Expand JIT; policy-as-code in CI/CD; quarterly access reviews
4 — Contextual Device posture in access decisions; conditional access policies Continuous session evaluation; automated anomaly response
5 — Optimizing Policy-as-code everywhere; automated right-sizing; anomaly-triggered revocation Refine and maintain — Zero Trust is never “done”

The jump from Level 1 to Level 3 delivers the most security value per unit of effort. Start there. Don’t defer least privilege enforcement while you build a sophisticated device posture integration.


The Practical Sequence

If you’re building Zero Trust IAM from where most organizations are, this is the order that maximizes early security value:

  1. Inventory all identities — human and machine. You cannot secure what you can’t see. Build a complete picture before changing anything.

  2. Eliminate static credentials for machines — replace access keys and SA key files with workload identity. This is the highest-ROI change in most environments.

  3. Enforce MFA for all human access — especially cloud consoles, IdP admin, and VPN. Hardware keys for privileged accounts.

  4. Federate human identity — single IdP, SSO to cloud and major applications. Centralize the revocation path.

  5. Right-size IAM permissions — use last-accessed data and IAM Recommender to find and remove unused permissions. This is a continuous discipline, not a one-time clean-up.

  6. JIT for privileged access — Azure PIM, AWS Identity Center assignment automation, or equivalent for all elevated roles. No standing admin.

  7. IAM as code — all IAM changes via Terraform/Pulumi/CDK, reviewed in pull requests, validated by Access Analyzer or OPA in CI/CD, applied through automation.

  8. Continuous monitoring — alerts on IAM mutations, anomalous API call patterns, new cross-account trust relationships, new public resource exposures.

  9. Add context signals — Conditional Access policies incorporating device posture. Access Context Manager in GCP. AWS Verified Access for application access.

  10. Automated response — anomaly detected → automatic credential suspension or session termination. Close the window between detection and containment.


Series Complete

This series covered Cloud IAM from the question “what even is IAM?” to Zero Trust architecture:

Episode Topic The Core Lesson
EP01 What is IAM? Access management is deny-by-default; every grant is an explicit decision
EP02 AuthN vs AuthZ Two separate gates; passing one doesn’t open the other
EP03 Roles, Policies, Permissions Structure prevents drift; wildcards accumulate into exposure
EP04 AWS IAM Deep Dive Trust policies and permission policies are both required; the evaluation chain has six layers
EP05 GCP IAM Deep Dive Hierarchy inheritance is a feature that needs careful handling; service account keys are an antipattern
EP06 Azure RBAC and Entra ID Two separate authorization planes; managed identities are the right model for workloads
EP07 Workload Identity Static credentials for machines are solvable at the root; OIDC token exchange replaces them
EP08 IAM Attack Paths The attack chain runs through IAM; iam:PassRole and its equivalents are privilege escalation primitives
EP09 Least Privilege Auditing 5% utilization is the average; the 95% excess is attack surface — and it’s measurable
EP10 Federation, OIDC, SAML The IdP is the trust anchor; everything downstream is bounded by its security
EP11 Kubernetes RBAC Two separate IAM layers; both must be secured; cluster-admin is the first thing to audit
EP12 Zero Trust IAM Trust nothing implicitly; verify everything explicitly; minimize blast radius through least privilege at every layer

IAM is not a feature you configure. It’s a practice you maintain. The organizations that operate with genuinely low cloud IAM risk don’t have fewer identities — they have better visibility into what those identities can do, and why, and what happened when something went wrong.

That’s what this series has been building toward.


The full series is at linuxcent.com/cloud-iam-series. If you found it useful, the best thing you can do is subscribe — the next series covers eBPF: what’s actually running in kernel space when Cilium, Falco, and Tetragon are doing their work.

Subscribe → linuxcent.com/subscribe