IAM Roles vs Policies: How Cloud Authorization Actually Works

Reading Time: 12 minutes


What Is Cloud IAMAuthentication vs AuthorizationIAM Roles vs PoliciesAWS IAM Deep DiveGCP Resource Hierarchy IAMAzure RBAC Scopes


TL;DR

  • Every cloud permission is atomic: one action (s3:GetObject) on one resource class — the indivisible unit of access
  • Policies group permissions into documents with conditions; roles carry policies and are assigned to identities
  • Never attach policies directly to users — roles are the indirection layer that makes access auditable and revocable
  • AWS roles have two required configs: trust policy (who can assume) + permission policy (what they can do) — both must be right
  • GCP binds roles to resources; AWS attaches policies to identities — the mental models run in opposite directions
  • iam:PassRole in AWS and iam.serviceAccounts.actAs in GCP are privilege escalation vectors — always scope to specific ARNs, never *

The Big Picture

Three primitives underlie every cloud IAM system. Learn how they connect and any cloud access model becomes readable.

  THE THREE-LAYER STACK
  Build bottom-up. Assign top-down. Change one layer without touching the others.

  ┌──────────────────────────────────────────────────────────────────────┐
  │  LAYER 3 — IDENTITY                                                  │
  │  [email protected]  ·  backend-service  ·  ci-runner@proj           │
  │  "who is acting — a human, a service, or a machine"                 │
  ├──────────────────────────────────────────────────────────────────────┤
  │  LAYER 2 — ROLE                                                      │
  │  BackendDeveloper  ·  DataAnalyst  ·  DeployBot  ·  S3ReadOnly      │
  │  "what function does this identity serve — the job title"           │
  ├──────────────────────────────────────────────────────────────────────┤
  │  LAYER 1 — POLICY                                                    │
  │  AllowS3Read  ·  AllowECRPush  ·  DenyProdDelete  ·  RequireMFA    │
  │  "what is explicitly permitted or denied, under what conditions"    │
  ├──────────────────────────────────────────────────────────────────────┤
  │  LAYER 0 — PERMISSION                                                │
  │  s3:GetObject  ·  ecr:PutImage  ·  s3:DeleteObject  ·  iam:PassRole│
  │  "one verb on one class of resource — the atom of access control"  │
  └──────────────────────────────────────────────────────────────────────┘

  When alice joins the backend team → assign her the BackendDeveloper role
  When the S3 bucket changes → update the policy once; alice gets it automatically
  When alice leaves → remove the role assignment; policy and permissions are untouched

If this maps better to something physical:

  PHYSICAL WORLD            →    CLOUD IAM

  A specific door rule           Permission      s3:GetObject
  Keycard access profile    →    Policy          AllowS3Read
  Job title                 →    Role            BackendDeveloper
  The employee              →    Identity        [email protected]

  When the employee leaves: revoke the role assignment.
  The job title, the keycard profile, the door rules — all unchanged.
  Next hire gets the same role. Same access. No manual work.

Introduction

IAM roles vs policies is a distinction that defines how cloud authorization actually works — and getting it wrong is how access sprawl starts. Every authentication vs authorization failure at the authorization layer traces back to how these three primitives are — or aren’t — structured.

Every cloud IAM system — AWS, GCP, Azure — is built on the same three primitives: permissions, policies, and roles. Learn these well and any cloud provider becomes readable. Skip them and you spend years pattern-matching without understanding why anything is structured the way it is.

What Is Cloud IAM established the foundation: IAM is the system that governs who can access what in cloud infrastructure, and its default answer is always deny. Authentication vs Authorization: AWS AccessDenied Explained drew the line between authentication — proving identity — and authorization — proving you’re allowed to act. This episode is about the authorization layer specifically. These three building blocks are how authorization is expressed in practice.

Before walking through each one, here’s what access control looks like without any of this structure — because that’s the fastest way to understand why the layers exist.

In 2015 I inherited an AWS account from a 12-engineer team that had been building for 18 months. When I ran aws iam list-attached-user-policies across the 23 users, 17 had policies attached directly to the user object — not to groups, not to roles.

One engineer had left six months earlier. His access key was still active. Three policies still attached: read access to prod S3, write to a DynamoDB table, ability to invoke Lambda functions. When I asked what the DynamoDB table was for, nobody could tell me. The Lambda functions no longer existed.

That account wasn’t built by negligent engineers. It was built by engineers reaching for whatever granted access fastest, under deadline, without a framework. Permissions scattered. Nothing tracked. Nothing removed.

Roles, policies, and permissions are the framework that prevents that. Understanding them is the difference between an IAM configuration you can audit in an afternoon and one that takes a week and still leaves you uncertain.


What Are IAM Permissions? The Atomic Unit of Access Control

A permission is a single action on a class of resources. It is the most granular thing you can grant or deny — the atom of access control.

Cloud providers express permissions differently, but the structure is consistent: a service, a resource type, and an action verb.

# AWS: service:Action
s3:GetObject               # read an object from S3
ec2:StartInstances         # start EC2 instances
iam:PassRole               # assign a role to an AWS service — one of the most dangerous
kms:Decrypt                # use a KMS key to decrypt

# GCP: service.resource.verb
storage.objects.get
compute.instances.start
iam.serviceAccounts.actAs  # impersonate a service account — equivalent risk to iam:PassRole
cloudkms.cryptoKeyVersions.useToDecrypt

# Azure: Provider/ResourceType/Action
Microsoft.Storage/storageAccounts/blobServices/containers/read
Microsoft.Compute/virtualMachines/start/action
Microsoft.Authorization/roleAssignments/write   # grant roles — highest risk
Microsoft.KeyVault/vaults/secrets/getSecret/action

You generally don’t assign individual permissions directly to identities — that’s like handing someone 47 keys with no labels and expecting the system to remain auditable. Permissions are grouped into policies.


What Are IAM Policies? Grouping Permissions with Conditions

A policy is a document that groups permissions and defines the conditions under which they apply.

AWS policy structure

An AWS policy document is JSON. Every field is a deliberate decision:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "AllowReadS3Backups",
      "Effect": "Allow",
      "Action": ["s3:GetObject", "s3:ListBucket"],
      "Resource": [
        "arn:aws:s3:::company-backups",
        "arn:aws:s3:::company-backups/*"
      ],
      "Condition": {
        "StringEquals": { "s3:prefix": ["2024/", "2025/"] }
      }
    },
    {
      "Sid": "DenyDeleteEverywhere",
      "Effect": "Deny",
      "Action": "s3:DeleteObject",
      "Resource": "*"
    }
  ]
}

The Sid is a comment — use it. AllowReadS3Backups tells a future auditor why this statement exists. Statement1 is technical debt.

The Effect is either Allow or Deny. A Deny always wins — it cannot be overridden by any Allow anywhere in any policy on the same identity. If you have a Deny on s3:DeleteObject with "Resource": "*", nothing can grant delete access to that identity. This asymmetry is deliberate: it’s how guardrails work.

The Resource field is where access most often creeps wider than intended. "Resource": "*" on a write action means “every resource of this type in the account.” It works. It outlives the context that made it feel reasonable.

AWS policy types — which to reach for

┌──────────────────────────┬────────────────────────────┬────────────────────────────┐
│ Type                     │ Attached to                │ What it does               │
├──────────────────────────┼────────────────────────────┼────────────────────────────┤
│ Identity-based           │ User, Group, Role          │ What the identity can do   │
│ Resource-based           │ S3 bucket, KMS key, Lambda │ Who can touch this resource │
│ Permissions boundary     │ User or Role               │ Maximum possible — ceiling  │
│ Service Control Policy   │ AWS Org OU or Account      │ Org-level guardrail         │
│ Session policy           │ AssumeRole session         │ Restricts a specific session│
│ Resource Control Policy  │ AWS Org resources          │ Resource-level org guardrail│
└──────────────────────────┴────────────────────────────┴────────────────────────────┘

Critical: Permissions boundaries and SCPs do not grant permissions. They constrain them. A boundary that allows s3:* doesn’t mean the identity has S3 access. It means the identity can have at most S3 access, if an identity-based policy actually grants it. Many engineers set a boundary and expect it to work as a grant. It doesn’t.

GCP policy bindings

GCP doesn’t attach policy documents to identities. Each resource has an IAM policy — a set of bindings mapping roles to members:

{
  "bindings": [
    {
      "role": "roles/storage.objectViewer",
      "members": [
        "user:[email protected]",
        "serviceAccount:[email protected]"
      ]
    },
    {
      "role": "roles/storage.objectCreator",
      "members": ["serviceAccount:[email protected]"],
      "condition": {
        "title": "Business hours only",
        "expression": "request.time.getHours('America/New_York') >= 9 && request.time.getHours('America/New_York') < 18"
      }
    }
  ]
}

The mental model shift: in AWS you ask “what can this identity do?” by looking at the identity. In GCP you ask “who can access this resource?” by looking at the resource. The question runs in the opposite direction.

Azure role definitions

Azure separates what a role grants (role definition) from who gets it where (role assignment). Define once, assign at multiple scopes.

{
  "Name": "Custom Storage Reader",
  "IsCustom": true,
  "Actions": [
    "Microsoft.Storage/storageAccounts/blobServices/containers/read",
    "Microsoft.Storage/storageAccounts/blobServices/generateUserDelegationKey/action"
  ],
  "DataActions": [
    "Microsoft.Storage/storageAccounts/blobServices/containers/blobs/read"
  ],
  "AssignableScopes": ["/subscriptions/SUB_ID"]
}

Actions vs DataActions catches people. Actions are control plane — you can see the storage account exists. DataActions are data plane — you can read actual blob contents. A user with Actions can list the container but cannot read a single byte without a DataAction. Both planes must be covered for the access to be complete.


What Are IAM Roles? The Layer That Scales Access Control

A role is a collection of policies assigned to identities. It’s the indirection layer that makes access manageable at scale.

Going back to the 2015 account: the problem wasn’t that engineers had access — they needed it. The problem was that access was scattered across 23 individual user objects with no shared structure. This is what what is cloud IAM establishes as the core problem IAM exists to solve. Roles are the structural answer.

The role model solves this:

Policy: S3ReadAccess (s3:GetObject, s3:ListBucket on s3:::app-data/*)
  ↓ attached to
Role: BackendDeveloper
  ↓ assigned to
Users: alice, bob, charlie, dave (and six more)

When the bucket changes  → update one policy
When someone joins       → assign one role
When someone leaves      → remove one role
Access model stays coherent because it's structured.

AWS roles — the identity that issues temporary credentials

AWS roles are themselves IAM identities, not just permission containers. When something assumes a role, it gets temporary credentials from STS. Two things must be configured:

Trust policy — who can assume:

{
  "Version": "2012-10-17",
  "Statement": [{
    "Effect": "Allow",
    "Principal": { "Service": "ec2.amazonaws.com" },
    "Action": "sts:AssumeRole"
  }]
}

Without this, nobody can use the role regardless of its permissions. The trust policy is the gatekeeper.

Permission policy — what it can do:

aws iam create-role \
  --role-name AppServerRole \
  --assume-role-policy-document file://ec2-trust-policy.json

aws iam attach-role-policy \
  --role-name AppServerRole \
  --policy-arn arn:aws:iam::aws:policy/AmazonS3ReadOnlyAccess

When debugging “why can’t this Lambda/EC2/ECS task do X?”, the first thing I check is the trust policy. Many times the permission policy is correct — the service simply isn’t in the trust policy and cannot assume the role at all.

GCP role types

┌──────────────────┬──────────────────────────────┬──────────────────────────────────┐
│ Type             │ Example                      │ When to use                      │
├──────────────────┼──────────────────────────────┼──────────────────────────────────┤
│ Basic/Primitive  │ roles/editor, roles/owner    │ Never in production              │
│ Predefined       │ roles/storage.objectViewer   │ Default — service-specific       │
│ Custom           │ Your org defines             │ When predefined is too broad     │
└──────────────────┴──────────────────────────────┴──────────────────────────────────┘

roles/editor at the project level grants write access to almost every GCP service. I’ve seen it granted “temporarily” and found it attached six months later. Always use predefined roles.

# Find the right predefined role
gcloud iam roles list --filter="name:roles/storage" --format="table(name,title)"

# See exactly what permissions it includes
gcloud iam roles describe roles/storage.objectViewer

# Create a custom role when predefined is still too broad
cat > custom-log-reader.yaml << 'EOF'
title: "Log Reader"
description: "Read application logs — nothing else"
stage: "GA"
includedPermissions:
  - logging.logEntries.list
  - logging.logs.list
  - logging.logMetrics.get
EOF
gcloud iam roles create LogReader --project=my-project --file=custom-log-reader.yaml

Azure built-in and custom roles

# List built-in roles containing "Storage"
az role definition list --output table | grep Storage

# View what a built-in role grants
az role definition list --name "Storage Blob Data Reader"

# Create a custom role
az role definition create --role-definition custom-app-storage.json

# Assign at a specific scope
az role assignment create \
  --assignee [email protected] \
  --role "Storage Blob Data Reader" \
  --scope /subscriptions/SUB_ID/resourceGroups/rg-prod/providers/\
Microsoft.Storage/storageAccounts/prodstore

RBAC vs ABAC: Which Access Control Model to Use

RBAC — Role-Based Access Control

The dominant model. Access flows from role membership:

alice     ∈ BackendDeveloper  →  s3:GetObject on app-data/*
bob       ∈ DataAnalyst       →  athena:* on analytics-queries
ci-runner ∈ DeployRole        →  ecr:PutImage, ecs:UpdateService

RBAC degrades two ways: role explosion (200 roles, nobody can explain what they all do) and coarse roles (avoid explosion by making roles broad, now BackendDeveloper has prod access with no distinction from dev). Both look the same on a spreadsheet — lots of access, no clear principle.

ABAC — Attribute-Based Access Control

ABAC grants access based on attributes of the principal, resource, or environment — not role membership. This one policy replaced 12 team-specific policies in one account:

{
  "Effect": "Allow",
  "Action": "ec2:*",
  "Resource": "*",
  "Condition": {
    "StringEquals": {
      "aws:ResourceTag/Team": "${aws:PrincipalTag/Team}"
    }
  }
}

An engineer tagged Team=Platform can only act on EC2 resources tagged Team=Platform. Add a new team — tag their resources and their identity. No new policy. No new role.

The risk is tag drift. If someone tags a resource incorrectly, the access model breaks silently. In practice, I use ABAC for environment and team scoping, and explicit policies for sensitive services like KMS and IAM. How these primitives combine in a full AWS account is covered in the AWS IAM deep dive.

Conditions — when context determines access

// Require MFA for any IAM or Organizations action
{
  "Effect": "Deny",
  "Action": ["iam:*", "organizations:*"],
  "Resource": "*",
  "Condition": { "BoolIfExists": { "aws:MultiFactorAuthPresent": "false" } }
}

// Restrict to corporate IP range
{
  "Effect": "Deny",
  "Action": "*",
  "Resource": "*",
  "Condition": {
    "NotIpAddress": { "aws:SourceIp": ["10.0.0.0/8", "172.16.0.0/12"] }
  }
}

The MFA condition is in every account I manage. A compromised API key without an MFA session can’t escalate IAM privileges — the Deny blocks it at the condition level. This single statement meaningfully reduces the blast radius of a credential compromise.


⚠ Production Gotchas

╔══════════════════════════════════════════════════════════════════════╗
║  ⚠  GOTCHA 1 — Policies attached directly to users                 ║
║                                                                      ║
║  Feels fast. Creates the exact problem from 2015: access scattered  ║
║  across individual user objects with no shared structure.            ║
║  When the user leaves, their policies don't follow — they stay.     ║
║                                                                      ║
║  Fix: always use roles. Attach policies to roles. Assign roles to   ║
║  users. The role outlives the person.                               ║
╚══════════════════════════════════════════════════════════════════════╝

╔══════════════════════════════════════════════════════════════════════╗
║  ⚠  GOTCHA 2 — Using AWS managed policies in production            ║
║                                                                      ║
║  AmazonS3FullAccess grants s3:* on *. For a Lambda that reads one  ║
║  specific bucket, that's ~30 permissions you didn't need, all live. ║
║                                                                      ║
║  Fix: create customer managed policies scoped to the specific       ║
║  actions and ARNs the workload actually uses.                       ║
╚══════════════════════════════════════════════════════════════════════╝

╔══════════════════════════════════════════════════════════════════════╗
║  ⚠  GOTCHA 3 — iam:PassRole with "Resource": "*"                   ║
║                                                                      ║
║  iam:PassRole lets an identity assign a role to an AWS service.     ║
║  With Resource: *, it can pass ANY role — including ones with more  ║
║  permissions than it currently has. That is a privilege escalation. ║
║                                                                      ║
║  Fix: always scope iam:PassRole to a specific role ARN:             ║
║  "Resource": "arn:aws:iam::ACCOUNT:role/SpecificRoleName"          ║
╚══════════════════════════════════════════════════════════════════════╝

╔══════════════════════════════════════════════════════════════════════╗
║  ⚠  GOTCHA 4 — Permissions boundary ≠ policy grant                 ║
║                                                                      ║
║  Setting a boundary that allows s3:* does NOT grant S3 access.     ║
║  The boundary is a ceiling — it limits maximum possible permissions. ║
║  The identity-based policy still needs to explicitly Allow the      ║
║  action. Both must be present for the access to work.               ║
╚══════════════════════════════════════════════════════════════════════╝

Cross-Cloud Rosetta Stone

Same concepts, different names and different directions. Bookmark this table.

┌─────────────────────────┬──────────────────────────┬──────────────────────────┬──────────────────────────┐
│ Concept                 │ AWS                      │ GCP                      │ Azure                    │
├─────────────────────────┼──────────────────────────┼──────────────────────────┼──────────────────────────┤
│ Atomic permission       │ s3:GetObject             │ storage.objects.get      │ .../blobs/read           │
│ Permission document     │ Policy (JSON)            │ (built into role def)    │ Role Definition          │
│ Access grant            │ Policy attachment        │ IAM Binding              │ Role Assignment          │
│ Job-function identity   │ IAM Role                 │ Predefined Role          │ Built-in Role            │
│ Non-human identity      │ IAM Role (assumed)       │ Service Account          │ Managed Identity         │
│ Org-level guardrail     │ SCP                      │ Org Policy               │ Management Group Policy  │
│ Permission ceiling      │ Permissions Boundary     │ —                        │ —                        │
│ Session restriction     │ Session Policy           │ —                        │ —                        │
│ Attribute-based grant   │ Tag conditions in policy │ IAM Conditions           │ Conditions in assignment │
└─────────────────────────┴──────────────────────────┴──────────────────────────┴──────────────────────────┘

Quick Reference

┌──────────────────────────┬────────────────────────────────────────────────────────────┐
│ Term                     │ What it is                                                 │
├──────────────────────────┼────────────────────────────────────────────────────────────┤
│ Permission               │ Atomic: one action on one resource class                   │
│ Policy                   │ Document grouping permissions + conditions                 │
│ Role (AWS)               │ Assumable identity — carries policies, issues temp creds   │
│ Trust policy (AWS)       │ Who can assume this role — separate from permissions       │
│ Permissions boundary     │ Ceiling — limits max possible permissions; does not grant  │
│ SCP                      │ Org guardrail — constrains all identities in scope         │
│ IAM Binding (GCP)        │ Maps a role to a member on a specific resource             │
│ Role Assignment (Azure)  │ Grants a role definition at a specific scope               │
│ ABAC                     │ Access by tag/attribute — one policy replaces many roles   │
│ RBAC                     │ Access by role membership — clean until roles proliferate  │
│ iam:PassRole             │ Privilege escalation vector — always scope to specific ARN │
└──────────────────────────┴────────────────────────────────────────────────────────────┘

Commands to know:
┌────────────────────────────────────────────────────────────────────────────────┐
│  # AWS — list policies attached to a role                                     │
│  aws iam list-attached-role-policies --role-name MyRole                       │
│                                                                                │
│  # AWS — view what a managed policy actually grants                           │
│  aws iam get-policy-version \                                                  │
│    --policy-arn arn:aws:iam::aws:policy/AmazonS3ReadOnlyAccess \              │
│    --version-id v1                                                             │
│                                                                                │
│  # AWS — who can assume this role?                                            │
│  aws iam get-role --role-name MyRole --query 'Role.AssumeRolePolicyDocument'  │
│                                                                                │
│  # GCP — view the IAM policy on a project                                    │
│  gcloud projects get-iam-policy PROJECT_ID --format=json                      │
│                                                                                │
│  # GCP — list all roles and what permissions they include                    │
│  gcloud iam roles describe roles/storage.objectViewer                         │
│                                                                                │
│  # Azure — list role assignments in a subscription                           │
│  az role assignment list --all --output table                                 │
│                                                                                │
│  # Azure — view exactly what a built-in role grants                          │
│  az role definition list --name "Storage Blob Data Reader"                   │
└────────────────────────────────────────────────────────────────────────────────┘

Framework Alignment

Framework Reference What It Covers Here
CISSP Domain 5 — Identity and Access Management RBAC and ABAC are the implementation models for authorization at scale
CISSP Domain 1 — Security & Risk Management Role design implements separation of duties and least privilege
ISO 27001:2022 5.15 Access control Access control policy — roles and policies are the mechanism
ISO 27001:2022 5.18 Access rights Provisioning, review, and removal of access rights — roles make this auditable
ISO 27001:2022 8.2 Privileged access rights Permissions boundaries and conditions applied to elevated access
SOC 2 CC6.1 Logical access security — policy documents are the technical implementation
SOC 2 CC6.3 Access revocation — role-based model makes removal consistent and auditable

Key Takeaways

  • Permissions are atomic — one action on one resource class. Policies group permissions. Roles carry policies for assignment
  • AWS roles have two required configs: trust policy (who can assume) and permission policy (what it can do) — both must be correct
  • GCP binds roles to resources; AWS attaches policies to identities — the mental model runs in opposite directions
  • Azure separates role definition (what) from role assignment (who, where) — define once, assign at multiple scopes
  • RBAC scales through role design; ABAC scales through tag/attribute conditions — use ABAC where roles would proliferate
  • iam:PassRole and iam.serviceAccounts.actAs are privilege escalation vectors — scope them to specific ARNs, never *
  • Conditions add context (MFA, IP, tags, time) to policies — the MFA condition on IAM actions is essential in every account

What’s Next

EP04 goes deep on AWS IAM — the most complex of the three cloud models. Policy evaluation order, cross-account trust, permissions boundaries in practice, SCPs, and IAM Identity Center for human access. We’ll work through the patterns that make AWS IAM maintainable at production scale.

Next: AWS IAM Deep Dive: Users, Groups, Roles, and Policies Explained

Get the AWS IAM deep dive in your inbox when it publishes → linuxcent.com/subscribe

Authentication vs Authorization: AWS AccessDenied Explained

Reading Time: 10 minutes


What Is Cloud IAMAuthentication vs AuthorizationIAM Roles vs PoliciesAWS IAM Deep DiveGCP Resource Hierarchy IAMAzure RBAC Scopes


TL;DR

  • Authentication asks are you who you claim to be? Authorization asks are you allowed to do this? — two separate gates, two separate failure modes
  • AWS AccessDenied is an authorization failure — the identity authenticated fine; fix the policy, not the credentials
  • Prefer short-lived credentials (STS temporary tokens, Managed Identities) over long-lived access keys — the difference is the blast radius window
  • MFA strengthens authentication; it does nothing for authorization — a hijacked session with broad permissions is just as dangerous with or without MFA on the original login
  • HTTP 401 = authentication failure; HTTP 403 = authorization failure — the code tells you which gate to debug
  • Both layers must enforce least privilege independently — application-layer authorization is not a substitute for tight cloud IAM

The Big Picture

Every API call in the cloud passes through two gates before it executes. Most engineers know the first one. The second is where most security failures live.

  THE TWO GATES — every cloud API call passes through both, in order

  ┌──────────────────────────────────────────────────────────────────┐
  │  GATE 1 — AUTHENTICATION                                         │
  │  "Are you who you claim to be?"                                  │
  │                                                                  │
  │  IAM user     →  Access Key + Secret (long-lived, rotatable)    │
  │  IAM role     →  Temporary STS token (expires automatically)    │
  │  Human        →  Password + MFA via console or IdP              │
  │  Service      →  Instance profile / Managed Identity / OIDC     │
  │                                                                  │
  │  Passes → move to Gate 2                                        │
  │  Fails  → stopped here, HTTP 401                                │
  └──────────────────────────────────────────────────────────────────┘
                                 │
                                 ▼
  ┌──────────────────────────────────────────────────────────────────┐
  │  GATE 2 — AUTHORIZATION                                          │
  │  "Are you allowed to do what you're trying to do?"               │
  │                                                                  │
  │  Evaluated against: identity-based policies · SCPs              │
  │                     resource-based policies · conditions         │
  │                     permissions boundaries · session policies    │
  │                                                                  │
  │  Default answer: DENY (explicit Allow required every time)      │
  │                                                                  │
  │  Passes → request executes                                      │
  │  Fails  → AccessDenied / HTTP 403                               │
  └──────────────────────────────────────────────────────────────────┘

  MFA hardens Gate 1. It has zero effect on Gate 2.
  A hijacked session with a valid token clears Gate 1 automatically.
  Gate 2 is your last line of defense — and the one that's most often misconfigured.

Introduction

The authentication vs authorization distinction is the most commonly confused boundary in cloud security — and the source of most misdirected debugging when an AWS AccessDenied error appears. These are two separate gates, two separate failure modes, and two entirely different fixes.

Early in my career I wrote an API endpoint I was proud of. Token validation. Rejection of unauthenticated requests. I called it “secured” in the code review.

A senior engineer asked one question: “What happens if I take a valid token from a regular user and call your /admin/delete-user endpoint?”

I ran the test. It worked. Any employee — with a perfectly valid, properly issued token — could delete any user account in the system.

The authentication was correct. The authorization didn’t exist.

That gap between proving who you are and proving you’re allowed to do this is where a surprising number of security incidents live. Not just in application code — in cloud IAM too.

I’ve reviewed AWS environments where MFA was enforced on every human account, access keys were rotated quarterly, and yet a Lambda function had s3:* on * because whoever wrote the deployment script reached for AmazonS3FullAccess and moved on.

Gate 1 was solid. Gate 2 was wide open.

This episode draws the boundary cleanly — what each gate is, how each cloud implements it, and the specific failure modes that happen when the two get conflated.


How Authentication Works in Cloud IAM

Authentication answers: are you who you claim to be?

The three factor types

Authentication has not fundamentally changed in decades. What has changed is how cloud platforms implement it.

Factor Type Cloud Examples
Something you know Knowledge Password, access key secret, PIN
Something you have Possession TOTP app, FIDO2 hardware key, smart card
Something you are Inherence Biometrics — less common in cloud contexts

MFA requires two distinct factors. A password plus a username is not MFA — both are knowledge factors. A password plus a TOTP code is MFA. Worth stating clearly because I’ve seen internal documentation describe “username and password” as two-factor authentication.

SMS codes count as MFA, but they’re the weakest form. SIM-swapping attacks — convincing a carrier to port your number — have been used to defeat SMS MFA on high-value accounts. If TOTP or FIDO2 hardware keys are available, use them.

How AWS authenticates

AWS has two fundamentally different identity classes:

Human identities authenticate via console (password + optional MFA) or CLI/API (Access Key ID + Secret Access Key). The access key is a long-lived credential with no default expiry. Every .env file with an access key, every git commit that included one, every CI/CD log that printed one — that credential is live until someone explicitly rotates or deletes it.

Machine identities — EC2, Lambda, ECS tasks — authenticate via temporary credentials issued by STS:

# Assume a role — get temporary credentials that expire
aws sts assume-role \
  --role-arn arn:aws:iam::123456789012:role/DevRole \
  --role-session-name alice-session \
  --duration-seconds 3600
# Returns: AccessKeyId + SecretAccessKey + SessionToken
# All three expire together. Nothing to rotate.

# From inside an EC2 instance — credentials arrive automatically via IMDS
curl http://169.254.169.254/latest/meta-data/iam/security-credentials/MyAppRole
# Returns: AccessKeyId, SecretAccessKey, Token, Expiration
# AWS refreshes these before expiry. The application never sees a rotation event.

The IMDS model is the right one. The application never manages a credential — it appears, it’s used, it expires. If it leaks, it’s usable for hours at most, not years.

Why Long-Lived Credentials Keep Appearing

How GCP authenticates

GCP cleanly separates human and machine authentication.

Humans authenticate via Google Account or Workspace (OAuth2). The gcloud CLI handles the flow:

gcloud auth login                        # browser-based OAuth2 for humans
gcloud auth application-default login    # sets up Application Default Credentials for local dev

Machine identities use service accounts, ideally attached to the resource rather than using downloaded key files. Key files are GCP’s equivalent of long-lived AWS access keys — same problems, same risks.

# From inside a GCE VM — ADC uses the attached service account, no key file needed
gcloud auth print-access-token
# Use it: curl -H "Authorization: Bearer $(gcloud auth print-access-token)" ...

How Azure authenticates

Azure’s identity plane is Entra ID (formerly Azure Active Directory). Humans authenticate via Entra ID using OAuth2/OIDC. Machine identities use Managed Identities — Azure handles the entire credential lifecycle, nothing to configure or rotate.

az login                                  # browser-based OAuth2
az login --service-principal \            # service principal for automation
  -u APP_ID -p CERT_OR_SECRET \
  --tenant TENANT_ID

# From inside an Azure VM — get a token via IMDS, no credentials needed
curl 'http://169.254.169.254/metadata/identity/oauth2/token\
?api-version=2018-02-01&resource=https://management.azure.com/' \
  -H 'Metadata: true'

The credential failure modes that repeat everywhere

In practice, the same patterns appear across all three clouds in every audit:

Leaked credentials — access keys in git commits, .env files, Docker image layers, CI/CD logs. GitHub’s secret scanning finds thousands of these monthly on public repos alone.

Long-lived credentials — an access key from 2019 is still valid in 2026 unless someone explicitly rotated it. I’ve audited accounts where 30% of access keys had never been rotated, some five years old.

Shared credentials — one key used by three services. When you revoke it, three things break. When it leaks, you can’t tell which service was the source.

Credential sprawl — service account keys downloaded for “one quick test” and never deleted. I once found seventeen key files for a single GCP service account, created by different engineers over two years. None rotated. Five belonged to accounts that no longer existed.

The direction of travel in all three clouds is credential-less: workload identity federation, managed identities, instance profiles. We’ll cover this specifically in OIDC Workload Identity: Eliminate Cloud Access Keys Entirely.


How Authorization Evaluates Every API Call

Authorization happens after authentication. The system knows who you are — now it decides what you can do. This decision is enforced through IAM roles vs policies — the building blocks that express what each identity is allowed to do on which resources.

What the evaluation looks like

Every API call triggers an authorization check. You don’t notice when it succeeds. You notice when it fails:

REQUEST:
  Action:    s3:DeleteObject
  Resource:  arn:aws:s3:::prod-backups/2024-01-15.tar.gz
  Principal: arn:aws:iam::123456789012:role/DevEngineerRole
  Context:   { source_ip: "10.0.1.5", mfa: false, time: "14:32 UTC" }

EVALUATION:
  1. Explicit Deny anywhere? → none found
  2. Explicit Allow in any policy? → not granted
  3. Default → DENY

RESULT: AccessDenied

The engineer authenticated successfully. Valid credentials, valid session. But DevEngineerRole has no policy granting s3:DeleteObject on that bucket. Gate 1 passed. Gate 2 denied. They are evaluated independently.

Policy evaluation chains by cloud

AWS — evaluated in layers, explicit Deny wins at any layer:

1. Explicit Deny in any SCP?           → DENY (cannot be overridden anywhere)
2. No SCP Allow?                       → DENY
3. Explicit Deny in identity or resource policy? → DENY
4. Resource-based policy Allow?        → can ALLOW (same account)
5. Permissions boundary — no Allow?    → DENY
6. Session policy — no Allow?          → DENY
7. Identity-based policy Allow?        → ALLOW
Default (nothing granted):             → DENY

The default is always Deny. Every successful authorization is an explicit "Effect": "Allow" somewhere in the chain. This is the opposite of traditional Unix — in the cloud, if you didn’t explicitly grant it, it doesn’t exist.

GCP — additive, permissions accumulate up the hierarchy:

Permission granted if ANY binding grants it at:
  resource level → project level → folder level → organization level

IAM Deny Policies can override all grants (newer feature).
No binding at any level? → Denied.

Azure RBAC:

1. Explicit Deny Assignment?           → DENY (even Owner can't override)
2. Role Assignment with Allow?         → ALLOW
Default:                               → DENY

Why Confusing Authentication and Authorization Breaks Security

The token-as-authorization antipattern

An application checks for a valid JWT and if found, proceeds. The JWT proves the user authenticated with the IdP. However, it says nothing about what they’re allowed to do.

# This is authentication only — anyone with a valid token gets through
@app.route("/admin/delete-user", methods=["POST"])
def delete_user():
    token = request.headers.get("Authorization")
    if verify_token(token):           # asks: is this token real and unexpired?
        delete_user_from_db(...)      # executes for any valid token holder
        return "OK"
    return "Unauthorized", 401

# This separates the two correctly
@app.route("/admin/delete-user", methods=["POST"])
def delete_user():
    token = request.headers.get("Authorization")
    principal = verify_token(token)                    # Gate 1: authentication
    if not has_permission(principal, "users:delete"):  # Gate 2: authorization
        return "Forbidden", 403
    delete_user_from_db(...)
    return "OK"

The short-expiry principle

Credential type Provider Typical lifetime Risk
Access Key + Secret AWS Permanent (until deleted) Years of exposure if leaked
STS Temporary Token AWS 15 min – 12 hours Hours at most
OAuth2 Access Token GCP / Azure ~1 hour Short window
IMDS Token (VM) All three Minutes Auto-refreshed by platform

A credential that expires in an hour has a one-hour exposure window if stolen. A credential that never expires has an unlimited window. This is the operational argument for managed identities and instance profiles, beyond just convenience.

# AWS — configure max session duration at role level
aws iam update-role \
  --role-name MyRole \
  --max-session-duration 3600   # 1 hour max

# GCP — access tokens expire in ~1 hour automatically
gcloud auth print-access-token
# Refresh: gcloud auth application-default print-access-token

# Azure — token lifetime configurable in Entra ID token policies
az account get-access-token --resource https://management.azure.com/

⚠ Production Gotchas

╔══════════════════════════════════════════════════════════════════════╗
║  ⚠  GOTCHA 1 — "We have MFA, so permissions can be broad"          ║
║                                                                      ║
║  MFA protects Gate 1 only. If a session is hijacked after login    ║
║  (via malware, SSRF, or a stolen session cookie), the attacker has  ║
║  a valid, MFA-authenticated token. Gate 1 is already cleared.       ║
║  Broad permissions in Gate 2 are the full attack surface.           ║
║                                                                      ║
║  Fix: treat Gate 2 (IAM policy) as your primary blast-radius        ║
║  control. MFA buys time. Least privilege limits damage.             ║
╚══════════════════════════════════════════════════════════════════════╝

╔══════════════════════════════════════════════════════════════════════╗
║  ⚠  GOTCHA 2 — Debugging AccessDenied by rotating credentials      ║
║                                                                      ║
║  AWS AccessDenied is an authorization failure. The identity         ║
║  authenticated successfully — there's no Allow in the policy.       ║
║  Rotating the access key does nothing.                              ║
║                                                                      ║
║  Fix: check the policy chain. Use simulate-principal-policy to      ║
║  confirm where the Allow is missing before touching credentials.    ║
╚══════════════════════════════════════════════════════════════════════╝

╔══════════════════════════════════════════════════════════════════════╗
║  ⚠  GOTCHA 3 — Application-layer authZ with broad cloud IAM        ║
║                                                                      ║
║  "The app controls access" is not a substitute for scoped cloud     ║
║  IAM. An SSRF vulnerability, exposed debug endpoint, or            ║
║  compromised dependency bypasses the application layer entirely.    ║
║  The cloud identity's permissions become the attacker's surface.    ║
║                                                                      ║
║  Fix: both layers enforce least privilege independently.            ║
╚══════════════════════════════════════════════════════════════════════╝

Authentication vs Authorization Audit Checklist

Split your IAM review along the authN/authZ boundary — they’re different problems with different fixes.

Authentication — Gate 1:
– Are there long-lived access keys that could be replaced with STS/Managed Identity?
– Is MFA enforced for all human identities with console or API access?
– Are service account key files present where workload identity is available?
– Are credentials stored in a secrets manager — not in code, .env files, or repos?
– When did each long-lived credential last rotate?

Authorization — Gate 2:
– Does every policy follow least privilege — only the permissions the workload actually uses?
– Are there wildcards (s3:*, "Resource": "*") that could be narrowed?
– Are write, delete, and IAM-modification actions scoped to specific resources?
– Are SCPs or permissions boundaries capping maximum permissions at org or account level?
– When were each role’s permissions last reviewed against actual usage (Access Analyzer)?


Quick Reference

┌────────────────────────────┬──────────────────────────────────────────────────┐
│ Term                       │ What it means                                    │
├────────────────────────────┼──────────────────────────────────────────────────┤
│ Authentication (AuthN)     │ Verifying identity — are you who you claim?      │
│ Authorization (AuthZ)      │ Verifying permission — are you allowed to act?   │
│ MFA                        │ Two distinct factors; strengthens Gate 1 only    │
│ STS (AWS)                  │ Security Token Service — issues temp credentials │
│ Access Key                 │ Long-lived AWS credential; avoid for services    │
│ Instance profile (AWS)     │ Container attaching a role to EC2                │
│ Managed Identity (Azure)   │ Credential-less identity for Azure services      │
│ Service Account (GCP)      │ Machine identity; prefer attached over key file  │
│ HTTP 401                   │ Authentication failure — prove who you are       │
│ HTTP 403 / AccessDenied    │ Authorization failure — fix the policy           │
└────────────────────────────┴──────────────────────────────────────────────────┘

Commands to know:
┌──────────────────────────────────────────────────────────────────────────────┐
│  # AWS — assume a role and get temporary credentials                        │
│  aws sts assume-role --role-arn arn:aws:iam::ACCOUNT:role/ROLE \            │
│    --role-session-name my-session --duration-seconds 3600                   │
│                                                                              │
│  # AWS — simulate a policy to debug AccessDenied before touching anything   │
│  aws iam simulate-principal-policy \                                         │
│    --policy-source-arn arn:aws:iam::ACCOUNT:role/MyRole \                   │
│    --action-names s3:GetObject \                                             │
│    --resource-arns arn:aws:s3:::my-bucket/*                                 │
│                                                                              │
│  # AWS — check what credentials your session is using                       │
│  aws sts get-caller-identity                                                 │
│                                                                              │
│  # GCP — print the current access token (expires in ~1 hour)                │
│  gcloud auth print-access-token                                              │
│                                                                              │
│  # GCP — show which account ADC is using                                    │
│  gcloud auth application-default print-access-token                         │
│                                                                              │
│  # Azure — get current token for ARM                                         │
│  az account get-access-token --resource https://management.azure.com/       │
│                                                                              │
│  # Azure — check who you're logged in as                                     │
│  az account show                                                             │
└──────────────────────────────────────────────────────────────────────────────┘

Framework Alignment

Framework Reference What It Covers Here
CISSP Domain 5 — Identity and Access Management AuthN and AuthZ are the two core mechanisms; this episode defines the boundary
CISSP Domain 1 — Security & Risk Management Conflating the two creates systematic, measurable risk with different attack surfaces
ISO 27001:2022 5.17 Authentication information Managing credentials and authentication mechanisms across the identity lifecycle
ISO 27001:2022 8.5 Secure authentication Technical controls — MFA, session management, credential policies
ISO 27001:2022 5.15 Access control Policy requirements that depend on cleanly separating identity from permission
SOC 2 CC6.1 Logical access controls — this episode defines the two-gate model CC6.1 is built on
SOC 2 CC6.7 Access restrictions enforced at the authorization layer, not just authentication

Key Takeaways

  • Authentication proves identity; authorization proves permission — two gates, two separate failure modes, two separate fixes
  • AWS AccessDenied is a Gate 2 failure — the credential is valid, the policy is missing; fix the policy
  • Short-lived credentials (STS, Managed Identities, instance profiles) reduce the blast radius of a credential compromise from years to hours
  • MFA hardens Gate 1 — it has no effect on what an authenticated identity can do
  • HTTP 401 = Gate 1 failed; HTTP 403 = Gate 2 failed — the status code tells you where to look
  • Application-layer authorization and cloud IAM authorization are independent — both must enforce least privilege

What’s Next

You now know what the two gates are and where failures in each originate. IAM Roles vs Policies: How Cloud Authorization Actually Works goes into the mechanics of Gate 2 — the permissions, policies, and roles that implement authorization in practice, and the structural patterns that keep them from turning into an unmanageable sprawl.

Next: IAM Roles vs Policies: How Cloud Authorization Actually Works

Get the IAM roles vs policies breakdown in your inbox when it publishes → linuxcent.com/subscribe

eBPF Program Types — What’s Actually Running on Your Nodes

Reading Time: 8 minutes

eBPF: From Kernel to Cloud, Episode 4
What Is eBPF? · The BPF Verifier · eBPF vs Kernel Modules · eBPF Program Types**


Architecture Overview

eBPF Program Types — tracing, networking, and security hook points across the Linux kernel
Each eBPF program type attaches to a different kernel hook — from socket filters to LSM enforcement points.

TL;DR

  • bpftool prog list and bpftool net list show every eBPF program on a node — run these first when debugging eBPF-based tool behavior
  • TC programs can stack on the same interface; stale programs from incomplete Cilium upgrades cause intermittent packet drops — check tc filter show after every Cilium upgrade
  • XDP fires before sk_buff allocation — fastest hook, but no pod identity; Cilium uses it for service load balancing, not pod policy
  • XDP silently falls back to generic mode on unsupported NICs — verify with ip link show | grep xdp
  • Tracepoints are stable across kernel versions; kprobe-based tools may silently break after node OS patches
  • LSM hooks enforce at the kernel level — what makes Tetragon’s enforcement mode fundamentally different from sidecar-based approaches

The Big Picture

  WHERE eBPF PROGRAM TYPES ATTACH IN THE KERNEL

  NIC hardware
       ↓
  DMA → ring buffer
       ↓
  ┌─────────────────────────────────────────────────┐
  │  XDP hook  (Cilium: service load balancing)     │
  │  Sees: raw packet bytes only. No pod identity.  │
  └─────────────────────────┬───────────────────────┘
                            │ XDP_PASS
                            ▼
  sk_buff allocated
       ↓
  ┌─────────────────────────────────────────────────┐
  │  TC ingress hook  (Cilium: pod policy ingress)  │
  │  Sees: sk_buff + socket + cgroup → pod identity │
  └─────────────────────────┬───────────────────────┘
                            ↓
  netfilter / IP routing
       ↓
  socket → process (syscall boundary)
  ┌─────────────────────────────────────────────────┐
  │  Tracepoint / kprobe  (Falco: syscall monitor)  │
  │  Sees: any kernel event, any process, any pod   │
  └─────────────────────────────────────────────────┘
  ┌─────────────────────────────────────────────────┐
  │  LSM hook  (Tetragon: kernel-level enforcement) │
  │  Sees: security check context. Can DENY.        │
  └─────────────────────────────────────────────────┘
       ↓
  IP routing → qdisc
  ┌─────────────────────────────────────────────────┐
  │  TC egress hook  (Cilium: pod policy egress)    │
  │  Sees: socket + cgroup on outbound traffic      │
  └─────────────────────────────────────────────────┘
       ↓
  NIC → wire

eBPF program types define where in the kernel a hook fires and what it can see — and knowing the difference is what makes you effective when Cilium or Falco behave unexpectedly. What we hadn’t answered — and what a 2am incident eventually forced — is what kind of eBPF programs are actually running on your nodes, and why the difference matters when something breaks.

A pod in production was dropping roughly one in fifty outbound TCP connections. Not all of them — just enough to cause intermittent timeouts in the application logs. NetworkPolicy showed egress allowed. Cilium reported no violations. Running curl manually from inside the pod worked every time.

I spent the better part of three hours eliminating possibilities. DNS. MTU. Node-level conntrack table exhaustion. Upstream firewall rules. Nothing.

Eventually, almost as an afterthought, I ran this:

sudo bpftool prog list

There were two TC programs attached to that pod’s veth interface. One from the current Cilium version. One from the previous version — left behind by a rolling upgrade that hadn’t cleaned up properly. Two programs. Different policy state. One was occasionally dropping packets based on rules that no longer existed in the current policy model.

The answer had been sitting in the kernel the whole time. I just didn’t know where to look.

That incident forced me to actually understand something I’d been hand-waving for two years: eBPF isn’t a single hook. It’s a family of program types, each attached to a different location in the kernel, each seeing different data, each suited for different problems. Understanding the difference is what separates “I run Cilium and Falco” from “I understand what Cilium and Falco are actually doing on my nodes” — and that difference matters when something breaks at 2am.

The Command You Should Run on Your Cluster Right Now

Before getting into the theory, do this:

# See every eBPF program loaded on the node
sudo bpftool prog list

# See every eBPF program attached to a network interface
sudo bpftool net list

On a node running Cilium and Falco, you’ll see something like this:

42: xdp           name cil_xdp_entry       loaded_at 2026-04-01T09:23:41
43: sched_cls     name cil_from_netdev      loaded_at 2026-04-01T09:23:41
44: sched_cls     name cil_to_netdev        loaded_at 2026-04-01T09:23:41
51: cgroup_sock_addr  name cil_sock4_connect loaded_at 2026-04-01T09:23:41
88: raw_tracepoint  name sys_enter          loaded_at 2026-04-01T09:23:55
89: raw_tracepoint  name sys_exit           loaded_at 2026-04-01T09:23:55

Each line is a different program type. Each one fires at a different point in the kernel. The type column — xdp, sched_cls, raw_tracepoint, cgroup_sock_addr — tells you where in the kernel execution path that program is attached and therefore what it can and cannot see.

If you see more programs than you expect on a specific interface — like I did — that’s your first clue.

Why Program Types Exist

The Linux kernel isn’t a single pipeline. Network packets, system calls, file operations, process scheduling — these all run through different subsystems with different execution contexts and different available data.

eBPF lets you attach programs to specific points within those subsystems. The “program type” is the contract: it defines where the hook fires, what data the program receives, and what it’s allowed to do with it. A program designed to process network packets before they hit the kernel stack looks completely different from one designed to intercept system calls across all containers simultaneously.

Most of us will interact with four or five program types through the tools we already run. Understanding what each one actually is — where it sits, what it sees — is what makes you effective when those tools behave unexpectedly.

The Types Behind the Tools You Already Use

TC — Why Cilium Can Tell Which Pod Sent a Packet

TC stands for Traffic Control. It’s where Cilium enforces your NetworkPolicy, and it’s what caused my incident.

TC programs attach to network interfaces — specifically to the ingress and egress directions of the pod’s virtual interface (lxcXXXXX in Cilium’s naming). They fire after the kernel has already processed the packet enough to know its context: which socket created it, which cgroup that socket belongs to. Cgroup maps to container, container maps to pod.

This is the critical piece: TC is how Cilium knows which pod a packet belongs to. Without that cgroup context, per-pod policy enforcement isn’t possible.

# See TC programs on a pod's veth interface
sudo tc filter show dev lxc12345 ingress
sudo tc filter show dev lxc12345 egress

# If you see two entries on the same direction — that's the incident I described
# The priority number (pref 1, pref 2) tells you the order they run

When there are two TC programs on the same interface, the first one to return “drop” wins. The second program never runs. This is why the issue was intermittent rather than consistent — the stale program only matched specific connection patterns.

Fixing it is straightforward once you know what to look for:

# Remove a stale TC filter by its priority number
sudo tc filter del dev lxc12345 egress pref 2

Add this check to your post-upgrade runbook. Cilium upgrades are generally clean but not always.

XDP — Why Cilium Doesn’t Use TC for Everything

If TC is good enough for pod-level policy, why does Cilium also run an XDP program on the node’s main interface? Look at the bpftool prog list output again — there’s an xdp program loaded alongside the TC programs.

XDP fires earlier. Much earlier. Before the kernel allocates any memory for the packet. Before routing. Before connection tracking. Before anything.

The tradeoff is exactly what you’d expect: XDP is fast but context-poor. It sees raw packet bytes. It doesn’t know which pod the packet came from. It can’t read cgroup information because no socket buffer has been allocated yet.

Cilium uses XDP specifically for ClusterIP service load balancing — when a packet arrives at the node destined for a service VIP, XDP rewrites the destination to the actual pod IP in a single map lookup and sends it on its way. No iptables. No conntrack. The work is done before the kernel stack is involved.

There’s a silent failure mode worth knowing about here. XDP runs in one of two modes:

  • Native mode — runs inside the NIC driver itself, before any kernel allocation. This is where the performance comes from.
  • Generic mode — fallback when the NIC driver doesn’t support XDP. Runs later, after sk_buff allocation. No performance benefit over iptables.

If your NIC doesn’t support native XDP, Cilium silently falls back to generic mode. The policy still works — but the performance characteristics you assumed aren’t there.

# Check which XDP mode is active on your node's main interface
ip link show eth0 | grep xdp
# xdpdrv  ← native mode (fast)
# xdpgeneric ← generic mode (no perf benefit)

Most cloud provider instance types with modern Mellanox/Intel NICs support native mode. Worth verifying rather than assuming.

Tracepoints — How Falco Sees Every Container

Falco loads two programs: sys_enter and sys_exit. These are raw tracepoints — they fire on every single system call, from every process, in every container on the node.

Tracepoints are explicitly defined and maintained instrumentation points in the kernel. Unlike hooks that attach to specific internal function names (which can be renamed or inlined between kernel versions), tracepoints are stable interfaces. They’re part of the kernel’s public contract with tooling that wants to instrument it.

This matters operationally. When you patch your nodes — and cloud-managed nodes get patched frequently — tools built on tracepoints keep working. Tools built on kprobes (internal function hooks) may silently stop firing if the function they’re attached to gets renamed or inlined by the compiler in a new kernel build.

# Verify what Falco is actually using
sudo bpftool prog list | grep -E "kprobe|tracepoint"

# Falco's current eBPF driver should show raw_tracepoint entries
# If you see kprobe entries from Falco, you're on the older driver
# Check: falco --version and the driver being loaded at startup

If you’re running Falco on a cluster that gets regular OS patch upgrades and you haven’t verified the driver mode, check it. The older kprobe-based driver has a real failure mode on certain kernel versions.

LSM — How Tetragon Blocks Operations at the Kernel Level

LSM hooks run at the kernel’s security decision points: file opens, socket connections, process execution, capability checks. The defining characteristic is that they can deny an operation. Return an error from an LSM hook and the kernel refuses the syscall before it completes.

This is qualitatively different from observability hooks. kprobes and tracepoints watch. LSM hooks enforce.

When you see Tetragon configured to kill a process attempting a privileged operation, or block a container from writing to a specific path, that’s an LSM hook making the decision inside the kernel — not a sidecar watching traffic, not an admission webhook running before pod creation, not a userspace agent trying to act fast enough. The enforcement is in the kernel itself.

# See if any LSM eBPF programs are active on the node
sudo bpftool prog list | grep lsm

# Verify LSM eBPF support on your kernel (required for Tetragon enforcement mode)
grep CONFIG_BPF_LSM /boot/config-$(uname -r)
# CONFIG_BPF_LSM=y   ← required

The Practical Summary

What’s happening on your node Program type Where to look
Cilium service load balancing XDP ip link show eth0 \| grep xdp
Cilium pod network policy TC (sched_cls) tc filter show dev lxcXXXX egress
Falco syscall monitoring Tracepoint bpftool prog list \| grep tracepoint
Tetragon enforcement LSM bpftool prog list \| grep lsm
Anything unexpected All types bpftool prog list, bpftool net list

The Incident, Revisited

Three hours of debugging. The answer was a stale TC program sitting at priority 2 on a pod’s veth interface, left behind by an incomplete Cilium upgrade.

# What I should have run first
sudo bpftool net list
sudo tc filter show dev lxc12345 egress

Two commands. Thirty seconds. If I’d known that TC programs can stack on the same interface, I’d have started there.

That’s the point of understanding program types — not to write eBPF programs yourself, but to know where to look when the tools you depend on don’t behave the way you expect. The programs are already there, running on your nodes right now. bpftool prog list shows you all of them.

Key Takeaways

  • bpftool prog list and bpftool net list show every eBPF program on a node — run these before anything else when debugging eBPF-based tool behavior
  • TC programs can stack on the same interface; stale programs from incomplete Cilium upgrades cause intermittent drops — check tc filter show after every Cilium upgrade
  • XDP runs before the kernel stack — fastest hook, but no pod identity; Cilium uses it for service load balancing, not pod policy
  • XDP silently falls back to generic mode on unsupported NICs — verify with ip link show | grep xdp
  • Tracepoints are stable across kernel versions; kprobe-based tools may silently break after node OS patches — verify your Falco driver mode
  • LSM hooks enforce at the kernel level — this is what makes Tetragon’s enforcement mode fundamentally different from sidecar-based approaches

What’s Next

Every eBPF program fires, does its work, and exits — but the work always involves data.

Get EP05 in your inbox when it publishes → linuxcent.com/subscribe Counting connections. Tracking processes. Streaming events to a detection engine. In EP05, I’ll cover eBPF maps: the persistent data layer that connects kernel programs to the tools consuming their output. Understanding maps explains a class of production issues — and makes bpftool map dump useful rather than cryptic.

What Is Cloud IAM — and Why Every API Call Depends on It

Reading Time: 11 minutes


What Is Cloud IAMAuthentication vs AuthorizationIAM Roles vs PoliciesAWS IAM Deep DiveGCP Resource Hierarchy IAMAzure RBAC Scopes


TL;DR

  • Cloud IAM is the system that decides whether any API call is allowed or denied — deny by default, explicit Allow required at every layer
  • Every API call answers four questions: Who? (Identity) What? (Action) On what? (Resource) Under what conditions? (Context)
  • Two identity types in every cloud account: human (engineers) and machine (Lambda, EC2, Kubernetes pods) — machine identities outnumber human by 10:1 in most production environments
  • AWS, GCP, and Azure share the same model: deny-by-default, policy-driven, principal-based — different syntax, same mental model
  • The gap between granted and used permissions is where attackers move — the average IAM entity uses under 5% of its granted permissions
  • IAM failure has two modes: over-permissioned (“it works”) and over-restricted (“it’s secure, engineers work around it”) — both end in incidents

The Big Picture

                        WHAT IS CLOUD IAM?

  Every API call in AWS, GCP, or Azure answers four questions:

  ┌─────────────┐   ┌─────────────┐   ┌─────────────┐   ┌─────────────┐
  │    WHO?     │   │   WHAT?     │   │  ON WHAT?   │   │  UNDER      │
  │             │   │             │   │             │   │  WHAT?      │
  │  Identity / │   │  Action /   │   │  Resource   │   │             │
  │  Principal  │   │  Permission │   │             │   │  Condition  │
  │             │   │             │   │             │   │             │
  │ IAM Role    │   │ s3:GetObject│   │ arn:aws:s3: │   │ MFA: true   │
  │ Svc Account │   │ ec2:Start   │   │ ::prod-data │   │ IP: 10.0/8  │
  │ Managed     │   │ iam:        │   │ /exports/*  │   │ Time: 09-17 │
  │ Identity    │   │   PassRole  │   │             │   │             │
  └─────────────┘   └─────────────┘   └─────────────┘   └─────────────┘
        └────────────────┴────────────────┴────────────────┘
                                  │
                     ┌────────────▼────────────┐
                     │    IAM Policy Engine    │
                     │    deny by default      │
                     │                         │
                     │  Explicit ALLOW?   ─────┼──→  PERMIT
                     │  Explicit DENY?    ─────┼──→  DENY (overrides Allow)
                     │  No matching rule? ─────┼──→  DENY (implicit)
                     └─────────────────────────┘

Cloud IAM is the answer to a question every growing infrastructure team hits: at scale, how do you know who can do what, why they can do it, and whether they still should?


Introduction

Cloud IAM (Identity and Access Management) is the control plane for access in every major cloud provider. Every API call — reading a file, starting an instance, invoking a function — goes through an IAM evaluation. The result is binary: explicit Allow or deny. There is no implicit access. Nothing is open by default. This is what makes cloud IAM fundamentally different from the access models that came before it.

Understanding why it works that way requires tracing how access control evolved — and what kept breaking at each stage.

A few years into my career managing Linux infrastructure, I was handed a production server audit. The task was straightforward: find out who had access to what. I pulled /etc/passwd, checked the sudoers file, reviewed SSH authorized_keys across the fleet.

Three days later, I had a spreadsheet nobody wanted to read.

The problem wasn’t that the access was wrong. Most of it was fine. The problem was that nobody — not the team lead, not the security team, not the engineers who’d been there five years — could tell me why a particular account had access to a particular server. It had accumulated. People joined, got access, changed teams, left. The access stayed.

That was a 40-server fleet in 2012.

Fast-forward to a cloud environment today: you might have 50 engineers, 300 Lambda functions, 20 microservices, CI/CD pipelines, third-party integrations, compliance scanners — all making API calls, all needing access to something. The identity sprawl problem I spent three days auditing manually on 40 servers now exists at a scale where manual auditing isn’t even a conversation.

This is the problem Identity and Access Management exists to solve. Not just in theory — in practice, at the scale cloud infrastructure demands.


How We Got Here — The Evolution of Access Control

To understand why cloud IAM works the way it does, you need to trace how access control evolved. The design decisions in AWS IAM, GCP, and Azure didn’t come out of nowhere. They’re answers to lessons learned the hard way across decades of broken systems.

The Unix Model (1970s–1990s): Simple and Sufficient

Unix got the fundamentals right early. Every resource (file, device, process) has an owner and a group. Every action is one of three: read, write, execute. Every user is either the owner, in the group, or everyone else.

-rw-r--r--  1 vamshi  engineers  4096 Apr 11 09:00 deploy.conf
# owner can read/write | group can read | others can read

For a single machine or a small network, this model is elegant. The permissions are visible in a ls -l. Reasoning about access is straightforward. Auditing means reading a few files.

However, the cracks started showing when organizations grew. You’d add sudo to give specific commands to specific users. Then sudoers files became 300 lines long. Then you’d have shared accounts because managing individual ones was “too much overhead.” Shared accounts mean no individual accountability. No accountability means no audit trail worth anything.

The Directory Era (1990s–2000s): Centralise or Collapse

As networks grew, every server managing its own /etc/passwd became untenable. Enter LDAP and Active Directory. Instead of distributing identity management across every machine, you centralised it: one directory, one place to add users, one place to disable them when someone left.

This was a significant step forward. Onboarding got faster. Offboarding became reliable. Group membership drove access to resources across the network.

Why Groups Became the New Problem

But the permission model was still coarse. You were either in the Domain Admins group or you weren’t. “Read access to the file share” was a group. “Deploy to the staging web server” was a group. Managing fine-grained permissions at scale meant managing hundreds of groups, and the groups themselves became the audit nightmare.

I spent time in environments like this. The group named SG_Prod_App_ReadWrite_v2_FINAL that nobody could explain. The AD group from a project that ended three years ago but was still in twenty user accounts. The contractor whose AD account was disabled but whose service account was still running a nightly job.

The directory model centralised identity. It didn’t solve the permissions sprawl problem.

The Cloud Shift (2006–2014): Everything Changes

AWS launched EC2 in 2006. In 2011, AWS IAM went into general availability. That date matters — for the first five years of AWS, access control was primitive. Root accounts. Access keys. No roles.

Early AWS environments I’ve seen (and had to clean up) reflect this era: a single root account access key shared across a team, rotated manually on a shared spreadsheet. Static credentials in application config files. EC2 instances with AdministratorAccess because “it was easier at the time.”

The Model That Changed Everything

The AWS team understood what they’d built was dangerous. IAM in 2011 introduced the model that all three major cloud providers now share: deny-by-default, policy-driven, principal-based access control. Not “who is in which group.” The question became: which policy explicitly grants this specific action on this specific resource to this specific identity.

GCP launched its IAM model with a different flavour in 2012 — hierarchical, additive, binding-based. Azure RBAC came to general availability in 2014, built on top of Active Directory’s identity model.

By 2015, the modern cloud IAM era was established. The primitives existed. The problem shifted from “does IAM exist?” to “are we using it correctly?” — and most teams were not.

In practice, that question is still the right one to ask today.


The Problem IAM Actually Solves

Here’s the honest version of what IAM is for, based on what I’ve seen go wrong without it.

Without proper IAM, you get one of two outcomes:

The first is what I call the “it works” environment. Everything runs. The developers are happy. Access requests take five minutes because everyone gets the same broad policy. And then a Lambda function’s execution role — which had s3:* on * because someone once needed to debug something — gets its credentials exposed through an SSRF vulnerability in the app it runs. That role can now read every bucket in the account, including the one with the customer database exports.

The second is the “it’s secure” environment. Access is locked down. Every request goes through a ticket. The ticket goes to a security team that approves it in three to five business days. Engineers work around it by storing credentials locally. The workarounds become the real access model. The formal IAM posture and the actual access posture diverge. The audit finds the formal one. Attackers find the real one.

IAM, done right, is the discipline of walking the line between those two outcomes. It’s not a product you buy or a feature you turn on. It’s a practice — a continuous process of defining what access exists, why it exists, and whether it’s still needed.


The Core Concepts — Taught, Not Listed

Let me walk you through the vocabulary you need, grounded in what each concept means in practice.

Identity: Who Is Making This Request?

An identity is any entity that can hold a credential and make requests. In cloud environments, identities split into two types:

Human identities are engineers, operators, and developers. They authenticate via the console, CLI, or SDK. They should ideally authenticate through a central IdP (Okta, Google Workspace, Entra ID) using federation — more on that in SAML vs OIDC: Which Federation Protocol Belongs in Your Cloud?.

Machine identities are everything else: Lambda functions, EC2 instances, Kubernetes pods, CI/CD pipelines, monitoring agents, data pipelines. In most production environments, machine identities outnumber human identities by 10:1 or more.

This ratio matters. When your security model is designed primarily for human access, the 90% of identities that are machines become an afterthought. That’s where access keys end up in environment variables, where Lambda functions get broad permissions because nobody thought carefully about what they actually need, where the real attack surface lives.

Principal: The Authenticated Identity Making a Specific Request

A principal is an identity that has been authenticated and is currently making a request. The distinction from “identity” is subtle but important: the principal includes the context of how the identity authenticated.

In AWS, an IAM role assumed by EC2, assumed by a Lambda, and assumed by a developer’s CLI session are three different principals — even if they all assume the same role. The session context, source, and expiration differ.

{
  "Principal": {
    "AWS": "arn:aws:iam::123456789012:role/DataPipelineRole"
  }
}

In GCP, the equivalent term is member. In Azure, it’s security principal — a user, group, service principal, or managed identity.

Resource: What Is Being Accessed?

A resource is whatever is being acted upon. In AWS, every resource has an ARN (Amazon Resource Name) — a globally unique identifier.

arn:aws:s3:::customer-data-prod          # S3 bucket
arn:aws:s3:::customer-data-prod/*        # everything inside that bucket
arn:aws:ec2:ap-south-1:123456789012:instance/i-0abcdef1234567890
arn:aws:iam::123456789012:role/DataPipelineRole

The ARN structure tells you: service, region, account, resource type, resource name. Once you can read ARNs fluently, IAM policies become much less intimidating.

Action: What Is Being Done?

An action (AWS/Azure) or permission (GCP) is the operation being attempted. Cloud providers express these as service:Operation strings:

# AWS
s3:GetObject           # read a specific object
s3:PutObject           # write an object
s3:DeleteObject        # delete an object — treat differently than read
iam:PassRole           # assign a role to a service — one of the most dangerous permissions
ec2:DescribeInstances  # list instances — often overlooked, but reveals infrastructure

# GCP
storage.objects.get
storage.objects.create
iam.serviceAccounts.actAs   # impersonate a service account — equivalent to iam:PassRole danger

When I audit IAM configurations, I pay special attention to any policy that includes iam:*, iam:PassRole, or wildcards like "Action": "*". These are the permissions that let a compromised identity create new identities, assign itself more power, or impersonate other accounts. They’re the privilege escalation primitives — more on that in AWS IAM Privilege Escalation: How iam:PassRole Leads to Full Compromise.

Policy: The Document That Connects Everything

A policy is a document that says: this principal can perform these actions on these resources, under these conditions.

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "ReadCustomerDataBucket",
      "Effect": "Allow",
      "Action": [
        "s3:GetObject",
        "s3:ListBucket"
      ],
      "Resource": [
        "arn:aws:s3:::customer-data-prod",
        "arn:aws:s3:::customer-data-prod/*"
      ]
    }
  ]
}

Notice what’s explicit here: the effect (Allow), the exact actions (not s3:*), and the exact resource (not *). Every word in this document is a deliberate decision. The moment you start using wildcards to save typing, you’re writing technical debt that will come back as a security incident.


How IAM Actually Works — The Decision Flow

When any API call hits a cloud service, an IAM engine evaluates it. Understanding this flow is the foundation of debugging access issues, and more importantly, of understanding why your security posture is what it is.

Request arrives:
  Action:    s3:PutObject
  Resource:  arn:aws:s3:::customer-data-prod/exports/2026-04-11.csv
  Principal: arn:aws:iam::123456789012:role/DataPipelineRole
  Context:   { source_ip: "10.0.2.15", mfa: false, time: "02:30 UTC" }

IAM Engine evaluation (AWS):
  1. Is there an explicit Deny anywhere? → No
  2. Does the SCP (if any) allow this? → Yes
  3. Does the identity-based policy allow this? → Yes (via DataPipelinePolicy)
  4. Does the resource-based policy (bucket policy) allow or deny? → No explicit rule → implicit allow for same-account
  5. Is there a permissions boundary? → No
  Decision: ALLOW

The critical insight here: cloud IAM is deny-by-default. There is no implicit allow. If there is no policy that explicitly grants s3:PutObject to this role on this bucket, the request fails. The only way in is through an explicit "Effect": "Allow".

This is the opposite of how most traditional systems work. In a Unix permission model, if your file is world-readable (-r--r--r--), anyone can read it unless you actively restrict them. In cloud IAM, nothing is accessible unless you actively grant it.

When I’m debugging an AccessDenied error — and every engineer who works with cloud IAM spends significant time doing this — the mental model is always: “what is the chain of explicit Allows that should be granting this access, and at which layer is it missing?”


Why This Is Harder Than It Looks

Understanding the concepts is the easy part. The hard part is everything that happens at organisational scale over time.

Scale. A real AWS account in a growing company might have 600+ IAM roles, 300+ policies, and 40+ cross-account trust relationships. None of these were designed together. They evolved incrementally, each change made by someone who understood the context at the time and may have left the organisation since. The cumulative effect is an IAM configuration that no single person fully understands.

Drift. IAM configs don’t stay clean. An engineer needs to debug a production issue at 2 AM and grants themselves broad access temporarily. The temporary access never gets revoked. Multiply that by a team of 20 over three years. I’ve audited environments where 60% of the permissions in a role had never been used — not once — in the 90-day CloudTrail window. That unused 60% is pure attack surface.

The machine identity blind spot. Most IAM governance practices were built for human users. Service accounts, Lambda roles, and CI/CD pipeline identities get created rapidly and reviewed rarely. In my experience, these are the identities most likely to have excess permissions, least likely to be in the access review process, and most likely to be the initial foothold in a cloud breach.

The gap between granted and used. That said, this one surprised me most when I first started doing cloud security work. AWS data from real customer accounts shows the average IAM entity uses less than 5% of its granted permissions. That 95% excess isn’t just waste — it’s attack surface. Every permission that exists but isn’t needed is a permission an attacker can use if they compromise that identity.


IAM Across AWS, GCP, and Azure — The Conceptual Map

The three major providers implement IAM differently in syntax, but the same model underlies all of them. Once you understand one deeply, the others become a translation exercise.

Concept AWS GCP Azure
Identity store IAM users / roles Google accounts, Workspace Entra ID
Machine identity IAM Role (via instance profile or AssumeRole) Service Account Managed Identity
Access grant mechanism Policy document attached to identity or resource IAM binding on resource (member + role + condition) Role Assignment (principal + role + scope)
Hierarchy Account is the boundary; Org via SCPs Org → Folder → Project → Resource Tenant → Management Group → Subscription → Resource Group → Resource
Default stance Deny Deny Deny
Wildcard risk "Action": "*" on "Resource": "*" Primitive roles (viewer/editor/owner) Owner or Contributor assigned broadly

The hierarchy point is worth pausing on. AWS is relatively flat — the account is the primary security boundary. GCP’s hierarchy means a binding at the Organisation level propagates down to every project. Azure’s hierarchy means a role assignment at the Management Group level flows through every subscription beneath it.

The blast radius of a misconfiguration scales with how high in the hierarchy it sits.

This will matter in GCP IAM Policy Inheritance and Azure RBAC Explained when we go deep on GCP and Azure specifically. For now, the takeaway is: understand where in the hierarchy a permission is granted, because the same permission granted at the wrong level has a very different security implication.


Framework Alignment

If you’re mapping this episode to a control framework — for a compliance audit, a certification study, or building a security program — here’s where it lands:

Framework Reference What It Covers Here
CISSP Domain 1 — Security & Risk Management IAM as a risk reduction control; blast radius is a risk variable
CISSP Domain 5 — Identity and Access Management Direct implementation: who can do what, to which resources, under what conditions
ISO 27001:2022 5.15 Access control Policy requirements for restricting access to information and systems
ISO 27001:2022 5.16 Identity management Managing the full lifecycle of identities in the organization
ISO 27001:2022 5.18 Access rights Provisioning, review, and removal of access rights
SOC 2 CC6.1 Logical access security controls to protect against unauthorized access
SOC 2 CC6.3 Access removal and review processes to limit unauthorized access

Key Takeaways

  • IAM evolved from Unix file permissions → directory services → cloud policy engines, driven by scale and the failure modes of each prior model
  • Cloud IAM is deny-by-default: every access requires an explicit Allow somewhere in the policy chain
  • Identities are human or machine; in production, machines dominate — and they’re the under-governed majority
  • A policy binds a principal to actions on resources; every word is a deliberate security decision
  • The hardest IAM problems aren’t technical — they’re organisational: drift, unused permissions, machine identities nobody owns, and access reviews that never happen
  • The gap between granted and used permissions is where attackers find room to move

What’s Next

Now that you understand what IAM is and why it exists, the next question is the one that trips up even experienced engineers: what’s the difference between authentication and authorization, and why does conflating them cause security failures?

EP02 works through both — how cloud providers implement each, where the boundary sits, and why getting this boundary wrong creates exploitable gaps.

Next: Authentication vs Authorization: AWS AccessDenied Explained

Get EP02 in your inbox when it publishes → subscribe

The Runtime Reckoning: Dockershim Out, eBPF In, and PSP Finally Dies (2022–2023)

Reading Time: 6 minutes


Introduction

2022 is the year Kubernetes dealt with its legacy. The Docker shim that everyone had been warned about for two years was actually removed. PodSecurityPolicy — the broken security primitive that clusters had depended on since 1.3 — was deleted. And eBPF started displacing iptables as the networking substrate.

These weren’t additions to Kubernetes. They were the removal of technical debt accumulated over eight years. And the migrations they forced were the most operationally significant events since RBAC went stable.


Kubernetes 1.24 — Dockershim Removed (May 2022)

The dockershim was removed in 1.24. The deprecation had been announced in 1.20 (December 2020) — 18 months of warning. It didn’t matter. Operators who hadn’t migrated still scrambled.

The actual migration was straightforward for most environments:

# On each node, before upgrading to 1.24:
# 1. Install containerd
apt-get install -y containerd.io

# 2. Configure containerd
containerd config default | tee /etc/containerd/config.toml
# Edit: set SystemdCgroup = true in runc options

# 3. Update kubelet to use containerd socket
# /etc/systemd/system/kubelet.service.d/10-kubeadm.conf
# Add: --container-runtime-endpoint=unix:///run/containerd/containerd.sock

# 4. Restart
systemctl daemon-reload && systemctl restart kubelet

What the migration revealed: how many teams were depending on the Docker socket being present on nodes. Tools that mounted /var/run/docker.sock to talk to the Docker daemon — build tools, CI agents, some monitoring agents — broke. The ecosystem had to adapt to nerdctl (containerd’s Docker-compatible CLI), Kaniko, Buildah, or mounting the containerd socket instead.

Other 1.24 highlights:
Beta APIs disabled by default: New beta features would no longer be enabled automatically. This reversed a long-standing policy that had caused too many production clusters to accidentally pick up unstable features
gRPC probes stable: Liveness and readiness probes could now use gRPC health checks natively — no more writing HTTP wrapper endpoints for gRPC services
Non-graceful node shutdown alpha: Handle the case where the node disappears without the kubelet getting to gracefully terminate pods — stateful workloads on node failure


Kubernetes 1.25 — PSP Removed (August 2022)

PodSecurityPolicy was deleted in 1.25. Every cluster that was still using PSP had to migrate to Pod Security Admission (or OPA/Gatekeeper or Kyverno) before upgrading.

Pod Security Admission was GA in 1.25, ready to take over:

# Enforce restricted policy on a namespace
kubectl label namespace production \
  pod-security.kubernetes.io/enforce=restricted \
  pod-security.kubernetes.io/enforce-version=v1.25

# Test a pod against the policy without enforcing
kubectl label namespace staging \
  pod-security.kubernetes.io/warn=restricted \
  pod-security.kubernetes.io/audit=restricted

The dry-run modes (warn, audit) were critical for migration: you could enable them on namespaces and watch what would have been rejected before switching to enforce mode.

The real migration challenge was existing workloads running as root, with privileged security contexts, or with hostPath mounts. The restricted policy rejected all of these. Production applications that had been running for years under permissive PSP policies now failed validation.

Also in 1.25:
Ephemeral containers stable: Attach a debug container to a running pod without restarting it

# Debug a running pod with no shell
kubectl debug -it nginx-pod --image=busybox:latest --target=nginx
  • CSI ephemeral volumes stable
  • cgroups v2 (unified hierarchy) support stable: Enables memory QoS, improved resource accounting

Kubernetes 1.26 — Structured Parameter Scheduling, Storage (December 2022)

1.26 focused on the scheduler and storage:
Dynamic Resource Allocation alpha: A generalization of the device plugin API — allows requesting complex resources (GPUs, FPGAs, network adapters) with scheduling constraints. The foundation for AI/ML workload scheduling on heterogeneous hardware
CrossNamespacePVCDataSource beta: Clone a PVC across namespaces — enables namespace-based data isolation while sharing data sets
Pod scheduling readiness alpha: A pod can declare that it’s not ready to be scheduled until external conditions are met (data pre-loading complete, license validated, etc.)
Removal of in-tree cloud provider code (beta, continued): A long-running effort to move cloud-provider-specific code out of the core Kubernetes binary

The Dynamic Resource Allocation feature deserves emphasis: it’s the mechanism that makes Kubernetes a serious platform for GPU scheduling in AI/ML workloads. Device plugins (the prior mechanism) had limitations — a pod either got a GPU or it didn’t. DRA allows richer resource semantics: this pod needs two GPUs on the same PCIe bus, or this pod needs a specific GPU model.


eBPF Reshapes Kubernetes Networking

The most significant architectural shift in Kubernetes networking during 2022–2023 wasn’t a Kubernetes release feature. It was the adoption of eBPF-based CNI solutions — primarily Cilium — as the default networking layer in major managed Kubernetes offerings.

The iptables problem: kube-proxy has been using iptables rules to implement Service routing since Kubernetes 1.0. Every Service adds iptables rules to every node. At 10,000 services, the iptables rule table on each node has hundreds of thousands of rules. Traversing these rules on every packet is O(n). Updating them requires locking and flushing. At scale, iptables becomes a bottleneck.

The eBPF solution: Cilium replaces kube-proxy entirely, implementing Service routing using eBPF maps — hash tables in kernel memory. Service lookup is O(1). Rule updates don’t require locking. Network policy enforcement happens in the kernel, before packets even reach the application.

# Check if Cilium is running in kube-proxy replacement mode
cilium status | grep "KubeProxy replacement"
# KubeProxy replacement:    True

# eBPF-based service map — inspect directly
cilium service list
# ID   Frontend          Service Type   Backend
# 1    10.96.0.1:443     ClusterIP      10.0.0.5:6443
# 2    10.96.0.10:53     ClusterIP      10.0.1.2:53, 10.0.1.3:53

Network policy enforcement: Cilium’s NetworkPolicy implementation enforces rules at the eBPF layer — packets that would be dropped by policy are dropped before they ever leave the kernel, before they touch the pod’s network stack. This is both faster and more secure than userspace enforcement.

Hubble: Cilium’s observability layer — built on the same eBPF probes — provides real-time network flow visibility, HTTP layer observability (which service called which endpoint, response codes), and DNS query logging without any application changes.

Major adoption milestones:
– GKE’s default CNI became Cilium (Dataplane V2) in 2021
– Amazon EKS added Cilium support
– Azure AKS enabled Cilium-based networking
– Google’s Autopilot clusters use Cilium exclusively


Kubernetes 1.27 — Graceful Failure, In-Place Resize Alpha (April 2023)

  • In-Place Pod Vertical Scaling alpha: Change the CPU and memory resources of a running container without restarting the pod. For databases, JVM-based applications, and anything with warm caches, live resizing is a significant operational improvement
# Resize a container's CPU without restart
kubectl patch pod database-pod --type='json' \
  -p='[{"op": "replace", "path": "/spec/containers/0/resources/requests/cpu", "value": "2"}]'
  • SeccompDefault stable: Enable the default seccomp profile (RuntimeDefault) cluster-wide — a meaningful reduction in the default syscall attack surface for all pods
  • Mutable scheduling directives for Jobs stable: Change node affinity and tolerations of pending (not yet running) Job pods
  • ReadWriteOncePod PersistentVolume access mode stable: A volume can only be mounted by a single pod at a time — the correct semantic for databases with file-level locking requirements

The 1.5 Million Lines Removed: Cloud Provider Code Migration

One of the largest ongoing engineering efforts in Kubernetes 1.26–1.31 was the removal of in-tree cloud provider code. Every major cloud provider (AWS, Azure, GCP, OpenStack, vSphere) had code compiled directly into the Kubernetes control plane binaries.

The result: the Kubernetes API server and controller manager binaries contained code for AWS EBS volumes, GCE persistent disks, Azure managed disks, OpenStack Cinder — regardless of which cloud you were running on.

The migration moved this code to external Cloud Controller Managers (CCM) — separate processes that communicate with the API server like any other controller:

Before: kube-controller-manager (monolithic, includes all cloud providers)
After:  kube-controller-manager (generic) + cloud-controller-manager (cloud-specific, external)

By 1.31, approximately 1.5 million lines of code had been removed from the core binaries, reducing binary sizes by approximately 40%. This is the largest refactor in Kubernetes history.


Gateway API: Replacing Ingress (2022–2023)

The Ingress API, which graduated to stable in 1.19, has fundamental limitations:
– No support for TCP/UDP routing (HTTP only)
– No traffic splitting between multiple backends
– No header-based routing
– Vendor-specific features implemented via annotations (not portable)
– No RBAC granularity within a single Ingress resource

Gateway API (kubernetes-sigs/gateway-api) was designed as the successor, with a role-based model:

GatewayClass  → Managed by infrastructure provider (cluster admin)
Gateway       → Managed by cluster operators
HTTPRoute     → Managed by application developers
# Gateway — cluster operator configures the load balancer
apiVersion: gateway.networking.k8s.io/v1
kind: Gateway
metadata:
  name: production-gateway
spec:
  gatewayClassName: nginx
  listeners:
  - name: https
    port: 443
    protocol: HTTPS
    tls:
      mode: Terminate
      certificateRefs:
      - name: tls-cert

---
# HTTPRoute — application team configures routing
apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
  name: api-route
spec:
  parentRefs:
  - name: production-gateway
  rules:
  - matches:
    - path:
        type: PathPrefix
        value: /api/v2
    backendRefs:
    - name: api-v2-service
      port: 8080
      weight: 90
    - name: api-v3-canary
      port: 8080
      weight: 10

Gateway API reached GA (v1.0) in October 2023, with the core HTTPRoute, Gateway, and GatewayClass resources graduating to stable.


Key Takeaways

  • Dockershim removal in 1.24 completed the CRI migration that started in 1.5 — the Kubernetes runtime interface is now clean, with containerd and CRI-O as the standard runtimes
  • PSP removal in 1.25 forced a migration that should have happened years earlier; Pod Security Admission’s simplicity is a feature, not a limitation
  • eBPF-based networking (Cilium, Dataplane V2) is now the default in GKE and increasingly in EKS and AKS — O(1) service routing and kernel-level policy enforcement replace the iptables approach that dated to Kubernetes 1.0
  • Dynamic Resource Allocation (1.26 alpha) is the foundation for AI/ML GPU scheduling — more capable than device plugins and designed for heterogeneous hardware requests
  • Gateway API reaching GA replaced the annotation-driven, non-portable Ingress API with a role-oriented, extensible routing API
  • The cloud provider code removal (1.5M lines) is the largest refactor in Kubernetes history, a prerequisite for a maintainable, leaner core

What’s Next

← EP05: Security Hardens | EP07: Platform Engineering Era →

Series: Kubernetes: From Borg to Platform Engineering | linuxcent.com

eBPF vs Kernel Modules: An Honest Comparison for K8s Engineers

Reading Time: 8 minutes


Reading Time: 7 minutes

~2,100 words · Reading time: 8 min · Series: eBPF: From Kernel to Cloud, Episode 3 of 18

In Episode 1 we covered what eBPF is. In Episode 2 we covered why it is safe. The question that comes next is the one most tutorials skip entirely:

If eBPF can do everything a kernel module does for observability, why do kernel modules still exist? And when should you still reach for one?

Most comparisons on this topic are written by people who have used one or the other. I have used both — device driver work from 2012 to 2014 and eBPF in production Kubernetes clusters for the last several years. This is the honest version of that comparison, including the cases where kernel modules are still the right answer.


Architecture Overview

eBPF vs Kernel Modules — safety, portability, and runtime loading comparison diagram
eBPF programs run in a sandboxed VM; kernel modules run with full ring-0 privileges — the safety trade-off visualised.

TL;DR

  • Kernel modules run with full ring-0 privileges and no safety net — a bug causes an immediate kernel panic, no recovery
  • eBPF runs in a sandboxed virtual machine: the verifier ensures it cannot crash the kernel, and CO-RE means one binary runs across kernel versions without recompilation
  • eBPF cannot replace kernel modules for hardware drivers, new filesystems, or deep scheduler modifications — those still require modules
  • On EKS, GKE, and most managed Kubernetes platforms, loading custom kernel modules is restricted or blocked; eBPF is the only viable kernel extension path
  • Kernel modules are a significant attack surface (container escape, privilege escalation); eBPF programs are constrained by the verifier and produce an audit trail
  • Practical rule: reach for eBPF first; only reach for a kernel module when eBPF’s sandboxed model provably cannot do what you need

What Kernel Modules Actually Are

A kernel module is a piece of compiled code that loads directly into the running Linux kernel. Once loaded, it operates with full kernel privileges — the same level of access as the kernel itself. There is no sandbox. There is no safety check. There is no verifier.

This is both the power and the problem.

Kernel modules can do things that nothing else in the Linux ecosystem can do: implement new filesystems, add hardware drivers, intercept and modify kernel data structures, hook into scheduler internals. They are how the kernel extends itself without requiring a recompile or a reboot.

But the operating model is unforgiving:

  • A bug in a kernel module causes an immediate kernel panic — no exceptions, no recovery
  • Modules must be compiled against the exact kernel headers of the running kernel
  • A module that works on RHEL 8 may refuse to load on RHEL 9 without recompilation
  • Loading a module requires root privileges and deliberate coordination in production
  • Debugging a module failure means kernel crash dumps, kdump analysis, and time

I experienced all of these during device driver work. The discipline that environment instils is real — you think very carefully before touching anything, because mistakes are instantaneous and complete.


What eBPF Does Differently

eBPF was not designed to replace kernel modules. It was designed to provide a safe, programmable interface to kernel internals for the specific use cases where modules had always been used but were too dangerous: observability, networking, and security monitoring.

The fundamental difference is the verifier, covered in depth in Episode 2. Before any eBPF program runs, the kernel proves it is safe. Before any kernel module runs, nothing checks anything.

That single architectural decision produces a completely different operational profile:

Property Kernel module eBPF program
Safety check before load None BPF verifier — mathematical proof of safety
A bug causes Kernel panic, immediate Program rejected at load time
Kernel version coupling Compiled per kernel version CO-RE: compile once, run on any kernel 5.4+
Hot load / unload Risky, requires coordination Safe, zero downtime, zero pod restarts
Access scope Full kernel, unrestricted Restricted, granted per program type
Debugging Kernel crash dumps, kdump bpftool, bpftrace, readable error messages
Portability Recompile per distro per version Single binary runs across distros and versions
Production risk High — no safety net Low — verifier enforced before execution

CO-RE: Why Portability Matters More Than Most Engineers Realise

The portability column in that table deserves more than a one-line entry, because it is the operational advantage that compounds over time.

A kernel module written for RHEL 8 ships compiled against 4.18.0-xxx.el8.x86_64 kernel headers. When RHEL 8 moves to a new minor version, the module may need recompilation. When you migrate to RHEL 9 — kernel 5.14 with a completely different ABI in places — the module almost certainly needs a full rewrite of any code that touches kernel internals that changed between versions.

If you are running Falco with its kernel module driver and you upgrade a node from Ubuntu 20.04 to 22.04, Falco needs a pre-built module for your exact new kernel or it needs to compile one. If the pre-built is not available and compilation fails — no runtime security monitoring until it is resolved.

eBPF with CO-RE works differently. CO-RE (Compile Once, Run Everywhere) uses the kernel’s embedded BTF (BPF Type Format) information to patch field offsets and data structure layouts at load time to match the running kernel. The eBPF program was compiled once, against a reference kernel. When it loads on a different kernel, libbpf reads the BTF data from /sys/kernel/btf/vmlinux and fixes up the relocations automatically.

The practical result: a Cilium or Falco binary built six months ago loads and runs correctly on a node you just upgraded to a newer kernel version — without any module rebuilding, without any intervention, without any downtime.

In a Kubernetes environment where node images update regularly — especially on managed services like EKS, GKE, and AKS — this is not a minor convenience. It is the difference between eBPF tooling that survives an upgrade cycle and kernel module tooling that breaks one.


Security Implications: Container Escape and Privilege Escalation

The security difference between the two approaches matters specifically for container environments, and it goes beyond the verifier’s protection of your own nodes.

Kernel modules as an attack surface

Historically, kernel module vulnerabilities have been a primary vector for container escape. The attack pattern is straightforward: exploit a vulnerability in a loaded kernel module to gain kernel-level code execution, then use that access to break out of the container namespace into the host. Several high-profile CVEs over the past decade have followed this pattern.

The risk is compounded in environments that load third-party kernel modules — hardware drivers, filesystem modules, observability agents using the kernel module approach — because each additional module is an additional attack surface at the highest privilege level on the system.

eBPF’s security boundaries

eBPF does not eliminate the attack surface entirely, but it constrains it in important ways.

First, eBPF programs cannot leak kernel memory addresses to userspace. This is verifier-enforced and closes the class of KASLR bypass attacks that kernel module vulnerabilities have historically enabled.

Second, eBPF programs are sandboxed by design. They cannot access arbitrary kernel memory, cannot call arbitrary kernel functions, and cannot modify kernel data structures they were not explicitly granted access to. A vulnerability in an eBPF program is contained within that sandbox.

Third, the program type system controls what each eBPF program can see and do. A kprobe program watching syscalls cannot suddenly start modifying network packets. The scope is fixed at load time by the program type and verified by the kernel.

For EKS specifically: Falco running in eBPF mode on your nodes is not a kernel module that could be exploited for container escape. It is a verifier-checked program with a constrained access scope. The tool designed to detect container escapes is not itself a container escape vector — which is the correct security architecture.

Audit and visibility

eBPF programs are auditable in ways that kernel modules are not. You can list every eBPF program currently loaded on a node:

$ bpftool prog list
14: kprobe  name sys_enter_execve  tag abc123...  gpl
    loaded_at 2025-03-01T07:30:00+0000  uid 0
    xlated 240B  jited 172B  memlock 4096B  map_ids 3,4

27: cgroup_skb  name egress_filter  tag def456...  gpl
    loaded_at 2025-03-01T07:30:01+0000  uid 0

Every program is listed with its load time, its type, its tag (a hash of the program), and the maps it accesses. You can audit exactly what is running in your kernel at any point. Kernel modules offer no equivalent — lsmod tells you what is loaded but nothing about what it is actually doing.


EKS and Managed Kubernetes: Where the Difference Is Most Visible

The eBPF vs kernel module distinction plays out most clearly in managed Kubernetes environments, because you do not control when nodes upgrade.

On EKS, when AWS releases a new optimised AMI for a node group and you update it, your nodes are replaced. Any kernel module-based tooling on those nodes needs pre-built modules for the new kernel, or it needs to compile them at node startup, or it fails. AWS does not provide the kernel source for EKS-optimised AMIs in the same way a standard distribution does, which makes module compilation at runtime unreliable.

This is precisely why the EKS 1.33 migration covered in the EKS 1.33 post was painful for Rocky Linux: it involved kernel-level networking behaviour that had been assumed stable. When the kernel networking stack changed, everything built on top of those assumptions broke.

eBPF-based tooling on EKS does not have this problem, provided the node OS ships with BTF enabled — which Amazon Linux 2023 and Ubuntu 22.04 EKS-optimised AMIs do. Cilium and Falco survive node replacements without any module rebuilding because CO-RE handles the kernel version differences automatically.

For GKE and AKS the story is similar. Both use node images with BTF enabled on current versions, and both upgrade nodes on a managed schedule that is difficult to predict precisely. eBPF tooling survives this. Kernel module tooling fights it.


When You Should Still Use Kernel Modules

eBPF is not the right answer for every use case. Kernel modules remain the correct tool when:

You are implementing hardware support. Device drivers for new hardware still require kernel modules. eBPF cannot provide the low-level hardware interrupt handling, DMA operations, or hardware register access that a device driver needs. If you are bringing up a new network interface card, storage controller, or GPU, you are writing a kernel module.

You need to modify kernel behaviour, not just observe it. eBPF can observe and filter. It can drop packets, block syscalls via LSM hooks, and redirect traffic. But it cannot fundamentally change how the kernel handles a syscall, implement a new scheduling algorithm from scratch, or add a new filesystem type. Those changes require kernel modules or upstream kernel patches.

You are on a kernel older than 5.4. Without BTF and CO-RE, eBPF programs must be compiled per kernel version — which largely eliminates the portability advantage. On RHEL 7 or very old Ubuntu LTS versions still in production, kernel modules may be the more practical path for instrumentation work, though migrating the underlying OS is a better long-term answer.

You need capabilities the eBPF verifier rejects. The verifier’s safety constraints occasionally reject programs that are logically safe but that the verifier cannot prove safe statically. Complex loops, large stack allocations, and certain pointer arithmetic patterns hit verifier limits. In these edge cases, a kernel module can do what the verifier would not allow. These situations are rare and becoming rarer as the verifier improves across kernel versions.


The Practical Decision Framework

For most engineers reading this — Linux admins, DevOps engineers, SREs managing Kubernetes clusters — the decision is straightforward:

  • Observability, security monitoring, network policy, performance profiling on Linux 5.4+ → eBPF
  • Hardware drivers, new kernel subsystems, or kernels older than 5.4 → kernel modules
  • Production Kubernetes on EKS, GKE, or AKS → eBPF, always, because CO-RE survives managed upgrades and kernel modules do not

The overlap between the two technologies — the use cases where both could work — has been shrinking for five years and continues to shrink as the verifier becomes more capable and CO-RE becomes more widely supported. The direction of travel is clear.

Kernel modules are a precision instrument for modifying kernel behaviour. eBPF is a safe, portable interface for observing and influencing it. In 2025, if you are reaching for a kernel module to instrument a production system, there is almost certainly a better path.


Up Next

Episode 4 covers the five things eBPF can observe that no other tool can — without agents, without sidecars, and without any changes to your application code. If you are running production Kubernetes and want to understand what true zero-instrumentation observability looks like, that is the post.

The full series is on LinkedIn — search #eBPFSeries — and all episodes are indexed on linuxcent.com under the eBPF Series tag.


Further Reading


Questions or corrections? Reach me on LinkedIn. If this was useful, the full series index is on linuxcent.com — search the eBPF Series tag for all episodes.

Security Hardens: Supply Chain, Pod Security, and the API Cleanup (2020–2022)

Reading Time: 6 minutes


Introduction

The 2020–2022 period redefined what “secure Kubernetes” meant. A global pandemic moved workloads to cloud-native infrastructure faster than security practices could follow. SolarWinds happened. Log4Shell happened. The software supply chain became a crisis.

At the same time, the Kubernetes project was doing something it had been reluctant to do: removing APIs and features, including PodSecurityPolicy — the primary security primitive that most enterprise clusters depended on. The replacement was simpler, but the migration was not.


Kubernetes 1.19 — LTS Behavior, Ingress Stable (August 2020)

1.19 extended the support window to one year (from nine months). This was an acknowledgment that enterprise organizations couldn’t upgrade four times per year — a common complaint from operations teams.

  • Ingress graduated to stable: networking.k8s.io/v1 — after years as a beta resource, Ingress finally had a stable API
  • Immutable ConfigMaps and Secrets to beta: Configuration protection becomes broadly available
  • EndpointSlices to GA: The replacement for Endpoints — shards pod-to-service mappings to avoid the single large Endpoints object that caused control plane stress at scale (10,000+ endpoints for a single service)
  • Structured logging (alpha): Machine-parseable log output from Kubernetes control plane components — a prerequisite for reliable SIEM integration
# EndpointSlice: distributed representation of service endpoints
kubectl get endpointslices -n production -l kubernetes.io/service-name=api-service
NAME                  ADDRESSTYPE   PORTS   ENDPOINTS                                   AGE
api-service-abc12     IPv4          8080    10.0.1.5,10.0.1.6,10.0.1.7 + 47 more...   2d
api-service-def34     IPv4          8080    10.0.2.1,10.0.2.2,10.0.2.3 + 47 more...   2d

Kubernetes 1.20 — Dockershim Deprecated (December 2020)

The announcement in 1.20 that the Docker shim was deprecated caused more panic than any previous Kubernetes deprecation. The message was misread by many as “Kubernetes is dropping Docker support” — the PR catastrophe that followed required the Kubernetes blog to publish a dedicated clarification post.

The reality: Docker-built images continued to work on Kubernetes. What was being removed was the code in the kubelet that talked directly to Docker’s daemon using a non-standard interface, rather than through the Container Runtime Interface (CRI). Docker images conform to the OCI (Open Container Initiative) image specification — they run on any CRI-compliant runtime.

The migration path:
containerd: The runtime that Docker itself used internally. Moving to containerd meant removing the Docker layer entirely — the kubelet talks directly to containerd via CRI
CRI-O: An OCI-focused runtime designed specifically for Kubernetes, minimal and purpose-built

# Before (Docker socket): kubelet → dockershim → Docker daemon → containerd → runc
# After (direct CRI):     kubelet → containerd → runc
#                    or:  kubelet → CRI-O → runc

# Check runtime in use on a node
kubectl get node worker-1 -o jsonpath='{.status.nodeInfo.containerRuntimeVersion}'
# containerd://1.6.4

Also in 1.20:
API Priority and Fairness beta: Rate-limit API server requests by priority — prevents a runaway controller from starving other API clients
CronJobs stable: Scheduled jobs graduate after years in beta
Volume snapshot stable


The SolarWinds Context (December 2020)

The SolarWinds supply chain attack, disclosed in December 2020, didn’t directly target Kubernetes. But it accelerated an existing conversation in the cloud-native community: if the build pipeline is compromised, signed binaries mean nothing. If the image registry is compromised, admission control on image names means nothing.

The attack catalyzed work on several fronts:
Sigstore: An open-source project (Google, Red Hat, Purdue University) for signing and verifying software artifacts including container images
SLSA (Supply chain Levels for Software Artifacts): A framework for incrementally improving supply chain security, from basic build provenance to hermetic builds with verified dependencies
SBOM (Software Bill of Materials): A machine-readable inventory of software components in an image — required by US Executive Order 14028 (May 2021) for software sold to the federal government


Kubernetes 1.21 — PodSecurityPolicy Deprecation (April 2021)

PodSecurityPolicy was deprecated in 1.21, announcing its removal in 1.25. The deprecation was contentious — PSP was the only built-in mechanism for enforcing pod security constraints, and every security-conscious cluster depended on it, despite its many flaws.

The replacement approach: Pod Security Standards — three predefined security profiles:

Profile Description Use Case
Privileged No restrictions System-level workloads, trusted components
Baseline Prevents known privilege escalations General application workloads
Restricted Hardened; follows current best practices High-security workloads

Other 1.21 highlights:
CronJobs stable
Immutable ConfigMaps and Secrets stable
Graceful node shutdown beta: The kubelet gracefully terminates pods when a node shuts down (not just when the kubelet stops)
PodDisruptionBudget stable


Kubernetes 1.22 — The Great API Removal (August 2021)

1.22 was the most disruptive Kubernetes release for operations teams since 1.0. Several long-lived beta APIs were removed:

Removed API Replacement Used By
networking.k8s.io/v1beta1 Ingress networking.k8s.io/v1 Every ingress resource
batch/v1beta1 CronJob batch/v1 Every scheduled job
apiextensions.k8s.io/v1beta1 CRD apiextensions.k8s.io/v1 Every CRD definition
rbac.authorization.k8s.io/v1beta1 rbac.authorization.k8s.io/v1 RBAC resources

Teams with Helm charts, Terraform modules, and CI/CD pipelines built against beta API versions had to update their manifests. This was the moment that finally drove home the message: beta APIs in Kubernetes are not stable — they will be removed.

Also in 1.22:
Server-Side Apply stable: Apply semantics moved server-side — field ownership tracking, conflict detection, and merge strategies are handled by the API server rather than client-side kubectl
Memory manager stable: Better NUMA-aware memory allocation for latency-sensitive workloads
Bound Service Account Token Volumes stable: Time-limited, audience-bound tokens for pods — replacing the long-lived, cluster-wide service account tokens that were a persistent security concern

# Bound service account token — expires, audience-restricted
# Projected volume mounts a time-limited token (default 1h expiry)
volumes:
- name: token
  projected:
    sources:
    - serviceAccountToken:
        audience: api
        expirationSeconds: 3600
        path: token

The bound token change was significant from a security perspective: previously, a service account token extracted from a pod would be valid indefinitely, for any audience. Projected tokens expire and are tied to a specific audience.


Pod Security Admission (Kubernetes 1.22, GA in 1.25)

The replacement for PodSecurityPolicy was Pod Security Admission — an admission controller built into the API server (no webhook required) that enforces the three Pod Security Standards at the namespace level:

# Namespace-level security enforcement
apiVersion: v1
kind: Namespace
metadata:
  name: production
  labels:
    pod-security.kubernetes.io/enforce: restricted
    pod-security.kubernetes.io/enforce-version: v1.25
    pod-security.kubernetes.io/warn: restricted
    pod-security.kubernetes.io/warn-version: v1.25

The three modes:
enforce: Reject pods that violate the policy
audit: Allow the pod but add an audit annotation
warn: Allow the pod and send a warning to the client

Pod Security Admission is deliberately simpler than PSP. It does less — it enforces three fixed profiles, not arbitrary rules. For arbitrary policy, you still need OPA/Gatekeeper or Kyverno. But the simplicity means it works reliably, with no authorization edge cases.


Kubernetes 1.23 — Dual-Stack Stable, HPA v2 Stable (December 2021)

  • IPv4/IPv6 dual-stack stable: Pods and Services can have both IPv4 and IPv6 addresses — critical for organizations running mixed-stack networks or migrating from IPv4 to IPv6
  • HPA v2 stable: Horizontal Pod Autoscaler with support for multiple metrics (CPU, memory, custom metrics from Prometheus, external metrics). Scale on Prometheus metrics, not just CPU:
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: api-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: api
  minReplicas: 2
  maxReplicas: 20
  metrics:
  - type: Pods
    pods:
      metric:
        name: http_requests_per_second
      target:
        type: AverageValue
        averageValue: 1000m
  • FlexVolume deprecated (in favor of CSI): Another step in the driver out-of-tree migration

The Log4Shell Moment (December 2021)

Log4Shell (CVE-2021-44228) hit on December 9, 2021. The vulnerability allowed unauthenticated remote code execution in any Java application using Log4j 2.x. The blast radius was enormous — Log4j was in everything.

For Kubernetes operators, Log4Shell crystallized several operational realities:

Inventory problem: Do you know which of your pods is running a Java application? Do you know which version of Log4j it includes? Without an SBOM pipeline and admission-time image scanning, you probably don’t have a reliable answer.

Patch velocity problem: Once you know which images are vulnerable, how quickly can you rebuild and redeploy? Organizations with GitOps pipelines and image update automation (Flux’s image reflector, ArgoCD Image Updater) could respond in hours. Organizations without this infrastructure measured response time in days.

Runtime detection problem: Can you detect exploitation attempts in real time? Falco rules for Log4Shell JNDI lookup patterns were available within hours of disclosure — but only organizations already running Falco could use them.

Log4Shell made the case for supply chain security, image scanning, SBOM generation, and runtime detection tooling more effectively than any conference talk.


Sigstore and the Supply Chain Response

In 2021, Sigstore reached a point where its tooling — cosign (image signing), rekor (transparency log), fulcio (keyless signing via OIDC) — was production-ready.

The keyless signing model was significant: instead of managing long-lived signing keys (which themselves become a supply chain risk), fulcio issues short-lived certificates tied to an OIDC identity (a GitHub Actions workflow, a GitLab CI job). The signature proves that a specific workflow built the image.

# Sign an image as part of CI (keyless, OIDC-based)
cosign sign --yes ghcr.io/org/app:v1.0.0

# Verify before deploying
cosign verify \
  --certificate-identity-regexp "https://github.com/org/app/.github/workflows/build.yml" \
  --certificate-oidc-issuer https://token.actions.githubusercontent.com \
  ghcr.io/org/app:v1.0.0

Policy engines (OPA/Gatekeeper, Kyverno) could be configured to reject pods using unsigned or unverified images at admission time — closing the loop from build provenance to runtime enforcement.


Key Takeaways

  • Dockershim deprecation in 1.20 was about removing the non-standard interface, not about dropping Docker image compatibility — containers built with Docker run on containerd or CRI-O without changes
  • The API removals in 1.22 were operationally painful but necessary — beta APIs in Kubernetes are not production-stable commitments
  • Pod Security Admission (PSP’s replacement) trades power for reliability — three fixed profiles enforced at the namespace level, built into the API server, no authorization edge cases
  • SolarWinds and Log4Shell made supply chain security a board-level concern; Sigstore, SBOM, and admission-time image verification moved from “nice to have” to operational requirements
  • Bound service account tokens (1.22 stable) addressed a persistent security gap: pod tokens that expire and are audience-restricted rather than long-lived cluster-wide credentials

What’s Next

← EP04: The Operator Era | EP06: The Runtime Reckoning →

Series: Kubernetes: From Borg to Platform Engineering | linuxcent.com

The Operator Era: Stateful Workloads, Service Mesh, and the Cloud-Native Stack (2018–2020)

Reading Time: 6 minutes


Introduction

By 2018, Kubernetes had won the orchestration market. The question was no longer “which orchestrator?” — it was “how do we run complex workloads on it, and how do we do it safely?”

The 2018–2020 period is defined by three parallel tracks: the Operator pattern maturing into a serious engineering discipline, the service mesh debate consuming enormous community energy, and the security model evolving from “trust everything in the cluster” toward something resembling defense-in-depth.


The OperatorHub Era

The Operator pattern, introduced by CoreOS engineers in 2016, reached critical mass in 2018–2019. In November 2018, Red Hat launched OperatorHub.io — a registry for Kubernetes Operators covering databases (PostgreSQL, MongoDB, CockroachDB), messaging (Kafka, RabbitMQ), monitoring (Prometheus), and more.

The Operator SDK (Red Hat, 2018) gave teams a framework for building Operators in Go, Ansible, or Helm — lowering the barrier from “you need to write a Kubernetes controller from scratch” to “fill in the reconciliation logic.”

The maturity model for Operators was codified into five levels:

Level Capability
1 Basic Install — automated deployment
2 Seamless Upgrades — patch and minor version upgrades
3 Full Lifecycle — backup, failure recovery
4 Deep Insights — metrics, alerts, log processing
5 Auto Pilot — horizontal/vertical scaling, auto-config tuning

Most production Operators in 2019 were at Level 1–2. Getting to Level 3+ required encoding significant domain knowledge — the kind that previously lived in a senior database administrator’s head.


Kubernetes 1.11 — CoreDNS Default, Load Balancing Stable (June 2018)

  • CoreDNS replaced kube-dns as the default DNS provider. CoreDNS is plugin-based — you can extend it for custom DNS resolution logic (split DNS, external name resolution, DNS-based service discovery for non-Kubernetes services)
  • IPVS-based kube-proxy stable: The load balancing mode for Services switched from iptables to IPVS (IP Virtual Server), enabling O(1) service routing instead of O(n) iptables rule traversal — critical at scale
  • TLS bootstrapping stable: Kubelet automatic certificate rotation — kubelets no longer needed manual certificate management

The IPVS kube-proxy mode is a good example of a performance improvement that also has security implications. iptables rules degrade linearly with rule count; at 10,000+ services, iptables becomes a performance and debuggability problem. IPVS uses a hash table — O(1) lookups regardless of service count.


Kubernetes 1.12 — 1.13: Amazon EKS, Runtime Security (September–December 2018)

Amazon EKS Goes GA (June 2018)

Amazon EKS became generally available in June 2018. This was significant not just for AWS customers but for the entire ecosystem: EKS’s launch meant every major cloud provider now had a production-grade managed Kubernetes offering.

EKS’s initial release was deliberately limited — managed control plane, self-managed worker nodes. This contrasted with GKE’s more automated approach, and the community noticed. GKE had been running managed Kubernetes longer, and it showed in feature completeness.

1.12 (September 2018)

  • RuntimeClass alpha: A mechanism to specify which container runtime to use for a pod — containerd, gVisor, Kata Containers. The foundation for confidential computing workloads where you want hardware-isolated containers
  • RBAC delegation: Service accounts could now grant RBAC permissions they themselves held — enabling Operators to manage RBAC for the applications they deploy
  • Volume snapshot alpha: Create point-in-time snapshots of PersistentVolumes — the beginning of Kubernetes-native backup primitives

1.13 (December 2018)

  • kubeadm graduates to GA: The cluster bootstrapping tool was now stable and recommended for production
  • CoreDNS stable
  • CSI stable: Storage drivers could be shipped entirely out of tree

Kubernetes 1.14 — Windows Containers Go Stable (March 2019)

Windows Server container support graduated to stable in 1.14. For the first time, Kubernetes clusters could run Windows workloads as first-class citizens — .NET Framework applications, IIS, SQL Server containers alongside Linux-based microservices.

The implementation required significant work: Windows containers have different networking models, different filesystem semantics, and different process models than Linux containers. Making them a first-class Kubernetes citizen meant handling all of those differences in the node components.

Also in 1.14:
PersistentVolume and StorageClass improvements
kubectl improvements: kubectl diff — show what would change before applying a manifest


The PodSecurityPolicy Problem

PodSecurityPolicy (PSP) was alpha in Kubernetes 1.3, beta in 1.8, and would remain in beta until it was deprecated in 1.21. It was simultaneously the most important security primitive in Kubernetes and the most broken.

PSP let administrators define what a pod was allowed to do:

apiVersion: policy/v1beta1
kind: PodSecurityPolicy
metadata:
  name: restricted
spec:
  privileged: false
  allowPrivilegeEscalation: false
  requiredDropCapabilities:
    - ALL
  volumes:
    - 'configMap'
    - 'emptyDir'
    - 'projected'
    - 'secret'
    - 'downwardAPI'
    - 'persistentVolumeClaim'
  hostNetwork: false
  hostIPC: false
  hostPID: false
  runAsUser:
    rule: MustRunAsNonRoot
  seLinux:
    rule: RunAsAny
  supplementalGroups:
    rule: MustRunAs
    ranges:
      - min: 1
        max: 65535
  fsGroup:
    rule: MustRunAs
    ranges:
      - min: 1
        max: 65535
  readOnlyRootFilesystem: false

The problem: the admission mechanism was confusing, the UX was hostile, and the authorization model (who could use which PSP) led to privilege escalation paths that were non-obvious. Many teams either disabled PSP entirely or created a permissive policy that made it functionally useless.

The community would spend years working toward a replacement. In 2021 it was deprecated; in 1.25 (2022) it was removed. The replacement — Pod Security Admission — is discussed in EP05.


Kubernetes 1.15 — 1.17: Custom Resource Maturity (2019)

1.15 (June 2019)

  • CRDs continue maturing: Structural schemas, pruning of unknown fields — making CRDs behave more like first-class API types
  • Kustomize integrated into kubectl: Template-free Kubernetes configuration customization. Where Helm uses Go templates, Kustomize uses overlays — a base configuration plus environment-specific patches
# kustomization.yaml — base + production overlay
bases:
  - ../../base
patches:
  - deployment-replicas.yaml
  - resource-limits.yaml
configMapGenerator:
  - name: app-config
    literals:
      - ENV=production

1.16 (September 2019)

  • CRDs graduate to GA (apps/v1, not extensions/v1beta1)
  • Admission webhooks stable: Validating and mutating webhooks that intercept every API request. This is the foundation for OPA/Gatekeeper, Kyverno, and all policy-as-code enforcement in Kubernetes

The admission webhook framework’s graduation to stable in 1.16 was more significant than it appeared. It meant that any security policy engine — OPA/Gatekeeper, Kyverno, Styra, etc. — could now enforce policies on any Kubernetes resource creation or modification, using a stable, documented API.

  • Removal of several deprecated beta APIs: extensions/v1beta1 Deployments, DaemonSets, ReplicaSets — a preview of the more aggressive API cleanup that would come in 1.22

1.17 (December 2019)

  • Volume snapshots beta
  • Cloud Provider labels stable

OPA/Gatekeeper: Policy as Code Enters the Mainstream

Open Policy Agent (OPA) + Gatekeeper emerged as the policy engine of choice for Kubernetes in 2019. Gatekeeper uses the admission webhook framework to intercept API requests and evaluate them against Rego policies:

# Deny containers running as root
package kubernetes.admission

deny[msg] {
  input.request.kind.kind == "Pod"
  container := input.request.object.spec.containers[_]
  container.securityContext.runAsUser == 0
  msg := sprintf("Container %v must not run as root", [container.name])
}

The OPA/Gatekeeper model represented a shift in security thinking: instead of configuring security at the cluster level, you codify security policy in a language (Rego) and enforce it uniformly across all admission requests. Policies can be tested, versioned, and reviewed like code.


Kubernetes 1.18 — Topology-Aware Routing, Immutability (March 2020)

  • Topology-aware service routing alpha: Route service traffic to endpoints in the same zone/node as the caller — reducing cross-zone data transfer costs and latency
  • Immutable ConfigMaps and Secrets alpha: Mark a ConfigMap or Secret as immutable — the API server rejects updates, preventing accidental mutation of configuration that applications have already loaded
  • IngressClass: A mechanism to specify which Ingress controller should handle an Ingress resource — enabling multiple ingress controllers in the same cluster
# Immutable secret — once set, cannot be changed
apiVersion: v1
kind: Secret
metadata:
  name: db-credentials
immutable: true
data:
  password: dGhpcyBpcyBhIHRlc3Q=

The Falco Adoption Wave

CNCF-donated Falco (originated by Sysdig) became the standard tool for Kubernetes runtime security in this period. Falco uses eBPF probes or kernel modules to monitor syscalls and generate alerts based on rules:

# Falco rule: detect shell spawned in a container
- rule: Terminal shell in container
  desc: A shell was spawned in a container
  condition: >
    spawned_process and container and
    shell_procs and proc.tty != 0
  output: >
    A shell was spawned in a container
    (user=%user.name container=%container.name
     shell=%proc.name parent=%proc.pname)
  priority: WARNING

Falco addressed the gap that PodSecurityPolicy couldn’t: admission-time policy prevents known-bad configurations from running, but it can’t detect a compromise that happens at runtime — a shell spawned by an exploited web application, for example.


The Service Mesh Exhaustion

By 2019, the service mesh landscape was producing more overhead than value for many teams. Istio’s operational complexity — its control plane components, its sidecar injection model, its frequent breaking changes between versions — burned teams that adopted it early.

The community questions were real: do you actually need mTLS between every service in your cluster? Is the operational cost of a service mesh worth the security benefit for every organization?

Linkerd 2.x (Buoyant) positioned itself as the lightweight alternative — simpler to operate, less configuration surface, Rust-based proxy instead of Envoy. For teams that wanted the security benefit (mTLS) without the complexity cost, Linkerd 2.x was often the better choice.

The honest answer in 2019-2020: service meshes were the right architecture for organizations with hundreds of services and dedicated platform teams. For most organizations, they were complexity that outpaced the threat model.


Key Takeaways

  • The Operator pattern matured from a pattern into an engineering discipline with tooling (Operator SDK), a registry (OperatorHub), and a capability maturity model
  • EKS going GA completed the managed Kubernetes trifecta — every major cloud provider was now committed
  • CRDs graduating to stable in 1.16 was the foundation for everything built on Kubernetes extensibility — Operators, policy engines, GitOps tools
  • Admission webhooks graduating to stable enabled the policy-as-code ecosystem (OPA/Gatekeeper, Kyverno) — the only viable alternative to PSP’s broken model
  • Falco established runtime security as a distinct discipline from admission-time policy enforcement
  • Service mesh adoption was real but the complexity cost was frequently underestimated; many teams that adopted Istio in 2018-2019 spent 2019-2020 managing it

What’s Next

← EP03: Enterprise Awakening | EP05: Security Hardens →

Series: Kubernetes: From Borg to Platform Engineering | linuxcent.com

BPF Verifier Explained: Why eBPF Is Safe for Production Kubernetes

Reading Time: 10 minutes


Reading Time: 9 minutes

~2,400 words · Reading time: 9 min · Series: eBPF: From Kernel to Cloud, Episode 2 of 18

In Episode 1, we established what eBPF is and why it gives Linux admins and DevOps engineers kernel-level visibility without sidecars or code changes. The obvious follow-up question is the one every experienced engineer should ask before running anything in kernel space:

Is it actually safe to run on production nodes?

The answer is yes — and the reason is one specific component of the Linux kernel called the BPF verifier. This post explains what the verifier is, what it protects your cluster from, and why it changes the risk calculus for eBPF-based tools entirely.


Architecture Overview

BPF Verifier and JIT pipeline — how eBPF programs are safety-checked and compiled before kernel execution
The BPF verifier runs before every eBPF load — rejecting unsafe programs before they touch the kernel.

TL;DR

  • The BPF verifier is a static analysis pass that runs before every eBPF program loads — it rejects unsafe programs before they touch the kernel
  • It prevents infinite loops (only bounded loops allowed), out-of-bounds memory access, null pointer dereferences, and privilege escalation via kernel pointer leaks
  • Unlike kernel modules, a verified eBPF program cannot kernel-panic your node — that guarantee is why eBPF-based tools are safe in production
  • Every eBPF-based tool you run — Cilium, Falco, Tetragon, Datadog — passes its programs through the verifier on every node load
  • Ask three questions before adopting any eBPF tool: minimum kernel version required, CO-RE support (portable across kernels), and which program types it uses
  • (The verifier is also why eBPF programs require CAP_BPF or CAP_SYS_ADMIN — privilege is still required to load, just not to survive a bad load)

The Fear That Holds Most Teams Back

When I first explain eBPF to Linux admins and DevOps engineers, the reaction is almost always the same:

“So it runs code inside the kernel? On our production nodes? That sounds like a disaster waiting to happen.”

It is a completely reasonable concern. The Linux kernel is not a place where mistakes are tolerated. A buggy kernel module can take down a server instantly — no warning, no graceful shutdown, just a hard panic and a 3 AM phone call.

I know this from personal experience. During 2012–2014, I worked briefly with Linux device driver code. That period taught me one thing clearly: kernel space does not forgive careless code.

So when people started talking about running programs inside the kernel via eBPF, my instinct was scepticism too. Then I understood the BPF verifier. And everything changed.


What the Verifier Actually Is

Think of the BPF verifier as a strict safety gate that sits between your eBPF program and the kernel. Before your eBPF program is allowed to run — before it touches a single system call, network packet, or container event — the verifier reads through every line of it and asks one question:

“Could this program crash or compromise the kernel?”

If the answer is yes, or even maybe, the program is rejected. It does not load. Your cluster stays safe. If the answer is a provable no, the program loads and runs.

This is not a runtime check that catches problems after the fact. It is a load-time guarantee — the kernel proves the program is safe before it ever executes. Here is what that looks like when you deploy Cilium:

You run: kubectl apply -f cilium-daemonset.yaml
         └─► Cilium loads its eBPF programs onto each node
                   └─► Kernel verifier checks every program
                             ├─► SAFE   → program loads, starts observing
                             └─► UNSAFE → rejected, cluster untouched

This is why Cilium can replace kube-proxy on your nodes, why Falco can watch every syscall in every container, and why Tetragon can enforce security policy at the kernel level — all without putting your cluster at risk.


What the Verifier Protects You From

You do not need to know how the verifier works internally. What matters is what it prevents — and why each protection matters specifically in Kubernetes environments.

Infinite loops

An eBPF program that never terminates would freeze the kernel event it is attached to — potentially hanging every container on that node. The verifier rejects any program it cannot prove will finish executing within a bounded number of instructions.

Why this matters: Every eBPF-based tool on your K8s nodes — Cilium, Falco, Tetragon, Hubble — was verified to terminate correctly on every code path before it shipped. You are not trusting the vendor’s claim. The kernel enforced it.

Memory safety violations

An eBPF program cannot read or write memory outside the boundaries it is explicitly granted. No reaching into another container’s memory space. No accessing kernel data structures it was not given permission to touch.

Why this matters: This is the property that makes eBPF safe for multi-tenant clusters. A Falco rule monitoring one namespace cannot accidentally read data from another namespace’s containers. The verifier makes this impossible at the program level, not just at the policy level.

Kernel crashes

The verifier checks that every pointer is valid before it is dereferenced, that every function call uses correct arguments, and that the program cannot corrupt kernel data structures. Programs that could cause a kernel panic are rejected before they load.

Why this matters: Running Cilium or Tetragon on a production node is not the same risk as loading an untested kernel module. The verifier has already proven these programs cannot crash your nodes — before they ever ran on your infrastructure.

Privilege escalation and kernel pointer leaks

eBPF programs cannot leak kernel memory addresses to userspace. This closes a class of container escape and privilege escalation attacks that have historically been possible through kernel module vulnerabilities.

Why this matters: Security tools built on eBPF — like Tetragon, which detects and blocks container escape attempts in real time — are not themselves a vector for the attacks they protect against.


eBPF vs Traditional Observability Agents

To appreciate what the verifier gives you operationally, compare the two main approaches to K8s observability.

Traditional agent — DaemonSet sidecar approach

Your K8s cluster
└─► Node
    ├─► App Pod (your service)
    ├─► Sidecar container (injected into every pod)
    │   └─► Reads /proc, intercepts syscalls via ptrace
    │       └─► 15–30% CPU/memory overhead per pod
    └─► Agent DaemonSet Pod
        └─► Aggregates data from all sidecars

Problems with this model:

  • Sidecar injection requires modifying every pod spec and typically an admission webhook
  • ptrace-based interception adds 50–100% overhead to the traced process and is blocked in hardened containers
  • The agent runs in userspace with elevated privileges — a larger attack surface
  • Updating the agent requires pod restarts across your fleet

eBPF-based tool — Cilium / Falco / Tetragon

Your K8s cluster
└─► Node
    ├─► App Pod (your service — completely unmodified)
    ├─► App Pod (another service — also unmodified)
    └─► eBPF programs (inside the kernel, verifier-checked)
        └─► See every syscall, network packet, file access
            └─► Forward events to userspace agent via ring buffer

Benefits:

  • No sidecar injection — pod specs stay clean, no admission webhook required
  • Kernel-level visibility with near-zero overhead (typically 1–3%)
  • The verifier guarantees the eBPF programs cannot harm your nodes
  • Works identically with Docker, containerd, and CRI-O

Tools You Are Probably Already Running — All Verifier-Protected

You may already be running eBPF on your nodes without thinking about it explicitly. In each case below, the verifier ran before the tool ever touched your cluster.

Tool How the verifier is involved
Cilium Every network policy decision, service load-balancing operation, and Hubble flow log is handled by eBPF programs that passed the verifier at node startup.
Falco Every Falco rule is enforced by a verifier-checked eBPF program attached to syscall hooks. Sub-millisecond detection is only possible because the program runs in kernel space.
AWS VPC CNI On EKS, networking operations have progressively moved to eBPF for performance at scale. If you are on a recent EKS AMI, eBPF is already doing work on your nodes.
systemd Modern systemd uses eBPF for cgroup-based resource accounting and network traffic control. Active on most current Ubuntu, RHEL, and Amazon Linux 2023 installations.

Questions to Ask When Evaluating eBPF Tools

When a vendor tells you their tool uses eBPF, these three questions will quickly tell you how mature their implementation is.

1. What kernel version do you require?

The verifier’s capabilities have expanded significantly across kernel versions. Tools targeting kernel 5.8+ can use more powerful features safely. Tools claiming to work on kernel 4.x are constrained by an older, more limited verifier. The table below shows exactly where each major distribution stands.

Distribution Default kernel eBPF support level Notes
Ubuntu 16.04 LTS 4.4 Basic eBPF only No BTF. kprobes and socket filters work but modern tooling like Cilium and Falco eBPF driver will not run. EOL — do not use for new deployments.
Ubuntu 18.04 LTS 4.15 eBPF, no BTF No CO-RE. Tools must be compiled against the exact running kernel headers. The HWE kernel (5.4) improves this but BTF still varies by build.
Ubuntu 20.04 LTS 5.4 BTF available, verify before use CO-RE capable on most deployments. CONFIG_DEBUG_INFO_BTF was absent on some early builds. Verify with ls /sys/kernel/btf/vmlinux before deploying eBPF tooling. Cloud images generally have it enabled.
Ubuntu 20.10+ 5.8 Full BTF + CO-RE First Ubuntu release where BTF was consistently enabled by default. Ring buffers available. Not an LTS release — use 22.04 for production.
Ubuntu 22.04 LTS 5.15 Full modern eBPF — production ready BTF embedded. Ring buffers, global variables, LSM hooks. Default baseline for EKS-optimised Ubuntu AMIs. Recommended for new deployments.
Ubuntu 24.04 LTS 6.8 Full modern eBPF + latest features Open-coded iterators, improved verifier precision, enhanced LSM support. Best Ubuntu option for cutting-edge eBPF tooling today.
Debian 10 (Buster) 4.19 Basic eBPF, no BTF eBPF programs load but CO-RE is unavailable. Must compile against exact kernel headers. EOL — migrate to Debian 11 or 12.
Debian 11 (Bullseye) 5.10 LTS Full BTF + CO-RE BTF enabled. CO-RE works. Cilium, Falco, and Tetragon all fully supported. Solid production baseline for Debian environments through 2026.
Debian 12 (Bookworm) 6.1 LTS Full modern eBPF — production ready Same kernel generation as Amazon Linux 2023. LSM hooks, ring buffers, full CO-RE. Recommended Debian version for eBPF workloads today.
Debian 13 (Trixie) 6.12 LTS Full modern eBPF + latest features Released August 2025. Same kernel generation as RHEL 10 / Rocky 10 / AlmaLinux 10. Maximum eBPF feature availability across all program types.
RHEL 7.6 3.10 (backported) Tech Preview only — not production safe First RHEL release to enable eBPF but explicitly marked as Tech Preview. Limited to kprobes and tracepoints. No XDP, no socket filters, no BTF. Do not use for eBPF in production.
RHEL 8 / Rocky 8 / AlmaLinux 8 4.18 (heavily backported) Full BPF + BTF — functionally 5.4-equivalent Red Hat backports make RHEL 8 kernels functionally comparable to upstream 5.4 for most eBPF use cases. BTF enabled across all releases. CO-RE works. Cilium treats RHEL 8.6+ as its minimum supported RHEL-family version.
RHEL 9 / Rocky 9 / AlmaLinux 9 5.14 (heavily backported) Full modern eBPF — production ready BTF embedded. XDP, tc, kprobe, tracepoint, and LSM hooks all supported. Falco, Cilium, and Tetragon fully supported. Recommended RHEL-family version for eBPF deployments today. Supported until 2032.
RHEL 10 / Rocky 10 / AlmaLinux 10 6.12 Full modern eBPF + latest features Same kernel generation as Debian 13 and upstream 6.12 LTS. Rocky 10 released June 2025, AlmaLinux 10 released May 2025. Enhanced eBPF functionality throughout.
Amazon Linux 2023 6.1+ Full modern eBPF — production ready BTF embedded. Full CO-RE. Recommended for EKS. Also resolves the NetworkManager deprecation issues in EKS 1.33+ — see the EKS 1.33 post.

Quick check for any distro: Run ls /sys/kernel/btf/vmlinux on your node. If the file exists, your kernel has BTF enabled and CO-RE-based eBPF tools will work correctly. If it does not exist, you are limited to tools that compile against your specific kernel headers. Run uname -r to confirm the exact kernel version.

Rocky Linux and AlmaLinux note: Both distros rebuild directly from RHEL sources. Their kernel versions and eBPF capabilities are effectively identical to the corresponding RHEL release. When Cilium or Falco document “RHEL 9 support”, that applies equally to Rocky 9 and AlmaLinux 9 without any additional configuration.

2. Do you use CO-RE?

CO-RE (Compile Once, Run Everywhere) means the tool’s eBPF programs work correctly across different kernel versions without recompilation. Tools using CO-RE are more portable and significantly less likely to break after a routine node OS update. This is a reliable signal of engineering maturity in the vendor’s eBPF implementation.

3. What eBPF program types do you use?

Different program types have different privilege levels and access scopes. A tool that only needs kprobe access is asking for considerably less privilege than one requiring lsm hooks.

  • kprobe / tracepoint — observability and debugging
  • tc (traffic control) — network policy enforcement
  • xdp (eXpress Data Path) — high-performance packet processing
  • lsm (Linux Security Module) — security policy enforcement (used by Tetragon)

Understanding the program type tells you what the tool can and cannot see on your nodes, and how much kernel access you are granting it.


How Falco Uses the Verifier — A Step-by-Step Walkthrough

Here is exactly what happens when Falco starts on one of your K8s nodes, and where the verifier fits in:

1. Falco pod starts on the node (via DaemonSet)

2. Falco loads its eBPF programs into the kernel:
   └─► BPF verifier checks each program
       ├─► Can it crash the kernel?            No → continue
       ├─► Can it loop forever?                No → continue
       ├─► Can it access out-of-bounds memory? No → continue
       └─► PASS → program loads

3. Falco's eBPF programs attach to syscall hooks:
   └─► sys_enter_execve   (every process execution in every container)
   └─► sys_enter_openat   (every file open)
   └─► sys_enter_connect  (every outbound network connection)

4. A container runs an unexpected shell (potential attack):
   └─► execve() called inside the container
   └─► Falco's eBPF hook fires in kernel space
   └─► Event forwarded to Falco userspace via ring buffer
   └─► Falco rule matches: "shell spawned in container"
   └─► Alert fired in under 1 millisecond

5. Your container, your other pods, your node: completely unaffected

Step 2 is what the verifier makes safe. Without it, attaching eBPF hooks to every syscall on your production node would be an unacceptable risk. With it, Falco can offer this level of visibility with a mathematical safety guarantee.


The Bottom Line

You do not need to understand BPF bytecode, register states, or static analysis to use eBPF tools safely in production. What you do need to understand is this:

The BPF verifier is the reason eBPF is fundamentally different from kernel modules. It does not just make eBPF “safer” in a vague sense — it provides a mathematical proof that each program cannot crash your kernel before that program ever runs.

This is why eBPF-based tools can deliver deep kernel-level visibility into every container, every syscall, and every network flow — with near-zero overhead, no sidecar injection, and production safety that kernel modules could never guarantee.

The next time someone on your team hesitates about running Cilium, Falco, or Tetragon on production nodes because “it runs code in the kernel” — you now know what to tell them. The verifier already checked it. Before it ever touched your cluster.


Further Reading


Questions or corrections? Reach me on LinkedIn. If this was useful, the full series index is on linuxcent.com — search the eBPF Series tag for all episodes.

What Is eBPF? A Plain-English Guide for Linux and Kubernetes Engineers

Reading Time: 7 minutes


Reading Time: 6 minutes

~1,900 words · Reading time: 7 min · Series: eBPF: From Kernel to Cloud, Episode 1 of 18

Your Linux kernel has had a technology built into it since 2014 that most engineers working with Linux every day have never looked at directly. You have almost certainly been using it — through Cilium, Falco, Datadog, or even systemd — without knowing it was there.

This post is the plain-English introduction to eBPF that I wished existed when I first encountered it. No kernel engineering background required. No bytecode, no BPF maps, no JIT compilation. Just a clear answer to the question every Linux admin and DevOps engineer eventually asks: what actually is eBPF, and why does it matter for the infrastructure I run every day?


Architecture Overview

What Is eBPF — architecture diagram showing eBPF program types, verifier, JIT compiler, and kernel hook points
eBPF sits between user space and the kernel — attaching programs to hook points without modifying kernel source.

TL;DR

  • eBPF lets you run small, safe programs inside the Linux kernel — no kernel module, no reboot, no application changes required
  • The name is a historical artefact; modern eBPF is a general-purpose kernel observability and networking platform, not a packet filter
  • Programs attach to kernel hook points (tracepoints, kprobes, socket filters) — giving you visibility into every syscall, file open, and network packet
  • You are probably already running eBPF: Cilium, Falco, Datadog, and systemd all use it under the hood
  • Safe for production because the BPF verifier rejects any program that could crash or loop — covered in depth in EP02
  • Full feature set from Linux 5.8+; meaningful production use from Linux 4.14+ (most EKS and GKE defaults qualify)

First: Forget the Name

eBPF stands for extended Berkeley Packet Filter. It is one of the most misleading names in computing for what the technology actually does.

The original BPF was a 1992 mechanism for filtering network packets — the engine behind tcpdump. The extended version, introduced in Linux 3.18 (2014) and significantly matured through Linux 5.x, is a completely different technology. It is no longer just about packets. It is no longer just about filtering.

Forget the name. Here is what eBPF actually is:

eBPF lets you run small, safe programs directly inside the Linux kernel — without writing a kernel module, without rebooting, and without modifying your applications.

That is the complete definition. Everything else is implementation detail. The one-liner above is what matters for how you use it day to day.


What the Linux Kernel Can See That Nothing Else Can

To understand why eBPF is significant, you need to understand what the Linux kernel already sees on every server and every Kubernetes node you run.

The kernel is the lowest layer of software on your machine. Every action that happens — every file opened, every process started, every network packet sent — passes through the kernel. That means it has a complete, real-time view of everything:

  • Every syscall — every open(), execve(), connect(), write() from every process in every container on the node, in real time
  • Every network packet — source, destination, port, protocol, bytes, and latency for every pod-to-pod and pod-to-external connection
  • Every process event — every fork, exec, and exit, including processes spawned inside containers that your container runtime never reports
  • Every file access — which process opened which file, when, and with what permissions, across all workloads on the node simultaneously
  • CPU and memory usage — per-process CPU time, function-level latency, and memory allocation patterns without profiling agents

The kernel has always had this visibility. The problem was that there was no safe, practical way to access it without writing kernel modules — which are complex, kernel version-specific, and genuinely dangerous to run in production. eBPF is the safe, practical way to access it.


The Problem eBPF Solves — A Real Kubernetes Scenario

Here is a situation every Kubernetes engineer has faced. A production pod starts behaving strangely — elevated CPU, slow responses, occasional connection failures. You want to understand what is happening at a low level: what syscalls is it making, what network connections is it opening, is something spawning unexpected processes?

The old approaches and their problems

Restart the pod with a debug sidecar. You lose the current state immediately. The issue may not reproduce. You have modified the workload.

Run strace inside the container via kubectl exec. strace uses ptrace, which adds 50–100% CPU overhead to the traced process and is unavailable in hardened containers. You are tracing one process at a time with no cluster-wide view.

Poll /proc with a monitoring agent. Snapshot-based. Any event that happens between polls is invisible. A process that starts, does something, and exits between intervals is completely missed.

The eBPF approach

# Use a debug pod on the node — no changes to your workload
$ kubectl debug node/your-node -it --image=cilium/hubble-cli

# Real-time kernel events from every container on this node:
sys_enter_execve  pid=8821  comm=sh    args=["/bin/sh","-c","curl http://..."]
sys_enter_connect pid=8821  comm=curl  dst=203.0.113.42:443
sys_enter_openat  pid=8821  comm=curl  path=/etc/passwd

# Something inside the pod spawned a shell, made an outbound connection,
# and read /etc/passwd — all visible without touching the pod.

Real-time visibility. No overhead on your workload. Nothing restarted. Nothing modified. That is what eBPF makes possible.


Tools You Are Probably Already Running on eBPF

eBPF is not a standalone product — it is the foundation that many tools in the cloud-native ecosystem are built on. You may already be running eBPF on your nodes without thinking about it explicitly.

Tool What eBPF does for it Without eBPF
Cilium Replaces kube-proxy and iptables with kernel-level packet routing. 2–3× faster at scale. iptables rules — linear lookup, degrades with service count
Falco Watches every syscall in every container for security rule violations. Sub-millisecond detection. Kernel module (risky) or ptrace (high overhead)
Tetragon Runtime security enforcement — can kill a process or drop a network packet at the kernel level. No practical alternative at this detection speed
Datadog Agent Network performance monitoring and universal service monitoring without application code changes. Language-specific agents injected into application code
systemd cgroup resource accounting and network traffic control on your Linux nodes. Legacy cgroup v1 interfaces with limited visibility

eBPF vs the Old Ways

Before eBPF, getting deep visibility into a running Linux system meant choosing between three approaches, each with a significant trade-off:

Approach Visibility Cost Production safe?
Kernel modules Full kernel access One bug = kernel panic. Version-specific, must recompile per kernel update. No
ptrace / strace One process at a time 50–100% CPU overhead on the traced process. Unusable in production. No
Polling /proc Snapshots only Events between polls are invisible. Short-lived processes are missed entirely. Partial
eBPF Full kernel visibility 1–3% overhead. Verifier-guaranteed safety. Real-time stream, not polling. Yes

Is It Safe to Run in Production?

This is always the first question from any experienced Linux admin, and it is exactly the right question to ask. The answer is yes — and the reason is the BPF verifier.

Before any eBPF program is allowed to run on your node, the Linux kernel runs it through a built-in static safety analyser. This analyser examines every possible execution path and asks: could this program crash the kernel, loop forever, or access memory it should not?

If the answer is yes — or even maybe — the program is rejected at load time. It never runs.

This is fundamentally different from kernel modules. A kernel module loads immediately with no safety check. If it has a bug, you find out at runtime — usually as a kernel panic. An eBPF program that would cause a panic is rejected before it ever loads. The safety guarantee is mathematical, not hopeful.

Episode 2 of this series covers the BPF verifier in full: what it checks, how it makes Cilium and Falco safe on your production nodes, and what questions to ask eBPF tool vendors about their implementation.


Common Misconceptions

eBPF is not a specific tool or product. It is a kernel technology — a platform. Cilium, Falco, Tetragon, and Pixie are tools built on top of it. When a vendor says “we use eBPF”, they mean they build on this kernel capability, not that they share a single implementation.

eBPF is not only for networking. The Berkeley Packet Filter name suggests networking, but modern eBPF covers security, observability, performance profiling, and tracing. The networking origin is historical, not a limitation.

eBPF is not only for Kubernetes. It works on any Linux system running kernel 4.9+, including bare metal servers, Docker hosts, and VMs. K8s is the most popular deployment target because of the observability challenges at scale, but it is not a requirement.

You do not need to write eBPF programs to benefit from eBPF. Most Linux admins and DevOps engineers will use eBPF through tools like Cilium, Falco, and Datadog — never writing a line of BPF code themselves. This series covers the writing side later. Understanding what eBPF is makes you a significantly better user of these tools today.


Kernel Version Requirements

eBPF is a Linux kernel feature. The capabilities available depend directly on the kernel version running on your nodes. Run uname -r on any node to check.

Kernel What becomes available
4.9+ Basic eBPF support. Tracing, socket filtering. Most production systems today meet this minimum.
5.4+ BTF (BPF Type Format) and CO-RE — programs that adapt to different kernel versions without recompile. Recommended minimum for production tooling.
5.8+ Ring buffers for high-performance event streaming. Global variables. The target kernel for Cilium, Falco, and Tetragon full feature support.
6.x Open-coded iterators, improved verifier, LSM security enforcement hooks. Amazon Linux 2023 and Ubuntu 22.04+ ship 5.15 or newer and are fully eBPF-ready.

EKS users: Amazon Linux 2023 AMIs ship with kernel 6.1+ and support the full modern eBPF feature set out of the box. If you are still on AL2, the migration also resolves the NetworkManager deprecation issues covered in the EKS 1.33 post.


The Bottom Line

eBPF is the answer to a question Linux engineers have been asking for years: how do I get deep visibility into what is happening on my servers and Kubernetes nodes — without adding massive overhead, injecting sidecars, or risking a kernel panic?

The answer is: run small, safe programs at the kernel level, where everything is already visible. Let the BPF verifier guarantee those programs are safe before they run. Stream the results to your observability tools through shared memory maps.

The tools you already use — Cilium for networking, Falco for security, Datadog for APM — are built on this foundation. Understanding eBPF means understanding why those tools work the way they do, what they can and cannot see, and how to evaluate new tools that claim to use it.

Every eBPF-based tool you run on your nodes passed through the BPF verifier before it touched your cluster. Episode 2 covers exactly what that means — and why it matters for your infrastructure decisions.


Further Reading


Questions or corrections? Reach me on LinkedIn. If this was useful, the full series index is on linuxcent.com — search the eBPF Series tag for all episodes.