Write Your First Kubernetes CRD: A Hands-On YAML Walkthrough

Reading Time: 6 minutes

Kubernetes CRDs & Operators: Extending the API, Episode 4
What Is a CRD? · CRDs You Already Use · CRD Anatomy · Write Your First CRD · CEL Validation · Controller Loop · Build an Operator · CRD Versioning · Admission Webhooks · CRDs in Production


TL;DR

  • Writing a Kubernetes CRD requires five YAML files: the CRD itself, a ClusterRole/ClusterRoleBinding, a namespaced Role/RoleBinding for consumers, and a sample custom resource
  • The BackupPolicy CRD built in this episode is the running example throughout the rest of the series — operators, versioning, and production patterns all use it
  • Apply the CRD, verify it with kubectl get crds, create a custom resource, and watch the API server validate your spec
  • RBAC for CRDs follows the same Role/ClusterRole model as built-in resources — the generated resource name is {plural}.{group}
  • Schema validation fires at apply time: bad field types, missing required fields, and out-of-range values all return clear errors before anything reaches etcd
  • Without a controller, a BackupPolicy is stored in etcd but nothing acts on it — that is the topic of EP05 and EP07

The Big Picture

  WHAT WE'RE BUILDING IN THIS EPISODE

  1. backuppolicies-crd.yaml        ← registers the BackupPolicy type
  2. backuppolicies-rbac.yaml       ← controls who can create/view/delete
  3. nightly-backup.yaml            ← our first custom resource instance

  After applying:

  kubectl get crds | grep backup      ← BackupPolicy type exists
  kubectl get backuppolicies -n demo  ← nightly instance exists
  kubectl describe bp nightly -n demo ← spec visible, status empty
  kubectl apply -f bad-backup.yaml    ← schema validation rejects bad data

Writing your first Kubernetes CRD is the step that bridges understanding CRDs conceptually to operating them in a real cluster. This episode is hands-on — every block of YAML is something you apply and verify.


Prerequisites

You need a running Kubernetes cluster and kubectl configured. Any of these work:

# Local options
kind create cluster --name crd-demo
# or
minikube start

# Verify cluster access
kubectl cluster-info
kubectl get nodes

Step 1: Write the CRD

Save this as backuppolicies-crd.yaml:

apiVersion: apiextensions.k8s.io/v1
kind: CustomResourceDefinition
metadata:
  name: backuppolicies.storage.example.com
spec:
  group: storage.example.com
  scope: Namespaced
  names:
    plural:     backuppolicies
    singular:   backuppolicy
    kind:       BackupPolicy
    shortNames:
      - bp
    categories:
      - storage
  versions:
    - name: v1alpha1
      served: true
      storage: true
      schema:
        openAPIV3Schema:
          type: object
          required: ["spec"]
          properties:
            spec:
              type: object
              required: ["schedule", "retentionDays"]
              properties:
                schedule:
                  type: string
                  description: "Cron expression (e.g. '0 2 * * *' for 02:00 daily)"
                retentionDays:
                  type: integer
                  minimum: 1
                  maximum: 365
                  description: "How many days to retain backup snapshots"
                storageClass:
                  type: string
                  default: "standard"
                  description: "StorageClass to use for backup volumes"
                targets:
                  type: array
                  description: "Namespaces and resources to include in the backup"
                  maxItems: 20
                  items:
                    type: object
                    required: ["namespace"]
                    properties:
                      namespace:
                        type: string
                      includeSecrets:
                        type: boolean
                        default: false
                suspended:
                  type: boolean
                  default: false
                  description: "Set to true to pause backup execution"
            status:
              type: object
              x-kubernetes-preserve-unknown-fields: true
      subresources:
        status: {}
      additionalPrinterColumns:
        - name: Schedule
          type: string
          jsonPath: .spec.schedule
        - name: Retention
          type: integer
          jsonPath: .spec.retentionDays
        - name: Suspended
          type: boolean
          jsonPath: .spec.suspended
        - name: Ready
          type: string
          jsonPath: .status.conditions[?(@.type=='Ready')].status
        - name: Age
          type: date
          jsonPath: .metadata.creationTimestamp

Apply it:

kubectl apply -f backuppolicies-crd.yaml

Verify it registered correctly:

kubectl get crds backuppolicies.storage.example.com
NAME                                    CREATED AT
backuppolicies.storage.example.com      2026-04-25T08:00:00Z

Check the API server now knows about it:

kubectl api-resources | grep backuppolic
backuppolicies    bp    storage.example.com/v1alpha1    true    BackupPolicy

Check it is Established:

kubectl get crd backuppolicies.storage.example.com \
  -o jsonpath='{.status.conditions[?(@.type=="Established")].status}'
True

If you see False or empty output, wait a few seconds and retry — the API server takes a moment to register new CRDs.


Step 2: Write RBAC

CRDs follow the same RBAC model as built-in resources. The resource name is {plural}.{group}.

Save this as backuppolicies-rbac.yaml:

# ClusterRole for operators/controllers that manage BackupPolicy objects
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: backuppolicy-controller
rules:
  - apiGroups: ["storage.example.com"]
    resources: ["backuppolicies"]
    verbs: ["get", "list", "watch", "create", "update", "patch", "delete"]
  - apiGroups: ["storage.example.com"]
    resources: ["backuppolicies/status"]
    verbs: ["get", "update", "patch"]
  - apiGroups: ["storage.example.com"]
    resources: ["backuppolicies/finalizers"]
    verbs: ["update"]
---
# Role for application teams to manage BackupPolicies in their namespace
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: backuppolicy-editor
rules:
  - apiGroups: ["storage.example.com"]
    resources: ["backuppolicies"]
    verbs: ["get", "list", "watch", "create", "update", "patch", "delete"]
---
# Read-only role for auditors
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: backuppolicy-viewer
rules:
  - apiGroups: ["storage.example.com"]
    resources: ["backuppolicies"]
    verbs: ["get", "list", "watch"]
kubectl apply -f backuppolicies-rbac.yaml

Verify the roles exist:

kubectl get clusterrole | grep backuppolicy
backuppolicy-controller   2026-04-25T08:01:00Z
backuppolicy-editor       2026-04-25T08:01:00Z
backuppolicy-viewer       2026-04-25T08:01:00Z

Note on backuppolicies/status: The separate status RBAC rule is only meaningful if you enabled the status subresource (we did). Without it, status and spec share the same update path.


Step 3: Create a Namespace and Your First Custom Resource

kubectl create namespace demo

Save this as nightly-backup.yaml:

apiVersion: storage.example.com/v1alpha1
kind: BackupPolicy
metadata:
  name: nightly
  namespace: demo
  labels:
    app.kubernetes.io/managed-by: manual
spec:
  schedule: "0 2 * * *"
  retentionDays: 30
  storageClass: standard
  targets:
    - namespace: production
      includeSecrets: false
    - namespace: staging
      includeSecrets: false
  suspended: false

Apply it:

kubectl apply -f nightly-backup.yaml

Get it back:

kubectl get backuppolicies -n demo
NAME      SCHEDULE    RETENTION   SUSPENDED   READY   AGE
nightly   0 2 * * *   30          false       <none>  5s

The Ready column is <none> because there is no controller writing status yet. The custom resource exists and is stored in etcd, but nothing is acting on it.

Describe it:

kubectl describe bp nightly -n demo
Name:         nightly
Namespace:    demo
Labels:       app.kubernetes.io/managed-by=manual
Annotations:  <none>
API Version:  storage.example.com/v1alpha1
Kind:         BackupPolicy
Metadata:
  Creation Timestamp:  2026-04-25T08:05:00Z
  ...
Spec:
  Retention Days:  30
  Schedule:        0 2 * * *
  Storage Class:   standard
  Suspended:       false
  Targets:
    Include Secrets:  false
    Namespace:        production
    Include Secrets:  false
    Namespace:        staging
Status:
Events:  <none>

Step 4: Test Schema Validation

The API server now validates every BackupPolicy against the schema. Try creating an invalid one:

kubectl apply -f - <<'EOF'
apiVersion: storage.example.com/v1alpha1
kind: BackupPolicy
metadata:
  name: bad-policy
  namespace: demo
spec:
  schedule: "not-a-cron"
  retentionDays: 500
EOF
The BackupPolicy "bad-policy" is invalid:
  spec.retentionDays: Invalid value: 500:
    spec.retentionDays in body should be less than or equal to 365

Missing required field:

kubectl apply -f - <<'EOF'
apiVersion: storage.example.com/v1alpha1
kind: BackupPolicy
metadata:
  name: missing-schedule
  namespace: demo
spec:
  retentionDays: 7
EOF
The BackupPolicy "missing-schedule" is invalid:
  spec.schedule: Required value

Wrong type:

kubectl apply -f - <<'EOF'
apiVersion: storage.example.com/v1alpha1
kind: BackupPolicy
metadata:
  name: wrong-type
  namespace: demo
spec:
  schedule: "0 2 * * *"
  retentionDays: "thirty"
EOF
The BackupPolicy "wrong-type" is invalid:
  spec.retentionDays: Invalid value: "string":
    spec.retentionDays in body must be of type integer: "string"

All validation fires at the API boundary — before etcd, before any controller sees the object.


Step 5: Verify Default Values Apply

The schema defines storageClass: default: "standard" and suspended: default: false. Verify they are applied even when not specified:

kubectl apply -f - <<'EOF'
apiVersion: storage.example.com/v1alpha1
kind: BackupPolicy
metadata:
  name: minimal
  namespace: demo
spec:
  schedule: "0 0 * * 0"
  retentionDays: 7
EOF

kubectl get bp minimal -n demo -o jsonpath='{.spec.storageClass}'
standard
kubectl get bp minimal -n demo -o jsonpath='{.spec.suspended}'
false

Defaults are injected by the API server at admission time. They appear in etcd and in every kubectl get -o yaml output — the stored object includes the defaults even if the user did not specify them.


Step 6: Explore the API Endpoints

Your custom resource is now available at standard REST endpoints:

kubectl proxy --port=8001 &

# List all BackupPolicies in the demo namespace
curl -s http://localhost:8001/apis/storage.example.com/v1alpha1/namespaces/demo/backuppolicies \
  | jq '.items[].metadata.name'
"nightly"
"minimal"
# Get a specific BackupPolicy
curl -s http://localhost:8001/apis/storage.example.com/v1alpha1/namespaces/demo/backuppolicies/nightly \
  | jq '.spec'

This is how controllers discover and watch custom resources — via the same API server endpoints, using informers that wrap these REST calls with efficient list-and-watch semantics.


Step 7: Clean Up

kubectl delete namespace demo
kubectl delete -f backuppolicies-rbac.yaml
kubectl delete -f backuppolicies-crd.yaml   # WARNING: deletes all BackupPolicy instances first

⚠ Common Mistakes

metadata.name does not match {plural}.{group}. The most common error. If you name the CRD backuppolicy.storage.example.com (singular) but the spec says plural: backuppolicies, the API server rejects it. The name must always be {plural}.{group}.

No required fields on spec. Without required constraints, kubectl apply accepts an empty spec: {}. The controller then receives objects with no configuration and has to handle the nil case. Define required fields in the schema.

Forgetting subresources: status: {}. Without this, controllers writing .status also overwrite .spec on full PUT updates. This causes status updates to reset user edits. Enable the status subresource from day one.

Not testing validation errors. Schema validation is the first line of defense. Always explicitly test that your required fields are required, types are enforced, and range constraints work — before deploying the controller.


Quick Reference

# All kubectl operations work on custom resources
kubectl get      backuppolicies -n demo
kubectl get      bp -n demo                  # shortName
kubectl describe bp nightly -n demo
kubectl edit     bp nightly -n demo
kubectl delete   bp nightly -n demo

# Output formats
kubectl get bp -n demo -o yaml
kubectl get bp -n demo -o json
kubectl get bp -n demo -o jsonpath='{.items[*].metadata.name}'

# Watch for changes
kubectl get bp -n demo -w

# List across all namespaces
kubectl get bp -A

# Patch spec
kubectl patch bp nightly -n demo \
  --type=merge -p '{"spec":{"suspended":true}}'

Key Takeaways

  • A working CRD deployment needs: the CRD YAML, RBAC ClusterRoles, and at least one sample custom resource
  • The API server validates all custom resources against the schema at apply time — errors are surfaced immediately, not inside the controller
  • Default values in the schema are injected at admission time and appear in every stored object
  • RBAC for custom resources uses {plural}.{group} as the resource name — status and finalizers are separate sub-resources
  • Without a controller, custom resources are stored in etcd and serve as validated configuration — nothing acts on them until a controller is deployed

What’s Next

EP05: Kubernetes CRD CEL Validation extends schema validation beyond simple type and range checks — cross-field rules (“if storageClass is premium, retentionDays must be at most 90″), regex validation beyond pattern, and immutable field enforcement. All without an admission webhook.

Get EP05 in your inbox when it publishes → subscribe at linuxcent.com

Kubernetes CRD Schema Explained: Versions, Validation, and Status Subresource

Reading Time: 6 minutes

Kubernetes CRDs & Operators: Extending the API, Episode 3
What Is a CRD? · CRDs You Already Use · CRD Anatomy · Write Your First CRD · CEL Validation · Controller Loop · Build an Operator · CRD Versioning · Admission Webhooks · CRDs in Production


TL;DR

  • The Kubernetes CRD schema is defined in spec.versions[].schema.openAPIV3Schema — the API server uses it to validate every custom resource create and update before storing in etcd
    (OpenAPI v3 schema = a JSON Schema dialect that describes the structure, types, and constraints of your resource’s fields)
  • spec.versions is a list — CRDs can serve multiple API versions simultaneously; exactly one version must have storage: true
  • scope: Namespaced vs scope: Cluster controls whether custom resources live inside a namespace or at cluster level (like PersistentVolume vs PersistentVolumeClaim)
  • spec.names defines the plural, singular, kind, and optional shortNames used in kubectl and RBAC
  • The status subresource (subresources.status: {}) separates user writes (spec) from controller writes (status) — enabling optimistic concurrency and kubectl status support
  • The scale subresource (subresources.scale) makes your custom resource compatible with kubectl scale and the HorizontalPodAutoscaler

The Big Picture

  ANATOMY OF A CUSTOMRESOURCEDEFINITION

  apiVersion: apiextensions.k8s.io/v1
  kind: CustomResourceDefinition
  metadata:
    name: {plural}.{group}        ← MUST be exactly this format
  spec:
    group: {group}                ← API group (e.g. storage.example.com)
    scope: Namespaced | Cluster   ← where instances live
    names:                        ← how kubectl refers to this resource
      plural: backuppolicies
      singular: backuppolicy
      kind: BackupPolicy
      shortNames: [bp]
    versions:                     ← can be a list; one must have storage: true
      - name: v1alpha1
        served: true              ← API server responds to this version
        storage: true             ← etcd stores objects in this version
        schema:
          openAPIV3Schema:        ← validation schema for ALL objects of this type
            type: object
            properties:
              spec: {...}
              status: {...}
        subresources:
          status: {}              ← enables separate status write path
          scale:                  ← enables kubectl scale + HPA
            specReplicasPath: .spec.replicas
            statusReplicasPath: .status.replicas
        additionalPrinterColumns: ← extra columns in kubectl get output
          - name: Schedule
            type: string
            jsonPath: .spec.schedule

Understanding the Kubernetes CRD schema is the prerequisite for writing a CRD that behaves correctly in production — validation catches bad data at the API boundary, the status subresource prevents controller race conditions, and scope determines your entire RBAC and multi-tenancy model.


spec.group and metadata.name

The group is a reverse-DNS identifier for your API. Convention:

storage.example.com     ← domain you control + functional area
monitoring.myteam.io
databases.platform.company.com

The CRD’s metadata.name must be exactly {plural}.{group}:

metadata:
  name: backuppolicies.storage.example.com
spec:
  group: storage.example.com
  names:
    plural: backuppolicies

If these do not match, the API server rejects the CRD with a validation error. This is the most common first-timer mistake.


spec.scope: Namespaced vs Cluster

  SCOPE DETERMINES WHERE INSTANCES LIVE

  Namespaced (scope: Namespaced)       Cluster (scope: Cluster)
  ─────────────────────────────         ──────────────────────────
  kubectl get backuppolicies -n prod    kubectl get clusterbackuppolicies
  kubectl get backuppolicies -A         (no -n flag, no namespace)

  Analogous to: Pod, Deployment,        Analogous to: PersistentVolume,
                ConfigMap                             ClusterRole, Node

Namespaced: Use when instances are per-tenant or per-application. Users with namespace-scoped RBAC can manage their own instances without cluster-admin. Most CRDs should be namespaced.

Cluster-scoped: Use when instances represent cluster-wide configuration — a ClusterIssuer (cert-manager), ClusterSecretStore (ESO), a StorageClass-like concept. Requires cluster-level RBAC to create/modify.

You cannot change scope after a CRD is created without deleting and recreating it (which deletes all instances). Choose carefully.


spec.versions: Serving Multiple API Versions

spec:
  versions:
    - name: v1alpha1
      served: true
      storage: false       # not stored; converted on read
      schema:
        openAPIV3Schema: {...}
    - name: v1beta1
      served: true
      storage: false
      schema:
        openAPIV3Schema: {...}
    - name: v1
      served: true
      storage: true        # etcd stores in this version
      schema:
        openAPIV3Schema: {...}

Rules:
served: true means the API server accepts requests at this version
served: false means the API server returns 404 for that version — use to deprecate
– Exactly one version must have storage: true — this is what gets written to etcd
– When a client requests a non-storage version, the API server converts on the fly (or calls your conversion webhook — see EP08)

Early in development, start with v1alpha1 storage: true. Promote to v1 when the schema is stable. EP08 covers how to do this without losing data.


spec.names: What kubectl Sees

spec:
  names:
    plural:     backuppolicies     # kubectl get backuppolicies
    singular:   backuppolicy       # kubectl get backuppolicy (also works)
    kind:       BackupPolicy       # used in YAML apiVersion/kind
    listKind:   BackupPolicyList   # optional; auto-derived if omitted
    shortNames:                    # kubectl get bp
      - bp
    categories:                    # kubectl get all includes this type
      - all

categories is worth noting: if you add all to categories, your custom resources appear when someone runs kubectl get all -n mynamespace. Most CRDs deliberately do not add this — it clutters get all output. Only add it if your resource is a primary operational concern.


schema.openAPIV3Schema: Validation

The schema is where you define field types, required fields, constraints, and descriptions. The API server validates every create and update against this schema before writing to etcd.

schema:
  openAPIV3Schema:
    type: object
    required: ["spec"]
    properties:
      spec:
        type: object
        required: ["schedule", "retentionDays"]
        properties:
          schedule:
            type: string
            description: "Cron expression for backup schedule"
            pattern: '^(\*|[0-9,\-\/]+)\s+(\*|[0-9,\-\/]+)\s+(\*|[0-9,\-\/]+)\s+(\*|[0-9,\-\/]+)\s+(\*|[0-9,\-\/]+)$'
          retentionDays:
            type: integer
            minimum: 1
            maximum: 365
          storageClass:
            type: string
            default: "standard"        # default value (Kubernetes 1.17+)
          targets:
            type: array
            maxItems: 10
            items:
              type: object
              required: ["name"]
              properties:
                name:
                  type: string
                namespace:
                  type: string
                  default: "default"
      status:
        type: object
        x-kubernetes-preserve-unknown-fields: true   # controllers write arbitrary status

Field types available

Type Usage
string Text values; supports format, pattern, enum, minLength, maxLength
integer Whole numbers; supports minimum, maximum
number Floating point
boolean true/false
object Nested structure; use properties to define fields
array List; use items to define element schema; supports minItems, maxItems

x-kubernetes-preserve-unknown-fields: true

This tells the API server not to prune fields it does not know about. Use it on status (controllers write whatever they need) and on fields that are intentionally free-form (like a config field that accepts arbitrary YAML). Avoid it on spec — it bypasses validation.

Validation behavior in practice

# This will fail with a clear error:
kubectl apply -f - <<EOF
apiVersion: storage.example.com/v1alpha1
kind: BackupPolicy
metadata:
  name: bad
  namespace: default
spec:
  schedule: "not-a-cron"    # fails pattern validation
  retentionDays: 500         # fails maximum: 365
EOF
The BackupPolicy "bad" is invalid:
  spec.schedule: Invalid value: "not-a-cron": spec.schedule in body should match
    '^(\*|[0-9,\-\/]+)\s+...'
  spec.retentionDays: Invalid value: 500: spec.retentionDays in body should be
    less than or equal to 365

Schema validation catches configuration mistakes at apply time, not at runtime inside a pod. This is one of the core advantages of expressing domain configuration as CRDs rather than ConfigMaps.


additionalPrinterColumns: What kubectl get Shows

By default, kubectl get backuppolicies shows only NAME and AGE. You can add columns:

additionalPrinterColumns:
  - name: Schedule
    type: string
    jsonPath: .spec.schedule
    description: Cron schedule for backups
  - name: Retention
    type: integer
    jsonPath: .spec.retentionDays
    priority: 1          # 0 = always shown; 1 = only with -o wide
  - name: Ready
    type: string
    jsonPath: .status.conditions[?(@.type=='Ready')].status
  - name: Age
    type: date
    jsonPath: .metadata.creationTimestamp

Result:

NAME        SCHEDULE      READY   AGE
nightly     0 2 * * *     True    3d
weekly      0 0 * * 0     False   7d

Good printer columns turn kubectl get into a useful operational dashboard. Include Ready (from status conditions) so operators can immediately see which custom resources are healthy without running kubectl describe.


The Status Subresource

subresources:
  status: {}

Without the status subresource, spec and status are part of the same object. Any user with update permission on the CRD can modify both. Controllers write status through the same path as users write spec.

With the status subresource enabled:
kubectl apply / kubectl patch only update spec — the status block is stripped
– Controllers use the /status subresource endpoint to write status
– RBAC can grant update on backuppolicies (spec) independently from update on backuppolicies/status

  WITHOUT status subresource:         WITH status subresource:
  ─────────────────────────            ──────────────────────────
  PUT /backuppolicies/nightly          PUT /backuppolicies/nightly
  → updates spec AND status            → updates spec only

                                       PUT /backuppolicies/nightly/status
                                       → updates status only (controller path)

Always enable the status subresource on production CRDs. The split between spec and status is fundamental to the Kubernetes API contract. Without it, a controller updating status can accidentally overwrite spec changes made by a user at the same time.


The Scale Subresource

subresources:
  scale:
    specReplicasPath: .spec.replicas
    statusReplicasPath: .status.replicas
    labelSelectorPath: .status.labelSelector

This makes your custom resource compatible with:

kubectl scale backuppolicy nightly --replicas=3

And with HorizontalPodAutoscaler targeting your custom resource. If your CRD manages something replica-based (workers, shards, connections), enabling the scale subresource lets it plug into the standard Kubernetes autoscaling ecosystem without extra plumbing.


⚠ Common Mistakes

Forgetting x-kubernetes-preserve-unknown-fields: true on status. If you validate the status field with a strict schema but do not add this, the API server will prune any status fields the controller writes that are not in the schema. The controller’s status updates will silently lose fields. Either define the full status schema or use x-kubernetes-preserve-unknown-fields: true.

Using scope: Cluster for resources that should be namespaced. Once a CRD is created as cluster-scoped, you cannot make it namespaced without deleting and recreating it. Plan scope before deploying to production.

Not enabling the status subresource. Without it, controllers writing status can race with users updating spec. It also means kubectl patch --subresource=status does not work and some tooling behaves unexpectedly. Enable it from the start.

Loose schema with no required fields. An openAPIV3Schema with no required constraint accepts objects with empty spec. This usually means your controller gets called with a resource that is missing mandatory configuration. Define required fields and validate them at the API boundary, not inside the controller.


Quick Reference

# Inspect the full schema of a CRD
kubectl get crd backuppolicies.storage.example.com -o yaml | \
  yq '.spec.versions[0].schema'

# Check what subresources are enabled
kubectl get crd certificates.cert-manager.io -o jsonpath=\
  '{.spec.versions[0].subresources}'

# See all served versions for a CRD
kubectl get crd prometheuses.monitoring.coreos.com \
  -o jsonpath='{.spec.versions[*].name}'

# Check which version is the storage version
kubectl get crd certificates.cert-manager.io \
  -o jsonpath='{.spec.versions[?(@.storage==true)].name}'

# Describe the printer columns for a CRD
kubectl get crd scaledobjects.keda.sh \
  -o jsonpath='{.spec.versions[0].additionalPrinterColumns}'

Key Takeaways

  • spec.versions allows serving and storing multiple API versions; only one version has storage: true
  • scope (Namespaced vs Cluster) cannot be changed after creation — choose deliberately
  • openAPIV3Schema validates every CR at the API boundary, before etcd storage
  • The status subresource separates the user write path (spec) from the controller write path (status) — always enable it
  • additionalPrinterColumns makes kubectl get operationally useful; include a Ready column from status conditions

What’s Next

EP04: Write Your First Kubernetes CRD puts the anatomy into practice — a complete hands-on walkthrough building a BackupPolicy CRD from scratch, applying it to a cluster, creating instances, and verifying validation, RBAC, and status behavior.

Get EP04 in your inbox when it publishes → subscribe at linuxcent.com

CRDs You Already Use: cert-manager, KEDA, and External Secrets Explained

Reading Time: 6 minutes

Kubernetes CRDs & Operators: Extending the API, Episode 2
What Is a CRD? · CRDs You Already Use · CRD Anatomy · Write Your First CRD · CEL Validation · Controller Loop · Build an Operator · CRD Versioning · Admission Webhooks · CRDs in Production


TL;DR

  • cert-manager, KEDA, and External Secrets Operator are all CRD-based systems — understanding their custom resources shows you what a well-designed CRD looks like before you build one
  • cert-manager’s Certificate CRD expresses desired TLS state; the cert-manager controller reconciles that state by issuing, renewing, and storing certificates in Secrets
  • KEDA’s ScaledObject extends the HorizontalPodAutoscaler with external metrics (queue depth, Kafka lag, Prometheus queries) — the KEDA operator translates ScaledObjects into native HPA objects
  • External Secrets Operator’s ExternalSecret abstracts over secret backends (AWS Secrets Manager, HashiCorp Vault, GCP Secret Manager) — the controller pulls values and writes Kubernetes Secrets
  • All three follow the same pattern: you describe desired state in a custom resource; the operator reconciles actual state to match
  • Kubernetes custom resources examples like these are the fastest way to internalize the CRD mental model before writing your own

The Big Picture

  THREE CRD-BASED OPERATORS AND WHAT THEY MANAGE

  ┌─────────────────────────────────────────────────────────────┐
  │  cert-manager                                               │
  │  Certificate CR  →  controller issues cert  →  TLS Secret  │
  └─────────────────────────────────────────────────────────────┘

  ┌─────────────────────────────────────────────────────────────┐
  │  KEDA                                                       │
  │  ScaledObject CR  →  controller creates HPA  →  Pod count  │
  └─────────────────────────────────────────────────────────────┘

  ┌─────────────────────────────────────────────────────────────┐
  │  External Secrets Operator                                  │
  │  ExternalSecret CR  →  controller pulls  →  K8s Secret      │
  │                         from Vault/AWS/GCP                  │
  └─────────────────────────────────────────────────────────────┘

  In every case:
  User creates CR  →  Operator watches CR  →  Operator acts  →  Status updated

Kubernetes custom resources examples from real tools like these reveal the design pattern you will use in every CRD you build: express desired state declaratively, let the controller bridge the gap to actual state, surface the outcome in the status subresource.


Why Look at Existing CRDs First?

Before designing your own CRD, you want to understand what good CRD design looks like from the user’s perspective. The engineers at Jetstack (cert-manager), KEDACORE (KEDA), and External Secrets contributors have collectively solved the same problems you will face:

  • What goes in spec vs status?
  • How do you reference other Kubernetes objects?
  • How do you handle secrets and credentials securely?
  • What does a healthy vs unhealthy custom resource look like?

Studying these before writing your own saves you from the most common first-timer mistakes.


cert-manager: The Certificate CRD

cert-manager is the most widely deployed CRD-based system in Kubernetes. It manages TLS certificates from Let’s Encrypt, internal CAs, and cloud providers.

The core CRDs

kubectl get crds | grep cert-manager
certificates.cert-manager.io
certificaterequests.cert-manager.io
challenges.acme.cert-manager.io
clusterissuers.cert-manager.io
issuers.cert-manager.io
orders.acme.cert-manager.io

The one you interact with most is Certificate. Here is a real example:

apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
  name: api-tls
  namespace: production
spec:
  secretName: api-tls-cert        # cert-manager writes the TLS Secret here
  duration: 2160h                 # 90 days
  renewBefore: 720h               # renew 30 days before expiry
  subject:
    organizations:
      - example.com
  dnsNames:
    - api.example.com
    - api-internal.example.com
  issuerRef:
    name: letsencrypt-prod
    kind: ClusterIssuer

What happens after you apply this:

  1. cert-manager controller sees the new Certificate object
  2. It contacts the referenced ClusterIssuer (Let’s Encrypt in this case)
  3. It completes the ACME challenge, obtains the certificate
  4. It writes the certificate and private key into the api-tls-cert Secret
  5. It updates the Certificate object’s status to reflect success
kubectl describe certificate api-tls -n production
Status:
  Conditions:
    Last Transition Time:  2026-04-10T08:00:00Z
    Message:               Certificate is up to date and has not expired
    Reason:                Ready
    Status:                True
    Type:                  Ready
  Not After:               2026-07-09T08:00:00Z
  Not Before:              2026-04-10T08:00:00Z
  Renewal Time:            2026-06-09T08:00:00Z

What this teaches you about CRD design

  • spec.secretName — the CR references an output object by name. The controller creates or updates that object.
  • spec.issuerRef — the CR references another custom resource (ClusterIssuer) by name. This is a common pattern for separating configuration concerns.
  • status.conditions — the standard Kubernetes condition pattern: type, status, reason, message. You will use the same structure in your own CRDs.
  • The controller owns status — users own spec. This separation is a core convention.

KEDA: The ScaledObject CRD

KEDA (Kubernetes Event-Driven Autoscaling) extends Kubernetes autoscaling beyond CPU and memory. It can scale deployments based on queue depth, Kafka consumer lag, Prometheus metric values, and dozens of other event sources.

The core CRDs

kubectl get crds | grep keda
clustertriggerauthentications.keda.sh
scaledjobs.keda.sh
scaledobjects.keda.sh
triggerauthentications.keda.sh

A ScaledObject ties a Deployment to an external scaler:

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: order-processor-scaler
  namespace: production
spec:
  scaleTargetRef:
    name: order-processor        # the Deployment to scale
  minReplicaCount: 0             # scale to zero when idle
  maxReplicaCount: 50
  triggers:
    - type: aws-sqs-queue
      metadata:
        queueURL: https://sqs.us-east-1.amazonaws.com/123456789/orders
        queueLength: "5"         # target: 5 messages per pod
        awsRegion: us-east-1
      authenticationRef:
        name: keda-sqs-auth      # TriggerAuthentication for AWS credentials

What KEDA does with this:

  1. KEDA controller sees the ScaledObject
  2. It creates a native HorizontalPodAutoscaler object targeting the order-processor Deployment
  3. KEDA’s metrics adapter polls the SQS queue depth and exposes it as a custom metric
  4. The HPA uses that metric to scale replicas — including to zero when the queue is empty
kubectl get scaledobject order-processor-scaler -n production
NAME                       SCALETARGETKIND      SCALETARGETNAME    MIN   MAX   TRIGGERS         READY   ACTIVE
order-processor-scaler     apps/Deployment      order-processor    0     50    aws-sqs-queue    True    True

What this teaches you about CRD design

  • spec.scaleTargetRef — targeting another object by name. The controller acts on that object, not on the CR itself.
  • spec.triggers — a list of trigger specifications. Lists of typed sub-objects are a recurring CRD pattern.
  • spec.minReplicaCount: 0 — expressing scale-to-zero as a first-class concept in the API. Built-in HPA does not support this; KEDA’s CRD extends the vocabulary of what is expressible.
  • The KEDA operator translates ScaledObject → native HPA. The CRD is an abstraction over a more complex Kubernetes object. This “translate and manage child resources” pattern is extremely common in operators.

External Secrets Operator: The ExternalSecret CRD

External Secrets Operator (ESO) solves a specific problem: secrets live in external systems (AWS Secrets Manager, HashiCorp Vault, GCP Secret Manager), but Kubernetes workloads need them as Kubernetes Secrets. ESO bridges the gap.

The core CRDs

kubectl get crds | grep external-secrets
clusterexternalsecrets.external-secrets.io
clustersecretstores.external-secrets.io
externalsecrets.external-secrets.io
secretstores.external-secrets.io

A SecretStore defines the backend connection:

apiVersion: external-secrets.io/v1beta1
kind: SecretStore
metadata:
  name: aws-secrets-manager
  namespace: production
spec:
  provider:
    aws:
      service: SecretsManager
      region: us-east-1
      auth:
        jwt:
          serviceAccountRef:
            name: eso-sa            # uses IRSA/workload identity

An ExternalSecret defines what to pull and how to map it:

apiVersion: external-secrets.io/v1beta1
kind: ExternalSecret
metadata:
  name: database-creds
  namespace: production
spec:
  refreshInterval: 1h
  secretStoreRef:
    name: aws-secrets-manager
    kind: SecretStore
  target:
    name: database-secret          # Kubernetes Secret to create/update
    creationPolicy: Owner
  data:
    - secretKey: username          # key in the K8s Secret
      remoteRef:
        key: prod/database         # path in AWS Secrets Manager
        property: username         # property within that secret
    - secretKey: password
      remoteRef:
        key: prod/database
        property: password

After ESO reconciles this:

kubectl get secret database-secret -n production -o jsonpath='{.data.username}' | base64 -d
# outputs: db_user
kubectl describe externalsecret database-creds -n production
Status:
  Conditions:
    Last Transition Time:   2026-04-10T08:00:00Z
    Message:                Secret was synced
    Reason:                 SecretSynced
    Status:                 True
    Type:                   Ready
  Refresh Time:             2026-04-10T09:00:00Z
  Synced Resource Version:  1-abc123

What this teaches you about CRD design

  • spec.secretStoreRef — referencing a configuration CRD (SecretStore) from an operational CRD (ExternalSecret). This layering of CRDs to separate concerns is a mature pattern.
  • spec.refreshInterval — the CR expresses a desired behavior (periodic sync), not just a desired state snapshot. CRDs can express temporal behaviors.
  • spec.target.creationPolicy: Owner — ESO will set an owner reference on the created Secret, so deleting the ExternalSecret cascades to deleting the Secret. This is how controllers manage lifecycle.
  • Sensitive values never appear in the CR — only paths and references. The controller handles the actual secret retrieval. This is a key security pattern in CRD design.

The Common Pattern Across All Three

  OPERATOR PATTERN (cert-manager / KEDA / ESO / every other operator)

  User applies CR
        │
        ▼
  Controller watches CRDs
  (informer cache, events queue)
        │
        ▼
  Controller reconciles:
  actual state ──→ compare ──→ desired state
        │              │
        │         (gap found)
        │              │
        ▼              ▼
  Takes action      Updates status
  (issue cert,      conditions in CR
   create HPA,
   sync Secret)
        │
        └──── loops back, watches for next change

The design contract:
Users write spec — what they want
Controllers read spec, write status — what actually happened
Status conditions are truthReady: True/False with reason and message tell operators what the controller knows

This pattern, explained in depth in EP06, is why CRDs and controllers are designed the way they are.


⚠ Common Mistakes

Installing CRDs without the controller. If you install cert-manager’s CRDs from the crds.yaml manifest without installing cert-manager itself, Certificate objects will be accepted by the API server but never reconciled. The Ready condition will never appear. Always install the operator alongside its CRDs.

Editing status fields directly. Many teams try kubectl patch or kubectl edit to update a custom resource’s status to work around a stuck controller. Most well-written controllers overwrite status every reconcile loop — your manual change will be wiped. Fix the underlying issue, not the status display.

Assuming CRD deletion is safe. Covered in EP01 but worth repeating: deleting a CRD cascades to deleting all instances. If you kubectl delete crd certificates.cert-manager.io, every Certificate object in every namespace is gone and cert-manager will stop issuing. Back up CRDs and their instances before any CRD deletion.


Quick Reference

# See all CRDs installed by cert-manager
kubectl get crds | grep cert-manager.io

# Get all Certificates across all namespaces
kubectl get certificates -A

# Watch cert-manager reconcile a new Certificate
kubectl get certificate api-tls -n production -w

# See all ScaledObjects and their current state
kubectl get scaledobjects -A

# Check ESO sync status for all ExternalSecrets
kubectl get externalsecrets -A

# Inspect what APIs a CRD exposes
kubectl api-resources | grep cert-manager

Key Takeaways

  • cert-manager, KEDA, and ESO are canonical examples of well-designed CRD-based operators
  • All three follow the same pattern: user writes spec, controller reconciles to actual state, status reflects outcome
  • spec expresses desired state declaratively; the controller figures out how to achieve it
  • Status conditions (type, status, reason, message) are the standard way to surface controller outcomes
  • Sensitive values never appear in the CR — controllers retrieve them from external systems using references and credentials

What’s Next

EP03: CRD Anatomy opens the YAML of a CRD itself — spec.versions, OpenAPI schema properties, scope, names, and subresources. You have seen CRDs from the outside; next we look at how they are structured on the inside.

Get EP03 in your inbox when it publishes → subscribe at linuxcent.com

What Is a Kubernetes CRD? How Custom Resources Extend the API

Reading Time: 6 minutes

Kubernetes CRDs & Operators: Extending the API, Episode 1
What Is a CRD? · CRDs You Already Use · CRD Anatomy · Write Your First CRD · CEL Validation · Controller Loop · Build an Operator · CRD Versioning · Admission Webhooks · CRDs in Production


TL;DR

  • A Kubernetes CRD (Custom Resource Definition) is how you add new resource types to the Kubernetes API — the same way Deployment and Service exist natively, you can make BackupPolicy or Certificate exist too
    (CRD = the schema/blueprint; Custom Resource = an instance of that schema, just like a Pod is an instance of the Pod schema)
  • Every kubectl get crds on a real cluster shows dozens of them — cert-manager, KEDA, Prometheus Operator, Crossplane all ship their own CRDs
  • CRDs are served by the same API server as built-in resources — kubectl, RBAC, watches, and events all work identically
  • A CRD alone does nothing — a controller watches the custom resources and acts on them; together they form an Operator
  • CRDs live in etcd just like Pods and Deployments — they survive API server restarts and cluster upgrades
  • You do not need to modify Kubernetes source code or restart the API server to add a CRD

The Big Picture

  HOW KUBERNETES CRDs EXTEND THE API

  ┌──────────────────────────────────────────────────────────────┐
  │  Kubernetes API Server                                       │
  │                                                              │
  │  Built-in resources          Custom resources (via CRD)      │
  │  ─────────────────           ──────────────────────────      │
  │  Pod                         Certificate     (cert-manager)  │
  │  Deployment                  ScaledObject    (KEDA)          │
  │  Service                     ExternalSecret  (ESO)           │
  │  ConfigMap                   BackupPolicy    (your team)     │
  │  ...                         ...                             │
  │                                                              │
  │  All resources: same API, same kubectl, same RBAC, same etcd │
  └──────────────────────────────────────────────────────────────┘
            ▲                          ▲
            │ built in                 │ registered at runtime
            │                         │
         Kubernetes              CustomResourceDefinition
          binary                    (a YAML you apply)

What is a Kubernetes CRD? It is a resource that defines resources — a schema registration that teaches the API server about a new object type you want to use in your cluster.


What Problem CRDs Solve

Kubernetes ships with roughly 50 resource types: Pods, Deployments, Services, ConfigMaps, Secrets, PersistentVolumes, and so on. These cover the general-purpose building blocks for running containerized workloads.

But the moment you operate real infrastructure, you hit the edges. You want to express:

  • “This database should have three replicas with point-in-time recovery enabled” — not a Deployment
  • “This TLS certificate for api.example.com should renew 30 days before expiry” — not a Secret
  • “This queue consumer should scale to zero when the queue is empty” — not a HorizontalPodAutoscaler

Before CRDs (pre-2017), the only options were: use ConfigMaps as a poor substitute (no schema, no validation, no dedicated RBAC), or fork Kubernetes and add the resource natively (impractical for everyone outside the core team).

CRDs, introduced as stable in Kubernetes 1.16, solved this by letting you register a new resource type with the API server at runtime — without touching Kubernetes source code, without restarting the API server, without any special access beyond being able to create cluster-scoped resources.


The Kubernetes API: A Brief Mental Model

Before CRDs make sense, the API model needs to be clear.

  KUBERNETES API STRUCTURE

  apiVersion: apps/v1       ← API group (apps) + version (v1)
  kind: Deployment          ← resource type
  metadata:
    name: web               ← instance name
    namespace: default      ← namespace scope
  spec:
    replicas: 3             ← desired state

Every Kubernetes resource has:
– A group (e.g., apps, batch, networking.k8s.io) — or no group for core resources
– A version (e.g., v1, v1beta1)
– A kind (e.g., Deployment, Pod)
– A scope: namespaced or cluster-wide

The API server is a registry. Each group/version/kind combination maps to a Go struct that knows how to validate, store, and serve that resource type.

A CRD registers a new entry in that registry. You supply the group, version, kind, and schema. The API server handles everything else — serving it via REST, storing it in etcd, exposing it to kubectl.


What a CRD Looks Like

Here is the smallest possible CRD — it creates a new BackupPolicy resource type in the storage.example.com API group:

apiVersion: apiextensions.k8s.io/v1
kind: CustomResourceDefinition
metadata:
  name: backuppolicies.storage.example.com
spec:
  group: storage.example.com
  versions:
    - name: v1alpha1
      served: true
      storage: true
      schema:
        openAPIV3Schema:
          type: object
          properties:
            spec:
              type: object
              properties:
                schedule:
                  type: string
                retentionDays:
                  type: integer
  scope: Namespaced
  names:
    plural: backuppolicies
    singular: backuppolicy
    kind: BackupPolicy
    shortNames:
      - bp

Apply it:

kubectl apply -f backuppolicy-crd.yaml

Now create an instance:

apiVersion: storage.example.com/v1alpha1
kind: BackupPolicy
metadata:
  name: nightly
  namespace: default
spec:
  schedule: "0 2 * * *"
  retentionDays: 30
kubectl apply -f nightly-backup.yaml
kubectl get backuppolicies
kubectl get bp            # shortName works
kubectl describe bp nightly

The API server validates the spec against the schema, stores it in etcd, and returns it via all the standard API endpoints — all without a single line of custom code.


CRD vs Built-In Resource: What Is Different?

Not much, deliberately.

Capability Built-in resource Custom resource (CRD)
kubectl get / describe / delete Yes Yes
RBAC (Roles, ClusterRoles) Yes Yes
Watch (informers, events) Yes Yes
Stored in etcd Yes Yes
OpenAPI schema validation Yes Yes (you define the schema)
Admission webhooks Yes Yes
Status subresource Yes Optional (you enable it)
Scale subresource Yes Optional (you enable it)
Built-in controller behavior Yes No — you write the controller

The last row is the critical one. When you create a Deployment, the deployment controller immediately starts managing ReplicaSets. When you create a BackupPolicy, nothing happens — until you write and deploy a controller that watches BackupPolicy objects and acts on them.

That controller + the CRD is what people call an Operator.


A Real Cluster: What You Actually See

Run this on any cluster running cert-manager, Prometheus Operator, or any other tooling:

kubectl get crds

Sample output (abbreviated):

NAME                                                  CREATED AT
certificates.cert-manager.io                          2024-11-01T08:12:00Z
certificaterequests.cert-manager.io                   2024-11-01T08:12:00Z
issuers.cert-manager.io                               2024-11-01T08:12:00Z
clusterissuers.cert-manager.io                        2024-11-01T08:12:00Z
scaledobjects.keda.sh                                 2024-11-01T08:13:00Z
scaledjobs.keda.sh                                    2024-11-01T08:13:00Z
externalsecrets.external-secrets.io                   2024-11-01T08:14:00Z
prometheuses.monitoring.coreos.com                    2024-11-01T08:15:00Z
servicemonitors.monitoring.coreos.com                 2024-11-01T08:15:00Z

Every tool that ships as a CRD-based system registers its resource types here first. The count often surprises engineers: a production cluster with a typical toolchain easily has 40–80 CRDs.

Check how many are on your cluster:

kubectl get crds --no-headers | wc -l

How the API Server Handles a CRD

When you apply a CRD, the API server does three things:

  CRD REGISTRATION FLOW

  kubectl apply -f my-crd.yaml
          │
          ▼
  1. API server validates the CRD manifest
     (is the schema valid OpenAPI v3? are names correct?)
          │
          ▼
  2. CRD stored in etcd
     (under /registry/apiextensions.k8s.io/customresourcedefinitions/)
          │
          ▼
  3. New REST endpoints activated immediately:
     GET  /apis/storage.example.com/v1alpha1/namespaces/{ns}/backuppolicies
     POST /apis/storage.example.com/v1alpha1/namespaces/{ns}/backuppolicies
     ...

From this point, any kubectl get backuppolicies or API call to those endpoints is handled exactly like a built-in resource call — the API server serves it from etcd, applies RBAC, runs admission webhooks, and returns standard JSON.

No restart required. The new endpoints appear within seconds.


The Difference Between CRD and CR

Two terms that are easily confused:

  • CRD (CustomResourceDefinition) — the schema/blueprint. There is one CRD per resource type. certificates.cert-manager.io is a CRD.
  • CR (Custom Resource) — an instance of a CRD. Every Certificate object you create is a custom resource. You can have thousands of CRs per CRD.
  CRD (one)          →  Custom Resource (many)
  ─────────             ─────────────────────
  certificates          web-tls           (namespace: production)
  .cert-manager.io      api-tls           (namespace: production)
                        admin-tls         (namespace: staging)
                        ...

The CRD is applied once (usually by the tool’s Helm chart). Custom resources are created by your users, your CI pipeline, or your GitOps system throughout the life of the cluster.


Where CRDs Fit in the Kubernetes Extension Model

CRDs are one of three ways to extend Kubernetes:

  KUBERNETES EXTENSION MECHANISMS

  1. CRDs + Controllers (Operators)
     Add new resource types + behavior
     → cert-manager, KEDA, Argo CD, Crossplane
     Used for: domain-specific abstractions, infrastructure management

  2. Admission Webhooks
     Intercept API requests to validate or mutate objects
     → OPA/Gatekeeper, Kyverno, Istio injection
     Used for: policy enforcement, sidecar injection, defaulting

  3. API Aggregation (AA)
     Register a fully separate API server behind the main API server
     → metrics-server, custom autoscalers
     Used for: when you need non-CRUD semantics (e.g. exec, attach, streaming)

For 95% of use cases, CRDs + controllers are the right mechanism. API aggregation is complex and only warranted for non-standard API semantics. Admission webhooks are complementary to CRDs, not an alternative.


⚠ Common Mistakes

Confusing the CRD with the controller. The CRD is just a schema registration — it does not execute code. If you apply a CRD but do not deploy its controller, creating custom resources will succeed (the API server accepts them) but nothing will happen. This catches many people the first time they try to use cert-manager by only applying the CRDs without installing the cert-manager controller.

Assuming CRD deletion is safe. Deleting a CRD deletes all custom resources of that type from etcd. There is no “are you sure?” prompt. If you delete the certificates.cert-manager.io CRD, every Certificate object in every namespace is gone.

Treating CRDs as ConfigMap replacements. Some teams store configuration in CRDs purely to get schema validation. This works, but without a controller, the custom resources are inert data. If you only need configuration storage with validation, a CRD is viable — just be explicit that there is no reconciliation loop.


Quick Reference

# List all CRDs in the cluster
kubectl get crds

# Inspect a specific CRD's schema
kubectl get crd certificates.cert-manager.io -o yaml

# List all custom resources of a type
kubectl get certificates -A

# Get details on a specific custom resource
kubectl describe certificate web-tls -n production

# Delete a CRD (WARNING: deletes all instances)
kubectl delete crd backuppolicies.storage.example.com

# Check if a CRD is established (ready to use)
kubectl get crd backuppolicies.storage.example.com \
  -o jsonpath='{.status.conditions[?(@.type=="Established")].status}'
# Returns: True

Key Takeaways

  • A Kubernetes CRD registers a new resource type with the API server — no source code changes, no restart required
  • Custom resources behave identically to built-in resources: kubectl, RBAC, watches, etcd, admission webhooks all work the same way
  • The CRD is just the schema; a controller gives custom resources behavior — together they form an Operator
  • Every production cluster running modern tooling already uses dozens of CRDs
  • Deleting a CRD deletes all its instances — treat CRDs as production-critical objects

What’s Next

EP02: CRDs You Already Use makes this concrete before we go deeper — we walk through cert-manager’s Certificate, KEDA’s ScaledObject, and External Secrets’ ExternalSecret as working examples, so you understand what a well-designed CRD looks like from a user’s perspective before you design your own.

Get EP02 in your inbox when it publishes → subscribe at linuxcent.com

Kubernetes Today: v1.33 to v1.35, In-Place Resize GA, and What Comes Next

Reading Time: 6 minutes


Introduction

Ten years after the first commit, Kubernetes is not exciting in the way it was in 2015. That’s a compliment. The system is stable. The APIs are mature. The migrations — dockershim, PSP, cloud provider code — are behind us.

What the 1.33–1.35 cycle shows is a project focused on precision: removing edge cases, promoting long-running alpha features to stable, and making the scheduler, storage, and security model more correct rather than more powerful. That’s what a mature infrastructure platform looks like.

Here’s what happened and where the project is headed.


Kubernetes 1.33 — Sidecar Resize, In-Place Resize Beta (April 2025)

Code name: Octarine

In-Place Pod Vertical Scaling reaches Beta

After landing as alpha in 1.27, in-place pod resource resizing became beta in 1.33 — enabled by default via the InPlacePodVerticalScaling feature gate.

The capability: change CPU and memory requests/limits on a running container without terminating and restarting the pod.

# Resize a running container's CPU limit without restart
kubectl patch pod api-pod-xyz --type='json' -p='[
  {
    "op": "replace",
    "path": "/spec/containers/0/resources/requests/cpu",
    "value": "2"
  },
  {
    "op": "replace",
    "path": "/spec/containers/0/resources/limits/cpu",
    "value": "4"
  }
]'

# Verify the resize was applied
kubectl get pod api-pod-xyz -o jsonpath='{.status.containerStatuses[0].resources}'

Why this matters operationally: Before in-place resize, vertical scaling meant terminating the pod, losing in-memory state, waiting for a new pod to become ready. For databases with warm buffer pools, JVM applications with loaded heap caches, or any workload where startup cost is significant, this was a serious limitation. Vertical Pod Autoscaler (VPA) worked around it by restarting pods — acceptable for stateless workloads, problematic for stateful ones.

In 1.33, resizing also works for sidecar containers, combining two 1.32-stable features.

Sidecar Containers — Full Maturity

The first feature to formally combine sidecar and in-place resize: you can now vertically scale a service mesh proxy (Envoy sidecar) without restarting the application pod. For high-traffic services where the proxy itself becomes the CPU bottleneck, this is directly actionable.


Gateway API v1.4 (October 2025)

Gateway API continued its rapid iteration with v1.4:

BackendTLSPolicy (Standard channel): Configure TLS between the gateway and the backend service — not just TLS termination at the gateway, but end-to-end encryption:

apiVersion: gateway.networking.k8s.io/v1alpha3
kind: BackendTLSPolicy
metadata:
  name: api-backend-tls
spec:
  targetRefs:
  - group: ""
    kind: Service
    name: api-service
  validation:
    caCertificateRefs:
    - name: internal-ca
      group: ""
      kind: ConfigMap
    hostname: api.internal.corp

Gateway Client Certificate Validation: The gateway can now validate client certificates — mutual TLS for ingress traffic, not just between services.

TLSRoute to Standard: TLS routing (based on SNI, not HTTP host headers) graduated to the standard channel — enabling TCP workloads with TLS passthrough through the Gateway API model.

ListenerSet: Group multiple Gateway listeners — useful for shared infrastructure where multiple teams need to attach routes to the same gateway without managing separate Gateway resources.


Kubernetes 1.34 — Scheduler Improvements, DRA Continues (August 2025)

The 1.34 release focused on the scheduler and Dynamic Resource Allocation:

DRA structured parameters stabilization: The Dynamic Resource Allocation API matured its parameter model — resource drivers can expose structured claims that the scheduler understands, enabling topology-aware placement of GPU workloads:

apiVersion: resource.k8s.io/v1alpha3
kind: ResourceClaim
metadata:
  name: gpu-claim
spec:
  devices:
    requests:
    - name: gpu
      deviceClassName: gpu.nvidia.com
      selectors:
      - cel:
          expression: device.attributes["nvidia.com/gpu-product"].string() == "A100-SXM4-80GB"
      count: 2

Scheduler QueueingHint stable: Plugins can now tell the scheduler when to re-queue a pod for scheduling — instead of the scheduler periodically retrying all unschedulable pods, plugins signal when relevant cluster state has changed. This significantly reduces scheduler CPU consumption in large clusters with many unschedulable pods.

Fine-grained node authorization improvements: Kubelets can now be restricted from accessing Service resources they don’t need — further reducing the blast radius of a compromised kubelet.


Kubernetes 1.35 — In-Place Resize GA, Memory Limits Unlocked (December 2025)

In-Place Pod Vertical Scaling Graduates to Stable

After landing in alpha (1.27), beta (1.33), in-place resize graduated to GA in 1.35. Two significant improvements accompanied GA:

Memory limit decreases now permitted: Previously, you could increase memory limits in-place but not decrease them. The restriction existed because the kernel doesn’t immediately reclaim memory when the limit is lowered — the OOM killer would need to run. 1.35 lifts this restriction with proper handling: the kernel is instructed to reclaim, and the pod status reflects the resize progress.

Pod-Level Resources (alpha in 1.35): Specify resource requests and limits at the pod level rather than per-container — with in-place resize support. Useful for init containers and sidecar patterns where total pod resources matter more than per-container allocation.

spec:
  # Pod-level resources (alpha) — total budget for all containers
  resources:
    requests:
      cpu: "4"
      memory: "8Gi"
  containers:
  - name: application
    image: myapp:latest
    # No per-container resources; pod-level applies
  - name: log-collector
    image: fluentbit:latest
    restartPolicy: Always  # sidecar

Other 1.35 Highlights

Topology Spread Constraints improvements: Better handling of unschedulable scenarios — whenUnsatisfiable: ScheduleAnyway now has smarter fallback behavior.

VolumeAttributesClass stable: Change storage performance characteristics (IOPS, throughput) of a PersistentVolume without re-provisioning — the storage equivalent of in-place pod resize.

# Change volume IOPS without re-provisioning
kubectl patch pvc database-pvc --type='merge' -p='
  {"spec": {"volumeAttributesClassName": "high-performance"}}'

Job success policy improvements: Declare a Job successful when a subset of pods complete successfully — for distributed training jobs where not all workers need to finish.


What’s in Kubernetes 1.36 (April 22, 2026)

Kubernetes 1.36 is on track for April 22, 2026 release. Based on the enhancement tracking and KEP (Kubernetes Enhancement Proposal) pipeline, expected highlights include:

  • DRA continuing toward stable
  • Pod-level resources moving to beta
  • Scheduler improvements for AI/ML workload placement
  • Further Gateway API integration as core networking model

The project has reached a rhythm: four releases per year, each focused on advancing a predictable set of features through alpha → beta → stable. The drama of the 2019–2022 period (PSP, dockershim, API removals) is behind it.


The State of the Ecosystem in 2026

Control Plane Deployment Models

Model Examples Best For
Managed (cloud provider) GKE, EKS, AKS Most organizations; no control plane ops
Self-managed kubeadm, k3s, Talos Air-gapped, on-prem, specific compliance requirements
Managed (platform) Rancher, OpenShift Enterprises that need multi-cluster management + vendor support

CNI Landscape

CNI Model Notable Feature
Cilium eBPF kube-proxy replacement, network policy at kernel, Hubble observability
Calico eBPF or iptables BGP-based networking, hybrid cloud routing
Flannel VXLAN/host-gw Simple, low overhead, no network policy
Weave Mesh overlay Easy multi-host setup

eBPF-based CNIs (Cilium, Calico in eBPF mode) are now the default recommendation for production clusters. The iptables era of Kubernetes networking is ending.

Security Stack in 2026

A hardened Kubernetes cluster in 2026 runs:

Cluster provisioning:    Cluster API + GitOps (Flux/ArgoCD)
Admission control:       Pod Security Admission (restricted) + Kyverno or OPA/Gatekeeper
Runtime security:        Falco (eBPF-based syscall monitoring)
Network security:        Cilium NetworkPolicy + Cilium Cluster Mesh for multi-cluster
Image security:          Cosign signing in CI + admission webhook for signature verification
Secret management:       External Secrets Operator → HashiCorp Vault or cloud KMS
Observability:           Prometheus + Grafana + Hubble (network flows) + OpenTelemetry

The Permanent Principles That Haven’t Changed

Looking across twelve years and 35 minor versions, some things have not changed:

The API as the universal interface: Everything in Kubernetes is a resource. This remains the most important architectural decision — it makes every tool, every controller, every GitOps system work with the same model.

Reconciliation loops: Every Kubernetes controller watches actual state and drives it toward desired state. The controller pattern from 2014 is unchanged. CRDs and Operators are just more instances of it.

Labels and selectors: The flexible grouping mechanism from 1.0 is still the primary way Kubernetes components find each other. Services find pods. HPA finds Deployments. Operators find their managed resources.

Declarative, not imperative: You describe what you want. Kubernetes figures out how to achieve and maintain it. This principle, inherited from Borg’s BCL configuration, underlies everything from Deployments to Crossplane’s cloud resource management.


What’s Coming: The Next Five Years

WebAssembly on Kubernetes: The Wasm ecosystem (wasmCloud, SpinKube) is building toward running WebAssembly workloads as first-class Kubernetes pods — near-native performance, smaller images, stronger isolation than containers. Still early, but gaining real adoption.

AI inference as infrastructure: LLM serving is becoming a cluster primitive. Tools like KServe and vLLM on Kubernetes are moving from research to production. The scheduler, resource model, and networking will continue adapting to inference workload patterns.

Confidential computing: AMD SEV, Intel TDX, and ARM CCA provide hardware-level memory encryption for pods. The RuntimeClass mechanism and ongoing kernel work are making confidential Kubernetes workloads operational rather than experimental.

Leaner distributions: k3s, k0s, Talos, and Flatcar-based minimal Kubernetes distributions are growing in adoption for edge, IoT, and resource-constrained environments. The pressure is toward smaller, more auditable control planes.


Key Takeaways

  • In-place pod vertical scaling went from alpha (1.27) to stable (1.35) — live CPU and memory resize without pod restart changes the economics of stateful workload management
  • Gateway API v1.4 completes the ingress replacement story: BackendTLSPolicy, client certificate validation, and TLSRoute in standard channel
  • VolumeAttributesClass stable (1.35): Change storage performance in-place — the storage parallel to pod resource resize
  • The eBPF era of Kubernetes networking is established: Cilium as default CNI in GKE, growing in EKS/AKS, replacing iptables-based kube-proxy
  • The Kubernetes project in 2026 is focused on precision — promoting mature features to stable, reducing edge cases, improving scheduler efficiency — not adding new abstractions
  • WebAssembly, confidential computing, and AI inference scheduling are the frontiers to watch

Series Wrap-Up

Era Defining Change
2003–2014 Borg and Omega build the playbook internally at Google
2014–2016 Kubernetes 1.0, CNCF, and winning the container orchestration wars
2016–2018 RBAC stable, CRDs, cloud providers all-in on managed K8s
2018–2020 Operators, service mesh, OPA/Gatekeeper — the extensibility era
2020–2022 Supply chain crisis, PSP deprecated, API removals, dockershim exit
2022–2023 Dockershim and PSP removed, eBPF networking takes over
2023–2025 GitOps standard, sidecar stable, DRA, AI/ML workloads
2025–2026 In-place resize GA, VolumeAttributesClass, Gateway API complete

From 47,501 lines of Go in a 250-file GitHub commit to the operating system of the cloud — and still reconciling.


← EP07: Platform Engineering Era

Series: Kubernetes: From Borg to Platform Engineering | linuxcent.com

XDP — Packets Processed Before the Kernel Knows They Arrived

Reading Time: 10 minutes

eBPF: From Kernel to Cloud, Episode 7
What Is eBPF? · The BPF Verifier · eBPF vs Kernel Modules · eBPF Program Types · eBPF Maps · CO-RE and libbpf · XDP**

14 min read


Introduction

EP01 through EP06 covered what eBPF is, how the verifier keeps it safe, and how the toolchain compiles and loads programs across kernel versions. This episode is where that foundation meets production networking.

XDP — eXpress Data Path — is the earliest hook in the Linux kernel packet path. It fires before sk_buff allocation, before routing, before netfilter. A DROP decision at XDP costs one bounds check and a return value. Everything else is skipped. At 1 million packets per second, that difference shows up directly as CPU.

This episode explains where XDP sits, what it can and cannot see, how Cilium uses it, and what every Kubernetes operator needs to know about it — even if they never write an eBPF program.


Table of Contents


Architecture Overview

XDP Pre-Stack Packet Hook — eBPF kernel data path diagram showing where XDP fires before sk_buff allocation
XDP fires before sk_buff allocation — the earliest possible kernel hook for zero-copy packet processing.

TL;DR

  • XDP fires before sk_buff allocation — the earliest possible kernel hook for packet processing
    (sk_buff = the kernel’s socket buffer — every normal packet requires one to be allocated, which adds up fast at scale)
  • Three modes: native (in-driver, full performance), generic (fallback, no perf gain), offloaded (NIC ASIC)
  • XDP context is raw packet bytes — no socket, no cgroup, no pod identity; handle non-IP traffic explicitly
  • Every pointer dereference requires a bounds check against data_end — the verifier enforces this
  • BPF_MAP_TYPE_LPM_TRIE is the right map type for IP prefix blocklists — handles /32 hosts and CIDRs together
  • XDP metadata area enables coordination with TC programs — classify at XDP speed, enforce with pod context at TC

Quick Check: Is XDP Running on Your Cluster?

Before the data path walkthrough — a two-command check you can run right now on any cluster node:

# SSH into a worker node, then:
bpftool net list

On a Cilium-managed node, you’ll see something like:

eth0 (index 2):
        xdpdrv  id 44

lxc8a3f21b (index 7):
        tc ingress id 47
        tc egress  id 48

Reading the output:
xdpdrv — XDP in native mode, running in the NIC driver before sk_buff (this is what you want)
xdpgeneric instead of xdpdrvgeneric mode, runs after sk_buff allocation, no performance benefit
– No XDP line at all — XDP not deployed; your CNI uses iptables for service forwarding

If you’re on EKS with aws-vpc-cni or GKE with kubenet, you likely won’t see XDP here — those CNIs use iptables. Understanding this section explains why teams migrating to Cilium see lower node CPU under the same traffic load.


Where XDP Sits in the Kernel Data Path

A client’s cluster was under a SYN flood — roughly 1 million packets per second from a rotating set of source IPs. We had iptables DROP rules installed within the first ten minutes, blocklist updated every 30 seconds as new source ranges appeared. The flood traffic dropped in volume. But node CPU stayed high. The %si column in top — software interrupt time — was sitting at 25–30%.

%si in top is the percentage of CPU time spent handling hardware interrupts and kernel-level packet processing — separate from your application’s CPU usage. On a quiet managed cluster (EKS, GKE) this is usually under 1%. Under a packet flood, high %si means the kernel is burning cycles just receiving packets, before your workloads run at all. It’s the metric that tells you the problem is below the application layer.

The iptables rules were matching. Packets were being dropped. CPU was still burning. The answer is where in the kernel the drop was happening. iptables fires inside the netfilter framework — after the kernel has already allocated an sk_buff for the packet, done DMA from the NIC ring buffer, and traversed several netfilter hooks. At 1Mpps, the allocation cost alone is measurable.

XDP fires before any of that.

The standard Linux packet receive path:

NIC hardware
  ↓
DMA to ring buffer (kernel memory)
  ↓
[XDP hook — fires here, before sk_buff]
  ├── XDP_DROP   → discard, zero further allocation
  ├── XDP_PASS   → continue to kernel network stack
  ├── XDP_TX     → transmit back out the same interface
  └── XDP_REDIRECT → forward to another interface or CPU
  ↓
sk_buff allocated from slab allocator
  ↓
netfilter: PREROUTING
  ↓
IP routing decision
  ↓
netfilter: INPUT or FORWARD
  ↓  [iptables fires somewhere in here]
socket receive queue
  ↓
userspace application

XDP runs at the driver level, in the NAPI poll context — the same context where the driver is processing received packets off the ring buffer. The program runs before the kernel’s general networking code gets involved. There’s no sk_buff, no reference counting, no slab allocation.

NAPI (New API) is how modern Linux receives packets efficiently. Instead of one CPU interrupt per packet (catastrophically expensive at 1Mpps), the NIC fires a single interrupt, then the kernel polls the NIC ring buffer in batches until it’s drained. XDP runs inside this polling loop — as close to the hardware as software gets without running on the NIC itself.

At 1Mpps, the difference between XDP_DROP and an iptables DROP is roughly the cost of allocating and then immediately freeing 1 million sk_buff objects per second — plus netfilter traversal, connection tracking lookup, and the DROP action itself. That’s the CPU time that was burning.

After moving the blocklist to an XDP program, the %si on the same traffic load dropped from 28% to 3%.


XDP Modes

XDP operates in three modes, and which one you get depends on your NIC driver.

Native XDP (XDP_FLAGS_DRV_MODE)

The eBPF program runs directly in the NIC driver’s NAPI poll function — in interrupt context, before sk_buff. This is the only mode that delivers the full performance benefit.

Driver support is required. The widely supported drivers: mlx4, mlx5 (Mellanox/NVIDIA), i40e, ice (Intel), bnxt_en (Broadcom), virtio_net (KVM/QEMU), veth (containers). Check support:

# Verify native XDP support on your driver
ethtool -i eth0 | grep driver
# driver: mlx5_core  ← supports native XDP

# Load in native mode
ip link set dev eth0 xdpdrv obj blocklist.bpf.o sec xdp

The veth driver supporting native XDP is what makes XDP meaningful inside Kubernetes pods — each pod’s veth interface can run an XDP program at wire speed.

Generic XDP (XDP_FLAGS_SKB_MODE)

Fallback for drivers that don’t support native XDP. The program still runs, but it runs after sk_buff allocation, as a hook in the netif_receive_skb path. No performance benefit over early netfilter. sk_buff is still allocated and freed for every packet.

# Generic mode — development and testing only
ip link set dev eth0 xdpgeneric obj blocklist.bpf.o sec xdp

Use this for development on a laptop with a NIC that lacks native XDP support. Never benchmark with it and never use it in production expecting performance gains.

Offloaded XDP

Runs on the NIC’s own processing unit (SmartNIC). Zero CPU involvement — the XDP decision happens in NIC hardware. Supported by Netronome Agilio NICs. Rare in production, but the theoretical ceiling for XDP performance.


The XDP Context: What Your Program Can See

XDP programs receive one argument: struct xdp_md.

struct xdp_md {
    __u32 data;           // offset of first packet byte in the ring buffer page
    __u32 data_end;       // offset past the last byte
    __u32 data_meta;      // metadata area before data (XDP metadata for TC cooperation)
    __u32 ingress_ifindex;
    __u32 rx_queue_index;
};

data and data_end are used as follows:

void *data     = (void *)(long)ctx->data;
void *data_end = (void *)(long)ctx->data_end;

// Every pointer dereference must be bounds-checked first
struct ethhdr *eth = data;
if ((void *)(eth + 1) > data_end)
    return XDP_PASS;  // malformed or truncated packet

The verifier enforces these bounds checks — every pointer derived from ctx->data must be validated before use. The error invalid mem access 'inv' means you dereferenced a pointer without checking the bounds. This is the most common cause of XDP program rejection.

For operators (not writing XDP code): You’ll see invalid mem access 'inv' in logs when an eBPF program is rejected at load time — most commonly during a Cilium upgrade or a custom tool deployment on a kernel the tool wasn’t built for. The fix is in the eBPF source or the tool version, not the cluster config.

What XDP cannot see:
– Socket state — no socket buffer exists yet
– Cgroup hierarchy — no pod identity
– Process information — no PID, no container
– Connection tracking state (unless you maintain it yourself in a map)

XDP is ingress-only. It fires on packets arriving at an interface, not departing. For egress, TC is the hook.


What This Means on Your Cluster Right Now

Every Cilium-managed node has XDP programs running. Here’s how to see them:

# All XDP programs on all interfaces — this is the full picture
bpftool net list
# Sample output on a Cilium node:
#
# eth0 (index 2):
#         xdpdrv  id 44         ← XDP in native mode on the node uplink
#
# lxc8a3f21b (index 7):
#         tc ingress id 47      ← TC enforces NetworkPolicy on pod ingress
#         tc egress  id 48      ← TC enforces NetworkPolicy on pod egress
#
# "xdpdrv"     = native mode (runs in NIC driver, before sk_buff — full performance)
# "xdpgeneric" = fallback mode (after sk_buff — no performance benefit over iptables)

# Which mode is active?
ip link show eth0 | grep xdp
# xdp mode drv  ← native (full performance)
# xdp mode generic  ← fallback (no perf benefit)

# Details on the XDP program ID
bpftool prog show id $(bpftool net show dev eth0 | grep xdp | awk '{print $NF}')
# Shows: loaded_at, tag, xlated bytes, jited bytes, map IDs

The map IDs in that output are the BPF maps the XDP program is using — typically the service VIP table for DNAT, and in security tools, the blocklist or allowlist. To see what’s in them:

# List maps used by the XDP program
bpftool prog show id <PROG_ID> | grep map_ids

# Dump the service map (for a Cilium node — this is the load balancer table)
bpftool map dump id <MAP_ID> | head -40

For a blocklist scenario — like the SYN flood mitigation above — the BPF_MAP_TYPE_LPM_TRIE is the standard data structure. A lookup for 192.168.1.45 hits a 192.168.1.0/24 entry in the same map, handling both host /32s and CIDR ranges in one lookup.

# Count entries in an XDP filter map
bpftool map dump id <BLOCKLIST_MAP_ID> | grep -c "key"

# Verify XDP is active and inspect program details
bpftool net show dev eth0

XDP Metadata: Cooperating with TC

Think of it as a sticky note attached to the packet. XDP writes the note at line speed (no context about pods or sockets). TC reads it later when full context is available, and acts on it. The packet carries the note between them.

More precisely: XDP can write metadata into the area before ctx->data — a small scratch space that survives as the packet moves from XDP to the TC hook. This is the coordination mechanism between the two eBPF layers.

The pattern: XDP classifies at speed (no sk_buff overhead), TC enforces with pod context (where you have socket identity). XDP writes a classification tag into the metadata area. TC reads it and makes the policy decision.

From an operational standpoint, when you see two eBPF programs on the same interface (one XDP, one TC), this pipeline is the likely explanation:

bpftool net list
# xdpdrv id 44 on eth0       ← XDP classifier running at line rate
# tc ingress id 47 on eth0   ← TC enforcer reading XDP metadata

How Cilium Uses XDP

Not running Cilium? On EKS with aws-vpc-cni or GKE with kubenet, service forwarding uses iptables NAT rules and conntrack instead. You can see this with iptables -t nat -L -n on a node — look for the KUBE-SVC-* chains. Those chains are what XDP replaces in a Cilium cluster. This is why teams migrating from kube-proxy to Cilium report lower node CPU at high connection rates — it’s not magic, it’s hook placement.

On a Cilium node, XDP handles the load balancing path for ClusterIP services. When a packet arrives at the node destined for a ClusterIP:

  1. XDP program checks the destination IP against a BPF LRU hash map of known service VIPs
  2. On a match, it performs DNAT — rewriting the destination IP to a backend pod IP
  3. Returns XDP_TX or XDP_REDIRECT to forward directly

No iptables NAT rules. No conntrack state machine. No socket buffer allocation for the routing decision. The lookup is O(1) in a BPF hash map.

# See Cilium's XDP program on the node uplink
ip link show eth0 | grep xdp
# xdp  (attached, native mode)

# The XDP program details
bpftool prog show pinned /sys/fs/bpf/cilium/xdp

# Load time, instruction count, JIT-compiled size
bpftool prog show id $(bpftool net list | grep xdp | awk '{print $NF}')

At production scale — 500+ nodes, 50k+ services — removing iptables from the service forwarding path with XDP reduces per-node CPU utilization measurably. The effect is most visible on nodes handling high connection rates to cluster services.


Operational Inspection

# All XDP programs on all interfaces
bpftool net list

# Check XDP mode (native, generic, offloaded)
ip link show | grep xdp

# Per-interface stats — includes XDP drop/pass counters
cat /sys/class/net/eth0/statistics/rx_dropped

# XDP drop counters exposed via bpftool
bpftool map dump id <stats_map_id>

# Verify XDP is active and show program details
bpftool net show dev eth0

Common Mistakes

Mistake Impact Fix
Missing bounds check before pointer dereference Verifier rejects: “invalid mem access” Always check ptr + sizeof(*ptr) > data_end before use
Using generic XDP for performance testing Misleading numbers — sk_buff still allocated Test in native mode only; check ip link output for mode
Not handling non-IP traffic (ARP, IPv6, VLAN) ARP breaks, IPv6 drops, VLAN-tagged frames dropped Check eth->h_proto and return XDP_PASS for non-IP
XDP for egress or pod identity No socket context at XDP; XDP is ingress only Use TC egress for pod-identity-aware egress policy
Forgetting BPF_F_NO_PREALLOC on LPM trie Full memory allocated at map creation for all entries Always set this flag for sparse prefix tries
Blocking ARP by accident in a /24 blocklist Loss of layer-2 reachability within the blocked subnet Separate ARP handling before the IP blocklist check

Key Takeaways

  • XDP fires before sk_buff allocation — the earliest possible kernel hook for packet processing
  • Three modes: native (in-driver, full performance), generic (fallback, no perf gain), offloaded (NIC ASIC)
  • XDP context is raw packet bytes — no socket, no cgroup, no pod identity; handle non-IP traffic explicitly
  • Every pointer dereference requires a bounds check against data_end — the verifier enforces this
  • BPF_MAP_TYPE_LPM_TRIE is the right map for IP prefix blocklists — handles /32 hosts and CIDRs together
  • XDP metadata area enables coordination with TC programs — classify at XDP speed, enforce with pod context at TC

What’s Next

XDP handles ingress at the fastest possible point but has no visibility into which pod sent a packet. EP08 covers TC eBPF — the hook that fires after sk_buff allocation, where socket and cgroup context exist.

TC is how Cilium implements pod-to-pod network policy without iptables. It’s also where stale programs from failed Cilium upgrades leave ghost filters that cause intermittent packet drops. Knowing how TC programs chain — and how to find and remove stale ones — is a specific, concrete operational skill.

Next: TC eBPF — pod-level network policy without iptables

Get EP08 in your inbox when it publishes → linuxcent.com/subscribe

Kubernetes RBAC and AWS IAM: The Two-Layer Access Model for EKS

Reading Time: 9 minutes


What Is Cloud IAMAuthentication vs AuthorizationIAM Roles vs PoliciesAWS IAM Deep DiveGCP Resource Hierarchy IAMAzure RBAC ScopesOIDC Workload IdentityAWS IAM Privilege EscalationAWS Least Privilege AuditSAML vs OIDC FederationKubernetes RBAC and AWS IAM


TL;DR

  • Kubernetes RBAC and cloud IAM are separate authorization layers — strong cloud IAM with weak Kubernetes RBAC is still a vulnerable cluster
  • cluster-admin ClusterRoleBindings are the first thing to audit — a compromised pod with cluster-admin controls the entire cluster
  • Disable automountServiceAccountToken on pods that don’t call the Kubernetes API — most application pods don’t need it mounted
  • Use OIDC for human access instead of X.509 client certificates — client certs cannot be revoked without rotating the CA
  • Bind groups from IdP, not individual usernames — revocation propagates automatically when someone leaves
  • A ServiceAccount that can create pods or create rolebindings is a privilege escalation path: the same class of risk as iam:PassRole

The Big Picture

  TWO AUTHORIZATION LAYERS — NEITHER COMPENSATES FOR THE OTHER

  ┌─────────────────────────────────────────────────────────────────┐
  │  CLOUD IAM LAYER  (AWS IAM / GCP IAM / Azure RBAC)             │
  │  Controls: S3, DynamoDB, Lambda, RDS, cloud services           │
  │  Human: federated identity from IdP (SAML / OIDC)             │
  │  Machine: IRSA annotation → IAM role / GKE WI / AKS WI        │
  │  Audit: CloudTrail, GCP Audit Logs, Azure Monitor              │
  └─────────────────────────────────────────────────────────────────┘
           ↕ separate systems — no inheritance in either direction
  ┌─────────────────────────────────────────────────────────────────┐
  │  KUBERNETES RBAC LAYER  (within the cluster)                   │
  │  Controls: pods, secrets, deployments, configmaps, namespaces  │
  │  Human: OIDC groups → ClusterRoleBinding (or RoleBinding)      │
  │  Machine: ServiceAccount → Role / ClusterRole                  │
  │  Audit: kube-apiserver audit log                               │
  └─────────────────────────────────────────────────────────────────┘

  Attack path: exploit app pod → SA has cluster-admin → own the cluster
  Audit finding: cluster-admin on app SA, regardless of cloud IAM posture

Introduction

I spent a long time in Kubernetes environments thinking cloud IAM and Kubernetes RBAC were related in a way that meant securing one partially covered the other. They don’t. They’re separate authorization systems that happen to share infrastructure.

The moment this crystallized for me: I was auditing an EKS cluster for a fintech company. Their AWS IAM posture was actually quite good — least privilege roles, no wildcard policies, SCPs in place at the org level. I was about to give them a clean bill of health when I ran one command:

kubectl get clusterrolebindings -o json | \
  jq '.items[] | select(.roleRef.name=="cluster-admin") | {name:.metadata.name, subjects:.subjects}'

The output showed five ClusterRoleBindings to cluster-admin. Two of them bound it to service accounts in production namespaces. One of those service accounts was used by an application that processed customer transactions.

cluster-admin in Kubernetes is the equivalent of AdministratorAccess in AWS. An attacker who compromises a pod running as that service account doesn’t just have access to the application’s data. They have control of the entire cluster: reading every secret in every namespace, deploying arbitrary workloads, modifying RBAC bindings to create persistence.

None of this showed up in the AWS IAM audit. AWS IAM and Kubernetes RBAC are separate systems. Securing one tells you nothing about the other.


Kubernetes RBAC Architecture

Kubernetes RBAC works with four object types:

Object Scope What It Does
Role Single namespace Defines permissions within one namespace
ClusterRole Cluster-wide Permissions across all namespaces, or for non-namespaced resources
RoleBinding Single namespace Binds a Role (or ClusterRole) to subjects, scoped to one namespace
ClusterRoleBinding Cluster-wide Binds a ClusterRole to subjects with cluster-wide scope

Subjects — the identities that receive the binding — are:
User: an external identity (Kubernetes has no native user objects; users come from the authenticator)
Group: a group of external identities
ServiceAccount: a Kubernetes-native machine identity, namespaced

The scoping matters. A ClusterRole defines what permissions exist. A RoleBinding applies that ClusterRole within a single namespace. A ClusterRoleBinding applies it everywhere. The same permissions, dramatically different blast radius.


Roles and ClusterRoles

# Role: read pods and their logs — scoped to the default namespace only
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  namespace: default
  name: pod-reader
rules:
- apiGroups: [""]          # "" = core API group (pods, secrets, configmaps, etc.)
  resources: ["pods", "pods/log"]
  verbs: ["get", "list", "watch"]
# ClusterRole: manage Deployments across all namespaces
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: deployment-manager
rules:
- apiGroups: ["apps"]
  resources: ["deployments", "replicasets"]
  verbs: ["get", "list", "watch", "create", "update", "patch", "delete"]
- apiGroups: [""]
  resources: ["pods"]
  verbs: ["get", "list", "watch"]

The verbs map to HTTP methods against the Kubernetes API: get reads a specific resource, list returns a collection, watch streams changes, create/update/patch/delete are mutations.

One that consistently surprises people: list on secrets returns secret values in some Kubernetes versions and configurations. You might think “list” is just metadata, but listing secrets can include their data. If a service account needs to check whether a secret exists, grant get on the specific secret name. Avoid list on the secrets resource.

The Wildcard Risk

# This is effectively cluster-admin in the default namespace — avoid
rules:
- apiGroups: ["*"]
  resources: ["*"]
  verbs: ["*"]

Any * in RBAC rules is an audit finding. In practice I find wildcards most often in:
– Operator and controller service accounts (understandable, but worth reviewing)
– “Temporary” RBAC that became permanent
– Developer tooling given cluster-admin “because it was easier”

Run this to find all ClusterRoles with wildcard verbs:

kubectl get clusterroles -o json | \
  jq '.items[] | select(.rules[]?.verbs[] == "*") | .metadata.name'

Bindings — Connecting Identities to Roles

# RoleBinding: alice can read pods in the default namespace
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: alice-pod-reader
  namespace: default
subjects:
- kind: User
  name: [email protected]
  apiGroup: rbac.authorization.k8s.io
roleRef:
  kind: Role
  name: pod-reader
  apiGroup: rbac.authorization.k8s.io
# ClusterRoleBinding: Prometheus can read cluster-wide (monitoring use case)
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: prometheus-cluster-reader
subjects:
- kind: ServiceAccount
  name: prometheus
  namespace: monitoring
roleRef:
  kind: ClusterRole
  name: view
  apiGroup: rbac.authorization.k8s.io

An important pattern: a RoleBinding can reference a ClusterRole. This lets you define a role once at the cluster level (the ClusterRole) and bind it within specific namespaces through RoleBindings. The permissions are still scoped to the namespace where the RoleBinding lives. This is the right pattern for shared role definitions — define the permission set once, instantiate it with appropriate scope.

Default to RoleBinding over ClusterRoleBinding for namespace-scoped work. ClusterRoleBinding should be reserved for genuinely cluster-wide operations: monitoring agents, network plugins, cluster operators, security tooling.


Service Accounts — The Machine Identity in Kubernetes

Every pod in Kubernetes runs as a service account. If you don’t specify one, it uses the default service account in the pod’s namespace.

The default service account is where many RBAC misconfigurations accumulate. When someone creates a RoleBinding without thinking about which SA to use, they often bind the permission to default. Now every pod in that namespace that doesn’t explicitly set a service account — including pods deployed by developers who aren’t thinking about RBAC — inherits that binding.

# Create a dedicated SA for each application
kubectl create serviceaccount app-backend -n production

# Check what any SA can currently do — use this in every audit
kubectl auth can-i --list --as=system:serviceaccount:production:app-backend -n production

# Check a specific action
kubectl auth can-i get secrets \
  --as=system:serviceaccount:production:app-backend -n production

kubectl auth can-i create pods \
  --as=system:serviceaccount:production:app-backend -n production

Disable Auto-Mounting the SA Token

By default, Kubernetes mounts the service account token into every pod at /var/run/secrets/kubernetes.io/serviceaccount/token. A pod that doesn’t need to call the Kubernetes API doesn’t need this token. Having it mounted increases the blast radius if the pod is compromised — the token can be used to call the K8s API with whatever RBAC permissions the SA has.

# Disable at the pod level
apiVersion: v1
kind: Pod
spec:
  automountServiceAccountToken: false
  serviceAccountName: app-backend
  containers:
  - name: app
    image: my-app:latest

# Or at the service account level (applies to all pods using this SA)
apiVersion: v1
kind: ServiceAccount
metadata:
  name: app-backend
  namespace: production
automountServiceAccountToken: false

For most application pods — anything that isn’t a Kubernetes operator, controller, or management tool — the K8s API token is unnecessary. Disable it.


Human Access to Kubernetes — Get Off Client Certificates

Kubernetes doesn’t manage human users natively. Authentication is delegated to an external mechanism. The most common approaches:

Method Notes
X.509 client certificates Common for initial cluster setup; credentials are embedded in kubeconfig; cannot be revoked without revoking the CA
Static bearer tokens Long-lived; avoid
OIDC via external IdP Preferred for human access — supports SSO, MFA, and revocation via IdP
Webhook auth Flexible, requires custom infrastructure

X.509 certificates are the bootstrap pattern. Every managed Kubernetes offering generates an admin kubeconfig with a client certificate. The problem: you can’t revoke individual certificates without rotating the CA. If you’re giving human engineers access via client certificates, someone leaving doesn’t actually lose cluster access until the certificate expires.

OIDC is the right model. Configure the kube-apiserver to accept JWTs from your IdP, bind RBAC permissions to groups from the IdP, and revocation becomes “remove from IdP group” rather than “hope the certificate expires soon”:

# kube-apiserver flags for OIDC (managed clusters configure this via provider settings)
--oidc-issuer-url=https://accounts.google.com
--oidc-client-id=my-cluster-client-id
--oidc-username-claim=email
--oidc-groups-claim=groups
--oidc-groups-prefix=oidc:
# User's kubeconfig — uses an exec plugin to fetch an OIDC token
users:
- name: alice
  user:
    exec:
      apiVersion: client.authentication.k8s.io/v1beta1
      command: kubectl-oidc-login
      args:
        - get-token
        - --oidc-issuer-url=https://dex.company.com
        - --oidc-client-id=kubernetes

With managed clusters:

# EKS: add IAM role as a cluster access entry (replaces the aws-auth ConfigMap)
aws eks create-access-entry \
  --cluster-name my-cluster \
  --principal-arn arn:aws:iam::123456789012:role/DevTeamRole \
  --type STANDARD

aws eks associate-access-policy \
  --cluster-name my-cluster \
  --principal-arn arn:aws:iam::123456789012:role/DevTeamRole \
  --policy-arn arn:aws:eks::aws:cluster-access-policy/AmazonEKSViewPolicy \
  --access-scope type=namespace,namespaces=production,staging

# GKE: get credentials; IAM roles map to cluster permissions
gcloud container clusters get-credentials my-cluster --region us-central1
# roles/container.developer → edit permissions
# But: use ClusterRoleBindings for fine-grained control rather than relying on GCP IAM roles

# AKS: bind Entra ID groups to Kubernetes RBAC
az aks get-credentials --name my-aks --resource-group rg-prod
kubectl create clusterrolebinding dev-team-view \
  --clusterrole=view \
  --group=ENTRA_GROUP_OBJECT_ID

Cloud IAM + Kubernetes RBAC: The Integration Points

EKS Pod Identity / IRSA (revisited)

The annotation on the Kubernetes ServiceAccount is the bridge:

apiVersion: v1
kind: ServiceAccount
metadata:
  name: app-backend
  namespace: production
  annotations:
    eks.amazonaws.com/role-arn: arn:aws:iam::123456789012:role/AppBackendRole

Kubernetes RBAC controls what the pod can do inside the cluster. The IAM role controls what the pod can do in AWS. Both must be explicitly granted; neither inherits from the other.

GKE Workload Identity

apiVersion: v1
kind: ServiceAccount
metadata:
  name: app-backend
  namespace: production
  annotations:
    iam.gke.io/gcp-service-account: [email protected]

AKS Workload Identity

apiVersion: v1
kind: ServiceAccount
metadata:
  name: app-backend
  namespace: production
  annotations:
    azure.workload.identity/client-id: "MANAGED_IDENTITY_CLIENT_ID"
---
apiVersion: v1
kind: Pod
metadata:
  labels:
    azure.workload.identity/use: "true"
spec:
  serviceAccountName: app-backend

RBAC Audit — What to Check First

# Start here: who has cluster-admin?
kubectl get clusterrolebindings -o json | \
  jq '.items[] | select(.roleRef.name=="cluster-admin") | 
      {binding: .metadata.name, subjects: .subjects}'
# cluster-admin should bind to almost nobody — review every result

# Find ClusterRoles with wildcard permissions
kubectl get clusterroles -o json | \
  jq '.items[] | select(.rules[]?.verbs[]? == "*") | .metadata.name'

# What can the default SA do in each namespace?
for ns in $(kubectl get namespaces -o name | cut -d/ -f2); do
  echo "=== $ns ==="
  kubectl auth can-i --list --as=system:serviceaccount:${ns}:default -n ${ns} 2>/dev/null \
    | grep -v "no" | head -10
done

# What can a specific SA do?
kubectl auth can-i --list \
  --as=system:serviceaccount:production:app-backend \
  -n production

# Check whether an SA can escalate — key risk indicators
kubectl auth can-i get secrets -n production \
  --as=system:serviceaccount:production:app-backend
kubectl auth can-i create pods -n production \
  --as=system:serviceaccount:production:app-backend
kubectl auth can-i create rolebindings -n production \
  --as=system:serviceaccount:production:app-backend

Creating pods and creating rolebindings are privilege escalation primitives. A service account that can create pods can run a pod with a different, more powerful SA. A service account that can create rolebindings can grant itself more permissions.

Useful Tools

# rbac-tool — visualize and analyze RBAC (install: kubectl krew install rbac-tool)
kubectl rbac-tool viz                              # generate a graph of all bindings
kubectl rbac-tool who-can get secrets -n production
kubectl rbac-tool lookup [email protected]

# rakkess — access matrix for a subject
kubectl rakkess --sa production:app-backend

# audit2rbac — generate minimal RBAC from audit logs
audit2rbac --filename /var/log/kubernetes/audit.log \
  --serviceaccount production:app-backend

Common RBAC Misconfigurations

Misconfiguration Risk Fix
cluster-admin bound to application SA Full cluster takeover from compromised pod Minimal ClusterRole; scope to namespace where possible
list or wildcard on secrets Read all secrets in scope — includes credentials, API keys Grant get on specific named secrets only
default SA with non-trivial permissions Every pod in the namespace inherits the permission Bind permissions to dedicated SAs; automountServiceAccountToken: false on default
ClusterRoleBinding for namespace-scoped work Namespace work with cluster-wide permission Always prefer RoleBinding; ClusterRoleBinding only for genuinely cluster-wide needs
Binding users by username string Hard to revoke; doesn’t sync with IdP Bind groups from IdP; revocation propagates through group membership
SA can create pods or create rolebindings Privilege escalation path Audit and remove these from non-privileged SAs

Framework Alignment

Framework Reference What It Covers Here
CISSP Domain 5 — Identity and Access Management Kubernetes RBAC operates as a full IAM system at the platform layer, independent of cloud IAM
CISSP Domain 3 — Security Architecture Two independent authorization layers (cloud + K8s) must each be designed and audited — one does not compensate for the other
ISO 27001:2022 5.15 Access control Kubernetes RBAC Roles, ClusterRoles, and bindings implement access control within the container platform
ISO 27001:2022 5.18 Access rights Service account provisioning, OIDC-based human access, and workload identity integration with cloud IAM
ISO 27001:2022 8.2 Privileged access rights cluster-admin and wildcard RBAC bindings represent the highest-privilege grants in Kubernetes
SOC 2 CC6.1 Kubernetes RBAC is the access control mechanism for the container platform layer in CC6.1
SOC 2 CC6.3 Binding revocation, SA token disabling, and OIDC group-based access removal satisfy CC6.3 requirements

Key Takeaways

  • Kubernetes RBAC and cloud IAM are separate authorization layers — both must be secured; strong cloud IAM with weak K8s RBAC is still a vulnerable cluster
  • cluster-admin bindings are the first thing to audit in any cluster — the blast radius of a compromised pod with cluster-admin is the entire cluster
  • Disable automountServiceAccountToken on service accounts and pods that don’t call the Kubernetes API — most application pods don’t need it
  • Use OIDC for human access rather than client certificates; revocation via IdP is instant and reliable
  • Bind groups from IdP rather than individual usernames; revocation propagates automatically when someone leaves
  • A service account that can create pods or create rolebindings is a privilege escalation path — audit for these in every namespace

What’s Next

EP12 is the capstone: Zero Trust IAM — how all the concepts in this series come together into an architecture that assumes nothing is implicitly trusted, verifies everything explicitly, and limits blast radius through least privilege enforced at every layer.

Next: Zero trust access in the cloud

Get EP12 in your inbox when it publishes → linuxcent.com/subscribe

CO-RE and libbpf — Write Once, Run on Any Kernel

Reading Time: 9 minutes

eBPF: From Kernel to Cloud, Episode 6
What Is eBPF? · The BPF Verifier · eBPF vs Kernel Modules · eBPF Program Types · eBPF Maps · CO-RE and libbpf**


Architecture Overview

CO-RE and libbpf — portable eBPF program architecture using BTF for kernel-version independence
CO-RE relocations + BTF let a single eBPF binary run across different kernel versions without recompilation.

TL;DR

  • Kernel structs change between releases — hardcoded offsets break across patch versions, not just major releases
  • BTF embeds full type information in the kernel at /sys/kernel/btf/vmlinux; CO-RE uses it to patch field accesses at load time
    (BTF = BPF Type Format — a compact description of every struct, field, and byte offset in the running kernel, built into the kernel image)
  • vmlinux.h, generated from BTF, replaces all kernel headers with a single file committed to your repository
  • BPF_CORE_READ() is the CO-RE macro — every kernel struct access in a portable program goes through it
  • libbpf skeleton generation (bpftool gen skeleton) eliminates manual fd management for map and program lifecycle
  • For production tools: libbpf + CO-RE. For one-off debugging: bpftrace. For prototyping: BCC.

eBPF CO-RE (Compile Once, Run Everywhere) solves the kernel portability problem — the reason Cilium and Falco survive kernel upgrades without recompilation. What maps assumed — quietly — is that the kernel structs those programs read look the same tomorrow as they do today. They don’t. The Linux kernel has no stable ABI for internal data structures. task_struct, sk_buff, sock — the fields eBPF programs read constantly — can shift between patch releases, not just major versions. I learned this the hard way when a routine upgrade from 5.15.0-89 to 5.15.0-91 — two patch revisions — silently broke a custom tracer I’d been running in production for six months.


Six months after deploying a custom eBPF tracer for a client — it detected specific syscall patterns that Falco’s default ruleset didn’t cover — they ran a routine Ubuntu patch upgrade. Not a major kernel version jump. 5.15.0-89 to 5.15.0-91. Two patch revisions.

The tracer stopped loading. The error was invalid indirect read from stack. I opened the program source: nothing remotely like an indirect read. The program was a straightforward tracepoint handler, maybe 40 lines of C.

Three hours of debugging led to a four-byte offset difference. The struct task_struct had a field alignment change between the two patch versions. My program accessed ->comm at a hardcoded byte offset. On 5.15.0-89 that offset was 0x620. On 5.15.0-91 it was 0x624. The verifier caught the misalignment — correctly — and rejected the program.

I had compiled the eBPF bytecode against a fixed kernel header snapshot. The binary was not portable. Every time the kernel moved a struct field, the tool broke.

CO-RE is the solution to this.

Quick Check: Does Your Cluster Support CO-RE?

Two commands — check whether your nodes have the BTF support that CO-RE tools require:

# SSH into a worker node, then:
ls -la /sys/kernel/btf/vmlinux && echo "BTF available — CO-RE tools will work"

Expected output on a supported node:

-r--r--r-- 1 root root 4956234 Apr 21 00:00 /sys/kernel/btf/vmlinux
BTF available — CO-RE tools will work

If the file is missing: CO-RE tools (Cilium, Falco, Tetragon) will fall back to legacy BCC compilation mode — which requires a full compiler toolchain and kernel headers installed on every node.

# Confirm the kernel was built with BTF enabled
cat /boot/config-$(uname -r) | grep CONFIG_DEBUG_INFO_BTF
# CONFIG_DEBUG_INFO_BTF=y  ← required for CO-RE

Common results by platform:
| Platform | BTF available? |
|———-|—————-|
| Ubuntu 20.04+ (kernel 5.4+) | ✓ Yes |
| EKS managed nodes (AL2023) | ✓ Yes |
| GKE managed nodes (kernel 5.10+) | ✓ Yes |
| Amazon Linux 2 (older kernels) | ✗ No — BCC fallback |
| RHEL 7 / CentOS 7 | ✗ No |

Why Kernel Structs Change and Why It Matters

The Linux kernel has no stable ABI for internal data structures. task_struct, sock, sk_buff, file — the structs that eBPF programs read constantly — change between releases.

ABI (Application Binary Interface) is the contract that says a compiled binary built against version N will still work against version N+1 without recompilation. The Linux kernel maintains a stable ABI for syscalls (open(), read(), connect()) but makes no such guarantee for internal structs. Fields move, get added, get renamed between patch releases — and any program with hardcoded offsets silently breaks. Field additions, reordering, alignment changes, struct embedding changes. The kernel developers are under no obligation to preserve internal layouts, and they don’t.

Before CO-RE, eBPF programs dealt with this in two ways:

BCC (BPF Compiler Collection) — compile the eBPF C code at runtime on the target host, using that system’s kernel headers. No portability problem because compilation happens on the machine you’re deploying to. Cost: you need a full compiler toolchain, kernel headers, and Python runtime on every production node. Startup time in seconds. Container image size in hundreds of MB. For a security tool that should be lightweight and fast-starting, this is a non-starter.

Per-kernel compiled binaries — ship different builds for each supported kernel version, detect at runtime, load the matching binary. Falco maintained this model for years. The operational overhead is significant: a matrix of kernel × distro × version with separate build and test pipelines for each combination.

CO-RE is the third option. Compile once on a build machine, and let libbpf patch struct field accesses at load time on the target system, using type information embedded in the running kernel.

BTF: The Type System That Makes CO-RE Possible

BTF (BPF Type Format) is compact type debug information embedded directly into the kernel image. Since Linux 5.2, kernels built with CONFIG_DEBUG_INFO_BTF=y expose their full type information at /sys/kernel/btf/vmlinux.

# Verify BTF is available
ls -la /sys/kernel/btf/vmlinux

# Inspect the BTF for a specific struct
bpftool btf dump file /sys/kernel/btf/vmlinux format raw | grep -A 5 'task_struct'

# See the actual field offsets the running kernel uses
bpftool btf dump file /sys/kernel/btf/vmlinux format c | grep -A 20 'struct task_struct {'

BTF encodes every struct definition with field names, types, and byte offsets. When libbpf loads an eBPF program compiled with CO-RE relocations, it reads both the BTF the program was compiled against (embedded in the .bpf.o file) and the BTF of the running kernel. If task_struct->comm has moved, libbpf patches the field access instruction before loading the program.

This patching happens at load time, transparently, without modifying the binary you shipped.

CO-RE relocation is the mechanism behind this. When a CO-RE program is compiled, it embeds metadata saying “I need the offset of comm inside task_struct” rather than hardcoding 0x620. At load time, libbpf reads this relocation, looks up the real offset from the running kernel’s BTF, and patches the instruction. For operators: this is why Cilium and Falco survive kernel upgrades without you reinstalling them.

Most distribution kernels now ship with BTF enabled:

# Ubuntu 20.04+ (kernel 5.4+)
cat /boot/config-$(uname -r) | grep CONFIG_DEBUG_INFO_BTF
# CONFIG_DEBUG_INFO_BTF=y

# Check at runtime
file /sys/kernel/btf/vmlinux
# /sys/kernel/btf/vmlinux: symbolic link to /sys/kernel/btf/vmlinux

Amazon Linux 2023, Ubuntu 22.04, Debian 11+, RHEL 8.2+, and most cloud-provider-managed kernels have BTF. The notable exception: RHEL 7 and Amazon Linux 2 on older kernels.

The CO-RE Toolchain

The build pipeline for a CO-RE eBPF program:

Development machine:
  vmlinux.h (generated from kernel BTF)
       ↓
  myprog.bpf.c ──── clang -target bpf -g ────→ myprog.bpf.o
  (CO-RE relocations embedded in BTF section)
       ↓
  bpftool gen skeleton myprog.bpf.o ─────────→ myprog.skel.h
       ↓
  myprog.c (userspace) ── gcc ──→ myprog
  (statically links libbpf, skeleton handles load/attach/cleanup)

Target machine (any kernel with BTF, 5.4+):
  ./myprog
  ↓ libbpf reads /sys/kernel/btf/vmlinux
  ↓ patches field accesses to match current kernel struct layout
  ↓ verifier validates patched program
  ↓ program loads and runs

One binary. Any supported kernel. No compiler on the target system.

vmlinux.h — One Header to Replace Them All

Before CO-RE, eBPF C programs included dozens of kernel headers — linux/sched.h, linux/net.h, linux/fs.h, linux/socket.h — and they had to match the exact kernel version you were targeting.

vmlinux.h is generated from the BTF of a running kernel. It contains every struct, enum, typedef, and macro definition the kernel exposes through BTF — in a single file, without any compile-time kernel dependency.

# Generate vmlinux.h from the running kernel
bpftool btf dump file /sys/kernel/btf/vmlinux format c > vmlinux.h

# Typical size
wc -l vmlinux.h
# 350000+

You commit vmlinux.h to your repository, generated from a representative kernel. CO-RE handles the actual layout differences at load time on whatever kernel you deploy to. The file is large but you only generate it once and update it when you add support for a new kernel generation.

In your eBPF C source:

#include "vmlinux.h"           // replaces all kernel headers
#include <bpf/bpf_helpers.h>   // eBPF helper functions
#include <bpf/bpf_tracing.h>   // tracing macros
#include <bpf/bpf_core_read.h> // CO-RE read macros

How CO-RE Fixes the Offset Problem

The mechanism is worth understanding once, even if you’re not writing eBPF programs.

When a CO-RE eBPF program accesses a kernel struct field, it doesn’t hardcode the byte offset. Instead, it records a relocation — “I need the offset of pid inside task_struct” — in the compiled binary. When libbpf loads the program, it resolves each relocation by looking up the field in the running kernel’s BTF and patches the access instruction to use the correct offset for this specific kernel.

This is why my four-byte problem happened: the tracer I’d compiled wasn’t using CO-RE. It hardcoded 0x620 as the offset of task_struct->comm. When the kernel moved it to 0x624, the program accessed the wrong memory, the verifier caught the misalignment, and the load failed. A CO-RE rewrite would have resolved comm‘s offset at load time from BTF and never known the difference.

The relocation model also handles fields that don’t exist on older kernels. If a program accesses a field added in kernel 5.15 and the running kernel is 5.10, libbpf can either skip the access (returning a zero value) or fail the load — depending on how the program marks the field access. This is how tools ship support for features across a kernel version range without separate builds.

What CO-RE Means for Tools You Already Run

This is why you care about CO-RE even if you’re never going to write an eBPF program yourself.

Falco, Cilium, Tetragon, and Pixie all ship as single binaries or container images. You install them on a Ubuntu 22.04 node, a RHEL 9 node, and an Amazon Linux 2023 node — three different kernel versions, three different task_struct layouts — and the same binary works on all of them. Before CO-RE, Falco maintained pre-compiled kernel probes for every supported kernel version in a matrix of distro × kernel × version. The probe list had thousands of entries. A kernel your distro shipped between Falco release cycles meant a gap in coverage until the next release.

With CO-RE, there’s one binary. libbpf reads the running kernel’s BTF at load time, patches the field accesses to match the actual struct layout, and the verifier checks the patched program. The tool vendor doesn’t need to know about your specific kernel. You don’t need to wait for a probe release.

The constraint is BTF availability. Check your nodes:

# Quick check — if this file exists, CO-RE tools work
ls /sys/kernel/btf/vmlinux

# Full confirmation
cat /boot/config-$(uname -r) | grep CONFIG_DEBUG_INFO_BTF
# CONFIG_DEBUG_INFO_BTF=y  ← required

What you’ll find: Ubuntu 20.04+, Debian 11+, RHEL 8.2+, Amazon Linux 2023, and GKE/EKS managed nodes all have BTF. Amazon Linux 2 and RHEL 7 do not. If you’re running those, CO-RE-based tools fall back to the legacy BCC compilation path — which requires kernel headers installed on the node.

The One Thing to Run Right Now

This command shows you the exact struct layout your running kernel uses — the same layout libbpf reads when it patches CO-RE programs at load time:

# See how your kernel defines task_struct right now
bpftool btf dump file /sys/kernel/btf/vmlinux format c | grep -A 30 '^struct task_struct {'

The output is the canonical type information for your running kernel. Every field, every offset. When libbpf loads a CO-RE program, it’s reading this to figure out whether task_struct->comm is at offset 0x620 or 0x624.

You can also see specific struct sizes and verify that two kernels differ:

# On kernel A (5.15.0-89)
bpftool btf dump file /sys/kernel/btf/vmlinux format raw | grep -w "task_struct" | head -3

# On kernel B (5.15.0-91) — same command, different output if struct changed
# This is what broke my tracer: field offset changed across a two-patch jump

The practical use: when a CO-RE eBPF tool fails to load with a BTF error, this is where you look. The error tells you which struct field the relocation failed on. This command shows you the current layout. You can confirm whether the field exists, whether it moved, whether it was renamed.

BCC vs libbpf vs bpftrace

Three approaches to eBPF development, with distinct tradeoffs:

BCC libbpf + CO-RE bpftrace
Compilation Runtime on target host Build-time on dev machine Runtime (embedded LLVM)
Target deployment Compiler + headers on every node Single static binary bpftrace binary only
Portability Compile-on-target handles it CO-RE + BTF handles it Internal CO-RE support
Memory overhead High (Python + compiler: 200MB+) Low (few MB binary) Medium
Startup time Seconds (compilation) Milliseconds Seconds (JIT compile)
Best for Prototyping, development Production tools, shipped software Interactive debugging sessions
Language Python + C C (kernel) + C/Go/Rust (userspace) bpftrace scripting

For anything you’re shipping — an eBPF-based security tool, an observability agent, an open-source project — libbpf + CO-RE is the right choice. BCC is for prototyping before you commit to an implementation. bpftrace is for the 30-second debugging session on a live node.

The practical test: if you’re building something you’ll deploy as a container image or a package, it needs to be a self-contained binary with no build dependencies on the target system. That means libbpf.

Common Mistakes

Mistake Impact Fix
Direct struct dereference instead of BPF_CORE_READ Program breaks on any kernel struct change Use BPF_CORE_READ() for all kernel struct field access
Missing char LICENSE[] SEC("license") = "GPL" GPL-only helpers (most tracing helpers) unavailable Always include the license section
vmlinux.h generated on a very old kernel Missing structs added in newer kernel releases Regenerate from the highest kernel version you target
Forgetting -g flag in clang invocation No BTF debug info → no CO-RE relocations Always compile with -g -O2 -target bpf
Hardcoding struct offsets as integer constants Breaks silently on next kernel patch Use BTF-aware CO-RE macros exclusively

Key Takeaways

  • Kernel structs change between releases — hardcoded offsets break across patch versions, not just major releases
  • BTF embeds full type information in the kernel at /sys/kernel/btf/vmlinux; CO-RE uses it to patch field accesses at load time
  • vmlinux.h, generated from BTF, replaces all kernel headers with a single file committed to your repository
  • BPF_CORE_READ() is the CO-RE macro — every kernel struct access in a portable program goes through it
  • libbpf skeleton generation (bpftool gen skeleton) eliminates manual fd management for map and program lifecycle
  • For production tools: libbpf + CO-RE. For one-off debugging: bpftrace. For prototyping: BCC.

What’s Next

CO-RE makes eBPF programs portable across kernel versions. EP07 takes the next question: where in the kernel’s data path does it make sense to attach them?

XDP fires before the kernel has allocated a single byte of memory for an incoming packet — before the kernel even knows whether to accept it. That hook placement is why Cilium can do line-rate load balancing and why some network filtering rules that look correct in iptables do nothing against certain traffic. The rules weren’t wrong. The hook was in the wrong place.

Next: XDP — packets processed before the kernel knows they arrived

Get EP07 in your inbox when it publishes → linuxcent.com/subscribe

eBPF Maps — The Persistent Data Layer Between Kernel and Userspace

Reading Time: 11 minutes

eBPF: From Kernel to Cloud, Episode 5
What Is eBPF? · The BPF Verifier · eBPF vs Kernel Modules · eBPF Program Types · eBPF Maps**


Architecture Overview

eBPF Maps — the persistent data layer between kernel eBPF programs and userspace tools
eBPF maps are the shared memory between kernel programs and userspace — hash, array, ringbuf, and LRU variants shown.

TL;DR

  • eBPF programs are stateless — maps are where all state lives, between invocations and between kernel and userspace
    (“stateless” here means each program invocation starts with no memory of previous runs — like a function with no global variables)
  • Every production eBPF tool (Cilium, Falco, Tetragon, Datadog NPM) is a map-based architecture — bpftool map list shows you what it’s actually holding
  • Per-CPU maps eliminate write contention for high-frequency counters; the tool aggregates per-CPU values at export time
  • LRU maps handle unbounded key spaces (IPs, PIDs, connections) without hard errors when full — but eviction is silent, so size generously
  • Ring buffer (kernel 5.8+) is the correct event streaming primitive — Falco and Tetragon both use it
  • Map memory is kernel-locked and invisible to standard memory metrics — account for it explicitly on eBPF-heavy nodes
  • Pinned maps survive restarts; Cilium uses this for zero-disruption connection tracking through upgrades

The Big Picture

  HOW eBPF MAPS CONNECT KERNEL PROGRAMS TO USERSPACE TOOLS

  ┌─────────────────────────────────────────────────────────────┐
  │  Kernel space                                               │
  │                                                             │
  │  [XDP program]  [TC program]  [kprobe]  [tracepoint]        │
  │        │              │           │           │             │
  │        └──────────────┴───────────┴───────────┘             │
  │                              │                              │
  │                   bpf_map_update_elem()                     │
  │                              │                              │
  │                              ▼                              │
  │  ┌─────────────────────────────────────────────────────┐    │
  │  │             eBPF MAP (kernel object)                │    │
  │  │  hash · percpu_hash · lru_hash · ringbuf · lpm_trie │    │
  │  │  Lives outside program invocations.                 │    │
  │  │  Pinned maps (/sys/fs/bpf/) survive restarts.       │    │
  │  └────────────────────┬────────────────────────────────┘    │
  └───────────────────────│─────────────────────────────────────┘
                          │  read / write via file descriptor
                          ▼
  ┌─────────────────────────────────────────────────────────────┐
  │  Userspace tools                                            │
  │                                                             │
  │  Cilium agent  Falco engine  Tetragon  bpftool map dump     │
  └─────────────────────────────────────────────────────────────┘

eBPF maps are the persistent data layer between kernel programs and the tools that consume their output. eBPF programs fire and exit — there’s no memory between invocations. Yet Cilium tracks TCP connections across millions of packets, and Falco correlates a process exec from five minutes ago with a suspicious network connection happening now. The mechanism between stateless kernel programs and the stateful production tools you depend on is what this episode is about — and understanding it changes what you see when you run bpftool map list.


I was trying to identify the noisy neighbor saturating a cluster’s egress link. I had an eBPF program loading cleanly, events firing, everything confirming it was working. But when I read back the per-port connection counters from userspace, everything was zero.

I spent an hour on it before posting to the BCC mailing list. The reply came back fast: eBPF programs don’t hold state between invocations. Every time the kprobe fires, the program starts fresh. The counter I was incrementing existed only for that single call — created, incremented to one, then discarded. On every single invocation. I was counting events one at a time, throwing the count away, and reading nothing.

That’s what eBPF maps solve.

Quick Check: What Maps Are Running on Your Node?

Before the map types walkthrough — see the live state of maps on any cluster node right now:

# SSH into a worker node, then:
bpftool map list

On a node running Cilium + Falco, you’ll see something like:

12: hash          name cilium_ct4_glo    key 24B  value 56B  max_entries 65536  memlock 5767168B
13: lpm_trie      name cilium_ipcache    key 40B  value 32B  max_entries 512000 memlock 327680B
14: percpu_hash   name cilium_metrics    key 8B   value 32B  max_entries 65536  memlock 2097152B
28: ringbuf       name falco_events      max_entries 8388608

Reading this output:
hash, lpm_trie, percpu_hash, ringbuf — the map type (each optimised for a different access pattern)
key 24B value 56B — sizes of a single entry’s key and value in bytes
max_entries — the hard ceiling; when the map is full, behaviour depends on type (see LRU section below)
memlock — non-pageable kernel memory this map consumes (invisible to free and container metrics)

Not running Cilium? On EKS with aws-vpc-cni or GKE with kubenet, there are far fewer maps here — primarily kube-proxy uses iptables rather than BPF maps. Running bpftool map list still works; you’ll just see fewer entries. On a pure iptables-based cluster, most of the maps you see come from the system kernel itself, not a CNI.

Maps Are the Architecture, Not an Afterthought

Maps are kernel objects that live outside any individual program invocation. They’re shared between multiple eBPF programs, readable and writable from userspace, and persistent for the lifetime of the map — which can outlive both the program that created them and the userspace process that loaded them.

Every production eBPF tool is fundamentally a map-based architecture:

  • Cilium stores connection tracking state in BPF hash maps
  • Falco uses ring buffers to stream syscall events to its userspace rule engine
  • Tetragon maintains process tree state across exec events using maps
  • Datadog NPM stores per-connection flow stats in per-CPU maps for lock-free metric accumulation

Run bpftool map list on a Cilium node:

$ bpftool map list
ID 12: hash          name cilium_ct4_glo    key 24B  value 56B   max_entries 65536
#      ^^^^           ^^^^^^^^^^^^^^^^       ^^^^^^   ^^^^^^^     ^^^^^^^^^^^^^^^^
#      type           map name               key size value size  max concurrent entries

ID 13: lpm_trie      name cilium_ipcache    key 40B  value 32B   max_entries 512000
#      longest-prefix-match trie — for IP address + CIDR lookups

ID 14: percpu_hash   name cilium_metrics    key 8B   value 32B   max_entries 65536
#      one copy of this map per CPU — no write contention for high-frequency counters

ID 28: ringbuf       name falco_events      max_entries 8388608
#                                           ^^^^^^^^^^^ 8MB ring buffer for event streaming

Connection tracking, IP policy cache, per-CPU metrics, event stream. Every one of these is a different map type, chosen for a specific reason.

Map Types and What They’re Actually Used For

Hash Maps

The general-purpose key-value store. A key maps to a value — lookup is O(1) average. Cilium’s connection tracking map (cilium_ct4_glo) is a hash map: the key is a 5-tuple (source IP, destination IP, ports, protocol), the value is the connection state.

$ bpftool map show id 12
12: hash  name cilium_ct4_glo  flags 0x0
        key 24B  value 56B  max_entries 65536  memlock 5767168B

The key 24B is the 5-tuple. The value 56B is the connection state record. max_entries 65536 is the upper bound — Cilium can track 65,536 active connections in this map before hitting the limit.

Hash maps are shared across all CPUs on the node. When multiple CPUs try to update the same entry simultaneously — which happens constantly on busy nodes — writes need to be coordinated. For most use cases this is fine. For high-frequency counters updated on every packet, it’s a bottleneck. That’s when you reach for a per-CPU hash map.

Where you see them: connection tracking, per-IP statistics, process-to-identity mapping, policy verdict caching.

Per-CPU Hash Maps

Per-CPU hash maps solve the write coordination problem by giving each CPU its own independent copy of every entry. There’s no sharing, no contention, no waiting — each CPU writes its own copy without touching any other.

The tradeoff: reading from userspace means collecting one value per CPU and summing them up. That aggregation happens in the tool, not the kernel.

# Cilium's per-CPU metrics map — one counter value per CPU
bpftool map dump id 14
key: 0x00000001
  value (CPU 00): 12345
  value (CPU 01): 8901
  value (CPU 02): 3421
  value (CPU 03): 7102
# total bytes for this metric: 31769

Cilium’s cilium_metrics map uses this pattern for exactly this reason — it’s updated on every packet across every CPU on the node. Forcing all CPUs to coordinate writes to a single shared entry at that rate would hurt throughput. Instead: each CPU writes locally, Cilium’s userspace agent sums the values at export time.

Where you see them: packet counters, byte counters, syscall frequency metrics — anywhere updates happen on every event at high volume.

LRU Hash Maps

LRU hash maps add automatic eviction. Same key-value semantics as a regular hash map, but when the map hits its entry limit, the least recently accessed entry is dropped to make room for the new one.

This matters for any map tracking dynamic state with an unpredictable number of keys: TCP connections, process IDs, DNS queries, pod IPs. Without LRU semantics, a full map returns an error on insert — and in production, that means your tool silently stops tracking new entries. Not a crash, not an alert — just missing data.

Cilium’s connection tracking map is LRU-bounded at 65,536 entries. On a node handling high-connection-rate workloads, this can fill up. When it does, Cilium starts evicting old connections to make room for new ones — and if it’s evicting too aggressively, you’ll see connection resets.

# Check current CT map usage vs its limit
bpftool map show id 12
# max_entries tells you the ceiling
# count entries to see current usage
bpftool map dump id 12 | grep -c "^key"

Size LRU maps at 2× your expected concurrent active entries. Aggressive eviction under pressure introduces gaps — not crashes, but missing or incorrect state.

Where you see them: connection tracking, process lineage, anything where the key space is dynamic and unbounded.

Ring Buffers

Ring buffers are how eBPF tools stream events from the kernel to a userspace consumer. Falco reads syscall events from a ring buffer. Tetragon streams process execution and network events through ring buffers. The pattern is the same across all of them:

kernel eBPF program
  → sees event (syscall, network packet, process exec)
  → writes record to ring buffer
  → userspace tool reads it and processes (Falco rules, Tetragon policies)

What makes ring buffers the right primitive for event streaming:

  • Single buffer shared across all CPUs — unlike the older perf_event_array approach which required one buffer per CPU, a ring buffer is one allocation, one file descriptor, one consumer
  • Lock-free — the kernel writes, the userspace tool reads, they don’t block each other
  • Backpressure when full — if the userspace tool can’t keep up, new events are dropped rather than queued indefinitely. The tool can detect and count drops. Falco reports these as Dropped events in its stats output.
# Falco's ring buffer — 8MB
bpftool map list | grep ringbuf
# ID 28: ringbuf  name falco_events  max_entries 8388608

8,388,608 bytes = 8MB. That’s the buffer between Falco’s kernel hooks and its rule engine. If there’s a burst of syscall activity and Falco’s rule evaluation can’t keep up, events drop into that window and are lost.

Sizing matters operationally. Too small and you drop events during normal burst. Too large and you’re holding non-pageable kernel memory that doesn’t show up in standard memory metrics.

# Check Falco's drop rate
falcoctl stats
# or check the Falco logs
journalctl -u falco | grep -i "drop"

Most production deployments run 8–32MB. Start at 8MB, monitor drop rates under load, size up if needed.

Where you see them: Falco event streaming, Tetragon audit events, any tool that needs to move high-volume event data from kernel to userspace.

Array Maps

Array maps are fixed-size, integer-indexed, and entirely pre-allocated at creation time. Think of them as lookup tables with integer keys — constant-time access, no hash overhead, no dynamic allocation.

Cilium uses array maps for policy configuration: a fixed set of slots indexed by endpoint identity number. When a packet arrives and Cilium needs to check policy, it indexes into the array directly rather than doing a hash lookup. For read-heavy, write-rare data, this is faster.

The constraint: you can’t delete entries from an array map. Every slot exists for the lifetime of the map. If you need to track state that comes and goes — connections, processes, pods — use a hash map instead.

Where you see them: policy configuration, routing tables with fixed indices, per-CPU stats indexed by CPU number.

LPM Trie Maps

LPM (Longest Prefix Match) trie maps handle IP prefix lookups — the same operation that a hardware router does when deciding which interface to send a packet out of.

You can store a mix of specific host addresses (/32) and CIDR ranges (/16, /24) in the same map, and a lookup returns the most specific match. If 10.0.1.15/32 and 10.0.0.0/8 are both in the map, a lookup for 10.0.1.15 returns the /32 entry.

Cilium’s cilium_ipcache map is an LPM trie. It maps every IP in the cluster to its security identity — the identifier Cilium uses for policy enforcement. When a packet arrives, Cilium does a trie lookup on the source IP to find out which endpoint sent it, then checks policy against that identity.

# Inspect the ipcache map
bpftool map show id 13
# lpm_trie  name cilium_ipcache  key 40B  value 32B  max_entries 512000

# Look up which security identity owns a pod IP
bpftool map lookup id 13 key hex 20 00 00 00 0a 00 01 0f 00 00 00 00 00 00 00 00 00 00 00 00

Where you see them: IP-to-identity mapping (Cilium), CIDR-based policy enforcement, IP blocklists.


Pinned Maps — State That Survives Restarts

By default, a map’s lifetime is tied to the tool that created it. When the tool exits, the kernel garbage-collects the map.

Pinning writes a reference to the BPF filesystem at /sys/fs/bpf, which keeps the map alive even after the creating process exits:

# See all maps Cilium has pinned
ls /sys/fs/bpf/tc/globals/
# cilium_ct4_global  cilium_ipcache  cilium_metrics  cilium_policy ...

# Inspect a pinned map directly — no Cilium process needed
bpftool map dump pinned /sys/fs/bpf/tc/globals/cilium_ct4_global

# Pin any map by ID for manual inspection
bpftool map pin id 12 /sys/fs/bpf/my_conn_tracker
bpftool map dump pinned /sys/fs/bpf/my_conn_tracker

Cilium pins all its maps under /sys/fs/bpf/tc/globals/. When Cilium restarts — rolling upgrade, crash, OOM kill — it reopens its pinned maps and resumes with existing state intact. Pods maintain established TCP connections through a Cilium restart without disruption.

This is operationally significant: if you’re evaluating eBPF-based tools for production, check whether they pin their maps. A tool that doesn’t loses all its tracked state on every restart — connection tracking resets, process lineage gaps, policy state rebuilt from scratch.


Map Memory: A Production Consideration

Map memory is kernel-locked — it cannot be paged out, and it doesn’t show up in standard memory pressure metrics. Your node’s free output and container memory limits don’t account for it.

Kernel-locked memory is memory the OS guarantees will never be swapped to disk — it stays in RAM permanently. The kernel requires this for eBPF maps because a kernel program running during a network interrupt cannot wait for a page fault. The side effect: it doesn’t appear in top, free, or container memory metrics, so it’s easy to accidentally provision nodes without accounting for it.

# Total eBPF map memory locked on this node
bpftool map list -j | python3 -c "
import json,sys
maps=json.load(sys.stdin)
total=sum(m.get('bytes_memlock',0) for m in maps)
print(f'Total map memory: {total/1024/1024:.1f} MB')
"

# Check system memlock limit (unlimited is correct for eBPF tools)
ulimit -l

# Check what Cilium's systemd unit sets
systemctl show cilium | grep -i memlock

On a node running Cilium + Falco + Datadog NPM, I’ve seen 200–400MB of map memory locked. That’s real, non-pageable kernel memory. If you’re sizing nodes for eBPF-heavy workloads, account for this separately from your pod workload memory.

If an eBPF tool fails to load with a permission error despite having enough free memory, the root cause is usually the memlock ulimit for the process. Cilium, Falco, and most production tools set LimitMEMLOCK=infinity in their systemd units. Verify this if you’re deploying a new eBPF-based tool and seeing unexpected load failures.


Inspecting Maps in Production

# List all maps: type, name, key/value sizes, memory usage
bpftool map list

# Dump all entries in a map (careful with large maps)
bpftool map dump id 12

# Look up a specific entry by key
bpftool map lookup id 12 key hex 0a 00 01 0f 00 00 00 00

# Watch map stats live
watch -n1 'bpftool map show id 12'

# See all maps for a specific tool by checking its pinned path
ls /sys/fs/bpf/tc/globals/                    # Cilium
ls /sys/fs/bpf/falco/                         # Falco (if pinned)

# Cross-reference map IDs with the programs using them
bpftool prog list
bpftool map list

⚠ Production Gotchas

A full LRU map drops state silently, not loudly
When Cilium’s CT map fills up, it starts evicting the least recently used connections — not returning an error. You see connection resets, not a tool alert. Check map utilisation (bpftool map dump id X | grep -c key) against max_entries on nodes with high connection rates.

Ring buffer drops don’t stop the tool — they create gaps
When Falco’s ring buffer fills up, events are dropped. Falco keeps running. The rule engine keeps processing. But you have gaps in your syscall visibility. Monitor Dropped events in Falco’s stats and size the ring buffer accordingly.

Map memory is invisible to standard monitoring
200–400MB of kernel-locked memory on a Cilium + Falco node doesn’t appear in top, container memory metrics, or memory pressure alerts. Size eBPF-heavy nodes with this in mind and add explicit map memory monitoring via bpftool.

Tools that don’t pin their maps lose state on restart
A Cilium restart with pinned maps = zero-disruption connection tracking. A tool without pinning = all tracked state rebuilt from scratch. This matters for connection tracking tools and any tool maintaining process lineage.

perf_event_array on kernel 5.8+ is the old way
Older eBPF tools use per-CPU perf_event_array for event streaming. Ring buffer is strictly better — single allocation, lower overhead, simpler consumption. If you’re running a tool that still uses perf_event_array on a 5.8+ kernel, it’s using a legacy path.


Key Takeaways

  • eBPF programs are stateless — maps are where all state lives, between invocations and between kernel and userspace
  • Every production eBPF tool (Cilium, Falco, Tetragon, Datadog NPM) is a map-based architecture — bpftool map list shows you what it’s actually holding
  • Per-CPU maps eliminate write contention for high-frequency counters; the tool aggregates per-CPU values at export time
  • LRU maps handle unbounded key spaces (IPs, PIDs, connections) without hard errors when full — but eviction is silent, so size generously
  • Ring buffer (kernel 5.8+) is the correct event streaming primitive — Falco and Tetragon both use it
  • Map memory is kernel-locked and invisible to standard memory metrics — account for it explicitly on eBPF-heavy nodes
  • Pinned maps survive restarts; Cilium uses this for zero-disruption connection tracking through upgrades

What’s Next

You know what program types run in the kernel, and you know how they hold state.

Get EP06 in your inbox when it publishes → linuxcent.com/subscribe But there’s a problem anyone running eBPF-based tools eventually runs into: a tool works on one kernel version and breaks on the next. Struct layouts shift between patch versions. Field offsets move. EP06 covers CO-RE (Compile Once, Run Everywhere) and libbpf — the mechanism that makes tools like Cilium and Falco survive your node upgrades without recompilation, and why kernel version compatibility is a solved problem for any tool built on this toolchain.

The Platform Engineering Era: GitOps, AI Workloads, and Leaner Kubernetes (2023–2025)

Reading Time: 6 minutes


Introduction

By 2023, the question had shifted from “how do we run Kubernetes?” to “how do we let other engineers run their workloads on Kubernetes without becoming a bottleneck?”

This is the platform engineering problem. And it drove the tooling that defined 2023–2025: GitOps as the deployment standard, Cluster API for Kubernetes-on-Kubernetes provisioning, AI/ML workloads forcing new scheduling capabilities, and the Kubernetes project itself shedding more weight to become faster to release and operate.


GitOps: Principle Becomes Practice

GitOps as a term was coined by Weaveworks in 2017. By 2023, it was no longer a debate — it was the default deployment model for organizations running Kubernetes at scale.

The principle: the desired state of your cluster lives in Git. A controller watches the repository and reconciles the cluster state to match. Every deployment is a PR merge. The audit trail is the Git history.

Flux v2 (CNCF graduated) and ArgoCD (CNCF incubating) became the two dominant implementations:

# Flux: GitRepository + Kustomization
apiVersion: source.toolkit.fluxcd.io/v1
kind: GitRepository
metadata:
  name: production-config
  namespace: flux-system
spec:
  interval: 1m
  url: https://github.com/org/k8s-config
  ref:
    branch: main
---
apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
  name: production-apps
  namespace: flux-system
spec:
  interval: 10m
  path: ./clusters/production
  prune: true          # Remove resources deleted from Git
  sourceRef:
    kind: GitRepository
    name: production-config
  healthChecks:
  - apiVersion: apps/v1
    kind: Deployment
    name: api
    namespace: production

The prune: true behavior is critical: resources deleted from Git are deleted from the cluster. This is what makes GitOps a security control — unknown resources that aren’t in Git get removed. No more accumulation of forgotten test deployments, rogue debug pods, or unauthorized configuration changes that outlive the engineer who made them.

ArgoCD’s Application model added a UI, synchronization policies, and multi-cluster management:

apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: production-api
  namespace: argocd
spec:
  project: production
  source:
    repoURL: https://github.com/org/apps
    targetRevision: HEAD
    path: api/production
  destination:
    server: https://kubernetes.default.svc
    namespace: api
  syncPolicy:
    automated:
      prune: true
      selfHeal: true    # Revert manual kubectl changes
    syncOptions:
    - CreateNamespace=true

The selfHeal: true option is where GitOps becomes enforceable: any manual change made with kubectl is automatically reverted within the sync interval. For compliance-sensitive environments, this is a configuration drift prevention control.


Cluster API: Kubernetes Managing Kubernetes

Cluster API (cluster-sigs/cluster-api) flipped the usual model: instead of using tools like Terraform or Ansible to provision Kubernetes clusters, Cluster API lets you manage Kubernetes clusters as Kubernetes resources — using a management cluster to provision and manage workload clusters.

# Create a new Kubernetes cluster as a Kubernetes resource
apiVersion: cluster.x-k8s.io/v1beta1
kind: Cluster
metadata:
  name: workload-cluster-prod
spec:
  clusterNetwork:
    pods:
      cidrBlocks: ["192.168.0.0/16"]
  infrastructureRef:
    apiVersion: infrastructure.cluster.x-k8s.io/v1beta2
    kind: AWSCluster
    name: workload-cluster-prod
  controlPlaneRef:
    apiVersion: controlplane.cluster.x-k8s.io/v1beta1
    kind: KubeadmControlPlane
    name: workload-cluster-prod-control-plane

Cluster API reconciliation handles cluster provisioning, scaling, upgrades, and deletion — all through the Kubernetes API, with all the tooling (RBAC, audit logging, GitOps integration) that entails. Multi-cluster platform teams could now manage hundreds of workload clusters from a single management cluster.


Kubernetes 1.28 — Sidecar Containers Alpha (August 2023)

Sidecar containers had been a Kubernetes pattern since 2015 — a helper container in the same pod as the main application. But there was no native sidecar lifecycle management. Sidecars were just regular init containers or additional containers, which meant:
– Init container sidecars ran before the application and had to block until they succeeded
– Regular container sidecars had no ordering guarantees at startup
– At pod termination, sidecars could die before the application finished draining

1.28 introduced native sidecar support: a new restartPolicy field for init containers:

spec:
  initContainers:
  - name: log-collector
    image: fluentbit:latest
    restartPolicy: Always    # This makes it a sidecar
    # Starts before main containers, stays running, stops after main containers exit
  containers:
  - name: application
    image: myapp:latest

A sidecar container (init container with restartPolicy: Always):
– Starts before application containers
– Stays running throughout the pod lifecycle
– Terminates automatically after all main containers exit
– Restarts if it crashes (unlike regular init containers)

This solved the service mesh sidecar problem: Istio and Linkerd injected Envoy proxies as regular containers, leading to race conditions where the proxy hadn’t started when the application tried to make outbound connections. Native sidecar lifecycle guarantees the proxy is ready before the application starts.

Also in 1.28:
Retroactive default StorageClass assignment: Existing PVCs without a StorageClass assignment get the default applied retroactively — useful for migrations
Non-graceful node shutdown stable: Handle node power failures without manual pod cleanup
Recovery from volume expansion failure: Previously, a failed volume expansion left the PVC in a broken state; 1.28 introduced a mechanism to recover


AI/ML Workloads Force New Kubernetes Capabilities

The LLM wave of 2023 drove GPU workloads onto Kubernetes at a scale and urgency the project hadn’t anticipated. Running LLM inference on Kubernetes required solving problems that CPU-centric cluster scheduling hadn’t encountered:

GPU topology awareness: Inference across multiple GPUs requires GPUs connected by NVLink or on the same PCIe switch, not arbitrary GPUs from different nodes or different PCIe buses. The Dynamic Resource Allocation API (1.26 alpha) was designed exactly for this.

Fractional GPU allocation: NVIDIA’s time-slicing and MIG (Multi-Instance GPU) allow multiple pods to share a single GPU. The GPU operator (NVIDIA) manages this at the node level:

# Check GPU resources visible to Kubernetes
kubectl get nodes -o custom-columns=\
  "NODE:.metadata.name,GPU:.status.allocatable.nvidia\.com/gpu"
# NODE       GPU
# gpu-node-1   8
# gpu-node-2   8

Batch scheduling for training jobs: Training runs require all workers to start simultaneously — a single missing GPU makes the entire job stall. The Kubernetes Job API doesn’t guarantee this. Projects like Volcano (CNCF incubating) and Kueue (Kubernetes SIG Scheduling) added gang scheduling: a job only starts when all requested resources are available.

# Kueue: queue AI training jobs with resource quotas
apiVersion: kueue.x-k8s.io/v1beta1
kind: ClusterQueue
metadata:
  name: gpu-queue
spec:
  namespaceSelector: {}
  resourceGroups:
  - coveredResources: ["nvidia.com/gpu", "cpu", "memory"]
    flavors:
    - name: a100-80gb
      resources:
      - name: nvidia.com/gpu
        nominalQuota: 16

Kubernetes 1.29 — Sidecar to Beta, Load Balancer IP Mode (December 2023)

  • Sidecar containers beta: The lifecycle semantics were refined based on 1.28 alpha feedback
  • Load balancer IP mode alpha: Distinguish between load balancers that use virtual IPs (kube-proxy handles the traffic) vs. those that handle traffic directly (no need for kube-proxy rules) — important for eBPF-based load balancers
  • ReadWriteOncePod volume access stable

Kubernetes 1.30 — Structured Authorization Config (April 2024)

  • Structured authorization configuration beta: Define multiple authorization webhooks with explicit ordering, failure modes, and connection settings — replacing the flat --authorization-mode flag
  • Sidecar containers beta continues
  • Node memory swap support beta: Allow pods to use swap memory — controversial but necessary for workloads with bursty memory patterns that prefer using swap over OOM kill
# Node with swap enabled — kubelet config
kind: KubeletConfiguration
memorySwap:
  swapBehavior: LimitedSwap

The swap support feature reversed a long-standing Kubernetes hard stance: swap was disabled since 1.0 because its interaction with Kubernetes memory accounting was unpredictable. The 1.30 approach adds proper accounting and policies.


Kubernetes 1.31 — Cloud Provider Code Removal Complete (August 2024)

1.31 marked the completion of the cloud provider code removal — the 1.5 million line migration that had been running since 1.26. Core binaries are 40% smaller. The API server, controller manager, and scheduler no longer contain vendor-specific code.

Also in 1.31:
Persistent Volume health monitor stable
AppArmor support stable: AppArmor profiles for pods using the native Kubernetes field (not annotations)
Traffic distribution for Services beta: Express topology preferences for Service routing (prefer local node, prefer same zone)

# Traffic distribution: prefer endpoints in the same zone
apiVersion: v1
kind: Service
metadata:
  name: api
spec:
  trafficDistribution: PreferClose
  selector:
    app: api
  ports:
  - port: 80
    targetPort: 8080

Kubernetes 1.32 — Sidecar Stable, DRA Beta (December 2024)

  • Sidecar containers stable: After nearly a decade of workarounds, the sidecar pattern is a first-class Kubernetes primitive
  • Dynamic Resource Allocation beta: GPU and specialized hardware scheduling ready for production evaluation
  • Job API improvements: Success and failure policies for indexed jobs — granular control over batch workload behavior
  • Custom Resource field selectors: Filter CRDs on arbitrary fields — making large CRD-based systems more efficient to query

Crossplane: Kubernetes as the Control Plane for Everything

Crossplane (CNCF graduated) extended the Kubernetes API model beyond the cluster itself. Using CRDs and controllers, Crossplane lets you manage cloud resources (RDS databases, S3 buckets, VPCs, IAM roles) as Kubernetes resources — provisioned, updated, and deleted through the Kubernetes API.

# Crossplane: provision an RDS PostgreSQL instance as a Kubernetes resource
apiVersion: database.aws.crossplane.io/v1beta1
kind: RDSInstance
metadata:
  name: production-db
spec:
  forProvider:
    region: us-east-1
    dbInstanceClass: db.r6g.xlarge
    masterUsername: admin
    engine: postgres
    engineVersion: "15"
    allocatedStorage: 100
    multiAZ: true
  writeConnectionSecretsToRef:
    name: production-db-credentials
    namespace: production

For platform teams, Crossplane means a single control plane — the Kubernetes API — for both compute workloads and cloud infrastructure. GitOps tools (Flux, ArgoCD) manage both.


Key Takeaways

  • GitOps (Flux, ArgoCD) became the production deployment standard — not for ideological reasons, but because the audit trail, drift detection, and self-healing properties solve real operational and compliance problems
  • Cluster API made Kubernetes cluster lifecycle (provisioning, upgrades, deletion) a Kubernetes-native operation — the same API, tooling, and audit trail
  • Native sidecar containers (1.28 alpha → 1.32 stable) finally resolved the lifecycle ordering problem that service meshes and log collectors had worked around for years
  • AI/ML workloads drove new scheduling capabilities (DRA, gang scheduling via Kueue/Volcano) and made GPU topology awareness a first-class concern
  • Crossplane generalized the Kubernetes API model to cloud infrastructure — the cluster is now a control plane for everything, not just containers

What’s Next

← EP06: The Runtime Reckoning | EP08: Kubernetes Today →

Series: Kubernetes: From Borg to Platform Engineering | linuxcent.com

eBPF Program Types — What’s Actually Running on Your Nodes

Reading Time: 8 minutes

eBPF: From Kernel to Cloud, Episode 4
What Is eBPF? · The BPF Verifier · eBPF vs Kernel Modules · eBPF Program Types**


Architecture Overview

eBPF Program Types — tracing, networking, and security hook points across the Linux kernel
Each eBPF program type attaches to a different kernel hook — from socket filters to LSM enforcement points.

TL;DR

  • bpftool prog list and bpftool net list show every eBPF program on a node — run these first when debugging eBPF-based tool behavior
  • TC programs can stack on the same interface; stale programs from incomplete Cilium upgrades cause intermittent packet drops — check tc filter show after every Cilium upgrade
  • XDP fires before sk_buff allocation — fastest hook, but no pod identity; Cilium uses it for service load balancing, not pod policy
  • XDP silently falls back to generic mode on unsupported NICs — verify with ip link show | grep xdp
  • Tracepoints are stable across kernel versions; kprobe-based tools may silently break after node OS patches
  • LSM hooks enforce at the kernel level — what makes Tetragon’s enforcement mode fundamentally different from sidecar-based approaches

The Big Picture

  WHERE eBPF PROGRAM TYPES ATTACH IN THE KERNEL

  NIC hardware
       ↓
  DMA → ring buffer
       ↓
  ┌─────────────────────────────────────────────────┐
  │  XDP hook  (Cilium: service load balancing)     │
  │  Sees: raw packet bytes only. No pod identity.  │
  └─────────────────────────┬───────────────────────┘
                            │ XDP_PASS
                            ▼
  sk_buff allocated
       ↓
  ┌─────────────────────────────────────────────────┐
  │  TC ingress hook  (Cilium: pod policy ingress)  │
  │  Sees: sk_buff + socket + cgroup → pod identity │
  └─────────────────────────┬───────────────────────┘
                            ↓
  netfilter / IP routing
       ↓
  socket → process (syscall boundary)
  ┌─────────────────────────────────────────────────┐
  │  Tracepoint / kprobe  (Falco: syscall monitor)  │
  │  Sees: any kernel event, any process, any pod   │
  └─────────────────────────────────────────────────┘
  ┌─────────────────────────────────────────────────┐
  │  LSM hook  (Tetragon: kernel-level enforcement) │
  │  Sees: security check context. Can DENY.        │
  └─────────────────────────────────────────────────┘
       ↓
  IP routing → qdisc
  ┌─────────────────────────────────────────────────┐
  │  TC egress hook  (Cilium: pod policy egress)    │
  │  Sees: socket + cgroup on outbound traffic      │
  └─────────────────────────────────────────────────┘
       ↓
  NIC → wire

eBPF program types define where in the kernel a hook fires and what it can see — and knowing the difference is what makes you effective when Cilium or Falco behave unexpectedly. What we hadn’t answered — and what a 2am incident eventually forced — is what kind of eBPF programs are actually running on your nodes, and why the difference matters when something breaks.

A pod in production was dropping roughly one in fifty outbound TCP connections. Not all of them — just enough to cause intermittent timeouts in the application logs. NetworkPolicy showed egress allowed. Cilium reported no violations. Running curl manually from inside the pod worked every time.

I spent the better part of three hours eliminating possibilities. DNS. MTU. Node-level conntrack table exhaustion. Upstream firewall rules. Nothing.

Eventually, almost as an afterthought, I ran this:

sudo bpftool prog list

There were two TC programs attached to that pod’s veth interface. One from the current Cilium version. One from the previous version — left behind by a rolling upgrade that hadn’t cleaned up properly. Two programs. Different policy state. One was occasionally dropping packets based on rules that no longer existed in the current policy model.

The answer had been sitting in the kernel the whole time. I just didn’t know where to look.

That incident forced me to actually understand something I’d been hand-waving for two years: eBPF isn’t a single hook. It’s a family of program types, each attached to a different location in the kernel, each seeing different data, each suited for different problems. Understanding the difference is what separates “I run Cilium and Falco” from “I understand what Cilium and Falco are actually doing on my nodes” — and that difference matters when something breaks at 2am.

The Command You Should Run on Your Cluster Right Now

Before getting into the theory, do this:

# See every eBPF program loaded on the node
sudo bpftool prog list

# See every eBPF program attached to a network interface
sudo bpftool net list

On a node running Cilium and Falco, you’ll see something like this:

42: xdp           name cil_xdp_entry       loaded_at 2026-04-01T09:23:41
43: sched_cls     name cil_from_netdev      loaded_at 2026-04-01T09:23:41
44: sched_cls     name cil_to_netdev        loaded_at 2026-04-01T09:23:41
51: cgroup_sock_addr  name cil_sock4_connect loaded_at 2026-04-01T09:23:41
88: raw_tracepoint  name sys_enter          loaded_at 2026-04-01T09:23:55
89: raw_tracepoint  name sys_exit           loaded_at 2026-04-01T09:23:55

Each line is a different program type. Each one fires at a different point in the kernel. The type column — xdp, sched_cls, raw_tracepoint, cgroup_sock_addr — tells you where in the kernel execution path that program is attached and therefore what it can and cannot see.

If you see more programs than you expect on a specific interface — like I did — that’s your first clue.

Why Program Types Exist

The Linux kernel isn’t a single pipeline. Network packets, system calls, file operations, process scheduling — these all run through different subsystems with different execution contexts and different available data.

eBPF lets you attach programs to specific points within those subsystems. The “program type” is the contract: it defines where the hook fires, what data the program receives, and what it’s allowed to do with it. A program designed to process network packets before they hit the kernel stack looks completely different from one designed to intercept system calls across all containers simultaneously.

Most of us will interact with four or five program types through the tools we already run. Understanding what each one actually is — where it sits, what it sees — is what makes you effective when those tools behave unexpectedly.

The Types Behind the Tools You Already Use

TC — Why Cilium Can Tell Which Pod Sent a Packet

TC stands for Traffic Control. It’s where Cilium enforces your NetworkPolicy, and it’s what caused my incident.

TC programs attach to network interfaces — specifically to the ingress and egress directions of the pod’s virtual interface (lxcXXXXX in Cilium’s naming). They fire after the kernel has already processed the packet enough to know its context: which socket created it, which cgroup that socket belongs to. Cgroup maps to container, container maps to pod.

This is the critical piece: TC is how Cilium knows which pod a packet belongs to. Without that cgroup context, per-pod policy enforcement isn’t possible.

# See TC programs on a pod's veth interface
sudo tc filter show dev lxc12345 ingress
sudo tc filter show dev lxc12345 egress

# If you see two entries on the same direction — that's the incident I described
# The priority number (pref 1, pref 2) tells you the order they run

When there are two TC programs on the same interface, the first one to return “drop” wins. The second program never runs. This is why the issue was intermittent rather than consistent — the stale program only matched specific connection patterns.

Fixing it is straightforward once you know what to look for:

# Remove a stale TC filter by its priority number
sudo tc filter del dev lxc12345 egress pref 2

Add this check to your post-upgrade runbook. Cilium upgrades are generally clean but not always.

XDP — Why Cilium Doesn’t Use TC for Everything

If TC is good enough for pod-level policy, why does Cilium also run an XDP program on the node’s main interface? Look at the bpftool prog list output again — there’s an xdp program loaded alongside the TC programs.

XDP fires earlier. Much earlier. Before the kernel allocates any memory for the packet. Before routing. Before connection tracking. Before anything.

The tradeoff is exactly what you’d expect: XDP is fast but context-poor. It sees raw packet bytes. It doesn’t know which pod the packet came from. It can’t read cgroup information because no socket buffer has been allocated yet.

Cilium uses XDP specifically for ClusterIP service load balancing — when a packet arrives at the node destined for a service VIP, XDP rewrites the destination to the actual pod IP in a single map lookup and sends it on its way. No iptables. No conntrack. The work is done before the kernel stack is involved.

There’s a silent failure mode worth knowing about here. XDP runs in one of two modes:

  • Native mode — runs inside the NIC driver itself, before any kernel allocation. This is where the performance comes from.
  • Generic mode — fallback when the NIC driver doesn’t support XDP. Runs later, after sk_buff allocation. No performance benefit over iptables.

If your NIC doesn’t support native XDP, Cilium silently falls back to generic mode. The policy still works — but the performance characteristics you assumed aren’t there.

# Check which XDP mode is active on your node's main interface
ip link show eth0 | grep xdp
# xdpdrv  ← native mode (fast)
# xdpgeneric ← generic mode (no perf benefit)

Most cloud provider instance types with modern Mellanox/Intel NICs support native mode. Worth verifying rather than assuming.

Tracepoints — How Falco Sees Every Container

Falco loads two programs: sys_enter and sys_exit. These are raw tracepoints — they fire on every single system call, from every process, in every container on the node.

Tracepoints are explicitly defined and maintained instrumentation points in the kernel. Unlike hooks that attach to specific internal function names (which can be renamed or inlined between kernel versions), tracepoints are stable interfaces. They’re part of the kernel’s public contract with tooling that wants to instrument it.

This matters operationally. When you patch your nodes — and cloud-managed nodes get patched frequently — tools built on tracepoints keep working. Tools built on kprobes (internal function hooks) may silently stop firing if the function they’re attached to gets renamed or inlined by the compiler in a new kernel build.

# Verify what Falco is actually using
sudo bpftool prog list | grep -E "kprobe|tracepoint"

# Falco's current eBPF driver should show raw_tracepoint entries
# If you see kprobe entries from Falco, you're on the older driver
# Check: falco --version and the driver being loaded at startup

If you’re running Falco on a cluster that gets regular OS patch upgrades and you haven’t verified the driver mode, check it. The older kprobe-based driver has a real failure mode on certain kernel versions.

LSM — How Tetragon Blocks Operations at the Kernel Level

LSM hooks run at the kernel’s security decision points: file opens, socket connections, process execution, capability checks. The defining characteristic is that they can deny an operation. Return an error from an LSM hook and the kernel refuses the syscall before it completes.

This is qualitatively different from observability hooks. kprobes and tracepoints watch. LSM hooks enforce.

When you see Tetragon configured to kill a process attempting a privileged operation, or block a container from writing to a specific path, that’s an LSM hook making the decision inside the kernel — not a sidecar watching traffic, not an admission webhook running before pod creation, not a userspace agent trying to act fast enough. The enforcement is in the kernel itself.

# See if any LSM eBPF programs are active on the node
sudo bpftool prog list | grep lsm

# Verify LSM eBPF support on your kernel (required for Tetragon enforcement mode)
grep CONFIG_BPF_LSM /boot/config-$(uname -r)
# CONFIG_BPF_LSM=y   ← required

The Practical Summary

What’s happening on your node Program type Where to look
Cilium service load balancing XDP ip link show eth0 \| grep xdp
Cilium pod network policy TC (sched_cls) tc filter show dev lxcXXXX egress
Falco syscall monitoring Tracepoint bpftool prog list \| grep tracepoint
Tetragon enforcement LSM bpftool prog list \| grep lsm
Anything unexpected All types bpftool prog list, bpftool net list

The Incident, Revisited

Three hours of debugging. The answer was a stale TC program sitting at priority 2 on a pod’s veth interface, left behind by an incomplete Cilium upgrade.

# What I should have run first
sudo bpftool net list
sudo tc filter show dev lxc12345 egress

Two commands. Thirty seconds. If I’d known that TC programs can stack on the same interface, I’d have started there.

That’s the point of understanding program types — not to write eBPF programs yourself, but to know where to look when the tools you depend on don’t behave the way you expect. The programs are already there, running on your nodes right now. bpftool prog list shows you all of them.

Key Takeaways

  • bpftool prog list and bpftool net list show every eBPF program on a node — run these before anything else when debugging eBPF-based tool behavior
  • TC programs can stack on the same interface; stale programs from incomplete Cilium upgrades cause intermittent drops — check tc filter show after every Cilium upgrade
  • XDP runs before the kernel stack — fastest hook, but no pod identity; Cilium uses it for service load balancing, not pod policy
  • XDP silently falls back to generic mode on unsupported NICs — verify with ip link show | grep xdp
  • Tracepoints are stable across kernel versions; kprobe-based tools may silently break after node OS patches — verify your Falco driver mode
  • LSM hooks enforce at the kernel level — this is what makes Tetragon’s enforcement mode fundamentally different from sidecar-based approaches

What’s Next

Every eBPF program fires, does its work, and exits — but the work always involves data.

Get EP05 in your inbox when it publishes → linuxcent.com/subscribe Counting connections. Tracking processes. Streaming events to a detection engine. In EP05, I’ll cover eBPF maps: the persistent data layer that connects kernel programs to the tools consuming their output. Understanding maps explains a class of production issues — and makes bpftool map dump useful rather than cryptic.