Kubernetes CRD Schema Explained: Versions, Validation, and Status Subresource

Reading Time: 6 minutes

Kubernetes CRDs & Operators: Extending the API, Episode 3
What Is a CRD? · CRDs You Already Use · CRD Anatomy · Write Your First CRD · CEL Validation · Controller Loop · Build an Operator · CRD Versioning · Admission Webhooks · CRDs in Production


TL;DR

  • The Kubernetes CRD schema is defined in spec.versions[].schema.openAPIV3Schema — the API server uses it to validate every custom resource create and update before storing in etcd
    (OpenAPI v3 schema = a JSON Schema dialect that describes the structure, types, and constraints of your resource’s fields)
  • spec.versions is a list — CRDs can serve multiple API versions simultaneously; exactly one version must have storage: true
  • scope: Namespaced vs scope: Cluster controls whether custom resources live inside a namespace or at cluster level (like PersistentVolume vs PersistentVolumeClaim)
  • spec.names defines the plural, singular, kind, and optional shortNames used in kubectl and RBAC
  • The status subresource (subresources.status: {}) separates user writes (spec) from controller writes (status) — enabling optimistic concurrency and kubectl status support
  • The scale subresource (subresources.scale) makes your custom resource compatible with kubectl scale and the HorizontalPodAutoscaler

The Big Picture

  ANATOMY OF A CUSTOMRESOURCEDEFINITION

  apiVersion: apiextensions.k8s.io/v1
  kind: CustomResourceDefinition
  metadata:
    name: {plural}.{group}        ← MUST be exactly this format
  spec:
    group: {group}                ← API group (e.g. storage.example.com)
    scope: Namespaced | Cluster   ← where instances live
    names:                        ← how kubectl refers to this resource
      plural: backuppolicies
      singular: backuppolicy
      kind: BackupPolicy
      shortNames: [bp]
    versions:                     ← can be a list; one must have storage: true
      - name: v1alpha1
        served: true              ← API server responds to this version
        storage: true             ← etcd stores objects in this version
        schema:
          openAPIV3Schema:        ← validation schema for ALL objects of this type
            type: object
            properties:
              spec: {...}
              status: {...}
        subresources:
          status: {}              ← enables separate status write path
          scale:                  ← enables kubectl scale + HPA
            specReplicasPath: .spec.replicas
            statusReplicasPath: .status.replicas
        additionalPrinterColumns: ← extra columns in kubectl get output
          - name: Schedule
            type: string
            jsonPath: .spec.schedule

Understanding the Kubernetes CRD schema is the prerequisite for writing a CRD that behaves correctly in production — validation catches bad data at the API boundary, the status subresource prevents controller race conditions, and scope determines your entire RBAC and multi-tenancy model.


spec.group and metadata.name

The group is a reverse-DNS identifier for your API. Convention:

storage.example.com     ← domain you control + functional area
monitoring.myteam.io
databases.platform.company.com

The CRD’s metadata.name must be exactly {plural}.{group}:

metadata:
  name: backuppolicies.storage.example.com
spec:
  group: storage.example.com
  names:
    plural: backuppolicies

If these do not match, the API server rejects the CRD with a validation error. This is the most common first-timer mistake.


spec.scope: Namespaced vs Cluster

  SCOPE DETERMINES WHERE INSTANCES LIVE

  Namespaced (scope: Namespaced)       Cluster (scope: Cluster)
  ─────────────────────────────         ──────────────────────────
  kubectl get backuppolicies -n prod    kubectl get clusterbackuppolicies
  kubectl get backuppolicies -A         (no -n flag, no namespace)

  Analogous to: Pod, Deployment,        Analogous to: PersistentVolume,
                ConfigMap                             ClusterRole, Node

Namespaced: Use when instances are per-tenant or per-application. Users with namespace-scoped RBAC can manage their own instances without cluster-admin. Most CRDs should be namespaced.

Cluster-scoped: Use when instances represent cluster-wide configuration — a ClusterIssuer (cert-manager), ClusterSecretStore (ESO), a StorageClass-like concept. Requires cluster-level RBAC to create/modify.

You cannot change scope after a CRD is created without deleting and recreating it (which deletes all instances). Choose carefully.


spec.versions: Serving Multiple API Versions

spec:
  versions:
    - name: v1alpha1
      served: true
      storage: false       # not stored; converted on read
      schema:
        openAPIV3Schema: {...}
    - name: v1beta1
      served: true
      storage: false
      schema:
        openAPIV3Schema: {...}
    - name: v1
      served: true
      storage: true        # etcd stores in this version
      schema:
        openAPIV3Schema: {...}

Rules:
served: true means the API server accepts requests at this version
served: false means the API server returns 404 for that version — use to deprecate
– Exactly one version must have storage: true — this is what gets written to etcd
– When a client requests a non-storage version, the API server converts on the fly (or calls your conversion webhook — see EP08)

Early in development, start with v1alpha1 storage: true. Promote to v1 when the schema is stable. EP08 covers how to do this without losing data.


spec.names: What kubectl Sees

spec:
  names:
    plural:     backuppolicies     # kubectl get backuppolicies
    singular:   backuppolicy       # kubectl get backuppolicy (also works)
    kind:       BackupPolicy       # used in YAML apiVersion/kind
    listKind:   BackupPolicyList   # optional; auto-derived if omitted
    shortNames:                    # kubectl get bp
      - bp
    categories:                    # kubectl get all includes this type
      - all

categories is worth noting: if you add all to categories, your custom resources appear when someone runs kubectl get all -n mynamespace. Most CRDs deliberately do not add this — it clutters get all output. Only add it if your resource is a primary operational concern.


schema.openAPIV3Schema: Validation

The schema is where you define field types, required fields, constraints, and descriptions. The API server validates every create and update against this schema before writing to etcd.

schema:
  openAPIV3Schema:
    type: object
    required: ["spec"]
    properties:
      spec:
        type: object
        required: ["schedule", "retentionDays"]
        properties:
          schedule:
            type: string
            description: "Cron expression for backup schedule"
            pattern: '^(\*|[0-9,\-\/]+)\s+(\*|[0-9,\-\/]+)\s+(\*|[0-9,\-\/]+)\s+(\*|[0-9,\-\/]+)\s+(\*|[0-9,\-\/]+)$'
          retentionDays:
            type: integer
            minimum: 1
            maximum: 365
          storageClass:
            type: string
            default: "standard"        # default value (Kubernetes 1.17+)
          targets:
            type: array
            maxItems: 10
            items:
              type: object
              required: ["name"]
              properties:
                name:
                  type: string
                namespace:
                  type: string
                  default: "default"
      status:
        type: object
        x-kubernetes-preserve-unknown-fields: true   # controllers write arbitrary status

Field types available

Type Usage
string Text values; supports format, pattern, enum, minLength, maxLength
integer Whole numbers; supports minimum, maximum
number Floating point
boolean true/false
object Nested structure; use properties to define fields
array List; use items to define element schema; supports minItems, maxItems

x-kubernetes-preserve-unknown-fields: true

This tells the API server not to prune fields it does not know about. Use it on status (controllers write whatever they need) and on fields that are intentionally free-form (like a config field that accepts arbitrary YAML). Avoid it on spec — it bypasses validation.

Validation behavior in practice

# This will fail with a clear error:
kubectl apply -f - <<EOF
apiVersion: storage.example.com/v1alpha1
kind: BackupPolicy
metadata:
  name: bad
  namespace: default
spec:
  schedule: "not-a-cron"    # fails pattern validation
  retentionDays: 500         # fails maximum: 365
EOF
The BackupPolicy "bad" is invalid:
  spec.schedule: Invalid value: "not-a-cron": spec.schedule in body should match
    '^(\*|[0-9,\-\/]+)\s+...'
  spec.retentionDays: Invalid value: 500: spec.retentionDays in body should be
    less than or equal to 365

Schema validation catches configuration mistakes at apply time, not at runtime inside a pod. This is one of the core advantages of expressing domain configuration as CRDs rather than ConfigMaps.


additionalPrinterColumns: What kubectl get Shows

By default, kubectl get backuppolicies shows only NAME and AGE. You can add columns:

additionalPrinterColumns:
  - name: Schedule
    type: string
    jsonPath: .spec.schedule
    description: Cron schedule for backups
  - name: Retention
    type: integer
    jsonPath: .spec.retentionDays
    priority: 1          # 0 = always shown; 1 = only with -o wide
  - name: Ready
    type: string
    jsonPath: .status.conditions[?(@.type=='Ready')].status
  - name: Age
    type: date
    jsonPath: .metadata.creationTimestamp

Result:

NAME        SCHEDULE      READY   AGE
nightly     0 2 * * *     True    3d
weekly      0 0 * * 0     False   7d

Good printer columns turn kubectl get into a useful operational dashboard. Include Ready (from status conditions) so operators can immediately see which custom resources are healthy without running kubectl describe.


The Status Subresource

subresources:
  status: {}

Without the status subresource, spec and status are part of the same object. Any user with update permission on the CRD can modify both. Controllers write status through the same path as users write spec.

With the status subresource enabled:
kubectl apply / kubectl patch only update spec — the status block is stripped
– Controllers use the /status subresource endpoint to write status
– RBAC can grant update on backuppolicies (spec) independently from update on backuppolicies/status

  WITHOUT status subresource:         WITH status subresource:
  ─────────────────────────            ──────────────────────────
  PUT /backuppolicies/nightly          PUT /backuppolicies/nightly
  → updates spec AND status            → updates spec only

                                       PUT /backuppolicies/nightly/status
                                       → updates status only (controller path)

Always enable the status subresource on production CRDs. The split between spec and status is fundamental to the Kubernetes API contract. Without it, a controller updating status can accidentally overwrite spec changes made by a user at the same time.


The Scale Subresource

subresources:
  scale:
    specReplicasPath: .spec.replicas
    statusReplicasPath: .status.replicas
    labelSelectorPath: .status.labelSelector

This makes your custom resource compatible with:

kubectl scale backuppolicy nightly --replicas=3

And with HorizontalPodAutoscaler targeting your custom resource. If your CRD manages something replica-based (workers, shards, connections), enabling the scale subresource lets it plug into the standard Kubernetes autoscaling ecosystem without extra plumbing.


⚠ Common Mistakes

Forgetting x-kubernetes-preserve-unknown-fields: true on status. If you validate the status field with a strict schema but do not add this, the API server will prune any status fields the controller writes that are not in the schema. The controller’s status updates will silently lose fields. Either define the full status schema or use x-kubernetes-preserve-unknown-fields: true.

Using scope: Cluster for resources that should be namespaced. Once a CRD is created as cluster-scoped, you cannot make it namespaced without deleting and recreating it. Plan scope before deploying to production.

Not enabling the status subresource. Without it, controllers writing status can race with users updating spec. It also means kubectl patch --subresource=status does not work and some tooling behaves unexpectedly. Enable it from the start.

Loose schema with no required fields. An openAPIV3Schema with no required constraint accepts objects with empty spec. This usually means your controller gets called with a resource that is missing mandatory configuration. Define required fields and validate them at the API boundary, not inside the controller.


Quick Reference

# Inspect the full schema of a CRD
kubectl get crd backuppolicies.storage.example.com -o yaml | \
  yq '.spec.versions[0].schema'

# Check what subresources are enabled
kubectl get crd certificates.cert-manager.io -o jsonpath=\
  '{.spec.versions[0].subresources}'

# See all served versions for a CRD
kubectl get crd prometheuses.monitoring.coreos.com \
  -o jsonpath='{.spec.versions[*].name}'

# Check which version is the storage version
kubectl get crd certificates.cert-manager.io \
  -o jsonpath='{.spec.versions[?(@.storage==true)].name}'

# Describe the printer columns for a CRD
kubectl get crd scaledobjects.keda.sh \
  -o jsonpath='{.spec.versions[0].additionalPrinterColumns}'

Key Takeaways

  • spec.versions allows serving and storing multiple API versions; only one version has storage: true
  • scope (Namespaced vs Cluster) cannot be changed after creation — choose deliberately
  • openAPIV3Schema validates every CR at the API boundary, before etcd storage
  • The status subresource separates the user write path (spec) from the controller write path (status) — always enable it
  • additionalPrinterColumns makes kubectl get operationally useful; include a Ready column from status conditions

What’s Next

EP04: Write Your First Kubernetes CRD puts the anatomy into practice — a complete hands-on walkthrough building a BackupPolicy CRD from scratch, applying it to a cluster, creating instances, and verifying validation, RBAC, and status behavior.

Get EP04 in your inbox when it publishes → subscribe at linuxcent.com

CRDs You Already Use: cert-manager, KEDA, and External Secrets Explained

Reading Time: 6 minutes

Kubernetes CRDs & Operators: Extending the API, Episode 2
What Is a CRD? · CRDs You Already Use · CRD Anatomy · Write Your First CRD · CEL Validation · Controller Loop · Build an Operator · CRD Versioning · Admission Webhooks · CRDs in Production


TL;DR

  • cert-manager, KEDA, and External Secrets Operator are all CRD-based systems — understanding their custom resources shows you what a well-designed CRD looks like before you build one
  • cert-manager’s Certificate CRD expresses desired TLS state; the cert-manager controller reconciles that state by issuing, renewing, and storing certificates in Secrets
  • KEDA’s ScaledObject extends the HorizontalPodAutoscaler with external metrics (queue depth, Kafka lag, Prometheus queries) — the KEDA operator translates ScaledObjects into native HPA objects
  • External Secrets Operator’s ExternalSecret abstracts over secret backends (AWS Secrets Manager, HashiCorp Vault, GCP Secret Manager) — the controller pulls values and writes Kubernetes Secrets
  • All three follow the same pattern: you describe desired state in a custom resource; the operator reconciles actual state to match
  • Kubernetes custom resources examples like these are the fastest way to internalize the CRD mental model before writing your own

The Big Picture

  THREE CRD-BASED OPERATORS AND WHAT THEY MANAGE

  ┌─────────────────────────────────────────────────────────────┐
  │  cert-manager                                               │
  │  Certificate CR  →  controller issues cert  →  TLS Secret  │
  └─────────────────────────────────────────────────────────────┘

  ┌─────────────────────────────────────────────────────────────┐
  │  KEDA                                                       │
  │  ScaledObject CR  →  controller creates HPA  →  Pod count  │
  └─────────────────────────────────────────────────────────────┘

  ┌─────────────────────────────────────────────────────────────┐
  │  External Secrets Operator                                  │
  │  ExternalSecret CR  →  controller pulls  →  K8s Secret      │
  │                         from Vault/AWS/GCP                  │
  └─────────────────────────────────────────────────────────────┘

  In every case:
  User creates CR  →  Operator watches CR  →  Operator acts  →  Status updated

Kubernetes custom resources examples from real tools like these reveal the design pattern you will use in every CRD you build: express desired state declaratively, let the controller bridge the gap to actual state, surface the outcome in the status subresource.


Why Look at Existing CRDs First?

Before designing your own CRD, you want to understand what good CRD design looks like from the user’s perspective. The engineers at Jetstack (cert-manager), KEDACORE (KEDA), and External Secrets contributors have collectively solved the same problems you will face:

  • What goes in spec vs status?
  • How do you reference other Kubernetes objects?
  • How do you handle secrets and credentials securely?
  • What does a healthy vs unhealthy custom resource look like?

Studying these before writing your own saves you from the most common first-timer mistakes.


cert-manager: The Certificate CRD

cert-manager is the most widely deployed CRD-based system in Kubernetes. It manages TLS certificates from Let’s Encrypt, internal CAs, and cloud providers.

The core CRDs

kubectl get crds | grep cert-manager
certificates.cert-manager.io
certificaterequests.cert-manager.io
challenges.acme.cert-manager.io
clusterissuers.cert-manager.io
issuers.cert-manager.io
orders.acme.cert-manager.io

The one you interact with most is Certificate. Here is a real example:

apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
  name: api-tls
  namespace: production
spec:
  secretName: api-tls-cert        # cert-manager writes the TLS Secret here
  duration: 2160h                 # 90 days
  renewBefore: 720h               # renew 30 days before expiry
  subject:
    organizations:
      - example.com
  dnsNames:
    - api.example.com
    - api-internal.example.com
  issuerRef:
    name: letsencrypt-prod
    kind: ClusterIssuer

What happens after you apply this:

  1. cert-manager controller sees the new Certificate object
  2. It contacts the referenced ClusterIssuer (Let’s Encrypt in this case)
  3. It completes the ACME challenge, obtains the certificate
  4. It writes the certificate and private key into the api-tls-cert Secret
  5. It updates the Certificate object’s status to reflect success
kubectl describe certificate api-tls -n production
Status:
  Conditions:
    Last Transition Time:  2026-04-10T08:00:00Z
    Message:               Certificate is up to date and has not expired
    Reason:                Ready
    Status:                True
    Type:                  Ready
  Not After:               2026-07-09T08:00:00Z
  Not Before:              2026-04-10T08:00:00Z
  Renewal Time:            2026-06-09T08:00:00Z

What this teaches you about CRD design

  • spec.secretName — the CR references an output object by name. The controller creates or updates that object.
  • spec.issuerRef — the CR references another custom resource (ClusterIssuer) by name. This is a common pattern for separating configuration concerns.
  • status.conditions — the standard Kubernetes condition pattern: type, status, reason, message. You will use the same structure in your own CRDs.
  • The controller owns status — users own spec. This separation is a core convention.

KEDA: The ScaledObject CRD

KEDA (Kubernetes Event-Driven Autoscaling) extends Kubernetes autoscaling beyond CPU and memory. It can scale deployments based on queue depth, Kafka consumer lag, Prometheus metric values, and dozens of other event sources.

The core CRDs

kubectl get crds | grep keda
clustertriggerauthentications.keda.sh
scaledjobs.keda.sh
scaledobjects.keda.sh
triggerauthentications.keda.sh

A ScaledObject ties a Deployment to an external scaler:

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: order-processor-scaler
  namespace: production
spec:
  scaleTargetRef:
    name: order-processor        # the Deployment to scale
  minReplicaCount: 0             # scale to zero when idle
  maxReplicaCount: 50
  triggers:
    - type: aws-sqs-queue
      metadata:
        queueURL: https://sqs.us-east-1.amazonaws.com/123456789/orders
        queueLength: "5"         # target: 5 messages per pod
        awsRegion: us-east-1
      authenticationRef:
        name: keda-sqs-auth      # TriggerAuthentication for AWS credentials

What KEDA does with this:

  1. KEDA controller sees the ScaledObject
  2. It creates a native HorizontalPodAutoscaler object targeting the order-processor Deployment
  3. KEDA’s metrics adapter polls the SQS queue depth and exposes it as a custom metric
  4. The HPA uses that metric to scale replicas — including to zero when the queue is empty
kubectl get scaledobject order-processor-scaler -n production
NAME                       SCALETARGETKIND      SCALETARGETNAME    MIN   MAX   TRIGGERS         READY   ACTIVE
order-processor-scaler     apps/Deployment      order-processor    0     50    aws-sqs-queue    True    True

What this teaches you about CRD design

  • spec.scaleTargetRef — targeting another object by name. The controller acts on that object, not on the CR itself.
  • spec.triggers — a list of trigger specifications. Lists of typed sub-objects are a recurring CRD pattern.
  • spec.minReplicaCount: 0 — expressing scale-to-zero as a first-class concept in the API. Built-in HPA does not support this; KEDA’s CRD extends the vocabulary of what is expressible.
  • The KEDA operator translates ScaledObject → native HPA. The CRD is an abstraction over a more complex Kubernetes object. This “translate and manage child resources” pattern is extremely common in operators.

External Secrets Operator: The ExternalSecret CRD

External Secrets Operator (ESO) solves a specific problem: secrets live in external systems (AWS Secrets Manager, HashiCorp Vault, GCP Secret Manager), but Kubernetes workloads need them as Kubernetes Secrets. ESO bridges the gap.

The core CRDs

kubectl get crds | grep external-secrets
clusterexternalsecrets.external-secrets.io
clustersecretstores.external-secrets.io
externalsecrets.external-secrets.io
secretstores.external-secrets.io

A SecretStore defines the backend connection:

apiVersion: external-secrets.io/v1beta1
kind: SecretStore
metadata:
  name: aws-secrets-manager
  namespace: production
spec:
  provider:
    aws:
      service: SecretsManager
      region: us-east-1
      auth:
        jwt:
          serviceAccountRef:
            name: eso-sa            # uses IRSA/workload identity

An ExternalSecret defines what to pull and how to map it:

apiVersion: external-secrets.io/v1beta1
kind: ExternalSecret
metadata:
  name: database-creds
  namespace: production
spec:
  refreshInterval: 1h
  secretStoreRef:
    name: aws-secrets-manager
    kind: SecretStore
  target:
    name: database-secret          # Kubernetes Secret to create/update
    creationPolicy: Owner
  data:
    - secretKey: username          # key in the K8s Secret
      remoteRef:
        key: prod/database         # path in AWS Secrets Manager
        property: username         # property within that secret
    - secretKey: password
      remoteRef:
        key: prod/database
        property: password

After ESO reconciles this:

kubectl get secret database-secret -n production -o jsonpath='{.data.username}' | base64 -d
# outputs: db_user
kubectl describe externalsecret database-creds -n production
Status:
  Conditions:
    Last Transition Time:   2026-04-10T08:00:00Z
    Message:                Secret was synced
    Reason:                 SecretSynced
    Status:                 True
    Type:                   Ready
  Refresh Time:             2026-04-10T09:00:00Z
  Synced Resource Version:  1-abc123

What this teaches you about CRD design

  • spec.secretStoreRef — referencing a configuration CRD (SecretStore) from an operational CRD (ExternalSecret). This layering of CRDs to separate concerns is a mature pattern.
  • spec.refreshInterval — the CR expresses a desired behavior (periodic sync), not just a desired state snapshot. CRDs can express temporal behaviors.
  • spec.target.creationPolicy: Owner — ESO will set an owner reference on the created Secret, so deleting the ExternalSecret cascades to deleting the Secret. This is how controllers manage lifecycle.
  • Sensitive values never appear in the CR — only paths and references. The controller handles the actual secret retrieval. This is a key security pattern in CRD design.

The Common Pattern Across All Three

  OPERATOR PATTERN (cert-manager / KEDA / ESO / every other operator)

  User applies CR
        │
        ▼
  Controller watches CRDs
  (informer cache, events queue)
        │
        ▼
  Controller reconciles:
  actual state ──→ compare ──→ desired state
        │              │
        │         (gap found)
        │              │
        ▼              ▼
  Takes action      Updates status
  (issue cert,      conditions in CR
   create HPA,
   sync Secret)
        │
        └──── loops back, watches for next change

The design contract:
Users write spec — what they want
Controllers read spec, write status — what actually happened
Status conditions are truthReady: True/False with reason and message tell operators what the controller knows

This pattern, explained in depth in EP06, is why CRDs and controllers are designed the way they are.


⚠ Common Mistakes

Installing CRDs without the controller. If you install cert-manager’s CRDs from the crds.yaml manifest without installing cert-manager itself, Certificate objects will be accepted by the API server but never reconciled. The Ready condition will never appear. Always install the operator alongside its CRDs.

Editing status fields directly. Many teams try kubectl patch or kubectl edit to update a custom resource’s status to work around a stuck controller. Most well-written controllers overwrite status every reconcile loop — your manual change will be wiped. Fix the underlying issue, not the status display.

Assuming CRD deletion is safe. Covered in EP01 but worth repeating: deleting a CRD cascades to deleting all instances. If you kubectl delete crd certificates.cert-manager.io, every Certificate object in every namespace is gone and cert-manager will stop issuing. Back up CRDs and their instances before any CRD deletion.


Quick Reference

# See all CRDs installed by cert-manager
kubectl get crds | grep cert-manager.io

# Get all Certificates across all namespaces
kubectl get certificates -A

# Watch cert-manager reconcile a new Certificate
kubectl get certificate api-tls -n production -w

# See all ScaledObjects and their current state
kubectl get scaledobjects -A

# Check ESO sync status for all ExternalSecrets
kubectl get externalsecrets -A

# Inspect what APIs a CRD exposes
kubectl api-resources | grep cert-manager

Key Takeaways

  • cert-manager, KEDA, and ESO are canonical examples of well-designed CRD-based operators
  • All three follow the same pattern: user writes spec, controller reconciles to actual state, status reflects outcome
  • spec expresses desired state declaratively; the controller figures out how to achieve it
  • Status conditions (type, status, reason, message) are the standard way to surface controller outcomes
  • Sensitive values never appear in the CR — controllers retrieve them from external systems using references and credentials

What’s Next

EP03: CRD Anatomy opens the YAML of a CRD itself — spec.versions, OpenAPI schema properties, scope, names, and subresources. You have seen CRDs from the outside; next we look at how they are structured on the inside.

Get EP03 in your inbox when it publishes → subscribe at linuxcent.com

What Is a Kubernetes CRD? How Custom Resources Extend the API

Reading Time: 6 minutes

Kubernetes CRDs & Operators: Extending the API, Episode 1
What Is a CRD? · CRDs You Already Use · CRD Anatomy · Write Your First CRD · CEL Validation · Controller Loop · Build an Operator · CRD Versioning · Admission Webhooks · CRDs in Production


TL;DR

  • A Kubernetes CRD (Custom Resource Definition) is how you add new resource types to the Kubernetes API — the same way Deployment and Service exist natively, you can make BackupPolicy or Certificate exist too
    (CRD = the schema/blueprint; Custom Resource = an instance of that schema, just like a Pod is an instance of the Pod schema)
  • Every kubectl get crds on a real cluster shows dozens of them — cert-manager, KEDA, Prometheus Operator, Crossplane all ship their own CRDs
  • CRDs are served by the same API server as built-in resources — kubectl, RBAC, watches, and events all work identically
  • A CRD alone does nothing — a controller watches the custom resources and acts on them; together they form an Operator
  • CRDs live in etcd just like Pods and Deployments — they survive API server restarts and cluster upgrades
  • You do not need to modify Kubernetes source code or restart the API server to add a CRD

The Big Picture

  HOW KUBERNETES CRDs EXTEND THE API

  ┌──────────────────────────────────────────────────────────────┐
  │  Kubernetes API Server                                       │
  │                                                              │
  │  Built-in resources          Custom resources (via CRD)      │
  │  ─────────────────           ──────────────────────────      │
  │  Pod                         Certificate     (cert-manager)  │
  │  Deployment                  ScaledObject    (KEDA)          │
  │  Service                     ExternalSecret  (ESO)           │
  │  ConfigMap                   BackupPolicy    (your team)     │
  │  ...                         ...                             │
  │                                                              │
  │  All resources: same API, same kubectl, same RBAC, same etcd │
  └──────────────────────────────────────────────────────────────┘
            ▲                          ▲
            │ built in                 │ registered at runtime
            │                         │
         Kubernetes              CustomResourceDefinition
          binary                    (a YAML you apply)

What is a Kubernetes CRD? It is a resource that defines resources — a schema registration that teaches the API server about a new object type you want to use in your cluster.


What Problem CRDs Solve

Kubernetes ships with roughly 50 resource types: Pods, Deployments, Services, ConfigMaps, Secrets, PersistentVolumes, and so on. These cover the general-purpose building blocks for running containerized workloads.

But the moment you operate real infrastructure, you hit the edges. You want to express:

  • “This database should have three replicas with point-in-time recovery enabled” — not a Deployment
  • “This TLS certificate for api.example.com should renew 30 days before expiry” — not a Secret
  • “This queue consumer should scale to zero when the queue is empty” — not a HorizontalPodAutoscaler

Before CRDs (pre-2017), the only options were: use ConfigMaps as a poor substitute (no schema, no validation, no dedicated RBAC), or fork Kubernetes and add the resource natively (impractical for everyone outside the core team).

CRDs, introduced as stable in Kubernetes 1.16, solved this by letting you register a new resource type with the API server at runtime — without touching Kubernetes source code, without restarting the API server, without any special access beyond being able to create cluster-scoped resources.


The Kubernetes API: A Brief Mental Model

Before CRDs make sense, the API model needs to be clear.

  KUBERNETES API STRUCTURE

  apiVersion: apps/v1       ← API group (apps) + version (v1)
  kind: Deployment          ← resource type
  metadata:
    name: web               ← instance name
    namespace: default      ← namespace scope
  spec:
    replicas: 3             ← desired state

Every Kubernetes resource has:
– A group (e.g., apps, batch, networking.k8s.io) — or no group for core resources
– A version (e.g., v1, v1beta1)
– A kind (e.g., Deployment, Pod)
– A scope: namespaced or cluster-wide

The API server is a registry. Each group/version/kind combination maps to a Go struct that knows how to validate, store, and serve that resource type.

A CRD registers a new entry in that registry. You supply the group, version, kind, and schema. The API server handles everything else — serving it via REST, storing it in etcd, exposing it to kubectl.


What a CRD Looks Like

Here is the smallest possible CRD — it creates a new BackupPolicy resource type in the storage.example.com API group:

apiVersion: apiextensions.k8s.io/v1
kind: CustomResourceDefinition
metadata:
  name: backuppolicies.storage.example.com
spec:
  group: storage.example.com
  versions:
    - name: v1alpha1
      served: true
      storage: true
      schema:
        openAPIV3Schema:
          type: object
          properties:
            spec:
              type: object
              properties:
                schedule:
                  type: string
                retentionDays:
                  type: integer
  scope: Namespaced
  names:
    plural: backuppolicies
    singular: backuppolicy
    kind: BackupPolicy
    shortNames:
      - bp

Apply it:

kubectl apply -f backuppolicy-crd.yaml

Now create an instance:

apiVersion: storage.example.com/v1alpha1
kind: BackupPolicy
metadata:
  name: nightly
  namespace: default
spec:
  schedule: "0 2 * * *"
  retentionDays: 30
kubectl apply -f nightly-backup.yaml
kubectl get backuppolicies
kubectl get bp            # shortName works
kubectl describe bp nightly

The API server validates the spec against the schema, stores it in etcd, and returns it via all the standard API endpoints — all without a single line of custom code.


CRD vs Built-In Resource: What Is Different?

Not much, deliberately.

Capability Built-in resource Custom resource (CRD)
kubectl get / describe / delete Yes Yes
RBAC (Roles, ClusterRoles) Yes Yes
Watch (informers, events) Yes Yes
Stored in etcd Yes Yes
OpenAPI schema validation Yes Yes (you define the schema)
Admission webhooks Yes Yes
Status subresource Yes Optional (you enable it)
Scale subresource Yes Optional (you enable it)
Built-in controller behavior Yes No — you write the controller

The last row is the critical one. When you create a Deployment, the deployment controller immediately starts managing ReplicaSets. When you create a BackupPolicy, nothing happens — until you write and deploy a controller that watches BackupPolicy objects and acts on them.

That controller + the CRD is what people call an Operator.


A Real Cluster: What You Actually See

Run this on any cluster running cert-manager, Prometheus Operator, or any other tooling:

kubectl get crds

Sample output (abbreviated):

NAME                                                  CREATED AT
certificates.cert-manager.io                          2024-11-01T08:12:00Z
certificaterequests.cert-manager.io                   2024-11-01T08:12:00Z
issuers.cert-manager.io                               2024-11-01T08:12:00Z
clusterissuers.cert-manager.io                        2024-11-01T08:12:00Z
scaledobjects.keda.sh                                 2024-11-01T08:13:00Z
scaledjobs.keda.sh                                    2024-11-01T08:13:00Z
externalsecrets.external-secrets.io                   2024-11-01T08:14:00Z
prometheuses.monitoring.coreos.com                    2024-11-01T08:15:00Z
servicemonitors.monitoring.coreos.com                 2024-11-01T08:15:00Z

Every tool that ships as a CRD-based system registers its resource types here first. The count often surprises engineers: a production cluster with a typical toolchain easily has 40–80 CRDs.

Check how many are on your cluster:

kubectl get crds --no-headers | wc -l

How the API Server Handles a CRD

When you apply a CRD, the API server does three things:

  CRD REGISTRATION FLOW

  kubectl apply -f my-crd.yaml
          │
          ▼
  1. API server validates the CRD manifest
     (is the schema valid OpenAPI v3? are names correct?)
          │
          ▼
  2. CRD stored in etcd
     (under /registry/apiextensions.k8s.io/customresourcedefinitions/)
          │
          ▼
  3. New REST endpoints activated immediately:
     GET  /apis/storage.example.com/v1alpha1/namespaces/{ns}/backuppolicies
     POST /apis/storage.example.com/v1alpha1/namespaces/{ns}/backuppolicies
     ...

From this point, any kubectl get backuppolicies or API call to those endpoints is handled exactly like a built-in resource call — the API server serves it from etcd, applies RBAC, runs admission webhooks, and returns standard JSON.

No restart required. The new endpoints appear within seconds.


The Difference Between CRD and CR

Two terms that are easily confused:

  • CRD (CustomResourceDefinition) — the schema/blueprint. There is one CRD per resource type. certificates.cert-manager.io is a CRD.
  • CR (Custom Resource) — an instance of a CRD. Every Certificate object you create is a custom resource. You can have thousands of CRs per CRD.
  CRD (one)          →  Custom Resource (many)
  ─────────             ─────────────────────
  certificates          web-tls           (namespace: production)
  .cert-manager.io      api-tls           (namespace: production)
                        admin-tls         (namespace: staging)
                        ...

The CRD is applied once (usually by the tool’s Helm chart). Custom resources are created by your users, your CI pipeline, or your GitOps system throughout the life of the cluster.


Where CRDs Fit in the Kubernetes Extension Model

CRDs are one of three ways to extend Kubernetes:

  KUBERNETES EXTENSION MECHANISMS

  1. CRDs + Controllers (Operators)
     Add new resource types + behavior
     → cert-manager, KEDA, Argo CD, Crossplane
     Used for: domain-specific abstractions, infrastructure management

  2. Admission Webhooks
     Intercept API requests to validate or mutate objects
     → OPA/Gatekeeper, Kyverno, Istio injection
     Used for: policy enforcement, sidecar injection, defaulting

  3. API Aggregation (AA)
     Register a fully separate API server behind the main API server
     → metrics-server, custom autoscalers
     Used for: when you need non-CRUD semantics (e.g. exec, attach, streaming)

For 95% of use cases, CRDs + controllers are the right mechanism. API aggregation is complex and only warranted for non-standard API semantics. Admission webhooks are complementary to CRDs, not an alternative.


⚠ Common Mistakes

Confusing the CRD with the controller. The CRD is just a schema registration — it does not execute code. If you apply a CRD but do not deploy its controller, creating custom resources will succeed (the API server accepts them) but nothing will happen. This catches many people the first time they try to use cert-manager by only applying the CRDs without installing the cert-manager controller.

Assuming CRD deletion is safe. Deleting a CRD deletes all custom resources of that type from etcd. There is no “are you sure?” prompt. If you delete the certificates.cert-manager.io CRD, every Certificate object in every namespace is gone.

Treating CRDs as ConfigMap replacements. Some teams store configuration in CRDs purely to get schema validation. This works, but without a controller, the custom resources are inert data. If you only need configuration storage with validation, a CRD is viable — just be explicit that there is no reconciliation loop.


Quick Reference

# List all CRDs in the cluster
kubectl get crds

# Inspect a specific CRD's schema
kubectl get crd certificates.cert-manager.io -o yaml

# List all custom resources of a type
kubectl get certificates -A

# Get details on a specific custom resource
kubectl describe certificate web-tls -n production

# Delete a CRD (WARNING: deletes all instances)
kubectl delete crd backuppolicies.storage.example.com

# Check if a CRD is established (ready to use)
kubectl get crd backuppolicies.storage.example.com \
  -o jsonpath='{.status.conditions[?(@.type=="Established")].status}'
# Returns: True

Key Takeaways

  • A Kubernetes CRD registers a new resource type with the API server — no source code changes, no restart required
  • Custom resources behave identically to built-in resources: kubectl, RBAC, watches, etcd, admission webhooks all work the same way
  • The CRD is just the schema; a controller gives custom resources behavior — together they form an Operator
  • Every production cluster running modern tooling already uses dozens of CRDs
  • Deleting a CRD deletes all its instances — treat CRDs as production-critical objects

What’s Next

EP02: CRDs You Already Use makes this concrete before we go deeper — we walk through cert-manager’s Certificate, KEDA’s ScaledObject, and External Secrets’ ExternalSecret as working examples, so you understand what a well-designed CRD looks like from a user’s perspective before you design your own.

Get EP02 in your inbox when it publishes → subscribe at linuxcent.com

LDAP Internals: The Directory Tree, Schema, and What Travels on the Wire

Reading Time: 12 minutes

The Identity Stack, Episode 2
EP01: What Is LDAPEP02EP03: LDAP Authentication on Linux → …


TL;DR

  • The Directory Information Tree (DIT) is the hierarchical database LDAP stores — every entry lives at a unique path described by its Distinguished Name (DN)
  • Object classes define what attributes an entry is allowed or required to have — posixAccount adds UID, GID, and home directory; inetOrgPerson adds email and display name
  • Schema is the rulebook: which attribute types exist across the entire directory, what syntax each follows, and which object classes require or permit them
  • An LDAP Search sends four things: a base DN, a scope (base/one/sub), a filter like (uid=vamshi), and a list of attributes to return — the server traverses the tree and returns LDIF
  • Every LDAP message on the wire is BER-encoded (Basic Encoding Rules, a subset of ASN.1) — a compact binary format, not text
  • ldapsearch output is LDIF (LDAP Data Interchange Format) — the human-readable representation of what the BER payload carried

The Big Picture: From ldapsearch to Directory Entry

ldapsearch -x -H ldap://dc.corp.com -b "dc=corp,dc=com" "(uid=vamshi)" cn mail uidNumber
     │
     │  TCP port 389 (or 636 for LDAPS)
     │  BER-encoded SearchRequest
     ▼
┌─────────────────────────────────────────────────┐
│  LDAP Server (AD / OpenLDAP / 389-DS / FreeIPA)  │
│                                                   │
│  Directory Information Tree                       │
│                                                   │
│  dc=corp,dc=com                    ← search base  │
│    └── ou=engineers                ← scope: sub   │
│          ├── uid=alice                            │
│          └── uid=vamshi  ← filter match           │
│                cn: vamshi                         │
│                mail: [email protected]              │
│                uidNumber: 1001                    │
└─────────────────────────────────────────────────┘
     │
     │  BER-encoded SearchResultEntry
     ▼
# LDIF output on your terminal
dn: uid=vamshi,ou=engineers,dc=corp,dc=com
cn: vamshi
mail: [email protected]
uidNumber: 1001

LDAP internals are the mechanics between the command you type and the directory entry you get back. EP01 explained why LDAP was invented. This episode explains what it actually does when you run it.


The Directory Information Tree

EP01 introduced the DIT as a concept inherited from X.500. Here’s what it actually looks like inside a directory.

Every LDAP directory has a root — the base DN — from which all entries descend. For a company called Corp with a domain corp.com, the base is typically dc=corp,dc=com. Below that, the tree branches into organizational units, and below those, individual entries for people, groups, services, and anything else the directory administrator decided to model.

dc=corp,dc=com                          ← domain root (base DN)
│
├── ou=people                           ← organizational unit: people
│     ├── uid=alice                     ← user entry
│     ├── uid=vamshi
│     └── uid=bob
│
├── ou=groups                           ← organizational unit: groups
│     ├── cn=engineers
│     └── cn=ops
│
├── ou=services                         ← organizational unit: service accounts
│     ├── cn=jenkins
│     └── cn=gitlab-runner
│
└── ou=hosts                            ← organizational unit: machines
      ├── cn=web01.corp.com
      └── cn=db01.corp.com

This hierarchy is not a file system and not a relational database. It is specifically optimized for reads — the query “give me everything about this user” is the operation the protocol is built around. Writes are infrequent. Reads are constant.

Every entry in the tree has exactly one parent. There are no cross-links between branches, no foreign keys. The tree is the structure. An entry’s position in the tree is what defines it.


Distinguished Names: Reading the Path

The Distinguished Name (DN) is how you address any entry in the directory. It reads right-to-left, from the leaf to the root, with each component separated by a comma.

uid=vamshi,ou=engineers,dc=corp,dc=com

Reading right-to-left:
  dc=corp,dc=com       ← domain: corp.com
  ou=engineers         ← organizational unit: engineers
  uid=vamshi           ← this specific entry: user "vamshi"

Each component of a DN — uid=vamshi, ou=engineers, dc=corp — is a Relative Distinguished Name (RDN). The RDN is the attribute-value pair that uniquely identifies the entry within its parent container. Two users in the same ou=engineers cannot both have uid=vamshi — that would create two entries with identical DNs, which the directory won’t allow.

Common RDN attribute types and what they mean:

Attribute Stands for Typical use
dc Domain Component Domain name segments (dc=corp,dc=com = corp.com)
ou Organizational Unit Container for grouping entries
cn Common Name Groups, service accounts, human-readable name
uid User ID Linux username — the standard RDN for user entries
o Organization Top-level org containers (less common in modern setups)

When your Linux system calls getent passwd vamshi, SSSD translates that into an LDAP Search for an entry where uid=vamshi somewhere under the configured base DN. The full DN comes back with the result, but what your system cares about are the attributes inside it.


Object Classes and Schema

Every entry in the directory has a objectClass attribute — usually several values. Object classes define what attributes the entry is allowed or required to have.

# A typical user entry's object classes
dn: uid=vamshi,ou=engineers,dc=corp,dc=com
objectClass: top
objectClass: inetOrgPerson
objectClass: posixAccount
objectClass: shadowAccount

Each object class contributes a set of attributes — some required (MUST), some optional (MAY):

objectClass: posixAccount
  MUST: cn, uid, uidNumber, gidNumber, homeDirectory
  MAY:  userPassword, loginShell, gecos, description

objectClass: inetOrgPerson
  MUST: sn (surname), cn
  MAY:  mail, telephoneNumber, displayName, jpegPhoto, ...

objectClass: shadowAccount
  MUST: uid
  MAY:  shadowLastChange, shadowMin, shadowMax, shadowWarning, ...

When Linux authenticates a user via LDAP, it needs the posixAccount attributes: uidNumber (the numeric UID), gidNumber, homeDirectory, and loginShell. Without posixAccount, the user entry exists in the directory but can’t be used for Linux logins — getent passwd will return nothing.

Object classes are grouped into three kinds:

Groups in LDAP use their own object class:

objectClass: groupOfNames
  MUST: cn, member
  MAY:  description, owner, ...

# A group entry looks like this:
dn: cn=engineers,ou=groups,dc=corp,dc=com
objectClass: groupOfNames
cn: engineers
member: uid=vamshi,ou=engineers,dc=corp,dc=com
member: uid=alice,ou=engineers,dc=corp,dc=com

groupOfNames stores members as full DNs — which is why the SSSD group search filter is (member=uid=vamshi,ou=...) rather than (member=vamshi). The directory stores the exact path to each member entry. posixGroup is the alternative, which stores the memberUid as a bare username string instead of a DN — Active Directory uses groupOfNames; pure POSIX environments often use posixGroup.

Object classes are grouped into three kinds:

Structural — defines what the entry fundamentally is. Every entry must have exactly one structural class. posixAccount is structural.

Auxiliary — adds additional attributes to an existing entry. shadowAccount and inetOrgPerson can be auxiliary. You can stack multiple auxiliary classes on a single entry.

Abstract — base classes that other classes inherit from. top is the root abstract class that every entry implicitly has. You never add top to an entry; it’s always there.

Schema: The Directory’s Type System

Schema is the global rulebook for the entire directory. It defines:

  • Attribute type definitions — what each attribute is named, what syntax it uses (a string? an integer? a binary blob?), whether it’s case-sensitive, whether multiple values are allowed
  • Object class definitions — which attributes each class requires or permits
  • Matching rules — how equality comparisons work for each attribute type

The schema is stored in the directory itself, under a special entry at cn=schema,cn=config (OpenLDAP) or cn=Schema,cn=Configuration (Active Directory). You can query it:

# View the schema for the posixAccount object class
ldapsearch -x -H ldap://your-dc \
  -b "cn=schema,cn=config" \
  "(objectClass=olcObjectClasses)" \
  olcObjectClasses | grep -A 10 "posixAccount"

# Output:
# olcObjectClasses: ( 1.3.6.1.1.1.2.0
#   NAME 'posixAccount'
#   DESC 'Abstraction of an account with POSIX attributes'
#   SUP top
#   AUXILIARY
#   MUST ( cn $ uid $ uidNumber $ gidNumber $ homeDirectory )
#   MAY ( userPassword $ loginShell $ gecos $ description ) )

That OID (1.3.6.1.1.1.2.0) is the globally unique identifier for the posixAccount object class. Every object class and attribute type in every LDAP directory on the planet has a unique OID assigned by an authority. This is how schema interoperability works across different directory implementations — OpenLDAP, Active Directory, and 389-DS can all understand each other’s posixAccount entries because they share the same OID.


LDAP Operations: What Actually Runs

LDAP defines eight operations. Day-to-day authentication uses two: Bind and Search.

LDAP Operation Set
──────────────────
Bind        ← authenticate (prove identity)
Search      ← query the directory
Add         ← create a new entry
Modify      ← change attributes on an existing entry
Delete      ← remove an entry
ModifyDN    ← rename or move an entry
Compare     ← test if an attribute has a specific value
Abandon     ← cancel an outstanding operation

Bind: Proving Who You Are

Before any authenticated operation, the client sends a Bind request. There are two types:

Simple Bind — the client sends its DN and password in the clear (or over TLS). This is what -x in ldapsearch means: simple authentication.

# Simple bind as a service account
ldapsearch -x \
  -D "cn=svc-ldap-reader,ou=services,dc=corp,dc=com" \
  -w "service-account-password" \
  -H ldap://dc.corp.com \
  -b "dc=corp,dc=com" \
  "(uid=vamshi)"

SASL Bind — the client uses an authentication mechanism registered with SASL (Simple Authentication and Security Layer). Kerberos (via the GSSAPI mechanism) is the most common. EP05 covers Kerberos in detail.

# SASL bind using Kerberos (after kinit)
ldapsearch -Y GSSAPI \
  -H ldap://dc.corp.com \
  -b "dc=corp,dc=com" \
  "(uid=vamshi)"

An anonymous Bind (no DN, no password) is also valid for directories configured to allow anonymous reads. Many public LDAP directories (and some internal ones, misconfigured) allow this.

Search: The Core Operation

A Search request has five required parameters:

baseObject   — where in the DIT to start (e.g., "dc=corp,dc=com")
scope        — how deep to look
               base    = only the base entry itself
               one     = one level below base (immediate children)
               sub     = entire subtree below base (most common)
derefAliases — how to handle alias entries (usually derefAlways)
filter       — what to match (e.g., "(uid=vamshi)")
attributes   — which attributes to return (empty = return all)

When SSSD authenticates a user login, it runs exactly two Search operations:

Search 1 — find the user's entry
  base:       dc=corp,dc=com
  scope:      sub
  filter:     (uid=vamshi)
  attributes: dn, uid, uidNumber, gidNumber, homeDirectory, loginShell

Search 2 — find the user's group memberships
  base:       dc=corp,dc=com
  scope:      sub
  filter:     (member=uid=vamshi,ou=engineers,dc=corp,dc=com)
  attributes: dn, cn, gidNumber

The first search locates the user entry and retrieves the POSIX attributes. The second finds all group entries that contain the user’s DN as a member. These two queries are the complete basis for a Linux login over LDAP.

Search Filters

LDAP filters follow a prefix (Polish notation) syntax. Every filter is wrapped in parentheses:

# Simple equality
(uid=vamshi)

# Presence — entry has this attribute at all
(mail=*)

# Substring match
(cn=vam*)

# Comparison
(uidNumber>=1000)

# Logical AND — both conditions must match
(&(objectClass=posixAccount)(uid=vamshi))

# Logical OR — either condition matches
(|(uid=vamshi)([email protected]))

# Logical NOT
(!(uid=guest))

# Combined — posixAccount entries with UID >= 1000 and no disabled flag
(&(objectClass=posixAccount)(uidNumber>=1000)(!(pwdAccountLockedTime=*)))

The & and | operators take any number of operands. Filter syntax looks strange the first time but is unambiguous and compact — which matters when you’re encoding it into BER for the wire.


What Actually Travels on the Wire

Every LDAP message is encoded in BER (Basic Encoding Rules), a binary subset of ASN.1. LDAP is not a text protocol.

When you run ldapsearch, the tool constructs a BER-encoded SearchRequest message and sends it over TCP. The server responds with one or more SearchResultEntry messages (one per matching entry), followed by a SearchResultDone. All of these are BER.

BER uses a type-length-value (TLV) encoding:

Tag byte(s)    — what type of data this is
Length byte(s) — how many bytes of data follow
Value byte(s)  — the actual data

A minimal LDAP SearchRequest for ldapsearch -x -b "dc=corp,dc=com" "(uid=vamshi)" uid looks like this on the wire:

30 45          ← SEQUENCE (LDAPMessage)
  02 01 01     ← INTEGER 1 (messageID = 1)
  63 40        ← [APPLICATION 3] SearchRequest
    04 11       ← OCTET STRING: baseObject
      64 63 3d  ← "dc=corp,dc=com" (20 bytes)
      63 6f 72
      70 2c 64
      63 3d 63
      6f 6d
    0a 01 02   ← ENUMERATED: scope = wholeSubtree (2)
    0a 01 03   ← ENUMERATED: derefAliases = derefAlways (3)
    02 01 00   ← INTEGER: sizeLimit = 0 (unlimited)
    02 01 00   ← INTEGER: timeLimit = 0 (unlimited)
    01 01 00   ← BOOLEAN: typesOnly = false
    a7 0f      ← [7] equalityMatch filter
      04 03 75 69 64   ← attributeDesc: "uid"
      04 06 76 61 6d   ← assertionValue: "vamshi"
             73 68 69
    30 05      ← SEQUENCE: AttributeDescriptionList
      04 03 75 69 64   ← "uid"

You don’t need to read BER by hand in practice. But knowing it’s binary — not HTTP, not JSON, not plain text — explains some things:

  • Why tcpdump port 389 shows binary output you can’t read directly
  • Why LDAP on port 389 looks different in Wireshark than HTTP traffic
  • Why ldapsearch output (LDIF) is a transformation of the wire data, not the wire data itself

To see the wire protocol in action:

# Run ldapsearch with debug output (level 1 = protocol tracing)
ldapsearch -d 1 -x \
  -H ldap://ldap.forumsys.com \
  -b "dc=example,dc=com" \
  -D "cn=read-only-admin,dc=example,dc=com" \
  -w readonly \
  "(uid=tesla)" cn

# You'll see output like:
# ldap_connect_to_host: TCP ldap.forumsys.com:389
# ldap_new_connection 1 1 0
# ldap_connect_to_host: Trying ldap.forumsys.com:389
# ldap_pvt_connect: fd: 5 tm: -1 async: 0
# TLS: can't connect.
# ldap_open_defconn: successful
# ber_scanf fmt ({it) ber:     ← BER decoding of the response
# ber_scanf fmt ({) ber:
# ber_scanf fmt (W) ber:
# ...

The ber_scanf lines are the BER decoder working through the server’s response. Each line represents one TLV element being read off the wire.


Reading ldapsearch Output: Every Field

ldapsearch output is LDIF (LDAP Data Interchange Format), defined in RFC 2849. It’s the standard text serialization of LDAP entries.

ldapsearch -x \
  -H ldap://ldap.forumsys.com \
  -b "dc=example,dc=com" \
  -D "cn=read-only-admin,dc=example,dc=com" \
  -w readonly \
  "(uid=tesla)" \
  cn mail uid uidNumber objectClass

Output, annotated:

# extended LDIF
#
# LDAPv3                              ← protocol version confirmed
# base <dc=example,dc=com> with scope subtree
# filter: (uid=tesla)                 ← your search filter echoed back
# requesting: cn mail uid uidNumber objectClass
#

# tesla, example.com                  ← comment: CN, base DN
dn: uid=tesla,dc=example,dc=com      ← Distinguished Name — full path in the tree

objectClass: inetOrgPerson           ← structural class: person with org attrs
objectClass: organizationalPerson    ← auxiliary: adds telephoneNumber etc.
objectClass: person                  ← auxiliary: adds sn (surname)
objectClass: top                     ← every entry has this implicitly
cn: Tesla                            ← common name (from inetOrgPerson MUST)
mail: [email protected]        ← email (from inetOrgPerson MAY)
uid: tesla                           ← userid (from inetOrgPerson MAY)

# search result
search: 2                            ← messageID of the SearchResultDone
result: 0 Success                    ← 0 = no error; 32 = no such object; 49 = invalid credentials

# numResponses: 2                    ← 1 result entry + 1 SearchResultDone
# numEntries: 1

The result: line is the one to watch when debugging. LDAP result codes:

Code Meaning What it tells you
0 Success Query ran, results returned (or no results found — check numEntries)
32 No Such Object Base DN doesn’t exist in this directory
49 Invalid Credentials Bind failed — wrong DN, wrong password, or account locked
50 Insufficient Access Your bind DN doesn’t have read permission on these entries
53 Unwilling to Perform Server refused the operation (e.g., password policy, anonymous bind disabled)
65 Object Class Violation Add/Modify would violate schema (missing MUST attribute, unrecognized object class)

Ports: 389, 636, and 3268

Port 389   — LDAP (plaintext, or StartTLS in-session upgrade)
Port 636   — LDAPS (LDAP wrapped in TLS from the start)
Port 3268  — Active Directory Global Catalog (plain)
Port 3269  — Active Directory Global Catalog over TLS

Port 389 vs 636: Both carry the same BER-encoded LDAP protocol. The difference is when TLS starts. On 636 (LDAPS), the TLS handshake happens before the first LDAP message. On 389 with StartTLS, the client sends a plaintext ExtendedRequest with OID 1.3.6.1.4.1.1466.20037 to initiate the TLS upgrade, then both sides continue over TLS. In production, use one or the other — never unencrypted port 389. Your credentials transit the wire on every Bind.

Ports 3268/3269 — Active Directory Global Catalog: AD organizes domains into forests. Each domain controller holds the full LDAP tree for its own domain. The Global Catalog is a read-only, partial replica of every domain in the forest — just the most-queried attributes from every object. When an application needs to find a user across domains in the same forest (not just in one domain), it queries the Global Catalog on 3268/3269 instead of a domain-specific DC on 389/636.

Forest: corp.com
  ├── Domain: corp.com       → DC at port 389/636   (full copy of corp.com)
  ├── Domain: emea.corp.com  → DC at port 389/636   (full copy of emea.corp.com)
  └── Global Catalog        → GC at port 3268/3269  (partial copy of ALL domains)

If your SSSD or application is configured to use port 3268 instead of 389, it’s talking to the Global Catalog — useful for forest-wide user lookups, but missing some less-common attributes that aren’t replicated to the GC.


Try It: ldapsearch Against Your Own Directory

If your Linux machine is joined to AD or connected to an LDAP directory, you can run these right now:

# 1. Confirm your SSSD knows where the LDAP server is
grep -E "ldap_uri|ad_domain|krb5_server" /etc/sssd/sssd.conf

# 2. Look up your own user entry
ldapsearch -x \
  -H ldap://$(grep ldap_uri /etc/sssd/sssd.conf | awk -F= '{print $2}' | tr -d ' ') \
  -b "dc=$(hostname -d | sed 's/\./,dc=/g')" \
  "(uid=$(whoami))" \
  dn objectClass uid uidNumber gidNumber homeDirectory loginShell

# 3. Find the groups you're in
ldapsearch -x \
  -H ldap://your-dc \
  -b "dc=corp,dc=com" \
  "(member=$(ldapsearch -x ... "(uid=$(whoami))" dn | grep ^dn | cut -d' ' -f2-))" \
  cn gidNumber

# 4. Check what object classes your entry has
ldapsearch -x \
  -H ldap://your-dc \
  -b "dc=corp,dc=com" \
  "(uid=$(whoami))" \
  objectClass

On a machine joined to Active Directory, the ldap_uri in sssd.conf is your domain controller’s address. On FreeIPA or OpenLDAP, it’s the directory server. The same ldapsearch commands work against all of them — because they all speak LDAP v3.


⚠ Common Misconceptions

“The DN is like a file path.” The analogy holds for reading it, but the DIT is not a file system. Entries don’t inherit permissions from parent containers the way files inherit from directories. Access control in LDAP is defined by ACLs on the server — not by position in the tree.

“LDAP is case-sensitive.” It depends on the attribute. Most string attributes (like cn and mail) use case-insensitive matching by default — (cn=Vamshi) and (cn=vamshi) return the same results. But some attributes (like userPassword and most binary types) are case-sensitive. The schema’s matching rules define this per-attribute.

“You need the full DN to search for a user.” No. The Search operation with a sub scope searches the entire subtree below the base DN. You search with a filter like (uid=vamshi) without knowing the full DN. The DN comes back in the result.

“LDAP accounts and Linux accounts are the same thing.” An LDAP user entry becomes a Linux account only if the entry has a posixAccount object class with the required POSIX attributes (uidNumber, gidNumber, homeDirectory). An LDAP entry without posixAccount can exist in the directory but getent passwd will not return it.

“The objectClass attribute can be changed freely.” Structural object classes cannot be changed after an entry is created — you’d have to delete and recreate the entry. Auxiliary classes can be added or removed. This is why correctly choosing the structural class at entry creation time matters.


Framework Alignment

Domain Relevance
CISSP Domain 5: Identity and Access Management DIT structure, DN addressing, object classes, and schema are the data model underpinning every enterprise identity store — understanding them is foundational to managing directory-based IAM
CISSP Domain 4: Communications and Network Security BER on port 389 is unencrypted; LDAPS (port 636) or StartTLS is required for production — wire-level understanding informs the transport security decision
CISSP Domain 3: Security Architecture and Engineering Schema design and DIT hierarchy are architectural decisions with security consequences: overly permissive schemas enable privilege escalation; flat DITs make access delegation harder

Key Takeaways

  • The DIT is a hierarchical database — every entry has a unique DN that describes its path from leaf to root
  • Object classes define the schema rules for each entry: what attributes are required (MUST) vs optional (MAY), and what the entry fundamentally is
  • For a user to be usable for Linux logins, the directory entry needs the posixAccount object class with uidNumber, gidNumber, and homeDirectory populated
  • An LDAP login is two operations: a Bind (authenticate), then a Search (retrieve POSIX attributes and group memberships)
  • Everything on the wire is BER-encoded binary — ldapsearch output is LDIF, a human-readable transformation of what the wire actually carries
  • LDAP result code 0 means success; 49 means bad credentials; 32 means the base DN doesn’t exist — these are the three you’ll debug most often


Run ldapsearch against your own directory and look at the object classes on your entry. Does it have posixAccount? Does it have shadowAccount? What attributes is your SSSD actually reading on every login — and what does it do when the LDAP server is unreachable? 👇


What’s Next

EP02 showed what’s inside the directory: the tree structure, the schema, the operations, and the wire protocol. What it left open is how Linux actually uses this information to grant a login.

LDAP is not, by itself, an authentication protocol. The Bind operation can verify a password — but that’s a tiny piece of what happens when you SSH into a machine joined to Active Directory. The full login flow runs through PAM, NSS, and SSSD before LDAP ever gets queried. EP03 traces that path.

Next: LDAP Authentication on Linux: PAM, NSS, and the Login Stack

Get EP03 in your inbox when it publishes → linuxcent.com/subscribe

Kubernetes Today: v1.33 to v1.35, In-Place Resize GA, and What Comes Next

Reading Time: 6 minutes


Introduction

Ten years after the first commit, Kubernetes is not exciting in the way it was in 2015. That’s a compliment. The system is stable. The APIs are mature. The migrations — dockershim, PSP, cloud provider code — are behind us.

What the 1.33–1.35 cycle shows is a project focused on precision: removing edge cases, promoting long-running alpha features to stable, and making the scheduler, storage, and security model more correct rather than more powerful. That’s what a mature infrastructure platform looks like.

Here’s what happened and where the project is headed.


Kubernetes 1.33 — Sidecar Resize, In-Place Resize Beta (April 2025)

Code name: Octarine

In-Place Pod Vertical Scaling reaches Beta

After landing as alpha in 1.27, in-place pod resource resizing became beta in 1.33 — enabled by default via the InPlacePodVerticalScaling feature gate.

The capability: change CPU and memory requests/limits on a running container without terminating and restarting the pod.

# Resize a running container's CPU limit without restart
kubectl patch pod api-pod-xyz --type='json' -p='[
  {
    "op": "replace",
    "path": "/spec/containers/0/resources/requests/cpu",
    "value": "2"
  },
  {
    "op": "replace",
    "path": "/spec/containers/0/resources/limits/cpu",
    "value": "4"
  }
]'

# Verify the resize was applied
kubectl get pod api-pod-xyz -o jsonpath='{.status.containerStatuses[0].resources}'

Why this matters operationally: Before in-place resize, vertical scaling meant terminating the pod, losing in-memory state, waiting for a new pod to become ready. For databases with warm buffer pools, JVM applications with loaded heap caches, or any workload where startup cost is significant, this was a serious limitation. Vertical Pod Autoscaler (VPA) worked around it by restarting pods — acceptable for stateless workloads, problematic for stateful ones.

In 1.33, resizing also works for sidecar containers, combining two 1.32-stable features.

Sidecar Containers — Full Maturity

The first feature to formally combine sidecar and in-place resize: you can now vertically scale a service mesh proxy (Envoy sidecar) without restarting the application pod. For high-traffic services where the proxy itself becomes the CPU bottleneck, this is directly actionable.


Gateway API v1.4 (October 2025)

Gateway API continued its rapid iteration with v1.4:

BackendTLSPolicy (Standard channel): Configure TLS between the gateway and the backend service — not just TLS termination at the gateway, but end-to-end encryption:

apiVersion: gateway.networking.k8s.io/v1alpha3
kind: BackendTLSPolicy
metadata:
  name: api-backend-tls
spec:
  targetRefs:
  - group: ""
    kind: Service
    name: api-service
  validation:
    caCertificateRefs:
    - name: internal-ca
      group: ""
      kind: ConfigMap
    hostname: api.internal.corp

Gateway Client Certificate Validation: The gateway can now validate client certificates — mutual TLS for ingress traffic, not just between services.

TLSRoute to Standard: TLS routing (based on SNI, not HTTP host headers) graduated to the standard channel — enabling TCP workloads with TLS passthrough through the Gateway API model.

ListenerSet: Group multiple Gateway listeners — useful for shared infrastructure where multiple teams need to attach routes to the same gateway without managing separate Gateway resources.


Kubernetes 1.34 — Scheduler Improvements, DRA Continues (August 2025)

The 1.34 release focused on the scheduler and Dynamic Resource Allocation:

DRA structured parameters stabilization: The Dynamic Resource Allocation API matured its parameter model — resource drivers can expose structured claims that the scheduler understands, enabling topology-aware placement of GPU workloads:

apiVersion: resource.k8s.io/v1alpha3
kind: ResourceClaim
metadata:
  name: gpu-claim
spec:
  devices:
    requests:
    - name: gpu
      deviceClassName: gpu.nvidia.com
      selectors:
      - cel:
          expression: device.attributes["nvidia.com/gpu-product"].string() == "A100-SXM4-80GB"
      count: 2

Scheduler QueueingHint stable: Plugins can now tell the scheduler when to re-queue a pod for scheduling — instead of the scheduler periodically retrying all unschedulable pods, plugins signal when relevant cluster state has changed. This significantly reduces scheduler CPU consumption in large clusters with many unschedulable pods.

Fine-grained node authorization improvements: Kubelets can now be restricted from accessing Service resources they don’t need — further reducing the blast radius of a compromised kubelet.


Kubernetes 1.35 — In-Place Resize GA, Memory Limits Unlocked (December 2025)

In-Place Pod Vertical Scaling Graduates to Stable

After landing in alpha (1.27), beta (1.33), in-place resize graduated to GA in 1.35. Two significant improvements accompanied GA:

Memory limit decreases now permitted: Previously, you could increase memory limits in-place but not decrease them. The restriction existed because the kernel doesn’t immediately reclaim memory when the limit is lowered — the OOM killer would need to run. 1.35 lifts this restriction with proper handling: the kernel is instructed to reclaim, and the pod status reflects the resize progress.

Pod-Level Resources (alpha in 1.35): Specify resource requests and limits at the pod level rather than per-container — with in-place resize support. Useful for init containers and sidecar patterns where total pod resources matter more than per-container allocation.

spec:
  # Pod-level resources (alpha) — total budget for all containers
  resources:
    requests:
      cpu: "4"
      memory: "8Gi"
  containers:
  - name: application
    image: myapp:latest
    # No per-container resources; pod-level applies
  - name: log-collector
    image: fluentbit:latest
    restartPolicy: Always  # sidecar

Other 1.35 Highlights

Topology Spread Constraints improvements: Better handling of unschedulable scenarios — whenUnsatisfiable: ScheduleAnyway now has smarter fallback behavior.

VolumeAttributesClass stable: Change storage performance characteristics (IOPS, throughput) of a PersistentVolume without re-provisioning — the storage equivalent of in-place pod resize.

# Change volume IOPS without re-provisioning
kubectl patch pvc database-pvc --type='merge' -p='
  {"spec": {"volumeAttributesClassName": "high-performance"}}'

Job success policy improvements: Declare a Job successful when a subset of pods complete successfully — for distributed training jobs where not all workers need to finish.


What’s in Kubernetes 1.36 (April 22, 2026)

Kubernetes 1.36 is on track for April 22, 2026 release. Based on the enhancement tracking and KEP (Kubernetes Enhancement Proposal) pipeline, expected highlights include:

  • DRA continuing toward stable
  • Pod-level resources moving to beta
  • Scheduler improvements for AI/ML workload placement
  • Further Gateway API integration as core networking model

The project has reached a rhythm: four releases per year, each focused on advancing a predictable set of features through alpha → beta → stable. The drama of the 2019–2022 period (PSP, dockershim, API removals) is behind it.


The State of the Ecosystem in 2026

Control Plane Deployment Models

Model Examples Best For
Managed (cloud provider) GKE, EKS, AKS Most organizations; no control plane ops
Self-managed kubeadm, k3s, Talos Air-gapped, on-prem, specific compliance requirements
Managed (platform) Rancher, OpenShift Enterprises that need multi-cluster management + vendor support

CNI Landscape

CNI Model Notable Feature
Cilium eBPF kube-proxy replacement, network policy at kernel, Hubble observability
Calico eBPF or iptables BGP-based networking, hybrid cloud routing
Flannel VXLAN/host-gw Simple, low overhead, no network policy
Weave Mesh overlay Easy multi-host setup

eBPF-based CNIs (Cilium, Calico in eBPF mode) are now the default recommendation for production clusters. The iptables era of Kubernetes networking is ending.

Security Stack in 2026

A hardened Kubernetes cluster in 2026 runs:

Cluster provisioning:    Cluster API + GitOps (Flux/ArgoCD)
Admission control:       Pod Security Admission (restricted) + Kyverno or OPA/Gatekeeper
Runtime security:        Falco (eBPF-based syscall monitoring)
Network security:        Cilium NetworkPolicy + Cilium Cluster Mesh for multi-cluster
Image security:          Cosign signing in CI + admission webhook for signature verification
Secret management:       External Secrets Operator → HashiCorp Vault or cloud KMS
Observability:           Prometheus + Grafana + Hubble (network flows) + OpenTelemetry

The Permanent Principles That Haven’t Changed

Looking across twelve years and 35 minor versions, some things have not changed:

The API as the universal interface: Everything in Kubernetes is a resource. This remains the most important architectural decision — it makes every tool, every controller, every GitOps system work with the same model.

Reconciliation loops: Every Kubernetes controller watches actual state and drives it toward desired state. The controller pattern from 2014 is unchanged. CRDs and Operators are just more instances of it.

Labels and selectors: The flexible grouping mechanism from 1.0 is still the primary way Kubernetes components find each other. Services find pods. HPA finds Deployments. Operators find their managed resources.

Declarative, not imperative: You describe what you want. Kubernetes figures out how to achieve and maintain it. This principle, inherited from Borg’s BCL configuration, underlies everything from Deployments to Crossplane’s cloud resource management.


What’s Coming: The Next Five Years

WebAssembly on Kubernetes: The Wasm ecosystem (wasmCloud, SpinKube) is building toward running WebAssembly workloads as first-class Kubernetes pods — near-native performance, smaller images, stronger isolation than containers. Still early, but gaining real adoption.

AI inference as infrastructure: LLM serving is becoming a cluster primitive. Tools like KServe and vLLM on Kubernetes are moving from research to production. The scheduler, resource model, and networking will continue adapting to inference workload patterns.

Confidential computing: AMD SEV, Intel TDX, and ARM CCA provide hardware-level memory encryption for pods. The RuntimeClass mechanism and ongoing kernel work are making confidential Kubernetes workloads operational rather than experimental.

Leaner distributions: k3s, k0s, Talos, and Flatcar-based minimal Kubernetes distributions are growing in adoption for edge, IoT, and resource-constrained environments. The pressure is toward smaller, more auditable control planes.


Key Takeaways

  • In-place pod vertical scaling went from alpha (1.27) to stable (1.35) — live CPU and memory resize without pod restart changes the economics of stateful workload management
  • Gateway API v1.4 completes the ingress replacement story: BackendTLSPolicy, client certificate validation, and TLSRoute in standard channel
  • VolumeAttributesClass stable (1.35): Change storage performance in-place — the storage parallel to pod resource resize
  • The eBPF era of Kubernetes networking is established: Cilium as default CNI in GKE, growing in EKS/AKS, replacing iptables-based kube-proxy
  • The Kubernetes project in 2026 is focused on precision — promoting mature features to stable, reducing edge cases, improving scheduler efficiency — not adding new abstractions
  • WebAssembly, confidential computing, and AI inference scheduling are the frontiers to watch

Series Wrap-Up

Era Defining Change
2003–2014 Borg and Omega build the playbook internally at Google
2014–2016 Kubernetes 1.0, CNCF, and winning the container orchestration wars
2016–2018 RBAC stable, CRDs, cloud providers all-in on managed K8s
2018–2020 Operators, service mesh, OPA/Gatekeeper — the extensibility era
2020–2022 Supply chain crisis, PSP deprecated, API removals, dockershim exit
2022–2023 Dockershim and PSP removed, eBPF networking takes over
2023–2025 GitOps standard, sidecar stable, DRA, AI/ML workloads
2025–2026 In-place resize GA, VolumeAttributesClass, Gateway API complete

From 47,501 lines of Go in a 250-file GitHub commit to the operating system of the cloud — and still reconciling.


← EP07: Platform Engineering Era

Series: Kubernetes: From Borg to Platform Engineering | linuxcent.com

Hardening Blueprint as Code — Declare Your OS Baseline in YAML

Reading Time: 6 minutes

OS Hardening as Code, Episode 2
Cloud AMI Security Risks · Linux Hardening as Code**


TL;DR

  • A hardening runbook is a list of steps someone runs. A HardeningBlueprint YAML is a build artifact — if it wasn’t applied, the image doesn’t exist
  • Linux hardening as code means declaring your entire OS security baseline in a single YAML file and building it reproducibly across any provider
  • stratum build --blueprint ubuntu22-cis-l1.yaml --provider aws either produces a hardened image or fails — there is no partial state
  • The blueprint includes: target OS/provider, compliance benchmark, Ansible roles, and per-control overrides with documented reasons
  • One blueprint file = one source of truth for your hardening posture, version-controlled and reviewable like any other infrastructure code
  • Post-build OpenSCAP scan runs automatically — the image only snapshots if it passes

The Problem: A Runbook That Gets Skipped Once Is a Runbook That Gets Skipped

Hardening runbook
       │
       ▼
  Human executes
  steps manually
       │
       ├─── 47 deployments: followed correctly
       │
       └─── 1 deployment at 2am: step 12 skipped
                    │
                    ▼
           Instance in production
           without audit logging,
           SSH password auth enabled,
           unnecessary services running

Linux hardening as code eliminates the human decision point. If the blueprint wasn’t applied, the image doesn’t exist.

EP01 showed that default cloud AMIs arrive pre-broken — unnecessary services, no audit logging, weak kernel parameters, SSH configured for convenience not security. The obvious response is a hardening script. But a script run by a human is still a process step. It can be skipped. It can be done halfway. It can drift across different engineers who each interpret “run the hardening script” slightly differently.


A production deployment last year. The platform team had a solid CIS L1 hardening runbook — 68 steps, well-documented, followed consistently. Then a critical incident at 2am required three new instances to be deployed on short notice. The engineer on call ran the provisioning script and, under pressure, skipped the hardening step with the intention of running it the next morning.

They didn’t. The three instances stayed in production unhardened for six weeks before an automated scan caught them. Audit logging wasn’t configured. SSH was accepting password authentication. Two unnecessary services were running that weren’t in the approved software list.

Nothing was breached. But the finding went into the next compliance report as a gap, the team spent a week remediating, and the post-mortem conclusion was “we need better runbook discipline.”

That’s the wrong conclusion. The runbook isn’t the problem. The problem is that hardening was a process step instead of a build constraint.


What Linux Hardening as Code Actually Means

Linux hardening as code is the same principle as infrastructure as code applied to OS security posture: the desired state is declared in a file, the file is the source of truth, and the execution is deterministic and repeatable.

HardeningBlueprint YAML
         │
         ▼
  stratum build
         │
  ┌──────┴──────────────────┐
  │  Provider Layer          │
  │  (cloud-init, disk       │
  │   names, metadata        │
  │   endpoint per provider) │
  └──────┬──────────────────┘
         │
  ┌──────┴──────────────────┐
  │  Ansible-Lockdown        │
  │  (CIS L1/L2, STIG —      │
  │   the hardening steps)   │
  └──────┬──────────────────┘
         │
  ┌──────┴──────────────────┐
  │  OpenSCAP Scanner        │
  │  (post-build verify)     │
  └──────┬──────────────────┘
         │
         ▼
  Golden Image (AMI/GCP image/Azure image)
  + Compliance grade in image metadata

The YAML file is what you write. Stratum handles the rest.


The HardeningBlueprint YAML

The blueprint is the complete, auditable declaration of your OS security posture:

# ubuntu22-cis-l1.yaml
name: ubuntu22-cis-l1
description: Ubuntu 22.04 CIS Level 1 baseline for production workloads
version: "1.0"

target:
  os: ubuntu
  version: "22.04"
  provider: aws
  region: ap-south-1
  instance_type: t3.medium

compliance:
  benchmark: cis-l1
  controls: all

hardening:
  - ansible-lockdown/UBUNTU22-CIS
  - role: custom-audit-logging
    vars:
      audit_log_retention_days: 90
      audit_max_log_file: 100

filesystem:
  tmp:
    type: tmpfs
    options: [nodev, nosuid, noexec]
  home:
    options: [nodev]

controls:
  - id: 1.1.2
    override: compliant
    reason: "tmpfs /tmp implemented via systemd unit — equivalent control"
  - id: 5.2.4
    override: compliant
    reason: "SSH timeout managed by session manager policy, not sshd_config"

Each section is explicit:

target — which OS, which version, which provider. This is the only provider-specific section. The compliance intent below it is portable.

compliance — which benchmark and which controls to apply. controls: all means every CIS L1 control. You can also specify controls: [1.x, 2.x] to scope to specific sections.

hardening — which Ansible roles to run. ansible-lockdown/UBUNTU22-CIS is the community CIS hardening role. You can add custom roles alongside it.

controls — documented exceptions. Not suppressions — overrides with a recorded reason. This is the difference between “we turned off this control” and “this control is satisfied by an equivalent implementation, documented here.”


Building the Image

# Validate the blueprint before building
stratum blueprint validate ubuntu22-cis-l1.yaml

# Build — this will take 15-20 minutes
stratum build --blueprint ubuntu22-cis-l1.yaml --provider aws

# Output:
# [15:42:01] Launching build instance...
# [15:42:45] Running ansible-lockdown/UBUNTU22-CIS (144 tasks)...
# [15:51:33] Running custom-audit-logging role...
# [15:52:11] Running post-build OpenSCAP scan (benchmark: cis-l1)...
# [15:54:08] Grade: A (98/100 controls passing)
# [15:54:09] 2 controls overridden (documented in blueprint)
# [15:54:10] Creating AMI snapshot: ami-0a7f3c9e82d1b4c05
# [15:54:47] Done. AMI tagged with compliance grade: cis-l1-A-98

If the post-build scan comes back below a configurable threshold, the build fails — no AMI is created. The instance is terminated. The image does not exist.

That is the structural guarantee. You cannot skip a build step at 2am because at 2am you’re calling stratum build, not running steps manually.


The Control Override Mechanism

The override mechanism is what separates this from checkbox compliance.

Every security benchmark has controls that conflict with how production environments actually work. CIS L1 recommends /tmp on a separate partition. Many cloud instances use tmpfs with equivalent nodev, nosuid, noexec mount options. The intent of the control is satisfied. The literal implementation differs.

Without an override mechanism, you have two bad options: fail the scan (noisy, meaningless), or configure the scanner to ignore the control (undocumented, invisible to auditors).

The blueprint’s controls section gives you a third option: record the override, document the reason, and let the scanner count it as compliant. The SARIF output and the compliance grade both reflect the documented state.

controls:
  - id: 1.1.2
    override: compliant
    reason: "tmpfs /tmp implemented via systemd unit — equivalent control"

This appears in the build log, in the SARIF export, and in the image metadata. An auditor reading the output sees: control 1.1.2 — compliant, documented exception, reason recorded. Not: control 1.1.2 — ignored.


What the Blueprint Gives You That a Script Doesn’t

Hardening script HardeningBlueprint YAML
Version-controlled Possible but not enforced Always — it’s a file
Auditable exceptions Typically not Built-in override mechanism
Post-build verification Manual or none Automatic OpenSCAP scan
Image exists only if hardened No Yes — build fails if scan fails
Multi-cloud portability Requires separate scripts Provider flag, same YAML
Drift detection Not possible Rescan instance against original grade
Skippable at 2am Yes No — you’d have to change the build process

The last row is the one that matters. A script is skippable because there’s a human in the loop. A blueprint is a build artifact — you can’t deploy the image without the blueprint having been applied, because the image is what the blueprint produces.


Validating a Blueprint Before Building

# Syntax and schema validation
stratum blueprint validate ubuntu22-cis-l1.yaml

# Dry-run — show what Ansible tasks will run, what controls will be checked
stratum build --blueprint ubuntu22-cis-l1.yaml --provider aws --dry-run

# Show all available controls for a benchmark
stratum blueprint controls --benchmark cis-l1 --os ubuntu --version 22.04

# Show what a specific control checks
stratum blueprint controls --id 1.1.2 --benchmark cis-l1

The dry-run output shows every Ansible task that will run, every OpenSCAP check that will fire, and flags any controls that might conflict with the provider environment before you’ve launched a build instance.


Production Gotchas

Build time is 15–25 minutes. Ansible-Lockdown applies 144+ tasks for CIS L1. Build this into your pipeline timing — don’t expect golden images in 3 minutes.

Cloud-init ordering matters. On AWS, certain hardening steps (sysctl tuning, PAM configuration) interact with cloud-init. The Stratum provider layer handles sequencing — but if you add custom hardening roles, test the cloud-init interaction explicitly.

Some CIS controls conflict with managed service requirements. AWS Systems Manager Session Manager requires specific SSH configuration. RDS requires specific networking settings. Use the controls override section to document these — don’t suppress them silently.

Kernel parameter hardening requires a reboot. Controls in the 3.x (network parameters) and 1.5.x (kernel modules) sections apply sysctl changes that take effect on reboot. The Stratum build process reboots the instance before the OpenSCAP scan — don’t skip the reboot if you’re building manually.


Key Takeaways

  • Linux hardening as code means the blueprint YAML is the build artifact — the image either exists and is hardened, or it doesn’t exist
  • The controls override mechanism is the difference between undocumented suppressions and auditable, reasoned exceptions
  • Post-build OpenSCAP scan runs automatically — a failing grade blocks image creation
  • One blueprint file is portable across providers (EP03 covers this): the compliance intent stays in the YAML, the cloud-specific details go in the provider layer
  • Version-controlling the blueprint gives you a complete history of what your OS security posture was at any point in time — the same way Terraform state tracks infrastructure

What’s Next

One blueprint, one provider. EP02 showed that the skip-at-2am problem is solved when hardening is a build artifact rather than a process step.

What it didn’t address: what happens when you expand to a second cloud. GCP uses different disk names. Azure cloud-init fires in a different order. The AWS metadata endpoint IP is different from every other provider. If you maintain separate hardening scripts per cloud, they drift within a month.

EP03 covers multi-cloud OS hardening: the same blueprint, six providers, no drift.

Next: multi-cloud OS hardening — one blueprint for AWS, GCP, and Azure

Get EP03 in your inbox when it publishes → linuxcent.com/subscribe

What Is LDAP — and Why It Was Invented to Replace Something Worse

Reading Time: 9 minutes

The Identity Stack, Episode 1
EP01EP02: LDAP Internals → EP03 → …


TL;DR

  • LDAP (Lightweight Directory Access Protocol) is a protocol for reading and writing directory information — most commonly, who is allowed to do what
  • It was built in 1993 as a “lightweight” alternative to X.500/DAP, which ran over the full OSI stack and was impossible to deploy on anything but mainframe hardware
  • Before LDAP, every server had its own /etc/passwd — 50 machines meant 50 separate user databases, managed manually
  • NIS (Network Information Service) was the first attempt to centralize this — it worked, then became a cleartext-credentials security liability
  • LDAP v3 (RFC 2251, 1997) is the version still in production today — 27 years of backwards compatibility
  • Everything you use today — Active Directory, Okta, Entra ID — is built on top of, or speaks, LDAP

The Big Picture: 50 Years of “Who Are You?”

1969–1980s   /etc/passwd — per-machine, no network auth
     │        50 servers = 50 user databases, managed manually
     │
     ▼
1984         Sun NIS / Yellow Pages — first centralized directory
     │        broadcast-based, no encryption, flat namespace
     │        Revolutionary for its era. A liability by the 1990s.
     │
     ▼
1988         X.500 / DAP — enterprise-grade directory services
     │        OSI protocol stack. Powerful. Impossible to deploy.
     │        Mainframe-class infrastructure required just to run it.
     │
     ▼
1993         RFC 1487 — LDAP v1
     │        Tim Howes, University of Michigan.
     │        Lightweight. TCP/IP. Actually deployable.
     │
     ▼
1997         RFC 2251 — LDAP v3
     │        SASL authentication. TLS. Controls. Referrals.
     │        The version still in production today.
     │
     ▼
2000s–now    Active Directory, OpenLDAP, 389-DS, FreeIPA
             Okta, Entra ID, Google Workspace
             LDAP DNA in every identity system on the planet.

What is LDAP? It’s the protocol that solved one of the most boring and consequential problems in computing: how do you know who someone is, across machines, at scale, without sending their password in cleartext?


The World Before LDAP

Before you understand why LDAP was invented, you need to feel the problem it solved.

Every Unix machine in the 1970s and 1980s managed its own users. When you created an account on a server, your username, UID, and hashed password went into /etc/passwd on that machine. Another machine had no idea you existed. If you needed access to ten servers, an administrator created ten separate accounts — manually, one by one. When you changed your password, each account had to be updated separately.

For a university with 200 machines and 10,000 students, this was chaos. For a company with offices in three cities, it was a full-time job for multiple sysadmins.

Machine A           Machine B           Machine C
/etc/passwd         /etc/passwd         /etc/passwd
vamshi:x:1001       (vamshi unknown)    vamshi:x:1004
alice:x:1002        alice:x:1001        alice:x:1003
bob:x:1003          bob:x:1002          (bob unknown)

Same people, different UIDs, different machines, no central truth.
File permissions become meaningless when UID 1001 means
different users on different hosts.

For every new hire, an admin SSHed to every machine and ran useradd. When someone left, you hoped whoever ran the offboarding remembered all the machines. Most organizations didn’t know their own attack surface because there was no single place to look.


Sun NIS: The First Attempt at Centralization

Sun Microsystems released NIS (Network Information Service) in 1984, originally called Yellow Pages — a name they had to drop after a trademark dispute with British Telecom. The idea was elegant: one server holds the authoritative /etc/passwd (and /etc/group, /etc/hosts, and a dozen other maps), and client machines query it instead of reading local files.

For the first time, you could create an account once and have it work across your entire network. For a generation of Unix administrators, NIS was liberating.

       NIS Master Server
       /var/yp/passwd.byname
              │
    ┌─────────┼──────────┐
    ▼         ▼          ▼
 Client A   Client B   Client C
 (query NIS — no local /etc/passwd needed)

NIS worked well — until it didn’t. The failure modes were structural:

No encryption. NIS responses were cleartext UDP. An attacker on the same network segment could capture the full password database with a packet sniffer. In 1984, “the network” meant a trusted corporate LAN. By the mid-1990s, it meant ethernet segments that included lab workstations, and the assumptions no longer held.

Broadcast-based discovery. NIS clients found servers by broadcasting on the local network. This worked on a single flat ethernet. It failed completely across routers, across buildings, and across WAN links. Multi-site organizations ended up running separate NIS domains with no connection between them — which partially defeated the purpose.

Flat namespace. NIS had no organizational hierarchy. One domain. Everything flat. You couldn’t have engineering and finance as separate administrative units. You couldn’t delegate user management to a department. One person — usually one overworked sysadmin — managed the whole thing.

UIDs had to match across all machines. If alice was UID 1002 on one server but UID 1001 on another, NFS file ownership became wrong. NIS enforced consistency, but onboarding a new machine into an existing network required manually auditing UID conflicts across the entire directory. Get one wrong and files end up owned by the wrong person.

NIS worked for thousands of installations from 1984 to the mid-1990s. It also ended careers when it failed. What the industry needed was a hierarchical, structured, encrypted, scalable directory service.


X.500 and DAP: The Right Idea, Wrong Protocol

The OSI (Open Systems Interconnection) standards body had an answer: X.500 directory services. X.500 was comprehensive, hierarchical, globally federated. The ITU-T published the standard in 1988, and it looked like exactly what enterprises needed.

X.500 Directory Information Tree (DIT)
              c=US                   ← country
                │
         o=University                ← organization
                │
         ┌──────┴──────┐
     ou=CS           ou=Physics      ← organizational units
         │
     cn=Tim Howes                    ← common name (person)
     telephoneNumber: +1-734-...
     mail: [email protected]

This data model — the hierarchy, the object classes, the distinguished names — is exactly what LDAP inherited. The DIT, the cn=, ou=, dc= notation in every LDAP query you’ve ever read: all of it came from X.500.

The problem was DAP: the Directory Access Protocol that X.500 used to communicate.

DAP ran over the full OSI protocol stack. Not TCP/IP — OSI. Seven layers, all of which required specialized software that in 1988 only mainframe and minicomputer vendors had implemented. A university department wanting to run X.500 needed hardware and software licenses that cost as much as a small car. The vast majority of workstations couldn’t speak OSI at all.

The data model was sound. The transport was impractical.

X.500 / DAP (1988)              LDAP v1 (1993)
──────────────────              ──────────────
Full OSI stack (7 layers)  →    TCP/IP only
Mainframe-class hardware   →    Any Unix box with a TCP stack
$50,000+ deployment cost   →    Free (reference implementation)
Vendor-specific OSI impl.  →    Standard socket API
Zero internet adoption     →    Universities deployed immediately

The Invention: LDAP at the University of Michigan

Tim Howes was at the University of Michigan in the early 1990s. The university was running X.500 for its directory — faculty, staff, student contact information, credentials. The data model was good. The protocol was the problem.

His insight, working with colleagues Wengyik Yeong and Steve Kille: strip X.500 down to what actually needs to function over a TCP/IP connection. Keep the hierarchical data model. Throw away the OSI transport. The result was the Lightweight Directory Access Protocol.

RFC 1487, published July 1993, described LDAP v1. It preserved the X.500 directory information model — the hierarchy, the object classes, the distinguished name format — and mapped it onto a protocol that could run over a simple TCP socket on port 389.

No specialized hardware. No OSI. If you had a Unix machine and TCP/IP, you could run LDAP. By 1993, that meant virtually every workstation and server in every university and most enterprises.

The University of Michigan deployed it immediately. Within two years, organizations across the internet were running the reference implementation.

LDAP v2 (RFC 1777, 1995) cleaned up the protocol. LDAP v3 (RFC 2251, 1997) is the version in production today — adding SASL authentication (which enables Kerberos integration), TLS support, referrals for federated directories, and extensible controls for server-side operations. The RFC that standardized the internet’s primary identity protocol is 27 years old and still running.


What LDAP Actually Is

LDAP is a client-server protocol for reading and writing a directory — a structured, hierarchical database optimized for reads.

Every entry in the directory has a Distinguished Name (DN) that describes its position in the hierarchy, and a set of attributes defined by its object classes. A person entry looks like this:

dn: cn=vamshi,ou=engineers,dc=linuxcent,dc=com

objectClass: inetOrgPerson
objectClass: posixAccount
cn: vamshi
uid: vamshi
uidNumber: 1001
gidNumber: 1001
homeDirectory: /home/vamshi
loginShell: /bin/bash
mail: [email protected]

The DN reads right-to-left: domain linuxcent.com (dc=linuxcent,dc=com) → organizational unit engineers → common name vamshi. Every entry in the directory has a unique path through the tree — there’s no ambiguity about which vamshi you mean.

LDAP defines eight operations: Bind (authenticate), Search, Add, Modify, Delete, ModifyDN (rename), Compare, and Abandon. Most of what a Linux authentication system does with LDAP reduces to two: Bind (prove you are who you say you are) and Search (tell me everything you know about this user).

When your Linux machine authenticates an SSH login against LDAP:

1. User types password
2. PAM calls pam_sss (or pam_ldap on older systems)
3. SSSD issues a Bind to the LDAP server: "I am cn=vamshi, and here is my credential"
4. LDAP server verifies the bind → success or failure
5. SSSD issues a Search: "give me the posixAccount attributes for uid=vamshi"
6. LDAP returns uidNumber, gidNumber, homeDirectory, loginShell
7. PAM creates the session with those attributes

The entire login flow is two LDAP operations: one Bind, one Search.


Try It Right Now

You don’t need to set up an LDAP server to run your first query. There’s a public test LDAP directory at ldap.forumsys.com:

# Query a public LDAP server — no setup required
ldapsearch -x \
  -H ldap://ldap.forumsys.com \
  -b "dc=example,dc=com" \
  -D "cn=read-only-admin,dc=example,dc=com" \
  -w readonly \
  "(objectClass=inetOrgPerson)" \
  cn mail uid

# What you get back (abbreviated):
# dn: uid=tesla,dc=example,dc=com
# cn: Tesla
# mail: [email protected]
# uid: tesla
#
# dn: uid=einstein,dc=example,dc=com
# cn: Albert Einstein
# mail: [email protected]
# uid: einstein

Decode what you just ran:

  • -x — simple authentication (username/password bind, not Kerberos/SASL)
  • -H ldap://ldap.forumsys.com — the LDAP server URI, port 389
  • -b "dc=example,dc=com" — the base DN, the top of the subtree to search
  • -D "cn=read-only-admin,dc=example,dc=com" — the bind DN (who you’re authenticating as)
  • -w readonly — the bind password
  • "(objectClass=inetOrgPerson)" — the search filter: return entries that are people
  • cn mail uid — the attributes to return (default returns all)

That’s a live LDAP query returning real directory entries from a server running RFC 2251 — the same protocol Tim Howes designed in 1993.

On your own Linux system, if you’re joined to AD or LDAP, you can query it the same way with your domain credentials.


Why It Never Went Away

LDAP v3 was finalized in 1997. In 2024, it’s still the protocol every enterprise directory speaks. Why?

Because it became the lingua franca of enterprise identity before any replacement existed. Every application that needs to authenticate users — VPN concentrators, mail servers, network switches, web applications, HR systems — implemented LDAP support. Every directory service Microsoft, Red Hat, Sun, and Novell shipped stored data in an LDAP-accessible tree.

When Microsoft built Active Directory in 1999, they built it on top of LDAP + Kerberos. When your Linux machine joins an AD domain, it speaks LDAP to enumerate users and groups, and Kerberos to verify credentials. When Okta or Entra ID syncs with your on-premises directory, it uses LDAP Sync (or a modern protocol that maps directly to LDAP semantics).

The protocol is old. The ecosystem built on top of it is so deep that replacing LDAP would mean simultaneously replacing every enterprise application that depends on it. Nobody has done that. Nobody has had to.

What happened instead is the stack got taller. LDAP at the bottom, Kerberos for network authentication, SSSD as the local caching daemon, PAM as the Linux integration layer, SAML and OIDC at the top for web-based federation. The directory is still LDAP. The interfaces above it evolved.

That full stack — from the directory at the bottom to Zero Trust at the top — is what this series covers.


⚠ Common Misconceptions

“LDAP is an authentication protocol.” LDAP is a directory protocol. It stores identity information and can verify credentials (via Bind). Authentication in modern stacks is typically Kerberos or OIDC — LDAP provides the directory backing it.

“LDAP is obsolete.” LDAP is the storage layer for Active Directory, OpenLDAP, 389-DS, FreeIPA, and every enterprise IdP’s on-premises sync. It is ubiquitous. What’s changed is the interface layer above it.

“You need Active Directory to run LDAP.” Active Directory uses LDAP. OpenLDAP, 389-DS, FreeIPA, and Apache Directory Server are all standalone LDAP implementations. You can run a directory without Microsoft.

“LDAP and LDAPS are different protocols.” LDAP is the protocol. LDAPS is LDAP over TLS on port 636. StartTLS is LDAP on port 389 with an in-session upgrade to TLS. Same protocol, different transport security.


Framework Alignment

Domain Relevance
CISSP Domain 5: Identity and Access Management LDAP is the foundational directory protocol for centralized identity stores — the base layer of every enterprise IAM stack
CISSP Domain 4: Communications and Network Security Port 389 (LDAP), 636 (LDAPS), 3268/3269 (AD Global Catalog) — transport security decisions affect every directory deployment
CISSP Domain 3: Security Architecture and Engineering DIT hierarchy, schema design, replication topology — directory structure is an architectural security decision
NIST SP 800-63B LDAP as a credential service provider (CSP) backing enterprise authenticators

Key Takeaways

  • LDAP was invented to solve a real, painful problem: the authentication chaos that NIS couldn’t fix and X.500/DAP was too expensive to deploy
  • It inherited the right thing from X.500 (the hierarchical data model) and replaced the right thing (the impractical OSI transport with TCP/IP)
  • NIS was the predecessor that worked until it didn’t — its failure modes (no encryption, flat namespace, broadcast discovery) are exactly what LDAP was designed to fix
  • LDAP v3 (RFC 2251, 1997) is still the production standard — 27 years later
  • Active Directory, OpenLDAP, FreeIPA, Okta, Entra ID — every enterprise identity system either runs LDAP or speaks it
  • The full authentication stack is deeper than LDAP: the next 12 episodes peel it apart layer by layer

What’s Next

EP01 stayed at the design level — the problem, the predecessor failures, the invention, the data model.

EP02 goes inside the wire. The DIT structure, DN syntax, object classes, schema, and the BER-encoded bytes that actually travel from the server to your authentication daemon. Run ldapsearch against your own directory and read every line of what comes back.

Next: LDAP Internals: The Directory Tree, Schema, and What Travels on the Wire

Get EP02 in your inbox when it publishes → linuxcent.com/subscribe

XDP — Packets Processed Before the Kernel Knows They Arrived

Reading Time: 10 minutes

eBPF: From Kernel to Cloud, Episode 7
What Is eBPF? · The BPF Verifier · eBPF vs Kernel Modules · eBPF Program Types · eBPF Maps · CO-RE and libbpf · XDP**


TL;DR

  • XDP fires before sk_buff allocation — the earliest possible kernel hook for packet processing
    (sk_buff = the kernel’s socket buffer — every normal packet requires one to be allocated, which adds up fast at scale)
  • Three modes: native (in-driver, full performance), generic (fallback, no perf gain), offloaded (NIC ASIC)
  • XDP context is raw packet bytes — no socket, no cgroup, no pod identity; handle non-IP traffic explicitly
  • Every pointer dereference requires a bounds check against data_end — the verifier enforces this
  • BPF_MAP_TYPE_LPM_TRIE is the right map type for IP prefix blocklists — handles /32 hosts and CIDRs together
  • XDP metadata area enables coordination with TC programs — classify at XDP speed, enforce with pod context at TC

XDP eBPF fires before sk_buff allocation — the earliest possible kernel hook, and the reason iptables rules can be technically correct while still burning CPU at high packet rates. I had iptables DROP rules installed and working during a SYN flood. Packets were being dropped. CPU was still burning at 28% software interrupt time. The rules weren’t wrong. The hook was in the wrong place.


A client’s cluster was under a SYN flood — roughly 1 million packets per second from a rotating set of source IPs. We had iptables DROP rules installed within the first ten minutes, blocklist updated every 30 seconds as new source ranges appeared. The flood traffic dropped in volume. But node CPU stayed high. The %si column in top — software interrupt time — was sitting at 25–30%.

%si in top is the percentage of CPU time spent handling hardware interrupts and kernel-level packet processing — separate from your application’s CPU usage. On a quiet managed cluster (EKS, GKE) this is usually under 1%. Under a packet flood, high %si means the kernel is burning cycles just receiving packets, before your workloads run at all. It’s the metric that tells you the problem is below the application layer.

I didn’t understand why. The iptables rules were matching. Packets were being dropped. Why was the CPU still burning?

The answer is where in the kernel the drop was happening. iptables fires inside the netfilter framework — after the kernel has already allocated an sk_buff for the packet, done DMA from the NIC ring buffer, and traversed several netfilter hooks.

netfilter is the Linux kernel subsystem that handles packet filtering, NAT, and connection tracking. iptables is the userspace CLI that writes rules into it. At high packet rates, the cost isn’t the rule match — it’s the kernel work that happens before the rule is evaluated. At 1 million packets per second, the allocation cost alone is measurable. The attack was “slow” in network terms, but fast enough to keep the kernel memory allocator and netfilter traversal continuously busy.

XDP fires before any of that. Before sk_buff. Before routing. Before the kernel network stack has touched the packet at all. A DROP decision at the XDP layer costs one bounds check and a return value. Nothing else.

Quick Check: Is XDP Running on Your Cluster?

Before the data path walkthrough — a two-command check you can run right now on any cluster node:

# SSH into a worker node, then:
bpftool net list

On a Cilium-managed node, you’ll see something like:

eth0 (index 2):
        xdpdrv  id 44

lxc8a3f21b (index 7):
        tc ingress id 47
        tc egress  id 48

Reading the output:
xdpdrv — XDP in native mode, running in the NIC driver before sk_buff (this is what you want)
xdpgeneric instead of xdpdrvgeneric mode, runs after sk_buff allocation, no performance benefit
– No XDP line at all — XDP not deployed; your CNI uses iptables for service forwarding

If you’re on EKS with aws-vpc-cni or GKE with kubenet, you likely won’t see XDP here — those CNIs use iptables. Understanding this section explains why teams migrating to Cilium see lower node CPU under the same traffic load.

Where XDP Sits in the Kernel Data Path

The standard Linux packet receive path:

NIC hardware
  ↓
DMA to ring buffer (kernel memory)
  ↓
[XDP hook — fires here, before sk_buff]
  ├── XDP_DROP   → discard, zero further allocation
  ├── XDP_PASS   → continue to kernel network stack
  ├── XDP_TX     → transmit back out the same interface
  └── XDP_REDIRECT → forward to another interface or CPU
  ↓
sk_buff allocated from slab allocator
  ↓
netfilter: PREROUTING
  ↓
IP routing decision
  ↓
netfilter: INPUT or FORWARD
  ↓  [iptables fires somewhere in here]
socket receive queue
  ↓
userspace application

XDP runs at the driver level, in the NAPI poll context — the same context where the driver is processing received packets off the ring buffer. The program runs before the kernel’s general networking code gets involved. There’s no sk_buff, no reference counting, no slab allocation.

NAPI (New API) is how modern Linux receives packets efficiently. Instead of one CPU interrupt per packet (catastrophically expensive at 1Mpps), the NIC fires a single interrupt, then the kernel polls the NIC ring buffer in batches until it’s drained. XDP runs inside this polling loop — as close to the hardware as software gets without running on the NIC itself.

At 1Mpps, the difference between XDP_DROP and an iptables DROP is roughly the cost of allocating and then immediately freeing 1 million sk_buff objects per second — plus netfilter traversal, connection tracking lookup, and the DROP action itself. That’s the CPU time that was burning.

After moving the blocklist to an XDP program, the %si on the same traffic load dropped from 28% to 3%.

XDP Modes

XDP operates in three modes, and which one you get depends on your NIC driver.

Native XDP (XDP_FLAGS_DRV_MODE)

The eBPF program runs directly in the NIC driver’s NAPI poll function — in interrupt context, before sk_buff. This is the only mode that delivers the full performance benefit.

Driver support is required. The widely supported drivers: mlx4, mlx5 (Mellanox/NVIDIA), i40e, ice (Intel), bnxt_en (Broadcom), virtio_net (KVM/QEMU), veth (containers). Check support:

# Verify native XDP support on your driver
ethtool -i eth0 | grep driver
# driver: mlx5_core  ← supports native XDP

# Load in native mode
ip link set dev eth0 xdpdrv obj blocklist.bpf.o sec xdp

The veth driver supporting native XDP is what makes XDP meaningful inside Kubernetes pods — each pod’s veth interface can run an XDP program at wire speed.

Generic XDP (XDP_FLAGS_SKB_MODE)

Fallback for drivers that don’t support native XDP. The program still runs, but it runs after sk_buff allocation, as a hook in the netif_receive_skb path. No performance benefit over early netfilter. sk_buff is still allocated and freed for every packet.

# Generic mode — development and testing only
ip link set dev eth0 xdpgeneric obj blocklist.bpf.o sec xdp

Use this for development on a laptop with a NIC that lacks native XDP support. Never benchmark with it and never use it in production expecting performance gains.

Offloaded XDP

Runs on the NIC’s own processing unit (SmartNIC). Zero CPU involvement — the XDP decision happens in NIC hardware. Supported by Netronome Agilio NICs. Rare in production, but the theoretical ceiling for XDP performance.

The XDP Context: What Your Program Can See

XDP programs receive one argument: struct xdp_md.

struct xdp_md {
    __u32 data;           // offset of first packet byte in the ring buffer page
    __u32 data_end;       // offset past the last byte
    __u32 data_meta;      // metadata area before data (XDP metadata for TC cooperation)
    __u32 ingress_ifindex;
    __u32 rx_queue_index;
};

data and data_end are used as follows:

void *data     = (void *)(long)ctx->data;
void *data_end = (void *)(long)ctx->data_end;

// Every pointer dereference must be bounds-checked first
struct ethhdr *eth = data;
if ((void *)(eth + 1) > data_end)
    return XDP_PASS;  // malformed or truncated packet

The verifier enforces these bounds checks — every pointer derived from ctx->data must be validated before use. The error invalid mem access 'inv' means you dereferenced a pointer without checking the bounds. This is the most common cause of XDP program rejection.

For operators (not writing XDP code): You’ll see invalid mem access 'inv' in logs when an eBPF program is rejected at load time — most commonly during a Cilium upgrade or a custom tool deployment on a kernel the tool wasn’t built for. The fix is in the eBPF source or the tool version, not the cluster config. If you see this error and you’re not writing eBPF yourself, it means the tool’s build doesn’t match your kernel version — upgrade the tool or check its supported kernel matrix.

What XDP cannot see:
– Socket state — no socket buffer exists yet
– Cgroup hierarchy — no pod identity
– Process information — no PID, no container
– Connection tracking state (unless you maintain it yourself in a map)

XDP is ingress-only. It fires on packets arriving at an interface, not departing. For egress, TC is the hook.

What This Means on Your Cluster Right Now

Every Cilium-managed node has XDP programs running. Here’s how to see them:

# All XDP programs on all interfaces — this is the full picture
bpftool net list
# Sample output on a Cilium node:
#
# eth0 (index 2):
#         xdpdrv  id 44         ← XDP in native mode on the node uplink
#
# lxc8a3f21b (index 7):
#         tc ingress id 47      ← TC enforces NetworkPolicy on pod ingress
#         tc egress  id 48      ← TC enforces NetworkPolicy on pod egress
#
# "xdpdrv"     = native mode (runs in NIC driver, before sk_buff — full performance)
# "xdpgeneric" = fallback mode (after sk_buff — no performance benefit over iptables)

# Which mode is active?
ip link show eth0 | grep xdp
# xdp mode drv  ← native (full performance)
# xdp mode generic  ← fallback (no perf benefit)

# Details on the XDP program ID
bpftool prog show id $(bpftool net show dev eth0 | grep xdp | awk '{print $NF}')
# Shows: loaded_at, tag, xlated bytes, jited bytes, map IDs

The map IDs in that output are the BPF maps the XDP program is using — typically the service VIP table for DNAT, and in security tools, the blocklist or allowlist. To see what’s in them:

# List maps used by the XDP program
bpftool prog show id <PROG_ID> | grep map_ids

# Dump the service map (for a Cilium node — this is the load balancer table)
bpftool map dump id <MAP_ID> | head -40

For a blocklist scenario — like the SYN flood mitigation above — the BPF_MAP_TYPE_LPM_TRIE is the standard data structure. A lookup for 192.168.1.45 hits a 192.168.1.0/24 entry in the same map, handling both host /32s and CIDR ranges in one lookup. The practical operational check:

# Count entries in an XDP filter map
bpftool map dump id <BLOCKLIST_MAP_ID> | grep -c "key"

# Verify XDP is active and inspect program details
bpftool net show dev eth0

XDP Metadata: Cooperating with TC

Think of it as a sticky note attached to the packet. XDP writes the note at line speed (no context about pods or sockets). TC reads it later when full context is available, and acts on it. The packet carries the note between them.

More precisely: XDP can write metadata into the area before ctx->data — a small scratch space that survives as the packet moves from XDP to the TC hook. This is the coordination mechanism between the two eBPF layers.

The pattern is: XDP classifies at speed (no sk_buff overhead), TC enforces with pod context (where you have socket identity). XDP writes a classification tag into the metadata area. TC reads it and makes the policy decision.

This is the architecture behind tools like Pro-NDS: the fast-path pattern matching (connection tracking, signature matching) happens at XDP before any kernel allocation. The enforcement action — which requires knowing which pod sent this — happens at TC using the metadata XDP already wrote.

From an operational standpoint, when you see two eBPF programs on the same interface (one XDP, one TC), this pipeline is the likely explanation. The bpftool net list output shows both:

bpftool net list
# xdpdrv id 44 on eth0       ← XDP classifier running at line rate
# tc ingress id 47 on eth0   ← TC enforcer reading XDP metadata

How Cilium Uses XDP

Not running Cilium? On EKS with aws-vpc-cni or GKE with kubenet, service forwarding uses iptables NAT rules and conntrack instead. You can see this with iptables -t nat -L -n on a node — look for the KUBE-SVC-* chains. Those chains are what XDP replaces in a Cilium cluster. This is why teams migrating from kube-proxy to Cilium report lower node CPU at high connection rates — it’s not magic, it’s hook placement.

On a Cilium node, XDP handles the load balancing path for ClusterIP services. When a packet arrives at the node destined for a ClusterIP:

  1. XDP program checks the destination IP against a BPF LRU hash map of known service VIPs
  2. On a match, it performs DNAT — rewriting the destination IP to a backend pod IP
  3. Returns XDP_TX or XDP_REDIRECT to forward directly

No iptables NAT rules. No conntrack state machine. No socket buffer allocation for the routing decision. The lookup is O(1) in a BPF hash map.

# See Cilium's XDP program on the node uplink
ip link show eth0 | grep xdp
# xdp  (attached, native mode)

# The XDP program details
bpftool prog show pinned /sys/fs/bpf/cilium/xdp

# Load time, instruction count, JIT-compiled size
bpftool prog show id $(bpftool net list | grep xdp | awk '{print $NF}')

At production scale — 500+ nodes, 50k+ services — removing iptables from the service forwarding path with XDP reduces per-node CPU utilization measurably. The effect is most visible on nodes handling high connection rates to cluster services.

Operational Inspection

# All XDP programs on all interfaces
bpftool net list

# Check XDP mode (native, generic, offloaded)
ip link show | grep xdp

# Per-interface stats — includes XDP drop/pass counters
cat /sys/class/net/eth0/statistics/rx_dropped

# XDP drop counters exposed via bpftool
bpftool map dump id <stats_map_id>

# Verify XDP is active and show program details
bpftool net show dev eth0

Common Mistakes

Mistake Impact Fix
Missing bounds check before pointer dereference Verifier rejects: “invalid mem access” Always check ptr + sizeof(*ptr) > data_end before use
Using generic XDP for performance testing Misleading numbers — sk_buff still allocated Test in native mode only; check ip link output for mode
Not handling non-IP traffic (ARP, IPv6, VLAN) ARP breaks, IPv6 drops, VLAN-tagged frames dropped Check eth->h_proto and return XDP_PASS for non-IP
XDP for egress or pod identity No socket context at XDP; XDP is ingress only Use TC egress for pod-identity-aware egress policy
Forgetting BPF_F_NO_PREALLOC on LPM trie Full memory allocated at map creation for all entries Always set this flag for sparse prefix tries
Blocking ARP by accident in a /24 blocklist Loss of layer-2 reachability within the blocked subnet Separate ARP handling before the IP blocklist check

Key Takeaways

  • XDP fires before sk_buff allocation — the earliest possible kernel hook for packet processing
  • Three modes: native (in-driver, full performance), generic (fallback, no perf gain), offloaded (NIC ASIC)
  • XDP context is raw packet bytes — no socket, no cgroup, no pod identity; handle non-IP traffic explicitly
  • Every pointer dereference requires a bounds check against data_end — the verifier enforces this
  • BPF_MAP_TYPE_LPM_TRIE is the right map for IP prefix blocklists — handles /32 hosts and CIDRs together
  • XDP metadata area enables coordination with TC programs — classify at XDP speed, enforce with pod context at TC

What’s Next

XDP handles ingress at the fastest possible point but has no visibility into which pod sent a packet. EP08 covers TC eBPF — the hook that fires after sk_buff allocation, where socket and cgroup context exist.

TC is how Cilium implements pod-to-pod network policy without iptables. It’s also where stale programs from failed Cilium upgrades leave ghost filters that cause intermittent packet drops. Knowing how TC programs chain — and how to find and remove stale ones — is a specific, concrete operational skill.

Next: TC eBPF — pod-level network policy without iptables

Get EP08 in your inbox when it publishes → linuxcent.com/subscribe

Zero Trust Access in the Cloud: How the Evaluation Loop Actually Works

Reading Time: 10 minutes


What Is Cloud IAMAuthentication vs AuthorizationIAM Roles vs PoliciesAWS IAM Deep DiveGCP Resource Hierarchy IAMAzure RBAC ScopesOIDC Workload IdentityAWS IAM Privilege EscalationAWS Least Privilege AuditSAML vs OIDC FederationKubernetes RBAC and AWS IAMZero Trust Access in the Cloud


TL;DR

  • Zero Trust: trust nothing implicitly, verify everything explicitly, minimize blast radius by assuming you will be breached
  • Network location is not identity — VPN is authentication for the tunnel, not authorization for the resource
  • JIT privilege elevation removes standing admin access: engineers request elevation for a specific purpose, scoped to a specific duration
  • Device posture is an access signal — a compromised endpoint with valid credentials is still a threat; Conditional Access gates on device compliance
  • Continuous session validation re-evaluates signals throughout the session — device falls out of compliance, sessions revoke in minutes, not at expiry
  • The highest-ROI early moves: eliminate machine static credentials, enforce MFA on all human access, federate to a single IdP

The Big Picture

  ZERO TRUST IAM — EVERY REQUEST EVALUATED INDEPENDENTLY

  API call arrives
         │
         ▼
  Identity verified? ──── No ────► DENY
         │
        Yes
         │
         ▼
  Device compliant? ───── No ────► DENY (or step-up MFA)
         │
        Yes
         │
         ▼
  Policy allows this  ─── No ────► DENY
  action on this ARN?
         │
        Yes
         │
         ▼
  Conditions met? ─────── No ────► DENY
  (time, IP, MFA age,              (e.g., outside business hours,
   risk score, session)             impossible travel detected)
         │
        Yes
         │
         ▼
       ALLOW ──────────────────────► LOG every decision (allow and deny)
         │
         └── Continuous re-evaluation:
             device state changes → revoke
             anomaly detected → revoke or step-up
             credential age → require re-auth

Introduction

The perimeter model of network security made a bet: inside the network is trusted, outside is not. Lock down the perimeter tightly enough and you’re safe. VPN in, and you’re one of us.

I grew up professionally in that model. Firewalls, DMZs, trusted zones. The idea had intuitive appeal — you build walls, you control what crosses them. For a while it worked reasonably well.

Then I watched it fail, repeatedly, in ways that were predictable in hindsight. An engineer’s laptop gets compromised at a coffee shop. They VPN in. Now the attacker is “inside.” A contractor account gets phished. They have valid Active Directory credentials. They’re inside. A cloud service gets misconfigured and exposes a management interface. There’s no perimeter for that to be inside of.

The perimeter model failed not because the walls weren’t strong enough, but because the premise was wrong. There is no inside. There is no perimeter that reliably separates trusted from untrusted. In a world of remote work, cloud services, contractor access, and API integrations, the attack surface doesn’t respect network boundaries.

Zero Trust is the architecture built on a different premise: trust nothing implicitly. Verify everything explicitly. Minimize blast radius by assuming you will be breached.

This isn’t a product you buy. It’s a set of principles applied to how you design, build, and operate your IAM. This episode is how those principles translate to concrete practices — building on everything we’ve covered in this series.


The Three Principles

Verify Explicitly

Every request must carry verifiable identity and context. Network location is not identity.

Old model: request from 10.0.0.0/8 → trusted, proceed
Zero Trust: request from 10.0.0.0/8 → still must present verifiable identity
                                       still must pass authorization check
                                       still must pass context evaluation
                                       then proceed (or deny)

In cloud IAM terms: every API call carries identity claims (IAM role ARN, federated identity, managed identity), and those claims are verified against policy on every single request. There’s no concept of “once authenticated, trusted until logout.” In cloud IAM, this already exists natively. Every API call is authenticated and authorized independently. The challenge is extending this model to internal services, internal APIs, and human access patterns.

Implementation in practice:
– mTLS for service-to-service communication — both sides present certificates; identity is the certificate, not the network path
– Bearer tokens on every internal API call — no session cookies, no “we’re on the same VPC so it’s fine”
– Short-lived credentials everywhere — a compromised credential expires, not “after the session times out in 8 hours”

Use Least Privilege — Just-in-Time, Just-Enough

No standing access to sensitive resources. Access granted when needed, for the minimum scope, for the minimum duration.

Old model: alice is in the DBA group → permanent access to all databases
Zero Trust: alice requests access to production DB →
            verified: alice's device is enrolled in MDM and compliant
            verified: alice has an open change ticket for this task
            verified: current time is within business hours
            granted: connection to this specific database, from alice's specific IP
                     for 2 hours, then revoked automatically

This is JIT access. It reduces the window where a compromised credential can cause damage. It requires a change in how engineers think about access: access is not a property you have, it’s something you request when you need it. The operational friction is a feature, not a bug. Justifying each elevated access request is what keeps the access model honest.

Assume Breach

Design systems as if the attacker is already inside. This drives different decisions:

  • Micro-segmentation: one role per service, minimum permissions per role. If one service is compromised, it can’t pivot to everything else.
  • Log everything: every authorization decision, allow or deny. When you’re investigating an incident, you need to know what happened, not just that something happened.
  • Automate response: anomalous API call pattern → trigger automated credential revocation or session termination. Don’t wait for a human to notice.

Building Zero Trust IAM — Block by Block

Block 1: Strong Identity Foundation

You can’t verify explicitly without strong authentication. The starting point:

# AWS: require MFA for all IAM operations — enforce via SCP across the org
{
  "Effect": "Deny",
  "Action": "*",
  "Resource": "*",
  "Condition": {
    "BoolIfExists": {
      "aws:MultiFactorAuthPresent": "false"
    },
    "StringNotLike": {
      "aws:PrincipalArn": [
        "arn:aws:iam::*:role/AWSServiceRole*",
        "arn:aws:iam::*:role/OrganizationAccountAccessRole"
      ]
    }
  }
}
# GCP: enforce OS Login for VM SSH (ties SSH access to Google identity, not SSH keys)
gcloud compute project-info add-metadata \
  --metadata enable-oslogin=TRUE

# This means: SSH to a VM requires your Google identity to have roles/compute.osLogin
# or roles/compute.osAdminLogin. No more managing ~/.authorized_keys files on instances.

For human access: hardware FIDO2 keys (YubiKey, Google Titan) rather than TOTP where possible. TOTP codes can be phished in real-time adversary-in-the-middle attacks. Hardware keys cannot — the cryptographic challenge-response is bound to the origin URL.

Block 2: Device Posture as an Access Signal

In a Zero Trust model, the identity of the user is necessary but not sufficient. The state of the device matters too — a compromised endpoint with valid credentials is still a threat.

# Azure Conditional Access: block access from non-compliant devices
# (configures in Entra ID Conditional Access portal)
conditions:
  clientAppTypes: [browser, mobileAppsAndDesktopClients]
  devices:
    deviceFilter:
      mode: exclude
      rule: "device.isCompliant -eq True and device.trustType -eq 'AzureAD'"
grantControls:
  builtInControls: [compliantDevice]
# AWS Verified Access: identity + device posture for application access — no VPN
aws ec2 create-verified-access-instance \
  --description "Zero Trust app access"

# Attach identity trust provider (Okta OIDC)
aws ec2 create-verified-access-trust-provider \
  --trust-provider-type user \
  --user-trust-provider-type oidc \
  --oidc-options IssuerURL=https://company.okta.com,ClientId=...,ClientSecret=...,Scope=openid

# Attach device trust provider (Jamf, Intune, or CrowdStrike)
aws ec2 create-verified-access-trust-provider \
  --trust-provider-type device \
  --device-trust-provider-type jamf \
  --device-options TenantId=JAMF_TENANT_ID

AWS Verified Access allows users to reach internal applications by verifying both their identity (via OIDC) and their device health (via MDM) — without a VPN. The access gateway evaluates both signals on every connection, not just at login.

Block 3: Just-in-Time Privilege Elevation

No standing elevated access. Engineers are eligible for elevated roles; they activate them when needed.

# Azure PIM: engineer activates an eligible privileged role
az rest --method POST \
  --uri "https://graph.microsoft.com/v1.0/roleManagement/directory/roleAssignmentScheduleRequests" \
  --body '{
    "action": "selfActivate",
    "principalId": "USER_OBJECT_ID",
    "roleDefinitionId": "ROLE_DEF_ID",
    "directoryScopeId": "/",
    "justification": "Investigating security alert in tenant — incident ticket INC-2026-0411",
    "scheduleInfo": {
      "startDateTime": "2026-04-11T09:00:00Z",
      "expiration": {"type": "AfterDuration", "duration": "PT4H"}
    }
  }'
# Access activates, lasts 4 hours, then automatically removed
# AWS: temporary account assignment via Identity Center
# (typically triggered by ITSM workflow integration, not manual CLI)
aws sso-admin create-account-assignment \
  --instance-arn "arn:aws:sso:::instance/ssoins-xxx" \
  --target-id ACCOUNT_ID \
  --target-type AWS_ACCOUNT \
  --permission-set-arn "arn:aws:sso:::permissionSet/ssoins-xxx/ps-yyy" \
  --principal-type USER \
  --principal-id USER_ID

# Schedule deletion (using EventBridge + Lambda in a real deployment)
aws sso-admin delete-account-assignment \
  --instance-arn "arn:aws:sso:::instance/ssoins-xxx" \
  --target-id ACCOUNT_ID \
  --target-type AWS_ACCOUNT \
  --permission-set-arn "arn:aws:sso:::permissionSet/ssoins-xxx/ps-yyy" \
  --principal-type USER \
  --principal-id USER_ID

The operational change this requires: engineers stop thinking of access as something they hold permanently and start thinking of it as something they request for a specific purpose.

This feels like friction until you’re investigating an incident and you have a precise record of who activated what elevated access and why.

Block 4: Continuous Session Validation

Traditional auth: verify once at login, trust the session until timeout.
Zero Trust auth: re-evaluate access signals continuously throughout the session.

Session starts: identity verified + device compliant + IP in expected range
                → access granted

15 minutes later: impossible travel detected (IP changes to different country)
                  → step-up authentication required, or session terminated

Later: device compliance state changes (EDR detects malware)
       → all active sessions for this device revoked immediately

This requires integration between your identity platform and your device management / EDR tooling. Entra ID Conditional Access with Continuous Access Evaluation (CAE) implements this natively. When certain events occur — device compliance change, IP anomaly, token revocation — access tokens are invalidated within minutes rather than waiting for natural expiry.

// GCP: bind IAM access to an Access Context Manager access level
// Access level enforces device compliance — if device falls out of compliance,
// the access level is no longer satisfied and requests fail immediately
gcloud projects add-iam-policy-binding my-project \
  --member="user:[email protected]" \
  --role="roles/bigquery.admin" \
  --condition="expression=request.auth.access_levels.exists(x, x == 'accessPolicies/POLICY_NUM/accessLevels/corporate_compliant_device'),title=Compliant device required"

Block 5: Micro-Segmented Permissions

Every service has its own identity. Every identity has only what it needs. Compromise of one service cannot propagate to others.

# Terraform: IAM as code — each service gets a dedicated, scoped role
resource "aws_iam_role" "order_processor" {
  name                 = "svc-order-processor"
  permissions_boundary = aws_iam_policy.service_boundary.arn

  assume_role_policy = jsonencode({
    Version = "2012-10-17"
    Statement = [{
      Effect    = "Allow"
      Principal = { Service = "lambda.amazonaws.com" }
      Action    = "sts:AssumeRole"
    }]
  })
}

resource "aws_iam_role_policy" "order_processor" {
  name   = "order-processor-policy"
  role   = aws_iam_role.order_processor.id
  policy = jsonencode({
    Version = "2012-10-17"
    Statement = [
      {
        Effect   = "Allow"
        Action   = ["sqs:ReceiveMessage", "sqs:DeleteMessage", "sqs:GetQueueAttributes"]
        Resource = aws_sqs_queue.orders.arn
      },
      {
        Effect   = "Allow"
        Action   = ["dynamodb:PutItem", "dynamodb:GetItem", "dynamodb:UpdateItem"]
        Resource = aws_dynamodb_table.orders.arn
      }
    ]
  })
}
# Open Policy Agent: enforce IAM standards at the policy level
# Run this in CI/CD — fail the build if any policy statement has wildcard actions
package iam.policy

deny[msg] {
  input.Statement[i].Effect == "Allow"
  input.Statement[i].Action == "*"
  msg := sprintf("Statement %d has wildcard Action — not allowed", [i])
}

deny[msg] {
  input.Statement[i].Effect == "Allow"
  input.Statement[i].Resource == "*"
  endswith(input.Statement[i].Action, "Delete")
  msg := sprintf("Statement %d allows Delete on all resources — requires specific ARN", [i])
}

Block 6: Universal Audit Trail

Zero Trust without logging is just obscurity. Every authorization decision — allow and deny — must be logged, retained, and queryable.

# AWS: verify CloudTrail is comprehensive
aws cloudtrail get-trail-status --name management-trail
# Must have: LoggingEnabled=true, IsMultiRegionTrail=true, IncludeGlobalServiceEvents=true

# Verify no management events are excluded
aws cloudtrail get-event-selectors --trail-name management-trail \
  | jq '.EventSelectors[] | {ReadWrite: .ReadWriteType, Mgmt: .IncludeManagementEvents}'
# ReadWriteType should be "All"; IncludeManagementEvents should be true

# GCP: ensure Data Access audit logs are enabled for IAM
gcloud projects get-iam-policy my-project --format=json | jq '.auditConfigs'
# Should see auditLogConfigs for cloudresourcemanager.googleapis.com and iam.googleapis.com
# with both DATA_READ and DATA_WRITE enabled

# Azure: route Entra ID logs to Log Analytics for long-term retention and querying
az monitor diagnostic-settings create \
  --name entra-audit-to-la \
  --resource "/tenants/TENANT_ID/providers/microsoft.aad/domains/company.com" \
  --logs '[{"category":"AuditLogs","enabled":true},{"category":"SignInLogs","enabled":true}]' \
  --workspace /subscriptions/SUB_ID/resourceGroups/rg-monitoring/providers/Microsoft.OperationalInsights/workspaces/security-logs

Framework Alignment

Zero Trust IAM isn’t a framework itself — it’s a design philosophy. But it maps cleanly onto the controls that compliance frameworks are pushing organizations toward:

Framework Reference What It Covers Here
CISSP Domain 5 — IAM Zero Trust reframes IAM as continuous, context-aware verification rather than perimeter-based trust
CISSP Domain 1 — Security & Risk Management Assume breach as a risk management posture; blast radius minimization through least privilege
CISSP Domain 7 — Security Operations Continuous monitoring, anomaly detection, and automated response are operational requirements of Zero Trust
ISO 27001:2022 5.15 Access control Zero Trust access policy: verify explicitly, least privilege, assume breach
ISO 27001:2022 8.16 Monitoring activities Continuous session validation and universal audit trail — all authorization decisions logged
ISO 27001:2022 8.20 Networks security Micro-segmentation and mTLS replace implicit network trust with verified identity at every hop
ISO 27001:2022 5.23 Information security for cloud services Zero Trust architecture applied to cloud IAM across AWS, GCP, and Azure
SOC 2 CC6.1 Zero Trust logical access controls — JIT, device posture, context-aware authorization
SOC 2 CC6.7 Continuous session validation and transmission controls across all system components
SOC 2 CC7.1 Threat detection through universal audit trails and anomaly-triggered automated response
SOC 2 CC7.2 Incident response — automated revocation and session termination on anomaly detection

Zero Trust Maturity — Where to Start

In practice, most organizations think about Zero Trust as a destination — a large, multi-year program. The reality is it’s a direction. Any movement in that direction reduces risk.

Level Where You Are What to Build Next
1 — Initial Some MFA; static credentials for machines; no centralized IdP Eliminate machine static keys → workload identity
2 — Managed Centralized IdP; SSO for most systems; some MFA enforcement Close SSO gaps; enforce MFA everywhere; federate to cloud
3 — Defined Least privilege being enforced; audit tooling in use; JIT for some privileged access Expand JIT; policy-as-code in CI/CD; quarterly access reviews
4 — Contextual Device posture in access decisions; conditional access policies Continuous session evaluation; automated anomaly response
5 — Optimizing Policy-as-code everywhere; automated right-sizing; anomaly-triggered revocation Refine and maintain — Zero Trust is never “done”

The jump from Level 1 to Level 3 delivers the most security value per unit of effort. Start there. Don’t defer least privilege enforcement while you build a sophisticated device posture integration.


The Practical Sequence

If you’re building Zero Trust IAM from where most organizations are, this is the order that maximizes early security value:

  1. Inventory all identities — human and machine. You cannot secure what you can’t see. Build a complete picture before changing anything.

  2. Eliminate static credentials for machines — replace access keys and SA key files with workload identity. This is the highest-ROI change in most environments.

  3. Enforce MFA for all human access — especially cloud consoles, IdP admin, and VPN. Hardware keys for privileged accounts.

  4. Federate human identity — single IdP, SSO to cloud and major applications. Centralize the revocation path.

  5. Right-size IAM permissions — use last-accessed data and IAM Recommender to find and remove unused permissions. This is a continuous discipline, not a one-time clean-up.

  6. JIT for privileged access — Azure PIM, AWS Identity Center assignment automation, or equivalent for all elevated roles. No standing admin.

  7. IAM as code — all IAM changes via Terraform/Pulumi/CDK, reviewed in pull requests, validated by Access Analyzer or OPA in CI/CD, applied through automation.

  8. Continuous monitoring — alerts on IAM mutations, anomalous API call patterns, new cross-account trust relationships, new public resource exposures.

  9. Add context signals — Conditional Access policies incorporating device posture. Access Context Manager in GCP. AWS Verified Access for application access.

  10. Automated response — anomaly detected → automatic credential suspension or session termination. Close the window between detection and containment.


Series Complete

This series covered Cloud IAM from the question “what even is IAM?” to Zero Trust architecture:

Episode Topic The Core Lesson
EP01 What is IAM? Access management is deny-by-default; every grant is an explicit decision
EP02 AuthN vs AuthZ Two separate gates; passing one doesn’t open the other
EP03 Roles, Policies, Permissions Structure prevents drift; wildcards accumulate into exposure
EP04 AWS IAM Deep Dive Trust policies and permission policies are both required; the evaluation chain has six layers
EP05 GCP IAM Deep Dive Hierarchy inheritance is a feature that needs careful handling; service account keys are an antipattern
EP06 Azure RBAC and Entra ID Two separate authorization planes; managed identities are the right model for workloads
EP07 Workload Identity Static credentials for machines are solvable at the root; OIDC token exchange replaces them
EP08 IAM Attack Paths The attack chain runs through IAM; iam:PassRole and its equivalents are privilege escalation primitives
EP09 Least Privilege Auditing 5% utilization is the average; the 95% excess is attack surface — and it’s measurable
EP10 Federation, OIDC, SAML The IdP is the trust anchor; everything downstream is bounded by its security
EP11 Kubernetes RBAC Two separate IAM layers; both must be secured; cluster-admin is the first thing to audit
EP12 Zero Trust IAM Trust nothing implicitly; verify everything explicitly; minimize blast radius through least privilege at every layer

IAM is not a feature you configure. It’s a practice you maintain. The organizations that operate with genuinely low cloud IAM risk don’t have fewer identities — they have better visibility into what those identities can do, and why, and what happened when something went wrong.

That’s what this series has been building toward.


The full series is at linuxcent.com/cloud-iam-series. If you found it useful, the best thing you can do is subscribe — the next series covers eBPF: what’s actually running in kernel space when Cilium, Falco, and Tetragon are doing their work.

Subscribe → linuxcent.com/subscribe

Kubernetes RBAC and AWS IAM: The Two-Layer Access Model for EKS

Reading Time: 9 minutes


What Is Cloud IAMAuthentication vs AuthorizationIAM Roles vs PoliciesAWS IAM Deep DiveGCP Resource Hierarchy IAMAzure RBAC ScopesOIDC Workload IdentityAWS IAM Privilege EscalationAWS Least Privilege AuditSAML vs OIDC FederationKubernetes RBAC and AWS IAM


TL;DR

  • Kubernetes RBAC and cloud IAM are separate authorization layers — strong cloud IAM with weak Kubernetes RBAC is still a vulnerable cluster
  • cluster-admin ClusterRoleBindings are the first thing to audit — a compromised pod with cluster-admin controls the entire cluster
  • Disable automountServiceAccountToken on pods that don’t call the Kubernetes API — most application pods don’t need it mounted
  • Use OIDC for human access instead of X.509 client certificates — client certs cannot be revoked without rotating the CA
  • Bind groups from IdP, not individual usernames — revocation propagates automatically when someone leaves
  • A ServiceAccount that can create pods or create rolebindings is a privilege escalation path: the same class of risk as iam:PassRole

The Big Picture

  TWO AUTHORIZATION LAYERS — NEITHER COMPENSATES FOR THE OTHER

  ┌─────────────────────────────────────────────────────────────────┐
  │  CLOUD IAM LAYER  (AWS IAM / GCP IAM / Azure RBAC)             │
  │  Controls: S3, DynamoDB, Lambda, RDS, cloud services           │
  │  Human: federated identity from IdP (SAML / OIDC)             │
  │  Machine: IRSA annotation → IAM role / GKE WI / AKS WI        │
  │  Audit: CloudTrail, GCP Audit Logs, Azure Monitor              │
  └─────────────────────────────────────────────────────────────────┘
           ↕ separate systems — no inheritance in either direction
  ┌─────────────────────────────────────────────────────────────────┐
  │  KUBERNETES RBAC LAYER  (within the cluster)                   │
  │  Controls: pods, secrets, deployments, configmaps, namespaces  │
  │  Human: OIDC groups → ClusterRoleBinding (or RoleBinding)      │
  │  Machine: ServiceAccount → Role / ClusterRole                  │
  │  Audit: kube-apiserver audit log                               │
  └─────────────────────────────────────────────────────────────────┘

  Attack path: exploit app pod → SA has cluster-admin → own the cluster
  Audit finding: cluster-admin on app SA, regardless of cloud IAM posture

Introduction

I spent a long time in Kubernetes environments thinking cloud IAM and Kubernetes RBAC were related in a way that meant securing one partially covered the other. They don’t. They’re separate authorization systems that happen to share infrastructure.

The moment this crystallized for me: I was auditing an EKS cluster for a fintech company. Their AWS IAM posture was actually quite good — least privilege roles, no wildcard policies, SCPs in place at the org level. I was about to give them a clean bill of health when I ran one command:

kubectl get clusterrolebindings -o json | \
  jq '.items[] | select(.roleRef.name=="cluster-admin") | {name:.metadata.name, subjects:.subjects}'

The output showed five ClusterRoleBindings to cluster-admin. Two of them bound it to service accounts in production namespaces. One of those service accounts was used by an application that processed customer transactions.

cluster-admin in Kubernetes is the equivalent of AdministratorAccess in AWS. An attacker who compromises a pod running as that service account doesn’t just have access to the application’s data. They have control of the entire cluster: reading every secret in every namespace, deploying arbitrary workloads, modifying RBAC bindings to create persistence.

None of this showed up in the AWS IAM audit. AWS IAM and Kubernetes RBAC are separate systems. Securing one tells you nothing about the other.


Kubernetes RBAC Architecture

Kubernetes RBAC works with four object types:

Object Scope What It Does
Role Single namespace Defines permissions within one namespace
ClusterRole Cluster-wide Permissions across all namespaces, or for non-namespaced resources
RoleBinding Single namespace Binds a Role (or ClusterRole) to subjects, scoped to one namespace
ClusterRoleBinding Cluster-wide Binds a ClusterRole to subjects with cluster-wide scope

Subjects — the identities that receive the binding — are:
User: an external identity (Kubernetes has no native user objects; users come from the authenticator)
Group: a group of external identities
ServiceAccount: a Kubernetes-native machine identity, namespaced

The scoping matters. A ClusterRole defines what permissions exist. A RoleBinding applies that ClusterRole within a single namespace. A ClusterRoleBinding applies it everywhere. The same permissions, dramatically different blast radius.


Roles and ClusterRoles

# Role: read pods and their logs — scoped to the default namespace only
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  namespace: default
  name: pod-reader
rules:
- apiGroups: [""]          # "" = core API group (pods, secrets, configmaps, etc.)
  resources: ["pods", "pods/log"]
  verbs: ["get", "list", "watch"]
# ClusterRole: manage Deployments across all namespaces
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: deployment-manager
rules:
- apiGroups: ["apps"]
  resources: ["deployments", "replicasets"]
  verbs: ["get", "list", "watch", "create", "update", "patch", "delete"]
- apiGroups: [""]
  resources: ["pods"]
  verbs: ["get", "list", "watch"]

The verbs map to HTTP methods against the Kubernetes API: get reads a specific resource, list returns a collection, watch streams changes, create/update/patch/delete are mutations.

One that consistently surprises people: list on secrets returns secret values in some Kubernetes versions and configurations. You might think “list” is just metadata, but listing secrets can include their data. If a service account needs to check whether a secret exists, grant get on the specific secret name. Avoid list on the secrets resource.

The Wildcard Risk

# This is effectively cluster-admin in the default namespace — avoid
rules:
- apiGroups: ["*"]
  resources: ["*"]
  verbs: ["*"]

Any * in RBAC rules is an audit finding. In practice I find wildcards most often in:
– Operator and controller service accounts (understandable, but worth reviewing)
– “Temporary” RBAC that became permanent
– Developer tooling given cluster-admin “because it was easier”

Run this to find all ClusterRoles with wildcard verbs:

kubectl get clusterroles -o json | \
  jq '.items[] | select(.rules[]?.verbs[] == "*") | .metadata.name'

Bindings — Connecting Identities to Roles

# RoleBinding: alice can read pods in the default namespace
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: alice-pod-reader
  namespace: default
subjects:
- kind: User
  name: [email protected]
  apiGroup: rbac.authorization.k8s.io
roleRef:
  kind: Role
  name: pod-reader
  apiGroup: rbac.authorization.k8s.io
# ClusterRoleBinding: Prometheus can read cluster-wide (monitoring use case)
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: prometheus-cluster-reader
subjects:
- kind: ServiceAccount
  name: prometheus
  namespace: monitoring
roleRef:
  kind: ClusterRole
  name: view
  apiGroup: rbac.authorization.k8s.io

An important pattern: a RoleBinding can reference a ClusterRole. This lets you define a role once at the cluster level (the ClusterRole) and bind it within specific namespaces through RoleBindings. The permissions are still scoped to the namespace where the RoleBinding lives. This is the right pattern for shared role definitions — define the permission set once, instantiate it with appropriate scope.

Default to RoleBinding over ClusterRoleBinding for namespace-scoped work. ClusterRoleBinding should be reserved for genuinely cluster-wide operations: monitoring agents, network plugins, cluster operators, security tooling.


Service Accounts — The Machine Identity in Kubernetes

Every pod in Kubernetes runs as a service account. If you don’t specify one, it uses the default service account in the pod’s namespace.

The default service account is where many RBAC misconfigurations accumulate. When someone creates a RoleBinding without thinking about which SA to use, they often bind the permission to default. Now every pod in that namespace that doesn’t explicitly set a service account — including pods deployed by developers who aren’t thinking about RBAC — inherits that binding.

# Create a dedicated SA for each application
kubectl create serviceaccount app-backend -n production

# Check what any SA can currently do — use this in every audit
kubectl auth can-i --list --as=system:serviceaccount:production:app-backend -n production

# Check a specific action
kubectl auth can-i get secrets \
  --as=system:serviceaccount:production:app-backend -n production

kubectl auth can-i create pods \
  --as=system:serviceaccount:production:app-backend -n production

Disable Auto-Mounting the SA Token

By default, Kubernetes mounts the service account token into every pod at /var/run/secrets/kubernetes.io/serviceaccount/token. A pod that doesn’t need to call the Kubernetes API doesn’t need this token. Having it mounted increases the blast radius if the pod is compromised — the token can be used to call the K8s API with whatever RBAC permissions the SA has.

# Disable at the pod level
apiVersion: v1
kind: Pod
spec:
  automountServiceAccountToken: false
  serviceAccountName: app-backend
  containers:
  - name: app
    image: my-app:latest

# Or at the service account level (applies to all pods using this SA)
apiVersion: v1
kind: ServiceAccount
metadata:
  name: app-backend
  namespace: production
automountServiceAccountToken: false

For most application pods — anything that isn’t a Kubernetes operator, controller, or management tool — the K8s API token is unnecessary. Disable it.


Human Access to Kubernetes — Get Off Client Certificates

Kubernetes doesn’t manage human users natively. Authentication is delegated to an external mechanism. The most common approaches:

Method Notes
X.509 client certificates Common for initial cluster setup; credentials are embedded in kubeconfig; cannot be revoked without revoking the CA
Static bearer tokens Long-lived; avoid
OIDC via external IdP Preferred for human access — supports SSO, MFA, and revocation via IdP
Webhook auth Flexible, requires custom infrastructure

X.509 certificates are the bootstrap pattern. Every managed Kubernetes offering generates an admin kubeconfig with a client certificate. The problem: you can’t revoke individual certificates without rotating the CA. If you’re giving human engineers access via client certificates, someone leaving doesn’t actually lose cluster access until the certificate expires.

OIDC is the right model. Configure the kube-apiserver to accept JWTs from your IdP, bind RBAC permissions to groups from the IdP, and revocation becomes “remove from IdP group” rather than “hope the certificate expires soon”:

# kube-apiserver flags for OIDC (managed clusters configure this via provider settings)
--oidc-issuer-url=https://accounts.google.com
--oidc-client-id=my-cluster-client-id
--oidc-username-claim=email
--oidc-groups-claim=groups
--oidc-groups-prefix=oidc:
# User's kubeconfig — uses an exec plugin to fetch an OIDC token
users:
- name: alice
  user:
    exec:
      apiVersion: client.authentication.k8s.io/v1beta1
      command: kubectl-oidc-login
      args:
        - get-token
        - --oidc-issuer-url=https://dex.company.com
        - --oidc-client-id=kubernetes

With managed clusters:

# EKS: add IAM role as a cluster access entry (replaces the aws-auth ConfigMap)
aws eks create-access-entry \
  --cluster-name my-cluster \
  --principal-arn arn:aws:iam::123456789012:role/DevTeamRole \
  --type STANDARD

aws eks associate-access-policy \
  --cluster-name my-cluster \
  --principal-arn arn:aws:iam::123456789012:role/DevTeamRole \
  --policy-arn arn:aws:eks::aws:cluster-access-policy/AmazonEKSViewPolicy \
  --access-scope type=namespace,namespaces=production,staging

# GKE: get credentials; IAM roles map to cluster permissions
gcloud container clusters get-credentials my-cluster --region us-central1
# roles/container.developer → edit permissions
# But: use ClusterRoleBindings for fine-grained control rather than relying on GCP IAM roles

# AKS: bind Entra ID groups to Kubernetes RBAC
az aks get-credentials --name my-aks --resource-group rg-prod
kubectl create clusterrolebinding dev-team-view \
  --clusterrole=view \
  --group=ENTRA_GROUP_OBJECT_ID

Cloud IAM + Kubernetes RBAC: The Integration Points

EKS Pod Identity / IRSA (revisited)

The annotation on the Kubernetes ServiceAccount is the bridge:

apiVersion: v1
kind: ServiceAccount
metadata:
  name: app-backend
  namespace: production
  annotations:
    eks.amazonaws.com/role-arn: arn:aws:iam::123456789012:role/AppBackendRole

Kubernetes RBAC controls what the pod can do inside the cluster. The IAM role controls what the pod can do in AWS. Both must be explicitly granted; neither inherits from the other.

GKE Workload Identity

apiVersion: v1
kind: ServiceAccount
metadata:
  name: app-backend
  namespace: production
  annotations:
    iam.gke.io/gcp-service-account: [email protected]

AKS Workload Identity

apiVersion: v1
kind: ServiceAccount
metadata:
  name: app-backend
  namespace: production
  annotations:
    azure.workload.identity/client-id: "MANAGED_IDENTITY_CLIENT_ID"
---
apiVersion: v1
kind: Pod
metadata:
  labels:
    azure.workload.identity/use: "true"
spec:
  serviceAccountName: app-backend

RBAC Audit — What to Check First

# Start here: who has cluster-admin?
kubectl get clusterrolebindings -o json | \
  jq '.items[] | select(.roleRef.name=="cluster-admin") | 
      {binding: .metadata.name, subjects: .subjects}'
# cluster-admin should bind to almost nobody — review every result

# Find ClusterRoles with wildcard permissions
kubectl get clusterroles -o json | \
  jq '.items[] | select(.rules[]?.verbs[]? == "*") | .metadata.name'

# What can the default SA do in each namespace?
for ns in $(kubectl get namespaces -o name | cut -d/ -f2); do
  echo "=== $ns ==="
  kubectl auth can-i --list --as=system:serviceaccount:${ns}:default -n ${ns} 2>/dev/null \
    | grep -v "no" | head -10
done

# What can a specific SA do?
kubectl auth can-i --list \
  --as=system:serviceaccount:production:app-backend \
  -n production

# Check whether an SA can escalate — key risk indicators
kubectl auth can-i get secrets -n production \
  --as=system:serviceaccount:production:app-backend
kubectl auth can-i create pods -n production \
  --as=system:serviceaccount:production:app-backend
kubectl auth can-i create rolebindings -n production \
  --as=system:serviceaccount:production:app-backend

Creating pods and creating rolebindings are privilege escalation primitives. A service account that can create pods can run a pod with a different, more powerful SA. A service account that can create rolebindings can grant itself more permissions.

Useful Tools

# rbac-tool — visualize and analyze RBAC (install: kubectl krew install rbac-tool)
kubectl rbac-tool viz                              # generate a graph of all bindings
kubectl rbac-tool who-can get secrets -n production
kubectl rbac-tool lookup [email protected]

# rakkess — access matrix for a subject
kubectl rakkess --sa production:app-backend

# audit2rbac — generate minimal RBAC from audit logs
audit2rbac --filename /var/log/kubernetes/audit.log \
  --serviceaccount production:app-backend

Common RBAC Misconfigurations

Misconfiguration Risk Fix
cluster-admin bound to application SA Full cluster takeover from compromised pod Minimal ClusterRole; scope to namespace where possible
list or wildcard on secrets Read all secrets in scope — includes credentials, API keys Grant get on specific named secrets only
default SA with non-trivial permissions Every pod in the namespace inherits the permission Bind permissions to dedicated SAs; automountServiceAccountToken: false on default
ClusterRoleBinding for namespace-scoped work Namespace work with cluster-wide permission Always prefer RoleBinding; ClusterRoleBinding only for genuinely cluster-wide needs
Binding users by username string Hard to revoke; doesn’t sync with IdP Bind groups from IdP; revocation propagates through group membership
SA can create pods or create rolebindings Privilege escalation path Audit and remove these from non-privileged SAs

Framework Alignment

Framework Reference What It Covers Here
CISSP Domain 5 — Identity and Access Management Kubernetes RBAC operates as a full IAM system at the platform layer, independent of cloud IAM
CISSP Domain 3 — Security Architecture Two independent authorization layers (cloud + K8s) must each be designed and audited — one does not compensate for the other
ISO 27001:2022 5.15 Access control Kubernetes RBAC Roles, ClusterRoles, and bindings implement access control within the container platform
ISO 27001:2022 5.18 Access rights Service account provisioning, OIDC-based human access, and workload identity integration with cloud IAM
ISO 27001:2022 8.2 Privileged access rights cluster-admin and wildcard RBAC bindings represent the highest-privilege grants in Kubernetes
SOC 2 CC6.1 Kubernetes RBAC is the access control mechanism for the container platform layer in CC6.1
SOC 2 CC6.3 Binding revocation, SA token disabling, and OIDC group-based access removal satisfy CC6.3 requirements

Key Takeaways

  • Kubernetes RBAC and cloud IAM are separate authorization layers — both must be secured; strong cloud IAM with weak K8s RBAC is still a vulnerable cluster
  • cluster-admin bindings are the first thing to audit in any cluster — the blast radius of a compromised pod with cluster-admin is the entire cluster
  • Disable automountServiceAccountToken on service accounts and pods that don’t call the Kubernetes API — most application pods don’t need it
  • Use OIDC for human access rather than client certificates; revocation via IdP is instant and reliable
  • Bind groups from IdP rather than individual usernames; revocation propagates automatically when someone leaves
  • A service account that can create pods or create rolebindings is a privilege escalation path — audit for these in every namespace

What’s Next

EP12 is the capstone: Zero Trust IAM — how all the concepts in this series come together into an architecture that assumes nothing is implicitly trusted, verifies everything explicitly, and limits blast radius through least privilege enforced at every layer.

Next: Zero trust access in the cloud

Get EP12 in your inbox when it publishes → linuxcent.com/subscribe

SAML vs OIDC: Which Federation Protocol Belongs in Your Cloud?

Reading Time: 10 minutes


What Is Cloud IAMAuthentication vs AuthorizationIAM Roles vs PoliciesAWS IAM Deep DiveGCP Resource Hierarchy IAMAzure RBAC ScopesOIDC Workload IdentityAWS IAM Privilege EscalationAWS Least Privilege AuditSAML vs OIDC Federation


TL;DR

  • Federation means downstream systems trust the IdP’s signed assertion — they never see credentials and don’t manage them independently
  • SAML is XML-based, browser-oriented, the enterprise standard; OIDC is JWT-based, API-native, the modern protocol for workload identity and consumer SSO
  • In OIDC trust policies, the sub condition is the security boundary — omitting it means any GitHub Actions workflow in any repository can assume your role
  • Validate all JWT claims: signature, iss, aud, exp, sub — libraries do this, but need correct configuration (especially aud)
  • The IdP is the trust anchor: compromise the IdP and every downstream system is compromised. Treat IdP admin access with the same controls as your most sensitive system.
  • JIT provisioning and Conditional Access extend federation from “who are you” to “are you in an appropriate context right now”

The Big Picture

  FEDERATION: HOW TRUST FLOWS FROM IdP TO DOWNSTREAM SYSTEMS

  Identity Provider  (Okta / Entra ID / Google / AD FS)
  ┌──────────────────────────────────────────────────────────────────┐
  │  User or workload authenticates → IdP issues signed assertion   │
  │                                                                  │
  │  ┌──────────────────────────┐  ┌───────────────────────────┐   │
  │  │  SAML Assertion (XML)    │  │  OIDC ID Token (JWT)       │   │
  │  │  RSA-signed, 5–10 min    │  │  RS256-signed, ~1 hr      │   │
  │  │  Audience: SP entity ID  │  │  aud: client ID           │   │
  │  │  Subject: user identity  │  │  sub: specific workload   │   │
  │  └───────────┬──────────────┘  └──────────┬────────────────┘   │
  └─────────────────────────────────────────────────────────────────┘
                 │  human SSO                  │  workload identity
                 ▼                             ▼
  ┌─────────────────────────┐  ┌───────────────────────────────────┐
  │ SP validates signature  │  │ AWS STS / GCP STS validates       │
  │ + audience + timestamp  │  │ signature + iss + aud + sub       │
  │ → console session       │  │ → AssumeRoleWithWebIdentity       │
  └─────────────────────────┘  └───────────────────────────────────┘

  Security bound: IdP security bounds every system that trusts it
  Disable in Okta → access revoked everywhere that trusts Okta

Introduction

Before federation existed, every system had its own user database. Your Jira account. Your AWS account. Your Salesforce account. Your internal wiki. Each one had its own password, its own MFA, its own offboarding process. When an engineer joined, someone had to create accounts in every system. When they left, you hoped whoever processed the offboarding remembered to deactivate all of them.

I’ve done that audit — the one where you’re trying to figure out if a former employee still has access to anything. You go system by system, cross-reference against HR records, find accounts that exist in places you’ve forgotten the company even uses. In one environment I found an ex-engineer’s account still active in a vendor portal six months after they left, because that system was set up by someone who had since also left the company, and nobody had documented it.

Federation solves this structurally. One identity provider. One place to authenticate. One place to revoke. Every downstream system trusts the IdP’s assertion rather than managing credentials independently. Disable someone in Okta and they lose access everywhere that trusts Okta — immediately, without a checklist.

This episode is how federation actually works at the protocol level, because understanding the mechanism is what lets you design it securely. A federation setup with a trust policy that accepts assertions from any OIDC issuer is worse than no federation — it’s a false sense of security.


The Federation Model

Identity Provider (IdP)          Service Provider (SP) / Relying Party
  (Okta, Google, AD FS, Entra ID)       (AWS, Salesforce, GitHub, your app)
         │                                          │
         │  1. User authenticates to IdP             │
         │     (password + MFA)                      │
         │                                          │
         │  2. IdP generates a signed assertion      │
         │     (SAML response or OIDC ID Token)      │
         │ ──────────────────────────────────────── ▶│
         │                                          │
         │  3. SP validates the signature            │
         │     (using IdP's public certificate       │
         │      or JWKS endpoint)                    │
         │  4. SP maps identity to local permissions │
         │  5. SP grants access                      │

The SP never sees the user’s password. It never has one. It trusts the IdP’s cryptographic signature — if the assertion is signed with the IdP’s private key, and the SP trusts that key, the identity is accepted.

This trust chain has one critical property: the security of every SP is bounded by the security of the IdP. Compromise the IdP, and every system that trusts it is compromised. This is why IdP security deserves the same attention as the most sensitive system it gates access to.


SAML 2.0 — The Enterprise Standard

SAML (Security Assertion Markup Language) is XML-based, verbose, and battle-tested. Published in 2005, it’s the protocol behind most enterprise SSO deployments. When your company says “use your corporate login for this vendor app,” SAML is usually the mechanism.

How a SAML Login Flows

1. User visits AWS console (the Service Provider)
2. AWS checks: no active session → redirect to IdP
   → https://company.okta.com/saml?SAMLRequest=...
3. Okta authenticates the user (password, MFA)
4. Okta generates a SAML Assertion — a signed XML document containing:
   - Who the user is (Subject, typically email)
   - Their attributes (group memberships, custom attributes)
   - When the assertion was issued and when it expires (valid 5-10 minutes typically)
   - Which SP this is for (Audience restriction)
   - Okta's digital signature (RSA-SHA256 or similar)
5. Browser POSTs the assertion to AWS's ACS (Assertion Consumer Service) URL
6. AWS validates the signature against Okta's public cert (retrieved from Okta's metadata URL)
7. AWS reads the SAML attribute for the IAM role
8. AWS calls sts:AssumeRoleWithSAML → issues temporary credentials
9. User gets a console session — no AWS credentials were ever stored anywhere

What a SAML Assertion Actually Looks Like

<saml:Assertion>
  <saml:Issuer>https://okta.company.com</saml:Issuer>

  <saml:Subject>
    <saml:NameID>[email protected]</saml:NameID>
  </saml:Subject>

  <saml:AttributeStatement>
    <!-- This attribute tells AWS which IAM role to assume -->
    <saml:Attribute Name="https://aws.amazon.com/SAML/Attributes/Role">
      <saml:AttributeValue>
        arn:aws:iam::123456789012:role/EngineerRole,arn:aws:iam::123456789012:saml-provider/OktaProvider
      </saml:AttributeValue>
    </saml:Attribute>
  </saml:AttributeStatement>

  <!-- Critical: time bounds on this assertion -->
  <saml:Conditions NotBefore="2026-04-11T09:00:00Z" NotOnOrAfter="2026-04-11T09:05:00Z">
    <saml:AudienceRestriction>
      <!-- Critical: this assertion is ONLY valid for AWS -->
      <saml:Audience>https://signin.aws.amazon.com/saml</saml:Audience>
    </saml:AudienceRestriction>
  </saml:Conditions>

  <ds:Signature>... RSA-SHA256 signature over the above ...</ds:Signature>
</saml:Assertion>

The Audience restriction and the NotOnOrAfter timestamp are two of the most security-critical fields. The audience ensures this assertion can’t be reused for a different SP. The timestamp ensures it can’t be replayed after expiry.

Setting Up SAML Federation with AWS

# Register Okta as a SAML provider in AWS IAM
aws iam create-saml-provider \
  --saml-metadata-document file://okta-metadata.xml \
  --name OktaProvider

# Create the IAM role that federated users will assume
aws iam create-role \
  --role-name EngineerRole \
  --assume-role-policy-document '{
    "Version": "2012-10-17",
    "Statement": [{
      "Effect": "Allow",
      "Principal": {
        "Federated": "arn:aws:iam::123456789012:saml-provider/OktaProvider"
      },
      "Action": "sts:AssumeRoleWithSAML",
      "Condition": {
        "StringEquals": {
          "SAML:aud": "https://signin.aws.amazon.com/saml"
        }
      }
    }]
  }'

# In Okta: configure the AWS IAM Identity Center app
# Attribute mapping: https://aws.amazon.com/SAML/Attributes/Role
# Value: arn:aws:iam::123456789012:role/EngineerRole,arn:aws:iam::123456789012:saml-provider/OktaProvider

# Set maximum session duration (8 hours is reasonable for human access)
aws iam update-role \
  --role-name EngineerRole \
  --max-session-duration 28800

SAML Attack Surface

Attack What It Does Why It Works Prevention
XML Signature Wrapping (XSW) Attacker inserts a malicious assertion, wraps it around the legitimate signed one; some SPs validate the wrong element SAML’s XML structure is complex; naive signature validation checks the signed element, not the element the SP reads Use a vetted SAML library — never hand-roll parsing
Assertion replay Steal a valid assertion (e.g., via network intercept) and replay it before NotOnOrAfter If the SP doesn’t track used assertion IDs, the same assertion can be used multiple times Short expiry; SP tracks seen assertion IDs
Audience bypass SP doesn’t verify the Audience field An assertion issued for SP A can be used at SP B Always validate Audience matches your SP entity ID

XML Signature Wrapping is the most interesting attack historically — it was how security researchers demonstrated SAML implementations in AWS, Google, and others could be bypassed before vendors patched their libraries. The lesson: SAML is complex enough that rolling your own parser is asking for a vulnerability.


OpenID Connect (OIDC) — The Modern Protocol

OIDC is JSON-based, REST-native, and designed for the web and API-first world. Built on top of OAuth 2.0, it’s the protocol behind “Sign in with Google,” GitHub’s OIDC tokens for Actions, and workload identity federation across cloud providers.

Token Anatomy

An OIDC ID Token is a JWT — three base64-encoded parts separated by dots:

Header.Payload.Signature

Header:
{
  "alg": "RS256",           ← signing algorithm
  "kid": "key-id-123"       ← which key signed this (for JWKS rotation)
}

Payload (the claims):
{
  "iss": "https://accounts.google.com",         ← who issued this token
  "sub": "108378629573454321234",               ← stable user identifier (not email)
  "aud": "my-app-client-id",                   ← who this token is for
  "exp": 1749600000,                           ← expires at (Unix timestamp)
  "iat": 1749596400,                           ← issued at
  "email": "[email protected]",
  "email_verified": true,
  "hd": "company.com"                          ← hosted domain (Google Workspace)
}

Signature: RSA-SHA256(base64(header) + "." + base64(payload), idp_private_key)

The relying party (your application, or AWS STS) validates the signature using the IdP’s public keys — available at the JWKS endpoint (/.well-known/jwks.json). The signature verification proves the token was issued by the expected IdP and hasn’t been tampered with since.

The Full OIDC Token Exchange (GitHub Actions → AWS)

# GitHub Actions automatically provides an OIDC token in the runner environment
# The token contains: iss=token.actions.githubusercontent.com, repo, ref, sha, run_id, etc.

# Step 1: Fetch the OIDC token from GitHub's token service
TOKEN=$(curl -H "Authorization: bearer $ACTIONS_ID_TOKEN_REQUEST_TOKEN" \
  "$ACTIONS_ID_TOKEN_REQUEST_URL&audience=sts.amazonaws.com" | jq -r '.value')

# Step 2: Present to AWS STS for exchange
aws sts assume-role-with-web-identity \
  --role-arn arn:aws:iam::123456789012:role/GitHubActionsRole \
  --role-session-name github-deploy \
  --web-identity-token "${TOKEN}"

# STS performs these validations:
# 1. Fetch GitHub's JWKS: https://token.actions.githubusercontent.com/.well-known/jwks
# 2. Verify signature is valid
# 3. Verify iss = "token.actions.githubusercontent.com" (matches OIDC provider)
# 4. Verify aud = "sts.amazonaws.com"
# 5. Verify sub matches the trust policy condition
# 6. Verify exp is in the future

The trust policy condition on the IAM role is what prevents any GitHub repository from assuming this role:

{
  "Version": "2012-10-17",
  "Statement": [{
    "Effect": "Allow",
    "Principal": {
      "Federated": "arn:aws:iam::123456789012:oidc-provider/token.actions.githubusercontent.com"
    },
    "Action": "sts:AssumeRoleWithWebIdentity",
    "Condition": {
      "StringEquals": {
        "token.actions.githubusercontent.com:aud": "sts.amazonaws.com",
        "token.actions.githubusercontent.com:sub": "repo:my-org/my-repo:ref:refs/heads/main"
      }
    }
  }]
}

The sub condition is the security boundary. repo:my-org/my-repo:ref:refs/heads/main means: only runs triggered from the main branch of my-org/my-repo can assume this role. A pull request from a fork, a run from a different repo, or a run from a different branch — all get a different sub claim and the assumption fails.

I’ve reviewed trust policies that omit the sub condition and just check aud. That means any GitHub Actions workflow — in any repository, owned by anyone — can assume that role. That’s not a misconfiguration to be theoretical about: public GitHub repositories exist, and they can trigger GitHub Actions.

OIDC Validation Checklist

Every application that validates OIDC tokens must check all of these:

✓ Signature valid (using IdP's JWKS endpoint — not a hardcoded key)
✓ iss matches the expected IdP URL
✓ aud matches your application's client ID (not just "any audience")
✓ exp is in the future
✓ nbf (not before), if present, is in the past
✓ iat is recent (within your clock skew tolerance)
✓ For workload identity: sub is pinned to the specific workload

Skipping aud validation is the most common mistake. A token issued for application A with aud: app-a-client-id should not be accepted by application B. Without audience validation, any application in your system that can obtain a token for the IdP can reuse it at any other application. Libraries like python-jose and jsonwebtoken validate aud by default — but they need to be configured with the expected audience value.


Enterprise Federation Patterns

Multi-Account AWS with IAM Identity Center + Okta

The pattern I deploy in every multi-account AWS environment:

Okta (IdP)
  └── IAM Identity Center
        ├── Account: prod     → Permission Sets: ReadOnly, DevOps
        ├── Account: staging  → Permission Sets: Developer  
        ├── Account: shared   → Permission Sets: NetworkAdmin, SecurityAudit
        └── Account: sandbox  → Permission Sets: Admin (sandbox only)
# Engineers access accounts through Identity Center portal
aws configure sso
# Prompts: SSO start URL, region, account, role

aws sso login --profile prod-readonly

# List available accounts and roles (useful for tooling and scripts)
aws sso list-accounts --access-token "${TOKEN}"
aws sso list-account-roles --access-token "${TOKEN}" --account-id "${ACCOUNT_ID}"

# Get temporary credentials for a specific account/role
aws sso get-role-credentials \
  --account-id "${ACCOUNT_ID}" \
  --role-name ReadOnly \
  --access-token "${TOKEN}"

When an engineer is offboarded from Okta, they lose access to every AWS account immediately. No individual IAM user deletion across 20 accounts. No access key hunting. One action in Okta, complete revocation.

Just-in-Time (JIT) Provisioning

Rather than creating user accounts in every downstream system ahead of time, JIT provisioning creates accounts on first login:

  1. User authenticates to IdP
  2. SAML/OIDC assertion includes group memberships and attributes
  3. SP receives assertion, checks if a user account exists for this sub
  4. If not: create the account with attributes from the assertion
  5. Grant access based on group claims
  6. On subsequent logins: update the account’s attributes if claims changed

The security property: when a user is disabled in the IdP, their account in downstream systems becomes inaccessible even if the account object still exists. There’s nothing to log in with. JIT accounts don’t survive IdP deletion — they’re inactive shells that produce no risk.


The IdP Is the Trust Anchor — Protect It Accordingly

The entire security of a federated system is bounded by the security of the IdP. If an attacker can log into Okta as an admin, they can issue valid SAML assertions for any user, for any role, to any SP that trusts Okta. Every downstream system is compromised simultaneously.

This is not theoretical. In the 2023 Caesars and MGM Resorts attacks, initial access was achieved through social engineering against identity provider support — not through technical exploitation of cloud infrastructure. Once identity infrastructure is compromised, everything downstream follows.

What this means practically:

  • MFA for all IdP admin accounts — hardware FIDO2 keys, not TOTP. TOTP codes can be phished in real-time. Hardware keys cannot.
  • PIM / JIT access for IdP configuration changes — no standing admin access
  • Separate monitoring and alerting for IdP admin activity
  • Audit who can modify SAML/OIDC configurations and attribute mappings in the IdP — these are the levers for privilege escalation
  • Narrow audience restrictions — configure which SPs can receive assertions; don’t create a wildcard IdP configuration that serves all SPs

Conditional Access — Adding Context to Federation

Modern IdPs support Conditional Access policies that restrict when assertions are issued:

// Entra ID Conditional Access: require MFA + compliant device for AWS access
{
  "conditions": {
    "applications": {
      "includeApplications": ["AWS-Application-ID-in-Entra"]
    },
    "users": {
      "includeGroups": ["all-employees"]
    },
    "locations": {
      "excludeLocations": ["NamedLocation-CorporateNetwork"]
    }
  },
  "grantControls": {
    "operator": "AND",
    "builtInControls": ["mfa", "compliantDevice"]
  }
}

This policy: when an employee accesses AWS from outside the corporate network, they must use MFA on a device that MDM has verified as compliant. From inside the network, the policy still applies but the named location exclusion can relax certain requirements.

Conditional Access is how you move beyond “authenticated to IdP” as the only gate. Device health, network location, risk score — these become inputs to the access decision.


Framework Alignment

Framework Reference What It Covers Here
CISSP Domain 5 — Identity and Access Management Federation is the mechanism for extending identity trust across organizational boundaries
CISSP Domain 3 — Security Architecture Trust relationships must be explicitly designed; overly broad federation trust is an architectural failure
ISO 27001:2022 5.19 Information security in supplier relationships Federation with third-party IdPs and SPs establishes a cross-organizational trust boundary that must be governed
ISO 27001:2022 8.5 Secure authentication SAML and OIDC are the secure authentication protocols for federated access — token validation requirements
ISO 27001:2022 5.17 Authentication information Credential lifecycle in federated systems — no passwords distributed to SPs; IdP manages authentication
SOC 2 CC6.1 Federated identity is the access control mechanism for human access to cloud environments in CC6.1
SOC 2 CC6.6 Logical access from outside system boundaries — federation with external IdPs and partner organizations

Key Takeaways

  • Federation means downstream systems trust the IdP’s signed assertion — they never see credentials and don’t need to manage them independently
  • SAML is XML-based, browser-oriented, widely supported for enterprise SSO; OIDC is JWT-based, API-friendly, the protocol for modern workload identity and consumer SSO
  • In OIDC, the sub condition in trust policies is what prevents any workload from assuming any role — omitting it is a critical misconfiguration
  • Validate all JWT claims: signature, iss, aud, exp, sub — libraries do this, but they need correct configuration
  • The IdP is the trust anchor — its security posture bounds the security of every system that trusts it. Treat IdP admin access with the same controls as your most sensitive systems.
  • JIT provisioning and Conditional Access extend federation from “who are you” to “are you in an appropriate context right now”

What’s Next

EP11 brings this into Kubernetes — RBAC, service account tokens, and how the Kubernetes authorization layer interacts with cloud IAM. Two separate systems, both requiring security. A gap in either becomes a gap in both.

Next: Kubernetes RBAC and AWS IAM

Get EP11 in your inbox when it publishes → linuxcent.com/subscribe