eBPF Archives - Linuxcent

Network Flow Observability — What Every Connection Reveals

May 29, 2026 by Vamshi Krishna Santhapuri

Reading Time: 10 minutes

eBPF: From Kernel to Cloud, Episode 10
What Is eBPF? · The BPF Verifier · eBPF vs Kernel Modules · eBPF Program Types · eBPF Maps · CO-RE and libbpf · XDP · TC eBPF · bpftrace · Network Flow Observability · DNS Observability

Architecture Overview

eBPF Network Flow Observability — Hubble and Cilium architecture for zero-instrumentation flow monitoring — Hubble captures every packet decision at the eBPF layer — no sidecar, no app changes, no sampling.

TL;DR

Network flow observability with eBPF attaches persistent programs to TC hooks and records every connection attempt, retransmit, reset, and drop — continuously, with no sampling
(TC hook = Traffic Control hook: the point in the Linux network stack where eBPF programs intercept packets after ingress or before egress, tied to a specific network interface)
APM tools and service mesh telemetry are interpretations of what happened; kernel-level flow data from TC hooks is the raw event stream they all derive from
Retransmit counters at the kernel level reveal congestion, half-open connections, and remote endpoint failures that application logs never surface
Cilium’s Hubble and similar tools (Pixie, Retina) are eBPF flow exporters — they run TC programs, collect perf_event or ringbuf events, and expose them over an API
You can verify what flow data a tool is actually collecting with four bpftool commands — without reading documentation
Production caution: flow maps grow with the number of active connections; pin and bound your maps, and account for the per-packet overhead on high-throughput interfaces

EP09 showed bpftrace as an on-demand kernel query tool — compile a question, get an answer, clean up. Network flow observability with eBPF is the persistent version: programs that stay attached to TC hooks across your entire fleet, recording every connection without waiting for you to ask. When a client reports intermittent failures that appear nowhere in application logs, that persistent record is what you query. This episode covers how that layer works and how to read it.

Quick Check: What Flow Data Is Your Cluster Already Collecting?

Before building anything new, check what’s already running. If you have Cilium, Pixie, or Retina on your cluster, eBPF flow programs are already attached:

# SSH into a worker node, then:

# What TC programs are attached to cluster interfaces?
bpftool net list

# Expected output on a Cilium node:
# xdp:
#
# tc:
# eth0(2) clsact/ingress prog_id 38 prio 1 handle 0x1 direct-action
# eth0(2) clsact/egress  prog_id 39 prio 1 handle 0x1 direct-action
# lxc12a3(15) clsact/ingress prog_id 41 prio 1 handle 0x1 direct-action
# lxc12a3(15) clsact/egress  prog_id 42 prio 1 handle 0x1 direct-action

# What maps are those programs holding state in?
bpftool map list | grep -E "flow|conn|sock|nat"

# Sample output:
# 24: hash  name cilium_ct4_global  flags 0x0
#     key 24B  value 56B  max_entries 65536  memlock 4718592B
# 25: hash  name cilium_ct4_local   flags 0x0
#     key 24B  value 56B  max_entries 8192   memlock 589824B

Each lxcXXXX interface is a pod’s veth pair. The TC programs on those interfaces are what Cilium uses to enforce NetworkPolicy and collect flow telemetry. If you see prog_id values on pod interfaces, your cluster is already doing kernel-level flow collection.

Not running Cilium? On a plain kubeadm or EKS node without a CNI that uses eBPF, bpftool net list will show no TC programs on pod interfaces — just whatever kube-proxy or the CNI plugin installed. You can still attach your own flow programs with tc qdisc add dev eth0 clsact — that’s the starting point this episode covers.

The client opened a ticket on a Tuesday afternoon. “Intermittent connection failures to the payment gateway. Started around 11 AM. Application logs say timeout. Retry logic is masking it for most users but the error rate is up 0.3%.”

I looked at the APM dashboard. The service showed elevated latency — p99 at 850ms versus a normal 120ms — but no hard errors at the application layer. The service mesh metrics showed the downstream call succeeding from the mesh’s perspective. The payment gateway team said their side looked clean.

Three tools. Three different answers. All of them interpreting the network. None of them were the network.

I ran:

bpftool map dump id 24 | grep -A5 "payment-gateway-ip"

The connection tracking map showed retransmit count 14 for a specific (src_ip, dst_ip, src_port, dst_port) tuple — the same 5-tuple, every 30 seconds, for 2 hours. The kernel was retransmitting. The TCP stack was compensating. The application was seeing sporadic success because retransmits eventually got through. The APM dashboard averaged that latency into a p99 and called it “elevated.”

The kernel had the truth. Everything above it was rounding.

Why Application-Level Metrics Miss What the Kernel Sees

Application metrics — APM spans, service mesh telemetry, load balancer health checks — operate at Layer 7. They measure round-trip time for complete requests, error codes returned, bytes transferred. They answer “did this request succeed?” not “what did the network do to make it succeed?”

The TCP stack underneath those requests handles retransmits, congestion window adjustments, RST packets, and half-open connections silently. From an application’s perspective, a request that required 3 retransmits before the ACK arrived looks identical to one that succeeded on the first attempt — slightly slower, but successful.

This is structural, not a tooling gap. Application-layer observability tools cannot see below their own protocol boundary. The kernel’s TCP implementation does not report upward when it retransmits. It just retransmits.

eBPF flow observability closes this gap by attaching programs directly to the network path — at the TC hook, which fires on every packet crossing a network interface — and recording what the kernel actually does.

How TC Hook Flow Programs Work

EP08 covered TC eBPF programs for pod network policy. Flow observability uses the same attachment point with a different purpose: instead of allowing or dropping packets, the program reads packet metadata and writes it to a map or ring buffer.

Pod sends packet
      ↓
veth interface (lxcXXXX)
      ↓
TC clsact/egress hook fires
      ↓
eBPF program reads:
  - src IP, dst IP
  - src port, dst port
  - protocol
  - packet size
  - TCP flags (SYN, ACK, FIN, RST, retransmit bit)
      ↓
Writes event to ringbuf (or perf_event_array)
      ↓
Userspace consumer reads ringbuf
      ↓
Aggregates to flow record
      ↓
Exports to Hubble/Prometheus/flow store

ringbuf — a BPF ring buffer: a lock-free, memory-efficient queue shared between a kernel eBPF program and a userspace consumer. The kernel program writes events; the userspace reader drains them. Used instead of perf_event_array in kernel 5.8+ because it avoids per-CPU memory waste and supports variable-length records. When you see Hubble exporting flows, it’s reading from a ringbuf that the TC program writes to.

The key structural property: the TC hook fires on every packet. Not sampled. Not throttled by default. Every SYN, every ACK, every RST, every retransmit. For flow observability, you typically aggregate at the program level — count packets and bytes per 5-tuple per second, rather than emitting an event per packet — but the raw visibility is there if you need it.

What Retransmit Telemetry Actually Reveals

Most flow observability implementations track TCP retransmits specifically because they are the clearest signal of network-layer trouble invisible to applications.

A TCP retransmit happens when a sender doesn’t receive an ACK within the retransmission timeout (RTO). The kernel resends the segment and doubles the timeout (exponential backoff). From the application’s perspective, the call takes longer. If retransmits keep clearing, the application sees success — just slow success.

perf_event — a kernel mechanism for collecting performance data. In eBPF, BPF_MAP_TYPE_PERF_EVENT_ARRAY lets kernel programs push variable-length records to userspace readers via a ring buffer per CPU. Older tools use perf_event_array; newer ones use BPF_MAP_TYPE_RINGBUF (single shared ring, more efficient). If you inspect an older version of Cilium’s flow exporter, you’ll see perf_event writes; newer versions use ringbuf.

To observe retransmits directly with bpftrace:

# Count retransmit events per destination IP — run for 60 seconds
bpftrace -e '
kprobe:tcp_retransmit_skb {
    $sk = (struct sock *)arg0;
    $daddr = ntop(AF_INET, $sk->__sk_common.skc_daddr);
    @retransmits[$daddr] = count();
}
interval:s:60 { print(@retransmits); clear(@retransmits); exit(); }
'

Sample output:

Attaching 2 probes...
@retransmits[10.96.0.10]:   2       # DNS service — normal
@retransmits[172.16.4.23]:  847     # payment gateway endpoint ← problem here
@retransmits[10.244.1.5]:   1       # normal pod-to-pod traffic

847 retransmits to a single endpoint in 60 seconds. That’s not noise. That’s a congested or half-open connection being retried 14 times per second by the TCP stack while the application layer averages it into “elevated latency.”

How Cilium Hubble Collects Flow Data

Hubble is the flow observability layer built into Cilium. Understanding how it works makes you able to reason about what it can and cannot see — and how to verify what it’s actually collecting.

Hubble’s architecture:

Kernel (per node)
├── TC eBPF programs on all pod veth interfaces
│     write flow events → BPF ringbuf
│
└── Hubble node agent (userspace)
      reads ringbuf
      enriches with pod metadata (Kubernetes API)
      exposes gRPC API

Cluster level
└── Hubble Relay
      aggregates per-node gRPC streams
      exposes single cluster-wide API

User tooling
└── hubble observe  /  Hubble UI  /  Prometheus exporter

The TC programs are writing raw packet events. The Hubble agent is the consumer that translates those events into Kubernetes-aware flow records — adding pod name, namespace, label, and policy verdict on top of the 5-tuple and TCP metadata the kernel provides.

To see what Hubble’s TC programs have attached:

# On any Cilium node
bpftool net list | grep lxc

# lxce4a1(23) clsact/ingress prog_id 61  ← Hubble flow program on pod interface ingress
# lxce4a1(23) clsact/egress  prog_id 62  ← Hubble flow program on pod interface egress
# lxcf7b2(31) clsact/ingress prog_id 63
# lxcf7b2(31) clsact/egress  prog_id 64

# Inspect one of those programs to confirm it's reading flow metadata
bpftool prog show id 61

# Output:
# 61: sched_cls  name tail_handle_nat  tag 3a8e2f1b4c7d9e0a  gpl
#     loaded_at 2026-04-22T09:13:45+0530  uid 0
#     xlated 2144B  jited 1382B  memlock 4096B  map_ids 24,31,38
#     btf_id 142

sched_cls is the BPF program type for TC — confirming these are TC-attached flow programs. map_ids 24,31,38 — those are the maps this program reads from and writes to. You can dump any of them:

bpftool map dump id 24 | head -40

# Output (connection tracking entry):
# [{
#     "key": {
#         "saddr": "10.244.1.5",        # ← source pod IP
#         "daddr": "172.16.4.23",        # ← destination IP
#         "sport": 48291,                # ← source port
#         "dport": 443,                  # ← destination port
#         "nexthdr": 6,                  # ← protocol: TCP
#         "flags": 3                     # ← CT_EGRESS | CT_ESTABLISHED
#     },
#     "value": {
#         "rx_packets": 14832,           # ← packets received
#         "tx_packets": 14831,           # ← packets sent
#         "rx_bytes": 3841024,           # ← bytes received
#         "tx_bytes": 3756288,           # ← bytes sent
#         "lifetime": 21600,             # ← seconds until entry expires
#         "rx_closing": 0,
#         "tx_closing": 0
#     }
# }]

That’s the ground truth. Not an APM span. Not a service mesh metric. The actual per-connection counters the kernel is maintaining for that 5-tuple.

Writing a Minimal Flow Observer with bpftrace

You don’t need Cilium or Hubble to get flow telemetry. bpftrace can produce it directly on any node with BTF:

# Persistent flow table: connections + packet counts for 2 minutes
bpftrace -e '
kprobe:tcp_sendmsg {
    $sk = (struct sock *)arg0;
    $daddr = ntop(AF_INET, $sk->__sk_common.skc_daddr);
    $dport = $sk->__sk_common.skc_dport >> 8;
    @flows[comm, $daddr, $dport] = count();
}
interval:s:30 { print(@flows); clear(@flows); }
' --timeout 120

Sample output (every 30 seconds):

@flows[curl, 93.184.216.34, 443]:         12    # curl → example.com:443
@flows[coredns, 10.96.0.10, 53]:          341   # CoreDNS upstream queries
@flows[payment-svc, 172.16.4.23, 443]:   1204   # payment service → gateway
@flows[nginx, 10.244.2.3, 8080]:          89    # nginx → upstream pod

For retransmit tracking specifically:

# Combined flow + retransmit watcher — runs until Ctrl-C
bpftrace -e '
kprobe:tcp_retransmit_skb {
    $sk = (struct sock *)arg0;
    $daddr = ntop(AF_INET, $sk->__sk_common.skc_daddr);
    @retx[comm, $daddr] = count();
}
kprobe:tcp_sendmsg {
    $sk = (struct sock *)arg0;
    $daddr = ntop(AF_INET, $sk->__sk_common.skc_daddr);
    @sends[comm, $daddr] = count();
}
interval:s:10 {
    printf("=== Retransmit ratio (last 10s) ===\n");
    print(@retx);
    print(@sends);
    clear(@retx);
    clear(@sends);
}
'

This gives you both the volume of sends and the retransmit count side by side — the ratio tells you whether retransmits are a rounding error (0.01%) or a signal (5%+).

⚠ Production Gotchas

Map size bounds matter. Connection tracking maps default to tens of thousands of entries. On nodes with high connection churn (serverless, short-lived batch jobs), maps can fill and start dropping new entries silently. Check bpftool map show id N for max_entries and monitor map utilization. Cilium exposes this as cilium_bpf_map_pressure in Prometheus.

Per-packet overhead on high-throughput interfaces. A TC program that fires on every packet on a 10Gbps interface processes millions of packets per second. Aggregating at the program level (count per 5-tuple rather than emit per packet) keeps overhead manageable — Cilium does this. A naive bpftrace one-liner that emits a perf event per packet will saturate the perf ring buffer under real load. Use ringbuf write paths or aggregate before emitting.

TC hook placement and direction confusion. Ingress TC on a pod’s veth (lxcXXXX) sees egress traffic from the pod’s perspective — because the host sees the packet arriving on the veth after the pod sent it. This reversal is consistent but confusing when you’re reading direction labels in flow records. EP08 covered this in detail for policy enforcement; the same asymmetry applies to flow data.

Retransmit counters reset on connection close. If you’re tracking retransmit totals for a long-lived connection, the count is stored in the kernel’s socket state and is cleared when the socket closes. For persistent tracking across reconnects, aggregate at the flow level in userspace before the connection closes.

Hubble flow visibility requires pod interfaces. Hubble only sees traffic that crosses a pod’s veth interface. Node-to-node traffic that doesn’t involve a pod (e.g., node SSH, kubelet-to-API-server on the node IP) is not captured by default. For host-level network observability, you need a TC program on the physical interface (eth0, ens3), not just on pod veth pairs.

Quick Reference

What you want to see	Command
What TC programs are attached	`bpftool net list`
Which maps a program uses	`bpftool prog show id N` (check `map_ids`)
Connection tracking entries	`bpftool map dump id N`
Retransmits per destination	`bpftrace -e 'kprobe:tcp_retransmit_skb { ... }'`
Flow counts per process	`bpftrace -e 'kprobe:tcp_sendmsg { @[comm, daddr] = count(); }'`
Hubble flow stream (Cilium)	`hubble observe --follow`
Hubble flows for one pod	`hubble observe --pod mynamespace/mypod --follow`
Verify map pressure	`bpftool map show id N` (check `max_entries` vs entries)

Kernel function	What it marks
`tcp_sendmsg`	Data being sent on a TCP socket
`tcp_recvmsg`	Data being received on a TCP socket
`tcp_retransmit_skb`	A segment being retransmitted
`tcp_send_reset`	RST being sent
`tcp_fin`	Connection teardown initiated
`tcp_connect`	New outbound TCP connection attempt

Key Takeaways

Network flow observability with eBPF attaches TC programs that record every connection event continuously — not sampled, not throttled, not filtered by what the application reports
Retransmit telemetry from tcp_retransmit_skb reveals congestion and endpoint failures that are structurally invisible to application-layer monitoring tools
Cilium Hubble, Pixie, and Retina are all eBPF flow exporters — they run TC programs, drain a ringbuf, enrich with Kubernetes metadata, and expose the result over an API
You can verify what any flow tool is actually collecting with bpftool net list, bpftool prog show, and bpftool map dump — four commands, no documentation needed
Map sizing and per-packet overhead are the two production concerns; aggregate at the kernel level, bound your maps, and monitor map pressure
The kernel’s connection tracking map is the ground truth. APM dashboards, service mesh metrics, and load balancer health checks are all interpretations of what that map contains

What’s Next

Flow observability tells you what connections exist. EP11 goes one level deeper: what names your pods are resolving those connections to. DNS is where a compromised workload first reveals itself — it queries a domain that has no business being queried from a production pod, and if you’re not watching the kernel-level DNS path, you won’t see it until after the damage.

DNS observability at the kernel level uses tracepoint hooks on the DNS syscall path — the same ground-truth approach as flow telemetry, but for name resolution: every query, every response, tied to the pod that made it, without deploying a sidecar.

Next: DNS observability at the kernel level — what your pods are actually resolving

Get EP11 in your inbox when it publishes → linuxcent.com/subscribe

bpftrace — Kernel Answers in One Line

May 13, 2026May 10, 2026 by Vamshi Krishna Santhapuri

Reading Time: 8 minutes

eBPF: From Kernel to Cloud, Episode 9
What Is eBPF? · The BPF Verifier · eBPF vs Kernel Modules · eBPF Program Types · eBPF Maps · CO-RE and libbpf · XDP · TC eBPF · bpftrace**

Architecture Overview

bpftrace and eBPF Tracing — dynamic kernel observability showing probe types and output pipeline — bpftrace attaches probes at runtime — no recompilation, no restarts, full kernel visibility in one line.

TL;DR

bpftrace is an eBPF compiler, not a monitoring agent — every one-liner compiles, loads, runs, and cleans up a complete kernel program
(think of it like kubectl exec — but for asking the kernel a direct question, with no agent, no sidecar, no prior setup)
kretprobe and tracepoint cover most production debugging needs; use tracepoints for stability across kernel versions
The security use cases are unique: kernel-level observation that an attacker inside a container cannot suppress
Every connection, every file open, every process spawn — observable in real time with a single command, no prior instrumentation
Production caution: high-frequency probes on hot paths add overhead; filter by pid/comm, use --timeout, watch %si
Container PIDs are host-namespace PIDs in bpftrace — use curtask->real_parent->tgid to correlate to container activity

bpftrace turns any kernel question into a one-liner — compiling, loading, and attaching a complete eBPF program in seconds, with no agents, no restarts, and no prior instrumentation on the node. When something is wrong on a node right now and you don’t know where to look, it’s how you ask the kernel a direct question. That’s what EP09 is about.

Quick Check: Is bpftrace Available on Your Node?

Before the one-liner toolkit — verify bpftrace is installed and working on a cluster node:

# SSH into a worker node, then:
bpftrace --version
# bpftrace v0.19.0   ← any version ≥ 0.16 supports the patterns in this episode

# Verify BTF is available (required for struct access one-liners)
ls /sys/kernel/btf/vmlinux && echo "BTF available"

# The simplest possible one-liner — count syscalls for 5 seconds
bpftrace -e 'tracepoint:raw_syscalls:sys_enter { @[comm] = count(); }' --timeout 5

Expected output (abridged):

Attaching 1 probe...

@[containerd]: 312
@[kubelet]:    841
@[node_exporter]: 203
@[sshd]:       47

Each line is a process name and how many syscalls it made in 5 seconds. If this runs and produces output, everything in this episode will work on your node.

Not on a self-managed node? EKS managed nodes and GKE nodes don’t have bpftrace pre-installed, but you can run it from a privileged debug pod: kubectl debug node/<node-name> -it --image=quay.io/iovisor/bpftrace. The tool runs on the host kernel — you get full kernel visibility even from a pod.

A node in production started showing elevated TCP latency — p99 at 180ms, where p99 was normally under 10ms. The application logs were clean. The APM dashboard showed nothing unusual at the service level. CPU, memory, disk: all normal. The load balancer health checks were passing.

I had 12 minutes before the on-call escalation would have gone to the application team and started a war room.

I ran one command:

bpftrace -e 'kretprobe:tcp_recvmsg { @bytes[comm] = hist(retval); }' --timeout 10

Ten seconds of sampling. The histogram output showed a single process — backup-agent — receiving 4MB chunks at irregular intervals. Not the application. Not the service mesh. A backup agent that runs at the infrastructure layer, saturating the receive path with large reads during its scheduled window.

Found in 9 seconds. War room averted.

What made that possible is something most engineers don’t know about bpftrace: that one-liner is not a monitoring query. It’s a complete eBPF program — compiled, loaded into the kernel, attached to the tcp_recvmsg kernel return probe, run, and cleaned up — all in ten seconds. bpftrace is a compiler that happens to have a very convenient command-line interface.

What bpftrace Actually Is

bpftrace is not a monitoring tool. It’s an eBPF compiler with a high-level scripting language designed for one-shot investigation.

When you run bpftrace -e 'kretprobe:tcp_recvmsg { ... }', this is what happens:

Your one-liner
      ↓
bpftrace's built-in LLVM/Clang frontend
      ↓
eBPF bytecode (.bpf.o in memory)
      ↓
Kernel verifier validates the program
      ↓
JIT compiler compiles to native machine code
      ↓
Program attaches to tcp_recvmsg kretprobe
      ↓
Runs until Ctrl-C or --timeout
      ↓
Output printed, maps freed, program detached

The kernel doesn’t know bpftrace wrote the program. It’s the same path as Falco, Cilium, Tetragon — kernel program loaded via the BPF syscall, verified, JIT-compiled, attached to a probe. bpftrace just wraps that entire process in a scripting language that takes 30 seconds to write instead of an afternoon.

This is why bpftrace can answer questions that no other tool can: it compiles to a kernel-level observer that fires on any event in the kernel, on any process, on any container — without any prior instrumentation.

The Four Probe Types You’ll Use Most

bpftrace supports 20+ probe types. These four cover 90% of production debugging:

kprobe / kretprobe — Kernel Functions

Attaches to the entry (kprobe) or return (kretprobe) of any kernel function. The most powerful probes for understanding what the kernel is actually doing.

# Fire on every call to tcp_connect — who's making new TCP connections?
bpftrace -e 'kprobe:tcp_connect { printf("%s PID %d connecting\n", comm, pid); }'

# On return from tcp_recvmsg — how large are the reads per process?
bpftrace -e 'kretprobe:tcp_recvmsg { @[comm] = hist(retval); }'

# Count calls to vfs_write per process (file write activity)
bpftrace -e 'kprobe:vfs_write { @[comm] = count(); }'

Limitation: kernel functions are internal and can change between kernel versions. Use tracepoints (below) for stability when you can.

kprobe instability: A function targeted by a kprobe can be inlined by the kernel compiler — the compiler embeds the function’s code at its call sites with no separate entry point. When that happens, the kprobe silently fires on nothing. Verify before relying on one: bpftrace -l 'kprobe:function_name' — empty response means it was inlined. Use a tracepoint equivalent instead.

tracepoint — Stable Kernel Trace Points

Tracepoints are stable, versioned hooks explicitly placed in the kernel source. Unlike kprobes, they are part of the kernel’s public interface and guaranteed not to disappear between versions. Use these for anything you need to work reliably across a fleet with mixed kernel versions.

# Every file open — process name + filename
bpftrace -e 'tracepoint:syscalls:sys_enter_openat {
    printf("%s %s\n", comm, str(args->filename));
}'

# Every outbound connect — process, destination IP and port
bpftrace -e 'tracepoint:syscalls:sys_enter_connect {
    printf("%-16s %-6d\n", comm, pid);
}'

# List all available tracepoints (hundreds)
bpftrace -l 'tracepoint:syscalls:*' | head -30

uprobe — Userspace Function Probes

Attaches to a specific function in a userspace binary or library. Useful for observing application behaviour without recompiling.

# What bash commands are being typed on this node?
bpftrace -e 'uprobe:/bin/bash:readline { printf("%s\n", str(arg0)); }'

# Python function calls
bpftrace -e 'uprobe:/usr/bin/python3:PyObject_Call { printf("Python call: pid %d\n", pid); }'

From a security standpoint: this is how you observe what an attacker is typing in an interactive shell they’ve obtained on your node — in real time, from the kernel, without touching the terminal session.

interval — Periodic Sampling

Runs a block of code on a fixed interval. Used for aggregation and periodic stats.

# Print the top file-opening processes every 5 seconds
bpftrace -e '
kprobe:vfs_open { @[comm] = count(); }
interval:s:5  { print(@); clear(@); }
'

The One-Liner Toolkit: Runnable Right Now

These run on any Linux node with BTF (kernel 5.8+, Ubuntu 20.04+, most managed K8s nodes):

# What files is every process opening right now? (30-second view)
bpftrace -e 'tracepoint:syscalls:sys_enter_openat {
    printf("%-16s %s\n", comm, str(args->filename));
}' --timeout 30

# Who is making DNS queries? (catches queries from any container, no sidecar needed)
bpftrace -e 'tracepoint:net:net_dev_xmit {
    if (args->skbaddr->protocol == 0x0800) printf("%s\n", comm);
}'

# Latency histogram for all read() syscalls — find the slow process
bpftrace -e '
tracepoint:syscalls:sys_enter_read { @start[tid] = nsecs; }
tracepoint:syscalls:sys_exit_read  {
    $latency = nsecs - @start[tid];
    @latency[comm] = hist($latency);
    delete(@start[tid]);
}' --timeout 15

# Which process is using the most CPU right now? (99Hz sampling)
bpftrace -e 'profile:hz:99 { @[comm] = count(); }' --timeout 10

# Real-time syscall frequency — find unusual process activity
bpftrace -e 'tracepoint:raw_syscalls:sys_enter { @[comm, args->id] = count(); }' --timeout 10 \
  | sort -k3 -rn | head -20

# New TCP connections in the last 30 seconds — source and dest
bpftrace -e 'kprobe:tcp_connect {
    $sk = (struct sock *)arg0;
    printf("%-16s → %s:%d\n", comm,
           ntop(AF_INET, $sk->__sk_common.skc_daddr),
           $sk->__sk_common.skc_dport >> 8);
}' --timeout 30

# What is a specific PID doing? (replace 12345)
bpftrace -e 'tracepoint:syscalls:sys_enter_openat /pid == 12345/ {
    printf("%s\n", str(args->filename));
}'

Each of these compiles and loads in under 2 seconds. They leave no persistent state. When they exit, the kernel reverts to exactly the state it was in before.

The Security Use Cases

Watching an Active Session

If you suspect a process is running commands you didn’t deploy:

# See every bash command on this node in real time
bpftrace -e 'uprobe:/bin/bash:readline { printf("%s %s\n", comm, str(arg0)); }'

# Every process spawn — PID, parent, command
bpftrace -e 'tracepoint:syscalls:sys_enter_execve {
    printf("%-6d %-6d %s\n", pid, curtask->real_parent->tgid, str(args->filename));
}'

This is the kernel-level version of watching /var/log/auth.log — except it can’t be suppressed by an attacker who has root, because the probe runs in kernel space. An attacker who has compromised a container with root inside the container cannot prevent a bpftrace program on the host from observing their syscalls.

Detecting Unexpected Network Activity

# Any process making a connection to a non-standard port
bpftrace -e 'kprobe:tcp_connect {
    $sk = (struct sock *)arg0;
    $port = $sk->__sk_common.skc_dport >> 8;
    if ($port != 80 && $port != 443 && $port != 53) {
        printf("%-16s port %d\n", comm, $port);
    }
}'

# DNS queries to non-standard resolvers (anything not on port 53)
bpftrace -e 'tracepoint:syscalls:sys_enter_sendto {
    if (args->addr->sa_family == 2) {
        printf("%-16s → %s\n", comm, str(args->addr));
    }
}'

Watching File Access on Sensitive Paths

# Any access to /etc/passwd, /etc/shadow, /root/
bpftrace -e 'tracepoint:syscalls:sys_enter_openat {
    if (str(args->filename) == "/etc/passwd" ||
        str(args->filename) == "/etc/shadow") {
        printf("%-16s PID %-6d opened %s\n", comm, pid, str(args->filename));
    }
}'

Production Gotchas

CPU overhead: bpftrace probes fire synchronously in the traced context. High-frequency probes on hot kernel paths (vfs_read, sys_enter_* without filtering) can add 10–20% overhead. Always test with --timeout and watch %si before running on a production node.

Maps grow unbounded by default: @[comm] = count() will accumulate an entry per unique comm value forever in the current session. Use clear(@) in an interval block, or set a key limit: @[comm] = count(); if (@[comm] > 100) { clear(@comm); }.

kprobe instability: Functions targeted by kprobes can be inlined by the compiler between kernel versions, making the probe silently ineffective. If a kprobe isn’t firing, verify the function exists: bpftrace -l 'kprobe:function_name'. If it returns nothing, the function was inlined. Use a tracepoint equivalent instead.

Container PIDs: PIDs inside a container are different from host PIDs. pid in bpftrace is the host namespace PID.

Container PID semantics: When a container shows PID 1 internally, the host kernel sees it as PID 8432 (or whatever was assigned). bpftrace’s pid built-in always gives you the host-namespace PID. To map a container’s PID to the host PID: cat /proc/<host-pid>/status | grep NSpid — the second value is the PID inside the container. Or use curtask->real_parent->tgid in your probe to walk the process tree. This matters when you filter by pid in a one-liner and get no output — you may be filtering on the container-namespace PID instead of the host one.

BTF requirement: bpftrace requires BTF for struct field access ($sk->__sk_common.skc_daddr). If BTF is unavailable, struct access fails. Check /sys/kernel/btf/vmlinux exists before running struct-access one-liners.

Quick Reference

Probe type	Syntax	Use for
kernel function entry	`kprobe:function_name`	Function arguments
kernel function return	`kretprobe:function_name`	Return value, latency
kernel tracepoint	`tracepoint:subsys:name`	Stable, versioned hooks
userspace function	`uprobe:/path/to/bin:function`	App-level observation
CPU sampling	`profile:hz:99`	Flamegraphs, hot code
interval	`interval:s:N`	Periodic aggregation
process start	`tracepoint:syscalls:sys_enter_execve`	New process detection

Built-in variable	Value
`pid`	Process ID (host namespace)
`tid`	Thread ID
`comm`	Process name (15 chars)
`nsecs`	Nanoseconds since boot
`curtask`	Pointer to `task_struct`
`retval`	Return value (kretprobe/tracepoint exit)
`args`	Probe arguments struct

Key Takeaways

bpftrace is an eBPF compiler, not a monitoring agent — every one-liner compiles, loads, runs, and cleans up a complete kernel program
kretprobe and tracepoint cover most production debugging needs; use tracepoints for stability across kernel versions
The security use cases are unique: kernel-level observation that an attacker inside a container cannot suppress, because the probe runs on the host in kernel space
Every connection, every file open, every process spawn — observable in real time with a single command, no prior instrumentation
Production caution: high-frequency probes on hot paths add overhead; filter by pid/comm, use --timeout, watch %si

What’s Next

bpftrace answers questions you ask in the moment. EP10 covers what happens when you need those answers continuously — not as a one-shot investigation tool, but as persistent telemetry recording every network connection across your entire cluster.

Flow observability from TC hooks is the always-on version: a persistent eBPF program recording every connection attempt, every retransmit, every dropped packet — the ground truth layer that everything above it interprets. When your APM says “timeout” and the kernel says “retransmit storm to one specific endpoint,” the kernel is right.

Next: network flow observability at the kernel level

Get EP10 in your inbox when it publishes → linuxcent.com/subscribe

TC eBPF — Pod-Level Network Policy Without iptables

May 13, 2026May 3, 2026 by Vamshi Krishna Santhapuri

Reading Time: 10 minutes

eBPF: From Kernel to Cloud, Episode 8
What Is eBPF? · The BPF Verifier · eBPF vs Kernel Modules · eBPF Program Types · eBPF Maps · CO-RE and libbpf · XDP · TC eBPF**

Architecture Overview

TC eBPF and Cilium — traffic control hook architecture showing ingress/egress packet flow with sk_buff context — The TC hook runs inside the kernel network stack — Cilium uses it for identity-based policy enforcement.

TL;DR

TC eBPF fires after sk_buff allocation — it has socket metadata, cgroup ID, and pod identity that XDP lacks
(sk_buff = the kernel’s socket buffer, allocated for every packet; TC fires after this allocation, so it can read socket and process identity)
Direct action (DA) mode combines filter and action; the program’s return value is the packet fate
Multiple TC programs chain on the same hook ordered by priority — stale programs from Cilium upgrades cause silent policy conflicts
tc filter show dev <iface> ingress/egress is the primary inspection tool; bpftool net list shows the full node picture
XDP + TC is the Cilium data path: XDP for pre-stack service load balancing, TC for per-pod identity-based enforcement
TC can modify packet content (bpf_skb_store_bytes) — the basis for TC-based DNAT and packet mangling

TC eBPF is where Cilium implements pod-level network policy without iptables — the hook that fires after sk_buff allocation, where socket and cgroup context exist, making per-pod enforcement possible. The obvious follow-up to XDP is why Cilium doesn’t use it for everything — pod network policy, egress enforcement, the full NetworkPolicy ruleset. The answer reveals an inherent trade-off built into the Linux data path: XDP’s speed comes from running before any context exists. At the moment it fires, there is no socket, no cgroup, no way to tell which pod sent the packet. The moment you need pod identity, you need a hook that fires later — and pays for it.

A specific pod in production was experiencing intermittent TCP connection failures to an external service. Not all connections — roughly one in fifty. Kubernetes NetworkPolicy showed egress allowed for the namespace. Cilium policy status showed no violations. Running curl from inside the pod worked fine.

The application logs told a different story: connection timeouts at the 30-second mark, no SYN-ACK received. Not a DNS issue — I verified with tcpdump inside the pod namespace. SYN packets were leaving the pod network namespace. They weren’t making it onto the wire.

I ran bpftool net list on the node and saw two TC egress programs attached to that pod’s veth interface. One from the current Cilium version (installed six weeks ago). One from the previous version — from before the rolling upgrade. Two programs. Different policy epochs. The older one had a stale block rule that fired intermittently based on connection tuple patterns it was never designed to handle in the new policy model.

Without understanding TC eBPF — what programs attach where, how multiple programs interact, and how to inspect them — I would have kept chasing ghosts in the application layer.

Quick Check: Are There Stale TC Filters on Your Cluster?

The most common TC eBPF issue on production clusters — stale filters left behind by a Cilium upgrade — is a two-command check:

# SSH into a worker node, then pick any pod's veth interface:
ip link | grep lxc | head -5
# lxc8a3f21b@if7: ...
# lxc2c9d3e1@if9: ...

# Check TC filters on that interface
tc filter show dev lxc8a3f21b egress

Healthy output (one filter, one priority):

filter protocol all pref 1 bpf chain 0
filter protocol all pref 1 bpf chain 0 handle 0x1 cil_to_container direct-action not_in_hw id 44

Stale filter present (two priorities = problem):

filter protocol all pref 1 bpf chain 0
filter protocol all pref 1 bpf chain 0 handle 0x1 cil_to_container direct-action not_in_hw id 44
filter protocol all pref 2 bpf chain 0
filter protocol all pref 2 bpf chain 0 handle 0x1 old_cil_to_container direct-action not_in_hw id 17
#                  ^^^^^^ two different priorities = two programs running in sequence

Two priorities on the same hook means two programs running sequentially. If the older one has a stale DROP rule, packets are being dropped intermittently — and nothing in the application layer will tell you why.

Not running Cilium? If you’re on a non-Cilium CNI (Calico, Flannel, aws-vpc-cni), you likely won’t have TC eBPF filters on pod interfaces. Run tc filter show dev eth0 ingress on the node uplink instead to see if any TC programs are attached at the node level. An empty response is normal for non-Cilium clusters.

Why TC, Not XDP

EP07 covered XDP: fastest possible hook, fires before sk_buff, drops at line rate. If XDP is so fast, why doesn’t Cilium use it for everything?

Because XDP sees only raw packet bytes. No socket. No cgroup. No pod identity.

In Kubernetes, network policy is inherently about identity. “Allow pod A to connect to pod B on port 8080.” To enforce this, you need to know which pod a packet is coming from on egress — and which pod it’s going to on ingress. That mapping lives in the cgroup hierarchy and the socket buffer, neither of which exist at XDP time.

TC fires later in the packet lifecycle, after sk_buff is allocated and populated:

Ingress path:
  wire → NIC → [XDP hook] → sk_buff allocated → [TC ingress hook] → netfilter → socket

Egress path:
  socket → IP routing → [TC egress hook] → qdisc → NIC → wire

At the TC egress hook on a pod’s veth interface, the sk_buff carries the socket that created the packet — and from that socket you can read the cgroup ID. The cgroup hierarchy maps container → pod, so the TC program knows which pod this traffic belongs to. That’s what makes pod-level enforcement possible.

The Linux Traffic Control Architecture

tc (traffic control) is the Linux subsystem for managing packet queues and scheduling. Most Linux administrators know it as the bandwidth-shaping tool:

# Classic tc usage — rate limit an interface
tc qdisc add dev eth0 root tbf rate 100mbit burst 32kbit latency 400ms

The qdisc (queuing discipline) is the primary abstraction. Under the qdisc sits a filter layer — and the filter type relevant to eBPF is cls_bpf, which attaches eBPF programs as packet classifiers.

qdisc (queuing discipline) is the kernel’s packet scheduler for an interface — it controls how packets are buffered and in what order they leave. For eBPF policy enforcement, Cilium uses a special qdisc called clsact which has no buffering behaviour at all; it purely provides the ingress and egress hook points where eBPF filters attach. If a pod veth doesn’t have clsact, Cilium isn’t enforcing policy on that pod.

Cilium attaches cls_bpf filters in direct action (DA) mode, which combines classifier and action into a single eBPF program. The program’s return value is the packet fate directly:

Return value	Action
`TC_ACT_OK` (0)	Pass the packet
`TC_ACT_SHOT` (2)	Drop the packet
`TC_ACT_REDIRECT` (7)	Redirect to another interface
`TC_ACT_PIPE` (3)	Pass to the next filter in the chain

TC Context: What Your Program Can See

TC programs receive a struct __sk_buff — a safe, BPF-accessible projection of the kernel sk_buff. Unlike the raw packet bytes in XDP, __sk_buff includes metadata:

struct __sk_buff {
    __u32 len;           // packet length
    __u32 pkt_type;      // PACKET_HOST, PACKET_BROADCAST, etc.
    __u32 mark;          // skb->mark — used by Cilium for pod identity
    __u32 queue_mapping;
    __u32 protocol;      // ETH_P_IP, ETH_P_IPV6, etc.
    __u32 vlan_present;
    __u32 vlan_tci;
    __u32 vlan_proto;
    __u32 priority;
    __u32 ingress_ifindex;
    __u32 ifindex;
    __u32 tc_index;
    __u32 cb[5];
    __u32 hash;
    __u32 tc_classid;
    __u32 data;          // offset to packet data
    __u32 data_end;
    __u32 napi_id;
    __u32 family;
    __u32 remote_ip4;    // source IP (ingress) or dest IP (egress)
    __u32 local_ip4;
    __u32 remote_port;
    __u32 local_port;
    // ...
};

skb->mark is how Cilium passes pod identity between its hook points.

skb->mark is a 32-bit field in every sk_buff that any kernel subsystem can read or write. It’s a general-purpose scratch field — iptables uses it, routing rules use it, and Cilium uses it to carry pod security identity from the socket hook through to TC enforcement. When Cilium stamps a pod’s identity into skb->mark at connection time, every subsequent TC filter on that packet’s path can read it without another identity lookup. The socket-level cgroup hook (cgroup_sock_addr) stamps the cgroup-derived pod identity into skb->mark when the socket calls connect(). By the time the packet reaches the TC egress hook, skb->mark carries the pod’s security identity — and the TC program uses it for policy enforcement.

What Cilium’s TC Filters Actually Do

The TC filter on each pod’s veth is Cilium’s enforcement point for Kubernetes NetworkPolicy. The mechanism:

When a pod opens a connection, a cgroup_sock_addr hook stamps the pod’s security identity (derived from its labels + namespace) into skb->mark
The TC egress filter on the veth reads skb->mark, looks up the pod identity + destination in the policy map, and returns TC_ACT_SHOT (drop) or TC_ACT_OK (pass)
The TC ingress filter on the receiving pod’s veth does the same check for inbound traffic

The policy map is a BPF LRU hash keyed on {pod_identity, dst_ip, dst_port, protocol}. This is what cilium policy get reads from — and what bpftool map dump shows directly:

# Find Cilium's policy maps
bpftool map list | grep -i policy

# Dump the active policy entries for a specific endpoint
# Get endpoint ID from: cilium endpoint list
cilium bpf policy get <endpoint-id>

# Cross-check with raw bpftool dump
bpftool map dump id <POLICY_MAP_ID> | head -30

The clsact qdisc is the prerequisite for any TC eBPF filter — it creates the ingress and egress hook points without any queuing behavior. Every pod veth on a Cilium node has one:

tc qdisc show dev lxcABCDEF
# qdisc clsact ffff: dev lxcABCDEF parent ffff:fff1
# ^^^^^^^^^^^^ this line confirms Cilium's hook points exist on this pod's veth
# If this is missing: Cilium is NOT enforcing NetworkPolicy on this pod

If a pod veth doesn’t show clsact, Cilium isn’t enforcing policy on that pod.

Multiple Programs and the Filter Chain

This is the detail that caused my production incident.

TC supports chaining multiple filters on the same hook, ordered by priority. Lower priority number runs first. When Cilium upgrades, it installs a new filter at a new priority before removing the old one. If the upgrade procedure has any timing gap — or if the removal step fails silently — you end up with two programs running in sequence.

# Show all TC filters on a pod's veth — both priorities visible
tc filter show dev lxc12345 egress

# Example output with a stale filter:
filter protocol all pref 1 bpf chain 0
filter protocol all pref 1 bpf chain 0 handle 0x1 cil_to_container direct-action not_in_hw id 44
filter protocol all pref 2 bpf chain 0
filter protocol all pref 2 bpf chain 0 handle 0x1 old_cil_to_container direct-action not_in_hw id 17

Two programs. Pref 1 runs first. Pref 2 runs second — unless pref 1 returned TC_ACT_SHOT, in which case the packet is already dropped and pref 2 never fires.

In my incident: pref 1 was the current Cilium version with correct policy, returning TC_ACT_OK for the traffic in question. Pref 2 was the old version with a stale block entry, returning TC_ACT_SHOT for a subset of connection tuples. Because TC_ACT_OK passes to the next filter in the chain (TC_ACT_PIPE would do the same), pref 2 got to run — and intermittently dropped packets.

The fix:

# Remove the stale filter by priority
tc filter del dev lxc12345 egress pref 2

# Verify only the current filter remains
tc filter show dev lxc12345 egress

This should be part of any post-upgrade verification for Cilium-managed clusters.

How Cilium Uses TC Across the Full Node

Cilium’s TC deployment on a node:

Pod veth (host-side, lxcXXXXX):
  TC ingress: cil_from_container — L3/L4 policy on the pod's outbound traffic
  TC egress:  cil_to_container   — L3/L4 policy on traffic arriving at the pod

Node uplink (eth0):
  TC ingress: cil_from_netdev    — traffic arriving from outside the node
  TC egress:  cil_to_netdev      — traffic leaving the node

XDP on eth0:
  cil_xdp_entry — pre-stack service load balancing (DNAT for ClusterIP)

The naming is counterintuitive at first: cil_from_container is attached to the TC ingress hook on the veth.

Veth direction confusion: TC ingress/egress is named from the kernel’s perspective of the interface, not the pod’s. The host-side veth interface receives traffic that the pod is sending — so TC ingress on the host veth = the pod’s outbound traffic. This trips up everyone the first time. When debugging, always confirm direction with tc filter show dev lxcXXX ingress and egress separately, and check which Cilium program name is attached (cil_from_container = pod outbound, cil_to_container = pod inbound). The veth ingress direction from the host perspective is traffic flowing out of the container. Traffic leaving the pod hits the host-side veth ingress, which is cil_from_container. It enforces egress policy for the pod. Naming follows the kernel’s perspective of the interface, not the application’s.

To see the full picture on a node:

# All eBPF network programs (XDP and TC) across all interfaces
bpftool net list

# TC-specific view
for iface in $(ip link | grep lxc | awk -F': ' '{print $2}'); do
    echo "=== $iface ==="
    tc filter show dev $iface ingress
    tc filter show dev $iface egress
done

TC Can Modify Packets Too

Unlike XDP, TC programs have full access to the sk_buff and can modify packet content — headers, payload, and checksums. This is how TC-based DNAT works in Cilium when XDP isn’t available on the NIC: the program rewrites the destination IP at L3 and updates the IP + transport checksums atomically. The kernel BPF helper handles the checksum recalculation.

From an operational standpoint: if you see a TC program attached but expected traffic is being redirected rather than dropped, the program is likely doing DNAT. bpftool prog dump xlated id <ID> shows the disassembled instructions and will reveal bpf_skb_store_bytes calls if packet rewriting is happening.

Debugging TC Programs in Production

Workflow I follow when investigating network issues on Cilium clusters:

# 1. List all eBPF network programs (see the full picture)
bpftool net list

# 2. Check specific interface for stale TC filters
tc filter show dev lxcABCDEF ingress
tc filter show dev lxcABCDEF egress

# 3. Inspect a specific program
bpftool prog show id 44

# 4. Disassemble a program (last resort for understanding behavior)
bpftool prog dump xlated id 44

# 5. Check Cilium's view of the same interface
cilium endpoint list
cilium endpoint get <endpoint-id>

# 6. Enable verbose TC program logs (debug builds only)
# Cilium: set CILIUM_DEBUG=true in the deployment

Common Mistakes

Mistake	Impact	Fix
Not checking for stale TC filters after Cilium upgrades	Conflicting policy programs cause intermittent drops	Run `tc filter show` post-upgrade; remove stale by priority
Confusing ingress/egress direction on veth interfaces	Policy applied to wrong traffic direction	TC ingress on host-side veth = pod’s outbound traffic
Attaching TC without `clsact` qdisc	Filter attachment fails	`tc qdisc add dev <iface> clsact` before filter add
Using `TC_ACT_OK` when you want to stop the chain	Subsequent filters still run	Use `TC_ACT_OK` knowing the chain continues; use `TC_ACT_REDIRECT` or explicit `TC_ACT_SHOT` only
Expecting TC performance equal to XDP	TC has sk_buff overhead — it’s slower	Right tool: XDP for pre-stack bulk drops, TC for identity-aware policy
Hardcoding `skb->mark` interpretation	Different tools use mark differently	Document mark field usage clearly; coordinate between Cilium and custom programs

Key Takeaways

TC eBPF fires after sk_buff allocation — it has socket metadata, cgroup ID, and pod identity that XDP lacks
Direct action (DA) mode combines filter and action; the program’s return value is the packet fate
Multiple TC programs chain on the same hook ordered by priority — stale programs from Cilium upgrades cause silent policy conflicts
tc filter show dev <iface> ingress/egress is the primary inspection tool; bpftool net list shows the full node picture
XDP + TC is the Cilium data path: XDP for pre-stack service load balancing, TC for per-pod identity-based enforcement
TC can modify packet content (bpf_skb_store_bytes) — the basis for TC-based DNAT and packet mangling

What’s Next

EP08 closes out the kernel machinery arc: program types, maps, CO-RE, XDP, TC. Five episodes on the engine under the tools. EP09 shifts from understanding the machinery to using it interactively.

bpftrace turns kernel knowledge into one-liners you can run on a live production node. Which process is touching this file right now? Where is this latency spike originating in the kernel call stack? Which container is making DNS queries to an unexpected resolver? Under 10 seconds per question — no restart, no sidecar, no instrumentation change.

Every bpftrace one-liner is a complete eBPF program compiled, loaded, run, and cleaned up on the fly. EP09 covers how that works and why it changes the way you investigate production incidents.

Next: bpftrace — kernel answers in one line

Get EP09 in your inbox when it publishes → linuxcent.com/subscribe

XDP — Packets Processed Before the Kernel Knows They Arrived

May 13, 2026April 21, 2026 by Vamshi Krishna Santhapuri

Reading Time: 10 minutes

eBPF: From Kernel to Cloud, Episode 7
What Is eBPF? · The BPF Verifier · eBPF vs Kernel Modules · eBPF Program Types · eBPF Maps · CO-RE and libbpf · XDP**

14 min read

Introduction

EP01 through EP06 covered what eBPF is, how the verifier keeps it safe, and how the toolchain compiles and loads programs across kernel versions. This episode is where that foundation meets production networking.

XDP — eXpress Data Path — is the earliest hook in the Linux kernel packet path. It fires before sk_buff allocation, before routing, before netfilter. A DROP decision at XDP costs one bounds check and a return value. Everything else is skipped. At 1 million packets per second, that difference shows up directly as CPU.

This episode explains where XDP sits, what it can and cannot see, how Cilium uses it, and what every Kubernetes operator needs to know about it — even if they never write an eBPF program.

TL;DR
Quick Check: Is XDP Running on Your Cluster?
Where XDP Sits in the Kernel Data Path
XDP Modes
The XDP Context: What Your Program Can See
What This Means on Your Cluster Right Now
XDP Metadata: Cooperating with TC
How Cilium Uses XDP
Operational Inspection
Common Mistakes
Key Takeaways

Architecture Overview

XDP Pre-Stack Packet Hook — eBPF kernel data path diagram showing where XDP fires before sk_buff allocation — XDP fires before `sk_buff` allocation — the earliest possible kernel hook for zero-copy packet processing.

TL;DR

XDP fires before sk_buff allocation — the earliest possible kernel hook for packet processing
(sk_buff = the kernel’s socket buffer — every normal packet requires one to be allocated, which adds up fast at scale)
Three modes: native (in-driver, full performance), generic (fallback, no perf gain), offloaded (NIC ASIC)
XDP context is raw packet bytes — no socket, no cgroup, no pod identity; handle non-IP traffic explicitly
Every pointer dereference requires a bounds check against data_end — the verifier enforces this
BPF_MAP_TYPE_LPM_TRIE is the right map type for IP prefix blocklists — handles /32 hosts and CIDRs together
XDP metadata area enables coordination with TC programs — classify at XDP speed, enforce with pod context at TC

Quick Check: Is XDP Running on Your Cluster?

Before the data path walkthrough — a two-command check you can run right now on any cluster node:

# SSH into a worker node, then:
bpftool net list

On a Cilium-managed node, you’ll see something like:

eth0 (index 2):
        xdpdrv  id 44

lxc8a3f21b (index 7):
        tc ingress id 47
        tc egress  id 48

Reading the output:
– xdpdrv — XDP in native mode, running in the NIC driver before sk_buff (this is what you want)
– xdpgeneric instead of xdpdrv — generic mode, runs after sk_buff allocation, no performance benefit
– No XDP line at all — XDP not deployed; your CNI uses iptables for service forwarding

If you’re on EKS with aws-vpc-cni or GKE with kubenet, you likely won’t see XDP here — those CNIs use iptables. Understanding this section explains why teams migrating to Cilium see lower node CPU under the same traffic load.

Where XDP Sits in the Kernel Data Path

A client’s cluster was under a SYN flood — roughly 1 million packets per second from a rotating set of source IPs. We had iptables DROP rules installed within the first ten minutes, blocklist updated every 30 seconds as new source ranges appeared. The flood traffic dropped in volume. But node CPU stayed high. The %si column in top — software interrupt time — was sitting at 25–30%.

%si in top is the percentage of CPU time spent handling hardware interrupts and kernel-level packet processing — separate from your application’s CPU usage. On a quiet managed cluster (EKS, GKE) this is usually under 1%. Under a packet flood, high %si means the kernel is burning cycles just receiving packets, before your workloads run at all. It’s the metric that tells you the problem is below the application layer.

The iptables rules were matching. Packets were being dropped. CPU was still burning. The answer is where in the kernel the drop was happening. iptables fires inside the netfilter framework — after the kernel has already allocated an sk_buff for the packet, done DMA from the NIC ring buffer, and traversed several netfilter hooks. At 1Mpps, the allocation cost alone is measurable.

XDP fires before any of that.

The standard Linux packet receive path:

NIC hardware
  ↓
DMA to ring buffer (kernel memory)
  ↓
[XDP hook — fires here, before sk_buff]
  ├── XDP_DROP   → discard, zero further allocation
  ├── XDP_PASS   → continue to kernel network stack
  ├── XDP_TX     → transmit back out the same interface
  └── XDP_REDIRECT → forward to another interface or CPU
  ↓
sk_buff allocated from slab allocator
  ↓
netfilter: PREROUTING
  ↓
IP routing decision
  ↓
netfilter: INPUT or FORWARD
  ↓  [iptables fires somewhere in here]
socket receive queue
  ↓
userspace application

XDP runs at the driver level, in the NAPI poll context — the same context where the driver is processing received packets off the ring buffer. The program runs before the kernel’s general networking code gets involved. There’s no sk_buff, no reference counting, no slab allocation.

NAPI (New API) is how modern Linux receives packets efficiently. Instead of one CPU interrupt per packet (catastrophically expensive at 1Mpps), the NIC fires a single interrupt, then the kernel polls the NIC ring buffer in batches until it’s drained. XDP runs inside this polling loop — as close to the hardware as software gets without running on the NIC itself.

At 1Mpps, the difference between XDP_DROP and an iptables DROP is roughly the cost of allocating and then immediately freeing 1 million sk_buff objects per second — plus netfilter traversal, connection tracking lookup, and the DROP action itself. That’s the CPU time that was burning.

After moving the blocklist to an XDP program, the %si on the same traffic load dropped from 28% to 3%.

XDP Modes

XDP operates in three modes, and which one you get depends on your NIC driver.

Native XDP (XDP_FLAGS_DRV_MODE)

The eBPF program runs directly in the NIC driver’s NAPI poll function — in interrupt context, before sk_buff. This is the only mode that delivers the full performance benefit.

Driver support is required. The widely supported drivers: mlx4, mlx5 (Mellanox/NVIDIA), i40e, ice (Intel), bnxt_en (Broadcom), virtio_net (KVM/QEMU), veth (containers). Check support:

# Verify native XDP support on your driver
ethtool -i eth0 | grep driver
# driver: mlx5_core  ← supports native XDP

# Load in native mode
ip link set dev eth0 xdpdrv obj blocklist.bpf.o sec xdp

The veth driver supporting native XDP is what makes XDP meaningful inside Kubernetes pods — each pod’s veth interface can run an XDP program at wire speed.

Generic XDP (XDP_FLAGS_SKB_MODE)

Fallback for drivers that don’t support native XDP. The program still runs, but it runs after sk_buff allocation, as a hook in the netif_receive_skb path. No performance benefit over early netfilter. sk_buff is still allocated and freed for every packet.

# Generic mode — development and testing only
ip link set dev eth0 xdpgeneric obj blocklist.bpf.o sec xdp

Use this for development on a laptop with a NIC that lacks native XDP support. Never benchmark with it and never use it in production expecting performance gains.

Offloaded XDP

Runs on the NIC’s own processing unit (SmartNIC). Zero CPU involvement — the XDP decision happens in NIC hardware. Supported by Netronome Agilio NICs. Rare in production, but the theoretical ceiling for XDP performance.

The XDP Context: What Your Program Can See

XDP programs receive one argument: struct xdp_md.

struct xdp_md {
    __u32 data;           // offset of first packet byte in the ring buffer page
    __u32 data_end;       // offset past the last byte
    __u32 data_meta;      // metadata area before data (XDP metadata for TC cooperation)
    __u32 ingress_ifindex;
    __u32 rx_queue_index;
};

data and data_end are used as follows:

void *data     = (void *)(long)ctx->data;
void *data_end = (void *)(long)ctx->data_end;

// Every pointer dereference must be bounds-checked first
struct ethhdr *eth = data;
if ((void *)(eth + 1) > data_end)
    return XDP_PASS;  // malformed or truncated packet

The verifier enforces these bounds checks — every pointer derived from ctx->data must be validated before use. The error invalid mem access 'inv' means you dereferenced a pointer without checking the bounds. This is the most common cause of XDP program rejection.

For operators (not writing XDP code): You’ll see invalid mem access 'inv' in logs when an eBPF program is rejected at load time — most commonly during a Cilium upgrade or a custom tool deployment on a kernel the tool wasn’t built for. The fix is in the eBPF source or the tool version, not the cluster config.

What XDP cannot see:
– Socket state — no socket buffer exists yet
– Cgroup hierarchy — no pod identity
– Process information — no PID, no container
– Connection tracking state (unless you maintain it yourself in a map)

XDP is ingress-only. It fires on packets arriving at an interface, not departing. For egress, TC is the hook.

What This Means on Your Cluster Right Now

Every Cilium-managed node has XDP programs running. Here’s how to see them:

# All XDP programs on all interfaces — this is the full picture
bpftool net list
# Sample output on a Cilium node:
#
# eth0 (index 2):
#         xdpdrv  id 44         ← XDP in native mode on the node uplink
#
# lxc8a3f21b (index 7):
#         tc ingress id 47      ← TC enforces NetworkPolicy on pod ingress
#         tc egress  id 48      ← TC enforces NetworkPolicy on pod egress
#
# "xdpdrv"     = native mode (runs in NIC driver, before sk_buff — full performance)
# "xdpgeneric" = fallback mode (after sk_buff — no performance benefit over iptables)

# Which mode is active?
ip link show eth0 | grep xdp
# xdp mode drv  ← native (full performance)
# xdp mode generic  ← fallback (no perf benefit)

# Details on the XDP program ID
bpftool prog show id $(bpftool net show dev eth0 | grep xdp | awk '{print $NF}')
# Shows: loaded_at, tag, xlated bytes, jited bytes, map IDs

The map IDs in that output are the BPF maps the XDP program is using — typically the service VIP table for DNAT, and in security tools, the blocklist or allowlist. To see what’s in them:

# List maps used by the XDP program
bpftool prog show id <PROG_ID> | grep map_ids

# Dump the service map (for a Cilium node — this is the load balancer table)
bpftool map dump id <MAP_ID> | head -40

For a blocklist scenario — like the SYN flood mitigation above — the BPF_MAP_TYPE_LPM_TRIE is the standard data structure. A lookup for 192.168.1.45 hits a 192.168.1.0/24 entry in the same map, handling both host /32s and CIDR ranges in one lookup.

# Count entries in an XDP filter map
bpftool map dump id <BLOCKLIST_MAP_ID> | grep -c "key"

# Verify XDP is active and inspect program details
bpftool net show dev eth0

XDP Metadata: Cooperating with TC

Think of it as a sticky note attached to the packet. XDP writes the note at line speed (no context about pods or sockets). TC reads it later when full context is available, and acts on it. The packet carries the note between them.

More precisely: XDP can write metadata into the area before ctx->data — a small scratch space that survives as the packet moves from XDP to the TC hook. This is the coordination mechanism between the two eBPF layers.

The pattern: XDP classifies at speed (no sk_buff overhead), TC enforces with pod context (where you have socket identity). XDP writes a classification tag into the metadata area. TC reads it and makes the policy decision.

From an operational standpoint, when you see two eBPF programs on the same interface (one XDP, one TC), this pipeline is the likely explanation:

bpftool net list
# xdpdrv id 44 on eth0       ← XDP classifier running at line rate
# tc ingress id 47 on eth0   ← TC enforcer reading XDP metadata

How Cilium Uses XDP

Not running Cilium? On EKS with aws-vpc-cni or GKE with kubenet, service forwarding uses iptables NAT rules and conntrack instead. You can see this with iptables -t nat -L -n on a node — look for the KUBE-SVC-* chains. Those chains are what XDP replaces in a Cilium cluster. This is why teams migrating from kube-proxy to Cilium report lower node CPU at high connection rates — it’s not magic, it’s hook placement.

On a Cilium node, XDP handles the load balancing path for ClusterIP services. When a packet arrives at the node destined for a ClusterIP:

XDP program checks the destination IP against a BPF LRU hash map of known service VIPs
On a match, it performs DNAT — rewriting the destination IP to a backend pod IP
Returns XDP_TX or XDP_REDIRECT to forward directly

No iptables NAT rules. No conntrack state machine. No socket buffer allocation for the routing decision. The lookup is O(1) in a BPF hash map.

# See Cilium's XDP program on the node uplink
ip link show eth0 | grep xdp
# xdp  (attached, native mode)

# The XDP program details
bpftool prog show pinned /sys/fs/bpf/cilium/xdp

# Load time, instruction count, JIT-compiled size
bpftool prog show id $(bpftool net list | grep xdp | awk '{print $NF}')

At production scale — 500+ nodes, 50k+ services — removing iptables from the service forwarding path with XDP reduces per-node CPU utilization measurably. The effect is most visible on nodes handling high connection rates to cluster services.

Operational Inspection

# All XDP programs on all interfaces
bpftool net list

# Check XDP mode (native, generic, offloaded)
ip link show | grep xdp

# Per-interface stats — includes XDP drop/pass counters
cat /sys/class/net/eth0/statistics/rx_dropped

# XDP drop counters exposed via bpftool
bpftool map dump id <stats_map_id>

# Verify XDP is active and show program details
bpftool net show dev eth0

Common Mistakes

Mistake	Impact	Fix
Missing bounds check before pointer dereference	Verifier rejects: “invalid mem access”	Always check `ptr + sizeof(*ptr) > data_end` before use
Using generic XDP for performance testing	Misleading numbers — sk_buff still allocated	Test in native mode only; check `ip link` output for mode
Not handling non-IP traffic (ARP, IPv6, VLAN)	ARP breaks, IPv6 drops, VLAN-tagged frames dropped	Check `eth->h_proto` and return `XDP_PASS` for non-IP
XDP for egress or pod identity	No socket context at XDP; XDP is ingress only	Use TC egress for pod-identity-aware egress policy
Forgetting `BPF_F_NO_PREALLOC` on LPM trie	Full memory allocated at map creation for all entries	Always set this flag for sparse prefix tries
Blocking ARP by accident in a /24 blocklist	Loss of layer-2 reachability within the blocked subnet	Separate ARP handling before the IP blocklist check

Key Takeaways

XDP fires before sk_buff allocation — the earliest possible kernel hook for packet processing
Three modes: native (in-driver, full performance), generic (fallback, no perf gain), offloaded (NIC ASIC)
XDP context is raw packet bytes — no socket, no cgroup, no pod identity; handle non-IP traffic explicitly
Every pointer dereference requires a bounds check against data_end — the verifier enforces this
BPF_MAP_TYPE_LPM_TRIE is the right map for IP prefix blocklists — handles /32 hosts and CIDRs together
XDP metadata area enables coordination with TC programs — classify at XDP speed, enforce with pod context at TC

What’s Next

XDP handles ingress at the fastest possible point but has no visibility into which pod sent a packet. EP08 covers TC eBPF — the hook that fires after sk_buff allocation, where socket and cgroup context exist.

TC is how Cilium implements pod-to-pod network policy without iptables. It’s also where stale programs from failed Cilium upgrades leave ghost filters that cause intermittent packet drops. Knowing how TC programs chain — and how to find and remove stale ones — is a specific, concrete operational skill.

Next: TC eBPF — pod-level network policy without iptables

Get EP08 in your inbox when it publishes → linuxcent.com/subscribe

CO-RE and libbpf — Write Once, Run on Any Kernel

May 13, 2026April 19, 2026 by Vamshi Krishna Santhapuri

Reading Time: 9 minutes

eBPF: From Kernel to Cloud, Episode 6
What Is eBPF? · The BPF Verifier · eBPF vs Kernel Modules · eBPF Program Types · eBPF Maps · CO-RE and libbpf**

Architecture Overview

CO-RE and libbpf — portable eBPF program architecture using BTF for kernel-version independence — CO-RE relocations + BTF let a single eBPF binary run across different kernel versions without recompilation.

TL;DR

Kernel structs change between releases — hardcoded offsets break across patch versions, not just major releases
BTF embeds full type information in the kernel at /sys/kernel/btf/vmlinux; CO-RE uses it to patch field accesses at load time
(BTF = BPF Type Format — a compact description of every struct, field, and byte offset in the running kernel, built into the kernel image)
vmlinux.h, generated from BTF, replaces all kernel headers with a single file committed to your repository
BPF_CORE_READ() is the CO-RE macro — every kernel struct access in a portable program goes through it
libbpf skeleton generation (bpftool gen skeleton) eliminates manual fd management for map and program lifecycle
For production tools: libbpf + CO-RE. For one-off debugging: bpftrace. For prototyping: BCC.

eBPF CO-RE (Compile Once, Run Everywhere) solves the kernel portability problem — the reason Cilium and Falco survive kernel upgrades without recompilation. What maps assumed — quietly — is that the kernel structs those programs read look the same tomorrow as they do today. They don’t. The Linux kernel has no stable ABI for internal data structures. task_struct, sk_buff, sock — the fields eBPF programs read constantly — can shift between patch releases, not just major versions. I learned this the hard way when a routine upgrade from 5.15.0-89 to 5.15.0-91 — two patch revisions — silently broke a custom tracer I’d been running in production for six months.

Six months after deploying a custom eBPF tracer for a client — it detected specific syscall patterns that Falco’s default ruleset didn’t cover — they ran a routine Ubuntu patch upgrade. Not a major kernel version jump. 5.15.0-89 to 5.15.0-91. Two patch revisions.

The tracer stopped loading. The error was invalid indirect read from stack. I opened the program source: nothing remotely like an indirect read. The program was a straightforward tracepoint handler, maybe 40 lines of C.

Three hours of debugging led to a four-byte offset difference. The struct task_struct had a field alignment change between the two patch versions. My program accessed ->comm at a hardcoded byte offset. On 5.15.0-89 that offset was 0x620. On 5.15.0-91 it was 0x624. The verifier caught the misalignment — correctly — and rejected the program.

I had compiled the eBPF bytecode against a fixed kernel header snapshot. The binary was not portable. Every time the kernel moved a struct field, the tool broke.

CO-RE is the solution to this.

Quick Check: Does Your Cluster Support CO-RE?

Two commands — check whether your nodes have the BTF support that CO-RE tools require:

# SSH into a worker node, then:
ls -la /sys/kernel/btf/vmlinux && echo "BTF available — CO-RE tools will work"

Expected output on a supported node:

-r--r--r-- 1 root root 4956234 Apr 21 00:00 /sys/kernel/btf/vmlinux
BTF available — CO-RE tools will work

If the file is missing: CO-RE tools (Cilium, Falco, Tetragon) will fall back to legacy BCC compilation mode — which requires a full compiler toolchain and kernel headers installed on every node.

# Confirm the kernel was built with BTF enabled
cat /boot/config-$(uname -r) | grep CONFIG_DEBUG_INFO_BTF
# CONFIG_DEBUG_INFO_BTF=y  ← required for CO-RE

Why Kernel Structs Change and Why It Matters

The Linux kernel has no stable ABI for internal data structures. task_struct, sock, sk_buff, file — the structs that eBPF programs read constantly — change between releases.

ABI (Application Binary Interface) is the contract that says a compiled binary built against version N will still work against version N+1 without recompilation. The Linux kernel maintains a stable ABI for syscalls (open(), read(), connect()) but makes no such guarantee for internal structs. Fields move, get added, get renamed between patch releases — and any program with hardcoded offsets silently breaks. Field additions, reordering, alignment changes, struct embedding changes. The kernel developers are under no obligation to preserve internal layouts, and they don’t.

Before CO-RE, eBPF programs dealt with this in two ways:

BCC (BPF Compiler Collection) — compile the eBPF C code at runtime on the target host, using that system’s kernel headers. No portability problem because compilation happens on the machine you’re deploying to. Cost: you need a full compiler toolchain, kernel headers, and Python runtime on every production node. Startup time in seconds. Container image size in hundreds of MB. For a security tool that should be lightweight and fast-starting, this is a non-starter.

Per-kernel compiled binaries — ship different builds for each supported kernel version, detect at runtime, load the matching binary. Falco maintained this model for years. The operational overhead is significant: a matrix of kernel × distro × version with separate build and test pipelines for each combination.

CO-RE is the third option. Compile once on a build machine, and let libbpf patch struct field accesses at load time on the target system, using type information embedded in the running kernel.

BTF: The Type System That Makes CO-RE Possible

BTF (BPF Type Format) is compact type debug information embedded directly into the kernel image. Since Linux 5.2, kernels built with CONFIG_DEBUG_INFO_BTF=y expose their full type information at /sys/kernel/btf/vmlinux.

# Verify BTF is available
ls -la /sys/kernel/btf/vmlinux

# Inspect the BTF for a specific struct
bpftool btf dump file /sys/kernel/btf/vmlinux format raw | grep -A 5 'task_struct'

# See the actual field offsets the running kernel uses
bpftool btf dump file /sys/kernel/btf/vmlinux format c | grep -A 20 'struct task_struct {'

BTF encodes every struct definition with field names, types, and byte offsets. When libbpf loads an eBPF program compiled with CO-RE relocations, it reads both the BTF the program was compiled against (embedded in the .bpf.o file) and the BTF of the running kernel. If task_struct->comm has moved, libbpf patches the field access instruction before loading the program.

This patching happens at load time, transparently, without modifying the binary you shipped.

CO-RE relocation is the mechanism behind this. When a CO-RE program is compiled, it embeds metadata saying “I need the offset of comm inside task_struct” rather than hardcoding 0x620. At load time, libbpf reads this relocation, looks up the real offset from the running kernel’s BTF, and patches the instruction. For operators: this is why Cilium and Falco survive kernel upgrades without you reinstalling them.

Most distribution kernels now ship with BTF enabled:

# Ubuntu 20.04+ (kernel 5.4+)
cat /boot/config-$(uname -r) | grep CONFIG_DEBUG_INFO_BTF
# CONFIG_DEBUG_INFO_BTF=y

# Check at runtime
file /sys/kernel/btf/vmlinux
# /sys/kernel/btf/vmlinux: symbolic link to /sys/kernel/btf/vmlinux

Amazon Linux 2023, Ubuntu 22.04, Debian 11+, RHEL 8.2+, and most cloud-provider-managed kernels have BTF. The notable exception: RHEL 7 and Amazon Linux 2 on older kernels.

The CO-RE Toolchain

The build pipeline for a CO-RE eBPF program:

Development machine:
  vmlinux.h (generated from kernel BTF)
       ↓
  myprog.bpf.c ──── clang -target bpf -g ────→ myprog.bpf.o
  (CO-RE relocations embedded in BTF section)
       ↓
  bpftool gen skeleton myprog.bpf.o ─────────→ myprog.skel.h
       ↓
  myprog.c (userspace) ── gcc ──→ myprog
  (statically links libbpf, skeleton handles load/attach/cleanup)

Target machine (any kernel with BTF, 5.4+):
  ./myprog
  ↓ libbpf reads /sys/kernel/btf/vmlinux
  ↓ patches field accesses to match current kernel struct layout
  ↓ verifier validates patched program
  ↓ program loads and runs

One binary. Any supported kernel. No compiler on the target system.

vmlinux.h — One Header to Replace Them All

Before CO-RE, eBPF C programs included dozens of kernel headers — linux/sched.h, linux/net.h, linux/fs.h, linux/socket.h — and they had to match the exact kernel version you were targeting.

vmlinux.h is generated from the BTF of a running kernel. It contains every struct, enum, typedef, and macro definition the kernel exposes through BTF — in a single file, without any compile-time kernel dependency.

# Generate vmlinux.h from the running kernel
bpftool btf dump file /sys/kernel/btf/vmlinux format c > vmlinux.h

# Typical size
wc -l vmlinux.h
# 350000+

You commit vmlinux.h to your repository, generated from a representative kernel. CO-RE handles the actual layout differences at load time on whatever kernel you deploy to. The file is large but you only generate it once and update it when you add support for a new kernel generation.

In your eBPF C source:

#include "vmlinux.h"           // replaces all kernel headers
#include <bpf/bpf_helpers.h>   // eBPF helper functions
#include <bpf/bpf_tracing.h>   // tracing macros
#include <bpf/bpf_core_read.h> // CO-RE read macros

How CO-RE Fixes the Offset Problem

The mechanism is worth understanding once, even if you’re not writing eBPF programs.

When a CO-RE eBPF program accesses a kernel struct field, it doesn’t hardcode the byte offset. Instead, it records a relocation — “I need the offset of pid inside task_struct” — in the compiled binary. When libbpf loads the program, it resolves each relocation by looking up the field in the running kernel’s BTF and patches the access instruction to use the correct offset for this specific kernel.

This is why my four-byte problem happened: the tracer I’d compiled wasn’t using CO-RE. It hardcoded 0x620 as the offset of task_struct->comm. When the kernel moved it to 0x624, the program accessed the wrong memory, the verifier caught the misalignment, and the load failed. A CO-RE rewrite would have resolved comm‘s offset at load time from BTF and never known the difference.

The relocation model also handles fields that don’t exist on older kernels. If a program accesses a field added in kernel 5.15 and the running kernel is 5.10, libbpf can either skip the access (returning a zero value) or fail the load — depending on how the program marks the field access. This is how tools ship support for features across a kernel version range without separate builds.

What CO-RE Means for Tools You Already Run

This is why you care about CO-RE even if you’re never going to write an eBPF program yourself.

Falco, Cilium, Tetragon, and Pixie all ship as single binaries or container images. You install them on a Ubuntu 22.04 node, a RHEL 9 node, and an Amazon Linux 2023 node — three different kernel versions, three different task_struct layouts — and the same binary works on all of them. Before CO-RE, Falco maintained pre-compiled kernel probes for every supported kernel version in a matrix of distro × kernel × version. The probe list had thousands of entries. A kernel your distro shipped between Falco release cycles meant a gap in coverage until the next release.

With CO-RE, there’s one binary. libbpf reads the running kernel’s BTF at load time, patches the field accesses to match the actual struct layout, and the verifier checks the patched program. The tool vendor doesn’t need to know about your specific kernel. You don’t need to wait for a probe release.

The constraint is BTF availability. Check your nodes:

# Quick check — if this file exists, CO-RE tools work
ls /sys/kernel/btf/vmlinux

# Full confirmation
cat /boot/config-$(uname -r) | grep CONFIG_DEBUG_INFO_BTF
# CONFIG_DEBUG_INFO_BTF=y  ← required

What you’ll find: Ubuntu 20.04+, Debian 11+, RHEL 8.2+, Amazon Linux 2023, and GKE/EKS managed nodes all have BTF. Amazon Linux 2 and RHEL 7 do not. If you’re running those, CO-RE-based tools fall back to the legacy BCC compilation path — which requires kernel headers installed on the node.

The One Thing to Run Right Now

This command shows you the exact struct layout your running kernel uses — the same layout libbpf reads when it patches CO-RE programs at load time:

# See how your kernel defines task_struct right now
bpftool btf dump file /sys/kernel/btf/vmlinux format c | grep -A 30 '^struct task_struct {'

The output is the canonical type information for your running kernel. Every field, every offset. When libbpf loads a CO-RE program, it’s reading this to figure out whether task_struct->comm is at offset 0x620 or 0x624.

You can also see specific struct sizes and verify that two kernels differ:

# On kernel A (5.15.0-89)
bpftool btf dump file /sys/kernel/btf/vmlinux format raw | grep -w "task_struct" | head -3

# On kernel B (5.15.0-91) — same command, different output if struct changed
# This is what broke my tracer: field offset changed across a two-patch jump

The practical use: when a CO-RE eBPF tool fails to load with a BTF error, this is where you look. The error tells you which struct field the relocation failed on. This command shows you the current layout. You can confirm whether the field exists, whether it moved, whether it was renamed.

BCC vs libbpf vs bpftrace

Three approaches to eBPF development, with distinct tradeoffs:

	BCC	libbpf + CO-RE	bpftrace
Compilation	Runtime on target host	Build-time on dev machine	Runtime (embedded LLVM)
Target deployment	Compiler + headers on every node	Single static binary	bpftrace binary only
Portability	Compile-on-target handles it	CO-RE + BTF handles it	Internal CO-RE support
Memory overhead	High (Python + compiler: 200MB+)	Low (few MB binary)	Medium
Startup time	Seconds (compilation)	Milliseconds	Seconds (JIT compile)
Best for	Prototyping, development	Production tools, shipped software	Interactive debugging sessions
Language	Python + C	C (kernel) + C/Go/Rust (userspace)	bpftrace scripting

For anything you’re shipping — an eBPF-based security tool, an observability agent, an open-source project — libbpf + CO-RE is the right choice. BCC is for prototyping before you commit to an implementation. bpftrace is for the 30-second debugging session on a live node.

The practical test: if you’re building something you’ll deploy as a container image or a package, it needs to be a self-contained binary with no build dependencies on the target system. That means libbpf.

Common Mistakes

Mistake	Impact	Fix
Direct struct dereference instead of `BPF_CORE_READ`	Program breaks on any kernel struct change	Use `BPF_CORE_READ()` for all kernel struct field access
Missing `char LICENSE[] SEC("license") = "GPL"`	GPL-only helpers (most tracing helpers) unavailable	Always include the license section
vmlinux.h generated on a very old kernel	Missing structs added in newer kernel releases	Regenerate from the highest kernel version you target
Forgetting `-g` flag in clang invocation	No BTF debug info → no CO-RE relocations	Always compile with `-g -O2 -target bpf`
Hardcoding struct offsets as integer constants	Breaks silently on next kernel patch	Use BTF-aware CO-RE macros exclusively

Key Takeaways

Kernel structs change between releases — hardcoded offsets break across patch versions, not just major releases
BTF embeds full type information in the kernel at /sys/kernel/btf/vmlinux; CO-RE uses it to patch field accesses at load time
vmlinux.h, generated from BTF, replaces all kernel headers with a single file committed to your repository
BPF_CORE_READ() is the CO-RE macro — every kernel struct access in a portable program goes through it
libbpf skeleton generation (bpftool gen skeleton) eliminates manual fd management for map and program lifecycle
For production tools: libbpf + CO-RE. For one-off debugging: bpftrace. For prototyping: BCC.

What’s Next

CO-RE makes eBPF programs portable across kernel versions. EP07 takes the next question: where in the kernel’s data path does it make sense to attach them?

XDP fires before the kernel has allocated a single byte of memory for an incoming packet — before the kernel even knows whether to accept it. That hook placement is why Cilium can do line-rate load balancing and why some network filtering rules that look correct in iptables do nothing against certain traffic. The rules weren’t wrong. The hook was in the wrong place.

Next: XDP — packets processed before the kernel knows they arrived

Get EP07 in your inbox when it publishes → linuxcent.com/subscribe

eBPF Maps — The Persistent Data Layer Between Kernel and Userspace

May 13, 2026April 16, 2026 by Vamshi Krishna Santhapuri

Reading Time: 11 minutes

eBPF: From Kernel to Cloud, Episode 5
What Is eBPF? · The BPF Verifier · eBPF vs Kernel Modules · eBPF Program Types · eBPF Maps**

Architecture Overview

eBPF Maps — the persistent data layer between kernel eBPF programs and userspace tools — eBPF maps are the shared memory between kernel programs and userspace — hash, array, ringbuf, and LRU variants shown.

TL;DR

eBPF programs are stateless — maps are where all state lives, between invocations and between kernel and userspace
(“stateless” here means each program invocation starts with no memory of previous runs — like a function with no global variables)
Every production eBPF tool (Cilium, Falco, Tetragon, Datadog NPM) is a map-based architecture — bpftool map list shows you what it’s actually holding
Per-CPU maps eliminate write contention for high-frequency counters; the tool aggregates per-CPU values at export time
LRU maps handle unbounded key spaces (IPs, PIDs, connections) without hard errors when full — but eviction is silent, so size generously
Ring buffer (kernel 5.8+) is the correct event streaming primitive — Falco and Tetragon both use it
Map memory is kernel-locked and invisible to standard memory metrics — account for it explicitly on eBPF-heavy nodes
Pinned maps survive restarts; Cilium uses this for zero-disruption connection tracking through upgrades

The Big Picture

  HOW eBPF MAPS CONNECT KERNEL PROGRAMS TO USERSPACE TOOLS

  ┌─────────────────────────────────────────────────────────────┐
  │  Kernel space                                               │
  │                                                             │
  │  [XDP program]  [TC program]  [kprobe]  [tracepoint]        │
  │        │              │           │           │             │
  │        └──────────────┴───────────┴───────────┘             │
  │                              │                              │
  │                   bpf_map_update_elem()                     │
  │                              │                              │
  │                              ▼                              │
  │  ┌─────────────────────────────────────────────────────┐    │
  │  │             eBPF MAP (kernel object)                │    │
  │  │  hash · percpu_hash · lru_hash · ringbuf · lpm_trie │    │
  │  │  Lives outside program invocations.                 │    │
  │  │  Pinned maps (/sys/fs/bpf/) survive restarts.       │    │
  │  └────────────────────┬────────────────────────────────┘    │
  └───────────────────────│─────────────────────────────────────┘
                          │  read / write via file descriptor
                          ▼
  ┌─────────────────────────────────────────────────────────────┐
  │  Userspace tools                                            │
  │                                                             │
  │  Cilium agent  Falco engine  Tetragon  bpftool map dump     │
  └─────────────────────────────────────────────────────────────┘

eBPF maps are the persistent data layer between kernel programs and the tools that consume their output. eBPF programs fire and exit — there’s no memory between invocations. Yet Cilium tracks TCP connections across millions of packets, and Falco correlates a process exec from five minutes ago with a suspicious network connection happening now. The mechanism between stateless kernel programs and the stateful production tools you depend on is what this episode is about — and understanding it changes what you see when you run bpftool map list.

I was trying to identify the noisy neighbor saturating a cluster’s egress link. I had an eBPF program loading cleanly, events firing, everything confirming it was working. But when I read back the per-port connection counters from userspace, everything was zero.

I spent an hour on it before posting to the BCC mailing list. The reply came back fast: eBPF programs don’t hold state between invocations. Every time the kprobe fires, the program starts fresh. The counter I was incrementing existed only for that single call — created, incremented to one, then discarded. On every single invocation. I was counting events one at a time, throwing the count away, and reading nothing.

That’s what eBPF maps solve.

Quick Check: What Maps Are Running on Your Node?

Before the map types walkthrough — see the live state of maps on any cluster node right now:

# SSH into a worker node, then:
bpftool map list

On a node running Cilium + Falco, you’ll see something like:

12: hash          name cilium_ct4_glo    key 24B  value 56B  max_entries 65536  memlock 5767168B
13: lpm_trie      name cilium_ipcache    key 40B  value 32B  max_entries 512000 memlock 327680B
14: percpu_hash   name cilium_metrics    key 8B   value 32B  max_entries 65536  memlock 2097152B
28: ringbuf       name falco_events      max_entries 8388608

Reading this output:
– hash, lpm_trie, percpu_hash, ringbuf — the map type (each optimised for a different access pattern)
– key 24B value 56B — sizes of a single entry’s key and value in bytes
– max_entries — the hard ceiling; when the map is full, behaviour depends on type (see LRU section below)
– memlock — non-pageable kernel memory this map consumes (invisible to free and container metrics)

Not running Cilium? On EKS with aws-vpc-cni or GKE with kubenet, there are far fewer maps here — primarily kube-proxy uses iptables rather than BPF maps. Running bpftool map list still works; you’ll just see fewer entries. On a pure iptables-based cluster, most of the maps you see come from the system kernel itself, not a CNI.

Maps Are the Architecture, Not an Afterthought

Maps are kernel objects that live outside any individual program invocation. They’re shared between multiple eBPF programs, readable and writable from userspace, and persistent for the lifetime of the map — which can outlive both the program that created them and the userspace process that loaded them.

Every production eBPF tool is fundamentally a map-based architecture:

Cilium stores connection tracking state in BPF hash maps
Falco uses ring buffers to stream syscall events to its userspace rule engine
Tetragon maintains process tree state across exec events using maps
Datadog NPM stores per-connection flow stats in per-CPU maps for lock-free metric accumulation

Run bpftool map list on a Cilium node:

$ bpftool map list
ID 12: hash          name cilium_ct4_glo    key 24B  value 56B   max_entries 65536
#      ^^^^           ^^^^^^^^^^^^^^^^       ^^^^^^   ^^^^^^^     ^^^^^^^^^^^^^^^^
#      type           map name               key size value size  max concurrent entries

ID 13: lpm_trie      name cilium_ipcache    key 40B  value 32B   max_entries 512000
#      longest-prefix-match trie — for IP address + CIDR lookups

ID 14: percpu_hash   name cilium_metrics    key 8B   value 32B   max_entries 65536
#      one copy of this map per CPU — no write contention for high-frequency counters

ID 28: ringbuf       name falco_events      max_entries 8388608
#                                           ^^^^^^^^^^^ 8MB ring buffer for event streaming

Connection tracking, IP policy cache, per-CPU metrics, event stream. Every one of these is a different map type, chosen for a specific reason.

Map Types and What They’re Actually Used For

Hash Maps

The general-purpose key-value store. A key maps to a value — lookup is O(1) average. Cilium’s connection tracking map (cilium_ct4_glo) is a hash map: the key is a 5-tuple (source IP, destination IP, ports, protocol), the value is the connection state.

$ bpftool map show id 12
12: hash  name cilium_ct4_glo  flags 0x0
        key 24B  value 56B  max_entries 65536  memlock 5767168B

The key 24B is the 5-tuple. The value 56B is the connection state record. max_entries 65536 is the upper bound — Cilium can track 65,536 active connections in this map before hitting the limit.

Hash maps are shared across all CPUs on the node. When multiple CPUs try to update the same entry simultaneously — which happens constantly on busy nodes — writes need to be coordinated. For most use cases this is fine. For high-frequency counters updated on every packet, it’s a bottleneck. That’s when you reach for a per-CPU hash map.

Where you see them: connection tracking, per-IP statistics, process-to-identity mapping, policy verdict caching.

Per-CPU Hash Maps

Per-CPU hash maps solve the write coordination problem by giving each CPU its own independent copy of every entry. There’s no sharing, no contention, no waiting — each CPU writes its own copy without touching any other.

The tradeoff: reading from userspace means collecting one value per CPU and summing them up. That aggregation happens in the tool, not the kernel.

# Cilium's per-CPU metrics map — one counter value per CPU
bpftool map dump id 14
key: 0x00000001
  value (CPU 00): 12345
  value (CPU 01): 8901
  value (CPU 02): 3421
  value (CPU 03): 7102
# total bytes for this metric: 31769

Cilium’s cilium_metrics map uses this pattern for exactly this reason — it’s updated on every packet across every CPU on the node. Forcing all CPUs to coordinate writes to a single shared entry at that rate would hurt throughput. Instead: each CPU writes locally, Cilium’s userspace agent sums the values at export time.

Where you see them: packet counters, byte counters, syscall frequency metrics — anywhere updates happen on every event at high volume.

LRU Hash Maps

LRU hash maps add automatic eviction. Same key-value semantics as a regular hash map, but when the map hits its entry limit, the least recently accessed entry is dropped to make room for the new one.

This matters for any map tracking dynamic state with an unpredictable number of keys: TCP connections, process IDs, DNS queries, pod IPs. Without LRU semantics, a full map returns an error on insert — and in production, that means your tool silently stops tracking new entries. Not a crash, not an alert — just missing data.

Cilium’s connection tracking map is LRU-bounded at 65,536 entries. On a node handling high-connection-rate workloads, this can fill up. When it does, Cilium starts evicting old connections to make room for new ones — and if it’s evicting too aggressively, you’ll see connection resets.

# Check current CT map usage vs its limit
bpftool map show id 12
# max_entries tells you the ceiling
# count entries to see current usage
bpftool map dump id 12 | grep -c "^key"

Size LRU maps at 2× your expected concurrent active entries. Aggressive eviction under pressure introduces gaps — not crashes, but missing or incorrect state.

Where you see them: connection tracking, process lineage, anything where the key space is dynamic and unbounded.

Ring Buffers

Ring buffers are how eBPF tools stream events from the kernel to a userspace consumer. Falco reads syscall events from a ring buffer. Tetragon streams process execution and network events through ring buffers. The pattern is the same across all of them:

kernel eBPF program
  → sees event (syscall, network packet, process exec)
  → writes record to ring buffer
  → userspace tool reads it and processes (Falco rules, Tetragon policies)

What makes ring buffers the right primitive for event streaming:

Single buffer shared across all CPUs — unlike the older perf_event_array approach which required one buffer per CPU, a ring buffer is one allocation, one file descriptor, one consumer
Lock-free — the kernel writes, the userspace tool reads, they don’t block each other
Backpressure when full — if the userspace tool can’t keep up, new events are dropped rather than queued indefinitely. The tool can detect and count drops. Falco reports these as Dropped events in its stats output.

# Falco's ring buffer — 8MB
bpftool map list | grep ringbuf
# ID 28: ringbuf  name falco_events  max_entries 8388608

8,388,608 bytes = 8MB. That’s the buffer between Falco’s kernel hooks and its rule engine. If there’s a burst of syscall activity and Falco’s rule evaluation can’t keep up, events drop into that window and are lost.

Sizing matters operationally. Too small and you drop events during normal burst. Too large and you’re holding non-pageable kernel memory that doesn’t show up in standard memory metrics.

# Check Falco's drop rate
falcoctl stats
# or check the Falco logs
journalctl -u falco | grep -i "drop"

Most production deployments run 8–32MB. Start at 8MB, monitor drop rates under load, size up if needed.

Where you see them: Falco event streaming, Tetragon audit events, any tool that needs to move high-volume event data from kernel to userspace.

Array Maps

Array maps are fixed-size, integer-indexed, and entirely pre-allocated at creation time. Think of them as lookup tables with integer keys — constant-time access, no hash overhead, no dynamic allocation.

Cilium uses array maps for policy configuration: a fixed set of slots indexed by endpoint identity number. When a packet arrives and Cilium needs to check policy, it indexes into the array directly rather than doing a hash lookup. For read-heavy, write-rare data, this is faster.

The constraint: you can’t delete entries from an array map. Every slot exists for the lifetime of the map. If you need to track state that comes and goes — connections, processes, pods — use a hash map instead.

Where you see them: policy configuration, routing tables with fixed indices, per-CPU stats indexed by CPU number.

LPM Trie Maps

LPM (Longest Prefix Match) trie maps handle IP prefix lookups — the same operation that a hardware router does when deciding which interface to send a packet out of.

You can store a mix of specific host addresses (/32) and CIDR ranges (/16, /24) in the same map, and a lookup returns the most specific match. If 10.0.1.15/32 and 10.0.0.0/8 are both in the map, a lookup for 10.0.1.15 returns the /32 entry.

Cilium’s cilium_ipcache map is an LPM trie. It maps every IP in the cluster to its security identity — the identifier Cilium uses for policy enforcement. When a packet arrives, Cilium does a trie lookup on the source IP to find out which endpoint sent it, then checks policy against that identity.

# Inspect the ipcache map
bpftool map show id 13
# lpm_trie  name cilium_ipcache  key 40B  value 32B  max_entries 512000

# Look up which security identity owns a pod IP
bpftool map lookup id 13 key hex 20 00 00 00 0a 00 01 0f 00 00 00 00 00 00 00 00 00 00 00 00

Where you see them: IP-to-identity mapping (Cilium), CIDR-based policy enforcement, IP blocklists.

Pinned Maps — State That Survives Restarts

By default, a map’s lifetime is tied to the tool that created it. When the tool exits, the kernel garbage-collects the map.

Pinning writes a reference to the BPF filesystem at /sys/fs/bpf, which keeps the map alive even after the creating process exits:

# See all maps Cilium has pinned
ls /sys/fs/bpf/tc/globals/
# cilium_ct4_global  cilium_ipcache  cilium_metrics  cilium_policy ...

# Inspect a pinned map directly — no Cilium process needed
bpftool map dump pinned /sys/fs/bpf/tc/globals/cilium_ct4_global

# Pin any map by ID for manual inspection
bpftool map pin id 12 /sys/fs/bpf/my_conn_tracker
bpftool map dump pinned /sys/fs/bpf/my_conn_tracker

Cilium pins all its maps under /sys/fs/bpf/tc/globals/. When Cilium restarts — rolling upgrade, crash, OOM kill — it reopens its pinned maps and resumes with existing state intact. Pods maintain established TCP connections through a Cilium restart without disruption.

This is operationally significant: if you’re evaluating eBPF-based tools for production, check whether they pin their maps. A tool that doesn’t loses all its tracked state on every restart — connection tracking resets, process lineage gaps, policy state rebuilt from scratch.

Map Memory: A Production Consideration

Map memory is kernel-locked — it cannot be paged out, and it doesn’t show up in standard memory pressure metrics. Your node’s free output and container memory limits don’t account for it.

Kernel-locked memory is memory the OS guarantees will never be swapped to disk — it stays in RAM permanently. The kernel requires this for eBPF maps because a kernel program running during a network interrupt cannot wait for a page fault. The side effect: it doesn’t appear in top, free, or container memory metrics, so it’s easy to accidentally provision nodes without accounting for it.

# Total eBPF map memory locked on this node
bpftool map list -j | python3 -c "
import json,sys
maps=json.load(sys.stdin)
total=sum(m.get('bytes_memlock',0) for m in maps)
print(f'Total map memory: {total/1024/1024:.1f} MB')
"

# Check system memlock limit (unlimited is correct for eBPF tools)
ulimit -l

# Check what Cilium's systemd unit sets
systemctl show cilium | grep -i memlock

On a node running Cilium + Falco + Datadog NPM, I’ve seen 200–400MB of map memory locked. That’s real, non-pageable kernel memory. If you’re sizing nodes for eBPF-heavy workloads, account for this separately from your pod workload memory.

If an eBPF tool fails to load with a permission error despite having enough free memory, the root cause is usually the memlock ulimit for the process. Cilium, Falco, and most production tools set LimitMEMLOCK=infinity in their systemd units. Verify this if you’re deploying a new eBPF-based tool and seeing unexpected load failures.

Inspecting Maps in Production

# List all maps: type, name, key/value sizes, memory usage
bpftool map list

# Dump all entries in a map (careful with large maps)
bpftool map dump id 12

# Look up a specific entry by key
bpftool map lookup id 12 key hex 0a 00 01 0f 00 00 00 00

# Watch map stats live
watch -n1 'bpftool map show id 12'

# See all maps for a specific tool by checking its pinned path
ls /sys/fs/bpf/tc/globals/                    # Cilium
ls /sys/fs/bpf/falco/                         # Falco (if pinned)

# Cross-reference map IDs with the programs using them
bpftool prog list
bpftool map list

⚠ Production Gotchas

A full LRU map drops state silently, not loudly
When Cilium’s CT map fills up, it starts evicting the least recently used connections — not returning an error. You see connection resets, not a tool alert. Check map utilisation (bpftool map dump id X | grep -c key) against max_entries on nodes with high connection rates.

Ring buffer drops don’t stop the tool — they create gaps
When Falco’s ring buffer fills up, events are dropped. Falco keeps running. The rule engine keeps processing. But you have gaps in your syscall visibility. Monitor Dropped events in Falco’s stats and size the ring buffer accordingly.

Map memory is invisible to standard monitoring
200–400MB of kernel-locked memory on a Cilium + Falco node doesn’t appear in top, container memory metrics, or memory pressure alerts. Size eBPF-heavy nodes with this in mind and add explicit map memory monitoring via bpftool.

Tools that don’t pin their maps lose state on restart
A Cilium restart with pinned maps = zero-disruption connection tracking. A tool without pinning = all tracked state rebuilt from scratch. This matters for connection tracking tools and any tool maintaining process lineage.

perf_event_array on kernel 5.8+ is the old way
Older eBPF tools use per-CPU perf_event_array for event streaming. Ring buffer is strictly better — single allocation, lower overhead, simpler consumption. If you’re running a tool that still uses perf_event_array on a 5.8+ kernel, it’s using a legacy path.

Key Takeaways

eBPF programs are stateless — maps are where all state lives, between invocations and between kernel and userspace
Every production eBPF tool (Cilium, Falco, Tetragon, Datadog NPM) is a map-based architecture — bpftool map list shows you what it’s actually holding
Per-CPU maps eliminate write contention for high-frequency counters; the tool aggregates per-CPU values at export time
LRU maps handle unbounded key spaces (IPs, PIDs, connections) without hard errors when full — but eviction is silent, so size generously
Ring buffer (kernel 5.8+) is the correct event streaming primitive — Falco and Tetragon both use it
Map memory is kernel-locked and invisible to standard memory metrics — account for it explicitly on eBPF-heavy nodes
Pinned maps survive restarts; Cilium uses this for zero-disruption connection tracking through upgrades

What’s Next

You know what program types run in the kernel, and you know how they hold state.

Get EP06 in your inbox when it publishes → linuxcent.com/subscribe But there’s a problem anyone running eBPF-based tools eventually runs into: a tool works on one kernel version and breaks on the next. Struct layouts shift between patch versions. Field offsets move. EP06 covers CO-RE (Compile Once, Run Everywhere) and libbpf — the mechanism that makes tools like Cilium and Falco survive your node upgrades without recompilation, and why kernel version compatibility is a solved problem for any tool built on this toolchain.

eBPF Program Types — What’s Actually Running on Your Nodes

May 13, 2026April 12, 2026 by Vamshi Krishna Santhapuri

Reading Time: 8 minutes

eBPF: From Kernel to Cloud, Episode 4
What Is eBPF? · The BPF Verifier · eBPF vs Kernel Modules · eBPF Program Types**

Architecture Overview

eBPF Program Types — tracing, networking, and security hook points across the Linux kernel — Each eBPF program type attaches to a different kernel hook — from socket filters to LSM enforcement points.

TL;DR

bpftool prog list and bpftool net list show every eBPF program on a node — run these first when debugging eBPF-based tool behavior
TC programs can stack on the same interface; stale programs from incomplete Cilium upgrades cause intermittent packet drops — check tc filter show after every Cilium upgrade
XDP fires before sk_buff allocation — fastest hook, but no pod identity; Cilium uses it for service load balancing, not pod policy
XDP silently falls back to generic mode on unsupported NICs — verify with ip link show | grep xdp
Tracepoints are stable across kernel versions; kprobe-based tools may silently break after node OS patches
LSM hooks enforce at the kernel level — what makes Tetragon’s enforcement mode fundamentally different from sidecar-based approaches

The Big Picture

  WHERE eBPF PROGRAM TYPES ATTACH IN THE KERNEL

  NIC hardware
       ↓
  DMA → ring buffer
       ↓
  ┌─────────────────────────────────────────────────┐
  │  XDP hook  (Cilium: service load balancing)     │
  │  Sees: raw packet bytes only. No pod identity.  │
  └─────────────────────────┬───────────────────────┘
                            │ XDP_PASS
                            ▼
  sk_buff allocated
       ↓
  ┌─────────────────────────────────────────────────┐
  │  TC ingress hook  (Cilium: pod policy ingress)  │
  │  Sees: sk_buff + socket + cgroup → pod identity │
  └─────────────────────────┬───────────────────────┘
                            ↓
  netfilter / IP routing
       ↓
  socket → process (syscall boundary)
  ┌─────────────────────────────────────────────────┐
  │  Tracepoint / kprobe  (Falco: syscall monitor)  │
  │  Sees: any kernel event, any process, any pod   │
  └─────────────────────────────────────────────────┘
  ┌─────────────────────────────────────────────────┐
  │  LSM hook  (Tetragon: kernel-level enforcement) │
  │  Sees: security check context. Can DENY.        │
  └─────────────────────────────────────────────────┘
       ↓
  IP routing → qdisc
  ┌─────────────────────────────────────────────────┐
  │  TC egress hook  (Cilium: pod policy egress)    │
  │  Sees: socket + cgroup on outbound traffic      │
  └─────────────────────────────────────────────────┘
       ↓
  NIC → wire

eBPF program types define where in the kernel a hook fires and what it can see — and knowing the difference is what makes you effective when Cilium or Falco behave unexpectedly. What we hadn’t answered — and what a 2am incident eventually forced — is what kind of eBPF programs are actually running on your nodes, and why the difference matters when something breaks.

A pod in production was dropping roughly one in fifty outbound TCP connections. Not all of them — just enough to cause intermittent timeouts in the application logs. NetworkPolicy showed egress allowed. Cilium reported no violations. Running curl manually from inside the pod worked every time.

I spent the better part of three hours eliminating possibilities. DNS. MTU. Node-level conntrack table exhaustion. Upstream firewall rules. Nothing.

Eventually, almost as an afterthought, I ran this:

sudo bpftool prog list

There were two TC programs attached to that pod’s veth interface. One from the current Cilium version. One from the previous version — left behind by a rolling upgrade that hadn’t cleaned up properly. Two programs. Different policy state. One was occasionally dropping packets based on rules that no longer existed in the current policy model.

The answer had been sitting in the kernel the whole time. I just didn’t know where to look.

That incident forced me to actually understand something I’d been hand-waving for two years: eBPF isn’t a single hook. It’s a family of program types, each attached to a different location in the kernel, each seeing different data, each suited for different problems. Understanding the difference is what separates “I run Cilium and Falco” from “I understand what Cilium and Falco are actually doing on my nodes” — and that difference matters when something breaks at 2am.

The Command You Should Run on Your Cluster Right Now

Before getting into the theory, do this:

# See every eBPF program loaded on the node
sudo bpftool prog list

# See every eBPF program attached to a network interface
sudo bpftool net list

On a node running Cilium and Falco, you’ll see something like this:

42: xdp           name cil_xdp_entry       loaded_at 2026-04-01T09:23:41
43: sched_cls     name cil_from_netdev      loaded_at 2026-04-01T09:23:41
44: sched_cls     name cil_to_netdev        loaded_at 2026-04-01T09:23:41
51: cgroup_sock_addr  name cil_sock4_connect loaded_at 2026-04-01T09:23:41
88: raw_tracepoint  name sys_enter          loaded_at 2026-04-01T09:23:55
89: raw_tracepoint  name sys_exit           loaded_at 2026-04-01T09:23:55

Each line is a different program type. Each one fires at a different point in the kernel. The type column — xdp, sched_cls, raw_tracepoint, cgroup_sock_addr — tells you where in the kernel execution path that program is attached and therefore what it can and cannot see.

If you see more programs than you expect on a specific interface — like I did — that’s your first clue.

Why Program Types Exist

The Linux kernel isn’t a single pipeline. Network packets, system calls, file operations, process scheduling — these all run through different subsystems with different execution contexts and different available data.

eBPF lets you attach programs to specific points within those subsystems. The “program type” is the contract: it defines where the hook fires, what data the program receives, and what it’s allowed to do with it. A program designed to process network packets before they hit the kernel stack looks completely different from one designed to intercept system calls across all containers simultaneously.

Most of us will interact with four or five program types through the tools we already run. Understanding what each one actually is — where it sits, what it sees — is what makes you effective when those tools behave unexpectedly.

The Types Behind the Tools You Already Use

TC — Why Cilium Can Tell Which Pod Sent a Packet

TC stands for Traffic Control. It’s where Cilium enforces your NetworkPolicy, and it’s what caused my incident.

TC programs attach to network interfaces — specifically to the ingress and egress directions of the pod’s virtual interface (lxcXXXXX in Cilium’s naming). They fire after the kernel has already processed the packet enough to know its context: which socket created it, which cgroup that socket belongs to. Cgroup maps to container, container maps to pod.

This is the critical piece: TC is how Cilium knows which pod a packet belongs to. Without that cgroup context, per-pod policy enforcement isn’t possible.

# See TC programs on a pod's veth interface
sudo tc filter show dev lxc12345 ingress
sudo tc filter show dev lxc12345 egress

# If you see two entries on the same direction — that's the incident I described
# The priority number (pref 1, pref 2) tells you the order they run

When there are two TC programs on the same interface, the first one to return “drop” wins. The second program never runs. This is why the issue was intermittent rather than consistent — the stale program only matched specific connection patterns.

Fixing it is straightforward once you know what to look for:

# Remove a stale TC filter by its priority number
sudo tc filter del dev lxc12345 egress pref 2

Add this check to your post-upgrade runbook. Cilium upgrades are generally clean but not always.

XDP — Why Cilium Doesn’t Use TC for Everything

If TC is good enough for pod-level policy, why does Cilium also run an XDP program on the node’s main interface? Look at the bpftool prog list output again — there’s an xdp program loaded alongside the TC programs.

XDP fires earlier. Much earlier. Before the kernel allocates any memory for the packet. Before routing. Before connection tracking. Before anything.

The tradeoff is exactly what you’d expect: XDP is fast but context-poor. It sees raw packet bytes. It doesn’t know which pod the packet came from. It can’t read cgroup information because no socket buffer has been allocated yet.

Cilium uses XDP specifically for ClusterIP service load balancing — when a packet arrives at the node destined for a service VIP, XDP rewrites the destination to the actual pod IP in a single map lookup and sends it on its way. No iptables. No conntrack. The work is done before the kernel stack is involved.

There’s a silent failure mode worth knowing about here. XDP runs in one of two modes:

Native mode — runs inside the NIC driver itself, before any kernel allocation. This is where the performance comes from.
Generic mode — fallback when the NIC driver doesn’t support XDP. Runs later, after sk_buff allocation. No performance benefit over iptables.

If your NIC doesn’t support native XDP, Cilium silently falls back to generic mode. The policy still works — but the performance characteristics you assumed aren’t there.

# Check which XDP mode is active on your node's main interface
ip link show eth0 | grep xdp
# xdpdrv  ← native mode (fast)
# xdpgeneric ← generic mode (no perf benefit)

Most cloud provider instance types with modern Mellanox/Intel NICs support native mode. Worth verifying rather than assuming.

Tracepoints — How Falco Sees Every Container

Falco loads two programs: sys_enter and sys_exit. These are raw tracepoints — they fire on every single system call, from every process, in every container on the node.

Tracepoints are explicitly defined and maintained instrumentation points in the kernel. Unlike hooks that attach to specific internal function names (which can be renamed or inlined between kernel versions), tracepoints are stable interfaces. They’re part of the kernel’s public contract with tooling that wants to instrument it.

This matters operationally. When you patch your nodes — and cloud-managed nodes get patched frequently — tools built on tracepoints keep working. Tools built on kprobes (internal function hooks) may silently stop firing if the function they’re attached to gets renamed or inlined by the compiler in a new kernel build.

# Verify what Falco is actually using
sudo bpftool prog list | grep -E "kprobe|tracepoint"

# Falco's current eBPF driver should show raw_tracepoint entries
# If you see kprobe entries from Falco, you're on the older driver
# Check: falco --version and the driver being loaded at startup

If you’re running Falco on a cluster that gets regular OS patch upgrades and you haven’t verified the driver mode, check it. The older kprobe-based driver has a real failure mode on certain kernel versions.

LSM — How Tetragon Blocks Operations at the Kernel Level

LSM hooks run at the kernel’s security decision points: file opens, socket connections, process execution, capability checks. The defining characteristic is that they can deny an operation. Return an error from an LSM hook and the kernel refuses the syscall before it completes.

This is qualitatively different from observability hooks. kprobes and tracepoints watch. LSM hooks enforce.

When you see Tetragon configured to kill a process attempting a privileged operation, or block a container from writing to a specific path, that’s an LSM hook making the decision inside the kernel — not a sidecar watching traffic, not an admission webhook running before pod creation, not a userspace agent trying to act fast enough. The enforcement is in the kernel itself.

# See if any LSM eBPF programs are active on the node
sudo bpftool prog list | grep lsm

# Verify LSM eBPF support on your kernel (required for Tetragon enforcement mode)
grep CONFIG_BPF_LSM /boot/config-$(uname -r)
# CONFIG_BPF_LSM=y   ← required

The Practical Summary

What’s happening on your node	Program type	Where to look
Cilium service load balancing	XDP	`ip link show eth0 \\| grep xdp`
Cilium pod network policy	TC (`sched_cls`)	`tc filter show dev lxcXXXX egress`
Falco syscall monitoring	Tracepoint	`bpftool prog list \\| grep tracepoint`
Tetragon enforcement	LSM	`bpftool prog list \\| grep lsm`
Anything unexpected	All types	`bpftool prog list`, `bpftool net list`

The Incident, Revisited

Three hours of debugging. The answer was a stale TC program sitting at priority 2 on a pod’s veth interface, left behind by an incomplete Cilium upgrade.

# What I should have run first
sudo bpftool net list
sudo tc filter show dev lxc12345 egress

Two commands. Thirty seconds. If I’d known that TC programs can stack on the same interface, I’d have started there.

That’s the point of understanding program types — not to write eBPF programs yourself, but to know where to look when the tools you depend on don’t behave the way you expect. The programs are already there, running on your nodes right now. bpftool prog list shows you all of them.

Key Takeaways

bpftool prog list and bpftool net list show every eBPF program on a node — run these before anything else when debugging eBPF-based tool behavior
TC programs can stack on the same interface; stale programs from incomplete Cilium upgrades cause intermittent drops — check tc filter show after every Cilium upgrade
XDP runs before the kernel stack — fastest hook, but no pod identity; Cilium uses it for service load balancing, not pod policy
XDP silently falls back to generic mode on unsupported NICs — verify with ip link show | grep xdp
Tracepoints are stable across kernel versions; kprobe-based tools may silently break after node OS patches — verify your Falco driver mode
LSM hooks enforce at the kernel level — this is what makes Tetragon’s enforcement mode fundamentally different from sidecar-based approaches

What’s Next

Every eBPF program fires, does its work, and exits — but the work always involves data.

Get EP05 in your inbox when it publishes → linuxcent.com/subscribe Counting connections. Tracking processes. Streaming events to a detection engine. In EP05, I’ll cover eBPF maps: the persistent data layer that connects kernel programs to the tools consuming their output. Understanding maps explains a class of production issues — and makes bpftool map dump useful rather than cryptic.

eBPF vs Kernel Modules: An Honest Comparison for K8s Engineers

May 13, 2026April 4, 2026 by Vamshi Krishna Santhapuri

Reading Time: 8 minutes

Reading Time: 7 minutes

~2,100 words · Reading time: 8 min · Series: eBPF: From Kernel to Cloud, Episode 3 of 18

In Episode 1 we covered what eBPF is. In Episode 2 we covered why it is safe. The question that comes next is the one most tutorials skip entirely:

If eBPF can do everything a kernel module does for observability, why do kernel modules still exist? And when should you still reach for one?

Most comparisons on this topic are written by people who have used one or the other. I have used both — device driver work from 2012 to 2014 and eBPF in production Kubernetes clusters for the last several years. This is the honest version of that comparison, including the cases where kernel modules are still the right answer.

Table of Contents

Toggle

Architecture Overview

eBPF vs Kernel Modules — safety, portability, and runtime loading comparison diagram — eBPF programs run in a sandboxed VM; kernel modules run with full ring-0 privileges — the safety trade-off visualised.

TL;DR

Kernel modules run with full ring-0 privileges and no safety net — a bug causes an immediate kernel panic, no recovery
eBPF runs in a sandboxed virtual machine: the verifier ensures it cannot crash the kernel, and CO-RE means one binary runs across kernel versions without recompilation
eBPF cannot replace kernel modules for hardware drivers, new filesystems, or deep scheduler modifications — those still require modules
On EKS, GKE, and most managed Kubernetes platforms, loading custom kernel modules is restricted or blocked; eBPF is the only viable kernel extension path
Kernel modules are a significant attack surface (container escape, privilege escalation); eBPF programs are constrained by the verifier and produce an audit trail
Practical rule: reach for eBPF first; only reach for a kernel module when eBPF’s sandboxed model provably cannot do what you need

What Kernel Modules Actually Are

A kernel module is a piece of compiled code that loads directly into the running Linux kernel. Once loaded, it operates with full kernel privileges — the same level of access as the kernel itself. There is no sandbox. There is no safety check. There is no verifier.

This is both the power and the problem.

Kernel modules can do things that nothing else in the Linux ecosystem can do: implement new filesystems, add hardware drivers, intercept and modify kernel data structures, hook into scheduler internals. They are how the kernel extends itself without requiring a recompile or a reboot.

But the operating model is unforgiving:

A bug in a kernel module causes an immediate kernel panic — no exceptions, no recovery
Modules must be compiled against the exact kernel headers of the running kernel
A module that works on RHEL 8 may refuse to load on RHEL 9 without recompilation
Loading a module requires root privileges and deliberate coordination in production
Debugging a module failure means kernel crash dumps, kdump analysis, and time

I experienced all of these during device driver work. The discipline that environment instils is real — you think very carefully before touching anything, because mistakes are instantaneous and complete.

What eBPF Does Differently

eBPF was not designed to replace kernel modules. It was designed to provide a safe, programmable interface to kernel internals for the specific use cases where modules had always been used but were too dangerous: observability, networking, and security monitoring.

The fundamental difference is the verifier, covered in depth in Episode 2. Before any eBPF program runs, the kernel proves it is safe. Before any kernel module runs, nothing checks anything.

That single architectural decision produces a completely different operational profile:

Property	Kernel module	eBPF program
Safety check before load	None	BPF verifier — mathematical proof of safety
A bug causes	Kernel panic, immediate	Program rejected at load time
Kernel version coupling	Compiled per kernel version	CO-RE: compile once, run on any kernel 5.4+
Hot load / unload	Risky, requires coordination	Safe, zero downtime, zero pod restarts
Access scope	Full kernel, unrestricted	Restricted, granted per program type
Debugging	Kernel crash dumps, kdump	bpftool, bpftrace, readable error messages
Portability	Recompile per distro per version	Single binary runs across distros and versions
Production risk	High — no safety net	Low — verifier enforced before execution

CO-RE: Why Portability Matters More Than Most Engineers Realise

The portability column in that table deserves more than a one-line entry, because it is the operational advantage that compounds over time.

A kernel module written for RHEL 8 ships compiled against 4.18.0-xxx.el8.x86_64 kernel headers. When RHEL 8 moves to a new minor version, the module may need recompilation. When you migrate to RHEL 9 — kernel 5.14 with a completely different ABI in places — the module almost certainly needs a full rewrite of any code that touches kernel internals that changed between versions.

If you are running Falco with its kernel module driver and you upgrade a node from Ubuntu 20.04 to 22.04, Falco needs a pre-built module for your exact new kernel or it needs to compile one. If the pre-built is not available and compilation fails — no runtime security monitoring until it is resolved.

eBPF with CO-RE works differently. CO-RE (Compile Once, Run Everywhere) uses the kernel’s embedded BTF (BPF Type Format) information to patch field offsets and data structure layouts at load time to match the running kernel. The eBPF program was compiled once, against a reference kernel. When it loads on a different kernel, libbpf reads the BTF data from /sys/kernel/btf/vmlinux and fixes up the relocations automatically.

The practical result: a Cilium or Falco binary built six months ago loads and runs correctly on a node you just upgraded to a newer kernel version — without any module rebuilding, without any intervention, without any downtime.

In a Kubernetes environment where node images update regularly — especially on managed services like EKS, GKE, and AKS — this is not a minor convenience. It is the difference between eBPF tooling that survives an upgrade cycle and kernel module tooling that breaks one.

Security Implications: Container Escape and Privilege Escalation

The security difference between the two approaches matters specifically for container environments, and it goes beyond the verifier’s protection of your own nodes.

Kernel modules as an attack surface

Historically, kernel module vulnerabilities have been a primary vector for container escape. The attack pattern is straightforward: exploit a vulnerability in a loaded kernel module to gain kernel-level code execution, then use that access to break out of the container namespace into the host. Several high-profile CVEs over the past decade have followed this pattern.

The risk is compounded in environments that load third-party kernel modules — hardware drivers, filesystem modules, observability agents using the kernel module approach — because each additional module is an additional attack surface at the highest privilege level on the system.

eBPF’s security boundaries

eBPF does not eliminate the attack surface entirely, but it constrains it in important ways.

First, eBPF programs cannot leak kernel memory addresses to userspace. This is verifier-enforced and closes the class of KASLR bypass attacks that kernel module vulnerabilities have historically enabled.

Second, eBPF programs are sandboxed by design. They cannot access arbitrary kernel memory, cannot call arbitrary kernel functions, and cannot modify kernel data structures they were not explicitly granted access to. A vulnerability in an eBPF program is contained within that sandbox.

Third, the program type system controls what each eBPF program can see and do. A kprobe program watching syscalls cannot suddenly start modifying network packets. The scope is fixed at load time by the program type and verified by the kernel.

For EKS specifically: Falco running in eBPF mode on your nodes is not a kernel module that could be exploited for container escape. It is a verifier-checked program with a constrained access scope. The tool designed to detect container escapes is not itself a container escape vector — which is the correct security architecture.

Audit and visibility

eBPF programs are auditable in ways that kernel modules are not. You can list every eBPF program currently loaded on a node:

$ bpftool prog list
14: kprobe  name sys_enter_execve  tag abc123...  gpl
    loaded_at 2025-03-01T07:30:00+0000  uid 0
    xlated 240B  jited 172B  memlock 4096B  map_ids 3,4

27: cgroup_skb  name egress_filter  tag def456...  gpl
    loaded_at 2025-03-01T07:30:01+0000  uid 0

Every program is listed with its load time, its type, its tag (a hash of the program), and the maps it accesses. You can audit exactly what is running in your kernel at any point. Kernel modules offer no equivalent — lsmod tells you what is loaded but nothing about what it is actually doing.

EKS and Managed Kubernetes: Where the Difference Is Most Visible

The eBPF vs kernel module distinction plays out most clearly in managed Kubernetes environments, because you do not control when nodes upgrade.

On EKS, when AWS releases a new optimised AMI for a node group and you update it, your nodes are replaced. Any kernel module-based tooling on those nodes needs pre-built modules for the new kernel, or it needs to compile them at node startup, or it fails. AWS does not provide the kernel source for EKS-optimised AMIs in the same way a standard distribution does, which makes module compilation at runtime unreliable.

This is precisely why the EKS 1.33 migration covered in the EKS 1.33 post was painful for Rocky Linux: it involved kernel-level networking behaviour that had been assumed stable. When the kernel networking stack changed, everything built on top of those assumptions broke.

eBPF-based tooling on EKS does not have this problem, provided the node OS ships with BTF enabled — which Amazon Linux 2023 and Ubuntu 22.04 EKS-optimised AMIs do. Cilium and Falco survive node replacements without any module rebuilding because CO-RE handles the kernel version differences automatically.

For GKE and AKS the story is similar. Both use node images with BTF enabled on current versions, and both upgrade nodes on a managed schedule that is difficult to predict precisely. eBPF tooling survives this. Kernel module tooling fights it.

When You Should Still Use Kernel Modules

eBPF is not the right answer for every use case. Kernel modules remain the correct tool when:

You are implementing hardware support. Device drivers for new hardware still require kernel modules. eBPF cannot provide the low-level hardware interrupt handling, DMA operations, or hardware register access that a device driver needs. If you are bringing up a new network interface card, storage controller, or GPU, you are writing a kernel module.

You need to modify kernel behaviour, not just observe it. eBPF can observe and filter. It can drop packets, block syscalls via LSM hooks, and redirect traffic. But it cannot fundamentally change how the kernel handles a syscall, implement a new scheduling algorithm from scratch, or add a new filesystem type. Those changes require kernel modules or upstream kernel patches.

You are on a kernel older than 5.4. Without BTF and CO-RE, eBPF programs must be compiled per kernel version — which largely eliminates the portability advantage. On RHEL 7 or very old Ubuntu LTS versions still in production, kernel modules may be the more practical path for instrumentation work, though migrating the underlying OS is a better long-term answer.

You need capabilities the eBPF verifier rejects. The verifier’s safety constraints occasionally reject programs that are logically safe but that the verifier cannot prove safe statically. Complex loops, large stack allocations, and certain pointer arithmetic patterns hit verifier limits. In these edge cases, a kernel module can do what the verifier would not allow. These situations are rare and becoming rarer as the verifier improves across kernel versions.

The Practical Decision Framework

For most engineers reading this — Linux admins, DevOps engineers, SREs managing Kubernetes clusters — the decision is straightforward:

Observability, security monitoring, network policy, performance profiling on Linux 5.4+ → eBPF
Hardware drivers, new kernel subsystems, or kernels older than 5.4 → kernel modules
Production Kubernetes on EKS, GKE, or AKS → eBPF, always, because CO-RE survives managed upgrades and kernel modules do not

The overlap between the two technologies — the use cases where both could work — has been shrinking for five years and continues to shrink as the verifier becomes more capable and CO-RE becomes more widely supported. The direction of travel is clear.

Kernel modules are a precision instrument for modifying kernel behaviour. eBPF is a safe, portable interface for observing and influencing it. In 2025, if you are reaching for a kernel module to instrument a production system, there is almost certainly a better path.

Up Next

Episode 4 covers the five things eBPF can observe that no other tool can — without agents, without sidecars, and without any changes to your application code. If you are running production Kubernetes and want to understand what true zero-instrumentation observability looks like, that is the post.

The full series is on LinkedIn — search #eBPFSeries — and all episodes are indexed on linuxcent.com under the eBPF Series tag.

BPF Verifier Explained: Why eBPF Is Safe for Production Kubernetes

May 13, 2026March 22, 2026 by Vamshi Krishna Santhapuri

Reading Time: 10 minutes

Reading Time: 9 minutes

~2,400 words · Reading time: 9 min · Series: eBPF: From Kernel to Cloud, Episode 2 of 18

In Episode 1, we established what eBPF is and why it gives Linux admins and DevOps engineers kernel-level visibility without sidecars or code changes. The obvious follow-up question is the one every experienced engineer should ask before running anything in kernel space:

Is it actually safe to run on production nodes?

The answer is yes — and the reason is one specific component of the Linux kernel called the BPF verifier. This post explains what the verifier is, what it protects your cluster from, and why it changes the risk calculus for eBPF-based tools entirely.

Table of Contents

Toggle

Architecture Overview

BPF Verifier and JIT pipeline — how eBPF programs are safety-checked and compiled before kernel execution — The BPF verifier runs before every eBPF load — rejecting unsafe programs before they touch the kernel.

TL;DR

The BPF verifier is a static analysis pass that runs before every eBPF program loads — it rejects unsafe programs before they touch the kernel
It prevents infinite loops (only bounded loops allowed), out-of-bounds memory access, null pointer dereferences, and privilege escalation via kernel pointer leaks
Unlike kernel modules, a verified eBPF program cannot kernel-panic your node — that guarantee is why eBPF-based tools are safe in production
Every eBPF-based tool you run — Cilium, Falco, Tetragon, Datadog — passes its programs through the verifier on every node load
Ask three questions before adopting any eBPF tool: minimum kernel version required, CO-RE support (portable across kernels), and which program types it uses
(The verifier is also why eBPF programs require CAP_BPF or CAP_SYS_ADMIN — privilege is still required to load, just not to survive a bad load)

The Fear That Holds Most Teams Back

When I first explain eBPF to Linux admins and DevOps engineers, the reaction is almost always the same:

“So it runs code inside the kernel? On our production nodes? That sounds like a disaster waiting to happen.”

It is a completely reasonable concern. The Linux kernel is not a place where mistakes are tolerated. A buggy kernel module can take down a server instantly — no warning, no graceful shutdown, just a hard panic and a 3 AM phone call.

I know this from personal experience. During 2012–2014, I worked briefly with Linux device driver code. That period taught me one thing clearly: kernel space does not forgive careless code.

So when people started talking about running programs inside the kernel via eBPF, my instinct was scepticism too. Then I understood the BPF verifier. And everything changed.

What the Verifier Actually Is

Think of the BPF verifier as a strict safety gate that sits between your eBPF program and the kernel. Before your eBPF program is allowed to run — before it touches a single system call, network packet, or container event — the verifier reads through every line of it and asks one question:

“Could this program crash or compromise the kernel?”

If the answer is yes, or even maybe, the program is rejected. It does not load. Your cluster stays safe. If the answer is a provable no, the program loads and runs.

This is not a runtime check that catches problems after the fact. It is a load-time guarantee — the kernel proves the program is safe before it ever executes. Here is what that looks like when you deploy Cilium:

You run: kubectl apply -f cilium-daemonset.yaml
         └─► Cilium loads its eBPF programs onto each node
                   └─► Kernel verifier checks every program
                             ├─► SAFE   → program loads, starts observing
                             └─► UNSAFE → rejected, cluster untouched

This is why Cilium can replace kube-proxy on your nodes, why Falco can watch every syscall in every container, and why Tetragon can enforce security policy at the kernel level — all without putting your cluster at risk.

What the Verifier Protects You From

You do not need to know how the verifier works internally. What matters is what it prevents — and why each protection matters specifically in Kubernetes environments.

Infinite loops

An eBPF program that never terminates would freeze the kernel event it is attached to — potentially hanging every container on that node. The verifier rejects any program it cannot prove will finish executing within a bounded number of instructions.

Why this matters: Every eBPF-based tool on your K8s nodes — Cilium, Falco, Tetragon, Hubble — was verified to terminate correctly on every code path before it shipped. You are not trusting the vendor’s claim. The kernel enforced it.

Memory safety violations

An eBPF program cannot read or write memory outside the boundaries it is explicitly granted. No reaching into another container’s memory space. No accessing kernel data structures it was not given permission to touch.

Why this matters: This is the property that makes eBPF safe for multi-tenant clusters. A Falco rule monitoring one namespace cannot accidentally read data from another namespace’s containers. The verifier makes this impossible at the program level, not just at the policy level.

Kernel crashes

The verifier checks that every pointer is valid before it is dereferenced, that every function call uses correct arguments, and that the program cannot corrupt kernel data structures. Programs that could cause a kernel panic are rejected before they load.

Why this matters: Running Cilium or Tetragon on a production node is not the same risk as loading an untested kernel module. The verifier has already proven these programs cannot crash your nodes — before they ever ran on your infrastructure.

Privilege escalation and kernel pointer leaks

eBPF programs cannot leak kernel memory addresses to userspace. This closes a class of container escape and privilege escalation attacks that have historically been possible through kernel module vulnerabilities.

Why this matters: Security tools built on eBPF — like Tetragon, which detects and blocks container escape attempts in real time — are not themselves a vector for the attacks they protect against.

eBPF vs Traditional Observability Agents

To appreciate what the verifier gives you operationally, compare the two main approaches to K8s observability.

Traditional agent — DaemonSet sidecar approach

Your K8s cluster
└─► Node
    ├─► App Pod (your service)
    ├─► Sidecar container (injected into every pod)
    │   └─► Reads /proc, intercepts syscalls via ptrace
    │       └─► 15–30% CPU/memory overhead per pod
    └─► Agent DaemonSet Pod
        └─► Aggregates data from all sidecars

Problems with this model:

Sidecar injection requires modifying every pod spec and typically an admission webhook
ptrace-based interception adds 50–100% overhead to the traced process and is blocked in hardened containers
The agent runs in userspace with elevated privileges — a larger attack surface
Updating the agent requires pod restarts across your fleet

eBPF-based tool — Cilium / Falco / Tetragon

Your K8s cluster
└─► Node
    ├─► App Pod (your service — completely unmodified)
    ├─► App Pod (another service — also unmodified)
    └─► eBPF programs (inside the kernel, verifier-checked)
        └─► See every syscall, network packet, file access
            └─► Forward events to userspace agent via ring buffer

Benefits:

No sidecar injection — pod specs stay clean, no admission webhook required
Kernel-level visibility with near-zero overhead (typically 1–3%)
The verifier guarantees the eBPF programs cannot harm your nodes
Works identically with Docker, containerd, and CRI-O

Tools You Are Probably Already Running — All Verifier-Protected

You may already be running eBPF on your nodes without thinking about it explicitly. In each case below, the verifier ran before the tool ever touched your cluster.

Tool	How the verifier is involved
Cilium	Every network policy decision, service load-balancing operation, and Hubble flow log is handled by eBPF programs that passed the verifier at node startup.
Falco	Every Falco rule is enforced by a verifier-checked eBPF program attached to syscall hooks. Sub-millisecond detection is only possible because the program runs in kernel space.
AWS VPC CNI	On EKS, networking operations have progressively moved to eBPF for performance at scale. If you are on a recent EKS AMI, eBPF is already doing work on your nodes.
systemd	Modern systemd uses eBPF for cgroup-based resource accounting and network traffic control. Active on most current Ubuntu, RHEL, and Amazon Linux 2023 installations.

Questions to Ask When Evaluating eBPF Tools

When a vendor tells you their tool uses eBPF, these three questions will quickly tell you how mature their implementation is.

1. What kernel version do you require?

The verifier’s capabilities have expanded significantly across kernel versions. Tools targeting kernel 5.8+ can use more powerful features safely. Tools claiming to work on kernel 4.x are constrained by an older, more limited verifier. The table below shows exactly where each major distribution stands.

Distribution	Default kernel	eBPF support level	Notes
Ubuntu 16.04 LTS	4.4	Basic eBPF only	No BTF. kprobes and socket filters work but modern tooling like Cilium and Falco eBPF driver will not run. EOL — do not use for new deployments.
Ubuntu 18.04 LTS	4.15	eBPF, no BTF	No CO-RE. Tools must be compiled against the exact running kernel headers. The HWE kernel (5.4) improves this but BTF still varies by build.
Ubuntu 20.04 LTS	5.4	BTF available, verify before use	CO-RE capable on most deployments. `CONFIG_DEBUG_INFO_BTF` was absent on some early builds. Verify with `ls /sys/kernel/btf/vmlinux` before deploying eBPF tooling. Cloud images generally have it enabled.
Ubuntu 20.10+	5.8	Full BTF + CO-RE	First Ubuntu release where BTF was consistently enabled by default. Ring buffers available. Not an LTS release — use 22.04 for production.
Ubuntu 22.04 LTS	5.15	Full modern eBPF — production ready	BTF embedded. Ring buffers, global variables, LSM hooks. Default baseline for EKS-optimised Ubuntu AMIs. Recommended for new deployments.
Ubuntu 24.04 LTS	6.8	Full modern eBPF + latest features	Open-coded iterators, improved verifier precision, enhanced LSM support. Best Ubuntu option for cutting-edge eBPF tooling today.
Debian 10 (Buster)	4.19	Basic eBPF, no BTF	eBPF programs load but CO-RE is unavailable. Must compile against exact kernel headers. EOL — migrate to Debian 11 or 12.
Debian 11 (Bullseye)	5.10 LTS	Full BTF + CO-RE	BTF enabled. CO-RE works. Cilium, Falco, and Tetragon all fully supported. Solid production baseline for Debian environments through 2026.
Debian 12 (Bookworm)	6.1 LTS	Full modern eBPF — production ready	Same kernel generation as Amazon Linux 2023. LSM hooks, ring buffers, full CO-RE. Recommended Debian version for eBPF workloads today.
Debian 13 (Trixie)	6.12 LTS	Full modern eBPF + latest features	Released August 2025. Same kernel generation as RHEL 10 / Rocky 10 / AlmaLinux 10. Maximum eBPF feature availability across all program types.
RHEL 7.6	3.10 (backported)	Tech Preview only — not production safe	First RHEL release to enable eBPF but explicitly marked as Tech Preview. Limited to kprobes and tracepoints. No XDP, no socket filters, no BTF. Do not use for eBPF in production.
RHEL 8 / Rocky 8 / AlmaLinux 8	4.18 (heavily backported)	Full BPF + BTF — functionally 5.4-equivalent	Red Hat backports make RHEL 8 kernels functionally comparable to upstream 5.4 for most eBPF use cases. BTF enabled across all releases. CO-RE works. Cilium treats RHEL 8.6+ as its minimum supported RHEL-family version.
RHEL 9 / Rocky 9 / AlmaLinux 9	5.14 (heavily backported)	Full modern eBPF — production ready	BTF embedded. XDP, tc, kprobe, tracepoint, and LSM hooks all supported. Falco, Cilium, and Tetragon fully supported. Recommended RHEL-family version for eBPF deployments today. Supported until 2032.
RHEL 10 / Rocky 10 / AlmaLinux 10	6.12	Full modern eBPF + latest features	Same kernel generation as Debian 13 and upstream 6.12 LTS. Rocky 10 released June 2025, AlmaLinux 10 released May 2025. Enhanced eBPF functionality throughout.
Amazon Linux 2023	6.1+	Full modern eBPF — production ready	BTF embedded. Full CO-RE. Recommended for EKS. Also resolves the NetworkManager deprecation issues in EKS 1.33+ — see the EKS 1.33 post.

Quick check for any distro: Run ls /sys/kernel/btf/vmlinux on your node. If the file exists, your kernel has BTF enabled and CO-RE-based eBPF tools will work correctly. If it does not exist, you are limited to tools that compile against your specific kernel headers. Run uname -r to confirm the exact kernel version.

Rocky Linux and AlmaLinux note: Both distros rebuild directly from RHEL sources. Their kernel versions and eBPF capabilities are effectively identical to the corresponding RHEL release. When Cilium or Falco document “RHEL 9 support”, that applies equally to Rocky 9 and AlmaLinux 9 without any additional configuration.

2. Do you use CO-RE?

CO-RE (Compile Once, Run Everywhere) means the tool’s eBPF programs work correctly across different kernel versions without recompilation. Tools using CO-RE are more portable and significantly less likely to break after a routine node OS update. This is a reliable signal of engineering maturity in the vendor’s eBPF implementation.

3. What eBPF program types do you use?

Different program types have different privilege levels and access scopes. A tool that only needs kprobe access is asking for considerably less privilege than one requiring lsm hooks.

kprobe / tracepoint — observability and debugging
tc (traffic control) — network policy enforcement
xdp (eXpress Data Path) — high-performance packet processing
lsm (Linux Security Module) — security policy enforcement (used by Tetragon)

Understanding the program type tells you what the tool can and cannot see on your nodes, and how much kernel access you are granting it.

How Falco Uses the Verifier — A Step-by-Step Walkthrough

Here is exactly what happens when Falco starts on one of your K8s nodes, and where the verifier fits in:

1. Falco pod starts on the node (via DaemonSet)

2. Falco loads its eBPF programs into the kernel:
   └─► BPF verifier checks each program
       ├─► Can it crash the kernel?            No → continue
       ├─► Can it loop forever?                No → continue
       ├─► Can it access out-of-bounds memory? No → continue
       └─► PASS → program loads

3. Falco's eBPF programs attach to syscall hooks:
   └─► sys_enter_execve   (every process execution in every container)
   └─► sys_enter_openat   (every file open)
   └─► sys_enter_connect  (every outbound network connection)

4. A container runs an unexpected shell (potential attack):
   └─► execve() called inside the container
   └─► Falco's eBPF hook fires in kernel space
   └─► Event forwarded to Falco userspace via ring buffer
   └─► Falco rule matches: "shell spawned in container"
   └─► Alert fired in under 1 millisecond

5. Your container, your other pods, your node: completely unaffected

Step 2 is what the verifier makes safe. Without it, attaching eBPF hooks to every syscall on your production node would be an unacceptable risk. With it, Falco can offer this level of visibility with a mathematical safety guarantee.

The Bottom Line

You do not need to understand BPF bytecode, register states, or static analysis to use eBPF tools safely in production. What you do need to understand is this:

The BPF verifier is the reason eBPF is fundamentally different from kernel modules. It does not just make eBPF “safer” in a vague sense — it provides a mathematical proof that each program cannot crash your kernel before that program ever runs.

This is why eBPF-based tools can deliver deep kernel-level visibility into every container, every syscall, and every network flow — with near-zero overhead, no sidecar injection, and production safety that kernel modules could never guarantee.

The next time someone on your team hesitates about running Cilium, Falco, or Tetragon on production nodes because “it runs code in the kernel” — you now know what to tell them. The verifier already checked it. Before it ever touched your cluster.

What Is eBPF? A Plain-English Guide for Linux and Kubernetes Engineers

May 13, 2026March 19, 2026 by Vamshi Krishna Santhapuri

Reading Time: 7 minutes

Reading Time: 6 minutes

~1,900 words · Reading time: 7 min · Series: eBPF: From Kernel to Cloud, Episode 1 of 18

Your Linux kernel has had a technology built into it since 2014 that most engineers working with Linux every day have never looked at directly. You have almost certainly been using it — through Cilium, Falco, Datadog, or even systemd — without knowing it was there.

This post is the plain-English introduction to eBPF that I wished existed when I first encountered it. No kernel engineering background required. No bytecode, no BPF maps, no JIT compilation. Just a clear answer to the question every Linux admin and DevOps engineer eventually asks: what actually is eBPF, and why does it matter for the infrastructure I run every day?

Table of Contents

Toggle

Architecture Overview

What Is eBPF — architecture diagram showing eBPF program types, verifier, JIT compiler, and kernel hook points — eBPF sits between user space and the kernel — attaching programs to hook points without modifying kernel source.

TL;DR

eBPF lets you run small, safe programs inside the Linux kernel — no kernel module, no reboot, no application changes required
The name is a historical artefact; modern eBPF is a general-purpose kernel observability and networking platform, not a packet filter
Programs attach to kernel hook points (tracepoints, kprobes, socket filters) — giving you visibility into every syscall, file open, and network packet
You are probably already running eBPF: Cilium, Falco, Datadog, and systemd all use it under the hood
Safe for production because the BPF verifier rejects any program that could crash or loop — covered in depth in EP02
Full feature set from Linux 5.8+; meaningful production use from Linux 4.14+ (most EKS and GKE defaults qualify)

First: Forget the Name

eBPF stands for extended Berkeley Packet Filter. It is one of the most misleading names in computing for what the technology actually does.

The original BPF was a 1992 mechanism for filtering network packets — the engine behind tcpdump. The extended version, introduced in Linux 3.18 (2014) and significantly matured through Linux 5.x, is a completely different technology. It is no longer just about packets. It is no longer just about filtering.

Forget the name. Here is what eBPF actually is:

eBPF lets you run small, safe programs directly inside the Linux kernel — without writing a kernel module, without rebooting, and without modifying your applications.

That is the complete definition. Everything else is implementation detail. The one-liner above is what matters for how you use it day to day.

What the Linux Kernel Can See That Nothing Else Can

To understand why eBPF is significant, you need to understand what the Linux kernel already sees on every server and every Kubernetes node you run.

The kernel is the lowest layer of software on your machine. Every action that happens — every file opened, every process started, every network packet sent — passes through the kernel. That means it has a complete, real-time view of everything:

Every syscall — every open(), execve(), connect(), write() from every process in every container on the node, in real time
Every network packet — source, destination, port, protocol, bytes, and latency for every pod-to-pod and pod-to-external connection
Every process event — every fork, exec, and exit, including processes spawned inside containers that your container runtime never reports
Every file access — which process opened which file, when, and with what permissions, across all workloads on the node simultaneously
CPU and memory usage — per-process CPU time, function-level latency, and memory allocation patterns without profiling agents

The kernel has always had this visibility. The problem was that there was no safe, practical way to access it without writing kernel modules — which are complex, kernel version-specific, and genuinely dangerous to run in production. eBPF is the safe, practical way to access it.

The Problem eBPF Solves — A Real Kubernetes Scenario

Here is a situation every Kubernetes engineer has faced. A production pod starts behaving strangely — elevated CPU, slow responses, occasional connection failures. You want to understand what is happening at a low level: what syscalls is it making, what network connections is it opening, is something spawning unexpected processes?

The old approaches and their problems

Restart the pod with a debug sidecar. You lose the current state immediately. The issue may not reproduce. You have modified the workload.

Run strace inside the container via kubectl exec. strace uses ptrace, which adds 50–100% CPU overhead to the traced process and is unavailable in hardened containers. You are tracing one process at a time with no cluster-wide view.

Poll /proc with a monitoring agent. Snapshot-based. Any event that happens between polls is invisible. A process that starts, does something, and exits between intervals is completely missed.

The eBPF approach

# Use a debug pod on the node — no changes to your workload
$ kubectl debug node/your-node -it --image=cilium/hubble-cli

# Real-time kernel events from every container on this node:
sys_enter_execve  pid=8821  comm=sh    args=["/bin/sh","-c","curl http://..."]
sys_enter_connect pid=8821  comm=curl  dst=203.0.113.42:443
sys_enter_openat  pid=8821  comm=curl  path=/etc/passwd

# Something inside the pod spawned a shell, made an outbound connection,
# and read /etc/passwd — all visible without touching the pod.

Real-time visibility. No overhead on your workload. Nothing restarted. Nothing modified. That is what eBPF makes possible.

Tools You Are Probably Already Running on eBPF

eBPF is not a standalone product — it is the foundation that many tools in the cloud-native ecosystem are built on. You may already be running eBPF on your nodes without thinking about it explicitly.

Tool	What eBPF does for it	Without eBPF
Cilium	Replaces kube-proxy and iptables with kernel-level packet routing. 2–3× faster at scale.	iptables rules — linear lookup, degrades with service count
Falco	Watches every syscall in every container for security rule violations. Sub-millisecond detection.	Kernel module (risky) or ptrace (high overhead)
Tetragon	Runtime security enforcement — can kill a process or drop a network packet at the kernel level.	No practical alternative at this detection speed
Datadog Agent	Network performance monitoring and universal service monitoring without application code changes.	Language-specific agents injected into application code
systemd	cgroup resource accounting and network traffic control on your Linux nodes.	Legacy cgroup v1 interfaces with limited visibility

eBPF vs the Old Ways

Before eBPF, getting deep visibility into a running Linux system meant choosing between three approaches, each with a significant trade-off:

Approach	Visibility	Cost	Production safe?
Kernel modules	Full kernel access	One bug = kernel panic. Version-specific, must recompile per kernel update.	No
ptrace / strace	One process at a time	50–100% CPU overhead on the traced process. Unusable in production.	No
Polling /proc	Snapshots only	Events between polls are invisible. Short-lived processes are missed entirely.	Partial
eBPF	Full kernel visibility	1–3% overhead. Verifier-guaranteed safety. Real-time stream, not polling.	Yes

Is It Safe to Run in Production?

This is always the first question from any experienced Linux admin, and it is exactly the right question to ask. The answer is yes — and the reason is the BPF verifier.

Before any eBPF program is allowed to run on your node, the Linux kernel runs it through a built-in static safety analyser. This analyser examines every possible execution path and asks: could this program crash the kernel, loop forever, or access memory it should not?

If the answer is yes — or even maybe — the program is rejected at load time. It never runs.

This is fundamentally different from kernel modules. A kernel module loads immediately with no safety check. If it has a bug, you find out at runtime — usually as a kernel panic. An eBPF program that would cause a panic is rejected before it ever loads. The safety guarantee is mathematical, not hopeful.

Episode 2 of this series covers the BPF verifier in full: what it checks, how it makes Cilium and Falco safe on your production nodes, and what questions to ask eBPF tool vendors about their implementation.

Common Misconceptions

eBPF is not a specific tool or product. It is a kernel technology — a platform. Cilium, Falco, Tetragon, and Pixie are tools built on top of it. When a vendor says “we use eBPF”, they mean they build on this kernel capability, not that they share a single implementation.

eBPF is not only for networking. The Berkeley Packet Filter name suggests networking, but modern eBPF covers security, observability, performance profiling, and tracing. The networking origin is historical, not a limitation.

eBPF is not only for Kubernetes. It works on any Linux system running kernel 4.9+, including bare metal servers, Docker hosts, and VMs. K8s is the most popular deployment target because of the observability challenges at scale, but it is not a requirement.

You do not need to write eBPF programs to benefit from eBPF. Most Linux admins and DevOps engineers will use eBPF through tools like Cilium, Falco, and Datadog — never writing a line of BPF code themselves. This series covers the writing side later. Understanding what eBPF is makes you a significantly better user of these tools today.

Kernel Version Requirements

eBPF is a Linux kernel feature. The capabilities available depend directly on the kernel version running on your nodes. Run uname -r on any node to check.

Kernel	What becomes available
`4.9+`	Basic eBPF support. Tracing, socket filtering. Most production systems today meet this minimum.
`5.4+`	BTF (BPF Type Format) and CO-RE — programs that adapt to different kernel versions without recompile. Recommended minimum for production tooling.
`5.8+`	Ring buffers for high-performance event streaming. Global variables. The target kernel for Cilium, Falco, and Tetragon full feature support.
`6.x`	Open-coded iterators, improved verifier, LSM security enforcement hooks. Amazon Linux 2023 and Ubuntu 22.04+ ship 5.15 or newer and are fully eBPF-ready.

EKS users: Amazon Linux 2023 AMIs ship with kernel 6.1+ and support the full modern eBPF feature set out of the box. If you are still on AL2, the migration also resolves the NetworkManager deprecation issues covered in the EKS 1.33 post.

The Bottom Line

eBPF is the answer to a question Linux engineers have been asking for years: how do I get deep visibility into what is happening on my servers and Kubernetes nodes — without adding massive overhead, injecting sidecars, or risking a kernel panic?

The answer is: run small, safe programs at the kernel level, where everything is already visible. Let the BPF verifier guarantee those programs are safe before they run. Stream the results to your observability tools through shared memory maps.

The tools you already use — Cilium for networking, Falco for security, Datadog for APM — are built on this foundation. Understanding eBPF means understanding why those tools work the way they do, what they can and cannot see, and how to evaluate new tools that claim to use it.

Every eBPF-based tool you run on your nodes passed through the BPF verifier before it touched your cluster. Episode 2 covers exactly what that means — and why it matters for your infrastructure decisions.