XDP — Packets Processed Before the Kernel Knows They Arrived

Reading Time: 10 minutes

eBPF: From Kernel to Cloud, Episode 7
What Is eBPF? · The BPF Verifier · eBPF vs Kernel Modules · eBPF Program Types · eBPF Maps · CO-RE and libbpf · XDP**


TL;DR

  • XDP fires before sk_buff allocation — the earliest possible kernel hook for packet processing
    (sk_buff = the kernel’s socket buffer — every normal packet requires one to be allocated, which adds up fast at scale)
  • Three modes: native (in-driver, full performance), generic (fallback, no perf gain), offloaded (NIC ASIC)
  • XDP context is raw packet bytes — no socket, no cgroup, no pod identity; handle non-IP traffic explicitly
  • Every pointer dereference requires a bounds check against data_end — the verifier enforces this
  • BPF_MAP_TYPE_LPM_TRIE is the right map type for IP prefix blocklists — handles /32 hosts and CIDRs together
  • XDP metadata area enables coordination with TC programs — classify at XDP speed, enforce with pod context at TC

XDP eBPF fires before sk_buff allocation — the earliest possible kernel hook, and the reason iptables rules can be technically correct while still burning CPU at high packet rates. I had iptables DROP rules installed and working during a SYN flood. Packets were being dropped. CPU was still burning at 28% software interrupt time. The rules weren’t wrong. The hook was in the wrong place.


A client’s cluster was under a SYN flood — roughly 1 million packets per second from a rotating set of source IPs. We had iptables DROP rules installed within the first ten minutes, blocklist updated every 30 seconds as new source ranges appeared. The flood traffic dropped in volume. But node CPU stayed high. The %si column in top — software interrupt time — was sitting at 25–30%.

%si in top is the percentage of CPU time spent handling hardware interrupts and kernel-level packet processing — separate from your application’s CPU usage. On a quiet managed cluster (EKS, GKE) this is usually under 1%. Under a packet flood, high %si means the kernel is burning cycles just receiving packets, before your workloads run at all. It’s the metric that tells you the problem is below the application layer.

I didn’t understand why. The iptables rules were matching. Packets were being dropped. Why was the CPU still burning?

The answer is where in the kernel the drop was happening. iptables fires inside the netfilter framework — after the kernel has already allocated an sk_buff for the packet, done DMA from the NIC ring buffer, and traversed several netfilter hooks.

netfilter is the Linux kernel subsystem that handles packet filtering, NAT, and connection tracking. iptables is the userspace CLI that writes rules into it. At high packet rates, the cost isn’t the rule match — it’s the kernel work that happens before the rule is evaluated. At 1 million packets per second, the allocation cost alone is measurable. The attack was “slow” in network terms, but fast enough to keep the kernel memory allocator and netfilter traversal continuously busy.

XDP fires before any of that. Before sk_buff. Before routing. Before the kernel network stack has touched the packet at all. A DROP decision at the XDP layer costs one bounds check and a return value. Nothing else.

Quick Check: Is XDP Running on Your Cluster?

Before the data path walkthrough — a two-command check you can run right now on any cluster node:

# SSH into a worker node, then:
bpftool net list

On a Cilium-managed node, you’ll see something like:

eth0 (index 2):
        xdpdrv  id 44

lxc8a3f21b (index 7):
        tc ingress id 47
        tc egress  id 48

Reading the output:
xdpdrv — XDP in native mode, running in the NIC driver before sk_buff (this is what you want)
xdpgeneric instead of xdpdrvgeneric mode, runs after sk_buff allocation, no performance benefit
– No XDP line at all — XDP not deployed; your CNI uses iptables for service forwarding

If you’re on EKS with aws-vpc-cni or GKE with kubenet, you likely won’t see XDP here — those CNIs use iptables. Understanding this section explains why teams migrating to Cilium see lower node CPU under the same traffic load.

Where XDP Sits in the Kernel Data Path

The standard Linux packet receive path:

NIC hardware
  ↓
DMA to ring buffer (kernel memory)
  ↓
[XDP hook — fires here, before sk_buff]
  ├── XDP_DROP   → discard, zero further allocation
  ├── XDP_PASS   → continue to kernel network stack
  ├── XDP_TX     → transmit back out the same interface
  └── XDP_REDIRECT → forward to another interface or CPU
  ↓
sk_buff allocated from slab allocator
  ↓
netfilter: PREROUTING
  ↓
IP routing decision
  ↓
netfilter: INPUT or FORWARD
  ↓  [iptables fires somewhere in here]
socket receive queue
  ↓
userspace application

XDP runs at the driver level, in the NAPI poll context — the same context where the driver is processing received packets off the ring buffer. The program runs before the kernel’s general networking code gets involved. There’s no sk_buff, no reference counting, no slab allocation.

NAPI (New API) is how modern Linux receives packets efficiently. Instead of one CPU interrupt per packet (catastrophically expensive at 1Mpps), the NIC fires a single interrupt, then the kernel polls the NIC ring buffer in batches until it’s drained. XDP runs inside this polling loop — as close to the hardware as software gets without running on the NIC itself.

At 1Mpps, the difference between XDP_DROP and an iptables DROP is roughly the cost of allocating and then immediately freeing 1 million sk_buff objects per second — plus netfilter traversal, connection tracking lookup, and the DROP action itself. That’s the CPU time that was burning.

After moving the blocklist to an XDP program, the %si on the same traffic load dropped from 28% to 3%.

XDP Modes

XDP operates in three modes, and which one you get depends on your NIC driver.

Native XDP (XDP_FLAGS_DRV_MODE)

The eBPF program runs directly in the NIC driver’s NAPI poll function — in interrupt context, before sk_buff. This is the only mode that delivers the full performance benefit.

Driver support is required. The widely supported drivers: mlx4, mlx5 (Mellanox/NVIDIA), i40e, ice (Intel), bnxt_en (Broadcom), virtio_net (KVM/QEMU), veth (containers). Check support:

# Verify native XDP support on your driver
ethtool -i eth0 | grep driver
# driver: mlx5_core  ← supports native XDP

# Load in native mode
ip link set dev eth0 xdpdrv obj blocklist.bpf.o sec xdp

The veth driver supporting native XDP is what makes XDP meaningful inside Kubernetes pods — each pod’s veth interface can run an XDP program at wire speed.

Generic XDP (XDP_FLAGS_SKB_MODE)

Fallback for drivers that don’t support native XDP. The program still runs, but it runs after sk_buff allocation, as a hook in the netif_receive_skb path. No performance benefit over early netfilter. sk_buff is still allocated and freed for every packet.

# Generic mode — development and testing only
ip link set dev eth0 xdpgeneric obj blocklist.bpf.o sec xdp

Use this for development on a laptop with a NIC that lacks native XDP support. Never benchmark with it and never use it in production expecting performance gains.

Offloaded XDP

Runs on the NIC’s own processing unit (SmartNIC). Zero CPU involvement — the XDP decision happens in NIC hardware. Supported by Netronome Agilio NICs. Rare in production, but the theoretical ceiling for XDP performance.

The XDP Context: What Your Program Can See

XDP programs receive one argument: struct xdp_md.

struct xdp_md {
    __u32 data;           // offset of first packet byte in the ring buffer page
    __u32 data_end;       // offset past the last byte
    __u32 data_meta;      // metadata area before data (XDP metadata for TC cooperation)
    __u32 ingress_ifindex;
    __u32 rx_queue_index;
};

data and data_end are used as follows:

void *data     = (void *)(long)ctx->data;
void *data_end = (void *)(long)ctx->data_end;

// Every pointer dereference must be bounds-checked first
struct ethhdr *eth = data;
if ((void *)(eth + 1) > data_end)
    return XDP_PASS;  // malformed or truncated packet

The verifier enforces these bounds checks — every pointer derived from ctx->data must be validated before use. The error invalid mem access 'inv' means you dereferenced a pointer without checking the bounds. This is the most common cause of XDP program rejection.

For operators (not writing XDP code): You’ll see invalid mem access 'inv' in logs when an eBPF program is rejected at load time — most commonly during a Cilium upgrade or a custom tool deployment on a kernel the tool wasn’t built for. The fix is in the eBPF source or the tool version, not the cluster config. If you see this error and you’re not writing eBPF yourself, it means the tool’s build doesn’t match your kernel version — upgrade the tool or check its supported kernel matrix.

What XDP cannot see:
– Socket state — no socket buffer exists yet
– Cgroup hierarchy — no pod identity
– Process information — no PID, no container
– Connection tracking state (unless you maintain it yourself in a map)

XDP is ingress-only. It fires on packets arriving at an interface, not departing. For egress, TC is the hook.

What This Means on Your Cluster Right Now

Every Cilium-managed node has XDP programs running. Here’s how to see them:

# All XDP programs on all interfaces — this is the full picture
bpftool net list
# Sample output on a Cilium node:
#
# eth0 (index 2):
#         xdpdrv  id 44         ← XDP in native mode on the node uplink
#
# lxc8a3f21b (index 7):
#         tc ingress id 47      ← TC enforces NetworkPolicy on pod ingress
#         tc egress  id 48      ← TC enforces NetworkPolicy on pod egress
#
# "xdpdrv"     = native mode (runs in NIC driver, before sk_buff — full performance)
# "xdpgeneric" = fallback mode (after sk_buff — no performance benefit over iptables)

# Which mode is active?
ip link show eth0 | grep xdp
# xdp mode drv  ← native (full performance)
# xdp mode generic  ← fallback (no perf benefit)

# Details on the XDP program ID
bpftool prog show id $(bpftool net show dev eth0 | grep xdp | awk '{print $NF}')
# Shows: loaded_at, tag, xlated bytes, jited bytes, map IDs

The map IDs in that output are the BPF maps the XDP program is using — typically the service VIP table for DNAT, and in security tools, the blocklist or allowlist. To see what’s in them:

# List maps used by the XDP program
bpftool prog show id <PROG_ID> | grep map_ids

# Dump the service map (for a Cilium node — this is the load balancer table)
bpftool map dump id <MAP_ID> | head -40

For a blocklist scenario — like the SYN flood mitigation above — the BPF_MAP_TYPE_LPM_TRIE is the standard data structure. A lookup for 192.168.1.45 hits a 192.168.1.0/24 entry in the same map, handling both host /32s and CIDR ranges in one lookup. The practical operational check:

# Count entries in an XDP filter map
bpftool map dump id <BLOCKLIST_MAP_ID> | grep -c "key"

# Verify XDP is active and inspect program details
bpftool net show dev eth0

XDP Metadata: Cooperating with TC

Think of it as a sticky note attached to the packet. XDP writes the note at line speed (no context about pods or sockets). TC reads it later when full context is available, and acts on it. The packet carries the note between them.

More precisely: XDP can write metadata into the area before ctx->data — a small scratch space that survives as the packet moves from XDP to the TC hook. This is the coordination mechanism between the two eBPF layers.

The pattern is: XDP classifies at speed (no sk_buff overhead), TC enforces with pod context (where you have socket identity). XDP writes a classification tag into the metadata area. TC reads it and makes the policy decision.

This is the architecture behind tools like Pro-NDS: the fast-path pattern matching (connection tracking, signature matching) happens at XDP before any kernel allocation. The enforcement action — which requires knowing which pod sent this — happens at TC using the metadata XDP already wrote.

From an operational standpoint, when you see two eBPF programs on the same interface (one XDP, one TC), this pipeline is the likely explanation. The bpftool net list output shows both:

bpftool net list
# xdpdrv id 44 on eth0       ← XDP classifier running at line rate
# tc ingress id 47 on eth0   ← TC enforcer reading XDP metadata

How Cilium Uses XDP

Not running Cilium? On EKS with aws-vpc-cni or GKE with kubenet, service forwarding uses iptables NAT rules and conntrack instead. You can see this with iptables -t nat -L -n on a node — look for the KUBE-SVC-* chains. Those chains are what XDP replaces in a Cilium cluster. This is why teams migrating from kube-proxy to Cilium report lower node CPU at high connection rates — it’s not magic, it’s hook placement.

On a Cilium node, XDP handles the load balancing path for ClusterIP services. When a packet arrives at the node destined for a ClusterIP:

  1. XDP program checks the destination IP against a BPF LRU hash map of known service VIPs
  2. On a match, it performs DNAT — rewriting the destination IP to a backend pod IP
  3. Returns XDP_TX or XDP_REDIRECT to forward directly

No iptables NAT rules. No conntrack state machine. No socket buffer allocation for the routing decision. The lookup is O(1) in a BPF hash map.

# See Cilium's XDP program on the node uplink
ip link show eth0 | grep xdp
# xdp  (attached, native mode)

# The XDP program details
bpftool prog show pinned /sys/fs/bpf/cilium/xdp

# Load time, instruction count, JIT-compiled size
bpftool prog show id $(bpftool net list | grep xdp | awk '{print $NF}')

At production scale — 500+ nodes, 50k+ services — removing iptables from the service forwarding path with XDP reduces per-node CPU utilization measurably. The effect is most visible on nodes handling high connection rates to cluster services.

Operational Inspection

# All XDP programs on all interfaces
bpftool net list

# Check XDP mode (native, generic, offloaded)
ip link show | grep xdp

# Per-interface stats — includes XDP drop/pass counters
cat /sys/class/net/eth0/statistics/rx_dropped

# XDP drop counters exposed via bpftool
bpftool map dump id <stats_map_id>

# Verify XDP is active and show program details
bpftool net show dev eth0

Common Mistakes

Mistake Impact Fix
Missing bounds check before pointer dereference Verifier rejects: “invalid mem access” Always check ptr + sizeof(*ptr) > data_end before use
Using generic XDP for performance testing Misleading numbers — sk_buff still allocated Test in native mode only; check ip link output for mode
Not handling non-IP traffic (ARP, IPv6, VLAN) ARP breaks, IPv6 drops, VLAN-tagged frames dropped Check eth->h_proto and return XDP_PASS for non-IP
XDP for egress or pod identity No socket context at XDP; XDP is ingress only Use TC egress for pod-identity-aware egress policy
Forgetting BPF_F_NO_PREALLOC on LPM trie Full memory allocated at map creation for all entries Always set this flag for sparse prefix tries
Blocking ARP by accident in a /24 blocklist Loss of layer-2 reachability within the blocked subnet Separate ARP handling before the IP blocklist check

Key Takeaways

  • XDP fires before sk_buff allocation — the earliest possible kernel hook for packet processing
  • Three modes: native (in-driver, full performance), generic (fallback, no perf gain), offloaded (NIC ASIC)
  • XDP context is raw packet bytes — no socket, no cgroup, no pod identity; handle non-IP traffic explicitly
  • Every pointer dereference requires a bounds check against data_end — the verifier enforces this
  • BPF_MAP_TYPE_LPM_TRIE is the right map for IP prefix blocklists — handles /32 hosts and CIDRs together
  • XDP metadata area enables coordination with TC programs — classify at XDP speed, enforce with pod context at TC

What’s Next

XDP handles ingress at the fastest possible point but has no visibility into which pod sent a packet. EP08 covers TC eBPF — the hook that fires after sk_buff allocation, where socket and cgroup context exist.

TC is how Cilium implements pod-to-pod network policy without iptables. It’s also where stale programs from failed Cilium upgrades leave ghost filters that cause intermittent packet drops. Knowing how TC programs chain — and how to find and remove stale ones — is a specific, concrete operational skill.

Next: TC eBPF — pod-level network policy without iptables

Get EP08 in your inbox when it publishes → linuxcent.com/subscribe

CO-RE and libbpf — Write Once, Run on Any Kernel

Reading Time: 9 minutes

eBPF: From Kernel to Cloud, Episode 6
What Is eBPF? · The BPF Verifier · eBPF vs Kernel Modules · eBPF Program Types · eBPF Maps · CO-RE and libbpf**


TL;DR

  • Kernel structs change between releases — hardcoded offsets break across patch versions, not just major releases
  • BTF embeds full type information in the kernel at /sys/kernel/btf/vmlinux; CO-RE uses it to patch field accesses at load time
    (BTF = BPF Type Format — a compact description of every struct, field, and byte offset in the running kernel, built into the kernel image)
  • vmlinux.h, generated from BTF, replaces all kernel headers with a single file committed to your repository
  • BPF_CORE_READ() is the CO-RE macro — every kernel struct access in a portable program goes through it
  • libbpf skeleton generation (bpftool gen skeleton) eliminates manual fd management for map and program lifecycle
  • For production tools: libbpf + CO-RE. For one-off debugging: bpftrace. For prototyping: BCC.

eBPF CO-RE (Compile Once, Run Everywhere) solves the kernel portability problem — the reason Cilium and Falco survive kernel upgrades without recompilation. What maps assumed — quietly — is that the kernel structs those programs read look the same tomorrow as they do today. They don’t. The Linux kernel has no stable ABI for internal data structures. task_struct, sk_buff, sock — the fields eBPF programs read constantly — can shift between patch releases, not just major versions. I learned this the hard way when a routine upgrade from 5.15.0-89 to 5.15.0-91 — two patch revisions — silently broke a custom tracer I’d been running in production for six months.


Six months after deploying a custom eBPF tracer for a client — it detected specific syscall patterns that Falco’s default ruleset didn’t cover — they ran a routine Ubuntu patch upgrade. Not a major kernel version jump. 5.15.0-89 to 5.15.0-91. Two patch revisions.

The tracer stopped loading. The error was invalid indirect read from stack. I opened the program source: nothing remotely like an indirect read. The program was a straightforward tracepoint handler, maybe 40 lines of C.

Three hours of debugging led to a four-byte offset difference. The struct task_struct had a field alignment change between the two patch versions. My program accessed ->comm at a hardcoded byte offset. On 5.15.0-89 that offset was 0x620. On 5.15.0-91 it was 0x624. The verifier caught the misalignment — correctly — and rejected the program.

I had compiled the eBPF bytecode against a fixed kernel header snapshot. The binary was not portable. Every time the kernel moved a struct field, the tool broke.

CO-RE is the solution to this.

Quick Check: Does Your Cluster Support CO-RE?

Two commands — check whether your nodes have the BTF support that CO-RE tools require:

# SSH into a worker node, then:
ls -la /sys/kernel/btf/vmlinux && echo "BTF available — CO-RE tools will work"

Expected output on a supported node:

-r--r--r-- 1 root root 4956234 Apr 21 00:00 /sys/kernel/btf/vmlinux
BTF available — CO-RE tools will work

If the file is missing: CO-RE tools (Cilium, Falco, Tetragon) will fall back to legacy BCC compilation mode — which requires a full compiler toolchain and kernel headers installed on every node.

# Confirm the kernel was built with BTF enabled
cat /boot/config-$(uname -r) | grep CONFIG_DEBUG_INFO_BTF
# CONFIG_DEBUG_INFO_BTF=y  ← required for CO-RE

Common results by platform:
| Platform | BTF available? |
|———-|—————-|
| Ubuntu 20.04+ (kernel 5.4+) | ✓ Yes |
| EKS managed nodes (AL2023) | ✓ Yes |
| GKE managed nodes (kernel 5.10+) | ✓ Yes |
| Amazon Linux 2 (older kernels) | ✗ No — BCC fallback |
| RHEL 7 / CentOS 7 | ✗ No |

Why Kernel Structs Change and Why It Matters

The Linux kernel has no stable ABI for internal data structures. task_struct, sock, sk_buff, file — the structs that eBPF programs read constantly — change between releases.

ABI (Application Binary Interface) is the contract that says a compiled binary built against version N will still work against version N+1 without recompilation. The Linux kernel maintains a stable ABI for syscalls (open(), read(), connect()) but makes no such guarantee for internal structs. Fields move, get added, get renamed between patch releases — and any program with hardcoded offsets silently breaks. Field additions, reordering, alignment changes, struct embedding changes. The kernel developers are under no obligation to preserve internal layouts, and they don’t.

Before CO-RE, eBPF programs dealt with this in two ways:

BCC (BPF Compiler Collection) — compile the eBPF C code at runtime on the target host, using that system’s kernel headers. No portability problem because compilation happens on the machine you’re deploying to. Cost: you need a full compiler toolchain, kernel headers, and Python runtime on every production node. Startup time in seconds. Container image size in hundreds of MB. For a security tool that should be lightweight and fast-starting, this is a non-starter.

Per-kernel compiled binaries — ship different builds for each supported kernel version, detect at runtime, load the matching binary. Falco maintained this model for years. The operational overhead is significant: a matrix of kernel × distro × version with separate build and test pipelines for each combination.

CO-RE is the third option. Compile once on a build machine, and let libbpf patch struct field accesses at load time on the target system, using type information embedded in the running kernel.

BTF: The Type System That Makes CO-RE Possible

BTF (BPF Type Format) is compact type debug information embedded directly into the kernel image. Since Linux 5.2, kernels built with CONFIG_DEBUG_INFO_BTF=y expose their full type information at /sys/kernel/btf/vmlinux.

# Verify BTF is available
ls -la /sys/kernel/btf/vmlinux

# Inspect the BTF for a specific struct
bpftool btf dump file /sys/kernel/btf/vmlinux format raw | grep -A 5 'task_struct'

# See the actual field offsets the running kernel uses
bpftool btf dump file /sys/kernel/btf/vmlinux format c | grep -A 20 'struct task_struct {'

BTF encodes every struct definition with field names, types, and byte offsets. When libbpf loads an eBPF program compiled with CO-RE relocations, it reads both the BTF the program was compiled against (embedded in the .bpf.o file) and the BTF of the running kernel. If task_struct->comm has moved, libbpf patches the field access instruction before loading the program.

This patching happens at load time, transparently, without modifying the binary you shipped.

CO-RE relocation is the mechanism behind this. When a CO-RE program is compiled, it embeds metadata saying “I need the offset of comm inside task_struct” rather than hardcoding 0x620. At load time, libbpf reads this relocation, looks up the real offset from the running kernel’s BTF, and patches the instruction. For operators: this is why Cilium and Falco survive kernel upgrades without you reinstalling them.

Most distribution kernels now ship with BTF enabled:

# Ubuntu 20.04+ (kernel 5.4+)
cat /boot/config-$(uname -r) | grep CONFIG_DEBUG_INFO_BTF
# CONFIG_DEBUG_INFO_BTF=y

# Check at runtime
file /sys/kernel/btf/vmlinux
# /sys/kernel/btf/vmlinux: symbolic link to /sys/kernel/btf/vmlinux

Amazon Linux 2023, Ubuntu 22.04, Debian 11+, RHEL 8.2+, and most cloud-provider-managed kernels have BTF. The notable exception: RHEL 7 and Amazon Linux 2 on older kernels.

The CO-RE Toolchain

The build pipeline for a CO-RE eBPF program:

Development machine:
  vmlinux.h (generated from kernel BTF)
       ↓
  myprog.bpf.c ──── clang -target bpf -g ────→ myprog.bpf.o
  (CO-RE relocations embedded in BTF section)
       ↓
  bpftool gen skeleton myprog.bpf.o ─────────→ myprog.skel.h
       ↓
  myprog.c (userspace) ── gcc ──→ myprog
  (statically links libbpf, skeleton handles load/attach/cleanup)

Target machine (any kernel with BTF, 5.4+):
  ./myprog
  ↓ libbpf reads /sys/kernel/btf/vmlinux
  ↓ patches field accesses to match current kernel struct layout
  ↓ verifier validates patched program
  ↓ program loads and runs

One binary. Any supported kernel. No compiler on the target system.

vmlinux.h — One Header to Replace Them All

Before CO-RE, eBPF C programs included dozens of kernel headers — linux/sched.h, linux/net.h, linux/fs.h, linux/socket.h — and they had to match the exact kernel version you were targeting.

vmlinux.h is generated from the BTF of a running kernel. It contains every struct, enum, typedef, and macro definition the kernel exposes through BTF — in a single file, without any compile-time kernel dependency.

# Generate vmlinux.h from the running kernel
bpftool btf dump file /sys/kernel/btf/vmlinux format c > vmlinux.h

# Typical size
wc -l vmlinux.h
# 350000+

You commit vmlinux.h to your repository, generated from a representative kernel. CO-RE handles the actual layout differences at load time on whatever kernel you deploy to. The file is large but you only generate it once and update it when you add support for a new kernel generation.

In your eBPF C source:

#include "vmlinux.h"           // replaces all kernel headers
#include <bpf/bpf_helpers.h>   // eBPF helper functions
#include <bpf/bpf_tracing.h>   // tracing macros
#include <bpf/bpf_core_read.h> // CO-RE read macros

How CO-RE Fixes the Offset Problem

The mechanism is worth understanding once, even if you’re not writing eBPF programs.

When a CO-RE eBPF program accesses a kernel struct field, it doesn’t hardcode the byte offset. Instead, it records a relocation — “I need the offset of pid inside task_struct” — in the compiled binary. When libbpf loads the program, it resolves each relocation by looking up the field in the running kernel’s BTF and patches the access instruction to use the correct offset for this specific kernel.

This is why my four-byte problem happened: the tracer I’d compiled wasn’t using CO-RE. It hardcoded 0x620 as the offset of task_struct->comm. When the kernel moved it to 0x624, the program accessed the wrong memory, the verifier caught the misalignment, and the load failed. A CO-RE rewrite would have resolved comm‘s offset at load time from BTF and never known the difference.

The relocation model also handles fields that don’t exist on older kernels. If a program accesses a field added in kernel 5.15 and the running kernel is 5.10, libbpf can either skip the access (returning a zero value) or fail the load — depending on how the program marks the field access. This is how tools ship support for features across a kernel version range without separate builds.

What CO-RE Means for Tools You Already Run

This is why you care about CO-RE even if you’re never going to write an eBPF program yourself.

Falco, Cilium, Tetragon, and Pixie all ship as single binaries or container images. You install them on a Ubuntu 22.04 node, a RHEL 9 node, and an Amazon Linux 2023 node — three different kernel versions, three different task_struct layouts — and the same binary works on all of them. Before CO-RE, Falco maintained pre-compiled kernel probes for every supported kernel version in a matrix of distro × kernel × version. The probe list had thousands of entries. A kernel your distro shipped between Falco release cycles meant a gap in coverage until the next release.

With CO-RE, there’s one binary. libbpf reads the running kernel’s BTF at load time, patches the field accesses to match the actual struct layout, and the verifier checks the patched program. The tool vendor doesn’t need to know about your specific kernel. You don’t need to wait for a probe release.

The constraint is BTF availability. Check your nodes:

# Quick check — if this file exists, CO-RE tools work
ls /sys/kernel/btf/vmlinux

# Full confirmation
cat /boot/config-$(uname -r) | grep CONFIG_DEBUG_INFO_BTF
# CONFIG_DEBUG_INFO_BTF=y  ← required

What you’ll find: Ubuntu 20.04+, Debian 11+, RHEL 8.2+, Amazon Linux 2023, and GKE/EKS managed nodes all have BTF. Amazon Linux 2 and RHEL 7 do not. If you’re running those, CO-RE-based tools fall back to the legacy BCC compilation path — which requires kernel headers installed on the node.

The One Thing to Run Right Now

This command shows you the exact struct layout your running kernel uses — the same layout libbpf reads when it patches CO-RE programs at load time:

# See how your kernel defines task_struct right now
bpftool btf dump file /sys/kernel/btf/vmlinux format c | grep -A 30 '^struct task_struct {'

The output is the canonical type information for your running kernel. Every field, every offset. When libbpf loads a CO-RE program, it’s reading this to figure out whether task_struct->comm is at offset 0x620 or 0x624.

You can also see specific struct sizes and verify that two kernels differ:

# On kernel A (5.15.0-89)
bpftool btf dump file /sys/kernel/btf/vmlinux format raw | grep -w "task_struct" | head -3

# On kernel B (5.15.0-91) — same command, different output if struct changed
# This is what broke my tracer: field offset changed across a two-patch jump

The practical use: when a CO-RE eBPF tool fails to load with a BTF error, this is where you look. The error tells you which struct field the relocation failed on. This command shows you the current layout. You can confirm whether the field exists, whether it moved, whether it was renamed.

BCC vs libbpf vs bpftrace

Three approaches to eBPF development, with distinct tradeoffs:

BCC libbpf + CO-RE bpftrace
Compilation Runtime on target host Build-time on dev machine Runtime (embedded LLVM)
Target deployment Compiler + headers on every node Single static binary bpftrace binary only
Portability Compile-on-target handles it CO-RE + BTF handles it Internal CO-RE support
Memory overhead High (Python + compiler: 200MB+) Low (few MB binary) Medium
Startup time Seconds (compilation) Milliseconds Seconds (JIT compile)
Best for Prototyping, development Production tools, shipped software Interactive debugging sessions
Language Python + C C (kernel) + C/Go/Rust (userspace) bpftrace scripting

For anything you’re shipping — an eBPF-based security tool, an observability agent, an open-source project — libbpf + CO-RE is the right choice. BCC is for prototyping before you commit to an implementation. bpftrace is for the 30-second debugging session on a live node.

The practical test: if you’re building something you’ll deploy as a container image or a package, it needs to be a self-contained binary with no build dependencies on the target system. That means libbpf.

Common Mistakes

Mistake Impact Fix
Direct struct dereference instead of BPF_CORE_READ Program breaks on any kernel struct change Use BPF_CORE_READ() for all kernel struct field access
Missing char LICENSE[] SEC("license") = "GPL" GPL-only helpers (most tracing helpers) unavailable Always include the license section
vmlinux.h generated on a very old kernel Missing structs added in newer kernel releases Regenerate from the highest kernel version you target
Forgetting -g flag in clang invocation No BTF debug info → no CO-RE relocations Always compile with -g -O2 -target bpf
Hardcoding struct offsets as integer constants Breaks silently on next kernel patch Use BTF-aware CO-RE macros exclusively

Key Takeaways

  • Kernel structs change between releases — hardcoded offsets break across patch versions, not just major releases
  • BTF embeds full type information in the kernel at /sys/kernel/btf/vmlinux; CO-RE uses it to patch field accesses at load time
  • vmlinux.h, generated from BTF, replaces all kernel headers with a single file committed to your repository
  • BPF_CORE_READ() is the CO-RE macro — every kernel struct access in a portable program goes through it
  • libbpf skeleton generation (bpftool gen skeleton) eliminates manual fd management for map and program lifecycle
  • For production tools: libbpf + CO-RE. For one-off debugging: bpftrace. For prototyping: BCC.

What’s Next

CO-RE makes eBPF programs portable across kernel versions. EP07 takes the next question: where in the kernel’s data path does it make sense to attach them?

XDP fires before the kernel has allocated a single byte of memory for an incoming packet — before the kernel even knows whether to accept it. That hook placement is why Cilium can do line-rate load balancing and why some network filtering rules that look correct in iptables do nothing against certain traffic. The rules weren’t wrong. The hook was in the wrong place.

Next: XDP — packets processed before the kernel knows they arrived

Get EP07 in your inbox when it publishes → linuxcent.com/subscribe

eBPF Maps — The Persistent Data Layer Between Kernel and Userspace

Reading Time: 11 minutes

eBPF: From Kernel to Cloud, Episode 5
What Is eBPF? · The BPF Verifier · eBPF vs Kernel Modules · eBPF Program Types · eBPF Maps**


TL;DR

  • eBPF programs are stateless — maps are where all state lives, between invocations and between kernel and userspace
    (“stateless” here means each program invocation starts with no memory of previous runs — like a function with no global variables)
  • Every production eBPF tool (Cilium, Falco, Tetragon, Datadog NPM) is a map-based architecture — bpftool map list shows you what it’s actually holding
  • Per-CPU maps eliminate write contention for high-frequency counters; the tool aggregates per-CPU values at export time
  • LRU maps handle unbounded key spaces (IPs, PIDs, connections) without hard errors when full — but eviction is silent, so size generously
  • Ring buffer (kernel 5.8+) is the correct event streaming primitive — Falco and Tetragon both use it
  • Map memory is kernel-locked and invisible to standard memory metrics — account for it explicitly on eBPF-heavy nodes
  • Pinned maps survive restarts; Cilium uses this for zero-disruption connection tracking through upgrades

The Big Picture

  HOW eBPF MAPS CONNECT KERNEL PROGRAMS TO USERSPACE TOOLS

  ┌─────────────────────────────────────────────────────────────┐
  │  Kernel space                                               │
  │                                                             │
  │  [XDP program]  [TC program]  [kprobe]  [tracepoint]        │
  │        │              │           │           │             │
  │        └──────────────┴───────────┴───────────┘             │
  │                              │                              │
  │                   bpf_map_update_elem()                     │
  │                              │                              │
  │                              ▼                              │
  │  ┌─────────────────────────────────────────────────────┐    │
  │  │             eBPF MAP (kernel object)                │    │
  │  │  hash · percpu_hash · lru_hash · ringbuf · lpm_trie │    │
  │  │  Lives outside program invocations.                 │    │
  │  │  Pinned maps (/sys/fs/bpf/) survive restarts.       │    │
  │  └────────────────────┬────────────────────────────────┘    │
  └───────────────────────│─────────────────────────────────────┘
                          │  read / write via file descriptor
                          ▼
  ┌─────────────────────────────────────────────────────────────┐
  │  Userspace tools                                            │
  │                                                             │
  │  Cilium agent  Falco engine  Tetragon  bpftool map dump     │
  └─────────────────────────────────────────────────────────────┘

eBPF maps are the persistent data layer between kernel programs and the tools that consume their output. eBPF programs fire and exit — there’s no memory between invocations. Yet Cilium tracks TCP connections across millions of packets, and Falco correlates a process exec from five minutes ago with a suspicious network connection happening now. The mechanism between stateless kernel programs and the stateful production tools you depend on is what this episode is about — and understanding it changes what you see when you run bpftool map list.


I was trying to identify the noisy neighbor saturating a cluster’s egress link. I had an eBPF program loading cleanly, events firing, everything confirming it was working. But when I read back the per-port connection counters from userspace, everything was zero.

I spent an hour on it before posting to the BCC mailing list. The reply came back fast: eBPF programs don’t hold state between invocations. Every time the kprobe fires, the program starts fresh. The counter I was incrementing existed only for that single call — created, incremented to one, then discarded. On every single invocation. I was counting events one at a time, throwing the count away, and reading nothing.

That’s what eBPF maps solve.

Quick Check: What Maps Are Running on Your Node?

Before the map types walkthrough — see the live state of maps on any cluster node right now:

# SSH into a worker node, then:
bpftool map list

On a node running Cilium + Falco, you’ll see something like:

12: hash          name cilium_ct4_glo    key 24B  value 56B  max_entries 65536  memlock 5767168B
13: lpm_trie      name cilium_ipcache    key 40B  value 32B  max_entries 512000 memlock 327680B
14: percpu_hash   name cilium_metrics    key 8B   value 32B  max_entries 65536  memlock 2097152B
28: ringbuf       name falco_events      max_entries 8388608

Reading this output:
hash, lpm_trie, percpu_hash, ringbuf — the map type (each optimised for a different access pattern)
key 24B value 56B — sizes of a single entry’s key and value in bytes
max_entries — the hard ceiling; when the map is full, behaviour depends on type (see LRU section below)
memlock — non-pageable kernel memory this map consumes (invisible to free and container metrics)

Not running Cilium? On EKS with aws-vpc-cni or GKE with kubenet, there are far fewer maps here — primarily kube-proxy uses iptables rather than BPF maps. Running bpftool map list still works; you’ll just see fewer entries. On a pure iptables-based cluster, most of the maps you see come from the system kernel itself, not a CNI.

Maps Are the Architecture, Not an Afterthought

Maps are kernel objects that live outside any individual program invocation. They’re shared between multiple eBPF programs, readable and writable from userspace, and persistent for the lifetime of the map — which can outlive both the program that created them and the userspace process that loaded them.

Every production eBPF tool is fundamentally a map-based architecture:

  • Cilium stores connection tracking state in BPF hash maps
  • Falco uses ring buffers to stream syscall events to its userspace rule engine
  • Tetragon maintains process tree state across exec events using maps
  • Datadog NPM stores per-connection flow stats in per-CPU maps for lock-free metric accumulation

Run bpftool map list on a Cilium node:

$ bpftool map list
ID 12: hash          name cilium_ct4_glo    key 24B  value 56B   max_entries 65536
#      ^^^^           ^^^^^^^^^^^^^^^^       ^^^^^^   ^^^^^^^     ^^^^^^^^^^^^^^^^
#      type           map name               key size value size  max concurrent entries

ID 13: lpm_trie      name cilium_ipcache    key 40B  value 32B   max_entries 512000
#      longest-prefix-match trie — for IP address + CIDR lookups

ID 14: percpu_hash   name cilium_metrics    key 8B   value 32B   max_entries 65536
#      one copy of this map per CPU — no write contention for high-frequency counters

ID 28: ringbuf       name falco_events      max_entries 8388608
#                                           ^^^^^^^^^^^ 8MB ring buffer for event streaming

Connection tracking, IP policy cache, per-CPU metrics, event stream. Every one of these is a different map type, chosen for a specific reason.

Map Types and What They’re Actually Used For

Hash Maps

The general-purpose key-value store. A key maps to a value — lookup is O(1) average. Cilium’s connection tracking map (cilium_ct4_glo) is a hash map: the key is a 5-tuple (source IP, destination IP, ports, protocol), the value is the connection state.

$ bpftool map show id 12
12: hash  name cilium_ct4_glo  flags 0x0
        key 24B  value 56B  max_entries 65536  memlock 5767168B

The key 24B is the 5-tuple. The value 56B is the connection state record. max_entries 65536 is the upper bound — Cilium can track 65,536 active connections in this map before hitting the limit.

Hash maps are shared across all CPUs on the node. When multiple CPUs try to update the same entry simultaneously — which happens constantly on busy nodes — writes need to be coordinated. For most use cases this is fine. For high-frequency counters updated on every packet, it’s a bottleneck. That’s when you reach for a per-CPU hash map.

Where you see them: connection tracking, per-IP statistics, process-to-identity mapping, policy verdict caching.

Per-CPU Hash Maps

Per-CPU hash maps solve the write coordination problem by giving each CPU its own independent copy of every entry. There’s no sharing, no contention, no waiting — each CPU writes its own copy without touching any other.

The tradeoff: reading from userspace means collecting one value per CPU and summing them up. That aggregation happens in the tool, not the kernel.

# Cilium's per-CPU metrics map — one counter value per CPU
bpftool map dump id 14
key: 0x00000001
  value (CPU 00): 12345
  value (CPU 01): 8901
  value (CPU 02): 3421
  value (CPU 03): 7102
# total bytes for this metric: 31769

Cilium’s cilium_metrics map uses this pattern for exactly this reason — it’s updated on every packet across every CPU on the node. Forcing all CPUs to coordinate writes to a single shared entry at that rate would hurt throughput. Instead: each CPU writes locally, Cilium’s userspace agent sums the values at export time.

Where you see them: packet counters, byte counters, syscall frequency metrics — anywhere updates happen on every event at high volume.

LRU Hash Maps

LRU hash maps add automatic eviction. Same key-value semantics as a regular hash map, but when the map hits its entry limit, the least recently accessed entry is dropped to make room for the new one.

This matters for any map tracking dynamic state with an unpredictable number of keys: TCP connections, process IDs, DNS queries, pod IPs. Without LRU semantics, a full map returns an error on insert — and in production, that means your tool silently stops tracking new entries. Not a crash, not an alert — just missing data.

Cilium’s connection tracking map is LRU-bounded at 65,536 entries. On a node handling high-connection-rate workloads, this can fill up. When it does, Cilium starts evicting old connections to make room for new ones — and if it’s evicting too aggressively, you’ll see connection resets.

# Check current CT map usage vs its limit
bpftool map show id 12
# max_entries tells you the ceiling
# count entries to see current usage
bpftool map dump id 12 | grep -c "^key"

Size LRU maps at 2× your expected concurrent active entries. Aggressive eviction under pressure introduces gaps — not crashes, but missing or incorrect state.

Where you see them: connection tracking, process lineage, anything where the key space is dynamic and unbounded.

Ring Buffers

Ring buffers are how eBPF tools stream events from the kernel to a userspace consumer. Falco reads syscall events from a ring buffer. Tetragon streams process execution and network events through ring buffers. The pattern is the same across all of them:

kernel eBPF program
  → sees event (syscall, network packet, process exec)
  → writes record to ring buffer
  → userspace tool reads it and processes (Falco rules, Tetragon policies)

What makes ring buffers the right primitive for event streaming:

  • Single buffer shared across all CPUs — unlike the older perf_event_array approach which required one buffer per CPU, a ring buffer is one allocation, one file descriptor, one consumer
  • Lock-free — the kernel writes, the userspace tool reads, they don’t block each other
  • Backpressure when full — if the userspace tool can’t keep up, new events are dropped rather than queued indefinitely. The tool can detect and count drops. Falco reports these as Dropped events in its stats output.
# Falco's ring buffer — 8MB
bpftool map list | grep ringbuf
# ID 28: ringbuf  name falco_events  max_entries 8388608

8,388,608 bytes = 8MB. That’s the buffer between Falco’s kernel hooks and its rule engine. If there’s a burst of syscall activity and Falco’s rule evaluation can’t keep up, events drop into that window and are lost.

Sizing matters operationally. Too small and you drop events during normal burst. Too large and you’re holding non-pageable kernel memory that doesn’t show up in standard memory metrics.

# Check Falco's drop rate
falcoctl stats
# or check the Falco logs
journalctl -u falco | grep -i "drop"

Most production deployments run 8–32MB. Start at 8MB, monitor drop rates under load, size up if needed.

Where you see them: Falco event streaming, Tetragon audit events, any tool that needs to move high-volume event data from kernel to userspace.

Array Maps

Array maps are fixed-size, integer-indexed, and entirely pre-allocated at creation time. Think of them as lookup tables with integer keys — constant-time access, no hash overhead, no dynamic allocation.

Cilium uses array maps for policy configuration: a fixed set of slots indexed by endpoint identity number. When a packet arrives and Cilium needs to check policy, it indexes into the array directly rather than doing a hash lookup. For read-heavy, write-rare data, this is faster.

The constraint: you can’t delete entries from an array map. Every slot exists for the lifetime of the map. If you need to track state that comes and goes — connections, processes, pods — use a hash map instead.

Where you see them: policy configuration, routing tables with fixed indices, per-CPU stats indexed by CPU number.

LPM Trie Maps

LPM (Longest Prefix Match) trie maps handle IP prefix lookups — the same operation that a hardware router does when deciding which interface to send a packet out of.

You can store a mix of specific host addresses (/32) and CIDR ranges (/16, /24) in the same map, and a lookup returns the most specific match. If 10.0.1.15/32 and 10.0.0.0/8 are both in the map, a lookup for 10.0.1.15 returns the /32 entry.

Cilium’s cilium_ipcache map is an LPM trie. It maps every IP in the cluster to its security identity — the identifier Cilium uses for policy enforcement. When a packet arrives, Cilium does a trie lookup on the source IP to find out which endpoint sent it, then checks policy against that identity.

# Inspect the ipcache map
bpftool map show id 13
# lpm_trie  name cilium_ipcache  key 40B  value 32B  max_entries 512000

# Look up which security identity owns a pod IP
bpftool map lookup id 13 key hex 20 00 00 00 0a 00 01 0f 00 00 00 00 00 00 00 00 00 00 00 00

Where you see them: IP-to-identity mapping (Cilium), CIDR-based policy enforcement, IP blocklists.


Pinned Maps — State That Survives Restarts

By default, a map’s lifetime is tied to the tool that created it. When the tool exits, the kernel garbage-collects the map.

Pinning writes a reference to the BPF filesystem at /sys/fs/bpf, which keeps the map alive even after the creating process exits:

# See all maps Cilium has pinned
ls /sys/fs/bpf/tc/globals/
# cilium_ct4_global  cilium_ipcache  cilium_metrics  cilium_policy ...

# Inspect a pinned map directly — no Cilium process needed
bpftool map dump pinned /sys/fs/bpf/tc/globals/cilium_ct4_global

# Pin any map by ID for manual inspection
bpftool map pin id 12 /sys/fs/bpf/my_conn_tracker
bpftool map dump pinned /sys/fs/bpf/my_conn_tracker

Cilium pins all its maps under /sys/fs/bpf/tc/globals/. When Cilium restarts — rolling upgrade, crash, OOM kill — it reopens its pinned maps and resumes with existing state intact. Pods maintain established TCP connections through a Cilium restart without disruption.

This is operationally significant: if you’re evaluating eBPF-based tools for production, check whether they pin their maps. A tool that doesn’t loses all its tracked state on every restart — connection tracking resets, process lineage gaps, policy state rebuilt from scratch.


Map Memory: A Production Consideration

Map memory is kernel-locked — it cannot be paged out, and it doesn’t show up in standard memory pressure metrics. Your node’s free output and container memory limits don’t account for it.

Kernel-locked memory is memory the OS guarantees will never be swapped to disk — it stays in RAM permanently. The kernel requires this for eBPF maps because a kernel program running during a network interrupt cannot wait for a page fault. The side effect: it doesn’t appear in top, free, or container memory metrics, so it’s easy to accidentally provision nodes without accounting for it.

# Total eBPF map memory locked on this node
bpftool map list -j | python3 -c "
import json,sys
maps=json.load(sys.stdin)
total=sum(m.get('bytes_memlock',0) for m in maps)
print(f'Total map memory: {total/1024/1024:.1f} MB')
"

# Check system memlock limit (unlimited is correct for eBPF tools)
ulimit -l

# Check what Cilium's systemd unit sets
systemctl show cilium | grep -i memlock

On a node running Cilium + Falco + Datadog NPM, I’ve seen 200–400MB of map memory locked. That’s real, non-pageable kernel memory. If you’re sizing nodes for eBPF-heavy workloads, account for this separately from your pod workload memory.

If an eBPF tool fails to load with a permission error despite having enough free memory, the root cause is usually the memlock ulimit for the process. Cilium, Falco, and most production tools set LimitMEMLOCK=infinity in their systemd units. Verify this if you’re deploying a new eBPF-based tool and seeing unexpected load failures.


Inspecting Maps in Production

# List all maps: type, name, key/value sizes, memory usage
bpftool map list

# Dump all entries in a map (careful with large maps)
bpftool map dump id 12

# Look up a specific entry by key
bpftool map lookup id 12 key hex 0a 00 01 0f 00 00 00 00

# Watch map stats live
watch -n1 'bpftool map show id 12'

# See all maps for a specific tool by checking its pinned path
ls /sys/fs/bpf/tc/globals/                    # Cilium
ls /sys/fs/bpf/falco/                         # Falco (if pinned)

# Cross-reference map IDs with the programs using them
bpftool prog list
bpftool map list

⚠ Production Gotchas

A full LRU map drops state silently, not loudly
When Cilium’s CT map fills up, it starts evicting the least recently used connections — not returning an error. You see connection resets, not a tool alert. Check map utilisation (bpftool map dump id X | grep -c key) against max_entries on nodes with high connection rates.

Ring buffer drops don’t stop the tool — they create gaps
When Falco’s ring buffer fills up, events are dropped. Falco keeps running. The rule engine keeps processing. But you have gaps in your syscall visibility. Monitor Dropped events in Falco’s stats and size the ring buffer accordingly.

Map memory is invisible to standard monitoring
200–400MB of kernel-locked memory on a Cilium + Falco node doesn’t appear in top, container memory metrics, or memory pressure alerts. Size eBPF-heavy nodes with this in mind and add explicit map memory monitoring via bpftool.

Tools that don’t pin their maps lose state on restart
A Cilium restart with pinned maps = zero-disruption connection tracking. A tool without pinning = all tracked state rebuilt from scratch. This matters for connection tracking tools and any tool maintaining process lineage.

perf_event_array on kernel 5.8+ is the old way
Older eBPF tools use per-CPU perf_event_array for event streaming. Ring buffer is strictly better — single allocation, lower overhead, simpler consumption. If you’re running a tool that still uses perf_event_array on a 5.8+ kernel, it’s using a legacy path.


Key Takeaways

  • eBPF programs are stateless — maps are where all state lives, between invocations and between kernel and userspace
  • Every production eBPF tool (Cilium, Falco, Tetragon, Datadog NPM) is a map-based architecture — bpftool map list shows you what it’s actually holding
  • Per-CPU maps eliminate write contention for high-frequency counters; the tool aggregates per-CPU values at export time
  • LRU maps handle unbounded key spaces (IPs, PIDs, connections) without hard errors when full — but eviction is silent, so size generously
  • Ring buffer (kernel 5.8+) is the correct event streaming primitive — Falco and Tetragon both use it
  • Map memory is kernel-locked and invisible to standard memory metrics — account for it explicitly on eBPF-heavy nodes
  • Pinned maps survive restarts; Cilium uses this for zero-disruption connection tracking through upgrades

What’s Next

You know what program types run in the kernel, and you know how they hold state.

Get EP06 in your inbox when it publishes → linuxcent.com/subscribe But there’s a problem anyone running eBPF-based tools eventually runs into: a tool works on one kernel version and breaks on the next. Struct layouts shift between patch versions. Field offsets move. EP06 covers CO-RE (Compile Once, Run Everywhere) and libbpf — the mechanism that makes tools like Cilium and Falco survive your node upgrades without recompilation, and why kernel version compatibility is a solved problem for any tool built on this toolchain.

eBPF Program Types — What’s Actually Running on Your Nodes

Reading Time: 8 minutes

eBPF: From Kernel to Cloud, Episode 4
What Is eBPF? · The BPF Verifier · eBPF vs Kernel Modules · eBPF Program Types**


TL;DR

  • bpftool prog list and bpftool net list show every eBPF program on a node — run these first when debugging eBPF-based tool behavior
  • TC programs can stack on the same interface; stale programs from incomplete Cilium upgrades cause intermittent packet drops — check tc filter show after every Cilium upgrade
  • XDP fires before sk_buff allocation — fastest hook, but no pod identity; Cilium uses it for service load balancing, not pod policy
  • XDP silently falls back to generic mode on unsupported NICs — verify with ip link show | grep xdp
  • Tracepoints are stable across kernel versions; kprobe-based tools may silently break after node OS patches
  • LSM hooks enforce at the kernel level — what makes Tetragon’s enforcement mode fundamentally different from sidecar-based approaches

The Big Picture

  WHERE eBPF PROGRAM TYPES ATTACH IN THE KERNEL

  NIC hardware
       ↓
  DMA → ring buffer
       ↓
  ┌─────────────────────────────────────────────────┐
  │  XDP hook  (Cilium: service load balancing)     │
  │  Sees: raw packet bytes only. No pod identity.  │
  └─────────────────────────┬───────────────────────┘
                            │ XDP_PASS
                            ▼
  sk_buff allocated
       ↓
  ┌─────────────────────────────────────────────────┐
  │  TC ingress hook  (Cilium: pod policy ingress)  │
  │  Sees: sk_buff + socket + cgroup → pod identity │
  └─────────────────────────┬───────────────────────┘
                            ↓
  netfilter / IP routing
       ↓
  socket → process (syscall boundary)
  ┌─────────────────────────────────────────────────┐
  │  Tracepoint / kprobe  (Falco: syscall monitor)  │
  │  Sees: any kernel event, any process, any pod   │
  └─────────────────────────────────────────────────┘
  ┌─────────────────────────────────────────────────┐
  │  LSM hook  (Tetragon: kernel-level enforcement) │
  │  Sees: security check context. Can DENY.        │
  └─────────────────────────────────────────────────┘
       ↓
  IP routing → qdisc
  ┌─────────────────────────────────────────────────┐
  │  TC egress hook  (Cilium: pod policy egress)    │
  │  Sees: socket + cgroup on outbound traffic      │
  └─────────────────────────────────────────────────┘
       ↓
  NIC → wire

eBPF program types define where in the kernel a hook fires and what it can see — and knowing the difference is what makes you effective when Cilium or Falco behave unexpectedly. What we hadn’t answered — and what a 2am incident eventually forced — is what kind of eBPF programs are actually running on your nodes, and why the difference matters when something breaks.

A pod in production was dropping roughly one in fifty outbound TCP connections. Not all of them — just enough to cause intermittent timeouts in the application logs. NetworkPolicy showed egress allowed. Cilium reported no violations. Running curl manually from inside the pod worked every time.

I spent the better part of three hours eliminating possibilities. DNS. MTU. Node-level conntrack table exhaustion. Upstream firewall rules. Nothing.

Eventually, almost as an afterthought, I ran this:

sudo bpftool prog list

There were two TC programs attached to that pod’s veth interface. One from the current Cilium version. One from the previous version — left behind by a rolling upgrade that hadn’t cleaned up properly. Two programs. Different policy state. One was occasionally dropping packets based on rules that no longer existed in the current policy model.

The answer had been sitting in the kernel the whole time. I just didn’t know where to look.

That incident forced me to actually understand something I’d been hand-waving for two years: eBPF isn’t a single hook. It’s a family of program types, each attached to a different location in the kernel, each seeing different data, each suited for different problems. Understanding the difference is what separates “I run Cilium and Falco” from “I understand what Cilium and Falco are actually doing on my nodes” — and that difference matters when something breaks at 2am.

The Command You Should Run on Your Cluster Right Now

Before getting into the theory, do this:

# See every eBPF program loaded on the node
sudo bpftool prog list

# See every eBPF program attached to a network interface
sudo bpftool net list

On a node running Cilium and Falco, you’ll see something like this:

42: xdp           name cil_xdp_entry       loaded_at 2026-04-01T09:23:41
43: sched_cls     name cil_from_netdev      loaded_at 2026-04-01T09:23:41
44: sched_cls     name cil_to_netdev        loaded_at 2026-04-01T09:23:41
51: cgroup_sock_addr  name cil_sock4_connect loaded_at 2026-04-01T09:23:41
88: raw_tracepoint  name sys_enter          loaded_at 2026-04-01T09:23:55
89: raw_tracepoint  name sys_exit           loaded_at 2026-04-01T09:23:55

Each line is a different program type. Each one fires at a different point in the kernel. The type column — xdp, sched_cls, raw_tracepoint, cgroup_sock_addr — tells you where in the kernel execution path that program is attached and therefore what it can and cannot see.

If you see more programs than you expect on a specific interface — like I did — that’s your first clue.

Why Program Types Exist

The Linux kernel isn’t a single pipeline. Network packets, system calls, file operations, process scheduling — these all run through different subsystems with different execution contexts and different available data.

eBPF lets you attach programs to specific points within those subsystems. The “program type” is the contract: it defines where the hook fires, what data the program receives, and what it’s allowed to do with it. A program designed to process network packets before they hit the kernel stack looks completely different from one designed to intercept system calls across all containers simultaneously.

Most of us will interact with four or five program types through the tools we already run. Understanding what each one actually is — where it sits, what it sees — is what makes you effective when those tools behave unexpectedly.

The Types Behind the Tools You Already Use

TC — Why Cilium Can Tell Which Pod Sent a Packet

TC stands for Traffic Control. It’s where Cilium enforces your NetworkPolicy, and it’s what caused my incident.

TC programs attach to network interfaces — specifically to the ingress and egress directions of the pod’s virtual interface (lxcXXXXX in Cilium’s naming). They fire after the kernel has already processed the packet enough to know its context: which socket created it, which cgroup that socket belongs to. Cgroup maps to container, container maps to pod.

This is the critical piece: TC is how Cilium knows which pod a packet belongs to. Without that cgroup context, per-pod policy enforcement isn’t possible.

# See TC programs on a pod's veth interface
sudo tc filter show dev lxc12345 ingress
sudo tc filter show dev lxc12345 egress

# If you see two entries on the same direction — that's the incident I described
# The priority number (pref 1, pref 2) tells you the order they run

When there are two TC programs on the same interface, the first one to return “drop” wins. The second program never runs. This is why the issue was intermittent rather than consistent — the stale program only matched specific connection patterns.

Fixing it is straightforward once you know what to look for:

# Remove a stale TC filter by its priority number
sudo tc filter del dev lxc12345 egress pref 2

Add this check to your post-upgrade runbook. Cilium upgrades are generally clean but not always.

XDP — Why Cilium Doesn’t Use TC for Everything

If TC is good enough for pod-level policy, why does Cilium also run an XDP program on the node’s main interface? Look at the bpftool prog list output again — there’s an xdp program loaded alongside the TC programs.

XDP fires earlier. Much earlier. Before the kernel allocates any memory for the packet. Before routing. Before connection tracking. Before anything.

The tradeoff is exactly what you’d expect: XDP is fast but context-poor. It sees raw packet bytes. It doesn’t know which pod the packet came from. It can’t read cgroup information because no socket buffer has been allocated yet.

Cilium uses XDP specifically for ClusterIP service load balancing — when a packet arrives at the node destined for a service VIP, XDP rewrites the destination to the actual pod IP in a single map lookup and sends it on its way. No iptables. No conntrack. The work is done before the kernel stack is involved.

There’s a silent failure mode worth knowing about here. XDP runs in one of two modes:

  • Native mode — runs inside the NIC driver itself, before any kernel allocation. This is where the performance comes from.
  • Generic mode — fallback when the NIC driver doesn’t support XDP. Runs later, after sk_buff allocation. No performance benefit over iptables.

If your NIC doesn’t support native XDP, Cilium silently falls back to generic mode. The policy still works — but the performance characteristics you assumed aren’t there.

# Check which XDP mode is active on your node's main interface
ip link show eth0 | grep xdp
# xdpdrv  ← native mode (fast)
# xdpgeneric ← generic mode (no perf benefit)

Most cloud provider instance types with modern Mellanox/Intel NICs support native mode. Worth verifying rather than assuming.

Tracepoints — How Falco Sees Every Container

Falco loads two programs: sys_enter and sys_exit. These are raw tracepoints — they fire on every single system call, from every process, in every container on the node.

Tracepoints are explicitly defined and maintained instrumentation points in the kernel. Unlike hooks that attach to specific internal function names (which can be renamed or inlined between kernel versions), tracepoints are stable interfaces. They’re part of the kernel’s public contract with tooling that wants to instrument it.

This matters operationally. When you patch your nodes — and cloud-managed nodes get patched frequently — tools built on tracepoints keep working. Tools built on kprobes (internal function hooks) may silently stop firing if the function they’re attached to gets renamed or inlined by the compiler in a new kernel build.

# Verify what Falco is actually using
sudo bpftool prog list | grep -E "kprobe|tracepoint"

# Falco's current eBPF driver should show raw_tracepoint entries
# If you see kprobe entries from Falco, you're on the older driver
# Check: falco --version and the driver being loaded at startup

If you’re running Falco on a cluster that gets regular OS patch upgrades and you haven’t verified the driver mode, check it. The older kprobe-based driver has a real failure mode on certain kernel versions.

LSM — How Tetragon Blocks Operations at the Kernel Level

LSM hooks run at the kernel’s security decision points: file opens, socket connections, process execution, capability checks. The defining characteristic is that they can deny an operation. Return an error from an LSM hook and the kernel refuses the syscall before it completes.

This is qualitatively different from observability hooks. kprobes and tracepoints watch. LSM hooks enforce.

When you see Tetragon configured to kill a process attempting a privileged operation, or block a container from writing to a specific path, that’s an LSM hook making the decision inside the kernel — not a sidecar watching traffic, not an admission webhook running before pod creation, not a userspace agent trying to act fast enough. The enforcement is in the kernel itself.

# See if any LSM eBPF programs are active on the node
sudo bpftool prog list | grep lsm

# Verify LSM eBPF support on your kernel (required for Tetragon enforcement mode)
grep CONFIG_BPF_LSM /boot/config-$(uname -r)
# CONFIG_BPF_LSM=y   ← required

The Practical Summary

What’s happening on your node Program type Where to look
Cilium service load balancing XDP ip link show eth0 \| grep xdp
Cilium pod network policy TC (sched_cls) tc filter show dev lxcXXXX egress
Falco syscall monitoring Tracepoint bpftool prog list \| grep tracepoint
Tetragon enforcement LSM bpftool prog list \| grep lsm
Anything unexpected All types bpftool prog list, bpftool net list

The Incident, Revisited

Three hours of debugging. The answer was a stale TC program sitting at priority 2 on a pod’s veth interface, left behind by an incomplete Cilium upgrade.

# What I should have run first
sudo bpftool net list
sudo tc filter show dev lxc12345 egress

Two commands. Thirty seconds. If I’d known that TC programs can stack on the same interface, I’d have started there.

That’s the point of understanding program types — not to write eBPF programs yourself, but to know where to look when the tools you depend on don’t behave the way you expect. The programs are already there, running on your nodes right now. bpftool prog list shows you all of them.

Key Takeaways

  • bpftool prog list and bpftool net list show every eBPF program on a node — run these before anything else when debugging eBPF-based tool behavior
  • TC programs can stack on the same interface; stale programs from incomplete Cilium upgrades cause intermittent drops — check tc filter show after every Cilium upgrade
  • XDP runs before the kernel stack — fastest hook, but no pod identity; Cilium uses it for service load balancing, not pod policy
  • XDP silently falls back to generic mode on unsupported NICs — verify with ip link show | grep xdp
  • Tracepoints are stable across kernel versions; kprobe-based tools may silently break after node OS patches — verify your Falco driver mode
  • LSM hooks enforce at the kernel level — this is what makes Tetragon’s enforcement mode fundamentally different from sidecar-based approaches

What’s Next

Every eBPF program fires, does its work, and exits — but the work always involves data.

Get EP05 in your inbox when it publishes → linuxcent.com/subscribe Counting connections. Tracking processes. Streaming events to a detection engine. In EP05, I’ll cover eBPF maps: the persistent data layer that connects kernel programs to the tools consuming their output. Understanding maps explains a class of production issues — and makes bpftool map dump useful rather than cryptic.