TC eBPF — Pod-Level Network Policy Without iptables

Reading Time: 10 minutes

eBPF: From Kernel to Cloud, Episode 8
What Is eBPF? · The BPF Verifier · eBPF vs Kernel Modules · eBPF Program Types · eBPF Maps · CO-RE and libbpf · XDP · TC eBPF**


Architecture Overview

TC eBPF and Cilium — traffic control hook architecture showing ingress/egress packet flow with sk_buff context
The TC hook runs inside the kernel network stack — Cilium uses it for identity-based policy enforcement.

TL;DR

  • TC eBPF fires after sk_buff allocation — it has socket metadata, cgroup ID, and pod identity that XDP lacks
    (sk_buff = the kernel’s socket buffer, allocated for every packet; TC fires after this allocation, so it can read socket and process identity)
  • Direct action (DA) mode combines filter and action; the program’s return value is the packet fate
  • Multiple TC programs chain on the same hook ordered by priority — stale programs from Cilium upgrades cause silent policy conflicts
  • tc filter show dev <iface> ingress/egress is the primary inspection tool; bpftool net list shows the full node picture
  • XDP + TC is the Cilium data path: XDP for pre-stack service load balancing, TC for per-pod identity-based enforcement
  • TC can modify packet content (bpf_skb_store_bytes) — the basis for TC-based DNAT and packet mangling

TC eBPF is where Cilium implements pod-level network policy without iptables — the hook that fires after sk_buff allocation, where socket and cgroup context exist, making per-pod enforcement possible. The obvious follow-up to XDP is why Cilium doesn’t use it for everything — pod network policy, egress enforcement, the full NetworkPolicy ruleset. The answer reveals an inherent trade-off built into the Linux data path: XDP’s speed comes from running before any context exists. At the moment it fires, there is no socket, no cgroup, no way to tell which pod sent the packet. The moment you need pod identity, you need a hook that fires later — and pays for it.


A specific pod in production was experiencing intermittent TCP connection failures to an external service. Not all connections — roughly one in fifty. Kubernetes NetworkPolicy showed egress allowed for the namespace. Cilium policy status showed no violations. Running curl from inside the pod worked fine.

The application logs told a different story: connection timeouts at the 30-second mark, no SYN-ACK received. Not a DNS issue — I verified with tcpdump inside the pod namespace. SYN packets were leaving the pod network namespace. They weren’t making it onto the wire.

I ran bpftool net list on the node and saw two TC egress programs attached to that pod’s veth interface. One from the current Cilium version (installed six weeks ago). One from the previous version — from before the rolling upgrade. Two programs. Different policy epochs. The older one had a stale block rule that fired intermittently based on connection tuple patterns it was never designed to handle in the new policy model.

Without understanding TC eBPF — what programs attach where, how multiple programs interact, and how to inspect them — I would have kept chasing ghosts in the application layer.

Quick Check: Are There Stale TC Filters on Your Cluster?

The most common TC eBPF issue on production clusters — stale filters left behind by a Cilium upgrade — is a two-command check:

# SSH into a worker node, then pick any pod's veth interface:
ip link | grep lxc | head -5
# lxc8a3f21b@if7: ...
# lxc2c9d3e1@if9: ...

# Check TC filters on that interface
tc filter show dev lxc8a3f21b egress

Healthy output (one filter, one priority):

filter protocol all pref 1 bpf chain 0
filter protocol all pref 1 bpf chain 0 handle 0x1 cil_to_container direct-action not_in_hw id 44

Stale filter present (two priorities = problem):

filter protocol all pref 1 bpf chain 0
filter protocol all pref 1 bpf chain 0 handle 0x1 cil_to_container direct-action not_in_hw id 44
filter protocol all pref 2 bpf chain 0
filter protocol all pref 2 bpf chain 0 handle 0x1 old_cil_to_container direct-action not_in_hw id 17
#                  ^^^^^^ two different priorities = two programs running in sequence

Two priorities on the same hook means two programs running sequentially. If the older one has a stale DROP rule, packets are being dropped intermittently — and nothing in the application layer will tell you why.

Not running Cilium? If you’re on a non-Cilium CNI (Calico, Flannel, aws-vpc-cni), you likely won’t have TC eBPF filters on pod interfaces. Run tc filter show dev eth0 ingress on the node uplink instead to see if any TC programs are attached at the node level. An empty response is normal for non-Cilium clusters.

Why TC, Not XDP

EP07 covered XDP: fastest possible hook, fires before sk_buff, drops at line rate. If XDP is so fast, why doesn’t Cilium use it for everything?

Because XDP sees only raw packet bytes. No socket. No cgroup. No pod identity.

In Kubernetes, network policy is inherently about identity. “Allow pod A to connect to pod B on port 8080.” To enforce this, you need to know which pod a packet is coming from on egress — and which pod it’s going to on ingress. That mapping lives in the cgroup hierarchy and the socket buffer, neither of which exist at XDP time.

TC fires later in the packet lifecycle, after sk_buff is allocated and populated:

Ingress path:
  wire → NIC → [XDP hook] → sk_buff allocated → [TC ingress hook] → netfilter → socket

Egress path:
  socket → IP routing → [TC egress hook] → qdisc → NIC → wire

At the TC egress hook on a pod’s veth interface, the sk_buff carries the socket that created the packet — and from that socket you can read the cgroup ID. The cgroup hierarchy maps container → pod, so the TC program knows which pod this traffic belongs to. That’s what makes pod-level enforcement possible.

The Linux Traffic Control Architecture

tc (traffic control) is the Linux subsystem for managing packet queues and scheduling. Most Linux administrators know it as the bandwidth-shaping tool:

# Classic tc usage — rate limit an interface
tc qdisc add dev eth0 root tbf rate 100mbit burst 32kbit latency 400ms

The qdisc (queuing discipline) is the primary abstraction. Under the qdisc sits a filter layer — and the filter type relevant to eBPF is cls_bpf, which attaches eBPF programs as packet classifiers.

qdisc (queuing discipline) is the kernel’s packet scheduler for an interface — it controls how packets are buffered and in what order they leave. For eBPF policy enforcement, Cilium uses a special qdisc called clsact which has no buffering behaviour at all; it purely provides the ingress and egress hook points where eBPF filters attach. If a pod veth doesn’t have clsact, Cilium isn’t enforcing policy on that pod.

Cilium attaches cls_bpf filters in direct action (DA) mode, which combines classifier and action into a single eBPF program. The program’s return value is the packet fate directly:

Return value Action
TC_ACT_OK (0) Pass the packet
TC_ACT_SHOT (2) Drop the packet
TC_ACT_REDIRECT (7) Redirect to another interface
TC_ACT_PIPE (3) Pass to the next filter in the chain

TC Context: What Your Program Can See

TC programs receive a struct __sk_buff — a safe, BPF-accessible projection of the kernel sk_buff. Unlike the raw packet bytes in XDP, __sk_buff includes metadata:

struct __sk_buff {
    __u32 len;           // packet length
    __u32 pkt_type;      // PACKET_HOST, PACKET_BROADCAST, etc.
    __u32 mark;          // skb->mark — used by Cilium for pod identity
    __u32 queue_mapping;
    __u32 protocol;      // ETH_P_IP, ETH_P_IPV6, etc.
    __u32 vlan_present;
    __u32 vlan_tci;
    __u32 vlan_proto;
    __u32 priority;
    __u32 ingress_ifindex;
    __u32 ifindex;
    __u32 tc_index;
    __u32 cb[5];
    __u32 hash;
    __u32 tc_classid;
    __u32 data;          // offset to packet data
    __u32 data_end;
    __u32 napi_id;
    __u32 family;
    __u32 remote_ip4;    // source IP (ingress) or dest IP (egress)
    __u32 local_ip4;
    __u32 remote_port;
    __u32 local_port;
    // ...
};

skb->mark is how Cilium passes pod identity between its hook points.

skb->mark is a 32-bit field in every sk_buff that any kernel subsystem can read or write. It’s a general-purpose scratch field — iptables uses it, routing rules use it, and Cilium uses it to carry pod security identity from the socket hook through to TC enforcement. When Cilium stamps a pod’s identity into skb->mark at connection time, every subsequent TC filter on that packet’s path can read it without another identity lookup. The socket-level cgroup hook (cgroup_sock_addr) stamps the cgroup-derived pod identity into skb->mark when the socket calls connect(). By the time the packet reaches the TC egress hook, skb->mark carries the pod’s security identity — and the TC program uses it for policy enforcement.

What Cilium’s TC Filters Actually Do

The TC filter on each pod’s veth is Cilium’s enforcement point for Kubernetes NetworkPolicy. The mechanism:

  1. When a pod opens a connection, a cgroup_sock_addr hook stamps the pod’s security identity (derived from its labels + namespace) into skb->mark
  2. The TC egress filter on the veth reads skb->mark, looks up the pod identity + destination in the policy map, and returns TC_ACT_SHOT (drop) or TC_ACT_OK (pass)
  3. The TC ingress filter on the receiving pod’s veth does the same check for inbound traffic

The policy map is a BPF LRU hash keyed on {pod_identity, dst_ip, dst_port, protocol}. This is what cilium policy get reads from — and what bpftool map dump shows directly:

# Find Cilium's policy maps
bpftool map list | grep -i policy

# Dump the active policy entries for a specific endpoint
# Get endpoint ID from: cilium endpoint list
cilium bpf policy get <endpoint-id>

# Cross-check with raw bpftool dump
bpftool map dump id <POLICY_MAP_ID> | head -30

The clsact qdisc is the prerequisite for any TC eBPF filter — it creates the ingress and egress hook points without any queuing behavior. Every pod veth on a Cilium node has one:

tc qdisc show dev lxcABCDEF
# qdisc clsact ffff: dev lxcABCDEF parent ffff:fff1
# ^^^^^^^^^^^^ this line confirms Cilium's hook points exist on this pod's veth
# If this is missing: Cilium is NOT enforcing NetworkPolicy on this pod

If a pod veth doesn’t show clsact, Cilium isn’t enforcing policy on that pod.

Multiple Programs and the Filter Chain

This is the detail that caused my production incident.

TC supports chaining multiple filters on the same hook, ordered by priority. Lower priority number runs first. When Cilium upgrades, it installs a new filter at a new priority before removing the old one. If the upgrade procedure has any timing gap — or if the removal step fails silently — you end up with two programs running in sequence.

# Show all TC filters on a pod's veth — both priorities visible
tc filter show dev lxc12345 egress

# Example output with a stale filter:
filter protocol all pref 1 bpf chain 0
filter protocol all pref 1 bpf chain 0 handle 0x1 cil_to_container direct-action not_in_hw id 44
filter protocol all pref 2 bpf chain 0
filter protocol all pref 2 bpf chain 0 handle 0x1 old_cil_to_container direct-action not_in_hw id 17

Two programs. Pref 1 runs first. Pref 2 runs second — unless pref 1 returned TC_ACT_SHOT, in which case the packet is already dropped and pref 2 never fires.

In my incident: pref 1 was the current Cilium version with correct policy, returning TC_ACT_OK for the traffic in question. Pref 2 was the old version with a stale block entry, returning TC_ACT_SHOT for a subset of connection tuples. Because TC_ACT_OK passes to the next filter in the chain (TC_ACT_PIPE would do the same), pref 2 got to run — and intermittently dropped packets.

The fix:

# Remove the stale filter by priority
tc filter del dev lxc12345 egress pref 2

# Verify only the current filter remains
tc filter show dev lxc12345 egress

This should be part of any post-upgrade verification for Cilium-managed clusters.

How Cilium Uses TC Across the Full Node

Cilium’s TC deployment on a node:

Pod veth (host-side, lxcXXXXX):
  TC ingress: cil_from_container — L3/L4 policy on the pod's outbound traffic
  TC egress:  cil_to_container   — L3/L4 policy on traffic arriving at the pod

Node uplink (eth0):
  TC ingress: cil_from_netdev    — traffic arriving from outside the node
  TC egress:  cil_to_netdev      — traffic leaving the node

XDP on eth0:
  cil_xdp_entry — pre-stack service load balancing (DNAT for ClusterIP)

The naming is counterintuitive at first: cil_from_container is attached to the TC ingress hook on the veth.

Veth direction confusion: TC ingress/egress is named from the kernel’s perspective of the interface, not the pod’s. The host-side veth interface receives traffic that the pod is sending — so TC ingress on the host veth = the pod’s outbound traffic. This trips up everyone the first time. When debugging, always confirm direction with tc filter show dev lxcXXX ingress and egress separately, and check which Cilium program name is attached (cil_from_container = pod outbound, cil_to_container = pod inbound). The veth ingress direction from the host perspective is traffic flowing out of the container. Traffic leaving the pod hits the host-side veth ingress, which is cil_from_container. It enforces egress policy for the pod. Naming follows the kernel’s perspective of the interface, not the application’s.

To see the full picture on a node:

# All eBPF network programs (XDP and TC) across all interfaces
bpftool net list

# TC-specific view
for iface in $(ip link | grep lxc | awk -F': ' '{print $2}'); do
    echo "=== $iface ==="
    tc filter show dev $iface ingress
    tc filter show dev $iface egress
done

TC Can Modify Packets Too

Unlike XDP, TC programs have full access to the sk_buff and can modify packet content — headers, payload, and checksums. This is how TC-based DNAT works in Cilium when XDP isn’t available on the NIC: the program rewrites the destination IP at L3 and updates the IP + transport checksums atomically. The kernel BPF helper handles the checksum recalculation.

From an operational standpoint: if you see a TC program attached but expected traffic is being redirected rather than dropped, the program is likely doing DNAT. bpftool prog dump xlated id <ID> shows the disassembled instructions and will reveal bpf_skb_store_bytes calls if packet rewriting is happening.

Debugging TC Programs in Production

Workflow I follow when investigating network issues on Cilium clusters:

# 1. List all eBPF network programs (see the full picture)
bpftool net list

# 2. Check specific interface for stale TC filters
tc filter show dev lxcABCDEF ingress
tc filter show dev lxcABCDEF egress

# 3. Inspect a specific program
bpftool prog show id 44

# 4. Disassemble a program (last resort for understanding behavior)
bpftool prog dump xlated id 44

# 5. Check Cilium's view of the same interface
cilium endpoint list
cilium endpoint get <endpoint-id>

# 6. Enable verbose TC program logs (debug builds only)
# Cilium: set CILIUM_DEBUG=true in the deployment

Common Mistakes

Mistake Impact Fix
Not checking for stale TC filters after Cilium upgrades Conflicting policy programs cause intermittent drops Run tc filter show post-upgrade; remove stale by priority
Confusing ingress/egress direction on veth interfaces Policy applied to wrong traffic direction TC ingress on host-side veth = pod’s outbound traffic
Attaching TC without clsact qdisc Filter attachment fails tc qdisc add dev <iface> clsact before filter add
Using TC_ACT_OK when you want to stop the chain Subsequent filters still run Use TC_ACT_OK knowing the chain continues; use TC_ACT_REDIRECT or explicit TC_ACT_SHOT only
Expecting TC performance equal to XDP TC has sk_buff overhead — it’s slower Right tool: XDP for pre-stack bulk drops, TC for identity-aware policy
Hardcoding skb->mark interpretation Different tools use mark differently Document mark field usage clearly; coordinate between Cilium and custom programs

Key Takeaways

  • TC eBPF fires after sk_buff allocation — it has socket metadata, cgroup ID, and pod identity that XDP lacks
  • Direct action (DA) mode combines filter and action; the program’s return value is the packet fate
  • Multiple TC programs chain on the same hook ordered by priority — stale programs from Cilium upgrades cause silent policy conflicts
  • tc filter show dev <iface> ingress/egress is the primary inspection tool; bpftool net list shows the full node picture
  • XDP + TC is the Cilium data path: XDP for pre-stack service load balancing, TC for per-pod identity-based enforcement
  • TC can modify packet content (bpf_skb_store_bytes) — the basis for TC-based DNAT and packet mangling

What’s Next

EP08 closes out the kernel machinery arc: program types, maps, CO-RE, XDP, TC. Five episodes on the engine under the tools. EP09 shifts from understanding the machinery to using it interactively.

bpftrace turns kernel knowledge into one-liners you can run on a live production node. Which process is touching this file right now? Where is this latency spike originating in the kernel call stack? Which container is making DNS queries to an unexpected resolver? Under 10 seconds per question — no restart, no sidecar, no instrumentation change.

Every bpftrace one-liner is a complete eBPF program compiled, loaded, run, and cleaned up on the fly. EP09 covers how that works and why it changes the way you investigate production incidents.

Next: bpftrace — kernel answers in one line

Get EP09 in your inbox when it publishes → linuxcent.com/subscribe

XDP — Packets Processed Before the Kernel Knows They Arrived

Reading Time: 10 minutes

eBPF: From Kernel to Cloud, Episode 7
What Is eBPF? · The BPF Verifier · eBPF vs Kernel Modules · eBPF Program Types · eBPF Maps · CO-RE and libbpf · XDP**

14 min read


Introduction

EP01 through EP06 covered what eBPF is, how the verifier keeps it safe, and how the toolchain compiles and loads programs across kernel versions. This episode is where that foundation meets production networking.

XDP — eXpress Data Path — is the earliest hook in the Linux kernel packet path. It fires before sk_buff allocation, before routing, before netfilter. A DROP decision at XDP costs one bounds check and a return value. Everything else is skipped. At 1 million packets per second, that difference shows up directly as CPU.

This episode explains where XDP sits, what it can and cannot see, how Cilium uses it, and what every Kubernetes operator needs to know about it — even if they never write an eBPF program.


Table of Contents


Architecture Overview

XDP Pre-Stack Packet Hook — eBPF kernel data path diagram showing where XDP fires before sk_buff allocation
XDP fires before sk_buff allocation — the earliest possible kernel hook for zero-copy packet processing.

TL;DR

  • XDP fires before sk_buff allocation — the earliest possible kernel hook for packet processing
    (sk_buff = the kernel’s socket buffer — every normal packet requires one to be allocated, which adds up fast at scale)
  • Three modes: native (in-driver, full performance), generic (fallback, no perf gain), offloaded (NIC ASIC)
  • XDP context is raw packet bytes — no socket, no cgroup, no pod identity; handle non-IP traffic explicitly
  • Every pointer dereference requires a bounds check against data_end — the verifier enforces this
  • BPF_MAP_TYPE_LPM_TRIE is the right map type for IP prefix blocklists — handles /32 hosts and CIDRs together
  • XDP metadata area enables coordination with TC programs — classify at XDP speed, enforce with pod context at TC

Quick Check: Is XDP Running on Your Cluster?

Before the data path walkthrough — a two-command check you can run right now on any cluster node:

# SSH into a worker node, then:
bpftool net list

On a Cilium-managed node, you’ll see something like:

eth0 (index 2):
        xdpdrv  id 44

lxc8a3f21b (index 7):
        tc ingress id 47
        tc egress  id 48

Reading the output:
xdpdrv — XDP in native mode, running in the NIC driver before sk_buff (this is what you want)
xdpgeneric instead of xdpdrvgeneric mode, runs after sk_buff allocation, no performance benefit
– No XDP line at all — XDP not deployed; your CNI uses iptables for service forwarding

If you’re on EKS with aws-vpc-cni or GKE with kubenet, you likely won’t see XDP here — those CNIs use iptables. Understanding this section explains why teams migrating to Cilium see lower node CPU under the same traffic load.


Where XDP Sits in the Kernel Data Path

A client’s cluster was under a SYN flood — roughly 1 million packets per second from a rotating set of source IPs. We had iptables DROP rules installed within the first ten minutes, blocklist updated every 30 seconds as new source ranges appeared. The flood traffic dropped in volume. But node CPU stayed high. The %si column in top — software interrupt time — was sitting at 25–30%.

%si in top is the percentage of CPU time spent handling hardware interrupts and kernel-level packet processing — separate from your application’s CPU usage. On a quiet managed cluster (EKS, GKE) this is usually under 1%. Under a packet flood, high %si means the kernel is burning cycles just receiving packets, before your workloads run at all. It’s the metric that tells you the problem is below the application layer.

The iptables rules were matching. Packets were being dropped. CPU was still burning. The answer is where in the kernel the drop was happening. iptables fires inside the netfilter framework — after the kernel has already allocated an sk_buff for the packet, done DMA from the NIC ring buffer, and traversed several netfilter hooks. At 1Mpps, the allocation cost alone is measurable.

XDP fires before any of that.

The standard Linux packet receive path:

NIC hardware
  ↓
DMA to ring buffer (kernel memory)
  ↓
[XDP hook — fires here, before sk_buff]
  ├── XDP_DROP   → discard, zero further allocation
  ├── XDP_PASS   → continue to kernel network stack
  ├── XDP_TX     → transmit back out the same interface
  └── XDP_REDIRECT → forward to another interface or CPU
  ↓
sk_buff allocated from slab allocator
  ↓
netfilter: PREROUTING
  ↓
IP routing decision
  ↓
netfilter: INPUT or FORWARD
  ↓  [iptables fires somewhere in here]
socket receive queue
  ↓
userspace application

XDP runs at the driver level, in the NAPI poll context — the same context where the driver is processing received packets off the ring buffer. The program runs before the kernel’s general networking code gets involved. There’s no sk_buff, no reference counting, no slab allocation.

NAPI (New API) is how modern Linux receives packets efficiently. Instead of one CPU interrupt per packet (catastrophically expensive at 1Mpps), the NIC fires a single interrupt, then the kernel polls the NIC ring buffer in batches until it’s drained. XDP runs inside this polling loop — as close to the hardware as software gets without running on the NIC itself.

At 1Mpps, the difference between XDP_DROP and an iptables DROP is roughly the cost of allocating and then immediately freeing 1 million sk_buff objects per second — plus netfilter traversal, connection tracking lookup, and the DROP action itself. That’s the CPU time that was burning.

After moving the blocklist to an XDP program, the %si on the same traffic load dropped from 28% to 3%.


XDP Modes

XDP operates in three modes, and which one you get depends on your NIC driver.

Native XDP (XDP_FLAGS_DRV_MODE)

The eBPF program runs directly in the NIC driver’s NAPI poll function — in interrupt context, before sk_buff. This is the only mode that delivers the full performance benefit.

Driver support is required. The widely supported drivers: mlx4, mlx5 (Mellanox/NVIDIA), i40e, ice (Intel), bnxt_en (Broadcom), virtio_net (KVM/QEMU), veth (containers). Check support:

# Verify native XDP support on your driver
ethtool -i eth0 | grep driver
# driver: mlx5_core  ← supports native XDP

# Load in native mode
ip link set dev eth0 xdpdrv obj blocklist.bpf.o sec xdp

The veth driver supporting native XDP is what makes XDP meaningful inside Kubernetes pods — each pod’s veth interface can run an XDP program at wire speed.

Generic XDP (XDP_FLAGS_SKB_MODE)

Fallback for drivers that don’t support native XDP. The program still runs, but it runs after sk_buff allocation, as a hook in the netif_receive_skb path. No performance benefit over early netfilter. sk_buff is still allocated and freed for every packet.

# Generic mode — development and testing only
ip link set dev eth0 xdpgeneric obj blocklist.bpf.o sec xdp

Use this for development on a laptop with a NIC that lacks native XDP support. Never benchmark with it and never use it in production expecting performance gains.

Offloaded XDP

Runs on the NIC’s own processing unit (SmartNIC). Zero CPU involvement — the XDP decision happens in NIC hardware. Supported by Netronome Agilio NICs. Rare in production, but the theoretical ceiling for XDP performance.


The XDP Context: What Your Program Can See

XDP programs receive one argument: struct xdp_md.

struct xdp_md {
    __u32 data;           // offset of first packet byte in the ring buffer page
    __u32 data_end;       // offset past the last byte
    __u32 data_meta;      // metadata area before data (XDP metadata for TC cooperation)
    __u32 ingress_ifindex;
    __u32 rx_queue_index;
};

data and data_end are used as follows:

void *data     = (void *)(long)ctx->data;
void *data_end = (void *)(long)ctx->data_end;

// Every pointer dereference must be bounds-checked first
struct ethhdr *eth = data;
if ((void *)(eth + 1) > data_end)
    return XDP_PASS;  // malformed or truncated packet

The verifier enforces these bounds checks — every pointer derived from ctx->data must be validated before use. The error invalid mem access 'inv' means you dereferenced a pointer without checking the bounds. This is the most common cause of XDP program rejection.

For operators (not writing XDP code): You’ll see invalid mem access 'inv' in logs when an eBPF program is rejected at load time — most commonly during a Cilium upgrade or a custom tool deployment on a kernel the tool wasn’t built for. The fix is in the eBPF source or the tool version, not the cluster config.

What XDP cannot see:
– Socket state — no socket buffer exists yet
– Cgroup hierarchy — no pod identity
– Process information — no PID, no container
– Connection tracking state (unless you maintain it yourself in a map)

XDP is ingress-only. It fires on packets arriving at an interface, not departing. For egress, TC is the hook.


What This Means on Your Cluster Right Now

Every Cilium-managed node has XDP programs running. Here’s how to see them:

# All XDP programs on all interfaces — this is the full picture
bpftool net list
# Sample output on a Cilium node:
#
# eth0 (index 2):
#         xdpdrv  id 44         ← XDP in native mode on the node uplink
#
# lxc8a3f21b (index 7):
#         tc ingress id 47      ← TC enforces NetworkPolicy on pod ingress
#         tc egress  id 48      ← TC enforces NetworkPolicy on pod egress
#
# "xdpdrv"     = native mode (runs in NIC driver, before sk_buff — full performance)
# "xdpgeneric" = fallback mode (after sk_buff — no performance benefit over iptables)

# Which mode is active?
ip link show eth0 | grep xdp
# xdp mode drv  ← native (full performance)
# xdp mode generic  ← fallback (no perf benefit)

# Details on the XDP program ID
bpftool prog show id $(bpftool net show dev eth0 | grep xdp | awk '{print $NF}')
# Shows: loaded_at, tag, xlated bytes, jited bytes, map IDs

The map IDs in that output are the BPF maps the XDP program is using — typically the service VIP table for DNAT, and in security tools, the blocklist or allowlist. To see what’s in them:

# List maps used by the XDP program
bpftool prog show id <PROG_ID> | grep map_ids

# Dump the service map (for a Cilium node — this is the load balancer table)
bpftool map dump id <MAP_ID> | head -40

For a blocklist scenario — like the SYN flood mitigation above — the BPF_MAP_TYPE_LPM_TRIE is the standard data structure. A lookup for 192.168.1.45 hits a 192.168.1.0/24 entry in the same map, handling both host /32s and CIDR ranges in one lookup.

# Count entries in an XDP filter map
bpftool map dump id <BLOCKLIST_MAP_ID> | grep -c "key"

# Verify XDP is active and inspect program details
bpftool net show dev eth0

XDP Metadata: Cooperating with TC

Think of it as a sticky note attached to the packet. XDP writes the note at line speed (no context about pods or sockets). TC reads it later when full context is available, and acts on it. The packet carries the note between them.

More precisely: XDP can write metadata into the area before ctx->data — a small scratch space that survives as the packet moves from XDP to the TC hook. This is the coordination mechanism between the two eBPF layers.

The pattern: XDP classifies at speed (no sk_buff overhead), TC enforces with pod context (where you have socket identity). XDP writes a classification tag into the metadata area. TC reads it and makes the policy decision.

From an operational standpoint, when you see two eBPF programs on the same interface (one XDP, one TC), this pipeline is the likely explanation:

bpftool net list
# xdpdrv id 44 on eth0       ← XDP classifier running at line rate
# tc ingress id 47 on eth0   ← TC enforcer reading XDP metadata

How Cilium Uses XDP

Not running Cilium? On EKS with aws-vpc-cni or GKE with kubenet, service forwarding uses iptables NAT rules and conntrack instead. You can see this with iptables -t nat -L -n on a node — look for the KUBE-SVC-* chains. Those chains are what XDP replaces in a Cilium cluster. This is why teams migrating from kube-proxy to Cilium report lower node CPU at high connection rates — it’s not magic, it’s hook placement.

On a Cilium node, XDP handles the load balancing path for ClusterIP services. When a packet arrives at the node destined for a ClusterIP:

  1. XDP program checks the destination IP against a BPF LRU hash map of known service VIPs
  2. On a match, it performs DNAT — rewriting the destination IP to a backend pod IP
  3. Returns XDP_TX or XDP_REDIRECT to forward directly

No iptables NAT rules. No conntrack state machine. No socket buffer allocation for the routing decision. The lookup is O(1) in a BPF hash map.

# See Cilium's XDP program on the node uplink
ip link show eth0 | grep xdp
# xdp  (attached, native mode)

# The XDP program details
bpftool prog show pinned /sys/fs/bpf/cilium/xdp

# Load time, instruction count, JIT-compiled size
bpftool prog show id $(bpftool net list | grep xdp | awk '{print $NF}')

At production scale — 500+ nodes, 50k+ services — removing iptables from the service forwarding path with XDP reduces per-node CPU utilization measurably. The effect is most visible on nodes handling high connection rates to cluster services.


Operational Inspection

# All XDP programs on all interfaces
bpftool net list

# Check XDP mode (native, generic, offloaded)
ip link show | grep xdp

# Per-interface stats — includes XDP drop/pass counters
cat /sys/class/net/eth0/statistics/rx_dropped

# XDP drop counters exposed via bpftool
bpftool map dump id <stats_map_id>

# Verify XDP is active and show program details
bpftool net show dev eth0

Common Mistakes

Mistake Impact Fix
Missing bounds check before pointer dereference Verifier rejects: “invalid mem access” Always check ptr + sizeof(*ptr) > data_end before use
Using generic XDP for performance testing Misleading numbers — sk_buff still allocated Test in native mode only; check ip link output for mode
Not handling non-IP traffic (ARP, IPv6, VLAN) ARP breaks, IPv6 drops, VLAN-tagged frames dropped Check eth->h_proto and return XDP_PASS for non-IP
XDP for egress or pod identity No socket context at XDP; XDP is ingress only Use TC egress for pod-identity-aware egress policy
Forgetting BPF_F_NO_PREALLOC on LPM trie Full memory allocated at map creation for all entries Always set this flag for sparse prefix tries
Blocking ARP by accident in a /24 blocklist Loss of layer-2 reachability within the blocked subnet Separate ARP handling before the IP blocklist check

Key Takeaways

  • XDP fires before sk_buff allocation — the earliest possible kernel hook for packet processing
  • Three modes: native (in-driver, full performance), generic (fallback, no perf gain), offloaded (NIC ASIC)
  • XDP context is raw packet bytes — no socket, no cgroup, no pod identity; handle non-IP traffic explicitly
  • Every pointer dereference requires a bounds check against data_end — the verifier enforces this
  • BPF_MAP_TYPE_LPM_TRIE is the right map for IP prefix blocklists — handles /32 hosts and CIDRs together
  • XDP metadata area enables coordination with TC programs — classify at XDP speed, enforce with pod context at TC

What’s Next

XDP handles ingress at the fastest possible point but has no visibility into which pod sent a packet. EP08 covers TC eBPF — the hook that fires after sk_buff allocation, where socket and cgroup context exist.

TC is how Cilium implements pod-to-pod network policy without iptables. It’s also where stale programs from failed Cilium upgrades leave ghost filters that cause intermittent packet drops. Knowing how TC programs chain — and how to find and remove stale ones — is a specific, concrete operational skill.

Next: TC eBPF — pod-level network policy without iptables

Get EP08 in your inbox when it publishes → linuxcent.com/subscribe