Network Flow Observability — What Every Connection Reveals

Reading Time: 9 minutes

eBPF: From Kernel to Cloud, Episode 10
What Is eBPF? · The BPF Verifier · eBPF vs Kernel Modules · eBPF Program Types · eBPF Maps · CO-RE and libbpf · XDP · TC eBPF · bpftrace · Network Flow Observability · DNS Observability


TL;DR

  • Network flow observability with eBPF attaches persistent programs to TC hooks and records every connection attempt, retransmit, reset, and drop — continuously, with no sampling
    (TC hook = Traffic Control hook: the point in the Linux network stack where eBPF programs intercept packets after ingress or before egress, tied to a specific network interface)
  • APM tools and service mesh telemetry are interpretations of what happened; kernel-level flow data from TC hooks is the raw event stream they all derive from
  • Retransmit counters at the kernel level reveal congestion, half-open connections, and remote endpoint failures that application logs never surface
  • Cilium’s Hubble and similar tools (Pixie, Retina) are eBPF flow exporters — they run TC programs, collect perf_event or ringbuf events, and expose them over an API
  • You can verify what flow data a tool is actually collecting with four bpftool commands — without reading documentation
  • Production caution: flow maps grow with the number of active connections; pin and bound your maps, and account for the per-packet overhead on high-throughput interfaces

EP09 showed bpftrace as an on-demand kernel query tool — compile a question, get an answer, clean up. Network flow observability with eBPF is the persistent version: programs that stay attached to TC hooks across your entire fleet, recording every connection without waiting for you to ask. When a client reports intermittent failures that appear nowhere in application logs, that persistent record is what you query. This episode covers how that layer works and how to read it.

Quick Check: What Flow Data Is Your Cluster Already Collecting?

Before building anything new, check what’s already running. If you have Cilium, Pixie, or Retina on your cluster, eBPF flow programs are already attached:

# SSH into a worker node, then:

# What TC programs are attached to cluster interfaces?
bpftool net list

# Expected output on a Cilium node:
# xdp:
#
# tc:
# eth0(2) clsact/ingress prog_id 38 prio 1 handle 0x1 direct-action
# eth0(2) clsact/egress  prog_id 39 prio 1 handle 0x1 direct-action
# lxc12a3(15) clsact/ingress prog_id 41 prio 1 handle 0x1 direct-action
# lxc12a3(15) clsact/egress  prog_id 42 prio 1 handle 0x1 direct-action
# What maps are those programs holding state in?
bpftool map list | grep -E "flow|conn|sock|nat"

# Sample output:
# 24: hash  name cilium_ct4_global  flags 0x0
#     key 24B  value 56B  max_entries 65536  memlock 4718592B
# 25: hash  name cilium_ct4_local   flags 0x0
#     key 24B  value 56B  max_entries 8192   memlock 589824B

Each lxcXXXX interface is a pod’s veth pair. The TC programs on those interfaces are what Cilium uses to enforce NetworkPolicy and collect flow telemetry. If you see prog_id values on pod interfaces, your cluster is already doing kernel-level flow collection.

Not running Cilium? On a plain kubeadm or EKS node without a CNI that uses eBPF, bpftool net list will show no TC programs on pod interfaces — just whatever kube-proxy or the CNI plugin installed. You can still attach your own flow programs with tc qdisc add dev eth0 clsact — that’s the starting point this episode covers.


The client opened a ticket on a Tuesday afternoon. “Intermittent connection failures to the payment gateway. Started around 11 AM. Application logs say timeout. Retry logic is masking it for most users but the error rate is up 0.3%.”

I looked at the APM dashboard. The service showed elevated latency — p99 at 850ms versus a normal 120ms — but no hard errors at the application layer. The service mesh metrics showed the downstream call succeeding from the mesh’s perspective. The payment gateway team said their side looked clean.

Three tools. Three different answers. All of them interpreting the network. None of them were the network.

I ran:

bpftool map dump id 24 | grep -A5 "payment-gateway-ip"

The connection tracking map showed retransmit count 14 for a specific (src_ip, dst_ip, src_port, dst_port) tuple — the same 5-tuple, every 30 seconds, for 2 hours. The kernel was retransmitting. The TCP stack was compensating. The application was seeing sporadic success because retransmits eventually got through. The APM dashboard averaged that latency into a p99 and called it “elevated.”

The kernel had the truth. Everything above it was rounding.


Why Application-Level Metrics Miss What the Kernel Sees

Application metrics — APM spans, service mesh telemetry, load balancer health checks — operate at Layer 7. They measure round-trip time for complete requests, error codes returned, bytes transferred. They answer “did this request succeed?” not “what did the network do to make it succeed?”

The TCP stack underneath those requests handles retransmits, congestion window adjustments, RST packets, and half-open connections silently. From an application’s perspective, a request that required 3 retransmits before the ACK arrived looks identical to one that succeeded on the first attempt — slightly slower, but successful.

This is structural, not a tooling gap. Application-layer observability tools cannot see below their own protocol boundary. The kernel’s TCP implementation does not report upward when it retransmits. It just retransmits.

eBPF flow observability closes this gap by attaching programs directly to the network path — at the TC hook, which fires on every packet crossing a network interface — and recording what the kernel actually does.


How TC Hook Flow Programs Work

EP08 covered TC eBPF programs for pod network policy. Flow observability uses the same attachment point with a different purpose: instead of allowing or dropping packets, the program reads packet metadata and writes it to a map or ring buffer.

Pod sends packet
      ↓
veth interface (lxcXXXX)
      ↓
TC clsact/egress hook fires
      ↓
eBPF program reads:
  - src IP, dst IP
  - src port, dst port
  - protocol
  - packet size
  - TCP flags (SYN, ACK, FIN, RST, retransmit bit)
      ↓
Writes event to ringbuf (or perf_event_array)
      ↓
Userspace consumer reads ringbuf
      ↓
Aggregates to flow record
      ↓
Exports to Hubble/Prometheus/flow store

ringbuf — a BPF ring buffer: a lock-free, memory-efficient queue shared between a kernel eBPF program and a userspace consumer. The kernel program writes events; the userspace reader drains them. Used instead of perf_event_array in kernel 5.8+ because it avoids per-CPU memory waste and supports variable-length records. When you see Hubble exporting flows, it’s reading from a ringbuf that the TC program writes to.

The key structural property: the TC hook fires on every packet. Not sampled. Not throttled by default. Every SYN, every ACK, every RST, every retransmit. For flow observability, you typically aggregate at the program level — count packets and bytes per 5-tuple per second, rather than emitting an event per packet — but the raw visibility is there if you need it.


What Retransmit Telemetry Actually Reveals

Most flow observability implementations track TCP retransmits specifically because they are the clearest signal of network-layer trouble invisible to applications.

A TCP retransmit happens when a sender doesn’t receive an ACK within the retransmission timeout (RTO). The kernel resends the segment and doubles the timeout (exponential backoff). From the application’s perspective, the call takes longer. If retransmits keep clearing, the application sees success — just slow success.

perf_event — a kernel mechanism for collecting performance data. In eBPF, BPF_MAP_TYPE_PERF_EVENT_ARRAY lets kernel programs push variable-length records to userspace readers via a ring buffer per CPU. Older tools use perf_event_array; newer ones use BPF_MAP_TYPE_RINGBUF (single shared ring, more efficient). If you inspect an older version of Cilium’s flow exporter, you’ll see perf_event writes; newer versions use ringbuf.

To observe retransmits directly with bpftrace:

# Count retransmit events per destination IP — run for 60 seconds
bpftrace -e '
kprobe:tcp_retransmit_skb {
    $sk = (struct sock *)arg0;
    $daddr = ntop(AF_INET, $sk->__sk_common.skc_daddr);
    @retransmits[$daddr] = count();
}
interval:s:60 { print(@retransmits); clear(@retransmits); exit(); }
'

Sample output:

Attaching 2 probes...
@retransmits[10.96.0.10]:   2       # DNS service — normal
@retransmits[172.16.4.23]:  847     # payment gateway endpoint ← problem here
@retransmits[10.244.1.5]:   1       # normal pod-to-pod traffic

847 retransmits to a single endpoint in 60 seconds. That’s not noise. That’s a congested or half-open connection being retried 14 times per second by the TCP stack while the application layer averages it into “elevated latency.”


How Cilium Hubble Collects Flow Data

Hubble is the flow observability layer built into Cilium. Understanding how it works makes you able to reason about what it can and cannot see — and how to verify what it’s actually collecting.

Hubble’s architecture:

Kernel (per node)
├── TC eBPF programs on all pod veth interfaces
│     write flow events → BPF ringbuf
│
└── Hubble node agent (userspace)
      reads ringbuf
      enriches with pod metadata (Kubernetes API)
      exposes gRPC API

Cluster level
└── Hubble Relay
      aggregates per-node gRPC streams
      exposes single cluster-wide API

User tooling
└── hubble observe  /  Hubble UI  /  Prometheus exporter

The TC programs are writing raw packet events. The Hubble agent is the consumer that translates those events into Kubernetes-aware flow records — adding pod name, namespace, label, and policy verdict on top of the 5-tuple and TCP metadata the kernel provides.

To see what Hubble’s TC programs have attached:

# On any Cilium node
bpftool net list | grep lxc

# lxce4a1(23) clsact/ingress prog_id 61  ← Hubble flow program on pod interface ingress
# lxce4a1(23) clsact/egress  prog_id 62  ← Hubble flow program on pod interface egress
# lxcf7b2(31) clsact/ingress prog_id 63
# lxcf7b2(31) clsact/egress  prog_id 64
# Inspect one of those programs to confirm it's reading flow metadata
bpftool prog show id 61

# Output:
# 61: sched_cls  name tail_handle_nat  tag 3a8e2f1b4c7d9e0a  gpl
#     loaded_at 2026-04-22T09:13:45+0530  uid 0
#     xlated 2144B  jited 1382B  memlock 4096B  map_ids 24,31,38
#     btf_id 142

sched_cls is the BPF program type for TC — confirming these are TC-attached flow programs. map_ids 24,31,38 — those are the maps this program reads from and writes to. You can dump any of them:

bpftool map dump id 24 | head -40

# Output (connection tracking entry):
# [{
#     "key": {
#         "saddr": "10.244.1.5",        # ← source pod IP
#         "daddr": "172.16.4.23",        # ← destination IP
#         "sport": 48291,                # ← source port
#         "dport": 443,                  # ← destination port
#         "nexthdr": 6,                  # ← protocol: TCP
#         "flags": 3                     # ← CT_EGRESS | CT_ESTABLISHED
#     },
#     "value": {
#         "rx_packets": 14832,           # ← packets received
#         "tx_packets": 14831,           # ← packets sent
#         "rx_bytes": 3841024,           # ← bytes received
#         "tx_bytes": 3756288,           # ← bytes sent
#         "lifetime": 21600,             # ← seconds until entry expires
#         "rx_closing": 0,
#         "tx_closing": 0
#     }
# }]

That’s the ground truth. Not an APM span. Not a service mesh metric. The actual per-connection counters the kernel is maintaining for that 5-tuple.


Writing a Minimal Flow Observer with bpftrace

You don’t need Cilium or Hubble to get flow telemetry. bpftrace can produce it directly on any node with BTF:

# Persistent flow table: connections + packet counts for 2 minutes
bpftrace -e '
kprobe:tcp_sendmsg {
    $sk = (struct sock *)arg0;
    $daddr = ntop(AF_INET, $sk->__sk_common.skc_daddr);
    $dport = $sk->__sk_common.skc_dport >> 8;
    @flows[comm, $daddr, $dport] = count();
}
interval:s:30 { print(@flows); clear(@flows); }
' --timeout 120

Sample output (every 30 seconds):

@flows[curl, 93.184.216.34, 443]:         12    # curl → example.com:443
@flows[coredns, 10.96.0.10, 53]:          341   # CoreDNS upstream queries
@flows[payment-svc, 172.16.4.23, 443]:   1204   # payment service → gateway
@flows[nginx, 10.244.2.3, 8080]:          89    # nginx → upstream pod

For retransmit tracking specifically:

# Combined flow + retransmit watcher — runs until Ctrl-C
bpftrace -e '
kprobe:tcp_retransmit_skb {
    $sk = (struct sock *)arg0;
    $daddr = ntop(AF_INET, $sk->__sk_common.skc_daddr);
    @retx[comm, $daddr] = count();
}
kprobe:tcp_sendmsg {
    $sk = (struct sock *)arg0;
    $daddr = ntop(AF_INET, $sk->__sk_common.skc_daddr);
    @sends[comm, $daddr] = count();
}
interval:s:10 {
    printf("=== Retransmit ratio (last 10s) ===\n");
    print(@retx);
    print(@sends);
    clear(@retx);
    clear(@sends);
}
'

This gives you both the volume of sends and the retransmit count side by side — the ratio tells you whether retransmits are a rounding error (0.01%) or a signal (5%+).


⚠ Production Gotchas

Map size bounds matter. Connection tracking maps default to tens of thousands of entries. On nodes with high connection churn (serverless, short-lived batch jobs), maps can fill and start dropping new entries silently. Check bpftool map show id N for max_entries and monitor map utilization. Cilium exposes this as cilium_bpf_map_pressure in Prometheus.

Per-packet overhead on high-throughput interfaces. A TC program that fires on every packet on a 10Gbps interface processes millions of packets per second. Aggregating at the program level (count per 5-tuple rather than emit per packet) keeps overhead manageable — Cilium does this. A naive bpftrace one-liner that emits a perf event per packet will saturate the perf ring buffer under real load. Use ringbuf write paths or aggregate before emitting.

TC hook placement and direction confusion. Ingress TC on a pod’s veth (lxcXXXX) sees egress traffic from the pod’s perspective — because the host sees the packet arriving on the veth after the pod sent it. This reversal is consistent but confusing when you’re reading direction labels in flow records. EP08 covered this in detail for policy enforcement; the same asymmetry applies to flow data.

Retransmit counters reset on connection close. If you’re tracking retransmit totals for a long-lived connection, the count is stored in the kernel’s socket state and is cleared when the socket closes. For persistent tracking across reconnects, aggregate at the flow level in userspace before the connection closes.

Hubble flow visibility requires pod interfaces. Hubble only sees traffic that crosses a pod’s veth interface. Node-to-node traffic that doesn’t involve a pod (e.g., node SSH, kubelet-to-API-server on the node IP) is not captured by default. For host-level network observability, you need a TC program on the physical interface (eth0, ens3), not just on pod veth pairs.


Quick Reference

What you want to see Command
What TC programs are attached bpftool net list
Which maps a program uses bpftool prog show id N (check map_ids)
Connection tracking entries bpftool map dump id N
Retransmits per destination bpftrace -e 'kprobe:tcp_retransmit_skb { ... }'
Flow counts per process bpftrace -e 'kprobe:tcp_sendmsg { @[comm, daddr] = count(); }'
Hubble flow stream (Cilium) hubble observe --follow
Hubble flows for one pod hubble observe --pod mynamespace/mypod --follow
Verify map pressure bpftool map show id N (check max_entries vs entries)
Kernel function What it marks
tcp_sendmsg Data being sent on a TCP socket
tcp_recvmsg Data being received on a TCP socket
tcp_retransmit_skb A segment being retransmitted
tcp_send_reset RST being sent
tcp_fin Connection teardown initiated
tcp_connect New outbound TCP connection attempt

Key Takeaways

  • Network flow observability with eBPF attaches TC programs that record every connection event continuously — not sampled, not throttled, not filtered by what the application reports
  • Retransmit telemetry from tcp_retransmit_skb reveals congestion and endpoint failures that are structurally invisible to application-layer monitoring tools
  • Cilium Hubble, Pixie, and Retina are all eBPF flow exporters — they run TC programs, drain a ringbuf, enrich with Kubernetes metadata, and expose the result over an API
  • You can verify what any flow tool is actually collecting with bpftool net list, bpftool prog show, and bpftool map dump — four commands, no documentation needed
  • Map sizing and per-packet overhead are the two production concerns; aggregate at the kernel level, bound your maps, and monitor map pressure
  • The kernel’s connection tracking map is the ground truth. APM dashboards, service mesh metrics, and load balancer health checks are all interpretations of what that map contains

What’s Next

Flow observability tells you what connections exist. EP11 goes one level deeper: what names your pods are resolving those connections to. DNS is where a compromised workload first reveals itself — it queries a domain that has no business being queried from a production pod, and if you’re not watching the kernel-level DNS path, you won’t see it until after the damage.

DNS observability at the kernel level uses tracepoint hooks on the DNS syscall path — the same ground-truth approach as flow telemetry, but for name resolution: every query, every response, tied to the pod that made it, without deploying a sidecar.

Next: DNS observability at the kernel level — what your pods are actually resolving

Get EP11 in your inbox when it publishes → linuxcent.com/subscribe

TC eBPF — Pod-Level Network Policy Without iptables

Reading Time: 10 minutes

eBPF: From Kernel to Cloud, Episode 8
What Is eBPF? · The BPF Verifier · eBPF vs Kernel Modules · eBPF Program Types · eBPF Maps · CO-RE and libbpf · XDP · TC eBPF**


TL;DR

  • TC eBPF fires after sk_buff allocation — it has socket metadata, cgroup ID, and pod identity that XDP lacks
    (sk_buff = the kernel’s socket buffer, allocated for every packet; TC fires after this allocation, so it can read socket and process identity)
  • Direct action (DA) mode combines filter and action; the program’s return value is the packet fate
  • Multiple TC programs chain on the same hook ordered by priority — stale programs from Cilium upgrades cause silent policy conflicts
  • tc filter show dev <iface> ingress/egress is the primary inspection tool; bpftool net list shows the full node picture
  • XDP + TC is the Cilium data path: XDP for pre-stack service load balancing, TC for per-pod identity-based enforcement
  • TC can modify packet content (bpf_skb_store_bytes) — the basis for TC-based DNAT and packet mangling

TC eBPF is where Cilium implements pod-level network policy without iptables — the hook that fires after sk_buff allocation, where socket and cgroup context exist, making per-pod enforcement possible. The obvious follow-up to XDP is why Cilium doesn’t use it for everything — pod network policy, egress enforcement, the full NetworkPolicy ruleset. The answer reveals an inherent trade-off built into the Linux data path: XDP’s speed comes from running before any context exists. At the moment it fires, there is no socket, no cgroup, no way to tell which pod sent the packet. The moment you need pod identity, you need a hook that fires later — and pays for it.


A specific pod in production was experiencing intermittent TCP connection failures to an external service. Not all connections — roughly one in fifty. Kubernetes NetworkPolicy showed egress allowed for the namespace. Cilium policy status showed no violations. Running curl from inside the pod worked fine.

The application logs told a different story: connection timeouts at the 30-second mark, no SYN-ACK received. Not a DNS issue — I verified with tcpdump inside the pod namespace. SYN packets were leaving the pod network namespace. They weren’t making it onto the wire.

I ran bpftool net list on the node and saw two TC egress programs attached to that pod’s veth interface. One from the current Cilium version (installed six weeks ago). One from the previous version — from before the rolling upgrade. Two programs. Different policy epochs. The older one had a stale block rule that fired intermittently based on connection tuple patterns it was never designed to handle in the new policy model.

Without understanding TC eBPF — what programs attach where, how multiple programs interact, and how to inspect them — I would have kept chasing ghosts in the application layer.

Quick Check: Are There Stale TC Filters on Your Cluster?

The most common TC eBPF issue on production clusters — stale filters left behind by a Cilium upgrade — is a two-command check:

# SSH into a worker node, then pick any pod's veth interface:
ip link | grep lxc | head -5
# lxc8a3f21b@if7: ...
# lxc2c9d3e1@if9: ...

# Check TC filters on that interface
tc filter show dev lxc8a3f21b egress

Healthy output (one filter, one priority):

filter protocol all pref 1 bpf chain 0
filter protocol all pref 1 bpf chain 0 handle 0x1 cil_to_container direct-action not_in_hw id 44

Stale filter present (two priorities = problem):

filter protocol all pref 1 bpf chain 0
filter protocol all pref 1 bpf chain 0 handle 0x1 cil_to_container direct-action not_in_hw id 44
filter protocol all pref 2 bpf chain 0
filter protocol all pref 2 bpf chain 0 handle 0x1 old_cil_to_container direct-action not_in_hw id 17
#                  ^^^^^^ two different priorities = two programs running in sequence

Two priorities on the same hook means two programs running sequentially. If the older one has a stale DROP rule, packets are being dropped intermittently — and nothing in the application layer will tell you why.

Not running Cilium? If you’re on a non-Cilium CNI (Calico, Flannel, aws-vpc-cni), you likely won’t have TC eBPF filters on pod interfaces. Run tc filter show dev eth0 ingress on the node uplink instead to see if any TC programs are attached at the node level. An empty response is normal for non-Cilium clusters.

Why TC, Not XDP

EP07 covered XDP: fastest possible hook, fires before sk_buff, drops at line rate. If XDP is so fast, why doesn’t Cilium use it for everything?

Because XDP sees only raw packet bytes. No socket. No cgroup. No pod identity.

In Kubernetes, network policy is inherently about identity. “Allow pod A to connect to pod B on port 8080.” To enforce this, you need to know which pod a packet is coming from on egress — and which pod it’s going to on ingress. That mapping lives in the cgroup hierarchy and the socket buffer, neither of which exist at XDP time.

TC fires later in the packet lifecycle, after sk_buff is allocated and populated:

Ingress path:
  wire → NIC → [XDP hook] → sk_buff allocated → [TC ingress hook] → netfilter → socket

Egress path:
  socket → IP routing → [TC egress hook] → qdisc → NIC → wire

At the TC egress hook on a pod’s veth interface, the sk_buff carries the socket that created the packet — and from that socket you can read the cgroup ID. The cgroup hierarchy maps container → pod, so the TC program knows which pod this traffic belongs to. That’s what makes pod-level enforcement possible.

The Linux Traffic Control Architecture

tc (traffic control) is the Linux subsystem for managing packet queues and scheduling. Most Linux administrators know it as the bandwidth-shaping tool:

# Classic tc usage — rate limit an interface
tc qdisc add dev eth0 root tbf rate 100mbit burst 32kbit latency 400ms

The qdisc (queuing discipline) is the primary abstraction. Under the qdisc sits a filter layer — and the filter type relevant to eBPF is cls_bpf, which attaches eBPF programs as packet classifiers.

qdisc (queuing discipline) is the kernel’s packet scheduler for an interface — it controls how packets are buffered and in what order they leave. For eBPF policy enforcement, Cilium uses a special qdisc called clsact which has no buffering behaviour at all; it purely provides the ingress and egress hook points where eBPF filters attach. If a pod veth doesn’t have clsact, Cilium isn’t enforcing policy on that pod.

Cilium attaches cls_bpf filters in direct action (DA) mode, which combines classifier and action into a single eBPF program. The program’s return value is the packet fate directly:

Return value Action
TC_ACT_OK (0) Pass the packet
TC_ACT_SHOT (2) Drop the packet
TC_ACT_REDIRECT (7) Redirect to another interface
TC_ACT_PIPE (3) Pass to the next filter in the chain

TC Context: What Your Program Can See

TC programs receive a struct __sk_buff — a safe, BPF-accessible projection of the kernel sk_buff. Unlike the raw packet bytes in XDP, __sk_buff includes metadata:

struct __sk_buff {
    __u32 len;           // packet length
    __u32 pkt_type;      // PACKET_HOST, PACKET_BROADCAST, etc.
    __u32 mark;          // skb->mark — used by Cilium for pod identity
    __u32 queue_mapping;
    __u32 protocol;      // ETH_P_IP, ETH_P_IPV6, etc.
    __u32 vlan_present;
    __u32 vlan_tci;
    __u32 vlan_proto;
    __u32 priority;
    __u32 ingress_ifindex;
    __u32 ifindex;
    __u32 tc_index;
    __u32 cb[5];
    __u32 hash;
    __u32 tc_classid;
    __u32 data;          // offset to packet data
    __u32 data_end;
    __u32 napi_id;
    __u32 family;
    __u32 remote_ip4;    // source IP (ingress) or dest IP (egress)
    __u32 local_ip4;
    __u32 remote_port;
    __u32 local_port;
    // ...
};

skb->mark is how Cilium passes pod identity between its hook points.

skb->mark is a 32-bit field in every sk_buff that any kernel subsystem can read or write. It’s a general-purpose scratch field — iptables uses it, routing rules use it, and Cilium uses it to carry pod security identity from the socket hook through to TC enforcement. When Cilium stamps a pod’s identity into skb->mark at connection time, every subsequent TC filter on that packet’s path can read it without another identity lookup. The socket-level cgroup hook (cgroup_sock_addr) stamps the cgroup-derived pod identity into skb->mark when the socket calls connect(). By the time the packet reaches the TC egress hook, skb->mark carries the pod’s security identity — and the TC program uses it for policy enforcement.

What Cilium’s TC Filters Actually Do

The TC filter on each pod’s veth is Cilium’s enforcement point for Kubernetes NetworkPolicy. The mechanism:

  1. When a pod opens a connection, a cgroup_sock_addr hook stamps the pod’s security identity (derived from its labels + namespace) into skb->mark
  2. The TC egress filter on the veth reads skb->mark, looks up the pod identity + destination in the policy map, and returns TC_ACT_SHOT (drop) or TC_ACT_OK (pass)
  3. The TC ingress filter on the receiving pod’s veth does the same check for inbound traffic

The policy map is a BPF LRU hash keyed on {pod_identity, dst_ip, dst_port, protocol}. This is what cilium policy get reads from — and what bpftool map dump shows directly:

# Find Cilium's policy maps
bpftool map list | grep -i policy

# Dump the active policy entries for a specific endpoint
# Get endpoint ID from: cilium endpoint list
cilium bpf policy get <endpoint-id>

# Cross-check with raw bpftool dump
bpftool map dump id <POLICY_MAP_ID> | head -30

The clsact qdisc is the prerequisite for any TC eBPF filter — it creates the ingress and egress hook points without any queuing behavior. Every pod veth on a Cilium node has one:

tc qdisc show dev lxcABCDEF
# qdisc clsact ffff: dev lxcABCDEF parent ffff:fff1
# ^^^^^^^^^^^^ this line confirms Cilium's hook points exist on this pod's veth
# If this is missing: Cilium is NOT enforcing NetworkPolicy on this pod

If a pod veth doesn’t show clsact, Cilium isn’t enforcing policy on that pod.

Multiple Programs and the Filter Chain

This is the detail that caused my production incident.

TC supports chaining multiple filters on the same hook, ordered by priority. Lower priority number runs first. When Cilium upgrades, it installs a new filter at a new priority before removing the old one. If the upgrade procedure has any timing gap — or if the removal step fails silently — you end up with two programs running in sequence.

# Show all TC filters on a pod's veth — both priorities visible
tc filter show dev lxc12345 egress

# Example output with a stale filter:
filter protocol all pref 1 bpf chain 0
filter protocol all pref 1 bpf chain 0 handle 0x1 cil_to_container direct-action not_in_hw id 44
filter protocol all pref 2 bpf chain 0
filter protocol all pref 2 bpf chain 0 handle 0x1 old_cil_to_container direct-action not_in_hw id 17

Two programs. Pref 1 runs first. Pref 2 runs second — unless pref 1 returned TC_ACT_SHOT, in which case the packet is already dropped and pref 2 never fires.

In my incident: pref 1 was the current Cilium version with correct policy, returning TC_ACT_OK for the traffic in question. Pref 2 was the old version with a stale block entry, returning TC_ACT_SHOT for a subset of connection tuples. Because TC_ACT_OK passes to the next filter in the chain (TC_ACT_PIPE would do the same), pref 2 got to run — and intermittently dropped packets.

The fix:

# Remove the stale filter by priority
tc filter del dev lxc12345 egress pref 2

# Verify only the current filter remains
tc filter show dev lxc12345 egress

This should be part of any post-upgrade verification for Cilium-managed clusters.

How Cilium Uses TC Across the Full Node

Cilium’s TC deployment on a node:

Pod veth (host-side, lxcXXXXX):
  TC ingress: cil_from_container — L3/L4 policy on the pod's outbound traffic
  TC egress:  cil_to_container   — L3/L4 policy on traffic arriving at the pod

Node uplink (eth0):
  TC ingress: cil_from_netdev    — traffic arriving from outside the node
  TC egress:  cil_to_netdev      — traffic leaving the node

XDP on eth0:
  cil_xdp_entry — pre-stack service load balancing (DNAT for ClusterIP)

The naming is counterintuitive at first: cil_from_container is attached to the TC ingress hook on the veth.

Veth direction confusion: TC ingress/egress is named from the kernel’s perspective of the interface, not the pod’s. The host-side veth interface receives traffic that the pod is sending — so TC ingress on the host veth = the pod’s outbound traffic. This trips up everyone the first time. When debugging, always confirm direction with tc filter show dev lxcXXX ingress and egress separately, and check which Cilium program name is attached (cil_from_container = pod outbound, cil_to_container = pod inbound). The veth ingress direction from the host perspective is traffic flowing out of the container. Traffic leaving the pod hits the host-side veth ingress, which is cil_from_container. It enforces egress policy for the pod. Naming follows the kernel’s perspective of the interface, not the application’s.

To see the full picture on a node:

# All eBPF network programs (XDP and TC) across all interfaces
bpftool net list

# TC-specific view
for iface in $(ip link | grep lxc | awk -F': ' '{print $2}'); do
    echo "=== $iface ==="
    tc filter show dev $iface ingress
    tc filter show dev $iface egress
done

TC Can Modify Packets Too

Unlike XDP, TC programs have full access to the sk_buff and can modify packet content — headers, payload, and checksums. This is how TC-based DNAT works in Cilium when XDP isn’t available on the NIC: the program rewrites the destination IP at L3 and updates the IP + transport checksums atomically. The kernel BPF helper handles the checksum recalculation.

From an operational standpoint: if you see a TC program attached but expected traffic is being redirected rather than dropped, the program is likely doing DNAT. bpftool prog dump xlated id <ID> shows the disassembled instructions and will reveal bpf_skb_store_bytes calls if packet rewriting is happening.

Debugging TC Programs in Production

Workflow I follow when investigating network issues on Cilium clusters:

# 1. List all eBPF network programs (see the full picture)
bpftool net list

# 2. Check specific interface for stale TC filters
tc filter show dev lxcABCDEF ingress
tc filter show dev lxcABCDEF egress

# 3. Inspect a specific program
bpftool prog show id 44

# 4. Disassemble a program (last resort for understanding behavior)
bpftool prog dump xlated id 44

# 5. Check Cilium's view of the same interface
cilium endpoint list
cilium endpoint get <endpoint-id>

# 6. Enable verbose TC program logs (debug builds only)
# Cilium: set CILIUM_DEBUG=true in the deployment

Common Mistakes

Mistake Impact Fix
Not checking for stale TC filters after Cilium upgrades Conflicting policy programs cause intermittent drops Run tc filter show post-upgrade; remove stale by priority
Confusing ingress/egress direction on veth interfaces Policy applied to wrong traffic direction TC ingress on host-side veth = pod’s outbound traffic
Attaching TC without clsact qdisc Filter attachment fails tc qdisc add dev <iface> clsact before filter add
Using TC_ACT_OK when you want to stop the chain Subsequent filters still run Use TC_ACT_OK knowing the chain continues; use TC_ACT_REDIRECT or explicit TC_ACT_SHOT only
Expecting TC performance equal to XDP TC has sk_buff overhead — it’s slower Right tool: XDP for pre-stack bulk drops, TC for identity-aware policy
Hardcoding skb->mark interpretation Different tools use mark differently Document mark field usage clearly; coordinate between Cilium and custom programs

Key Takeaways

  • TC eBPF fires after sk_buff allocation — it has socket metadata, cgroup ID, and pod identity that XDP lacks
  • Direct action (DA) mode combines filter and action; the program’s return value is the packet fate
  • Multiple TC programs chain on the same hook ordered by priority — stale programs from Cilium upgrades cause silent policy conflicts
  • tc filter show dev <iface> ingress/egress is the primary inspection tool; bpftool net list shows the full node picture
  • XDP + TC is the Cilium data path: XDP for pre-stack service load balancing, TC for per-pod identity-based enforcement
  • TC can modify packet content (bpf_skb_store_bytes) — the basis for TC-based DNAT and packet mangling

What’s Next

EP08 closes out the kernel machinery arc: program types, maps, CO-RE, XDP, TC. Five episodes on the engine under the tools. EP09 shifts from understanding the machinery to using it interactively.

bpftrace turns kernel knowledge into one-liners you can run on a live production node. Which process is touching this file right now? Where is this latency spike originating in the kernel call stack? Which container is making DNS queries to an unexpected resolver? Under 10 seconds per question — no restart, no sidecar, no instrumentation change.

Every bpftrace one-liner is a complete eBPF program compiled, loaded, run, and cleaned up on the fly. EP09 covers how that works and why it changes the way you investigate production incidents.

Next: bpftrace — kernel answers in one line

Get EP09 in your inbox when it publishes → linuxcent.com/subscribe