DNS at the Kernel Level — What Your Pods Are Actually Resolving

Reading Time: 9 minutes

eBPF: From Kernel to Cloud, Episode 11
What Is eBPF? · The BPF Verifier · eBPF vs Kernel Modules · eBPF Program Types · eBPF Maps · CO-RE and libbpf · XDP · TC eBPF · bpftrace · Network Flow Observability · DNS Observability


Architecture Overview

eBPF DNS Kernel Observability — kernel-level DNS event capture without touching application code
eBPF intercepts DNS at the kernel socket layer — capturing query, response, and latency without application changes.

TL;DR

  • DNS observability in Kubernetes with eBPF hooks the kernel’s DNS syscall path — giving you per-pod query visibility without sidecars, restarts, or CoreDNS log scraping
    (tracepoint = a stable, versioned hook placed deliberately in the Linux kernel source; unlike kprobes, tracepoints survive kernel upgrades without breakage)
  • CoreDNS metrics tell you aggregate query rates; eBPF tracepoints tell you which pod queried what domain, when, and what was returned
  • A compromised workload’s first observable action is almost always an unexpected DNS query — infrastructure no legitimate process should ever resolve
  • The DNS syscall path in Linux goes: application calls getaddrinfo() → glibc → sendto() syscall → kernel network stack → UDP packet to CoreDNS resolver
  • You hook the sendto tracepoint to catch the query leaving the pod and the recvfrom tracepoint to catch the response arriving
  • Production note: DNS query payloads cross the kernel as raw UDP — parsing the DNS wire format in a bpftrace one-liner requires reading past the UDP header; Tetragon and Pixie do this parsing in the eBPF program itself

EP10 showed eBPF flow telemetry as the ground truth for what connections your pods are making. DNS observability with eBPF goes one layer beneath that: the name resolution step that happens before any connection is established. Every domain a pod resolves is visible at the kernel level. That visibility is what a security scan alert is missing when it flags “unexpected DNS queries” — it can see the traffic on the wire, but it can’t tell you which pod sent it without restarting or deploying an agent into the pod.

Quick Check: What DNS Traffic Is Leaving Your Pods Right Now?

Without installing anything, you can see DNS queries crossing any node in under 30 seconds:

# SSH into a worker node, then:

# Watch all UDP port 53 traffic — which processes are making DNS queries?
bpftrace -e '
tracepoint:syscalls:sys_enter_sendto {
    $port = (uint16)((uint8*)args->addr)[3] << 8 |
            (uint16)((uint8*)args->addr)[2];
    if ($port == 53) {
        printf("%-20s %-6d DNS query (UDP sendto)\n", comm, pid);
    }
}' --timeout 30

Expected output:

coredns              1842   DNS query (UDP sendto)   # ← CoreDNS forwarding upstream
nginx                9231   DNS query (UDP sendto)   # ← nginx resolving upstream
payment-svc          11043  DNS query (UDP sendto)   # ← your service making queries
curl                 14829  DNS query (UDP sendto)   # ← kubectl exec / debug session
# How many DNS queries per process in the last 30 seconds?
bpftrace -e '
tracepoint:syscalls:sys_enter_sendto {
    $port = (uint16)((uint8*)args->addr)[3] << 8 |
            (uint16)((uint8*)args->addr)[2];
    if ($port == 53) { @dns_queries[comm] = count(); }
}
interval:s:30 { print(@dns_queries); exit(); }
'

Expected output:

@dns_queries[coredns]:       1203   # ← upstream forwarder traffic
@dns_queries[payment-svc]:    847   # ← legitimate service queries
@dns_queries[unknown]:         12   # ← investigate this one

On EKS or GKE managed nodes: You may not be able to SSH directly to worker nodes, but you can run a privileged debug pod: kubectl debug node/<node-name> -it --image=quay.io/iovisor/bpftrace. The bpftrace program runs on the host kernel and sees all pods’ DNS queries. GKE Autopilot restricts privileged pods — use GKE’s built-in eBPF-based DNS observability instead (enabled via Cloud Logging with DNS policy logging).


A security scan flagged unexpected DNS queries from payment-svc in the production namespace. The query domains didn’t match anything in the service’s known dependency list. The scan tool showed the traffic on the wire — destination port 53, from the pod’s IP — but couldn’t tell us which process inside the pod was responsible or what domain was being queried without pulling the pod’s DNS logs.

The pod had no DNS logging enabled. CoreDNS showed the queries in its aggregate metrics but with no attribution below namespace level. Restarting the pod to add a DNS sidecar would wipe any in-memory state the process had accumulated.

I ran bpftrace with a recvfrom hook to catch the DNS response payloads coming back into the pod:

bpftrace -e '
tracepoint:syscalls:sys_exit_recvfrom {
    if (retval > 0) {
        printf("%-20s PID %-6d received %d bytes (possible DNS response)\n",
               comm, pid, retval);
    }
}' --timeout 60

Then cross-referenced the PIDs to container processes via /proc/<pid>/cgroup. The unexpected queries were coming from a sidecar process that had been injected by a recent Helm chart change — not from the main application container at all. A misconfigured Datadog agent injected into the wrong namespace was querying its intake endpoint.

No restart. No sidecar deployment. Found in under two minutes.


Why CoreDNS Metrics Don’t Give You This

CoreDNS exposes DNS query metrics via Prometheus. Those metrics tell you:
– Total queries per second across the cluster
– Query latency histograms
– Error rates (NXDOMAIN, SERVFAIL)
– Upstream forwarder health

What they don’t tell you:
– Which specific pod sent a query to a specific domain
– Which process inside that pod made the getaddrinfo() call
– Whether the query came from the main container or an injected sidecar
– The timing relationship between a DNS query and the connection that followed it

CoreDNS sees the query after it arrives at the resolver. eBPF tracepoints see the query at the moment the pod’s process issues the sendto() syscall — before it leaves the node. The difference is attribution.


The DNS Syscall Path in Linux

Understanding where the hook fires helps you reason about what you can observe:

Application code
    ↓
getaddrinfo("api.example.com") ← glibc resolver function
    ↓
glibc reads /etc/resolv.conf → finds nameserver 10.96.0.10 (CoreDNS ClusterIP)
    ↓
glibc builds DNS wire-format query packet
    ↓
sendto(sockfd, buf, len, 0, &resolver_addr, addrlen)
    ↓                     ← eBPF tracepoint fires here: sys_enter_sendto
Linux kernel: udp_sendmsg()
    ↓
Packet leaves pod veth interface
    ↓
TC eBPF on veth sees UDP packet (flow telemetry picks this up too)
    ↓
CoreDNS receives query, resolves, sends response
    ↓
Packet arrives back at pod veth
    ↓
recvfrom(sockfd, buf, len, 0, &src_addr, &src_len)
    ↓                     ← eBPF tracepoint fires here: sys_exit_recvfrom
glibc parses DNS response
    ↓
getaddrinfo() returns IP addresses to application

getaddrinfo — the standard POSIX function applications call to resolve a hostname to IP addresses. It lives in glibc, not in the kernel. The kernel never sees the domain name string directly — it only sees the UDP packet carrying the DNS wire-format query. To read the actual domain name in an eBPF program, you parse the DNS packet payload at the sendto tracepoint.

tracepoint — a stable, versioned hook deliberately placed in Linux kernel source code by kernel developers. Unlike kprobes (which attach to arbitrary kernel functions and break when those functions change), tracepoints are part of the kernel’s stable interface. The syscalls:sys_enter_sendto tracepoint has been present and stable since kernel 3.x. You can rely on it across Ubuntu 20.04 through the latest kernels without version checks.


Reading DNS Queries at the Tracepoint

The sendto tracepoint fires when any process sends data on a socket. Filtering to port 53 gives you DNS queries. Parsing the payload gives you the domain name.

The DNS wire format for a query:

Bytes 0-11:   DNS header (12 bytes)
              - Transaction ID (2 bytes)
              - Flags (2 bytes)
              - QDCount, ANCount, NSCount, ARCount (2 bytes each)
Byte 12+:     Question section
              - QNAME (variable length, label-encoded)
              - QTYPE (2 bytes)
              - QCLASS (2 bytes)

The QNAME is length-prefixed labels: \x03api\x07example\x03com\x00 for api.example.com. bpftrace can read the raw bytes but parsing label encoding inline in a one-liner is awkward. For raw query detection (flag any DNS query from a specific process), the tracepoint is enough:

# Watch DNS queries from a specific process name — replace "payment-svc"
bpftrace -e '
tracepoint:syscalls:sys_enter_sendto /comm == "payment-svc"/ {
    printf("PID %-6d sending %d bytes to DNS\n", pid, args->len);
}
'

For full domain name extraction, use a tool that implements DNS wire-format parsing in its eBPF layer. Tetragon and Pixie both do this. On a Tetragon-instrumented cluster:

# Watch DNS queries with domain names — Tetragon (all pods)
kubectl exec -n kube-system -it $(kubectl get pod -n kube-system -l app.kubernetes.io/name=tetragon -o name | head -1) \
  -- tetra getevents --event-types PROCESS_KPROBE \
  | grep -i dns

Sample Tetragon output:

{
  "process": {
    "pod": {"name": "payment-svc-7d4b9f-xk2p1", "namespace": "production"},
    "binary": "/usr/bin/payment-service",
    "pid": 11043
  },
  "function_name": "__sys_sendto",
  "args": [
    {"sock_arg": {"family": "AF_INET", "protocol": "UDP",
                  "daddr": "10.96.0.10", "dport": 53}},
    {"bytes_arg": "<DNS query for metrics.datadoghq.com>"}
  ]
}

Pod name, namespace, binary, PID, and the domain being queried — all from a kernel tracepoint, no sidecar, no pod restart.


Building Pod-Level DNS Attribution Without Tetragon

If you’re not running Tetragon, you can build pod-level attribution from the PID. When bpftrace reports a PID making a DNS query, map it to a container:

# Get the PID from bpftrace, then:
PID=11043

# Which cgroup does this PID belong to? (maps to container/pod)
cat /proc/$PID/cgroup | grep kubepods
# 12:cpu:/kubepods/burstable/pod3f8a21bc-4e7d-4b91-a3c2-8b947f6e3d12/a4c8f1e2b3d4...
# The pod UID is embedded: pod3f8a21bc-4e7d-4b91-a3c2-8b947f6e3d12

# Map pod UID to pod name
kubectl get pods -A -o jsonpath='{range .items[*]}{.metadata.uid}{" "}{.metadata.name}{" "}{.metadata.namespace}{"\n"}{end}' \
  | grep 3f8a21bc-4e7d-4b91-a3c2-8b947f6e3d12
# 3f8a21bc-4e7d-4b91-a3c2-8b947f6e3d12  payment-svc-7d4b9f-xk2p1  production

That’s the full chain: kernel tracepoint → host PID → cgroup path → pod UID → pod name + namespace. Automatable. No agents required inside the pod.


Detecting Anomalous DNS: What to Watch For

DNS is the first observable action in most attack chains. A process that has been compromised or injected typically cannot establish a C2 connection without first resolving the C2 domain.

Signals worth watching at the kernel DNS layer:

Queries to non-cluster domains from unexpected processes

# Flag any DNS query to a non-cluster domain (not .cluster.local or .svc.cluster.local)
bpftrace -e '
tracepoint:syscalls:sys_enter_sendto {
    $port = (uint16)((uint8*)args->addr)[3] << 8 |
            (uint16)((uint8*)args->addr)[2];
    if ($port == 53) {
        printf("%-20s %-6d DNS sendto\n", comm, pid);
    }
}' --timeout 60

High-frequency DNS queries from a single process (DNS tunneling fingerprint)

# Processes making more than N DNS queries per second
bpftrace -e '
tracepoint:syscalls:sys_enter_sendto {
    $port = (uint16)((uint8*)args->addr)[3] << 8 |
            (uint16)((uint8*)args->addr)[2];
    if ($port == 53) { @[pid, comm] = count(); }
}
interval:s:1 {
    print(@);
    clear(@);
}
'

DNS tunneling exfiltrates data by encoding it in subdomains of queries. A process making 50+ DNS queries per second to varied subdomains of the same parent domain is a strong signal. CoreDNS aggregate metrics will show elevated query volume; the kernel tracepoint tells you which PID is responsible.

Queries immediately followed by a connection (normal vs anomalous pattern)

Legitimate services resolve a known set of domains. A process that resolves a new, never-before-seen domain and immediately opens a TCP connection to the returned IP is structurally different from normal service behavior. The combination of DNS tracepoint + TCP connect kprobe lets you correlate these events by PID and timestamp — without any application instrumentation.


⚠ Production Gotchas

DNS payload parsing is not trivial in bpftrace. Reading the domain name from the UDP payload requires byte-level parsing of the DNS wire format inside an eBPF program. bpftrace can read raw bytes with buf(), but the label-encoded domain name format requires a loop that the verifier may reject for complexity reasons. Tools like Tetragon and Pixie implement this parsing in C within their eBPF programs where they have more control over verifier limits. For raw detection (flag DNS queries from unexpected processes), the sendto tracepoint without payload parsing is enough.

sendto fires for all UDP, not just DNS. Filter on the destination port. The destination address structure is at args->addr — port is in network byte order at bytes 2–3 of the sockaddr_in structure. The filtering in the examples above is correct for port 53; double-check if you’re on a cluster that uses a non-standard DNS port.

CoreDNS pods will appear in your DNS query trace — that’s expected. CoreDNS makes upstream DNS queries to resolve non-cluster domains. Filter on namespace/cgroup if you want to exclude CoreDNS from your trace.

DNS over TCP is a separate code path. Most DNS queries are UDP. Large responses (>512 bytes) or DNSSEC responses may trigger TCP fallback. The sendto tracepoint catches UDP; for TCP DNS, you’d need tcp_sendmsg with port 53 filtering. In practice, within-cluster DNS resolution is almost entirely UDP.

glibc caching means not every getaddrinfo() generates a DNS query. glibc caches resolved hostnames in the process’s memory. A service that calls getaddrinfo("api.example.com") every 100ms may only generate a DNS query every 30 seconds (the TTL). If you’re looking for which pods are resolving a domain and see only occasional tracepoint hits, that’s expected — it’s the cache miss rate, not the access rate.


Quick Reference

What you want Command
All DNS queries on a node bpftrace -e 'tracepoint:syscalls:sys_enter_sendto { if (port == 53) ... }'
DNS query count per process bpftrace -e '... { @[comm] = count(); }'
DNS queries from a specific process bpftrace -e '... /comm == "my-svc"/ { ... }'
Map PID to pod cat /proc/<pid>/cgroup → extract pod UID → kubectl get pods
DNS events with domain names (Tetragon) tetra getevents --event-types PROCESS_KPROBE
DNS policy violations (Cilium) hubble observe --verdict DROPPED --protocol DNS
CoreDNS query logs kubectl logs -n kube-system -l k8s-app=kube-dns
DNS signal What it indicates
New domain, immediate TCP connect Possible C2 resolution
50+ queries/second from one PID DNS tunneling candidate
Query to non-cluster domain from batch job Unusual — investigate
NXDOMAIN responses at high rate Misconfiguration or DGA
Queries from PID not matching any known binary Injected process

Key Takeaways

  • DNS observability in Kubernetes with eBPF uses the sendto tracepoint — the hook fires when the process issues the syscall, before the packet leaves the node, giving you PID-level attribution with no sidecar
  • CoreDNS metrics show aggregate DNS health; kernel tracepoints show which pod and which process made each query — the attribution gap between the two is where anomaly detection lives
  • The DNS syscall path goes: getaddrinfo() → glibc → sendto() syscall → kernel UDP stack → CoreDNS. eBPF hooks fire at the sendto() boundary
  • A compromised workload’s first observable action is almost always a DNS query; tracepoint-based DNS observability catches it at the kernel level, ahead of any application log
  • glibc caches resolved names, so tracepoint hit rate reflects cache misses, not getaddrinfo() call rate — account for this when baselining
  • Full domain name extraction requires DNS wire-format parsing; Tetragon and Pixie do this in their eBPF programs; bpftrace one-liners detect the query event without the domain string

What’s Next

DNS observability tells you what a workload is resolving. EP12 answers what happens when you want to stop a workload from doing something — not detect it after the fact, but prevent it at the syscall boundary before it completes.

LSM hooks and Tetragon’s kill path enforce at the kernel level. When the kernel enforces, the process never gets the return value from the syscall. There is no “detect and respond” window — the action simply does not complete. That is a structurally different security posture from anything a sidecar or userspace agent can provide.

Next: LSM and Tetragon — when the kernel says no

Get EP12 in your inbox when it publishes → linuxcent.com/subscribe

EKS 1.33 Upgrade Blocker: Fixing Dead Nodes & NetworkManager on Rocky Linux

Reading Time: 5 minutes

The EKS 1.33+ NetworkManager Trap: A Complete systemd-networkd Migration Guide for Rocky & Alma Linux

TL;DR:

  • The Blocker: Upgrading to EKS 1.33+ is breaking worker nodes, especially on free community distributions like Rocky Linux and AlmaLinux. Boot times are spiking past 6 minutes, and nodes are failing to get IPs.
  • The Root Cause: AWS is deprecating NetworkManager in favor of systemd-networkd. However, ripping out NetworkManager can leave stale VPC IPs in /etc/resolv.conf. Combined with the systemd-resolved stub listener (127.0.0.53) and a few configuration missteps, it causes a total internal DNS collapse where CoreDNS pods crash and burn.
  • The Subtext: AWS is pushing this modern networking standard hard. Subtly, this acts as a major drawback for Rocky/Alma AMIs, silently steering frustrated engineers toward Amazon Linux 2023 (AL2023) as the “easy” way out.
  • The “Super Hack”: Automate the clean removal of NetworkManager, bypass the DNS stub listener by symlinking /etc/resolv.conf directly to the systemd uplink, and enforce strict state validation during the AMI build.

If you’ve been in the DevOps and SRE space long enough, you know that vendor upgrades rarely go exactly as planned. But lately, if you are running enterprise Linux distributions like Rocky Linux or AlmaLinux on AWS EKS, you might have noticed the ground silently shifting beneath your feet.

With the push to EKS 1.33+, AWS is mandating a shift toward modern, cloud-native networking standards. Specifically, they are phasing out the legacy NetworkManager in favor of systemd-networkd.

While this makes sense on paper, the transition for community distributions has been incredibly painful. AWS support couldn’t resolve our issues, and my SRE team had practically given up, officially halting our EKS upgrade process. It’s hard not to notice that this massive, undocumented friction in Rocky Linux and AlmaLinux conveniently positions AWS’s own Amazon Linux 2023 (AL2023) as the path of least resistance.

I’m hoping the incredible maintainers at free distributions like Rocky Linux and AlmaLinux take note of this architectural shift. But until the official AMIs catch up, we have to fix it ourselves. Here is the exact breakdown of the cascading failure that brought our clusters to their knees, and the “super hack” script we used to fix it.

The Investigation: A Cascading SRE Failure

When our EKS 1.33+ worker nodes started booting with 6+ minute latencies or outright failing to join the cluster, I pulled apart our Rocky Linux AMIs to monitor the network startup sequence. What I found was a classic cascading failure of services, stale data, and human error.

Step 1: The Race Condition

Initially, the problem was a violent tug-of-war. NetworkManager was not correctly disabled by default, and cloud-init was still trying to invoke it. This conflicted directly with systemd-networkd, paralyzing the network stack during boot. To fix this, we initially disabled the NetworkManager service and removed it from cloud-init.

Step 2: The Stale Data Landmine

Here is where the trap snapped shut. Because NetworkManager was historically the primary service responsible for dynamically generating and updating /etc/resolv.conf, completely disabling it stopped that file from being updated.

When we baked the new AMI via Packer, /etc/resolv.conf was orphaned and preserved the old configuration—specifically, a stale .2 VPC IP address from the temporary subnet where the AMI build ran.

Step 3: The Human Element

We’ve all been there: during a stressful outage, wires get crossed. While troubleshooting the dead nodes, one of our SREs mistakenly stopped the systemd-resolved service entirely, thinking it was conflicting with something else.

Step 4: Total DNS Collapse

When the new AMI booted up and joined the EKS node group, the environment was a disaster zone:

  1. NetworkManager was dead (intentional).
  2. systemd-resolved was stopped (accidental).
  3. /etc/resolv.conf contained a dead, stale IP address from a completely different subnet.

When kubelet started, it dutifully read the host’s broken /etc/resolv.conf and passed it up to CoreDNS. CoreDNS attempted to route traffic to the stale IP, failed, and started crash-looping. Internal DNS resolution (pod.namespace.svc.cluster.local) totally collapsed. The cluster was dead in the water.

Flowchart showing the cascading DNS failure in EKS worker nodes
The perfect storm: How stale data and disabled services led to a total CoreDNS collapse.

Linux Internals: How systemd Manages DNS (And Why CoreDNS Breaks)

To understand how to permanently fix this, we need to look at how systemd actually handles DNS under the hood. When using systemd-networkd, resolv.conf management is handled through a strict partnership with systemd-resolved.

Architecture diagram of systemd-networkd and systemd-resolved D-Bus communication
How systemd collects network data and the critical symlink choice that dictates EKS DNS health.

Here is how the flow works: systemd-networkd collects network and DNS information (from DHCP, Router Advertisements, or static configs) and pushes it to systemd-resolved via D-Bus. To manage your DNS resolution effectively, you must configure the /etc/resolv.conf symbolic link to match your desired mode of operation. You have three choices:

1. The “Recommended” Local DNS Stub (The EKS Killer)

By default, systemd recommends using systemd-resolved as a local DNS cache and manager, providing features like DNS-over-TLS and mDNS.

  • The Symlink: ln -sf /run/systemd/resolve/stub-resolv.conf /etc/resolv.conf
  • Contents: Points to 127.0.0.53 as the only nameserver.
  • The Problem: This is a disaster for Kubernetes. If Kubelet passes 127.0.0.53 to CoreDNS, CoreDNS queries its own loopback interface inside the pod network namespace, blackholing all cluster DNS.

2. Direct Uplink DNS (The “Super Hack” Solution)

This mode bypasses the local stub entirely. The system lists the actual upstream DNS servers (e.g., your AWS VPC nameservers) discovered by systemd-networkd directly in the file.

  • The Symlink: ln -sf /run/systemd/resolve/resolv.conf /etc/resolv.conf
  • Contents: Lists all actual VPC DNS servers currently known to systemd-resolved.
  • The Benefit: CoreDNS gets the real AWS VPC nameservers, allowing it to route external queries correctly while managing internal cluster resolution perfectly.

3. Static Configuration (Manual)

If you want to manage DNS manually without systemd modifying the file, you break the symlink and create a regular file (rm /etc/resolv.conf). While systemd-networkd still receives DNS info from DHCP, it won’t touch this file. (Not ideal for dynamic cloud environments).


The Solution: A Surgical systemd Cutover

Knowing the internals, the path forward is clear. We needed to not only remove the legacy stack but explicitly rewire the DNS resolution to the Direct Uplink to prevent the stale data trap and bypass the notorious 127.0.0.53 stub listener.

Here is the exact state we achieved:

  1. Lock down cloud-init so it stops triggering legacy network services.
  2. Completely mask NetworkManager to ensure it never wakes up.
  3. Ensure systemd-resolved is enabled and running, but with the DNSStubListener explicitly disabled (DNSStubListener=no) so it doesn’t conflict with anything.
  4. Destroy the stale /etc/resolv.conf and create a symlink to the Direct Uplink (ln -sf /run/systemd/resolve/resolv.conf /etc/resolv.conf).
  5. Reconfigure and restart systemd-networkd.

Pro-Tip for Debugging: To ensure systemd-networkd is successfully pushing DNS info to the resolver, verify your .network files in /etc/systemd/network/. Ensure UseDNS=yes (which is the default) is set in the [DHCPv4] section. You can always run resolvectl status to see exactly which DNS servers are currently assigned to each interface over D-Bus!

The Automation: Production AMI Prep Script

Manual hacks are great for debugging, but SRE is about repeatable automation. We’ve open-sourced the eks-production-ami-prep.sh script to handle this cutover automatically during your Packer or Image Builder pipeline. It standardizes the cutover, wipes out the stale data, and includes a strict validation suite.


The Results

By actively taking control of the systemd stack and ensuring /etc/resolv.conf was dynamically linked rather than statically abandoned, we completely unblocked our EKS 1.33+ upgrade.

More impressively, our system bootup time dropped from a crippling 6+ minutes down to under 2 minutes. We shouldn’t have to abandon fantastic, free enterprise distributions just because a cloud provider shifts their networking paradigm. If your team is struggling with AWS EKS upgrades on Rocky Linux or AlmaLinux, integrate this automation into your pipeline and get your clusters back in the fast lane.