The Pipeline Gate — Hardened Images as a CI/CD Build Constraint

Reading Time: 6 minutes

OS Hardening as Code, Episode 5
Cloud AMI Security Risks · Linux Hardening as Code · Multi-Cloud OS Hardening · Automated OpenSCAP Compliance · CI/CD Compliance Gate**


TL;DR

  • A CI/CD compliance gate turns an OS hardening grade from a report into a build constraint — unhardened images fail the pipeline before they can be deployed
  • POST /api/pipeline/scan returns pass/fail against a minimum grade threshold — integrates into any CI/CD system that can make an HTTP request
  • Failed gate output tells engineers exactly which controls failed and what to fix — not just “blocked”
  • The gate works on both build-time grades (new images) and runtime grades (existing instances)
  • GitHub Actions, GitLab CI, Jenkins, and Tekton integrations are one curl command
  • The structural guarantee: an image that doesn’t pass the gate doesn’t exist in the deployment pipeline

The Problem: A Grade No One Checks Is Decoration

Pipeline without compliance gate:
  Build → Test → Security scan (results to dashboard) → Deploy

What actually happens:
  Build → Test → Security scan → "C grade, but we need to ship" → Deploy anyway
                                           │
                                           └─ Dashboard shows C grade
                                              Nobody is paged
                                              Deployment succeeds

A CI/CD compliance gate means the pipeline can’t continue if the grade is below threshold.

EP04 showed that automated OpenSCAP compliance gives every image a verified, reproducible grade before deployment. What it assumed is that someone checks the grade before deploying. They don’t — not under deadline pressure, not when the image has been “working fine for months,” not at 2am.

The same problem that made hardening runbooks skippable applies to compliance grades: if checking the grade is a discretionary step, it will be skipped.


A new microservice was deployed from an unhardened base image. The team had built it quickly during a sprint, used a community AMI as the base, and planned to harden it “in the next sprint.”

Three weeks later, a penetration test found it. SSH password authentication enabled. Three unnecessary services running — one of them with a known CVE. The finding: the instance had full inbound access from the VPC and was reachable from a compromised adjacent instance.

The deployment had gone through the normal CI/CD pipeline. Unit tests passed. Integration tests passed. A vulnerability scan ran. The scan produced a report that went to a dashboard. Nobody had a gate set up to fail the build if the image was unhardened.

The hardening work from the “next sprint” plan would have taken four hours. The pentest remediation took a week, plus the time to investigate what had been exposed during the three weeks the instance was running.

The CI/CD pipeline had every check except the one that would have caught the base image problem before the first deployment.


The Pipeline API

The Pipeline API is a single HTTP endpoint that takes an image or instance ID, checks it against a minimum grade, and returns pass or fail:

# Fail the pipeline if the image grade is below B
curl -sf -X POST https://stratum.yourdomain.com/api/pipeline/scan \
  -H "Authorization: Bearer ${STRATUM_TOKEN}" \
  -H "Content-Type: application/json" \
  -d '{
    "image_id": "ami-0a7f3c9e82d1b4c05",
    "min_grade": "B"
  }'

# Pass response (grade A):
# HTTP 200
# {
#   "result": "pass",
#   "image_id": "ami-0a7f3c9e82d1b4c05",
#   "grade": "A",
#   "score": 94,
#   "controls_passing": 94,
#   "controls_total": 100,
#   "scanned_at": "2026-04-19T15:54:10Z"
# }

# Fail response (grade C):
# HTTP 422
# {
#   "result": "fail",
#   "image_id": "ami-0c9d5e3f81a2b6e07",
#   "grade": "C",
#   "score": 72,
#   "min_grade_required": "B",
#   "failing_controls": [
#     { "id": "1.1.7", "title": "Separate partition for /var/log/audit", "severity": "medium" },
#     { "id": "3.3.2", "title": "TCP SYN cookies enabled", "severity": "low" },
#     ...
#   ]
# }

A non-200 response fails the pipeline. The || exit 1 in the shell integration handles this — if the API returns 422, the pipeline step exits non-zero and the job fails.


GitHub Actions Integration

# .github/workflows/deploy.yml

jobs:
  build-image:
    runs-on: ubuntu-latest
    outputs:
      ami_id: ${{ steps.build.outputs.ami_id }}
    steps:
      - name: Build hardened AMI
        id: build
        run: |
          AMI_ID=$(stratum build \
            --blueprint ubuntu22-cis-l1.yaml \
            --provider aws \
            --output json | jq -r '.image_id')
          echo "ami_id=${AMI_ID}" >> $GITHUB_OUTPUT

  compliance-gate:
    runs-on: ubuntu-latest
    needs: build-image
    steps:
      - name: Stratum compliance gate
        run: |
          curl -sf -X POST ${{ vars.STRATUM_URL }}/api/pipeline/scan \
            -H "Authorization: Bearer ${{ secrets.STRATUM_TOKEN }}" \
            -H "Content-Type: application/json" \
            -d "{\"image_id\": \"${{ needs.build-image.outputs.ami_id }}\", \"min_grade\": \"B\"}" \
            || { echo "Compliance gate failed — image does not meet minimum grade B"; exit 1; }

  deploy:
    runs-on: ubuntu-latest
    needs: [build-image, compliance-gate]
    steps:
      - name: Deploy to staging
        run: |
          aws autoscaling update-auto-scaling-group \
            --auto-scaling-group-name my-asg \
            --launch-template "ImageId=${{ needs.build-image.outputs.ami_id }}"

The deploy job only runs if compliance-gate passes. The AMI doesn’t reach the autoscaling group if it doesn’t meet the grade threshold.


GitLab CI Integration

# .gitlab-ci.yml

stages:
  - build
  - compliance
  - deploy

build-image:
  stage: build
  script:
    - |
      AMI_ID=$(stratum build \
        --blueprint ubuntu22-cis-l1.yaml \
        --provider aws \
        --output json | jq -r '.image_id')
      echo "AMI_ID=${AMI_ID}" >> build.env
  artifacts:
    reports:
      dotenv: build.env

compliance-gate:
  stage: compliance
  needs: [build-image]
  script:
    - |
      curl -sf -X POST ${STRATUM_URL}/api/pipeline/scan \
        -H "Authorization: Bearer ${STRATUM_TOKEN}" \
        -H "Content-Type: application/json" \
        -d "{\"image_id\": \"${AMI_ID}\", \"min_grade\": \"B\"}"

deploy:
  stage: deploy
  needs: [build-image, compliance-gate]
  script:
    - ./deploy.sh ${AMI_ID}

What the Failed Gate Tells You

The value of the CI/CD compliance gate is not just that it blocks bad images — it’s that the failure output tells engineers what to fix.

A gate failure in CI shows:

Compliance gate failed.

Image: ami-0c9d5e3f81a2b6e07
Grade: C (72/100)
Required: B (85/100)
Gap: 13 controls failing

Failing controls:
  HIGH   1.1.7   Separate partition for /var/log/audit
                 Fix: Provision /var/log/audit on a separate EBS volume
  MEDIUM 1.6.1.3 AppArmor enabled in bootloader
                 Fix: Update GRUB_CMDLINE_LINUX, run update-grub, reboot
  MEDIUM 3.3.2   TCP SYN cookies
                 Fix: echo "net.ipv4.tcp_syncookies=1" > /etc/sysctl.d/60-cis.conf
  LOW    5.2.21  SSH MaxStartups
                 Fix: Add "MaxStartups 10:30:60" to /etc/ssh/sshd_config
  ...

View full scan report: https://stratum.yourdomain.com/scans/ami-0c9d5e3f81a2b6e07

This is not a wall — it’s a list of exactly what to fix. The engineer running the pipeline sees the gap, fixes the blueprint or the Ansible role, rebuilds, and the gate passes. The gap is closed before any instance is deployed.


Runtime Gate: Checking Existing Instances

The Pipeline API also works against running instances, not just images:

# Gate on a running instance's current compliance state
curl -sf -X POST https://stratum.yourdomain.com/api/pipeline/scan \
  -H "Authorization: Bearer ${STRATUM_TOKEN}" \
  -H "Content-Type: application/json" \
  -d '{
    "instance_id": "i-0abc123",
    "min_grade": "B",
    "scan_type": "runtime"
  }'

This is useful in deployment pipelines that don’t build custom AMIs — they launch instances and configure them after launch. The runtime gate runs after configuration is complete and before the instance is registered with the load balancer.

It also integrates into scheduled compliance jobs — scan your fleet on a schedule and alert when any instance drifts below grade threshold.


Grade Thresholds by Environment

Not all environments need the same threshold. A common pattern:

# Environment-specific minimum grades
environments:
  production: A      # 95%+ passing — no exceptions
  staging:    B      # 85%+ passing — minor gaps acceptable
  development: C     # 70%+ passing — experimental OK
# Production deploy gate
curl -sf -X POST .../api/pipeline/scan \
  -d '{"image_id": "ami-...", "min_grade": "A"}'

# Staging deploy gate
curl -sf -X POST .../api/pipeline/scan \
  -d '{"image_id": "ami-...", "min_grade": "B"}'

This lets development move fast with a lower bar while enforcing the highest standard at the production gate.


Production Gotchas

Gate latency on first scan: If the image hasn’t been scanned yet, the Pipeline API triggers a scan on demand. This takes 2–3 minutes. For build pipelines that want instant gate results, use stratum build --blueprint ... --scan-on-build to ensure the scan runs during the build step and the result is cached for the gate call.

Token rotation: The STRATUM_TOKEN used for API authentication should be rotated on the same schedule as other service credentials. Use environment-specific tokens so a compromised staging token doesn’t bypass a production gate.

Webhook notifications on gate failure: The Pipeline API can send a webhook to Slack, PagerDuty, or any endpoint when a gate fails. Configure this for production pipelines so failures are visible beyond the CI log.

# In the Stratum config
notifications:
  pipeline_failures:
    - type: slack
      webhook: ${SLACK_WEBHOOK}
      channel: "#platform-security"
    - type: webhook
      url: ${PAGERDUTY_WEBHOOK}
      min_grade: D     # only page on D/F, not B/C failures

Key Takeaways

  • A CI/CD compliance gate turns a compliance grade from a dashboard metric into a pipeline constraint — the image doesn’t deploy if it doesn’t pass
  • POST /api/pipeline/scan is a single HTTP call that any CI/CD system can make — no agent, no plugin, no SDK required
  • Failed gate output is actionable: every failing control includes the specific fix, not just the control ID
  • Runtime gates check instances after configuration, not just at image build time
  • Environment-specific thresholds let development move faster while enforcing the highest standard at production

What’s Next

The CI/CD compliance gate closes the final gap: even if an unhardened image gets built, it can’t deploy. EP05 is the bookmark episode — this is the point where OS hardening becomes structurally enforced rather than procedurally expected.

EP06 is the series closer. For five episodes, you’ve been using Stratum as a user. What does it look like to run it yourself — extend it with a custom control, add a provider, deploy the platform in your own infrastructure?

Stratum is open-core (Apache 2.0). EP06 is the architecture reveal, the GitHub release, and the extension guide for everything the series taught.

Next: Stratum — open-source OS hardening platform for multi-cloud infrastructure

Get EP06 in your inbox when it publishes → linuxcent.com/subscribe

The Platform Engineering Era: GitOps, AI Workloads, and Leaner Kubernetes (2023–2025)

Reading Time: 6 minutes


Introduction

By 2023, the question had shifted from “how do we run Kubernetes?” to “how do we let other engineers run their workloads on Kubernetes without becoming a bottleneck?”

This is the platform engineering problem. And it drove the tooling that defined 2023–2025: GitOps as the deployment standard, Cluster API for Kubernetes-on-Kubernetes provisioning, AI/ML workloads forcing new scheduling capabilities, and the Kubernetes project itself shedding more weight to become faster to release and operate.


GitOps: Principle Becomes Practice

GitOps as a term was coined by Weaveworks in 2017. By 2023, it was no longer a debate — it was the default deployment model for organizations running Kubernetes at scale.

The principle: the desired state of your cluster lives in Git. A controller watches the repository and reconciles the cluster state to match. Every deployment is a PR merge. The audit trail is the Git history.

Flux v2 (CNCF graduated) and ArgoCD (CNCF incubating) became the two dominant implementations:

# Flux: GitRepository + Kustomization
apiVersion: source.toolkit.fluxcd.io/v1
kind: GitRepository
metadata:
  name: production-config
  namespace: flux-system
spec:
  interval: 1m
  url: https://github.com/org/k8s-config
  ref:
    branch: main
---
apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
  name: production-apps
  namespace: flux-system
spec:
  interval: 10m
  path: ./clusters/production
  prune: true          # Remove resources deleted from Git
  sourceRef:
    kind: GitRepository
    name: production-config
  healthChecks:
  - apiVersion: apps/v1
    kind: Deployment
    name: api
    namespace: production

The prune: true behavior is critical: resources deleted from Git are deleted from the cluster. This is what makes GitOps a security control — unknown resources that aren’t in Git get removed. No more accumulation of forgotten test deployments, rogue debug pods, or unauthorized configuration changes that outlive the engineer who made them.

ArgoCD’s Application model added a UI, synchronization policies, and multi-cluster management:

apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: production-api
  namespace: argocd
spec:
  project: production
  source:
    repoURL: https://github.com/org/apps
    targetRevision: HEAD
    path: api/production
  destination:
    server: https://kubernetes.default.svc
    namespace: api
  syncPolicy:
    automated:
      prune: true
      selfHeal: true    # Revert manual kubectl changes
    syncOptions:
    - CreateNamespace=true

The selfHeal: true option is where GitOps becomes enforceable: any manual change made with kubectl is automatically reverted within the sync interval. For compliance-sensitive environments, this is a configuration drift prevention control.


Cluster API: Kubernetes Managing Kubernetes

Cluster API (cluster-sigs/cluster-api) flipped the usual model: instead of using tools like Terraform or Ansible to provision Kubernetes clusters, Cluster API lets you manage Kubernetes clusters as Kubernetes resources — using a management cluster to provision and manage workload clusters.

# Create a new Kubernetes cluster as a Kubernetes resource
apiVersion: cluster.x-k8s.io/v1beta1
kind: Cluster
metadata:
  name: workload-cluster-prod
spec:
  clusterNetwork:
    pods:
      cidrBlocks: ["192.168.0.0/16"]
  infrastructureRef:
    apiVersion: infrastructure.cluster.x-k8s.io/v1beta2
    kind: AWSCluster
    name: workload-cluster-prod
  controlPlaneRef:
    apiVersion: controlplane.cluster.x-k8s.io/v1beta1
    kind: KubeadmControlPlane
    name: workload-cluster-prod-control-plane

Cluster API reconciliation handles cluster provisioning, scaling, upgrades, and deletion — all through the Kubernetes API, with all the tooling (RBAC, audit logging, GitOps integration) that entails. Multi-cluster platform teams could now manage hundreds of workload clusters from a single management cluster.


Kubernetes 1.28 — Sidecar Containers Alpha (August 2023)

Sidecar containers had been a Kubernetes pattern since 2015 — a helper container in the same pod as the main application. But there was no native sidecar lifecycle management. Sidecars were just regular init containers or additional containers, which meant:
– Init container sidecars ran before the application and had to block until they succeeded
– Regular container sidecars had no ordering guarantees at startup
– At pod termination, sidecars could die before the application finished draining

1.28 introduced native sidecar support: a new restartPolicy field for init containers:

spec:
  initContainers:
  - name: log-collector
    image: fluentbit:latest
    restartPolicy: Always    # This makes it a sidecar
    # Starts before main containers, stays running, stops after main containers exit
  containers:
  - name: application
    image: myapp:latest

A sidecar container (init container with restartPolicy: Always):
– Starts before application containers
– Stays running throughout the pod lifecycle
– Terminates automatically after all main containers exit
– Restarts if it crashes (unlike regular init containers)

This solved the service mesh sidecar problem: Istio and Linkerd injected Envoy proxies as regular containers, leading to race conditions where the proxy hadn’t started when the application tried to make outbound connections. Native sidecar lifecycle guarantees the proxy is ready before the application starts.

Also in 1.28:
Retroactive default StorageClass assignment: Existing PVCs without a StorageClass assignment get the default applied retroactively — useful for migrations
Non-graceful node shutdown stable: Handle node power failures without manual pod cleanup
Recovery from volume expansion failure: Previously, a failed volume expansion left the PVC in a broken state; 1.28 introduced a mechanism to recover


AI/ML Workloads Force New Kubernetes Capabilities

The LLM wave of 2023 drove GPU workloads onto Kubernetes at a scale and urgency the project hadn’t anticipated. Running LLM inference on Kubernetes required solving problems that CPU-centric cluster scheduling hadn’t encountered:

GPU topology awareness: Inference across multiple GPUs requires GPUs connected by NVLink or on the same PCIe switch, not arbitrary GPUs from different nodes or different PCIe buses. The Dynamic Resource Allocation API (1.26 alpha) was designed exactly for this.

Fractional GPU allocation: NVIDIA’s time-slicing and MIG (Multi-Instance GPU) allow multiple pods to share a single GPU. The GPU operator (NVIDIA) manages this at the node level:

# Check GPU resources visible to Kubernetes
kubectl get nodes -o custom-columns=\
  "NODE:.metadata.name,GPU:.status.allocatable.nvidia\.com/gpu"
# NODE       GPU
# gpu-node-1   8
# gpu-node-2   8

Batch scheduling for training jobs: Training runs require all workers to start simultaneously — a single missing GPU makes the entire job stall. The Kubernetes Job API doesn’t guarantee this. Projects like Volcano (CNCF incubating) and Kueue (Kubernetes SIG Scheduling) added gang scheduling: a job only starts when all requested resources are available.

# Kueue: queue AI training jobs with resource quotas
apiVersion: kueue.x-k8s.io/v1beta1
kind: ClusterQueue
metadata:
  name: gpu-queue
spec:
  namespaceSelector: {}
  resourceGroups:
  - coveredResources: ["nvidia.com/gpu", "cpu", "memory"]
    flavors:
    - name: a100-80gb
      resources:
      - name: nvidia.com/gpu
        nominalQuota: 16

Kubernetes 1.29 — Sidecar to Beta, Load Balancer IP Mode (December 2023)

  • Sidecar containers beta: The lifecycle semantics were refined based on 1.28 alpha feedback
  • Load balancer IP mode alpha: Distinguish between load balancers that use virtual IPs (kube-proxy handles the traffic) vs. those that handle traffic directly (no need for kube-proxy rules) — important for eBPF-based load balancers
  • ReadWriteOncePod volume access stable

Kubernetes 1.30 — Structured Authorization Config (April 2024)

  • Structured authorization configuration beta: Define multiple authorization webhooks with explicit ordering, failure modes, and connection settings — replacing the flat --authorization-mode flag
  • Sidecar containers beta continues
  • Node memory swap support beta: Allow pods to use swap memory — controversial but necessary for workloads with bursty memory patterns that prefer using swap over OOM kill
# Node with swap enabled — kubelet config
kind: KubeletConfiguration
memorySwap:
  swapBehavior: LimitedSwap

The swap support feature reversed a long-standing Kubernetes hard stance: swap was disabled since 1.0 because its interaction with Kubernetes memory accounting was unpredictable. The 1.30 approach adds proper accounting and policies.


Kubernetes 1.31 — Cloud Provider Code Removal Complete (August 2024)

1.31 marked the completion of the cloud provider code removal — the 1.5 million line migration that had been running since 1.26. Core binaries are 40% smaller. The API server, controller manager, and scheduler no longer contain vendor-specific code.

Also in 1.31:
Persistent Volume health monitor stable
AppArmor support stable: AppArmor profiles for pods using the native Kubernetes field (not annotations)
Traffic distribution for Services beta: Express topology preferences for Service routing (prefer local node, prefer same zone)

# Traffic distribution: prefer endpoints in the same zone
apiVersion: v1
kind: Service
metadata:
  name: api
spec:
  trafficDistribution: PreferClose
  selector:
    app: api
  ports:
  - port: 80
    targetPort: 8080

Kubernetes 1.32 — Sidecar Stable, DRA Beta (December 2024)

  • Sidecar containers stable: After nearly a decade of workarounds, the sidecar pattern is a first-class Kubernetes primitive
  • Dynamic Resource Allocation beta: GPU and specialized hardware scheduling ready for production evaluation
  • Job API improvements: Success and failure policies for indexed jobs — granular control over batch workload behavior
  • Custom Resource field selectors: Filter CRDs on arbitrary fields — making large CRD-based systems more efficient to query

Crossplane: Kubernetes as the Control Plane for Everything

Crossplane (CNCF graduated) extended the Kubernetes API model beyond the cluster itself. Using CRDs and controllers, Crossplane lets you manage cloud resources (RDS databases, S3 buckets, VPCs, IAM roles) as Kubernetes resources — provisioned, updated, and deleted through the Kubernetes API.

# Crossplane: provision an RDS PostgreSQL instance as a Kubernetes resource
apiVersion: database.aws.crossplane.io/v1beta1
kind: RDSInstance
metadata:
  name: production-db
spec:
  forProvider:
    region: us-east-1
    dbInstanceClass: db.r6g.xlarge
    masterUsername: admin
    engine: postgres
    engineVersion: "15"
    allocatedStorage: 100
    multiAZ: true
  writeConnectionSecretsToRef:
    name: production-db-credentials
    namespace: production

For platform teams, Crossplane means a single control plane — the Kubernetes API — for both compute workloads and cloud infrastructure. GitOps tools (Flux, ArgoCD) manage both.


Key Takeaways

  • GitOps (Flux, ArgoCD) became the production deployment standard — not for ideological reasons, but because the audit trail, drift detection, and self-healing properties solve real operational and compliance problems
  • Cluster API made Kubernetes cluster lifecycle (provisioning, upgrades, deletion) a Kubernetes-native operation — the same API, tooling, and audit trail
  • Native sidecar containers (1.28 alpha → 1.32 stable) finally resolved the lifecycle ordering problem that service meshes and log collectors had worked around for years
  • AI/ML workloads drove new scheduling capabilities (DRA, gang scheduling via Kueue/Volcano) and made GPU topology awareness a first-class concern
  • Crossplane generalized the Kubernetes API model to cloud infrastructure — the cluster is now a control plane for everything, not just containers

What’s Next

← EP06: The Runtime Reckoning | EP08: Kubernetes Today →

Series: Kubernetes: From Borg to Platform Engineering | linuxcent.com