Zero Trust Identity: SPIFFE, SPIRE, mTLS, and Continuous Verification

Reading Time: 7 minutes

The Identity Stack, Episode 13
EP12: Entra ID + LinuxEP13

Focus Keyphrase: zero trust identity
Search Intent: Informational
Meta Description: Zero Trust identity explained: continuous verification, SPIFFE workload identity, SPIRE, mTLS service mesh, and how the stack evolved from /etc/passwd to 2026. (162 chars)


TL;DR

  • Zero Trust means “never trust, always verify” — identity is verified continuously, not just at login time; network location provides no implicit trust
  • Human identity (users) and workload identity (services, pods, jobs) are separate problems — LDAP/Kerberos/OIDC solve the human side; SPIFFE/SPIRE solve the workload side
  • SPIFFE (Secure Production Identity Framework For Everyone) defines a standard for workload identity — a SPIFFE ID is a URI like spiffe://corp.com/ns/prod/sa/payments-svc
  • SPIRE (SPIFFE Runtime Environment) issues short-lived X.509 SVIDs (SPIFFE Verifiable Identity Documents) to workloads — certificates that rotate automatically, every hour
  • mTLS (mutual TLS) is how workloads prove identity to each other — both sides present certificates; no passwords, no API keys
  • The evolution: /etc/passwd (1970) → NIS → LDAP → Kerberos → SAML → OIDC → SPIFFE/SPIRE — the problem has always been the same; the trust boundary keeps moving outward

The Big Picture: From /etc/passwd to Zero Trust

1970s  /etc/passwd              ← trust: the local machine
       One machine, one user list

1984   NIS / Yellow Pages       ← trust: the local network
       Centralized, but cleartext, flat

1993   LDAP                     ← trust: the directory server
       Hierarchical, scalable, encrypted (eventually)

1988   Kerberos                 ← trust: the KDC
       Tickets instead of passwords, network-wide

2002   SAML                     ← trust: the IdP assertion
       Identity crosses the internet

2014   OIDC / OAuth2            ← trust: the JWT signature
       API-native, mobile-native, developer-native

2017   SPIFFE / SPIRE           ← trust: the workload certificate
       Automated identity for services, not humans

2026   Zero Trust               ← trust: nothing, verify everything
       Continuous verification, short-lived credentials,
       device posture, behavioral signals

EP01 of this series started with the chaos of per-machine /etc/passwd. This episode — EP13 — closes the loop: from that chaos to a model where identity is verified continuously, credentials expire in hours not years, and the network provides no implicit trust.


The Assumption That Zero Trust Rejects

Traditional security assumed: if you’re on the internal network, you’re trusted. A VPN user was treated as equivalent to someone at a desk in the office. A service running on the same Kubernetes node as another service was implicitly trusted.

That assumption broke in practice:

  • Compromised VPN credentials gave attackers full internal access
  • Lateral movement after initial compromise was easy — once inside, everything trusted you
  • Perimeter-based security had no visibility into east-west traffic (service-to-service)

Zero Trust inverts the model: the network provides no trust. Every access request is verified — user or service, internal or external, first request or hundredth. Trust is dynamic, contextual, and short-lived.


Human Zero Trust: Continuous Verification

For human users, Zero Trust extends OIDC and Conditional Access:

Short-lived tokens. Access tokens expire in 1 hour (OIDC standard). Refresh tokens are revocable. A user who is terminated can have their refresh tokens revoked in Entra ID — the next time their app tries to use the refresh token, it fails. The maximum blast radius of a stolen token is bounded by its lifetime.

Device posture. The device the user authenticates from is part of the identity assertion. Conditional Access can require: device is managed (Intune-enrolled), device is compliant (no malware, full-disk encryption enabled, OS patched). A valid user credential from an unmanaged device is denied.

Behavioral signals. Entra ID Identity Protection and similar systems analyze login patterns — unusual location, impossible travel (login from Mumbai, then New York 5 minutes later), unfamiliar device. High-risk sign-ins trigger step-up authentication or are blocked automatically.

Privileged Access Management (PAM). For privileged operations (production shell access, AD admin), Zero Trust adds time-bounded just-in-time access:

Request:  "I need admin access to db01.corp.com for 2 hours to investigate an incident"
Approval: Manager approves via Slack/email/ticketing system
Grant:    Temporary role assignment or password checkout from the PAM vault
Access:   User SSHes with a one-time or time-limited credential
Expire:   Credential automatically revoked after 2 hours
Audit:    Full session recording available for review

CyberArk, BeyondTrust, and HashiCorp Vault implement this model. Vault’s SSH Secrets Engine issues short-lived SSH certificates:

# Request a signed SSH certificate (valid 30 minutes)
vault ssh \
  -role=prod-admin \
  -mode=ca \
  -mount-point=ssh-client-signer \
  [email protected]

# Vault issues a certificate signed by the server's trusted CA
# sshd on db01 trusts that CA — no authorized_keys needed
# Certificate expires in 30 minutes — no cleanup required

Workload Identity: The Non-Human Problem

Services don’t have passwords they can type. A microservice calling another microservice needs to prove its identity — but you can’t give a Kubernetes pod a static API key (it’ll be in a config file, in a git repo, or in a crash dump within 6 months).

Workload identity solves this with short-lived, automatically rotated certificates — the service’s identity is its certificate, issued by a trusted CA, expiring in minutes to hours.

Traditional:                     Zero Trust:
  payments-svc → orders-svc        payments-svc → orders-svc
  Authentication: API key           Authentication: mTLS (X.509 cert)
  "Bearer sk_live_abc123"           cert: spiffe://corp.com/ns/prod/sa/payments-svc
  Rotation: manual (rarely done)    Rotation: automatic, every hour
  Revocation: change the key        Revocation: cert expires; new cert issued
  Audit: "API key was used"         Audit: "spiffe://payments-svc → spiffe://orders-svc"

SPIFFE: The Standard

SPIFFE (Secure Production Identity Framework For Everyone) defines what a workload identity looks like. The core concept is the SPIFFE ID — a URI in the format:

spiffe://<trust-domain>/<workload-path>

Examples:
  spiffe://corp.com/ns/prod/sa/payments-svc
  spiffe://corp.com/region/us-east/service/auth-api
  spiffe://corp.com/k8s/cluster-prod/namespace/payments/pod/payments-svc-abc123

The trust domain (corp.com) is the organizational boundary. The path is the workload identifier — typically encoding namespace, service account, or cluster information.

A SPIFFE ID is embedded in an SVID (SPIFFE Verifiable Identity Document) — either an X.509 certificate (X.509-SVID) or a JWT (JWT-SVID). The X.509-SVID is the standard form: the SPIFFE ID appears in the certificate’s Subject Alternative Name (SAN) field.

X.509 Certificate (SVID):
  Subject: CN=payments-svc
  SAN: URI=spiffe://corp.com/ns/prod/sa/payments-svc
  Validity: 1 hour
  Issuer: SPIRE Intermediate CA
  Signed by: corp.com trust bundle

Any service that has the corp.com trust bundle (the CA certificate chain) can verify that a certificate with spiffe://corp.com/... in the SAN was issued by the authorized CA for that trust domain.


SPIRE: The Runtime

SPIRE (SPIFFE Runtime Environment) is the reference implementation that issues SVIDs to workloads.

SPIRE Server
  ├── Node attestation: verifies the identity of the node/VM
  │   (AWS instance identity document, GCP service account, k8s node SA)
  └── Workload attestation: verifies the identity of the process
      (Kubernetes SA, Unix UID/GID, Docker container labels)
         │
         │ issues X.509 SVIDs (short-lived, auto-rotated)
         ▼
SPIRE Agent (runs on every node)
         │
         │ SPIFFE Workload API (Unix socket)
         ▼
Workload (your service)
  → gets its own certificate
  → gets the trust bundle (CA certs of trusted domains)
  → uses cert for mTLS with other services

The workload fetches its identity via the Workload API socket — no environment variables, no file mounts. The SPIRE Agent pushes new certificates before the old ones expire. Rotation is transparent to the workload.

# On a node with SPIRE Agent running:
# Fetch the SVID for the current workload
spire-agent api fetch x509 \
  -socketPath /run/spire/sockets/agent.sock

# Output shows:
# SPIFFE ID: spiffe://corp.com/ns/prod/sa/payments-svc
# Certificate: (PEM)
# Trust bundle: (PEM of issuing CA chain)
# Expires: 2026-04-27T02:00:00Z (1 hour from now)

mTLS: Both Sides Show ID

Mutual TLS (mTLS) is what makes SPIFFE useful operationally. In standard TLS, only the server presents a certificate — the client just verifies it. In mTLS, both sides present certificates. Both sides verify the other’s certificate against the trust bundle.

payments-svc → orders-svc connection:

TLS handshake:
  payments-svc presents: spiffe://corp.com/ns/prod/sa/payments-svc cert
  orders-svc presents:   spiffe://corp.com/ns/prod/sa/orders-svc cert

  Both verify:
    • cert signed by trusted CA (the corp.com SPIRE CA)
    • cert not expired
    • SPIFFE ID in SAN matches what's expected

  After handshake: encrypted channel, both sides verified
  Authorization: orders-svc checks its policy:
    "is spiffe://corp.com/ns/prod/sa/payments-svc allowed to call /api/orders?"

Service meshes (Istio, Linkerd, Consul Connect) implement mTLS transparently — the application doesn’t handle certificates; the sidecar proxy does. In Istio’s case, Citadel (now istiod) acts as the SPIFFE-compatible CA, issuing certificates to envoy sidecars. The application code doesn’t change.


Open Policy Agent: Authorization After Identity

Zero Trust separates identity from authorization. Once you know who the caller is (SPIFFE ID, OIDC token, user cert), a policy engine decides what they can do.

OPA (Open Policy Agent) is the standard for this:

# opa-policy.rego
package authz

# payments-svc can read orders; nothing else can write orders
allow {
  input.caller == "spiffe://corp.com/ns/prod/sa/payments-svc"
  input.method == "GET"
  startswith(input.path, "/api/orders")
}

default allow = false

The service checks OPA on each request: “caller=X wants to do Y to Z — allowed?” OPA evaluates the policy and returns a decision. The policy is version-controlled, tested, and deployed independently of the service.


⚠ Common Misconceptions

“Zero Trust means no trust.” Zero Trust means trust is earned dynamically through verification, not granted by network location. A verified user with a valid, compliant device and MFA is trusted — for the scope and duration of the verified session. The “zero” refers to implicit trust, not trust itself.

“SPIFFE replaces OIDC.” SPIFFE is for workload (service) identity. OIDC is for human (user) identity. They complement each other — a service has a SPIFFE identity; a user has an OIDC identity; the authorization layer accepts both.

“mTLS is complex to implement.” With a service mesh (Istio, Linkerd), mTLS is transparent — the sidecar handles it. Without a service mesh, the application needs to use the SPIFFE Workload API. The complexity is real but manageable, especially compared to the alternative of static API keys.


Framework Alignment

Domain Relevance
CISSP Domain 5: Identity and Access Management Zero Trust extends IAM to workloads (SPIFFE) and continuous verification (short-lived tokens, device posture) — it’s the current frontier of identity architecture
CISSP Domain 3: Security Architecture and Engineering The separation of identity (SPIFFE ID), authentication (mTLS), and authorization (OPA) is a clean architectural decomposition that scales to complex multi-service environments
CISSP Domain 4: Communications and Network Security mTLS encrypts and authenticates every service-to-service connection — it eliminates the assumption that east-west traffic on the internal network is safe
CISSP Domain 1: Security and Risk Management Zero Trust is a risk management posture — it accepts that perimeter breach is inevitable and limits blast radius through continuous verification and least-privilege

Key Takeaways

  • Zero Trust rejects network-based implicit trust — every request is verified regardless of source
  • Human identity: short-lived OIDC tokens, device posture checks, Conditional Access, JIT privileged access (Vault, CyberArk)
  • Workload identity: SPIFFE IDs in X.509 certificates, issued by SPIRE, rotated automatically every hour — no static API keys
  • mTLS lets services verify each other’s identity at the TLS layer — service meshes (Istio, Linkerd) implement it transparently
  • OPA handles authorization after identity is established — who you are ≠ what you can do
  • The series arc: /etc/passwd → NIS → LDAP → Kerberos → SAML → OIDC → SPIFFE/SPIRE — the problem has always been “how do you know who someone is, at scale, without trusting the network?” The answer keeps getting better.

What does identity look like at your organization — still static API keys and shared service accounts, or moving toward SPIFFE and short-lived credentials? 👇


The Identity Stack: From LDAP to Zero Trust — 13 episodes complete.

Start from EP01: What Is LDAP →

Stratum — OS Hardening as a Platform

Reading Time: 5 minutes

OS Hardening as Code, Episode 6
Cloud AMI Security Risks · Linux Hardening as Code · Multi-Cloud OS Hardening · Automated OpenSCAP Compliance · CI/CD Compliance Gate · Stratum Platform**

Focus Keyphrase: OS hardening platform open source
Search Intent: Informational
Meta Description: Stratum is open-source (Apache 2.0) — declare your OS baseline in YAML, build hardened images across six clouds, and gate deployments on compliance grade. (154 chars)


TL;DR

  • Stratum is open-source under Apache 2.0 — the engine, blueprint format, scanner, and Pipeline API are all available on GitHub
  • The platform follows the same open-core model as Terraform/OpenTofu and Cilium/Isovalent: OSS core, self-hostable, extendable
  • Three extension points: custom compliance controls, provider plugins (add new cloud providers), pipeline integrations
  • Architecture: Blueprint YAML → Engine → Provider Layer → Ansible-Lockdown → OpenSCAP → Golden Image → Pipeline API
  • The series taught the user-facing interface for five episodes; EP06 covers what’s underneath and how to build on it
  • Installation is a single helm install or docker compose up — the platform runs in your environment

The Series Arc, Inverted

EP01 showed that default cloud AMIs arrive pre-broken. By the time you reach EP06, that problem has a complete solution:

EP01 — The problem:
  Default AMI → Production → Security audit finds gaps
  (unknown OS baseline, unverified hardening, no evidence)

EP06 — The solution:
  HardeningBlueprint YAML
           ↓
    stratum build          ← EP02 (blueprint as code)
    --provider aws,gcp     ← EP03 (multi-cloud)
           ↓
    OpenSCAP scan          ← EP04 (compliance grading)
    Grade: A (94/100)
           ↓
    POST /api/pipeline/scan ← EP05 (CI/CD gate)
    Result: pass
           ↓
    Production deployment
    (Grade A, SARIF attached, blueprint version-controlled)

For five episodes, you’ve used Stratum as a user. This episode covers what it looks like to run it yourself, extend it, and build on it.


I’ve spent years watching infrastructure teams solve the same OS hardening problem in slightly different ways. Custom scripts that drift. OpenSCAP runs that produce evidence no one reads. Compliance checklists completed by humans who have competing priorities.

The tools exist. ansible-lockdown applies CIS controls reliably. OpenSCAP verifies them accurately. The CI/CD systems can enforce anything you can express as a pass/fail. The gap isn’t the tooling — it’s the integration layer that ties them together into a reproducible, auditable pipeline.

Stratum is that integration layer, open-sourced.

The philosophy is the same as Terraform applied to OS security posture: declare the desired state in a version-controlled file, apply it reproducibly, and verify it automatically. The skip-at-2am problem disappears not because engineers are more careful, but because there’s no step to skip.


The Architecture

┌─────────────────────────────────────────────────────────┐
│                 HardeningBlueprint YAML                  │
│         (version-controlled, provider-agnostic)          │
└─────────────────────┬───────────────────────────────────┘
                      │
                      ▼
┌─────────────────────────────────────────────────────────┐
│                   Stratum Engine                         │
│                  (Apache 2.0, OSS)                       │
│  ┌─────────────┐  ┌──────────────┐  ┌────────────────┐  │
│  │  Blueprint  │  │   Provider   │  │    Scheduler   │  │
│  │   Parser    │  │    Layer     │  │  (parallel     │  │
│  │             │  │  AWS  GCP    │  │   multi-cloud  │  │
│  │  Validates  │  │  Azure DO    │  │   builds)      │  │
│  │  schema +   │  │  Linode      │  │                │  │
│  │  overrides  │  │  Proxmox     │  │                │  │
│  └─────────────┘  └──────────────┘  └────────────────┘  │
└─────────────────────┬───────────────────────────────────┘
                      │
           ┌──────────┴──────────┐
           ▼                     ▼
  ┌─────────────────┐   ┌─────────────────┐
  │ Ansible-Lockdown │   │  OpenSCAP       │
  │  Runner          │   │  Scanner        │
  │                  │   │                 │
  │  UBUNTU22-CIS    │   │  A-F grade      │
  │  RHEL8-STIG      │   │  SARIF export   │
  │  Custom roles    │   │  Drift detect   │
  └────────┬─────────┘   └────────┬────────┘
           │                      │
           └──────────┬───────────┘
                      │
                      ▼
         ┌─────────────────────────┐
         │   Golden Image          │
         │   (AMI / GCP / Azure)   │
         │   + compliance metadata │
         └────────────┬────────────┘
                      │
                      ▼
         ┌─────────────────────────┐
         │   Pipeline API          │
         │   (Apache 2.0, OSS)     │
         │                         │
         │  POST /api/pipeline/scan │
         │  ← CI/CD gate           │
         └─────────────────────────┘

Every component is open-source under Apache 2.0. The engine, provider layer, Ansible runner, OpenSCAP scanner, and Pipeline API are all in the repository. Nothing is locked to a hosted service.


Installation

Stratum runs as a set of containers. Kubernetes or Docker Compose both work.

Kubernetes (Helm):

# Clone the repository
git clone https://github.com/rrskris/Stratum
cd Stratum

# Install Stratum in your cluster using the bundled Helm chart
helm install stratum ./deploy/helm/stratum \
  --namespace stratum-system \
  --create-namespace \
  --set config.providers.aws.enabled=true \
  --set config.providers.gcp.enabled=true \
  --set config.storageClass=standard

# Verify
kubectl get pods -n stratum-system
# NAME                          READY   STATUS    RESTARTS   AGE
# stratum-engine-0              1/1     Running   0          2m
# stratum-scanner-7d9b4-abc12   1/1     Running   0          2m
# stratum-api-6c8f5-def34       1/1     Running   0          2m

Docker Compose (single-node):

# Clone the repository
git clone https://github.com/rrskris/Stratum
cd Stratum

# Configure providers
cp config/providers.example.yaml config/providers.yaml
vim config/providers.yaml  # add AWS/GCP/Azure credentials

# Start
docker compose up -d

# Stratum is available at http://localhost:8080

The Three Extension Points

1. Custom Compliance Controls

Add controls that aren’t in the CIS benchmark — internal policies, org-specific security requirements, or controls from other frameworks:

# controls/custom-audit-policy.yaml
id: CUSTOM-001
title: Audit logging retention must be 90 days
description: All instances must retain audit logs for 90 days minimum
severity: high
benchmark: custom
check:
  type: command
  command: "grep -E '^max_log_file_action' /etc/audit/auditd.conf"
  expected: "max_log_file_action = keep_logs"
remediation:
  type: ansible
  task: |
    - name: Configure audit log retention
      lineinfile:
        path: /etc/audit/auditd.conf
        regexp: '^max_log_file_action'
        line: 'max_log_file_action = keep_logs'

Deploy the custom control:

stratum controls deploy --file controls/custom-audit-policy.yaml

Reference it in any blueprint:

compliance:
  benchmark: cis-l1
  controls: all
  additional_controls:
    - CUSTOM-001

Custom controls appear in the grade calculation and SARIF output alongside CIS controls.

2. Provider Plugins

Add support for a new cloud provider by implementing the provider interface:

# providers/custom_provider.py
from stratum.providers import BaseProvider

class CustomProvider(BaseProvider):
    name = "my-cloud"

    def provision_build_instance(self, blueprint, config):
        # Launch a build instance on your cloud
        # Return: instance_id, connection_details
        ...

    def create_image(self, instance_id, blueprint, grade):
        # Snapshot the instance into an image
        # Tag with compliance metadata
        # Return: image_id
        ...

    def terminate_instance(self, instance_id):
        # Clean up the build instance
        ...

Register the plugin:

stratum providers register --file providers/custom_provider.py --name my-cloud

The provider is now available as --provider my-cloud in all stratum build commands.

3. Pipeline Integrations

Beyond the curl-based API, Stratum provides a webhook system that fires on build completion, scan results, and gate failures:

# Webhook configuration
notifications:
  - event: pipeline_gate_failure
    webhook: https://hooks.slack.com/...
    template: |
      Image {{ image_id }} failed compliance gate.
      Grade: {{ grade }} (required: {{ min_grade }})
      Top failing controls:
      {% for control in failing_controls[:3] %}
      - {{ control.id }}: {{ control.title }}
      {% endfor %}

  - event: build_complete
    webhook: https://jira.yourdomain.com/api/...
    template: |
      New image built: {{ image_id }}
      Blueprint: {{ blueprint_name }}@{{ blueprint_version }}
      Grade: {{ grade }}

The Open-Core Model

Stratum follows the same model as the tools that have become infrastructure standards:

Tool Open-core model
Terraform / OpenTofu Core OSS, enterprise features in paid tier
Cilium / Isovalent Core OSS, enterprise support/features in paid tier
Vault / HCP Vault Core OSS, hosted/enterprise in paid tier
Stratum Engine + blueprint + scanner + Pipeline API: Apache 2.0

Everything taught in this series — the blueprint format, the build pipeline, the compliance grading, the CI/CD gate — is in the OSS core. You can self-host it, extend it, contribute to it, and run it in your own infrastructure without any dependency on a hosted service.

The repository is at: github.com/rrskris/Stratum


What This Series Taught

EP01 — EP06 in one view:

Episode What you learned What Stratum does
EP01 Default AMIs are insecure by design Replaces default AMI with a hardened golden image
EP02 Blueprint as code — the 2am skip disappears HardeningBlueprint YAML — 5-step wizard or direct YAML
EP03 One blueprint, six providers, no drift 6 providers: AWS, GCP, Azure, DigitalOcean, Linode, Proxmox
EP04 Automated OpenSCAP — grade at build time Compliance Scanner: A-F, SARIF, drift detection
EP05 CI/CD gate — the unhardened image never deploys Pipeline API: POST /api/pipeline/scan
EP06 The platform — OSS, self-hostable, extendable Apache 2.0, Helm install, three extension points

What’s Next

This series closes the OS hardening gap. The same principle — declare desired state, build reproducibly, verify automatically — applies to every layer of your infrastructure.

If you’ve been following the eBPF: From Kernel to Cloud series, EP10 covers what happens when you combine kernel-level observability with the hardened base that Stratum provides: every connection, every process spawn, every file access — visible from the host kernel, on an OS baseline you can verify.

The next series: Purple Team Playbook — real attack paths against cloud and Kubernetes infrastructure, how they’re detected, and how they’re closed. Starting May 8.

GitHub: github.com/rrskris/Stratum

Get the Purple Team series in your inbox → linuxcent.com/subscribe

Entra ID Linux Login: SSH Authentication with Azure AD Credentials

Reading Time: 7 minutes

The Identity Stack, Episode 12
EP11: Identity ProvidersEP12EP13: Zero Trust Identity → …

Focus Keyphrase: Entra ID Linux login
Search Intent: Navigational
Meta Description: Configure Linux SSH login with Entra ID credentials — step by step: aad-auth, pam_aad, Conditional Access, and how to inspect the full auth flow in logs. (155 chars)


TL;DR

  • Entra ID (Azure AD) Linux login lets you SSH into a VM using your Azure AD credentials — no local Linux accounts, no SSH keys to distribute
  • The stack: aad-auth package + pam_aad.so + SSSD — Azure authenticates via OIDC device code flow or password, then maps the identity to a local Linux UID
  • Entra ID is not AD — it’s OIDC/OAuth2 native, with no LDAP and no Kerberos (unless you add Azure AD DS, a separate managed service)
  • Conditional Access Policies can gate Linux logins — MFA, device compliance, location restrictions — the same policies as for web apps
  • Two login modes: interactive (browser-based device code, for non-Azure VMs) and integrated (Azure IMDS-based, for Azure VMs)
  • Required roles: Virtual Machine Administrator Login or Virtual Machine User Login on the VM — IAM, not local sudoers

The Big Picture: How Entra ID Linux Login Works

User: ssh [email protected]

  sshd on Linux VM
      │
      ▼
  PAM (/etc/pam.d/sshd)
      │
      ├── pam_aad.so (auth)
      │     │
      │     │  OIDC device code flow:
      │     │  "Go to microsoft.com/devicelogin and enter code ABCD-1234"
      │     │  User authenticates in browser with MFA
      │     │  Entra ID issues id_token + access_token
      │     ▼
      │   pam_aad validates token:
      │     • signature (JWKS from Entra ID)
      │     • tenant ID (iss claim)
      │     • VM resource audience (aud claim)
      │     • group membership (groups claim)
      │
      └── pam_mkhomedir (session)
            Creates /home/[email protected] on first login

  Shell session created
  whoami → vamshi_corp_com (sanitized UPN for Linux username)

EP11 mapped the IdP landscape. This episode gets specific: Entra ID and Linux. Understanding this matters because Entra ID is increasingly where enterprise identities live, and cloud VMs that SSH into with local accounts are an operational and security liability.


Entra ID vs Active Directory: What’s Different

This distinction matters before configuring anything.

Active Directory (on-prem) Entra ID (cloud)
Protocol LDAP + Kerberos OIDC + OAuth2
Directory queries ldapsearch Microsoft Graph API
Linux join realm join (adcli + SSSD) aad-auth package
Authentication Kerberos tickets JWT tokens
Group policy GPO via Sysvol Conditional Access + Intune
Network requirement DC reachable on LAN/VPN HTTPS to login.microsoftonline.com

Entra ID has no LDAP interface and no Kerberos realm. You cannot run ldapsearch against it. You cannot kinit to it. The authentication protocol is entirely OIDC/OAuth2 — the same protocol your browser uses to “Login with Microsoft.”

If you need LDAP and Kerberos from Azure, that’s Azure AD Domain Services — a separate managed service that Microsoft runs, which does speak LDAP and Kerberos. It’s not Entra ID; it’s a managed AD replica in Azure. EP12 covers the Entra ID path — the modern, protocol-native approach.


Prerequisites

# Azure side:
# 1. The VM's managed identity must be enabled (System-assigned)
# 2. Two Entra ID roles assigned on the VM resource:
#    - "Virtual Machine Administrator Login" (for sudo access)
#    - "Virtual Machine User Login" (for regular access)
# 3. Conditional Access policies that apply to the VM login scope

# VM side (Ubuntu 20.04+ / RHEL 8+):
# Install the aad-auth package (Microsoft-maintained)
curl -sSL https://packages.microsoft.com/keys/microsoft.asc \
  | gpg --dearmor -o /usr/share/keyrings/microsoft.gpg
echo "deb [signed-by=/usr/share/keyrings/microsoft.gpg] \
  https://packages.microsoft.com/ubuntu/22.04/prod jammy main" \
  > /etc/apt/sources.list.d/microsoft.list
apt-get update && apt-get install -y aad-auth

Configuration

# Configure the aad-auth package
aad-auth configure \
  --tenant-id 12345678-1234-1234-1234-123456789abc \
  --app-id 87654321-4321-4321-4321-cba987654321

# This writes /etc/aad.conf:
# [aad]
# tenant_id = 12345678-...
# app_id = 87654321-...
# version = 1

# Verify the PAM configuration was updated
grep pam_aad /etc/pam.d/common-auth
# auth [success=1 default=ignore] pam_aad.so

The aad-auth package installs pam_aad.so and configures PAM automatically. It also modifies /etc/nsswitch.conf to add aad as a source for passwd lookups — so getent passwd [email protected] works after the first login.


The Login Flow

On an Azure VM (Integrated mode)

Azure VMs have access to the Instance Metadata Service (IMDS) at 169.254.169.254. pam_aad uses the VM’s managed identity to get a token from IMDS, which proves the VM is trusted, then validates the user’s token against the tenant.

# User SSHes with username as UPN ([email protected] or [email protected])
ssh [email protected]@vm.eastus.cloudapp.azure.com

# Or use the short form if the tenant is configured:
ssh [email protected]@vm.eastus.cloudapp.azure.com

On first connection, pam_aad initiates the device code flow:

To sign in, use a web browser to open https://microsoft.com/devicelogin
and enter the code ABCD-1234 to authenticate.

The user opens the URL in any browser (on any device), enters the code, and authenticates with their Entra ID credentials + MFA. The SSH session gets a token. Subsequent logins within the token cache TTL skip the device code step.

Username format on the Linux system

Entra ID usernames (UPNs) contain @ — not valid in Linux usernames. aad-auth sanitizes the UPN:

[email protected] → vamshi_corp_com    (default)
# or, with shorter_username enabled in /etc/aad.conf:
[email protected] → vamshi

The UID is derived from the Azure AD Object ID (a deterministic hash) — stable across logins, same UID on every VM in the tenant.


Conditional Access for Linux Logins

Conditional Access Policies in Entra ID apply to Linux VM logins the same way they apply to web app logins.

Policy: Require MFA for Linux VM Login
  Conditions:
    Cloud apps: "Azure Linux Virtual Machine Sign-In"
    Users: All users (or specific groups)
  Grant:
    Require multi-factor authentication
    Require compliant device (optional)

With this policy, every SSH login triggers MFA — regardless of whether the client machine supports it. The MFA challenge appears in the device code flow (the browser window the user opens).

You can also enforce:
Location restrictions — only from corporate IP ranges
Device compliance — device must be Intune-managed
Sign-in risk — block logins flagged as risky by Entra ID Identity Protection

This is the operational shift: Linux login security is now managed in the same Conditional Access policy engine as every other Entra ID-protected resource. No more per-machine PAM configuration for MFA.


Role-Based Access: Who Can Log In

Access to the VM is controlled by Azure RBAC — not by local Linux groups or sudoers.

# Grant a user SSH access to the VM
az role assignment create \
  --assignee [email protected] \
  --role "Virtual Machine User Login" \
  --scope /subscriptions/SUB_ID/resourceGroups/RG/providers/Microsoft.Compute/virtualMachines/VM_NAME

# Grant admin (sudo) access
az role assignment create \
  --assignee [email protected] \
  --role "Virtual Machine Administrator Login" \
  --scope /subscriptions/SUB_ID/...

Virtual Machine Administrator Login maps to the sudo group on the Linux VM. Users with this role get passwordless sudo. Users with Virtual Machine User Login get a regular shell.

The mapping is enforced by pam_aad checking the groups claim in the token against the configured admin group. No /etc/sudoers.d/ files needed.


Debugging Entra ID Linux Logins

# Check aad-auth service status
systemctl status aad-auth

# View aad-auth logs
journalctl -u aad-auth -f

# Attempt a manual token validation (requires aad-auth debug mode)
aad-auth login --username [email protected]

# Check the local user cache
getent passwd vamshi_corp_com
# Returns if the user has logged in before

# Clear the local cache (forces re-authentication)
aad-auth clean-cache

# Verify Conditional Access isn't blocking (check Entra ID Sign-in logs)
# Azure Portal → Entra ID → Sign-in logs → filter by user + app "Azure Linux VM Sign-In"

The Entra ID Sign-in logs in the Azure Portal show every authentication attempt, the Conditional Access policies that evaluated, which ones passed/failed, and the exact failure reason. This is far more diagnostic than reading PAM logs.


Entra ID Connect: Bringing On-Prem Users to Entra ID

For organizations with existing on-prem AD who want to enable Entra ID Linux login:

On-prem AD users → Entra ID Connect sync → Entra ID
                                                │
                                    Linux VM login (aad-auth)

Entra ID Connect is a Windows Server application that syncs users from on-prem AD to Entra ID every 30 minutes. Users authenticate against Entra ID (which validates against AD via Password Hash Sync, Pass-Through Authentication, or Federation). The Linux VM doesn’t know or care — it sees an Entra ID token.

With Password Hash Sync: password hashes (not plaintext) are synced to Entra ID — users authenticate directly in the cloud.
With Pass-Through Authentication: Entra ID forwards authentication requests to an on-prem agent that validates against AD — no password hashes leave the datacenter.
With Federation (AD FS / Entra ID as a relying party): Entra ID delegates authentication to AD FS — the most complex, the most on-prem control.


⚠ Common Misconceptions

“Entra ID = Azure Active Directory = Active Directory.” Three different things. Active Directory: on-prem, LDAP+Kerberos. Azure AD (now Entra ID): cloud, OIDC+OAuth2. Azure AD Domain Services: managed AD replica in Azure, LDAP+Kerberos, not Entra ID.

“You need Azure AD DS to join Linux to Azure.” Azure AD DS is the managed AD service. Entra ID Linux login (via aad-auth) is entirely separate and doesn’t require AD DS. You can authenticate Linux to Entra ID directly via OIDC.

“The Linux username matches the Entra ID username.” The UPN is sanitized (@_) to produce a valid Linux username. The canonical identity is the UPN or the Entra Object ID. Don’t hardcode the sanitized username in scripts.


Framework Alignment

Domain Relevance
CISSP Domain 5: Identity and Access Management Entra ID Linux login centralizes Linux VM access in the same IAM system as all other enterprise resources — one policy engine, one audit log
CISSP Domain 3: Security Architecture and Engineering Eliminating per-VM local accounts removes a class of credential management risk — no SSH keys to rotate, no local accounts to audit
CISSP Domain 1: Security and Risk Management Conditional Access Policies enforcing MFA on Linux logins reduce the risk of credential-based compromise of cloud VMs

Key Takeaways

  • Entra ID Linux login uses OIDC device code flow — no LDAP, no Kerberos, no local Linux accounts
  • aad-auth package installs pam_aad.so and handles the full authentication stack: token issuance, validation, user cache, UID mapping
  • VM access is controlled by Azure RBAC roles (Virtual Machine Administrator Login / Virtual Machine User Login) — not by sudoers files
  • Conditional Access Policies apply to Linux VM logins — MFA, device compliance, and location restrictions use the same engine as every other Entra ID app
  • Debugging starts in Entra ID Sign-in logs (Azure Portal), not in /var/log/auth.log

What’s Next

EP12 showed how Entra ID enables Linux logins in the cloud. EP13 is the series closer: Zero Trust identity — what it means to verify identity continuously, how SPIFFE and SPIRE handle workload (non-human) identity, and where the stack goes from /etc/passwd in 1970 to a Zero Trust policy engine in 2026.

Next: Zero Trust Identity: SPIFFE, SPIRE, mTLS, and Continuous Verification

Get EP13 in your inbox when it publishes → linuxcent.com/subscribe

GCP Secure Boot Certificate Expiration 2026: What You Must Do Before June 24

Reading Time: 10 minutes


TL;DR

  • Three Microsoft UEFI Secure Boot certificates expire between June 24 and October 19, 2026
  • Any GCP Compute Engine instance with Secure Boot enabled, created before November 7, 2025, carries the old certs and is at risk
  • When the certs expire, instances may fail to boot after OS updates that pull in bootloaders signed only by the replacement 2023 certificates
  • GKE Shielded Nodes are affected too — node pools whose nodes haven’t been recreated since November 7, 2025 carry the old UEFI database
  • vTPM-sealed secrets, BitLocker, and Linux full disk encryption break if Secure Boot fails mid-update
  • Primary fix: recreate affected instances (post-Nov 7, 2025 instances include the updated UEFI DB automatically)
  • Emergency workaround if boot fails: temporarily disable Secure Boot, apply updates, re-enable

The Big Picture: The UEFI Secure Boot Trust Chain

  UEFI Firmware (PK — Platform Key, set by OEM/Google)
         │
         │  PK signs KEK updates
         ▼
  ┌─────────────────────────────────────────────┐
  │        KEK (Key Exchange Key Database)       │
  │  Microsoft Corporation KEK CA 2011           │ ← EXPIRING Jun 24, 2026
  │  Microsoft Corporation KEK CA 2023           │ ← Replacement (new VMs only)
  └────────────────────┬────────────────────────┘
                       │  KEK authorizes DB/DBX updates
                       ▼
  ┌─────────────────────────────────────────────┐
  │         DB (Authorized Signature Database)   │
  │  Microsoft UEFI CA 2011 ← signs Linux Shim  │ ← EXPIRING Jun 27, 2026
  │  Microsoft Windows PCA 2011 ← signs WinBoot │ ← EXPIRING Oct 19, 2026
  │  Microsoft UEFI CA 2023 ← replacement       │ ← Present on post-Nov 7 VMs
  │  Microsoft Windows PCA 2023 ← replacement   │ ← Present on post-Nov 7 VMs
  └────────┬───────────────────────┬────────────┘
           │                       │
           ▼                       ▼
   Linux Shim (shim.efi)    Windows Boot Manager
           │
           ▼
       GRUB2 / systemd-boot
           │
           ▼
       Linux Kernel

GCP Compute Engine instances with Secure Boot enabled — created before November 7, 2025 — have a UEFI signature database that includes the 2011 certificates but not the 2023 replacements. When those 2011 certificates expire, new bootloader binaries (signed exclusively by the 2023 certs) will be rejected at boot time.


What Secure Boot Actually Does — and Why Certificate Expiry Breaks Booting

Secure Boot is UEFI’s mechanism for ensuring that only cryptographically signed, trusted software runs during the boot sequence. The trust chain works like this:

  1. Platform Key (PK): Root of trust, set by the hardware manufacturer or cloud provider. Authorizes updates to the KEK.
  2. Key Exchange Key (KEK): Authorizes modifications to the DB and DBX (the forbidden signatures database). Microsoft holds one KEK slot; OEMs often hold another.
  3. DB (Signature Database): Contains the public certificates used to verify bootloaders. If a bootloader binary is signed by a cert in DB, it’s allowed to run. If not, the firmware halts.
  4. DBX (Forbidden Signatures Database): Revocation list. Bootloaders explicitly listed here are blocked even if they were once trusted.

Where expiry matters: The DB certificates don’t “enforce” anything at runtime by checking dates themselves — UEFI doesn’t do certificate revocation in real time. The problem is different and more insidious: as Linux distributions and Microsoft ship updated bootloaders, those new binaries are signed only by the 2023 replacement certificates, not the expiring 2011 ones. If your VM’s DB doesn’t contain the 2023 certs, the UEFI firmware will reject the new shim, and the system won’t boot after an OS update that upgrades the bootloader package.

On Debian/Ubuntu, shim-signed upgrades. On RHEL/CentOS Stream, shim-x64 upgrades. Either way: new binary, new signature, old DB — boot failure.


The Three Certificates Expiring in 2026

1. Microsoft Corporation KEK CA 2011 — expires June 24, 2026

Role: Authorizes updates to the DB and DBX signature databases.

When the KEK expires, firmware that enforces KEK validity may refuse to accept DB/DBX updates signed by this certificate. This means even if Google pushes an out-of-band UEFI DB update containing the 2023 certs, instances with an expired-only KEK slot may not be able to apply it cleanly.

Replacement: Microsoft Corporation KEK CA 2023


2. Microsoft Corporation UEFI CA 2011 — expires June 27, 2026

Role: Signs third-party bootloaders — specifically the Linux Shim (shim.efi).

This is the most critical cert for Linux workloads. Every major Linux distribution uses a shim bootloader as the first-stage loader in a Secure Boot chain. The shim is signed by Microsoft’s UEFI CA because Linux vendors submit their shim builds to Microsoft for signing (to ensure broad UEFI compatibility). When new shim packages are released signed only by UEFI CA 2023, any VM with only the 2011 cert in its DB will reject them.

Replacement: Microsoft UEFI CA 2023


3. Microsoft Windows Production PCA 2011 — expires October 19, 2026

Role: Signs Windows Boot Manager and other Windows boot components.

Windows instances on GCP using Secure Boot are affected by this cert. Post-expiry Windows OS updates that ship a new Boot Manager binary signed exclusively by the 2023 PCA will fail to boot on instances carrying only the 2011 cert.

Replacement: Microsoft Windows Production PCA 2023

Windows-specific signal: Event ID 1801 in the Windows System event log — “Secure Boot CA/keys need to be updated” — will appear by mid-2026 on affected instances, before actual boot failure. This is your warning window.


Why GCP Instances Are Specifically Affected

Google’s Compute Engine Shielded VMs ship with a pre-populated UEFI variable database. The content of that database is fixed at instance creation time — it’s part of the VM’s UEFI firmware image. Instances created before November 7, 2025 have a DB that contains the 2011 certs but not the 2023 replacements. Instances created on or after November 7, 2025 had the updated database backfilled.

This is not a Google-specific failure. Every cloud provider and on-premises hypervisor platform that uses Secure Boot with a pre-populated UEFI DB has the same problem. GCP is ahead of many platforms in actually documenting it.


GKE Shielded Nodes: The Operational Blind Spot

GKE’s Shielded Nodes feature enables Secure Boot on node pool VMs. Each node is a Compute Engine instance — and all the same rules apply.

The risk: Node pools whose nodes were last provisioned before November 7, 2025 carry the old UEFI database. When containerd, the OS image, or the kernel gets updated via node auto-upgrade or manual node pool upgrade, the new node VMs will carry updated certs. But nodes that haven’t been replaced since before the cutoff are sitting on the old DB.

GKE auto-upgrade helps — but only if it’s actually running and has completed at least one full node replacement cycle since November 7, 2025.

Node pools with auto-upgrade disabled, or clusters in maintenance windows that delayed upgrades, are at risk.

The trigger scenario:
1. GKE runs a node OS update in-place on an old node (not a full node replacement)
2. The update upgrades the shim package to a version signed only by UEFI CA 2023
3. Next reboot: the node fails to boot
4. The node is marked NotReady, workloads are rescheduled — but the underlying VM is stuck


Detecting Affected Resources

Compute Engine Instances

gcloud compute instances list \
  --filter="creationTimestamp < '2025-11-07' AND shieldedInstanceConfig.enableSecureBoot=true" \
  --format="table(name,zone,creationTimestamp,shieldedInstanceConfig.enableSecureBoot,status)"

Sample output:

NAME               ZONE           CREATION_TIMESTAMP        ENABLE_SECURE_BOOT  STATUS
prod-api-01        us-central1-a  2024-08-15T10:22:00Z      True                RUNNING   ← at risk
prod-db-02         us-central1-b  2023-11-01T08:15:00Z      True                RUNNING   ← at risk
prod-web-03        us-central1-a  2025-12-01T14:30:00Z      True                RUNNING   ← safe (post-Nov 7)

GKE Node Pools

# List node pools with Secure Boot enabled per cluster
gcloud container clusters list --format="value(name,location)" | while read NAME LOCATION; do
  echo "=== Cluster: $NAME ($LOCATION) ==="
  gcloud container node-pools list \
    --cluster="$NAME" \
    --location="$LOCATION" \
    --filter="config.shieldedInstanceConfig.enableSecureBoot=true" \
    --format="table(name,config.shieldedInstanceConfig.enableSecureBoot,management.autoUpgrade)"
done

Then verify node creation timestamps within affected pools:

gcloud compute instances list \
  --filter="labels.goog-gke-node:* AND creationTimestamp < '2025-11-07' AND shieldedInstanceConfig.enableSecureBoot=true" \
  --format="table(name,zone,creationTimestamp,labels.goog-gke-node)"

Checking the UEFI DB on a Running Instance

SSH into an affected instance and verify which certs are in the DB:

# On the instance (requires mokutil and/or efitools)
sudo mokutil --db | grep -A3 "Subject:"

Look for CN=Microsoft UEFI CA 2023 in the output. Its absence means your instance has only the 2011 certs.

On GKE nodes (where you have node shell access via a DaemonSet or node debug pod):

# Using kubectl debug for node access
kubectl debug node/NODE_NAME -it --image=ubuntu -- bash
# Then inside the debug pod:
chroot /host
mokutil --db 2>/dev/null | grep "Microsoft.*2023" || echo "2023 cert NOT present — node at risk"

Solutions

Instances created after November 7, 2025 automatically receive the updated UEFI certificate database. The simplest fix is to recreate affected instances.

For Compute Engine:

# Step 1: Create a machine image (snapshot) of the existing instance
gcloud compute machine-images create INSTANCE_NAME-backup \
  --source-instance=INSTANCE_NAME \
  --source-instance-zone=ZONE

# Step 2: Delete the old instance (after verifying backup)
gcloud compute instances delete INSTANCE_NAME --zone=ZONE

# Step 3: Create new instance from machine image
gcloud compute instances create INSTANCE_NAME \
  --source-machine-image=INSTANCE_NAME-backup \
  --zone=ZONE \
  --shielded-secure-boot \
  --shielded-vtpm \
  --shielded-integrity-monitoring

The new instance will have the post-November 7, 2025 UEFI DB.

For GKE Node Pools:

# Option A: Upgrade the node pool (triggers node recreation)
gcloud container clusters upgrade CLUSTER_NAME \
  --location=LOCATION \
  --node-pool=NODE_POOL_NAME

# Option B: Recreate the node pool entirely
gcloud container node-pools create NODE_POOL_NAME-new \
  --cluster=CLUSTER_NAME \
  --location=LOCATION \
  --shielded-secure-boot \
  --shielded-integrity-monitoring \
  [... your existing pool config ...]

# Then cordon and drain the old pool nodes
kubectl cordon NODE_NAME
kubectl drain NODE_NAME --ignore-daemonsets --delete-emptydir-data

# Finally delete the old node pool
gcloud container node-pools delete NODE_POOL_NAME \
  --cluster=CLUSTER_NAME \
  --location=LOCATION

Option 2: Disable Secure Boot Temporarily (Emergency Workaround)

If an instance has already failed to boot after an OS update, or if you need to apply bootloader updates before recreating the instance:

# Disable Secure Boot on the stopped instance
gcloud compute instances update INSTANCE_NAME \
  --zone=ZONE \
  --no-shielded-secure-boot

# Start the instance
gcloud compute instances start INSTANCE_NAME --zone=ZONE

# SSH in, apply OS updates and any pending bootloader upgrades
# (The system will boot without Secure Boot enforcement)
sudo apt-get update && sudo apt-get upgrade -y   # Debian/Ubuntu
# or
sudo dnf update -y                                # RHEL/CentOS

# Stop the instance again
gcloud compute instances stop INSTANCE_NAME --zone=ZONE

# Re-enable Secure Boot
gcloud compute instances update INSTANCE_NAME \
  --zone=ZONE \
  --shielded-secure-boot

# Start again — now boots with new bootloader binaries
gcloud compute instances start INSTANCE_NAME --zone=ZONE

Note: This workaround doesn’t add the 2023 certs to the DB. It bypasses Secure Boot enforcement temporarily. The underlying UEFI DB still only has the 2011 certs. You still need to recreate the instance to get the updated DB — this is only a bridge to keep the instance alive while you plan migration.


Option 3: Restore from Machine Image

If an instance is already in a boot failure state and the workaround above doesn’t apply:

# List available machine images
gcloud compute machine-images list

# Restore from a pre-failure machine image
gcloud compute instances create INSTANCE_NAME-restored \
  --source-machine-image=MACHINE_IMAGE_NAME \
  --zone=ZONE

Then immediately plan recreation on a post-November 7, 2025 instance.


vTPM, BitLocker, and Full Disk Encryption: The Hidden Risk

For VMs using Shielded VM features beyond just Secure Boot — specifically vTPM with sealed secrets — certificate expiry creates a more dangerous failure mode.

How vTPM sealing works:

  Boot sequence measurements → PCR registers (PCR 0–7 for UEFI, PCR 8–15 for OS)
         │
         ▼
  TPM seals secrets (FDE key, BitLocker key) to specific PCR values
         │
         ▼
  On next boot: PCR values must match for TPM to release the key
         │
         ▼
  If Secure Boot state changes (cert DB changes, Secure Boot disabled) →
  PCR values change → TPM refuses to unseal → FDE fails → disk inaccessible

What this means in practice:

  • Linux FDE (LUKS with TPM2 unsealing): If Secure Boot fails or is temporarily disabled per the workaround above, the TPM will not release the LUKS volume key. The system will drop to a recovery prompt. You need the LUKS recovery passphrase.

  • Windows BitLocker: If PCR values shift (Secure Boot disabled, cert DB changed), BitLocker enters recovery mode. The VM prompts for the BitLocker recovery key on next boot. Without it, the volume is inaccessible.

  • Windows Virtual Secure Mode: VSM uses vTPM to protect credentials. If Secure Boot state changes, VSM-protected secrets become inaccessible until re-enrollment.

Action before any changes:

# For Linux: ensure you have the LUKS recovery key
sudo cryptsetup luksDump /dev/sda3 | grep "Key Slot"

# For Windows: export BitLocker recovery key before touching Secure Boot state
# (Do this from within the running Windows instance via PowerShell)
Get-BitLockerVolume | Select-Object -ExpandProperty KeyProtector | Where-Object {$_.KeyProtectorType -eq "RecoveryPassword"}

Store recovery keys in Secret Manager, not just locally:

# Store LUKS key in GCP Secret Manager
echo -n "YOUR_RECOVERY_KEY" | gcloud secrets create luks-recovery-INSTANCE_NAME \
  --data-file=- \
  --replication-policy=automatic

⚠ Production Gotchas

1. OS update automation is the trigger, not the cert expiry date itself.
The certs don’t enforce anything at runtime. The actual failure happens when an unattended-upgrade, yum-cron, or GKE node OS update pulls in a new shim/Boot Manager binary signed only by the 2023 cert. Instances may fail to boot weeks or months before the official cert expiry date if distros ship updated bootloaders early.

2. GKE surge upgrades can mask the problem — temporarily.
During a node pool upgrade, GKE creates new nodes (with updated certs) before draining old ones. Workloads move to new nodes. The old nodes get deleted. This looks fine — until you realize some in-place operations (node taints, label changes, manual kubelet restarts) could force old nodes to reboot without triggering node replacement.

3. Disabling Secure Boot changes vTPM PCR values — plan FDE recovery before touching anything.
The temporary workaround (disable Secure Boot) will invalidate TPM-bound disk encryption. Have recovery keys ready before running --no-shielded-secure-boot.

4. Windows Event ID 1801 is an early warning — act on it.
If you see this event in your Windows Compute Engine instances before June 2026, that instance has already identified itself as carrying the old certs. Use it as your automated detection signal in Cloud Logging.

# Query Cloud Logging for Event ID 1801 across Windows instances
gcloud logging read 'resource.type="gce_instance" AND jsonPayload.EventID=1801' \
  --format="table(resource.labels.instance_id,timestamp,jsonPayload.Message)" \
  --limit=50

5. Instance templates propagate the old DB.
If you use instance templates or managed instance groups (MIGs) to create VMs, and those templates were created before November 7, 2025, new instances created from them may or may not inherit updated certs depending on how the template configures the UEFI DB. Verify by checking creation timestamp of the resulting instance, not the template.

6. Custom OS images don’t fix this.
Importing a custom image or using a custom OS does not update the UEFI certificate database. The DB is part of the VM’s UEFI firmware state, not the OS disk image. Recreating the instance is the only reliable path.


Quick Reference: Commands

Task Command
List affected Compute Engine VMs gcloud compute instances list --filter="creationTimestamp < '2025-11-07' AND shieldedInstanceConfig.enableSecureBoot=true"
Check UEFI DB on a Linux VM sudo mokutil --db \| grep -E "Subject\|Not After"
Check for 2023 cert presence mokutil --db 2>/dev/null \| grep "Microsoft.*2023" \|\| echo "2023 cert absent"
Disable Secure Boot (emergency) gcloud compute instances update INSTANCE --zone=ZONE --no-shielded-secure-boot
Re-enable Secure Boot gcloud compute instances update INSTANCE --zone=ZONE --shielded-secure-boot
Find affected GKE nodes gcloud compute instances list --filter="labels.goog-gke-node:* AND creationTimestamp < '2025-11-07' AND shieldedInstanceConfig.enableSecureBoot=true"
Trigger GKE node pool upgrade gcloud container clusters upgrade CLUSTER --location=LOCATION --node-pool=POOL
Store LUKS key in Secret Manager echo -n "KEY" \| gcloud secrets create NAME --data-file=-
Query Windows Event 1801 in Logging gcloud logging read 'resource.type="gce_instance" AND jsonPayload.EventID=1801'
Create machine image backup gcloud compute machine-images create BACKUP --source-instance=INSTANCE --source-instance-zone=ZONE

Framework Alignment

Framework Domain Relevance
CISSP Domain 7: Security Operations Patch management, boot integrity, incident response
CISSP Domain 3: Security Architecture Secure Boot trust chain, TPM integration, cryptographic key lifecycle
NIST CSF 2.0 ID.AM, PR.IP Asset inventory of affected VMs; integrity protection of boot chain
CIS Benchmarks CIS Google Cloud Computing Foundations Shielded VM controls, vTPM configuration
OWASP Top 10 A05: Security Misconfiguration Failure to maintain certificate currency in security-critical infrastructure

Key Takeaways

  • The expiry of three Microsoft UEFI CA certificates in 2026 creates a window where GCP VMs with Secure Boot enabled — created before November 7, 2025 — will fail to boot after pulling in new bootloader packages
  • The failure is not instantaneous on the cert expiry date. It’s triggered by the next OS update that ships a bootloader signed exclusively by the 2023 replacement certs
  • GKE Shielded Nodes are affected through the same mechanism: node VMs that haven’t been recreated since November 7, 2025 carry the old UEFI database
  • vTPM-sealed secrets (FDE, BitLocker, VSM) add a secondary failure mode if Secure Boot state is changed as part of remediation — have recovery keys before touching anything
  • Google’s recommended fix is instance recreation. The workaround (disable Secure Boot temporarily) keeps instances alive but doesn’t fix the underlying DB — treat it as a bridge, not a resolution
  • Audit now, before June 24. The command is one line. The blast radius of missing this is a production boot failure at 2 AM after a routine security patch run

What’s Next

If you’re running Shielded VMs in production, this certificate expiry is the kind of quiet deadline that fails silently — not with an alarm, but with a VM that doesn’t come back after a patch cycle. The time to audit is before your automated patching runs, not after.

If you found this useful, the linuxcent.com newsletter covers infrastructure security at this depth regularly — kernel internals, cloud platform gotchas, and the operational implications that vendor docs bury in footnotes.

Get the next deep-dive in your inbox when it publishes → [subscribe link]

Identity Providers Explained: On-Prem, Cloud, SCIM, and Federation

Reading Time: 6 minutes

The Identity Stack, Episode 11
EP10: SAML/OIDCEP11EP12: Entra ID + Linux → …

Focus Keyphrase: what is an identity provider
Search Intent: Informational
Meta Description: Identity providers explained: on-prem (AD FS, Keycloak), cloud (Okta, Entra ID), SCIM provisioning, and how federation stitches them together. (151 chars)


TL;DR

  • An Identity Provider (IdP) is the system that authenticates users and issues identity assertions (SAML assertions, OIDC tokens) to applications
  • On-prem IdPs: AD FS (Microsoft), Shibboleth (universities), Keycloak (open source), Ping Identity — they sit in front of AD and speak SAML/OIDC to cloud apps
  • Cloud IdPs: Okta, Entra ID (Azure AD), Google Workspace, Ping Identity Cloud — they are the directory and the authentication layer in one
  • Federation: IdPs can trust each other — a corporate IdP can delegate to a cloud IdP, or federate with a partner org’s IdP
  • SCIM (System for Cross-domain Identity Management) is provisioning, not authentication — it creates/updates/deactivates user accounts in target systems when the source directory changes
  • The key distinction: federation (authentication flow) vs directory sync (data copy) — they solve different problems and are often deployed together

The Big Picture: Where IdPs Sit

                        On-prem Directory
                        (Active Directory / OpenLDAP / FreeIPA)
                               │
                               │ LDAP / Kerberos
                               ▼
                         Identity Provider
                         ┌──────────────────────────────────┐
                         │  AD FS / Keycloak / Okta /       │
                         │  Entra ID Connect / Shibboleth   │
                         │                                  │
                         │  Speaks: SAML 2.0 + OIDC + OAuth2│
                         └────────────────┬─────────────────┘
                                          │ assertions / tokens
                      ┌───────────────────┼───────────────────┐
                      ▼                   ▼                   ▼
               Salesforce          GitHub Enterprise      AWS IAM
               (SAML SP)           (OIDC RP)              (OIDC)

EP10 covered the protocols. This episode covers the systems — what an IdP actually does, how the major ones differ, and how they connect to each other through federation and SCIM.


On-Premises Identity Providers

AD FS (Active Directory Federation Services)

AD FS is Microsoft’s on-prem federation server — a Windows Server role that sits in front of Active Directory and speaks SAML 2.0 and OIDC to external applications.

What it does:
– Authenticates users against AD (Kerberos/LDAP behind the scenes)
– Issues SAML assertions and OIDC tokens to external SPs
– Handles claims transformation: maps AD attributes to what the SP expects

What it doesn’t do well:
– It’s Windows Server only
– Configuration is complex (XML, certificates, claim rule language)
– No built-in MFA (requires Azure MFA or a third-party provider)
– Being deprecated in favor of Entra ID for most use cases

AD FS made sense when everything was on-prem. As workloads move to cloud, Entra ID Connect (a lighter sync agent) combined with Entra ID as the IdP replaces AD FS for most enterprises.

Keycloak

Keycloak is the open-source IdP from Red Hat. It’s what FreeIPA uses for web-based OIDC/SAML SSO, and it’s widely deployed independently for organizations that want full control over their identity infrastructure.

# Run Keycloak in development mode (Docker)
docker run -p 8080:8080 \
  -e KEYCLOAK_ADMIN=admin \
  -e KEYCLOAK_ADMIN_PASSWORD=admin \
  quay.io/keycloak/keycloak:latest \
  start-dev

# Keycloak concepts:
# Realm     — an isolated namespace (like a tenant)
# Client    — an application that uses Keycloak for auth (SP/RP)
# User federation — connect Keycloak to an existing LDAP/AD directory
# Identity brokering — federate with external IdPs (Google, GitHub, another SAML IdP)

Keycloak reads users from AD/LDAP via its User Federation feature — it doesn’t replace the directory, it federates it. Users still live in AD; Keycloak issues SAML/OIDC tokens based on those users.

Shibboleth

Shibboleth is the dominant IdP in academia. Most universities run it. It’s SAML-native, designed for federation between institutions — a student can authenticate at their home university’s IdP and access resources at a partner institution.


Cloud Identity Providers

Okta

Okta is a cloud IdP + directory. It can:
– Act as the primary user directory (storing users, credentials)
– Connect to on-prem AD via the Okta Active Directory Agent (a lightweight sync service)
– Federate with other IdPs (act as IdP or SP in a SAML/OIDC chain)
– Enforce MFA, Adaptive Authentication, Device Trust

Okta’s Lifecycle Management handles provisioning: when a user is created/disabled in Okta (or synced from AD), Okta can automatically create/deactivate accounts in downstream SaaS apps — via SCIM or app-specific APIs.

Entra ID (Azure Active Directory)

Entra ID is Microsoft’s cloud IdP. It’s both a directory (stores users, groups) and an IdP (issues tokens). For organizations running on-prem AD, Entra ID Connect syncs users from AD to Entra ID.

Entra ID is OIDC and OAuth2 native — it speaks SAML for legacy apps but JWT/OIDC for everything modern. Its OIDC implementation follows the standard closely; its token validation happens via /.well-known/openid-configuration and the JWKS endpoint.

On-prem AD  →  Entra ID Connect (sync agent)  →  Entra ID (cloud)
                                                      │
                                              SAML / OIDC
                                                      │
                                            SaaS apps, Azure resources

Google Workspace

Google Workspace is Google’s combined directory + IdP. Google accounts are the users. Apps integrate via SAML or OIDC. Google’s OIDC implementation is one of the most widely used reference implementations — most OIDC libraries are tested against it.


Federation: IdPs Trusting Each Other

Federation is the mechanism that lets IdPs delegate to each other. Two patterns:

SAML Federation (IdP-to-IdP)

Common in academia and partner integrations:

User at University A → requests resource at University B
                              │
                              │ doesn't know user
                              ▼
                    University B SP redirects to...
                    Discovery Service: "which IdP are you from?"
                              │
                              ▼
                    University A IdP authenticates user
                              │
                    Sends SAML assertion to University B SP

University B’s SP trusts University A’s IdP because both are members of a SAML federation (e.g., InCommon in the US, eduGAIN globally). The federation metadata aggregates all members’ SAML metadata — certificates, endpoints — so members don’t have to manually configure each bilateral trust.

OIDC Identity Brokering

Keycloak, Okta, and Entra ID can all act as identity brokers — they sit between the application and the actual authenticating IdP:

App (OIDC RP) → Keycloak (broker IdP) → Google / GitHub / SAML IdP
                                               │ authenticate
                                               ▼
                                      Keycloak receives assertion
                                      Maps external claims to local claims
                                      Issues OIDC token to app

The app only knows Keycloak. Keycloak handles the upstream IdP complexity.


SCIM: Provisioning ≠ Authentication

SCIM (RFC 7644) is a REST API standard for user lifecycle management — creating, updating, and deactivating user accounts in a target system when changes happen in the source directory.

Source (Okta / Entra ID)           Target (Slack / GitHub / Jira)
         │                                    │
         │  SCIM 2.0 (REST + JSON)            │
         ├─ POST /Users  ─────────────────────► create user
         ├─ PATCH /Users/id ──────────────────► update attributes
         └─ DELETE /Users/id ─────────────────► deactivate account

SCIM is not SSO. A SCIM-provisioned user in Slack can log in to Slack — but the authentication still goes through the IdP (SAML/OIDC). SCIM ensures the account exists. The IdP proves the user’s identity.

Why both? Because SSO alone doesn’t create accounts in target systems — it just authenticates to them. If a user tries to log in to Slack for the first time via SSO, Slack needs an account to map them to. SCIM creates that account before the first login (Just-in-Time provisioning handles it at first login, but SCIM handles it in bulk and handles deprovisioning reliably).

Deprovisioning is where SCIM matters most. When an employee leaves, you disable them in Okta — SCIM deactivates their account in every connected app within minutes. Without SCIM, IT runs a manual checklist. Someone misses Jira. The ex-employee has access for three weeks.


Directory Sync vs Federation

These are commonly confused:

Directory sync — copy user data from source to target. Entra ID Connect copies users from on-prem AD to Entra ID. This is not authentication; it’s data replication. After sync, Entra ID has its own copy of the user record.

Federation — delegate authentication to an external IdP. The target system doesn’t store credentials; it redirects to the IdP for authentication and trusts the assertion that comes back.

You often need both:
– Sync: so the target system has the user record and can enforce policies (group membership, license assignment)
– Federation: so the user authenticates against the source of truth (your IdP) rather than maintaining a separate password in every system


⚠ Common Misconceptions

“SCIM is an authentication protocol.” SCIM is a provisioning protocol. It creates and manages accounts. Authentication is SAML/OIDC. Both solve different parts of the identity lifecycle problem.

“SSO means you only have one password.” SSO means you only authenticate once per session. The password still exists (at the IdP). SSO reduces the number of authentication events, not the number of credentials.

“On-prem IdP + cloud sync is the same as a cloud IdP.” With on-prem IdP + cloud sync (e.g., AD + Entra ID Connect), authentication happens via the on-prem IdP — if it goes down, cloud SSO breaks. A pure cloud IdP (Okta standalone, Entra ID without on-prem AD) authenticates entirely in the cloud.


Framework Alignment

Domain Relevance
CISSP Domain 5: Identity and Access Management IdPs are the central control plane for federated identity — their architecture, trust relationships, and provisioning workflows define the enterprise IAM posture
CISSP Domain 1: Security and Risk Management SCIM-based deprovisioning is an access control risk management practice — without it, terminated employee access persists across connected systems
CISSP Domain 3: Security Architecture and Engineering The choice of on-prem vs cloud IdP, federation vs sync, and SCIM vs JIT provisioning are architectural decisions with long-term operational and security implications

Key Takeaways

  • An IdP authenticates users and issues assertions (SAML) or tokens (OIDC/OAuth2) — applications trust the IdP, not the user directly
  • On-prem: AD FS (Windows/legacy), Keycloak (open source, flexible), Shibboleth (academia)
  • Cloud: Okta (cloud-native, strong lifecycle management), Entra ID (Microsoft-integrated), Google Workspace
  • Federation = authentication delegation between IdPs; Directory sync = data replication; SCIM = account lifecycle (provisioning/deprovisioning)
  • SCIM deprovisioning is the critical control — it ensures ex-employees lose access automatically across all connected systems

What’s Next

EP11 covered the IdP landscape. EP12 gets specific: Entra ID and Linux — how you configure a Linux VM to accept SSH logins authenticated against Azure AD credentials, and how the aad-auth / pam_aad stack works end to end.

Next: Entra ID Linux Login: SSH Authentication with Azure AD Credentials

Get EP12 in your inbox when it publishes → linuxcent.com/subscribe

SAML vs OIDC vs OAuth2: Which Protocol Handles Which Identity Problem

Reading Time: 6 minutes

The Identity Stack, Episode 10
EP09: Active DirectoryEP10EP11: Identity Providers → …

Focus Keyphrase: SAML vs OIDC explained
Search Intent: Investigational
Meta Description: SAML, OAuth2, and OIDC solve different problems and are often confused. Here’s what each protocol does, when to use it, and how a browser SSO login actually works. (163 chars)


TL;DR

  • SAML 2.0 is a federation protocol for browser-based SSO — an IdP issues a signed XML assertion that a Service Provider trusts; designed for enterprise applications
  • OAuth2 is an authorization delegation protocol, not authentication — it lets an application act on your behalf without knowing your password; the access token says what, not who
  • OIDC (OpenID Connect) = OAuth2 + an identity layer — adds the id_token (a JWT containing who you are) on top of OAuth2’s access_token (what you can do)
  • SAML vs OIDC: SAML is XML, enterprise-native, stateful; OIDC is JSON/JWT, API-native, stateless — new applications almost always use OIDC
  • The id_token is a JWT — decode it at jwt.io and read every claim — it tells you exactly what the IdP asserts about the user
  • The browser SSO flow is three redirects: user → SP → IdP (authenticate) → SP (consume assertion)

The Problem: LDAP and Kerberos Don’t Cross the Internet

EP09 showed how authentication works inside a corporate network. LDAP and Kerberos both assume network proximity to the directory server — firewall-friendly ports don’t help when the authentication protocol requires a direct connection to the KDC or directory.

Internal network: works
  Browser → intranet app → LDAP/Kerberos → AD DC (all on 10.0.0.0/8)

Internet: breaks
  Browser → SaaS app (AWS) → LDAP/Kerberos → AD DC (on-prem behind firewall)
  ✗ KDC not reachable across NAT
  ✗ LDAP not exposed to internet (shouldn't be)
  ✗ Every SaaS app can't have its own LDAP connection to your DC

SAML was invented in 2002 to solve this. OIDC in 2014. Both let identity assertions travel over HTTPS — the one protocol that crosses every firewall.


SAML 2.0: Enterprise Browser SSO

SAML 2.0 has three actors: the User, the Identity Provider (IdP), and the Service Provider (SP).

1. User visits SP (e.g., Salesforce)
   SP: "I don't know this user — send them to the IdP"
   ↓  HTTP redirect with SAMLRequest (base64-encoded AuthnRequest)

2. User arrives at IdP (e.g., Okta, AD FS, Entra ID)
   IdP: "Authenticate me" → user enters credentials
   IdP: generates a signed SAML Assertion (XML)
   ↓  HTTP POST to SP's Assertion Consumer Service (ACS) URL

3. SP receives the SAMLResponse
   SP: verifies the signature using IdP's public key
   SP: extracts user attributes from the Assertion
   SP: creates a session — user is logged in

The SAML Assertion is an XML document signed by the IdP. It contains:

<saml:Assertion>
  <saml:Issuer>https://idp.corp.com</saml:Issuer>
  <saml:Subject>
    <saml:NameID Format="urn:oasis:names:tc:SAML:1.1:nameid-format:emailAddress">
      [email protected]
    </saml:NameID>
  </saml:Subject>
  <saml:Conditions
    NotBefore="2026-04-27T01:00:00Z"
    NotOnOrAfter="2026-04-27T01:05:00Z">  ← short-lived: replay protection
  </saml:Conditions>
  <saml:AttributeStatement>
    <saml:Attribute Name="email">
      <saml:AttributeValue>[email protected]</saml:AttributeValue>
    </saml:Attribute>
    <saml:Attribute Name="groups">
      <saml:AttributeValue>engineers</saml:AttributeValue>
      <saml:AttributeValue>sre-team</saml:AttributeValue>
    </saml:Attribute>
  </saml:AttributeStatement>
</saml:Assertion>

The SP trusts the assertion because it’s signed with the IdP’s private key, and the SP has the IdP’s public certificate configured. No direct connection between SP and IdP needed during authentication — only the browser carries the assertion.

SP-initiated vs IdP-initiated:
– SP-initiated: user visits the SP, gets redirected to IdP, authenticates, redirected back — the common flow
– IdP-initiated: user starts at the IdP (e.g., company portal), clicks an app, IdP sends assertion directly — simpler but no SP-generated RequestID, so the SP can’t verify the request was expected (a security concern)


OAuth2: Authorization Delegation (Not Authentication)

This distinction is important and consistently confused: OAuth2 is for authorization, not authentication.

OAuth2 solves: “I want to let GitHub Actions post to my Slack without giving GitHub my Slack password.”

Resource Owner (you)  → grants permission to →  Client (GitHub Actions)
                                                        │
                                                        │ access_token
                                                        ▼
                                               Resource Server (Slack API)
                                               "this token can post messages"

The access_token answers “what can this client do?” not “who is this user?” A resource server receiving an access token knows the token is valid and what scopes it carries — it does not necessarily know which human authorized it.

The four OAuth2 grant types:

Grant Use case
Authorization Code Web apps (server-side) — most secure, recommended
PKCE (+ Auth Code) Native/SPA apps — Auth Code without client secret
Client Credentials Machine-to-machine (no user) — service accounts
Device Code Devices without browsers (smart TVs, CLIs)

The Implicit grant (tokens in URL fragment) is deprecated. Don’t use it.


OIDC: OAuth2 + Who You Are

OpenID Connect adds identity to OAuth2 by adding the id_token — a JWT that the IdP signs and that contains claims about the authenticated user.

Authorization Code flow with OIDC:

1. Client redirects user to IdP:
   GET /authorize?
     response_type=code
     &client_id=myapp
     &scope=openid email profile    ← "openid" scope triggers OIDC
     &redirect_uri=https://app.com/callback
     &state=random-nonce

2. IdP authenticates user, returns:
   GET /callback?code=AUTH_CODE&state=random-nonce

3. Client exchanges code for tokens:
   POST /token
   grant_type=authorization_code&code=AUTH_CODE...

4. IdP returns:
   {
     "access_token": "eyJ...",    ← what the user authorized
     "id_token": "eyJ...",        ← who the user is (JWT)
     "token_type": "Bearer",
     "expires_in": 3600
   }

The id_token decoded:

{
  "iss": "https://idp.corp.com",          ← issuer (the IdP)
  "sub": "user-guid-12345",               ← subject (stable user identifier)
  "aud": "myapp",                          ← audience (your client_id)
  "exp": 1745730000,                       ← expiry (Unix timestamp)
  "iat": 1745726400,                       ← issued at
  "email": "[email protected]",
  "name": "Vamshi Krishna",
  "groups": ["engineers", "sre-team"]     ← custom claims from IdP
}
# Decode any JWT at the command line (no verification — for debugging only)
echo "eyJ..." | cut -d. -f2 | base64 -d 2>/dev/null | python3 -m json.tool

# Or: jwt.io — paste the token, read every claim

sub is the stable user identifier. Email addresses change. Names change. The sub claim is the IdP’s internal identifier for the user — use it as the primary key when storing user data. Never store email as the primary key.


SAML vs OIDC: When to Use Which

SAML 2.0 OIDC
Format XML JSON / JWT
Transport HTTP POST (browser only) HTTP redirect + JSON API
Age 2002 2014
Enterprise adoption Very high (AD FS, Okta, Entra ID) Very high (newer apps)
API-friendly No Yes
Mobile apps No Yes
Complexity High (XML, schemas, signatures) Medium (JWT, JSON)
Single Logout Specified (rarely works well) Optional, inconsistent

Use SAML when: You’re integrating with an enterprise SaaS that only supports SAML (Salesforce classic, legacy HR systems), or your IdP team mandates it.

Use OIDC when: You’re building a new application, integrating with a modern IdP, or need API-based token validation. OIDC is the default for everything new.

Use OAuth2 (Client Credentials) when: Service-to-service authentication with no user — your CI/CD pipeline authenticating to an API, your microservice calling another microservice.


A Complete Browser SSO Flow (OIDC)

1. User visits https://app.corp.com (not logged in)
   App: no session → redirect to IdP

2. GET https://idp.corp.com/authorize?
        response_type=code
        &client_id=app-corp
        &scope=openid email
        &redirect_uri=https://app.corp.com/callback
        &state=abc123
        &nonce=xyz789

3. IdP: user is not authenticated → show login form
   User: enters [email protected] + password
   (or: IdP sees existing session cookie → skip login)

4. IdP: authentication success
   Redirect: GET https://app.corp.com/callback?code=AUTH_CODE&state=abc123

5. App (server-side): validate state=abc123 (CSRF protection)
   POST https://idp.corp.com/token
     grant_type=authorization_code
     &code=AUTH_CODE
     &client_id=app-corp
     &client_secret=SECRET
     &redirect_uri=https://app.corp.com/callback

6. IdP responds:
   { "id_token": "JWT...", "access_token": "JWT...", "expires_in": 3600 }

7. App: validate id_token signature (using IdP's JWKS endpoint)
   App: extract sub, email, groups from id_token
   App: create session for [email protected]
   App: redirect user to original destination

Step 7 is where most bugs live. The app must validate: signature (using IdP’s public keys from /.well-known/jwks.json), iss (matches the expected IdP), aud (matches the client_id), exp (not expired), and nonce (matches what was sent in step 2). Skip any of these and you have an authentication bypass.


⚠ Common Misconceptions

“OAuth2 is for login.” OAuth2 is for authorization delegation. It can be used as a login mechanism only when OIDC (the openid scope + id_token) is added on top. “Login with Google” uses OIDC, not bare OAuth2.

“JWTs are encrypted.” By default, JWTs are signed (JWS), not encrypted. The header and payload are base64url-encoded — anyone can decode them. Encryption (JWE) is a separate, less commonly used spec. Never put secrets in a JWT payload assuming it’s private.

“SAML Single Logout works reliably.” SAML SLO is specified but inconsistently implemented. Many SPs ignore SLO requests or don’t propagate them correctly. Don’t depend on SLO for security — session revocation requires additional mechanisms (short-lived tokens, token introspection, session registries).


Framework Alignment

Domain Relevance
CISSP Domain 5: Identity and Access Management SAML, OAuth2, and OIDC are the three protocols that enable federated identity and SSO — understanding which does what is foundational to modern IAM design
CISSP Domain 4: Communications and Network Security JWT validation (signature, claims, expiry) is a network security control — failing to validate any claim is an authentication bypass vulnerability
CISSP Domain 3: Security Architecture and Engineering The choice of SAML vs OIDC is an architectural decision that affects every application integration, mobile support, and API design

Key Takeaways

  • SAML 2.0: XML-based browser SSO — three redirects, signed assertion, enterprise legacy apps
  • OAuth2: authorization delegation — access tokens grant scopes, not identity
  • OIDC: OAuth2 + id_token — adds who the user is on top of what they can do
  • sub is the stable user identifier in OIDC — never use email as a primary key
  • JWT validation must check: signature, iss, aud, exp, nonce — missing any is a security bypass
  • New applications: OIDC. Legacy enterprise SaaS: SAML. Service-to-service: OAuth2 Client Credentials

What’s Next

EP10 covered the protocols. EP11 covers the systems that implement them — the identity providers: what Okta, Entra ID, Keycloak, and AD FS actually do, how they federate with each other, and how SCIM handles user provisioning separately from authentication.

Next: Identity Providers Explained: On-Prem, Cloud, SCIM, and Federation

Get EP11 in your inbox when it publishes → linuxcent.com/subscribe

OWASP Top 10 History: How the List Evolved from 2003 to 2025

Reading Time: 8 minutes


series: OWASP LLM Top 10: From Web Roots to AI Frontiers
episode: 1 of 22
status: Draft
slug: /owasp-top-10-history-evolution/
focus_keyphrase: OWASP Top 10 history evolution
search_intent: Informational
meta_description: “OWASP Top 10 history: how the list evolved from SQL injection in 2003 to LLM prompt injection in 2025 — and what stayed constant across every version.”
owasp_mapping: “Foundation episode — establishes the OWASP organization, methodology, and six-version evolution before branching to the four lists that exist today (Web App, API, Cloud-Native, LLM).”


OWASP Top 10 HistoryThe Four OWASP ListsWhy Classic OWASP Breaks for LLMsOWASP LLM Top 10 2025


TL;DR

  • OWASP Top 10 history evolution spans six published versions from 2003 to 2021 — the category names change every cycle; the underlying failure classes do not
  • Injection, broken authentication, and access control have appeared in every single version under different names; they were exploited in 2003 and they are still the top breach vectors in 2025
  • The 2021 edition abstracted away from web-app-specific language into attack classes — which is what made OWASP applicable to cloud infrastructure, APIs, Kubernetes, and ultimately AI systems
  • OWASP is not a compliance standard; it is a community consensus on risk — but in 2025, the EU AI Act began directly citing the OWASP AI Exchange, which changes that calculus
  • Four distinct OWASP Top 10 lists exist today: Web App (2021), API Security (2023), Cloud-Native App Security, and LLM Applications (2025) — this series covers the last one, built on the foundation of the first

OWASP Mapping: Foundation episode. No single OWASP LLM category. This episode traces the lineage from OWASP Top 10 (2003) through all six web app versions to the four lists that exist in 2025. Every subsequent episode maps directly to one or more OWASP LLM Top 10 (2025) categories.


The Big Picture

OWASP TOP 10 EVOLUTION: 2003 → 2025

2003 ──▶ Web-era injection (SQL, XSS, parameter tampering)
          │  HTTP/1.0 apps. Databases directly exposed via
          │  dynamic SQL. Sessions via URL parameters.
          │
2007 ──▶ Session management + insecure comms elevated
          │  HTTPS adoption slow. Cookie theft common.
          │
2010 ──▶ Unvalidated redirects added. XSS re-ranked.
          │  The list reflects what's being actively exploited.
          │
2013 ──▶ CSRF dropped. Missing Function-Level Access added.
          │  First signs of API/microservice thinking.
          │
2017 ──▶ Risk-weighted ranking. CWE mappings. XXE added.
          │  Insecure Deserialization, Logging failures enter.
          │  The list becomes infrastructure-aware.
          │
2021 ──▶ Abstracted to attack classes. Insecure Design +
          │  SSRF added. Infrastructure/cloud applicability.
          │  ┌──────────────────────────────┐
          │  │ Now maps to cloud infra      │ ← Purple Team EP02
          │  │ Kubernetes, APIs, pipelines  │
          │  └──────────────────────────────┘
          │
          ├──▶ API Security Top 10 (2023)
          │     REST/GraphQL-specific risks
          │
          ├──▶ Cloud-Native App Security Top 10
          │     Containers, orchestration
          │
          └──▶ LLM Applications Top 10 (2023 v1 → 2025 v2)
                Prompt injection, model poisoning, RAG attacks
                ← THIS SERIES

OWASP Top 10 history is not a list of bugs. It is a snapshot of where the application surface was — and where attackers found the seams — taken every three to four years.


The 2003 Founding: What the Web Looked Like

The OWASP Foundation was established in 2001. The first Top 10 list shipped in 2003.

The web in 2003 looked nothing like it does now. Applications were monolithic. Databases were directly queried via dynamic SQL strings concatenated from user input. Authentication was session cookies stored in URL parameters. “Security” was a firewall at the network perimeter — if you were inside the network, you were trusted.

SQL injection was not a theoretical risk. It was how attackers exfiltrated data in bulk, every day, at scale. The same for XSS: inject JavaScript into a page, steal session cookies, impersonate users. These were not edge cases — they were the primary breach vectors because the web was built without any assumption that input was untrusted.

The OWASP founding premise: developers build these vulnerabilities not because they are negligent, but because the threat model was never taught. The Top 10 list was documentation, not enforcement — a shared vocabulary for what actually causes breaches.


Version-by-Version: What Changed and What Did Not

Year Most Significant Addition What Dropped / Changed What It Reflects
2003 Unvalidated Input, SQL Injection, XSS, Command Injection Dynamic SQL era; input treated as trusted
2007 CSRF, Insecure Comms, Improper Error Handling Unvalidated Input consolidated HTTPS adoption gap; session theft via network
2010 Unvalidated Redirects + Forwards CSRF de-emphasized Open redirectors weaponized for phishing
2013 CSRF dropped; Missing Function-Level Access Insecure Storage removed API-style thinking entering the list
2017 Insecure Deserialization, Logging + Monitoring Failures, XXE Unvalidated Redirects dropped Server-side attack complexity; blind spots in detection
2021 Insecure Design (new class), SSRF XSS merged under Injection Architecture-level risk; abstract attack classes introduced

The column that doesn’t change: Broken Access Control, Injection, and Authentication Failures have appeared in every version. The names shift (A01 becomes A07 becomes A01 again). The category descriptions evolve. The underlying failure — you can access things you shouldn’t, or execute code you shouldn’t, or authenticate as someone you’re not — never leaves the list.

This is the most important observation in the entire series: OWASP’s vocabulary modernizes; the failure classes are constants. When you see LLM01 Prompt Injection in the 2025 LLM list, you are looking at the same failure class as A03 Injection in the web app list. The attack surface changed. The category did not.


What the 2021 Abstraction Unlocked

The 2017 → 2021 transition was architecturally significant. Prior versions were implicitly scoped to HTTP requests against web applications. The 2021 list made a deliberate choice to describe attack classes rather than attack techniques.

“Injection” in 2021 means: untrusted data is sent to an interpreter and executed as code or commands. That definition covers SQL injection, LDAP injection, OS command injection — and, it turns out, natural language prompt injection in LLMs. The definition doesn’t care what the interpreter is.

“Broken Access Control” in 2021 means: a principal can act on a resource or perform an action it was not intended to. That covers misconfigured S3 buckets, Kubernetes RBAC gaps — and an LLM agent with tool access that hasn’t been scoped to least capability.

This abstraction is why OWASP became applicable to cloud infrastructure, APIs, containers, and AI. It’s also why the Purple Team series (specifically EP02) was able to map the entire 2021 list directly to cloud infrastructure attack paths — and why this series can map the same abstraction to LLM attack surfaces.

For the cloud infrastructure angle, see OWASP Top 10 mapped to cloud infrastructure. This series starts where that one ends: the attack surface that cloud infrastructure runs on is increasingly powered by language models.


The Four Lists That Exist Today

OWASP has expanded beyond the original web app list. Four Top 10 lists are actively maintained as of 2025:

OWASP Top 10 — Web Application Security Risks (2021)
The original. HTTP-layer attacks on server-rendered or API-backed apps. A01 Broken Access Control through A10 SSRF. Still the baseline for any web-facing application.

OWASP API Security Top 10 (2023)
REST and GraphQL-specific. Broken Object Level Authorization (BOLA/IDOR), excessive data exposure, mass assignment, unrestricted resource consumption. API attacks account for the majority of cloud breaches — this list exists because the web app list missed API-specific attack surfaces.

OWASP Cloud-Native Application Security Top 10
Kubernetes, containers, orchestration-layer risks: insecure workload configurations, misconfigured cloud storage, vulnerable container images, runtime compromise. The cloud-infra angle.

OWASP Top 10 for LLM Applications (2025)
The list this series is built on. Prompt injection, model poisoning, supply chain risks for model artifacts, RAG database attacks, autonomous agent over-permission. The attack surfaces that arrive when you embed a language model in your infrastructure.

The full comparison — which list applies to which part of your architecture, and how they overlap — is in the next episode.


Why AI Arrived at OWASP

The OWASP Top 10 for LLM Applications was not invented top-down. It came from practitioners who were deploying language models and cataloguing the breach patterns they were seeing.

The first version (v1.0) shipped in August 2023, driven by a working group that formed in May 2023 — roughly six months after ChatGPT created widespread LLM deployment. The timeline matters: security researchers were finding real vulnerabilities in production systems in real time, and the OWASP list was the community’s way of documenting the emerging threat model before it became a liability.

Version 2.0 shipped in November 2024. Two entirely new categories — System Prompt Leakage (LLM07) and Vector/Embedding Weaknesses (LLM08) — were added because RAG-based applications and agentic AI had become prevalent enough that their specific attack surfaces warranted dedicated treatment. Sensitive Information Disclosure moved from #6 to #2 because real breach data, not theory, showed it was the second most commonly exploited category.

The OWASP AI Exchange — a parallel OWASP project — went further. It produced a 300-page technical guide on AI security and privacy and contributed directly to the EU AI Act’s technical requirements. As of 2025, the EU AI Act for high-risk AI systems references risk assessment requirements that align directly with OWASP LLM Top 10 categories. OWASP is still not a compliance standard. But for AI systems in the EU, ignoring it is no longer a neutral choice.


⚠ Production Gotchas

“OWASP is a checklist you run once”
It’s a living document updated every 3–4 years based on actual breach data. The 2021 web app list is not the same document as the 2017 list. The 2025 LLM list has different categories than the 2023 v1 list. Running the 2017 checklist on a 2025 system is not OWASP compliance — it is a false sense of coverage.

“We are OWASP compliant”
OWASP is not a compliance standard. There is no OWASP certification, no OWASP audit, no OWASP controls framework. Organizations that say “we are OWASP compliant” mean they have reviewed the list and addressed the categories — that is a risk reduction exercise, not a regulatory state. The EU AI Act is a compliance standard. NIST AI RMF is a compliance framework. OWASP is the technical operationalization of both.

“The LLM Top 10 only matters if you’re building LLMs”
You don’t need to build LLMs for the list to apply. If you are deploying a chatbot powered by a third-party API, using an AI coding assistant that has access to your codebase, or running a RAG application that indexes internal documents — you are within scope of LLM01 through LLM10. The attack surface is the integration, not the model itself.


Quick Reference: OWASP Top 10 Versions

Year Version Key Additions Key Removals Architectural Context
2003 v1.0 Injection, Broken Auth, XSS, Insecure Config Monolithic web apps, dynamic SQL
2007 v2.0 CSRF, Insecure Comms Unvalidated Input → merged HTTPS gap, session theft
2010 v3.0 Unvalidated Redirects Phishing via redirectors
2013 v4.0 Missing Function-Level Access CSRF moved to lower priority API patterns emerging
2017 v5.0 XXE, Insecure Deserialization, Logging Failures Unvalidated Redirects Microservices, detection gaps
2021 v6.0 Insecure Design, SSRF XSS merged into Injection Attack class abstraction; cloud/AI applicability

Current parallel lists:

List Last Updated Primary Surface Key Org
Web App Top 10 2021 HTTP/web apps OWASP
API Security Top 10 2023 REST/GraphQL APIs OWASP
Cloud-Native App Security Top 10 2022 K8s/containers OWASP
LLM Applications Top 10 2025 (v2.0) Language models/AI OWASP GenAI

Framework Alignment

Framework Relevant Function Connection to OWASP History
NIST CSF 2.0 IDENTIFY (ID.RA) OWASP is the community risk catalog that feeds asset risk assessments
ISO 27001:2022 A.8.8 (vulnerability management) OWASP Top 10 is the standard reference for vulnerability class coverage
NIST AI RMF MAP 1.5 Identify which risk categories from OWASP LLM Top 10 apply to specific system components
EU AI Act Art. 9 (risk management system) High-risk AI system risk assessments reference OWASP AI Exchange technical guidance

Key Takeaways

  • OWASP Top 10 history is the story of attack surfaces expanding — web to API to cloud to AI — with the same failure classes appearing at each layer
  • The 2021 abstraction to attack classes (not web-specific techniques) was the architectural decision that made OWASP applicable everywhere, including LLMs
  • Four lists exist today; real systems touch multiple lists simultaneously
  • The LLM Top 10 (v2.0, 2025) is not theoretical — it was built from documented production breach patterns, and v2.0 added new categories because RAG and agentic AI created new attack surfaces fast enough to warrant them
  • OWASP is a risk framework, not a compliance standard — until 2025, when the EU AI Act began referencing OWASP AI Exchange guidance for high-risk AI systems

What’s Next

EP02 answers the navigation question this episode raises: if four OWASP lists exist, which one applies to your system — and what happens when a single architecture touches all four at once?

The Four OWASP Lists: Web App, API, Cloud-Native, and LLM Compared →

Get EP02 in your inbox when it publishes → subscribe

Network Flow Observability — What Every Connection Reveals

Reading Time: 9 minutes

eBPF: From Kernel to Cloud, Episode 10
What Is eBPF? · The BPF Verifier · eBPF vs Kernel Modules · eBPF Program Types · eBPF Maps · CO-RE and libbpf · XDP · TC eBPF · bpftrace · Network Flow Observability · DNS Observability


TL;DR

  • Network flow observability with eBPF attaches persistent programs to TC hooks and records every connection attempt, retransmit, reset, and drop — continuously, with no sampling
    (TC hook = Traffic Control hook: the point in the Linux network stack where eBPF programs intercept packets after ingress or before egress, tied to a specific network interface)
  • APM tools and service mesh telemetry are interpretations of what happened; kernel-level flow data from TC hooks is the raw event stream they all derive from
  • Retransmit counters at the kernel level reveal congestion, half-open connections, and remote endpoint failures that application logs never surface
  • Cilium’s Hubble and similar tools (Pixie, Retina) are eBPF flow exporters — they run TC programs, collect perf_event or ringbuf events, and expose them over an API
  • You can verify what flow data a tool is actually collecting with four bpftool commands — without reading documentation
  • Production caution: flow maps grow with the number of active connections; pin and bound your maps, and account for the per-packet overhead on high-throughput interfaces

EP09 showed bpftrace as an on-demand kernel query tool — compile a question, get an answer, clean up. Network flow observability with eBPF is the persistent version: programs that stay attached to TC hooks across your entire fleet, recording every connection without waiting for you to ask. When a client reports intermittent failures that appear nowhere in application logs, that persistent record is what you query. This episode covers how that layer works and how to read it.

Quick Check: What Flow Data Is Your Cluster Already Collecting?

Before building anything new, check what’s already running. If you have Cilium, Pixie, or Retina on your cluster, eBPF flow programs are already attached:

# SSH into a worker node, then:

# What TC programs are attached to cluster interfaces?
bpftool net list

# Expected output on a Cilium node:
# xdp:
#
# tc:
# eth0(2) clsact/ingress prog_id 38 prio 1 handle 0x1 direct-action
# eth0(2) clsact/egress  prog_id 39 prio 1 handle 0x1 direct-action
# lxc12a3(15) clsact/ingress prog_id 41 prio 1 handle 0x1 direct-action
# lxc12a3(15) clsact/egress  prog_id 42 prio 1 handle 0x1 direct-action
# What maps are those programs holding state in?
bpftool map list | grep -E "flow|conn|sock|nat"

# Sample output:
# 24: hash  name cilium_ct4_global  flags 0x0
#     key 24B  value 56B  max_entries 65536  memlock 4718592B
# 25: hash  name cilium_ct4_local   flags 0x0
#     key 24B  value 56B  max_entries 8192   memlock 589824B

Each lxcXXXX interface is a pod’s veth pair. The TC programs on those interfaces are what Cilium uses to enforce NetworkPolicy and collect flow telemetry. If you see prog_id values on pod interfaces, your cluster is already doing kernel-level flow collection.

Not running Cilium? On a plain kubeadm or EKS node without a CNI that uses eBPF, bpftool net list will show no TC programs on pod interfaces — just whatever kube-proxy or the CNI plugin installed. You can still attach your own flow programs with tc qdisc add dev eth0 clsact — that’s the starting point this episode covers.


The client opened a ticket on a Tuesday afternoon. “Intermittent connection failures to the payment gateway. Started around 11 AM. Application logs say timeout. Retry logic is masking it for most users but the error rate is up 0.3%.”

I looked at the APM dashboard. The service showed elevated latency — p99 at 850ms versus a normal 120ms — but no hard errors at the application layer. The service mesh metrics showed the downstream call succeeding from the mesh’s perspective. The payment gateway team said their side looked clean.

Three tools. Three different answers. All of them interpreting the network. None of them were the network.

I ran:

bpftool map dump id 24 | grep -A5 "payment-gateway-ip"

The connection tracking map showed retransmit count 14 for a specific (src_ip, dst_ip, src_port, dst_port) tuple — the same 5-tuple, every 30 seconds, for 2 hours. The kernel was retransmitting. The TCP stack was compensating. The application was seeing sporadic success because retransmits eventually got through. The APM dashboard averaged that latency into a p99 and called it “elevated.”

The kernel had the truth. Everything above it was rounding.


Why Application-Level Metrics Miss What the Kernel Sees

Application metrics — APM spans, service mesh telemetry, load balancer health checks — operate at Layer 7. They measure round-trip time for complete requests, error codes returned, bytes transferred. They answer “did this request succeed?” not “what did the network do to make it succeed?”

The TCP stack underneath those requests handles retransmits, congestion window adjustments, RST packets, and half-open connections silently. From an application’s perspective, a request that required 3 retransmits before the ACK arrived looks identical to one that succeeded on the first attempt — slightly slower, but successful.

This is structural, not a tooling gap. Application-layer observability tools cannot see below their own protocol boundary. The kernel’s TCP implementation does not report upward when it retransmits. It just retransmits.

eBPF flow observability closes this gap by attaching programs directly to the network path — at the TC hook, which fires on every packet crossing a network interface — and recording what the kernel actually does.


How TC Hook Flow Programs Work

EP08 covered TC eBPF programs for pod network policy. Flow observability uses the same attachment point with a different purpose: instead of allowing or dropping packets, the program reads packet metadata and writes it to a map or ring buffer.

Pod sends packet
      ↓
veth interface (lxcXXXX)
      ↓
TC clsact/egress hook fires
      ↓
eBPF program reads:
  - src IP, dst IP
  - src port, dst port
  - protocol
  - packet size
  - TCP flags (SYN, ACK, FIN, RST, retransmit bit)
      ↓
Writes event to ringbuf (or perf_event_array)
      ↓
Userspace consumer reads ringbuf
      ↓
Aggregates to flow record
      ↓
Exports to Hubble/Prometheus/flow store

ringbuf — a BPF ring buffer: a lock-free, memory-efficient queue shared between a kernel eBPF program and a userspace consumer. The kernel program writes events; the userspace reader drains them. Used instead of perf_event_array in kernel 5.8+ because it avoids per-CPU memory waste and supports variable-length records. When you see Hubble exporting flows, it’s reading from a ringbuf that the TC program writes to.

The key structural property: the TC hook fires on every packet. Not sampled. Not throttled by default. Every SYN, every ACK, every RST, every retransmit. For flow observability, you typically aggregate at the program level — count packets and bytes per 5-tuple per second, rather than emitting an event per packet — but the raw visibility is there if you need it.


What Retransmit Telemetry Actually Reveals

Most flow observability implementations track TCP retransmits specifically because they are the clearest signal of network-layer trouble invisible to applications.

A TCP retransmit happens when a sender doesn’t receive an ACK within the retransmission timeout (RTO). The kernel resends the segment and doubles the timeout (exponential backoff). From the application’s perspective, the call takes longer. If retransmits keep clearing, the application sees success — just slow success.

perf_event — a kernel mechanism for collecting performance data. In eBPF, BPF_MAP_TYPE_PERF_EVENT_ARRAY lets kernel programs push variable-length records to userspace readers via a ring buffer per CPU. Older tools use perf_event_array; newer ones use BPF_MAP_TYPE_RINGBUF (single shared ring, more efficient). If you inspect an older version of Cilium’s flow exporter, you’ll see perf_event writes; newer versions use ringbuf.

To observe retransmits directly with bpftrace:

# Count retransmit events per destination IP — run for 60 seconds
bpftrace -e '
kprobe:tcp_retransmit_skb {
    $sk = (struct sock *)arg0;
    $daddr = ntop(AF_INET, $sk->__sk_common.skc_daddr);
    @retransmits[$daddr] = count();
}
interval:s:60 { print(@retransmits); clear(@retransmits); exit(); }
'

Sample output:

Attaching 2 probes...
@retransmits[10.96.0.10]:   2       # DNS service — normal
@retransmits[172.16.4.23]:  847     # payment gateway endpoint ← problem here
@retransmits[10.244.1.5]:   1       # normal pod-to-pod traffic

847 retransmits to a single endpoint in 60 seconds. That’s not noise. That’s a congested or half-open connection being retried 14 times per second by the TCP stack while the application layer averages it into “elevated latency.”


How Cilium Hubble Collects Flow Data

Hubble is the flow observability layer built into Cilium. Understanding how it works makes you able to reason about what it can and cannot see — and how to verify what it’s actually collecting.

Hubble’s architecture:

Kernel (per node)
├── TC eBPF programs on all pod veth interfaces
│     write flow events → BPF ringbuf
│
└── Hubble node agent (userspace)
      reads ringbuf
      enriches with pod metadata (Kubernetes API)
      exposes gRPC API

Cluster level
└── Hubble Relay
      aggregates per-node gRPC streams
      exposes single cluster-wide API

User tooling
└── hubble observe  /  Hubble UI  /  Prometheus exporter

The TC programs are writing raw packet events. The Hubble agent is the consumer that translates those events into Kubernetes-aware flow records — adding pod name, namespace, label, and policy verdict on top of the 5-tuple and TCP metadata the kernel provides.

To see what Hubble’s TC programs have attached:

# On any Cilium node
bpftool net list | grep lxc

# lxce4a1(23) clsact/ingress prog_id 61  ← Hubble flow program on pod interface ingress
# lxce4a1(23) clsact/egress  prog_id 62  ← Hubble flow program on pod interface egress
# lxcf7b2(31) clsact/ingress prog_id 63
# lxcf7b2(31) clsact/egress  prog_id 64
# Inspect one of those programs to confirm it's reading flow metadata
bpftool prog show id 61

# Output:
# 61: sched_cls  name tail_handle_nat  tag 3a8e2f1b4c7d9e0a  gpl
#     loaded_at 2026-04-22T09:13:45+0530  uid 0
#     xlated 2144B  jited 1382B  memlock 4096B  map_ids 24,31,38
#     btf_id 142

sched_cls is the BPF program type for TC — confirming these are TC-attached flow programs. map_ids 24,31,38 — those are the maps this program reads from and writes to. You can dump any of them:

bpftool map dump id 24 | head -40

# Output (connection tracking entry):
# [{
#     "key": {
#         "saddr": "10.244.1.5",        # ← source pod IP
#         "daddr": "172.16.4.23",        # ← destination IP
#         "sport": 48291,                # ← source port
#         "dport": 443,                  # ← destination port
#         "nexthdr": 6,                  # ← protocol: TCP
#         "flags": 3                     # ← CT_EGRESS | CT_ESTABLISHED
#     },
#     "value": {
#         "rx_packets": 14832,           # ← packets received
#         "tx_packets": 14831,           # ← packets sent
#         "rx_bytes": 3841024,           # ← bytes received
#         "tx_bytes": 3756288,           # ← bytes sent
#         "lifetime": 21600,             # ← seconds until entry expires
#         "rx_closing": 0,
#         "tx_closing": 0
#     }
# }]

That’s the ground truth. Not an APM span. Not a service mesh metric. The actual per-connection counters the kernel is maintaining for that 5-tuple.


Writing a Minimal Flow Observer with bpftrace

You don’t need Cilium or Hubble to get flow telemetry. bpftrace can produce it directly on any node with BTF:

# Persistent flow table: connections + packet counts for 2 minutes
bpftrace -e '
kprobe:tcp_sendmsg {
    $sk = (struct sock *)arg0;
    $daddr = ntop(AF_INET, $sk->__sk_common.skc_daddr);
    $dport = $sk->__sk_common.skc_dport >> 8;
    @flows[comm, $daddr, $dport] = count();
}
interval:s:30 { print(@flows); clear(@flows); }
' --timeout 120

Sample output (every 30 seconds):

@flows[curl, 93.184.216.34, 443]:         12    # curl → example.com:443
@flows[coredns, 10.96.0.10, 53]:          341   # CoreDNS upstream queries
@flows[payment-svc, 172.16.4.23, 443]:   1204   # payment service → gateway
@flows[nginx, 10.244.2.3, 8080]:          89    # nginx → upstream pod

For retransmit tracking specifically:

# Combined flow + retransmit watcher — runs until Ctrl-C
bpftrace -e '
kprobe:tcp_retransmit_skb {
    $sk = (struct sock *)arg0;
    $daddr = ntop(AF_INET, $sk->__sk_common.skc_daddr);
    @retx[comm, $daddr] = count();
}
kprobe:tcp_sendmsg {
    $sk = (struct sock *)arg0;
    $daddr = ntop(AF_INET, $sk->__sk_common.skc_daddr);
    @sends[comm, $daddr] = count();
}
interval:s:10 {
    printf("=== Retransmit ratio (last 10s) ===\n");
    print(@retx);
    print(@sends);
    clear(@retx);
    clear(@sends);
}
'

This gives you both the volume of sends and the retransmit count side by side — the ratio tells you whether retransmits are a rounding error (0.01%) or a signal (5%+).


⚠ Production Gotchas

Map size bounds matter. Connection tracking maps default to tens of thousands of entries. On nodes with high connection churn (serverless, short-lived batch jobs), maps can fill and start dropping new entries silently. Check bpftool map show id N for max_entries and monitor map utilization. Cilium exposes this as cilium_bpf_map_pressure in Prometheus.

Per-packet overhead on high-throughput interfaces. A TC program that fires on every packet on a 10Gbps interface processes millions of packets per second. Aggregating at the program level (count per 5-tuple rather than emit per packet) keeps overhead manageable — Cilium does this. A naive bpftrace one-liner that emits a perf event per packet will saturate the perf ring buffer under real load. Use ringbuf write paths or aggregate before emitting.

TC hook placement and direction confusion. Ingress TC on a pod’s veth (lxcXXXX) sees egress traffic from the pod’s perspective — because the host sees the packet arriving on the veth after the pod sent it. This reversal is consistent but confusing when you’re reading direction labels in flow records. EP08 covered this in detail for policy enforcement; the same asymmetry applies to flow data.

Retransmit counters reset on connection close. If you’re tracking retransmit totals for a long-lived connection, the count is stored in the kernel’s socket state and is cleared when the socket closes. For persistent tracking across reconnects, aggregate at the flow level in userspace before the connection closes.

Hubble flow visibility requires pod interfaces. Hubble only sees traffic that crosses a pod’s veth interface. Node-to-node traffic that doesn’t involve a pod (e.g., node SSH, kubelet-to-API-server on the node IP) is not captured by default. For host-level network observability, you need a TC program on the physical interface (eth0, ens3), not just on pod veth pairs.


Quick Reference

What you want to see Command
What TC programs are attached bpftool net list
Which maps a program uses bpftool prog show id N (check map_ids)
Connection tracking entries bpftool map dump id N
Retransmits per destination bpftrace -e 'kprobe:tcp_retransmit_skb { ... }'
Flow counts per process bpftrace -e 'kprobe:tcp_sendmsg { @[comm, daddr] = count(); }'
Hubble flow stream (Cilium) hubble observe --follow
Hubble flows for one pod hubble observe --pod mynamespace/mypod --follow
Verify map pressure bpftool map show id N (check max_entries vs entries)
Kernel function What it marks
tcp_sendmsg Data being sent on a TCP socket
tcp_recvmsg Data being received on a TCP socket
tcp_retransmit_skb A segment being retransmitted
tcp_send_reset RST being sent
tcp_fin Connection teardown initiated
tcp_connect New outbound TCP connection attempt

Key Takeaways

  • Network flow observability with eBPF attaches TC programs that record every connection event continuously — not sampled, not throttled, not filtered by what the application reports
  • Retransmit telemetry from tcp_retransmit_skb reveals congestion and endpoint failures that are structurally invisible to application-layer monitoring tools
  • Cilium Hubble, Pixie, and Retina are all eBPF flow exporters — they run TC programs, drain a ringbuf, enrich with Kubernetes metadata, and expose the result over an API
  • You can verify what any flow tool is actually collecting with bpftool net list, bpftool prog show, and bpftool map dump — four commands, no documentation needed
  • Map sizing and per-packet overhead are the two production concerns; aggregate at the kernel level, bound your maps, and monitor map pressure
  • The kernel’s connection tracking map is the ground truth. APM dashboards, service mesh metrics, and load balancer health checks are all interpretations of what that map contains

What’s Next

Flow observability tells you what connections exist. EP11 goes one level deeper: what names your pods are resolving those connections to. DNS is where a compromised workload first reveals itself — it queries a domain that has no business being queried from a production pod, and if you’re not watching the kernel-level DNS path, you won’t see it until after the damage.

DNS observability at the kernel level uses tracepoint hooks on the DNS syscall path — the same ground-truth approach as flow telemetry, but for name resolution: every query, every response, tied to the pod that made it, without deploying a sidecar.

Next: DNS observability at the kernel level — what your pods are actually resolving

Get EP11 in your inbox when it publishes → linuxcent.com/subscribe

The Pipeline Gate — Hardened Images as a CI/CD Build Constraint

Reading Time: 6 minutes

OS Hardening as Code, Episode 5
Cloud AMI Security Risks · Linux Hardening as Code · Multi-Cloud OS Hardening · Automated OpenSCAP Compliance · CI/CD Compliance Gate**

Focus Keyphrase: CI/CD compliance gate
Search Intent: Investigational
Meta Description: A compliance grade no one checks before deploying is decoration. The Pipeline API makes grade a build constraint — unhardened images never reach production. (158 chars)


TL;DR

  • A CI/CD compliance gate turns an OS hardening grade from a report into a build constraint — unhardened images fail the pipeline before they can be deployed
  • POST /api/pipeline/scan returns pass/fail against a minimum grade threshold — integrates into any CI/CD system that can make an HTTP request
  • Failed gate output tells engineers exactly which controls failed and what to fix — not just “blocked”
  • The gate works on both build-time grades (new images) and runtime grades (existing instances)
  • GitHub Actions, GitLab CI, Jenkins, and Tekton integrations are one curl command
  • The structural guarantee: an image that doesn’t pass the gate doesn’t exist in the deployment pipeline

The Problem: A Grade No One Checks Is Decoration

Pipeline without compliance gate:
  Build → Test → Security scan (results to dashboard) → Deploy

What actually happens:
  Build → Test → Security scan → "C grade, but we need to ship" → Deploy anyway
                                           │
                                           └─ Dashboard shows C grade
                                              Nobody is paged
                                              Deployment succeeds

A CI/CD compliance gate means the pipeline can’t continue if the grade is below threshold.

EP04 showed that automated OpenSCAP compliance gives every image a verified, reproducible grade before deployment. What it assumed is that someone checks the grade before deploying. They don’t — not under deadline pressure, not when the image has been “working fine for months,” not at 2am.

The same problem that made hardening runbooks skippable applies to compliance grades: if checking the grade is a discretionary step, it will be skipped.


A new microservice was deployed from an unhardened base image. The team had built it quickly during a sprint, used a community AMI as the base, and planned to harden it “in the next sprint.”

Three weeks later, a penetration test found it. SSH password authentication enabled. Three unnecessary services running — one of them with a known CVE. The finding: the instance had full inbound access from the VPC and was reachable from a compromised adjacent instance.

The deployment had gone through the normal CI/CD pipeline. Unit tests passed. Integration tests passed. A vulnerability scan ran. The scan produced a report that went to a dashboard. Nobody had a gate set up to fail the build if the image was unhardened.

The hardening work from the “next sprint” plan would have taken four hours. The pentest remediation took a week, plus the time to investigate what had been exposed during the three weeks the instance was running.

The CI/CD pipeline had every check except the one that would have caught the base image problem before the first deployment.


The Pipeline API

The Pipeline API is a single HTTP endpoint that takes an image or instance ID, checks it against a minimum grade, and returns pass or fail:

# Fail the pipeline if the image grade is below B
curl -sf -X POST https://stratum.yourdomain.com/api/pipeline/scan \
  -H "Authorization: Bearer ${STRATUM_TOKEN}" \
  -H "Content-Type: application/json" \
  -d '{
    "image_id": "ami-0a7f3c9e82d1b4c05",
    "min_grade": "B"
  }'

# Pass response (grade A):
# HTTP 200
# {
#   "result": "pass",
#   "image_id": "ami-0a7f3c9e82d1b4c05",
#   "grade": "A",
#   "score": 94,
#   "controls_passing": 94,
#   "controls_total": 100,
#   "scanned_at": "2026-04-19T15:54:10Z"
# }

# Fail response (grade C):
# HTTP 422
# {
#   "result": "fail",
#   "image_id": "ami-0c9d5e3f81a2b6e07",
#   "grade": "C",
#   "score": 72,
#   "min_grade_required": "B",
#   "failing_controls": [
#     { "id": "1.1.7", "title": "Separate partition for /var/log/audit", "severity": "medium" },
#     { "id": "3.3.2", "title": "TCP SYN cookies enabled", "severity": "low" },
#     ...
#   ]
# }

A non-200 response fails the pipeline. The || exit 1 in the shell integration handles this — if the API returns 422, the pipeline step exits non-zero and the job fails.


GitHub Actions Integration

# .github/workflows/deploy.yml

jobs:
  build-image:
    runs-on: ubuntu-latest
    outputs:
      ami_id: ${{ steps.build.outputs.ami_id }}
    steps:
      - name: Build hardened AMI
        id: build
        run: |
          AMI_ID=$(stratum build \
            --blueprint ubuntu22-cis-l1.yaml \
            --provider aws \
            --output json | jq -r '.image_id')
          echo "ami_id=${AMI_ID}" >> $GITHUB_OUTPUT

  compliance-gate:
    runs-on: ubuntu-latest
    needs: build-image
    steps:
      - name: Stratum compliance gate
        run: |
          curl -sf -X POST ${{ vars.STRATUM_URL }}/api/pipeline/scan \
            -H "Authorization: Bearer ${{ secrets.STRATUM_TOKEN }}" \
            -H "Content-Type: application/json" \
            -d "{\"image_id\": \"${{ needs.build-image.outputs.ami_id }}\", \"min_grade\": \"B\"}" \
            || { echo "Compliance gate failed — image does not meet minimum grade B"; exit 1; }

  deploy:
    runs-on: ubuntu-latest
    needs: [build-image, compliance-gate]
    steps:
      - name: Deploy to staging
        run: |
          aws autoscaling update-auto-scaling-group \
            --auto-scaling-group-name my-asg \
            --launch-template "ImageId=${{ needs.build-image.outputs.ami_id }}"

The deploy job only runs if compliance-gate passes. The AMI doesn’t reach the autoscaling group if it doesn’t meet the grade threshold.


GitLab CI Integration

# .gitlab-ci.yml

stages:
  - build
  - compliance
  - deploy

build-image:
  stage: build
  script:
    - |
      AMI_ID=$(stratum build \
        --blueprint ubuntu22-cis-l1.yaml \
        --provider aws \
        --output json | jq -r '.image_id')
      echo "AMI_ID=${AMI_ID}" >> build.env
  artifacts:
    reports:
      dotenv: build.env

compliance-gate:
  stage: compliance
  needs: [build-image]
  script:
    - |
      curl -sf -X POST ${STRATUM_URL}/api/pipeline/scan \
        -H "Authorization: Bearer ${STRATUM_TOKEN}" \
        -H "Content-Type: application/json" \
        -d "{\"image_id\": \"${AMI_ID}\", \"min_grade\": \"B\"}"

deploy:
  stage: deploy
  needs: [build-image, compliance-gate]
  script:
    - ./deploy.sh ${AMI_ID}

What the Failed Gate Tells You

The value of the CI/CD compliance gate is not just that it blocks bad images — it’s that the failure output tells engineers what to fix.

A gate failure in CI shows:

Compliance gate failed.

Image: ami-0c9d5e3f81a2b6e07
Grade: C (72/100)
Required: B (85/100)
Gap: 13 controls failing

Failing controls:
  HIGH   1.1.7   Separate partition for /var/log/audit
                 Fix: Provision /var/log/audit on a separate EBS volume
  MEDIUM 1.6.1.3 AppArmor enabled in bootloader
                 Fix: Update GRUB_CMDLINE_LINUX, run update-grub, reboot
  MEDIUM 3.3.2   TCP SYN cookies
                 Fix: echo "net.ipv4.tcp_syncookies=1" > /etc/sysctl.d/60-cis.conf
  LOW    5.2.21  SSH MaxStartups
                 Fix: Add "MaxStartups 10:30:60" to /etc/ssh/sshd_config
  ...

View full scan report: https://stratum.yourdomain.com/scans/ami-0c9d5e3f81a2b6e07

This is not a wall — it’s a list of exactly what to fix. The engineer running the pipeline sees the gap, fixes the blueprint or the Ansible role, rebuilds, and the gate passes. The gap is closed before any instance is deployed.


Runtime Gate: Checking Existing Instances

The Pipeline API also works against running instances, not just images:

# Gate on a running instance's current compliance state
curl -sf -X POST https://stratum.yourdomain.com/api/pipeline/scan \
  -H "Authorization: Bearer ${STRATUM_TOKEN}" \
  -H "Content-Type: application/json" \
  -d '{
    "instance_id": "i-0abc123",
    "min_grade": "B",
    "scan_type": "runtime"
  }'

This is useful in deployment pipelines that don’t build custom AMIs — they launch instances and configure them after launch. The runtime gate runs after configuration is complete and before the instance is registered with the load balancer.

It also integrates into scheduled compliance jobs — scan your fleet on a schedule and alert when any instance drifts below grade threshold.


Grade Thresholds by Environment

Not all environments need the same threshold. A common pattern:

# Environment-specific minimum grades
environments:
  production: A      # 95%+ passing — no exceptions
  staging:    B      # 85%+ passing — minor gaps acceptable
  development: C     # 70%+ passing — experimental OK
# Production deploy gate
curl -sf -X POST .../api/pipeline/scan \
  -d '{"image_id": "ami-...", "min_grade": "A"}'

# Staging deploy gate
curl -sf -X POST .../api/pipeline/scan \
  -d '{"image_id": "ami-...", "min_grade": "B"}'

This lets development move fast with a lower bar while enforcing the highest standard at the production gate.


Production Gotchas

Gate latency on first scan: If the image hasn’t been scanned yet, the Pipeline API triggers a scan on demand. This takes 2–3 minutes. For build pipelines that want instant gate results, use stratum build --blueprint ... --scan-on-build to ensure the scan runs during the build step and the result is cached for the gate call.

Token rotation: The STRATUM_TOKEN used for API authentication should be rotated on the same schedule as other service credentials. Use environment-specific tokens so a compromised staging token doesn’t bypass a production gate.

Webhook notifications on gate failure: The Pipeline API can send a webhook to Slack, PagerDuty, or any endpoint when a gate fails. Configure this for production pipelines so failures are visible beyond the CI log.

# In the Stratum config
notifications:
  pipeline_failures:
    - type: slack
      webhook: ${SLACK_WEBHOOK}
      channel: "#platform-security"
    - type: webhook
      url: ${PAGERDUTY_WEBHOOK}
      min_grade: D     # only page on D/F, not B/C failures

Key Takeaways

  • A CI/CD compliance gate turns a compliance grade from a dashboard metric into a pipeline constraint — the image doesn’t deploy if it doesn’t pass
  • POST /api/pipeline/scan is a single HTTP call that any CI/CD system can make — no agent, no plugin, no SDK required
  • Failed gate output is actionable: every failing control includes the specific fix, not just the control ID
  • Runtime gates check instances after configuration, not just at image build time
  • Environment-specific thresholds let development move faster while enforcing the highest standard at production

What’s Next

The CI/CD compliance gate closes the final gap: even if an unhardened image gets built, it can’t deploy. EP05 is the bookmark episode — this is the point where OS hardening becomes structurally enforced rather than procedurally expected.

EP06 is the series closer. For five episodes, you’ve been using Stratum as a user. What does it look like to run it yourself — extend it with a custom control, add a provider, deploy the platform in your own infrastructure?

Stratum is open-core (Apache 2.0). EP06 is the architecture reveal, the GitHub release, and the extension guide for everything the series taught.

Next: Stratum — open-source OS hardening platform for multi-cloud infrastructure

Get EP06 in your inbox when it publishes → linuxcent.com/subscribe

How Active Directory Works: LDAP, Kerberos, and Group Policy Under the Hood

Reading Time: 6 minutes

The Identity Stack, Episode 9
EP08: FreeIPAEP09EP10: SAML/OIDC → …

Focus Keyphrase: Active Directory LDAP
Search Intent: Informational
Meta Description: Active Directory is LDAP + Kerberos + DNS + Group Policy — all tightly integrated. Here’s how AD replication, Sites, GPO, and Linux domain join actually work. (160 chars)


TL;DR

  • Active Directory is not a product that happens to use LDAP — it is an LDAP directory with a Microsoft-extended schema, a built-in Kerberos KDC, and DNS tightly integrated
  • Replication uses USNs (Update Sequence Numbers) and GUIDs — the Knowledge Consistency Checker (KCC) automatically builds the replication topology
  • Sites and site links tell AD which DCs are physically close — AD prefers to authenticate users against a DC in the same site to minimize WAN latency
  • Group Policy Objects (GPOs) are stored as LDAP entries (in the CN=Policies container) and Sysvol files — LDAP tells clients which GPOs apply; Sysvol delivers the policy files
  • Linux joins AD via realm join (uses adcli + SSSD) or net ads join (Samba + winbind) — both register a machine account in AD and get a Kerberos keytab
  • The difference between Linux in AD and Linux in FreeIPA: AD is optimized for Windows; FreeIPA is optimized for Linux — both interoperate

The Big Picture: What AD Actually Is

Active Directory Domain: corp.com
┌────────────────────────────────────────────────────────────┐
│                                                            │
│  LDAP directory          Kerberos KDC                      │
│  ─────────────           ──────────                        │
│  Schema: 1000+ classes   Realm: CORP.COM                   │
│  Objects: users, groups, Issues TGTs + service tickets     │
│  computers, GPOs, OUs    Uses LDAP as the account DB       │
│                                                            │
│  DNS                     Sysvol (DFS share)                │
│  ────                    ────────────────                  │
│  SRV records for KDC     GPO templates                     │
│  and LDAP discovery      Login scripts                     │
│                          Replicated via DFSR               │
│                                                            │
│  Replication engine: USN + GUID + KCC                      │
└────────────────────────────────────────────────────────────┘
          │ replicates to          │ replicates to
          ▼                        ▼
   DC: dc02.corp.com        DC: dc03.corp.com

EP08 showed FreeIPA as the Linux-native answer to enterprise identity. AD is the Microsoft answer — and because most enterprises run Windows clients, understanding AD is unavoidable for Linux infrastructure engineers. This episode goes behind the LDAP and Kerberos protocols to explain what makes AD specifically work.


The AD Schema: LDAP With 1000+ Object Classes

AD’s schema extends the base LDAP schema with Microsoft-specific classes and attributes. Every user object is a user class (which extends organizationalPerson which extends person which extends top) with additional attributes like:

sAMAccountName   ← the pre-Windows 2000 login name (vamshi)
userPrincipalName ← the modern UPN ([email protected])
objectGUID       ← a globally unique 128-bit identifier (never changes, even if DN changes)
objectSid        ← Windows Security Identifier (used for ACL enforcement on Windows)
whenCreated      ← creation timestamp
pwdLastSet       ← password change timestamp
userAccountControl ← bitmask: disabled, locked, password never expires, etc.
memberOf         ← back-link: groups this user belongs to

objectGUID is the authoritative identifier in AD — not the DN. When a user is renamed or moved to a different OU, the GUID stays the same. Applications that store a user’s DN will break on rename; applications that store the GUID won’t.

userAccountControl is the bitmask that controls account state:

Flag          Value   Meaning
ACCOUNTDISABLE  2     Account disabled
LOCKOUT         16    Account locked out
PASSWD_NOTREQD  32    Password not required
NORMAL_ACCOUNT  512   Normal user account (set on almost all accounts)
DONT_EXPIRE_PASSWD 65536  Password never expires
# Query AD from a Linux machine
ldapsearch -x -H ldap://dc.corp.com \
  -D "[email protected]" -w password \
  -b "dc=corp,dc=com" \
  "(sAMAccountName=vamshi)" \
  sAMAccountName userPrincipalName objectGUID memberOf userAccountControl

Replication: USN + GUID + KCC

AD replication is multi-master — every DC accepts writes. The replication engine uses:

USN (Update Sequence Number) — a per-DC counter that increments on every local write. Each attribute in the directory stores the USN at which it was last modified (uSNChanged, uSNCreated). When DC-A replicates to DC-B, DC-B asks: “give me everything you’ve changed since the last USN I saw from you.”

GUID — each object has a globally unique identifier. If the same attribute is modified on two DCs before replication (a conflict), the conflict is resolved: last-writer-wins at the attribute level, based on the modification timestamp. If timestamps are equal, the attribute value from the DC with the lexicographically higher GUID wins.

KCC (Knowledge Consistency Checker) — a component that runs on every DC and automatically constructs the replication topology. You don’t configure which DCs replicate to which — the KCC builds a minimum spanning tree that ensures every DC is connected to every other within a set number of hops. You configure Sites and site links; the KCC does the rest.

# Check replication status from a Linux machine (requires rpcclient or adcli)
# Or on the DC: repadmin /showrepl (Windows tool)

# Simulate: query the highestCommittedUSN from a DC
ldapsearch -x -H ldap://dc.corp.com \
  -D "[email protected]" -w password \
  -b "" -s base highestCommittedUSN

Sites are AD’s concept of physical network topology. A site is a set of IP subnets with high-bandwidth connectivity between them. Site links represent the WAN connections between sites.

Site: Mumbai              Site: Hyderabad
┌────────────────┐        ┌────────────────┐
│ DC: dc-mum-01  │        │ DC: dc-hyd-01  │
│ DC: dc-mum-02  │        │ DC: dc-hyd-02  │
│ subnet: 10.1/16│        │ subnet: 10.2/16│
└───────┬────────┘        └────────┬───────┘
        │                          │
        └──── Site Link ───────────┘
              Cost: 100
              Replication interval: 15 min

When a user in Mumbai authenticates, AD’s KDC locates a DC in the same site using DNS SRV records. The SRV records include the site name in the service name: _ldap._tcp.Mumbai._sites.dc._msdcs.corp.com. SSSD and Windows clients query site-local SRV records first.

If no DC is available in the local site, authentication falls back to a DC in another site across the WAN link. Configuring sites correctly prevents remote authentication failures from killing local operations.


Group Policy: LDAP + Sysvol

GPOs are stored in two places:

LDAP — the CN=Policies,CN=System,DC=corp,DC=com container holds GPO metadata objects. Each GPO has a GUID, a display name, and version numbers. The gPLink attribute on OUs and the domain root links GPOs to where they apply.

Sysvol — the actual policy templates and scripts live in \\corp.com\SYSVOL\corp.com\Policies\{GPO-GUID}\. Sysvol is a DFS-R (Distributed File System Replication) share replicated to every DC.

When a Windows client applies Group Policy:
1. LDAP query: what GPOs are linked to my OU chain?
2. Sysvol fetch: download the policy templates from the GPO’s Sysvol path
3. Apply: process Registry settings, Security settings, Scripts

Linux clients don’t process GPOs natively. The adcli and sssd tools interpret a small subset of AD policy (password policy, account lockout) via LDAP. Full GPO processing on Linux requires Samba’s samba-gpupdate or third-party tools.


Joining Linux to AD

# Install required packages
dnf install -y realmd sssd adcli samba-common

# Discover the domain
realm discover corp.com
# corp.com
#   type: kerberos
#   realm-name: CORP.COM
#   domain-name: corp.com
#   configured: no
#   server-software: active-directory
#   client-software: sssd

# Join
realm join corp.com -U Administrator
# Prompts for Administrator password
# Creates machine account in AD
# Configures sssd.conf, krb5.conf, nsswitch.conf, pam.d automatically

# Verify
realm list
id [email protected]

What the join does:

  1. Creates a machine account HOSTNAME$ in CN=Computers,DC=corp,DC=com
  2. Sets a machine password (rotated automatically by SSSD)
  3. Retrieves a Kerberos keytab to /etc/krb5.keytab
  4. Configures SSSD with id_provider = ad, auth_provider = ad
  5. Updates /etc/nsswitch.conf to include sss
  6. Updates /etc/pam.d/ to include pam_sss

After joining, SSSD uses the machine’s Kerberos keytab to authenticate to the DC and query LDAP — no hardcoded service account credentials required.


LDAP Queries Against AD from Linux

# Find a user (after kinit or with -w password)
ldapsearch -Y GSSAPI -H ldap://dc.corp.com \
  -b "dc=corp,dc=com" \
  "(sAMAccountName=vamshi)" \
  sAMAccountName mail memberOf

# Find all members of a group
ldapsearch -Y GSSAPI -H ldap://dc.corp.com \
  -b "dc=corp,dc=com" \
  "(cn=engineers)" \
  member

# Find all AD-joined Linux machines
ldapsearch -Y GSSAPI -H ldap://dc.corp.com \
  -b "dc=corp,dc=com" \
  "(&(objectClass=computer)(operatingSystem=*Linux*))" \
  cn operatingSystem lastLogonTimestamp

# Find disabled accounts
ldapsearch -Y GSSAPI -H ldap://dc.corp.com \
  -b "dc=corp,dc=com" \
  "(userAccountControl:1.2.840.113556.1.4.803:=2)" \
  sAMAccountName

The last filter uses an LDAP extensible match (1.2.840.113556.1.4.803 is the OID for bitwise AND). userAccountControl:1.2.840.113556.1.4.803:=2 means “entries where userAccountControl AND 2 equals 2” — i.e., the ACCOUNTDISABLE bit is set. This is a Microsoft AD extension not in standard LDAP.


⚠ Common Misconceptions

“AD is just Microsoft’s LDAP.” AD is LDAP + Kerberos + DNS + DFS-R + GPO, all tightly integrated and with a schema that the Microsoft ecosystem depends on. You can query AD with standard ldapsearch. You cannot replace it with OpenLDAP without breaking every Windows client.

“Linux machines in AD get GPO.” Linux machines appear in AD and can be organized into OUs. Standard GPOs don’t apply to them. Samba’s samba-gpupdate can process a subset of AD policy for Linux — mostly Registry and Security settings mapped to Linux equivalents.

“realm leave removes the machine cleanly.” realm leave removes local configuration but does not delete the machine account from AD. The stale computer object stays in CN=Computers until an AD admin deletes it. Always run realm leave && adcli delete-computer -U Administrator for a clean removal.


Framework Alignment

Domain Relevance
CISSP Domain 5: Identity and Access Management AD is the dominant enterprise identity store — understanding its LDAP structure, Kerberos realm, and GPO model is essential for IAM in mixed environments
CISSP Domain 4: Communications and Network Security AD replication traffic (RPC, LDAP, Kerberos) is a significant portion of enterprise WAN traffic — Sites and site links are a network security and performance design decision
CISSP Domain 3: Security Architecture and Engineering AD forest/domain/OU hierarchy is an architectural decision with long-term security consequences — getting OU structure wrong constrains GPO delegation for years

Key Takeaways

  • AD is LDAP + Kerberos + DNS + GPO + DFS-R — not a product that “uses” these; they’re the implementation
  • Replication is multi-master via USN + GUID; the KCC builds the topology automatically from Sites configuration
  • objectGUID is the stable identifier — not the DN, which changes on rename/move
  • realm join is the correct way to join Linux to AD — it configures SSSD, Kerberos, PAM, and NSS correctly in one command
  • userAccountControl is the bitmask that controls account state — (userAccountControl:1.2.840.113556.1.4.803:=2) finds disabled accounts

What’s Next

EP09 covered AD — LDAP and Kerberos inside the corporate network. EP10 covers what happens when identity needs to work across the internet, where Kerberos doesn’t reach: SAML, OAuth2, and OIDC — the protocols that let identity leave the building.

Next: SAML vs OIDC vs OAuth2: Which Protocol Handles Which Identity Problem

Get EP10 in your inbox when it publishes → linuxcent.com/subscribe

FreeIPA: LDAP + Kerberos + PKI in a Single Linux Identity Stack

Reading Time: 5 minutes

The Identity Stack, Episode 8
EP07: LDAP HAEP08EP09: Active Directory → …

Focus Keyphrase: FreeIPA setup
Search Intent: Investigational
Meta Description: FreeIPA integrates 389-DS, MIT Kerberos, Dogtag PKI, and SSSD into one Linux identity stack. Here’s what it gives you and how to use it effectively. (153 chars)


TL;DR

  • FreeIPA is 389-DS (LDAP) + MIT Kerberos + Dogtag PKI + Bind DNS + SSSD — one ipa-server-install command gets you an enterprise identity platform
  • Host-Based Access Control (HBAC) lets you define centrally: which users can SSH to which hosts — no more managing /etc/security/access.conf per machine
  • Sudo rules from the directory: define sudo policy centrally, have every machine pull it — no /etc/sudoers.d/ files scattered across the fleet
  • ipa CLI is the management interface — ipa user-add, ipa group-add, ipa hbacrule-add — everything that took five LDAP commands takes one ipa command
  • FreeIPA trusts with Active Directory let Linux machines authenticate AD users without joining the AD domain
  • The right choice for Linux-centric environments; AD is the right choice when Windows clients dominate

The Big Picture: What FreeIPA Integrates

┌─────────────────────────────────────────────────────────┐
│                    FreeIPA Server                        │
│                                                         │
│  389-DS (LDAP)    MIT Kerberos    Dogtag PKI            │
│  ─────────────    ───────────     ─────────             │
│  User/group       TGT + service   Machine certs         │
│  storage          ticket issuing  User certs             │
│                                   OCSP / CRL            │
│  Bind DNS         SSSD (client)   Apache (WebUI)        │
│  ──────────       ────────────    ──────────────        │
│  SRV records      Enrollment      Management UI         │
│  for KDC/LDAP     automation      REST API              │
└─────────────────────────────────────────────────────────┘
              ▲                  ▲
              │ enrollment       │ SSH + sudo rules
   ┌──────────┴──────────┐  ┌───┴──────────────────┐
   │  Linux client        │  │  Linux client         │
   │  (ipa-client-install)│  │  (ipa-client-install) │
   └─────────────────────┘  └──────────────────────┘

EP06 and EP07 built OpenLDAP from components. FreeIPA gives you all of that plus Kerberos, PKI, DNS, and HBAC — opinionated, integrated, and managed through a single CLI and WebUI. This episode shows what you actually get from it.


Why FreeIPA Instead of Bare OpenLDAP

Running bare OpenLDAP requires you to:
– Configure schema for POSIX accounts, SSH keys, sudo rules, HBAC manually
– Set up MIT Kerberos separately and integrate it with LDAP
– Build your own PKI for machine certificates
– Maintain DNS SRV records for Kerberos discovery
– Write client enrollment scripts
– Build a management interface (or live in LDIF)

FreeIPA does all of this in one installer, with a consistent data model across all components. The trade-off is opacity — FreeIPA makes decisions for you (schema, replication topology, Kerberos realm name) that bare OpenLDAP leaves to you.


Installing FreeIPA Server

# RHEL / Rocky / AlmaLinux
dnf install -y freeipa-server freeipa-server-dns

# Run the installer (interactive)
ipa-server-install

# Or non-interactive:
ipa-server-install \
  --realm=CORP.COM \
  --domain=corp.com \
  --ds-password=DM_password \
  --admin-password=Admin_password \
  --setup-dns \
  --forwarder=8.8.8.8 \
  --unattended

# After install: get an admin Kerberos ticket
kinit admin

The installer creates:
– 389-DS instance with the FreeIPA schema
– MIT KDC with realm CORP.COM
– Dogtag CA and all certificate infrastructure
– Bind DNS with SRV records for the KDC and LDAP server
– Apache WebUI at https://ipa.corp.com/ipa/ui/
– SSSD configured on the server itself

Time: 5–10 minutes. What used to take a week of manual configuration.


The ipa CLI

Every management action goes through ipa. It talks to the IPA server’s REST API and handles Kerberos authentication transparently (it uses your kinit session).

# Users
ipa user-add vamshi \
  --first=Vamshi --last=Krishna \
  [email protected] \
  --password

ipa user-show vamshi
ipa user-find --all              # search all users
ipa user-disable vamshi          # lock account without deleting
ipa user-mod vamshi --shell=/bin/zsh

# Groups
ipa group-add engineers --desc "Engineering team"
ipa group-add-member engineers --users=vamshi,alice

# Password policy
ipa pwpolicy-mod --minlength=12 --maxlife=90 --history=10

# SSH public keys — stored centrally, pushed to every host
ipa user-mod vamshi --sshpubkey="ssh-ed25519 AAAA..."
# SSSD on enrolled hosts will use this key for SSH login — no authorized_keys file needed

Host-Based Access Control (HBAC)

HBAC is the feature that justifies FreeIPA for most Linux shops. It lets you define centrally: which users (or groups) can log in to which hosts (or host groups), using which services (SSH, sudo, FTP).

Without HBAC, access control is per-machine: /etc/security/access.conf or PAM pam_access rules, replicated across every server, managed inconsistently.

With HBAC: one rule, enforced everywhere.

# Create host groups
ipa hostgroup-add production-servers --desc "Production Linux hosts"
ipa hostgroup-add-member production-servers --hosts=web01.corp.com,db01.corp.com

# Create user groups
ipa group-add sre-team
ipa group-add-member sre-team --users=vamshi,alice

# Create an HBAC rule
ipa hbacrule-add allow-sre-to-prod \
  --desc "SRE team can SSH to production"
ipa hbacrule-add-user allow-sre-to-prod --groups=sre-team
ipa hbacrule-add-host allow-sre-to-prod --hostgroups=production-servers
ipa hbacrule-add-service allow-sre-to-prod --hbacsvcs=sshd

# Test the rule before applying
ipa hbactest \
  --user=vamshi \
  --host=web01.corp.com \
  --service=sshd
# Access granted: True
# Matched rules: allow-sre-to-prod

SSSD on each enrolled host enforces the HBAC rules at login time by querying the IPA server. No per-machine configuration. Add a new server to the production-servers host group and the HBAC rules apply immediately.


Sudo Rules from the Directory

# Create a sudo rule
ipa sudorule-add allow-sre-sudo \
  --cmdcat=all \
  --desc "SRE team gets full sudo on production"
ipa sudorule-add-user allow-sre-sudo --groups=sre-team
ipa sudorule-add-host allow-sre-sudo --hostgroups=production-servers

# Or a scoped rule — only specific commands
ipa sudorule-add allow-service-restart
ipa sudocmdgroup-add service-commands
ipa sudocmd-add /usr/bin/systemctl
ipa sudocmdgroup-add-member service-commands --sudocmds="/usr/bin/systemctl"
ipa sudorule-add-allow-command allow-service-restart --sudocmdgroups=service-commands

On enrolled hosts, SSSD’s sssd_sudo responder pulls these rules and the sudo command evaluates them locally. No /etc/sudoers.d/ files. Central policy, local enforcement.


Enrolling a Client

# On the client machine
dnf install -y freeipa-client

ipa-client-install \
  --domain=corp.com \
  --server=ipa.corp.com \
  --realm=CORP.COM \
  --principal=admin \
  --password=Admin_password \
  --unattended

# What this does:
# 1. Registers the host in IPA as a machine principal
# 2. Retrieves a host Kerberos keytab (/etc/krb5.keytab)
# 3. Configures SSSD (sssd.conf, nsswitch.conf, pam.d)
# 4. Configures Kerberos (/etc/krb5.conf)
# 5. Optionally configures NTP and DNS

After enrollment: getent passwd vamshi returns the IPA user. SSH with an IPA password works. HBAC rules are enforced. Sudo rules from the directory apply. SSH public keys from the user’s IPA profile work without authorized_keys files.


FreeIPA Trust with Active Directory

In mixed environments (Linux servers + Windows clients), you can establish a trust between FreeIPA and AD without joining the Linux servers to the AD domain directly.

# On the IPA server (after installing ipa-server-trust-ad)
ipa-adtrust-install --netbios-name=CORP

# Establish the trust
ipa trust-add ad.corp.com \
  --admin=Administrator \
  --password \
  --type=ad

# AD users can now log in to IPA-enrolled Linux hosts
# They appear as: CORP.COM\username or [email protected]

Under the hood: FreeIPA acts as an SSSD-enabled Samba DC for the trust relationship. AD users’ Kerberos tickets from the AD KDC are accepted by the FreeIPA KDC, which maps them to POSIX attributes stored in IPA (or automatically generated via ID mapping).


⚠ Common Misconceptions

“FreeIPA is just OpenLDAP with a UI.” FreeIPA uses 389-DS (not OpenLDAP), adds a full Kerberos KDC, a certificate authority, DNS, HBAC enforcement, and sudo management — all with a consistent schema designed for these use cases. It’s an integrated identity platform, not a wrapper.

“HBAC rules replace firewall rules.” HBAC controls who can log in to a host at the authentication layer — not network access. A blocked HBAC rule means the SSH session is rejected after TCP connection. You still need firewall rules to block TCP access.

“FreeIPA replicas are identical.” FreeIPA uses 389-DS Multi-Supplier replication. All replicas accept reads and writes. But the CA is separate — only the initial server (and explicitly designated CA replicas) run the CA. If the CA goes down, certificate operations stop; authentication does not.


Framework Alignment

Domain Relevance
CISSP Domain 5: Identity and Access Management FreeIPA is an enterprise IAM platform — HBAC, sudo policy, SSH key management, and certificate-based authentication are all IAM controls
CISSP Domain 3: Security Architecture and Engineering FreeIPA’s integrated CA enables certificate-based authentication for machines and users — a stronger authentication factor than passwords
CISSP Domain 1: Security and Risk Management Centralized HBAC and sudo policy reduces the attack surface of privilege escalation — no more inconsistent sudoers files that drift across the fleet

Key Takeaways

  • FreeIPA = 389-DS + MIT Kerberos + Dogtag PKI + Bind DNS — one installer, one management interface
  • HBAC rules define centrally who can SSH to which host groups — enforced by SSSD on every enrolled client, no per-machine config
  • Sudo rules from the directory replace scattered /etc/sudoers.d/ files — central policy, SSSD-enforced locally
  • ipa hbactest lets you verify access rules before a user hits a blocked login — use it before every policy change
  • For Linux-centric environments: FreeIPA. For Windows-dominant environments: AD. For mixed: FreeIPA trust with AD.

What’s Next

FreeIPA is the Linux answer to enterprise identity. EP09 covers the Microsoft answer — Active Directory — which extended LDAP and Kerberos into a complete enterprise platform with Group Policy, Sites, and a replication model built for global scale.

Next: How Active Directory Works: LDAP, Kerberos, and Group Policy Under the Hood

Get EP09 in your inbox when it publishes → linuxcent.com/subscribe