Compliance Grading — Automated OpenSCAP with A-F Scores Before Deployment

Reading Time: 6 minutes

OS Hardening as Code, Episode 4
Cloud AMI Security Risks · Linux Hardening as Code · Multi-Cloud OS Hardening · Automated OpenSCAP Compliance**


TL;DR

  • “We use CIS L1” means nothing without a verified grade — automated OpenSCAP compliance provides one before any instance is deployed
  • Stratum runs OpenSCAP after every build and attaches the grade to the image metadata: cis-l1-A-98
  • Grades are A through F based on percentage of controls passing, with explicit accounting for documented overrides
  • SARIF output is machine-readable — importable directly into GitHub Advanced Security, Jira, or any SIEM
  • Drift detection: rescan any running instance against the original blueprint and see exactly which controls changed since the image was built
  • An image that scores below your minimum grade threshold doesn’t get snapshotted — it doesn’t exist

The Problem: A Grade That’s Never Been Verified Is Not a Grade

Security audit request:
"Provide CIS L1 compliance evidence for all production instances"

Team response:
  Instance A: "CIS L1 hardened" — OpenSCAP last run: 4 months ago
  Instance B: "CIS L1 hardened" — OpenSCAP last run: never
  Instance C: "CIS L1 hardened" — OpenSCAP version: 1.2 (current: 1.3.8)
  Instance D: "CIS L1 hardened" — manual scan output: "87% passing"
  Instance E: "CIS L1 hardened" — manual scan output: "91% passing"

"Which profile was used for D and E? Are they comparable?"
"Were they scanned before or after a recent kernel update?"
"Why is C running an old OpenSCAP version?"

Automated OpenSCAP compliance means the grade is generated the same way, on every image, every time, before the image is ever deployed.

EP03 showed that the same HardeningBlueprint YAML builds consistent OS images across six cloud providers. What it left open is the question every auditor eventually asks: how do you know the Ansible hardening actually did what you think it did? Running Ansible-Lockdown successfully means the tasks ran. It does not mean every CIS control is satisfied — some controls can’t be applied by Ansible alone, some require manual verification, and some interact with the environment in unexpected ways.


A compliance team requested CIS L2 evidence for a SOC 2 Type II audit. The security team had been running OpenSCAP scans — but manually, on-demand, using slightly different profiles across teams, with no standard for how to store or compare results.

The audit found four problems:
1. Two instances had been scanned with CIS L1, not L2, despite being labeled “CIS L2”
2. Three instances hadn’t been scanned in over six months
3. The scan outputs from different teams were in different formats (HTML vs XML vs text)
4. Two instances showed “91% passing” and “89% passing” — with no documentation of whether those were acceptable thresholds or what the failing controls were

The audit took two weeks to resolve. The finding wasn’t a security failure — it was a documentation and process failure. But it consumed two weeks of engineering time and appeared in the audit report as a gap.

The root cause: compliance scanning was a manual step that produced inconsistent output in an inconsistent format.


How Automated OpenSCAP Compliance Works

Every Stratum build ends with an automated OpenSCAP scan:

stratum build --blueprint ubuntu22-cis-l1.yaml --provider aws
      │
      ├─ Provisions build instance
      │
      ├─ Runs Ansible-Lockdown (144 tasks)
      │
      ├─ Runs post-build OpenSCAP scan
      │    ├── Profile: CIS Ubuntu 22.04 L1 (from blueprint)
      │    ├── OpenSCAP version: pinned in blueprint (default: latest)
      │    └── 100 controls checked
      │
      ├─ Calculates grade
      │    ├── Passing:   92 controls
      │    ├── Failing:   6 controls
      │    ├── Overrides: 2 (documented in blueprint)
      │    └── Grade: A (94/100 effective, 98% pass rate)
      │
      ├─ Writes to image metadata:
      │    compliance_grade=cis-l1-A-94
      │    compliance_scan_date=2026-04-19
      │    [email protected]
      │
      └─ Snapshots AMI (or fails if grade < min_grade)

The grade is written into the AMI (or GCP/Azure image) metadata at creation time. It travels with the image. Any instance launched from this AMI carries the provenance of what was applied and what grade was achieved.


The A-F Grade Calculation

The grade is not a simple percentage. It accounts for documented overrides and applies a threshold-based letter scale:

Total CIS controls:    100
Passing:               92
Failing:               6 (genuine failures)
Overrides (compliant): 2 (documented in blueprint, counted as passing)

Effective passing:     94 / 100
Grade:                 A

Grade thresholds (configurable per blueprint):

Grade Default threshold Meaning
A ≥ 95% effective Production-ready, minimal exceptions
B 85–94% Acceptable with documented exceptions
C 70–84% Below standard — deploy with caution
D 55–69% Significant gaps — do not deploy to production
F < 55% Hardening failed — image not snapshotted

The thresholds are configurable in the blueprint:

compliance:
  benchmark: cis-l1
  controls: all
  min_grade: B          # Build fails if grade < B
  grade_thresholds:
    A: 95
    B: 85
    C: 70
    D: 55

If the build produces a grade below min_grade, the instance is terminated and no image is created. The failure is logged with the full list of controls that blocked the grade.


Reading the Scan Output

# Show the last build's scan results
stratum scan --show-last --blueprint ubuntu22-cis-l1.yaml

# Output:
# Build: ubuntu22-cis-l1 @ 2026-04-19T15:42:01Z
# Provider: aws (ap-south-1)
# Grade: A (94/100 effective controls)
#
# Passing controls: 92
# Failing controls: 6
# ──────────────────────────────────────────────
# FAIL  1.1.7   Ensure separate partition for /var/log/audit
#       Reason: tmpfs used — separate block device not configured
#       Remediation: Add /var/log/audit to separate EBS volume
#
# FAIL  1.6.1.3 Ensure AppArmor is enabled in bootloader config
#       Reason: GRUB_CMDLINE_LINUX missing apparmor=1 security=apparmor
#       Remediation: Update /etc/default/grub, run update-grub, reboot
#
# FAIL  3.1.1   Ensure IPv6 is disabled if not needed
#       Reason: net.ipv6.conf.all.disable_ipv6=0
#       Remediation: Set in /etc/sysctl.d/60-kernel-hardening.conf
# ...
#
# Overrides (compliant): 2
# ──────────────────────────────────────────────
# OVERRIDE  1.1.2   tmpfs /tmp via systemd unit — equivalent control
# OVERRIDE  5.2.4   SSH timeout managed by session manager policy

The failing controls tell you exactly what to fix and how to fix it. This is the difference between “87% passing” as a number and “87% passing” as an actionable gap list.


SARIF Export

Every scan produces a SARIF (Static Analysis Results Interchange Format) file:

# Export scan results to SARIF
stratum scan \
  --instance i-0abc123 \
  --benchmark cis-l1 \
  --output sarif \
  --out-file scan-results/i-0abc123-cis-l1.sarif

SARIF is the standard format for security scan results. It’s directly importable into:

  • GitHub Advanced Security — upload via actions/upload-sarif, results appear in the Security tab
  • Jira — import as security findings, linked to the image or instance ID
  • Splunk / SIEM — structured JSON, parseable as events
  • AWS Security Hub — importable as findings via the Security Hub API

For audit purposes, the SARIF file is the evidence artifact. It contains the full scan profile, every control result, the OpenSCAP version, the scan timestamp, and the machine it was run against.

# Upload to GitHub Advanced Security
stratum scan \
  --instance i-0abc123 \
  --benchmark cis-l1 \
  --output sarif \
  --github-upload \
  --github-ref $GITHUB_REF \
  --github-sha $GITHUB_SHA

Drift Detection

The grade at build time is the baseline. Any instance can be rescanned against the blueprint that built it:

# Rescan a running instance
stratum scan --instance i-0abc123 --blueprint ubuntu22-cis-l1.yaml

# Output:
# Instance: i-0abc123 (launched from ami-0a7f3c9e82d1b4c05)
# Original grade (build):  A (94/100) — 2026-01-15
# Current grade (rescan):  B (87/100) — 2026-04-19
#
# Drifted controls (7):
#   3.3.2  TCP SYN cookies: FAIL — net.ipv4.tcp_syncookies=0
#           Last passing: 2026-01-15 (build)
#           Current value: 0 (expected: 1)
#
#   5.3.2  sudo log_input: FAIL — rule removed from /etc/sudoers.d/
#           Last passing: 2026-01-15 (build)
#           Current value: [rule absent] (expected: Defaults log_input)

Drift detection is how you find the instances that were “temporarily” modified and never reverted. The scan compares the current state against the baseline — not against a generic CIS profile, but against the specific blueprint version that built the image.


Scanning Without a Build: Assessing Existing Instances

For instances not built with Stratum, you can run a standalone scan:

# Assess an existing instance against CIS L1
stratum scan --instance i-0legacy123 --benchmark cis-l1

# No blueprint comparison — just the raw CIS grade
# Output:
# Grade: C (72/100)
# 28 controls failing
# ...

This is useful for assessing the state of instances built before Stratum was in use, or for comparing a manual hardening approach against the benchmark.


What Controls Typically Block an A Grade

For Ubuntu 22.04 CIS L1 builds in most cloud environments, these are the controls that most commonly prevent an A grade:

Control Why it often fails Fix
1.1.7 /var/log/audit separate partition Cloud images don’t have separate volumes at build time Add EBS volume, configure at launch
1.6.1 AppArmor bootloader config GRUB parameters not set correctly Update /etc/default/grub, run update-grub
3.1.1 Disable IPv6 Cloud networking sometimes requires IPv6 Override with documented reason if intentional
5.2.21 SSH MaxStartups Default sshd_config not updated Add MaxStartups 10:30:60 to sshd_config
6.1.10 World-writable files Some package installations leave world-writable files Post-install cleanup in Ansible role

The first two (separate audit partition, AppArmor bootloader) are the most common A→B blockers and often require architecture decisions about how volumes are provisioned at launch versus build time.


Key Takeaways

  • Automated OpenSCAP compliance means every image has a verified, reproducible grade generated by the same scanner with the same profile, before it’s ever deployed
  • The A-F grade accounts for documented overrides from the blueprint — the failing controls in the output are genuine gaps, not known exceptions
  • SARIF export makes scan results importable into GitHub Advanced Security, Jira, SIEM, and audit tooling
  • Drift detection catches configuration changes that happen after the image is deployed — the grade at build time is the baseline
  • Images that score below min_grade don’t get snapshotted — the failed build tells you exactly which controls to fix

What’s Next

Automated OpenSCAP compliance gives every image a verified grade before deployment. What EP04 left open is what happens after the grade is known — specifically, what prevents an engineer from deploying a C-grade image to production “just this once.”

The Pipeline API is the answer. EP05 covers the CI/CD compliance gate: POST /api/pipeline/scan fails the build if the image grade is below threshold. The unhardened image never reaches production — not because engineers are disciplined, but because the pipeline won’t let it through.

Next: CI/CD compliance gate — block unhardened images before they reach production

Get EP05 in your inbox when it publishes → linuxcent.com/subscribe

What Is Purple Team Security: Red + Blue = Better Defense

Reading Time: 8 minutes

What Is Purple Team SecurityOWASP Top 10 mapped to cloud infrastructureCloud security breaches 2020–2025


TL;DR

  • Purple team security is the practice of combining offensive (red) and defensive (blue) work in the same exercise — attackers simulate real techniques while defenders tune detection in real time
  • Traditional red team engagements produce a report; purple team produces a faster MTTD (mean time to detect)
  • The structural output is not a findings list — it’s updated detection rules, tested playbooks, and a measured detection baseline
  • Purple team is not a permanent headcount; it is a cadence of exercises run against your own infrastructure
  • Every episode in this series follows the red-blue-purple model: attack simulation → detection → structural fix

OWASP Mapping: This episode establishes the series methodology. No single OWASP category. Subsequent episodes map directly to A01 through A10.


The Big Picture

┌─────────────────────────────────────────────────────────────────┐
│                    PURPLE TEAM MODEL                            │
│                                                                 │
│   RED TEAM                    BLUE TEAM                         │
│   (Offensive)                 (Defensive)                       │
│                                                                 │
│   ┌──────────┐               ┌──────────┐                       │
│   │ Simulate │──── attack ──▶│  Detect  │                       │
│   │ attack   │               │  alert   │                       │
│   └──────────┘               └──────────┘                       │
│         │                          │                            │
│         └──────────┬───────────────┘                            │
│                    │                                            │
│              ┌─────▼──────┐                                     │
│              │  DEBRIEF   │  ← The purple layer                 │
│              │ What fired?│                                      │
│              │ What didn't│                                      │
│              │ Why?       │                                      │
│              └─────┬──────┘                                     │
│                    │                                            │
│         ┌──────────▼──────────┐                                 │
│         │  Updated detection  │                                 │
│         │  rules + playbooks  │                                 │
│         └─────────────────────┘                                 │
│                                                                 │
│   OUTCOME: Detection time drops exercise-over-exercise          │
└─────────────────────────────────────────────────────────────────┘

What is purple team security? It is the structured practice of attacking your own infrastructure — with full visibility on both sides — so that detection logic improves after every exercise, not just after a real breach.


Why Red vs. Blue Alone Fails

Eleven days.

That was how long an attacker had access before my blue team detected the compromise in a red team engagement I ran two years ago. It was a standard authorized engagement — well-scoped, realistic techniques, no shortcuts. The red team was good. The blue team was experienced. And still: eleven days.

The debrief was the turning point. The red team had used techniques that generated logs — CloudTrail entries, VPC Flow Log anomalies, process spawn events. The blue team had the data. The detections just weren’t tuned for these specific patterns. Nobody had ever run the techniques against this specific environment and verified whether the alerts fired.

We restructured the next exercise as a purple team exercise. Same attacker techniques. But this time, the blue team was in the room with the red team. They watched each technique execute in real time. They checked whether the alert fired. When it didn’t, they wrote the detection rule on the spot and verified it before moving to the next technique.

Detection time in the following exercise: four hours.

That is the entire argument for purple team security. Not philosophy. Not org charts. Eleven days versus four hours.


What Red Team Alone Gets Wrong

Traditional red team engagements produce a report with findings. The findings describe what the attacker did. The recommendations describe what to fix. Then the report goes to a remediation queue, the org closes the tickets over three months, and the detection logic is never tested.

The fundamental problem: a red team report tells you what happened; it doesn’t tell you whether your detection would catch it happening again.

The MITRE ATT&CK framework lists over 400 techniques. An annual red team engagement tests maybe 20 of them against your environment. You get a PDF. You don’t get a detection baseline.

Red team alone also creates adversarial dynamics inside the organization. Red team wins when they’re not caught. Blue team wins when they catch everything. These goals are structurally opposed, which means neither team has an incentive to share information that would help the other.


What Blue Team Alone Gets Wrong

Blue team without red team input is writing detection rules in the abstract. They tune alerts based on what they think an attacker would do, not what an attacker actually does against your specific environment with your specific tooling.

Signature-based detection catches known-bad. Behavioral detection catches anomalies. Neither catches a sophisticated attacker who has studied your baseline — unless you’ve explicitly tested whether the behavior that attacker uses registers as an anomaly in your environment.

Blue teams also tend toward alert fatigue. When everything fires, nothing gets investigated. Tuning requires knowing which signals correspond to real techniques, and that knowledge only comes from running the techniques.


The Purple Team Model: How It Actually Works

Purple team security is not a permanent team structure. You don’t hire a purple team. You run purple team exercises.

The exercise structure:

1. SCOPE          — agree on the attack scenario (e.g., "compromised developer credentials")
2. RED EXECUTES   — red team runs the first technique in the scenario
3. BLUE OBSERVES  — blue team watches for the alert; records: fired / not fired / noisy
4. DEBRIEF        — immediate, technique by technique. Why didn't it fire? What data existed?
5. TUNE           — blue team updates detection rule. Red team re-runs. Verify it fires.
6. NEXT TECHNIQUE — repeat for every technique in the scenario
7. MEASURE        — record detection rate and detection time at the end of the exercise

The output of a purple team exercise is not a PDF. It is:
– Updated detection rules (tested and verified)
– A measured detection time for each technique
– A documented attack scenario with the specific commands used
– A baseline for the next exercise to beat

This is what “purple” means: the red and blue work together, in the same room or on the same call, producing improved defense as a direct output of the attack simulation.


The MITRE ATT&CK Scaffolding

Every purple team exercise is anchored to ATT&CK techniques. ATT&CK provides the shared vocabulary: red team uses technique T1078 (Valid Accounts), blue team knows which data sources detect T1078, and the exercise verifies whether those detections are actually implemented and tuned.

MITRE ATT&CK Technique
         │
         ├── Tactic: Initial Access / Persistence / Lateral Movement / ...
         ├── Data Sources: CloudTrail, Process events, Network traffic, ...
         ├── Detection: What behavioral indicator to look for
         └── Mitigations: What configuration change prevents or limits it

When you scope a purple team exercise using ATT&CK, you get explicit coverage tracking. After six exercises, you can report: “We have verified detections for 47 of the 112 techniques most relevant to our threat model. These 65 are not yet covered.”

That is a measurable security posture improvement. It is auditable. It is repeatable.


Where OWASP Fits in This Series

This series uses OWASP Top 10 (2021) as the threat taxonomy, not ATT&CK. The reason: OWASP Top 10 maps directly to the classes of vulnerability that caused the major breaches between 2020 and 2025 — and it is familiar to the developers and architects who need to remediate them.

The next episode maps every OWASP Top 10 category to its cloud and Kubernetes infrastructure equivalent. Most engineers think OWASP applies only to web applications. It doesn’t. Broken Access Control (A01) is the S3 bucket that’s public when it shouldn’t be. Cryptographic Failures (A02) is the environment variable with a plaintext database password committed to GitHub. Injection (A03) is the SSRF that hits the EC2 metadata endpoint.

The framing shifts. The categories don’t.


Red Phase Primer: How Attack Simulations Work in This Series

Every episode from EP04 onward follows this structure:

Red phase — the technique the attacker uses, with the actual commands. Not “the attacker exploited misconfigured IAM.” The actual aws CLI command or kubectl invocation that demonstrates the technique. Commands are safe for authorized use in your own environment or a test account.

Blue phase — what detection looks like. The CloudTrail event, the GuardDuty finding, the Falco rule, the SIEM query. If it doesn’t fire by default, the episode says so explicitly — and shows you how to make it fire.

Purple phase — the structural fix. Not “train your developers to be more careful.” The IAM policy, the SCPs, the network control, the pre-commit hook. The thing that makes the vulnerability not exist, not the thing that makes humans try harder to avoid it.


Run This in Your Own Environment: Baseline Your Current Detection Coverage

Before EP02, establish a detection baseline. This tells you where you start, so later exercises have a number to beat.

aws guardduty list-findings \
  --detector-id $(aws guardduty list-detectors --query 'DetectorIds[0]' --output text) \
  --finding-criteria '{
    "Criterion": {
      "updatedAt": {
        "GreaterThanOrEqual": '$(date -d '30 days ago' +%s000)'
      }
    }
  }' \
  --query 'FindingIds' --output text | \
  xargs -n 50 aws guardduty get-findings \
    --detector-id $(aws guardduty list-detectors --query 'DetectorIds[0]' --output text) \
    --finding-ids | \
  jq '.Findings[] | {type: .Type, severity: .Severity, count: 1}' | \
  jq -s 'group_by(.type) | map({type: .[0].type, count: length})'
# Check if CloudTrail is enabled and logging management events
aws cloudtrail describe-trails --query 'trailList[].{Name:Name,MultiRegion:IsMultiRegionTrail,LoggingEnabled:HasCustomEventSelectors}' --output table
# Check if S3 server access logging is enabled on all buckets
aws s3api list-buckets --query 'Buckets[].Name' --output text | \
  tr '\t' '\n' | \
  while read bucket; do
    logging=$(aws s3api get-bucket-logging --bucket "$bucket" 2>/dev/null)
    if [ -z "$logging" ] || echo "$logging" | grep -q '{}'; then
      echo "NO LOGGING: $bucket"
    else
      echo "LOGGING OK: $bucket"
    fi
  done

Record your current findings count by category and the number of buckets without logging. These are your pre-exercise baselines.


⚠ Common Mistakes When Starting a Purple Team Practice

Running it as an annual event. One purple team exercise per year produces a report. Monthly exercises with 3–5 techniques each produce measurable improvement in detection time. Frequency is the variable.

Letting red and blue work in separate rooms. The purple layer is the debrief. If red sends a report and blue reads it later, you’ve just done a red team engagement. The real-time shared observation is what generates the immediate detection improvement.

Measuring success as “how many vulnerabilities were found.” The right metric is detection time per technique and detection coverage across your ATT&CK or OWASP matrix. Vulnerabilities found is an output of the exercise; faster detection is the outcome.

Starting with sophisticated techniques. The first exercise should test basics: credential access, S3 enumeration, IAM privilege escalation attempts. These generate straightforward logs in CloudTrail. If your detection doesn’t catch these, it won’t catch the sophisticated stuff either. Start where the coverage gaps are most embarrassing.

No documentation of the exercise environment state. If you tune a detection rule during an exercise and then a Terraform change overwrites the policy, you’ve lost the improvement. All detection changes from exercises go through version control immediately.


Quick Reference

Term Definition
Purple team security Practice of combined red/blue exercises where both teams improve detection together
MTTD Mean Time to Detect — the primary metric purple team exercises reduce
ATT&CK MITRE framework mapping adversary techniques to data sources and detections
Red phase Attacker perspective: simulate the technique with real commands
Blue phase Defender perspective: what detection fires (or doesn’t)
Purple phase The joint debrief and immediate detection tuning that makes both better
Detection baseline Measured MTTD and technique coverage before the first exercise
OWASP Top 10 Threat taxonomy used in this series — applies to infrastructure, not just web apps

Key Takeaways

  • Purple team security is a practice, not a team: structured exercises where red attacks and blue detects in real time, with joint debrief producing updated detection rules
  • The metric that matters is detection time per technique — not findings count
  • Red team alone produces a report; purple team produces a faster MTTD and tested detection coverage
  • MITRE ATT&CK provides the technique vocabulary; OWASP Top 10 provides the vulnerability taxonomy this series uses
  • Every major cloud breach 2020–2025 maps to an OWASP category — those categories are the exercise backlog for any cloud-running organization
  • Detection improvements from exercises must be version-controlled immediately or they disappear with the next infrastructure change
  • Frequency of exercises is the primary driver of improvement — monthly beats annual by an order of magnitude

What’s Next

EP02 maps every OWASP Top 10 category to its cloud infrastructure equivalent. Most engineers treat OWASP as a web application concern. The cloud security breaches from 2020 to 2025 tell a different story: the S3 bucket that became public is A01; the CI/CD pipeline secret is A08; the SSRF to EC2 metadata is A10. The taxonomy was always infrastructure-applicable. EP02 makes that mapping explicit — with the cloud-native equivalent, the real breach that demonstrates it, and the detection query to run.

Get EP02 in your inbox when it publishes → subscribe at linuxcent.com

Identity Providers Explained: On-Prem, Cloud, SCIM, and Federation

Reading Time: 6 minutes

The Identity Stack, Episode 11
EP10: SAML/OIDCEP11EP12: Entra ID + Linux → …


TL;DR

  • An Identity Provider (IdP) is the system that authenticates users and issues identity assertions (SAML assertions, OIDC tokens) to applications
  • On-prem IdPs: AD FS (Microsoft), Shibboleth (universities), Keycloak (open source), Ping Identity — they sit in front of AD and speak SAML/OIDC to cloud apps
  • Cloud IdPs: Okta, Entra ID (Azure AD), Google Workspace, Ping Identity Cloud — they are the directory and the authentication layer in one
  • Federation: IdPs can trust each other — a corporate IdP can delegate to a cloud IdP, or federate with a partner org’s IdP
  • SCIM (System for Cross-domain Identity Management) is provisioning, not authentication — it creates/updates/deactivates user accounts in target systems when the source directory changes
  • The key distinction: federation (authentication flow) vs directory sync (data copy) — they solve different problems and are often deployed together

The Big Picture: Where IdPs Sit

                        On-prem Directory
                        (Active Directory / OpenLDAP / FreeIPA)
                               │
                               │ LDAP / Kerberos
                               ▼
                         Identity Provider
                         ┌──────────────────────────────────┐
                         │  AD FS / Keycloak / Okta /       │
                         │  Entra ID Connect / Shibboleth   │
                         │                                  │
                         │  Speaks: SAML 2.0 + OIDC + OAuth2│
                         └────────────────┬─────────────────┘
                                          │ assertions / tokens
                      ┌───────────────────┼───────────────────┐
                      ▼                   ▼                   ▼
               Salesforce          GitHub Enterprise      AWS IAM
               (SAML SP)           (OIDC RP)              (OIDC)

EP10 covered the protocols. This episode covers the systems — what an IdP actually does, how the major ones differ, and how they connect to each other through federation and SCIM.


On-Premises Identity Providers

AD FS (Active Directory Federation Services)

AD FS is Microsoft’s on-prem federation server — a Windows Server role that sits in front of Active Directory and speaks SAML 2.0 and OIDC to external applications.

What it does:
– Authenticates users against AD (Kerberos/LDAP behind the scenes)
– Issues SAML assertions and OIDC tokens to external SPs
– Handles claims transformation: maps AD attributes to what the SP expects

What it doesn’t do well:
– It’s Windows Server only
– Configuration is complex (XML, certificates, claim rule language)
– No built-in MFA (requires Azure MFA or a third-party provider)
– Being deprecated in favor of Entra ID for most use cases

AD FS made sense when everything was on-prem. As workloads move to cloud, Entra ID Connect (a lighter sync agent) combined with Entra ID as the IdP replaces AD FS for most enterprises.

Keycloak

Keycloak is the open-source IdP from Red Hat. It’s what FreeIPA uses for web-based OIDC/SAML SSO, and it’s widely deployed independently for organizations that want full control over their identity infrastructure.

# Run Keycloak in development mode (Docker)
docker run -p 8080:8080 \
  -e KEYCLOAK_ADMIN=admin \
  -e KEYCLOAK_ADMIN_PASSWORD=admin \
  quay.io/keycloak/keycloak:latest \
  start-dev

# Keycloak concepts:
# Realm     — an isolated namespace (like a tenant)
# Client    — an application that uses Keycloak for auth (SP/RP)
# User federation — connect Keycloak to an existing LDAP/AD directory
# Identity brokering — federate with external IdPs (Google, GitHub, another SAML IdP)

Keycloak reads users from AD/LDAP via its User Federation feature — it doesn’t replace the directory, it federates it. Users still live in AD; Keycloak issues SAML/OIDC tokens based on those users.

Shibboleth

Shibboleth is the dominant IdP in academia. Most universities run it. It’s SAML-native, designed for federation between institutions — a student can authenticate at their home university’s IdP and access resources at a partner institution.


Cloud Identity Providers

Okta

Okta is a cloud IdP + directory. It can:
– Act as the primary user directory (storing users, credentials)
– Connect to on-prem AD via the Okta Active Directory Agent (a lightweight sync service)
– Federate with other IdPs (act as IdP or SP in a SAML/OIDC chain)
– Enforce MFA, Adaptive Authentication, Device Trust

Okta’s Lifecycle Management handles provisioning: when a user is created/disabled in Okta (or synced from AD), Okta can automatically create/deactivate accounts in downstream SaaS apps — via SCIM or app-specific APIs.

Entra ID (Azure Active Directory)

Entra ID is Microsoft’s cloud IdP. It’s both a directory (stores users, groups) and an IdP (issues tokens). For organizations running on-prem AD, Entra ID Connect syncs users from AD to Entra ID.

Entra ID is OIDC and OAuth2 native — it speaks SAML for legacy apps but JWT/OIDC for everything modern. Its OIDC implementation follows the standard closely; its token validation happens via /.well-known/openid-configuration and the JWKS endpoint.

On-prem AD  →  Entra ID Connect (sync agent)  →  Entra ID (cloud)
                                                      │
                                              SAML / OIDC
                                                      │
                                            SaaS apps, Azure resources

Google Workspace

Google Workspace is Google’s combined directory + IdP. Google accounts are the users. Apps integrate via SAML or OIDC. Google’s OIDC implementation is one of the most widely used reference implementations — most OIDC libraries are tested against it.


Federation: IdPs Trusting Each Other

Federation is the mechanism that lets IdPs delegate to each other. Two patterns:

SAML Federation (IdP-to-IdP)

Common in academia and partner integrations:

User at University A → requests resource at University B
                              │
                              │ doesn't know user
                              ▼
                    University B SP redirects to...
                    Discovery Service: "which IdP are you from?"
                              │
                              ▼
                    University A IdP authenticates user
                              │
                    Sends SAML assertion to University B SP

University B’s SP trusts University A’s IdP because both are members of a SAML federation (e.g., InCommon in the US, eduGAIN globally). The federation metadata aggregates all members’ SAML metadata — certificates, endpoints — so members don’t have to manually configure each bilateral trust.

OIDC Identity Brokering

Keycloak, Okta, and Entra ID can all act as identity brokers — they sit between the application and the actual authenticating IdP:

App (OIDC RP) → Keycloak (broker IdP) → Google / GitHub / SAML IdP
                                               │ authenticate
                                               ▼
                                      Keycloak receives assertion
                                      Maps external claims to local claims
                                      Issues OIDC token to app

The app only knows Keycloak. Keycloak handles the upstream IdP complexity.


SCIM: Provisioning ≠ Authentication

SCIM (RFC 7644) is a REST API standard for user lifecycle management — creating, updating, and deactivating user accounts in a target system when changes happen in the source directory.

Source (Okta / Entra ID)           Target (Slack / GitHub / Jira)
         │                                    │
         │  SCIM 2.0 (REST + JSON)            │
         ├─ POST /Users  ─────────────────────► create user
         ├─ PATCH /Users/id ──────────────────► update attributes
         └─ DELETE /Users/id ─────────────────► deactivate account

SCIM is not SSO. A SCIM-provisioned user in Slack can log in to Slack — but the authentication still goes through the IdP (SAML/OIDC). SCIM ensures the account exists. The IdP proves the user’s identity.

Why both? Because SSO alone doesn’t create accounts in target systems — it just authenticates to them. If a user tries to log in to Slack for the first time via SSO, Slack needs an account to map them to. SCIM creates that account before the first login (Just-in-Time provisioning handles it at first login, but SCIM handles it in bulk and handles deprovisioning reliably).

Deprovisioning is where SCIM matters most. When an employee leaves, you disable them in Okta — SCIM deactivates their account in every connected app within minutes. Without SCIM, IT runs a manual checklist. Someone misses Jira. The ex-employee has access for three weeks.


Directory Sync vs Federation

These are commonly confused:

Directory sync — copy user data from source to target. Entra ID Connect copies users from on-prem AD to Entra ID. This is not authentication; it’s data replication. After sync, Entra ID has its own copy of the user record.

Federation — delegate authentication to an external IdP. The target system doesn’t store credentials; it redirects to the IdP for authentication and trusts the assertion that comes back.

You often need both:
– Sync: so the target system has the user record and can enforce policies (group membership, license assignment)
– Federation: so the user authenticates against the source of truth (your IdP) rather than maintaining a separate password in every system


⚠ Common Misconceptions

“SCIM is an authentication protocol.” SCIM is a provisioning protocol. It creates and manages accounts. Authentication is SAML/OIDC. Both solve different parts of the identity lifecycle problem.

“SSO means you only have one password.” SSO means you only authenticate once per session. The password still exists (at the IdP). SSO reduces the number of authentication events, not the number of credentials.

“On-prem IdP + cloud sync is the same as a cloud IdP.” With on-prem IdP + cloud sync (e.g., AD + Entra ID Connect), authentication happens via the on-prem IdP — if it goes down, cloud SSO breaks. A pure cloud IdP (Okta standalone, Entra ID without on-prem AD) authenticates entirely in the cloud.


Framework Alignment

Domain Relevance
CISSP Domain 5: Identity and Access Management IdPs are the central control plane for federated identity — their architecture, trust relationships, and provisioning workflows define the enterprise IAM posture
CISSP Domain 1: Security and Risk Management SCIM-based deprovisioning is an access control risk management practice — without it, terminated employee access persists across connected systems
CISSP Domain 3: Security Architecture and Engineering The choice of on-prem vs cloud IdP, federation vs sync, and SCIM vs JIT provisioning are architectural decisions with long-term operational and security implications

Key Takeaways

  • An IdP authenticates users and issues assertions (SAML) or tokens (OIDC/OAuth2) — applications trust the IdP, not the user directly
  • On-prem: AD FS (Windows/legacy), Keycloak (open source, flexible), Shibboleth (academia)
  • Cloud: Okta (cloud-native, strong lifecycle management), Entra ID (Microsoft-integrated), Google Workspace
  • Federation = authentication delegation between IdPs; Directory sync = data replication; SCIM = account lifecycle (provisioning/deprovisioning)
  • SCIM deprovisioning is the critical control — it ensures ex-employees lose access automatically across all connected systems

What’s Next

EP11 covered the IdP landscape. EP12 gets specific: Entra ID and Linux — how you configure a Linux VM to accept SSH logins authenticated against Azure AD credentials, and how the aad-auth / pam_aad stack works end to end.

Next: Entra ID Linux Login: SSH Authentication with Azure AD Credentials

Get EP12 in your inbox when it publishes → linuxcent.com/subscribe

SAML vs OIDC vs OAuth2: Which Protocol Handles Which Identity Problem

Reading Time: 6 minutes

The Identity Stack, Episode 10
EP09: Active DirectoryEP10EP11: Identity Providers → …


TL;DR

  • SAML 2.0 is a federation protocol for browser-based SSO — an IdP issues a signed XML assertion that a Service Provider trusts; designed for enterprise applications
  • OAuth2 is an authorization delegation protocol, not authentication — it lets an application act on your behalf without knowing your password; the access token says what, not who
  • OIDC (OpenID Connect) = OAuth2 + an identity layer — adds the id_token (a JWT containing who you are) on top of OAuth2’s access_token (what you can do)
  • SAML vs OIDC: SAML is XML, enterprise-native, stateful; OIDC is JSON/JWT, API-native, stateless — new applications almost always use OIDC
  • The id_token is a JWT — decode it at jwt.io and read every claim — it tells you exactly what the IdP asserts about the user
  • The browser SSO flow is three redirects: user → SP → IdP (authenticate) → SP (consume assertion)

The Problem: LDAP and Kerberos Don’t Cross the Internet

EP09 showed how authentication works inside a corporate network. LDAP and Kerberos both assume network proximity to the directory server — firewall-friendly ports don’t help when the authentication protocol requires a direct connection to the KDC or directory.

Internal network: works
  Browser → intranet app → LDAP/Kerberos → AD DC (all on 10.0.0.0/8)

Internet: breaks
  Browser → SaaS app (AWS) → LDAP/Kerberos → AD DC (on-prem behind firewall)
  ✗ KDC not reachable across NAT
  ✗ LDAP not exposed to internet (shouldn't be)
  ✗ Every SaaS app can't have its own LDAP connection to your DC

SAML was invented in 2002 to solve this. OIDC in 2014. Both let identity assertions travel over HTTPS — the one protocol that crosses every firewall.


SAML 2.0: Enterprise Browser SSO

SAML 2.0 has three actors: the User, the Identity Provider (IdP), and the Service Provider (SP).

1. User visits SP (e.g., Salesforce)
   SP: "I don't know this user — send them to the IdP"
   ↓  HTTP redirect with SAMLRequest (base64-encoded AuthnRequest)

2. User arrives at IdP (e.g., Okta, AD FS, Entra ID)
   IdP: "Authenticate me" → user enters credentials
   IdP: generates a signed SAML Assertion (XML)
   ↓  HTTP POST to SP's Assertion Consumer Service (ACS) URL

3. SP receives the SAMLResponse
   SP: verifies the signature using IdP's public key
   SP: extracts user attributes from the Assertion
   SP: creates a session — user is logged in

The SAML Assertion is an XML document signed by the IdP. It contains:

<saml:Assertion>
  <saml:Issuer>https://idp.corp.com</saml:Issuer>
  <saml:Subject>
    <saml:NameID Format="urn:oasis:names:tc:SAML:1.1:nameid-format:emailAddress">
      [email protected]
    </saml:NameID>
  </saml:Subject>
  <saml:Conditions
    NotBefore="2026-04-27T01:00:00Z"
    NotOnOrAfter="2026-04-27T01:05:00Z">  ← short-lived: replay protection
  </saml:Conditions>
  <saml:AttributeStatement>
    <saml:Attribute Name="email">
      <saml:AttributeValue>[email protected]</saml:AttributeValue>
    </saml:Attribute>
    <saml:Attribute Name="groups">
      <saml:AttributeValue>engineers</saml:AttributeValue>
      <saml:AttributeValue>sre-team</saml:AttributeValue>
    </saml:Attribute>
  </saml:AttributeStatement>
</saml:Assertion>

The SP trusts the assertion because it’s signed with the IdP’s private key, and the SP has the IdP’s public certificate configured. No direct connection between SP and IdP needed during authentication — only the browser carries the assertion.

SP-initiated vs IdP-initiated:
– SP-initiated: user visits the SP, gets redirected to IdP, authenticates, redirected back — the common flow
– IdP-initiated: user starts at the IdP (e.g., company portal), clicks an app, IdP sends assertion directly — simpler but no SP-generated RequestID, so the SP can’t verify the request was expected (a security concern)


OAuth2: Authorization Delegation (Not Authentication)

This distinction is important and consistently confused: OAuth2 is for authorization, not authentication.

OAuth2 solves: “I want to let GitHub Actions post to my Slack without giving GitHub my Slack password.”

Resource Owner (you)  → grants permission to →  Client (GitHub Actions)
                                                        │
                                                        │ access_token
                                                        ▼
                                               Resource Server (Slack API)
                                               "this token can post messages"

The access_token answers “what can this client do?” not “who is this user?” A resource server receiving an access token knows the token is valid and what scopes it carries — it does not necessarily know which human authorized it.

The four OAuth2 grant types:

Grant Use case
Authorization Code Web apps (server-side) — most secure, recommended
PKCE (+ Auth Code) Native/SPA apps — Auth Code without client secret
Client Credentials Machine-to-machine (no user) — service accounts
Device Code Devices without browsers (smart TVs, CLIs)

The Implicit grant (tokens in URL fragment) is deprecated. Don’t use it.


OIDC: OAuth2 + Who You Are

OpenID Connect adds identity to OAuth2 by adding the id_token — a JWT that the IdP signs and that contains claims about the authenticated user.

Authorization Code flow with OIDC:

1. Client redirects user to IdP:
   GET /authorize?
     response_type=code
     &client_id=myapp
     &scope=openid email profile    ← "openid" scope triggers OIDC
     &redirect_uri=https://app.com/callback
     &state=random-nonce

2. IdP authenticates user, returns:
   GET /callback?code=AUTH_CODE&state=random-nonce

3. Client exchanges code for tokens:
   POST /token
   grant_type=authorization_code&code=AUTH_CODE...

4. IdP returns:
   {
     "access_token": "eyJ...",    ← what the user authorized
     "id_token": "eyJ...",        ← who the user is (JWT)
     "token_type": "Bearer",
     "expires_in": 3600
   }

The id_token decoded:

{
  "iss": "https://idp.corp.com",          ← issuer (the IdP)
  "sub": "user-guid-12345",               ← subject (stable user identifier)
  "aud": "myapp",                          ← audience (your client_id)
  "exp": 1745730000,                       ← expiry (Unix timestamp)
  "iat": 1745726400,                       ← issued at
  "email": "[email protected]",
  "name": "Vamshi Krishna",
  "groups": ["engineers", "sre-team"]     ← custom claims from IdP
}
# Decode any JWT at the command line (no verification — for debugging only)
echo "eyJ..." | cut -d. -f2 | base64 -d 2>/dev/null | python3 -m json.tool

# Or: jwt.io — paste the token, read every claim

sub is the stable user identifier. Email addresses change. Names change. The sub claim is the IdP’s internal identifier for the user — use it as the primary key when storing user data. Never store email as the primary key.


SAML vs OIDC: When to Use Which

SAML 2.0 OIDC
Format XML JSON / JWT
Transport HTTP POST (browser only) HTTP redirect + JSON API
Age 2002 2014
Enterprise adoption Very high (AD FS, Okta, Entra ID) Very high (newer apps)
API-friendly No Yes
Mobile apps No Yes
Complexity High (XML, schemas, signatures) Medium (JWT, JSON)
Single Logout Specified (rarely works well) Optional, inconsistent

Use SAML when: You’re integrating with an enterprise SaaS that only supports SAML (Salesforce classic, legacy HR systems), or your IdP team mandates it.

Use OIDC when: You’re building a new application, integrating with a modern IdP, or need API-based token validation. OIDC is the default for everything new.

Use OAuth2 (Client Credentials) when: Service-to-service authentication with no user — your CI/CD pipeline authenticating to an API, your microservice calling another microservice.


A Complete Browser SSO Flow (OIDC)

1. User visits https://app.corp.com (not logged in)
   App: no session → redirect to IdP

2. GET https://idp.corp.com/authorize?
        response_type=code
        &client_id=app-corp
        &scope=openid email
        &redirect_uri=https://app.corp.com/callback
        &state=abc123
        &nonce=xyz789

3. IdP: user is not authenticated → show login form
   User: enters [email protected] + password
   (or: IdP sees existing session cookie → skip login)

4. IdP: authentication success
   Redirect: GET https://app.corp.com/callback?code=AUTH_CODE&state=abc123

5. App (server-side): validate state=abc123 (CSRF protection)
   POST https://idp.corp.com/token
     grant_type=authorization_code
     &code=AUTH_CODE
     &client_id=app-corp
     &client_secret=SECRET
     &redirect_uri=https://app.corp.com/callback

6. IdP responds:
   { "id_token": "JWT...", "access_token": "JWT...", "expires_in": 3600 }

7. App: validate id_token signature (using IdP's JWKS endpoint)
   App: extract sub, email, groups from id_token
   App: create session for [email protected]
   App: redirect user to original destination

Step 7 is where most bugs live. The app must validate: signature (using IdP’s public keys from /.well-known/jwks.json), iss (matches the expected IdP), aud (matches the client_id), exp (not expired), and nonce (matches what was sent in step 2). Skip any of these and you have an authentication bypass.


⚠ Common Misconceptions

“OAuth2 is for login.” OAuth2 is for authorization delegation. It can be used as a login mechanism only when OIDC (the openid scope + id_token) is added on top. “Login with Google” uses OIDC, not bare OAuth2.

“JWTs are encrypted.” By default, JWTs are signed (JWS), not encrypted. The header and payload are base64url-encoded — anyone can decode them. Encryption (JWE) is a separate, less commonly used spec. Never put secrets in a JWT payload assuming it’s private.

“SAML Single Logout works reliably.” SAML SLO is specified but inconsistently implemented. Many SPs ignore SLO requests or don’t propagate them correctly. Don’t depend on SLO for security — session revocation requires additional mechanisms (short-lived tokens, token introspection, session registries).


Framework Alignment

Domain Relevance
CISSP Domain 5: Identity and Access Management SAML, OAuth2, and OIDC are the three protocols that enable federated identity and SSO — understanding which does what is foundational to modern IAM design
CISSP Domain 4: Communications and Network Security JWT validation (signature, claims, expiry) is a network security control — failing to validate any claim is an authentication bypass vulnerability
CISSP Domain 3: Security Architecture and Engineering The choice of SAML vs OIDC is an architectural decision that affects every application integration, mobile support, and API design

Key Takeaways

  • SAML 2.0: XML-based browser SSO — three redirects, signed assertion, enterprise legacy apps
  • OAuth2: authorization delegation — access tokens grant scopes, not identity
  • OIDC: OAuth2 + id_token — adds who the user is on top of what they can do
  • sub is the stable user identifier in OIDC — never use email as a primary key
  • JWT validation must check: signature, iss, aud, exp, nonce — missing any is a security bypass
  • New applications: OIDC. Legacy enterprise SaaS: SAML. Service-to-service: OAuth2 Client Credentials

What’s Next

EP10 covered the protocols. EP11 covers the systems that implement them — the identity providers: what Okta, Entra ID, Keycloak, and AD FS actually do, how they federate with each other, and how SCIM handles user provisioning separately from authentication.

Next: Identity Providers Explained: On-Prem, Cloud, SCIM, and Federation

Get EP11 in your inbox when it publishes → linuxcent.com/subscribe

How Active Directory Works: LDAP, Kerberos, and Group Policy Under the Hood

Reading Time: 6 minutes

The Identity Stack, Episode 9
EP08: FreeIPAEP09EP10: SAML/OIDC → …


TL;DR

  • Active Directory is not a product that happens to use LDAP — it is an LDAP directory with a Microsoft-extended schema, a built-in Kerberos KDC, and DNS tightly integrated
  • Replication uses USNs (Update Sequence Numbers) and GUIDs — the Knowledge Consistency Checker (KCC) automatically builds the replication topology
  • Sites and site links tell AD which DCs are physically close — AD prefers to authenticate users against a DC in the same site to minimize WAN latency
  • Group Policy Objects (GPOs) are stored as LDAP entries (in the CN=Policies container) and Sysvol files — LDAP tells clients which GPOs apply; Sysvol delivers the policy files
  • Linux joins AD via realm join (uses adcli + SSSD) or net ads join (Samba + winbind) — both register a machine account in AD and get a Kerberos keytab
  • The difference between Linux in AD and Linux in FreeIPA: AD is optimized for Windows; FreeIPA is optimized for Linux — both interoperate

The Big Picture: What AD Actually Is

Active Directory Domain: corp.com
┌────────────────────────────────────────────────────────────┐
│                                                            │
│  LDAP directory          Kerberos KDC                      │
│  ─────────────           ──────────                        │
│  Schema: 1000+ classes   Realm: CORP.COM                   │
│  Objects: users, groups, Issues TGTs + service tickets     │
│  computers, GPOs, OUs    Uses LDAP as the account DB       │
│                                                            │
│  DNS                     Sysvol (DFS share)                │
│  ────                    ────────────────                  │
│  SRV records for KDC     GPO templates                     │
│  and LDAP discovery      Login scripts                     │
│                          Replicated via DFSR               │
│                                                            │
│  Replication engine: USN + GUID + KCC                      │
└────────────────────────────────────────────────────────────┘
          │ replicates to          │ replicates to
          ▼                        ▼
   DC: dc02.corp.com        DC: dc03.corp.com

EP08 showed FreeIPA as the Linux-native answer to enterprise identity. AD is the Microsoft answer — and because most enterprises run Windows clients, understanding AD is unavoidable for Linux infrastructure engineers. This episode goes behind the LDAP and Kerberos protocols to explain what makes AD specifically work.


The AD Schema: LDAP With 1000+ Object Classes

AD’s schema extends the base LDAP schema with Microsoft-specific classes and attributes. Every user object is a user class (which extends organizationalPerson which extends person which extends top) with additional attributes like:

sAMAccountName   ← the pre-Windows 2000 login name (vamshi)
userPrincipalName ← the modern UPN ([email protected])
objectGUID       ← a globally unique 128-bit identifier (never changes, even if DN changes)
objectSid        ← Windows Security Identifier (used for ACL enforcement on Windows)
whenCreated      ← creation timestamp
pwdLastSet       ← password change timestamp
userAccountControl ← bitmask: disabled, locked, password never expires, etc.
memberOf         ← back-link: groups this user belongs to

objectGUID is the authoritative identifier in AD — not the DN. When a user is renamed or moved to a different OU, the GUID stays the same. Applications that store a user’s DN will break on rename; applications that store the GUID won’t.

userAccountControl is the bitmask that controls account state:

Flag          Value   Meaning
ACCOUNTDISABLE  2     Account disabled
LOCKOUT         16    Account locked out
PASSWD_NOTREQD  32    Password not required
NORMAL_ACCOUNT  512   Normal user account (set on almost all accounts)
DONT_EXPIRE_PASSWD 65536  Password never expires
# Query AD from a Linux machine
ldapsearch -x -H ldap://dc.corp.com \
  -D "[email protected]" -w password \
  -b "dc=corp,dc=com" \
  "(sAMAccountName=vamshi)" \
  sAMAccountName userPrincipalName objectGUID memberOf userAccountControl

Replication: USN + GUID + KCC

AD replication is multi-master — every DC accepts writes. The replication engine uses:

USN (Update Sequence Number) — a per-DC counter that increments on every local write. Each attribute in the directory stores the USN at which it was last modified (uSNChanged, uSNCreated). When DC-A replicates to DC-B, DC-B asks: “give me everything you’ve changed since the last USN I saw from you.”

GUID — each object has a globally unique identifier. If the same attribute is modified on two DCs before replication (a conflict), the conflict is resolved: last-writer-wins at the attribute level, based on the modification timestamp. If timestamps are equal, the attribute value from the DC with the lexicographically higher GUID wins.

KCC (Knowledge Consistency Checker) — a component that runs on every DC and automatically constructs the replication topology. You don’t configure which DCs replicate to which — the KCC builds a minimum spanning tree that ensures every DC is connected to every other within a set number of hops. You configure Sites and site links; the KCC does the rest.

# Check replication status from a Linux machine (requires rpcclient or adcli)
# Or on the DC: repadmin /showrepl (Windows tool)

# Simulate: query the highestCommittedUSN from a DC
ldapsearch -x -H ldap://dc.corp.com \
  -D "[email protected]" -w password \
  -b "" -s base highestCommittedUSN

Sites are AD’s concept of physical network topology. A site is a set of IP subnets with high-bandwidth connectivity between them. Site links represent the WAN connections between sites.

Site: Mumbai              Site: Hyderabad
┌────────────────┐        ┌────────────────┐
│ DC: dc-mum-01  │        │ DC: dc-hyd-01  │
│ DC: dc-mum-02  │        │ DC: dc-hyd-02  │
│ subnet: 10.1/16│        │ subnet: 10.2/16│
└───────┬────────┘        └────────┬───────┘
        │                          │
        └──── Site Link ───────────┘
              Cost: 100
              Replication interval: 15 min

When a user in Mumbai authenticates, AD’s KDC locates a DC in the same site using DNS SRV records. The SRV records include the site name in the service name: _ldap._tcp.Mumbai._sites.dc._msdcs.corp.com. SSSD and Windows clients query site-local SRV records first.

If no DC is available in the local site, authentication falls back to a DC in another site across the WAN link. Configuring sites correctly prevents remote authentication failures from killing local operations.


Group Policy: LDAP + Sysvol

GPOs are stored in two places:

LDAP — the CN=Policies,CN=System,DC=corp,DC=com container holds GPO metadata objects. Each GPO has a GUID, a display name, and version numbers. The gPLink attribute on OUs and the domain root links GPOs to where they apply.

Sysvol — the actual policy templates and scripts live in \\corp.com\SYSVOL\corp.com\Policies\{GPO-GUID}\. Sysvol is a DFS-R (Distributed File System Replication) share replicated to every DC.

When a Windows client applies Group Policy:
1. LDAP query: what GPOs are linked to my OU chain?
2. Sysvol fetch: download the policy templates from the GPO’s Sysvol path
3. Apply: process Registry settings, Security settings, Scripts

Linux clients don’t process GPOs natively. The adcli and sssd tools interpret a small subset of AD policy (password policy, account lockout) via LDAP. Full GPO processing on Linux requires Samba’s samba-gpupdate or third-party tools.


Joining Linux to AD

# Install required packages
dnf install -y realmd sssd adcli samba-common

# Discover the domain
realm discover corp.com
# corp.com
#   type: kerberos
#   realm-name: CORP.COM
#   domain-name: corp.com
#   configured: no
#   server-software: active-directory
#   client-software: sssd

# Join
realm join corp.com -U Administrator
# Prompts for Administrator password
# Creates machine account in AD
# Configures sssd.conf, krb5.conf, nsswitch.conf, pam.d automatically

# Verify
realm list
id [email protected]

What the join does:

  1. Creates a machine account HOSTNAME$ in CN=Computers,DC=corp,DC=com
  2. Sets a machine password (rotated automatically by SSSD)
  3. Retrieves a Kerberos keytab to /etc/krb5.keytab
  4. Configures SSSD with id_provider = ad, auth_provider = ad
  5. Updates /etc/nsswitch.conf to include sss
  6. Updates /etc/pam.d/ to include pam_sss

After joining, SSSD uses the machine’s Kerberos keytab to authenticate to the DC and query LDAP — no hardcoded service account credentials required.


LDAP Queries Against AD from Linux

# Find a user (after kinit or with -w password)
ldapsearch -Y GSSAPI -H ldap://dc.corp.com \
  -b "dc=corp,dc=com" \
  "(sAMAccountName=vamshi)" \
  sAMAccountName mail memberOf

# Find all members of a group
ldapsearch -Y GSSAPI -H ldap://dc.corp.com \
  -b "dc=corp,dc=com" \
  "(cn=engineers)" \
  member

# Find all AD-joined Linux machines
ldapsearch -Y GSSAPI -H ldap://dc.corp.com \
  -b "dc=corp,dc=com" \
  "(&(objectClass=computer)(operatingSystem=*Linux*))" \
  cn operatingSystem lastLogonTimestamp

# Find disabled accounts
ldapsearch -Y GSSAPI -H ldap://dc.corp.com \
  -b "dc=corp,dc=com" \
  "(userAccountControl:1.2.840.113556.1.4.803:=2)" \
  sAMAccountName

The last filter uses an LDAP extensible match (1.2.840.113556.1.4.803 is the OID for bitwise AND). userAccountControl:1.2.840.113556.1.4.803:=2 means “entries where userAccountControl AND 2 equals 2” — i.e., the ACCOUNTDISABLE bit is set. This is a Microsoft AD extension not in standard LDAP.


⚠ Common Misconceptions

“AD is just Microsoft’s LDAP.” AD is LDAP + Kerberos + DNS + DFS-R + GPO, all tightly integrated and with a schema that the Microsoft ecosystem depends on. You can query AD with standard ldapsearch. You cannot replace it with OpenLDAP without breaking every Windows client.

“Linux machines in AD get GPO.” Linux machines appear in AD and can be organized into OUs. Standard GPOs don’t apply to them. Samba’s samba-gpupdate can process a subset of AD policy for Linux — mostly Registry and Security settings mapped to Linux equivalents.

“realm leave removes the machine cleanly.” realm leave removes local configuration but does not delete the machine account from AD. The stale computer object stays in CN=Computers until an AD admin deletes it. Always run realm leave && adcli delete-computer -U Administrator for a clean removal.


Framework Alignment

Domain Relevance
CISSP Domain 5: Identity and Access Management AD is the dominant enterprise identity store — understanding its LDAP structure, Kerberos realm, and GPO model is essential for IAM in mixed environments
CISSP Domain 4: Communications and Network Security AD replication traffic (RPC, LDAP, Kerberos) is a significant portion of enterprise WAN traffic — Sites and site links are a network security and performance design decision
CISSP Domain 3: Security Architecture and Engineering AD forest/domain/OU hierarchy is an architectural decision with long-term security consequences — getting OU structure wrong constrains GPO delegation for years

Key Takeaways

  • AD is LDAP + Kerberos + DNS + GPO + DFS-R — not a product that “uses” these; they’re the implementation
  • Replication is multi-master via USN + GUID; the KCC builds the topology automatically from Sites configuration
  • objectGUID is the stable identifier — not the DN, which changes on rename/move
  • realm join is the correct way to join Linux to AD — it configures SSSD, Kerberos, PAM, and NSS correctly in one command
  • userAccountControl is the bitmask that controls account state — (userAccountControl:1.2.840.113556.1.4.803:=2) finds disabled accounts

What’s Next

EP09 covered AD — LDAP and Kerberos inside the corporate network. EP10 covers what happens when identity needs to work across the internet, where Kerberos doesn’t reach: SAML, OAuth2, and OIDC — the protocols that let identity leave the building.

Next: SAML vs OIDC vs OAuth2: Which Protocol Handles Which Identity Problem

Get EP10 in your inbox when it publishes → linuxcent.com/subscribe

LDAP High Availability: Load Balancing and Production Architecture

Reading Time: 6 minutes

The Identity Stack, Episode 7
EP06: OpenLDAPEP07EP08: FreeIPA → …


TL;DR

  • LDAP HA means multiple directory servers behind a load balancer — clients connect to a VIP, not to individual servers
  • Read/write split: all writes go to the provider, reads are distributed across consumers — the load balancer enforces this by routing on port or backend check
  • SSSD handles multi-server failover natively (ldap_uri accepts a comma-separated list) — for apps without built-in failover, HAProxy with health checks does the work
  • Connection pooling is critical at scale — nss_ldap and pam_ldap opened a new connection per login; SSSD maintains a pool; apps that use libldap directly must implement their own
  • cn=monitor is the built-in monitoring endpoint — exposes connection counts, operation rates, and backend stats readable via ldapsearch
  • 389-DS (Red Hat Directory Server) is the production choice for >1M entries — purpose-built for large directories with a dedicated replication engine

The Big Picture: Production LDAP Topology

         Clients (SSSD, apps, VPN concentrators)
                      │
              ┌───────▼───────┐
              │   HAProxy VIP  │   ← single endpoint, port 389/636
              │  10.0.0.10     │
              └───────┬───────┘
                      │
          ┌───────────┼───────────┐
          ▼           ▼           ▼
   ldap1.corp.com  ldap2.corp.com  ldap3.corp.com
   (Provider)      (Consumer)      (Consumer)
   Reads + Writes  Reads only      Reads only
          │           ▲               ▲
          └───────────┴───────────────┘
               SyncRepl replication

EP06 built a two-node replicated directory. This episode covers what happens when the directory becomes infrastructure — when it needs to survive a node failure, handle thousands of connections, and be monitored like any other critical service.


HAProxy for LDAP

HAProxy is the standard choice for LDAP load balancing. Unlike HTTP, LDAP is a stateful protocol — once a client binds, subsequent operations on that connection share the authenticated session. The load balancer must use connection persistence, not per-request routing.

# /etc/haproxy/haproxy.cfg

global
    log /dev/log local0
    maxconn 50000

defaults
    mode tcp                  # LDAP is TCP, not HTTP
    timeout connect 5s
    timeout client  30s
    timeout server  30s
    option tcplog

# ── LDAP read/write split ─────────────────────────────────────────────

# Writes → provider only
frontend ldap-write
    bind *:389
    default_backend ldap-provider

backend ldap-provider
    balance first                   # always use first available (provider)
    option tcp-check
    tcp-check connect
    server ldap1 ldap1.corp.com:389 check inter 5s rise 2 fall 3
    server ldap2 ldap2.corp.com:389 check inter 5s rise 2 fall 3 backup

# Reads → all nodes round-robin
frontend ldap-read
    bind *:3389                     # internal read port
    default_backend ldap-consumers

backend ldap-consumers
    balance roundrobin
    option tcp-check
    tcp-check connect
    server ldap1 ldap1.corp.com:389 check inter 5s
    server ldap2 ldap2.corp.com:389 check inter 5s
    server ldap3 ldap3.corp.com:389 check inter 5s

# LDAPS (TLS)
frontend ldaps
    bind *:636
    default_backend ldap-consumers-tls

backend ldap-consumers-tls
    balance roundrobin
    server ldap1 ldap1.corp.com:636 check inter 5s ssl verify required ca-file /etc/ssl/certs/ca.pem
    server ldap2 ldap2.corp.com:636 check inter 5s ssl verify required ca-file /etc/ssl/certs/ca.pem

The health check (tcp-check connect) just verifies TCP connectivity. For a more precise check — verifying that slapd is actually responding to LDAP requests — use a custom script that runs ldapsearch and checks the result code.


SSSD Multi-Server Failover

SSSD has native failover — no load balancer required for SSSD-based clients:

# /etc/sssd/sssd.conf
[domain/corp.com]
ldap_uri = ldap://ldap1.corp.com, ldap://ldap2.corp.com, ldap://ldap3.corp.com
# SSSD tries them in order; switches to next on failure
# Switches back to primary after ldap_recovery_interval (default: 30s)

# For AD, discovery via DNS SRV records is even better:
ad_server = _srv_
# SSSD queries _ldap._tcp.corp.com SRV records and gets all DCs automatically

SSSD monitors the connection health. If the current server becomes unreachable, it switches to the next in the list within seconds. Existing cached data keeps serving during the switchover. Clients using SSSD don’t need a load balancer for basic HA.


Connection Pooling

Every LDAP bind creates an authenticated session on the server. A server with connection limits (olcConnMaxPending, olcConnMaxPendingAuth in OLC) will reject new connections when those limits are hit.

The problem: applications that use libldap directly tend to open a new connection per operation. At 500 requests/second, that’s 500 new TCP connections, 500 binds, 500 TLS handshakes per second — a directory that can handle 5000 concurrent connections starts refusing new ones.

The solutions:

SSSD — handles this automatically. SSSD maintains one or a small number of persistent connections per domain and multiplexes all PAM/NSS queries through them.

Application-level pooling — frameworks like python-ldap with connection pooling, ldap3 with connection strategies, or dedicated middleware like 389-DS‘s Directory Proxy Server.

ldap_maxconnections in OpenLDAP — sets a hard limit. When hit, new connections block until existing ones close. Set this to something reasonable (olcConnMaxPending: 100 in OLC) so you get a controlled failure mode instead of unbounded queuing.


Monitoring with cn=monitor

OpenLDAP exposes live operational statistics via the cn=monitor database — a virtual LDAP subtree that reflects the server’s current state. Enable it:

# enable-monitor.ldif
dn: cn=module,cn=config
objectClass: olcModuleList
cn: module
olcModulePath: /usr/lib/ldap
olcModuleLoad: back_monitor

dn: olcDatabase=monitor,cn=config
objectClass: olcDatabaseConfig
olcDatabase: monitor
olcAccess: to *
  by dn="cn=admin,dc=corp,dc=com" read
  by * none

Query it:

# Overall statistics
ldapsearch -x -H ldap://localhost \
  -D "cn=admin,dc=corp,dc=com" -w password \
  -b "cn=monitor" -s sub "(objectClass=*)" \
  monitorOpInitiated monitorOpCompleted

# Connection counts
ldapsearch -x -H ldap://localhost \
  -D "cn=admin,dc=corp,dc=com" -w password \
  -b "cn=Connections,cn=monitor" -s one \
  monitorConnectionNumber

# Operations by type
ldapsearch -x -H ldap://localhost \
  -D "cn=admin,dc=corp,dc=com" -w password \
  -b "cn=Operations,cn=monitor" -s one \
  monitorOpInitiated monitorOpCompleted

Useful metrics to export to Prometheus (via prometheus-openldap-exporter or similar):
monitorOpCompleted per operation type (bind, search, modify)
monitorConnectionNumber — current connection count
– Backend-specific: olmMDBEntries, olmMDBPagesMax, olmMDBPagesUsed


389-DS: LDAP at Scale

OpenLDAP is excellent for directories up to a few million entries. When you need:
– 10M+ entries
– High write throughput (more than a few hundred writes/second)
– Fine-grained replication filtering
– A dedicated web-based admin UI

…389-DS (Red Hat Directory Server, community edition) is the production answer. It’s what FreeIPA uses under the hood.

Key architectural differences from OpenLDAP:

Multi-supplier replication — 389-DS’s replication engine uses a dedicated changelog (stored in LMDB) and Change Sequence Numbers (CSNs) for conflict resolution. Multi-supplier (multi-master) replication is first-class, not a bolted-on feature.

Changelog — every change is written to a persistent changelog before being applied. This enables precise replication: a consumer can reconnect after a network partition and get exactly the changes it missed, rather than doing a full resync.

Plugin architecture — 389-DS functionality (replication, managed entries, DNA for automatic UID allocation, memberOf, password policy) is all implemented as plugins that can be enabled/disabled per directory instance.

# Install 389-DS
dnf install -y 389-ds-base

# Create a new instance
dscreate interactive
# — or use a template:
dscreate from-file /path/to/instance.inf

# Manage with dsctl
dsctl slapd-corp status
dsctl slapd-corp start
dsctl slapd-corp stop

# Admin with dsconf
dsconf slapd-corp backend suffix list
dsconf slapd-corp replication status -suffix "dc=corp,dc=com"

The dsconf replication status command gives a live view of replication lag across all suppliers and consumers — something OpenLDAP requires you to compute manually from contextCSN comparisons.


Global Catalog: Cross-Domain Search in AD

When your directory spans multiple AD domains in a forest, the Global Catalog solves a specific problem: a user in emea.corp.com needs to be found by an app that only knows corp.com.

Forest: corp.com
  ├── corp.com       → DC port 389    full directory: 500K entries
  ├── emea.corp.com  → DC port 389    full directory: 200K entries
  └── Global Catalog → GC port 3268  partial replica: 700K entries
                                       (not all attributes — just the most queried ones)

The GC replicates a subset of attributes from every domain in the forest. By default: cn, mail, sAMAccountName, userPrincipalName, memberOf, and about 150 others. Attributes marked with isMemberOfPartialAttributeSet in the schema are replicated to the GC.

If an application is configured to use port 3268 instead of 389, it’s using the GC — and it won’t see attributes not included in the partial attribute set. This surprises teams that add a custom attribute to AD and then wonder why their application can’t see it on 3268 but can on 389.


⚠ Production Gotchas

HAProxy TCP health checks don’t verify LDAP is responsive. A server can accept TCP connections but have slapd in a degraded state (database corruption, out-of-memory). Build a proper LDAP health check: a script that binds and searches a known entry and checks the result.

replication lag under write load. SyncRepl consumers can fall behind under sustained write load. Monitor the contextCSN difference between provider and consumers. If consumers are more than a few seconds behind, investigate the provider’s write throughput and the consumer’s processing speed.

Directory size and the MDB mapsize. LMDB requires a pre-configured maximum database size (olcDbMaxSize). If the database grows beyond this, slapd starts failing writes. Set it to 2–4x your expected data size and monitor olmMDBPagesUsed / olmMDBPagesMax.


Key Takeaways

  • HAProxy in TCP mode provides LDAP load balancing — use balance first for write routing (provider only), balance roundrobin for reads
  • SSSD has native failover via ldap_uri — for SSSD clients, a load balancer adds HA but isn’t strictly required
  • cn=monitor is the built-in OpenLDAP monitoring endpoint — export its counters to Prometheus for operational visibility
  • 389-DS is the right choice for >1M entries, high write throughput, or multi-supplier replication as a first-class feature
  • Global Catalog (port 3268/3269) is a partial replica of all AD domains — useful for forest-wide searches, but missing non-replicated attributes

What’s Next

EP07 covers the infrastructure layer. EP08 zooms out to FreeIPA — what you get when LDAP, Kerberos, DNS, PKI, and HBAC are integrated into a single Linux-native identity stack, and why most Linux shops running their own directory should be running FreeIPA instead of bare OpenLDAP.

Next: FreeIPA: LDAP + Kerberos + PKI in a Single Linux Identity Stack

Get EP08 in your inbox when it publishes → linuxcent.com/subscribe

OpenLDAP Setup and Replication: Running Your Own Directory

Reading Time: 5 minutes

The Identity Stack, Episode 6
EP01 → … → EP05: KerberosEP06EP07: LDAP HA → …


TL;DR

  • OpenLDAP’s server process is slapd — the backend that stores data is MDB (LMDB), a memory-mapped B-tree that replaced the old Berkeley DB backend
  • Configuration lives in the directory itself: cn=config (OLC — Online Configuration) lets you modify slapd at runtime without restarting
  • SyncRepl is the replication protocol: a consumer subscribes to a provider and stays in sync via either polling (refreshOnly) or a persistent connection (refreshAndPersist)
  • Multi-Provider (formerly Multi-Master) lets multiple nodes accept writes — conflict resolution uses CSN (Change Sequence Number), last-writer-wins
  • The essential tools: slapd, ldapadd, ldapmodify, ldapsearch, slapcat, slaptest
  • Always build indexes on the attributes you search most — uid, cn, memberOf — or every search is a full scan

The Big Picture: slapd Architecture

ldapsearch / ldapadd / SSSD / any LDAP client
              │ TCP 389 / 636
              ▼
         ┌─────────────────────────────────┐
         │  slapd (OpenLDAP server)         │
         │                                 │
         │  Frontend (protocol layer)       │
         │    • parse BER requests          │
         │    • ACL enforcement             │
         │    • schema validation           │
         │                                 │
         │  Backend (storage layer)         │
         │    • MDB (LMDB) — default       │
         │    • memory-mapped file I/O      │
         │    • ACID transactions           │
         └────────────┬────────────────────┘
                      │
              /var/lib/ldap/
              data.mdb   (the directory data)
              lock.mdb   (LMDB lock file)

EP05 showed Kerberos in isolation. OpenLDAP is where you run the identity store that Kerberos references — and where SSSD looks up user and group attributes. This episode builds a working two-node replicated directory from scratch.


Installation

# Ubuntu / Debian
apt-get install -y slapd ldap-utils

# RHEL / Rocky / AlmaLinux
dnf install -y openldap-servers openldap-clients

# After install — Ubuntu runs a configuration wizard
# Skip it: dpkg-reconfigure slapd
# Or answer it and then switch to OLC management

On RHEL-family systems, slapd is not configured after install — you work entirely through OLC from the start.


OLC: The Directory Configures Itself

The old way was slapd.conf — a static file that required a full restart on every change. OLC (Online Configuration) replaced it: slapd‘s own configuration is stored as LDAP entries under cn=config. You modify configuration the same way you modify data — with ldapmodify. Changes take effect immediately.

cn=config                        ← root config entry
├── cn=schema,cn=config          ← schema definitions
│     ├── cn={0}core             ← core schema
│     ├── cn={1}cosine           ← RFC 1274 attributes
│     └── cn={2}inetorgperson    ← inetOrgPerson object class
├── olcDatabase={-1}frontend     ← default settings for all databases
├── olcDatabase={0}config        ← the config database itself
└── olcDatabase={1}mdb           ← your actual directory data
      ├── olcAccess              ← ACLs
      ├── olcSuffix              ← base DN (e.g., dc=corp,dc=com)
      └── olcDbIndex             ← search indexes

Everything under cn=config has attributes prefixed with olc (OpenLDAP Configuration). You query and modify it just like any other LDAP subtree — with one restriction: only the cn=config admin (usually gidNumber=0+uidNumber=0,cn=peercred,cn=external,cn=auth — the local root via SASL EXTERNAL) can write to it.


Bootstrapping a Directory

The quickest way to get a working directory is a set of LDIF files applied in order.

1. Load schemas

# Apply the schemas OpenLDAP ships with
ldapadd -Y EXTERNAL -H ldapi:/// \
  -f /etc/ldap/schema/cosine.ldif
ldapadd -Y EXTERNAL -H ldapi:/// \
  -f /etc/ldap/schema/inetorgperson.ldif
ldapadd -Y EXTERNAL -H ldapi:/// \
  -f /etc/ldap/schema/nis.ldif       # adds posixAccount, posixGroup

2. Configure the MDB database

# mdb-config.ldif
dn: olcDatabase={1}mdb,cn=config
changetype: modify
replace: olcSuffix
olcSuffix: dc=corp,dc=com
-
replace: olcRootDN
olcRootDN: cn=admin,dc=corp,dc=com
-
replace: olcRootPW
olcRootPW: {SSHA}hashed_password_here

Generate the hash: slappasswd -s yourpassword

ldapmodify -Y EXTERNAL -H ldapi:/// -f mdb-config.ldif

3. Add indexes

# indexes.ldif
dn: olcDatabase={1}mdb,cn=config
changetype: modify
add: olcDbIndex
olcDbIndex: uid eq,pres
olcDbIndex: cn eq,sub
olcDbIndex: sn eq,sub
olcDbIndex: mail eq
olcDbIndex: memberOf eq
olcDbIndex: entryCSN eq
olcDbIndex: entryUUID eq

The last two (entryCSN, entryUUID) are required for SyncRepl replication to work efficiently.

4. Load initial data

# base.ldif
dn: dc=corp,dc=com
objectClass: top
objectClass: dcObject
objectClass: organization
o: Corp
dc: corp

dn: ou=people,dc=corp,dc=com
objectClass: organizationalUnit
ou: people

dn: ou=groups,dc=corp,dc=com
objectClass: organizationalUnit
ou: groups

dn: uid=vamshi,ou=people,dc=corp,dc=com
objectClass: inetOrgPerson
objectClass: posixAccount
objectClass: shadowAccount
cn: Vamshi Krishna
sn: Krishna
uid: vamshi
uidNumber: 1001
gidNumber: 1001
homeDirectory: /home/vamshi
loginShell: /bin/bash
mail: [email protected]
userPassword: {SSHA}hashed_password_here
ldapadd -x -H ldap://localhost \
  -D "cn=admin,dc=corp,dc=com" \
  -w adminpassword \
  -f base.ldif

ACLs: Who Can Read What

OpenLDAP ACLs are evaluated top-to-bottom; first match wins.

# acls.ldif — set via OLC
dn: olcDatabase={1}mdb,cn=config
changetype: modify
replace: olcAccess
# Users can change their own passwords
olcAccess: to attrs=userPassword
  by self write
  by anonymous auth
  by * none
# Users can read their own entry
olcAccess: to dn.base="ou=people,dc=corp,dc=com"
  by self read
  by users read
  by * none
# Service accounts can read everything (for SSSD)
olcAccess: to *
  by dn="cn=svc-ldap,ou=services,dc=corp,dc=com" read
  by self read
  by * none

A service account (cn=svc-ldap) that SSSD uses to search the directory needs read access to ou=people and ou=groups. Never give SSSD admin (write) access.


SyncRepl Replication

SyncRepl is a pull-based replication protocol built on the LDAP Sync operation (RFC 4533). A consumer connects to a provider and requests changes. The provider sends them. The consumer stays in sync.

On the Provider: Enable the syncprov overlay

# syncprov.ldif
dn: olcOverlay=syncprov,olcDatabase={1}mdb,cn=config
objectClass: olcOverlayConfig
objectClass: olcSyncProvConfig
olcOverlay: syncprov
olcSpCheckpoint: 100 10     # checkpoint every 100 ops or 10 minutes
olcSpSessionLog: 100        # keep last 100 changes for delta-sync
ldapadd -Y EXTERNAL -H ldapi:/// -f syncprov.ldif

On the Consumer: Configure syncrepl

# consumer-config.ldif
dn: olcDatabase={1}mdb,cn=config
changetype: modify
add: olcSyncrepl
olcSyncrepl: rid=001
  provider=ldap://ldap1.corp.com:389
  bindmethod=simple
  binddn="cn=repl-svc,dc=corp,dc=com"
  credentials=replication-password
  searchbase="dc=corp,dc=com"
  scope=sub
  schemachecking=on
  type=refreshAndPersist    # persistent connection (vs refreshOnly = polling)
  retry="5 5 60 +"          # retry: 5 times every 5s, then every 60s forever
  interval=00:00:05:00      # (for refreshOnly) sync every 5 minutes
-
add: olcUpdateRef
olcUpdateRef: ldap://ldap1.corp.com   # redirect writes to provider

refreshAndPersist keeps a persistent connection open. Changes replicate within milliseconds. refreshOnly polls on an interval — simpler, but adds latency.

Verify Replication

# On provider: check the contextCSN (the sync state token)
ldapsearch -x -H ldap://ldap1.corp.com \
  -D "cn=admin,dc=corp,dc=com" -w password \
  -b "dc=corp,dc=com" -s base contextCSN
# contextCSN: 20260427010000.000000Z#000000#000#000000

# On consumer: should match after sync
ldapsearch -x -H ldap://ldap2.corp.com \
  -D "cn=admin,dc=corp,dc=com" -w password \
  -b "dc=corp,dc=com" -s base contextCSN
# Same CSN = in sync

Multi-Provider: Accepting Writes on Both Nodes

Standard SyncRepl has one provider and one or more consumers — only the provider accepts writes. Multi-Provider (formerly Multi-Master) lets every node accept writes.

# On each node — add mirrormode to the database config
dn: olcDatabase={1}mdb,cn=config
changetype: modify
add: olcMirrorMode
olcMirrorMode: TRUE

With mirrormode enabled and each node configured as both provider and consumer of the other, writes on either node replicate to the other. Conflict resolution is CSN-based (Change Sequence Number) — a monotonically increasing timestamp. Last write wins at the attribute level.

Multi-Provider does not prevent split-brain conflicts — if two clients write the same attribute on two different nodes during a network partition, the higher CSN wins when the partition heals. For most directory use cases (user passwords, group memberships), this is acceptable. For others, it requires careful thought.


⚠ Production Gotchas

MDB data file grows monotonically. LMDB never shrinks the data file automatically. Deleted entries leave free space inside the file that gets reused, but the file on disk doesn’t shrink. Use slapcat to export and slapadd to reimport if you need to reclaim disk space.

slapcat is the only safe backup. slapcat reads the MDB database directly and exports LDIF — it does not go through slapd. Run it while slapd is running (LMDB is MVCC-safe for readers), but never copy the raw MDB files while slapd is running.

Schema changes on a replicated directory require coordination. Load the new schema on the provider first. SyncRepl will propagate it to consumers — but if a consumer gets a new entry using the new schema before the schema itself is replicated, the import will fail. Load schemas manually on all nodes before adding entries that use them.


Key Takeaways

  • OpenLDAP uses LMDB (MDB backend) — a memory-mapped, ACID-compliant storage engine with no external dependency
  • OLC (cn=config) is the right way to configure slapd — changes apply without restarts
  • SyncRepl pulls changes from a provider to a consumer — refreshAndPersist for near-real-time, refreshOnly for poll-based
  • Always index uid, cn, entryCSN, and entryUUID — unindexed searches are full scans
  • Multi-Provider allows writes on all nodes with CSN-based last-write-wins conflict resolution

What’s Next

A single OpenLDAP server works. Two nodes with SyncRepl work better. EP07 goes further: how you put multiple LDAP servers behind a load balancer, how connection pooling works, what to monitor, and how 389-DS handles directories with tens of millions of entries.

Next: LDAP High Availability: Load Balancing and Production Architecture

Get EP07 in your inbox when it publishes → linuxcent.com/subscribe

How Kerberos Works: Tickets, KDC, and Why Enterprises Use It With LDAP

Reading Time: 7 minutes

The Identity Stack, Episode 5
EP01EP02EP03EP04: SSSDEP05EP06: OpenLDAP → …


TL;DR

  • Kerberos is a network authentication protocol — it proves identity without sending passwords over the network, using time-limited cryptographic tickets
  • Three actors: the client, the KDC (Key Distribution Center), and the service — the KDC issues tickets; clients use tickets to authenticate to services
  • The ticket flow: AS-REQ (get a TGT) → TGS-REQ (exchange TGT for a service ticket) → AP-REQ (present service ticket to the target service)
  • A TGT (Ticket-Granting Ticket) is a session credential — it lets you request service tickets without re-entering your password for the lifetime of the ticket (default 10 hours)
  • LDAP + Kerberos together: LDAP stores identity (who you are), Kerberos authenticates it (proves you are who you say you are) — Active Directory is exactly this combination
  • kinit, klist, kdestroy are the hands-on tools — run them and read the ticket output

The Big Picture: Three Actors, Three Steps

         1. AS-REQ / AS-REP
Client ◄────────────────────► AS (Authentication Server)
  │                                     │
  │    (part of KDC)                    │
  │                                     ▼
  │         2. TGS-REQ / TGS-REP   TGS (Ticket-Granting Server)
  ├───────────────────────────────────►│
  │         (part of KDC)              │
  │                                    │
  │    3. AP-REQ / AP-REP              │
  └─────────────────────────────► Service (SSH, LDAP, NFS, HTTP...)

KDC = AS + TGS (usually the same process, same machine)

EP04 mentioned Kerberos tickets and clock skew requirements without explaining the protocol. This episode explains why Kerberos was invented, what a ticket actually is, and how the three-step flow works — so that when SSSD says “KDC unreachable” or kinit fails with “pre-authentication required,” you know exactly what’s happening.


The Problem Kerberos Was Built to Solve

MIT’s Project Athena started in 1983 — a campus-wide computing initiative giving students access to thousands of workstations. The problem: how do you authenticate a student at workstation 847 to a file server across campus without sending their password over the network?

In 1988, Steve Miller and Clifford Neuman published Kerberos version 4. The core insight: a trusted third party (the KDC) can issue cryptographic proof that a user has authenticated, and that proof can be presented to any service on the network without the service ever seeing the user’s password.

The password never leaves the client machine after the initial authentication. Every subsequent authentication — to a different service, to the same service again — uses a ticket. The KDC knows both the client and the service. The client and service only need to trust the KDC.


Keys, Tickets, and Sessions

Before the protocol, the primitives:

Long-term keys — derived from passwords. When you set a password in Kerberos, it’s hashed into a key stored in the KDC database (in the krbtgt account on AD, in /var/lib/krb5kdc/principal on MIT Kerberos). The client also derives this key from the password at authentication time. Neither ever sends the raw password.

Session keys — temporary symmetric keys created by the KDC for a specific session. They’re valid for the ticket’s lifetime. After the ticket expires, the session key is useless.

Tickets — encrypted blobs issued by the KDC. A ticket contains the session key, the client identity, the expiry time, and optional flags. It’s encrypted with the target service’s long-term key — only the service can decrypt it. The client carries the ticket but can’t read the contents.


The Three-Step Flow

Step 1: AS-REQ / AS-REP — Getting a TGT

Client                        KDC (AS component)
  │                                │
  │── AS-REQ ──────────────────────►
  │   {username, timestamp}         │
  │   (timestamp encrypted with     │
  │    client's long-term key)       │
  │                                 │
  │   KDC verifies: decrypts        │
  │   timestamp with stored key.    │
  │   If valid → issues TGT         │
  │                                 │
  ◄── AS-REP ──────────────────────│
      {session_key_enc_with_client, │
       TGT_enc_with_krbtgt_key}     │

The client decrypts the session key using its long-term key (derived from the password). The TGT is encrypted with the KDC’s own key (krbtgt) — the client can’t read it, but carries it.

This is the step that requires the password. After this, the TGT is what the client uses for everything else.

Step 2: TGS-REQ / TGS-REP — Getting a Service Ticket

Client                        KDC (TGS component)
  │                                │
  │── TGS-REQ ─────────────────────►
  │   {TGT, authenticator,         │
  │    target_service_name}        │
  │   (authenticator encrypted      │
  │    with TGT session key)        │
  │                                 │
  │   KDC: decrypts TGT,           │
  │   verifies authenticator,       │
  │   issues service ticket         │
  │                                 │
  ◄── TGS-REP ────────────────────│
      {service_session_key_enc,    │
       service_ticket_enc_with_    │
       service_long_term_key}      │

No password involved. The client proves its identity by presenting the TGT (which only the KDC can issue) and an authenticator (a timestamp encrypted with the TGT’s session key, proving the client holds the session key without revealing it).

Step 3: AP-REQ / AP-REP — Authenticating to the Service

Client                        Service (sshd, LDAP, NFS...)
  │                                │
  │── AP-REQ ──────────────────────►
  │   {service_ticket,             │
  │    authenticator_enc_with_      │
  │    service_session_key}        │
  │                                 │
  │   Service: decrypts ticket      │
  │   with its long-term key,       │
  │   verifies authenticator        │
  │                                 │
  ◄── AP-REP (optional) ───────────│
      {mutual authentication}       │

The service decrypts the ticket using its own key. It extracts the client identity and session key. It verifies the authenticator. No communication with the KDC required — the service trusts what the KDC signed.


Why Clock Skew Matters

Every Kerberos authenticator contains a timestamp. The service rejects authenticators older than 5 minutes (by default) — this prevents replay attacks where an attacker captures an authenticator and replays it later.

This is why clock skew over 5 minutes breaks Kerberos authentication entirely. If your machine’s clock drifts 6 minutes from the KDC, every authenticator you generate is rejected as too old or too far in the future. No tickets. No AD logins. No SSSD authentication.

# Check time sync status
timedatectl status
chronyc tracking        # if using chrony
ntpq -p                 # if using ntpd

# If clock is off: force a sync
chronyc makestep        # immediate step correction (chrony)

Hands-On: kinit, klist, kdestroy

# Get a TGT (will prompt for password)
kinit [email protected]

# Show current tickets
klist
# Credentials cache: FILE:/tmp/krb5cc_1001
# Principal: [email protected]
#
# Valid starting     Expires            Service principal
# 04/27/26 01:00:00  04/27/26 11:00:00  krbtgt/[email protected]
#   renew until 05/04/26 01:00:00

# Show encryption types used (the -e flag)
klist -e
# 04/27/26 01:00:00  04/27/26 11:00:00  krbtgt/[email protected]
#         Etype: aes256-cts-hmac-sha1-96, aes256-cts-hmac-sha1-96

# Get a service ticket for a specific service
kvno host/[email protected]
# host/[email protected]: kvno = 3

# Show all tickets including service tickets
klist -f
# Flags: F=forwardable, f=forwarded, P=proxiable, p=proxy, D=postdated,
#        d=postdated, R=renewable, I=initial, i=invalid, H=hardware auth

# Destroy all tickets
kdestroy

The Valid starting and Expires fields are the ticket lifetime. After expiry, you need to re-authenticate (or renew the ticket if it’s within the renew until window). The renew until date is when even renewal stops working.


/etc/krb5.conf

[libdefaults]
    default_realm = CORP.COM
    dns_lookup_realm = false
    dns_lookup_kdc = true         # find KDCs via DNS SRV records
    ticket_lifetime = 10h
    renew_lifetime = 7d
    forwardable = true            # tickets can be forwarded to remote hosts (needed for SSH forwarding)
    rdns = false

[realms]
    CORP.COM = {
        kdc = dc01.corp.com
        kdc = dc02.corp.com       # failover KDC
        admin_server = dc01.corp.com
    }

[domain_realm]
    .corp.com = CORP.COM
    corp.com = CORP.COM

With dns_lookup_kdc = true, Kerberos finds KDCs by querying DNS SRV records (_kerberos._tcp.corp.com). AD sets these up automatically. On MIT Kerberos, you add them manually. DNS-based discovery is the recommended approach for AD environments — it picks up new DCs automatically.


Kerberos + LDAP: Why Enterprises Run Both

LDAP and Kerberos solve different problems and are almost always deployed together:

LDAP answers:  "Who is vamshi? What groups is he in? What's his home directory?"
Kerberos answers: "Is this really vamshi? Prove it without sending a password."

Active Directory is exactly this combination — the directory is LDAP-based, the authentication is Kerberos. When a Linux machine joins an AD domain via realm join or adcli, it gets:
– LDAP access to the AD directory (for NSS: user and group lookups)
– A Kerberos principal registered in AD (for PAM: ticket-based authentication)
– A machine account (the machine’s identity in the directory)

When you SSH into an AD-joined Linux machine:
1. SSSD issues a Kerberos AS-REQ for the user’s TGT
2. SSSD uses the TGT to get a service ticket for the Linux machine’s PAM service
3. Authentication is verified via the service ticket — no LDAP Bind with a password
4. SSSD does an LDAP Search to get POSIX attributes (UID, GID, home dir)

Password-based LDAP Bind is the fallback when Kerberos isn’t available. Kerberos is the default on AD-joined systems — and it’s more secure because the password never leaves the client.


⚠ Common Misconceptions

“Kerberos sends your password to the KDC.” It doesn’t. The client derives a key from the password locally and uses that key to encrypt a timestamp (the pre-authentication data). The KDC verifies the timestamp using the stored key. The raw password never travels.

“Kerberos is an authorization protocol.” Kerberos authenticates — it proves who you are. Authorization (what you can do) is a separate decision, usually handled by ACLs on the service or directory group membership.

“Once you have a TGT, you’re authenticated to everything.” A TGT only proves your identity to the KDC. Each service requires a separate service ticket. The TGT is what lets you get those service tickets without re-entering your password.

“Kerberos requires AD.” MIT Kerberos 5 is a standalone implementation. FreeIPA (EP08) runs MIT Kerberos. Heimdal is another implementation. AD uses a Microsoft-extended version of Kerberos 5, but the core protocol is the same RFC.


Framework Alignment

Domain Relevance
CISSP Domain 5: Identity and Access Management Kerberos is the de facto enterprise authentication protocol — SSO, delegation, and service account authentication all depend on it
CISSP Domain 4: Communications and Network Security Kerberos prevents credential sniffing and replay attacks — two of the core network authentication threat categories
CISSP Domain 3: Security Architecture and Engineering The KDC is a critical single point of trust — its availability, key management, and account (krbtgt) rotation are architectural security decisions

Key Takeaways

  • Kerberos is a ticket-based protocol — the password is used once to get a TGT; from then on, tickets prove identity without the password
  • The three-step flow: get a TGT from the AS, exchange it for a service ticket at the TGS, present the service ticket to the target service
  • Clock skew over 5 minutes breaks Kerberos — time synchronization is a hard dependency
  • LDAP stores identity; Kerberos authenticates it — Active Directory is exactly this combination, and so is FreeIPA
  • klist -e shows the encryption types in use — aes256-cts-hmac-sha1-96 is what you want to see; arcfour-hmac (RC4) is legacy and should be disabled

What’s Next

EP05 covered Kerberos as a protocol. EP06 goes hands-on: building a real LDAP directory with OpenLDAP, configuring replication, and understanding how the server-side components — slapd, the MDB backend, SyncRepl — fit together.

Next: OpenLDAP Setup and Replication: Running Your Own Directory

Get EP06 in your inbox when it publishes → linuxcent.com/subscribe

Hardening Blueprint as Code — Declare Your OS Baseline in YAML

Reading Time: 6 minutes

OS Hardening as Code, Episode 2
Cloud AMI Security Risks · Linux Hardening as Code**


TL;DR

  • A hardening runbook is a list of steps someone runs. A HardeningBlueprint YAML is a build artifact — if it wasn’t applied, the image doesn’t exist
  • Linux hardening as code means declaring your entire OS security baseline in a single YAML file and building it reproducibly across any provider
  • stratum build --blueprint ubuntu22-cis-l1.yaml --provider aws either produces a hardened image or fails — there is no partial state
  • The blueprint includes: target OS/provider, compliance benchmark, Ansible roles, and per-control overrides with documented reasons
  • One blueprint file = one source of truth for your hardening posture, version-controlled and reviewable like any other infrastructure code
  • Post-build OpenSCAP scan runs automatically — the image only snapshots if it passes

The Problem: A Runbook That Gets Skipped Once Is a Runbook That Gets Skipped

Hardening runbook
       │
       ▼
  Human executes
  steps manually
       │
       ├─── 47 deployments: followed correctly
       │
       └─── 1 deployment at 2am: step 12 skipped
                    │
                    ▼
           Instance in production
           without audit logging,
           SSH password auth enabled,
           unnecessary services running

Linux hardening as code eliminates the human decision point. If the blueprint wasn’t applied, the image doesn’t exist.

EP01 showed that default cloud AMIs arrive pre-broken — unnecessary services, no audit logging, weak kernel parameters, SSH configured for convenience not security. The obvious response is a hardening script. But a script run by a human is still a process step. It can be skipped. It can be done halfway. It can drift across different engineers who each interpret “run the hardening script” slightly differently.


A production deployment last year. The platform team had a solid CIS L1 hardening runbook — 68 steps, well-documented, followed consistently. Then a critical incident at 2am required three new instances to be deployed on short notice. The engineer on call ran the provisioning script and, under pressure, skipped the hardening step with the intention of running it the next morning.

They didn’t. The three instances stayed in production unhardened for six weeks before an automated scan caught them. Audit logging wasn’t configured. SSH was accepting password authentication. Two unnecessary services were running that weren’t in the approved software list.

Nothing was breached. But the finding went into the next compliance report as a gap, the team spent a week remediating, and the post-mortem conclusion was “we need better runbook discipline.”

That’s the wrong conclusion. The runbook isn’t the problem. The problem is that hardening was a process step instead of a build constraint.


What Linux Hardening as Code Actually Means

Linux hardening as code is the same principle as infrastructure as code applied to OS security posture: the desired state is declared in a file, the file is the source of truth, and the execution is deterministic and repeatable.

HardeningBlueprint YAML
         │
         ▼
  stratum build
         │
  ┌──────┴──────────────────┐
  │  Provider Layer          │
  │  (cloud-init, disk       │
  │   names, metadata        │
  │   endpoint per provider) │
  └──────┬──────────────────┘
         │
  ┌──────┴──────────────────┐
  │  Ansible-Lockdown        │
  │  (CIS L1/L2, STIG —      │
  │   the hardening steps)   │
  └──────┬──────────────────┘
         │
  ┌──────┴──────────────────┐
  │  OpenSCAP Scanner        │
  │  (post-build verify)     │
  └──────┬──────────────────┘
         │
         ▼
  Golden Image (AMI/GCP image/Azure image)
  + Compliance grade in image metadata

The YAML file is what you write. Stratum handles the rest.


The HardeningBlueprint YAML

The blueprint is the complete, auditable declaration of your OS security posture:

# ubuntu22-cis-l1.yaml
name: ubuntu22-cis-l1
description: Ubuntu 22.04 CIS Level 1 baseline for production workloads
version: "1.0"

target:
  os: ubuntu
  version: "22.04"
  provider: aws
  region: ap-south-1
  instance_type: t3.medium

compliance:
  benchmark: cis-l1
  controls: all

hardening:
  - ansible-lockdown/UBUNTU22-CIS
  - role: custom-audit-logging
    vars:
      audit_log_retention_days: 90
      audit_max_log_file: 100

filesystem:
  tmp:
    type: tmpfs
    options: [nodev, nosuid, noexec]
  home:
    options: [nodev]

controls:
  - id: 1.1.2
    override: compliant
    reason: "tmpfs /tmp implemented via systemd unit — equivalent control"
  - id: 5.2.4
    override: compliant
    reason: "SSH timeout managed by session manager policy, not sshd_config"

Each section is explicit:

target — which OS, which version, which provider. This is the only provider-specific section. The compliance intent below it is portable.

compliance — which benchmark and which controls to apply. controls: all means every CIS L1 control. You can also specify controls: [1.x, 2.x] to scope to specific sections.

hardening — which Ansible roles to run. ansible-lockdown/UBUNTU22-CIS is the community CIS hardening role. You can add custom roles alongside it.

controls — documented exceptions. Not suppressions — overrides with a recorded reason. This is the difference between “we turned off this control” and “this control is satisfied by an equivalent implementation, documented here.”


Building the Image

# Validate the blueprint before building
stratum blueprint validate ubuntu22-cis-l1.yaml

# Build — this will take 15-20 minutes
stratum build --blueprint ubuntu22-cis-l1.yaml --provider aws

# Output:
# [15:42:01] Launching build instance...
# [15:42:45] Running ansible-lockdown/UBUNTU22-CIS (144 tasks)...
# [15:51:33] Running custom-audit-logging role...
# [15:52:11] Running post-build OpenSCAP scan (benchmark: cis-l1)...
# [15:54:08] Grade: A (98/100 controls passing)
# [15:54:09] 2 controls overridden (documented in blueprint)
# [15:54:10] Creating AMI snapshot: ami-0a7f3c9e82d1b4c05
# [15:54:47] Done. AMI tagged with compliance grade: cis-l1-A-98

If the post-build scan comes back below a configurable threshold, the build fails — no AMI is created. The instance is terminated. The image does not exist.

That is the structural guarantee. You cannot skip a build step at 2am because at 2am you’re calling stratum build, not running steps manually.


The Control Override Mechanism

The override mechanism is what separates this from checkbox compliance.

Every security benchmark has controls that conflict with how production environments actually work. CIS L1 recommends /tmp on a separate partition. Many cloud instances use tmpfs with equivalent nodev, nosuid, noexec mount options. The intent of the control is satisfied. The literal implementation differs.

Without an override mechanism, you have two bad options: fail the scan (noisy, meaningless), or configure the scanner to ignore the control (undocumented, invisible to auditors).

The blueprint’s controls section gives you a third option: record the override, document the reason, and let the scanner count it as compliant. The SARIF output and the compliance grade both reflect the documented state.

controls:
  - id: 1.1.2
    override: compliant
    reason: "tmpfs /tmp implemented via systemd unit — equivalent control"

This appears in the build log, in the SARIF export, and in the image metadata. An auditor reading the output sees: control 1.1.2 — compliant, documented exception, reason recorded. Not: control 1.1.2 — ignored.


What the Blueprint Gives You That a Script Doesn’t

Hardening script HardeningBlueprint YAML
Version-controlled Possible but not enforced Always — it’s a file
Auditable exceptions Typically not Built-in override mechanism
Post-build verification Manual or none Automatic OpenSCAP scan
Image exists only if hardened No Yes — build fails if scan fails
Multi-cloud portability Requires separate scripts Provider flag, same YAML
Drift detection Not possible Rescan instance against original grade
Skippable at 2am Yes No — you’d have to change the build process

The last row is the one that matters. A script is skippable because there’s a human in the loop. A blueprint is a build artifact — you can’t deploy the image without the blueprint having been applied, because the image is what the blueprint produces.


Validating a Blueprint Before Building

# Syntax and schema validation
stratum blueprint validate ubuntu22-cis-l1.yaml

# Dry-run — show what Ansible tasks will run, what controls will be checked
stratum build --blueprint ubuntu22-cis-l1.yaml --provider aws --dry-run

# Show all available controls for a benchmark
stratum blueprint controls --benchmark cis-l1 --os ubuntu --version 22.04

# Show what a specific control checks
stratum blueprint controls --id 1.1.2 --benchmark cis-l1

The dry-run output shows every Ansible task that will run, every OpenSCAP check that will fire, and flags any controls that might conflict with the provider environment before you’ve launched a build instance.


Production Gotchas

Build time is 15–25 minutes. Ansible-Lockdown applies 144+ tasks for CIS L1. Build this into your pipeline timing — don’t expect golden images in 3 minutes.

Cloud-init ordering matters. On AWS, certain hardening steps (sysctl tuning, PAM configuration) interact with cloud-init. The Stratum provider layer handles sequencing — but if you add custom hardening roles, test the cloud-init interaction explicitly.

Some CIS controls conflict with managed service requirements. AWS Systems Manager Session Manager requires specific SSH configuration. RDS requires specific networking settings. Use the controls override section to document these — don’t suppress them silently.

Kernel parameter hardening requires a reboot. Controls in the 3.x (network parameters) and 1.5.x (kernel modules) sections apply sysctl changes that take effect on reboot. The Stratum build process reboots the instance before the OpenSCAP scan — don’t skip the reboot if you’re building manually.


Key Takeaways

  • Linux hardening as code means the blueprint YAML is the build artifact — the image either exists and is hardened, or it doesn’t exist
  • The controls override mechanism is the difference between undocumented suppressions and auditable, reasoned exceptions
  • Post-build OpenSCAP scan runs automatically — a failing grade blocks image creation
  • One blueprint file is portable across providers (EP03 covers this): the compliance intent stays in the YAML, the cloud-specific details go in the provider layer
  • Version-controlling the blueprint gives you a complete history of what your OS security posture was at any point in time — the same way Terraform state tracks infrastructure

What’s Next

One blueprint, one provider. EP02 showed that the skip-at-2am problem is solved when hardening is a build artifact rather than a process step.

What it didn’t address: what happens when you expand to a second cloud. GCP uses different disk names. Azure cloud-init fires in a different order. The AWS metadata endpoint IP is different from every other provider. If you maintain separate hardening scripts per cloud, they drift within a month.

EP03 covers multi-cloud OS hardening: the same blueprint, six providers, no drift.

Next: multi-cloud OS hardening — one blueprint for AWS, GCP, and Azure

Get EP03 in your inbox when it publishes → linuxcent.com/subscribe