FreeIPA: LDAP + Kerberos + PKI in a Single Linux Identity Stack

Reading Time: 5 minutes

The Identity Stack, Episode 8
EP07: LDAP HAEP08EP09: Active Directory → …


TL;DR

  • FreeIPA is 389-DS (LDAP) + MIT Kerberos + Dogtag PKI + Bind DNS + SSSD — one ipa-server-install command gets you an enterprise identity platform
  • Host-Based Access Control (HBAC) lets you define centrally: which users can SSH to which hosts — no more managing /etc/security/access.conf per machine
  • Sudo rules from the directory: define sudo policy centrally, have every machine pull it — no /etc/sudoers.d/ files scattered across the fleet
  • ipa CLI is the management interface — ipa user-add, ipa group-add, ipa hbacrule-add — everything that took five LDAP commands takes one ipa command
  • FreeIPA trusts with Active Directory let Linux machines authenticate AD users without joining the AD domain
  • The right choice for Linux-centric environments; AD is the right choice when Windows clients dominate

The Big Picture: What FreeIPA Integrates

┌─────────────────────────────────────────────────────────┐
│                    FreeIPA Server                        │
│                                                         │
│  389-DS (LDAP)    MIT Kerberos    Dogtag PKI            │
│  ─────────────    ───────────     ─────────             │
│  User/group       TGT + service   Machine certs         │
│  storage          ticket issuing  User certs             │
│                                   OCSP / CRL            │
│  Bind DNS         SSSD (client)   Apache (WebUI)        │
│  ──────────       ────────────    ──────────────        │
│  SRV records      Enrollment      Management UI         │
│  for KDC/LDAP     automation      REST API              │
└─────────────────────────────────────────────────────────┘
              ▲                  ▲
              │ enrollment       │ SSH + sudo rules
   ┌──────────┴──────────┐  ┌───┴──────────────────┐
   │  Linux client        │  │  Linux client         │
   │  (ipa-client-install)│  │  (ipa-client-install) │
   └─────────────────────┘  └──────────────────────┘

EP06 and EP07 built OpenLDAP from components. FreeIPA gives you all of that plus Kerberos, PKI, DNS, and HBAC — opinionated, integrated, and managed through a single CLI and WebUI. This episode shows what you actually get from it.


Why FreeIPA Instead of Bare OpenLDAP

Running bare OpenLDAP requires you to:
– Configure schema for POSIX accounts, SSH keys, sudo rules, HBAC manually
– Set up MIT Kerberos separately and integrate it with LDAP
– Build your own PKI for machine certificates
– Maintain DNS SRV records for Kerberos discovery
– Write client enrollment scripts
– Build a management interface (or live in LDIF)

FreeIPA does all of this in one installer, with a consistent data model across all components. The trade-off is opacity — FreeIPA makes decisions for you (schema, replication topology, Kerberos realm name) that bare OpenLDAP leaves to you.


Installing FreeIPA Server

# RHEL / Rocky / AlmaLinux
dnf install -y freeipa-server freeipa-server-dns

# Run the installer (interactive)
ipa-server-install

# Or non-interactive:
ipa-server-install \
  --realm=CORP.COM \
  --domain=corp.com \
  --ds-password=DM_password \
  --admin-password=Admin_password \
  --setup-dns \
  --forwarder=8.8.8.8 \
  --unattended

# After install: get an admin Kerberos ticket
kinit admin

The installer creates:
– 389-DS instance with the FreeIPA schema
– MIT KDC with realm CORP.COM
– Dogtag CA and all certificate infrastructure
– Bind DNS with SRV records for the KDC and LDAP server
– Apache WebUI at https://ipa.corp.com/ipa/ui/
– SSSD configured on the server itself

Time: 5–10 minutes. What used to take a week of manual configuration.


The ipa CLI

Every management action goes through ipa. It talks to the IPA server’s REST API and handles Kerberos authentication transparently (it uses your kinit session).

# Users
ipa user-add vamshi \
  --first=Vamshi --last=Krishna \
  [email protected] \
  --password

ipa user-show vamshi
ipa user-find --all              # search all users
ipa user-disable vamshi          # lock account without deleting
ipa user-mod vamshi --shell=/bin/zsh

# Groups
ipa group-add engineers --desc "Engineering team"
ipa group-add-member engineers --users=vamshi,alice

# Password policy
ipa pwpolicy-mod --minlength=12 --maxlife=90 --history=10

# SSH public keys — stored centrally, pushed to every host
ipa user-mod vamshi --sshpubkey="ssh-ed25519 AAAA..."
# SSSD on enrolled hosts will use this key for SSH login — no authorized_keys file needed

Host-Based Access Control (HBAC)

HBAC is the feature that justifies FreeIPA for most Linux shops. It lets you define centrally: which users (or groups) can log in to which hosts (or host groups), using which services (SSH, sudo, FTP).

Without HBAC, access control is per-machine: /etc/security/access.conf or PAM pam_access rules, replicated across every server, managed inconsistently.

With HBAC: one rule, enforced everywhere.

# Create host groups
ipa hostgroup-add production-servers --desc "Production Linux hosts"
ipa hostgroup-add-member production-servers --hosts=web01.corp.com,db01.corp.com

# Create user groups
ipa group-add sre-team
ipa group-add-member sre-team --users=vamshi,alice

# Create an HBAC rule
ipa hbacrule-add allow-sre-to-prod \
  --desc "SRE team can SSH to production"
ipa hbacrule-add-user allow-sre-to-prod --groups=sre-team
ipa hbacrule-add-host allow-sre-to-prod --hostgroups=production-servers
ipa hbacrule-add-service allow-sre-to-prod --hbacsvcs=sshd

# Test the rule before applying
ipa hbactest \
  --user=vamshi \
  --host=web01.corp.com \
  --service=sshd
# Access granted: True
# Matched rules: allow-sre-to-prod

SSSD on each enrolled host enforces the HBAC rules at login time by querying the IPA server. No per-machine configuration. Add a new server to the production-servers host group and the HBAC rules apply immediately.


Sudo Rules from the Directory

# Create a sudo rule
ipa sudorule-add allow-sre-sudo \
  --cmdcat=all \
  --desc "SRE team gets full sudo on production"
ipa sudorule-add-user allow-sre-sudo --groups=sre-team
ipa sudorule-add-host allow-sre-sudo --hostgroups=production-servers

# Or a scoped rule — only specific commands
ipa sudorule-add allow-service-restart
ipa sudocmdgroup-add service-commands
ipa sudocmd-add /usr/bin/systemctl
ipa sudocmdgroup-add-member service-commands --sudocmds="/usr/bin/systemctl"
ipa sudorule-add-allow-command allow-service-restart --sudocmdgroups=service-commands

On enrolled hosts, SSSD’s sssd_sudo responder pulls these rules and the sudo command evaluates them locally. No /etc/sudoers.d/ files. Central policy, local enforcement.


Enrolling a Client

# On the client machine
dnf install -y freeipa-client

ipa-client-install \
  --domain=corp.com \
  --server=ipa.corp.com \
  --realm=CORP.COM \
  --principal=admin \
  --password=Admin_password \
  --unattended

# What this does:
# 1. Registers the host in IPA as a machine principal
# 2. Retrieves a host Kerberos keytab (/etc/krb5.keytab)
# 3. Configures SSSD (sssd.conf, nsswitch.conf, pam.d)
# 4. Configures Kerberos (/etc/krb5.conf)
# 5. Optionally configures NTP and DNS

After enrollment: getent passwd vamshi returns the IPA user. SSH with an IPA password works. HBAC rules are enforced. Sudo rules from the directory apply. SSH public keys from the user’s IPA profile work without authorized_keys files.


FreeIPA Trust with Active Directory

In mixed environments (Linux servers + Windows clients), you can establish a trust between FreeIPA and AD without joining the Linux servers to the AD domain directly.

# On the IPA server (after installing ipa-server-trust-ad)
ipa-adtrust-install --netbios-name=CORP

# Establish the trust
ipa trust-add ad.corp.com \
  --admin=Administrator \
  --password \
  --type=ad

# AD users can now log in to IPA-enrolled Linux hosts
# They appear as: CORP.COM\username or [email protected]

Under the hood: FreeIPA acts as an SSSD-enabled Samba DC for the trust relationship. AD users’ Kerberos tickets from the AD KDC are accepted by the FreeIPA KDC, which maps them to POSIX attributes stored in IPA (or automatically generated via ID mapping).


⚠ Common Misconceptions

“FreeIPA is just OpenLDAP with a UI.” FreeIPA uses 389-DS (not OpenLDAP), adds a full Kerberos KDC, a certificate authority, DNS, HBAC enforcement, and sudo management — all with a consistent schema designed for these use cases. It’s an integrated identity platform, not a wrapper.

“HBAC rules replace firewall rules.” HBAC controls who can log in to a host at the authentication layer — not network access. A blocked HBAC rule means the SSH session is rejected after TCP connection. You still need firewall rules to block TCP access.

“FreeIPA replicas are identical.” FreeIPA uses 389-DS Multi-Supplier replication. All replicas accept reads and writes. But the CA is separate — only the initial server (and explicitly designated CA replicas) run the CA. If the CA goes down, certificate operations stop; authentication does not.


Framework Alignment

Domain Relevance
CISSP Domain 5: Identity and Access Management FreeIPA is an enterprise IAM platform — HBAC, sudo policy, SSH key management, and certificate-based authentication are all IAM controls
CISSP Domain 3: Security Architecture and Engineering FreeIPA’s integrated CA enables certificate-based authentication for machines and users — a stronger authentication factor than passwords
CISSP Domain 1: Security and Risk Management Centralized HBAC and sudo policy reduces the attack surface of privilege escalation — no more inconsistent sudoers files that drift across the fleet

Key Takeaways

  • FreeIPA = 389-DS + MIT Kerberos + Dogtag PKI + Bind DNS — one installer, one management interface
  • HBAC rules define centrally who can SSH to which host groups — enforced by SSSD on every enrolled client, no per-machine config
  • Sudo rules from the directory replace scattered /etc/sudoers.d/ files — central policy, SSSD-enforced locally
  • ipa hbactest lets you verify access rules before a user hits a blocked login — use it before every policy change
  • For Linux-centric environments: FreeIPA. For Windows-dominant environments: AD. For mixed: FreeIPA trust with AD.

What’s Next

FreeIPA is the Linux answer to enterprise identity. EP09 covers the Microsoft answer — Active Directory — which extended LDAP and Kerberos into a complete enterprise platform with Group Policy, Sites, and a replication model built for global scale.

Next: How Active Directory Works: LDAP, Kerberos, and Group Policy Under the Hood

Get EP09 in your inbox when it publishes → linuxcent.com/subscribe

GCP Secure Boot Certificate Expiration 2026: What You Must Do Before June 24

Reading Time: 10 minutes


TL;DR

  • Three Microsoft UEFI Secure Boot certificates expire between June 24 and October 19, 2026
  • Any GCP Compute Engine instance with Secure Boot enabled, created before November 7, 2025, carries the old certs and is at risk
  • When the certs expire, instances may fail to boot after OS updates that pull in bootloaders signed only by the replacement 2023 certificates
  • GKE Shielded Nodes are affected too — node pools whose nodes haven’t been recreated since November 7, 2025 carry the old UEFI database
  • vTPM-sealed secrets, BitLocker, and Linux full disk encryption break if Secure Boot fails mid-update
  • Primary fix: recreate affected instances (post-Nov 7, 2025 instances include the updated UEFI DB automatically)
  • Emergency workaround if boot fails: temporarily disable Secure Boot, apply updates, re-enable

The Big Picture: The UEFI Secure Boot Trust Chain

  UEFI Firmware (PK — Platform Key, set by OEM/Google)
         │
         │  PK signs KEK updates
         ▼
  ┌─────────────────────────────────────────────┐
  │        KEK (Key Exchange Key Database)       │
  │  Microsoft Corporation KEK CA 2011           │ ← EXPIRING Jun 24, 2026
  │  Microsoft Corporation KEK CA 2023           │ ← Replacement (new VMs only)
  └────────────────────┬────────────────────────┘
                       │  KEK authorizes DB/DBX updates
                       ▼
  ┌─────────────────────────────────────────────┐
  │         DB (Authorized Signature Database)   │
  │  Microsoft UEFI CA 2011 ← signs Linux Shim  │ ← EXPIRING Jun 27, 2026
  │  Microsoft Windows PCA 2011 ← signs WinBoot │ ← EXPIRING Oct 19, 2026
  │  Microsoft UEFI CA 2023 ← replacement       │ ← Present on post-Nov 7 VMs
  │  Microsoft Windows PCA 2023 ← replacement   │ ← Present on post-Nov 7 VMs
  └────────┬───────────────────────┬────────────┘
           │                       │
           ▼                       ▼
   Linux Shim (shim.efi)    Windows Boot Manager
           │
           ▼
       GRUB2 / systemd-boot
           │
           ▼
       Linux Kernel

GCP Compute Engine instances with Secure Boot enabled — created before November 7, 2025 — have a UEFI signature database that includes the 2011 certificates but not the 2023 replacements. When those 2011 certificates expire, new bootloader binaries (signed exclusively by the 2023 certs) will be rejected at boot time.


What Secure Boot Actually Does — and Why Certificate Expiry Breaks Booting

Secure Boot is UEFI’s mechanism for ensuring that only cryptographically signed, trusted software runs during the boot sequence. The trust chain works like this:

  1. Platform Key (PK): Root of trust, set by the hardware manufacturer or cloud provider. Authorizes updates to the KEK.
  2. Key Exchange Key (KEK): Authorizes modifications to the DB and DBX (the forbidden signatures database). Microsoft holds one KEK slot; OEMs often hold another.
  3. DB (Signature Database): Contains the public certificates used to verify bootloaders. If a bootloader binary is signed by a cert in DB, it’s allowed to run. If not, the firmware halts.
  4. DBX (Forbidden Signatures Database): Revocation list. Bootloaders explicitly listed here are blocked even if they were once trusted.

Where expiry matters: The DB certificates don’t “enforce” anything at runtime by checking dates themselves — UEFI doesn’t do certificate revocation in real time. The problem is different and more insidious: as Linux distributions and Microsoft ship updated bootloaders, those new binaries are signed only by the 2023 replacement certificates, not the expiring 2011 ones. If your VM’s DB doesn’t contain the 2023 certs, the UEFI firmware will reject the new shim, and the system won’t boot after an OS update that upgrades the bootloader package.

On Debian/Ubuntu, shim-signed upgrades. On RHEL/CentOS Stream, shim-x64 upgrades. Either way: new binary, new signature, old DB — boot failure.


The Three Certificates Expiring in 2026

1. Microsoft Corporation KEK CA 2011 — expires June 24, 2026

Role: Authorizes updates to the DB and DBX signature databases.

When the KEK expires, firmware that enforces KEK validity may refuse to accept DB/DBX updates signed by this certificate. This means even if Google pushes an out-of-band UEFI DB update containing the 2023 certs, instances with an expired-only KEK slot may not be able to apply it cleanly.

Replacement: Microsoft Corporation KEK CA 2023


2. Microsoft Corporation UEFI CA 2011 — expires June 27, 2026

Role: Signs third-party bootloaders — specifically the Linux Shim (shim.efi).

This is the most critical cert for Linux workloads. Every major Linux distribution uses a shim bootloader as the first-stage loader in a Secure Boot chain. The shim is signed by Microsoft’s UEFI CA because Linux vendors submit their shim builds to Microsoft for signing (to ensure broad UEFI compatibility). When new shim packages are released signed only by UEFI CA 2023, any VM with only the 2011 cert in its DB will reject them.

Replacement: Microsoft UEFI CA 2023


3. Microsoft Windows Production PCA 2011 — expires October 19, 2026

Role: Signs Windows Boot Manager and other Windows boot components.

Windows instances on GCP using Secure Boot are affected by this cert. Post-expiry Windows OS updates that ship a new Boot Manager binary signed exclusively by the 2023 PCA will fail to boot on instances carrying only the 2011 cert.

Replacement: Microsoft Windows Production PCA 2023

Windows-specific signal: Event ID 1801 in the Windows System event log — “Secure Boot CA/keys need to be updated” — will appear by mid-2026 on affected instances, before actual boot failure. This is your warning window.


Why GCP Instances Are Specifically Affected

Google’s Compute Engine Shielded VMs ship with a pre-populated UEFI variable database. The content of that database is fixed at instance creation time — it’s part of the VM’s UEFI firmware image. Instances created before November 7, 2025 have a DB that contains the 2011 certs but not the 2023 replacements. Instances created on or after November 7, 2025 had the updated database backfilled.

This is not a Google-specific failure. Every cloud provider and on-premises hypervisor platform that uses Secure Boot with a pre-populated UEFI DB has the same problem. GCP is ahead of many platforms in actually documenting it.


GKE Shielded Nodes: The Operational Blind Spot

GKE’s Shielded Nodes feature enables Secure Boot on node pool VMs. Each node is a Compute Engine instance — and all the same rules apply.

The risk: Node pools whose nodes were last provisioned before November 7, 2025 carry the old UEFI database. When containerd, the OS image, or the kernel gets updated via node auto-upgrade or manual node pool upgrade, the new node VMs will carry updated certs. But nodes that haven’t been replaced since before the cutoff are sitting on the old DB.

GKE auto-upgrade helps — but only if it’s actually running and has completed at least one full node replacement cycle since November 7, 2025.

Node pools with auto-upgrade disabled, or clusters in maintenance windows that delayed upgrades, are at risk.

The trigger scenario:
1. GKE runs a node OS update in-place on an old node (not a full node replacement)
2. The update upgrades the shim package to a version signed only by UEFI CA 2023
3. Next reboot: the node fails to boot
4. The node is marked NotReady, workloads are rescheduled — but the underlying VM is stuck


Detecting Affected Resources

Compute Engine Instances

gcloud compute instances list \
  --filter="creationTimestamp < '2025-11-07' AND shieldedInstanceConfig.enableSecureBoot=true" \
  --format="table(name,zone,creationTimestamp,shieldedInstanceConfig.enableSecureBoot,status)"

Sample output:

NAME               ZONE           CREATION_TIMESTAMP        ENABLE_SECURE_BOOT  STATUS
prod-api-01        us-central1-a  2024-08-15T10:22:00Z      True                RUNNING   ← at risk
prod-db-02         us-central1-b  2023-11-01T08:15:00Z      True                RUNNING   ← at risk
prod-web-03        us-central1-a  2025-12-01T14:30:00Z      True                RUNNING   ← safe (post-Nov 7)

GKE Node Pools

# List node pools with Secure Boot enabled per cluster
gcloud container clusters list --format="value(name,location)" | while read NAME LOCATION; do
  echo "=== Cluster: $NAME ($LOCATION) ==="
  gcloud container node-pools list \
    --cluster="$NAME" \
    --location="$LOCATION" \
    --filter="config.shieldedInstanceConfig.enableSecureBoot=true" \
    --format="table(name,config.shieldedInstanceConfig.enableSecureBoot,management.autoUpgrade)"
done

Then verify node creation timestamps within affected pools:

gcloud compute instances list \
  --filter="labels.goog-gke-node:* AND creationTimestamp < '2025-11-07' AND shieldedInstanceConfig.enableSecureBoot=true" \
  --format="table(name,zone,creationTimestamp,labels.goog-gke-node)"

Checking the UEFI DB on a Running Instance

SSH into an affected instance and verify which certs are in the DB:

# On the instance (requires mokutil and/or efitools)
sudo mokutil --db | grep -A3 "Subject:"

Look for CN=Microsoft UEFI CA 2023 in the output. Its absence means your instance has only the 2011 certs.

On GKE nodes (where you have node shell access via a DaemonSet or node debug pod):

# Using kubectl debug for node access
kubectl debug node/NODE_NAME -it --image=ubuntu -- bash
# Then inside the debug pod:
chroot /host
mokutil --db 2>/dev/null | grep "Microsoft.*2023" || echo "2023 cert NOT present — node at risk"

Solutions

Instances created after November 7, 2025 automatically receive the updated UEFI certificate database. The simplest fix is to recreate affected instances.

For Compute Engine:

# Step 1: Create a machine image (snapshot) of the existing instance
gcloud compute machine-images create INSTANCE_NAME-backup \
  --source-instance=INSTANCE_NAME \
  --source-instance-zone=ZONE

# Step 2: Delete the old instance (after verifying backup)
gcloud compute instances delete INSTANCE_NAME --zone=ZONE

# Step 3: Create new instance from machine image
gcloud compute instances create INSTANCE_NAME \
  --source-machine-image=INSTANCE_NAME-backup \
  --zone=ZONE \
  --shielded-secure-boot \
  --shielded-vtpm \
  --shielded-integrity-monitoring

The new instance will have the post-November 7, 2025 UEFI DB.

For GKE Node Pools:

# Option A: Upgrade the node pool (triggers node recreation)
gcloud container clusters upgrade CLUSTER_NAME \
  --location=LOCATION \
  --node-pool=NODE_POOL_NAME

# Option B: Recreate the node pool entirely
gcloud container node-pools create NODE_POOL_NAME-new \
  --cluster=CLUSTER_NAME \
  --location=LOCATION \
  --shielded-secure-boot \
  --shielded-integrity-monitoring \
  [... your existing pool config ...]

# Then cordon and drain the old pool nodes
kubectl cordon NODE_NAME
kubectl drain NODE_NAME --ignore-daemonsets --delete-emptydir-data

# Finally delete the old node pool
gcloud container node-pools delete NODE_POOL_NAME \
  --cluster=CLUSTER_NAME \
  --location=LOCATION

Option 2: Disable Secure Boot Temporarily (Emergency Workaround)

If an instance has already failed to boot after an OS update, or if you need to apply bootloader updates before recreating the instance:

# Disable Secure Boot on the stopped instance
gcloud compute instances update INSTANCE_NAME \
  --zone=ZONE \
  --no-shielded-secure-boot

# Start the instance
gcloud compute instances start INSTANCE_NAME --zone=ZONE

# SSH in, apply OS updates and any pending bootloader upgrades
# (The system will boot without Secure Boot enforcement)
sudo apt-get update && sudo apt-get upgrade -y   # Debian/Ubuntu
# or
sudo dnf update -y                                # RHEL/CentOS

# Stop the instance again
gcloud compute instances stop INSTANCE_NAME --zone=ZONE

# Re-enable Secure Boot
gcloud compute instances update INSTANCE_NAME \
  --zone=ZONE \
  --shielded-secure-boot

# Start again — now boots with new bootloader binaries
gcloud compute instances start INSTANCE_NAME --zone=ZONE

Note: This workaround doesn’t add the 2023 certs to the DB. It bypasses Secure Boot enforcement temporarily. The underlying UEFI DB still only has the 2011 certs. You still need to recreate the instance to get the updated DB — this is only a bridge to keep the instance alive while you plan migration.


Option 3: Restore from Machine Image

If an instance is already in a boot failure state and the workaround above doesn’t apply:

# List available machine images
gcloud compute machine-images list

# Restore from a pre-failure machine image
gcloud compute instances create INSTANCE_NAME-restored \
  --source-machine-image=MACHINE_IMAGE_NAME \
  --zone=ZONE

Then immediately plan recreation on a post-November 7, 2025 instance.


vTPM, BitLocker, and Full Disk Encryption: The Hidden Risk

For VMs using Shielded VM features beyond just Secure Boot — specifically vTPM with sealed secrets — certificate expiry creates a more dangerous failure mode.

How vTPM sealing works:

  Boot sequence measurements → PCR registers (PCR 0–7 for UEFI, PCR 8–15 for OS)
         │
         ▼
  TPM seals secrets (FDE key, BitLocker key) to specific PCR values
         │
         ▼
  On next boot: PCR values must match for TPM to release the key
         │
         ▼
  If Secure Boot state changes (cert DB changes, Secure Boot disabled) →
  PCR values change → TPM refuses to unseal → FDE fails → disk inaccessible

What this means in practice:

  • Linux FDE (LUKS with TPM2 unsealing): If Secure Boot fails or is temporarily disabled per the workaround above, the TPM will not release the LUKS volume key. The system will drop to a recovery prompt. You need the LUKS recovery passphrase.

  • Windows BitLocker: If PCR values shift (Secure Boot disabled, cert DB changed), BitLocker enters recovery mode. The VM prompts for the BitLocker recovery key on next boot. Without it, the volume is inaccessible.

  • Windows Virtual Secure Mode: VSM uses vTPM to protect credentials. If Secure Boot state changes, VSM-protected secrets become inaccessible until re-enrollment.

Action before any changes:

# For Linux: ensure you have the LUKS recovery key
sudo cryptsetup luksDump /dev/sda3 | grep "Key Slot"

# For Windows: export BitLocker recovery key before touching Secure Boot state
# (Do this from within the running Windows instance via PowerShell)
Get-BitLockerVolume | Select-Object -ExpandProperty KeyProtector | Where-Object {$_.KeyProtectorType -eq "RecoveryPassword"}

Store recovery keys in Secret Manager, not just locally:

# Store LUKS key in GCP Secret Manager
echo -n "YOUR_RECOVERY_KEY" | gcloud secrets create luks-recovery-INSTANCE_NAME \
  --data-file=- \
  --replication-policy=automatic

⚠ Production Gotchas

1. OS update automation is the trigger, not the cert expiry date itself.
The certs don’t enforce anything at runtime. The actual failure happens when an unattended-upgrade, yum-cron, or GKE node OS update pulls in a new shim/Boot Manager binary signed only by the 2023 cert. Instances may fail to boot weeks or months before the official cert expiry date if distros ship updated bootloaders early.

2. GKE surge upgrades can mask the problem — temporarily.
During a node pool upgrade, GKE creates new nodes (with updated certs) before draining old ones. Workloads move to new nodes. The old nodes get deleted. This looks fine — until you realize some in-place operations (node taints, label changes, manual kubelet restarts) could force old nodes to reboot without triggering node replacement.

3. Disabling Secure Boot changes vTPM PCR values — plan FDE recovery before touching anything.
The temporary workaround (disable Secure Boot) will invalidate TPM-bound disk encryption. Have recovery keys ready before running --no-shielded-secure-boot.

4. Windows Event ID 1801 is an early warning — act on it.
If you see this event in your Windows Compute Engine instances before June 2026, that instance has already identified itself as carrying the old certs. Use it as your automated detection signal in Cloud Logging.

# Query Cloud Logging for Event ID 1801 across Windows instances
gcloud logging read 'resource.type="gce_instance" AND jsonPayload.EventID=1801' \
  --format="table(resource.labels.instance_id,timestamp,jsonPayload.Message)" \
  --limit=50

5. Instance templates propagate the old DB.
If you use instance templates or managed instance groups (MIGs) to create VMs, and those templates were created before November 7, 2025, new instances created from them may or may not inherit updated certs depending on how the template configures the UEFI DB. Verify by checking creation timestamp of the resulting instance, not the template.

6. Custom OS images don’t fix this.
Importing a custom image or using a custom OS does not update the UEFI certificate database. The DB is part of the VM’s UEFI firmware state, not the OS disk image. Recreating the instance is the only reliable path.


Quick Reference: Commands

Task Command
List affected Compute Engine VMs gcloud compute instances list --filter="creationTimestamp < '2025-11-07' AND shieldedInstanceConfig.enableSecureBoot=true"
Check UEFI DB on a Linux VM sudo mokutil --db \| grep -E "Subject\|Not After"
Check for 2023 cert presence mokutil --db 2>/dev/null \| grep "Microsoft.*2023" \|\| echo "2023 cert absent"
Disable Secure Boot (emergency) gcloud compute instances update INSTANCE --zone=ZONE --no-shielded-secure-boot
Re-enable Secure Boot gcloud compute instances update INSTANCE --zone=ZONE --shielded-secure-boot
Find affected GKE nodes gcloud compute instances list --filter="labels.goog-gke-node:* AND creationTimestamp < '2025-11-07' AND shieldedInstanceConfig.enableSecureBoot=true"
Trigger GKE node pool upgrade gcloud container clusters upgrade CLUSTER --location=LOCATION --node-pool=POOL
Store LUKS key in Secret Manager echo -n "KEY" \| gcloud secrets create NAME --data-file=-
Query Windows Event 1801 in Logging gcloud logging read 'resource.type="gce_instance" AND jsonPayload.EventID=1801'
Create machine image backup gcloud compute machine-images create BACKUP --source-instance=INSTANCE --source-instance-zone=ZONE

Framework Alignment

Framework Domain Relevance
CISSP Domain 7: Security Operations Patch management, boot integrity, incident response
CISSP Domain 3: Security Architecture Secure Boot trust chain, TPM integration, cryptographic key lifecycle
NIST CSF 2.0 ID.AM, PR.IP Asset inventory of affected VMs; integrity protection of boot chain
CIS Benchmarks CIS Google Cloud Computing Foundations Shielded VM controls, vTPM configuration
OWASP Top 10 A05: Security Misconfiguration Failure to maintain certificate currency in security-critical infrastructure

Key Takeaways

  • The expiry of three Microsoft UEFI CA certificates in 2026 creates a window where GCP VMs with Secure Boot enabled — created before November 7, 2025 — will fail to boot after pulling in new bootloader packages
  • The failure is not instantaneous on the cert expiry date. It’s triggered by the next OS update that ships a bootloader signed exclusively by the 2023 replacement certs
  • GKE Shielded Nodes are affected through the same mechanism: node VMs that haven’t been recreated since November 7, 2025 carry the old UEFI database
  • vTPM-sealed secrets (FDE, BitLocker, VSM) add a secondary failure mode if Secure Boot state is changed as part of remediation — have recovery keys before touching anything
  • Google’s recommended fix is instance recreation. The workaround (disable Secure Boot temporarily) keeps instances alive but doesn’t fix the underlying DB — treat it as a bridge, not a resolution
  • Audit now, before June 24. The command is one line. The blast radius of missing this is a production boot failure at 2 AM after a routine security patch run

What’s Next

If you’re running Shielded VMs in production, this certificate expiry is the kind of quiet deadline that fails silently — not with an alarm, but with a VM that doesn’t come back after a patch cycle. The time to audit is before your automated patching runs, not after.

If you found this useful, the linuxcent.com newsletter covers infrastructure security at this depth regularly — kernel internals, cloud platform gotchas, and the operational implications that vendor docs bury in footnotes.

Get the next deep-dive in your inbox when it publishes → [subscribe link]

SSSD: The Caching Daemon That Powers Every Enterprise Linux Login

Reading Time: 7 minutes

The Identity Stack, Episode 4
EP01: What Is LDAPEP02: LDAP InternalsEP03: LDAP Auth on LinuxEP04EP05: Kerberos → …


TL;DR

  • SSSD (System Security Services Daemon) is the caching and brokering layer between Linux and directory services — it handles LDAP, Kerberos, and AD so PAM and NSS don’t have to
  • Architecture: three tiers — responders (answer PAM/NSS queries), providers (talk to AD/LDAP/Kerberos), and a shared cache (LDB database on disk)
  • Credential caching means offline logins work — a user who authenticated yesterday can log in today even if the domain controller is unreachable
  • Key config: sssd.conf — the [domain] section is where almost all tuning happens
  • Debugging toolkit: sssctl, sss_cache, id, getent, journalctl -u sssd
  • The most common failure modes are: SSSD not running, stale cache, misconfigured ldap_search_base, and clock skew breaking Kerberos

The Big Picture: SSSD as the Identity Broker

PAM (pam_sss)         NSS (sss module)
      │                      │
      └──────────┬───────────┘
                 ▼
          SSSD Responders
          ┌────────────────────────────────────┐
          │  PAM responder   NSS responder      │
          │  (auth, account, (passwd, group,    │
          │   session)        shadow lookups)   │
          └────────────┬───────────────────────┘
                       │  shared cache (LDB)
                       ▼
          SSSD Providers
          ┌────────────────────────────────────┐
          │  identity provider  auth provider   │
          │  (user/group attrs) (credentials)   │
          └────────────┬───────────────────────┘
                       │
          ┌────────────┼────────────┐
          ▼            ▼            ▼
       LDAP          Kerberos    Local files
    (AD / OpenLDAP)  (KDC / AD)

EP03 showed that SSSD sits between PAM and LDAP. This episode goes inside it — the architecture, the config, and how to tell exactly what it’s doing on any given login attempt.


Why SSSD Exists

The problem before SSSD: nss_ldap and pam_ldap made direct LDAP connections for every query. No caching, no connection pooling, no failover, no offline support. On a system that makes dozens of getpwuid() calls per second (every ls -l, every process spawn), this meant dozens of LDAP roundtrips per second hitting the domain controller.

SSSD solved this with a single daemon that:
– Maintains a persistent connection pool to the directory
– Caches identity and credential data in an LDB (LDAP-like) database on disk
– Handles failover across multiple directory servers
– Satisfies PAM and NSS queries from cache when the directory is unreachable

The credential cache is the key insight. When you authenticate successfully, SSSD stores a hash of your credentials locally. If the domain controller is unreachable on your next login — network outage, laptop offline, VPN not connected — SSSD can verify your credentials against the local cache. You log in. You never knew the DC was down.


SSSD Architecture

SSSD is a set of cooperating processes sharing a cache:

Monitor — the parent process. Starts and restarts all other SSSD processes. If a responder or provider crashes, the monitor restarts it.

Responders — answer queries from PAM and NSS. Each responder handles a specific interface:
sssd_nss — answers getpwnam(), getpwuid(), getgrnam(), initgroups() calls
sssd_pam — handles PAM authentication, account checks, and session management
sssd_autofs, sssd_ssh, sssd_sudo — optional responders for specific services

Providers — the backend processes that talk to the actual directory:
– Each domain gets its own provider process (sssd_be[domain_name])
– The provider connects to LDAP/Kerberos/AD, fetches data, and writes it to the shared cache
– If the provider crashes or loses connectivity, responders fall back to serving from cache

Cache — LDB files in /var/lib/sss/db/. One database per configured domain, plus a cache for negative results (lookups that returned “not found”). The cache is an LDAP-like directory stored on disk — SSSD uses the same hierarchical structure for local storage as the remote directory uses.

# See the cache files
ls -la /var/lib/sss/db/
# cache_corp.com.ldb         ← user/group data for domain corp.com
# ccache_corp.com            ← Kerberos credential cache
# timestamps_corp.com.ldb   ← when entries were last refreshed

sssd.conf: The Config That Matters

/etc/sssd/sssd.conf has a [sssd] section (global) and one [domain/name] section per directory. The domain section is where almost all tuning happens.

[sssd]
services = nss, pam, sudo
domains = corp.com
config_file_version = 2

[domain/corp.com]
# What type of directory this is
id_provider = ad               # or: ldap, ipa, files
auth_provider = ad             # or: ldap, krb5, none
access_provider = ad           # controls who can log in

# The AD/LDAP server (can be a list for failover)
ad_domain = corp.com
ad_server = dc01.corp.com, dc02.corp.com

# Where to look for users and groups
ldap_search_base = dc=corp,dc=com

# Cache behavior
cache_credentials = true       # enable offline login
entry_cache_timeout = 5400     # how long before re-querying (seconds)
offline_credentials_expiration = 1  # days cached credentials stay valid offline

# What uid/gid range belongs to this domain (prevents UID conflicts)
ldap_id_mapping = true         # auto-map AD SIDs to UIDs (no uidNumber needed)
# OR for classical POSIX LDAP:
# ldap_id_mapping = false      # use uidNumber/gidNumber from directory

# Restrict logins to specific AD groups
# access_provider = simple
# simple_allow_groups = linux-admins, sre-team

# Home directory and shell defaults
override_homedir = /home/%u
default_shell = /bin/bash
fallback_homedir = /home/%u

# Enumerate all users (expensive on large dirs — disable unless needed)
enumerate = false

The two most commonly wrong settings:

ldap_search_base — if this doesn’t include the OU where your users live, SSSD won’t find them. On AD, the default searches the entire domain, which is usually correct. On OpenLDAP, you may need ou=people,dc=corp,dc=com.

ldap_id_mapping — on AD, users typically don’t have uidNumber attributes. Setting ldap_id_mapping = true tells SSSD to derive a UID from the user’s SID algorithmically. This produces consistent UIDs across machines. Setting it to false requires actual uidNumber attributes in the directory.


Credential Caching and Offline Logins

The cache is what separates SSSD from a simple proxy. When cache_credentials = true:

  1. On successful authentication, SSSD stores a hash of the credential in the LDB cache
  2. On the next authentication attempt, SSSD first tries the domain controller
  3. If the DC is unreachable, SSSD falls back to the local credential hash
  4. If the hash matches, login succeeds — even with no network

The credential hash is not the cleartext password — it’s a salted hash stored in /var/lib/sss/db/cache_corp.com.ldb. The security model is the same as /etc/shadow: someone with root access to the machine can access the hashes.

offline_credentials_expiration controls how long cached credentials stay valid when the DC is unreachable. 0 means forever (not recommended for high-security environments). 1 means one day — after 24 hours offline, even cached credentials expire and the user must authenticate online.


The Debugging Toolkit

# 1. Is SSSD running?
systemctl status sssd
pgrep -a sssd    # shows all SSSD processes (monitor + responders + providers)

# 2. Domain connectivity status
sssctl domain-status corp.com
# Domain: corp.com
# Active servers:
#   LDAP: dc01.corp.com
#   KDC: dc01.corp.com
# Discovered servers:
#   LDAP: dc01.corp.com, dc02.corp.com

# 3. Can SSSD find a specific user?
sssctl user-checks vamshi
# user: vamshi
# user name: [email protected]
# POSIX attributes: UID=1001, GID=1001, ...
# Authentication: success (uses actual PAM auth stack)

# 4. What does NSS see?
getent passwd vamshi          # full passwd entry
id vamshi                     # uid, gid, groups

# 5. Flush stale cache entries
sss_cache -u vamshi           # invalidate one user
sss_cache -G engineers        # invalidate one group
sss_cache -E                  # invalidate everything (nuclear option)

# 6. Live logs
journalctl -u sssd -f         # tail all SSSD logs
# Then attempt login in another terminal — watch the auth flow in real time

# 7. Increase log verbosity temporarily
sssctl config-check            # validate sssd.conf syntax
# Edit sssd.conf: add debug_level = 6 under [domain/corp.com]
systemctl restart sssd
journalctl -u sssd -f          # now shows LDAP queries, cache hits/misses

The single most useful command is sssctl user-checks <username>. It runs the full NSS + PAM auth stack internally and prints what SSSD would do on a real login — without creating a session or touching the running system.


Breaking SSSD (and What Each Failure Looks Like)

SSSD not running:

ssh vamshi@server
# Permission denied (publickey,gssapi-keyex,gssapi-with-mic,password)
# getent passwd vamshi → (empty)
# Fix: systemctl start sssd

Stale cache after AD password change:

# User changed password in AD but SSSD still has old credential hash
ssh vamshi@server  # password accepted (wrong!) — cache hit with old hash
# Fix: sss_cache -u vamshi, then attempt login again

Clock skew > 5 minutes (breaks Kerberos):

journalctl -u sssd | grep -i "clock skew\|KDC\|kinit"
# sssd_be[corp.com]: Kerberos authentication failed: Clock skew too great
# Fix: systemctl restart chronyd (or ntpd), verify time sync

ldap_search_base wrong:

getent passwd vamshi  # empty, but user exists in AD
sssctl user-checks vamshi  # "User not found"
# Check: ldap_search_base must include the OU containing users
# Test: ldapsearch -x -H ldap://dc -b "ou=engineers,dc=corp,dc=com" "(uid=vamshi)"

⚠ Common Misconceptions

“Restarting SSSD logs everyone out.” Restarting SSSD doesn’t affect existing authenticated sessions. Active shell sessions, running processes — all unaffected. Only new authentication attempts are disrupted during the restart window, which takes a few seconds.

“sss_cache -E fixes everything.” Flushing the entire cache forces SSSD to re-fetch all entries from the domain controller on the next lookup. On a system with many users or enumeration enabled, this can cause a brief spike in LDAP traffic and slow lookups. Use targeted flushes (-u username, -G group) when possible.

“debug_level should always be high.” SSSD at debug_level = 9 logs every LDAP packet. On a production system with active logins, this generates gigabytes of logs quickly. Set it temporarily for debugging, then remove it and restart.


Framework Alignment

Domain Relevance
CISSP Domain 5: Identity and Access Management SSSD is the runtime implementation of enterprise identity integration on Linux — understanding its caching model, failover behavior, and credential storage is foundational to IAM operations
CISSP Domain 3: Security Architecture and Engineering The credential cache design (/var/lib/sss/db/) creates a local credential store with specific security properties — architects need to understand the offline login trade-off
CISSP Domain 7: Security Operations SSSD is a critical security service — monitoring it, understanding its failure modes, and knowing how to recover it quickly are operational security skills

Key Takeaways

  • SSSD is a three-tier system: responders (serve PAM/NSS), providers (talk to AD/LDAP), and a shared LDB cache — each tier is independently restartable
  • Credential caching enables offline logins — the security trade-off is a local hash store in /var/lib/sss/db/
  • sssctl user-checks is the first tool to reach for when a login fails — it simulates the full auth flow and shows exactly where it breaks
  • ldap_id_mapping = true is the right choice for AD environments without POSIX attributes; false requires actual uidNumber/gidNumber in the directory
  • Clock skew over 5 minutes silently breaks Kerberos authentication — time sync is a hard dependency

What’s Next

EP04 showed SSSD’s role as the caching and brokering layer. What it referenced repeatedly — “Kerberos ticket”, “KDC”, “GSSAPI” — is the authentication protocol that sits underneath AD-joined Linux logins. SSSD uses Kerberos to authenticate. LDAP carries the identity data. EP05 explains how Kerberos works.

Next: How Kerberos Works: Tickets, KDC, and Why Enterprises Use It With LDAP

Get EP05 in your inbox when it publishes → linuxcent.com/subscribe

How LDAP Authentication Works on Linux: PAM, NSS, and the Login Stack

Reading Time: 8 minutes

The Identity Stack, Episode 3
EP01: What Is LDAPEP02: LDAP InternalsEP03EP04: SSSD → …


TL;DR

  • LDAP is a directory protocol — it stores identity information and can verify a password via Bind, but authentication on Linux runs through PAM, not directly through LDAP
  • NSS (/etc/nsswitch.conf) answers “who is this user?” — it resolves UIDs, group memberships, and home directories by querying LDAP (or the local files, or SSSD)
  • PAM (/etc/pam.d/) answers “are they allowed in?” — it enforces authentication, account validity, session setup, and password policy
  • pam_ldap (the old way) opened a direct LDAP connection on every login — fragile, no caching, broken when the LDAP server was unreachable
  • pam_sss (the modern way) delegates to SSSD, which caches credentials and handles failover — SSSD is the layer between Linux and the directory
  • Tracing a single SSH login: sshd → PAM → pam_sss → SSSD → LDAP Bind + Search → session created

The Big Picture: One SSH Login, Four Layers

You type: ssh [email protected]

  sshd
    │
    ▼
  PAM  (/etc/pam.d/sshd)          ← "Is this user allowed in?"
    │
    ├── pam_sss    (auth)          ← sends credentials to SSSD
    ├── pam_sss    (account)       ← checks account not expired/locked
    ├── pam_sss    (session)       ← logs the session open/close
    └── pam_mkhomedir (session)    ← creates /home/vamshi if it doesn't exist
    │
    ▼
  SSSD  (/etc/sssd/sssd.conf)     ← "Let me check the directory"
    │
    ├── NSS responder              ← answers getent, id, getpwnam
    └── LDAP/Kerberos provider     ← talks to the actual directory
    │
    ▼
  LDAP Server (AD / OpenLDAP)
    │
    ├── Bind: uid=vamshi + password (or Kerberos ticket)
    └── Search: posixAccount attrs for uid=vamshi
    │
    ▼
  Linux session created
  UID=1001, GID=1001, HOME=/home/vamshi, SHELL=/bin/bash

EP02 showed what the directory contains and what travels on the wire. What it left open is how Linux uses that to grant a login — and why LDAP is not, by itself, an authentication protocol.


Why LDAP Is Not an Authentication Protocol

This is the confusion that trips people most. LDAP can verify a password — the Bind operation does exactly that. But authentication on Linux means something broader: checking credentials, checking account validity, enforcing password policy, setting up a session, creating a home directory. LDAP handles one piece of that. PAM handles the rest.

More precisely: LDAP doesn’t know what a Linux session is. It doesn’t know about /etc/pam.d/. It doesn’t enforce login hours, account expiry, or concurrent session limits. It returns directory entries and verifies binds. The intelligence about what to do with those results lives in the Linux authentication stack.

When you run ssh vamshi@server, the OS doesn’t open an LDAP connection and ask “can this user log in?” It calls PAM. PAM consults its configuration, and PAM decides whether to call LDAP (directly or via SSSD), whether to check the shadow file, whether to enforce MFA. LDAP is one possible backend. It’s not the gatekeeper.


NSS: The Traffic Controller

Before PAM runs, Linux needs to know if the user exists at all. That’s NSS’s job.

/etc/nsswitch.conf is a routing table for name resolution. It tells the OS where to look when something asks “who is UID 1001?” or “what groups is vamshi in?”:

# /etc/nsswitch.conf

passwd:     files sss        ← user lookups: check /etc/passwd first, then SSSD
group:      files sss        ← group lookups: check /etc/group first, then SSSD
shadow:     files sss        ← shadow password lookups
hosts:      files dns        ← hostname lookups (not identity-related)
netgroup:   sss              ← NIS netgroups from SSSD only
automount:  sss              ← autofs maps from SSSD

Every call to getpwnam(), getpwuid(), getgrnam(), getgrgid() in any process — including sshd — goes through NSS. The entries in nsswitch.conf control which backends are tried in order.

With passwd: files sss, a lookup for user vamshi:
1. Checks /etc/passwd — not found (vamshi is a domain user, not in local files)
2. Queries SSSD — SSSD checks its cache, or queries LDAP, and returns the posixAccount attributes

Without the sss entry in passwd:, domain users don’t exist on the system — getent passwd vamshi returns nothing, id vamshi fails, SSH login never gets to PAM’s authentication step.

# Verify NSS is routing to SSSD correctly
getent passwd vamshi
# vamshi:*:1001:1001:Vamshi K:/home/vamshi:/bin/bash

# If this returns nothing, NSS isn't reaching SSSD
# Check: systemctl status sssd && grep passwd /etc/nsswitch.conf

# See what groups the user is in (NSS group lookup)
id vamshi
# uid=1001(vamshi) gid=1001(engineers) groups=1001(engineers),1002(ops)

PAM: The Real Gatekeeper

PAM (Pluggable Authentication Modules) is the framework that lets Linux swap authentication backends without recompiling anything. Every service that needs to authenticate users — sshd, sudo, login, su, gdm — has a PAM configuration file in /etc/pam.d/.

Each PAM config defines four stacks:

auth        ← verify credentials (password, key, MFA)
account     ← check if the account is valid (not expired, not locked, login hours)
password    ← password change policy
session     ← set up/tear down the session (home dir, limits, logging)

A typical /etc/pam.d/sshd on a system joined to AD via SSSD:

# /etc/pam.d/sshd

# auth stack — verify the user's credentials
auth    required      pam_sepermit.so
auth    substack      password-auth   ← usually includes pam_sss.so

# account stack — check account validity
account required      pam_nologin.so
account include       password-auth

# password stack — handle password changes
password include      password-auth

# session stack — set up the session
session required      pam_selinux.so close
session required      pam_loginuid.so
session optional      pam_keyinit.so force revoke
session include       password-auth
session optional      pam_motd.so
session optional      pam_mkhomedir.so skel=/etc/skel/ umask=0077
session required      pam_selinux.so open

The include and substack directives pull in shared stacks from other files (like /etc/pam.d/password-auth). On a system with SSSD, password-auth contains:

auth    required      pam_env.so
auth    sufficient    pam_sss.so      ← try SSSD first
auth    required      pam_deny.so     ← if pam_sss fails, deny

account required      pam_unix.so
account sufficient    pam_localuser.so
account sufficient    pam_sss.so      ← SSSD account check
account required      pam_permit.so

session optional      pam_sss.so      ← SSSD session tracking

The sufficient flag means: if this module succeeds, stop checking this stack and consider it passed. required means: this must pass (but continue checking other modules and report failure at the end). requisite means: if this fails, stop immediately.


PAM Control Flags at a Glance

required   — must succeed; failure reported after remaining modules run
requisite  — must succeed; failure reported immediately, stack stops
sufficient — if success, stop stack (ignore remaining); failure continues
optional   — result ignored unless it's the only module in the stack

This matters for debugging. If pam_sss.so is sufficient and SSSD is down, PAM falls through to pam_deny.so — login denied. If it were optional, the login would proceed to the next module. The control flag is the policy decision.


The Old Way: pam_ldap

Before SSSD, Linux systems used pam_ldap and nss_ldap directly:

# Old /etc/pam.d/common-auth (Ubuntu pre-SSSD era)
auth    sufficient    pam_ldap.so    ← direct LDAP connection per login
auth    required      pam_unix.so nullok_secure

# Old /etc/nsswitch.conf
passwd: files ldap    ← nss_ldap for user lookups
group:  files ldap

pam_ldap opened a fresh LDAP connection on every login attempt. No caching. If the LDAP server was unreachable for 3 seconds, the login hung for 3 seconds — sometimes much longer. If the LDAP server was down, all domain logins failed immediately. Previously logged-in users with active sessions were fine; new logins simply didn’t work.

nss_ldap had the same problem for NSS lookups: every getpwnam() call hit the LDAP server directly. On a busy system with many processes doing user lookups, this meant hundreds of LDAP queries per second, no connection reuse, and no way to survive a brief network blip.

The problems were structural:
– No credential caching — offline logins impossible
– No connection pooling — LDAP server saw one connection per login attempt
– No failover logic — one LDAP server down meant all logins down
– Slow timeouts that blocked login sessions

SSSD was built to fix all of this.


The Modern Way: pam_sss + SSSD

pam_sss doesn’t talk to LDAP directly. It’s a thin client that passes authentication requests to SSSD over a Unix domain socket. SSSD manages the LDAP connection, the credential cache, and the failover logic.

sshd  →  PAM (pam_sss)  →  SSSD (Unix socket)  →  LDAP server
                                   │
                                   └── credential cache
                                       (survives brief LDAP outages)

When pam_sss sends a credential to SSSD:
1. SSSD checks its in-memory cache — if the credential hash matches a recent successful auth, it can satisfy the request without hitting LDAP
2. If not cached (or cache expired), SSSD sends a Bind to the LDAP server
3. On success, SSSD caches the result and returns success to pam_sss
4. pam_sss returns PAM_SUCCESS, and the auth stack continues

The credential cache is what enables offline logins. If the LDAP server is unreachable and a user has authenticated successfully within the cache TTL (default: 1 day for credentials, configurable via cache_credentials = True in sssd.conf), SSSD satisfies the auth from cache and the login succeeds. The user never knows the LDAP server was down.


Tracing a Full SSH Login

Here’s every step of an SSH login for a domain user, in order:

1.  sshd accepts the TCP connection
2.  sshd calls PAM: pam_start("sshd", "vamshi", ...)

3.  PAM auth stack runs pam_sss:
      pam_sss sends credentials to SSSD via /var/lib/sss/pipes/pam

4.  SSSD auth provider:
      a. Check credential cache — miss (first login)
      b. Resolve user: NSS lookup for uid=vamshi
         → SSSD LDAP provider searches dc=corp,dc=com for (uid=vamshi)
         → Returns: uidNumber=1001, gidNumber=1001, homeDirectory=/home/vamshi
      c. Authenticate: LDAP Simple Bind as uid=vamshi,ou=engineers,dc=corp,dc=com
         → Server returns: success
      d. Cache the credential hash + POSIX attrs

5.  SSSD returns PAM_SUCCESS to pam_sss

6.  PAM account stack runs pam_sss:
      SSSD checks: account not expired, not locked, login permitted
      → PAM_ACCT_MGMT success

7.  PAM session stack:
      pam_loginuid sets /proc/self/loginuid = 1001
      pam_mkhomedir creates /home/vamshi if missing
      pam_sss opens session (records in SSSD session tracking)

8.  sshd creates the shell, sets environment:
      USER=vamshi, HOME=/home/vamshi, SHELL=/bin/bash, LOGNAME=vamshi

9.  Shell prompt appears

Steps 4b and 4c are the only two LDAP operations in the entire login flow: one Search to resolve the user’s attributes, one Bind to verify the password. Everything else is PAM and SSSD.


Debugging the Stack

When a login fails, the failure could be in any layer. Work top-down:

# 1. Does NSS resolve the user at all?
getent passwd vamshi
# If empty: NSS isn't reaching SSSD, or SSSD isn't finding the user in LDAP

# 2. Is SSSD running and healthy?
systemctl status sssd
sssctl domain-status corp.com      # shows SSSD's view of domain connectivity

# 3. What does SSSD think about the user?
sssctl user-checks vamshi          # runs auth + account checks internally
id vamshi                          # forces NSS resolution and shows group memberships

# 4. What does SSSD's log say?
journalctl -u sssd -f              # tail SSSD logs live, then attempt login

# 5. Can you reach the LDAP server at all?
ldapsearch -x -H ldap://dc.corp.com \
  -D "cn=svc-ldap,ou=services,dc=corp,dc=com" \
  -w "password" \
  -b "dc=corp,dc=com" \
  "(uid=vamshi)" dn

# 6. Force a cache flush if entries are stale
sss_cache -u vamshi                # invalidate this user's cache entry
sss_cache -G engineers             # invalidate a group

The sssctl user-checks command is the single most useful diagnostic — it simulates the full PAM auth + account check flow without actually creating a session, and prints exactly what SSSD would do on a real login attempt.


⚠ Common Misconceptions

“If ldapsearch works, SSH login should work.” Not necessarily. ldapsearch tests the LDAP layer. An SSH login requires NSS to resolve the user, PAM to authenticate, SSSD to be running and configured correctly, and pam_mkhomedir to create the home directory if it’s the first login. Any of these can fail independently.

“pam_ldap and pam_sss do the same thing.” They have the same job (authenticate via LDAP) but completely different architectures. pam_ldap is a direct-connect, no-cache module. pam_sss is a client of SSSD, which provides caching, connection pooling, failover, and offline support. On any modern system, you want pam_sss.

“nsswitch.conf order doesn’t matter much.” It matters exactly as much as the order suggests. passwd: files sss means local /etc/passwd is always checked first — if a domain username collides with a local user, the local account wins. This is the intended behavior (local accounts should always be reachable), but it means you’ll never override a local account with a directory entry.

“SSSD cache = security risk.” The cache stores a credential hash, not the cleartext password. An attacker with access to the SSSD cache database (/var/lib/sss/db/) would see hashed credentials — the same situation as /etc/shadow. The real concern is whether offline authentication is appropriate for your security posture; it can be disabled with offline_credentials_expiration = 0.


Framework Alignment

Domain Relevance
CISSP Domain 5: Identity and Access Management PAM is the enforcement layer for authentication policy on Linux — understanding its stack is foundational to any Linux IAM deployment
CISSP Domain 3: Security Architecture and Engineering The separation between NSS (resolution) and PAM (authentication) is an architectural boundary — misunderstanding it leads to misconfigured systems where account checks are bypassed
CISSP Domain 4: Communications and Network Security pam_ldap vs pam_sss affects whether credentials travel over a direct LDAP connection (one socket per login, no TLS guarantee) or through SSSD’s managed, pooled connection

Key Takeaways

  • LDAP alone is not an authentication protocol for Linux — authentication flows through PAM, and LDAP is one of PAM’s possible backends
  • NSS (/etc/nsswitch.conf) resolves user identity (who is UID 1001?); PAM enforces it (are they allowed in?)
  • pam_ldap talks to LDAP directly — no cache, no failover, login blocked when LDAP is unreachable
  • pam_sss delegates to SSSD — credential caching, connection pooling, offline login, and failover are all built in
  • A full SSH login touches LDAP exactly twice: one Search for POSIX attributes, one Bind to verify the password
  • When login fails, debug top-down: NSS resolution → SSSD status → LDAP reachability → PAM config

What’s Next

EP03 showed how authentication reaches LDAP — through PAM, through SSSD, through a Bind. What it assumed is that SSSD is healthy and the LDAP server is reachable. The moment either goes wrong, the behavior depends entirely on how SSSD is configured — its cache TTLs, its failover order, its offline credential policy.

EP04 goes inside SSSD: the architecture, the sssd.conf knobs that matter, how to read the logs, and how to break it intentionally and fix it.

Next: SSSD: The Caching Daemon That Powers Every Enterprise Linux Login

Get EP04 in your inbox when it publishes → linuxcent.com/subscribe

What Is LDAP — and Why It Was Invented to Replace Something Worse

Reading Time: 9 minutes

The Identity Stack, Episode 1
EP01EP02: LDAP Internals → EP03 → …


TL;DR

  • LDAP (Lightweight Directory Access Protocol) is a protocol for reading and writing directory information — most commonly, who is allowed to do what
  • It was built in 1993 as a “lightweight” alternative to X.500/DAP, which ran over the full OSI stack and was impossible to deploy on anything but mainframe hardware
  • Before LDAP, every server had its own /etc/passwd — 50 machines meant 50 separate user databases, managed manually
  • NIS (Network Information Service) was the first attempt to centralize this — it worked, then became a cleartext-credentials security liability
  • LDAP v3 (RFC 2251, 1997) is the version still in production today — 27 years of backwards compatibility
  • Everything you use today — Active Directory, Okta, Entra ID — is built on top of, or speaks, LDAP

The Big Picture: 50 Years of “Who Are You?”

1969–1980s   /etc/passwd — per-machine, no network auth
     │        50 servers = 50 user databases, managed manually
     │
     ▼
1984         Sun NIS / Yellow Pages — first centralized directory
     │        broadcast-based, no encryption, flat namespace
     │        Revolutionary for its era. A liability by the 1990s.
     │
     ▼
1988         X.500 / DAP — enterprise-grade directory services
     │        OSI protocol stack. Powerful. Impossible to deploy.
     │        Mainframe-class infrastructure required just to run it.
     │
     ▼
1993         RFC 1487 — LDAP v1
     │        Tim Howes, University of Michigan.
     │        Lightweight. TCP/IP. Actually deployable.
     │
     ▼
1997         RFC 2251 — LDAP v3
     │        SASL authentication. TLS. Controls. Referrals.
     │        The version still in production today.
     │
     ▼
2000s–now    Active Directory, OpenLDAP, 389-DS, FreeIPA
             Okta, Entra ID, Google Workspace
             LDAP DNA in every identity system on the planet.

What is LDAP? It’s the protocol that solved one of the most boring and consequential problems in computing: how do you know who someone is, across machines, at scale, without sending their password in cleartext?


The World Before LDAP

Before you understand why LDAP was invented, you need to feel the problem it solved.

Every Unix machine in the 1970s and 1980s managed its own users. When you created an account on a server, your username, UID, and hashed password went into /etc/passwd on that machine. Another machine had no idea you existed. If you needed access to ten servers, an administrator created ten separate accounts — manually, one by one. When you changed your password, each account had to be updated separately.

For a university with 200 machines and 10,000 students, this was chaos. For a company with offices in three cities, it was a full-time job for multiple sysadmins.

Machine A           Machine B           Machine C
/etc/passwd         /etc/passwd         /etc/passwd
vamshi:x:1001       (vamshi unknown)    vamshi:x:1004
alice:x:1002        alice:x:1001        alice:x:1003
bob:x:1003          bob:x:1002          (bob unknown)

Same people, different UIDs, different machines, no central truth.
File permissions become meaningless when UID 1001 means
different users on different hosts.

For every new hire, an admin SSHed to every machine and ran useradd. When someone left, you hoped whoever ran the offboarding remembered all the machines. Most organizations didn’t know their own attack surface because there was no single place to look.


Sun NIS: The First Attempt at Centralization

Sun Microsystems released NIS (Network Information Service) in 1984, originally called Yellow Pages — a name they had to drop after a trademark dispute with British Telecom. The idea was elegant: one server holds the authoritative /etc/passwd (and /etc/group, /etc/hosts, and a dozen other maps), and client machines query it instead of reading local files.

For the first time, you could create an account once and have it work across your entire network. For a generation of Unix administrators, NIS was liberating.

       NIS Master Server
       /var/yp/passwd.byname
              │
    ┌─────────┼──────────┐
    ▼         ▼          ▼
 Client A   Client B   Client C
 (query NIS — no local /etc/passwd needed)

NIS worked well — until it didn’t. The failure modes were structural:

No encryption. NIS responses were cleartext UDP. An attacker on the same network segment could capture the full password database with a packet sniffer. In 1984, “the network” meant a trusted corporate LAN. By the mid-1990s, it meant ethernet segments that included lab workstations, and the assumptions no longer held.

Broadcast-based discovery. NIS clients found servers by broadcasting on the local network. This worked on a single flat ethernet. It failed completely across routers, across buildings, and across WAN links. Multi-site organizations ended up running separate NIS domains with no connection between them — which partially defeated the purpose.

Flat namespace. NIS had no organizational hierarchy. One domain. Everything flat. You couldn’t have engineering and finance as separate administrative units. You couldn’t delegate user management to a department. One person — usually one overworked sysadmin — managed the whole thing.

UIDs had to match across all machines. If alice was UID 1002 on one server but UID 1001 on another, NFS file ownership became wrong. NIS enforced consistency, but onboarding a new machine into an existing network required manually auditing UID conflicts across the entire directory. Get one wrong and files end up owned by the wrong person.

NIS worked for thousands of installations from 1984 to the mid-1990s. It also ended careers when it failed. What the industry needed was a hierarchical, structured, encrypted, scalable directory service.


X.500 and DAP: The Right Idea, Wrong Protocol

The OSI (Open Systems Interconnection) standards body had an answer: X.500 directory services. X.500 was comprehensive, hierarchical, globally federated. The ITU-T published the standard in 1988, and it looked like exactly what enterprises needed.

X.500 Directory Information Tree (DIT)
              c=US                   ← country
                │
         o=University                ← organization
                │
         ┌──────┴──────┐
     ou=CS           ou=Physics      ← organizational units
         │
     cn=Tim Howes                    ← common name (person)
     telephoneNumber: +1-734-...
     mail: [email protected]

This data model — the hierarchy, the object classes, the distinguished names — is exactly what LDAP inherited. The DIT, the cn=, ou=, dc= notation in every LDAP query you’ve ever read: all of it came from X.500.

The problem was DAP: the Directory Access Protocol that X.500 used to communicate.

DAP ran over the full OSI protocol stack. Not TCP/IP — OSI. Seven layers, all of which required specialized software that in 1988 only mainframe and minicomputer vendors had implemented. A university department wanting to run X.500 needed hardware and software licenses that cost as much as a small car. The vast majority of workstations couldn’t speak OSI at all.

The data model was sound. The transport was impractical.

X.500 / DAP (1988)              LDAP v1 (1993)
──────────────────              ──────────────
Full OSI stack (7 layers)  →    TCP/IP only
Mainframe-class hardware   →    Any Unix box with a TCP stack
$50,000+ deployment cost   →    Free (reference implementation)
Vendor-specific OSI impl.  →    Standard socket API
Zero internet adoption     →    Universities deployed immediately

The Invention: LDAP at the University of Michigan

Tim Howes was at the University of Michigan in the early 1990s. The university was running X.500 for its directory — faculty, staff, student contact information, credentials. The data model was good. The protocol was the problem.

His insight, working with colleagues Wengyik Yeong and Steve Kille: strip X.500 down to what actually needs to function over a TCP/IP connection. Keep the hierarchical data model. Throw away the OSI transport. The result was the Lightweight Directory Access Protocol.

RFC 1487, published July 1993, described LDAP v1. It preserved the X.500 directory information model — the hierarchy, the object classes, the distinguished name format — and mapped it onto a protocol that could run over a simple TCP socket on port 389.

No specialized hardware. No OSI. If you had a Unix machine and TCP/IP, you could run LDAP. By 1993, that meant virtually every workstation and server in every university and most enterprises.

The University of Michigan deployed it immediately. Within two years, organizations across the internet were running the reference implementation.

LDAP v2 (RFC 1777, 1995) cleaned up the protocol. LDAP v3 (RFC 2251, 1997) is the version in production today — adding SASL authentication (which enables Kerberos integration), TLS support, referrals for federated directories, and extensible controls for server-side operations. The RFC that standardized the internet’s primary identity protocol is 27 years old and still running.


What LDAP Actually Is

LDAP is a client-server protocol for reading and writing a directory — a structured, hierarchical database optimized for reads.

Every entry in the directory has a Distinguished Name (DN) that describes its position in the hierarchy, and a set of attributes defined by its object classes. A person entry looks like this:

dn: cn=vamshi,ou=engineers,dc=linuxcent,dc=com

objectClass: inetOrgPerson
objectClass: posixAccount
cn: vamshi
uid: vamshi
uidNumber: 1001
gidNumber: 1001
homeDirectory: /home/vamshi
loginShell: /bin/bash
mail: [email protected]

The DN reads right-to-left: domain linuxcent.com (dc=linuxcent,dc=com) → organizational unit engineers → common name vamshi. Every entry in the directory has a unique path through the tree — there’s no ambiguity about which vamshi you mean.

LDAP defines eight operations: Bind (authenticate), Search, Add, Modify, Delete, ModifyDN (rename), Compare, and Abandon. Most of what a Linux authentication system does with LDAP reduces to two: Bind (prove you are who you say you are) and Search (tell me everything you know about this user).

When your Linux machine authenticates an SSH login against LDAP:

1. User types password
2. PAM calls pam_sss (or pam_ldap on older systems)
3. SSSD issues a Bind to the LDAP server: "I am cn=vamshi, and here is my credential"
4. LDAP server verifies the bind → success or failure
5. SSSD issues a Search: "give me the posixAccount attributes for uid=vamshi"
6. LDAP returns uidNumber, gidNumber, homeDirectory, loginShell
7. PAM creates the session with those attributes

The entire login flow is two LDAP operations: one Bind, one Search.


Try It Right Now

You don’t need to set up an LDAP server to run your first query. There’s a public test LDAP directory at ldap.forumsys.com:

# Query a public LDAP server — no setup required
ldapsearch -x \
  -H ldap://ldap.forumsys.com \
  -b "dc=example,dc=com" \
  -D "cn=read-only-admin,dc=example,dc=com" \
  -w readonly \
  "(objectClass=inetOrgPerson)" \
  cn mail uid

# What you get back (abbreviated):
# dn: uid=tesla,dc=example,dc=com
# cn: Tesla
# mail: [email protected]
# uid: tesla
#
# dn: uid=einstein,dc=example,dc=com
# cn: Albert Einstein
# mail: [email protected]
# uid: einstein

Decode what you just ran:

  • -x — simple authentication (username/password bind, not Kerberos/SASL)
  • -H ldap://ldap.forumsys.com — the LDAP server URI, port 389
  • -b "dc=example,dc=com" — the base DN, the top of the subtree to search
  • -D "cn=read-only-admin,dc=example,dc=com" — the bind DN (who you’re authenticating as)
  • -w readonly — the bind password
  • "(objectClass=inetOrgPerson)" — the search filter: return entries that are people
  • cn mail uid — the attributes to return (default returns all)

That’s a live LDAP query returning real directory entries from a server running RFC 2251 — the same protocol Tim Howes designed in 1993.

On your own Linux system, if you’re joined to AD or LDAP, you can query it the same way with your domain credentials.


Why It Never Went Away

LDAP v3 was finalized in 1997. In 2024, it’s still the protocol every enterprise directory speaks. Why?

Because it became the lingua franca of enterprise identity before any replacement existed. Every application that needs to authenticate users — VPN concentrators, mail servers, network switches, web applications, HR systems — implemented LDAP support. Every directory service Microsoft, Red Hat, Sun, and Novell shipped stored data in an LDAP-accessible tree.

When Microsoft built Active Directory in 1999, they built it on top of LDAP + Kerberos. When your Linux machine joins an AD domain, it speaks LDAP to enumerate users and groups, and Kerberos to verify credentials. When Okta or Entra ID syncs with your on-premises directory, it uses LDAP Sync (or a modern protocol that maps directly to LDAP semantics).

The protocol is old. The ecosystem built on top of it is so deep that replacing LDAP would mean simultaneously replacing every enterprise application that depends on it. Nobody has done that. Nobody has had to.

What happened instead is the stack got taller. LDAP at the bottom, Kerberos for network authentication, SSSD as the local caching daemon, PAM as the Linux integration layer, SAML and OIDC at the top for web-based federation. The directory is still LDAP. The interfaces above it evolved.

That full stack — from the directory at the bottom to Zero Trust at the top — is what this series covers.


⚠ Common Misconceptions

“LDAP is an authentication protocol.” LDAP is a directory protocol. It stores identity information and can verify credentials (via Bind). Authentication in modern stacks is typically Kerberos or OIDC — LDAP provides the directory backing it.

“LDAP is obsolete.” LDAP is the storage layer for Active Directory, OpenLDAP, 389-DS, FreeIPA, and every enterprise IdP’s on-premises sync. It is ubiquitous. What’s changed is the interface layer above it.

“You need Active Directory to run LDAP.” Active Directory uses LDAP. OpenLDAP, 389-DS, FreeIPA, and Apache Directory Server are all standalone LDAP implementations. You can run a directory without Microsoft.

“LDAP and LDAPS are different protocols.” LDAP is the protocol. LDAPS is LDAP over TLS on port 636. StartTLS is LDAP on port 389 with an in-session upgrade to TLS. Same protocol, different transport security.


Framework Alignment

Domain Relevance
CISSP Domain 5: Identity and Access Management LDAP is the foundational directory protocol for centralized identity stores — the base layer of every enterprise IAM stack
CISSP Domain 4: Communications and Network Security Port 389 (LDAP), 636 (LDAPS), 3268/3269 (AD Global Catalog) — transport security decisions affect every directory deployment
CISSP Domain 3: Security Architecture and Engineering DIT hierarchy, schema design, replication topology — directory structure is an architectural security decision
NIST SP 800-63B LDAP as a credential service provider (CSP) backing enterprise authenticators

Key Takeaways

  • LDAP was invented to solve a real, painful problem: the authentication chaos that NIS couldn’t fix and X.500/DAP was too expensive to deploy
  • It inherited the right thing from X.500 (the hierarchical data model) and replaced the right thing (the impractical OSI transport with TCP/IP)
  • NIS was the predecessor that worked until it didn’t — its failure modes (no encryption, flat namespace, broadcast discovery) are exactly what LDAP was designed to fix
  • LDAP v3 (RFC 2251, 1997) is still the production standard — 27 years later
  • Active Directory, OpenLDAP, FreeIPA, Okta, Entra ID — every enterprise identity system either runs LDAP or speaks it
  • The full authentication stack is deeper than LDAP: the next 12 episodes peel it apart layer by layer

What’s Next

EP01 stayed at the design level — the problem, the predecessor failures, the invention, the data model.

EP02 goes inside the wire. The DIT structure, DN syntax, object classes, schema, and the BER-encoded bytes that actually travel from the server to your authentication daemon. Run ldapsearch against your own directory and read every line of what comes back.

Next: LDAP Internals: The Directory Tree, Schema, and What Travels on the Wire

Get EP02 in your inbox when it publishes → linuxcent.com/subscribe

Cloud AMI Security Risks & How Custom OS Images Fix them and what’s wrong with defaults

Reading Time: 8 minutes

~2,800 words  ·  Reading time: 12 min  ·  Series: OS Image Security, Post 1 of 6

When you launch an EC2 instance from an AWS Marketplace AMI, or spin up a VM from a cloud-provider base image on GCP or Azure, you’re trusting a decision someone else made months ago about what your server should contain. That decision was made for the widest possible audience — not for your workload, your threat model, or your compliance requirements.

This post tears open what’s actually inside a default cloud image, compares it against what a production-hardened image should contain, and explains why the calculus changes depending on whether you’re deploying to AWS, an on-prem KVM host, or a Nutanix AHV cluster.


What a cloud provider is actually optimising for

AWS, Canonical, Red Hat, and every other publisher shipping to cloud marketplaces are solving a distribution problem, not a security problem. Their images need to:

  • Boot successfully on any instance type in any region
  • Work for the first-time user running their first workload
  • Support every possible use case — web servers, databases, ML training jobs, bastion hosts, everything

That constraint produces images that are, by design, permissive. Permissive gets out of the way. Permissive doesn’t break anything on day one. Permissive is also the opposite of what you want on a production server.

Let’s look at what “permissive” actually means in concrete terms.


Dissecting a default AWS AMI

Take Amazon Linux 2023 (AL2023), one of the more intentionally stripped-down cloud images available. Even with Amazon’s effort to reduce its footprint compared to AL2, a fresh AL2023 instance ships with more than most workloads need.

Services running at boot that most workloads don’t need

chronyd.service            # Fine — you need NTP
systemd-resolved.service   # Fine
dbus-broker.service        # Fine
amazon-ssm-agent.service   # Arguably fine if you use SSM
NetworkManager.service     # Debatable — most cloud workloads don't need NM

On a RHEL 8/9 or Ubuntu 22.04 Marketplace image, the list is longer. You’ll find avahi-daemon (mDNS/DNS-SD service discovery — on a server), bluetooth.service in some configurations, cups on some RHEL variants, and on Ubuntu, snapd running and occupying memory along with its associated mount units.

Every running service is an attack surface. Every socket it opens is a listening endpoint you didn’t ask for.

SSH configuration out of the box

The default sshd_config on most Marketplace images is not hardened. You’ll typically find:

PermitRootLogin prohibit-password   # Better than 'yes', but not 'no'
PasswordAuthentication no           # Usually disabled by cloud-init — good
X11Forwarding yes                   # On a headless server. Why?
AllowAgentForwarding yes            # Unnecessary for most workloads
PrintLastLog yes                    # Minor, but generates audit noise
MaxAuthTries 6                      # CIS recommends 4 or fewer
ClientAliveInterval 0               # No idle timeout

CIS Benchmark Level 1 for RHEL 9 has 40+ SSH-specific controls. A default image satisfies perhaps a third of them.

Kernel parameters that aren’t tuned

# Not set, or not set correctly, on most default images:
net.ipv4.conf.all.send_redirects = 1        # Should be 0
net.ipv4.conf.default.accept_redirects = 1  # Should be 0
net.ipv4.ip_forward = 0                     # Correct if not a router, but often left unset
kernel.randomize_va_space = 2               # Usually correct — verify anyway
fs.suid_dumpable = 0                        # Often not set
kernel.dmesg_restrict = 1                   # Rarely set

These live in /etc/sysctl.d/ and need to be explicitly applied. In a default AMI, they are not.

No audit daemon configured

auditd is installed on most RHEL-family images. It is not configured. The default audit.rules file is essentially empty — the daemon runs but captures almost nothing. On Ubuntu, auditd isn’t even installed by default.

CIS Benchmark Level 2 for RHEL 9 specifies 30+ auditd rules covering file access, privilege escalation, user management changes, network configuration changes, and more. None of them are present in a default AMI.

Package surface

Run rpm -qa | wc -l or dpkg -l | grep -c ^ii on a fresh instance. AL2023 comes in around 350 packages. Ubuntu 22.04 Server minimal sits around 500. RHEL 9 from Marketplace — depending on the variant — lands between 400 and 600.

How many of those packages does your application actually need? For a Python web service: Python, your runtime dependencies, and a handful of system libraries. The rest is exposure.


The on-prem story is different — and often worse

Cloud images at least get regular updates from their publishers. On-prem KVM and Nutanix environments tell a different story.

The KVM / QCOW2 situation

Most teams running KVM get their base images one of three ways:

  1. Download a cloud image (cloud-init enabled QCOW2) from the distro vendor and use it directly
  2. Convert an existing VMware VMDK or OVA and hope for the best
  3. Run a manual Kickstart/Preseed install once, then treat the result as the “golden image” forever

Option 1 gives you the same problems as the cloud image analysis above, plus you’re now responsible for handling cloud-init in an environment that might not have a metadata service — so you either ship a seed ISO with every VM, or you rip out cloud-init and manage first-boot differently.

Option 3 is the most common and the most dangerous. That “golden image” was created by someone who’s possibly no longer at the company, contains packages pinned to versions from 18 months ago, and has sshd configured however was convenient at the time. Worse, it gets cloned hundreds of times and none of those clones are ever individually updated at the image level.

The Nutanix AHV specifics

Nutanix AHV images have additional considerations that cloud images don’t deal with:

  • AHV uses a custom paravirtualised SCSI controller (virtio-scsi or the Nutanix variant). Images imported from VMware need pvscsi drivers removed and virtio_scsi added to the initramfs before the disk will be detected at boot.
  • The Nutanix guest tools agent (ngt) is separate from the kernel and needs to be installed inside the image for snapshot quiescence, VSS integration, and in-guest metrics.
  • cloud-init works on AHV but requires the ConfigDrive datasource — not the EC2 datasource that most cloud QCOW2 images default to. An unconfigured datasource means cloud-init times out at boot, costing 3–5 minutes on every first start.
  • NUMA topology on large AHV nodes affects memory allocation in ways that need kernel tuning (vm.zone_reclaim_mode, kernel.numa_balancing) — parameters no generic cloud image sets.

The result is that most Nutanix environments end up with a patchwork: partially converted images, manually applied guest tools, and hardening that was done once per environment rather than once per image.


What a hardened image actually looks like

A properly built hardened image isn’t just “a default image with some hardening applied at the end.” The hardening is architectural — decisions made at build time that change the fundamental shape of what’s inside the image.

Package set — minimal by design

Start from a minimal install group — @minimal-environment on RHEL/Rocky, --variant=minbase on Debian derivatives. Then add only what the image class requires. For a web server image: your runtime, a process supervisor, and nothing else. No man-db, no X11-common, no avahi.

Every package you don’t install is a CVE that can never affect you.

Filesystem hardening

Separate mount points with restrictive options prevent a class of privilege escalation attacks that depend on executing binaries from world-writable locations:

/tmp      nodev,nosuid,noexec
/var      nodev,nosuid
/var/tmp  nodev,nosuid,noexec
/home     nodev,nosuid
/dev/shm  nodev,nosuid,noexec

These are not applied by any default cloud image.

Kernel parameters — baked in at build time

# /etc/sysctl.d/99-hardening.conf

net.ipv4.conf.all.send_redirects = 0
net.ipv4.conf.default.send_redirects = 0
net.ipv4.conf.all.accept_redirects = 0
net.ipv4.conf.default.accept_redirects = 0
net.ipv4.conf.all.accept_source_route = 0
net.ipv4.conf.all.log_martians = 1
net.ipv6.conf.all.accept_redirects = 0
kernel.randomize_va_space = 2
fs.suid_dumpable = 0
kernel.dmesg_restrict = 1
kernel.kptr_restrict = 2
net.core.bpf_jit_harden = 2

Applied at image build time. Present on every instance, every time, before your application code runs.

SSH locked down

Protocol 2
PermitRootLogin no
MaxAuthTries 4
LoginGraceTime 60
X11Forwarding no
AllowAgentForwarding no
AllowTcpForwarding no
PermitUserEnvironment no
Ciphers [email protected],[email protected],aes256-ctr
MACs [email protected],[email protected]
KexAlgorithms curve25519-sha256,diffie-hellman-group16-sha512
ClientAliveInterval 300
ClientAliveCountMax 3
Banner /etc/issue.net

This is approximately CIS Level 1 SSH hardening. It lives in the image — not in a post-deploy playbook.

auditd rules embedded

# Privilege escalation
-a always,exit -F arch=b64 -S execve -C uid!=euid -F euid=0 -k setuid

# Sudo usage
-w /etc/sudoers -p wa -k sudoers

# User and group management
-w /etc/passwd -p wa -k identity
-w /etc/group  -p wa -k identity

# Kernel module loading
-a always,exit -F arch=b64 -S init_module -S delete_module -k modules

The full CIS L2 auditd ruleset runs to ~60 rules. They’re all committed to the image. Every instance generates audit logs from minute one of its existence.

Services disabled at build time

systemctl disable avahi-daemon
systemctl disable cups
systemctl disable postfix
systemctl disable bluetooth
systemctl disable rpcbind
systemctl mask debug-shell.service

The service list varies by distro. The principle is the same: if it’s not required by the image’s purpose, it doesn’t run.


The platform dimension: why you can’t use one image everywhere

This is where the complexity gets real. A CIS-hardened RHEL 9 image built for AWS doesn’t directly work on KVM, and it doesn’t directly work on Nutanix either. The security controls are the same — the platform-specific layer underneath them is not.

Here’s what needs to differ per target platform:

Concern AWS (AMI) KVM (QCOW2) Nutanix AHV
Disk format Raw / VMDK → AMI QCOW2 QCOW2 / VMDK
Boot mechanism GRUB2 + PVGRUB2 or UEFI GRUB2 GRUB2 + UEFI
Network driver ENA (ena kernel module) virtio-net virtio-net
Storage driver NVMe or xen-blkfront virtio-blk / virtio-scsi virtio-scsi
cloud-init datasource ec2 NoCloud / ConfigDrive ConfigDrive
Guest agent AWS SSM / CloudWatch qemu-guest-agent Nutanix Guest Tools
Metadata service 169.254.169.254 None (seed ISO) or local Nutanix AOS

A single pipeline needs to produce platform-specific artefacts from a single hardened source. The hardening doesn’t change. The drivers, datasources, and agents do.


Where this sits relative to CIS and NIST

The controls described above aren’t arbitrary. They map directly to published frameworks.

CIS Benchmark Level 1 covers controls with low operational impact and high security return — SSH configuration, kernel parameters, filesystem mount options, service reduction. Almost everything in the “what a hardened image looks like” section above is CIS Level 1.

CIS Benchmark Level 2 adds auditd configuration, PAM controls, additional filesystem protections, and more aggressive service disablement. It trades some operational flexibility for a significantly smaller attack surface.

NIST SP 800-53 CM-6 (Configuration Settings) directly requires that systems be configured to the most restrictive settings consistent with operational requirements. Baking hardening into the image is a stronger implementation of CM-6 than applying it post-deploy — because it’s guaranteed, auditable at build time, and consistent across every instance regardless of how it was launched.

NIST SP 800-53 SI-2 (Flaw Remediation) maps to your image patching cadence. An image rebuilt monthly against the latest package repositories satisfies SI-2 more completely than runtime patching alone, because it also eliminates packages you don’t need — packages that would need patching if they were present.

The full CIS and NIST control mapping will be covered in depth later in this series.


The build-time vs runtime hardening distinction

This is the most important concept in the entire post.

Hardening applied at runtime — via Ansible, Chef, cloud-init user-data, or a shell script — is conditional. It runs if the automation runs. It applies if nothing fails. It’s consistent only if every deployment goes through exactly the same path.

Hardening embedded in the image is unconditional. It cannot be skipped. It doesn’t depend on connectivity to an Ansible control node. It doesn’t require cloud-init to succeed. It cannot be accidentally omitted by a new team member who doesn’t know the runbook.

This distinction matters most at incident response time. When you’re investigating a compromised instance, the first question you want to answer confidently is: was this instance ever in a known-good state?

  • If your hardening is in the image: yes, from boot.
  • If your hardening is applied post-deploy: it depends on whether everything went right on that specific instance’s first boot.

What comes next

The practical question this raises: how do you build these images in a repeatable, multi-platform way, with CIS scanning integrated into the build pipeline?

Packer covers most of the builder layer. OpenSCAP provides the scanning. Kickstart, cloud-init, and Nutanix AHV-specific tooling fill the gaps. But the orchestration between these — producing a consistent hardened image for three different target platforms from a single source of truth — is where most teams hit friction.

The next post in this series covers the platform-specific differences between AWS, KVM, and Nutanix in depth: what actually needs to change per target when your security baseline is shared.

Next in the series: Cloud vs KVM vs Nutanix — why one image doesn’t fit all →


Questions or corrections? Open an issue or reach me on LinkedIn. If this was useful, the series index has the full roadmap.