Linuxcent

Identity Providers Explained: On-Prem, Cloud, SCIM, and Federation

May 10, 2026May 8, 2026 by Vamshi Krishna Santhapuri

Reading Time: 6 minutes

The Identity Stack, Episode 11
EP10: SAML/OIDC → EP11 → EP12: Entra ID + Linux → …

TL;DR

An Identity Provider (IdP) is the system that authenticates users and issues identity assertions (SAML assertions, OIDC tokens) to applications
On-prem IdPs: AD FS (Microsoft), Shibboleth (universities), Keycloak (open source), Ping Identity — they sit in front of AD and speak SAML/OIDC to cloud apps
Cloud IdPs: Okta, Entra ID (Azure AD), Google Workspace, Ping Identity Cloud — they are the directory and the authentication layer in one
Federation: IdPs can trust each other — a corporate IdP can delegate to a cloud IdP, or federate with a partner org’s IdP
SCIM (System for Cross-domain Identity Management) is provisioning, not authentication — it creates/updates/deactivates user accounts in target systems when the source directory changes
The key distinction: federation (authentication flow) vs directory sync (data copy) — they solve different problems and are often deployed together

The Big Picture: Where IdPs Sit

                        On-prem Directory
                        (Active Directory / OpenLDAP / FreeIPA)
                               │
                               │ LDAP / Kerberos
                               ▼
                         Identity Provider
                         ┌──────────────────────────────────┐
                         │  AD FS / Keycloak / Okta /       │
                         │  Entra ID Connect / Shibboleth   │
                         │                                  │
                         │  Speaks: SAML 2.0 + OIDC + OAuth2│
                         └────────────────┬─────────────────┘
                                          │ assertions / tokens
                      ┌───────────────────┼───────────────────┐
                      ▼                   ▼                   ▼
               Salesforce          GitHub Enterprise      AWS IAM
               (SAML SP)           (OIDC RP)              (OIDC)

EP10 covered the protocols. This episode covers the systems — what an IdP actually does, how the major ones differ, and how they connect to each other through federation and SCIM.

On-Premises Identity Providers

AD FS (Active Directory Federation Services)

AD FS is Microsoft’s on-prem federation server — a Windows Server role that sits in front of Active Directory and speaks SAML 2.0 and OIDC to external applications.

What it does:
– Authenticates users against AD (Kerberos/LDAP behind the scenes)
– Issues SAML assertions and OIDC tokens to external SPs
– Handles claims transformation: maps AD attributes to what the SP expects

What it doesn’t do well:
– It’s Windows Server only
– Configuration is complex (XML, certificates, claim rule language)
– No built-in MFA (requires Azure MFA or a third-party provider)
– Being deprecated in favor of Entra ID for most use cases

AD FS made sense when everything was on-prem. As workloads move to cloud, Entra ID Connect (a lighter sync agent) combined with Entra ID as the IdP replaces AD FS for most enterprises.

Keycloak

Keycloak is the open-source IdP from Red Hat. It’s what FreeIPA uses for web-based OIDC/SAML SSO, and it’s widely deployed independently for organizations that want full control over their identity infrastructure.

# Run Keycloak in development mode (Docker)
docker run -p 8080:8080 \
  -e KEYCLOAK_ADMIN=admin \
  -e KEYCLOAK_ADMIN_PASSWORD=admin \
  quay.io/keycloak/keycloak:latest \
  start-dev

# Keycloak concepts:
# Realm     — an isolated namespace (like a tenant)
# Client    — an application that uses Keycloak for auth (SP/RP)
# User federation — connect Keycloak to an existing LDAP/AD directory
# Identity brokering — federate with external IdPs (Google, GitHub, another SAML IdP)

Keycloak reads users from AD/LDAP via its User Federation feature — it doesn’t replace the directory, it federates it. Users still live in AD; Keycloak issues SAML/OIDC tokens based on those users.

Shibboleth

Shibboleth is the dominant IdP in academia. Most universities run it. It’s SAML-native, designed for federation between institutions — a student can authenticate at their home university’s IdP and access resources at a partner institution.

Cloud Identity Providers

Okta

Okta is a cloud IdP + directory. It can:
– Act as the primary user directory (storing users, credentials)
– Connect to on-prem AD via the Okta Active Directory Agent (a lightweight sync service)
– Federate with other IdPs (act as IdP or SP in a SAML/OIDC chain)
– Enforce MFA, Adaptive Authentication, Device Trust

Okta’s Lifecycle Management handles provisioning: when a user is created/disabled in Okta (or synced from AD), Okta can automatically create/deactivate accounts in downstream SaaS apps — via SCIM or app-specific APIs.

Entra ID (Azure Active Directory)

Entra ID is Microsoft’s cloud IdP. It’s both a directory (stores users, groups) and an IdP (issues tokens). For organizations running on-prem AD, Entra ID Connect syncs users from AD to Entra ID.

Entra ID is OIDC and OAuth2 native — it speaks SAML for legacy apps but JWT/OIDC for everything modern. Its OIDC implementation follows the standard closely; its token validation happens via /.well-known/openid-configuration and the JWKS endpoint.

On-prem AD  →  Entra ID Connect (sync agent)  →  Entra ID (cloud)
                                                      │
                                              SAML / OIDC
                                                      │
                                            SaaS apps, Azure resources

Google Workspace

Google Workspace is Google’s combined directory + IdP. Google accounts are the users. Apps integrate via SAML or OIDC. Google’s OIDC implementation is one of the most widely used reference implementations — most OIDC libraries are tested against it.

Federation: IdPs Trusting Each Other

Federation is the mechanism that lets IdPs delegate to each other. Two patterns:

SAML Federation (IdP-to-IdP)

Common in academia and partner integrations:

User at University A → requests resource at University B
                              │
                              │ doesn't know user
                              ▼
                    University B SP redirects to...
                    Discovery Service: "which IdP are you from?"
                              │
                              ▼
                    University A IdP authenticates user
                              │
                    Sends SAML assertion to University B SP

University B’s SP trusts University A’s IdP because both are members of a SAML federation (e.g., InCommon in the US, eduGAIN globally). The federation metadata aggregates all members’ SAML metadata — certificates, endpoints — so members don’t have to manually configure each bilateral trust.

OIDC Identity Brokering

Keycloak, Okta, and Entra ID can all act as identity brokers — they sit between the application and the actual authenticating IdP:

App (OIDC RP) → Keycloak (broker IdP) → Google / GitHub / SAML IdP
                                               │ authenticate
                                               ▼
                                      Keycloak receives assertion
                                      Maps external claims to local claims
                                      Issues OIDC token to app

The app only knows Keycloak. Keycloak handles the upstream IdP complexity.

SCIM: Provisioning ≠ Authentication

SCIM (RFC 7644) is a REST API standard for user lifecycle management — creating, updating, and deactivating user accounts in a target system when changes happen in the source directory.

Source (Okta / Entra ID)           Target (Slack / GitHub / Jira)
         │                                    │
         │  SCIM 2.0 (REST + JSON)            │
         ├─ POST /Users  ─────────────────────► create user
         ├─ PATCH /Users/id ──────────────────► update attributes
         └─ DELETE /Users/id ─────────────────► deactivate account

SCIM is not SSO. A SCIM-provisioned user in Slack can log in to Slack — but the authentication still goes through the IdP (SAML/OIDC). SCIM ensures the account exists. The IdP proves the user’s identity.

Why both? Because SSO alone doesn’t create accounts in target systems — it just authenticates to them. If a user tries to log in to Slack for the first time via SSO, Slack needs an account to map them to. SCIM creates that account before the first login (Just-in-Time provisioning handles it at first login, but SCIM handles it in bulk and handles deprovisioning reliably).

Deprovisioning is where SCIM matters most. When an employee leaves, you disable them in Okta — SCIM deactivates their account in every connected app within minutes. Without SCIM, IT runs a manual checklist. Someone misses Jira. The ex-employee has access for three weeks.

Directory Sync vs Federation

These are commonly confused:

Directory sync — copy user data from source to target. Entra ID Connect copies users from on-prem AD to Entra ID. This is not authentication; it’s data replication. After sync, Entra ID has its own copy of the user record.

Federation — delegate authentication to an external IdP. The target system doesn’t store credentials; it redirects to the IdP for authentication and trusts the assertion that comes back.

You often need both:
– Sync: so the target system has the user record and can enforce policies (group membership, license assignment)
– Federation: so the user authenticates against the source of truth (your IdP) rather than maintaining a separate password in every system

⚠ Common Misconceptions

“SCIM is an authentication protocol.” SCIM is a provisioning protocol. It creates and manages accounts. Authentication is SAML/OIDC. Both solve different parts of the identity lifecycle problem.

“SSO means you only have one password.” SSO means you only authenticate once per session. The password still exists (at the IdP). SSO reduces the number of authentication events, not the number of credentials.

“On-prem IdP + cloud sync is the same as a cloud IdP.” With on-prem IdP + cloud sync (e.g., AD + Entra ID Connect), authentication happens via the on-prem IdP — if it goes down, cloud SSO breaks. A pure cloud IdP (Okta standalone, Entra ID without on-prem AD) authenticates entirely in the cloud.

Framework Alignment

Domain	Relevance
CISSP Domain 5: Identity and Access Management	IdPs are the central control plane for federated identity — their architecture, trust relationships, and provisioning workflows define the enterprise IAM posture
CISSP Domain 1: Security and Risk Management	SCIM-based deprovisioning is an access control risk management practice — without it, terminated employee access persists across connected systems
CISSP Domain 3: Security Architecture and Engineering	The choice of on-prem vs cloud IdP, federation vs sync, and SCIM vs JIT provisioning are architectural decisions with long-term operational and security implications

Key Takeaways

An IdP authenticates users and issues assertions (SAML) or tokens (OIDC/OAuth2) — applications trust the IdP, not the user directly
On-prem: AD FS (Windows/legacy), Keycloak (open source, flexible), Shibboleth (academia)
Cloud: Okta (cloud-native, strong lifecycle management), Entra ID (Microsoft-integrated), Google Workspace
Federation = authentication delegation between IdPs; Directory sync = data replication; SCIM = account lifecycle (provisioning/deprovisioning)
SCIM deprovisioning is the critical control — it ensures ex-employees lose access automatically across all connected systems

What’s Next

EP11 covered the IdP landscape. EP12 gets specific: Entra ID and Linux — how you configure a Linux VM to accept SSH logins authenticated against Azure AD credentials, and how the aad-auth / pam_aad stack works end to end.

Next: Entra ID Linux Login: SSH Authentication with Azure AD Credentials

Get EP12 in your inbox when it publishes → linuxcent.com/subscribe

SAML vs OIDC vs OAuth2: Which Protocol Handles Which Identity Problem

May 10, 2026May 8, 2026 by Vamshi Krishna Santhapuri

Reading Time: 6 minutes

The Identity Stack, Episode 10
EP09: Active Directory → EP10 → EP11: Identity Providers → …

TL;DR

SAML 2.0 is a federation protocol for browser-based SSO — an IdP issues a signed XML assertion that a Service Provider trusts; designed for enterprise applications
OAuth2 is an authorization delegation protocol, not authentication — it lets an application act on your behalf without knowing your password; the access token says what, not who
OIDC (OpenID Connect) = OAuth2 + an identity layer — adds the id_token (a JWT containing who you are) on top of OAuth2’s access_token (what you can do)
SAML vs OIDC: SAML is XML, enterprise-native, stateful; OIDC is JSON/JWT, API-native, stateless — new applications almost always use OIDC
The id_token is a JWT — decode it at jwt.io and read every claim — it tells you exactly what the IdP asserts about the user
The browser SSO flow is three redirects: user → SP → IdP (authenticate) → SP (consume assertion)

The Problem: LDAP and Kerberos Don’t Cross the Internet

EP09 showed how authentication works inside a corporate network. LDAP and Kerberos both assume network proximity to the directory server — firewall-friendly ports don’t help when the authentication protocol requires a direct connection to the KDC or directory.

Internal network: works
  Browser → intranet app → LDAP/Kerberos → AD DC (all on 10.0.0.0/8)

Internet: breaks
  Browser → SaaS app (AWS) → LDAP/Kerberos → AD DC (on-prem behind firewall)
  ✗ KDC not reachable across NAT
  ✗ LDAP not exposed to internet (shouldn't be)
  ✗ Every SaaS app can't have its own LDAP connection to your DC

SAML was invented in 2002 to solve this. OIDC in 2014. Both let identity assertions travel over HTTPS — the one protocol that crosses every firewall.

SAML 2.0: Enterprise Browser SSO

SAML 2.0 has three actors: the User, the Identity Provider (IdP), and the Service Provider (SP).

1. User visits SP (e.g., Salesforce)
   SP: "I don't know this user — send them to the IdP"
   ↓  HTTP redirect with SAMLRequest (base64-encoded AuthnRequest)

2. User arrives at IdP (e.g., Okta, AD FS, Entra ID)
   IdP: "Authenticate me" → user enters credentials
   IdP: generates a signed SAML Assertion (XML)
   ↓  HTTP POST to SP's Assertion Consumer Service (ACS) URL

3. SP receives the SAMLResponse
   SP: verifies the signature using IdP's public key
   SP: extracts user attributes from the Assertion
   SP: creates a session — user is logged in

The SAML Assertion is an XML document signed by the IdP. It contains:

<saml:Assertion>
  <saml:Issuer>https://idp.corp.com</saml:Issuer>
  <saml:Subject>
    <saml:NameID Format="urn:oasis:names:tc:SAML:1.1:nameid-format:emailAddress">
      [email protected]
    </saml:NameID>
  </saml:Subject>
  <saml:Conditions
    NotBefore="2026-04-27T01:00:00Z"
    NotOnOrAfter="2026-04-27T01:05:00Z">  ← short-lived: replay protection
  </saml:Conditions>
  <saml:AttributeStatement>
    <saml:Attribute Name="email">
      <saml:AttributeValue>[email protected]</saml:AttributeValue>
    </saml:Attribute>
    <saml:Attribute Name="groups">
      <saml:AttributeValue>engineers</saml:AttributeValue>
      <saml:AttributeValue>sre-team</saml:AttributeValue>
    </saml:Attribute>
  </saml:AttributeStatement>
</saml:Assertion>

The SP trusts the assertion because it’s signed with the IdP’s private key, and the SP has the IdP’s public certificate configured. No direct connection between SP and IdP needed during authentication — only the browser carries the assertion.

SP-initiated vs IdP-initiated:
– SP-initiated: user visits the SP, gets redirected to IdP, authenticates, redirected back — the common flow
– IdP-initiated: user starts at the IdP (e.g., company portal), clicks an app, IdP sends assertion directly — simpler but no SP-generated RequestID, so the SP can’t verify the request was expected (a security concern)

OAuth2: Authorization Delegation (Not Authentication)

This distinction is important and consistently confused: OAuth2 is for authorization, not authentication.

OAuth2 solves: “I want to let GitHub Actions post to my Slack without giving GitHub my Slack password.”

Resource Owner (you)  → grants permission to →  Client (GitHub Actions)
                                                        │
                                                        │ access_token
                                                        ▼
                                               Resource Server (Slack API)
                                               "this token can post messages"

The access_token answers “what can this client do?” not “who is this user?” A resource server receiving an access token knows the token is valid and what scopes it carries — it does not necessarily know which human authorized it.

The four OAuth2 grant types:

Grant	Use case
Authorization Code	Web apps (server-side) — most secure, recommended
PKCE (+ Auth Code)	Native/SPA apps — Auth Code without client secret
Client Credentials	Machine-to-machine (no user) — service accounts
Device Code	Devices without browsers (smart TVs, CLIs)

The Implicit grant (tokens in URL fragment) is deprecated. Don’t use it.

OIDC: OAuth2 + Who You Are

OpenID Connect adds identity to OAuth2 by adding the id_token — a JWT that the IdP signs and that contains claims about the authenticated user.

Authorization Code flow with OIDC:

1. Client redirects user to IdP:
   GET /authorize?
     response_type=code
     &client_id=myapp
     &scope=openid email profile    ← "openid" scope triggers OIDC
     &redirect_uri=https://app.com/callback
     &state=random-nonce

2. IdP authenticates user, returns:
   GET /callback?code=AUTH_CODE&state=random-nonce

3. Client exchanges code for tokens:
   POST /token
   grant_type=authorization_code&code=AUTH_CODE...

4. IdP returns:
   {
     "access_token": "eyJ...",    ← what the user authorized
     "id_token": "eyJ...",        ← who the user is (JWT)
     "token_type": "Bearer",
     "expires_in": 3600
   }

The id_token decoded:

{
  "iss": "https://idp.corp.com",          ← issuer (the IdP)
  "sub": "user-guid-12345",               ← subject (stable user identifier)
  "aud": "myapp",                          ← audience (your client_id)
  "exp": 1745730000,                       ← expiry (Unix timestamp)
  "iat": 1745726400,                       ← issued at
  "email": "[email protected]",
  "name": "Vamshi Krishna",
  "groups": ["engineers", "sre-team"]     ← custom claims from IdP
}

# Decode any JWT at the command line (no verification — for debugging only)
echo "eyJ..." | cut -d. -f2 | base64 -d 2>/dev/null | python3 -m json.tool

# Or: jwt.io — paste the token, read every claim

sub is the stable user identifier. Email addresses change. Names change. The sub claim is the IdP’s internal identifier for the user — use it as the primary key when storing user data. Never store email as the primary key.

SAML vs OIDC: When to Use Which

	SAML 2.0	OIDC
Format	XML	JSON / JWT
Transport	HTTP POST (browser only)	HTTP redirect + JSON API
Age	2002	2014
Enterprise adoption	Very high (AD FS, Okta, Entra ID)	Very high (newer apps)
API-friendly	No	Yes
Mobile apps	No	Yes
Complexity	High (XML, schemas, signatures)	Medium (JWT, JSON)
Single Logout	Specified (rarely works well)	Optional, inconsistent

Use SAML when: You’re integrating with an enterprise SaaS that only supports SAML (Salesforce classic, legacy HR systems), or your IdP team mandates it.

Use OIDC when: You’re building a new application, integrating with a modern IdP, or need API-based token validation. OIDC is the default for everything new.

Use OAuth2 (Client Credentials) when: Service-to-service authentication with no user — your CI/CD pipeline authenticating to an API, your microservice calling another microservice.

A Complete Browser SSO Flow (OIDC)

1. User visits https://app.corp.com (not logged in)
   App: no session → redirect to IdP

2. GET https://idp.corp.com/authorize?
        response_type=code
        &client_id=app-corp
        &scope=openid email
        &redirect_uri=https://app.corp.com/callback
        &state=abc123
        &nonce=xyz789

3. IdP: user is not authenticated → show login form
   User: enters [email protected] + password
   (or: IdP sees existing session cookie → skip login)

4. IdP: authentication success
   Redirect: GET https://app.corp.com/callback?code=AUTH_CODE&state=abc123

5. App (server-side): validate state=abc123 (CSRF protection)
   POST https://idp.corp.com/token
     grant_type=authorization_code
     &code=AUTH_CODE
     &client_id=app-corp
     &client_secret=SECRET
     &redirect_uri=https://app.corp.com/callback

6. IdP responds:
   { "id_token": "JWT...", "access_token": "JWT...", "expires_in": 3600 }

7. App: validate id_token signature (using IdP's JWKS endpoint)
   App: extract sub, email, groups from id_token
   App: create session for [email protected]
   App: redirect user to original destination

Step 7 is where most bugs live. The app must validate: signature (using IdP’s public keys from /.well-known/jwks.json), iss (matches the expected IdP), aud (matches the client_id), exp (not expired), and nonce (matches what was sent in step 2). Skip any of these and you have an authentication bypass.

⚠ Common Misconceptions

“OAuth2 is for login.” OAuth2 is for authorization delegation. It can be used as a login mechanism only when OIDC (the openid scope + id_token) is added on top. “Login with Google” uses OIDC, not bare OAuth2.

“JWTs are encrypted.” By default, JWTs are signed (JWS), not encrypted. The header and payload are base64url-encoded — anyone can decode them. Encryption (JWE) is a separate, less commonly used spec. Never put secrets in a JWT payload assuming it’s private.

“SAML Single Logout works reliably.” SAML SLO is specified but inconsistently implemented. Many SPs ignore SLO requests or don’t propagate them correctly. Don’t depend on SLO for security — session revocation requires additional mechanisms (short-lived tokens, token introspection, session registries).

Framework Alignment

Domain	Relevance
CISSP Domain 5: Identity and Access Management	SAML, OAuth2, and OIDC are the three protocols that enable federated identity and SSO — understanding which does what is foundational to modern IAM design
CISSP Domain 4: Communications and Network Security	JWT validation (signature, claims, expiry) is a network security control — failing to validate any claim is an authentication bypass vulnerability
CISSP Domain 3: Security Architecture and Engineering	The choice of SAML vs OIDC is an architectural decision that affects every application integration, mobile support, and API design

Key Takeaways

SAML 2.0: XML-based browser SSO — three redirects, signed assertion, enterprise legacy apps
OAuth2: authorization delegation — access tokens grant scopes, not identity
OIDC: OAuth2 + id_token — adds who the user is on top of what they can do
sub is the stable user identifier in OIDC — never use email as a primary key
JWT validation must check: signature, iss, aud, exp, nonce — missing any is a security bypass
New applications: OIDC. Legacy enterprise SaaS: SAML. Service-to-service: OAuth2 Client Credentials

What’s Next

EP10 covered the protocols. EP11 covers the systems that implement them — the identity providers: what Okta, Entra ID, Keycloak, and AD FS actually do, how they federate with each other, and how SCIM handles user provisioning separately from authentication.

Next: Identity Providers Explained: On-Prem, Cloud, SCIM, and Federation

Get EP11 in your inbox when it publishes → linuxcent.com/subscribe

How Active Directory Works: LDAP, Kerberos, and Group Policy Under the Hood

May 10, 2026May 7, 2026 by Vamshi Krishna Santhapuri

Reading Time: 6 minutes

The Identity Stack, Episode 9
EP08: FreeIPA → EP09 → EP10: SAML/OIDC → …

TL;DR

Active Directory is not a product that happens to use LDAP — it is an LDAP directory with a Microsoft-extended schema, a built-in Kerberos KDC, and DNS tightly integrated
Replication uses USNs (Update Sequence Numbers) and GUIDs — the Knowledge Consistency Checker (KCC) automatically builds the replication topology
Sites and site links tell AD which DCs are physically close — AD prefers to authenticate users against a DC in the same site to minimize WAN latency
Group Policy Objects (GPOs) are stored as LDAP entries (in the CN=Policies container) and Sysvol files — LDAP tells clients which GPOs apply; Sysvol delivers the policy files
Linux joins AD via realm join (uses adcli + SSSD) or net ads join (Samba + winbind) — both register a machine account in AD and get a Kerberos keytab
The difference between Linux in AD and Linux in FreeIPA: AD is optimized for Windows; FreeIPA is optimized for Linux — both interoperate

The Big Picture: What AD Actually Is

Active Directory Domain: corp.com
┌────────────────────────────────────────────────────────────┐
│                                                            │
│  LDAP directory          Kerberos KDC                      │
│  ─────────────           ──────────                        │
│  Schema: 1000+ classes   Realm: CORP.COM                   │
│  Objects: users, groups, Issues TGTs + service tickets     │
│  computers, GPOs, OUs    Uses LDAP as the account DB       │
│                                                            │
│  DNS                     Sysvol (DFS share)                │
│  ────                    ────────────────                  │
│  SRV records for KDC     GPO templates                     │
│  and LDAP discovery      Login scripts                     │
│                          Replicated via DFSR               │
│                                                            │
│  Replication engine: USN + GUID + KCC                      │
└────────────────────────────────────────────────────────────┘
          │ replicates to          │ replicates to
          ▼                        ▼
   DC: dc02.corp.com        DC: dc03.corp.com

EP08 showed FreeIPA as the Linux-native answer to enterprise identity. AD is the Microsoft answer — and because most enterprises run Windows clients, understanding AD is unavoidable for Linux infrastructure engineers. This episode goes behind the LDAP and Kerberos protocols to explain what makes AD specifically work.

The AD Schema: LDAP With 1000+ Object Classes

AD’s schema extends the base LDAP schema with Microsoft-specific classes and attributes. Every user object is a user class (which extends organizationalPerson which extends person which extends top) with additional attributes like:

sAMAccountName   ← the pre-Windows 2000 login name (vamshi)
userPrincipalName ← the modern UPN ([email protected])
objectGUID       ← a globally unique 128-bit identifier (never changes, even if DN changes)
objectSid        ← Windows Security Identifier (used for ACL enforcement on Windows)
whenCreated      ← creation timestamp
pwdLastSet       ← password change timestamp
userAccountControl ← bitmask: disabled, locked, password never expires, etc.
memberOf         ← back-link: groups this user belongs to

objectGUID is the authoritative identifier in AD — not the DN. When a user is renamed or moved to a different OU, the GUID stays the same. Applications that store a user’s DN will break on rename; applications that store the GUID won’t.

userAccountControl is the bitmask that controls account state:

Flag          Value   Meaning
ACCOUNTDISABLE  2     Account disabled
LOCKOUT         16    Account locked out
PASSWD_NOTREQD  32    Password not required
NORMAL_ACCOUNT  512   Normal user account (set on almost all accounts)
DONT_EXPIRE_PASSWD 65536  Password never expires

# Query AD from a Linux machine
ldapsearch -x -H ldap://dc.corp.com \
  -D "[email protected]" -w password \
  -b "dc=corp,dc=com" \
  "(sAMAccountName=vamshi)" \
  sAMAccountName userPrincipalName objectGUID memberOf userAccountControl

Replication: USN + GUID + KCC

AD replication is multi-master — every DC accepts writes. The replication engine uses:

USN (Update Sequence Number) — a per-DC counter that increments on every local write. Each attribute in the directory stores the USN at which it was last modified (uSNChanged, uSNCreated). When DC-A replicates to DC-B, DC-B asks: “give me everything you’ve changed since the last USN I saw from you.”

GUID — each object has a globally unique identifier. If the same attribute is modified on two DCs before replication (a conflict), the conflict is resolved: last-writer-wins at the attribute level, based on the modification timestamp. If timestamps are equal, the attribute value from the DC with the lexicographically higher GUID wins.

KCC (Knowledge Consistency Checker) — a component that runs on every DC and automatically constructs the replication topology. You don’t configure which DCs replicate to which — the KCC builds a minimum spanning tree that ensures every DC is connected to every other within a set number of hops. You configure Sites and site links; the KCC does the rest.

# Check replication status from a Linux machine (requires rpcclient or adcli)
# Or on the DC: repadmin /showrepl (Windows tool)

# Simulate: query the highestCommittedUSN from a DC
ldapsearch -x -H ldap://dc.corp.com \
  -D "[email protected]" -w password \
  -b "" -s base highestCommittedUSN

Sites and Site Links

Sites are AD’s concept of physical network topology. A site is a set of IP subnets with high-bandwidth connectivity between them. Site links represent the WAN connections between sites.

Site: Mumbai              Site: Hyderabad
┌────────────────┐        ┌────────────────┐
│ DC: dc-mum-01  │        │ DC: dc-hyd-01  │
│ DC: dc-mum-02  │        │ DC: dc-hyd-02  │
│ subnet: 10.1/16│        │ subnet: 10.2/16│
└───────┬────────┘        └────────┬───────┘
        │                          │
        └──── Site Link ───────────┘
              Cost: 100
              Replication interval: 15 min

When a user in Mumbai authenticates, AD’s KDC locates a DC in the same site using DNS SRV records. The SRV records include the site name in the service name: _ldap._tcp.Mumbai._sites.dc._msdcs.corp.com. SSSD and Windows clients query site-local SRV records first.

If no DC is available in the local site, authentication falls back to a DC in another site across the WAN link. Configuring sites correctly prevents remote authentication failures from killing local operations.

Group Policy: LDAP + Sysvol

GPOs are stored in two places:

LDAP — the CN=Policies,CN=System,DC=corp,DC=com container holds GPO metadata objects. Each GPO has a GUID, a display name, and version numbers. The gPLink attribute on OUs and the domain root links GPOs to where they apply.

Sysvol — the actual policy templates and scripts live in \\corp.com\SYSVOL\corp.com\Policies\{GPO-GUID}\. Sysvol is a DFS-R (Distributed File System Replication) share replicated to every DC.

When a Windows client applies Group Policy:
1. LDAP query: what GPOs are linked to my OU chain?
2. Sysvol fetch: download the policy templates from the GPO’s Sysvol path
3. Apply: process Registry settings, Security settings, Scripts

Linux clients don’t process GPOs natively. The adcli and sssd tools interpret a small subset of AD policy (password policy, account lockout) via LDAP. Full GPO processing on Linux requires Samba’s samba-gpupdate or third-party tools.

Joining Linux to AD

realm join (recommended)

# Install required packages
dnf install -y realmd sssd adcli samba-common

# Discover the domain
realm discover corp.com
# corp.com
#   type: kerberos
#   realm-name: CORP.COM
#   domain-name: corp.com
#   configured: no
#   server-software: active-directory
#   client-software: sssd

# Join
realm join corp.com -U Administrator
# Prompts for Administrator password
# Creates machine account in AD
# Configures sssd.conf, krb5.conf, nsswitch.conf, pam.d automatically

# Verify
realm list
id [email protected]

What the join does:

Creates a machine account HOSTNAME$ in CN=Computers,DC=corp,DC=com
Sets a machine password (rotated automatically by SSSD)
Retrieves a Kerberos keytab to /etc/krb5.keytab
Configures SSSD with id_provider = ad, auth_provider = ad
Updates /etc/nsswitch.conf to include sss
Updates /etc/pam.d/ to include pam_sss

After joining, SSSD uses the machine’s Kerberos keytab to authenticate to the DC and query LDAP — no hardcoded service account credentials required.

LDAP Queries Against AD from Linux

# Find a user (after kinit or with -w password)
ldapsearch -Y GSSAPI -H ldap://dc.corp.com \
  -b "dc=corp,dc=com" \
  "(sAMAccountName=vamshi)" \
  sAMAccountName mail memberOf

# Find all members of a group
ldapsearch -Y GSSAPI -H ldap://dc.corp.com \
  -b "dc=corp,dc=com" \
  "(cn=engineers)" \
  member

# Find all AD-joined Linux machines
ldapsearch -Y GSSAPI -H ldap://dc.corp.com \
  -b "dc=corp,dc=com" \
  "(&(objectClass=computer)(operatingSystem=*Linux*))" \
  cn operatingSystem lastLogonTimestamp

# Find disabled accounts
ldapsearch -Y GSSAPI -H ldap://dc.corp.com \
  -b "dc=corp,dc=com" \
  "(userAccountControl:1.2.840.113556.1.4.803:=2)" \
  sAMAccountName

The last filter uses an LDAP extensible match (1.2.840.113556.1.4.803 is the OID for bitwise AND). userAccountControl:1.2.840.113556.1.4.803:=2 means “entries where userAccountControl AND 2 equals 2” — i.e., the ACCOUNTDISABLE bit is set. This is a Microsoft AD extension not in standard LDAP.

⚠ Common Misconceptions

“AD is just Microsoft’s LDAP.” AD is LDAP + Kerberos + DNS + DFS-R + GPO, all tightly integrated and with a schema that the Microsoft ecosystem depends on. You can query AD with standard ldapsearch. You cannot replace it with OpenLDAP without breaking every Windows client.

“Linux machines in AD get GPO.” Linux machines appear in AD and can be organized into OUs. Standard GPOs don’t apply to them. Samba’s samba-gpupdate can process a subset of AD policy for Linux — mostly Registry and Security settings mapped to Linux equivalents.

“realm leave removes the machine cleanly.” realm leave removes local configuration but does not delete the machine account from AD. The stale computer object stays in CN=Computers until an AD admin deletes it. Always run realm leave && adcli delete-computer -U Administrator for a clean removal.

Framework Alignment

Domain	Relevance
CISSP Domain 5: Identity and Access Management	AD is the dominant enterprise identity store — understanding its LDAP structure, Kerberos realm, and GPO model is essential for IAM in mixed environments
CISSP Domain 4: Communications and Network Security	AD replication traffic (RPC, LDAP, Kerberos) is a significant portion of enterprise WAN traffic — Sites and site links are a network security and performance design decision
CISSP Domain 3: Security Architecture and Engineering	AD forest/domain/OU hierarchy is an architectural decision with long-term security consequences — getting OU structure wrong constrains GPO delegation for years

Key Takeaways

AD is LDAP + Kerberos + DNS + GPO + DFS-R — not a product that “uses” these; they’re the implementation
Replication is multi-master via USN + GUID; the KCC builds the topology automatically from Sites configuration
objectGUID is the stable identifier — not the DN, which changes on rename/move
realm join is the correct way to join Linux to AD — it configures SSSD, Kerberos, PAM, and NSS correctly in one command
userAccountControl is the bitmask that controls account state — (userAccountControl:1.2.840.113556.1.4.803:=2) finds disabled accounts

What’s Next

EP09 covered AD — LDAP and Kerberos inside the corporate network. EP10 covers what happens when identity needs to work across the internet, where Kerberos doesn’t reach: SAML, OAuth2, and OIDC — the protocols that let identity leave the building.

Next: SAML vs OIDC vs OAuth2: Which Protocol Handles Which Identity Problem

Get EP10 in your inbox when it publishes → linuxcent.com/subscribe

FreeIPA: LDAP + Kerberos + PKI in a Single Linux Identity Stack

May 13, 2026May 7, 2026 by Vamshi Krishna Santhapuri

Reading Time: 5 minutes

The Identity Stack, Episode 8
EP07: LDAP HA → EP08 → EP09: Active Directory → …

TL;DR

FreeIPA is 389-DS (LDAP) + MIT Kerberos + Dogtag PKI + Bind DNS + SSSD — one ipa-server-install command gets you an enterprise identity platform
Host-Based Access Control (HBAC) lets you define centrally: which users can SSH to which hosts — no more managing /etc/security/access.conf per machine
Sudo rules from the directory: define sudo policy centrally, have every machine pull it — no /etc/sudoers.d/ files scattered across the fleet
ipa CLI is the management interface — ipa user-add, ipa group-add, ipa hbacrule-add — everything that took five LDAP commands takes one ipa command
FreeIPA trusts with Active Directory let Linux machines authenticate AD users without joining the AD domain
The right choice for Linux-centric environments; AD is the right choice when Windows clients dominate

The Big Picture: What FreeIPA Integrates

┌─────────────────────────────────────────────────────────┐
│                    FreeIPA Server                        │
│                                                         │
│  389-DS (LDAP)    MIT Kerberos    Dogtag PKI            │
│  ─────────────    ───────────     ─────────             │
│  User/group       TGT + service   Machine certs         │
│  storage          ticket issuing  User certs             │
│                                   OCSP / CRL            │
│  Bind DNS         SSSD (client)   Apache (WebUI)        │
│  ──────────       ────────────    ──────────────        │
│  SRV records      Enrollment      Management UI         │
│  for KDC/LDAP     automation      REST API              │
└─────────────────────────────────────────────────────────┘
              ▲                  ▲
              │ enrollment       │ SSH + sudo rules
   ┌──────────┴──────────┐  ┌───┴──────────────────┐
   │  Linux client        │  │  Linux client         │
   │  (ipa-client-install)│  │  (ipa-client-install) │
   └─────────────────────┘  └──────────────────────┘

EP06 and EP07 built OpenLDAP from components. FreeIPA gives you all of that plus Kerberos, PKI, DNS, and HBAC — opinionated, integrated, and managed through a single CLI and WebUI. This episode shows what you actually get from it.

Why FreeIPA Instead of Bare OpenLDAP

Running bare OpenLDAP requires you to:
– Configure schema for POSIX accounts, SSH keys, sudo rules, HBAC manually
– Set up MIT Kerberos separately and integrate it with LDAP
– Build your own PKI for machine certificates
– Maintain DNS SRV records for Kerberos discovery
– Write client enrollment scripts
– Build a management interface (or live in LDIF)

FreeIPA does all of this in one installer, with a consistent data model across all components. The trade-off is opacity — FreeIPA makes decisions for you (schema, replication topology, Kerberos realm name) that bare OpenLDAP leaves to you.

Installing FreeIPA Server

# RHEL / Rocky / AlmaLinux
dnf install -y freeipa-server freeipa-server-dns

# Run the installer (interactive)
ipa-server-install

# Or non-interactive:
ipa-server-install \
  --realm=CORP.COM \
  --domain=corp.com \
  --ds-password=DM_password \
  --admin-password=Admin_password \
  --setup-dns \
  --forwarder=8.8.8.8 \
  --unattended

# After install: get an admin Kerberos ticket
kinit admin

The installer creates:
– 389-DS instance with the FreeIPA schema
– MIT KDC with realm CORP.COM
– Dogtag CA and all certificate infrastructure
– Bind DNS with SRV records for the KDC and LDAP server
– Apache WebUI at https://ipa.corp.com/ipa/ui/
– SSSD configured on the server itself

Time: 5–10 minutes. What used to take a week of manual configuration.

The ipa CLI

Every management action goes through ipa. It talks to the IPA server’s REST API and handles Kerberos authentication transparently (it uses your kinit session).

# Users
ipa user-add vamshi \
  --first=Vamshi --last=Krishna \
  [email protected] \
  --password

ipa user-show vamshi
ipa user-find --all              # search all users
ipa user-disable vamshi          # lock account without deleting
ipa user-mod vamshi --shell=/bin/zsh

# Groups
ipa group-add engineers --desc "Engineering team"
ipa group-add-member engineers --users=vamshi,alice

# Password policy
ipa pwpolicy-mod --minlength=12 --maxlife=90 --history=10

# SSH public keys — stored centrally, pushed to every host
ipa user-mod vamshi --sshpubkey="ssh-ed25519 AAAA..."
# SSSD on enrolled hosts will use this key for SSH login — no authorized_keys file needed

Host-Based Access Control (HBAC)

HBAC is the feature that justifies FreeIPA for most Linux shops. It lets you define centrally: which users (or groups) can log in to which hosts (or host groups), using which services (SSH, sudo, FTP).

Without HBAC, access control is per-machine: /etc/security/access.conf or PAM pam_access rules, replicated across every server, managed inconsistently.

With HBAC: one rule, enforced everywhere.

# Create host groups
ipa hostgroup-add production-servers --desc "Production Linux hosts"
ipa hostgroup-add-member production-servers --hosts=web01.corp.com,db01.corp.com

# Create user groups
ipa group-add sre-team
ipa group-add-member sre-team --users=vamshi,alice

# Create an HBAC rule
ipa hbacrule-add allow-sre-to-prod \
  --desc "SRE team can SSH to production"
ipa hbacrule-add-user allow-sre-to-prod --groups=sre-team
ipa hbacrule-add-host allow-sre-to-prod --hostgroups=production-servers
ipa hbacrule-add-service allow-sre-to-prod --hbacsvcs=sshd

# Test the rule before applying
ipa hbactest \
  --user=vamshi \
  --host=web01.corp.com \
  --service=sshd
# Access granted: True
# Matched rules: allow-sre-to-prod

SSSD on each enrolled host enforces the HBAC rules at login time by querying the IPA server. No per-machine configuration. Add a new server to the production-servers host group and the HBAC rules apply immediately.

Sudo Rules from the Directory

# Create a sudo rule
ipa sudorule-add allow-sre-sudo \
  --cmdcat=all \
  --desc "SRE team gets full sudo on production"
ipa sudorule-add-user allow-sre-sudo --groups=sre-team
ipa sudorule-add-host allow-sre-sudo --hostgroups=production-servers

# Or a scoped rule — only specific commands
ipa sudorule-add allow-service-restart
ipa sudocmdgroup-add service-commands
ipa sudocmd-add /usr/bin/systemctl
ipa sudocmdgroup-add-member service-commands --sudocmds="/usr/bin/systemctl"
ipa sudorule-add-allow-command allow-service-restart --sudocmdgroups=service-commands

On enrolled hosts, SSSD’s sssd_sudo responder pulls these rules and the sudo command evaluates them locally. No /etc/sudoers.d/ files. Central policy, local enforcement.

Enrolling a Client

# On the client machine
dnf install -y freeipa-client

ipa-client-install \
  --domain=corp.com \
  --server=ipa.corp.com \
  --realm=CORP.COM \
  --principal=admin \
  --password=Admin_password \
  --unattended

# What this does:
# 1. Registers the host in IPA as a machine principal
# 2. Retrieves a host Kerberos keytab (/etc/krb5.keytab)
# 3. Configures SSSD (sssd.conf, nsswitch.conf, pam.d)
# 4. Configures Kerberos (/etc/krb5.conf)
# 5. Optionally configures NTP and DNS

After enrollment: getent passwd vamshi returns the IPA user. SSH with an IPA password works. HBAC rules are enforced. Sudo rules from the directory apply. SSH public keys from the user’s IPA profile work without authorized_keys files.

FreeIPA Trust with Active Directory

In mixed environments (Linux servers + Windows clients), you can establish a trust between FreeIPA and AD without joining the Linux servers to the AD domain directly.

# On the IPA server (after installing ipa-server-trust-ad)
ipa-adtrust-install --netbios-name=CORP

# Establish the trust
ipa trust-add ad.corp.com \
  --admin=Administrator \
  --password \
  --type=ad

# AD users can now log in to IPA-enrolled Linux hosts
# They appear as: CORP.COM\username or [email protected]

Under the hood: FreeIPA acts as an SSSD-enabled Samba DC for the trust relationship. AD users’ Kerberos tickets from the AD KDC are accepted by the FreeIPA KDC, which maps them to POSIX attributes stored in IPA (or automatically generated via ID mapping).

⚠ Common Misconceptions

“FreeIPA is just OpenLDAP with a UI.” FreeIPA uses 389-DS (not OpenLDAP), adds a full Kerberos KDC, a certificate authority, DNS, HBAC enforcement, and sudo management — all with a consistent schema designed for these use cases. It’s an integrated identity platform, not a wrapper.

“HBAC rules replace firewall rules.” HBAC controls who can log in to a host at the authentication layer — not network access. A blocked HBAC rule means the SSH session is rejected after TCP connection. You still need firewall rules to block TCP access.

“FreeIPA replicas are identical.” FreeIPA uses 389-DS Multi-Supplier replication. All replicas accept reads and writes. But the CA is separate — only the initial server (and explicitly designated CA replicas) run the CA. If the CA goes down, certificate operations stop; authentication does not.

Framework Alignment

Domain	Relevance
CISSP Domain 5: Identity and Access Management	FreeIPA is an enterprise IAM platform — HBAC, sudo policy, SSH key management, and certificate-based authentication are all IAM controls
CISSP Domain 3: Security Architecture and Engineering	FreeIPA’s integrated CA enables certificate-based authentication for machines and users — a stronger authentication factor than passwords
CISSP Domain 1: Security and Risk Management	Centralized HBAC and sudo policy reduces the attack surface of privilege escalation — no more inconsistent sudoers files that drift across the fleet

Key Takeaways

FreeIPA = 389-DS + MIT Kerberos + Dogtag PKI + Bind DNS — one installer, one management interface
HBAC rules define centrally who can SSH to which host groups — enforced by SSSD on every enrolled client, no per-machine config
Sudo rules from the directory replace scattered /etc/sudoers.d/ files — central policy, SSSD-enforced locally
ipa hbactest lets you verify access rules before a user hits a blocked login — use it before every policy change
For Linux-centric environments: FreeIPA. For Windows-dominant environments: AD. For mixed: FreeIPA trust with AD.

What’s Next

FreeIPA is the Linux answer to enterprise identity. EP09 covers the Microsoft answer — Active Directory — which extended LDAP and Kerberos into a complete enterprise platform with Group Policy, Sites, and a replication model built for global scale.

Next: How Active Directory Works: LDAP, Kerberos, and Group Policy Under the Hood

Get EP09 in your inbox when it publishes → linuxcent.com/subscribe

LDAP High Availability: Load Balancing and Production Architecture

May 10, 2026May 6, 2026 by Vamshi Krishna Santhapuri

Reading Time: 6 minutes

The Identity Stack, Episode 7
EP06: OpenLDAP → EP07 → EP08: FreeIPA → …

TL;DR

LDAP HA means multiple directory servers behind a load balancer — clients connect to a VIP, not to individual servers
Read/write split: all writes go to the provider, reads are distributed across consumers — the load balancer enforces this by routing on port or backend check
SSSD handles multi-server failover natively (ldap_uri accepts a comma-separated list) — for apps without built-in failover, HAProxy with health checks does the work
Connection pooling is critical at scale — nss_ldap and pam_ldap opened a new connection per login; SSSD maintains a pool; apps that use libldap directly must implement their own
cn=monitor is the built-in monitoring endpoint — exposes connection counts, operation rates, and backend stats readable via ldapsearch
389-DS (Red Hat Directory Server) is the production choice for >1M entries — purpose-built for large directories with a dedicated replication engine

The Big Picture: Production LDAP Topology

         Clients (SSSD, apps, VPN concentrators)
                      │
              ┌───────▼───────┐
              │   HAProxy VIP  │   ← single endpoint, port 389/636
              │  10.0.0.10     │
              └───────┬───────┘
                      │
          ┌───────────┼───────────┐
          ▼           ▼           ▼
   ldap1.corp.com  ldap2.corp.com  ldap3.corp.com
   (Provider)      (Consumer)      (Consumer)
   Reads + Writes  Reads only      Reads only
          │           ▲               ▲
          └───────────┴───────────────┘
               SyncRepl replication

EP06 built a two-node replicated directory. This episode covers what happens when the directory becomes infrastructure — when it needs to survive a node failure, handle thousands of connections, and be monitored like any other critical service.

HAProxy for LDAP

HAProxy is the standard choice for LDAP load balancing. Unlike HTTP, LDAP is a stateful protocol — once a client binds, subsequent operations on that connection share the authenticated session. The load balancer must use connection persistence, not per-request routing.

# /etc/haproxy/haproxy.cfg

global
    log /dev/log local0
    maxconn 50000

defaults
    mode tcp                  # LDAP is TCP, not HTTP
    timeout connect 5s
    timeout client  30s
    timeout server  30s
    option tcplog

# ── LDAP read/write split ─────────────────────────────────────────────

# Writes → provider only
frontend ldap-write
    bind *:389
    default_backend ldap-provider

backend ldap-provider
    balance first                   # always use first available (provider)
    option tcp-check
    tcp-check connect
    server ldap1 ldap1.corp.com:389 check inter 5s rise 2 fall 3
    server ldap2 ldap2.corp.com:389 check inter 5s rise 2 fall 3 backup

# Reads → all nodes round-robin
frontend ldap-read
    bind *:3389                     # internal read port
    default_backend ldap-consumers

backend ldap-consumers
    balance roundrobin
    option tcp-check
    tcp-check connect
    server ldap1 ldap1.corp.com:389 check inter 5s
    server ldap2 ldap2.corp.com:389 check inter 5s
    server ldap3 ldap3.corp.com:389 check inter 5s

# LDAPS (TLS)
frontend ldaps
    bind *:636
    default_backend ldap-consumers-tls

backend ldap-consumers-tls
    balance roundrobin
    server ldap1 ldap1.corp.com:636 check inter 5s ssl verify required ca-file /etc/ssl/certs/ca.pem
    server ldap2 ldap2.corp.com:636 check inter 5s ssl verify required ca-file /etc/ssl/certs/ca.pem

The health check (tcp-check connect) just verifies TCP connectivity. For a more precise check — verifying that slapd is actually responding to LDAP requests — use a custom script that runs ldapsearch and checks the result code.

SSSD Multi-Server Failover

SSSD has native failover — no load balancer required for SSSD-based clients:

# /etc/sssd/sssd.conf
[domain/corp.com]
ldap_uri = ldap://ldap1.corp.com, ldap://ldap2.corp.com, ldap://ldap3.corp.com
# SSSD tries them in order; switches to next on failure
# Switches back to primary after ldap_recovery_interval (default: 30s)

# For AD, discovery via DNS SRV records is even better:
ad_server = _srv_
# SSSD queries _ldap._tcp.corp.com SRV records and gets all DCs automatically

SSSD monitors the connection health. If the current server becomes unreachable, it switches to the next in the list within seconds. Existing cached data keeps serving during the switchover. Clients using SSSD don’t need a load balancer for basic HA.

Connection Pooling

Every LDAP bind creates an authenticated session on the server. A server with connection limits (olcConnMaxPending, olcConnMaxPendingAuth in OLC) will reject new connections when those limits are hit.

The problem: applications that use libldap directly tend to open a new connection per operation. At 500 requests/second, that’s 500 new TCP connections, 500 binds, 500 TLS handshakes per second — a directory that can handle 5000 concurrent connections starts refusing new ones.

The solutions:

SSSD — handles this automatically. SSSD maintains one or a small number of persistent connections per domain and multiplexes all PAM/NSS queries through them.

Application-level pooling — frameworks like python-ldap with connection pooling, ldap3 with connection strategies, or dedicated middleware like 389-DS‘s Directory Proxy Server.

ldap_maxconnections in OpenLDAP — sets a hard limit. When hit, new connections block until existing ones close. Set this to something reasonable (olcConnMaxPending: 100 in OLC) so you get a controlled failure mode instead of unbounded queuing.

Monitoring with cn=monitor

OpenLDAP exposes live operational statistics via the cn=monitor database — a virtual LDAP subtree that reflects the server’s current state. Enable it:

# enable-monitor.ldif
dn: cn=module,cn=config
objectClass: olcModuleList
cn: module
olcModulePath: /usr/lib/ldap
olcModuleLoad: back_monitor

dn: olcDatabase=monitor,cn=config
objectClass: olcDatabaseConfig
olcDatabase: monitor
olcAccess: to *
  by dn="cn=admin,dc=corp,dc=com" read
  by * none

Query it:

# Overall statistics
ldapsearch -x -H ldap://localhost \
  -D "cn=admin,dc=corp,dc=com" -w password \
  -b "cn=monitor" -s sub "(objectClass=*)" \
  monitorOpInitiated monitorOpCompleted

# Connection counts
ldapsearch -x -H ldap://localhost \
  -D "cn=admin,dc=corp,dc=com" -w password \
  -b "cn=Connections,cn=monitor" -s one \
  monitorConnectionNumber

# Operations by type
ldapsearch -x -H ldap://localhost \
  -D "cn=admin,dc=corp,dc=com" -w password \
  -b "cn=Operations,cn=monitor" -s one \
  monitorOpInitiated monitorOpCompleted

Useful metrics to export to Prometheus (via prometheus-openldap-exporter or similar):
– monitorOpCompleted per operation type (bind, search, modify)
– monitorConnectionNumber — current connection count
– Backend-specific: olmMDBEntries, olmMDBPagesMax, olmMDBPagesUsed

389-DS: LDAP at Scale

OpenLDAP is excellent for directories up to a few million entries. When you need:
– 10M+ entries
– High write throughput (more than a few hundred writes/second)
– Fine-grained replication filtering
– A dedicated web-based admin UI

…389-DS (Red Hat Directory Server, community edition) is the production answer. It’s what FreeIPA uses under the hood.

Key architectural differences from OpenLDAP:

Multi-supplier replication — 389-DS’s replication engine uses a dedicated changelog (stored in LMDB) and Change Sequence Numbers (CSNs) for conflict resolution. Multi-supplier (multi-master) replication is first-class, not a bolted-on feature.

Changelog — every change is written to a persistent changelog before being applied. This enables precise replication: a consumer can reconnect after a network partition and get exactly the changes it missed, rather than doing a full resync.

Plugin architecture — 389-DS functionality (replication, managed entries, DNA for automatic UID allocation, memberOf, password policy) is all implemented as plugins that can be enabled/disabled per directory instance.

# Install 389-DS
dnf install -y 389-ds-base

# Create a new instance
dscreate interactive
# — or use a template:
dscreate from-file /path/to/instance.inf

# Manage with dsctl
dsctl slapd-corp status
dsctl slapd-corp start
dsctl slapd-corp stop

# Admin with dsconf
dsconf slapd-corp backend suffix list
dsconf slapd-corp replication status -suffix "dc=corp,dc=com"

The dsconf replication status command gives a live view of replication lag across all suppliers and consumers — something OpenLDAP requires you to compute manually from contextCSN comparisons.

Global Catalog: Cross-Domain Search in AD

When your directory spans multiple AD domains in a forest, the Global Catalog solves a specific problem: a user in emea.corp.com needs to be found by an app that only knows corp.com.

Forest: corp.com
  ├── corp.com       → DC port 389    full directory: 500K entries
  ├── emea.corp.com  → DC port 389    full directory: 200K entries
  └── Global Catalog → GC port 3268  partial replica: 700K entries
                                       (not all attributes — just the most queried ones)

The GC replicates a subset of attributes from every domain in the forest. By default: cn, mail, sAMAccountName, userPrincipalName, memberOf, and about 150 others. Attributes marked with isMemberOfPartialAttributeSet in the schema are replicated to the GC.

If an application is configured to use port 3268 instead of 389, it’s using the GC — and it won’t see attributes not included in the partial attribute set. This surprises teams that add a custom attribute to AD and then wonder why their application can’t see it on 3268 but can on 389.

⚠ Production Gotchas

HAProxy TCP health checks don’t verify LDAP is responsive. A server can accept TCP connections but have slapd in a degraded state (database corruption, out-of-memory). Build a proper LDAP health check: a script that binds and searches a known entry and checks the result.

replication lag under write load. SyncRepl consumers can fall behind under sustained write load. Monitor the contextCSN difference between provider and consumers. If consumers are more than a few seconds behind, investigate the provider’s write throughput and the consumer’s processing speed.

Directory size and the MDB mapsize. LMDB requires a pre-configured maximum database size (olcDbMaxSize). If the database grows beyond this, slapd starts failing writes. Set it to 2–4x your expected data size and monitor olmMDBPagesUsed / olmMDBPagesMax.

Key Takeaways

HAProxy in TCP mode provides LDAP load balancing — use balance first for write routing (provider only), balance roundrobin for reads
SSSD has native failover via ldap_uri — for SSSD clients, a load balancer adds HA but isn’t strictly required
cn=monitor is the built-in OpenLDAP monitoring endpoint — export its counters to Prometheus for operational visibility
389-DS is the right choice for >1M entries, high write throughput, or multi-supplier replication as a first-class feature
Global Catalog (port 3268/3269) is a partial replica of all AD domains — useful for forest-wide searches, but missing non-replicated attributes

What’s Next

EP07 covers the infrastructure layer. EP08 zooms out to FreeIPA — what you get when LDAP, Kerberos, DNS, PKI, and HBAC are integrated into a single Linux-native identity stack, and why most Linux shops running their own directory should be running FreeIPA instead of bare OpenLDAP.

Next: FreeIPA: LDAP + Kerberos + PKI in a Single Linux Identity Stack

Get EP08 in your inbox when it publishes → linuxcent.com/subscribe

GCP Secure Boot Certificate Expiration 2026: What You Must Do Before June 24

May 7, 2026May 6, 2026 by Vamshi Krishna Santhapuri

Reading Time: 10 minutes

TL;DR

Three Microsoft UEFI Secure Boot certificates expire between June 24 and October 19, 2026
Any GCP Compute Engine instance with Secure Boot enabled, created before November 7, 2025, carries the old certs and is at risk
When the certs expire, instances may fail to boot after OS updates that pull in bootloaders signed only by the replacement 2023 certificates
GKE Shielded Nodes are affected too — node pools whose nodes haven’t been recreated since November 7, 2025 carry the old UEFI database
vTPM-sealed secrets, BitLocker, and Linux full disk encryption break if Secure Boot fails mid-update
Primary fix: recreate affected instances (post-Nov 7, 2025 instances include the updated UEFI DB automatically)
Emergency workaround if boot fails: temporarily disable Secure Boot, apply updates, re-enable

The Big Picture: The UEFI Secure Boot Trust Chain

  UEFI Firmware (PK — Platform Key, set by OEM/Google)
         │
         │  PK signs KEK updates
         ▼
  ┌─────────────────────────────────────────────┐
  │        KEK (Key Exchange Key Database)       │
  │  Microsoft Corporation KEK CA 2011           │ ← EXPIRING Jun 24, 2026
  │  Microsoft Corporation KEK CA 2023           │ ← Replacement (new VMs only)
  └────────────────────┬────────────────────────┘
                       │  KEK authorizes DB/DBX updates
                       ▼
  ┌─────────────────────────────────────────────┐
  │         DB (Authorized Signature Database)   │
  │  Microsoft UEFI CA 2011 ← signs Linux Shim  │ ← EXPIRING Jun 27, 2026
  │  Microsoft Windows PCA 2011 ← signs WinBoot │ ← EXPIRING Oct 19, 2026
  │  Microsoft UEFI CA 2023 ← replacement       │ ← Present on post-Nov 7 VMs
  │  Microsoft Windows PCA 2023 ← replacement   │ ← Present on post-Nov 7 VMs
  └────────┬───────────────────────┬────────────┘
           │                       │
           ▼                       ▼
   Linux Shim (shim.efi)    Windows Boot Manager
           │
           ▼
       GRUB2 / systemd-boot
           │
           ▼
       Linux Kernel

GCP Compute Engine instances with Secure Boot enabled — created before November 7, 2025 — have a UEFI signature database that includes the 2011 certificates but not the 2023 replacements. When those 2011 certificates expire, new bootloader binaries (signed exclusively by the 2023 certs) will be rejected at boot time.

What Secure Boot Actually Does — and Why Certificate Expiry Breaks Booting

Secure Boot is UEFI’s mechanism for ensuring that only cryptographically signed, trusted software runs during the boot sequence. The trust chain works like this:

Platform Key (PK): Root of trust, set by the hardware manufacturer or cloud provider. Authorizes updates to the KEK.
Key Exchange Key (KEK): Authorizes modifications to the DB and DBX (the forbidden signatures database). Microsoft holds one KEK slot; OEMs often hold another.
DB (Signature Database): Contains the public certificates used to verify bootloaders. If a bootloader binary is signed by a cert in DB, it’s allowed to run. If not, the firmware halts.
DBX (Forbidden Signatures Database): Revocation list. Bootloaders explicitly listed here are blocked even if they were once trusted.

Where expiry matters: The DB certificates don’t “enforce” anything at runtime by checking dates themselves — UEFI doesn’t do certificate revocation in real time. The problem is different and more insidious: as Linux distributions and Microsoft ship updated bootloaders, those new binaries are signed only by the 2023 replacement certificates, not the expiring 2011 ones. If your VM’s DB doesn’t contain the 2023 certs, the UEFI firmware will reject the new shim, and the system won’t boot after an OS update that upgrades the bootloader package.

On Debian/Ubuntu, shim-signed upgrades. On RHEL/CentOS Stream, shim-x64 upgrades. Either way: new binary, new signature, old DB — boot failure.

The Three Certificates Expiring in 2026

1. Microsoft Corporation KEK CA 2011 — expires June 24, 2026

Role: Authorizes updates to the DB and DBX signature databases.

When the KEK expires, firmware that enforces KEK validity may refuse to accept DB/DBX updates signed by this certificate. This means even if Google pushes an out-of-band UEFI DB update containing the 2023 certs, instances with an expired-only KEK slot may not be able to apply it cleanly.

Replacement: Microsoft Corporation KEK CA 2023

2. Microsoft Corporation UEFI CA 2011 — expires June 27, 2026

Role: Signs third-party bootloaders — specifically the Linux Shim (shim.efi).

This is the most critical cert for Linux workloads. Every major Linux distribution uses a shim bootloader as the first-stage loader in a Secure Boot chain. The shim is signed by Microsoft’s UEFI CA because Linux vendors submit their shim builds to Microsoft for signing (to ensure broad UEFI compatibility). When new shim packages are released signed only by UEFI CA 2023, any VM with only the 2011 cert in its DB will reject them.

Replacement: Microsoft UEFI CA 2023

3. Microsoft Windows Production PCA 2011 — expires October 19, 2026

Role: Signs Windows Boot Manager and other Windows boot components.

Windows instances on GCP using Secure Boot are affected by this cert. Post-expiry Windows OS updates that ship a new Boot Manager binary signed exclusively by the 2023 PCA will fail to boot on instances carrying only the 2011 cert.

Replacement: Microsoft Windows Production PCA 2023

Windows-specific signal: Event ID 1801 in the Windows System event log — “Secure Boot CA/keys need to be updated” — will appear by mid-2026 on affected instances, before actual boot failure. This is your warning window.

Why GCP Instances Are Specifically Affected

Google’s Compute Engine Shielded VMs ship with a pre-populated UEFI variable database. The content of that database is fixed at instance creation time — it’s part of the VM’s UEFI firmware image. Instances created before November 7, 2025 have a DB that contains the 2011 certs but not the 2023 replacements. Instances created on or after November 7, 2025 had the updated database backfilled.

This is not a Google-specific failure. Every cloud provider and on-premises hypervisor platform that uses Secure Boot with a pre-populated UEFI DB has the same problem. GCP is ahead of many platforms in actually documenting it.

GKE’s Shielded Nodes feature enables Secure Boot on node pool VMs. Each node is a Compute Engine instance — and all the same rules apply.

The risk: Node pools whose nodes were last provisioned before November 7, 2025 carry the old UEFI database. When containerd, the OS image, or the kernel gets updated via node auto-upgrade or manual node pool upgrade, the new node VMs will carry updated certs. But nodes that haven’t been replaced since before the cutoff are sitting on the old DB.

GKE auto-upgrade helps — but only if it’s actually running and has completed at least one full node replacement cycle since November 7, 2025.

Node pools with auto-upgrade disabled, or clusters in maintenance windows that delayed upgrades, are at risk.

The trigger scenario:
1. GKE runs a node OS update in-place on an old node (not a full node replacement)
2. The update upgrades the shim package to a version signed only by UEFI CA 2023
3. Next reboot: the node fails to boot
4. The node is marked NotReady, workloads are rescheduled — but the underlying VM is stuck

Detecting Affected Resources

Compute Engine Instances

gcloud compute instances list \
  --filter="creationTimestamp < '2025-11-07' AND shieldedInstanceConfig.enableSecureBoot=true" \
  --format="table(name,zone,creationTimestamp,shieldedInstanceConfig.enableSecureBoot,status)"

Sample output:

NAME               ZONE           CREATION_TIMESTAMP        ENABLE_SECURE_BOOT  STATUS
prod-api-01        us-central1-a  2024-08-15T10:22:00Z      True                RUNNING   ← at risk
prod-db-02         us-central1-b  2023-11-01T08:15:00Z      True                RUNNING   ← at risk
prod-web-03        us-central1-a  2025-12-01T14:30:00Z      True                RUNNING   ← safe (post-Nov 7)

GKE Node Pools

# List node pools with Secure Boot enabled per cluster
gcloud container clusters list --format="value(name,location)" | while read NAME LOCATION; do
  echo "=== Cluster: $NAME ($LOCATION) ==="
  gcloud container node-pools list \
    --cluster="$NAME" \
    --location="$LOCATION" \
    --filter="config.shieldedInstanceConfig.enableSecureBoot=true" \
    --format="table(name,config.shieldedInstanceConfig.enableSecureBoot,management.autoUpgrade)"
done

Then verify node creation timestamps within affected pools:

gcloud compute instances list \
  --filter="labels.goog-gke-node:* AND creationTimestamp < '2025-11-07' AND shieldedInstanceConfig.enableSecureBoot=true" \
  --format="table(name,zone,creationTimestamp,labels.goog-gke-node)"

Checking the UEFI DB on a Running Instance

SSH into an affected instance and verify which certs are in the DB:

# On the instance (requires mokutil and/or efitools)
sudo mokutil --db | grep -A3 "Subject:"

Look for CN=Microsoft UEFI CA 2023 in the output. Its absence means your instance has only the 2011 certs.

On GKE nodes (where you have node shell access via a DaemonSet or node debug pod):

# Using kubectl debug for node access
kubectl debug node/NODE_NAME -it --image=ubuntu -- bash
# Then inside the debug pod:
chroot /host
mokutil --db 2>/dev/null | grep "Microsoft.*2023" || echo "2023 cert NOT present — node at risk"

Solutions

Option 1: Recreate Instances (Primary — Recommended by Google)

Instances created after November 7, 2025 automatically receive the updated UEFI certificate database. The simplest fix is to recreate affected instances.

For Compute Engine:

# Step 1: Create a machine image (snapshot) of the existing instance
gcloud compute machine-images create INSTANCE_NAME-backup \
  --source-instance=INSTANCE_NAME \
  --source-instance-zone=ZONE

# Step 2: Delete the old instance (after verifying backup)
gcloud compute instances delete INSTANCE_NAME --zone=ZONE

# Step 3: Create new instance from machine image
gcloud compute instances create INSTANCE_NAME \
  --source-machine-image=INSTANCE_NAME-backup \
  --zone=ZONE \
  --shielded-secure-boot \
  --shielded-vtpm \
  --shielded-integrity-monitoring

The new instance will have the post-November 7, 2025 UEFI DB.

For GKE Node Pools:

# Option A: Upgrade the node pool (triggers node recreation)
gcloud container clusters upgrade CLUSTER_NAME \
  --location=LOCATION \
  --node-pool=NODE_POOL_NAME

# Option B: Recreate the node pool entirely
gcloud container node-pools create NODE_POOL_NAME-new \
  --cluster=CLUSTER_NAME \
  --location=LOCATION \
  --shielded-secure-boot \
  --shielded-integrity-monitoring \
  [... your existing pool config ...]

# Then cordon and drain the old pool nodes
kubectl cordon NODE_NAME
kubectl drain NODE_NAME --ignore-daemonsets --delete-emptydir-data

# Finally delete the old node pool
gcloud container node-pools delete NODE_POOL_NAME \
  --cluster=CLUSTER_NAME \
  --location=LOCATION

Option 2: Disable Secure Boot Temporarily (Emergency Workaround)

If an instance has already failed to boot after an OS update, or if you need to apply bootloader updates before recreating the instance:

# Disable Secure Boot on the stopped instance
gcloud compute instances update INSTANCE_NAME \
  --zone=ZONE \
  --no-shielded-secure-boot

# Start the instance
gcloud compute instances start INSTANCE_NAME --zone=ZONE

# SSH in, apply OS updates and any pending bootloader upgrades
# (The system will boot without Secure Boot enforcement)
sudo apt-get update && sudo apt-get upgrade -y   # Debian/Ubuntu
# or
sudo dnf update -y                                # RHEL/CentOS

# Stop the instance again
gcloud compute instances stop INSTANCE_NAME --zone=ZONE

# Re-enable Secure Boot
gcloud compute instances update INSTANCE_NAME \
  --zone=ZONE \
  --shielded-secure-boot

# Start again — now boots with new bootloader binaries
gcloud compute instances start INSTANCE_NAME --zone=ZONE

Note: This workaround doesn’t add the 2023 certs to the DB. It bypasses Secure Boot enforcement temporarily. The underlying UEFI DB still only has the 2011 certs. You still need to recreate the instance to get the updated DB — this is only a bridge to keep the instance alive while you plan migration.

Option 3: Restore from Machine Image

If an instance is already in a boot failure state and the workaround above doesn’t apply:

# List available machine images
gcloud compute machine-images list

# Restore from a pre-failure machine image
gcloud compute instances create INSTANCE_NAME-restored \
  --source-machine-image=MACHINE_IMAGE_NAME \
  --zone=ZONE

Then immediately plan recreation on a post-November 7, 2025 instance.

vTPM, BitLocker, and Full Disk Encryption: The Hidden Risk

For VMs using Shielded VM features beyond just Secure Boot — specifically vTPM with sealed secrets — certificate expiry creates a more dangerous failure mode.

How vTPM sealing works:

  Boot sequence measurements → PCR registers (PCR 0–7 for UEFI, PCR 8–15 for OS)
         │
         ▼
  TPM seals secrets (FDE key, BitLocker key) to specific PCR values
         │
         ▼
  On next boot: PCR values must match for TPM to release the key
         │
         ▼
  If Secure Boot state changes (cert DB changes, Secure Boot disabled) →
  PCR values change → TPM refuses to unseal → FDE fails → disk inaccessible

What this means in practice:

Linux FDE (LUKS with TPM2 unsealing): If Secure Boot fails or is temporarily disabled per the workaround above, the TPM will not release the LUKS volume key. The system will drop to a recovery prompt. You need the LUKS recovery passphrase.
Windows BitLocker: If PCR values shift (Secure Boot disabled, cert DB changed), BitLocker enters recovery mode. The VM prompts for the BitLocker recovery key on next boot. Without it, the volume is inaccessible.
Windows Virtual Secure Mode: VSM uses vTPM to protect credentials. If Secure Boot state changes, VSM-protected secrets become inaccessible until re-enrollment.

Action before any changes:

# For Linux: ensure you have the LUKS recovery key
sudo cryptsetup luksDump /dev/sda3 | grep "Key Slot"

# For Windows: export BitLocker recovery key before touching Secure Boot state
# (Do this from within the running Windows instance via PowerShell)
Get-BitLockerVolume | Select-Object -ExpandProperty KeyProtector | Where-Object {$_.KeyProtectorType -eq "RecoveryPassword"}

Store recovery keys in Secret Manager, not just locally:

# Store LUKS key in GCP Secret Manager
echo -n "YOUR_RECOVERY_KEY" | gcloud secrets create luks-recovery-INSTANCE_NAME \
  --data-file=- \
  --replication-policy=automatic

⚠ Production Gotchas

1. OS update automation is the trigger, not the cert expiry date itself.
The certs don’t enforce anything at runtime. The actual failure happens when an unattended-upgrade, yum-cron, or GKE node OS update pulls in a new shim/Boot Manager binary signed only by the 2023 cert. Instances may fail to boot weeks or months before the official cert expiry date if distros ship updated bootloaders early.

2. GKE surge upgrades can mask the problem — temporarily.
During a node pool upgrade, GKE creates new nodes (with updated certs) before draining old ones. Workloads move to new nodes. The old nodes get deleted. This looks fine — until you realize some in-place operations (node taints, label changes, manual kubelet restarts) could force old nodes to reboot without triggering node replacement.

3. Disabling Secure Boot changes vTPM PCR values — plan FDE recovery before touching anything.
The temporary workaround (disable Secure Boot) will invalidate TPM-bound disk encryption. Have recovery keys ready before running --no-shielded-secure-boot.

4. Windows Event ID 1801 is an early warning — act on it.
If you see this event in your Windows Compute Engine instances before June 2026, that instance has already identified itself as carrying the old certs. Use it as your automated detection signal in Cloud Logging.

# Query Cloud Logging for Event ID 1801 across Windows instances
gcloud logging read 'resource.type="gce_instance" AND jsonPayload.EventID=1801' \
  --format="table(resource.labels.instance_id,timestamp,jsonPayload.Message)" \
  --limit=50

5. Instance templates propagate the old DB.
If you use instance templates or managed instance groups (MIGs) to create VMs, and those templates were created before November 7, 2025, new instances created from them may or may not inherit updated certs depending on how the template configures the UEFI DB. Verify by checking creation timestamp of the resulting instance, not the template.

6. Custom OS images don’t fix this.
Importing a custom image or using a custom OS does not update the UEFI certificate database. The DB is part of the VM’s UEFI firmware state, not the OS disk image. Recreating the instance is the only reliable path.

Quick Reference: Commands

Task	Command
List affected Compute Engine VMs	`gcloud compute instances list --filter="creationTimestamp < '2025-11-07' AND shieldedInstanceConfig.enableSecureBoot=true"`
Check UEFI DB on a Linux VM	`sudo mokutil --db \\| grep -E "Subject\\|Not After"`
Check for 2023 cert presence	`mokutil --db 2>/dev/null \\| grep "Microsoft.*2023" \\|\\| echo "2023 cert absent"`
Disable Secure Boot (emergency)	`gcloud compute instances update INSTANCE --zone=ZONE --no-shielded-secure-boot`
Re-enable Secure Boot	`gcloud compute instances update INSTANCE --zone=ZONE --shielded-secure-boot`
Find affected GKE nodes	`gcloud compute instances list --filter="labels.goog-gke-node:* AND creationTimestamp < '2025-11-07' AND shieldedInstanceConfig.enableSecureBoot=true"`
Trigger GKE node pool upgrade	`gcloud container clusters upgrade CLUSTER --location=LOCATION --node-pool=POOL`
Store LUKS key in Secret Manager	`echo -n "KEY" \\| gcloud secrets create NAME --data-file=-`
Query Windows Event 1801 in Logging	`gcloud logging read 'resource.type="gce_instance" AND jsonPayload.EventID=1801'`
Create machine image backup	`gcloud compute machine-images create BACKUP --source-instance=INSTANCE --source-instance-zone=ZONE`

Framework Alignment

Framework	Domain	Relevance
CISSP	Domain 7: Security Operations	Patch management, boot integrity, incident response
CISSP	Domain 3: Security Architecture	Secure Boot trust chain, TPM integration, cryptographic key lifecycle
NIST CSF 2.0	ID.AM, PR.IP	Asset inventory of affected VMs; integrity protection of boot chain
CIS Benchmarks	CIS Google Cloud Computing Foundations	Shielded VM controls, vTPM configuration
OWASP Top 10	A05: Security Misconfiguration	Failure to maintain certificate currency in security-critical infrastructure

Key Takeaways

The expiry of three Microsoft UEFI CA certificates in 2026 creates a window where GCP VMs with Secure Boot enabled — created before November 7, 2025 — will fail to boot after pulling in new bootloader packages
The failure is not instantaneous on the cert expiry date. It’s triggered by the next OS update that ships a bootloader signed exclusively by the 2023 replacement certs
GKE Shielded Nodes are affected through the same mechanism: node VMs that haven’t been recreated since November 7, 2025 carry the old UEFI database
vTPM-sealed secrets (FDE, BitLocker, VSM) add a secondary failure mode if Secure Boot state is changed as part of remediation — have recovery keys before touching anything
Google’s recommended fix is instance recreation. The workaround (disable Secure Boot temporarily) keeps instances alive but doesn’t fix the underlying DB — treat it as a bridge, not a resolution
Audit now, before June 24. The command is one line. The blast radius of missing this is a production boot failure at 2 AM after a routine security patch run

What’s Next

If you’re running Shielded VMs in production, this certificate expiry is the kind of quiet deadline that fails silently — not with an alarm, but with a VM that doesn’t come back after a patch cycle. The time to audit is before your automated patching runs, not after.

If you found this useful, the linuxcent.com newsletter covers infrastructure security at this depth regularly — kernel internals, cloud platform gotchas, and the operational implications that vendor docs bury in footnotes.

Get the next deep-dive in your inbox when it publishes → [subscribe link]

OpenLDAP Setup and Replication: Running Your Own Directory

May 10, 2026May 5, 2026 by Vamshi Krishna Santhapuri

Reading Time: 5 minutes

The Identity Stack, Episode 6
EP01 → … → EP05: Kerberos → EP06 → EP07: LDAP HA → …

TL;DR

OpenLDAP’s server process is slapd — the backend that stores data is MDB (LMDB), a memory-mapped B-tree that replaced the old Berkeley DB backend
Configuration lives in the directory itself: cn=config (OLC — Online Configuration) lets you modify slapd at runtime without restarting
SyncRepl is the replication protocol: a consumer subscribes to a provider and stays in sync via either polling (refreshOnly) or a persistent connection (refreshAndPersist)
Multi-Provider (formerly Multi-Master) lets multiple nodes accept writes — conflict resolution uses CSN (Change Sequence Number), last-writer-wins
The essential tools: slapd, ldapadd, ldapmodify, ldapsearch, slapcat, slaptest
Always build indexes on the attributes you search most — uid, cn, memberOf — or every search is a full scan

The Big Picture: slapd Architecture

ldapsearch / ldapadd / SSSD / any LDAP client
              │ TCP 389 / 636
              ▼
         ┌─────────────────────────────────┐
         │  slapd (OpenLDAP server)         │
         │                                 │
         │  Frontend (protocol layer)       │
         │    • parse BER requests          │
         │    • ACL enforcement             │
         │    • schema validation           │
         │                                 │
         │  Backend (storage layer)         │
         │    • MDB (LMDB) — default       │
         │    • memory-mapped file I/O      │
         │    • ACID transactions           │
         └────────────┬────────────────────┘
                      │
              /var/lib/ldap/
              data.mdb   (the directory data)
              lock.mdb   (LMDB lock file)

EP05 showed Kerberos in isolation. OpenLDAP is where you run the identity store that Kerberos references — and where SSSD looks up user and group attributes. This episode builds a working two-node replicated directory from scratch.

Installation

# Ubuntu / Debian
apt-get install -y slapd ldap-utils

# RHEL / Rocky / AlmaLinux
dnf install -y openldap-servers openldap-clients

# After install — Ubuntu runs a configuration wizard
# Skip it: dpkg-reconfigure slapd
# Or answer it and then switch to OLC management

On RHEL-family systems, slapd is not configured after install — you work entirely through OLC from the start.

OLC: The Directory Configures Itself

The old way was slapd.conf — a static file that required a full restart on every change. OLC (Online Configuration) replaced it: slapd‘s own configuration is stored as LDAP entries under cn=config. You modify configuration the same way you modify data — with ldapmodify. Changes take effect immediately.

cn=config                        ← root config entry
├── cn=schema,cn=config          ← schema definitions
│     ├── cn={0}core             ← core schema
│     ├── cn={1}cosine           ← RFC 1274 attributes
│     └── cn={2}inetorgperson    ← inetOrgPerson object class
├── olcDatabase={-1}frontend     ← default settings for all databases
├── olcDatabase={0}config        ← the config database itself
└── olcDatabase={1}mdb           ← your actual directory data
      ├── olcAccess              ← ACLs
      ├── olcSuffix              ← base DN (e.g., dc=corp,dc=com)
      └── olcDbIndex             ← search indexes

Everything under cn=config has attributes prefixed with olc (OpenLDAP Configuration). You query and modify it just like any other LDAP subtree — with one restriction: only the cn=config admin (usually gidNumber=0+uidNumber=0,cn=peercred,cn=external,cn=auth — the local root via SASL EXTERNAL) can write to it.

Bootstrapping a Directory

The quickest way to get a working directory is a set of LDIF files applied in order.

1. Load schemas

# Apply the schemas OpenLDAP ships with
ldapadd -Y EXTERNAL -H ldapi:/// \
  -f /etc/ldap/schema/cosine.ldif
ldapadd -Y EXTERNAL -H ldapi:/// \
  -f /etc/ldap/schema/inetorgperson.ldif
ldapadd -Y EXTERNAL -H ldapi:/// \
  -f /etc/ldap/schema/nis.ldif       # adds posixAccount, posixGroup

2. Configure the MDB database

# mdb-config.ldif
dn: olcDatabase={1}mdb,cn=config
changetype: modify
replace: olcSuffix
olcSuffix: dc=corp,dc=com
-
replace: olcRootDN
olcRootDN: cn=admin,dc=corp,dc=com
-
replace: olcRootPW
olcRootPW: {SSHA}hashed_password_here

Generate the hash: slappasswd -s yourpassword

ldapmodify -Y EXTERNAL -H ldapi:/// -f mdb-config.ldif

3. Add indexes

# indexes.ldif
dn: olcDatabase={1}mdb,cn=config
changetype: modify
add: olcDbIndex
olcDbIndex: uid eq,pres
olcDbIndex: cn eq,sub
olcDbIndex: sn eq,sub
olcDbIndex: mail eq
olcDbIndex: memberOf eq
olcDbIndex: entryCSN eq
olcDbIndex: entryUUID eq

The last two (entryCSN, entryUUID) are required for SyncRepl replication to work efficiently.

4. Load initial data

# base.ldif
dn: dc=corp,dc=com
objectClass: top
objectClass: dcObject
objectClass: organization
o: Corp
dc: corp

dn: ou=people,dc=corp,dc=com
objectClass: organizationalUnit
ou: people

dn: ou=groups,dc=corp,dc=com
objectClass: organizationalUnit
ou: groups

dn: uid=vamshi,ou=people,dc=corp,dc=com
objectClass: inetOrgPerson
objectClass: posixAccount
objectClass: shadowAccount
cn: Vamshi Krishna
sn: Krishna
uid: vamshi
uidNumber: 1001
gidNumber: 1001
homeDirectory: /home/vamshi
loginShell: /bin/bash
mail: [email protected]
userPassword: {SSHA}hashed_password_here

ldapadd -x -H ldap://localhost \
  -D "cn=admin,dc=corp,dc=com" \
  -w adminpassword \
  -f base.ldif

ACLs: Who Can Read What

OpenLDAP ACLs are evaluated top-to-bottom; first match wins.

# acls.ldif — set via OLC
dn: olcDatabase={1}mdb,cn=config
changetype: modify
replace: olcAccess
# Users can change their own passwords
olcAccess: to attrs=userPassword
  by self write
  by anonymous auth
  by * none
# Users can read their own entry
olcAccess: to dn.base="ou=people,dc=corp,dc=com"
  by self read
  by users read
  by * none
# Service accounts can read everything (for SSSD)
olcAccess: to *
  by dn="cn=svc-ldap,ou=services,dc=corp,dc=com" read
  by self read
  by * none

A service account (cn=svc-ldap) that SSSD uses to search the directory needs read access to ou=people and ou=groups. Never give SSSD admin (write) access.

SyncRepl Replication

SyncRepl is a pull-based replication protocol built on the LDAP Sync operation (RFC 4533). A consumer connects to a provider and requests changes. The provider sends them. The consumer stays in sync.

On the Provider: Enable the syncprov overlay

# syncprov.ldif
dn: olcOverlay=syncprov,olcDatabase={1}mdb,cn=config
objectClass: olcOverlayConfig
objectClass: olcSyncProvConfig
olcOverlay: syncprov
olcSpCheckpoint: 100 10     # checkpoint every 100 ops or 10 minutes
olcSpSessionLog: 100        # keep last 100 changes for delta-sync

ldapadd -Y EXTERNAL -H ldapi:/// -f syncprov.ldif

On the Consumer: Configure syncrepl

# consumer-config.ldif
dn: olcDatabase={1}mdb,cn=config
changetype: modify
add: olcSyncrepl
olcSyncrepl: rid=001
  provider=ldap://ldap1.corp.com:389
  bindmethod=simple
  binddn="cn=repl-svc,dc=corp,dc=com"
  credentials=replication-password
  searchbase="dc=corp,dc=com"
  scope=sub
  schemachecking=on
  type=refreshAndPersist    # persistent connection (vs refreshOnly = polling)
  retry="5 5 60 +"          # retry: 5 times every 5s, then every 60s forever
  interval=00:00:05:00      # (for refreshOnly) sync every 5 minutes
-
add: olcUpdateRef
olcUpdateRef: ldap://ldap1.corp.com   # redirect writes to provider

refreshAndPersist keeps a persistent connection open. Changes replicate within milliseconds. refreshOnly polls on an interval — simpler, but adds latency.

Verify Replication

# On provider: check the contextCSN (the sync state token)
ldapsearch -x -H ldap://ldap1.corp.com \
  -D "cn=admin,dc=corp,dc=com" -w password \
  -b "dc=corp,dc=com" -s base contextCSN
# contextCSN: 20260427010000.000000Z#000000#000#000000

# On consumer: should match after sync
ldapsearch -x -H ldap://ldap2.corp.com \
  -D "cn=admin,dc=corp,dc=com" -w password \
  -b "dc=corp,dc=com" -s base contextCSN
# Same CSN = in sync

Multi-Provider: Accepting Writes on Both Nodes

Standard SyncRepl has one provider and one or more consumers — only the provider accepts writes. Multi-Provider (formerly Multi-Master) lets every node accept writes.

# On each node — add mirrormode to the database config
dn: olcDatabase={1}mdb,cn=config
changetype: modify
add: olcMirrorMode
olcMirrorMode: TRUE

With mirrormode enabled and each node configured as both provider and consumer of the other, writes on either node replicate to the other. Conflict resolution is CSN-based (Change Sequence Number) — a monotonically increasing timestamp. Last write wins at the attribute level.

Multi-Provider does not prevent split-brain conflicts — if two clients write the same attribute on two different nodes during a network partition, the higher CSN wins when the partition heals. For most directory use cases (user passwords, group memberships), this is acceptable. For others, it requires careful thought.

⚠ Production Gotchas

MDB data file grows monotonically. LMDB never shrinks the data file automatically. Deleted entries leave free space inside the file that gets reused, but the file on disk doesn’t shrink. Use slapcat to export and slapadd to reimport if you need to reclaim disk space.

slapcat is the only safe backup. slapcat reads the MDB database directly and exports LDIF — it does not go through slapd. Run it while slapd is running (LMDB is MVCC-safe for readers), but never copy the raw MDB files while slapd is running.

Schema changes on a replicated directory require coordination. Load the new schema on the provider first. SyncRepl will propagate it to consumers — but if a consumer gets a new entry using the new schema before the schema itself is replicated, the import will fail. Load schemas manually on all nodes before adding entries that use them.

Key Takeaways

OpenLDAP uses LMDB (MDB backend) — a memory-mapped, ACID-compliant storage engine with no external dependency
OLC (cn=config) is the right way to configure slapd — changes apply without restarts
SyncRepl pulls changes from a provider to a consumer — refreshAndPersist for near-real-time, refreshOnly for poll-based
Always index uid, cn, entryCSN, and entryUUID — unindexed searches are full scans
Multi-Provider allows writes on all nodes with CSN-based last-write-wins conflict resolution

What’s Next

A single OpenLDAP server works. Two nodes with SyncRepl work better. EP07 goes further: how you put multiple LDAP servers behind a load balancer, how connection pooling works, what to monitor, and how 389-DS handles directories with tens of millions of entries.

Next: LDAP High Availability: Load Balancing and Production Architecture

Get EP07 in your inbox when it publishes → linuxcent.com/subscribe

OWASP Top 10 History: How the List Evolved from 2003 to 2025

May 4, 2026 by Vamshi Krishna Santhapuri

Reading Time: 8 minutes

series: OWASP LLM Top 10: From Web Roots to AI Frontiers
episode: 1 of 22
status: Draft
slug: /owasp-top-10-history-evolution/
focus_keyphrase: OWASP Top 10 history evolution
search_intent: Informational
meta_description: “OWASP Top 10 history: how the list evolved from SQL injection in 2003 to LLM prompt injection in 2025 — and what stayed constant across every version.”
owasp_mapping: “Foundation episode — establishes the OWASP organization, methodology, and six-version evolution before branching to the four lists that exist today (Web App, API, Cloud-Native, LLM).”

OWASP Top 10 History → The Four OWASP Lists → Why Classic OWASP Breaks for LLMs → OWASP LLM Top 10 2025

TL;DR

OWASP Top 10 history evolution spans six published versions from 2003 to 2021 — the category names change every cycle; the underlying failure classes do not
Injection, broken authentication, and access control have appeared in every single version under different names; they were exploited in 2003 and they are still the top breach vectors in 2025
The 2021 edition abstracted away from web-app-specific language into attack classes — which is what made OWASP applicable to cloud infrastructure, APIs, Kubernetes, and ultimately AI systems
OWASP is not a compliance standard; it is a community consensus on risk — but in 2025, the EU AI Act began directly citing the OWASP AI Exchange, which changes that calculus
Four distinct OWASP Top 10 lists exist today: Web App (2021), API Security (2023), Cloud-Native App Security, and LLM Applications (2025) — this series covers the last one, built on the foundation of the first

OWASP Mapping: Foundation episode. No single OWASP LLM category. This episode traces the lineage from OWASP Top 10 (2003) through all six web app versions to the four lists that exist in 2025. Every subsequent episode maps directly to one or more OWASP LLM Top 10 (2025) categories.

The Big Picture

OWASP TOP 10 EVOLUTION: 2003 → 2025

2003 ──▶ Web-era injection (SQL, XSS, parameter tampering)
          │  HTTP/1.0 apps. Databases directly exposed via
          │  dynamic SQL. Sessions via URL parameters.
          │
2007 ──▶ Session management + insecure comms elevated
          │  HTTPS adoption slow. Cookie theft common.
          │
2010 ──▶ Unvalidated redirects added. XSS re-ranked.
          │  The list reflects what's being actively exploited.
          │
2013 ──▶ CSRF dropped. Missing Function-Level Access added.
          │  First signs of API/microservice thinking.
          │
2017 ──▶ Risk-weighted ranking. CWE mappings. XXE added.
          │  Insecure Deserialization, Logging failures enter.
          │  The list becomes infrastructure-aware.
          │
2021 ──▶ Abstracted to attack classes. Insecure Design +
          │  SSRF added. Infrastructure/cloud applicability.
          │  ┌──────────────────────────────┐
          │  │ Now maps to cloud infra      │ ← Purple Team EP02
          │  │ Kubernetes, APIs, pipelines  │
          │  └──────────────────────────────┘
          │
          ├──▶ API Security Top 10 (2023)
          │     REST/GraphQL-specific risks
          │
          ├──▶ Cloud-Native App Security Top 10
          │     Containers, orchestration
          │
          └──▶ LLM Applications Top 10 (2023 v1 → 2025 v2)
                Prompt injection, model poisoning, RAG attacks
                ← THIS SERIES

OWASP Top 10 history is not a list of bugs. It is a snapshot of where the application surface was — and where attackers found the seams — taken every three to four years.

The 2003 Founding: What the Web Looked Like

The OWASP Foundation was established in 2001. The first Top 10 list shipped in 2003.

The web in 2003 looked nothing like it does now. Applications were monolithic. Databases were directly queried via dynamic SQL strings concatenated from user input. Authentication was session cookies stored in URL parameters. “Security” was a firewall at the network perimeter — if you were inside the network, you were trusted.

SQL injection was not a theoretical risk. It was how attackers exfiltrated data in bulk, every day, at scale. The same for XSS: inject JavaScript into a page, steal session cookies, impersonate users. These were not edge cases — they were the primary breach vectors because the web was built without any assumption that input was untrusted.

The OWASP founding premise: developers build these vulnerabilities not because they are negligent, but because the threat model was never taught. The Top 10 list was documentation, not enforcement — a shared vocabulary for what actually causes breaches.

Version-by-Version: What Changed and What Did Not

Year	Most Significant Addition	What Dropped / Changed	What It Reflects
2003	Unvalidated Input, SQL Injection, XSS, Command Injection	—	Dynamic SQL era; input treated as trusted
2007	CSRF, Insecure Comms, Improper Error Handling	Unvalidated Input consolidated	HTTPS adoption gap; session theft via network
2010	Unvalidated Redirects + Forwards	CSRF de-emphasized	Open redirectors weaponized for phishing
2013	CSRF dropped; Missing Function-Level Access	Insecure Storage removed	API-style thinking entering the list
2017	Insecure Deserialization, Logging + Monitoring Failures, XXE	Unvalidated Redirects dropped	Server-side attack complexity; blind spots in detection
2021	Insecure Design (new class), SSRF	XSS merged under Injection	Architecture-level risk; abstract attack classes introduced

The column that doesn’t change: Broken Access Control, Injection, and Authentication Failures have appeared in every version. The names shift (A01 becomes A07 becomes A01 again). The category descriptions evolve. The underlying failure — you can access things you shouldn’t, or execute code you shouldn’t, or authenticate as someone you’re not — never leaves the list.

This is the most important observation in the entire series: OWASP’s vocabulary modernizes; the failure classes are constants. When you see LLM01 Prompt Injection in the 2025 LLM list, you are looking at the same failure class as A03 Injection in the web app list. The attack surface changed. The category did not.

What the 2021 Abstraction Unlocked

The 2017 → 2021 transition was architecturally significant. Prior versions were implicitly scoped to HTTP requests against web applications. The 2021 list made a deliberate choice to describe attack classes rather than attack techniques.

“Injection” in 2021 means: untrusted data is sent to an interpreter and executed as code or commands. That definition covers SQL injection, LDAP injection, OS command injection — and, it turns out, natural language prompt injection in LLMs. The definition doesn’t care what the interpreter is.

“Broken Access Control” in 2021 means: a principal can act on a resource or perform an action it was not intended to. That covers misconfigured S3 buckets, Kubernetes RBAC gaps — and an LLM agent with tool access that hasn’t been scoped to least capability.

This abstraction is why OWASP became applicable to cloud infrastructure, APIs, containers, and AI. It’s also why the Purple Team series (specifically EP02) was able to map the entire 2021 list directly to cloud infrastructure attack paths — and why this series can map the same abstraction to LLM attack surfaces.

For the cloud infrastructure angle, see OWASP Top 10 mapped to cloud infrastructure. This series starts where that one ends: the attack surface that cloud infrastructure runs on is increasingly powered by language models.

The Four Lists That Exist Today

OWASP has expanded beyond the original web app list. Four Top 10 lists are actively maintained as of 2025:

OWASP Top 10 — Web Application Security Risks (2021)
The original. HTTP-layer attacks on server-rendered or API-backed apps. A01 Broken Access Control through A10 SSRF. Still the baseline for any web-facing application.

OWASP API Security Top 10 (2023)
REST and GraphQL-specific. Broken Object Level Authorization (BOLA/IDOR), excessive data exposure, mass assignment, unrestricted resource consumption. API attacks account for the majority of cloud breaches — this list exists because the web app list missed API-specific attack surfaces.

OWASP Cloud-Native Application Security Top 10
Kubernetes, containers, orchestration-layer risks: insecure workload configurations, misconfigured cloud storage, vulnerable container images, runtime compromise. The cloud-infra angle.

OWASP Top 10 for LLM Applications (2025)
The list this series is built on. Prompt injection, model poisoning, supply chain risks for model artifacts, RAG database attacks, autonomous agent over-permission. The attack surfaces that arrive when you embed a language model in your infrastructure.

The full comparison — which list applies to which part of your architecture, and how they overlap — is in the next episode.

Why AI Arrived at OWASP

The OWASP Top 10 for LLM Applications was not invented top-down. It came from practitioners who were deploying language models and cataloguing the breach patterns they were seeing.

The first version (v1.0) shipped in August 2023, driven by a working group that formed in May 2023 — roughly six months after ChatGPT created widespread LLM deployment. The timeline matters: security researchers were finding real vulnerabilities in production systems in real time, and the OWASP list was the community’s way of documenting the emerging threat model before it became a liability.

Version 2.0 shipped in November 2024. Two entirely new categories — System Prompt Leakage (LLM07) and Vector/Embedding Weaknesses (LLM08) — were added because RAG-based applications and agentic AI had become prevalent enough that their specific attack surfaces warranted dedicated treatment. Sensitive Information Disclosure moved from #6 to #2 because real breach data, not theory, showed it was the second most commonly exploited category.

The OWASP AI Exchange — a parallel OWASP project — went further. It produced a 300-page technical guide on AI security and privacy and contributed directly to the EU AI Act’s technical requirements. As of 2025, the EU AI Act for high-risk AI systems references risk assessment requirements that align directly with OWASP LLM Top 10 categories. OWASP is still not a compliance standard. But for AI systems in the EU, ignoring it is no longer a neutral choice.

⚠ Production Gotchas

“OWASP is a checklist you run once”
It’s a living document updated every 3–4 years based on actual breach data. The 2021 web app list is not the same document as the 2017 list. The 2025 LLM list has different categories than the 2023 v1 list. Running the 2017 checklist on a 2025 system is not OWASP compliance — it is a false sense of coverage.

“We are OWASP compliant”
OWASP is not a compliance standard. There is no OWASP certification, no OWASP audit, no OWASP controls framework. Organizations that say “we are OWASP compliant” mean they have reviewed the list and addressed the categories — that is a risk reduction exercise, not a regulatory state. The EU AI Act is a compliance standard. NIST AI RMF is a compliance framework. OWASP is the technical operationalization of both.

“The LLM Top 10 only matters if you’re building LLMs”
You don’t need to build LLMs for the list to apply. If you are deploying a chatbot powered by a third-party API, using an AI coding assistant that has access to your codebase, or running a RAG application that indexes internal documents — you are within scope of LLM01 through LLM10. The attack surface is the integration, not the model itself.

Quick Reference: OWASP Top 10 Versions

Year	Version	Key Additions	Key Removals	Architectural Context
2003	v1.0	Injection, Broken Auth, XSS, Insecure Config	—	Monolithic web apps, dynamic SQL
2007	v2.0	CSRF, Insecure Comms	Unvalidated Input → merged	HTTPS gap, session theft
2010	v3.0	Unvalidated Redirects	—	Phishing via redirectors
2013	v4.0	Missing Function-Level Access	CSRF moved to lower priority	API patterns emerging
2017	v5.0	XXE, Insecure Deserialization, Logging Failures	Unvalidated Redirects	Microservices, detection gaps
2021	v6.0	Insecure Design, SSRF	XSS merged into Injection	Attack class abstraction; cloud/AI applicability

Current parallel lists:

List	Last Updated	Primary Surface	Key Org
Web App Top 10	2021	HTTP/web apps	OWASP
API Security Top 10	2023	REST/GraphQL APIs	OWASP
Cloud-Native App Security Top 10	2022	K8s/containers	OWASP
LLM Applications Top 10	2025 (v2.0)	Language models/AI	OWASP GenAI

Framework Alignment

Framework	Relevant Function	Connection to OWASP History
NIST CSF 2.0	IDENTIFY (ID.RA)	OWASP is the community risk catalog that feeds asset risk assessments
ISO 27001:2022	A.8.8 (vulnerability management)	OWASP Top 10 is the standard reference for vulnerability class coverage
NIST AI RMF	MAP 1.5	Identify which risk categories from OWASP LLM Top 10 apply to specific system components
EU AI Act	Art. 9 (risk management system)	High-risk AI system risk assessments reference OWASP AI Exchange technical guidance

Key Takeaways

OWASP Top 10 history is the story of attack surfaces expanding — web to API to cloud to AI — with the same failure classes appearing at each layer
The 2021 abstraction to attack classes (not web-specific techniques) was the architectural decision that made OWASP applicable everywhere, including LLMs
Four lists exist today; real systems touch multiple lists simultaneously
The LLM Top 10 (v2.0, 2025) is not theoretical — it was built from documented production breach patterns, and v2.0 added new categories because RAG and agentic AI created new attack surfaces fast enough to warrant them
OWASP is a risk framework, not a compliance standard — until 2025, when the EU AI Act began referencing OWASP AI Exchange guidance for high-risk AI systems

What’s Next

EP02 answers the navigation question this episode raises: if four OWASP lists exist, which one applies to your system — and what happens when a single architecture touches all four at once?

The Four OWASP Lists: Web App, API, Cloud-Native, and LLM Compared →

Get EP02 in your inbox when it publishes → subscribe

How Kerberos Works: Tickets, KDC, and Why Enterprises Use It With LDAP

May 10, 2026May 4, 2026 by Vamshi Krishna Santhapuri

Reading Time: 7 minutes

The Identity Stack, Episode 5
EP01 → EP02 → EP03 → EP04: SSSD → EP05 → EP06: OpenLDAP → …

TL;DR

Kerberos is a network authentication protocol — it proves identity without sending passwords over the network, using time-limited cryptographic tickets
Three actors: the client, the KDC (Key Distribution Center), and the service — the KDC issues tickets; clients use tickets to authenticate to services
The ticket flow: AS-REQ (get a TGT) → TGS-REQ (exchange TGT for a service ticket) → AP-REQ (present service ticket to the target service)
A TGT (Ticket-Granting Ticket) is a session credential — it lets you request service tickets without re-entering your password for the lifetime of the ticket (default 10 hours)
LDAP + Kerberos together: LDAP stores identity (who you are), Kerberos authenticates it (proves you are who you say you are) — Active Directory is exactly this combination
kinit, klist, kdestroy are the hands-on tools — run them and read the ticket output

The Big Picture: Three Actors, Three Steps

         1. AS-REQ / AS-REP
Client ◄────────────────────► AS (Authentication Server)
  │                                     │
  │    (part of KDC)                    │
  │                                     ▼
  │         2. TGS-REQ / TGS-REP   TGS (Ticket-Granting Server)
  ├───────────────────────────────────►│
  │         (part of KDC)              │
  │                                    │
  │    3. AP-REQ / AP-REP              │
  └─────────────────────────────► Service (SSH, LDAP, NFS, HTTP...)

KDC = AS + TGS (usually the same process, same machine)

EP04 mentioned Kerberos tickets and clock skew requirements without explaining the protocol. This episode explains why Kerberos was invented, what a ticket actually is, and how the three-step flow works — so that when SSSD says “KDC unreachable” or kinit fails with “pre-authentication required,” you know exactly what’s happening.

The Problem Kerberos Was Built to Solve

MIT’s Project Athena started in 1983 — a campus-wide computing initiative giving students access to thousands of workstations. The problem: how do you authenticate a student at workstation 847 to a file server across campus without sending their password over the network?

In 1988, Steve Miller and Clifford Neuman published Kerberos version 4. The core insight: a trusted third party (the KDC) can issue cryptographic proof that a user has authenticated, and that proof can be presented to any service on the network without the service ever seeing the user’s password.

The password never leaves the client machine after the initial authentication. Every subsequent authentication — to a different service, to the same service again — uses a ticket. The KDC knows both the client and the service. The client and service only need to trust the KDC.

Keys, Tickets, and Sessions

Before the protocol, the primitives:

Long-term keys — derived from passwords. When you set a password in Kerberos, it’s hashed into a key stored in the KDC database (in the krbtgt account on AD, in /var/lib/krb5kdc/principal on MIT Kerberos). The client also derives this key from the password at authentication time. Neither ever sends the raw password.

Session keys — temporary symmetric keys created by the KDC for a specific session. They’re valid for the ticket’s lifetime. After the ticket expires, the session key is useless.

Tickets — encrypted blobs issued by the KDC. A ticket contains the session key, the client identity, the expiry time, and optional flags. It’s encrypted with the target service’s long-term key — only the service can decrypt it. The client carries the ticket but can’t read the contents.

The Three-Step Flow

Step 1: AS-REQ / AS-REP — Getting a TGT

Client                        KDC (AS component)
  │                                │
  │── AS-REQ ──────────────────────►
  │   {username, timestamp}         │
  │   (timestamp encrypted with     │
  │    client's long-term key)       │
  │                                 │
  │   KDC verifies: decrypts        │
  │   timestamp with stored key.    │
  │   If valid → issues TGT         │
  │                                 │
  ◄── AS-REP ──────────────────────│
      {session_key_enc_with_client, │
       TGT_enc_with_krbtgt_key}     │

The client decrypts the session key using its long-term key (derived from the password). The TGT is encrypted with the KDC’s own key (krbtgt) — the client can’t read it, but carries it.

This is the step that requires the password. After this, the TGT is what the client uses for everything else.

Step 2: TGS-REQ / TGS-REP — Getting a Service Ticket

Client                        KDC (TGS component)
  │                                │
  │── TGS-REQ ─────────────────────►
  │   {TGT, authenticator,         │
  │    target_service_name}        │
  │   (authenticator encrypted      │
  │    with TGT session key)        │
  │                                 │
  │   KDC: decrypts TGT,           │
  │   verifies authenticator,       │
  │   issues service ticket         │
  │                                 │
  ◄── TGS-REP ────────────────────│
      {service_session_key_enc,    │
       service_ticket_enc_with_    │
       service_long_term_key}      │

No password involved. The client proves its identity by presenting the TGT (which only the KDC can issue) and an authenticator (a timestamp encrypted with the TGT’s session key, proving the client holds the session key without revealing it).

Step 3: AP-REQ / AP-REP — Authenticating to the Service

Client                        Service (sshd, LDAP, NFS...)
  │                                │
  │── AP-REQ ──────────────────────►
  │   {service_ticket,             │
  │    authenticator_enc_with_      │
  │    service_session_key}        │
  │                                 │
  │   Service: decrypts ticket      │
  │   with its long-term key,       │
  │   verifies authenticator        │
  │                                 │
  ◄── AP-REP (optional) ───────────│
      {mutual authentication}       │

The service decrypts the ticket using its own key. It extracts the client identity and session key. It verifies the authenticator. No communication with the KDC required — the service trusts what the KDC signed.

Why Clock Skew Matters

Every Kerberos authenticator contains a timestamp. The service rejects authenticators older than 5 minutes (by default) — this prevents replay attacks where an attacker captures an authenticator and replays it later.

This is why clock skew over 5 minutes breaks Kerberos authentication entirely. If your machine’s clock drifts 6 minutes from the KDC, every authenticator you generate is rejected as too old or too far in the future. No tickets. No AD logins. No SSSD authentication.

# Check time sync status
timedatectl status
chronyc tracking        # if using chrony
ntpq -p                 # if using ntpd

# If clock is off: force a sync
chronyc makestep        # immediate step correction (chrony)

Hands-On: kinit, klist, kdestroy

# Get a TGT (will prompt for password)
kinit [email protected]

# Show current tickets
klist
# Credentials cache: FILE:/tmp/krb5cc_1001
# Principal: [email protected]
#
# Valid starting     Expires            Service principal
# 04/27/26 01:00:00  04/27/26 11:00:00  krbtgt/[email protected]
#   renew until 05/04/26 01:00:00

# Show encryption types used (the -e flag)
klist -e
# 04/27/26 01:00:00  04/27/26 11:00:00  krbtgt/[email protected]
#         Etype: aes256-cts-hmac-sha1-96, aes256-cts-hmac-sha1-96

# Get a service ticket for a specific service
kvno host/[email protected]
# host/[email protected]: kvno = 3

# Show all tickets including service tickets
klist -f
# Flags: F=forwardable, f=forwarded, P=proxiable, p=proxy, D=postdated,
#        d=postdated, R=renewable, I=initial, i=invalid, H=hardware auth

# Destroy all tickets
kdestroy

The Valid starting and Expires fields are the ticket lifetime. After expiry, you need to re-authenticate (or renew the ticket if it’s within the renew until window). The renew until date is when even renewal stops working.

/etc/krb5.conf

[libdefaults]
    default_realm = CORP.COM
    dns_lookup_realm = false
    dns_lookup_kdc = true         # find KDCs via DNS SRV records
    ticket_lifetime = 10h
    renew_lifetime = 7d
    forwardable = true            # tickets can be forwarded to remote hosts (needed for SSH forwarding)
    rdns = false

[realms]
    CORP.COM = {
        kdc = dc01.corp.com
        kdc = dc02.corp.com       # failover KDC
        admin_server = dc01.corp.com
    }

[domain_realm]
    .corp.com = CORP.COM
    corp.com = CORP.COM

With dns_lookup_kdc = true, Kerberos finds KDCs by querying DNS SRV records (_kerberos._tcp.corp.com). AD sets these up automatically. On MIT Kerberos, you add them manually. DNS-based discovery is the recommended approach for AD environments — it picks up new DCs automatically.

Kerberos + LDAP: Why Enterprises Run Both

LDAP and Kerberos solve different problems and are almost always deployed together:

LDAP answers:  "Who is vamshi? What groups is he in? What's his home directory?"
Kerberos answers: "Is this really vamshi? Prove it without sending a password."

Active Directory is exactly this combination — the directory is LDAP-based, the authentication is Kerberos. When a Linux machine joins an AD domain via realm join or adcli, it gets:
– LDAP access to the AD directory (for NSS: user and group lookups)
– A Kerberos principal registered in AD (for PAM: ticket-based authentication)
– A machine account (the machine’s identity in the directory)

When you SSH into an AD-joined Linux machine:
1. SSSD issues a Kerberos AS-REQ for the user’s TGT
2. SSSD uses the TGT to get a service ticket for the Linux machine’s PAM service
3. Authentication is verified via the service ticket — no LDAP Bind with a password
4. SSSD does an LDAP Search to get POSIX attributes (UID, GID, home dir)

Password-based LDAP Bind is the fallback when Kerberos isn’t available. Kerberos is the default on AD-joined systems — and it’s more secure because the password never leaves the client.

⚠ Common Misconceptions

“Kerberos sends your password to the KDC.” It doesn’t. The client derives a key from the password locally and uses that key to encrypt a timestamp (the pre-authentication data). The KDC verifies the timestamp using the stored key. The raw password never travels.

“Kerberos is an authorization protocol.” Kerberos authenticates — it proves who you are. Authorization (what you can do) is a separate decision, usually handled by ACLs on the service or directory group membership.

“Once you have a TGT, you’re authenticated to everything.” A TGT only proves your identity to the KDC. Each service requires a separate service ticket. The TGT is what lets you get those service tickets without re-entering your password.

“Kerberos requires AD.” MIT Kerberos 5 is a standalone implementation. FreeIPA (EP08) runs MIT Kerberos. Heimdal is another implementation. AD uses a Microsoft-extended version of Kerberos 5, but the core protocol is the same RFC.

Framework Alignment

Domain	Relevance
CISSP Domain 5: Identity and Access Management	Kerberos is the de facto enterprise authentication protocol — SSO, delegation, and service account authentication all depend on it
CISSP Domain 4: Communications and Network Security	Kerberos prevents credential sniffing and replay attacks — two of the core network authentication threat categories
CISSP Domain 3: Security Architecture and Engineering	The KDC is a critical single point of trust — its availability, key management, and account (`krbtgt`) rotation are architectural security decisions

Key Takeaways

Kerberos is a ticket-based protocol — the password is used once to get a TGT; from then on, tickets prove identity without the password
The three-step flow: get a TGT from the AS, exchange it for a service ticket at the TGS, present the service ticket to the target service
Clock skew over 5 minutes breaks Kerberos — time synchronization is a hard dependency
LDAP stores identity; Kerberos authenticates it — Active Directory is exactly this combination, and so is FreeIPA
klist -e shows the encryption types in use — aes256-cts-hmac-sha1-96 is what you want to see; arcfour-hmac (RC4) is legacy and should be disabled

What’s Next

EP05 covered Kerberos as a protocol. EP06 goes hands-on: building a real LDAP directory with OpenLDAP, configuring replication, and understanding how the server-side components — slapd, the MDB backend, SyncRepl — fit together.

Next: OpenLDAP Setup and Replication: Running Your Own Directory

Get EP06 in your inbox when it publishes → linuxcent.com/subscribe

TC eBPF — Pod-Level Network Policy Without iptables

May 13, 2026May 3, 2026 by Vamshi Krishna Santhapuri

Reading Time: 10 minutes

eBPF: From Kernel to Cloud, Episode 8
What Is eBPF? · The BPF Verifier · eBPF vs Kernel Modules · eBPF Program Types · eBPF Maps · CO-RE and libbpf · XDP · TC eBPF**

Architecture Overview

TC eBPF and Cilium — traffic control hook architecture showing ingress/egress packet flow with sk_buff context — The TC hook runs inside the kernel network stack — Cilium uses it for identity-based policy enforcement.

TL;DR

TC eBPF fires after sk_buff allocation — it has socket metadata, cgroup ID, and pod identity that XDP lacks
(sk_buff = the kernel’s socket buffer, allocated for every packet; TC fires after this allocation, so it can read socket and process identity)
Direct action (DA) mode combines filter and action; the program’s return value is the packet fate
Multiple TC programs chain on the same hook ordered by priority — stale programs from Cilium upgrades cause silent policy conflicts
tc filter show dev <iface> ingress/egress is the primary inspection tool; bpftool net list shows the full node picture
XDP + TC is the Cilium data path: XDP for pre-stack service load balancing, TC for per-pod identity-based enforcement
TC can modify packet content (bpf_skb_store_bytes) — the basis for TC-based DNAT and packet mangling

TC eBPF is where Cilium implements pod-level network policy without iptables — the hook that fires after sk_buff allocation, where socket and cgroup context exist, making per-pod enforcement possible. The obvious follow-up to XDP is why Cilium doesn’t use it for everything — pod network policy, egress enforcement, the full NetworkPolicy ruleset. The answer reveals an inherent trade-off built into the Linux data path: XDP’s speed comes from running before any context exists. At the moment it fires, there is no socket, no cgroup, no way to tell which pod sent the packet. The moment you need pod identity, you need a hook that fires later — and pays for it.

A specific pod in production was experiencing intermittent TCP connection failures to an external service. Not all connections — roughly one in fifty. Kubernetes NetworkPolicy showed egress allowed for the namespace. Cilium policy status showed no violations. Running curl from inside the pod worked fine.

The application logs told a different story: connection timeouts at the 30-second mark, no SYN-ACK received. Not a DNS issue — I verified with tcpdump inside the pod namespace. SYN packets were leaving the pod network namespace. They weren’t making it onto the wire.

I ran bpftool net list on the node and saw two TC egress programs attached to that pod’s veth interface. One from the current Cilium version (installed six weeks ago). One from the previous version — from before the rolling upgrade. Two programs. Different policy epochs. The older one had a stale block rule that fired intermittently based on connection tuple patterns it was never designed to handle in the new policy model.

Without understanding TC eBPF — what programs attach where, how multiple programs interact, and how to inspect them — I would have kept chasing ghosts in the application layer.

Quick Check: Are There Stale TC Filters on Your Cluster?

The most common TC eBPF issue on production clusters — stale filters left behind by a Cilium upgrade — is a two-command check:

# SSH into a worker node, then pick any pod's veth interface:
ip link | grep lxc | head -5
# lxc8a3f21b@if7: ...
# lxc2c9d3e1@if9: ...

# Check TC filters on that interface
tc filter show dev lxc8a3f21b egress

Healthy output (one filter, one priority):

filter protocol all pref 1 bpf chain 0
filter protocol all pref 1 bpf chain 0 handle 0x1 cil_to_container direct-action not_in_hw id 44

Stale filter present (two priorities = problem):

filter protocol all pref 1 bpf chain 0
filter protocol all pref 1 bpf chain 0 handle 0x1 cil_to_container direct-action not_in_hw id 44
filter protocol all pref 2 bpf chain 0
filter protocol all pref 2 bpf chain 0 handle 0x1 old_cil_to_container direct-action not_in_hw id 17
#                  ^^^^^^ two different priorities = two programs running in sequence

Two priorities on the same hook means two programs running sequentially. If the older one has a stale DROP rule, packets are being dropped intermittently — and nothing in the application layer will tell you why.

Not running Cilium? If you’re on a non-Cilium CNI (Calico, Flannel, aws-vpc-cni), you likely won’t have TC eBPF filters on pod interfaces. Run tc filter show dev eth0 ingress on the node uplink instead to see if any TC programs are attached at the node level. An empty response is normal for non-Cilium clusters.

Why TC, Not XDP

EP07 covered XDP: fastest possible hook, fires before sk_buff, drops at line rate. If XDP is so fast, why doesn’t Cilium use it for everything?

Because XDP sees only raw packet bytes. No socket. No cgroup. No pod identity.

In Kubernetes, network policy is inherently about identity. “Allow pod A to connect to pod B on port 8080.” To enforce this, you need to know which pod a packet is coming from on egress — and which pod it’s going to on ingress. That mapping lives in the cgroup hierarchy and the socket buffer, neither of which exist at XDP time.

TC fires later in the packet lifecycle, after sk_buff is allocated and populated:

Ingress path:
  wire → NIC → [XDP hook] → sk_buff allocated → [TC ingress hook] → netfilter → socket

Egress path:
  socket → IP routing → [TC egress hook] → qdisc → NIC → wire

At the TC egress hook on a pod’s veth interface, the sk_buff carries the socket that created the packet — and from that socket you can read the cgroup ID. The cgroup hierarchy maps container → pod, so the TC program knows which pod this traffic belongs to. That’s what makes pod-level enforcement possible.

The Linux Traffic Control Architecture

tc (traffic control) is the Linux subsystem for managing packet queues and scheduling. Most Linux administrators know it as the bandwidth-shaping tool:

# Classic tc usage — rate limit an interface
tc qdisc add dev eth0 root tbf rate 100mbit burst 32kbit latency 400ms

The qdisc (queuing discipline) is the primary abstraction. Under the qdisc sits a filter layer — and the filter type relevant to eBPF is cls_bpf, which attaches eBPF programs as packet classifiers.

qdisc (queuing discipline) is the kernel’s packet scheduler for an interface — it controls how packets are buffered and in what order they leave. For eBPF policy enforcement, Cilium uses a special qdisc called clsact which has no buffering behaviour at all; it purely provides the ingress and egress hook points where eBPF filters attach. If a pod veth doesn’t have clsact, Cilium isn’t enforcing policy on that pod.

Cilium attaches cls_bpf filters in direct action (DA) mode, which combines classifier and action into a single eBPF program. The program’s return value is the packet fate directly:

Return value	Action
`TC_ACT_OK` (0)	Pass the packet
`TC_ACT_SHOT` (2)	Drop the packet
`TC_ACT_REDIRECT` (7)	Redirect to another interface
`TC_ACT_PIPE` (3)	Pass to the next filter in the chain

TC Context: What Your Program Can See

TC programs receive a struct __sk_buff — a safe, BPF-accessible projection of the kernel sk_buff. Unlike the raw packet bytes in XDP, __sk_buff includes metadata:

struct __sk_buff {
    __u32 len;           // packet length
    __u32 pkt_type;      // PACKET_HOST, PACKET_BROADCAST, etc.
    __u32 mark;          // skb->mark — used by Cilium for pod identity
    __u32 queue_mapping;
    __u32 protocol;      // ETH_P_IP, ETH_P_IPV6, etc.
    __u32 vlan_present;
    __u32 vlan_tci;
    __u32 vlan_proto;
    __u32 priority;
    __u32 ingress_ifindex;
    __u32 ifindex;
    __u32 tc_index;
    __u32 cb[5];
    __u32 hash;
    __u32 tc_classid;
    __u32 data;          // offset to packet data
    __u32 data_end;
    __u32 napi_id;
    __u32 family;
    __u32 remote_ip4;    // source IP (ingress) or dest IP (egress)
    __u32 local_ip4;
    __u32 remote_port;
    __u32 local_port;
    // ...
};

skb->mark is how Cilium passes pod identity between its hook points.

skb->mark is a 32-bit field in every sk_buff that any kernel subsystem can read or write. It’s a general-purpose scratch field — iptables uses it, routing rules use it, and Cilium uses it to carry pod security identity from the socket hook through to TC enforcement. When Cilium stamps a pod’s identity into skb->mark at connection time, every subsequent TC filter on that packet’s path can read it without another identity lookup. The socket-level cgroup hook (cgroup_sock_addr) stamps the cgroup-derived pod identity into skb->mark when the socket calls connect(). By the time the packet reaches the TC egress hook, skb->mark carries the pod’s security identity — and the TC program uses it for policy enforcement.

What Cilium’s TC Filters Actually Do

The TC filter on each pod’s veth is Cilium’s enforcement point for Kubernetes NetworkPolicy. The mechanism:

When a pod opens a connection, a cgroup_sock_addr hook stamps the pod’s security identity (derived from its labels + namespace) into skb->mark
The TC egress filter on the veth reads skb->mark, looks up the pod identity + destination in the policy map, and returns TC_ACT_SHOT (drop) or TC_ACT_OK (pass)
The TC ingress filter on the receiving pod’s veth does the same check for inbound traffic

The policy map is a BPF LRU hash keyed on {pod_identity, dst_ip, dst_port, protocol}. This is what cilium policy get reads from — and what bpftool map dump shows directly:

# Find Cilium's policy maps
bpftool map list | grep -i policy

# Dump the active policy entries for a specific endpoint
# Get endpoint ID from: cilium endpoint list
cilium bpf policy get <endpoint-id>

# Cross-check with raw bpftool dump
bpftool map dump id <POLICY_MAP_ID> | head -30

The clsact qdisc is the prerequisite for any TC eBPF filter — it creates the ingress and egress hook points without any queuing behavior. Every pod veth on a Cilium node has one:

tc qdisc show dev lxcABCDEF
# qdisc clsact ffff: dev lxcABCDEF parent ffff:fff1
# ^^^^^^^^^^^^ this line confirms Cilium's hook points exist on this pod's veth
# If this is missing: Cilium is NOT enforcing NetworkPolicy on this pod

If a pod veth doesn’t show clsact, Cilium isn’t enforcing policy on that pod.

Multiple Programs and the Filter Chain

This is the detail that caused my production incident.

TC supports chaining multiple filters on the same hook, ordered by priority. Lower priority number runs first. When Cilium upgrades, it installs a new filter at a new priority before removing the old one. If the upgrade procedure has any timing gap — or if the removal step fails silently — you end up with two programs running in sequence.

# Show all TC filters on a pod's veth — both priorities visible
tc filter show dev lxc12345 egress

# Example output with a stale filter:
filter protocol all pref 1 bpf chain 0
filter protocol all pref 1 bpf chain 0 handle 0x1 cil_to_container direct-action not_in_hw id 44
filter protocol all pref 2 bpf chain 0
filter protocol all pref 2 bpf chain 0 handle 0x1 old_cil_to_container direct-action not_in_hw id 17

Two programs. Pref 1 runs first. Pref 2 runs second — unless pref 1 returned TC_ACT_SHOT, in which case the packet is already dropped and pref 2 never fires.

In my incident: pref 1 was the current Cilium version with correct policy, returning TC_ACT_OK for the traffic in question. Pref 2 was the old version with a stale block entry, returning TC_ACT_SHOT for a subset of connection tuples. Because TC_ACT_OK passes to the next filter in the chain (TC_ACT_PIPE would do the same), pref 2 got to run — and intermittently dropped packets.

The fix:

# Remove the stale filter by priority
tc filter del dev lxc12345 egress pref 2

# Verify only the current filter remains
tc filter show dev lxc12345 egress

This should be part of any post-upgrade verification for Cilium-managed clusters.

How Cilium Uses TC Across the Full Node

Cilium’s TC deployment on a node:

Pod veth (host-side, lxcXXXXX):
  TC ingress: cil_from_container — L3/L4 policy on the pod's outbound traffic
  TC egress:  cil_to_container   — L3/L4 policy on traffic arriving at the pod

Node uplink (eth0):
  TC ingress: cil_from_netdev    — traffic arriving from outside the node
  TC egress:  cil_to_netdev      — traffic leaving the node

XDP on eth0:
  cil_xdp_entry — pre-stack service load balancing (DNAT for ClusterIP)

The naming is counterintuitive at first: cil_from_container is attached to the TC ingress hook on the veth.

Veth direction confusion: TC ingress/egress is named from the kernel’s perspective of the interface, not the pod’s. The host-side veth interface receives traffic that the pod is sending — so TC ingress on the host veth = the pod’s outbound traffic. This trips up everyone the first time. When debugging, always confirm direction with tc filter show dev lxcXXX ingress and egress separately, and check which Cilium program name is attached (cil_from_container = pod outbound, cil_to_container = pod inbound). The veth ingress direction from the host perspective is traffic flowing out of the container. Traffic leaving the pod hits the host-side veth ingress, which is cil_from_container. It enforces egress policy for the pod. Naming follows the kernel’s perspective of the interface, not the application’s.

To see the full picture on a node:

# All eBPF network programs (XDP and TC) across all interfaces
bpftool net list

# TC-specific view
for iface in $(ip link | grep lxc | awk -F': ' '{print $2}'); do
    echo "=== $iface ==="
    tc filter show dev $iface ingress
    tc filter show dev $iface egress
done

TC Can Modify Packets Too

Unlike XDP, TC programs have full access to the sk_buff and can modify packet content — headers, payload, and checksums. This is how TC-based DNAT works in Cilium when XDP isn’t available on the NIC: the program rewrites the destination IP at L3 and updates the IP + transport checksums atomically. The kernel BPF helper handles the checksum recalculation.

From an operational standpoint: if you see a TC program attached but expected traffic is being redirected rather than dropped, the program is likely doing DNAT. bpftool prog dump xlated id <ID> shows the disassembled instructions and will reveal bpf_skb_store_bytes calls if packet rewriting is happening.

Debugging TC Programs in Production

Workflow I follow when investigating network issues on Cilium clusters:

# 1. List all eBPF network programs (see the full picture)
bpftool net list

# 2. Check specific interface for stale TC filters
tc filter show dev lxcABCDEF ingress
tc filter show dev lxcABCDEF egress

# 3. Inspect a specific program
bpftool prog show id 44

# 4. Disassemble a program (last resort for understanding behavior)
bpftool prog dump xlated id 44

# 5. Check Cilium's view of the same interface
cilium endpoint list
cilium endpoint get <endpoint-id>

# 6. Enable verbose TC program logs (debug builds only)
# Cilium: set CILIUM_DEBUG=true in the deployment

Common Mistakes

Mistake	Impact	Fix
Not checking for stale TC filters after Cilium upgrades	Conflicting policy programs cause intermittent drops	Run `tc filter show` post-upgrade; remove stale by priority
Confusing ingress/egress direction on veth interfaces	Policy applied to wrong traffic direction	TC ingress on host-side veth = pod’s outbound traffic
Attaching TC without `clsact` qdisc	Filter attachment fails	`tc qdisc add dev <iface> clsact` before filter add
Using `TC_ACT_OK` when you want to stop the chain	Subsequent filters still run	Use `TC_ACT_OK` knowing the chain continues; use `TC_ACT_REDIRECT` or explicit `TC_ACT_SHOT` only
Expecting TC performance equal to XDP	TC has sk_buff overhead — it’s slower	Right tool: XDP for pre-stack bulk drops, TC for identity-aware policy
Hardcoding `skb->mark` interpretation	Different tools use mark differently	Document mark field usage clearly; coordinate between Cilium and custom programs

Key Takeaways

TC eBPF fires after sk_buff allocation — it has socket metadata, cgroup ID, and pod identity that XDP lacks
Direct action (DA) mode combines filter and action; the program’s return value is the packet fate
Multiple TC programs chain on the same hook ordered by priority — stale programs from Cilium upgrades cause silent policy conflicts
tc filter show dev <iface> ingress/egress is the primary inspection tool; bpftool net list shows the full node picture
XDP + TC is the Cilium data path: XDP for pre-stack service load balancing, TC for per-pod identity-based enforcement
TC can modify packet content (bpf_skb_store_bytes) — the basis for TC-based DNAT and packet mangling

What’s Next

EP08 closes out the kernel machinery arc: program types, maps, CO-RE, XDP, TC. Five episodes on the engine under the tools. EP09 shifts from understanding the machinery to using it interactively.

bpftrace turns kernel knowledge into one-liners you can run on a live production node. Which process is touching this file right now? Where is this latency spike originating in the kernel call stack? Which container is making DNS queries to an unexpected resolver? Under 10 seconds per question — no restart, no sidecar, no instrumentation change.

Every bpftrace one-liner is a complete eBPF program compiled, loaded, run, and cleaned up on the fly. EP09 covers how that works and why it changes the way you investigate production incidents.

Next: bpftrace — kernel answers in one line

Get EP09 in your inbox when it publishes → linuxcent.com/subscribe

SSSD: The Caching Daemon That Powers Every Enterprise Linux Login

May 10, 2026April 27, 2026 by Vamshi Krishna Santhapuri

Reading Time: 7 minutes

The Identity Stack, Episode 4
EP01: What Is LDAP → EP02: LDAP Internals → EP03: LDAP Auth on Linux → EP04 → EP05: Kerberos → …

TL;DR

SSSD (System Security Services Daemon) is the caching and brokering layer between Linux and directory services — it handles LDAP, Kerberos, and AD so PAM and NSS don’t have to
Architecture: three tiers — responders (answer PAM/NSS queries), providers (talk to AD/LDAP/Kerberos), and a shared cache (LDB database on disk)
Credential caching means offline logins work — a user who authenticated yesterday can log in today even if the domain controller is unreachable
Key config: sssd.conf — the [domain] section is where almost all tuning happens
Debugging toolkit: sssctl, sss_cache, id, getent, journalctl -u sssd
The most common failure modes are: SSSD not running, stale cache, misconfigured ldap_search_base, and clock skew breaking Kerberos

The Big Picture: SSSD as the Identity Broker

PAM (pam_sss)         NSS (sss module)
      │                      │
      └──────────┬───────────┘
                 ▼
          SSSD Responders
          ┌────────────────────────────────────┐
          │  PAM responder   NSS responder      │
          │  (auth, account, (passwd, group,    │
          │   session)        shadow lookups)   │
          └────────────┬───────────────────────┘
                       │  shared cache (LDB)
                       ▼
          SSSD Providers
          ┌────────────────────────────────────┐
          │  identity provider  auth provider   │
          │  (user/group attrs) (credentials)   │
          └────────────┬───────────────────────┘
                       │
          ┌────────────┼────────────┐
          ▼            ▼            ▼
       LDAP          Kerberos    Local files
    (AD / OpenLDAP)  (KDC / AD)

EP03 showed that SSSD sits between PAM and LDAP. This episode goes inside it — the architecture, the config, and how to tell exactly what it’s doing on any given login attempt.

Why SSSD Exists

The problem before SSSD: nss_ldap and pam_ldap made direct LDAP connections for every query. No caching, no connection pooling, no failover, no offline support. On a system that makes dozens of getpwuid() calls per second (every ls -l, every process spawn), this meant dozens of LDAP roundtrips per second hitting the domain controller.

SSSD solved this with a single daemon that:
– Maintains a persistent connection pool to the directory
– Caches identity and credential data in an LDB (LDAP-like) database on disk
– Handles failover across multiple directory servers
– Satisfies PAM and NSS queries from cache when the directory is unreachable

The credential cache is the key insight. When you authenticate successfully, SSSD stores a hash of your credentials locally. If the domain controller is unreachable on your next login — network outage, laptop offline, VPN not connected — SSSD can verify your credentials against the local cache. You log in. You never knew the DC was down.

SSSD Architecture

SSSD is a set of cooperating processes sharing a cache:

Monitor — the parent process. Starts and restarts all other SSSD processes. If a responder or provider crashes, the monitor restarts it.

Responders — answer queries from PAM and NSS. Each responder handles a specific interface:
– sssd_nss — answers getpwnam(), getpwuid(), getgrnam(), initgroups() calls
– sssd_pam — handles PAM authentication, account checks, and session management
– sssd_autofs, sssd_ssh, sssd_sudo — optional responders for specific services

Providers — the backend processes that talk to the actual directory:
– Each domain gets its own provider process (sssd_be[domain_name])
– The provider connects to LDAP/Kerberos/AD, fetches data, and writes it to the shared cache
– If the provider crashes or loses connectivity, responders fall back to serving from cache

Cache — LDB files in /var/lib/sss/db/. One database per configured domain, plus a cache for negative results (lookups that returned “not found”). The cache is an LDAP-like directory stored on disk — SSSD uses the same hierarchical structure for local storage as the remote directory uses.

# See the cache files
ls -la /var/lib/sss/db/
# cache_corp.com.ldb         ← user/group data for domain corp.com
# ccache_corp.com            ← Kerberos credential cache
# timestamps_corp.com.ldb   ← when entries were last refreshed

sssd.conf: The Config That Matters

/etc/sssd/sssd.conf has a [sssd] section (global) and one [domain/name] section per directory. The domain section is where almost all tuning happens.

[sssd]
services = nss, pam, sudo
domains = corp.com
config_file_version = 2

[domain/corp.com]
# What type of directory this is
id_provider = ad               # or: ldap, ipa, files
auth_provider = ad             # or: ldap, krb5, none
access_provider = ad           # controls who can log in

# The AD/LDAP server (can be a list for failover)
ad_domain = corp.com
ad_server = dc01.corp.com, dc02.corp.com

# Where to look for users and groups
ldap_search_base = dc=corp,dc=com

# Cache behavior
cache_credentials = true       # enable offline login
entry_cache_timeout = 5400     # how long before re-querying (seconds)
offline_credentials_expiration = 1  # days cached credentials stay valid offline

# What uid/gid range belongs to this domain (prevents UID conflicts)
ldap_id_mapping = true         # auto-map AD SIDs to UIDs (no uidNumber needed)
# OR for classical POSIX LDAP:
# ldap_id_mapping = false      # use uidNumber/gidNumber from directory

# Restrict logins to specific AD groups
# access_provider = simple
# simple_allow_groups = linux-admins, sre-team

# Home directory and shell defaults
override_homedir = /home/%u
default_shell = /bin/bash
fallback_homedir = /home/%u

# Enumerate all users (expensive on large dirs — disable unless needed)
enumerate = false

The two most commonly wrong settings:

ldap_search_base — if this doesn’t include the OU where your users live, SSSD won’t find them. On AD, the default searches the entire domain, which is usually correct. On OpenLDAP, you may need ou=people,dc=corp,dc=com.

ldap_id_mapping — on AD, users typically don’t have uidNumber attributes. Setting ldap_id_mapping = true tells SSSD to derive a UID from the user’s SID algorithmically. This produces consistent UIDs across machines. Setting it to false requires actual uidNumber attributes in the directory.

Credential Caching and Offline Logins

The cache is what separates SSSD from a simple proxy. When cache_credentials = true:

On successful authentication, SSSD stores a hash of the credential in the LDB cache
On the next authentication attempt, SSSD first tries the domain controller
If the DC is unreachable, SSSD falls back to the local credential hash
If the hash matches, login succeeds — even with no network

The credential hash is not the cleartext password — it’s a salted hash stored in /var/lib/sss/db/cache_corp.com.ldb. The security model is the same as /etc/shadow: someone with root access to the machine can access the hashes.

offline_credentials_expiration controls how long cached credentials stay valid when the DC is unreachable. 0 means forever (not recommended for high-security environments). 1 means one day — after 24 hours offline, even cached credentials expire and the user must authenticate online.

The Debugging Toolkit

# 1. Is SSSD running?
systemctl status sssd
pgrep -a sssd    # shows all SSSD processes (monitor + responders + providers)

# 2. Domain connectivity status
sssctl domain-status corp.com
# Domain: corp.com
# Active servers:
#   LDAP: dc01.corp.com
#   KDC: dc01.corp.com
# Discovered servers:
#   LDAP: dc01.corp.com, dc02.corp.com

# 3. Can SSSD find a specific user?
sssctl user-checks vamshi
# user: vamshi
# user name: [email protected]
# POSIX attributes: UID=1001, GID=1001, ...
# Authentication: success (uses actual PAM auth stack)

# 4. What does NSS see?
getent passwd vamshi          # full passwd entry
id vamshi                     # uid, gid, groups

# 5. Flush stale cache entries
sss_cache -u vamshi           # invalidate one user
sss_cache -G engineers        # invalidate one group
sss_cache -E                  # invalidate everything (nuclear option)

# 6. Live logs
journalctl -u sssd -f         # tail all SSSD logs
# Then attempt login in another terminal — watch the auth flow in real time

# 7. Increase log verbosity temporarily
sssctl config-check            # validate sssd.conf syntax
# Edit sssd.conf: add debug_level = 6 under [domain/corp.com]
systemctl restart sssd
journalctl -u sssd -f          # now shows LDAP queries, cache hits/misses

The single most useful command is sssctl user-checks <username>. It runs the full NSS + PAM auth stack internally and prints what SSSD would do on a real login — without creating a session or touching the running system.

Breaking SSSD (and What Each Failure Looks Like)

SSSD not running:

ssh vamshi@server
# Permission denied (publickey,gssapi-keyex,gssapi-with-mic,password)
# getent passwd vamshi → (empty)
# Fix: systemctl start sssd

Stale cache after AD password change:

# User changed password in AD but SSSD still has old credential hash
ssh vamshi@server  # password accepted (wrong!) — cache hit with old hash
# Fix: sss_cache -u vamshi, then attempt login again

Clock skew > 5 minutes (breaks Kerberos):

journalctl -u sssd | grep -i "clock skew\|KDC\|kinit"
# sssd_be[corp.com]: Kerberos authentication failed: Clock skew too great
# Fix: systemctl restart chronyd (or ntpd), verify time sync

ldap_search_base wrong:

getent passwd vamshi  # empty, but user exists in AD
sssctl user-checks vamshi  # "User not found"
# Check: ldap_search_base must include the OU containing users
# Test: ldapsearch -x -H ldap://dc -b "ou=engineers,dc=corp,dc=com" "(uid=vamshi)"

⚠ Common Misconceptions

“Restarting SSSD logs everyone out.” Restarting SSSD doesn’t affect existing authenticated sessions. Active shell sessions, running processes — all unaffected. Only new authentication attempts are disrupted during the restart window, which takes a few seconds.

“sss_cache -E fixes everything.” Flushing the entire cache forces SSSD to re-fetch all entries from the domain controller on the next lookup. On a system with many users or enumeration enabled, this can cause a brief spike in LDAP traffic and slow lookups. Use targeted flushes (-u username, -G group) when possible.

“debug_level should always be high.” SSSD at debug_level = 9 logs every LDAP packet. On a production system with active logins, this generates gigabytes of logs quickly. Set it temporarily for debugging, then remove it and restart.

Framework Alignment

Domain	Relevance
CISSP Domain 5: Identity and Access Management	SSSD is the runtime implementation of enterprise identity integration on Linux — understanding its caching model, failover behavior, and credential storage is foundational to IAM operations
CISSP Domain 3: Security Architecture and Engineering	The credential cache design (`/var/lib/sss/db/`) creates a local credential store with specific security properties — architects need to understand the offline login trade-off
CISSP Domain 7: Security Operations	SSSD is a critical security service — monitoring it, understanding its failure modes, and knowing how to recover it quickly are operational security skills

Key Takeaways

SSSD is a three-tier system: responders (serve PAM/NSS), providers (talk to AD/LDAP), and a shared LDB cache — each tier is independently restartable
Credential caching enables offline logins — the security trade-off is a local hash store in /var/lib/sss/db/
sssctl user-checks is the first tool to reach for when a login fails — it simulates the full auth flow and shows exactly where it breaks
ldap_id_mapping = true is the right choice for AD environments without POSIX attributes; false requires actual uidNumber/gidNumber in the directory
Clock skew over 5 minutes silently breaks Kerberos authentication — time sync is a hard dependency

What’s Next

EP04 showed SSSD’s role as the caching and brokering layer. What it referenced repeatedly — “Kerberos ticket”, “KDC”, “GSSAPI” — is the authentication protocol that sits underneath AD-joined Linux logins. SSSD uses Kerberos to authenticate. LDAP carries the identity data. EP05 explains how Kerberos works.

Next: How Kerberos Works: Tickets, KDC, and Why Enterprises Use It With LDAP

Get EP05 in your inbox when it publishes → linuxcent.com/subscribe

TL;DR

The Big Picture: Where IdPs Sit

On-Premises Identity Providers

AD FS (Active Directory Federation Services)

Keycloak

Shibboleth

Cloud Identity Providers

Okta

Entra ID (Azure Active Directory)

Google Workspace

Federation: IdPs Trusting Each Other

SAML Federation (IdP-to-IdP)

OIDC Identity Brokering

SCIM: Provisioning ≠ Authentication

Directory Sync vs Federation

⚠ Common Misconceptions

Framework Alignment

Key Takeaways

What’s Next

TL;DR

The Problem: LDAP and Kerberos Don’t Cross the Internet

SAML 2.0: Enterprise Browser SSO

OAuth2: Authorization Delegation (Not Authentication)

OIDC: OAuth2 + Who You Are

SAML vs OIDC: When to Use Which

A Complete Browser SSO Flow (OIDC)

⚠ Common Misconceptions

Framework Alignment

Key Takeaways

What’s Next

TL;DR

The Big Picture: What AD Actually Is

The AD Schema: LDAP With 1000+ Object Classes

Replication: USN + GUID + KCC

Sites and Site Links

Group Policy: LDAP + Sysvol

Joining Linux to AD

realm join (recommended)

What the join does:

LDAP Queries Against AD from Linux

⚠ Common Misconceptions

Framework Alignment

Key Takeaways

What’s Next

TL;DR

The Big Picture: What FreeIPA Integrates

Why FreeIPA Instead of Bare OpenLDAP

Installing FreeIPA Server

The ipa CLI

Host-Based Access Control (HBAC)

Sudo Rules from the Directory

Enrolling a Client

FreeIPA Trust with Active Directory

⚠ Common Misconceptions

Framework Alignment

Key Takeaways

What’s Next

TL;DR

The Big Picture: Production LDAP Topology

HAProxy for LDAP

SSSD Multi-Server Failover

Connection Pooling

Monitoring with cn=monitor

389-DS: LDAP at Scale

Global Catalog: Cross-Domain Search in AD

⚠ Production Gotchas

Key Takeaways

What’s Next

TL;DR

The Big Picture: The UEFI Secure Boot Trust Chain

What Secure Boot Actually Does — and Why Certificate Expiry Breaks Booting

The Three Certificates Expiring in 2026

1. Microsoft Corporation KEK CA 2011 — expires June 24, 2026

2. Microsoft Corporation UEFI CA 2011 — expires June 27, 2026

3. Microsoft Windows Production PCA 2011 — expires October 19, 2026

Why GCP Instances Are Specifically Affected

GKE Shielded Nodes: The Operational Blind Spot

Detecting Affected Resources

Compute Engine Instances

GKE Node Pools