Kubernetes RBAC and AWS IAM: The Two-Layer Access Model for EKS

Reading Time: 9 minutes

Meta Description: Understand how Kubernetes RBAC and AWS IAM interact in EKS — map the two-layer access model and debug permission failures across both control planes.


What Is Cloud IAMAuthentication vs AuthorizationIAM Roles vs PoliciesAWS IAM Deep DiveGCP Resource Hierarchy IAMAzure RBAC ScopesOIDC Workload IdentityAWS IAM Privilege EscalationAWS Least Privilege AuditSAML vs OIDC FederationKubernetes RBAC and AWS IAM


TL;DR

  • Kubernetes RBAC and cloud IAM are separate authorization layers — strong cloud IAM with weak Kubernetes RBAC is still a vulnerable cluster
  • cluster-admin ClusterRoleBindings are the first thing to audit — a compromised pod with cluster-admin controls the entire cluster
  • Disable automountServiceAccountToken on pods that don’t call the Kubernetes API — most application pods don’t need it mounted
  • Use OIDC for human access instead of X.509 client certificates — client certs cannot be revoked without rotating the CA
  • Bind groups from IdP, not individual usernames — revocation propagates automatically when someone leaves
  • A ServiceAccount that can create pods or create rolebindings is a privilege escalation path: the same class of risk as iam:PassRole

The Big Picture

  TWO AUTHORIZATION LAYERS — NEITHER COMPENSATES FOR THE OTHER

  ┌─────────────────────────────────────────────────────────────────┐
  │  CLOUD IAM LAYER  (AWS IAM / GCP IAM / Azure RBAC)             │
  │  Controls: S3, DynamoDB, Lambda, RDS, cloud services           │
  │  Human: federated identity from IdP (SAML / OIDC)             │
  │  Machine: IRSA annotation → IAM role / GKE WI / AKS WI        │
  │  Audit: CloudTrail, GCP Audit Logs, Azure Monitor              │
  └─────────────────────────────────────────────────────────────────┘
           ↕ separate systems — no inheritance in either direction
  ┌─────────────────────────────────────────────────────────────────┐
  │  KUBERNETES RBAC LAYER  (within the cluster)                   │
  │  Controls: pods, secrets, deployments, configmaps, namespaces  │
  │  Human: OIDC groups → ClusterRoleBinding (or RoleBinding)      │
  │  Machine: ServiceAccount → Role / ClusterRole                  │
  │  Audit: kube-apiserver audit log                               │
  └─────────────────────────────────────────────────────────────────┘

  Attack path: exploit app pod → SA has cluster-admin → own the cluster
  Audit finding: cluster-admin on app SA, regardless of cloud IAM posture

Introduction

I spent a long time in Kubernetes environments thinking cloud IAM and Kubernetes RBAC were related in a way that meant securing one partially covered the other. They don’t. They’re separate authorization systems that happen to share infrastructure.

The moment this crystallized for me: I was auditing an EKS cluster for a fintech company. Their AWS IAM posture was actually quite good — least privilege roles, no wildcard policies, SCPs in place at the org level. I was about to give them a clean bill of health when I ran one command:

kubectl get clusterrolebindings -o json | \
  jq '.items[] | select(.roleRef.name=="cluster-admin") | {name:.metadata.name, subjects:.subjects}'

The output showed five ClusterRoleBindings to cluster-admin. Two of them bound it to service accounts in production namespaces. One of those service accounts was used by an application that processed customer transactions.

cluster-admin in Kubernetes is the equivalent of AdministratorAccess in AWS. An attacker who compromises a pod running as that service account doesn’t just have access to the application’s data. They have control of the entire cluster: reading every secret in every namespace, deploying arbitrary workloads, modifying RBAC bindings to create persistence.

None of this showed up in the AWS IAM audit. AWS IAM and Kubernetes RBAC are separate systems. Securing one tells you nothing about the other.


Kubernetes RBAC Architecture

Kubernetes RBAC works with four object types:

Object Scope What It Does
Role Single namespace Defines permissions within one namespace
ClusterRole Cluster-wide Permissions across all namespaces, or for non-namespaced resources
RoleBinding Single namespace Binds a Role (or ClusterRole) to subjects, scoped to one namespace
ClusterRoleBinding Cluster-wide Binds a ClusterRole to subjects with cluster-wide scope

Subjects — the identities that receive the binding — are:
User: an external identity (Kubernetes has no native user objects; users come from the authenticator)
Group: a group of external identities
ServiceAccount: a Kubernetes-native machine identity, namespaced

The scoping matters. A ClusterRole defines what permissions exist. A RoleBinding applies that ClusterRole within a single namespace. A ClusterRoleBinding applies it everywhere. The same permissions, dramatically different blast radius.


Roles and ClusterRoles

# Role: read pods and their logs — scoped to the default namespace only
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  namespace: default
  name: pod-reader
rules:
- apiGroups: [""]          # "" = core API group (pods, secrets, configmaps, etc.)
  resources: ["pods", "pods/log"]
  verbs: ["get", "list", "watch"]
# ClusterRole: manage Deployments across all namespaces
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: deployment-manager
rules:
- apiGroups: ["apps"]
  resources: ["deployments", "replicasets"]
  verbs: ["get", "list", "watch", "create", "update", "patch", "delete"]
- apiGroups: [""]
  resources: ["pods"]
  verbs: ["get", "list", "watch"]

The verbs map to HTTP methods against the Kubernetes API: get reads a specific resource, list returns a collection, watch streams changes, create/update/patch/delete are mutations.

One that consistently surprises people: list on secrets returns secret values in some Kubernetes versions and configurations. You might think “list” is just metadata, but listing secrets can include their data. If a service account needs to check whether a secret exists, grant get on the specific secret name. Avoid list on the secrets resource.

The Wildcard Risk

# This is effectively cluster-admin in the default namespace — avoid
rules:
- apiGroups: ["*"]
  resources: ["*"]
  verbs: ["*"]

Any * in RBAC rules is an audit finding. In practice I find wildcards most often in:
– Operator and controller service accounts (understandable, but worth reviewing)
– “Temporary” RBAC that became permanent
– Developer tooling given cluster-admin “because it was easier”

Run this to find all ClusterRoles with wildcard verbs:

kubectl get clusterroles -o json | \
  jq '.items[] | select(.rules[]?.verbs[] == "*") | .metadata.name'

Bindings — Connecting Identities to Roles

# RoleBinding: alice can read pods in the default namespace
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: alice-pod-reader
  namespace: default
subjects:
- kind: User
  name: [email protected]
  apiGroup: rbac.authorization.k8s.io
roleRef:
  kind: Role
  name: pod-reader
  apiGroup: rbac.authorization.k8s.io
# ClusterRoleBinding: Prometheus can read cluster-wide (monitoring use case)
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: prometheus-cluster-reader
subjects:
- kind: ServiceAccount
  name: prometheus
  namespace: monitoring
roleRef:
  kind: ClusterRole
  name: view
  apiGroup: rbac.authorization.k8s.io

An important pattern: a RoleBinding can reference a ClusterRole. This lets you define a role once at the cluster level (the ClusterRole) and bind it within specific namespaces through RoleBindings. The permissions are still scoped to the namespace where the RoleBinding lives. This is the right pattern for shared role definitions — define the permission set once, instantiate it with appropriate scope.

Default to RoleBinding over ClusterRoleBinding for namespace-scoped work. ClusterRoleBinding should be reserved for genuinely cluster-wide operations: monitoring agents, network plugins, cluster operators, security tooling.


Service Accounts — The Machine Identity in Kubernetes

Every pod in Kubernetes runs as a service account. If you don’t specify one, it uses the default service account in the pod’s namespace.

The default service account is where many RBAC misconfigurations accumulate. When someone creates a RoleBinding without thinking about which SA to use, they often bind the permission to default. Now every pod in that namespace that doesn’t explicitly set a service account — including pods deployed by developers who aren’t thinking about RBAC — inherits that binding.

# Create a dedicated SA for each application
kubectl create serviceaccount app-backend -n production

# Check what any SA can currently do — use this in every audit
kubectl auth can-i --list --as=system:serviceaccount:production:app-backend -n production

# Check a specific action
kubectl auth can-i get secrets \
  --as=system:serviceaccount:production:app-backend -n production

kubectl auth can-i create pods \
  --as=system:serviceaccount:production:app-backend -n production

Disable Auto-Mounting the SA Token

By default, Kubernetes mounts the service account token into every pod at /var/run/secrets/kubernetes.io/serviceaccount/token. A pod that doesn’t need to call the Kubernetes API doesn’t need this token. Having it mounted increases the blast radius if the pod is compromised — the token can be used to call the K8s API with whatever RBAC permissions the SA has.

# Disable at the pod level
apiVersion: v1
kind: Pod
spec:
  automountServiceAccountToken: false
  serviceAccountName: app-backend
  containers:
  - name: app
    image: my-app:latest

# Or at the service account level (applies to all pods using this SA)
apiVersion: v1
kind: ServiceAccount
metadata:
  name: app-backend
  namespace: production
automountServiceAccountToken: false

For most application pods — anything that isn’t a Kubernetes operator, controller, or management tool — the K8s API token is unnecessary. Disable it.


Human Access to Kubernetes — Get Off Client Certificates

Kubernetes doesn’t manage human users natively. Authentication is delegated to an external mechanism. The most common approaches:

Method Notes
X.509 client certificates Common for initial cluster setup; credentials are embedded in kubeconfig; cannot be revoked without revoking the CA
Static bearer tokens Long-lived; avoid
OIDC via external IdP Preferred for human access — supports SSO, MFA, and revocation via IdP
Webhook auth Flexible, requires custom infrastructure

X.509 certificates are the bootstrap pattern. Every managed Kubernetes offering generates an admin kubeconfig with a client certificate. The problem: you can’t revoke individual certificates without rotating the CA. If you’re giving human engineers access via client certificates, someone leaving doesn’t actually lose cluster access until the certificate expires.

OIDC is the right model. Configure the kube-apiserver to accept JWTs from your IdP, bind RBAC permissions to groups from the IdP, and revocation becomes “remove from IdP group” rather than “hope the certificate expires soon”:

# kube-apiserver flags for OIDC (managed clusters configure this via provider settings)
--oidc-issuer-url=https://accounts.google.com
--oidc-client-id=my-cluster-client-id
--oidc-username-claim=email
--oidc-groups-claim=groups
--oidc-groups-prefix=oidc:
# User's kubeconfig — uses an exec plugin to fetch an OIDC token
users:
- name: alice
  user:
    exec:
      apiVersion: client.authentication.k8s.io/v1beta1
      command: kubectl-oidc-login
      args:
        - get-token
        - --oidc-issuer-url=https://dex.company.com
        - --oidc-client-id=kubernetes

With managed clusters:

# EKS: add IAM role as a cluster access entry (replaces the aws-auth ConfigMap)
aws eks create-access-entry \
  --cluster-name my-cluster \
  --principal-arn arn:aws:iam::123456789012:role/DevTeamRole \
  --type STANDARD

aws eks associate-access-policy \
  --cluster-name my-cluster \
  --principal-arn arn:aws:iam::123456789012:role/DevTeamRole \
  --policy-arn arn:aws:eks::aws:cluster-access-policy/AmazonEKSViewPolicy \
  --access-scope type=namespace,namespaces=production,staging

# GKE: get credentials; IAM roles map to cluster permissions
gcloud container clusters get-credentials my-cluster --region us-central1
# roles/container.developer → edit permissions
# But: use ClusterRoleBindings for fine-grained control rather than relying on GCP IAM roles

# AKS: bind Entra ID groups to Kubernetes RBAC
az aks get-credentials --name my-aks --resource-group rg-prod
kubectl create clusterrolebinding dev-team-view \
  --clusterrole=view \
  --group=ENTRA_GROUP_OBJECT_ID

Cloud IAM + Kubernetes RBAC: The Integration Points

EKS Pod Identity / IRSA (revisited)

The annotation on the Kubernetes ServiceAccount is the bridge:

apiVersion: v1
kind: ServiceAccount
metadata:
  name: app-backend
  namespace: production
  annotations:
    eks.amazonaws.com/role-arn: arn:aws:iam::123456789012:role/AppBackendRole

Kubernetes RBAC controls what the pod can do inside the cluster. The IAM role controls what the pod can do in AWS. Both must be explicitly granted; neither inherits from the other.

GKE Workload Identity

apiVersion: v1
kind: ServiceAccount
metadata:
  name: app-backend
  namespace: production
  annotations:
    iam.gke.io/gcp-service-account: [email protected]

AKS Workload Identity

apiVersion: v1
kind: ServiceAccount
metadata:
  name: app-backend
  namespace: production
  annotations:
    azure.workload.identity/client-id: "MANAGED_IDENTITY_CLIENT_ID"
---
apiVersion: v1
kind: Pod
metadata:
  labels:
    azure.workload.identity/use: "true"
spec:
  serviceAccountName: app-backend

RBAC Audit — What to Check First

# Start here: who has cluster-admin?
kubectl get clusterrolebindings -o json | \
  jq '.items[] | select(.roleRef.name=="cluster-admin") | 
      {binding: .metadata.name, subjects: .subjects}'
# cluster-admin should bind to almost nobody — review every result

# Find ClusterRoles with wildcard permissions
kubectl get clusterroles -o json | \
  jq '.items[] | select(.rules[]?.verbs[]? == "*") | .metadata.name'

# What can the default SA do in each namespace?
for ns in $(kubectl get namespaces -o name | cut -d/ -f2); do
  echo "=== $ns ==="
  kubectl auth can-i --list --as=system:serviceaccount:${ns}:default -n ${ns} 2>/dev/null \
    | grep -v "no" | head -10
done

# What can a specific SA do?
kubectl auth can-i --list \
  --as=system:serviceaccount:production:app-backend \
  -n production

# Check whether an SA can escalate — key risk indicators
kubectl auth can-i get secrets -n production \
  --as=system:serviceaccount:production:app-backend
kubectl auth can-i create pods -n production \
  --as=system:serviceaccount:production:app-backend
kubectl auth can-i create rolebindings -n production \
  --as=system:serviceaccount:production:app-backend

Creating pods and creating rolebindings are privilege escalation primitives. A service account that can create pods can run a pod with a different, more powerful SA. A service account that can create rolebindings can grant itself more permissions.

Useful Tools

# rbac-tool — visualize and analyze RBAC (install: kubectl krew install rbac-tool)
kubectl rbac-tool viz                              # generate a graph of all bindings
kubectl rbac-tool who-can get secrets -n production
kubectl rbac-tool lookup [email protected]

# rakkess — access matrix for a subject
kubectl rakkess --sa production:app-backend

# audit2rbac — generate minimal RBAC from audit logs
audit2rbac --filename /var/log/kubernetes/audit.log \
  --serviceaccount production:app-backend

Common RBAC Misconfigurations

Misconfiguration Risk Fix
cluster-admin bound to application SA Full cluster takeover from compromised pod Minimal ClusterRole; scope to namespace where possible
list or wildcard on secrets Read all secrets in scope — includes credentials, API keys Grant get on specific named secrets only
default SA with non-trivial permissions Every pod in the namespace inherits the permission Bind permissions to dedicated SAs; automountServiceAccountToken: false on default
ClusterRoleBinding for namespace-scoped work Namespace work with cluster-wide permission Always prefer RoleBinding; ClusterRoleBinding only for genuinely cluster-wide needs
Binding users by username string Hard to revoke; doesn’t sync with IdP Bind groups from IdP; revocation propagates through group membership
SA can create pods or create rolebindings Privilege escalation path Audit and remove these from non-privileged SAs

Framework Alignment

Framework Reference What It Covers Here
CISSP Domain 5 — Identity and Access Management Kubernetes RBAC operates as a full IAM system at the platform layer, independent of cloud IAM
CISSP Domain 3 — Security Architecture Two independent authorization layers (cloud + K8s) must each be designed and audited — one does not compensate for the other
ISO 27001:2022 5.15 Access control Kubernetes RBAC Roles, ClusterRoles, and bindings implement access control within the container platform
ISO 27001:2022 5.18 Access rights Service account provisioning, OIDC-based human access, and workload identity integration with cloud IAM
ISO 27001:2022 8.2 Privileged access rights cluster-admin and wildcard RBAC bindings represent the highest-privilege grants in Kubernetes
SOC 2 CC6.1 Kubernetes RBAC is the access control mechanism for the container platform layer in CC6.1
SOC 2 CC6.3 Binding revocation, SA token disabling, and OIDC group-based access removal satisfy CC6.3 requirements

Key Takeaways

  • Kubernetes RBAC and cloud IAM are separate authorization layers — both must be secured; strong cloud IAM with weak K8s RBAC is still a vulnerable cluster
  • cluster-admin bindings are the first thing to audit in any cluster — the blast radius of a compromised pod with cluster-admin is the entire cluster
  • Disable automountServiceAccountToken on service accounts and pods that don’t call the Kubernetes API — most application pods don’t need it
  • Use OIDC for human access rather than client certificates; revocation via IdP is instant and reliable
  • Bind groups from IdP rather than individual usernames; revocation propagates automatically when someone leaves
  • A service account that can create pods or create rolebindings is a privilege escalation path — audit for these in every namespace

What’s Next

EP12 is the capstone: Zero Trust IAM — how all the concepts in this series come together into an architecture that assumes nothing is implicitly trusted, verifies everything explicitly, and limits blast radius through least privilege enforced at every layer.

Next: Zero trust access in the cloud

Get EP12 in your inbox when it publishes → linuxcent.com/subscribe

What Is Cloud IAM — and Why Every API Call Depends on It

Reading Time: 11 minutes

Meta Description: Understand what cloud IAM is and why every API call in AWS, GCP, and Azure hits a deny-by-default check — the foundational model behind all cloud access.


What Is Cloud IAMAuthentication vs AuthorizationIAM Roles vs PoliciesAWS IAM Deep DiveGCP Resource Hierarchy IAMAzure RBAC Scopes


TL;DR

  • Cloud IAM is the system that decides whether any API call is allowed or denied — deny by default, explicit Allow required at every layer
  • Every API call answers four questions: Who? (Identity) What? (Action) On what? (Resource) Under what conditions? (Context)
  • Two identity types in every cloud account: human (engineers) and machine (Lambda, EC2, Kubernetes pods) — machine identities outnumber human by 10:1 in most production environments
  • AWS, GCP, and Azure share the same model: deny-by-default, policy-driven, principal-based — different syntax, same mental model
  • The gap between granted and used permissions is where attackers move — the average IAM entity uses under 5% of its granted permissions
  • IAM failure has two modes: over-permissioned (“it works”) and over-restricted (“it’s secure, engineers work around it”) — both end in incidents

The Big Picture

                        WHAT IS CLOUD IAM?

  Every API call in AWS, GCP, or Azure answers four questions:

  ┌─────────────┐   ┌─────────────┐   ┌─────────────┐   ┌─────────────┐
  │    WHO?     │   │   WHAT?     │   │  ON WHAT?   │   │  UNDER      │
  │             │   │             │   │             │   │  WHAT?      │
  │  Identity / │   │  Action /   │   │  Resource   │   │             │
  │  Principal  │   │  Permission │   │             │   │  Condition  │
  │             │   │             │   │             │   │             │
  │ IAM Role    │   │ s3:GetObject│   │ arn:aws:s3: │   │ MFA: true   │
  │ Svc Account │   │ ec2:Start   │   │ ::prod-data │   │ IP: 10.0/8  │
  │ Managed     │   │ iam:        │   │ /exports/*  │   │ Time: 09-17 │
  │ Identity    │   │   PassRole  │   │             │   │             │
  └─────────────┘   └─────────────┘   └─────────────┘   └─────────────┘
        └────────────────┴────────────────┴────────────────┘
                                  │
                     ┌────────────▼────────────┐
                     │    IAM Policy Engine    │
                     │    deny by default      │
                     │                         │
                     │  Explicit ALLOW?   ─────┼──→  PERMIT
                     │  Explicit DENY?    ─────┼──→  DENY (overrides Allow)
                     │  No matching rule? ─────┼──→  DENY (implicit)
                     └─────────────────────────┘

Cloud IAM is the answer to a question every growing infrastructure team hits: at scale, how do you know who can do what, why they can do it, and whether they still should?


Introduction

Cloud IAM (Identity and Access Management) is the control plane for access in every major cloud provider. Every API call — reading a file, starting an instance, invoking a function — goes through an IAM evaluation. The result is binary: explicit Allow or deny. There is no implicit access. Nothing is open by default. This is what makes cloud IAM fundamentally different from the access models that came before it.

Understanding why it works that way requires tracing how access control evolved — and what kept breaking at each stage.

A few years into my career managing Linux infrastructure, I was handed a production server audit. The task was straightforward: find out who had access to what. I pulled /etc/passwd, checked the sudoers file, reviewed SSH authorized_keys across the fleet.

Three days later, I had a spreadsheet nobody wanted to read.

The problem wasn’t that the access was wrong. Most of it was fine. The problem was that nobody — not the team lead, not the security team, not the engineers who’d been there five years — could tell me why a particular account had access to a particular server. It had accumulated. People joined, got access, changed teams, left. The access stayed.

That was a 40-server fleet in 2012.

Fast-forward to a cloud environment today: you might have 50 engineers, 300 Lambda functions, 20 microservices, CI/CD pipelines, third-party integrations, compliance scanners — all making API calls, all needing access to something. The identity sprawl problem I spent three days auditing manually on 40 servers now exists at a scale where manual auditing isn’t even a conversation.

This is the problem Identity and Access Management exists to solve. Not just in theory — in practice, at the scale cloud infrastructure demands.


How We Got Here — The Evolution of Access Control

To understand why cloud IAM works the way it does, you need to trace how access control evolved. The design decisions in AWS IAM, GCP, and Azure didn’t come out of nowhere. They’re answers to lessons learned the hard way across decades of broken systems.

The Unix Model (1970s–1990s): Simple and Sufficient

Unix got the fundamentals right early. Every resource (file, device, process) has an owner and a group. Every action is one of three: read, write, execute. Every user is either the owner, in the group, or everyone else.

-rw-r--r--  1 vamshi  engineers  4096 Apr 11 09:00 deploy.conf
# owner can read/write | group can read | others can read

For a single machine or a small network, this model is elegant. The permissions are visible in a ls -l. Reasoning about access is straightforward. Auditing means reading a few files.

However, the cracks started showing when organizations grew. You’d add sudo to give specific commands to specific users. Then sudoers files became 300 lines long. Then you’d have shared accounts because managing individual ones was “too much overhead.” Shared accounts mean no individual accountability. No accountability means no audit trail worth anything.

The Directory Era (1990s–2000s): Centralise or Collapse

As networks grew, every server managing its own /etc/passwd became untenable. Enter LDAP and Active Directory. Instead of distributing identity management across every machine, you centralised it: one directory, one place to add users, one place to disable them when someone left.

This was a significant step forward. Onboarding got faster. Offboarding became reliable. Group membership drove access to resources across the network.

Why Groups Became the New Problem

But the permission model was still coarse. You were either in the Domain Admins group or you weren’t. “Read access to the file share” was a group. “Deploy to the staging web server” was a group. Managing fine-grained permissions at scale meant managing hundreds of groups, and the groups themselves became the audit nightmare.

I spent time in environments like this. The group named SG_Prod_App_ReadWrite_v2_FINAL that nobody could explain. The AD group from a project that ended three years ago but was still in twenty user accounts. The contractor whose AD account was disabled but whose service account was still running a nightly job.

The directory model centralised identity. It didn’t solve the permissions sprawl problem.

The Cloud Shift (2006–2014): Everything Changes

AWS launched EC2 in 2006. In 2011, AWS IAM went into general availability. That date matters — for the first five years of AWS, access control was primitive. Root accounts. Access keys. No roles.

Early AWS environments I’ve seen (and had to clean up) reflect this era: a single root account access key shared across a team, rotated manually on a shared spreadsheet. Static credentials in application config files. EC2 instances with AdministratorAccess because “it was easier at the time.”

The Model That Changed Everything

The AWS team understood what they’d built was dangerous. IAM in 2011 introduced the model that all three major cloud providers now share: deny-by-default, policy-driven, principal-based access control. Not “who is in which group.” The question became: which policy explicitly grants this specific action on this specific resource to this specific identity.

GCP launched its IAM model with a different flavour in 2012 — hierarchical, additive, binding-based. Azure RBAC came to general availability in 2014, built on top of Active Directory’s identity model.

By 2015, the modern cloud IAM era was established. The primitives existed. The problem shifted from “does IAM exist?” to “are we using it correctly?” — and most teams were not.

In practice, that question is still the right one to ask today.


The Problem IAM Actually Solves

Here’s the honest version of what IAM is for, based on what I’ve seen go wrong without it.

Without proper IAM, you get one of two outcomes:

The first is what I call the “it works” environment. Everything runs. The developers are happy. Access requests take five minutes because everyone gets the same broad policy. And then a Lambda function’s execution role — which had s3:* on * because someone once needed to debug something — gets its credentials exposed through an SSRF vulnerability in the app it runs. That role can now read every bucket in the account, including the one with the customer database exports.

The second is the “it’s secure” environment. Access is locked down. Every request goes through a ticket. The ticket goes to a security team that approves it in three to five business days. Engineers work around it by storing credentials locally. The workarounds become the real access model. The formal IAM posture and the actual access posture diverge. The audit finds the formal one. Attackers find the real one.

IAM, done right, is the discipline of walking the line between those two outcomes. It’s not a product you buy or a feature you turn on. It’s a practice — a continuous process of defining what access exists, why it exists, and whether it’s still needed.


The Core Concepts — Taught, Not Listed

Let me walk you through the vocabulary you need, grounded in what each concept means in practice.

Identity: Who Is Making This Request?

An identity is any entity that can hold a credential and make requests. In cloud environments, identities split into two types:

Human identities are engineers, operators, and developers. They authenticate via the console, CLI, or SDK. They should ideally authenticate through a central IdP (Okta, Google Workspace, Entra ID) using federation — more on that in SAML vs OIDC: Which Federation Protocol Belongs in Your Cloud?.

Machine identities are everything else: Lambda functions, EC2 instances, Kubernetes pods, CI/CD pipelines, monitoring agents, data pipelines. In most production environments, machine identities outnumber human identities by 10:1 or more.

This ratio matters. When your security model is designed primarily for human access, the 90% of identities that are machines become an afterthought. That’s where access keys end up in environment variables, where Lambda functions get broad permissions because nobody thought carefully about what they actually need, where the real attack surface lives.

Principal: The Authenticated Identity Making a Specific Request

A principal is an identity that has been authenticated and is currently making a request. The distinction from “identity” is subtle but important: the principal includes the context of how the identity authenticated.

In AWS, an IAM role assumed by EC2, assumed by a Lambda, and assumed by a developer’s CLI session are three different principals — even if they all assume the same role. The session context, source, and expiration differ.

{
  "Principal": {
    "AWS": "arn:aws:iam::123456789012:role/DataPipelineRole"
  }
}

In GCP, the equivalent term is member. In Azure, it’s security principal — a user, group, service principal, or managed identity.

Resource: What Is Being Accessed?

A resource is whatever is being acted upon. In AWS, every resource has an ARN (Amazon Resource Name) — a globally unique identifier.

arn:aws:s3:::customer-data-prod          # S3 bucket
arn:aws:s3:::customer-data-prod/*        # everything inside that bucket
arn:aws:ec2:ap-south-1:123456789012:instance/i-0abcdef1234567890
arn:aws:iam::123456789012:role/DataPipelineRole

The ARN structure tells you: service, region, account, resource type, resource name. Once you can read ARNs fluently, IAM policies become much less intimidating.

Action: What Is Being Done?

An action (AWS/Azure) or permission (GCP) is the operation being attempted. Cloud providers express these as service:Operation strings:

# AWS
s3:GetObject           # read a specific object
s3:PutObject           # write an object
s3:DeleteObject        # delete an object — treat differently than read
iam:PassRole           # assign a role to a service — one of the most dangerous permissions
ec2:DescribeInstances  # list instances — often overlooked, but reveals infrastructure

# GCP
storage.objects.get
storage.objects.create
iam.serviceAccounts.actAs   # impersonate a service account — equivalent to iam:PassRole danger

When I audit IAM configurations, I pay special attention to any policy that includes iam:*, iam:PassRole, or wildcards like "Action": "*". These are the permissions that let a compromised identity create new identities, assign itself more power, or impersonate other accounts. They’re the privilege escalation primitives — more on that in AWS IAM Privilege Escalation: How iam:PassRole Leads to Full Compromise.

Policy: The Document That Connects Everything

A policy is a document that says: this principal can perform these actions on these resources, under these conditions.

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "ReadCustomerDataBucket",
      "Effect": "Allow",
      "Action": [
        "s3:GetObject",
        "s3:ListBucket"
      ],
      "Resource": [
        "arn:aws:s3:::customer-data-prod",
        "arn:aws:s3:::customer-data-prod/*"
      ]
    }
  ]
}

Notice what’s explicit here: the effect (Allow), the exact actions (not s3:*), and the exact resource (not *). Every word in this document is a deliberate decision. The moment you start using wildcards to save typing, you’re writing technical debt that will come back as a security incident.


How IAM Actually Works — The Decision Flow

When any API call hits a cloud service, an IAM engine evaluates it. Understanding this flow is the foundation of debugging access issues, and more importantly, of understanding why your security posture is what it is.

Request arrives:
  Action:    s3:PutObject
  Resource:  arn:aws:s3:::customer-data-prod/exports/2026-04-11.csv
  Principal: arn:aws:iam::123456789012:role/DataPipelineRole
  Context:   { source_ip: "10.0.2.15", mfa: false, time: "02:30 UTC" }

IAM Engine evaluation (AWS):
  1. Is there an explicit Deny anywhere? → No
  2. Does the SCP (if any) allow this? → Yes
  3. Does the identity-based policy allow this? → Yes (via DataPipelinePolicy)
  4. Does the resource-based policy (bucket policy) allow or deny? → No explicit rule → implicit allow for same-account
  5. Is there a permissions boundary? → No
  Decision: ALLOW

The critical insight here: cloud IAM is deny-by-default. There is no implicit allow. If there is no policy that explicitly grants s3:PutObject to this role on this bucket, the request fails. The only way in is through an explicit "Effect": "Allow".

This is the opposite of how most traditional systems work. In a Unix permission model, if your file is world-readable (-r--r--r--), anyone can read it unless you actively restrict them. In cloud IAM, nothing is accessible unless you actively grant it.

When I’m debugging an AccessDenied error — and every engineer who works with cloud IAM spends significant time doing this — the mental model is always: “what is the chain of explicit Allows that should be granting this access, and at which layer is it missing?”


Why This Is Harder Than It Looks

Understanding the concepts is the easy part. The hard part is everything that happens at organisational scale over time.

Scale. A real AWS account in a growing company might have 600+ IAM roles, 300+ policies, and 40+ cross-account trust relationships. None of these were designed together. They evolved incrementally, each change made by someone who understood the context at the time and may have left the organisation since. The cumulative effect is an IAM configuration that no single person fully understands.

Drift. IAM configs don’t stay clean. An engineer needs to debug a production issue at 2 AM and grants themselves broad access temporarily. The temporary access never gets revoked. Multiply that by a team of 20 over three years. I’ve audited environments where 60% of the permissions in a role had never been used — not once — in the 90-day CloudTrail window. That unused 60% is pure attack surface.

The machine identity blind spot. Most IAM governance practices were built for human users. Service accounts, Lambda roles, and CI/CD pipeline identities get created rapidly and reviewed rarely. In my experience, these are the identities most likely to have excess permissions, least likely to be in the access review process, and most likely to be the initial foothold in a cloud breach.

The gap between granted and used. That said, this one surprised me most when I first started doing cloud security work. AWS data from real customer accounts shows the average IAM entity uses less than 5% of its granted permissions. That 95% excess isn’t just waste — it’s attack surface. Every permission that exists but isn’t needed is a permission an attacker can use if they compromise that identity.


IAM Across AWS, GCP, and Azure — The Conceptual Map

The three major providers implement IAM differently in syntax, but the same model underlies all of them. Once you understand one deeply, the others become a translation exercise.

Concept AWS GCP Azure
Identity store IAM users / roles Google accounts, Workspace Entra ID
Machine identity IAM Role (via instance profile or AssumeRole) Service Account Managed Identity
Access grant mechanism Policy document attached to identity or resource IAM binding on resource (member + role + condition) Role Assignment (principal + role + scope)
Hierarchy Account is the boundary; Org via SCPs Org → Folder → Project → Resource Tenant → Management Group → Subscription → Resource Group → Resource
Default stance Deny Deny Deny
Wildcard risk "Action": "*" on "Resource": "*" Primitive roles (viewer/editor/owner) Owner or Contributor assigned broadly

The hierarchy point is worth pausing on. AWS is relatively flat — the account is the primary security boundary. GCP’s hierarchy means a binding at the Organisation level propagates down to every project. Azure’s hierarchy means a role assignment at the Management Group level flows through every subscription beneath it.

The blast radius of a misconfiguration scales with how high in the hierarchy it sits.

This will matter in GCP IAM Policy Inheritance and Azure RBAC Explained when we go deep on GCP and Azure specifically. For now, the takeaway is: understand where in the hierarchy a permission is granted, because the same permission granted at the wrong level has a very different security implication.


Framework Alignment

If you’re mapping this episode to a control framework — for a compliance audit, a certification study, or building a security program — here’s where it lands:

Framework Reference What It Covers Here
CISSP Domain 1 — Security & Risk Management IAM as a risk reduction control; blast radius is a risk variable
CISSP Domain 5 — Identity and Access Management Direct implementation: who can do what, to which resources, under what conditions
ISO 27001:2022 5.15 Access control Policy requirements for restricting access to information and systems
ISO 27001:2022 5.16 Identity management Managing the full lifecycle of identities in the organization
ISO 27001:2022 5.18 Access rights Provisioning, review, and removal of access rights
SOC 2 CC6.1 Logical access security controls to protect against unauthorized access
SOC 2 CC6.3 Access removal and review processes to limit unauthorized access

Key Takeaways

  • IAM evolved from Unix file permissions → directory services → cloud policy engines, driven by scale and the failure modes of each prior model
  • Cloud IAM is deny-by-default: every access requires an explicit Allow somewhere in the policy chain
  • Identities are human or machine; in production, machines dominate — and they’re the under-governed majority
  • A policy binds a principal to actions on resources; every word is a deliberate security decision
  • The hardest IAM problems aren’t technical — they’re organisational: drift, unused permissions, machine identities nobody owns, and access reviews that never happen
  • The gap between granted and used permissions is where attackers find room to move

What’s Next

Now that you understand what IAM is and why it exists, the next question is the one that trips up even experienced engineers: what’s the difference between authentication and authorization, and why does conflating them cause security failures?

EP02 works through both — how cloud providers implement each, where the boundary sits, and why getting this boundary wrong creates exploitable gaps.

Next: Authentication vs Authorization: AWS AccessDenied Explained

Get EP02 in your inbox when it publishes → subscribe