Kubernetes CRDs in Production: Finalizers, Status Conditions, and RBAC Patterns

Reading Time: 8 minutes

Kubernetes CRDs & Operators: Extending the API, Episode 10
What Is a CRD? · CRDs You Already Use · CRD Anatomy · Write Your First CRD · CEL Validation · Controller Loop · Build an Operator · CRD Versioning · Admission Webhooks · CRDs in Production


TL;DR

  • Finalizers block deletion until cleanup completes — they prevent orphaned external resources but cause stuck objects if the controller crashes mid-cleanup; always implement a removal timeout
  • Status conditions are the standard communication channel between controller and user: use type, status, reason, message, and observedGeneration on every condition; never invent ad-hoc status fields
  • Owner references wire automatic garbage collection — when the parent custom resource is deleted, Kubernetes deletes owned child objects; use them for every object your controller creates in the same namespace
  • RBAC for CRDs in multi-tenant clusters must include separate ClusterRoles for controller, editor, and viewer; grant status and finalizers as separate sub-resources; never give application teams cluster-scoped create/delete on CRDs
  • The three most common Kubernetes CRD production failure modes: finalizer death loop, status thrash, and CRD deletion cascade — all avoidable with the patterns in this episode
  • Running kubectl get crds on a healthy cluster should show Established: True for every CRD; non-Established CRDs silently reject all create requests

The Big Picture

  PRODUCTION CRD LIFECYCLE: FULL PICTURE

  Create         Reconcile        Suspend/Resume      Delete
  ──────         ─────────        ──────────────      ──────
  User applies   Controller       User patches         User deletes
  BackupPolicy   creates CronJob, spec.suspended=true  BackupPolicy
      │          sets status          │                    │
      ▼              │                ▼                    ▼
  Admission      │           Controller          Finalizer blocks
  webhook        │           suspends CronJob     deletion
  (if any)       │                               Controller:
      │          │                                 1. Delete CronJob
      ▼          ▼                                 2. Remove external state
  Schema       Status                              3. Remove finalizer
  validation   conditions                          Object deleted from etcd
      │        updated
      ▼
  Controller
  reconcile
  triggered

Kubernetes CRD production readiness is not just about making the happy path work — it is about designing for the failure modes: controllers crashing mid-operation, deletion races, and status messages that confuse operators at 2am.


Finalizers: Controlled Deletion

A finalizer is a string in metadata.finalizers. Kubernetes will not delete an object that has finalizers, regardless of who issues the delete command.

metadata:
  name: nightly
  namespace: demo
  finalizers:
    - storage.example.com/backup-cleanup  # ← your controller put this here

When kubectl delete bp nightly runs:

  1. API server sets metadata.deletionTimestamp  (does NOT delete yet)
  2. Object is visible as "Terminating"
  3. Controller sees deletionTimestamp set
  4. Controller runs cleanup:
       - delete backup data from S3
       - delete CronJob (or let owner references handle it)
       - release any external locks
  5. Controller removes the finalizer:
       patch bp nightly --type=json \
         -p '[{"op":"remove","path":"/metadata/finalizers/0"}]'
  6. API server sees finalizers list is now empty → deletes the object

Adding a finalizer in Go

const finalizerName = "storage.example.com/backup-cleanup"

func (r *BackupPolicyReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
    bp := &storagev1alpha1.BackupPolicy{}
    if err := r.Get(ctx, req.NamespacedName, bp); err != nil {
        return ctrl.Result{}, client.IgnoreNotFound(err)
    }

    // Deletion path
    if !bp.DeletionTimestamp.IsZero() {
        if controllerutil.ContainsFinalizer(bp, finalizerName) {
            if err := r.cleanupExternalResources(ctx, bp); err != nil {
                return ctrl.Result{}, err
            }
            controllerutil.RemoveFinalizer(bp, finalizerName)
            if err := r.Update(ctx, bp); err != nil {
                return ctrl.Result{}, err
            }
        }
        return ctrl.Result{}, nil
    }

    // Normal path: ensure finalizer is present
    if !controllerutil.ContainsFinalizer(bp, finalizerName) {
        controllerutil.AddFinalizer(bp, finalizerName)
        if err := r.Update(ctx, bp); err != nil {
            return ctrl.Result{}, err
        }
    }

    // ... rest of reconcile
}

Finalizer death loop and the timeout pattern

If cleanupExternalResources always returns an error (external system down, bug in cleanup code), the object gets stuck in Terminating forever. The operator cannot delete it; kubectl delete --force does not help with finalizers.

Prevention: add a cleanup deadline with status tracking.

func (r *BackupPolicyReconciler) cleanupExternalResources(ctx context.Context, bp *storagev1alpha1.BackupPolicy) error {
    // Check if we've been trying to clean up for too long
    if bp.DeletionTimestamp != nil {
        deadline := bp.DeletionTimestamp.Add(10 * time.Minute)
        if time.Now().After(deadline) {
            // Log the failure, abandon cleanup, let the object be deleted.
            log.FromContext(ctx).Error(nil, "cleanup deadline exceeded, removing finalizer anyway",
                "name", bp.Name)
            return nil   // returning nil removes the finalizer
        }
    }
    // ... actual cleanup
}

Recovery for a stuck object (use only when cleanup truly cannot succeed):

kubectl patch bp nightly -n demo --type=json \
  -p '[{"op":"remove","path":"/metadata/finalizers"}]'

Status Conditions: The Right Way

The Kubernetes standard condition format is defined in k8s.io/apimachinery/pkg/apis/meta/v1.Condition:

type Condition struct {
    Type               string          // e.g. "Ready", "Synced", "Degraded"
    Status             ConditionStatus // "True", "False", "Unknown"
    ObservedGeneration int64           // the .metadata.generation this condition reflects
    LastTransitionTime metav1.Time     // when Status last changed
    Reason             string          // machine-readable, CamelCase, e.g. "CronJobCreated"
    Message            string          // human-readable, may contain details
}

Standard condition types

Type Meaning
Ready The resource is fully reconciled and operational
Synced The resource has been synced with an external system
Progressing An operation is actively in progress
Degraded The resource is operating in a reduced capacity

Use Ready: True only when the full reconcile is complete and the resource is functional. Use Ready: False with a clear Message when reconcile fails or is blocked.

Setting conditions in Go

meta.SetStatusCondition(&bpCopy.Status.Conditions, metav1.Condition{
    Type:               "Ready",
    Status:             metav1.ConditionFalse,
    ObservedGeneration: bp.Generation,
    Reason:             "CronJobCreateFailed",
    Message:            fmt.Sprintf("failed to create CronJob: %v", err),
})

meta.SetStatusCondition handles deduplication — it updates an existing condition of the same Type rather than appending a duplicate.

observedGeneration is critical

metadata.generation      = 5   (increments on every spec change)
status.observedGeneration = 3  (set by controller on each reconcile)

If observedGeneration < generation:
  → controller has not yet reconciled the latest spec change
  → status.conditions reflect an older state
  → do NOT alert based on conditions that lag generation

Always set ObservedGeneration: bp.Generation when writing status conditions. Tooling (Argo CD, Flux, kubectl wait) depends on this to know whether status is current.

kubectl wait uses conditions

# Wait until BackupPolicy is Ready
kubectl wait bp/nightly -n demo \
  --for=condition=Ready \
  --timeout=60s

This works because kubectl wait reads the status.conditions array.


Owner References: Automatic Garbage Collection

Owner references wire a parent-child relationship between Kubernetes objects. When the parent is deleted, Kubernetes garbage-collects all owned children automatically.

metadata:
  name: nightly-backup       # CronJob
  ownerReferences:
    - apiVersion: storage.example.com/v1alpha1
      kind: BackupPolicy
      name: nightly
      uid: a1b2c3d4-...
      controller: true          # only one owner can be the controller
      blockOwnerDeletion: true  # the GC waits for this owner before deleting child

Set in Go using ctrl.SetControllerReference:

if err := ctrl.SetControllerReference(bp, cronJob, r.Scheme); err != nil {
    return ctrl.Result{}, err
}

Owner reference rules

  • Owner and owned object must be in the same namespace — cluster-scoped objects cannot own namespaced objects
  • Only one object can be the controller: true owner; others can be non-controller owners
  • Deleting the owner cascades to deleting owned objects — this is garbage collection, not finalizer-based cleanup

Without owner references, deleting a BackupPolicy leaves the CronJob as an orphan. This is hard to detect and accumulates over time.


RBAC Patterns for Multi-Tenant CRD Usage

A production CRD deployment needs three distinct RBAC roles:

# 1. Controller role — full access for the operator
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: backuppolicy-controller
rules:
  - apiGroups: ["storage.example.com"]
    resources: ["backuppolicies"]
    verbs: ["get", "list", "watch", "update", "patch"]
  - apiGroups: ["storage.example.com"]
    resources: ["backuppolicies/status"]
    verbs: ["get", "update", "patch"]
  - apiGroups: ["storage.example.com"]
    resources: ["backuppolicies/finalizers"]
    verbs: ["update"]
  - apiGroups: ["batch"]
    resources: ["cronjobs"]
    verbs: ["get", "list", "watch", "create", "update", "patch", "delete"]
---
# 2. Editor role — for application teams (namespaced binding)
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: backuppolicy-editor
rules:
  - apiGroups: ["storage.example.com"]
    resources: ["backuppolicies"]
    verbs: ["get", "list", "watch", "create", "update", "patch", "delete"]
  # No status write — only the controller writes status
  # No finalizers write — prevents deletion blocking by non-controllers
---
# 3. Viewer role — for audit, monitoring
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: backuppolicy-viewer
rules:
  - apiGroups: ["storage.example.com"]
    resources: ["backuppolicies"]
    verbs: ["get", "list", "watch"]

Bind editor/viewer roles at namespace scope, not cluster scope:

apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: team-alpha-backup-editor
  namespace: team-alpha
subjects:
  - kind: Group
    name: team-alpha
    apiGroup: rbac.authorization.k8s.io
roleRef:
  kind: ClusterRole
  name: backuppolicy-editor
  apiGroup: rbac.authorization.k8s.io

This pattern gives team-alpha full control over BackupPolicies in their namespace but no access to other namespaces — standard Kubernetes multi-tenancy.


The Three Production Failure Modes

1. Finalizer death loop

Symptoms: Object stuck in Terminating for hours; kubectl get bp nightly shows DeletionTimestamp set but object exists.

Cause: cleanupExternalResources always returns an error.

Detection:

kubectl get bp nightly -n demo -o jsonpath='{.metadata.deletionTimestamp}'
# non-empty = stuck in termination
kubectl describe bp nightly -n demo
# look for repeated reconcile error events

Fix: Add cleanup deadline in controller; use kubectl patch to remove finalizer as last resort.

2. Status thrash

Symptoms: Controller sets Ready: True, then Ready: False, then Ready: True in a rapid loop. Alert noise, confusing dashboards.

Cause: Each reconcile compares actual state incorrectly due to cache lag — it sees its own status write as a change, re-reconciles, and flips the status again.

Fix: Set ObservedGeneration on every condition. Compare generation with observedGeneration before re-reconciling. Use meta.IsStatusConditionTrue to check current condition before overwriting it with the same value.

// Only update status if it changed
current := meta.FindStatusCondition(bp.Status.Conditions, "Ready")
if current == nil || current.Status != desired.Status || current.Reason != desired.Reason {
    meta.SetStatusCondition(&bpCopy.Status.Conditions, desired)
    r.Status().Update(ctx, bpCopy)
}

3. CRD deletion cascade

Symptoms: A team deletes a CRD for cleanup purposes; all instances across all namespaces disappear silently.

Cause: kubectl delete crd backuppolicies.storage.example.com — the API server cascades the deletion to all custom resources of that type.

Prevention:
– Add a resourcelock annotation on production CRDs managed by your operator
– Use GitOps (Argo CD, Flux) to manage CRD installation — a deleted CRD is automatically re-applied from the Git source
– Back up CRDs and instances with velero or equivalent before any CRD management operations


Production Readiness Checklist

CRD DEFINITION
  □ spec.versions has exactly one storage: true version
  □ Status subresource enabled (subresources.status: {})
  □ additionalPrinterColumns includes Ready column from status.conditions
  □ OpenAPI schema defines required fields and types
  □ CEL rules cover cross-field constraints

CONTROLLER
  □ Owner references set on all child resources
  □ Finalizer logic includes cleanup deadline
  □ Status conditions use standard format with observedGeneration
  □ Reconcile function is idempotent
  □ Not-found errors handled cleanly (return nil, not error)
  □ At least 2 replicas with leader election enabled

RBAC
  □ Three ClusterRoles: controller, editor, viewer
  □ Status and finalizers are separate RBAC sub-resources
  □ Editor/viewer bound at namespace scope, not cluster scope
  □ Controller ServiceAccount has only necessary permissions

OPERATIONS
  □ CRD installed via GitOps or Helm (not manual kubectl apply)
  □ Backup of CRDs and instances included in cluster backup
  □ kubectl get crds shows Established: True for all CRDs
  □ Monitoring for stuck Terminating objects (finalizer deadlock)
  □ Alert on controller reconcile error rate, not just pod health

⚠ Common Mistakes

Granting update on backuppolicies but not backuppolicies/status to the controller. If the controller cannot write status, status updates silently fail. The controller appears to run but status conditions never update. Grant both backuppolicies (for spec/metadata writes) and backuppolicies/status (for the status subresource path).

Setting Ready: True before all owned resources are healthy. If the controller sets Ready: True after creating the CronJob but before verifying the CronJob is actually active, users see a false-positive health signal. Only set Ready: True when you have confirmed the desired state is actually achieved.

Not setting observedGeneration on status conditions. Tools like Argo CD and kubectl wait --for=condition=Ready will report incorrect health status if observedGeneration is stale. Always set ObservedGeneration: obj.Generation in every condition write.

Using kubectl delete crd in a production cluster without a backup. This is irreversible. Treat CRDs as production-critical infrastructure — require GitOps review, backup verification, and team approval before any CRD deletion.


Quick Reference

# Check for stuck Terminating objects
kubectl get backuppolicies -A --field-selector metadata.deletionTimestamp!=''

# Force-remove a stuck finalizer (use only when cleanup is truly impossible)
kubectl patch bp nightly -n demo --type=json \
  -p '[{"op":"remove","path":"/metadata/finalizers/0"}]'

# Check all CRDs are Established
kubectl get crds -o jsonpath='{range .items[*]}{.metadata.name} {.status.conditions[?(@.type=="Established")].status}{"\n"}{end}'

# Watch status conditions update during reconcile
kubectl get bp nightly -n demo -w -o \
  jsonpath='{.status.conditions[?(@.type=="Ready")].status} {.status.conditions[?(@.type=="Ready")].message}{"\n"}'

# Verify owner references are set on child CronJob
kubectl get cronjob nightly-backup -n demo \
  -o jsonpath='{.metadata.ownerReferences}'

# List all objects owned by a BackupPolicy (by label)
kubectl get all -n demo -l backuppolicy=nightly

Key Takeaways

  • Finalizers block deletion until cleanup completes — always implement a cleanup deadline to prevent permanent stuck objects
  • Status conditions must use the standard format with observedGeneration — tooling depends on it for correctness
  • Owner references enable automatic garbage collection of child resources when the parent is deleted
  • RBAC needs three roles (controller, editor, viewer) with status and finalizers as separate sub-resources
  • The three production failure modes — finalizer death loop, status thrash, CRD deletion cascade — are all preventable with the patterns covered in this episode

Series Complete

You now have the full picture of Kubernetes CRDs and Operators: from understanding what a CRD is (EP01), through real examples (EP02), schema design (EP03), hands-on YAML (EP04), CEL validation (EP05), the controller loop (EP06), building an operator (EP07), versioning (EP08), admission webhooks (EP09), to production patterns in this episode.

The next series in the Kubernetes learning arc on linuxcent.com covers Kubernetes Networking Deep Dive — Services, Ingress, Gateway API, CNI, and eBPF networking. Subscribe below to get it when it launches.

Stay subscribed → linuxcent.com

IAM Roles vs Policies: How Cloud Authorization Actually Works

Reading Time: 12 minutes

Meta Description: Understand IAM roles vs policies and how cloud authorization works — RBAC, ABAC, and the evaluation logic that decides every access request.


What Is Cloud IAMAuthentication vs AuthorizationIAM Roles vs PoliciesAWS IAM Deep DiveGCP Resource Hierarchy IAMAzure RBAC Scopes


TL;DR

  • Every cloud permission is atomic: one action (s3:GetObject) on one resource class — the indivisible unit of access
  • Policies group permissions into documents with conditions; roles carry policies and are assigned to identities
  • Never attach policies directly to users — roles are the indirection layer that makes access auditable and revocable
  • AWS roles have two required configs: trust policy (who can assume) + permission policy (what they can do) — both must be right
  • GCP binds roles to resources; AWS attaches policies to identities — the mental models run in opposite directions
  • iam:PassRole in AWS and iam.serviceAccounts.actAs in GCP are privilege escalation vectors — always scope to specific ARNs, never *

The Big Picture

Three primitives underlie every cloud IAM system. Learn how they connect and any cloud access model becomes readable.

  THE THREE-LAYER STACK
  Build bottom-up. Assign top-down. Change one layer without touching the others.

  ┌──────────────────────────────────────────────────────────────────────┐
  │  LAYER 3 — IDENTITY                                                  │
  │  [email protected]  ·  backend-service  ·  ci-runner@proj           │
  │  "who is acting — a human, a service, or a machine"                 │
  ├──────────────────────────────────────────────────────────────────────┤
  │  LAYER 2 — ROLE                                                      │
  │  BackendDeveloper  ·  DataAnalyst  ·  DeployBot  ·  S3ReadOnly      │
  │  "what function does this identity serve — the job title"           │
  ├──────────────────────────────────────────────────────────────────────┤
  │  LAYER 1 — POLICY                                                    │
  │  AllowS3Read  ·  AllowECRPush  ·  DenyProdDelete  ·  RequireMFA    │
  │  "what is explicitly permitted or denied, under what conditions"    │
  ├──────────────────────────────────────────────────────────────────────┤
  │  LAYER 0 — PERMISSION                                                │
  │  s3:GetObject  ·  ecr:PutImage  ·  s3:DeleteObject  ·  iam:PassRole│
  │  "one verb on one class of resource — the atom of access control"  │
  └──────────────────────────────────────────────────────────────────────┘

  When alice joins the backend team → assign her the BackendDeveloper role
  When the S3 bucket changes → update the policy once; alice gets it automatically
  When alice leaves → remove the role assignment; policy and permissions are untouched

If this maps better to something physical:

  PHYSICAL WORLD            →    CLOUD IAM

  A specific door rule           Permission      s3:GetObject
  Keycard access profile    →    Policy          AllowS3Read
  Job title                 →    Role            BackendDeveloper
  The employee              →    Identity        [email protected]

  When the employee leaves: revoke the role assignment.
  The job title, the keycard profile, the door rules — all unchanged.
  Next hire gets the same role. Same access. No manual work.

Introduction

IAM roles vs policies is a distinction that defines how cloud authorization actually works — and getting it wrong is how access sprawl starts. Every authentication vs authorization failure at the authorization layer traces back to how these three primitives are — or aren’t — structured.

Every cloud IAM system — AWS, GCP, Azure — is built on the same three primitives: permissions, policies, and roles. Learn these well and any cloud provider becomes readable. Skip them and you spend years pattern-matching without understanding why anything is structured the way it is.

What Is Cloud IAM established the foundation: IAM is the system that governs who can access what in cloud infrastructure, and its default answer is always deny. Authentication vs Authorization: AWS AccessDenied Explained drew the line between authentication — proving identity — and authorization — proving you’re allowed to act. This episode is about the authorization layer specifically. These three building blocks are how authorization is expressed in practice.

Before walking through each one, here’s what access control looks like without any of this structure — because that’s the fastest way to understand why the layers exist.

In 2015 I inherited an AWS account from a 12-engineer team that had been building for 18 months. When I ran aws iam list-attached-user-policies across the 23 users, 17 had policies attached directly to the user object — not to groups, not to roles.

One engineer had left six months earlier. His access key was still active. Three policies still attached: read access to prod S3, write to a DynamoDB table, ability to invoke Lambda functions. When I asked what the DynamoDB table was for, nobody could tell me. The Lambda functions no longer existed.

That account wasn’t built by negligent engineers. It was built by engineers reaching for whatever granted access fastest, under deadline, without a framework. Permissions scattered. Nothing tracked. Nothing removed.

Roles, policies, and permissions are the framework that prevents that. Understanding them is the difference between an IAM configuration you can audit in an afternoon and one that takes a week and still leaves you uncertain.


What Are IAM Permissions? The Atomic Unit of Access Control

A permission is a single action on a class of resources. It is the most granular thing you can grant or deny — the atom of access control.

Cloud providers express permissions differently, but the structure is consistent: a service, a resource type, and an action verb.

# AWS: service:Action
s3:GetObject               # read an object from S3
ec2:StartInstances         # start EC2 instances
iam:PassRole               # assign a role to an AWS service — one of the most dangerous
kms:Decrypt                # use a KMS key to decrypt

# GCP: service.resource.verb
storage.objects.get
compute.instances.start
iam.serviceAccounts.actAs  # impersonate a service account — equivalent risk to iam:PassRole
cloudkms.cryptoKeyVersions.useToDecrypt

# Azure: Provider/ResourceType/Action
Microsoft.Storage/storageAccounts/blobServices/containers/read
Microsoft.Compute/virtualMachines/start/action
Microsoft.Authorization/roleAssignments/write   # grant roles — highest risk
Microsoft.KeyVault/vaults/secrets/getSecret/action

You generally don’t assign individual permissions directly to identities — that’s like handing someone 47 keys with no labels and expecting the system to remain auditable. Permissions are grouped into policies.


What Are IAM Policies? Grouping Permissions with Conditions

A policy is a document that groups permissions and defines the conditions under which they apply.

AWS policy structure

An AWS policy document is JSON. Every field is a deliberate decision:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "AllowReadS3Backups",
      "Effect": "Allow",
      "Action": ["s3:GetObject", "s3:ListBucket"],
      "Resource": [
        "arn:aws:s3:::company-backups",
        "arn:aws:s3:::company-backups/*"
      ],
      "Condition": {
        "StringEquals": { "s3:prefix": ["2024/", "2025/"] }
      }
    },
    {
      "Sid": "DenyDeleteEverywhere",
      "Effect": "Deny",
      "Action": "s3:DeleteObject",
      "Resource": "*"
    }
  ]
}

The Sid is a comment — use it. AllowReadS3Backups tells a future auditor why this statement exists. Statement1 is technical debt.

The Effect is either Allow or Deny. A Deny always wins — it cannot be overridden by any Allow anywhere in any policy on the same identity. If you have a Deny on s3:DeleteObject with "Resource": "*", nothing can grant delete access to that identity. This asymmetry is deliberate: it’s how guardrails work.

The Resource field is where access most often creeps wider than intended. "Resource": "*" on a write action means “every resource of this type in the account.” It works. It outlives the context that made it feel reasonable.

AWS policy types — which to reach for

┌──────────────────────────┬────────────────────────────┬────────────────────────────┐
│ Type                     │ Attached to                │ What it does               │
├──────────────────────────┼────────────────────────────┼────────────────────────────┤
│ Identity-based           │ User, Group, Role          │ What the identity can do   │
│ Resource-based           │ S3 bucket, KMS key, Lambda │ Who can touch this resource │
│ Permissions boundary     │ User or Role               │ Maximum possible — ceiling  │
│ Service Control Policy   │ AWS Org OU or Account      │ Org-level guardrail         │
│ Session policy           │ AssumeRole session         │ Restricts a specific session│
│ Resource Control Policy  │ AWS Org resources          │ Resource-level org guardrail│
└──────────────────────────┴────────────────────────────┴────────────────────────────┘

Critical: Permissions boundaries and SCPs do not grant permissions. They constrain them. A boundary that allows s3:* doesn’t mean the identity has S3 access. It means the identity can have at most S3 access, if an identity-based policy actually grants it. Many engineers set a boundary and expect it to work as a grant. It doesn’t.

GCP policy bindings

GCP doesn’t attach policy documents to identities. Each resource has an IAM policy — a set of bindings mapping roles to members:

{
  "bindings": [
    {
      "role": "roles/storage.objectViewer",
      "members": [
        "user:[email protected]",
        "serviceAccount:[email protected]"
      ]
    },
    {
      "role": "roles/storage.objectCreator",
      "members": ["serviceAccount:[email protected]"],
      "condition": {
        "title": "Business hours only",
        "expression": "request.time.getHours('America/New_York') >= 9 && request.time.getHours('America/New_York') < 18"
      }
    }
  ]
}

The mental model shift: in AWS you ask “what can this identity do?” by looking at the identity. In GCP you ask “who can access this resource?” by looking at the resource. The question runs in the opposite direction.

Azure role definitions

Azure separates what a role grants (role definition) from who gets it where (role assignment). Define once, assign at multiple scopes.

{
  "Name": "Custom Storage Reader",
  "IsCustom": true,
  "Actions": [
    "Microsoft.Storage/storageAccounts/blobServices/containers/read",
    "Microsoft.Storage/storageAccounts/blobServices/generateUserDelegationKey/action"
  ],
  "DataActions": [
    "Microsoft.Storage/storageAccounts/blobServices/containers/blobs/read"
  ],
  "AssignableScopes": ["/subscriptions/SUB_ID"]
}

Actions vs DataActions catches people. Actions are control plane — you can see the storage account exists. DataActions are data plane — you can read actual blob contents. A user with Actions can list the container but cannot read a single byte without a DataAction. Both planes must be covered for the access to be complete.


What Are IAM Roles? The Layer That Scales Access Control

A role is a collection of policies assigned to identities. It’s the indirection layer that makes access manageable at scale.

Going back to the 2015 account: the problem wasn’t that engineers had access — they needed it. The problem was that access was scattered across 23 individual user objects with no shared structure. This is what what is cloud IAM establishes as the core problem IAM exists to solve. Roles are the structural answer.

The role model solves this:

Policy: S3ReadAccess (s3:GetObject, s3:ListBucket on s3:::app-data/*)
  ↓ attached to
Role: BackendDeveloper
  ↓ assigned to
Users: alice, bob, charlie, dave (and six more)

When the bucket changes  → update one policy
When someone joins       → assign one role
When someone leaves      → remove one role
Access model stays coherent because it's structured.

AWS roles — the identity that issues temporary credentials

AWS roles are themselves IAM identities, not just permission containers. When something assumes a role, it gets temporary credentials from STS. Two things must be configured:

Trust policy — who can assume:

{
  "Version": "2012-10-17",
  "Statement": [{
    "Effect": "Allow",
    "Principal": { "Service": "ec2.amazonaws.com" },
    "Action": "sts:AssumeRole"
  }]
}

Without this, nobody can use the role regardless of its permissions. The trust policy is the gatekeeper.

Permission policy — what it can do:

aws iam create-role \
  --role-name AppServerRole \
  --assume-role-policy-document file://ec2-trust-policy.json

aws iam attach-role-policy \
  --role-name AppServerRole \
  --policy-arn arn:aws:iam::aws:policy/AmazonS3ReadOnlyAccess

When debugging “why can’t this Lambda/EC2/ECS task do X?”, the first thing I check is the trust policy. Many times the permission policy is correct — the service simply isn’t in the trust policy and cannot assume the role at all.

GCP role types

┌──────────────────┬──────────────────────────────┬──────────────────────────────────┐
│ Type             │ Example                      │ When to use                      │
├──────────────────┼──────────────────────────────┼──────────────────────────────────┤
│ Basic/Primitive  │ roles/editor, roles/owner    │ Never in production              │
│ Predefined       │ roles/storage.objectViewer   │ Default — service-specific       │
│ Custom           │ Your org defines             │ When predefined is too broad     │
└──────────────────┴──────────────────────────────┴──────────────────────────────────┘

roles/editor at the project level grants write access to almost every GCP service. I’ve seen it granted “temporarily” and found it attached six months later. Always use predefined roles.

# Find the right predefined role
gcloud iam roles list --filter="name:roles/storage" --format="table(name,title)"

# See exactly what permissions it includes
gcloud iam roles describe roles/storage.objectViewer

# Create a custom role when predefined is still too broad
cat > custom-log-reader.yaml << 'EOF'
title: "Log Reader"
description: "Read application logs — nothing else"
stage: "GA"
includedPermissions:
  - logging.logEntries.list
  - logging.logs.list
  - logging.logMetrics.get
EOF
gcloud iam roles create LogReader --project=my-project --file=custom-log-reader.yaml

Azure built-in and custom roles

# List built-in roles containing "Storage"
az role definition list --output table | grep Storage

# View what a built-in role grants
az role definition list --name "Storage Blob Data Reader"

# Create a custom role
az role definition create --role-definition custom-app-storage.json

# Assign at a specific scope
az role assignment create \
  --assignee [email protected] \
  --role "Storage Blob Data Reader" \
  --scope /subscriptions/SUB_ID/resourceGroups/rg-prod/providers/\
Microsoft.Storage/storageAccounts/prodstore

RBAC vs ABAC: Which Access Control Model to Use

RBAC — Role-Based Access Control

The dominant model. Access flows from role membership:

alice     ∈ BackendDeveloper  →  s3:GetObject on app-data/*
bob       ∈ DataAnalyst       →  athena:* on analytics-queries
ci-runner ∈ DeployRole        →  ecr:PutImage, ecs:UpdateService

RBAC degrades two ways: role explosion (200 roles, nobody can explain what they all do) and coarse roles (avoid explosion by making roles broad, now BackendDeveloper has prod access with no distinction from dev). Both look the same on a spreadsheet — lots of access, no clear principle.

ABAC — Attribute-Based Access Control

ABAC grants access based on attributes of the principal, resource, or environment — not role membership. This one policy replaced 12 team-specific policies in one account:

{
  "Effect": "Allow",
  "Action": "ec2:*",
  "Resource": "*",
  "Condition": {
    "StringEquals": {
      "aws:ResourceTag/Team": "${aws:PrincipalTag/Team}"
    }
  }
}

An engineer tagged Team=Platform can only act on EC2 resources tagged Team=Platform. Add a new team — tag their resources and their identity. No new policy. No new role.

The risk is tag drift. If someone tags a resource incorrectly, the access model breaks silently. In practice, I use ABAC for environment and team scoping, and explicit policies for sensitive services like KMS and IAM. How these primitives combine in a full AWS account is covered in the AWS IAM deep dive.

Conditions — when context determines access

// Require MFA for any IAM or Organizations action
{
  "Effect": "Deny",
  "Action": ["iam:*", "organizations:*"],
  "Resource": "*",
  "Condition": { "BoolIfExists": { "aws:MultiFactorAuthPresent": "false" } }
}

// Restrict to corporate IP range
{
  "Effect": "Deny",
  "Action": "*",
  "Resource": "*",
  "Condition": {
    "NotIpAddress": { "aws:SourceIp": ["10.0.0.0/8", "172.16.0.0/12"] }
  }
}

The MFA condition is in every account I manage. A compromised API key without an MFA session can’t escalate IAM privileges — the Deny blocks it at the condition level. This single statement meaningfully reduces the blast radius of a credential compromise.


⚠ Production Gotchas

╔══════════════════════════════════════════════════════════════════════╗
║  ⚠  GOTCHA 1 — Policies attached directly to users                 ║
║                                                                      ║
║  Feels fast. Creates the exact problem from 2015: access scattered  ║
║  across individual user objects with no shared structure.            ║
║  When the user leaves, their policies don't follow — they stay.     ║
║                                                                      ║
║  Fix: always use roles. Attach policies to roles. Assign roles to   ║
║  users. The role outlives the person.                               ║
╚══════════════════════════════════════════════════════════════════════╝

╔══════════════════════════════════════════════════════════════════════╗
║  ⚠  GOTCHA 2 — Using AWS managed policies in production            ║
║                                                                      ║
║  AmazonS3FullAccess grants s3:* on *. For a Lambda that reads one  ║
║  specific bucket, that's ~30 permissions you didn't need, all live. ║
║                                                                      ║
║  Fix: create customer managed policies scoped to the specific       ║
║  actions and ARNs the workload actually uses.                       ║
╚══════════════════════════════════════════════════════════════════════╝

╔══════════════════════════════════════════════════════════════════════╗
║  ⚠  GOTCHA 3 — iam:PassRole with "Resource": "*"                   ║
║                                                                      ║
║  iam:PassRole lets an identity assign a role to an AWS service.     ║
║  With Resource: *, it can pass ANY role — including ones with more  ║
║  permissions than it currently has. That is a privilege escalation. ║
║                                                                      ║
║  Fix: always scope iam:PassRole to a specific role ARN:             ║
║  "Resource": "arn:aws:iam::ACCOUNT:role/SpecificRoleName"          ║
╚══════════════════════════════════════════════════════════════════════╝

╔══════════════════════════════════════════════════════════════════════╗
║  ⚠  GOTCHA 4 — Permissions boundary ≠ policy grant                 ║
║                                                                      ║
║  Setting a boundary that allows s3:* does NOT grant S3 access.     ║
║  The boundary is a ceiling — it limits maximum possible permissions. ║
║  The identity-based policy still needs to explicitly Allow the      ║
║  action. Both must be present for the access to work.               ║
╚══════════════════════════════════════════════════════════════════════╝

Cross-Cloud Rosetta Stone

Same concepts, different names and different directions. Bookmark this table.

┌─────────────────────────┬──────────────────────────┬──────────────────────────┬──────────────────────────┐
│ Concept                 │ AWS                      │ GCP                      │ Azure                    │
├─────────────────────────┼──────────────────────────┼──────────────────────────┼──────────────────────────┤
│ Atomic permission       │ s3:GetObject             │ storage.objects.get      │ .../blobs/read           │
│ Permission document     │ Policy (JSON)            │ (built into role def)    │ Role Definition          │
│ Access grant            │ Policy attachment        │ IAM Binding              │ Role Assignment          │
│ Job-function identity   │ IAM Role                 │ Predefined Role          │ Built-in Role            │
│ Non-human identity      │ IAM Role (assumed)       │ Service Account          │ Managed Identity         │
│ Org-level guardrail     │ SCP                      │ Org Policy               │ Management Group Policy  │
│ Permission ceiling      │ Permissions Boundary     │ —                        │ —                        │
│ Session restriction     │ Session Policy           │ —                        │ —                        │
│ Attribute-based grant   │ Tag conditions in policy │ IAM Conditions           │ Conditions in assignment │
└─────────────────────────┴──────────────────────────┴──────────────────────────┴──────────────────────────┘

Quick Reference

┌──────────────────────────┬────────────────────────────────────────────────────────────┐
│ Term                     │ What it is                                                 │
├──────────────────────────┼────────────────────────────────────────────────────────────┤
│ Permission               │ Atomic: one action on one resource class                   │
│ Policy                   │ Document grouping permissions + conditions                 │
│ Role (AWS)               │ Assumable identity — carries policies, issues temp creds   │
│ Trust policy (AWS)       │ Who can assume this role — separate from permissions       │
│ Permissions boundary     │ Ceiling — limits max possible permissions; does not grant  │
│ SCP                      │ Org guardrail — constrains all identities in scope         │
│ IAM Binding (GCP)        │ Maps a role to a member on a specific resource             │
│ Role Assignment (Azure)  │ Grants a role definition at a specific scope               │
│ ABAC                     │ Access by tag/attribute — one policy replaces many roles   │
│ RBAC                     │ Access by role membership — clean until roles proliferate  │
│ iam:PassRole             │ Privilege escalation vector — always scope to specific ARN │
└──────────────────────────┴────────────────────────────────────────────────────────────┘

Commands to know:
┌────────────────────────────────────────────────────────────────────────────────┐
│  # AWS — list policies attached to a role                                     │
│  aws iam list-attached-role-policies --role-name MyRole                       │
│                                                                                │
│  # AWS — view what a managed policy actually grants                           │
│  aws iam get-policy-version \                                                  │
│    --policy-arn arn:aws:iam::aws:policy/AmazonS3ReadOnlyAccess \              │
│    --version-id v1                                                             │
│                                                                                │
│  # AWS — who can assume this role?                                            │
│  aws iam get-role --role-name MyRole --query 'Role.AssumeRolePolicyDocument'  │
│                                                                                │
│  # GCP — view the IAM policy on a project                                    │
│  gcloud projects get-iam-policy PROJECT_ID --format=json                      │
│                                                                                │
│  # GCP — list all roles and what permissions they include                    │
│  gcloud iam roles describe roles/storage.objectViewer                         │
│                                                                                │
│  # Azure — list role assignments in a subscription                           │
│  az role assignment list --all --output table                                 │
│                                                                                │
│  # Azure — view exactly what a built-in role grants                          │
│  az role definition list --name "Storage Blob Data Reader"                   │
└────────────────────────────────────────────────────────────────────────────────┘

Framework Alignment

Framework Reference What It Covers Here
CISSP Domain 5 — Identity and Access Management RBAC and ABAC are the implementation models for authorization at scale
CISSP Domain 1 — Security & Risk Management Role design implements separation of duties and least privilege
ISO 27001:2022 5.15 Access control Access control policy — roles and policies are the mechanism
ISO 27001:2022 5.18 Access rights Provisioning, review, and removal of access rights — roles make this auditable
ISO 27001:2022 8.2 Privileged access rights Permissions boundaries and conditions applied to elevated access
SOC 2 CC6.1 Logical access security — policy documents are the technical implementation
SOC 2 CC6.3 Access revocation — role-based model makes removal consistent and auditable

Key Takeaways

  • Permissions are atomic — one action on one resource class. Policies group permissions. Roles carry policies for assignment
  • AWS roles have two required configs: trust policy (who can assume) and permission policy (what it can do) — both must be correct
  • GCP binds roles to resources; AWS attaches policies to identities — the mental model runs in opposite directions
  • Azure separates role definition (what) from role assignment (who, where) — define once, assign at multiple scopes
  • RBAC scales through role design; ABAC scales through tag/attribute conditions — use ABAC where roles would proliferate
  • iam:PassRole and iam.serviceAccounts.actAs are privilege escalation vectors — scope them to specific ARNs, never *
  • Conditions add context (MFA, IP, tags, time) to policies — the MFA condition on IAM actions is essential in every account

What’s Next

EP04 goes deep on AWS IAM — the most complex of the three cloud models. Policy evaluation order, cross-account trust, permissions boundaries in practice, SCPs, and IAM Identity Center for human access. We’ll work through the patterns that make AWS IAM maintainable at production scale.

Next: AWS IAM Deep Dive: Users, Groups, Roles, and Policies Explained

Get the AWS IAM deep dive in your inbox when it publishes → linuxcent.com/subscribe