eBPF Program Types — What’s Actually Running on Your Nodes

eBPF: From Kernel to Cloud, Episode 4
Earlier in this series: What Is eBPF? · The BPF Verifier · eBPF vs Kernel Modules


By Episode 3, we’d covered what eBPF is, why the verifier makes it safe for production, and why it’s replaced kernel modules for observability workloads. What we hadn’t answered — and what a 2am incident eventually forced — is what kind of eBPF programs are actually running on your nodes, and why the difference matters when something breaks.

A pod in production was dropping roughly one in fifty outbound TCP connections. Not all of them — just enough to cause intermittent timeouts in the application logs. NetworkPolicy showed egress allowed. Cilium reported no violations. Running curl manually from inside the pod worked every time.

I spent the better part of three hours eliminating possibilities. DNS. MTU. Node-level conntrack table exhaustion. Upstream firewall rules. Nothing.

Eventually, almost as an afterthought, I ran this:

sudo bpftool prog list

There were two TC programs attached to that pod’s veth interface. One from the current Cilium version. One from the previous version — left behind by a rolling upgrade that hadn’t cleaned up properly. Two programs. Different policy state. One was occasionally dropping packets based on rules that no longer existed in the current policy model.

The answer had been sitting in the kernel the whole time. I just didn’t know where to look.

That incident forced me to actually understand something I’d been hand-waving for two years: eBPF isn’t a single hook. It’s a family of program types, each attached to a different location in the kernel, each seeing different data, each suited for different problems. Understanding the difference is what separates “I run Cilium and Falco” from “I understand what Cilium and Falco are actually doing on my nodes” — and that difference matters when something breaks at 2am.

The Command You Should Run on Your Cluster Right Now

Before getting into the theory, do this:

# See every eBPF program loaded on the node
sudo bpftool prog list

# See every eBPF program attached to a network interface
sudo bpftool net list

On a node running Cilium and Falco, you’ll see something like this:

42: xdp           name cil_xdp_entry       loaded_at 2026-04-01T09:23:41
43: sched_cls     name cil_from_netdev      loaded_at 2026-04-01T09:23:41
44: sched_cls     name cil_to_netdev        loaded_at 2026-04-01T09:23:41
51: cgroup_sock_addr  name cil_sock4_connect loaded_at 2026-04-01T09:23:41
88: raw_tracepoint  name sys_enter          loaded_at 2026-04-01T09:23:55
89: raw_tracepoint  name sys_exit           loaded_at 2026-04-01T09:23:55

Each line is a different program type. Each one fires at a different point in the kernel. The type column — xdp, sched_cls, raw_tracepoint, cgroup_sock_addr — tells you where in the kernel execution path that program is attached and therefore what it can and cannot see.

If you see more programs than you expect on a specific interface — like I did — that’s your first clue.

Why Program Types Exist

The Linux kernel isn’t a single pipeline. Network packets, system calls, file operations, process scheduling — these all run through different subsystems with different execution contexts and different available data.

eBPF lets you attach programs to specific points within those subsystems. The “program type” is the contract: it defines where the hook fires, what data the program receives, and what it’s allowed to do with it. A program designed to process network packets before they hit the kernel stack looks completely different from one designed to intercept system calls across all containers simultaneously.

Most of us will interact with four or five program types through the tools we already run. Understanding what each one actually is — where it sits, what it sees — is what makes you effective when those tools behave unexpectedly.

The Types Behind the Tools You Already Use

TC — Why Cilium Can Tell Which Pod Sent a Packet

TC stands for Traffic Control. It’s where Cilium enforces your NetworkPolicy, and it’s what caused my incident.

TC programs attach to network interfaces — specifically to the ingress and egress directions of the pod’s virtual interface (lxcXXXXX in Cilium’s naming). They fire after the kernel has already processed the packet enough to know its context: which socket created it, which cgroup that socket belongs to. Cgroup maps to container, container maps to pod.

This is the critical piece: TC is how Cilium knows which pod a packet belongs to. Without that cgroup context, per-pod policy enforcement isn’t possible.

# See TC programs on a pod's veth interface
sudo tc filter show dev lxc12345 ingress
sudo tc filter show dev lxc12345 egress

# If you see two entries on the same direction — that's the incident I described
# The priority number (pref 1, pref 2) tells you the order they run

When there are two TC programs on the same interface, the first one to return “drop” wins. The second program never runs. This is why the issue was intermittent rather than consistent — the stale program only matched specific connection patterns.

Fixing it is straightforward once you know what to look for:

# Remove a stale TC filter by its priority number
sudo tc filter del dev lxc12345 egress pref 2

Add this check to your post-upgrade runbook. Cilium upgrades are generally clean but not always.

XDP — Why Cilium Doesn’t Use TC for Everything

If TC is good enough for pod-level policy, why does Cilium also run an XDP program on the node’s main interface? Look at the bpftool prog list output again — there’s an xdp program loaded alongside the TC programs.

XDP fires earlier. Much earlier. Before the kernel allocates any memory for the packet. Before routing. Before connection tracking. Before anything.

The tradeoff is exactly what you’d expect: XDP is fast but context-poor. It sees raw packet bytes. It doesn’t know which pod the packet came from. It can’t read cgroup information because no socket buffer has been allocated yet.

Cilium uses XDP specifically for ClusterIP service load balancing — when a packet arrives at the node destined for a service VIP, XDP rewrites the destination to the actual pod IP in a single map lookup and sends it on its way. No iptables. No conntrack. The work is done before the kernel stack is involved.

There’s a silent failure mode worth knowing about here. XDP runs in one of two modes:

  • Native mode — runs inside the NIC driver itself, before any kernel allocation. This is where the performance comes from.
  • Generic mode — fallback when the NIC driver doesn’t support XDP. Runs later, after sk_buff allocation. No performance benefit over iptables.

If your NIC doesn’t support native XDP, Cilium silently falls back to generic mode. The policy still works — but the performance characteristics you assumed aren’t there.

# Check which XDP mode is active on your node's main interface
ip link show eth0 | grep xdp
# xdpdrv  ← native mode (fast)
# xdpgeneric ← generic mode (no perf benefit)

Most cloud provider instance types with modern Mellanox/Intel NICs support native mode. Worth verifying rather than assuming.

Tracepoints — How Falco Sees Every Container

Falco loads two programs: sys_enter and sys_exit. These are raw tracepoints — they fire on every single system call, from every process, in every container on the node.

Tracepoints are explicitly defined and maintained instrumentation points in the kernel. Unlike hooks that attach to specific internal function names (which can be renamed or inlined between kernel versions), tracepoints are stable interfaces. They’re part of the kernel’s public contract with tooling that wants to instrument it.

This matters operationally. When you patch your nodes — and cloud-managed nodes get patched frequently — tools built on tracepoints keep working. Tools built on kprobes (internal function hooks) may silently stop firing if the function they’re attached to gets renamed or inlined by the compiler in a new kernel build.

# Verify what Falco is actually using
sudo bpftool prog list | grep -E "kprobe|tracepoint"

# Falco's current eBPF driver should show raw_tracepoint entries
# If you see kprobe entries from Falco, you're on the older driver
# Check: falco --version and the driver being loaded at startup

If you’re running Falco on a cluster that gets regular OS patch upgrades and you haven’t verified the driver mode, check it. The older kprobe-based driver has a real failure mode on certain kernel versions.

LSM — How Tetragon Blocks Operations at the Kernel Level

LSM hooks run at the kernel’s security decision points: file opens, socket connections, process execution, capability checks. The defining characteristic is that they can deny an operation. Return an error from an LSM hook and the kernel refuses the syscall before it completes.

This is qualitatively different from observability hooks. kprobes and tracepoints watch. LSM hooks enforce.

When you see Tetragon configured to kill a process attempting a privileged operation, or block a container from writing to a specific path, that’s an LSM hook making the decision inside the kernel — not a sidecar watching traffic, not an admission webhook running before pod creation, not a userspace agent trying to act fast enough. The enforcement is in the kernel itself.

# See if any LSM eBPF programs are active on the node
sudo bpftool prog list | grep lsm

# Verify LSM eBPF support on your kernel (required for Tetragon enforcement mode)
grep CONFIG_BPF_LSM /boot/config-$(uname -r)
# CONFIG_BPF_LSM=y   ← required

The Practical Summary

What’s happening on your node Program type Where to look
Cilium service load balancing XDP ip link show eth0 \| grep xdp
Cilium pod network policy TC (sched_cls) tc filter show dev lxcXXXX egress
Falco syscall monitoring Tracepoint bpftool prog list \| grep tracepoint
Tetragon enforcement LSM bpftool prog list \| grep lsm
Anything unexpected All types bpftool prog list, bpftool net list

The Incident, Revisited

Three hours of debugging. The answer was a stale TC program sitting at priority 2 on a pod’s veth interface, left behind by an incomplete Cilium upgrade.

# What I should have run first
sudo bpftool net list
sudo tc filter show dev lxc12345 egress

Two commands. Thirty seconds. If I’d known that TC programs can stack on the same interface, I’d have started there.

That’s the point of understanding program types — not to write eBPF programs yourself, but to know where to look when the tools you depend on don’t behave the way you expect. The programs are already there, running on your nodes right now. bpftool prog list shows you all of them.

Key Takeaways

  • bpftool prog list and bpftool net list show every eBPF program on a node — run these before anything else when debugging eBPF-based tool behavior
  • TC programs can stack on the same interface; stale programs from incomplete Cilium upgrades cause intermittent drops — check tc filter show after every Cilium upgrade
  • XDP runs before the kernel stack — fastest hook, but no pod identity; Cilium uses it for service load balancing, not pod policy
  • XDP silently falls back to generic mode on unsupported NICs — verify with ip link show | grep xdp
  • Tracepoints are stable across kernel versions; kprobe-based tools may silently break after node OS patches — verify your Falco driver mode
  • LSM hooks enforce at the kernel level — this is what makes Tetragon’s enforcement mode fundamentally different from sidecar-based approaches

What’s Next

Every eBPF program fires, does its work, and exits — but the work always involves data. Counting connections. Tracking processes. Streaming events to a detection engine. In EP05, I’ll cover eBPF maps: the persistent data layer that connects kernel programs to the tools consuming their output. Understanding maps explains a class of production issues — and makes bpftool map dump useful rather than cryptic.

EP01: What is IAM? The Identity Problem in Modern Infrastructure


Introduction

A few years into my career managing Linux infrastructure, I was handed a production server audit. The task was straightforward: find out who had access to what. I pulled /etc/passwd, checked the sudoers file, reviewed SSH authorized_keys across the fleet.

Three days later, I had a spreadsheet nobody wanted to read.

The problem wasn’t that the access was wrong. Most of it was fine. The problem was that nobody — not the team lead, not the security team, not the engineers who’d been there five years — could tell me why a particular account had access to a particular server. It had accumulated. People joined, got access, changed teams, left. The access stayed.

That was a 40-server fleet in 2012.

Fast-forward to a cloud environment today: you might have 50 engineers, 300 Lambda functions, 20 microservices, CI/CD pipelines, third-party integrations, compliance scanners — all making API calls, all needing access to something. The identity sprawl problem I spent three days auditing manually on 40 servers now exists at a scale where manual auditing isn’t even a conversation.

This is the problem Identity and Access Management exists to solve. Not just in theory — in practice, at the scale cloud infrastructure demands.


How We Got Here — The Evolution of Access Control

To understand why cloud IAM works the way it does, you need to trace how access control evolved. The design decisions in AWS IAM, GCP, and Azure didn’t come out of nowhere — they’re answers to lessons learned the hard way across decades of broken systems.

The Unix Model (1970s–1990s): Simple and Sufficient

Unix got the fundamentals right early. Every resource (file, device, process) has an owner and a group. Every action is one of three: read, write, execute. Every user is either the owner, in the group, or everyone else.

-rw-r--r--  1 vamshi  engineers  4096 Apr 11 09:00 deploy.conf
# owner can read/write | group can read | others can read

For a single machine or a small network, this model is elegant. The permissions are visible in a ls -l. Reasoning about access is straightforward. Auditing means reading a few files.

The cracks started showing when organizations grew. You’d add sudo to give specific commands to specific users. Then sudoers files became 300 lines long. Then you’d have shared accounts because managing individual ones was “too much overhead.” Shared accounts mean no individual accountability. No accountability means no audit trail worth anything.

The Directory Era (1990s–2000s): Centralise or Collapse

As networks grew, every server managing its own /etc/passwd became untenable. Enter LDAP and Active Directory. Instead of distributing identity management across every machine, you centralised it: one directory, one place to add users, one place to disable them when someone left.

This was a significant step forward. Onboarding got faster. Offboarding became reliable. Group membership drove access to resources across the network.

But the permission model was still coarse. You were either in the Domain Admins group or you weren’t. “Read access to the file share” was a group. “Deploy to the staging web server” was a group. Managing fine-grained permissions at scale meant managing hundreds of groups, and the groups themselves became the audit nightmare.

I spent time in environments like this. The group named SG_Prod_App_ReadWrite_v2_FINAL that nobody could explain. The AD group from a project that ended three years ago but was still in twenty user accounts. The contractor whose AD account was disabled but whose service account was still running a nightly job.

The directory model centralised identity. It didn’t solve the permissions sprawl problem.

The Cloud Shift (2006–2014): Everything Changes

AWS launched EC2 in 2006. In 2011, AWS IAM went into general availability. That date matters — for the first five years of AWS, access control was primitive. Root accounts. Access keys. No roles.

Early AWS environments I’ve seen (and had to clean up) reflect this era: a single root account access key shared across a team, rotated manually on a shared spreadsheet. Static credentials in application config files. EC2 instances with AdministratorAccess because “it was easier at the time.”

The AWS team understood what they’d built was dangerous. IAM in 2011 introduced the model that all three major cloud providers now share: deny-by-default, policy-driven, principal-based access control. Not “who is in which group” but “which policy explicitly grants this specific action on this specific resource to this specific identity.”

GCP launched its IAM model with a different flavour in 2012 — hierarchical, additive, binding-based. Azure RBAC came to general availability in 2014, built on top of Active Directory’s identity model.

By 2015, the modern cloud IAM era was established. The primitives existed. The problem shifted from “does IAM exist?” to “are we using it correctly?” — and most teams were not.


The Problem IAM Actually Solves

Here’s the honest version of what IAM is for, based on what I’ve seen go wrong without it.

Without proper IAM, you get one of two outcomes:

The first is what I call the “it works” environment. Everything runs. The developers are happy. Access requests take five minutes because everyone gets the same broad policy. And then a Lambda function’s execution role — which had s3:* on * because someone once needed to debug something — gets its credentials exposed through an SSRF vulnerability in the app it runs. That role can now read every bucket in the account, including the one with the customer database exports.

The second is the “it’s secure” environment. Access is locked down. Every request goes through a ticket. The ticket goes to a security team that approves it in three to five business days. Engineers work around it by storing credentials locally. The workarounds become the real access model. The formal IAM posture and the actual access posture diverge. The audit finds the formal one. Attackers find the real one.

IAM, done right, is the discipline of walking the line between those two outcomes. It’s not a product you buy or a feature you turn on. It’s a practice — a continuous process of defining what access exists, why it exists, and whether it’s still needed.


The Core Concepts — Taught, Not Listed

Let me walk you through the vocabulary you need, grounded in what each concept means in practice.

Identity: Who Is Making This Request?

An identity is any entity that can hold a credential and make requests. In cloud environments, identities split into two types:

Human identities are engineers, operators, and developers. They authenticate via the console, CLI, or SDK. They should ideally authenticate through a central IdP (Okta, Google Workspace, Entra ID) using federation — more on that in EP10.

Machine identities are everything else: Lambda functions, EC2 instances, Kubernetes pods, CI/CD pipelines, monitoring agents, data pipelines. In most production environments, machine identities outnumber human identities by 10:1 or more.

This ratio matters. When your security model is designed primarily for human access, the 90% of identities that are machines become an afterthought. That’s where access keys end up in environment variables, where Lambda functions get broad permissions because nobody thought carefully about what they actually need, where the real attack surface lives.

Principal: The Authenticated Identity Making a Specific Request

A principal is an identity that has been authenticated and is currently making a request. The distinction from “identity” is subtle but important: the principal includes the context of how the identity authenticated.

In AWS, an IAM role assumed by EC2, assumed by a Lambda, and assumed by a developer’s CLI session are three different principals — even if they all assume the same role. The session context, source, and expiration differ.

{
  "Principal": {
    "AWS": "arn:aws:iam::123456789012:role/DataPipelineRole"
  }
}

In GCP, the equivalent term is member. In Azure, it’s security principal — a user, group, service principal, or managed identity.

Resource: What Is Being Accessed?

A resource is whatever is being acted upon. In AWS, every resource has an ARN (Amazon Resource Name) — a globally unique identifier.

arn:aws:s3:::customer-data-prod          # S3 bucket
arn:aws:s3:::customer-data-prod/*        # everything inside that bucket
arn:aws:ec2:ap-south-1:123456789012:instance/i-0abcdef1234567890
arn:aws:iam::123456789012:role/DataPipelineRole

The ARN structure tells you: service, region, account, resource type, resource name. Once you can read ARNs fluently, IAM policies become much less intimidating.

Action: What Is Being Done?

An action (AWS/Azure) or permission (GCP) is the operation being attempted. Cloud providers express these as service:Operation strings:

# AWS
s3:GetObject           # read a specific object
s3:PutObject           # write an object
s3:DeleteObject        # delete an object — treat differently than read
iam:PassRole           # assign a role to a service — one of the most dangerous permissions
ec2:DescribeInstances  # list instances — often overlooked, but reveals infrastructure

# GCP
storage.objects.get
storage.objects.create
iam.serviceAccounts.actAs   # impersonate a service account — equivalent to iam:PassRole danger

When I audit IAM configurations, I pay special attention to any policy that includes iam:*, iam:PassRole, or wildcards like "Action": "*". These are the permissions that let a compromised identity create new identities, assign itself more power, or impersonate other accounts. They’re the privilege escalation primitives — more on that in EP08.

Policy: The Document That Connects Everything

A policy is a document that says: this principal can perform these actions on these resources, under these conditions.

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "ReadCustomerDataBucket",
      "Effect": "Allow",
      "Action": [
        "s3:GetObject",
        "s3:ListBucket"
      ],
      "Resource": [
        "arn:aws:s3:::customer-data-prod",
        "arn:aws:s3:::customer-data-prod/*"
      ]
    }
  ]
}

Notice what’s explicit here: the effect (Allow), the exact actions (not s3:*), and the exact resource (not *). Every word in this document is a deliberate decision. The moment you start using wildcards to save typing, you’re writing technical debt that will come back as a security incident.


How IAM Actually Works — The Decision Flow

When any API call hits a cloud service, an IAM engine evaluates it. Understanding this flow is the foundation of debugging access issues, and more importantly, of understanding why your security posture is what it is.

Request arrives:
  Action:    s3:PutObject
  Resource:  arn:aws:s3:::customer-data-prod/exports/2026-04-11.csv
  Principal: arn:aws:iam::123456789012:role/DataPipelineRole
  Context:   { source_ip: "10.0.2.15", mfa: false, time: "02:30 UTC" }

IAM Engine evaluation (AWS):
  1. Is there an explicit Deny anywhere? → No
  2. Does the SCP (if any) allow this? → Yes
  3. Does the identity-based policy allow this? → Yes (via DataPipelinePolicy)
  4. Does the resource-based policy (bucket policy) allow or deny? → No explicit rule → implicit allow for same-account
  5. Is there a permissions boundary? → No
  Decision: ALLOW

The critical insight here: cloud IAM is deny-by-default. There is no implicit allow. If there is no policy that explicitly grants s3:PutObject to this role on this bucket, the request fails. The only way in is through an explicit "Effect": "Allow".

This is the opposite of how most traditional systems work. In a Unix permission model, if your file is world-readable (-r--r--r--), anyone can read it unless you actively restrict them. In cloud IAM, nothing is accessible unless you actively grant it.

When I’m debugging an AccessDenied error — and every engineer who works with cloud IAM spends significant time doing this — the mental model is always: “what is the chain of explicit Allows that should be granting this access, and at which layer is it missing?”


Why This Is Harder Than It Looks

Understanding the concepts is the easy part. The hard part is everything that happens at organisational scale over time.

Scale. A real AWS account in a growing company might have 600+ IAM roles, 300+ policies, and 40+ cross-account trust relationships. None of these were designed together. They evolved incrementally, each change made by someone who understood the context at the time and may have left the organisation since. The cumulative effect is an IAM configuration that no single person fully understands.

Drift. IAM configs don’t stay clean. An engineer needs to debug a production issue at 2 AM and grants themselves broad access temporarily. The temporary access never gets revoked. Multiply that by a team of 20 over three years. I’ve audited environments where 60% of the permissions in a role had never been used — not once — in the 90-day CloudTrail window. That unused 60% is pure attack surface.

The machine identity blind spot. Most IAM governance practices were built for human users. Service accounts, Lambda roles, and CI/CD pipeline identities get created rapidly and reviewed rarely. In my experience, these are the identities most likely to have excess permissions, least likely to be in the access review process, and most likely to be the initial foothold in a cloud breach.

The gap between granted and used. This one surprised me most when I first started doing cloud security work. AWS data from real customer accounts shows the average IAM entity uses less than 5% of its granted permissions. That 95% excess isn’t just waste — it’s attack surface. Every permission that exists but isn’t needed is a permission an attacker can use if they compromise that identity.


IAM Across AWS, GCP, and Azure — The Conceptual Map

The three major providers implement IAM differently in syntax, but the same model underlies all of them. Once you understand one deeply, the others become a translation exercise.

Concept AWS GCP Azure
Identity store IAM users / roles Google accounts, Workspace Entra ID
Machine identity IAM Role (via instance profile or AssumeRole) Service Account Managed Identity
Access grant mechanism Policy document attached to identity or resource IAM binding on resource (member + role + condition) Role Assignment (principal + role + scope)
Hierarchy Account is the boundary; Org via SCPs Org → Folder → Project → Resource Tenant → Management Group → Subscription → Resource Group → Resource
Default stance Deny Deny Deny
Wildcard risk "Action": "*" on "Resource": "*" Primitive roles (viewer/editor/owner) Owner or Contributor assigned broadly

The hierarchy point is worth pausing on. AWS is relatively flat — the account is the primary security boundary. GCP’s hierarchy means a binding at the Organisation level propagates down to every project. Azure’s hierarchy means a role assignment at the Management Group level flows through every subscription beneath it. The blast radius of a misconfiguration scales with how high in the hierarchy it sits.

This will matter in EP05 and EP06 when we go deep on GCP and Azure specifically. For now, the takeaway is: understand where in the hierarchy a permission is granted, because the same permission granted at the wrong level has a very different security implication.


Framework Alignment

If you’re mapping this episode to a control framework — for a compliance audit, a certification study, or building a security program — here’s where it lands:

Framework Reference What It Covers Here
CISSP Domain 1 — Security & Risk Management IAM as a risk reduction control; blast radius is a risk variable
CISSP Domain 5 — Identity and Access Management Direct implementation: who can do what, to which resources, under what conditions
ISO 27001:2022 5.15 Access control Policy requirements for restricting access to information and systems
ISO 27001:2022 5.16 Identity management Managing the full lifecycle of identities in the organization
ISO 27001:2022 5.18 Access rights Provisioning, review, and removal of access rights
SOC 2 CC6.1 Logical access security controls to protect against unauthorized access
SOC 2 CC6.3 Access removal and review processes to limit unauthorized access

Key Takeaways

  • IAM evolved from Unix file permissions → directory services → cloud policy engines, driven by scale and the failure modes of each prior model
  • Cloud IAM is deny-by-default: every access requires an explicit Allow somewhere in the policy chain
  • Identities are human or machine; in production, machines dominate — and they’re the under-governed majority
  • A policy binds a principal to actions on resources; every word is a deliberate security decision
  • The hardest IAM problems aren’t technical — they’re organisational: drift, unused permissions, machine identities nobody owns, and access reviews that never happen
  • The gap between granted and used permissions is where attackers find room to move

What’s Next

Now that you understand what IAM is and why it exists, the next question is the one that trips up even experienced engineers: what’s the difference between authentication and authorization, and why does conflating them cause security failures?

EP02 works through both — how cloud providers implement each, where the boundary sits, and why getting this boundary wrong creates exploitable gaps.

Next: EP02 — Authentication vs Authorization: The Two Pillars of IAM

eBPF vs Kernel Modules: An Honest Comparison for K8s Engineers

~2,100 words · Reading time: 8 min · Series: eBPF: From Kernel to Cloud, Episode 3 of 18

In Episode 1 we covered what eBPF is. In Episode 2 we covered why it is safe. The question that comes next is the one most tutorials skip entirely:

If eBPF can do everything a kernel module does for observability, why do kernel modules still exist? And when should you still reach for one?

Most comparisons on this topic are written by people who have used one or the other. I have used both — device driver work from 2012 to 2014 and eBPF in production Kubernetes clusters for the last several years. This is the honest version of that comparison, including the cases where kernel modules are still the right answer.


What Kernel Modules Actually Are

A kernel module is a piece of compiled code that loads directly into the running Linux kernel. Once loaded, it operates with full kernel privileges — the same level of access as the kernel itself. There is no sandbox. There is no safety check. There is no verifier.

This is both the power and the problem.

Kernel modules can do things that nothing else in the Linux ecosystem can do: implement new filesystems, add hardware drivers, intercept and modify kernel data structures, hook into scheduler internals. They are how the kernel extends itself without requiring a recompile or a reboot.

But the operating model is unforgiving:

  • A bug in a kernel module causes an immediate kernel panic — no exceptions, no recovery
  • Modules must be compiled against the exact kernel headers of the running kernel
  • A module that works on RHEL 8 may refuse to load on RHEL 9 without recompilation
  • Loading a module requires root privileges and deliberate coordination in production
  • Debugging a module failure means kernel crash dumps, kdump analysis, and time

I experienced all of these during device driver work. The discipline that environment instils is real — you think very carefully before touching anything, because mistakes are instantaneous and complete.


What eBPF Does Differently

eBPF was not designed to replace kernel modules. It was designed to provide a safe, programmable interface to kernel internals for the specific use cases where modules had always been used but were too dangerous: observability, networking, and security monitoring.

The fundamental difference is the verifier, covered in depth in Episode 2. Before any eBPF program runs, the kernel proves it is safe. Before any kernel module runs, nothing checks anything.

That single architectural decision produces a completely different operational profile:

Property Kernel module eBPF program
Safety check before load None BPF verifier — mathematical proof of safety
A bug causes Kernel panic, immediate Program rejected at load time
Kernel version coupling Compiled per kernel version CO-RE: compile once, run on any kernel 5.4+
Hot load / unload Risky, requires coordination Safe, zero downtime, zero pod restarts
Access scope Full kernel, unrestricted Restricted, granted per program type
Debugging Kernel crash dumps, kdump bpftool, bpftrace, readable error messages
Portability Recompile per distro per version Single binary runs across distros and versions
Production risk High — no safety net Low — verifier enforced before execution

CO-RE: Why Portability Matters More Than Most Engineers Realise

The portability column in that table deserves more than a one-line entry, because it is the operational advantage that compounds over time.

A kernel module written for RHEL 8 ships compiled against 4.18.0-xxx.el8.x86_64 kernel headers. When RHEL 8 moves to a new minor version, the module may need recompilation. When you migrate to RHEL 9 — kernel 5.14 with a completely different ABI in places — the module almost certainly needs a full rewrite of any code that touches kernel internals that changed between versions.

If you are running Falco with its kernel module driver and you upgrade a node from Ubuntu 20.04 to 22.04, Falco needs a pre-built module for your exact new kernel or it needs to compile one. If the pre-built is not available and compilation fails — no runtime security monitoring until it is resolved.

eBPF with CO-RE works differently. CO-RE (Compile Once, Run Everywhere) uses the kernel’s embedded BTF (BPF Type Format) information to patch field offsets and data structure layouts at load time to match the running kernel. The eBPF program was compiled once, against a reference kernel. When it loads on a different kernel, libbpf reads the BTF data from /sys/kernel/btf/vmlinux and fixes up the relocations automatically.

The practical result: a Cilium or Falco binary built six months ago loads and runs correctly on a node you just upgraded to a newer kernel version — without any module rebuilding, without any intervention, without any downtime.

In a Kubernetes environment where node images update regularly — especially on managed services like EKS, GKE, and AKS — this is not a minor convenience. It is the difference between eBPF tooling that survives an upgrade cycle and kernel module tooling that breaks one.


Security Implications: Container Escape and Privilege Escalation

The security difference between the two approaches matters specifically for container environments, and it goes beyond the verifier’s protection of your own nodes.

Kernel modules as an attack surface

Historically, kernel module vulnerabilities have been a primary vector for container escape. The attack pattern is straightforward: exploit a vulnerability in a loaded kernel module to gain kernel-level code execution, then use that access to break out of the container namespace into the host. Several high-profile CVEs over the past decade have followed this pattern.

The risk is compounded in environments that load third-party kernel modules — hardware drivers, filesystem modules, observability agents using the kernel module approach — because each additional module is an additional attack surface at the highest privilege level on the system.

eBPF’s security boundaries

eBPF does not eliminate the attack surface entirely, but it constrains it in important ways.

First, eBPF programs cannot leak kernel memory addresses to userspace. This is verifier-enforced and closes the class of KASLR bypass attacks that kernel module vulnerabilities have historically enabled.

Second, eBPF programs are sandboxed by design. They cannot access arbitrary kernel memory, cannot call arbitrary kernel functions, and cannot modify kernel data structures they were not explicitly granted access to. A vulnerability in an eBPF program is contained within that sandbox.

Third, the program type system controls what each eBPF program can see and do. A kprobe program watching syscalls cannot suddenly start modifying network packets. The scope is fixed at load time by the program type and verified by the kernel.

For EKS specifically: Falco running in eBPF mode on your nodes is not a kernel module that could be exploited for container escape. It is a verifier-checked program with a constrained access scope. The tool designed to detect container escapes is not itself a container escape vector — which is the correct security architecture.

Audit and visibility

eBPF programs are auditable in ways that kernel modules are not. You can list every eBPF program currently loaded on a node:

$ bpftool prog list
14: kprobe  name sys_enter_execve  tag abc123...  gpl
    loaded_at 2025-03-01T07:30:00+0000  uid 0
    xlated 240B  jited 172B  memlock 4096B  map_ids 3,4

27: cgroup_skb  name egress_filter  tag def456...  gpl
    loaded_at 2025-03-01T07:30:01+0000  uid 0

Every program is listed with its load time, its type, its tag (a hash of the program), and the maps it accesses. You can audit exactly what is running in your kernel at any point. Kernel modules offer no equivalent — lsmod tells you what is loaded but nothing about what it is actually doing.


EKS and Managed Kubernetes: Where the Difference Is Most Visible

The eBPF vs kernel module distinction plays out most clearly in managed Kubernetes environments, because you do not control when nodes upgrade.

On EKS, when AWS releases a new optimised AMI for a node group and you update it, your nodes are replaced. Any kernel module-based tooling on those nodes needs pre-built modules for the new kernel, or it needs to compile them at node startup, or it fails. AWS does not provide the kernel source for EKS-optimised AMIs in the same way a standard distribution does, which makes module compilation at runtime unreliable.

This is precisely why the EKS 1.33 migration covered in the EKS 1.33 post was painful for Rocky Linux: it involved kernel-level networking behaviour that had been assumed stable. When the kernel networking stack changed, everything built on top of those assumptions broke.

eBPF-based tooling on EKS does not have this problem, provided the node OS ships with BTF enabled — which Amazon Linux 2023 and Ubuntu 22.04 EKS-optimised AMIs do. Cilium and Falco survive node replacements without any module rebuilding because CO-RE handles the kernel version differences automatically.

For GKE and AKS the story is similar. Both use node images with BTF enabled on current versions, and both upgrade nodes on a managed schedule that is difficult to predict precisely. eBPF tooling survives this. Kernel module tooling fights it.


When You Should Still Use Kernel Modules

eBPF is not the right answer for every use case. Kernel modules remain the correct tool when:

You are implementing hardware support. Device drivers for new hardware still require kernel modules. eBPF cannot provide the low-level hardware interrupt handling, DMA operations, or hardware register access that a device driver needs. If you are bringing up a new network interface card, storage controller, or GPU, you are writing a kernel module.

You need to modify kernel behaviour, not just observe it. eBPF can observe and filter. It can drop packets, block syscalls via LSM hooks, and redirect traffic. But it cannot fundamentally change how the kernel handles a syscall, implement a new scheduling algorithm from scratch, or add a new filesystem type. Those changes require kernel modules or upstream kernel patches.

You are on a kernel older than 5.4. Without BTF and CO-RE, eBPF programs must be compiled per kernel version — which largely eliminates the portability advantage. On RHEL 7 or very old Ubuntu LTS versions still in production, kernel modules may be the more practical path for instrumentation work, though migrating the underlying OS is a better long-term answer.

You need capabilities the eBPF verifier rejects. The verifier’s safety constraints occasionally reject programs that are logically safe but that the verifier cannot prove safe statically. Complex loops, large stack allocations, and certain pointer arithmetic patterns hit verifier limits. In these edge cases, a kernel module can do what the verifier would not allow. These situations are rare and becoming rarer as the verifier improves across kernel versions.


The Practical Decision Framework

For most engineers reading this — Linux admins, DevOps engineers, SREs managing Kubernetes clusters — the decision is straightforward:

  • Observability, security monitoring, network policy, performance profiling on Linux 5.4+ → eBPF
  • Hardware drivers, new kernel subsystems, or kernels older than 5.4 → kernel modules
  • Production Kubernetes on EKS, GKE, or AKS → eBPF, always, because CO-RE survives managed upgrades and kernel modules do not

The overlap between the two technologies — the use cases where both could work — has been shrinking for five years and continues to shrink as the verifier becomes more capable and CO-RE becomes more widely supported. The direction of travel is clear.

Kernel modules are a precision instrument for modifying kernel behaviour. eBPF is a safe, portable interface for observing and influencing it. In 2025, if you are reaching for a kernel module to instrument a production system, there is almost certainly a better path.


Up Next

Episode 4 covers the five things eBPF can observe that no other tool can — without agents, without sidecars, and without any changes to your application code. If you are running production Kubernetes and want to understand what true zero-instrumentation observability looks like, that is the post.

The full series is on LinkedIn — search #eBPFSeries — and all episodes are indexed on linuxcent.com under the eBPF Series tag.


Further Reading


Questions or corrections? Reach me on LinkedIn. If this was useful, the full series index is on linuxcent.com — search the eBPF Series tag for all episodes.

BPF Verifier Explained: Why eBPF Is Safe for Production Kubernetes

~2,400 words · Reading time: 9 min · Series: eBPF: From Kernel to Cloud, Episode 2 of 18

In Episode 1, we established what eBPF is and why it gives Linux admins and DevOps engineers kernel-level visibility without sidecars or code changes. The obvious follow-up question is the one every experienced engineer should ask before running anything in kernel space:

Is it actually safe to run on production nodes?

The answer is yes — and the reason is one specific component of the Linux kernel called the BPF verifier. This post explains what the verifier is, what it protects your cluster from, and why it changes the risk calculus for eBPF-based tools entirely.


The Fear That Holds Most Teams Back

When I first explain eBPF to Linux admins and DevOps engineers, the reaction is almost always the same:

“So it runs code inside the kernel? On our production nodes? That sounds like a disaster waiting to happen.”

It is a completely reasonable concern. The Linux kernel is not a place where mistakes are tolerated. A buggy kernel module can take down a server instantly — no warning, no graceful shutdown, just a hard panic and a 3 AM phone call.

I know this from personal experience. During 2012–2014, I worked briefly with Linux device driver code. That period taught me one thing clearly: kernel space does not forgive careless code.

So when people started talking about running programs inside the kernel via eBPF, my instinct was scepticism too. Then I understood the BPF verifier. And everything changed.


What the Verifier Actually Is

Think of the BPF verifier as a strict safety gate that sits between your eBPF program and the kernel. Before your eBPF program is allowed to run — before it touches a single system call, network packet, or container event — the verifier reads through every line of it and asks one question:

“Could this program crash or compromise the kernel?”

If the answer is yes, or even maybe, the program is rejected. It does not load. Your cluster stays safe. If the answer is a provable no, the program loads and runs.

This is not a runtime check that catches problems after the fact. It is a load-time guarantee — the kernel proves the program is safe before it ever executes. Here is what that looks like when you deploy Cilium:

You run: kubectl apply -f cilium-daemonset.yaml
         └─► Cilium loads its eBPF programs onto each node
                   └─► Kernel verifier checks every program
                             ├─► SAFE   → program loads, starts observing
                             └─► UNSAFE → rejected, cluster untouched

This is why Cilium can replace kube-proxy on your nodes, why Falco can watch every syscall in every container, and why Tetragon can enforce security policy at the kernel level — all without putting your cluster at risk.


What the Verifier Protects You From

You do not need to know how the verifier works internally. What matters is what it prevents — and why each protection matters specifically in Kubernetes environments.

Infinite loops

An eBPF program that never terminates would freeze the kernel event it is attached to — potentially hanging every container on that node. The verifier rejects any program it cannot prove will finish executing within a bounded number of instructions.

Why this matters: Every eBPF-based tool on your K8s nodes — Cilium, Falco, Tetragon, Hubble — was verified to terminate correctly on every code path before it shipped. You are not trusting the vendor’s claim. The kernel enforced it.

Memory safety violations

An eBPF program cannot read or write memory outside the boundaries it is explicitly granted. No reaching into another container’s memory space. No accessing kernel data structures it was not given permission to touch.

Why this matters: This is the property that makes eBPF safe for multi-tenant clusters. A Falco rule monitoring one namespace cannot accidentally read data from another namespace’s containers. The verifier makes this impossible at the program level, not just at the policy level.

Kernel crashes

The verifier checks that every pointer is valid before it is dereferenced, that every function call uses correct arguments, and that the program cannot corrupt kernel data structures. Programs that could cause a kernel panic are rejected before they load.

Why this matters: Running Cilium or Tetragon on a production node is not the same risk as loading an untested kernel module. The verifier has already proven these programs cannot crash your nodes — before they ever ran on your infrastructure.

Privilege escalation and kernel pointer leaks

eBPF programs cannot leak kernel memory addresses to userspace. This closes a class of container escape and privilege escalation attacks that have historically been possible through kernel module vulnerabilities.

Why this matters: Security tools built on eBPF — like Tetragon, which detects and blocks container escape attempts in real time — are not themselves a vector for the attacks they protect against.


eBPF vs Traditional Observability Agents

To appreciate what the verifier gives you operationally, compare the two main approaches to K8s observability.

Traditional agent — DaemonSet sidecar approach

Your K8s cluster
└─► Node
    ├─► App Pod (your service)
    ├─► Sidecar container (injected into every pod)
    │   └─► Reads /proc, intercepts syscalls via ptrace
    │       └─► 15–30% CPU/memory overhead per pod
    └─► Agent DaemonSet Pod
        └─► Aggregates data from all sidecars

Problems with this model:

  • Sidecar injection requires modifying every pod spec and typically an admission webhook
  • ptrace-based interception adds 50–100% overhead to the traced process and is blocked in hardened containers
  • The agent runs in userspace with elevated privileges — a larger attack surface
  • Updating the agent requires pod restarts across your fleet

eBPF-based tool — Cilium / Falco / Tetragon

Your K8s cluster
└─► Node
    ├─► App Pod (your service — completely unmodified)
    ├─► App Pod (another service — also unmodified)
    └─► eBPF programs (inside the kernel, verifier-checked)
        └─► See every syscall, network packet, file access
            └─► Forward events to userspace agent via ring buffer

Benefits:

  • No sidecar injection — pod specs stay clean, no admission webhook required
  • Kernel-level visibility with near-zero overhead (typically 1–3%)
  • The verifier guarantees the eBPF programs cannot harm your nodes
  • Works identically with Docker, containerd, and CRI-O

Tools You Are Probably Already Running — All Verifier-Protected

You may already be running eBPF on your nodes without thinking about it explicitly. In each case below, the verifier ran before the tool ever touched your cluster.

Tool How the verifier is involved
Cilium Every network policy decision, service load-balancing operation, and Hubble flow log is handled by eBPF programs that passed the verifier at node startup.
Falco Every Falco rule is enforced by a verifier-checked eBPF program attached to syscall hooks. Sub-millisecond detection is only possible because the program runs in kernel space.
AWS VPC CNI On EKS, networking operations have progressively moved to eBPF for performance at scale. If you are on a recent EKS AMI, eBPF is already doing work on your nodes.
systemd Modern systemd uses eBPF for cgroup-based resource accounting and network traffic control. Active on most current Ubuntu, RHEL, and Amazon Linux 2023 installations.

Questions to Ask When Evaluating eBPF Tools

When a vendor tells you their tool uses eBPF, these three questions will quickly tell you how mature their implementation is.

1. What kernel version do you require?

The verifier’s capabilities have expanded significantly across kernel versions. Tools targeting kernel 5.8+ can use more powerful features safely. Tools claiming to work on kernel 4.x are constrained by an older, more limited verifier. The table below shows exactly where each major distribution stands.

Distribution Default kernel eBPF support level Notes
Ubuntu 16.04 LTS 4.4 Basic eBPF only No BTF. kprobes and socket filters work but modern tooling like Cilium and Falco eBPF driver will not run. EOL — do not use for new deployments.
Ubuntu 18.04 LTS 4.15 eBPF, no BTF No CO-RE. Tools must be compiled against the exact running kernel headers. The HWE kernel (5.4) improves this but BTF still varies by build.
Ubuntu 20.04 LTS 5.4 BTF available, verify before use CO-RE capable on most deployments. CONFIG_DEBUG_INFO_BTF was absent on some early builds. Verify with ls /sys/kernel/btf/vmlinux before deploying eBPF tooling. Cloud images generally have it enabled.
Ubuntu 20.10+ 5.8 Full BTF + CO-RE First Ubuntu release where BTF was consistently enabled by default. Ring buffers available. Not an LTS release — use 22.04 for production.
Ubuntu 22.04 LTS 5.15 Full modern eBPF — production ready BTF embedded. Ring buffers, global variables, LSM hooks. Default baseline for EKS-optimised Ubuntu AMIs. Recommended for new deployments.
Ubuntu 24.04 LTS 6.8 Full modern eBPF + latest features Open-coded iterators, improved verifier precision, enhanced LSM support. Best Ubuntu option for cutting-edge eBPF tooling today.
Debian 10 (Buster) 4.19 Basic eBPF, no BTF eBPF programs load but CO-RE is unavailable. Must compile against exact kernel headers. EOL — migrate to Debian 11 or 12.
Debian 11 (Bullseye) 5.10 LTS Full BTF + CO-RE BTF enabled. CO-RE works. Cilium, Falco, and Tetragon all fully supported. Solid production baseline for Debian environments through 2026.
Debian 12 (Bookworm) 6.1 LTS Full modern eBPF — production ready Same kernel generation as Amazon Linux 2023. LSM hooks, ring buffers, full CO-RE. Recommended Debian version for eBPF workloads today.
Debian 13 (Trixie) 6.12 LTS Full modern eBPF + latest features Released August 2025. Same kernel generation as RHEL 10 / Rocky 10 / AlmaLinux 10. Maximum eBPF feature availability across all program types.
RHEL 7.6 3.10 (backported) Tech Preview only — not production safe First RHEL release to enable eBPF but explicitly marked as Tech Preview. Limited to kprobes and tracepoints. No XDP, no socket filters, no BTF. Do not use for eBPF in production.
RHEL 8 / Rocky 8 / AlmaLinux 8 4.18 (heavily backported) Full BPF + BTF — functionally 5.4-equivalent Red Hat backports make RHEL 8 kernels functionally comparable to upstream 5.4 for most eBPF use cases. BTF enabled across all releases. CO-RE works. Cilium treats RHEL 8.6+ as its minimum supported RHEL-family version.
RHEL 9 / Rocky 9 / AlmaLinux 9 5.14 (heavily backported) Full modern eBPF — production ready BTF embedded. XDP, tc, kprobe, tracepoint, and LSM hooks all supported. Falco, Cilium, and Tetragon fully supported. Recommended RHEL-family version for eBPF deployments today. Supported until 2032.
RHEL 10 / Rocky 10 / AlmaLinux 10 6.12 Full modern eBPF + latest features Same kernel generation as Debian 13 and upstream 6.12 LTS. Rocky 10 released June 2025, AlmaLinux 10 released May 2025. Enhanced eBPF functionality throughout.
Amazon Linux 2023 6.1+ Full modern eBPF — production ready BTF embedded. Full CO-RE. Recommended for EKS. Also resolves the NetworkManager deprecation issues in EKS 1.33+ — see the EKS 1.33 post.

Quick check for any distro: Run ls /sys/kernel/btf/vmlinux on your node. If the file exists, your kernel has BTF enabled and CO-RE-based eBPF tools will work correctly. If it does not exist, you are limited to tools that compile against your specific kernel headers. Run uname -r to confirm the exact kernel version.

Rocky Linux and AlmaLinux note: Both distros rebuild directly from RHEL sources. Their kernel versions and eBPF capabilities are effectively identical to the corresponding RHEL release. When Cilium or Falco document “RHEL 9 support”, that applies equally to Rocky 9 and AlmaLinux 9 without any additional configuration.

2. Do you use CO-RE?

CO-RE (Compile Once, Run Everywhere) means the tool’s eBPF programs work correctly across different kernel versions without recompilation. Tools using CO-RE are more portable and significantly less likely to break after a routine node OS update. This is a reliable signal of engineering maturity in the vendor’s eBPF implementation.

3. What eBPF program types do you use?

Different program types have different privilege levels and access scopes. A tool that only needs kprobe access is asking for considerably less privilege than one requiring lsm hooks.

  • kprobe / tracepoint — observability and debugging
  • tc (traffic control) — network policy enforcement
  • xdp (eXpress Data Path) — high-performance packet processing
  • lsm (Linux Security Module) — security policy enforcement (used by Tetragon)

Understanding the program type tells you what the tool can and cannot see on your nodes, and how much kernel access you are granting it.


How Falco Uses the Verifier — A Step-by-Step Walkthrough

Here is exactly what happens when Falco starts on one of your K8s nodes, and where the verifier fits in:

1. Falco pod starts on the node (via DaemonSet)

2. Falco loads its eBPF programs into the kernel:
   └─► BPF verifier checks each program
       ├─► Can it crash the kernel?            No → continue
       ├─► Can it loop forever?                No → continue
       ├─► Can it access out-of-bounds memory? No → continue
       └─► PASS → program loads

3. Falco's eBPF programs attach to syscall hooks:
   └─► sys_enter_execve   (every process execution in every container)
   └─► sys_enter_openat   (every file open)
   └─► sys_enter_connect  (every outbound network connection)

4. A container runs an unexpected shell (potential attack):
   └─► execve() called inside the container
   └─► Falco's eBPF hook fires in kernel space
   └─► Event forwarded to Falco userspace via ring buffer
   └─► Falco rule matches: "shell spawned in container"
   └─► Alert fired in under 1 millisecond

5. Your container, your other pods, your node: completely unaffected

Step 2 is what the verifier makes safe. Without it, attaching eBPF hooks to every syscall on your production node would be an unacceptable risk. With it, Falco can offer this level of visibility with a mathematical safety guarantee.


The Bottom Line

You do not need to understand BPF bytecode, register states, or static analysis to use eBPF tools safely in production. What you do need to understand is this:

The BPF verifier is the reason eBPF is fundamentally different from kernel modules. It does not just make eBPF “safer” in a vague sense — it provides a mathematical proof that each program cannot crash your kernel before that program ever runs.

This is why eBPF-based tools can deliver deep kernel-level visibility into every container, every syscall, and every network flow — with near-zero overhead, no sidecar injection, and production safety that kernel modules could never guarantee.

The next time someone on your team hesitates about running Cilium, Falco, or Tetragon on production nodes because “it runs code in the kernel” — you now know what to tell them. The verifier already checked it. Before it ever touched your cluster.


Further Reading


Questions or corrections? Reach me on LinkedIn. If this was useful, the full series index is on linuxcent.com — search the eBPF Series tag for all episodes.

What Is eBPF? A Plain-English Guide for Linux and Kubernetes Engineers

~1,900 words · Reading time: 7 min · Series: eBPF: From Kernel to Cloud, Episode 1 of 18

Your Linux kernel has had a technology built into it since 2014 that most engineers working with Linux every day have never looked at directly. You have almost certainly been using it — through Cilium, Falco, Datadog, or even systemd — without knowing it was there.

This post is the plain-English introduction to eBPF that I wished existed when I first encountered it. No kernel engineering background required. No bytecode, no BPF maps, no JIT compilation. Just a clear answer to the question every Linux admin and DevOps engineer eventually asks: what actually is eBPF, and why does it matter for the infrastructure I run every day?


First: Forget the Name

eBPF stands for extended Berkeley Packet Filter. It is one of the most misleading names in computing for what the technology actually does.

The original BPF was a 1992 mechanism for filtering network packets — the engine behind tcpdump. The extended version, introduced in Linux 3.18 (2014) and significantly matured through Linux 5.x, is a completely different technology. It is no longer just about packets. It is no longer just about filtering.

Forget the name. Here is what eBPF actually is:

eBPF lets you run small, safe programs directly inside the Linux kernel — without writing a kernel module, without rebooting, and without modifying your applications.

That is the complete definition. Everything else is implementation detail. The one-liner above is what matters for how you use it day to day.


What the Linux Kernel Can See That Nothing Else Can

To understand why eBPF is significant, you need to understand what the Linux kernel already sees on every server and every Kubernetes node you run.

The kernel is the lowest layer of software on your machine. Every action that happens — every file opened, every process started, every network packet sent — passes through the kernel. That means it has a complete, real-time view of everything:

  • Every syscall — every open(), execve(), connect(), write() from every process in every container on the node, in real time
  • Every network packet — source, destination, port, protocol, bytes, and latency for every pod-to-pod and pod-to-external connection
  • Every process event — every fork, exec, and exit, including processes spawned inside containers that your container runtime never reports
  • Every file access — which process opened which file, when, and with what permissions, across all workloads on the node simultaneously
  • CPU and memory usage — per-process CPU time, function-level latency, and memory allocation patterns without profiling agents

The kernel has always had this visibility. The problem was that there was no safe, practical way to access it without writing kernel modules — which are complex, kernel version-specific, and genuinely dangerous to run in production. eBPF is the safe, practical way to access it.


The Problem eBPF Solves — A Real Kubernetes Scenario

Here is a situation every Kubernetes engineer has faced. A production pod starts behaving strangely — elevated CPU, slow responses, occasional connection failures. You want to understand what is happening at a low level: what syscalls is it making, what network connections is it opening, is something spawning unexpected processes?

The old approaches and their problems

Restart the pod with a debug sidecar. You lose the current state immediately. The issue may not reproduce. You have modified the workload.

Run strace inside the container via kubectl exec. strace uses ptrace, which adds 50–100% CPU overhead to the traced process and is unavailable in hardened containers. You are tracing one process at a time with no cluster-wide view.

Poll /proc with a monitoring agent. Snapshot-based. Any event that happens between polls is invisible. A process that starts, does something, and exits between intervals is completely missed.

The eBPF approach

# Use a debug pod on the node — no changes to your workload
$ kubectl debug node/your-node -it --image=cilium/hubble-cli

# Real-time kernel events from every container on this node:
sys_enter_execve  pid=8821  comm=sh    args=["/bin/sh","-c","curl http://..."]
sys_enter_connect pid=8821  comm=curl  dst=203.0.113.42:443
sys_enter_openat  pid=8821  comm=curl  path=/etc/passwd

# Something inside the pod spawned a shell, made an outbound connection,
# and read /etc/passwd — all visible without touching the pod.

Real-time visibility. No overhead on your workload. Nothing restarted. Nothing modified. That is what eBPF makes possible.


Tools You Are Probably Already Running on eBPF

eBPF is not a standalone product — it is the foundation that many tools in the cloud-native ecosystem are built on. You may already be running eBPF on your nodes without thinking about it explicitly.

Tool What eBPF does for it Without eBPF
Cilium Replaces kube-proxy and iptables with kernel-level packet routing. 2–3× faster at scale. iptables rules — linear lookup, degrades with service count
Falco Watches every syscall in every container for security rule violations. Sub-millisecond detection. Kernel module (risky) or ptrace (high overhead)
Tetragon Runtime security enforcement — can kill a process or drop a network packet at the kernel level. No practical alternative at this detection speed
Datadog Agent Network performance monitoring and universal service monitoring without application code changes. Language-specific agents injected into application code
systemd cgroup resource accounting and network traffic control on your Linux nodes. Legacy cgroup v1 interfaces with limited visibility

eBPF vs the Old Ways

Before eBPF, getting deep visibility into a running Linux system meant choosing between three approaches, each with a significant trade-off:

Approach Visibility Cost Production safe?
Kernel modules Full kernel access One bug = kernel panic. Version-specific, must recompile per kernel update. No
ptrace / strace One process at a time 50–100% CPU overhead on the traced process. Unusable in production. No
Polling /proc Snapshots only Events between polls are invisible. Short-lived processes are missed entirely. Partial
eBPF Full kernel visibility 1–3% overhead. Verifier-guaranteed safety. Real-time stream, not polling. Yes

Is It Safe to Run in Production?

This is always the first question from any experienced Linux admin, and it is exactly the right question to ask. The answer is yes — and the reason is the BPF verifier.

Before any eBPF program is allowed to run on your node, the Linux kernel runs it through a built-in static safety analyser. This analyser examines every possible execution path and asks: could this program crash the kernel, loop forever, or access memory it should not?

If the answer is yes — or even maybe — the program is rejected at load time. It never runs.

This is fundamentally different from kernel modules. A kernel module loads immediately with no safety check. If it has a bug, you find out at runtime — usually as a kernel panic. An eBPF program that would cause a panic is rejected before it ever loads. The safety guarantee is mathematical, not hopeful.

Episode 2 of this series covers the BPF verifier in full: what it checks, how it makes Cilium and Falco safe on your production nodes, and what questions to ask eBPF tool vendors about their implementation.


Common Misconceptions

eBPF is not a specific tool or product. It is a kernel technology — a platform. Cilium, Falco, Tetragon, and Pixie are tools built on top of it. When a vendor says “we use eBPF”, they mean they build on this kernel capability, not that they share a single implementation.

eBPF is not only for networking. The Berkeley Packet Filter name suggests networking, but modern eBPF covers security, observability, performance profiling, and tracing. The networking origin is historical, not a limitation.

eBPF is not only for Kubernetes. It works on any Linux system running kernel 4.9+, including bare metal servers, Docker hosts, and VMs. K8s is the most popular deployment target because of the observability challenges at scale, but it is not a requirement.

You do not need to write eBPF programs to benefit from eBPF. Most Linux admins and DevOps engineers will use eBPF through tools like Cilium, Falco, and Datadog — never writing a line of BPF code themselves. This series covers the writing side later. Understanding what eBPF is makes you a significantly better user of these tools today.


Kernel Version Requirements

eBPF is a Linux kernel feature. The capabilities available depend directly on the kernel version running on your nodes. Run uname -r on any node to check.

Kernel What becomes available
4.9+ Basic eBPF support. Tracing, socket filtering. Most production systems today meet this minimum.
5.4+ BTF (BPF Type Format) and CO-RE — programs that adapt to different kernel versions without recompile. Recommended minimum for production tooling.
5.8+ Ring buffers for high-performance event streaming. Global variables. The target kernel for Cilium, Falco, and Tetragon full feature support.
6.x Open-coded iterators, improved verifier, LSM security enforcement hooks. Amazon Linux 2023 and Ubuntu 22.04+ ship 5.15 or newer and are fully eBPF-ready.

EKS users: Amazon Linux 2023 AMIs ship with kernel 6.1+ and support the full modern eBPF feature set out of the box. If you are still on AL2, the migration also resolves the NetworkManager deprecation issues covered in the EKS 1.33 post.


The Bottom Line

eBPF is the answer to a question Linux engineers have been asking for years: how do I get deep visibility into what is happening on my servers and Kubernetes nodes — without adding massive overhead, injecting sidecars, or risking a kernel panic?

The answer is: run small, safe programs at the kernel level, where everything is already visible. Let the BPF verifier guarantee those programs are safe before they run. Stream the results to your observability tools through shared memory maps.

The tools you already use — Cilium for networking, Falco for security, Datadog for APM — are built on this foundation. Understanding eBPF means understanding why those tools work the way they do, what they can and cannot see, and how to evaluate new tools that claim to use it.

Every eBPF-based tool you run on your nodes passed through the BPF verifier before it touched your cluster. Episode 2 covers exactly what that means — and why it matters for your infrastructure decisions.


Further Reading


Questions or corrections? Reach me on LinkedIn. If this was useful, the full series index is on linuxcent.com — search the eBPF Series tag for all episodes.

Cloud AMI Security Risks & How Custom OS Images Fix them and what’s wrong with defaults

~2,800 words  ·  Reading time: 12 min  ·  Series: OS Image Security, Post 1 of 6

When you launch an EC2 instance from an AWS Marketplace AMI, or spin up a VM from a cloud-provider base image on GCP or Azure, you’re trusting a decision someone else made months ago about what your server should contain. That decision was made for the widest possible audience — not for your workload, your threat model, or your compliance requirements.

This post tears open what’s actually inside a default cloud image, compares it against what a production-hardened image should contain, and explains why the calculus changes depending on whether you’re deploying to AWS, an on-prem KVM host, or a Nutanix AHV cluster.


What a cloud provider is actually optimising for

AWS, Canonical, Red Hat, and every other publisher shipping to cloud marketplaces are solving a distribution problem, not a security problem. Their images need to:

  • Boot successfully on any instance type in any region
  • Work for the first-time user running their first workload
  • Support every possible use case — web servers, databases, ML training jobs, bastion hosts, everything

That constraint produces images that are, by design, permissive. Permissive gets out of the way. Permissive doesn’t break anything on day one. Permissive is also the opposite of what you want on a production server.

Let’s look at what “permissive” actually means in concrete terms.


Dissecting a default AWS AMI

Take Amazon Linux 2023 (AL2023), one of the more intentionally stripped-down cloud images available. Even with Amazon’s effort to reduce its footprint compared to AL2, a fresh AL2023 instance ships with more than most workloads need.

Services running at boot that most workloads don’t need

chronyd.service            # Fine — you need NTP
systemd-resolved.service   # Fine
dbus-broker.service        # Fine
amazon-ssm-agent.service   # Arguably fine if you use SSM
NetworkManager.service     # Debatable — most cloud workloads don't need NM

On a RHEL 8/9 or Ubuntu 22.04 Marketplace image, the list is longer. You’ll find avahi-daemon (mDNS/DNS-SD service discovery — on a server), bluetooth.service in some configurations, cups on some RHEL variants, and on Ubuntu, snapd running and occupying memory along with its associated mount units.

Every running service is an attack surface. Every socket it opens is a listening endpoint you didn’t ask for.

SSH configuration out of the box

The default sshd_config on most Marketplace images is not hardened. You’ll typically find:

PermitRootLogin prohibit-password   # Better than 'yes', but not 'no'
PasswordAuthentication no           # Usually disabled by cloud-init — good
X11Forwarding yes                   # On a headless server. Why?
AllowAgentForwarding yes            # Unnecessary for most workloads
PrintLastLog yes                    # Minor, but generates audit noise
MaxAuthTries 6                      # CIS recommends 4 or fewer
ClientAliveInterval 0               # No idle timeout

CIS Benchmark Level 1 for RHEL 9 has 40+ SSH-specific controls. A default image satisfies perhaps a third of them.

Kernel parameters that aren’t tuned

# Not set, or not set correctly, on most default images:
net.ipv4.conf.all.send_redirects = 1        # Should be 0
net.ipv4.conf.default.accept_redirects = 1  # Should be 0
net.ipv4.ip_forward = 0                     # Correct if not a router, but often left unset
kernel.randomize_va_space = 2               # Usually correct — verify anyway
fs.suid_dumpable = 0                        # Often not set
kernel.dmesg_restrict = 1                   # Rarely set

These live in /etc/sysctl.d/ and need to be explicitly applied. In a default AMI, they are not.

No audit daemon configured

auditd is installed on most RHEL-family images. It is not configured. The default audit.rules file is essentially empty — the daemon runs but captures almost nothing. On Ubuntu, auditd isn’t even installed by default.

CIS Benchmark Level 2 for RHEL 9 specifies 30+ auditd rules covering file access, privilege escalation, user management changes, network configuration changes, and more. None of them are present in a default AMI.

Package surface

Run rpm -qa | wc -l or dpkg -l | grep -c ^ii on a fresh instance. AL2023 comes in around 350 packages. Ubuntu 22.04 Server minimal sits around 500. RHEL 9 from Marketplace — depending on the variant — lands between 400 and 600.

How many of those packages does your application actually need? For a Python web service: Python, your runtime dependencies, and a handful of system libraries. The rest is exposure.


The on-prem story is different — and often worse

Cloud images at least get regular updates from their publishers. On-prem KVM and Nutanix environments tell a different story.

The KVM / QCOW2 situation

Most teams running KVM get their base images one of three ways:

  1. Download a cloud image (cloud-init enabled QCOW2) from the distro vendor and use it directly
  2. Convert an existing VMware VMDK or OVA and hope for the best
  3. Run a manual Kickstart/Preseed install once, then treat the result as the “golden image” forever

Option 1 gives you the same problems as the cloud image analysis above, plus you’re now responsible for handling cloud-init in an environment that might not have a metadata service — so you either ship a seed ISO with every VM, or you rip out cloud-init and manage first-boot differently.

Option 3 is the most common and the most dangerous. That “golden image” was created by someone who’s possibly no longer at the company, contains packages pinned to versions from 18 months ago, and has sshd configured however was convenient at the time. Worse, it gets cloned hundreds of times and none of those clones are ever individually updated at the image level.

The Nutanix AHV specifics

Nutanix AHV images have additional considerations that cloud images don’t deal with:

  • AHV uses a custom paravirtualised SCSI controller (virtio-scsi or the Nutanix variant). Images imported from VMware need pvscsi drivers removed and virtio_scsi added to the initramfs before the disk will be detected at boot.
  • The Nutanix guest tools agent (ngt) is separate from the kernel and needs to be installed inside the image for snapshot quiescence, VSS integration, and in-guest metrics.
  • cloud-init works on AHV but requires the ConfigDrive datasource — not the EC2 datasource that most cloud QCOW2 images default to. An unconfigured datasource means cloud-init times out at boot, costing 3–5 minutes on every first start.
  • NUMA topology on large AHV nodes affects memory allocation in ways that need kernel tuning (vm.zone_reclaim_mode, kernel.numa_balancing) — parameters no generic cloud image sets.

The result is that most Nutanix environments end up with a patchwork: partially converted images, manually applied guest tools, and hardening that was done once per environment rather than once per image.


What a hardened image actually looks like

A properly built hardened image isn’t just “a default image with some hardening applied at the end.” The hardening is architectural — decisions made at build time that change the fundamental shape of what’s inside the image.

Package set — minimal by design

Start from a minimal install group — @minimal-environment on RHEL/Rocky, --variant=minbase on Debian derivatives. Then add only what the image class requires. For a web server image: your runtime, a process supervisor, and nothing else. No man-db, no X11-common, no avahi.

Every package you don’t install is a CVE that can never affect you.

Filesystem hardening

Separate mount points with restrictive options prevent a class of privilege escalation attacks that depend on executing binaries from world-writable locations:

/tmp      nodev,nosuid,noexec
/var      nodev,nosuid
/var/tmp  nodev,nosuid,noexec
/home     nodev,nosuid
/dev/shm  nodev,nosuid,noexec

These are not applied by any default cloud image.

Kernel parameters — baked in at build time

# /etc/sysctl.d/99-hardening.conf

net.ipv4.conf.all.send_redirects = 0
net.ipv4.conf.default.send_redirects = 0
net.ipv4.conf.all.accept_redirects = 0
net.ipv4.conf.default.accept_redirects = 0
net.ipv4.conf.all.accept_source_route = 0
net.ipv4.conf.all.log_martians = 1
net.ipv6.conf.all.accept_redirects = 0
kernel.randomize_va_space = 2
fs.suid_dumpable = 0
kernel.dmesg_restrict = 1
kernel.kptr_restrict = 2
net.core.bpf_jit_harden = 2

Applied at image build time. Present on every instance, every time, before your application code runs.

SSH locked down

Protocol 2
PermitRootLogin no
MaxAuthTries 4
LoginGraceTime 60
X11Forwarding no
AllowAgentForwarding no
AllowTcpForwarding no
PermitUserEnvironment no
Ciphers [email protected],[email protected],aes256-ctr
MACs [email protected],[email protected]
KexAlgorithms curve25519-sha256,diffie-hellman-group16-sha512
ClientAliveInterval 300
ClientAliveCountMax 3
Banner /etc/issue.net

This is approximately CIS Level 1 SSH hardening. It lives in the image — not in a post-deploy playbook.

auditd rules embedded

# Privilege escalation
-a always,exit -F arch=b64 -S execve -C uid!=euid -F euid=0 -k setuid

# Sudo usage
-w /etc/sudoers -p wa -k sudoers

# User and group management
-w /etc/passwd -p wa -k identity
-w /etc/group  -p wa -k identity

# Kernel module loading
-a always,exit -F arch=b64 -S init_module -S delete_module -k modules

The full CIS L2 auditd ruleset runs to ~60 rules. They’re all committed to the image. Every instance generates audit logs from minute one of its existence.

Services disabled at build time

systemctl disable avahi-daemon
systemctl disable cups
systemctl disable postfix
systemctl disable bluetooth
systemctl disable rpcbind
systemctl mask debug-shell.service

The service list varies by distro. The principle is the same: if it’s not required by the image’s purpose, it doesn’t run.


The platform dimension: why you can’t use one image everywhere

This is where the complexity gets real. A CIS-hardened RHEL 9 image built for AWS doesn’t directly work on KVM, and it doesn’t directly work on Nutanix either. The security controls are the same — the platform-specific layer underneath them is not.

Here’s what needs to differ per target platform:

Concern AWS (AMI) KVM (QCOW2) Nutanix AHV
Disk format Raw / VMDK → AMI QCOW2 QCOW2 / VMDK
Boot mechanism GRUB2 + PVGRUB2 or UEFI GRUB2 GRUB2 + UEFI
Network driver ENA (ena kernel module) virtio-net virtio-net
Storage driver NVMe or xen-blkfront virtio-blk / virtio-scsi virtio-scsi
cloud-init datasource ec2 NoCloud / ConfigDrive ConfigDrive
Guest agent AWS SSM / CloudWatch qemu-guest-agent Nutanix Guest Tools
Metadata service 169.254.169.254 None (seed ISO) or local Nutanix AOS

A single pipeline needs to produce platform-specific artefacts from a single hardened source. The hardening doesn’t change. The drivers, datasources, and agents do.


Where this sits relative to CIS and NIST

The controls described above aren’t arbitrary. They map directly to published frameworks.

CIS Benchmark Level 1 covers controls with low operational impact and high security return — SSH configuration, kernel parameters, filesystem mount options, service reduction. Almost everything in the “what a hardened image looks like” section above is CIS Level 1.

CIS Benchmark Level 2 adds auditd configuration, PAM controls, additional filesystem protections, and more aggressive service disablement. It trades some operational flexibility for a significantly smaller attack surface.

NIST SP 800-53 CM-6 (Configuration Settings) directly requires that systems be configured to the most restrictive settings consistent with operational requirements. Baking hardening into the image is a stronger implementation of CM-6 than applying it post-deploy — because it’s guaranteed, auditable at build time, and consistent across every instance regardless of how it was launched.

NIST SP 800-53 SI-2 (Flaw Remediation) maps to your image patching cadence. An image rebuilt monthly against the latest package repositories satisfies SI-2 more completely than runtime patching alone, because it also eliminates packages you don’t need — packages that would need patching if they were present.

The full CIS and NIST control mapping will be covered in depth later in this series.


The build-time vs runtime hardening distinction

This is the most important concept in the entire post.

Hardening applied at runtime — via Ansible, Chef, cloud-init user-data, or a shell script — is conditional. It runs if the automation runs. It applies if nothing fails. It’s consistent only if every deployment goes through exactly the same path.

Hardening embedded in the image is unconditional. It cannot be skipped. It doesn’t depend on connectivity to an Ansible control node. It doesn’t require cloud-init to succeed. It cannot be accidentally omitted by a new team member who doesn’t know the runbook.

This distinction matters most at incident response time. When you’re investigating a compromised instance, the first question you want to answer confidently is: was this instance ever in a known-good state?

  • If your hardening is in the image: yes, from boot.
  • If your hardening is applied post-deploy: it depends on whether everything went right on that specific instance’s first boot.

What comes next

The practical question this raises: how do you build these images in a repeatable, multi-platform way, with CIS scanning integrated into the build pipeline?

Packer covers most of the builder layer. OpenSCAP provides the scanning. Kickstart, cloud-init, and Nutanix AHV-specific tooling fill the gaps. But the orchestration between these — producing a consistent hardened image for three different target platforms from a single source of truth — is where most teams hit friction.

The next post in this series covers the platform-specific differences between AWS, KVM, and Nutanix in depth: what actually needs to change per target when your security baseline is shared.

Next in the series: Cloud vs KVM vs Nutanix — why one image doesn’t fit all →


Questions or corrections? Open an issue or reach me on LinkedIn. If this was useful, the series index has the full roadmap.

EKS 1.33 Upgrade Blocker: Fixing Dead Nodes & NetworkManager on Rocky Linux

The EKS 1.33+ NetworkManager Trap: A Complete systemd-networkd Migration Guide for Rocky & Alma Linux

TL;DR:

  • The Blocker: Upgrading to EKS 1.33+ is breaking worker nodes, especially on free community distributions like Rocky Linux and AlmaLinux. Boot times are spiking past 6 minutes, and nodes are failing to get IPs.
  • The Root Cause: AWS is deprecating NetworkManager in favor of systemd-networkd. However, ripping out NetworkManager can leave stale VPC IPs in /etc/resolv.conf. Combined with the systemd-resolved stub listener (127.0.0.53) and a few configuration missteps, it causes a total internal DNS collapse where CoreDNS pods crash and burn.
  • The Subtext: AWS is pushing this modern networking standard hard. Subtly, this acts as a major drawback for Rocky/Alma AMIs, silently steering frustrated engineers toward Amazon Linux 2023 (AL2023) as the “easy” way out.
  • The “Super Hack”: Automate the clean removal of NetworkManager, bypass the DNS stub listener by symlinking /etc/resolv.conf directly to the systemd uplink, and enforce strict state validation during the AMI build.

If you’ve been in the DevOps and SRE space long enough, you know that vendor upgrades rarely go exactly as planned. But lately, if you are running enterprise Linux distributions like Rocky Linux or AlmaLinux on AWS EKS, you might have noticed the ground silently shifting beneath your feet.

With the push to EKS 1.33+, AWS is mandating a shift toward modern, cloud-native networking standards. Specifically, they are phasing out the legacy NetworkManager in favor of systemd-networkd.

While this makes sense on paper, the transition for community distributions has been incredibly painful. AWS support couldn’t resolve our issues, and my SRE team had practically given up, officially halting our EKS upgrade process. It’s hard not to notice that this massive, undocumented friction in Rocky Linux and AlmaLinux conveniently positions AWS’s own Amazon Linux 2023 (AL2023) as the path of least resistance.

I’m hoping the incredible maintainers at free distributions like Rocky Linux and AlmaLinux take note of this architectural shift. But until the official AMIs catch up, we have to fix it ourselves. Here is the exact breakdown of the cascading failure that brought our clusters to their knees, and the “super hack” script we used to fix it.

The Investigation: A Cascading SRE Failure

When our EKS 1.33+ worker nodes started booting with 6+ minute latencies or outright failing to join the cluster, I pulled apart our Rocky Linux AMIs to monitor the network startup sequence. What I found was a classic cascading failure of services, stale data, and human error.

Step 1: The Race Condition

Initially, the problem was a violent tug-of-war. NetworkManager was not correctly disabled by default, and cloud-init was still trying to invoke it. This conflicted directly with systemd-networkd, paralyzing the network stack during boot. To fix this, we initially disabled the NetworkManager service and removed it from cloud-init.

Step 2: The Stale Data Landmine

Here is where the trap snapped shut. Because NetworkManager was historically the primary service responsible for dynamically generating and updating /etc/resolv.conf, completely disabling it stopped that file from being updated.

When we baked the new AMI via Packer, /etc/resolv.conf was orphaned and preserved the old configuration—specifically, a stale .2 VPC IP address from the temporary subnet where the AMI build ran.

Step 3: The Human Element

We’ve all been there: during a stressful outage, wires get crossed. While troubleshooting the dead nodes, one of our SREs mistakenly stopped the systemd-resolved service entirely, thinking it was conflicting with something else.

Step 4: Total DNS Collapse

When the new AMI booted up and joined the EKS node group, the environment was a disaster zone:

  1. NetworkManager was dead (intentional).
  2. systemd-resolved was stopped (accidental).
  3. /etc/resolv.conf contained a dead, stale IP address from a completely different subnet.

When kubelet started, it dutifully read the host’s broken /etc/resolv.conf and passed it up to CoreDNS. CoreDNS attempted to route traffic to the stale IP, failed, and started crash-looping. Internal DNS resolution (pod.namespace.svc.cluster.local) totally collapsed. The cluster was dead in the water.

Flowchart showing the cascading DNS failure in EKS worker nodes
The perfect storm: How stale data and disabled services led to a total CoreDNS collapse.

Linux Internals: How systemd Manages DNS (And Why CoreDNS Breaks)

To understand how to permanently fix this, we need to look at how systemd actually handles DNS under the hood. When using systemd-networkd, resolv.conf management is handled through a strict partnership with systemd-resolved.

Architecture diagram of systemd-networkd and systemd-resolved D-Bus communication
How systemd collects network data and the critical symlink choice that dictates EKS DNS health.

Here is how the flow works: systemd-networkd collects network and DNS information (from DHCP, Router Advertisements, or static configs) and pushes it to systemd-resolved via D-Bus. To manage your DNS resolution effectively, you must configure the /etc/resolv.conf symbolic link to match your desired mode of operation. You have three choices:

1. The “Recommended” Local DNS Stub (The EKS Killer)

By default, systemd recommends using systemd-resolved as a local DNS cache and manager, providing features like DNS-over-TLS and mDNS.

  • The Symlink: ln -sf /run/systemd/resolve/stub-resolv.conf /etc/resolv.conf
  • Contents: Points to 127.0.0.53 as the only nameserver.
  • The Problem: This is a disaster for Kubernetes. If Kubelet passes 127.0.0.53 to CoreDNS, CoreDNS queries its own loopback interface inside the pod network namespace, blackholing all cluster DNS.

2. Direct Uplink DNS (The “Super Hack” Solution)

This mode bypasses the local stub entirely. The system lists the actual upstream DNS servers (e.g., your AWS VPC nameservers) discovered by systemd-networkd directly in the file.

  • The Symlink: ln -sf /run/systemd/resolve/resolv.conf /etc/resolv.conf
  • Contents: Lists all actual VPC DNS servers currently known to systemd-resolved.
  • The Benefit: CoreDNS gets the real AWS VPC nameservers, allowing it to route external queries correctly while managing internal cluster resolution perfectly.

3. Static Configuration (Manual)

If you want to manage DNS manually without systemd modifying the file, you break the symlink and create a regular file (rm /etc/resolv.conf). While systemd-networkd still receives DNS info from DHCP, it won’t touch this file. (Not ideal for dynamic cloud environments).


The Solution: A Surgical systemd Cutover

Knowing the internals, the path forward is clear. We needed to not only remove the legacy stack but explicitly rewire the DNS resolution to the Direct Uplink to prevent the stale data trap and bypass the notorious 127.0.0.53 stub listener.

Here is the exact state we achieved:

  1. Lock down cloud-init so it stops triggering legacy network services.
  2. Completely mask NetworkManager to ensure it never wakes up.
  3. Ensure systemd-resolved is enabled and running, but with the DNSStubListener explicitly disabled (DNSStubListener=no) so it doesn’t conflict with anything.
  4. Destroy the stale /etc/resolv.conf and create a symlink to the Direct Uplink (ln -sf /run/systemd/resolve/resolv.conf /etc/resolv.conf).
  5. Reconfigure and restart systemd-networkd.

Pro-Tip for Debugging: To ensure systemd-networkd is successfully pushing DNS info to the resolver, verify your .network files in /etc/systemd/network/. Ensure UseDNS=yes (which is the default) is set in the [DHCPv4] section. You can always run resolvectl status to see exactly which DNS servers are currently assigned to each interface over D-Bus!

The Automation: Production AMI Prep Script

Manual hacks are great for debugging, but SRE is about repeatable automation. We’ve open-sourced the eks-production-ami-prep.sh script to handle this cutover automatically during your Packer or Image Builder pipeline. It standardizes the cutover, wipes out the stale data, and includes a strict validation suite.


The Results

By actively taking control of the systemd stack and ensuring /etc/resolv.conf was dynamically linked rather than statically abandoned, we completely unblocked our EKS 1.33+ upgrade.

More impressively, our system bootup time dropped from a crippling 6+ minutes down to under 2 minutes. We shouldn’t have to abandon fantastic, free enterprise distributions just because a cloud provider shifts their networking paradigm. If your team is struggling with AWS EKS upgrades on Rocky Linux or AlmaLinux, integrate this automation into your pipeline and get your clusters back in the fast lane.

Supercharge Your Nginx Security: A Practical Guide to Enabling TLS 1.3 on Rocky Linux 9

Alright, let’s get straight to it. You’re running a modern web stack on Linux. You’ve been diligent, you’ve secured your URL endpoints, and you’re serving traffic over HTTPS using TLS 1.2. That’s a solid baseline. But in the world of infrastructure, standing still is moving backward. TLS 1.3 has been the standard for a while now, and it’s not just an incremental update; it’s a significant leap forward in both security and performance.

The good news? If you’re on a current platform like Rocky Linux 9.6, you’re already 90% of the way there. The underlying components are in place. This guide is the final 10%—a no-nonsense, command-line focused walkthrough to get you from TLS 1.2 to the faster, more secure TLS 1.3, complete with the validation steps and pro-tips to make it production-ready.

Prerequisites Check: Ensure that your OS is updated to the latest and we’re Good to Go

Before we touch any configuration files, let’s confirm your environment is ready. Enabling TLS 1.3 depends on two critical pieces of software: your web server (Nginx) and the underlying cryptography library (OpenSSL).

  • Nginx: You need version 1.13.0 or newer.
  • OpenSSL: You need version 1.1.1 or newer.

Rocky Linux 9.6 and its siblings in the RHEL 9 family ship with versions far newer than these minimums. Let’s verify it. SSH into your server and run this command:

nginx -V

The output will be verbose, but you’re looking for two lines. You’ll see something like this (your versions may differ slightly):

nginx version: nginx/1.26.x
built with OpenSSL 3.2.x ...

With Nginx and OpenSSL versions well above the minimum, we’re cleared for takeoff.

The Upgrade: Configuring Nginx for TLS 1.3

This is where the rubber meets the road. The process involves a single, targeted change to your Nginx configuration.

Step 1: Locate Your Nginx Server Block

Your SSL configuration is defined within a server block in your Nginx files. If you have a simple setup, this might be in /etc/nginx/nginx.conf. However, the best practice is to have separate configuration files for each site in /etc/nginx/conf.d/.

Find the relevant file for the site you want to upgrade. It will contain the listen 443 ssl; directive and your ssl_certificate paths.

Step 2: Modify the ssl_protocols Directive

Inside your server block, find the line that begins with ssl_protocols. To enable TLS 1.3 while maintaining compatibility for clients that haven’t caught up, modify this line to include TLSv1.3. The best practice is to support both 1.2 and 1.3.

# BEFORE
# ssl_protocols TLSv1.2;

# AFTER: Add TLSv1.3
ssl_protocols TLSv1.2 TLSv1.3;

It is critical that this directive is inside every server block where you want TLS 1.3 enabled. Settings are not always inherited from a global http block as you might expect.

Validation and Deployment: Trust, but Verify

A configuration change isn’t complete until it’s verified. This two-step process ensures you don’t break your site and that the change actually worked.

Step 1: Test and Reload Nginx

Never apply a new configuration blind. First, run the built-in Nginx test to check for syntax errors:

sudo nginx -t

If all is well, you’ll see a success message. Now, gracefully reload Nginx to apply the changes without dropping connections:

sudo systemctl reload nginx

Step 2: Verify TLS 1.3 is Active

Your server is reloaded, but how do you know TLS 1.3 is active? You must verify it with an external tool.

  • Quick Command-Line Check: For a fast check from your terminal, use curl:
    curl -I -v --tlsv1.3 --tls-max 1.3 https://your-domain.com

    Look for output confirming a successful connection using TLSv1.3.

  • The Gold Standard: The most comprehensive way to verify your setup is with the Qualys SSL Labs SSL Server Test. Navigate to their website, enter your domain name, and run a scan. In the “Configuration” section of the report, you will see a heading for “Protocols.” If your setup was successful, you will see a definitive “Yes” next to TLS 1.3.

Advanced Hardening: Pro-Tips for Production

You’ve enabled a modern protocol. Now, let’s enforce its use and add other layers of security that a production environment demands.

Pro-Tip 1: Implement HSTS (HTTP Strict Transport Security)

HSTS is a header your server sends to tell browsers that they should only communicate with your site using HTTPS. This prevents downgrade attacks. Add this header to your Nginx server block:

add_header Strict-Transport-Security "max-age=63072000; includeSubDomains; preload" always;
  • max-age=63072000: Tells the browser to cache this rule for two years.
  • includeSubDomains: Applies the rule to all subdomains. Use with caution.
  • preload: Allows you to submit your site to a list built into browsers, ensuring they never connect via HTTP.

Pro-Tip 2: Enable OCSP Stapling

Online Certificate Status Protocol (OCSP) Stapling improves performance and privacy by allowing your server to fetch the revocation status of its own certificate and “staple” it to the TLS handshake. This saves the client from having to make a separate request to the Certificate Authority.

Enable it by adding these lines to your server block:

# OCSP Stapling
ssl_stapling on;
ssl_stapling_verify on;
ssl_trusted_certificate /etc/letsencrypt/live/your-domain.com/fullchain.pem; # Use your fullchain certificate
resolver 8.8.8.8 1.1.1.1 valid=300s; # Use public resolvers

Pro-Tip 3: Modernize Your Cipher Suites

While TLS 1.3 has its own small set of mandatory, highly secure cipher suites, you can still define the ciphers for TLS 1.2. The ssl_prefer_server_ciphers directive should be set to off for TLS 1.3, which is the default in modern Nginx versions, allowing the client’s more modern cipher preferences to be honored. However, you should still define a strong cipher list for TLS 1.2.

Here is a modern configuration snippet combining these tips:

server {
    listen 443 ssl http2;
    server_name your-domain.com;

    # SSL Config
    ssl_certificate /path/to/fullchain.pem;
    ssl_certificate_key /path/to/privkey.pem;
    ssl_protocols TLSv1.2 TLSv1.3;
    ssl_prefer_server_ciphers off;

    # HSTS Header
    add_header Strict-Transport-Security "max-age=63072000; includeSubDomains; preload" always;

    # OCSP Stapling
    ssl_stapling on;
    ssl_stapling_verify on;
    ssl_trusted_certificate /path/to/fullchain.pem;

    # ... other configurations ...
}

TL;DR

  • Enable TLS 1.3 by adding it to the ssl_protocols directive in your Nginx server block: ssl_protocols TLSv1.2 TLSv1.3;. Rocky Linux 9.6 ships with the required Nginx and OpenSSL versions.
  • Always validate your configuration before and after applying it. Use sudo nginx -t to check syntax, and then use an external tool like the Qualys SSL Labs test to confirm TLS 1.3 is active on your live domain.
  • Go beyond the basic setup by implementing advanced hardening. Add the Strict-Transport-Security (HSTS) header and enable OCSP Stapling to build a truly robust and secure configuration.

Conclusion

Upgrading to TLS 1.3 on a modern stack like Nginx on Rocky Linux 9 is refreshingly simple. The core task is a one-line change. However, as a senior engineer, your job doesn’t end there. The real “super hack” is in the full workflow: making the change, rigorously validating it from an external perspective, and then hardening the configuration with production-grade features like HSTS and OCSP Stapling. By following these steps, you’ve done more than just flip a switch; you’ve demonstrably improved your site’s security posture and performance, confirming your stack is compliant with the latest standards.

Implementing ILM with Write Aliases (Logstash + Elasticsearch)

In this blog post, I demonstrate the creation of a new elasticsearch index with the ability to rollover using the aliases.

We will be implementing the ILM (Information lifecycle Management) in Elasticsearch with Logstash Using Write Aliases

Optimize Elasticsearch indexing with a clean, reliable setup: use Index Lifecycle Management (ILM) with a dedicated write alias, let Elasticsearch handle rollovers, and keep Logstash writing to the alias instead of hardcoded index names. This approach improves stability, reduces manual ops, and scales cleanly as log volume grows.

Implementing ILM with Write Aliases (Logstash + Elasticsearch)

Optimize Elasticsearch indexing with a clean, reliable setup: use Index Lifecycle Management (ILM) with a dedicated write alias, let Elasticsearch handle rollovers, and keep Logstash writing to the alias instead of hardcoded index names. This approach improves stability, reduces manual operations, and scales cleanly as log volume grows.

What you’ll set up

  • Write to a single write alias.
  • Apply ILM via an index template with a rollover alias.
  • Bootstrap the first index with the alias marked as is_write_index:true.
  • Point Logstash at ilm_rollover_alias (not a date-based index).

Prerequisites

  • Elasticsearch with ILM enabled.
  • Logstash connected to Elasticsearch.
  • An ILM policy (example: es_policy01).

1) Create index template with rollover alias

Define a template that applies the ILM policy and the alias all indices will use.

PUT _index_template/test-vks
{
  "index_patterns": ["vks-nginx-*"],
  "priority": 691,
  "template": {
    "settings": {
      "index": {
        "lifecycle": {
          "name": "es_policy01",
          "rollover_alias": "vks-nginx-write-alias"
        },
        "number_of_shards": 1,
        "number_of_replicas": 0
      }
    },
    "mappings": {
      "dynamic": "runtime"
    }
  }
}

Notes:

  • Only set index.lifecycle.rollover_alias here; do not declare the alias body in the template.
  • Tune shards/replicas for your cluster and retention goals.

2) Bootstrap the first index

Create the first managed index and bind the write alias to it.

PUT /<vks-nginx-error-{now/d}-000001>
{
  "aliases": {
    "vks-nginx-write-alias": {
      "is_write_index": true
    }
  }
}

Notes:

  • The -000001 suffix is required for rollover sequencing.
  • is_write_index:true tells Elasticsearch where new writes should go.

3) Configure Logstash to use the write alias

Point Logstash to the rollover alias and avoid hardcoding an index name.

output {
  elasticsearch {
    hosts => ["http://localhost:9200"]
    manage_template => false
    template_name   => "test-vks"
    # index => "vks-nginx-error-%{+YYYY.MM.dd}"   # keep commented when using ILM
    ilm_rollover_alias => "vks-nginx-write-alias"
  }
}

Notes:

  • manage_template => false prevents Logstash from overwriting your Elasticsearch template.
  • Restart Logstash after changes.

How rollover works

  • When ILM conditions are met, Elasticsearch creates the next index (...-000002), moves the write alias to it, and keeps previous indices searchable.
  • Reads via the alias cover all indices it targets; writes always land on the active write index.

Common issues and quick fixes

  • rollover_alias missing: Ensure index.lifecycle.rollover_alias is set in the template and matches the alias used in bootstrap and Logstash.
  • Docs landing in the wrong index: Remove index in Logstash; use only ilm_rollover_alias.
  • Alias conflicts on rollover: Don’t embed the alias body in the template—bind it during the bootstrap call only.
Complete Flow of Implementing ILM with Write Aliases (Logstash + Elasticsearch)
Implementing ILM with Write Aliases (Logstash + Elasticsearch)

Quick checklist

  • ILM policy exists (e.g., es_policy01).
  • Template includes index.lifecycle.name and index.lifecycle.rollover_alias.
  • First index created with -000001 and is_write_index:true.
  • Logstash writes to the alias (no concrete index).
  • Logstash restarted and ILM verified.

Verify your setup (optional)

Run these in Kibana Dev Tools or via curl:

GET _ilm/policy/es_policy01 GET _index_template/test-vks GET vks-nginx-write-alias/_alias POST /vks-nginx-write-alias/_rollover # non-prod/manual test 

Install java on Linux centos

In this tutorial we will quickly setup java on linux centos,

We will be using the yum command to download the openjdk 1.8 and install

[vamshi@node01 ~]$ sudo yum install java-1.8.0-openjdk.x86_64

We have installed the java openjdk 1.8 and we can check the version using java -version

[vamshi@node01 ~]$ java -version
openjdk version "1.8.0_252"
OpenJDK Runtime Environment (build 1.8.0_252-b09)
OpenJDK 64-Bit Server VM (build 25.252-b09, mixed mode)

 

We make use of the alternatives command in centos which lists if we have any other version of java installed on the machine, and then enabling the default java version on the system wide.

[vamshi@node01 ~]$ alternatives --list | grep java
java auto /usr/lib/jvm/java-1.8.0-openjdk-1.8.0.252.b09-2.el7_8.x86_64/jre/bin/java
jre_openjdk auto /usr/lib/jvm/java-1.8.0-openjdk-1.8.0.252.b09-2.el7_8.x86_64/jre
jre_1.8.0 auto /usr/lib/jvm/java-1.8.0-openjdk-1.8.0.252.b09-2.el7_8.x86_64/jre
jre_1.7.0 auto /usr/lib/jvm/java-1.7.0-openjdk-1.7.0.261-2.6.22.2.el7_8.x86_64/jre
[vamshi@node01 ~]$ sudo alternatives --config java

There are 2 programs which provide 'java'.

  Selection    Command
-----------------------------------------------
*  1           java-1.8.0-openjdk.x86_64 (/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.252.b09-2.el7_8.x86_64/jre/bin/java)
 + 2           java-1.7.0-openjdk.x86_64 (/usr/lib/jvm/java-1.7.0-openjdk-1.7.0.261-2.6.22.2.el7_8.x86_64/jre/bin/java)

Enter to keep the current selection[+], or type selection number: 1

This enabled openjdk1.8 to be the default version of java.

Setting JAVA_HOME path
In order to set the system JAVA_HOME path on the system we need to export this variable, for the obvious reasons of other programs and users using the classpath such as while using maven or a servlet container.

Now there are two levels we can setup the visibility of JAVA_HOME environment variable.
1. Setup JAVA_HOME for single user profile
We need to update the changes to the ~/.bash_profile

export JAVA_HOME=/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.252.b09-2.el7_8.x86_64/jre/bin/

PATH=$PATH:$JAVA_HOME

export PATH

Now we need enforce the changes with reloading the .bash_profile with a simple logout and then login into the system or we can source the file ~/.bash_profile as follows:

[vamshi@node01 ~]$ source .bash_profile

Verifying the changes:

[vamshi@node01 ~]$ echo $PATH
/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/home/vamshi/.local/bin:/home/vamshi/bin:/home/vamshi/.local/bin:/home/vamshi/bin:/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.252.b09-2.el7_8.x86_64/jre/bin/

2. Setup JAVA_HOME for the system wide profile and available to all the users.

[vamshi@node01 ~]$ sudo sh -c "echo -e 'export JAVA_HOME=/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.252.b09-2.el7_8.x86_64/jre/bin/' > /etc/profile.d/java.sh"

This echo command writes the JAVA_HOME path to the system profile.d and creates a file java.sh which is read system wide level.

Ensure the changes are written to /etc/profile.d/java.sh

[vamshi@node01 ~]$ cat /etc/profile.d/java.sh
export JAVA_HOME=/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.252.b09-2.el7_8.x86_64/jre/bin/

Now source to apply the changes immediately to the file /etc/profile.d/java.sh as follows

[vamshi@node01 ~]$ sudo sh -c ' source /etc/profile.d/java.sh '

Or login to the root account and run the source command

Ensure to run the env command

[vamshi@node01 ~]$ env  | grep JAVA_HOME
JAVA_HOME=/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.252.b09-2.el7_8.x86_64/jre/bin/

How do I download and install Java on CentOS?

Install Java On CentOS

  1. Install OpenJDK 11. Update the package repository to ensure you download the latest software: sudo yum update. …
  2. Install OpenJRE 11. Java Runtime Environment 11 (Open JRE 11) is a subset of OpenJDK. …
  3. Install Oracle Java 11. …
  4. Install JDK 8. …
  5. Install JRE 8. …
  6. Install Oracle Java 12.

Is Java installed on CentOS?

OpenJDK, the open-source implementation of the Java Platform, is the default Java development and runtime in CentOS 7. The installation is simple and straightforward.

How do I install Java on Linux?

  • Java for Linux Platforms
  • Change to the directory in which you want to install. Type: cd directory_path_name. …
  • Move the . tar. gz archive binary to the current directory.
  • Unpack the tarball and install Java. tar zxvf jre-8u73-linux-i586.tar.gz. The Java files are installed in a directory called jre1. …
  • Delete the . tar.

How do I install latest version of Java on CentOS?

To install OpenJDK 8 JRE using yum, run this command: sudo yum install java-1.8. 0-openjdk.

Where is java path on CentOS?

They usually reside in /usr/lib/jvm . You can list them via ll /usr/lib/jvm . The value you need to enter in the field JAVA_HOME in jenkins is /usr/lib/jvm/java-1.8.

How do I know if java is installed on CentOS 7?

  • To check the Java version on Linux Ubuntu/Debian/CentOS:
  • Open a terminal window.
  • Run the following command: java -version.
  • The output should display the version of the Java package installed on your system. In the example below, OpenJDK version 11 is installed.

Where is java path set in Linux?

Steps

  • Change to your home directory. cd $HOME.
  • Open the . bashrc file.
  • Add the following line to the file. Replace the JDK directory with the name of your java installation directory. export PATH=/usr/java/<JDK Directory>/bin:$PATH.
  • Save the file and exit. Use the source command to force Linux to reload the .

How do I install java 14 on Linux?

Installing OpenJDK 14

  • Step 1: Update APT. …
  • Step 2: Download and Install JDK Kit. …
  • Step 3: Check Installed JDK Framework. …
  • Step 4: Update Path to JDK (Optional) …
  • Step 6: Set Up Environment Variable. …
  • Step 7: Open Environment File. …
  • Step 8: Save Your Changes.

How do I know where java is installed on Linux?

This depends a bit from your package system … if the java command works, you can type readlink -f $(which java) to find the location of the java command. On the OpenSUSE system I’m on now it returns /usr/lib64/jvm/java-1.6. 0-openjdk-1.6. 0/jre/bin/java (but this is not a system which uses apt-get ).

How do I install java 11 on Linux?

Installing the 64-Bit JDK 11 on Linux Platforms

  1. Download the required file: For Linux x64 systems: jdk-11. interim. …
  2. Change the directory to the location where you want to install the JDK, then move the . tar. …
  3. Unpack the tarball and install the downloaded JDK: $ tar zxvf jdk-11. …
  4. Delete the . tar.

Signals in Linux; trap command – practical example

The SIGNALS in linux

The signals are the response of the kernel to certain actions generated by the user / by a program or an application and the I/O devices.
The linux trap command gives us a best view to understand the SIGNALS and take advantage of it.
With trap command can be used to respond to certain conditions and invoke the various activities when a shell receives a signal.
The below are the various Signals in linux.

vamshi@linuxcent :~] trap -l
1) SIGHUP 2) SIGINT 3) SIGQUIT 4) SIGILL 5) SIGTRAP
6) SIGABRT 7) SIGBUS 8) SIGFPE 9) SIGKILL 10) SIGUSR1
11) SIGSEGV 12) SIGUSR2 13) SIGPIPE 14) SIGALRM 15) SIGTERM
16) SIGSTKFLT 17) SIGCHLD 18) SIGCONT 19) SIGSTOP 20) SIGTSTP
21) SIGTTIN 22) SIGTTOU 23) SIGURG 24) SIGXCPU 25) SIGXFSZ
26) SIGVTALRM 27) SIGPROF 28) SIGWINCH 29) SIGIO 30) SIGPWR
31) SIGSYS 34) SIGRTMIN 35) SIGRTMIN+1 36) SIGRTMIN+2 37) SIGRTMIN+3
38) SIGRTMIN+4 39) SIGRTMIN+5 40) SIGRTMIN+6 41) SIGRTMIN+7 42) SIGRTMIN+8
43) SIGRTMIN+9 44) SIGRTMIN+10 45) SIGRTMIN+11 46) SIGRTMIN+12 47) SIGRTMIN+13
48) SIGRTMIN+14 49) SIGRTMIN+15 50) SIGRTMAX-14 51) SIGRTMAX-13 52) SIGRTMAX-12
53) SIGRTMAX-11 54) SIGRTMAX-10 55) SIGRTMAX-9 56) SIGRTMAX-8 57) SIGRTMAX-7
58) SIGRTMAX-6 59) SIGRTMAX-5 60) SIGRTMAX-4 61) SIGRTMAX-3 62) SIGRTMAX-2
63) SIGRTMAX-1 64) SIGRTMAX

Lets take a look at some Important SIGNALS and their categorization of them:

Job control Signals: These Signals are used to control the Queuing the waiting process
(18) SIGCONT, (19) SIGSTOP , (20) SIGSTP

Termination Signals: These signals are used to interrupt or terminate a running process
(2) SIGINT , (3) SIGQUIT, (6) SIGABRT,  (9) SIGKILL,  (15) SIGTERM.

Async I/O Signals: These signals are generated when data is available on a Input/Output device or when the kernel services wishes to notify applications about resource availability.
(23) SIGURG,  (29) SIGIO,  (29) SIGPOLL.

Timer Signals: These signals are generated when application wishes to trigger timers alarms.
(14) SIGALRM,  (27) SIGPROF,  (26) SIGVTALRM.

Error reporting Signals: These signals occur when running process or an application code endsup into an exception or a fault.
(1) SIGHUP, (4) SIGILL, (5) SIGTRAP, (7) SIGBUS, (8) SIGFPE,  (13) SIGPIPE,  (11) SIGSEGV, (24) SIGXCPU.

Trap command Syntax:

trap [-] [[ARG] SIGNAL]

ARG is a command to be interpreted and executed when the shell receives the signal(s) SIGNAL.

If no arguments are supplied, trap prints the list of commands associated with each signal.
to unset the trap a – is to be used followed by the [ARG] SIGNAL] which we will demonstrate in the following section.

How to set a trap on linux through the command line?

[vamshi@linuxcent ~]$ trap 'echo -e "You Pressed Ctrl-C"' SIGINT

Now you have successfully setup a trap:>

When ever you press Ctrl-c on your keyboard, the message “You Pressed Ctrl-C” gets printed.

[vamshi@linuxcent ~]$ ^CYou Pressed Ctrl-C
[vamshi@linuxcent ~]$ ^CYou Pressed Ctrl-C
[vamshi@linuxcent ~]$ ^CYou Pressed Ctrl-C

Now type the trap command and you can see the currently set trap details.

[vamshi@node01 ~]$ trap
trap -- 'echo -e "You Pressed Ctrl-C"' SIGINT
trap -- '' SIGTSTP
trap -- '' SIGTTIN
trap -- '' SIGTTOU

To unset the trap all you need to do is to run the following command,

[vamshi@node01 ~]$ trap - 'echo -e "You Pressed Ctrl-C"' SIGINT

The same can be evident from the below output:

[vamshi@node01 ~]$ trap
trap -- '' SIGTSTP
trap -- '' SIGTTIN
trap -- '' SIGTTOU
[vamshi@node01 ~]$ ^C
[vamshi@node01 ~]$ ^C

What is trap command in Linux?

A built-in bash command that is used to execute a command when the shell receives any signal is called `trap`. When any event occurs then bash sends the notification by any signal. Many signals are available in bash. The most common signal of bash is SIGINT (Signal Interrupt).

What is trap command in bash?

If you’ve written any amount of bash code, you’ve likely come across the trap command. Trap allows you to catch signals and execute code when they occur. Signals are asynchronous notifications that are sent to your script when certain events occur.

How do you Ctrl-C trap?

To trap Ctrl-C in a shell script, we will need to use the trap shell builtin command. When a user sends a Ctrl-C interrupt signal, the signal SIGINT (Signal number 2) is sent.

What is trap shell?

trap is a wrapper around the fish event delivery framework. It exists for backwards compatibility with POSIX shells. For other uses, it is recommended to define an event handler. The following parameters are available: ARG is the command to be executed on signal delivery.

What signals Cannot be caught?

There are two signals which cannot be intercepted and handled: SIGKILL and SIGSTOP.

How does shell trap work?

The user sets a shell trap. If the user is hit by a physical move, the trap will explode and inflict damage on the opposing Pokémon. The user sets a shell trap. If the user is hit by a physical move, the trap will explode and inflict damage on opposing Pokémon.

How do I wait in Linux?

Approach:

  1. Creating a simple process.
  2. Using a special variable($!) to find the PID(process ID) for that particular process.
  3. Print the process ID.
  4. Using wait command with process ID as an argument to wait until the process finishes.
  5. After the process is finished printing process ID with its exit status.

How use stty command in Linux?

  1. stty –all: This option print all current settings in human-readable form. …
  2. stty -g: This option will print all current settings in a stty-readable form. …
  3. stty -F : This option will open and use the specified DEVICE instead of stdin. …
  4. stty –help : This option will display this help and exit.

Can I trap Sigkill?

You can’t catch SIGKILL (and SIGSTOP ), so enabling your custom handler for SIGKILL is moot. You can catch all other signals, so perhaps try to make a design around those. be default pkill will send SIGTERM , not SIGKILL , which obviously can be caught.

What signal is Ctrl D?

Ctrl + D is not a signal, it’s EOF (End-Of-File). It closes the stdin pipe. If read(STDIN) returns 0, it means stdin closed, which means Ctrl + D was hit (assuming there is a keyboard at the other end of the pipe).