What Is eBPF? A Plain-English Guide for Linux and Kubernetes Engineers

Reading Time: 6 minutes


Reading Time: 6 minutes

~1,900 words · Reading time: 7 min · Series: eBPF: From Kernel to Cloud, Episode 1 of 18

Your Linux kernel has had a technology built into it since 2014 that most engineers working with Linux every day have never looked at directly. You have almost certainly been using it — through Cilium, Falco, Datadog, or even systemd — without knowing it was there.

This post is the plain-English introduction to eBPF that I wished existed when I first encountered it. No kernel engineering background required. No bytecode, no BPF maps, no JIT compilation. Just a clear answer to the question every Linux admin and DevOps engineer eventually asks: what actually is eBPF, and why does it matter for the infrastructure I run every day?


First: Forget the Name

eBPF stands for extended Berkeley Packet Filter. It is one of the most misleading names in computing for what the technology actually does.

The original BPF was a 1992 mechanism for filtering network packets — the engine behind tcpdump. The extended version, introduced in Linux 3.18 (2014) and significantly matured through Linux 5.x, is a completely different technology. It is no longer just about packets. It is no longer just about filtering.

Forget the name. Here is what eBPF actually is:

eBPF lets you run small, safe programs directly inside the Linux kernel — without writing a kernel module, without rebooting, and without modifying your applications.

That is the complete definition. Everything else is implementation detail. The one-liner above is what matters for how you use it day to day.


What the Linux Kernel Can See That Nothing Else Can

To understand why eBPF is significant, you need to understand what the Linux kernel already sees on every server and every Kubernetes node you run.

The kernel is the lowest layer of software on your machine. Every action that happens — every file opened, every process started, every network packet sent — passes through the kernel. That means it has a complete, real-time view of everything:

  • Every syscall — every open(), execve(), connect(), write() from every process in every container on the node, in real time
  • Every network packet — source, destination, port, protocol, bytes, and latency for every pod-to-pod and pod-to-external connection
  • Every process event — every fork, exec, and exit, including processes spawned inside containers that your container runtime never reports
  • Every file access — which process opened which file, when, and with what permissions, across all workloads on the node simultaneously
  • CPU and memory usage — per-process CPU time, function-level latency, and memory allocation patterns without profiling agents

The kernel has always had this visibility. The problem was that there was no safe, practical way to access it without writing kernel modules — which are complex, kernel version-specific, and genuinely dangerous to run in production. eBPF is the safe, practical way to access it.


The Problem eBPF Solves — A Real Kubernetes Scenario

Here is a situation every Kubernetes engineer has faced. A production pod starts behaving strangely — elevated CPU, slow responses, occasional connection failures. You want to understand what is happening at a low level: what syscalls is it making, what network connections is it opening, is something spawning unexpected processes?

The old approaches and their problems

Restart the pod with a debug sidecar. You lose the current state immediately. The issue may not reproduce. You have modified the workload.

Run strace inside the container via kubectl exec. strace uses ptrace, which adds 50–100% CPU overhead to the traced process and is unavailable in hardened containers. You are tracing one process at a time with no cluster-wide view.

Poll /proc with a monitoring agent. Snapshot-based. Any event that happens between polls is invisible. A process that starts, does something, and exits between intervals is completely missed.

The eBPF approach

# Use a debug pod on the node — no changes to your workload
$ kubectl debug node/your-node -it --image=cilium/hubble-cli

# Real-time kernel events from every container on this node:
sys_enter_execve  pid=8821  comm=sh    args=["/bin/sh","-c","curl http://..."]
sys_enter_connect pid=8821  comm=curl  dst=203.0.113.42:443
sys_enter_openat  pid=8821  comm=curl  path=/etc/passwd

# Something inside the pod spawned a shell, made an outbound connection,
# and read /etc/passwd — all visible without touching the pod.

Real-time visibility. No overhead on your workload. Nothing restarted. Nothing modified. That is what eBPF makes possible.


Tools You Are Probably Already Running on eBPF

eBPF is not a standalone product — it is the foundation that many tools in the cloud-native ecosystem are built on. You may already be running eBPF on your nodes without thinking about it explicitly.

Tool What eBPF does for it Without eBPF
Cilium Replaces kube-proxy and iptables with kernel-level packet routing. 2–3× faster at scale. iptables rules — linear lookup, degrades with service count
Falco Watches every syscall in every container for security rule violations. Sub-millisecond detection. Kernel module (risky) or ptrace (high overhead)
Tetragon Runtime security enforcement — can kill a process or drop a network packet at the kernel level. No practical alternative at this detection speed
Datadog Agent Network performance monitoring and universal service monitoring without application code changes. Language-specific agents injected into application code
systemd cgroup resource accounting and network traffic control on your Linux nodes. Legacy cgroup v1 interfaces with limited visibility

eBPF vs the Old Ways

Before eBPF, getting deep visibility into a running Linux system meant choosing between three approaches, each with a significant trade-off:

Approach Visibility Cost Production safe?
Kernel modules Full kernel access One bug = kernel panic. Version-specific, must recompile per kernel update. No
ptrace / strace One process at a time 50–100% CPU overhead on the traced process. Unusable in production. No
Polling /proc Snapshots only Events between polls are invisible. Short-lived processes are missed entirely. Partial
eBPF Full kernel visibility 1–3% overhead. Verifier-guaranteed safety. Real-time stream, not polling. Yes

Is It Safe to Run in Production?

This is always the first question from any experienced Linux admin, and it is exactly the right question to ask. The answer is yes — and the reason is the BPF verifier.

Before any eBPF program is allowed to run on your node, the Linux kernel runs it through a built-in static safety analyser. This analyser examines every possible execution path and asks: could this program crash the kernel, loop forever, or access memory it should not?

If the answer is yes — or even maybe — the program is rejected at load time. It never runs.

This is fundamentally different from kernel modules. A kernel module loads immediately with no safety check. If it has a bug, you find out at runtime — usually as a kernel panic. An eBPF program that would cause a panic is rejected before it ever loads. The safety guarantee is mathematical, not hopeful.

Episode 2 of this series covers the BPF verifier in full: what it checks, how it makes Cilium and Falco safe on your production nodes, and what questions to ask eBPF tool vendors about their implementation.


Common Misconceptions

eBPF is not a specific tool or product. It is a kernel technology — a platform. Cilium, Falco, Tetragon, and Pixie are tools built on top of it. When a vendor says “we use eBPF”, they mean they build on this kernel capability, not that they share a single implementation.

eBPF is not only for networking. The Berkeley Packet Filter name suggests networking, but modern eBPF covers security, observability, performance profiling, and tracing. The networking origin is historical, not a limitation.

eBPF is not only for Kubernetes. It works on any Linux system running kernel 4.9+, including bare metal servers, Docker hosts, and VMs. K8s is the most popular deployment target because of the observability challenges at scale, but it is not a requirement.

You do not need to write eBPF programs to benefit from eBPF. Most Linux admins and DevOps engineers will use eBPF through tools like Cilium, Falco, and Datadog — never writing a line of BPF code themselves. This series covers the writing side later. Understanding what eBPF is makes you a significantly better user of these tools today.


Kernel Version Requirements

eBPF is a Linux kernel feature. The capabilities available depend directly on the kernel version running on your nodes. Run uname -r on any node to check.

Kernel What becomes available
4.9+ Basic eBPF support. Tracing, socket filtering. Most production systems today meet this minimum.
5.4+ BTF (BPF Type Format) and CO-RE — programs that adapt to different kernel versions without recompile. Recommended minimum for production tooling.
5.8+ Ring buffers for high-performance event streaming. Global variables. The target kernel for Cilium, Falco, and Tetragon full feature support.
6.x Open-coded iterators, improved verifier, LSM security enforcement hooks. Amazon Linux 2023 and Ubuntu 22.04+ ship 5.15 or newer and are fully eBPF-ready.

EKS users: Amazon Linux 2023 AMIs ship with kernel 6.1+ and support the full modern eBPF feature set out of the box. If you are still on AL2, the migration also resolves the NetworkManager deprecation issues covered in the EKS 1.33 post.


The Bottom Line

eBPF is the answer to a question Linux engineers have been asking for years: how do I get deep visibility into what is happening on my servers and Kubernetes nodes — without adding massive overhead, injecting sidecars, or risking a kernel panic?

The answer is: run small, safe programs at the kernel level, where everything is already visible. Let the BPF verifier guarantee those programs are safe before they run. Stream the results to your observability tools through shared memory maps.

The tools you already use — Cilium for networking, Falco for security, Datadog for APM — are built on this foundation. Understanding eBPF means understanding why those tools work the way they do, what they can and cannot see, and how to evaluate new tools that claim to use it.

Every eBPF-based tool you run on your nodes passed through the BPF verifier before it touched your cluster. Episode 2 covers exactly what that means — and why it matters for your infrastructure decisions.


Further Reading


Questions or corrections? Reach me on LinkedIn. If this was useful, the full series index is on linuxcent.com — search the eBPF Series tag for all episodes.

Enterprise Awakening: RBAC, CRDs, Cloud Providers, and Helm Goes Mainstream (2016–2018)

Reading Time: 6 minutes


Introduction

By the end of 2016, engineers were running Kubernetes in production. Not as an experiment — in production, handling real traffic. And that’s where the real gaps became visible.

The 2016–2018 period is the era when Kubernetes grew up. RBAC went stable. CRDs replaced the fragile ThirdPartyResource hack. The major cloud providers launched managed services. Helm became the standard for packaging. And the security posture, which had been an afterthought in the Borg-derived model, started getting serious attention.


Kubernetes 1.6 — The RBAC Milestone (March 2017)

Kubernetes 1.6 is the release that made enterprise Kubernetes possible. The headline feature: RBAC (Role-Based Access Control) promoted to beta, enabled by default.

Before RBAC, Kubernetes had attribute-based access control (ABAC) — a flat policy file on the API server that required a restart to change. It worked, but it was operationally painful and offered no granularity at the namespace level.

RBAC introduced four objects:
Role: A set of permissions scoped to a namespace
ClusterRole: A set of permissions cluster-wide or reusable across namespaces
RoleBinding: Assigns a Role to a user/group/service account in a namespace
ClusterRoleBinding: Assigns a ClusterRole cluster-wide

# Example: read-only access to pods in the dev namespace
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  namespace: dev
  name: pod-reader
rules:
- apiGroups: [""]
  resources: ["pods"]
  verbs: ["get", "watch", "list"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: read-pods
  namespace: dev
subjects:
- kind: User
  name: alice
  apiGroup: rbac.authorization.k8s.io
roleRef:
  kind: Role
  name: pod-reader
  apiGroup: rbac.authorization.k8s.io

Also in 1.6:
etcd v3 as default: Better performance, watch semantics, and transaction support
Node Authorization mode: Kubelets can now only access secrets and pods bound to their own node — a critical lateral movement restriction
Audit logging (alpha): API server logs every request — who did what, to which resource, at what time
– Scale: Tested to 5,000 nodes per cluster

The node authorization mode deserves more attention than it typically gets. Before 1.6, a compromised kubelet could read all secrets in the cluster. Node authorization restricted the kubelet to only the secrets it needed for pods scheduled on that node. This single change dramatically reduced the blast radius of a node compromise.


Kubernetes 1.7 — Custom Resource Definitions (June 2017)

The most significant architectural decision in Kubernetes history after the initial design: ThirdPartyResources (TPRs) were replaced with CustomResourceDefinitions (CRDs).

TPRs were a fragile mechanism introduced in 1.2 that let users define custom API types. They had serious limitations: no schema validation, no versioning, data loss bugs, and poor upgrade behavior. In 1.7, they were replaced with CRDs.

CRDs are what make the Kubernetes API extension model work. They let you define new resource types that the API server stores and serves, with optional schema validation via OpenAPI v3 schemas, version conversion, and admission webhook integration.

apiVersion: apiextensions.k8s.io/v1
kind: CustomResourceDefinition
metadata:
  name: databases.stable.example.com
spec:
  group: stable.example.com
  versions:
  - name: v1
    served: true
    storage: true
    schema:
      openAPIV3Schema:
        type: object
        properties:
          spec:
            type: object
            properties:
              size:
                type: string
              version:
                type: string
  scope: Namespaced
  names:
    plural: databases
    singular: database
    kind: Database

CRDs enabled the entire Operator ecosystem that would define the next phase of Kubernetes. Without stable, schema-validated custom resources, you can’t build reliable controllers on top of them.

Also in 1.7:
Secrets encryption at rest (alpha): Finally, secrets stored in etcd could be encrypted with AES-CBC or AES-GCM
Network Policy promoted to stable: CNI plugins implementing NetworkPolicy could now enforce pod-level ingress/egress rules
API aggregation layer: Extend the Kubernetes API with custom API servers — the foundation for metrics-server and other API extensions


Kubernetes 1.8 — RBAC Goes Stable (September 2017)

RBAC graduated to stable in 1.8. This was the point of no return for enterprise adoption. Security teams could now enforce least-privilege on Kubernetes API access with a documented, stable API.

Key additions:
Storage Classes stable: Dynamic volume provisioning — request a PersistentVolume and have the underlying storage (EBS, GCE PD, NFS) automatically provisioned
Workloads API (apps/v1beta2): Deployments, ReplicaSets, DaemonSets, and StatefulSets all moved under a unified API group, signaling they were heading toward stable

The admission webhook framework — which would become the foundation for policy enforcement tools like OPA/Gatekeeper — was also being refined in this period.


The Cloud Provider Moment (2017–2018)

October 2017: Docker Surrenders

At DockerCon Europe in October 2017, Docker Inc. announced that Docker Enterprise Edition would ship with Kubernetes support alongside Docker Swarm. This was, effectively, Docker Inc. conceding the orchestration market to Kubernetes. Swarm remained available, but the message was clear: Kubernetes was the production standard.

October 2017: Microsoft Previews AKS

Microsoft previewed Azure Kubernetes Service at DockerCon Europe. The managed Kubernetes race was on.

November 2017: Amazon Announces EKS

At AWS re:Invent 2017, Amazon announced Elastic Kubernetes Service. The three major cloud providers — Google (GKE, running since 2014), Microsoft (AKS), and Amazon (EKS) — were all committed to managed Kubernetes.

For enterprise buyers, this was the signal they needed. Kubernetes was no longer a bet on an experimental technology — it was the supported, managed offering from every major cloud provider.


Kubernetes 1.9 — Workloads API Stable (December 2017)

The Workloads API (apps/v1) went stable in 1.9. This matters because it locked in the API contract for Deployments, ReplicaSets, DaemonSets, and StatefulSets. Infrastructure built on these APIs would not break on upgrades.

# apps/v1 Deployment — the stable form that operators rely on
apiVersion: apps/v1
kind: Deployment
metadata:
  name: nginx-deployment
spec:
  replicas: 3
  selector:
    matchLabels:
      app: nginx
  template:
    metadata:
      labels:
        app: nginx
    spec:
      containers:
      - name: nginx
        image: nginx:1.14.2
        ports:
        - containerPort: 80
        resources:
          requests:
            memory: "64Mi"
            cpu: "250m"
          limits:
            memory: "128Mi"
            cpu: "500m"

Also in 1.9:
Windows container support moved to beta — actual Windows Server 2016 nodes in a cluster
CoreDNS available as an alternative to kube-dns: A more extensible, plugin-based DNS server that would replace kube-dns as the default in 1.11


Kubernetes 1.10 — Storage, Auth, and Scale (March 2018)

1.10 continued the enterprise hardening:
CSI (Container Storage Interface) beta: A standardized interface between Kubernetes and storage providers. Before CSI, storage drivers were compiled into the kubelet binary. CSI moved them out-of-tree, allowing storage vendors to ship their own drivers without waiting for a Kubernetes release
External credential providers (alpha): Authenticate against external systems (cloud IAM, HashiCorp Vault) for kubeconfig credentials
Node problem detector stable: Detect and report node-level problems (kernel deadlocks, corrupted file systems) as Kubernetes events and node conditions

The CSI transition was one of the most important infrastructure decisions of this period. It decoupled storage driver development from the Kubernetes release cycle — a necessary step for cloud providers to ship storage integrations rapidly and independently.


The Istio Announcement and Service Mesh Wars (May 2017)

Google and IBM announced Istio in May 2017 — a service mesh that layered mTLS, traffic management, and observability on top of existing Kubernetes deployments without changing application code.

Istio’s architecture: sidecar proxies (Envoy) injected into every pod, managed by a control plane. Every service-to-service call passes through the sidecar, enabling:
– Mutual TLS between services (zero-trust networking at the service layer)
– Fine-grained traffic control (canary releases, circuit breaking, retries)
– Distributed tracing and metrics

Linkerd (from Buoyant) had been working on the same problem since 2016. The two projects would compete for the “service mesh standard” throughout 2017–2019.

The service mesh conversation was fundamentally a security architecture conversation: how do you enforce mutual authentication and encryption between services in a Kubernetes cluster without requiring application developers to implement it?


CoreOS Acquisition and the Operator Pattern (2018)

In January 2018, Red Hat acquired CoreOS for $250 million. CoreOS had contributed two things that would permanently shape Kubernetes:

1. The Operator Pattern (introduced by CoreOS engineers Brandon Philips and Josh Wood in 2016): An Operator is a custom controller that uses CRDs to manage the lifecycle of complex, stateful applications. The etcd Operator (CoreOS’s own) was the first — it automated etcd cluster creation, scaling, backup, and failure recovery. The pattern generalized: a Prometheus Operator, a PostgreSQL Operator, a Kafka Operator.

The Operator pattern is the answer to the question “how do you encode operational knowledge into software?” A human operator knows how to deploy, scale, backup, and recover a database. An Operator codifies that knowledge into a controller loop.

# Operator pattern: watch CRD → reconcile → manage application
CRD (EtcdCluster) → Operator Controller watches → creates/updates Pods, Services, Snapshots

2. etcd: The distributed key-value store that backs the Kubernetes control plane. CoreOS built and maintained etcd. Red Hat acquiring CoreOS meant that the company maintaining Kubernetes’s most critical dependency (after the kernel) was now inside the Red Hat/IBM orbit.


Helm 2 and the Charts Ecosystem

By 2017–2018, Helm had become the de facto package manager for Kubernetes. The public Helm chart repository hosted hundreds of charts — databases (PostgreSQL, MySQL, Redis), monitoring (Prometheus, Grafana), ingress controllers (nginx), CI/CD tools (Jenkins, GitLab Runner).

Helm 2 introduced Tiller — a server-side component that managed release state in the cluster. Tiller became the most criticized security decision in the Kubernetes ecosystem: Tiller ran with cluster-admin privileges by default, meaning any user who could reach Tiller’s gRPC endpoint could do anything in the cluster.

Security teams hated Tiller. The Helm team addressed it in Helm 3 (2019) by removing Tiller entirely and storing release state as Kubernetes Secrets instead.


Key Takeaways

  • RBAC going stable in 1.8 was the single most important security event in early Kubernetes history — it gave enterprises the access control model they needed for production
  • CRDs replacing TPRs in 1.7 enabled the entire Operator ecosystem that would define the next phase of Kubernetes
  • Docker Inc.’s October 2017 announcement that it would support Kubernetes in Docker EE effectively ended the container orchestration wars
  • The three major cloud providers (GKE, AKS, EKS) all standardizing on managed Kubernetes drove enterprise adoption faster than any feature announcement could
  • The Operator pattern — Kubernetes controllers that encode operational knowledge — emerged from CoreOS and became the standard model for managing complex stateful applications
  • Helm filled a real gap but Tiller’s cluster-admin model was a security debt the community had to repay in Helm 3

What’s Next

← EP02: The Container Wars | EP04: The Operator Era →

Series: Kubernetes: From Borg to Platform Engineering | linuxcent.com

The Container Wars: Kubernetes 1.0, CNCF, and the Fight for Orchestration (2014–2016)

Reading Time: 6 minutes


Introduction

Three orchestration systems entered the arena in 2015. Only one would still matter three years later.

Docker had created the container revolution. Now everyone needed to run containers at scale, and three camps formed around three very different philosophies. Understanding why Kubernetes won — and how close it came to not winning — explains most of the design choices that still shape Kubernetes today.


The State of Container Orchestration in 2014

When Kubernetes made its public debut at DockerCon 2014, it entered a space that didn’t yet have a name. “Container orchestration” wasn’t a category. It was a problem people had started to feel but not yet articulate.

Three approaches emerged nearly simultaneously:

Docker Swarm (announced December 2014): Docker’s answer to orchestration, built on the premise that the tool you use to run containers should also be the tool you use to cluster them. Swarm used the same Docker CLI and Docker API — zero new concepts for developers already using Docker.

Apache Mesos (Mesosphere Marathon): Mesos predated Docker. It was a distributed systems kernel originally developed at Berkeley, used in production at Twitter, Airbnb, and Apple. Marathon was the framework for running long-running services on top of Mesos. Mesos could run Docker containers, Hadoop jobs, and Spark workloads on the same cluster. Serious infrastructure engineers took it seriously.

Kubernetes: The newcomer with Google’s name behind it, but no track record outside Google, and early versions that required significant operational expertise to run.


Kubernetes v1.0: July 21, 2015

The 1.0 release landed at the first CloudNativeCon/KubeCon in San Francisco on July 21, 2015. The timing was deliberate — it coincided with the announcement of the Cloud Native Computing Foundation.

What shipped in 1.0:

  • Pods: The core scheduling unit — one or more containers sharing a network namespace and storage
  • Replication Controllers: Keep N copies of a pod running (later replaced by ReplicaSets and Deployments)
  • Services: A stable virtual IP and DNS name in front of a set of pods
  • Namespaces: Soft multi-tenancy boundaries within a cluster
  • Labels and Selectors: The flexible grouping mechanism that makes everything composable
  • Persistent Volumes (basic): Pods could mount persistent storage
  • kubectl: The command-line interface

What was not in 1.0:
– No RBAC (Role-Based Access Control)
– No network policy
– No autoscaling
– No Ingress resources
– No StatefulSets
– No DaemonSets (added in 1.1)
– Secrets were stored in plaintext in etcd

The security posture of a fresh Kubernetes 1.0 cluster was essentially: “trust everything inside the cluster.” That was the inherited assumption from Borg.


The CNCF Formation

Alongside the 1.0 release, Google donated Kubernetes to the newly formed Cloud Native Computing Foundation — a Linux Foundation project. This was a critical strategic move.

By donating Kubernetes to a neutral foundation, Google:
1. Removed the perception of a single vendor controlling the project
2. Created a governance model that made enterprise adoption politically safe
3. Invited competitors (Red Hat, CoreOS, Docker, Microsoft) to contribute without ceding control to them

The CNCF’s initial Technical Oversight Committee included engineers from Google, Red Hat, Twitter, Cisco, and others. This governance model would later become the template for every CNCF project that followed.


v1.1 — v1.5: Building the Foundation (Late 2015–2016)

Kubernetes 1.1 (November 2015)

  • Horizontal Pod Autoscaler (HPA): Automatically scale pod count based on CPU utilization
  • HTTP load balancing: Ingress API added as alpha — pods could now be exposed via HTTP routing rules
  • Job objects: Run a task to completion, not just keep it running
  • Performance: 30% throughput improvement, pods per minute scheduling rate improved significantly

Kubernetes 1.2 (March 2016)

  • Deployments promoted to beta: Rolling updates, rollback, pause/resume — the deployment primitive that engineers actually use for application deployments
  • ConfigMaps: Decouple configuration from container images (no more baking config into images)
  • Daemon Sets stable: Run exactly one pod per node — the pattern for node agents (log shippers, monitoring agents, network plugins)
  • Scale: Tested to 1,000 nodes and 30,000 pods per cluster

Kubernetes 1.3 (July 2016)

  • StatefulSets (then called PetSets, alpha): Ordered, persistent-identity pods — the first serious attempt to run databases and stateful applications
  • Cross-cluster federation (alpha): Run workloads across multiple clusters
  • PodDisruptionBudgets (alpha): Control how many pods can be unavailable during voluntary disruptions — critical for safe rolling updates
  • rkt integration (Rktnetes): First Container Runtime Interface experiment — the kubelet talking to something other than Docker

Kubernetes 1.4 (September 2016)

  • kubeadm: A tool to bootstrap a Kubernetes cluster in two commands. Before kubeadm, setting up a cluster required following Kelsey Hightower’s “Kubernetes the Hard Way” — valuable for learning, painful for production
  • ScheduledJobs (CronJobs): Run a job on a schedule
  • PodPresets: Inject common configuration into pods at admission time
  • Init Containers beta: Containers that run to completion before the main application containers start — the clean solution for initialization sequencing

Kubernetes 1.5 (December 2016)

  • StatefulSets promoted to beta
  • PodDisruptionBudgets to beta
  • Windows Server container support (alpha): First step toward a non-Linux node
  • CRI (Container Runtime Interface) alpha: The abstraction layer that would eventually allow Kubernetes to run containerd, CRI-O, and others instead of depending on Docker
  • OpenAPI spec: Machine-readable API documentation, enabling client code generation

Helm: The Missing Package Manager (February 2016)

Kubernetes gave you primitives. It did not give you a way to install applications composed of those primitives. In February 2016, Deis (later acquired by Microsoft) released Helm — a package manager for Kubernetes.

Helm introduced two concepts that stuck:
Charts: A collection of Kubernetes manifests bundled with templating and default values
Releases: An installed instance of a chart, with its own lifecycle (install, upgrade, rollback, delete)

Helm’s immediate adoption signaled something important: the community was already thinking in terms of applications, not just raw primitives. Infrastructure engineers needed a layer of abstraction above YAML.


The Battle Lines Harden

By mid-2016, the three-way contest was becoming clearer:

Docker Swarm’s advantage: Zero friction for existing Docker users. docker swarm init + docker stack deploy. No new CLI, no new API, no new mental model. For small teams running straightforward applications, it was compelling.

Mesos’s advantage: Proven at Google-scale before Kubernetes existed. Twitter ran Mesos in production. It could run heterogeneous workloads (Docker containers, Hadoop, Spark) on the same cluster. Enterprise data teams already had Mesos expertise.

Kubernetes’s advantage: The Google name, rapidly growing community, and a design that was clearly winning the feature race. But operational complexity was real — running Kubernetes well in 2016 required significant investment.


The Turning Point Nobody Talks About

The real moment that decided the container wars wasn’t a feature announcement. It was cloud provider behavior.

Google Kubernetes Engine (GKE) — then called Google Container Engine — had been running since 2014. It was the first managed Kubernetes service, and it worked. In 2016, both Microsoft and Amazon were working on managed Kubernetes offerings. Neither chose Docker Swarm. Neither chose Mesos.

When cloud providers converge on a technology, the market follows. By the time Amazon announced EKS and Microsoft announced AKS in late 2017, the decision was already made.


The Security Debt Accumulates

Running through the 1.0–1.5 feature list reveals a security architecture that was being designed in flight:

  • etcd stored secrets as base64-encoded strings — not encrypted. Kubernetes 1.7 (2017) would add encryption at rest, but it required explicit configuration
  • The API server was unauthenticated by default in early versions — you needed to configure authentication
  • Network traffic between pods was unrestricted — all pods could reach all other pods on all ports, across all namespaces. NetworkPolicy existed as alpha in 1.3 but required a CNI plugin that supported it
  • The kubelet’s API was open — in early Kubernetes, the kubelet’s HTTP API was accessible without authentication from within the cluster

These weren’t oversights — they were reasonable defaults for an internal cluster managed by a single team. They became liabilities as Kubernetes moved into multi-tenant enterprise environments.


KubeCon: A Community Forms

The first KubeCon conference ran November 9-11, 2015, in San Francisco — a small gathering of a few hundred engineers. By November 2016, KubeCon North America in Seattle drew thousands. The growth was not marketing-driven; it was practitioners solving real problems and sharing what they learned.

This community dynamic was qualitatively different from the Docker Swarm and Mesos ecosystems. Kubernetes had a contributor culture — pull requests, SIG (Special Interest Group) meetings, public design docs. The project was being built in the open, and engineers could see it happening.


Key Takeaways

  • Kubernetes 1.0 shipped in July 2015 with the basics functional but security model immature — no RBAC, no network policy, secrets stored in plaintext
  • The CNCF governance model was the strategic move that made enterprise adoption politically safe — no single vendor controls the project
  • Helm filled the missing application packaging layer that raw Kubernetes couldn’t provide
  • The container wars were decided not by technical superiority alone, but by cloud provider alignment — when Google, Microsoft, and Amazon all built managed Kubernetes, the market followed
  • v1.1–v1.5 established the core workload primitives: Deployments, StatefulSets, DaemonSets, Jobs, ConfigMaps, HPA — most of these remain the daily vocabulary of Kubernetes operations

What’s Next

← EP01: The Borg Legacy | EP03: Enterprise Awakening →

Series: Kubernetes: From Borg to Platform Engineering | linuxcent.com

The Borg Legacy: How Google Built the Blueprint for Kubernetes (2003–2014)

Reading Time: 5 minutes


Introduction

Every piece of infrastructure has a lineage. Kubernetes didn’t appear from nowhere in 2014. It is, in almost every meaningful sense, Google’s Borg system rebuilt for the world — with a decade of hard lessons baked in.

To understand Kubernetes, you have to understand what came before it. And what came before it ran (and still runs) more compute than most organizations will ever touch.


Google’s Scale Problem (2003)

By the early 2000s, Google was running hundreds of thousands of jobs across tens of thousands of machines. Web indexing, ads, Gmail, Maps — all of these needed compute, and none of them could afford to waste it.

In 2003-2004, Google engineer Rohit Seth proposed a kernel feature called cgroups (control groups) — a mechanism to limit, prioritize, account, and isolate resource usage of process groups. The Linux kernel merged cgroups in 2.6.24 (2008). This was the primitive that would later make containers possible.

Simultaneously, Google built Borg — an internal cluster management system that could run hundreds of thousands of jobs, from many thousands of different applications, across many clusters, with each cluster having up to tens of thousands of machines. Borg was never open-sourced. It ran (and still runs) Google’s entire production workload.


What Borg Got Right

Borg introduced concepts that engineers didn’t yet have names for. They became the vocabulary of modern infrastructure:

Workload types:
Borg separated workloads into two classes: long-running services (high-priority, latency-sensitive) and batch jobs (best-effort, preemptible). Kubernetes would later call these Deployments and Jobs.

Declarative specification:
Borg jobs were described in a configuration language (BCL, a dialect of GCL). You declared what you wanted; Borg figured out how to achieve it. Sound familiar?

Resource limits and requests:
Borg tasks had both a request (what you need) and a limit (what you can use). Kubernetes adopted this model directly — resources.requests and resources.limits in pod specs trace directly back to Borg.

Health checking and rescheduling:
Borg monitored task health and automatically rescheduled failed tasks. The kubelet’s liveness and readiness probes are descendants of this.

Cell (cluster) topology:
Borg organized machines into “cells” — what Kubernetes calls clusters. The Borgmaster (control plane) managed the cell.


Omega: The Sequel That Didn’t Ship

Around 2011, Google started building Omega — a more flexible scheduler designed to address Borg’s limitations. Borg had a monolithic scheduler; Omega introduced a shared-state, optimistic-concurrency model where multiple schedulers could operate concurrently without stepping on each other.

A 2013 paper from Google (“Omega: flexible, scalable schedulers for large compute clusters”) made these ideas public. Omega itself stayed internal, but many of its scheduling concepts influenced Kubernetes’ extensible scheduler design.


The Docker Moment (March 2013)

On March 15, 2013, Solomon Hykes stood at PyCon and demonstrated Docker with a five-minute talk titled “The future of Linux Containers.” The demo ran a container. That was it. The room understood immediately.

Docker solved the packaging and distribution problem. Linux had had containers (via LXC and cgroups/namespaces) for years, but running one required deep kernel knowledge. Docker wrapped all of that in a UX that a developer could actually use.

Google’s engineers watched. They recognized the pattern: Docker was doing for containers what the smartphone did for mobile computing — making an existing capability accessible to everyone.

The Google engineers building the next generation of infrastructure realized: once containers become ubiquitous, someone will need to orchestrate them at scale. And they had already built that system internally, twice.


The Decision to Open-Source (Fall 2013)

In late 2013, a small group of Google engineers — Brendan Burns, Joe Beda, Craig McLuckie, Ville Aikas, Tim Hockin, Dawn Chen, Brian Grant, and Daniel Smith — began a new project internally codenamed “Project Seven” (a reference to the Borg drone Seven of Nine).

The core insight: Google’s competitive advantage in infrastructure came from what ran on the cluster management system, not the system itself. Open-sourcing a Kubernetes-like system would benefit Google by standardizing the ecosystem around patterns Google already understood better than anyone.

The initial design decisions were deliberate:

  • Go as the implementation language: Fast compilation, good concurrency primitives, easy deployment as static binaries
  • REST API as the primary interface: Everything in Kubernetes is an API resource. This is not accidental — it makes the system composable and automatable from day one
  • Labels and selectors over hierarchical naming: Borg used a hierarchical job/task naming scheme; Kubernetes chose a flat namespace with label-based grouping, which proved far more flexible
  • Reconciliation loops everywhere: Every Kubernetes controller is a loop that watches actual state and drives it toward desired state. This is the controller pattern, and it is the heart of Kubernetes extensibility

First Commit: June 6, 2014

The first public commit landed on GitHub on June 6, 2014: 250 files, 47,501 lines of Go, Bash, and Markdown.

Three days later, on June 10, 2014, Eric Brewer (VP of Infrastructure at Google) announced Kubernetes publicly at DockerCon 2014. The announcement framed it explicitly as bringing Google’s infrastructure learnings to the community.

By July 10, 2014, Microsoft, Red Hat, IBM, and Docker had joined the contributor community.


What Kubernetes Deliberately Left Out of Borg

The designers made intentional decisions about what not to carry forward:

No proprietary language: Borg’s BCL/GCL was Google-internal. Kubernetes used plain JSON (later YAML) manifests — standard formats any tool could read and write.

No magic autoscaling by default: Borg aggressively reclaimed resources. Kubernetes launched without this, adding HPA (Horizontal Pod Autoscaler) later, allowing operators to control the behavior.

No built-in service discovery tied to the scheduler: Borg had tight coupling between scheduling and name resolution. Kubernetes separated these: Services (kube-proxy, DNS) are distinct from the scheduler, allowing them to evolve independently.


The Borg Paper (2015)

In April 2015, Google published “Large-scale cluster management at Google with Borg” — the first public detailed description of the system. Reading it alongside the Kubernetes documentation reveals how directly the design decisions transferred.

Key numbers from the paper:
– Borg ran hundreds of thousands of jobs from thousands of applications
– Typical cell: 10,000 machines
– Utilization improvements from bin-packing: significant enough to justify the entire engineering investment

The paper is required reading for anyone who wants to understand why Kubernetes is designed the way it is — not as a series of arbitrary choices but as a deliberately evolved system.


The Lineage That Matters for Security

From a security architecture perspective, the Borg lineage matters because the isolation model was designed for a trusted-internal environment, not a multi-tenant hostile-external one. This created a debt that Kubernetes has spent years paying down:

  • Namespaces are a soft boundary, not a hard isolation primitive — just as Borg’s cells were
  • The default-allow network model reflects Borg’s assumption of a trusted internal network
  • No built-in admission control at launch — Borg trusted its job submitters

Understanding this history explains why features like NetworkPolicy, PodSecurity, RBAC, and OPA/Gatekeeper were retrofitted over years rather than built-in from day one. The system was designed by and for Google’s internal trust model. The security hardening came as it entered the wild.


Key Takeaways

  • Kubernetes is Google’s Borg system rebuilt for the world, carrying 10+ years of cluster management experience
  • Core Kubernetes primitives — resource requests/limits, declarative specs, health-based rescheduling, label-based grouping — map directly to Borg concepts
  • The decision to open-source was strategic, not altruistic: Google wanted to standardize the ecosystem on patterns it already mastered
  • The security gaps in early Kubernetes (no default network isolation, permissive RBAC, no pod-level security controls) trace directly to Borg’s trusted-internal-network assumptions
  • Docker’s accessibility breakthrough created the demand; Google’s Borg experience supplied the architecture

What’s Next

EP02: The Container Wars → — Kubernetes 1.0, the CNCF formation, and the three-way fight between Docker Swarm, Apache Mesos, and Kubernetes for control of the container orchestration market.


Series: Kubernetes: From Borg to Platform Engineering | linuxcent.com