Optimized Infra

Intellectual Honesty in the Age of Vibes

Zain — Mon, 23 Feb 2026 21:19:54 GMT

The first time I encountered it as a company value, it was called “intellectual honesty” The expectation was that you do the deep dive, that you seek the truth, that you don’t stop at the first plausible explanation. I was told I was good at it. In my last company, at CloudKitchens, it came up again in my performance reviews, but under a different name: “truth seeking”. Same idea, different label. I remember reading it and thinking, sure, obviously. Isn’t that just engineering? You find the bug, you find the root cause, you fix the thing. But over time, I realized something: not everyone treats this as the default.

For many engineers, it’s one approach among many. For some engineers, it is the only approach. They need to know how things work to actually fix them. Not roughly. Not “the docs say this” Actually know. Read the source, tcpdump the traffic, go as low as possible.

Most companies have some version of this value, they just name it differently. Atlassian calls it “Open company, no bullshit.” Netflix calls it “Candor.” Amazon calls it “Dive Deep,” being skeptical when metrics and anecdotes don’t match. Bridgewater calls it “radical truth and radical transparency.” The language varies. The underlying idea is the same:

“Have the courage to see things as they actually are, not as you wish they were”

Every one of these companies felt the need to put this in writing because the natural human tendency runs the other direction.

I never thought of it as a philosophy. It was just how I operated. That was the joy of work actually. But then I started sitting in incident reviews where sometimes blame was shifted sideways, where the postmortem became a performance instead of a diagnosis. Not because people were malicious, but because investigating further would mean challenging assumptions they were comfortable with. The database was fine. The config was correct. The deploy went smoothly. It always went smoothly. Nobody wanted to be the person who pulled the thread, because pulling the thread meant admitting the thing they trusted was the thing that broke.

And it wasn’t just postmortems. I sat in design reviews, architecture discussions, planning meetings where everyone nodded along and moved on. Not because they agreed, but because asking a real question carried risk. Ask “why are we doing it this way?” and you might come across as rude, as not a team player, as the person who slows things down. And the worse: your question might reveal that you also don’t fully understand the thing you’re supposed to own. So you nod. Everyone nods. The meeting ends. The assumption survives.

Every team like this eventually produces a well-intentioned deflector. Someone who, in the middle of an outage or a heated incident review, confidently declares: “This all happened because of X.” Maybe it’s a network blip. Maybe it’s a bad deploy. Maybe it’s “the cloud provider had an issue” The explanation is plausible enough that nobody challenges it, specific enough that it sounds like a root cause, and shallow enough that it doesn’t threaten anyone’s assumptions. Leadership hears it, writes it down, moves on. The engineers are relieved.

because now everyone is looking in the same direction and that direction isn’t at them.

I’ve seen this play out with noisy neighbors more times than I can count. Service is slow. Latency is spiking. Someone says “it’s noisy neighbors, we need isolated nodes.” It sounds right. It sounds like infrastructure wisdom. So the team escalates, gets dedicated capacity, and for a while things seem better, maybe because the deploy also happened to restart the pods, maybe because traffic patterns shifted. Then the latency comes back. So they ask for more isolation. More dedicated nodes. The bill goes up, the problem persists, and six weeks later someone finally profiles the application and finds a connection pool misconfiguration, or a missing index, or a goroutine leak that’s been there since day one. The noisy neighbor was never the problem. But it was a comfortable answer, and comfortable answers don’t get questioned.

The well-intentioned deflector isn’t lying. They might even be partially right. But “partially right” is the most dangerous kind of wrong in engineering, because it closes the investigation before it reaches the actual cause. And then, three months later, the same incident happens again. Maybe with a different trigger, maybe in a different service, but the same underlying pattern. Because the real root cause was never found. It was just too comfortable not to look.

The more I saw that pattern, across teams, across companies, across incidents, the more I came to believe that rigorous thinking isn’t just a pillar of engineering. It’s the pillar. Everything else, design, architecture, reliability, performance, is downstream of whether you’re building on verified reality or comfortable assumptions.

And right now, in the age of LLMs, it’s about to become the most valuable skill you can build.

The Interstellar Tax

In space navigation, there’s a principle called the 1-in-60 rule: for every degree you’re off course, you drift one mile for every sixty you travel. At aviation distances, that’s a minor correction. At interplanetary distances, it’s annihilation.

NASA’s Mars Climate Orbiter traveled 669 million kilometers over nine and a half months. One team at Lockheed Martin generated thrust data in pound-force-seconds. The navigation team at JPL assumed it was in newton-seconds. Nobody verified. Seven small errors accumulated across the journey. When the spacecraft arrived at Mars, it was 170 kilometers closer than planned - too close. It burned up. A $327 million mission, destroyed by a unit mismatch that went unchallenged for 286 days.

Every wrong assumption in engineering works like this. It’s not a static cost - it’s a trajectory error. The longer you travel on a false heading, the further you end up from reality. And by the time you discover you’re off course, the fuel to correct may already be spent.

I think about this compounding constantly. In the infrastructure world, I’ve seen teams spend months debugging network performance because everyone assumed cloud provider provided machines were fine. The assumption felt safe:

“it’s a managed component, it’s battle-tested, thousands of clusters run it.”

But assumptions don’t care about your feelings. That’s the interstellar tax. Every day spent traveling in the wrong direction is a day you have to travel back.

A Brief History of Expensive Assumptions

The wreckage of unchallenged assumptions is scattered across engineering history. The pattern is always the same: something was assumed, evidence was available, nobody checked.

Ariane 5 (1996) - Assumed Ariane 4 flight software value ranges would hold for a faster rocket. 64-bit to 16-bit overflow. $370M rocket self-destructed 37 seconds after liftoff.

Boeing 737 MAX (2018-2019) - Assumed pilots would diagnose and override a faulty sensor-driven system within three seconds. 346 people dead. ~$20B in losses.

Hubble Space Telescope (1990) - Assumed the primary measuring device was calibrated correctly. Two backup instruments flagged the error. Readings dismissed. $1.5B mirror polished to exactly the wrong shape. $700M to fix.

Challenger (1986) - Assumed O-rings would hold at 31°F despite engineers recommending no launch below 53°F. Seven dead, 73 seconds after liftoff.

Therac-25 (1985-1987) - Assumed software alone could replace hardware safety interlocks. Manufacturer called overdose “impossible.” Six patients overdosed, three dead over 19 months.

Knight Capital (2012) - Assumed a deployment script reporting “success” meant all eight servers were updated. One wasn’t. Dormant code from 2005 activated. $440M lost in 45 minutes.

CrowdStrike (2024) - Assumed their content validator would catch a field count mismatch. Template defined 21 fields, sensor provided 20. 8.5 million Windows devices blue-screened simultaneously. ~$5.4B in losses.

AWS S3 (2017) - Assumed a routine capacity removal command had safe bounds and that restart procedures matched current scale. Took down ~20% of the internet for four hours. $150M+ in losses.

GitLab (2017) - Assumed their backups worked. Had five separate backup mechanisms. Zero functioned. Engineer ran rm -rf on production. 300GB of data lost. 18-hour outage.

Cloudflare (2019) - Assumed a regex passing CI was safe for global deployment. Catastrophic backtracking spiked every edge server to 100% CPU. 27 minutes of global outage.

Same pattern. Different decade, different domain. The assumption was reasonable. The evidence was available. Nobody verified.

Now Add LLMs to the Equation

Everything I’ve described above happened in a world where humans wrote their own code, did their own calculations, and made their own decisions about what to trust. The failure mode was always the same: humans not questioning their own assumptions. And even in that world, I watched people dodge hard questions in postmortems because the answers might be uncomfortable.

LLMs don’t create this problem. They accelerate it. And that’s exactly why the engineers who build real understanding right now will be more valuable than ever.

There’s a quieter version of this problem that nobody talks about. You make it up the ladder. You get the title, the scope, the org chart with your name near the top. And somewhere along the way, the systems you’re responsible for outgrew your hands-on understanding. You’re supposed to know, but you don’t. Not really. Not at the level where you could debug it yourself at 3 AM. So you develop instincts for when to nod confidently and when to delegate the question to someone who might actually know. This has always been part of senior engineering leadership, the gap between authority and understanding. But it used to be bounded by the fact that someone on the team had to write the code, had to understand it, had to make it work from scratch.

Now that gap can be closed with a prompt. And that changes what it means to be the person who actually understands.

With vibe coding and AI-assisted everything, the code actually works. That’s the interesting part. You prompt an LLM, you get a service that compiles, passes basic tests, handles the happy path. It looks like progress. It feels like velocity. But ask two questions and reality starts crumbling. Why Redis? Why can the server only handle this many requests? Why is this retry logic unbounded? The person who wrote it, who prompted it into existence, often can’t answer. Not because they’re bad engineers, but because they never had to confront those questions. The LLM absorbed them.

And here’s where it gets strange: the answers to those questions might come from the LLM itself. So now you have a human-bot interaction where the bot wrote the code and the bot explains the code, and the human in the middle is doing prompt engineering and interpreting the output. At some point you have to ask: what is the human actually contributing? If you’re not the one who understands why Redis, if you’re not the one who can reason about request capacity, you’re not engineering. You’re proxying. And you could remove the middle man entirely, just let the bot talk to itself, and the output would be roughly the same.

This is where the opportunity is. In a world where anyone can generate code, the person who can explain why that code is right or wrong becomes irreplaceable. The person who can ask the second question, the one the LLM can’t answer about your specific system, your specific failure modes, your specific constraints, that person is more valuable now than they’ve ever been. For the person who already felt the pressure to pretend they understood, LLMs are the perfect collaborator. They don’t judge. They don’t ask follow-ups. They give you something that looks right and let you move on. But for the person willing to actually understand, LLMs are the most powerful learning accelerator ever built. The difference is honest self-assessment about which one you’re doing.

More bugs. More confidence. That’s not a tooling problem. That’s an intellectual rigor problem.

The Assumptions Worth Questioning

LLMs introduce a new category of assumptions into your codebase. Not maliciously. Not even incorrectly, most of the time. But silently, and at scale. Knowing what to question is the skill.

Code that looks correct is not necessarily correct. LLMs produce code that reads like it was written by a competent human. It follows naming conventions, uses popular patterns, includes comments. This surface-level quality is genuinely impressive, and it’s also exactly what makes it worth scrutinizing. It passes the “does this look right?” filter, which was never a reliable filter to begin with. The engineer who reads it critically, instead of just visually, is the one who catches the issue before production does.

“It worked” does not mean “it’s right.” LLM-generated code often works for the common case. But engineering has always been about the cases you didn’t think of. The Ariane 5 code worked perfectly on Ariane 4 trajectories. The MCAS system worked perfectly until the sensor failed. The engineer who asks “what input would break this?” is doing the work that separates a demo from a production system.

Speed of production is not progress. When you can generate a thousand lines in minutes, it feels productive. But if you don’t understand those thousand lines, you’ve created technical debt at a speed that was never before possible. The engineer who slows down to understand what was generated, who treats the LLM output as a first draft rather than a finished product, is building something that will actually hold up.

The model does not understand your system. An LLM has no knowledge of your architecture, your invariants, your failure modes, your SLAs. It generates code based on statistical patterns from millions of other systems. It picked Redis because Redis appears frequently in training data next to the word “cache” not because it analyzed your access patterns and consistency requirements. It set the connection pool to 10 because that’s a common default, not because it profiled your throughput under load. The question is never “can the LLM write this code?” The question is “does this code reflect the reality of my system?” The engineer who can answer that, who understands the system deeply enough to validate or reject what the model produced, is the one who will be indispensable. But if the human’s understanding also came from the LLM, you have a closed loop with no ground truth anywhere in the circuit.

Honest Engineering as a Practice

I’ve come to see honest engineering not as a personality trait but as a set of concrete habits. Things you do before you trust, before you ship, before you declare something fixed.

Measure before theorizing. The Ariane 5 team assumed Ariane 4 ranges would hold. They never measured actual values for the new flight profile. In the LLM age, this means: don’t assume generated code handles your edge cases. Instrument it. Profile it. Feed it adversarial inputs.

Treat contradicting data as signal, not noise. Hubble’s backup instruments flagged the mirror flaw. The readings were dismissed. The Therac-25 operators reported burns; the manufacturer said it was impossible. When your monitoring says something different from your mental model, your mental model is wrong. This applies equally to AI-generated code that “should work” but behaves unexpectedly in staging.

Verify across boundaries. The Mars Orbiter failed at the interface between Lockheed Martin and JPL, each correct in isolation, wrong together. LLMs have no concept of your system boundaries, your team interfaces, your deployment constraints. Every piece of generated code that crosses a boundary, between services, between teams, between trust zones, needs manual verification.

Understand before you merge. If you can’t explain why the code works, you can’t explain why it won’t work. The 59% of developers shipping code they don’t understand are creating future incidents that nobody will be able to debug, because nobody understood the system they built. This was always true for copy-pasted Stack Overflow answers, but LLMs have industrialized the problem.

Make assumptions explicit. The Ariane 5 overflow decision was “obscured from external review.” If your LLM prompt assumes a specific data format, a particular library version, a certain deployment environment, write that down. Review it. Challenge it. Because the model won’t.

The Skill That Will Matter Most

LLMs are not going away. They’re going to get better. The code they generate will improve, the vulnerabilities will decrease, the tooling around them will mature. But none of that changes the fundamental dynamic: someone still needs to understand the system.

When you write code yourself, you encounter resistance. You hit compiler errors, test failures, logical contradictions. Each one is a micro-moment of reality testing, a forced confrontation with how things actually work. When an LLM generates code for you, those friction points disappear. The code compiles. The tests pass. Everything looks green. And the absence of friction feels like correctness.

But the absence of friction isn’t the absence of bugs. It’s the absence of your understanding of where the bugs are.

History teaches us that the most expensive engineering failures don’t come from ignorance. They come from false confidence, from teams that believed they knew enough, that the system was well-understood, that the assumption was safe. Every disaster I’ve described was built by brilliant engineers who were wrong about one thing and didn’t know it.

So here’s my advice, especially if you’re early in your career: build the understanding. Use LLMs, use them aggressively, but use them the way you’d use a senior engineer who’s brilliant but has never seen your codebase. They can draft, they can suggest, they can teach you patterns. But they can’t tell you whether those patterns fit your system. That part is yours. And the more people outsource that part, the more valuable it becomes for the people who don’t.

The engineers who thrive in this era won’t be the ones who can prompt the fastest. They’ll be the ones who can look at what the model produced and say, with confidence, “this is wrong, and here’s why.” Or, just as importantly, “I don’t know if this is right, and I need to find out before we ship it.”

That’s intellectual courage. It has always been the foundation of good engineering. In the age of vibes, it’s the whole building.

If there's one skill I'd want every engineer to build, one thing I'd want the next computer science curriculum to actually teach, it would be this: intellectual honesty. How to question what's in front of you, especially when it looks right. We're going to vibe. We're going to generate, ship, and iterate faster than ever. But the engineers who matter will be the ones who pause, question, and then vibe again, better.

References

Mars Climate Orbiter (1999) - $327M loss due to metric/imperial unit mismatch between Lockheed Martin and NASA JPL.

Ariane 5 Flight 501 (1996) - $370M loss, 37 seconds after liftoff. Integer overflow in reused Ariane 4 software.

Boeing 737 MAX / MCAS (2018–2019) - 346 deaths across Lion Air Flight 610 and Ethiopian Airlines Flight 302. Single-sensor design, flawed safety assumptions.

Hubble Space Telescope Mirror Flaw (1990) - $1.5B telescope launched with mirror ground to the wrong curvature. $700M servicing mission to correct.

Space Shuttle Challenger (1986) - 7 crew killed. O-ring failure in cold temperatures despite engineer warnings.

Therac-25 (1985–1987) - At least 6 radiation overdoses, 3 deaths. Software race conditions with no hardware safety interlocks.

Knight Capital Group ($440M loss, 2012) - Deployment error triggered dormant trading code, bankrupting the firm in 45 minutes.

CrowdStrike Channel File 291 Outage (2024) - A faulty sensor configuration update crashed 8.5 million Windows devices worldwide.

AWS S3 Outage (2017) - A mistyped command removed critical infrastructure and took down ~20% of the internet for 4 hours.

GitLab Database Outage (2017) - An engineer accidentally deleted 300GB of production data; all five backup methods had silently failed.

Cloudflare WAF Outage (2019) - A single regex with catastrophic backtracking behavior spiked every edge server to 100% CPU globally.

LLM Code Security Data (2025)

Lazy-Pulling Container Images: A Deep Dive Into OCI Seekability

Zain — Sun, 08 Feb 2026 17:18:52 GMT

Your container images are tar archives compressed with gzip. That single design decision, made in the Docker era and inherited by OCI, means that reading any file requires downloading and decompressing the entire layer from byte zero. For a 12GB ML image, that’s 2-8 minutes of cold-start time before a single inference request, depending on your bandwidth to the registry.

Lazy-pulling fixes this by fetching only the bytes you need, when you need them. The concept has existed since 2019. Multiple implementations are in production today. They all solve the same byte-level problem and they all require the same infrastructure change: swapping containerd’s snapshotter, deploying a FUSE daemon on every node, and accepting that your registry just became a runtime dependency.

This post starts with why the problem is harder than it looks at the byte level, then surveys the major approaches and what they trade off. The core of the post is a hands-on experiment: I deploy an in-cluster registry, convert images to eStargz, patch containerd with a custom snapshotter, and measure something nobody benchmarks properly. Not just pull time, but readiness, the moment a container can actually serve its first request.

A note on scope. This post is about container image layers, the runtime, libraries, and application code that make up the OCI image. Model weight delivery is a different topic and won’t be covered here.

Part I: Why Container Layers Resist Random Access

The Tar Format: Sequential by Design

A container image layer is a POSIX tar archive. Tar was designed for tape, literally Tape ARchive, and it shows. There is no central directory, no file index, no way to find a file without scanning from the beginning:

To find /usr/bin/python3 in a tar archive, you read the first header (512 bytes), check the filename, skip past the file data (reading the size field to know how far), read the next header, and repeat until you find your target or hit EOF. For a layer with 10,000 files, that’s potentially scanning through gigabytes of file data you don’t care about.

This is already slow. Gzip makes it worse.

GZIP + DEFLATE: The Dependency Chain

Container layers are tar.gz, the tar archive above compressed as a single gzip stream. GZIP uses DEFLATE compression (RFC 1951), which combines two techniques:

LZ77 (sliding window) replaces repeated byte sequences with back-references. Instead of storing “the quick brown fox” twice, the second occurrence becomes a (distance, length) pair pointing back to the first. The window size is up to 32KB. Any byte in the output can reference up to 32,768 bytes before it.

Huffman coding uses variable-length encoding where frequent symbols get shorter codes. The coding tables can be static (pre-defined) or dynamic (computed per block and embedded in the stream).

Together, these create a decompression dependency chain:

To decompress Block 3, the decompressor needs the Huffman tables for Block 3 (embedded in the block header, or the static tables), up to 32KB of previously decompressed output (the sliding window) for resolving back-references, and the current bit position in the compressed stream (DEFLATE is bit-aligned, not byte-aligned).

The decompressor state at any point is a function of everything that came before. There is no way to start decompressing at Block 3 without either decompressing Blocks 0-2 first or having a saved snapshot of the decompressor’s internal state at the Block 3 boundary.

This is the fundamental problem. A 5GB compressed layer containing 10,000 files is a single DEFLATE stream. To read the last file, you decompress from byte 0. There’s no shortcut within the format.

The Combined Cost

For a modern LLM serving image (think vLLM, SGLang, or TensorRT-LLM), excluding model weights which are typically mounted separately or pulled from object storage:

Runtime layers: 12.3 GB compressed, 28.5 GB uncompressed, 5 layers
  ├── Base OS (Ubuntu)
  ├── CUDA runtime + cuDNN
  ├── Python + pip packages
  ├── PyTorch
  └── vLLM application code

Traditional pull sequence:
  1. Download 12.3 GB                              → 2-8 min (depends on link speed)
  2. Decompress 12.3 GB → 28.5 GB                  → 2 min (CPU-bound)
  3. Write 28.5 GB to overlayfs (5 layer extracts)  → 3 min (IO-bound)
  ────────────────────────────────────────────────────────────────────
  Total: ~7-13 minutes to container Running state

What the container actually reads at startup:
  Python interpreter + stdlib + torch + vllm imports ≈ 150 MB
  That's 1.2% of the uncompressed image.

You download 12.3 GB of runtime layers to read 150 MB. The remaining 98.8% sits on disk, fully decompressed, waiting for requests that might never come. On GPU nodes costing $3-30/hr, those minutes of idle accelerator time add up, and that’s before you’ve even started loading model weights.

Part II: Many Ways to Break the Chain

Every lazy-pulling solution addresses the same problem, which is making individual files accessible without downloading the entire layer. They differ in where they break the DEFLATE dependency chain and what they require at the format level.

Independent Gzip Members (eStargz)

The core insight. RFC 1952 says a gzip file can contain multiple concatenated members. Standard decompressors treat them as a single stream. But each member has its own independent DEFLATE state.

eStargz recompresses each file (or chunk of a large file) as a separate gzip member:

The TOC (Table of Contents) is a JSON document stored as the final gzip member. It maps every file to its compressed byte offset.

The DiffID preservation trick. eStargz changes compression boundaries but not content. When you concatenate and decompress all gzip members, you get a byte-identical tar stream to the original. The DiffID (the layer’s identity in the image config) is preserved. The blob digest changes (different compressed bytes), but the layer identity doesn’t.

The cost. every image must be converted at build time. The conversion recompresses the entire layer. Compression ratio may differ slightly (independent members can’t reference data in other members, losing some LZ77 efficiency). In practice, the overhead is ~2% larger blobs.

Other Approaches

eStargz is the approach I use in the experiment later in this post, but it’s not the only one. Three other approaches are worth knowing about.

AWS SOCI takes a different angle entirely. Instead of recompressing the image, it creates an external index (the zTOC) that stores periodic snapshots of the zlib decompressor’s internal state. These checkpoints let you resume decompression from any 4MB boundary instead of from byte zero. The index is stored as a separate OCI artifact linked via the Referrers API, so the original image stays completely unmodified. The tradeoff is that reads may need to decompress up to 4MB of unwanted data to reach the target file, and the index must be explicitly copied alongside the image when promoting across registries.

Nydus replaces tar.gz entirely with RAFS (Registry Acceleration File System), a purpose-built format with separated metadata and content-addressable chunks. Its key differentiator is cross-layer chunk deduplication, which can reduce total download by 20-30% for ML images with overlapping runtime layers. Nydus also offers an EROFS kernel backend (Linux 5.19+) that eliminates FUSE from the data path, dropping per-operation latency from ~100-500μs to ~10-50μs. The cost is that images must be converted, and the pure Nydus format isn’t backward-compatible with standard runtimes (though a “zran” compatibility mode exists).

Azure Artifact Streaming and Google Image Streaming build the seekability index server-side, transparently, with no user-visible conversion step. Azure’s implementation is based on OverlayBD, which operates at the block device level via TCMU rather than using FUSE. Google generates an opaque index automatically on push and uses a custom containerd plugin backed by FUSE with aggressive multi-level caching. Both are closed implementations that require their specific managed Kubernetes service and container registry.

Part III: The containerd Integration Point

Every solution, open-source or proprietary, modifies the same component in containerd’s snapshotter.

Why It Has to Be the Snapshotter

containerd’s image pull pipeline has a clear separation of concerns.

Image Pull Pipeline:
  1. Resolver     → Converts image reference to manifest digest
  2. Fetcher      → Downloads blobs from registry → Content Store
  3. Unpacker     → Reads blobs from Content Store → Snapshotter
  4. Snapshotter  → Materializes layer diffs into mountable filesystem
  5. Runtime      → Mounts filesystem, creates container

Steps 1-3 operate on blobs as opaque bytes. The Snapshotter (step 4) is where bytes become files. This is the only point where lazy-pulling can be inserted. You intercept the moment containerd tries to materialize a complete filesystem from a layer blob and instead provide a virtual filesystem that fetches on demand.

The Snapshotter Contract

The default overlayfs snapshotter’s Prepare() call is synchronous and complete:

// overlayfs snapshotter: Prepare returns after full extraction
func (o *snapshotter) Prepare(ctx context.Context, key, parent string, opts ...Opt) ([]mount.Mount, error) {
    // ... (layer already fully extracted to disk)
    return []mount.Mount{{
        Type:    "overlay",
        Source:  "overlay",
        Options: []string{
            fmt.Sprintf("lowerdir=%s", lowerDirs),
            fmt.Sprintf("upperdir=%s", upperDir),
            fmt.Sprintf("workdir=%s", workDir),
        },
    }}, nil
}

A lazy-pulling snapshotter can return a FUSE mount instead:

// remote snapshotter: Prepare returns immediately, content fetched on demand
func (s *remoteSnapshotter) Prepare(ctx context.Context, key, parent string, opts ...Opt) ([]mount.Mount, error) {
    // 1. Fetch TOC/index (small metadata, fast)
    // 2. Start FUSE daemon for this layer
    // 3. Return mount point where FUSE serves content on demand
    return []mount.Mount{{
        Type:    "fuse.rawBridge",
        Source:  "stargz",
        Options: []string{"ro", fmt.Sprintf("mountpoint=%s", mountDir)},
    }}, nil
}

From containerd’s perspective, both return []mount.Mount. The OCI runtime (runc) mounts them identically. The container process sees a normal filesystem either way. The difference is entirely in what backs those mounts.

What Every Solution Adds to the Node

For FUSE-based solutions (eStargz, SOCI, Nydus default, Google), you end up with this node topology:

The config.toml change is simple. What it implies is not.

Process dependency ordering: The snapshotter daemon must be running before containerd starts accepting CRI calls. If the daemon crashes during node boot, containerd silently falls back to overlayfs. Your containers start, but slowly, and the only signal is startup latency, not an error.

FUSE mount lifecycle: Each lazy layer creates a FUSE mount. These mounts are tied to the snapshotter daemon’s process. If the daemon restarts, existing mounts become stale. Reads return ENOTCONN. Every container using those mounts breaks.

Cache management: Every solution caches fetched chunks locally. Without explicit GC configuration, the cache grows until the disk fills. On GPU nodes with expensive local NVMe, that’s capacity competing with model weights, checkpoints, and scratch space.

Part IV: Readiness as the Metric That Matters

Most lazy-pulling benchmarks report a single number: image pull time. Pull goes from minutes to sub-second, the chart looks dramatic, and the blog post ends. But pull time is a misleading metric for lazy-pulling because the cost doesn’t disappear. It shifts.

With a traditional full pull, the container starts with every file already on disk. With lazy-pulling, the container starts with nothing on disk. The first time the process reads a file, whether loading a binary, opening a shared library, or importing a Python module, that read blocks on a FUSE cache miss, which triggers an HTTP Range request to the registry, which adds network latency to what should be a local filesystem operation.

The metric that actually matters is readiness: the moment the container can serve its first request. We define it as the time from container create to a successful HTTP response on the health endpoint. This captures the full cost: pull, start, and all the on-demand file fetching that happens before the process is functional.

Readiness reveals the tradeoff that pull-time-only benchmarks hide.

The Experiment

We built an in-cluster test harness to compare three configurations head-to-head, measuring readiness on each. The goal: isolate what lazy-pulling actually changes by controlling for network, registry, and image variables.

Architecture

The experiment deploys three components into a Kubernetes cluster:

Local registry. A standard Docker Distribution v2 registry runs as a Deployment, backed by a PVC, exposed via NodePort on port 30500. Since kube-proxy binds NodePort on 0.0.0.0: on every node, the registry is reachable at localhost:30500 from any node without DNS, ingress, or TLS. This eliminates internet variability from the measurement.

eStargz conversion. A Helm post-install Job converts each source image: pulls from DockerHub, rebuilds each layer with a TOC using the containerd/stargz-snapshotter/estargz Go library, and pushes the result to the local registry. The converted image is backward-compatible; any OCI runtime can pull and unpack it normally. The TOC is only used by snapshotters that understand it.

Node patching. A privileged DaemonSet writes the hosts.toml for the local registry and restarts containerd via nsenter into the host PID namespace. This makes containerd trust plain HTTP pulls from localhost:30500.

stargz-snapshotter. For the FUSE lazy-pull configuration, the custom stargz-snapshotter runs as a containerd proxy plugin on the worker node. On pull, it fetches only the TOC from each layer (milliseconds). It mounts a FUSE filesystem instead of extracting layers to disk. On file access, it fetches individual files from the registry via HTTP Range requests, decompresses the specific byte range, and returns them to the process.

Three Configurations Under Test

We test three cold-start paths against the same image (nginx:1.25, ~70MB compressed), each starting with a clean image cache:

DockerHub full pull. Standard path: containerd pulls all layers from DockerHub over the internet, decompresses them, writes them to overlayfs. Container starts with all files on disk.

Local registry + overlayfs full pull. Same full-pull mechanics, but the registry is in-cluster. This isolates the network improvement: same decompression and extraction, no internet round-trip.

Local registry + FUSE lazy pull. eStargz image, stargz-snapshotter active. Pull fetches only the TOC. Container starts immediately. Files are fetched on-demand through FUSE as nginx loads its binary, shared libraries, and config.

Measurement Method

Benchmarks run at the containerd level using ctr and ctr-remote, bypassing the Kubernetes CRI path. This eliminates kubelet scheduling, readiness probe interval, and CRI overhead from the measurement, giving us a clean view of the pull → start → ready pipeline.

For each configuration, we measure three phases:

Image Pull (time to complete the pull/lazy-pull operation)
Container Start (time from pull completion to process running)
HTTP Readiness (time from container start to a successful HTTP response on /healthz).

Results

What Readiness Reveals

Pull is 65x faster with FUSE (0.088s vs 5.782s). Only the TOC and image manifest are downloaded, not the actual layer data. This is the number most benchmarks report and stop.

Readiness is 20x slower with FUSE (0.271s vs 0.013s). When nginx tries to serve /healthz, the FUSE filesystem must fetch the nginx binary, shared libraries (libc, libssl, libpcre), and config files from the registry over HTTP. Each file read becomes a network round-trip. With full-pull approaches, those files are already on disk and readiness is essentially instant.

The local registry alone gets you some of the improvement (5.981s → 3.577s total) with zero containerd configuration changes, no FUSE, no snapshotter swap. It’s the boring optimization that often gets skipped in the rush toward lazy-pulling.

Total readiness is still best with FUSE (0.589s vs 5.981s). For nginx, the pull savings vastly outweigh the readiness penalty. But the gap narrows as image complexity grows:

For a PyTorch image where the process imports hundreds of Python modules at startup, each triggering a FUSE cache miss and a network round-trip, the readiness phase grows proportionally. The pull savings are still enormous (minutes to sub-second), but the readiness penalty can reach seconds as the FUSE daemon serializes dozens of HTTP Range requests.

A Note on custom stargz-snapshotter

I used a custom snapshotter built on top of stargz-snapshotter that adds several performance improvements. It implements async prefetch to proactively fetch layer data in the background, eliminating the stalling and timeouts that occur when lazy-loading multi-gigabyte container images. Without async prefetch, on-demand fetches for large layers would consistently time out. The custom snapshotter also registers the necessary unpack platform handlers for containerd v2 compatibility, ensuring lazy-pulling works through the standard CRI path. The benchmarks above use this custom snapshotter.

Part V: The Operational Reality

Registry as Runtime Dependency

This is the single most important operational change that lazy-pulling introduces, and it’s consistently underemphasized in vendor documentation.

With traditional pulls, the registry interaction is bounded: download all blobs, verify digests, done. The container runs entirely from local disk. Registry outages don’t affect running workloads.

With lazy-pulling, every uncached file access is a live HTTP request to the registry:

Timeline of a lazy-pulled container:

  t=0s     Container starts (TOC fetched, FUSE mounted)
  t=0.1s   Python interpreter loaded (prefetched, cached)
  t=0.5s   torch imported (prefetched, cached)
  t=2.0s   Model loaded, serving requests
  
  t=3600s  User uploads file triggering rare code path
           → import obscure_module
           → FUSE: cache miss for /usr/lib/python3.11/obscure_module.py
           → HTTP Range request to registry
           → Registry is in maintenance window
           → read() returns EIO
           → Python: ModuleNotFoundError
           → 500 error to user

  The container has been running for an hour.
  The failure looks like a missing file, not a network issue.
  kubectl describe pod shows nothing.
  The pod is in Running state.

Every solution mitigates this with background fetching, downloading the complete image content in the background after the container starts. The race condition is the window between container start and background fetch completion. For a 12GB image, that window can be anywhere from 2 to 8 minutes depending on registry bandwidth. Any uncached file access during that window hits the registry live.

The deeper issue: background fetching eliminates the lazy-pulling benefit over time. Once the full image is downloaded, you have exactly the same disk usage as a traditional pull. Lazy-pulling optimizes time to first request, not steady-state resource consumption. If your containers run for hours, you’re carrying the operational complexity of lazy-pulling for a one-time startup improvement.

FUSE at Scale

Most open-source solutions and Google’s proprietary implementation use FUSE in the data path. Azure’s OverlayBD-based approach is a notable exception. Some characteristics of FUSE in production:

Per-operation latency includes scheduling. A FUSE read is: kernel sends request to FUSE device, userspace daemon wakes up and reads from /dev/fuse, daemon processes the request, daemon writes response back, kernel completes the syscall. The wake-up and write-back are context switches. On CPU-saturated GPU nodes, FUSE response latency becomes a function of CPU contention.

Mount count scales multiplicatively. A node running 20 containers, each with 5 image layers, creates up to 100 FUSE mounts. Each maintains kernel-side state. The aggregate matters on nodes already tracking thousands of cgroups, network interfaces, and device mappings.

Observability is split. strace shows slow read() syscalls. iostat shows nothing unusual (FUSE isn’t a block device). The actual bottleneck, a cache miss triggering an HTTP request inside the snapshotter daemon, is visible only through the daemon’s own metrics, which may or may not be exported in your monitoring stack.

Failure mode is unfamiliar. When a FUSE daemon crashes, existing mounts go stale. Reads return ENOTCONN, “Transport endpoint is not connected”, on what the application thinks is a local file. This error doesn’t appear in most applications’ retry logic because local filesystem reads are assumed to be reliable.

The Nydus/EROFS Exception

Nydus with the EROFS backend (Linux 5.19+) is the only solution that eliminates FUSE from the data path entirely. The cached read path goes through kernel VFS → EROFS driver → page cache, with zero extra context switches. The on-demand fetch path uses the fscache subsystem’s userspace daemon, but this runs asynchronously. EROFS can serve other cached reads while waiting for a fetch to complete.

The result is ~10x lower per-operation latency and no stale mount failure mode. The catch: you need a kernel with CONFIG_EROFS_FS_ONDEMAND enabled, and many minimal cloud-native node OSes still don’t enable this config flag by default.

Part VI: Where This Stands

Container image lazy-pulling is a solved problem at the format level. We have four proven approaches spanning backward-compatible (eStargz), non-invasive (SOCI), high-performance (Nydus/EROFS), and fully-managed (Google, Azure). The format diversity is healthy: different tradeoffs suit different environments.

What remains unsolved is the integration story. Every solution requires replacing the containerd snapshotter on every node, running a long-lived daemon with its own lifecycle and failure modes, FUSE or TCMU or a specific kernel configuration, persistent cache with GC policy, and monitoring for a new class of failures (registry-as-runtime-dependency, stale FUSE mounts, cache pressure).

The cloud provider’s answer is to absorb this complexity into managed services, which works but gates the benefit behind a specific registry and Kubernetes distribution. The open-source answer is to install and operate the snapshotter yourself, which works but adds operational surface that compounds with every containerd and kernel upgrade.

The piece that’s missing is an upstream-native lazy-pulling snapshotter in containerd. Not a proxy plugin over a Unix socket, but a built-in snapshotter that speaks the OCI distribution spec, understands SOCI-style external indexes (so images don’t need modification), and uses EROFS where available with FUSE as fallback. The OCI Referrers API provides the standard metadata attachment mechanism. EROFS is in mainline Linux. The stargz-snapshotter is already in the containerd GitHub organization. The building blocks exist. The assembly is the hard part.

In container infrastructure, the hardest problems aren’t the algorithms or the formats. They’re making the operational integration invisible enough that teams adopt it without needing a dedicated infrastructure engineer to babysit it. Our readiness experiment shows that even the simplest lazy-pulling setup (eStargz + stargz-snapshotter) delivers a 10x improvement in total cold-start time, but only if you understand that the cost shifts from pull to runtime, measure accordingly, and plan for the registry dependency.

Everything in this post addresses the container runtime side of cold-start, meaning the OS, libraries, interpreters, and application code baked into OCI layers. For GPU workloads, that’s often only half the story. Getting multi-gigabyte model weights to the right node at the right time is a different problem with different constraints, and outside the scope of this post.

The ROI of Complexity

We often talk about lazy-pulling as a “startup speed” hack, but that misses the bigger picture. This is about resilience and yield.

When a spot instance vanishes or a node kernel panics, a 5-minute pull time is a 5-minute outage. Lazy-pulling turns that catastrophe into a 10-second hiccup. That difference is the margin between a seamless failover and a user-visible incident.

Simultaneously, every minute an H100 sits idle waiting for tar extraction is capital burned. I’m curious—how are you modeling this trade-off? Is the operational tax of running FUSE daemons worth the reclaimed GPU time and the slash in MTTR? Or is the risk of a runtime registry dependency too high for your SLA? bill?

Who Will Observe the Observability? eBPF Performance at Scale

Zain — Sun, 23 Nov 2025 20:31:44 GMT

From KubeCon Stage to Technical Deep-Dive

After my KubeCon talk with , “Fix First, Investigate Later: When an eBPF Rollout Brought Down Our Network,” I received numerous questions. While many wanted to discuss the specific technicalities of the incident, a distinct pattern emerged in the “hallway.” People weren’t just asking how the bug happened. They were also interested in: “What kind of skills does a Infrastructure Team actually need to own this stack?”

The conversation shifted from the specifics of the incident to the capability of the team solving it. It raised a fundamental question: What is the gradient of expertise required today?

The Illusion of “Managed” Systems

There is no short answer to “what skills do we need?” but there is a mindset shift required regarding the infrastructure we rent. Too many organizations fall into the trap of treating managed Kubernetes services as black boxes. We convince ourselves that because we pay a premium for a managed control plane, the underlying compute is “someone else’s problem.”

But if your business continuity relies on these foundational systems, you cannot afford to be a passenger. Even in managed environments, you almost always have full access to the worker node VMs. There is nothing actually stopping us from building the expertise to debug them.

This depth of expertise is where an infrastructure team differentiates itself. It is the difference between the “Folklore Team” (”Just restart the node, that usually fixes it”) and the “Engineering Team” (”We identified a race condition in the CNI plugin...”).

Let’s take the technical findings from our KubeCon talk as a case study. Much like the deep dive into the irqbalance issue revealed, this investigation required us to look past the abstractions. It wasn’t just a matter of reading logs. It required digging into the kernel’s data transfer mechanisms to understand the root cause.

In fact, the deep dive I’m about to share happened much later, during personal time, driven by curiosity to reproduce and fully understand the issue. Since there were no direct business metrics attached to the specific root cause initially, limited organizational effort was allocated to it. It took that curiosity to go back and find the why.

The Hidden Challenge: Scaling eBPF on 32+ Core Systems

While eBPF programs promise minimal overhead, the reality on high-core-count machines under production traffic patterns tells a different story. Our investigation into Microsoft Retina’s packetparser plugin revealed a critical scaling bottleneck: when multi-core applications generate high traffic volumes across 32, 64, or 96 cores, the choice between perf arrays and ring buffers becomes critical.

Under real-world traffic patterns, where packet processing spreads across all cores simultaneously, perf arrays can degrade performance by up to 50%, while ring buffers maintain a more consistent 7% overhead. The difference becomes especially pronounced as traffic patterns shift from single-core bursts to sustained multi-core loads, turning what should be lightweight observability into a production-impacting bottleneck.

Understanding the Architecture: Retina’s Packetparser

Let’s examine how modern eBPF observability tools transfer data from kernel to user space, using Microsoft Retina as our case study. Retina’s packetparser plugin showcases a common architecture in eBPF network monitoring:

The eBPF Programs

Packetparser deploys four TC-BPF programs:

endpoint_ingress/egress: Attached to pod interfaces
host_ingress/egress: Monitoring host network interfaces

These programs perform packet analysis:

Parse TCP flags (all 9 bits)
Track TSval/TSecr for RTT calculation (RFC 7323)
Monitor sequence/ACK numbers
Calculate per-flow metrics (bytes/packets)

But the real challenge isn’t in the kernel, it’s in getting this data to user space efficiently.

Perf Arrays: The Traditional Approach

Most eBPF tools default to perf arrays (BPF_MAP_TYPE_PERF_EVENT_ARRAY) for kernel-to-user communication. Here’s why this made sense initially, and why it breaks down at scale.

How Perf Arrays Work

Perf arrays create a per-CPU buffer architecture:

32 CPU Cores = 32 Independent Buffers
Each buffer: 32 pages (131KB)
Total memory: ~4MB across all CPUs

The data flow follows this pattern:

Each CPU writes events to its dedicated buffer (Kernel “Producer”)
A single reader goroutine polls all buffers via epoll (User Space “Consumer”)
The reader consumes events in FIFO order across all CPUs
Events are forwarded to consumer workers through channels

This architecture seems reasonable. It avoids lock contention by giving each CPU its own buffer. But our benchmarks revealed a critical scaling problem.

Reproducing the Multi-Core Storm

To validate our hypothesis, we needed to simulate the traffic patterns that trigger this bottleneck. And that’s where standard tools fell short. iperf3, the go-to for network benchmarking, is single-threaded by default. It generates impressive throughput numbers, but all that traffic flows through a single core, completely missing the multi-core contention pattern we observed in production.

So we built a purpose-designed Go application to replicate a real-world, network-intensive workload. The architecture leverages SO_REUSEPORT to bind multiple listeners to the same port, allowing the Linux kernel to distribute incoming SYN packets across sockets using flow hashing on the real client pod source IPs. Each accepted connection spawns a lightweight goroutine for reading and decoding, feeding work into a buffered channel consumed by a fixed worker pool. This design ensures that when we spin up dozens of client pods hammering the receiver, we’re genuinely spreading packet processing across all available cores, exactly the scenario where eBPF’s data transfer path becomes the bottleneck rather than the application itself.

The Benchmark Results: A Non-Linear Degradation

We tested network throughput using this multi-threaded network receiver app on VM types with varying core counts. The results showed a disturbing trend: the more powerful the machine, the worse the relative performance overhead.

Performance Impact by Core Count:

4-core nodes:
  - Baseline:     200 Mb/s
  - With Retina:  200 Mb/s (0% impact)
  
16-core nodes:  
  - Baseline:     5 Gb/s
  - With Retina:  4.3 Gb/s (14% reduction)

32-core nodes:
  - Baseline:     8.0 Gb/s
  - With Retina:  4.5 Gb/s (44% reduction)

On small nodes (4 cores), the overhead was negligible. By the time we reached 64 cores, the observability tool was choking nearly 50% of the available network throughput capacity.

Deep Dive: CPU Limits & Pinning

When we started the investigation in the performance drop on 32-core nodes, our immediate hypothesis was CPU throttling. We assumed the eBPF agent was hitting its Kubernetes resource limits. To test this, we ran a series of comparative benchmarks specifically on the 32-core nodes.

The results were counter-intuitive and highlighted the “thrashing” behavior of perf arrays:

Network Throughput Comparison (32-core Node)

Scenario	Throughput	Throughput
Without Retina (Baseline)	8.0 Gb/s	-	Clean baseline
With Retina (Default)	4.5 Gb/s	-43%	Standard deployment
Retina (No CPU Limits)	3.7 Gb/s	-53%	Performance worsened!
Retina (CPU Pinning)	5.2 Gb/s	-35%	Best case for Perf Arrays

Why did removing CPU limits make it worse?

Seeing performance drop from 4.5 Gb/s to 3.7 Gb/s after removing CPU limits (the “Without Limits” bar) was a surprise.

This revealed that the bottleneck wasn’t a lack of CPU cycles, but scheduling contention.

Thrashing: With unlimited CPU, the reader thread spun more aggressively, polling the 32 buffers.
Context Switching: This generated excessive context switches (>120k/sec).

The Effect of CPU Pinning

When we applied CPU pinning (isolating the Retina agent to specific cores), throughput improved to 5.2 Gb/s. While this was better than the default configuration, it still represented a 35% performance penalty compared to the baseline. We can do this via CPUManager static policy in kubernetes.

Even with perfect CPU isolation, the architectural overhead of polling 32 separate memory regions via Perf Arrays prevented us from reaching acceptable performance.

Look at the CPU usage pattern

With our reproducible pattern, we saw that the more cores we used and the more network-intensive the machines were, the higher the degradation was.

This made us look into again which part is actually changing with CPU core count.

Debugging eBPF related Degradation

We just analyzed a specific CPU usage pattern using our reproducible test environment. We observed a distinct correlation: as we increased the number of cores on nodes running network-intensive workloads, the performance degradation became more severe.

This led us to investigate which components were scaling poorly with the CPU core count. Let’s go down the debugging path.

First, we checked the loaded BPF programs to identify potential overhead. We will use just one as an example.

bpftool prog list 

...
1167: sched_cls  name endpoint_ingress_filter  tag 44b14ea77164570a  gpl
	loaded_at 2025-11-23T19:06:46+0000  uid 0
	xlated 10480B  jited 7222B  memlock 12288B  map_ids 268,282
	btf_id 391
...

All we needed from the command above was the map_ids. We took a closer look at map 282.

bpftool map show id 282

282: perf_event_array  name retina_packetpa  flags 0x0
	key 4B  value 4B  max_entries 16  memlock 512B

It was a perf_event_array. To understand its impact, we decided to check the event frequency. We ran a bpftrace script to compare perf_event_output (streams) against perf_event_wakeup (wakeups).

timeout 10 sudo bpftrace -e ‘
  BEGIN {
    printf(”Timestamp | Streams/s | Wakeups/s\n”);
    printf(”----------------------------------\n”);
  }

  kprobe:perf_event_output {
    @streams = @streams + 1;
  }

  kprobe:perf_event_wakeup {
    @wakeups = @wakeups + 1;
  }

  interval:s:2 {
    printf(”%s | %9d | %9d\n”,
           strftime(”%H:%M:%S”, nsecs),
           @streams / 2,
           @wakeups / 2);

    @streams = 0;
    @wakeups = 0;
  }
‘
Attaching 4 probes...
Timestamp | Streams/s | Wakeups/s
----------------------------------
19:43:31 |     84883 |     62328
19:43:33 |     91055 |     65034
19:43:35 |     94762 |     69747
19:43:37 |     94825 |     69257

@streams: 171695
@wakeups: 115786

The results showed almost 70% of events resulting in a wakeup, which is significant.

To confirm the source, we asked: What happens if we disable Retina? How much of this percentage actually belongs to retina-agent? We ran the same probe with Retina disabled.

timeout 10 sudo bpftrace -e ‘
  BEGIN {
    printf(”Timestamp | Streams/s | Wakeups/s\n”);
    printf(”----------------------------------\n”);
  }

  kprobe:perf_event_output {
    @streams = @streams + 1;
  }

  kprobe:perf_event_wakeup {
    @wakeups = @wakeups + 1;
  }

  interval:s:2 {
    printf(”%s | %9d | %9d\n”,
           strftime(”%H:%M:%S”, nsecs),
           @streams / 2,
           @wakeups / 2);

    @streams = 0;
    @wakeups = 0;
  }
‘
Attaching 4 probes...
Timestamp | Streams/s | Wakeups/s
----------------------------------
19:44:43 |         0 |         0
19:44:45 |         0 |         0
19:44:47 |         0 |         0
19:44:49 |         0 |         0


@streams: 0
@wakeups: 0

The answer was clear: All of it. Every recorded event was coming due to Retina running on our node.

We were able to identify the per-cpu perf arrays as the culprit relatively quickly. By simply following the BPF program thread.

So, what does this mean for our packetparser architecture?

We were running not just one, but four TC BPF programs simultaneously. As the data demonstrates, relying heavily on per-CPU perf arrays can lead to extreme noise. The overhead incurred by these wakeups creates significant pressure on the system, explaining why the degradation worsened with increased core counts and network load.

Why Perf Arrays Break at Scale

The bottleneck emerges from a fundamental architectural constraint: buffer polling management.

Consider what happens on a 32-core system under high network load:

The single reader thread must manage file descriptors for 32 separate buffers.
Unlike a shared buffer, this requires iterating over multiple non-contiguous memory regions.
NUMA Penalties: On multi-socket systems (common in 32+ core VMs), the user-space reader typically runs on one NUMA node but must access memory pages allocated on remote NUMA nodes to drain the per-CPU buffers. This leads to cache line bouncing and expensive remote memory access.

Ring Buffers: A Different Approach

In kernel 5.8, BPF ring buffers (BPF_MAP_TYPE_RINGBUF) introduced a fundamentally different architecture. Instead of per-CPU isolation, ring buffers use a single shared data structure.

Ring Buffer Architecture

All CPUs → Single Shared Buffer (8MB)
          ↓
    [Spinlock coordination]
          ↓
    Single Consumer Read Point

Key differences from perf arrays:

Multi-producer, single-consumer design
Lock-free for readers (spinlock only for writers)
Efficient Batching - consume events from all CPUs in one contiguous memory pass
Adaptive sizing - independent of CPU count

Implementation and Testing

To address this, we modified Retina to support ring buffers.

To verify the fix, we ran the same steps as above. First, we inspected the map again:

bpftool map show id 455
455: ringbuf  name retina_packetpa  flags 0x0
	key 0B  value 0B  max_entries 8388608  memlock 8434072B

This time, instead of a perf_event_array, we see a ringbuf.

Next, we double-checked that perf_event_wakeup events were eliminated by running our tracing script again.

Timestamp | Streams/s | Wakeups/s
----------------------------------
19:59:27 |         0 |         0
19:59:29 |         0 |         0
19:59:31 |         0 |         0
19:59:33 |         0 |         0


@streams: 0
@wakeups: 0

The results confirmed it, the specific noise from the perf event array is completely gone. Let’s now run some tests.

Benchmark Results: Ring Buffer Performance

After implementing ring buffer support, we repeated our benchmarks on the same 32-core nodes that struggled with Perf Arrays:

Network Throughput Comparison (32-core nodes):
- Baseline (no Retina):         8.0 Gb/s
- With Perf Arrays (Pinned):    5.2 Gb/s (35% overhead)
- With Ring Buffer:             7.4 Gb/s (7.5% overhead)

The improvement was dramatic. We reduced overhead from 35% to 7.5%. This validated our hypothesis: on high-core-count systems under sustained multi-core load, the data structure choice fundamentally determines whether eBPF observability remains transparent or becomes a production bottleneck.

Trade-offs and Considerations

Ring buffers aren’t universally better. Here’s what we learned:

When Perf Arrays Win:

Low core counts (≤8 cores)
Strict per-CPU isolation requirements
Older kernels (pre-5.8)
NUMA-sensitive workloads (specifically where kernel-side write latency is the priority over user-side read latency)

When Ring Buffers Win:

High core counts (≥16 cores)
Bursty traffic patterns
Limited consumer threads
Memory-constrained environments

The State of eBPF Observability: Looking Forward

Our investigation highlights a critical gap in the eBPF ecosystem: most tools are optimized for modest systems but deployed on increasingly powerful hardware.

Recommendations for Tool Developers

Make buffer mechanisms configurable: Don’t hardcode perf arrays or ring buffers
Test on production-representative hardware: If users run 32+ cores, test on 32+ cores
Document scaling characteristics: Be transparent about performance at different scales
Provide escape hatches: Quick ways to disable or tune down collection

The Future: Adaptive Mechanisms

The ideal eBPF observability tool would:

Auto-detect system characteristics (CPU count, NUMA topology)
Dynamically switch buffer mechanisms based on load
Implement backpressure when overwhelmed
Gracefully degrade (do sampling) rather than impact workload performance

Conclusion: Renters vs. Owners

This investigation highlights why the choice of buffer mechanism “Perf Arrays vs. Ring Buffers” isn’t just an implementation detail. It defines the scalability of your observability stack.

But more importantly, it brings us back to the question raised in the hallway: Why does a infrastrcuture team need to know the kernel stack if they are just “renting” the cloud?

Because when you run at scale, the abstraction leaks. If we had stayed in the “renter” mindset, we opened a support ticket. And to say the least, no one acknowledged this as a problem to start with. The vendor or cloud provider will look at the saturation and point the finger back at you, claiming it is your rogue workload causing the issue.

They wouldn’t be entirely wrong, your workload is high-traffic. But the degradation isn’t the workload’s fault. It’s the observability tool struggling to observe it. If you view yourself merely as a renter, you accept the degradation. If you view yourself as an engineer owning the stack, you investigate, you debug, and you fix.

Questions or experiences to share? reach out on LinkedIn.

References

Acknowledgments

Thanks to Grzegorz Głąb (Whatnot) for co-presenting at KubeCon. Reach out to him for any questions related to the first part of our KubeCon presentation on mutation webhook magic.

Thanks to Matthew Fowle for suggesting the title of blogpost: “Who will observe the observability?”

From Utilization to PSI: Rethinking Resource Starvation Monitoring in Kubernetes

Zain — Sun, 27 Apr 2025 13:02:26 GMT

In Kubernetes v1.33 (alpha), cAdvisor’s Pressure Stall Information (PSI) metrics can be enabled on the kubelet by passing --feature-gates=KubeletPSI=true

Introduction: The Evolution of Resource Monitoring

In traditional VM-based environments, monitoring resource starvation was straightforward: you watched resource utilization (CPU, memory, etc.) against the machine’s capacity. If a VM’s CPU usage hit close to 100% of its allocated vCPUs or memory usage neared 100% of RAM, you knew contention was occurring. High utilization meant the workload was starved for more resources. This utilization-centric approach made sense when each VM had fixed resources.

However, Kubernetes changed the game. Kubernetes introduced the concepts of resource requests and limits for containers, enabling dynamic sharing and overcommitment of resources on a node.

Many teams initially tried to carry over the old monitoring mindset, comparing container usage to its requested resources as a proxy for stress. Unfortunately, usage vs. requests can be very misleading in Kubernetes (it may simply be borrowing idle capacity). A container using more CPU than requested isn’t necessarily a problem, and one using less than requested isn’t necessarily safe from contention. The traditional model of “utilization == starvation” doesn’t directly apply in this new world of shared resources and elastic consumption.

In this post, we’ll explore:

Why the old metrics (like CPU utilization vs. requests) fall short in Kubernetes.
Why even monitoring usage against limits is only a slight improvement.
Why setting CPU limits in Kubernetes is often considered a bad practice, as it can hurt performance.
How Linux’s Completely Fair Scheduler (CFS) using CPU shares (weights) based on requests usually suffices to manage CPU contention.
How Pressure Stall Information (PSI) metrics provide a far more accurate picture of resource contention.

We'll look at key scenarios that PSI highlights, such as CPU throttling events or genuine CPU pressure, and how PSI avoids the false positives of older monitoring approaches. Technical sample queries will be included to illustrate how to gather and use PSI metrics in practice.

If you’re a Kubernetes engineer or SRE still relying on outdated utilization metrics, it’s time to update your toolkit. Let’s dive in.

The Traditional Approach: Utilization vs. Requests (and Why It Fails)

In pre-container environments, we monitored utilization to catch resource starvation. For example, if a VM’s CPU was 95% utilized or its memory 90% full, that was a red flag.

Many teams initially applied a similar idea to Kubernetes by looking at a container’s usage relative to its resource requests (the amount of CPU or memory it “requested” when scheduled). The assumption was: if a pod’s CPU usage is near or above its request, it must be at risk of starvation, and if usage is well below request, it’s safe.

This approach, however, is flawed in Kubernetes. Resource requests are not hard allocations – they are guarantees for scheduling and baseline service, not fixed ceilings. Kubernetes uses requests to decide which node can host a pod and to ensure each pod gets its fair share when resources are contested, but a pod can use more CPU than requested if the node has spare capacity. Similarly, a pod might have low CPU usage relative to its request, yet still encounter contention if other pods compete for CPU.

In other words, comparing usage to requests is apples-to-oranges: requests are a scheduling construct, while usage is actual consumption.

Example: Imagine a pod requests 1 CPU but at runtime it’s using 1.5 CPUs on average. In a VM world this would be “150% utilization” (impossible on a fixed 1 CPU allocation), but on Kubernetes this scenario can happen if the node has idle CPU cycles. The pod simply borrows CPU above its request since no one else is using it. Naively, an SRE might see 150% of request and panic. But if the node isn’t fully utilized, this isn’t actually a problem! The pod isn’t starved at all; it’s benefiting from extra headroom. Kubernetes explicitly allows this: "As long as the node isn't maxed out, pod B can use whatever extra CPU is free... it won't interfere with pod A's fair share. That's the whole point of CPU requests – they give you a floor (guarantee)."

On the other hand, consider a pod that requests 1 CPU, but is only using 0.5 CPU most of the time. One might think it’s safe because it’s under its request. But if the node is fully booked with other pods and this pod occasionally needs more CPU (say bursts to 1 CPU), it will get at least 1 CPU (its full request) if it needs it – that’s guaranteed. However, if it needed more than 1 CPU (beyond its request) at a time when the node is busy, it might experience delays. Traditional monitoring wouldn’t flag this at all because usage hasn’t hit any static threshold relative to request or capacity.

In short, utilization vs. request is a poor indicator of actual distress in Kubernetes. A pod can be using 200% of its requested CPU and be perfectly healthy if the node has spare capacity, or it can be well below 100% of its request and still suffer if the node CPU is fully contended (or if it’s artificially capped by other means). The old model “high utilization = bad” doesn’t directly translate when resources are elastic.

Why It Made Sense on VMs (Fixed Quota) but Not on Kubernetes

It’s worth highlighting why this confusion exists.

On a VM or physical machine: Your CPU and memory allocations are basically fixed. If you have 4 vCPUs, 100% usage means all 4 are busy. If you have 8 GB of RAM, using 7.5 GB means you’re about to run out. There’s a fixed ceiling, so usage as a fraction of that ceiling is a meaningful metric.
In Kubernetes: A container’s “ceiling” is not always fixed at its request. If no explicit limit is set, the true ceiling is the node’s capacity (or remaining capacity), which is often much higher than the request. A container’s resource usage can go beyond what it requested (temporary boost) or can be constrained by overall node conditions even before hitting its request (if other pods demand their share).

Thus, the ratio usage/request can be very misleading. High usage/request doesn’t necessarily mean trouble (could be just opportunistic usage), while low usage/request doesn’t guarantee no contention.

Many Kubernetes monitoring dashboards still show “CPU utilization vs. requests” or “Memory usage vs. requests” for pods or deployments. These can be useful for capacity planning or right-sizing (e.g., to see if requests are far too high or too low relative to actual usage over time). But they’re not reliable for real-time detection of contention or starvation. Relying on them for alerting can cause false positives (alerting on benign bursts above request) and false negatives (missing actual contention that doesn’t manifest in those ratios).

Monitoring Against Limits: A Slightly Better Approach

Realizing the pitfalls of using requests as the yardstick, many teams shifted to monitoring resource limits instead. Kubernetes allows setting resources.limits for CPU and memory, which are hard constraints: a container cannot exceed its CPU limit (it will be throttled) and cannot exceed its memory limit (it will be OOM-killed if it tries).

Intuitively, monitoring usage against these hard limits makes more sense:

If a container is close to 100% of its memory limit, it’s in danger of OOM.
If a container’s CPU usage is hitting 100% of its CPU limit, it means it’s fully utilizing its allowed CPU and could be throttled.

Memory limits in particular demand close attention. Unlike CPU, memory is not a “compressible” resource – if you run out of memory, the kernel cannot just slow things down; something has to give (usually the process gets killed). "Memory is different because it is non-compressible – once you give memory you can't take it away without killing the process." For this reason, best practice is to always set memory limits on pods, and monitor if memory usage approaches those limits. A container at 95% of its memory limit is one allocation away from an OOM Kill. So monitoring memory usage vs. limits (and receiving alerts before it hits 100%) is essential.

For CPU limits, if set, a container’s CPU usage being at 100% of its limit is a sign it wants more CPU but is not allowed to have it. Hitting a CPU limit won’t kill the container – instead, the Linux kernel will throttle the container’s CPU cycles to enforce the limit. Throttling means the container’s processes are made to wait, even if the CPU is idle, until the next time slice – effectively capping its CPU usage to the limit over time. If you monitor a container and see its CPU usage flatlined at its limit (say constantly using 1 core when the limit is 1 core), that likely means the container could use more CPU if it were available. In other words, it’s potentially CPU-starved (constrained by the limit).

An even clearer indicator is to monitor the CPU throttling metrics that cAdvisor exposes when limits are in place. For example, cAdvisor tracks container_cpu_cfs_throttled_seconds_total (cumulative seconds a container was throttled) and the number of throttling occurrences. By checking the rate of increase of this metric, you can tell if the container is actively being throttled by the CPU quota. A high throttling rate means the container hit its CPU limit frequently. Monitoring throttling metrics captures scenarios where average CPU usage is low but brief bursts above the limit cause throttling.

Overall, watching memory usage vs. memory limits and CPU usage vs. CPU limits (or throttle metrics) is more aligned with real resource risks:

If memory usage is near the limit, the pod is at risk of OOM kill – a critical condition.
If CPU usage hits the limit and throttling occurs, the pod’s performance is being artificially constrained by its quota.

This approach reduces false alarms compared to the naive utilization-vs-request method. You won’t alert on a pod using 150% of its request if it still hasn’t hit any limit. Instead, you’d alert when it actually hits a ceiling (limit) or gets throttled. It’s a step in the right direction.

However, there are two big caveats:

Not everyone sets CPU limits (in fact, as we’ll discuss next, setting CPU limits can be counterproductive).
Even with limits, these signals don’t tell the whole story of why the pod is constrained or if it’s a true contention issue or just a mis-configured limit.

If you follow modern best practices, you might only set memory limits and not CPU limits on your pods. In that case, CPU usage has no defined hard limit to compare against – a pod can use all the CPU it can get on the node. You’re back to square one for CPU: how do you detect CPU contention without a limit? Monitoring raw CPU usage alone still isn’t sufficient, because a pod could be slowed down by competition with other pods even if it has no fixed limit.

Secondly, even when CPU limits are used, you might be interested in detecting contention before a pod is throttling at 100% of its limit. For example, a pod might be using 80% of its limit but the node is completely busy; it might not be throttled yet, but it could still be experiencing latency due to high CPU demand on the node. Pure usage metrics won’t flag that.

The bottom line: Monitoring limits is better than nothing – especially for memory – but it’s a reactive measure and can miss subtler forms of contention. We need a way to directly measure “how hard is the workload trying to use resources and being held back,” whether by limits or by competition with others.

Enter Linux’s CPU scheduler behavior and why many recommend removing CPU limits entirely in favor of a different approach.

The Case Against CPU Limits (and How Kubernetes Schedules CPU Fairly Without Them)

If monitoring CPU limits and throttling is an improvement, an even more radical improvement is to avoid CPU limits altogether. This might sound counterintuitive – if you don’t limit CPU, won’t pods just contend uncontrolled? But Kubernetes (and Linux) have a built-in mechanism to handle CPU contention: CFS CPU shares based on the pod’s CPU requests (also known as CPU weight).

Many experts argue that setting CPU limits causes more harm than good in Kubernetes, and that you can rely on requests and the kernel scheduler for fair sharing. Let’s break down why CPU limits can be harmful:

They restrict natural bursting, even when resources are idle. A container with a CPU limit cannot exceed that limit, no matter what. If the node has idle CPU cycles, a container without a limit could have used those cycles to handle a spike in work, then dropped back down. With a limit, those idle cycles go unused while the container threads sit idle waiting for the next time slice. In effect, “resources are available but you aren’t allowed to use them.” This is wasted potential and can degrade application performance. Why slow down your app just to keep CPU idle?
They can cause complex throttling behavior. When a container hits its quota early in a scheduling period, the kernel will throttle it for the remainder of that period. This can introduce latency spikes. The throttling isn’t smooth; it literally pauses the container’s threads. If your application is latency-sensitive, CPU quotas can produce irregular delays that are hard to predict or tune.
They are often unnecessary for fairness. The typical reason people set CPU limits is to prevent one pod from hogging the CPU and starving others (“noisy neighbor” problem). But Kubernetes already has a solution for this: CPU requests translate to CFS weights. The Linux Completely Fair Scheduler distributes CPU time according to these weights when there’s contention. If two pods contend for CPU, each gets a share proportional to its weight (derived from its CPU request). For instance, if Pod A requests 1500 millicores and Pod B requests 1000 millicores, A will get 60% and B 40% of CPU time under contention. It doesn’t matter if Pod B tries to use more; it will only get spare cycles beyond its share if A isn’t using its full request. In other words, requests give you a guaranteed floor and a fair share, without the need for hard caps. The kernel scheduler’s use of weights is well documented: “Kubernetes resources.requests.cpu translates into a weight. It’s the relative weight that matters – the ratio of one container’s request to another’s. If the node is under load, container B (with double the request of A) will get roughly twice as much CPU time as container A.” This happens automatically, no CPU limit required.
CPU limits don’t affect scheduling, only runtime. A subtle point: the Kubernetes scheduler doesn’t even consider limits when placing pods, only requests. This means you could have a node where total CPU limits of pods exceed capacity; the limits aren’t used for admission control. Their only function is to throttle at runtime. If you’ve already ensured via requests that the node won’t be overloaded (scheduler won’t put more total requested CPU than the node capacity), then limits are mostly redundant for preventing overload.

Because of these reasons, many in the Kubernetes community advocate not using CPU limits at all for most workloads. If every pod has an appropriate CPU request, then no pod can starve another of its guaranteed share. Any pod can still burst above its request if extra CPU is available, which improves utilization and performance. And if two pods both want more than their share, they’ll be limited by the CFS weighting – effectively, each is “throttled” only by the fact that the other exists and has a claim, not by an arbitrary cap. It’s a more organic form of throttling based on competition, not a static limit.

To illustrate, consider a scenario: Two pods (Pod A and Pod B) share a node. If there are no limits but each has a request (say A requests 1 CPU, B requests 1 CPU on a 2-CPU node), then if B suddenly needs more CPU and A doesn’t need all of its, B can temporarily use 1.5 CPUs while A uses 0.5. A still can get its full 1 CPU whenever it needs (it has that reserved), and B just opportunistically uses the slack. Both live. If, instead, we imposed a limit equal to their request (1 CPU each), then even if A was idle, B could not exceed 1 CPU – it would be stuck waiting while that extra CPU stays idle. That’s exactly what we want to avoid.

The modern best practice is: use CPU requests for all pods (and make them as accurate as possible), but set no CPU limits in most cases. The only exceptions might be certain workloads that internally adjust to a given CPU limit or multi-tenant clusters where you absolutely need to cap usage of untrusted workloads. But for typical microservices in a controlled cluster, CPU limits often do more harm than good.

If you adopt this approach (no CPU limits), you gain performance – pods can burst and use idle cycles – and simpler behavior. But you lose the simple signal of “CPU usage == limit” and the throttling metric for that pod, since there is no artificial throttling anymore. You need a different way to monitor when a pod is truly encountering CPU contention. After all, just because we removed the limit doesn’t mean we don’t care if the pod is getting constrained; it’s just constrained by actual contention now (other pods or node capacity), not by a configured quota.

How can we detect that scenario? This is where Pressure Stall Information (PSI) comes in as a game-changer for monitoring. It gives us direct insight into contention, regardless of whether a CPU limit is involved or not.

The Modern Approach: Pressure Stall Information (PSI)

Linux’s Pressure Stall Information (PSI) is a kernel feature (introduced in Linux 4.20) that provides a direct measure of resource contention. In essence, PSI metrics tell you what percentage of time tasks are stalled (waiting) due to lack of a given resource – CPU, memory, or IO.

This is exactly the signal we want for detecting resource starvation:

If an application’s threads are frequently waiting on CPU because the CPU is busy elsewhere (or a quota throttled them), that indicates CPU pressure.
If they are waiting on memory (e.g., for memory to be freed or swapped in), that indicates memory pressure.

PSI has been described as a “barometer” of resource pressure, providing early warning as pressure builds. Unlike raw utilization, which only shows how much resource is being used, PSI shows how contended that resource is, i.e., the cost (in wait time) of that contention.

To put it another way: high CPU utilization could be either because an app is happily consuming available CPU or because it’s struggling to get CPU time; PSI distinguishes these by measuring the delay. If an app is using a lot of CPU but not experiencing delays, PSI will remain low. If an app is getting delayed (runnable but not running), PSI will report a higher percentage.

Concretely, the Linux kernel exposes PSI data via files like /proc/pressure/cpu, /proc/pressure/memory, etc., and with cgroups v2, you can get PSI for specific cgroups (which is how Kubernetes can get per-container and per-pod PSI).

The CPU PSI metric is reported as a single metric (some pressure), since at a system level there’s always either some tasks running or waiting. For memory and IO, PSI is reported in two flavors: some (at least one task stalled) and full (all tasks stalled, meaning complete stall). But for most purposes, the “some” metric is the primary indicator of pressure.

What does “some CPU pressure = 20%” mean in plain terms? It means that over the time window, 20% of the time there was at least one task that wanted to run but couldn’t due to CPU being busy. In other words, one or more threads were ready to execute but had to wait. 0% CPU pressure means no delay. 100% CPU pressure (extreme case) would mean at all times, something was waiting for CPU.

The beauty of PSI is that it directly measures contention as experienced by the workload. It doesn’t matter whether the contention is because of a hard limit (throttling) or because other processes are competing – if your container’s tasks had to wait, PSI captures it. Conversely, if your container is blasting CPU but never actually waits (because there was no contention), PSI stays low.

As the VictoriaMetrics team put it: “PSI tracks when tasks are delayed or stalled due to resource contention – basically when the CPU is too busy to handle everything right away... These [PSI] metrics give you a pretty direct view into how much CPU pressure your containers are dealing with — something that raw CPU usage numbers don’t always show clearly.” This is a crucial point: raw usage can’t differentiate between using 80% of CPU with no interference vs. using 80% and desperately wanting 100%. PSI can.

PSI in Kubernetes: Getting the Data

Initially, PSI was only available by manually checking the host or cgroup files, but it has since been integrated into Kubernetes’ monitoring pipeline. Recent versions of cAdvisor (and the Kubernetes summary API) now expose PSI metrics for each container, pod, and node.

As of this writing, this is typically an alpha feature – you may need to enable the KubeletPSI feature gate and be running on a Linux kernel that supports cgroup v2 and PSI (kernel 4.20+ with cgroup2). But assuming those requirements are met, you’ll have new metrics available in the kubelet’s /metrics/cadvisor endpoint.

The key PSI metrics for containers exposed via cAdvisor are typically named:

container_pressure_cpu_waiting_seconds_total: total time tasks in the container have been delayed waiting for CPU (corresponds to the PSI “some” CPU counter). “Waiting” here means at least one task waiting.
container_pressure_cpu_stalled_seconds_total: total time all tasks in the container were stalled due to CPU (CPU “full”, less commonly used for CPU).

Similarly, you’ll find:

container_pressure_memory_waiting_seconds_total and ...memory_stalled_seconds_total for memory pressure (some vs full).
container_pressure_io_waiting_seconds_total and ...io_stalled_seconds_total for IO pressure.

These metrics accumulate time (in seconds) that tasks were stalled. To get a current pressure percentage over a time interval, you take a rate of these counters. We’ll demonstrate that in the next section with queries.

The key thing is: we now have a direct gauge of resource contention for each container/pod. We no longer have to infer it indirectly from usage vs. requests or throttling metrics. We can literally see “this container spent X% of the last 5 minutes waiting on CPU”. That is gold from an SRE perspective – it answers “is my app suffering from lack of CPU?” with a concrete measure.

Unlocking Insights: How PSI Reveals Real Contention

Let’s discuss a few scenarios to illustrate how PSI shines, highlighting exactly the cases mentioned earlier:

Pod is throttled (CPU limit): Suppose you still have a CPU limit on a pod, and the pod is frequently hitting that limit. Each time it hits the limit, the kernel throttles it (makes it wait until the next period). From the pod’s perspective, its processes were ready to run but got halted – classic CPU stall. PSI will register this: during those throttle intervals, tasks were waiting for CPU even though the CPU might have been idle otherwise (it’s a forced wait). Therefore, the container’s cpu_waiting PSI goes up. If you see, say, 10% CPU pressure on a container that correlates with it running at its exact CPU limit, that indicates it spent 10% of time throttled due to the limit. In older monitoring, you might have noticed high throttle counts; with PSI, you see the impact of that throttling as a percentage of lost time. This is a more intuitive measure (“10% CPU starvation”) than just raw counts. The advantage is that PSI doesn’t require any special case – it doesn’t matter that the wait was self-inflicted by a limit; it will still show up.
Pod exceeds its request and the node is at capacity (genuine CPU contention): Now consider a pod with no CPU limit. It has a request of 1 CPU but can use more if available. It starts using 2 CPUs because demand increased. If the node has at least 2 idle CPUs free, it will get 2 CPUs – no contention, no pressure. But if the node only had 1.5 CPUs free beyond what others are using, then the pod will be competing with others for CPU time. The Linux scheduler will give it its fair share (~1 CPU worth plus some fraction of the extra), but not the full 2 CPUs it wants. The pod will effectively be running below the level it would like to (it has threads that could run more, but they must wait their turn). In this scenario, even though there’s no explicit limit, the pod is experiencing CPU starvation due to node limits and competition. How do we detect it? CPU usage of that pod might show something like 1.5 CPUs usage (so it’s above its request of 1, which might or might not alert someone). But PSI will clearly show something like, for example, 25% CPU wait, meaning for 25% of the time, the pod had tasks waiting on CPU because the node was fully busy. That directly quantifies the contention. In other words, whenever a pod is unable to run because other pods (or overall load) are using the CPU, CPU PSI rises. This is exactly when SREs need to know – the pod could benefit from more CPU (or a higher request, or moving to a less busy node, or scaling out). Traditional metrics couldn’t isolate this condition well.
Pod exceeds its request but node has available capacity (no contention): This is the flip side and addresses the false positives issue. A pod might be using more CPU than its requested (say 200% of request) but if the node has idle cores, this is not a problem – the pod isn’t depriving anyone and isn’t waiting for CPU. Old-school monitoring might wrongly flag this as an issue. But PSI will be near 0% in this case, because from the pod’s view, it got all the CPU time it wanted with no delays. No waiting, no pressure. This is a beautiful example of PSI avoiding a false alert. The SRE can confidently ignore high usage if pressure remains low – it means the high usage is simply opportunistic consumption of idle resources, not contention. By focusing on pressure, you don’t cry wolf when a pod is just efficiently using available headroom.

To sum up, PSI aligns alerts with actual performance-impacting events. High CPU PSI means the app experienced CPU wait time (it was ready to do work but had to wait). High memory PSI means the app was stalled due to memory. If these metrics are low, it means lack of resources is not significantly slowing the app, regardless of how high the utilization numbers might be.

Memory PSI is also extremely useful. Memory contention in Kubernetes typically leads to OOM kills. Memory PSI can show that an application is spending time waiting for memory (e.g., perhaps garbage collection is hitting heavy page faults). If memory PSI for a container is significant, that’s a red flag that even if it hasn’t been OOM-killed yet, it’s suffering and could benefit from more memory or optimizations. In the past, one might only notice memory issues after an OOM kill event. PSI gives a window into the “gray zone” of memory pressure before a fatal event.

In summary, PSI metrics let you detect real resource starvation conditions in Kubernetes: whether due to CPU limits, CPU competition, or memory crunch, without getting confused by usage patterns that aren’t actually problematic. This makes them a powerful addition to the monitoring arsenal for Kubernetes SREs.

Putting PSI to Work: Practical Monitoring Examples

Now that we have these PSI metrics, how do we use them? In most setups, you’ll be scraping the kubelet/cAdvisor metrics with Prometheus (or another monitoring system). Assuming container_pressure_cpu_waiting_seconds_total and friends are being collected, here are some sample queries and techniques using PromQL:

1. Calculate CPU pressure percent for a container or pod:

Use the rate of the _waiting_seconds_total counter over a window, and multiply by 100 for percentage.

100 * rate(container_pressure_cpu_waiting_seconds_total{namespace="my-namespace", pod="my-pod", container="my-app-container"}[5m])

This yields the percentage of time over the last 5 minutes that at least one task in the specified container was waiting for CPU. If this value is 30, it means 30% of the time the container was CPU-starved. (Adjust labels to match your metrics setup; filter out container="POD" if needed).

2. Alert on high CPU pressure:

Set up an alert like: “CPU pressure > X% for Y minutes”.

rate(container_pressure_cpu_waiting_seconds_total{namespace!~"kube.*"}[5m]) > 0.20

This checks for >20% CPU pressure over 5 minutes for any container not in kube-system namespaces. Choose a threshold that makes sense – even a small non-zero value consistently might be worth investigating, but 10-20% is often a good starting point to avoid noise. This alert says “this container spent more than 20% of the last 5 minutes waiting on CPU – it’s likely CPU starved.”

3. Memory pressure monitoring:

Similarly, use container_pressure_memory_waiting_seconds_total.

100 * rate(container_pressure_memory_waiting_seconds_total{namespace="my-namespace", pod="my-pod", container="my-app-container"}[5m])

This gives the percent of time the container was stalled due to memory. Ideally this is 0%. Any sustained non-zero memory pressure indicates the app is experiencing memory contention (e.g., the kernel is frequently reclaiming pages, or it’s on the verge of OOM). You might alert if this goes above, say, 5% for some time, because significant memory stall could degrade performance badly.

4. Node-level pressure:

Check overall node pressure by looking at the metrics for the root cgroup (usually identified by a specific label like id="/", container="", and pod="").

100 * rate(container_pressure_cpu_waiting_seconds_total{id="/"}[5m])

This query (adjusting labels as needed for your Prometheus setup) could give the overall CPU pressure for the entire node. If this is high, it means the node is collectively overcommitted on CPU. This can be used to drive node-level auto-scaling or just to monitor overall health.

5. Identify top contended pods:

Find which pods have the highest CPU pressure using topk.

topk(5, 100 * rate(container_pressure_cpu_waiting_seconds_total{namespace="my-namespace", container!="POD"}[5m]))

This would list the top 5 containers (excluding pause containers) in my-namespace by CPU pressure percentage over the last 5 minutes. This is great for troubleshooting: it directly surfaces “who is suffering from CPU contention the most.”

6. Combine with usage for context:

PSI is best used alongside traditional metrics. Create a dashboard showing:

CPU Usage (millicores)
CPU Pressure (%)
Memory Usage (bytes)
Memory Pressure (%)

Side by side for each pod/container . This way you can differentiate:

High usage + Low pressure: Healthy, high throughput, efficiently using resources. Good!
Lower usage + High pressure: Application is likely getting throttled or contended; performance is likely degraded. Needs investigation/more resources.
High usage + High pressure: Application is very busy and hitting contention. Could potentially use more resources or needs optimization.

If request latency spikes alongside high CPU pressure, it confirms the application was delayed by CPU availability. If latency spikes but CPU pressure is zero, the cause lies elsewhere.

Remember to ensure your cluster setup provides these metrics. Check your Kubernetes version, cAdvisor configuration, and monitoring agent setup. PSI metrics are gaining adoption but might require explicit configuration depending on your environment.

Conclusion: Out with the Old, In with the New (Monitoring)

The world of Kubernetes resource management requires rethinking old monitoring habits. Historically, we obsessed over utilization percentages and compared usage to static allocations. In Kubernetes, that paradigm is outdated. A pod running at 95% of its requested CPU might be absolutely fine, while another at 50% could be suffering – without the right insight, you wouldn’t know.

We saw that monitoring against resource limits is a step closer to reality, especially for memory and for detecting explicit CPU throttling, but even that has limitations, particularly as best practices shift toward minimal use of CPU limits.

By leveraging PSI metrics, we align our monitoring with what actually matters: whether workloads are delayed due to resource contention. This gives SREs and engineers a much clearer signal amidst the noise. No more guessing or second-guessing based on indirect metrics – PSI tells it like it is.

To be opinionated: The traditional model of looking at utilization or usage vs. requests in Kubernetes is not just misleading, it’s antiquated. In an environment where resource allocations are fluid and “100% usage” has no fixed meaning, clinging to those old metrics can lead to bad decisions (throttling workloads unnecessarily, or not noticing when something is starving).

Modern Kubernetes operations should adopt a contention-first monitoring mindset using PSI. Here are the key takeaways:

Always set memory requests and limits and monitor usage against limits. Use memory PSI to catch pressure early.
Set CPU requests for all containers to ensure fair scheduling and capacity planning.
Avoid CPU limits for most workloads. Let pods burst and trust Kubernetes/Linux to share CPU via CFS weights.
Monitor CPU contention directly with PSI metrics rather than naive utilization. High CPU PSI is a clear signal of starvation, low PSI indicates resource availability.
Use PSI alongside other metrics for full context (e.g., correlate with latency or traditional usage).
Monitor node-level PSI to understand overall cluster saturation.

The Kubernetes ecosystem is recognizing the value of PSI. It’s making its way into upstream features and recommendations. By incorporating PSI into your monitoring dashboards and alerts, you’ll have a much sharper understanding of your clusters’ performance. You’ll reduce noise (no more false alarms for benign high usage) and catch true issues faster (seeing actual contention as it develops).

In Kubernetes, “not all high utilization is created equal,” and PSI is the lens that shows the difference. As engineers and SREs, embracing this new approach will let us focus our optimizations and firefighting where it truly matters. It’s time to retire the old metrics (or at least deprioritize them) and adopt a contention-first monitoring mindset. Your pods will thank you, by doing their work without waiting in line (and your pager will thank you for the quieter nights!).

No single metric is a silver bullet, but in the realm of resource monitoring, PSI is a huge leap forward. Combined with good resource request hygiene and sensible limits (or lack thereof), it forms the core of a modern, accurate picture of Kubernetes performance. The old utilization metrics served us well in the VM era, but Kubernetes demands a more nuanced view – and we now have the tools to achieve it.

Are You Getting the Most Out of Your Cloud Network?

Zain — Tue, 09 Jul 2024 14:00:08 GMT

As users of managed Kubernetes clusters (such as AKS, EKS, or GKE), we often focus on deploying and managing our applications at a higher level, relying on the underlying infrastructure to handle the low-level details. However, there is a world of optimization opportunities beneath the surface, which sometimes is left to the mercy of default configurations. In this blog post, we'll embark on a journey to the depths of kernel tuning, shedding light on how adjusting kernel-level settings and understanding low-level concepts can significantly enhance the performance of your infrastructure.

One critical aspect of this optimization involves recognizing that transitioning to a more powerful VM or a specialized machine doesn’t automatically alter the kernel’s default settings. Though these defaults are generally optimized for smaller instances, they can limit the performance capabilities of larger nodes.

Although we are using Azure as an example, the concepts discussed here can be applied to most cloud providers.

Before we dive into the specifics of kernel tuning, it's important to note that we will be using Ubuntu as our example throughout this blog post. However, the concepts and techniques discussed here can be applied to other Linux distributions with similar kernel architectures.

Journey from network device to the application

Let's explore the optimization opportunities for the journey of a network packet from the NIC to a Linux process.

The green-highlighted steps in the flowchart illustrate the areas we will target.

Optimizing the Ring Buffer for High Network Traffic Spikes

When dealing with high network traffic spikes, one of the critical areas to focus on is the ring buffer. The ring buffer is a fixed-size buffer that temporarily holds incoming packets before they are processed by the kernel. If the buffer is too small, it can easily fill up during traffic surges, resulting in packet loss. By tuning the ring buffer size, we can mitigate this issue and improve network efficiency.

Let’s first check the defaults we have in our AKS cluster.

So we have a default of 1024 and maximum allowed values in our instance type is 8192

Which means depending on the nature of our network load we are not leveraging the maximum allowed values. In our case we were seeing a network device packet drop across nodes which had high network usage. Usually we just blamed it to high network usage.

After tuning the value to a higher value, we saw a dramatic change in packet drop.

Distributing Network Processing Load Across CPU Cores

In a high-performance environment, ensuring that network processing is efficiently distributed across multiple CPU cores is crucial. Without proper distribution, some CPU cores may become overutilized while others remain underutilized, leading to suboptimal performance and increased latency. Three key features that can help address this issue are Receive Side Scaling (RSS), Receive Packet Steering (RPS) and Receive Flow Steering (RFS)

Receive Side Scaling (RSS)

RSS works by distributing the network interrupt handling across multiple CPU cores. When a packet arrives, the NIC generates an interrupt that is handled by a specific CPU core, determined by the RSS hash.

This means that the network traffic can be distributed across those 8 queues for both receiving and transmitting data. This results in each queue being mapped to one of the 8 cores.

We can verify this by checking the network indirection table

Our NIC hardware allows about 128 entries in the indirection table and all of those are mapped to 8 queues. But we are allowed to have as many queues as our VM cores, in this case 32.

ethtool -L eth0 combined 32

Now we do have those 128 entries mapped to all 32 queues.

By increasing the number of RSS queues, we can achieve several benefits

Parallel Processing: With 32 queues, there are more parallel paths for processing incoming packets. This can lead to higher overall network throughput, especially beneficial for high-bandwidth applications.
Reduced Queue Depth: With more queues, the depth of each queue is reduced, leading to less time packets spend waiting in the queue, thus speeding up processing.
Faster Packet Processing: By distributing the workload more evenly across CPUs, each packet can be processed more quickly, reducing the overall latency.

Receive Packet Steering (RPS)

RPS is designed to distribute the processing of received packets across multiple CPUs, based on the hash of the packet flow. This can help to alleviate bottlenecks on a single CPU core.

It's important to note that RPS focuses specifically on the packet processing stage, where the actual handling and manipulation of packet data occur. This is separate from the interrupt handling phase, which deals with the initial reception and queuing of incoming packets by the network interface.

Systems with a high number of CPU cores can leverage RPS to utilise their full processing potential. With a low number of CPU cores the benefits are almost null in this case.

Most modern kernels do support RPS, you can quickly verify it in the boot configuration

grep RPS /boot/config-$(uname -r)
CONFIG_RPS=y

lets check on any of the queues

cat /sys/class/net/eth0/queues/rx-0/rps_cpus
0

The default value for the rps_cpus file is typically 0, indicating that no CPUs are explicitly assigned for RPS, and the packets are processed by the CPU handling the interrupt. But we can do enable it for our system.

This way we improve network performance by balancing the load of packet processing across all CPU cores, reducing the likelihood of any single core becoming a bottleneck.

By enabling RPS we achieve:

Enhanced Load Distribution: By distributing packet processing across multiple CPUs, RPS helps balance the workload, preventing any single CPU core from becoming a bottleneck. This ensures more efficient utilization of all available CPU resources.
Improved Throughput: With RPS enabled, multiple CPU cores can handle packet processing simultaneously, leading to higher overall network throughput. This is especially beneficial for systems with high network traffic.
Reduced Latency: By spreading the packet processing load across multiple CPUs, RPS reduces the time packets spend waiting in the queue, leading to faster processing and lower latency.

Receive Flow Steering (RFS)

Receive Flow Steering (RFS) is an enhancement to Receive Packet Steering (RPS) that aims to improve cache locality and performance by steering packets to the CPU that is already processing the relevant flow. This ensures that the CPU cache remains "hot" with relevant data, reducing latency and improving efficiency.

RFS maintains a flow table that maps each flow to a specific CPU. This table is populated as packets are processed.

When a packet arrives at the NIC and triggers a hardware interrupt. The interrupt handler or NIC driver performs minimal processing and schedules a soft IRQ for the packet.

A hash is computed based on the 5-tuple information of the packet (source IP, destination IP, source port, destination port, and protocol). This hash uniquely identifies the flow.

If the flow is found in the table, the corresponding CPU is determined. If not, a CPU is selected based on the current load or other policies, and the flow table is updated with this new entry.

The flow table provides the CPU ID that should process the packet. The packet is then enqueued for processing by the identified CPU.

The designated CPU processes the packet, handling the protocol stack (TCP/UDP, IP) and delivering the packet to the appropriate socket buffer in the user space.

The benefits of enabling RFS include:

Improved Cache Locality: By processing packets of the same flow on the same CPU, RFS improves cache hits, reducing memory access latency.
Reduced Context Switching: Minimizes the need to switch context between CPUs, leading to more efficient CPU utilization.

If you are using CPUsets in your application, you might even see greater benefits in performance and efficiency due to improved CPU resource allocation and enhanced cache locality.

SoftIRQs Processing

We just increased the ring buffer, optimized the RSS, enabled RPS and RFS to handle scenarios where packet bursts are common. But now you might seeing packet drop in certain situations where kernel cannot process packets quickly enough.

Increasing backlog can help in these situations to match the increased ring buffer. If the default value 1000 is not sufficient, you will require to do some iterations till you find the right value for it.

sysctl -w net.core.netdev_max_backlog=$new_value

Setting this value too high may not be ideal, as sometimes it’s better to drop packets. Because our VM's network specs are closely tied to the number of CPU cores, we decided to base the new value on a multiple of the core count.

Also we can choose to increase the budget, this will increase the packets we process in each soft IRQ invocation.

sysctl -w net.core.netdev_budget=$new_budget_value

You can optimize your system for different scenarios, whether you need high throughput or low latency. Probably default values won’t cut for both extreme situations.

Automation @ Scale

We won’t go in details about how to set it up the automations but lets talk about the systems used to automate these issues we mentioned.

We are using Kubernetes and will leverage its tools to optimize our system management. We have two DaemonSets. The first DaemonSet runs our custom Go code to check if each node has the correct version of a systemD unit. If the systemD unit is not running, it labels the node; if it is running, it removes the label. The second DaemonSet runs only on nodes labeled by the first one, ensuring the systemD unit is installed and started if it's not already. Finally, the systemD unit itself will configure kernel settings based on the number of CPU cores.

Summary

In this blog post, we explored various kernel-level optimizations that can significantly enhance the performance of your infrastructure by focusing on network packet handling and processing. We discussed tuning the ring buffer, enabling RSS, RPS, and RFS, and adjusting SoftIRQs settings.

These optimizations can help mitigate packet loss, balance network processing loads, and improve overall efficiency. While this post covered several key areas, there are still other optimizations to consider, such as fine-tuning socket buffer settings and exploring more advanced kernel parameters. If you have any questions or would like to share your experiences, feel free to reach out to me directly.

Happy optimizing, and stay tuned for more deep dives into system performance enhancements!

Thank you for reading Optimized Infra. This post is public so feel free to share it.

Container Network Packet Drop in AKS

Zain — Mon, 25 Sep 2023 12:51:00 GMT

During a recent system outage, our Azure Kubernetes Service (AKS) clusters experienced a peculiar issue. Specifically, some containers suffered packet drops, causing network connectivity problems.

Our AKS clusters run containerized workloads, managed by Cluster API (CAPZ). Each node pool is a Virtual Machine Scale Set (VMSS), which we manage indirectly through the AKS layer.

During the outage, certain workloads on a specific node were affected. Initially, we resolved the issue by cordoning off the node and migrating the workloads. However, the problem recurred shortly after. The primary symptom was a significant increase in network packet drops.

For container network drops, we are relying on metrics exposed by cadvisor:

container_network_receive_packets_dropped_total
container_network_transmit_packets_dropped_total

Investigation

After identifying the packet drop issue, we initiated an investigation to ascertain if network throttling was occurring at the VM level, and sought the help of Azure Support for a thorough examination.

Another common symptom for all problematic nodes was a VM Freeze event, which was observed in the node status conditions. A VM Freeze event can occur due to a variety of reasons, according to Azure documentation.

The Virtual Machine is scheduled to pause for a few seconds. 
CPU and network connectivity may be suspended, but there's no 
impact on memory or open files.

But we have no more visibility on the internals of an Azure VM Freeze event. The preliminary findings from Azure Support indicated no anomalies with the VM, suggesting a review of any alterations in workload behavior. Concurrently, we conducted iPerf tests and captured tcpdump data on our end to delve deeper into the nature of the packet drops and to gain more insights into the network performance hindrances we were facing.

Root Cause Analysis

An intriguing observation was made regarding CPU utilization on the affected node, where it was noticed that one core was being utilized at 100%, while the remaining cores exhibited significantly lower levels of utilization.

This second metric was coming from node-exporter:

node_cpu_seconds_total{} by (cpu)

perf results

The next thing I did was to run a perf on the node to see what was causing the CPU to spike.

perf record -C 5 -a -g -D 99 -- sleep 60

results show that the CPU is being consumed by ksoftirqd/5 process

...
  Children      Self  Command          Shared Object 
+   99.40%     0.00%  ksoftirqd/5      [unknown]     
+   99.34%     0.00%  ksoftirqd/5      [unknown]     
+   99.18%     0.00%  ksoftirqd/5      [unknown]     
+   99.12%     0.00%  ksoftirqd/5      [unknown]     
...

ksoftirqd

ksoftirqd led me to inspect the softirqs. To do this, I had to check the interrupts on the node.

cat /proc/interrupts

     CPU0  CPU1  CPU2  CPU3       CPU4       CPU5  CPU6  CPU7
  4:   0     0   538     0          0          0     0     0 
  8:   0     0     0     0          0          0     0     0 
  9:   0     0     0     0          0          0     0     0 
 24:   0     0     0   586          0   82868004     0     0 
 25: 728     0     0     0  869041985          0     0     0 
 26:   0   864     0     0          0  813776462     0     0 
 27:   0     0  1439     0  838852829          0     1     0 
 28:   0     0     0  1545          0  781818909     0     1 
 29:   1     0     0     0 1234309153          0     0     0 
 30:   0     1     0     0          0 1262389002     0     0 
 31:   0     0     1     0  853755079          0  1172     0 
 32:   0     0     0     1          0  812015919     0  1417

We can clearly see that CPU 4 and CPU 5 are handling way more interrupts than the other CPUs.

smp_affinity

Next thing which I did was to check the smp_affinity of the interrupts.

for i in {24..32} ; do cat /proc/irq/$i/smp_affinity; done
20
10
20
10
20
10
20
10
20

The values 20 and 10 are hexadecimal representations of the CPU core assignments. Specifically, 20 means IRQs are handled by CPU 5, and 10 means they're handled by CPU 4. So we can see that the interrupts are being handled by CPU 4 and CPU 5. This explains the CPU spike on CPU 4 and CPU 5. This also explains the packet drop on the containers running on the node.

To understand better, we need to remember how interrupts are handled in Linux.

In this scenario, the observation of ksoftirqd5 being at 100% CPU utilization indicates a condition wherein the CPU 5 is exhaustively engaged in the handling of interrupts. This state precludes the CPU from accommodating any further interrupt requests, thereby creating a consequential situation where network packets are being discarded. The saturated utilization of the CPU 5 for interrupt handling delineates a bottleneck in the system's capability to process additional interrupts, manifesting as network packet drops.

Just to double-check if this is a common configuration on Azure VMs, I checked the smp_affinity of the interrupts on another node belonging to the same VMSS, which did not have a VM Freeze event yet.

for i in {24..32} ; do cat /proc/irq/$i/smp_affinity; done
40
80
10
80
04
20
08
01
02

IRQBalance

We can see that the interrupts are balanced across all the CPUs. So what is wrong with our node? Why is it not balanced?

Let's check the irqbalance service status

service irqbalance status
● irqbalance.service - irqbalance daemon
     Loaded: loaded 
     Active: active (running) 
       Docs: man:irqbalance(1)

irqbalance is running. But we are definitely not seeing the interrupts distributed across all the CPUs.

systemctl try-restart irqbalance

And right after restarting irqbalance, I could see that the IRQs are balanced across all the CPUs. Packet drop was gone, and CPU utilization was back to normal.

Automated Mitigation

Now that we know what was happening with VM Freeze events and packet drops, and we have a manual mitigation of the issue. It was time for an automated mitigation.

The available metrics can allow us to dig into the number of interrupts and group them by devices or cpu for example. But there is no available metric which can tell us about smp_affinity of the interrupts.

We already have a Daemonset running on all the nodes. So we decided to leverage that to automate the mitigation of the issue. So we extended our existing compute Daemonset to do the following:

emit metrics about the smp_affinity of the interrupts
if the smp_affinity is not balanced, label the node

Now we not only have metrics about the smp_affinity to have observability into the issue, but also we were labelling the node to set the stage for an automated mitigation.

The mitigation was simple at this point.

A new DaemonSet with nodeSelector configured to select the problematic nodes with the label

run nsenter -t 1 -m -n -i systemctl try-restart irqbalance in the container of the new Daemonset

As soon as after a VM Freeze event was leaving our nodes in a problematic state we were able to mitigate the issue automatically.

This is just a temporary mitigation. As this is as far as we can go as users of a managed AKS cluster.

Azure is still investigating this issue, and we are waiting for a permanent fix. Also for most users working with limited network interrupts capacity might not be an issue at all. You can only identify this issue if you are taking the node to substantial network usage.

For this post I used the example of an 8 cores VM. But this issue can happen on any VM size. We observed it in 16,32 and 64 cores VMs. Bigger the node, bigger the issue was due to the proportion of the interrupts capacity.

Conclusion

This was a very interesting case. It was a great learning experience, specially a great reminder on where we are standing as users of a managed Kubernetes cluster.

We are not living in a world where running a managed Kubernetes cluster is a set and forget thing. It's imperative to understand the cluster's internals and possess the capability to debug issues at the cluster level.
Managed services support is great, but it's not a replacement for the knowledge of the internals of the system you are running.

I'm 100% confident that Azure engineering will solve this problem, and we will not have to deal with this issue anymore. But meanwhile if that happens we have to be prepared to dig into the internals of the system and be able to mitigate the issue ourselves.

[update] 10/10/2023: Azure engineering has identified the issue and is rolling out a fix in the next few weeks. This was a bug in irqbalance

The upstream irqbalance 1.9.0+ has been fixed.

Bug Introduction: A specific patch e9e2811 initiated the issue.
Resolution: A subsequent fix has been provided 2a66a66