From Utilization to PSI: Rethinking Resource Starvation Monitoring in Kubernetes
From Utilization Confusion to PSI Clarity in Kubernetes
In Kubernetes v1.33 (alpha), cAdvisor’s Pressure Stall Information (PSI) metrics can be enabled on the kubelet by passing --feature-gates=KubeletPSI=true
Introduction: The Evolution of Resource Monitoring
In traditional VM-based environments, monitoring resource starvation was straightforward: you watched resource utilization (CPU, memory, etc.) against the machine’s capacity. If a VM’s CPU usage hit close to 100% of its allocated vCPUs or memory usage neared 100% of RAM, you knew contention was occurring. High utilization meant the workload was starved for more resources. This utilization-centric approach made sense when each VM had fixed resources.
However, Kubernetes changed the game. Kubernetes introduced the concepts of resource requests and limits for containers, enabling dynamic sharing and overcommitment of resources on a node.
Many teams initially tried to carry over the old monitoring mindset, comparing container usage to its requested resources as a proxy for stress. Unfortunately, usage vs. requests can be very misleading in Kubernetes (it may simply be borrowing idle capacity). A container using more CPU than requested isn’t necessarily a problem, and one using less than requested isn’t necessarily safe from contention. The traditional model of “utilization == starvation” doesn’t directly apply in this new world of shared resources and elastic consumption.
In this post, we’ll explore:
Why the old metrics (like CPU utilization vs. requests) fall short in Kubernetes.
Why even monitoring usage against limits is only a slight improvement.
Why setting CPU limits in Kubernetes is often considered a bad practice, as it can hurt performance.
How Linux’s Completely Fair Scheduler (CFS) using CPU shares (weights) based on requests usually suffices to manage CPU contention.
How Pressure Stall Information (PSI) metrics provide a far more accurate picture of resource contention.
We'll look at key scenarios that PSI highlights, such as CPU throttling events or genuine CPU pressure, and how PSI avoids the false positives of older monitoring approaches. Technical sample queries will be included to illustrate how to gather and use PSI metrics in practice.
If you’re a Kubernetes engineer or SRE still relying on outdated utilization metrics, it’s time to update your toolkit. Let’s dive in.
The Traditional Approach: Utilization vs. Requests (and Why It Fails)
In pre-container environments, we monitored utilization to catch resource starvation. For example, if a VM’s CPU was 95% utilized or its memory 90% full, that was a red flag.
Many teams initially applied a similar idea to Kubernetes by looking at a container’s usage relative to its resource requests (the amount of CPU or memory it “requested” when scheduled). The assumption was: if a pod’s CPU usage is near or above its request, it must be at risk of starvation, and if usage is well below request, it’s safe.
This approach, however, is flawed in Kubernetes. Resource requests are not hard allocations – they are guarantees for scheduling and baseline service, not fixed ceilings. Kubernetes uses requests to decide which node can host a pod and to ensure each pod gets its fair share when resources are contested, but a pod can use more CPU than requested if the node has spare capacity. Similarly, a pod might have low CPU usage relative to its request, yet still encounter contention if other pods compete for CPU.
In other words, comparing usage to requests is apples-to-oranges: requests are a scheduling construct, while usage is actual consumption.
Example: Imagine a pod requests 1 CPU but at runtime it’s using 1.5 CPUs on average. In a VM world this would be “150% utilization” (impossible on a fixed 1 CPU allocation), but on Kubernetes this scenario can happen if the node has idle CPU cycles. The pod simply borrows CPU above its request since no one else is using it. Naively, an SRE might see 150% of request and panic. But if the node isn’t fully utilized, this isn’t actually a problem! The pod isn’t starved at all; it’s benefiting from extra headroom. Kubernetes explicitly allows this: "As long as the node isn't maxed out, pod B can use whatever extra CPU is free... it won't interfere with pod A's fair share. That's the whole point of CPU requests – they give you a floor (guarantee)."
On the other hand, consider a pod that requests 1 CPU, but is only using 0.5 CPU most of the time. One might think it’s safe because it’s under its request. But if the node is fully booked with other pods and this pod occasionally needs more CPU (say bursts to 1 CPU), it will get at least 1 CPU (its full request) if it needs it – that’s guaranteed. However, if it needed more than 1 CPU (beyond its request) at a time when the node is busy, it might experience delays. Traditional monitoring wouldn’t flag this at all because usage hasn’t hit any static threshold relative to request or capacity.
In short, utilization vs. request is a poor indicator of actual distress in Kubernetes. A pod can be using 200% of its requested CPU and be perfectly healthy if the node has spare capacity, or it can be well below 100% of its request and still suffer if the node CPU is fully contended (or if it’s artificially capped by other means). The old model “high utilization = bad” doesn’t directly translate when resources are elastic.
Why It Made Sense on VMs (Fixed Quota) but Not on Kubernetes
It’s worth highlighting why this confusion exists.
On a VM or physical machine: Your CPU and memory allocations are basically fixed. If you have 4 vCPUs, 100% usage means all 4 are busy. If you have 8 GB of RAM, using 7.5 GB means you’re about to run out. There’s a fixed ceiling, so usage as a fraction of that ceiling is a meaningful metric.
In Kubernetes: A container’s “ceiling” is not always fixed at its request. If no explicit limit is set, the true ceiling is the node’s capacity (or remaining capacity), which is often much higher than the request. A container’s resource usage can go beyond what it requested (temporary boost) or can be constrained by overall node conditions even before hitting its request (if other pods demand their share).
Thus, the ratio usage/request can be very misleading. High usage/request doesn’t necessarily mean trouble (could be just opportunistic usage), while low usage/request doesn’t guarantee no contention.
Many Kubernetes monitoring dashboards still show “CPU utilization vs. requests” or “Memory usage vs. requests” for pods or deployments. These can be useful for capacity planning or right-sizing (e.g., to see if requests are far too high or too low relative to actual usage over time). But they’re not reliable for real-time detection of contention or starvation. Relying on them for alerting can cause false positives (alerting on benign bursts above request) and false negatives (missing actual contention that doesn’t manifest in those ratios).
Monitoring Against Limits: A Slightly Better Approach
Realizing the pitfalls of using requests as the yardstick, many teams shifted to monitoring resource limits instead. Kubernetes allows setting resources.limits for CPU and memory, which are hard constraints: a container cannot exceed its CPU limit (it will be throttled) and cannot exceed its memory limit (it will be OOM-killed if it tries).
Intuitively, monitoring usage against these hard limits makes more sense:
If a container is close to 100% of its memory limit, it’s in danger of OOM.
If a container’s CPU usage is hitting 100% of its CPU limit, it means it’s fully utilizing its allowed CPU and could be throttled.
Memory limits in particular demand close attention. Unlike CPU, memory is not a “compressible” resource – if you run out of memory, the kernel cannot just slow things down; something has to give (usually the process gets killed). "Memory is different because it is non-compressible – once you give memory you can't take it away without killing the process." For this reason, best practice is to always set memory limits on pods, and monitor if memory usage approaches those limits. A container at 95% of its memory limit is one allocation away from an OOM Kill. So monitoring memory usage vs. limits (and receiving alerts before it hits 100%) is essential.
For CPU limits, if set, a container’s CPU usage being at 100% of its limit is a sign it wants more CPU but is not allowed to have it. Hitting a CPU limit won’t kill the container – instead, the Linux kernel will throttle the container’s CPU cycles to enforce the limit. Throttling means the container’s processes are made to wait, even if the CPU is idle, until the next time slice – effectively capping its CPU usage to the limit over time. If you monitor a container and see its CPU usage flatlined at its limit (say constantly using 1 core when the limit is 1 core), that likely means the container could use more CPU if it were available. In other words, it’s potentially CPU-starved (constrained by the limit).
An even clearer indicator is to monitor the CPU throttling metrics that cAdvisor exposes when limits are in place. For example, cAdvisor tracks container_cpu_cfs_throttled_seconds_total (cumulative seconds a container was throttled) and the number of throttling occurrences. By checking the rate of increase of this metric, you can tell if the container is actively being throttled by the CPU quota. A high throttling rate means the container hit its CPU limit frequently. Monitoring throttling metrics captures scenarios where average CPU usage is low but brief bursts above the limit cause throttling.
Overall, watching memory usage vs. memory limits and CPU usage vs. CPU limits (or throttle metrics) is more aligned with real resource risks:
If memory usage is near the limit, the pod is at risk of OOM kill – a critical condition.
If CPU usage hits the limit and throttling occurs, the pod’s performance is being artificially constrained by its quota.
This approach reduces false alarms compared to the naive utilization-vs-request method. You won’t alert on a pod using 150% of its request if it still hasn’t hit any limit. Instead, you’d alert when it actually hits a ceiling (limit) or gets throttled. It’s a step in the right direction.
However, there are two big caveats:
Not everyone sets CPU limits (in fact, as we’ll discuss next, setting CPU limits can be counterproductive).
Even with limits, these signals don’t tell the whole story of why the pod is constrained or if it’s a true contention issue or just a mis-configured limit.
If you follow modern best practices, you might only set memory limits and not CPU limits on your pods. In that case, CPU usage has no defined hard limit to compare against – a pod can use all the CPU it can get on the node. You’re back to square one for CPU: how do you detect CPU contention without a limit? Monitoring raw CPU usage alone still isn’t sufficient, because a pod could be slowed down by competition with other pods even if it has no fixed limit.
Secondly, even when CPU limits are used, you might be interested in detecting contention before a pod is throttling at 100% of its limit. For example, a pod might be using 80% of its limit but the node is completely busy; it might not be throttled yet, but it could still be experiencing latency due to high CPU demand on the node. Pure usage metrics won’t flag that.
The bottom line: Monitoring limits is better than nothing – especially for memory – but it’s a reactive measure and can miss subtler forms of contention. We need a way to directly measure “how hard is the workload trying to use resources and being held back,” whether by limits or by competition with others.
Enter Linux’s CPU scheduler behavior and why many recommend removing CPU limits entirely in favor of a different approach.
The Case Against CPU Limits (and How Kubernetes Schedules CPU Fairly Without Them)
If monitoring CPU limits and throttling is an improvement, an even more radical improvement is to avoid CPU limits altogether. This might sound counterintuitive – if you don’t limit CPU, won’t pods just contend uncontrolled? But Kubernetes (and Linux) have a built-in mechanism to handle CPU contention: CFS CPU shares based on the pod’s CPU requests (also known as CPU weight).
Many experts argue that setting CPU limits causes more harm than good in Kubernetes, and that you can rely on requests and the kernel scheduler for fair sharing. Let’s break down why CPU limits can be harmful:
They restrict natural bursting, even when resources are idle. A container with a CPU limit cannot exceed that limit, no matter what. If the node has idle CPU cycles, a container without a limit could have used those cycles to handle a spike in work, then dropped back down. With a limit, those idle cycles go unused while the container threads sit idle waiting for the next time slice. In effect, “resources are available but you aren’t allowed to use them.” This is wasted potential and can degrade application performance. Why slow down your app just to keep CPU idle?
They can cause complex throttling behavior. When a container hits its quota early in a scheduling period, the kernel will throttle it for the remainder of that period. This can introduce latency spikes. The throttling isn’t smooth; it literally pauses the container’s threads. If your application is latency-sensitive, CPU quotas can produce irregular delays that are hard to predict or tune.
They are often unnecessary for fairness. The typical reason people set CPU limits is to prevent one pod from hogging the CPU and starving others (“noisy neighbor” problem). But Kubernetes already has a solution for this: CPU requests translate to CFS weights. The Linux Completely Fair Scheduler distributes CPU time according to these weights when there’s contention. If two pods contend for CPU, each gets a share proportional to its weight (derived from its CPU request). For instance, if Pod A requests 1500 millicores and Pod B requests 1000 millicores, A will get 60% and B 40% of CPU time under contention. It doesn’t matter if Pod B tries to use more; it will only get spare cycles beyond its share if A isn’t using its full request. In other words, requests give you a guaranteed floor and a fair share, without the need for hard caps. The kernel scheduler’s use of weights is well documented: “Kubernetes resources.requests.cpu translates into a weight. It’s the relative weight that matters – the ratio of one container’s request to another’s. If the node is under load, container B (with double the request of A) will get roughly twice as much CPU time as container A.” This happens automatically, no CPU limit required.
CPU limits don’t affect scheduling, only runtime. A subtle point: the Kubernetes scheduler doesn’t even consider limits when placing pods, only requests. This means you could have a node where total CPU limits of pods exceed capacity; the limits aren’t used for admission control. Their only function is to throttle at runtime. If you’ve already ensured via requests that the node won’t be overloaded (scheduler won’t put more total requested CPU than the node capacity), then limits are mostly redundant for preventing overload.
Because of these reasons, many in the Kubernetes community advocate not using CPU limits at all for most workloads. If every pod has an appropriate CPU request, then no pod can starve another of its guaranteed share. Any pod can still burst above its request if extra CPU is available, which improves utilization and performance. And if two pods both want more than their share, they’ll be limited by the CFS weighting – effectively, each is “throttled” only by the fact that the other exists and has a claim, not by an arbitrary cap. It’s a more organic form of throttling based on competition, not a static limit.
To illustrate, consider a scenario: Two pods (Pod A and Pod B) share a node. If there are no limits but each has a request (say A requests 1 CPU, B requests 1 CPU on a 2-CPU node), then if B suddenly needs more CPU and A doesn’t need all of its, B can temporarily use 1.5 CPUs while A uses 0.5. A still can get its full 1 CPU whenever it needs (it has that reserved), and B just opportunistically uses the slack. Both live. If, instead, we imposed a limit equal to their request (1 CPU each), then even if A was idle, B could not exceed 1 CPU – it would be stuck waiting while that extra CPU stays idle. That’s exactly what we want to avoid.
The modern best practice is: use CPU requests for all pods (and make them as accurate as possible), but set no CPU limits in most cases. The only exceptions might be certain workloads that internally adjust to a given CPU limit or multi-tenant clusters where you absolutely need to cap usage of untrusted workloads. But for typical microservices in a controlled cluster, CPU limits often do more harm than good.
If you adopt this approach (no CPU limits), you gain performance – pods can burst and use idle cycles – and simpler behavior. But you lose the simple signal of “CPU usage == limit” and the throttling metric for that pod, since there is no artificial throttling anymore. You need a different way to monitor when a pod is truly encountering CPU contention. After all, just because we removed the limit doesn’t mean we don’t care if the pod is getting constrained; it’s just constrained by actual contention now (other pods or node capacity), not by a configured quota.
How can we detect that scenario? This is where Pressure Stall Information (PSI) comes in as a game-changer for monitoring. It gives us direct insight into contention, regardless of whether a CPU limit is involved or not.
The Modern Approach: Pressure Stall Information (PSI)
Linux’s Pressure Stall Information (PSI) is a kernel feature (introduced in Linux 4.20) that provides a direct measure of resource contention. In essence, PSI metrics tell you what percentage of time tasks are stalled (waiting) due to lack of a given resource – CPU, memory, or IO.
This is exactly the signal we want for detecting resource starvation:
If an application’s threads are frequently waiting on CPU because the CPU is busy elsewhere (or a quota throttled them), that indicates CPU pressure.
If they are waiting on memory (e.g., for memory to be freed or swapped in), that indicates memory pressure.
PSI has been described as a “barometer” of resource pressure, providing early warning as pressure builds. Unlike raw utilization, which only shows how much resource is being used, PSI shows how contended that resource is, i.e., the cost (in wait time) of that contention.
To put it another way: high CPU utilization could be either because an app is happily consuming available CPU or because it’s struggling to get CPU time; PSI distinguishes these by measuring the delay. If an app is using a lot of CPU but not experiencing delays, PSI will remain low. If an app is getting delayed (runnable but not running), PSI will report a higher percentage.
Concretely, the Linux kernel exposes PSI data via files like /proc/pressure/cpu, /proc/pressure/memory, etc., and with cgroups v2, you can get PSI for specific cgroups (which is how Kubernetes can get per-container and per-pod PSI).
The CPU PSI metric is reported as a single metric (some pressure), since at a system level there’s always either some tasks running or waiting. For memory and IO, PSI is reported in two flavors: some (at least one task stalled) and full (all tasks stalled, meaning complete stall). But for most purposes, the “some” metric is the primary indicator of pressure.
What does “some CPU pressure = 20%” mean in plain terms? It means that over the time window, 20% of the time there was at least one task that wanted to run but couldn’t due to CPU being busy. In other words, one or more threads were ready to execute but had to wait. 0% CPU pressure means no delay. 100% CPU pressure (extreme case) would mean at all times, something was waiting for CPU.
The beauty of PSI is that it directly measures contention as experienced by the workload. It doesn’t matter whether the contention is because of a hard limit (throttling) or because other processes are competing – if your container’s tasks had to wait, PSI captures it. Conversely, if your container is blasting CPU but never actually waits (because there was no contention), PSI stays low.
As the VictoriaMetrics team put it: “PSI tracks when tasks are delayed or stalled due to resource contention – basically when the CPU is too busy to handle everything right away... These [PSI] metrics give you a pretty direct view into how much CPU pressure your containers are dealing with — something that raw CPU usage numbers don’t always show clearly.” This is a crucial point: raw usage can’t differentiate between using 80% of CPU with no interference vs. using 80% and desperately wanting 100%. PSI can.
PSI in Kubernetes: Getting the Data
Initially, PSI was only available by manually checking the host or cgroup files, but it has since been integrated into Kubernetes’ monitoring pipeline. Recent versions of cAdvisor (and the Kubernetes summary API) now expose PSI metrics for each container, pod, and node.
As of this writing, this is typically an alpha feature – you may need to enable the KubeletPSI feature gate and be running on a Linux kernel that supports cgroup v2 and PSI (kernel 4.20+ with cgroup2). But assuming those requirements are met, you’ll have new metrics available in the kubelet’s /metrics/cadvisor endpoint.
The key PSI metrics for containers exposed via cAdvisor are typically named:
container_pressure_cpu_waiting_seconds_total: total time tasks in the container have been delayed waiting for CPU (corresponds to the PSI “some” CPU counter). “Waiting” here means at least one task waiting.
container_pressure_cpu_stalled_seconds_total: total time all tasks in the container were stalled due to CPU (CPU “full”, less commonly used for CPU).
Similarly, you’ll find:
container_pressure_memory_waiting_seconds_total and ...memory_stalled_seconds_total for memory pressure (some vs full).
container_pressure_io_waiting_seconds_total and ...io_stalled_seconds_total for IO pressure.
These metrics accumulate time (in seconds) that tasks were stalled. To get a current pressure percentage over a time interval, you take a rate of these counters. We’ll demonstrate that in the next section with queries.
The key thing is: we now have a direct gauge of resource contention for each container/pod. We no longer have to infer it indirectly from usage vs. requests or throttling metrics. We can literally see “this container spent X% of the last 5 minutes waiting on CPU”. That is gold from an SRE perspective – it answers “is my app suffering from lack of CPU?” with a concrete measure.
Unlocking Insights: How PSI Reveals Real Contention
Let’s discuss a few scenarios to illustrate how PSI shines, highlighting exactly the cases mentioned earlier:
Pod is throttled (CPU limit): Suppose you still have a CPU limit on a pod, and the pod is frequently hitting that limit. Each time it hits the limit, the kernel throttles it (makes it wait until the next period). From the pod’s perspective, its processes were ready to run but got halted – classic CPU stall. PSI will register this: during those throttle intervals, tasks were waiting for CPU even though the CPU might have been idle otherwise (it’s a forced wait). Therefore, the container’s cpu_waiting PSI goes up. If you see, say, 10% CPU pressure on a container that correlates with it running at its exact CPU limit, that indicates it spent 10% of time throttled due to the limit. In older monitoring, you might have noticed high throttle counts; with PSI, you see the impact of that throttling as a percentage of lost time. This is a more intuitive measure (“10% CPU starvation”) than just raw counts. The advantage is that PSI doesn’t require any special case – it doesn’t matter that the wait was self-inflicted by a limit; it will still show up.
Pod exceeds its request and the node is at capacity (genuine CPU contention): Now consider a pod with no CPU limit. It has a request of 1 CPU but can use more if available. It starts using 2 CPUs because demand increased. If the node has at least 2 idle CPUs free, it will get 2 CPUs – no contention, no pressure. But if the node only had 1.5 CPUs free beyond what others are using, then the pod will be competing with others for CPU time. The Linux scheduler will give it its fair share (~1 CPU worth plus some fraction of the extra), but not the full 2 CPUs it wants. The pod will effectively be running below the level it would like to (it has threads that could run more, but they must wait their turn). In this scenario, even though there’s no explicit limit, the pod is experiencing CPU starvation due to node limits and competition. How do we detect it? CPU usage of that pod might show something like 1.5 CPUs usage (so it’s above its request of 1, which might or might not alert someone). But PSI will clearly show something like, for example, 25% CPU wait, meaning for 25% of the time, the pod had tasks waiting on CPU because the node was fully busy. That directly quantifies the contention. In other words, whenever a pod is unable to run because other pods (or overall load) are using the CPU, CPU PSI rises. This is exactly when SREs need to know – the pod could benefit from more CPU (or a higher request, or moving to a less busy node, or scaling out). Traditional metrics couldn’t isolate this condition well.
Pod exceeds its request but node has available capacity (no contention): This is the flip side and addresses the false positives issue. A pod might be using more CPU than its requested (say 200% of request) but if the node has idle cores, this is not a problem – the pod isn’t depriving anyone and isn’t waiting for CPU. Old-school monitoring might wrongly flag this as an issue. But PSI will be near 0% in this case, because from the pod’s view, it got all the CPU time it wanted with no delays. No waiting, no pressure. This is a beautiful example of PSI avoiding a false alert. The SRE can confidently ignore high usage if pressure remains low – it means the high usage is simply opportunistic consumption of idle resources, not contention. By focusing on pressure, you don’t cry wolf when a pod is just efficiently using available headroom.
To sum up, PSI aligns alerts with actual performance-impacting events. High CPU PSI means the app experienced CPU wait time (it was ready to do work but had to wait). High memory PSI means the app was stalled due to memory. If these metrics are low, it means lack of resources is not significantly slowing the app, regardless of how high the utilization numbers might be.
Memory PSI is also extremely useful. Memory contention in Kubernetes typically leads to OOM kills. Memory PSI can show that an application is spending time waiting for memory (e.g., perhaps garbage collection is hitting heavy page faults). If memory PSI for a container is significant, that’s a red flag that even if it hasn’t been OOM-killed yet, it’s suffering and could benefit from more memory or optimizations. In the past, one might only notice memory issues after an OOM kill event. PSI gives a window into the “gray zone” of memory pressure before a fatal event.
In summary, PSI metrics let you detect real resource starvation conditions in Kubernetes: whether due to CPU limits, CPU competition, or memory crunch, without getting confused by usage patterns that aren’t actually problematic. This makes them a powerful addition to the monitoring arsenal for Kubernetes SREs.
Putting PSI to Work: Practical Monitoring Examples
Now that we have these PSI metrics, how do we use them? In most setups, you’ll be scraping the kubelet/cAdvisor metrics with Prometheus (or another monitoring system). Assuming container_pressure_cpu_waiting_seconds_total and friends are being collected, here are some sample queries and techniques using PromQL:
1. Calculate CPU pressure percent for a container or pod:
Use the rate of the _waiting_seconds_total counter over a window, and multiply by 100 for percentage.
100 * rate(container_pressure_cpu_waiting_seconds_total{namespace="my-namespace", pod="my-pod", container="my-app-container"}[5m])
This yields the percentage of time over the last 5 minutes that at least one task in the specified container was waiting for CPU. If this value is 30, it means 30% of the time the container was CPU-starved. (Adjust labels to match your metrics setup; filter out container="POD" if needed).
2. Alert on high CPU pressure:
Set up an alert like: “CPU pressure > X% for Y minutes”.
rate(container_pressure_cpu_waiting_seconds_total{namespace!~"kube.*"}[5m]) > 0.20
This checks for >20% CPU pressure over 5 minutes for any container not in kube-system namespaces. Choose a threshold that makes sense – even a small non-zero value consistently might be worth investigating, but 10-20% is often a good starting point to avoid noise. This alert says “this container spent more than 20% of the last 5 minutes waiting on CPU – it’s likely CPU starved.”
3. Memory pressure monitoring:
Similarly, use container_pressure_memory_waiting_seconds_total.
100 * rate(container_pressure_memory_waiting_seconds_total{namespace="my-namespace", pod="my-pod", container="my-app-container"}[5m])
This gives the percent of time the container was stalled due to memory. Ideally this is 0%. Any sustained non-zero memory pressure indicates the app is experiencing memory contention (e.g., the kernel is frequently reclaiming pages, or it’s on the verge of OOM). You might alert if this goes above, say, 5% for some time, because significant memory stall could degrade performance badly.
4. Node-level pressure:
Check overall node pressure by looking at the metrics for the root cgroup (usually identified by a specific label like id="/", container="", and pod="").
100 * rate(container_pressure_cpu_waiting_seconds_total{id="/"}[5m])
This query (adjusting labels as needed for your Prometheus setup) could give the overall CPU pressure for the entire node. If this is high, it means the node is collectively overcommitted on CPU. This can be used to drive node-level auto-scaling or just to monitor overall health.
5. Identify top contended pods:
Find which pods have the highest CPU pressure using topk.
topk(5, 100 * rate(container_pressure_cpu_waiting_seconds_total{namespace="my-namespace", container!="POD"}[5m]))
This would list the top 5 containers (excluding pause containers) in my-namespace by CPU pressure percentage over the last 5 minutes. This is great for troubleshooting: it directly surfaces “who is suffering from CPU contention the most.”
6. Combine with usage for context:
PSI is best used alongside traditional metrics. Create a dashboard showing:
CPU Usage (millicores)
CPU Pressure (%)
Memory Usage (bytes)
Memory Pressure (%)
Side by side for each pod/container . This way you can differentiate:
High usage + Low pressure: Healthy, high throughput, efficiently using resources. Good!
Lower usage + High pressure: Application is likely getting throttled or contended; performance is likely degraded. Needs investigation/more resources.
High usage + High pressure: Application is very busy and hitting contention. Could potentially use more resources or needs optimization.
If request latency spikes alongside high CPU pressure, it confirms the application was delayed by CPU availability. If latency spikes but CPU pressure is zero, the cause lies elsewhere.
Remember to ensure your cluster setup provides these metrics. Check your Kubernetes version, cAdvisor configuration, and monitoring agent setup. PSI metrics are gaining adoption but might require explicit configuration depending on your environment.
Conclusion: Out with the Old, In with the New (Monitoring)
The world of Kubernetes resource management requires rethinking old monitoring habits. Historically, we obsessed over utilization percentages and compared usage to static allocations. In Kubernetes, that paradigm is outdated. A pod running at 95% of its requested CPU might be absolutely fine, while another at 50% could be suffering – without the right insight, you wouldn’t know.
We saw that monitoring against resource limits is a step closer to reality, especially for memory and for detecting explicit CPU throttling, but even that has limitations, particularly as best practices shift toward minimal use of CPU limits.
By leveraging PSI metrics, we align our monitoring with what actually matters: whether workloads are delayed due to resource contention. This gives SREs and engineers a much clearer signal amidst the noise. No more guessing or second-guessing based on indirect metrics – PSI tells it like it is.
To be opinionated: The traditional model of looking at utilization or usage vs. requests in Kubernetes is not just misleading, it’s antiquated. In an environment where resource allocations are fluid and “100% usage” has no fixed meaning, clinging to those old metrics can lead to bad decisions (throttling workloads unnecessarily, or not noticing when something is starving).
Modern Kubernetes operations should adopt a contention-first monitoring mindset using PSI. Here are the key takeaways:
Always set memory requests and limits and monitor usage against limits. Use memory PSI to catch pressure early.
Set CPU requests for all containers to ensure fair scheduling and capacity planning.
Avoid CPU limits for most workloads. Let pods burst and trust Kubernetes/Linux to share CPU via CFS weights.
Monitor CPU contention directly with PSI metrics rather than naive utilization. High CPU PSI is a clear signal of starvation, low PSI indicates resource availability.
Use PSI alongside other metrics for full context (e.g., correlate with latency or traditional usage).
Monitor node-level PSI to understand overall cluster saturation.
The Kubernetes ecosystem is recognizing the value of PSI. It’s making its way into upstream features and recommendations. By incorporating PSI into your monitoring dashboards and alerts, you’ll have a much sharper understanding of your clusters’ performance. You’ll reduce noise (no more false alarms for benign high usage) and catch true issues faster (seeing actual contention as it develops).
In Kubernetes, “not all high utilization is created equal,” and PSI is the lens that shows the difference. As engineers and SREs, embracing this new approach will let us focus our optimizations and firefighting where it truly matters. It’s time to retire the old metrics (or at least deprioritize them) and adopt a contention-first monitoring mindset. Your pods will thank you, by doing their work without waiting in line (and your pager will thank you for the quieter nights!).
No single metric is a silver bullet, but in the realm of resource monitoring, PSI is a huge leap forward. Combined with good resource request hygiene and sensible limits (or lack thereof), it forms the core of a modern, accurate picture of Kubernetes performance. The old utilization metrics served us well in the VM era, but Kubernetes demands a more nuanced view – and we now have the tools to achieve it.