Requests, Limits, and the Lies We Tell the Scheduler

Introduction

Resource requests and limits are the contract between your workload and the Kubernetes scheduler. Get them right and your cluster is predictable, bin-packed efficiently, and degrades gracefully under load. Get them wrong and you get OOMKilled pods, throttled CPUs, evictions that cascade at the worst possible moment, and a scheduler that confidently places work onto nodes that cannot actually handle it.

The problem is that most teams set these values through cargo-culting, guessing, or copying whatever the tutorial said. This post is about doing it properly.

What requests and limits actually mean

Requests are what the scheduler uses to decide where to place a pod. A node is considered eligible if it has enough unallocated capacity — that is, the sum of all pods’ requests fits within the node’s allocatable resources. The pod may use less than it requested, or more, as long as resources are available.

Limits are a hard ceiling enforced at runtime by cgroups. For CPU, exceeding the limit causes throttling — the process is slowed down but not killed. For memory, exceeding the limit causes the OOM killer to terminate the container. This distinction matters enormously in practice.

The most common mistakes

Setting limits without requests

If you set resources.limits but not resources.requests, Kubernetes sets requests equal to limits. This sounds safe but it destroys bin-packing efficiency: every pod reserves its worst-case allocation even when idle.

Setting CPU limits too low

CPU is compressible. Throttling is silent. A pod that is hitting its CPU limit every few seconds will appear healthy — it’s running, it’s responding — but latency percentiles at the tail will be elevated and you will spend hours blaming your application code.

The practical answer for many services: set CPU requests based on typical load, and either set limits generously or omit them entirely (using a LimitRange at the namespace level as a backstop).

Setting memory limits equal to requests

Memory is incompressible. If your application has any variance in heap usage — and most do — tight memory limits will cause periodic OOMKills under load spikes. The better model is requests at typical usage, limits at the maximum you’d tolerate before you’d rather crash and restart than let the node run out of RAM.

Not knowing what your application actually uses

This is the root cause. You cannot set accurate values if you have never measured. Prometheus metrics container_cpu_usage_seconds_total and container_memory_working_set_bytes are what the kubelet uses internally. Query them for your workload over a representative period, find the 95th and 99th percentiles, and use those as your baseline.

A practical approach

Deploy with only requests set, based on a reasonable estimate.
Run for two weeks under realistic traffic.
Query actual usage from your metrics store.
Set requests at p90 usage, limits at 2× for memory, and revisit CPU limits only if you see specific throttling evidence.
Use VPA in recommendation mode (updateMode: Off) to get Kubernetes’ own suggestions as a sanity check.

Vertical Pod Autoscaler in recommendation mode is underused. It watches your pods’ actual resource consumption and tells you what it would have recommended. You don’t have to accept its recommendations automatically — just read them.

QoS classes and why they matter for eviction

Kubernetes assigns pods to one of three Quality of Service classes:

Guaranteed — requests equal limits for all containers
Burstable — requests are set but differ from limits
BestEffort — no requests or limits at all

Under memory pressure, the kubelet evicts BestEffort pods first, then Burstable, then Guaranteed. This means your monitoring stack should be Guaranteed. Your batch ML training jobs can probably be BestEffort. Most application pods should be Burstable, tuned as above.

Understanding this hierarchy turns eviction from a mysterious random event into a predictable, intentional design choice.