I don't set limits because I'm afraid of how a pod is going to affect other pods...

arghwhat · on Nov 8, 2023

Limiting CPU to the amount guaranteed to be available also guarantees very significant wasted resource utilization unless all your pods spin 100% CPU continuously.

The best way to utilize resources is to overcommit, and the smart way to overcommit is to, say, allow 4x overcommit with each allocation limited to 1/4th of the available resources so not individual peak can choke the system. Given varied allocations, things average out with a reasonable amount of performance variability.

Idle CPUs are wasted CPUs and money out the window.

midko · on Nov 8, 2023

> with each allocation limited to 1/4th of the available resources so not individual peak can choke the system.

This assumes that the scheduled workloads are created equal which isn't the case. The app owners do not have control over what else gets scheduled on the node which introduces uncontrollable variability in the performance of what should be identical replicas and environments. What helps here is .. limits. The requests-to-limits ratio allows application owners to reason about the variability risk they are willing to take in relation to the needs of the application (e.g. imagine a latency-sensitive workload on a critical path vs a BAU service vs a background job which just cares about throughput -- for each of these classes, the ratio would probably be very different). This way, you can still overcommit but not by a rule-of-thumb that is created centrally by the cluster ops team (e.g. aim for 1/4) but it's distributed across each workload owner (ie application ops) where this can be done a lot more accurately and with better results. This is what the parent post is also talking about.

arghwhat · on Nov 8, 2023

1/4th was merely an example for one resource type, and a suitable limit may be much lower depending on the cluster and workloads. The point is that a limit set to 1/workloads guarantees wasted resources, and should be set significantly higher based on realistic workloads, while still ensuring that it takes N workloads to consume all resource to average out the risk of peak demand collisions.

> This assumes that the scheduled workloads are created equal which isn't the case.

This particular allocation technique benefits from scheduled workloads not being equal as equality would increase likelihood of peak demand collisions.

eloisant · on Nov 8, 2023

That's why you use monitor and alerting, so you notice degraded performances before the pods is crawling to a halt.

You need to do it anyway because a service might progressively need more resources as it's getting more traffic, even if you're not adding any other pod.

ithkuil · on Nov 8, 2023

Sure you need monitoring and alerting and sure there are other reasons why you need to update your requests.

But having _neighbours_ affecting the behaviour of your workload is precisely what creates the kind of fatigue that then results in people claiming that it's hard to run k8s workloads. K8s is highly dynamical, pods can get scheduled on a node by chance sometimes and on some clusters; pagers will ring, incidents will be created for conditions that may solve themselves because of another deployment (possibly of another team) happening.

Overcommit/bursting is an advanced cost saving feature.

Let me say it again: splitting up a large machine into smaller parts that can use the unused capacity of other parts in order to reduce waste is an advanced feature!

The problem is that the request/limits feature is presented in the configuration spec and in the documentation in a deceptively simple way and we're tricked to think it's a basic feature.

Not all companies have ops teams that are well equipped to do more sophisticated things. My advice for those teams who cannot setup full automation around capacity management is to just not use this advanced features.

An alternative is to just use smaller dedicated nodes and (anti)affinity rules, so you always understand which pods go with which other pods. It's clunky but it's actually easier to reason about what's going to happen.

EDIT: typos

otterley · on Nov 9, 2023

> splitting up a large machine into smaller parts that can use the unused capacity of other parts in order to reduce waste is an advanced feature

That's an interesting take. For those of us who once used microcomputers back in the 1980s like Commodore 64s, Apple IIs, and IBM PCs running MS-DOS, sure. But time-sharing systems have been around since the 1960s and Unix dates back to the 1970s, and multi-user logins and resource controls like ulimits have been part of the picture for a very long time.

We've had plenty of time to get used to multitasking and resource contention and management. Is it complicated? It can be. But if you consider that containers are basically a program lifecycle/deployment and resource control abstraction on an OS that's had these features (maybe not cgroups and namespaces, but similar ones) since its birth, it's not really all that advanced.

ithkuil · on Nov 9, 2023

The same arguments hold on old time-sharing systems. Setting up you limits in such a way that you can use excess capacity when available is very well suited (now and back then) for batch workloads. Scheduler priorities are a way to mix batch and interactive workloads.

When you're not latency sensitive but aiming at optimizing throughput you can achieve a pretty good utilization of the underlying resource. This is what resource<limit setting in k8s is good for: batch workload.

Batch workload still exists and it's important but the proportion of batch vs latency sensitive workload shifted significantly from the 70s to internet age. Request handlers that power that sit in the critical path for the user experience of a web or mobile app are not only latency sensitive but also _tail_ latency sensitive.

Tail latency often poses significant problems because stragglers can often determine the total duration of the interaction as perceived by the user, even if only one straggler request suffers from slowdowns.

While there are tricks to deal with tail latency while also improving utilization (i.e. reducing waste) they are hard to implement in practice and/or don't generalize well across workloads.

One thing always works and is your best option as a starting point: stop worrying about idle CPU in an _average_ usage metric. The behaviour during CPU usage spikes (even spikes that are shorter than your scraping period of your favourite monitoring tool!) determine latency envelope of your service. Measure how much CPU you need during those and plan accordingly.

Ideally it should be possible to some low priority (batch) process use that idle CPU, but AFAIK that's not currently possible with k8s.

EDIT: forgot to mention that all this discussion matters only inasmuch you have strict latency targets. If you're ok to occasionally have very slow requests and your customers are ok with that and your bosses don't freak out because they spoke to that one customer that is very angry that their requests time out a few times every month ... I'm very happy for you! Not everybody has these requirements and you can go pretty far with a production system consisting of a couple of VM on AWS without k8s, containers, cgroups or whatever in the picture. I know people who do and it works just fine for them. But in order to understand what the fuss about all of this we need to frame this discussion as pertaining to batch vs controlled-latency. Otherwise it's hard to explain why there are so many otherwise intelligent people making choices that appear to be a bit silly, isn't it? Sure, there are people who are "just wrong" on the internet :-) but more often than not if somebody ends up with a different solution is because they have different goals and tradeoffs.

eightnoteight · on Nov 9, 2023

imagine amazon's order history microservice causes outage for ordering creation microservice. regardless of the monitoring and alerting, you never want a non critical system to cause outage on the most critical system you have

otterley · on Nov 8, 2023

In my experience working with containers -- both on my own behalf and for customers -- I can think of relatively few situations in which CPU limits were necessary or useful for the workload.

When you set CPU reservations properly in your container definitions, they don't "crawl[] to a halt" when other containers demand more CPU than they did before. Every container is guaranteed to get the CPU it requested the moment it needs it, regardless of what other containers are using right up till that moment. So the key is to set the requested CPU to the minimum the application owner needs in the contended case. That's why I say that "requests are limits in reverse" - a request declared by container A effectively creates an implicit limit on container B when container A needs the CPU.

Perhaps you're concerned that not setting limits causes oversubscription on a node. But this is not so: when scheduling pods, the K8S scheduler only considers requests, not limits. If every container is forced to declare a CPU request - as it should - then K8S will not overprovision pods onto a node, and every container will be entitled to all the CPU it requested when needed.

Consider, too, that most applications are memory-bound and concurrency-bound, not CPU bound. And a typical web server is also demand-driven and load-balanced; it doesn't spontaneously consume CPU. For those kinds of applications, resource consumption is roughly consistent across all the containers, and the "pod running just find a moment ago...crawling to a halt" isn't a phenomenon you're going to see, as long as they haven't yet consumed all the CPU they requested. Limits hurt more than help especially for these kinds of workloads, because being able to burst gives them headroom to serve while a horizontal scale-out process (adding more pods) is taking place.

Most CPU-bound workloads tend to be batch workloads (e.g. ETL, map/reduce). Teams who run those on shared nodes usually just want to get whatever CPU they can. These jobs tend not to have strict deadlines and the owners don't usually panic when a worker pod has its CPU throttled on a shared node. And those who do care about getting all the CPU available are frequently putting these workloads on dedicated nodes, often even dedicated clusters. And the nodes tend to be ephemeral - launched just-in-time to run the job and then terminated when the job has completed.

As others have pointed out, putting CPU limits on containers is a contributing cause of nodes having lower CPU utilization than they could otherwise sustain. Every node costs money -- whether you bought it or rent it -- so underutilized nodes represent wasted capital or cloud spend. You want nodes to run as hot as possible to get the most out of your investment. If your CFO/cloud financial team isn't breathing down your neck when they see idle nodes, they ought to be.