This is a design flaw in Kubernetes. The article doesn't really explain what's happening though. The real problem is that there is no synchronization between the ingress controller (which manages the ingress software configuration, e.g. nginx from the Endpoints resources), kube-proxy (which manages iptables rules from the Endpoints resource), and kubelet (which sends the signals to the container). A presStop hook w/ a sleep equivalent to an acceptable timeout will handle the 99%+ cases (and the cases it doesn't will have exceeded your timeout anyhow). Things become more complicated when there are sidecar containers (say an envoy or nginx routing to another container in the same pod) and that often requires shenanigans such as shared emptyDir{} volumes that waits (with fsnotify or similar) for socket files to be closed to ensure requests are fully completed.
It's more of a design compromise than an outright flaw though. Since you can't know if your order to shut down a pod has arrived or not in a distributed system (per the CAP theorem), you either have to do it the way k8s has already implemented it or you have to accept potentially unbounded wait pod shutdown (and by extension new release rollout) durations in times of network partitions. K8s just chose Availability over Consistency in this case.
You can argue whether it would not have been preferable to choose C over A instead (or even better, to make this configurable), but in a distributed system you will always have to trade one of these two off. The hacks with shared emptyDir volumes just moves the system back to "Consistency" mode but in a hacky way.
I would say that's true for networking.k8s.io/v1beta1 Ingress, but not for networking.k8s.io/v1 which is much better.
There's still some issues around "concerns" maybe eg:
Should the Ingress also handle redirecting? ALB Ingress has its own annotations DSL to support this, and the nginx has a completely different annotations DSL to support this. I don't think Envoy does, though.
But then there's the question of supporting CDNs; some controllers support it with annotations and some through `pathType: ImplementationSpecific` and a `backend.resource` CRD (which doesn't have to be a CRD; they could become native networking.k8s.io/v1 extensions in the future that the controllers can opt in to support). This becomes great when combined with the operator framework (+ embedded kubebilder).
So, I think there's a lot of potential for things to get better.
A great success example in the ecosystem is cert-manager, that a lot of controllers rely on as a peer dependency in the cluster.
> A presStop hook w/ a sleep equivalent to an acceptable timeout will handle the 99%+ cases
That's precisely what we did in one of my previous client. To increase portability, we wrote the smallest possible sleep equivalent in C, statically linked it, stuck it into a ConfigMap and mounted it to the pods so every workload would have the same pre-stop hook.
It was funny to watch when a new starter in the team would find out about that very elegant, stable and useful hack and go "wtf is going on here?" :D
This dealt with pretty much all our 5XXs due to unclean shutdowns.
I mean, technically, you can recreate this scenario on a single host as well. Send a sigterm to an application and try to swap in another instance of it.
System fundamentals are at the heart of that problem: SIGTERM is just what it is, it's a signal and an application can choose to acknowledge it and do something or catch it and ignore it. The system also has no way of knowing what the application chose to do.
All that to say, I'm not sure it's as much of a flaw in Kubernetes as much as it's the way systems work and Kubernetes is reflecting that.
In my view it is a clear flaw that the signal to terminate can arrive while the server is still getting new requests. Being able to steer traffic based on your knowledge of the state of the system is one of the reasons why you'd want to set up an integrated environment where the load-balancer and servers are controlled from the same process.
The time to send the signal is entirely under control of the managing process. It could synchronize with the load-balancer before sending pods the term signal, and I'm unclear why this isn't done.
I don't think there is anything reasonable to synchronize with that will guarantee no new connections. You can remove the address from the control plane synchronously, but the stale config might live on in the kubelet or kube-proxy distributed throughout the cluster. I don't think you want to have blocking synchronization with every node every time you want to stop a pod.
The alternative is that you wait some amount of time before dying instead of explicit synchronization, which is exactly what this lame-duck period is. You find out that you should die ASAP, and then you decide how long you want to wait until you actually die.
I don't really see an issue with adding synchronisation, there's no fundamental reason why having endpoint consumers acknowledge updates before terminating removed pods would be horrifically expensive. Especially with endpoint slices.
With 10,000 nodes running kube-proxy it is a bit expensive and, more importantly: error prone. A problem on a single node that wasn't even talking to the app could stop that app from exiting indefinitely if acks were required and clusters this size already do gigabits of traffic in endpoints watches.
Additionally, there's no acks possible for clients of headless services, so just kube-proxy handling this doesn't go far enough.
But yeah, maybe accept that as a tradeoff for clusterip services, but more deeply integrate the real load balancer options.
Its design is good enough. There's just enough protocol to make it portable, and it's almost completely extensible so you can make it do basically anything.
This peculiar behavior (where the Service health check is unaware of the Deployment’s known instances) mirrors Google Compute Engine (where the httpLoadBalancer’s healthCheck is independent of the instanceGroupManager’s known instances). If you want your program to exit as soon as possible rather than waiting for a SIGKILL from the instance group manager, you have to hard-code the health check timeouts like so:
1. After a SIGTERM, the shutdownHook should keep the HTTP server running. Future /@status requests must return an error, but user requests must still succeed!
2. The shutdownHook sleeps for a minimum of the load balancer health check’s checkIntervalSec * unhealthyThreshold + timeoutSec (which by the way must be less than the instanceGroupManager’s health check’s checkIntervalSec * unhealthyThreshold if it uses the same endpoint)
3. Now the load balancer should not be sending new requests. The shutdownHook then waits for any existing requests to drain.
4. After requests drain, the shutdownHook can finally exit gracefully.
It is annoying to have to wait for the health check’s delay (rather than simply draining existing requests as in AWS), but it seems to be necessary for Google-designed load balancers and instance group managers.
With kubernetes you can at least add a preStop hook that sleeps for 60-120sec if the app is not designed to handle SIGTERM. The pod enters the terminating state just prior to executing the preStop hook.
This is something that plagued me when I started using Kubernetes because out of the box you will get 503s with a load balanced service.
If you're on EKS with the AWS Load Balancer Controller, after some research I ended up stumbling on https://github.com/foriequal0/pod-graceful-drain and never looked back. This has been working great so far. You install it once into your cluster and you're done. No dropped connections during a rollout and no need to set up preStop lifecycle hooks. The only downside I've seen so far is it takes slightly longer (about a minute or so) to terminate pods but I'd much rather have that than have to worry about 503s in production during a rollout.
typically a service of type LoadBalancer is the same as a NodePort, except the CM will not write the status field, expecting some other piece of software to handle it.
the stock AWS controller does this so badly, you will reach one random pod per node, and probably not even on the same node, and sometimes the same pod twice, and this connection usually persists until the pod dies so scaling up is near useless. that's not a kubernetes problem
kube-proxy and ingress controllers usually rewrite their routing within 1-2 seconds - but I've seen some other designs that take up to 10 minutes to properly set up their load balancers (GKE). once again, not a kubernetes problem
This is a really neat idea, to intercept the pod deletion requests using an admission hooks controller and just delay it for a bit. Thanks for sharing!
Everyone that uses kubernetes must run into this at some point but we all end up doing exactly what is in this article. But we do it quietly because we all think we must be idiots since there is virtually no one else writing anything about it.
btw. this can be fixed with readiness probes. and applications should implement them.
i.e. application still gets traffic, readiness probe fails because of SIGTERM, application will respond with an error on the probe, the loadbalancer/service sees that and removes the endpoint/service the application can still respond to the inflight requests since it should not stop directly on sigterm, if everything inflight was done it should stop.
(this also works with non http, if SIGTERM is handled)
No, sorry but I don't think you understood the article. Readiness probes don't fix this. In fact, they could exacerbate the problem (although you obviously need them anyway). The problem is Kubernetes and the Load Balancer react to pod status changes at different times. (In fact, it's worse - different Kubernetes components also react to status changes at different times.) The load balancer is almost always too slow to react, so, it sends traffic to an instance that has already started shutting down.
Their proposed solution isn't really a solution either, and will probably make things worse.
Current behavior - Kubernetes sends SIGTERM signal to your container, it starts the shutdown process, stops the web server, and finishes up the requests in flight. Ingress controller continues to send some requests to the pod regardless, and they fail to connect. The client can handle these failures in whatever way it deems fit.
Their solution - Kubernetes sends SIGTERM signal, container ignores it and keeps taking new requests. After a while the process is forcefully killed, all requests in flight fail in unpredictable ways, and other resources (like DB connections) may be left dangling.
if you're running nginx. Consider setting it to true instead of the default (false)
---
By default the NGINX ingress controller uses a list of all endpoints (Pod IP/port) in the NGINX upstream configuration.
The nginx.ingress.kubernetes.io/service-upstream annotation disables that behavior and instead uses a single upstream in NGINX, the service's Cluster IP and port.
This can be desirable for things like zero-downtime deployments .
This is not a good idea, you're doubling the load on conntrack.
Maybe ingress-nginx should fix their config generation instead. Had a cluster with at least one change per second to the ingresses, and nginx would regularly just die (and orphan existing requests after 20 seconds).
ah good tip, I don't care about session affinity or custom balancing algos so that works. I'd imagine running in GKE or AWS you would also avoid the DNAT / conntrack overhead as pods by default use a routable VPC IP instead of a magic CNI IP. Would have to test that though.
Quote from related issue:
The NGINX ingress controller does not uses Services to route traffic to the pods. Instead it uses the Endpoints API in order to bypass kube-proxy to allow NGINX features like session affinity and custom load balancing algorithms. It also removes some overhead, such as conntrack entries for iptables DNAT.
How does this remove conntrack overhead? It should add more, because the node with the ingress controller now has to hold an extra <Cluster IP Cluster Port, Pod IP, Pod Port> mapping, regardless of what CNI is used (flannel in gateway mode also eliminates this overhead, by the way - you just need to make sure there is nothing like the default EC2 source destination check in place).
Yes! I think this is a really under-reported issue. It's basically caused by kubernetes doing things without confirming everyone responded to prior status updates. It affects every ingress controller, and it also affects services of type "Load Balancer" and there isn't a real fix. Even if you add a timeout in the pre stop hook, that still might not handle it 100% of the time. IMO it is a design flaw in Kubernetes.
Not defending the situation of a preStop hook at least in the case of API's k8s can handle it 100%, its just messy.
We have a preStop hook of 62s. 60s timeouts are set in our apps, 61s is set on the ALBs (ensuring the ALB is never the cause of the hangup), and 62s on the preStop to make sure nothing has come into the container in the last 62s.
Then we set a terminationGracePeriodSeconds of 60 just to make sure it doesn't pop off too fast. This gives us 120s where nothing happens and anything in flight can get to where its going.
I think K8S secrets get a bad wrap. They are not intended to be secret in the sense that they are "kept from prying eyes by default". The secret object is simply a first-class citizen that differentiates it from a ConfigMap in a way that allows distinct ACL's.
Most organizations I know will still use something like ExternalSecret for source control and then populate the Secret with the values once in cluster and to an object with very few access points.
I think calling it a secret when it isn’t gave it a bad wrap. The last time I looked at the documentation it didn’t even clearly describe that it is not a secure object (that may have changed recently). Why call it a secret when it is not even close to one? I guess thing-to-store-secrets-if-you-use-rbac was too long.
Yes I understand that. My point is until you configure it in that way it is not “secret” and the name of the object is a bit misleading, especially when first learning k8s.
They're base64 encoded because they can be binary data; it's got nothing to do with hiding their value. K8s secrets are for delineating secret (as opposed to non-secret, ConfigMap) data, so that access to it can be controlled differently.
You can set up encryption at rest, you can use RBAC to control the access, etc — those features are possible because Secret gives a specific resource for secret data.
Because these objects were defined earlier in Kubernetes' history the fields have inconsistent names and defaults. In Secret there is a canonical data field which stores bytes and a stringData field which will convert text to bytes for you. ConfigMap has separate data (text) and binaryData (bytes) fields which are both canonical.
If the interface were redesigned today, Secrets would probably look like a renamed clone of ConfigMap.
It is in your hands (the version where it became available is more than a year end of life, basically forever in Kubernetes life), maybe they will change the default too. At least there's a fine bold warning box in the docs.
- You can configure etcd to encrypt Secrets without taking the encryption performance hit on ConfigMaps
- You can configure the audit logs to log the diff whenever a ConfigMap was created or updated while only logging metadata and redacting content when Secrets are created or updated
- You can configure RBAC policies that grant access to ConfigMaps without Secrets (e.g. for a controller or operator)
This is why we inject `preStop: ["/bin/sleep", "30"]` in every kubernetes PodSpec via an admission controller.
Kubernetes service/ingress are designed to be event-driven and eventually consistent. For example, the nginx ingress controller has a direct list of Pod IPs as backends, and will remove them asynchronously based on receiving pod lifecycle events. In real life, it means the pod will be shutdown long before the endpoint is actually removed, leading to error rates. The preStop hook is necessary to avoid request failures, at the expense of longer pod shutdown.
Is this true of the OOTB GKE nginx ingress? Hard to tell by 'load balancer' do they mean nginx ingress reverse proxy?
I can imagine the delay between updating the GCP global load balancer service from GKE would be much higher than nginx-ingress reacting to changes in pod health/endpoints.
Either way I guess the takeaway is there is a race there between endpoints being updated and those updates propagating, and seems like that isn't handled as perfectly as people assume, and this likely gets worse with node contention and Kube API performance problems.
Wait... sigterm can be handled to gracefully handle shutdown of your pod... i.e. finish handling all existing requests before closing.
I'm not sure the issue here, that sigterm is sent before removal from the load balancer? If you handle the sigterm gracefully doesnt this solve your issue...
I'm no k8s expert so maybe I am misunderstanding the problem
> I'm not sure the issue here, that sigterm is sent before removal from the load balancer?
That's exactly the issue. It'd be better for the orchestration system to handle removing the service from the LB in response to a configuration change than to wait for e.g. health checks to fail.
The SIGTERM method means you have to modify every service to accept SIGTERM (no big deal) for some amount of time (fiddly configuration detail) and in the mean time wait for the LB to see the terminating pod's health check fail so it can stop sending it traffic.
If we flipped it around a bit, having the LB remove the traffic before sending a term signal, we could be more assured that poorly behaving pods (pods that die too quickly or too slowly) don't result in user-facing issues.
Isn't this fundamentally k8s treating the pods as cattle?
Your service could fail ungracefully at any time for all kinds of reasons, so the client needs to be somewhat prepared (with retries or whatever). If your client can retry then this doesn't matter unless you have no replication? Am I missing something?
Non-graceful shutdowns must be handled anyway (generally with client-side retries), so this shouldn't be an issue in practice. Kubernetes is intended for distributed/clustered services, not traditional HA services.
NEG is "network endpoint groups" which basically means that Load Balancers deal not with nodes, but with pods directly. So LBs can know what state the pod is. You can even avoid using Ingress (instead, annotate your k8s Service with neg name, and use backend service for that neg in LB), so it'd be possible to combine k8s and non-k8s services under the same Load Balancer, e.g. for migration.
But I'm afraid NEG or not-NEG, LB still relies on health checks anyway, so it doesn't matter.
Would this be mitigated if the ingress load balancer also did health checks with the pods? So when a pod goes into shutdown state, the ingress can detect and stop routing traffic.
Health checks typically only run at like 30sec frequency. Even if you set it to 1sec that’s still a big window for requests to slip in before ingress detects the pod is gone.