Problems with “graceful shutdown” in Kubernetes (2019)

AaronBBrown · on March 22, 2022

This is a design flaw in Kubernetes. The article doesn't really explain what's happening though. The real problem is that there is no synchronization between the ingress controller (which manages the ingress software configuration, e.g. nginx from the Endpoints resources), kube-proxy (which manages iptables rules from the Endpoints resource), and kubelet (which sends the signals to the container). A presStop hook w/ a sleep equivalent to an acceptable timeout will handle the 99%+ cases (and the cases it doesn't will have exceeded your timeout anyhow). Things become more complicated when there are sidecar containers (say an envoy or nginx routing to another container in the same pod) and that often requires shenanigans such as shared emptyDir{} volumes that waits (with fsnotify or similar) for socket files to be closed to ensure requests are fully completed.

WJW · on March 23, 2022

It's more of a design compromise than an outright flaw though. Since you can't know if your order to shut down a pod has arrived or not in a distributed system (per the CAP theorem), you either have to do it the way k8s has already implemented it or you have to accept potentially unbounded wait pod shutdown (and by extension new release rollout) durations in times of network partitions. K8s just chose Availability over Consistency in this case.

You can argue whether it would not have been preferable to choose C over A instead (or even better, to make this configurable), but in a distributed system you will always have to trade one of these two off. The hacks with shared emptyDir volumes just moves the system back to "Consistency" mode but in a hacky way.

fn1 · on March 23, 2022

The most obvious design flaw of kubernetes is that the ingress-controller is pluggable and therefore not thoroughly defined.

spoiler · on March 23, 2022

I would say that's true for networking.k8s.io/v1beta1 Ingress, but not for networking.k8s.io/v1 which is much better.

There's still some issues around "concerns" maybe eg:

Should the Ingress also handle redirecting? ALB Ingress has its own annotations DSL to support this, and the nginx has a completely different annotations DSL to support this. I don't think Envoy does, though.

But then there's the question of supporting CDNs; some controllers support it with annotations and some through `pathType: ImplementationSpecific` and a `backend.resource` CRD (which doesn't have to be a CRD; they could become native networking.k8s.io/v1 extensions in the future that the controllers can opt in to support). This becomes great when combined with the operator framework (+ embedded kubebilder).

So, I think there's a lot of potential for things to get better.

A great success example in the ecosystem is cert-manager, that a lot of controllers rely on as a peer dependency in the cluster.

cassianoleal · on March 23, 2022

> A presStop hook w/ a sleep equivalent to an acceptable timeout will handle the 99%+ cases

That's precisely what we did in one of my previous client. To increase portability, we wrote the smallest possible sleep equivalent in C, statically linked it, stuck it into a ConfigMap and mounted it to the pods so every workload would have the same pre-stop hook.

It was funny to watch when a new starter in the team would find out about that very elegant, stable and useful hack and go "wtf is going on here?" :D

This dealt with pretty much all our 5XXs due to unclean shutdowns.

_ktx2 · on March 22, 2022

I mean, technically, you can recreate this scenario on a single host as well. Send a sigterm to an application and try to swap in another instance of it.

System fundamentals are at the heart of that problem: SIGTERM is just what it is, it's a signal and an application can choose to acknowledge it and do something or catch it and ignore it. The system also has no way of knowing what the application chose to do.

All that to say, I'm not sure it's as much of a flaw in Kubernetes as much as it's the way systems work and Kubernetes is reflecting that.

lolc · on March 22, 2022

In my view it is a clear flaw that the signal to terminate can arrive while the server is still getting new requests. Being able to steer traffic based on your knowledge of the state of the system is one of the reasons why you'd want to set up an integrated environment where the load-balancer and servers are controlled from the same process.

The time to send the signal is entirely under control of the managing process. It could synchronize with the load-balancer before sending pods the term signal, and I'm unclear why this isn't done.

singron · on March 23, 2022

I don't think there is anything reasonable to synchronize with that will guarantee no new connections. You can remove the address from the control plane synchronously, but the stale config might live on in the kubelet or kube-proxy distributed throughout the cluster. I don't think you want to have blocking synchronization with every node every time you want to stop a pod.

The alternative is that you wait some amount of time before dying instead of explicit synchronization, which is exactly what this lame-duck period is. You find out that you should die ASAP, and then you decide how long you want to wait until you actually die.

chippiewill · on March 23, 2022

I don't really see an issue with adding synchronisation, there's no fundamental reason why having endpoint consumers acknowledge updates before terminating removed pods would be horrifically expensive. Especially with endpoint slices.

paulfurtado · on March 23, 2022

With 10,000 nodes running kube-proxy it is a bit expensive and, more importantly: error prone. A problem on a single node that wasn't even talking to the app could stop that app from exiting indefinitely if acks were required and clusters this size already do gigabits of traffic in endpoints watches.

Additionally, there's no acks possible for clients of headless services, so just kube-proxy handling this doesn't go far enough.

But yeah, maybe accept that as a tradeoff for clusterip services, but more deeply integrate the real load balancer options.

tyingq · on March 23, 2022

And then many throw a service mesh on top of that foundation.

simplicialset · on March 23, 2022

Why do people continue using k8s if it's so badly designed?

tinco · on March 23, 2022

Its design is good enough. There's just enough protocol to make it portable, and it's almost completely extensible so you can make it do basically anything.

jo22oij3 · on March 23, 2022

Because it's a good cash cow for expensive consultants

yonran · on March 23, 2022

This peculiar behavior (where the Service health check is unaware of the Deployment’s known instances) mirrors Google Compute Engine (where the httpLoadBalancer’s healthCheck is independent of the instanceGroupManager’s known instances). If you want your program to exit as soon as possible rather than waiting for a SIGKILL from the instance group manager, you have to hard-code the health check timeouts like so:

1. After a SIGTERM, the shutdownHook should keep the HTTP server running. Future /@status requests must return an error, but user requests must still succeed!

2. The shutdownHook sleeps for a minimum of the load balancer health check’s checkIntervalSec * unhealthyThreshold + timeoutSec (which by the way must be less than the instanceGroupManager’s health check’s checkIntervalSec * unhealthyThreshold if it uses the same endpoint)

3. Now the load balancer should not be sending new requests. The shutdownHook then waits for any existing requests to drain.

4. After requests drain, the shutdownHook can finally exit gracefully.

It is annoying to have to wait for the health check’s delay (rather than simply draining existing requests as in AWS), but it seems to be necessary for Google-designed load balancers and instance group managers.

paulfurtado · on March 23, 2022

With kubernetes you can at least add a preStop hook that sleeps for 60-120sec if the app is not designed to handle SIGTERM. The pod enters the terminating state just prior to executing the preStop hook.

nickjj · on March 22, 2022

This is something that plagued me when I started using Kubernetes because out of the box you will get 503s with a load balanced service.

If you're on EKS with the AWS Load Balancer Controller, after some research I ended up stumbling on https://github.com/foriequal0/pod-graceful-drain and never looked back. This has been working great so far. You install it once into your cluster and you're done. No dropped connections during a rollout and no need to set up preStop lifecycle hooks. The only downside I've seen so far is it takes slightly longer (about a minute or so) to terminate pods but I'd much rather have that than have to worry about 503s in production during a rollout.

sascha_sl · on March 23, 2022

"load balanced service" means very little here

typically a service of type LoadBalancer is the same as a NodePort, except the CM will not write the status field, expecting some other piece of software to handle it.

the stock AWS controller does this so badly, you will reach one random pod per node, and probably not even on the same node, and sometimes the same pod twice, and this connection usually persists until the pod dies so scaling up is near useless. that's not a kubernetes problem

kube-proxy and ingress controllers usually rewrite their routing within 1-2 seconds - but I've seen some other designs that take up to 10 minutes to properly set up their load balancers (GKE). once again, not a kubernetes problem

pm90 · on March 23, 2022

This is a really neat idea, to intercept the pod deletion requests using an admission hooks controller and just delay it for a bit. Thanks for sharing!

jugg1es · on March 22, 2022

Everyone that uses kubernetes must run into this at some point but we all end up doing exactly what is in this article. But we do it quietly because we all think we must be idiots since there is virtually no one else writing anything about it.

merb · on March 23, 2022

btw. this can be fixed with readiness probes. and applications should implement them.

i.e. application still gets traffic, readiness probe fails because of SIGTERM, application will respond with an error on the probe, the loadbalancer/service sees that and removes the endpoint/service the application can still respond to the inflight requests since it should not stop directly on sigterm, if everything inflight was done it should stop.

(this also works with non http, if SIGTERM is handled)

blaisio · on March 23, 2022

No, sorry but I don't think you understood the article. Readiness probes don't fix this. In fact, they could exacerbate the problem (although you obviously need them anyway). The problem is Kubernetes and the Load Balancer react to pod status changes at different times. (In fact, it's worse - different Kubernetes components also react to status changes at different times.) The load balancer is almost always too slow to react, so, it sends traffic to an instance that has already started shutting down.

merb · on March 23, 2022

Most loadbalancers do use the readiness probes tough

esprehn · on March 23, 2022

Yeah this is resolved with lameduck mode: https://sre.google/sre-book/load-balancing-datacenter/

Ex. https://coredns.io/plugins/health/

agilob · on March 23, 2022

>btw. this can be fixed with readiness probes. and applications should implement them.

What rate would you set on readiness probes for high throughput services (100k requests per second) to keep +99% uptime? :)

paxys · on March 23, 2022

Their proposed solution isn't really a solution either, and will probably make things worse.

Current behavior - Kubernetes sends SIGTERM signal to your container, it starts the shutdown process, stops the web server, and finishes up the requests in flight. Ingress controller continues to send some requests to the pod regardless, and they fail to connect. The client can handle these failures in whatever way it deems fit.

Their solution - Kubernetes sends SIGTERM signal, container ignores it and keeps taking new requests. After a while the process is forcefully killed, all requests in flight fail in unpredictable ways, and other resources (like DB connections) may be left dangling.

TurningCanadian · on March 22, 2022

Check out

https://kubernetes.github.io/ingress-nginx/user-guide/nginx-...

if you're running nginx. Consider setting it to true instead of the default (false)

---

By default the NGINX ingress controller uses a list of all endpoints (Pod IP/port) in the NGINX upstream configuration.

The nginx.ingress.kubernetes.io/service-upstream annotation disables that behavior and instead uses a single upstream in NGINX, the service's Cluster IP and port.

This can be desirable for things like zero-downtime deployments .

nopurpose · on March 22, 2022

You just moved problem from ingress controller to kube-proxy (or it's replacement) on the same node. Race condition is still present as before.

sascha_sl · on March 23, 2022

This is not a good idea, you're doubling the load on conntrack.

Maybe ingress-nginx should fix their config generation instead. Had a cluster with at least one change per second to the ingresses, and nginx would regularly just die (and orphan existing requests after 20 seconds).

nhoughto · on March 22, 2022

ah good tip, I don't care about session affinity or custom balancing algos so that works. I'd imagine running in GKE or AWS you would also avoid the DNAT / conntrack overhead as pods by default use a routable VPC IP instead of a magic CNI IP. Would have to test that though.

Quote from related issue:

The NGINX ingress controller does not uses Services to route traffic to the pods. Instead it uses the Endpoints API in order to bypass kube-proxy to allow NGINX features like session affinity and custom load balancing algorithms. It also removes some overhead, such as conntrack entries for iptables DNAT.

sascha_sl · on March 23, 2022

How does this remove conntrack overhead? It should add more, because the node with the ingress controller now has to hold an extra <Cluster IP Cluster Port, Pod IP, Pod Port> mapping, regardless of what CNI is used (flannel in gateway mode also eliminates this overhead, by the way - you just need to make sure there is nothing like the default EC2 source destination check in place).

blaisio · on March 22, 2022

Yes! I think this is a really under-reported issue. It's basically caused by kubernetes doing things without confirming everyone responded to prior status updates. It affects every ingress controller, and it also affects services of type "Load Balancer" and there isn't a real fix. Even if you add a timeout in the pre stop hook, that still might not handle it 100% of the time. IMO it is a design flaw in Kubernetes.

LimaBearz · on March 22, 2022

Not defending the situation of a preStop hook at least in the case of API's k8s can handle it 100%, its just messy.

We have a preStop hook of 62s. 60s timeouts are set in our apps, 61s is set on the ALBs (ensuring the ALB is never the cause of the hangup), and 62s on the preStop to make sure nothing has come into the container in the last 62s.

Then we set a terminationGracePeriodSeconds of 60 just to make sure it doesn't pop off too fast. This gives us 120s where nothing happens and anything in flight can get to where its going.

chippiewill · on March 23, 2022

> Then we set a terminationGracePeriodSeconds of 60 just to make sure it doesn't pop off too fast.

I think the grace period includes the prestop duration doesn't it?

bogomipz · on March 23, 2022

The grace period does include the time spent to execute the preStop hook yes.

thecosmicfrog · on March 23, 2022

Yep, same configuration here other than we use 60/65/70 for (admittedly) completely unscientific reasons.

gscho · on March 22, 2022

Wait until you find out that kubernetes secrets aren’t actually secrets but base64 encoded strings.

notwedtm · on March 22, 2022

I think K8S secrets get a bad wrap. They are not intended to be secret in the sense that they are "kept from prying eyes by default". The secret object is simply a first-class citizen that differentiates it from a ConfigMap in a way that allows distinct ACL's.

Most organizations I know will still use something like ExternalSecret for source control and then populate the Secret with the values once in cluster and to an object with very few access points.

gscho · on March 22, 2022

I think calling it a secret when it isn’t gave it a bad wrap. The last time I looked at the documentation it didn’t even clearly describe that it is not a secure object (that may have changed recently). Why call it a secret when it is not even close to one? I guess thing-to-store-secrets-if-you-use-rbac was too long.

Nullabillity · on March 22, 2022

If you don't use RBAC (or some other ACL mechanism) then it's already game over, everyone with access to your cluster already has full root access.

threeseed · on March 22, 2022

But it can be a secret. You can store Base64-encoded, encrypted data.

And you can encode it for example using an external KMS.

gscho · on March 22, 2022

Yes I understand that. My point is until you configure it in that way it is not “secret” and the name of the object is a bit misleading, especially when first learning k8s.

rdtsc · on March 23, 2022

Is that built-in though? Because if it isn’t then it is a bit silly to call it a secret.

jhugo · on March 23, 2022

I think you're looking for "bad rap" (as in "rap sheet"). A bad wrap is an unappetising tortilla.

morelisp · on March 23, 2022

If it's only base64 and not encrypted, that also seems like a bad wrap.

kitsune_cw · on March 23, 2022

I've always seen it written as "bad rep" as in reputation.

dharmab · on March 23, 2022

Bad rap is the older form: https://www.merriam-webster.com/words-at-play/usage-bad-rap-...

jhugo · on March 23, 2022

They're base64 encoded because they can be binary data; it's got nothing to do with hiding their value. K8s secrets are for delineating secret (as opposed to non-secret, ConfigMap) data, so that access to it can be controlled differently.

You can set up encryption at rest, you can use RBAC to control the access, etc — those features are possible because Secret gives a specific resource for secret data.

remram · on March 23, 2022

Can't ConfigMap be binary data? I never understood why one is base64 and the other not.

dharmab · on March 24, 2022

Because these objects were defined earlier in Kubernetes' history the fields have inconsistent names and defaults. In Secret there is a canonical data field which stores bytes and a stringData field which will convert text to bytes for you. ConfigMap has separate data (text) and binaryData (bytes) fields which are both canonical.

If the interface were redesigned today, Secrets would probably look like a renamed clone of ConfigMap.

remram · on March 27, 2022

I didn't know about ConfigMap's binaryData or Secret's stringData, thanks!

twalla · on March 22, 2022

Encryption at rest for secrets can be enabled, the base64 thing is more of an artifact of how JSON serialization works with byte arrays.

dharmab · on March 22, 2022

They're not necessarily strings. You can put binary data in the data field, which is why it is base64.

You can also configure the apiserver/etcd to encrypt specific keyspaces, such as the secrets/ key space.

zaat · on March 22, 2022

It is in your hands (the version where it became available is more than a year end of life, basically forever in Kubernetes life), maybe they will change the default too. At least there's a fine bold warning box in the docs.

https://kubernetes.io/docs/tasks/administer-cluster/encrypt-...

knlam · on March 23, 2022

this is why I use configmap instead of secrets. Why complicate yourself without the upside ¯\_(ツ)_/¯

dharmab · on March 23, 2022

Some real concrete reasons:

- You can configure etcd to encrypt Secrets without taking the encryption performance hit on ConfigMaps

- You can configure the audit logs to log the diff whenever a ConfigMap was created or updated while only logging metadata and redacting content when Secrets are created or updated

- You can configure RBAC policies that grant access to ConfigMaps without Secrets (e.g. for a controller or operator)

rifelpet · on March 22, 2022

There's a KubeCon North America talk (also from 2019) that goes into more detail on this very issue including some additional recommendations

https://youtu.be/0o5C12kzEDI?t=57

rdsubhas · on March 23, 2022

This is why we inject `preStop: ["/bin/sleep", "30"]` in every kubernetes PodSpec via an admission controller.

Kubernetes service/ingress are designed to be event-driven and eventually consistent. For example, the nginx ingress controller has a direct list of Pod IPs as backends, and will remove them asynchronously based on receiving pod lifecycle events. In real life, it means the pod will be shutdown long before the endpoint is actually removed, leading to error rates. The preStop hook is necessary to avoid request failures, at the expense of longer pod shutdown.

nhoughto · on March 22, 2022

Is this true of the OOTB GKE nginx ingress? Hard to tell by 'load balancer' do they mean nginx ingress reverse proxy?

I can imagine the delay between updating the GCP global load balancer service from GKE would be much higher than nginx-ingress reacting to changes in pod health/endpoints.

Either way I guess the takeaway is there is a race there between endpoints being updated and those updates propagating, and seems like that isn't handled as perfectly as people assume, and this likely gets worse with node contention and Kube API performance problems.

cyberpunk · on March 22, 2022

By load balancer they mean internal kubernetes “service” object that a given ingress uses as it’s backing service.

spullara · on March 22, 2022

Tomcat had similar behavior when I was using it except it would bind the listener before it was ready to serve traffic with similar results.

cphoover · on March 23, 2022

Wait... sigterm can be handled to gracefully handle shutdown of your pod... i.e. finish handling all existing requests before closing.

I'm not sure the issue here, that sigterm is sent before removal from the load balancer? If you handle the sigterm gracefully doesnt this solve your issue...

I'm no k8s expert so maybe I am misunderstanding the problem

notreallyserio · on March 23, 2022

> I'm not sure the issue here, that sigterm is sent before removal from the load balancer?

That's exactly the issue. It'd be better for the orchestration system to handle removing the service from the LB in response to a configuration change than to wait for e.g. health checks to fail.

The SIGTERM method means you have to modify every service to accept SIGTERM (no big deal) for some amount of time (fiddly configuration detail) and in the mean time wait for the LB to see the terminating pod's health check fail so it can stop sending it traffic.

If we flipped it around a bit, having the LB remove the traffic before sending a term signal, we could be more assured that poorly behaving pods (pods that die too quickly or too slowly) don't result in user-facing issues.

joshuamorton · on March 23, 2022

Isn't this fundamentally k8s treating the pods as cattle?

Your service could fail ungracefully at any time for all kinds of reasons, so the client needs to be somewhat prepared (with retries or whatever). If your client can retry then this doesn't matter unless you have no replication? Am I missing something?

yashap · on March 23, 2022

Clients can only retry idempotent requests safely. Standard POSTs, for example, cannot be safely retried - you can end up with duplicates.

benlivengood · on March 23, 2022

Non-graceful shutdowns must be handled anyway (generally with client-side retries), so this shouldn't be an issue in practice. Kubernetes is intended for distributed/clustered services, not traditional HA services.

deepsun · on March 24, 2022

Is it still the issue with GKE NEGs?

NEG is "network endpoint groups" which basically means that Load Balancers deal not with nodes, but with pods directly. So LBs can know what state the pod is. You can even avoid using Ingress (instead, annotate your k8s Service with neg name, and use backend service for that neg in LB), so it'd be possible to combine k8s and non-k8s services under the same Load Balancer, e.g. for migration.

But I'm afraid NEG or not-NEG, LB still relies on health checks anyway, so it doesn't matter.

aequitas720 · on March 23, 2022

Shameless plug... This is the go package I wrote to handle this and use in all of my servers: https://github.com/abursavich/graceful

potamic · on March 23, 2022

Would this be mitigated if the ingress load balancer also did health checks with the pods? So when a pod goes into shutdown state, the ingress can detect and stop routing traffic.

Too · on March 23, 2022

Health checks typically only run at like 30sec frequency. Even if you set it to 1sec that’s still a big window for requests to slip in before ingress detects the pod is gone.

zeckalpha · on March 22, 2022

(2019)

richardfey · on March 22, 2022

It is still 100% applicable (AFAIK) and informative, with the (2019) in title readers will think it's not relevant anymore?

zeckalpha · on March 23, 2022

No, but if they have read it before they may realize it sooner. I have seen (2021) in titles here already.

mad_vill · on March 22, 2022

the issues I see with kubernetes ingress are more related to an ingress pod going down than the upstream.

cyberpunk · on March 22, 2022

What controller are you using? I’ve absolutely smashed nginx and the aws elb controllers and never seen them flinch…

dboreham · on March 23, 2022

Was hoping this was an article about a k8s cluster that become sentient then took control of its own power supply, like in that movie...

motoboi · on March 22, 2022

There is a bug in GKE causing 502 when using CNI network.

The big is triggered by port names in ingresses. Use port numbers and you should be good to go.