Hacker News new | past | comments | ask | show | jobs | submit login
Debugging Network Stalls on Kubernetes (github.blog)
180 points by chmaynard on Nov 22, 2019 | hide | past | favorite | 47 comments



What I am struggling to understand is why is it a good idea to to ip inside ip.

Giving containers their own nic, and IP isn't hard, and in most cases is >> faster than spending time using some immature tunnelling protocol.

I can see that doing multi-cloud deploys with transparent failover might benefit from a VPN, but short of operating on a hostile network I can't see its worth the heartache to deploy an overlay network.


This is actually interesting, we chose to use an overlay network because Kubernetes was a new, complex system that we initially didn't have experience with internally, and so we wanted to isolate problems it could create as much as possible, as well as ensure that teams working on Kubernetes deployment weren't blocked by needing network engineering time.

This meant that we felt an overlay network was the most pragmatic (works out of the box) solution, even though IPIP has some drawbacks in adding complexity (within the Kubernetes cluster) and hashes across router ECMP / NIC RX queues poorly (neither look into the inner IP packet as part of hashing). It was definitely a concern of ours, though, considering we'd very intentionally avoided IPIP in our load balancer (https://github.blog/2018-08-08-glb-director-open-source-load... has a write-up of why).

We've considered the alternative of announcing a /24 or /25 or similar via BGP on each Kubernetes node, which is supported by systems like Calico, but so far the limitations of IPIP haven't caused an issue because of our Kubernetes clusters being large enough to hash evenly regardless of the addition of an overlay network, so we haven't needed to migrate. It's definitely an interesting trade-off on complexity overall, though - simple routed networks are much easier to understand than another layer of encapsulation.


This a thousand !

You don't even need to give them actual nics, just veth pairs with routing (some CNI plugons do just that btw).

I think a big part of why overlay nets are so common is that a lot of the engineers deploying these systems have a shaky or non existant understanding of networking and routing, they are stuck in the "local network and a default route" model and if you only have that, overlay it is...


sorry, yes I should have said veth.

at least that way you can dump them on an entirely different vlan, if you want separation. The best part is: you're using your cloud provider's networking tools (lets be honest it is a cloud and they are way more likely to work)


Do you know what CNI plugins support veth pairs?


It’s about routing more than veth pairs (I mean you can also do simple bridging as other comment say, but it’s more limited), so look for BGP using ones, mainly Calico, but kube-router too.

I know there is simpler ones that don’t use BGP, but I don’t remember their names. For example there is one for AWS that uses ENI interfaces and pass them directly to containers (you’re obviously limited by the number of ENI you can attach to a particular host, then). There is also one that configures routes in the VPC routing table, and a similar one for GCP.


I've tried using Bridget[1] before, but didn't actually manage to get it working properly (although the problem was likely firewall related).

[1] https://github.com/kvaps/bridget



The built-in bridge one does. Maybe you meant something else by your question though?


This is more or less how azure-cni works. Though you only have a single nic that’s assigned multiple ips by azure.


I keep seeing articles showing up on debugging network issues like DNS or something else. Makes me wonder how much engineering effort we are putting in to fight the tool. Have these people considered/investigated other tools like Nomad (hashi corp)? How much value Kube actually adds vs these issue?


Kubernetes is CORBA of this generation. It will float around because of few heavyweights preaching it and suckers falling for it. Eventually it will die like all overly-complex, hideously-designed and poorly-implemented things do. You can save your time by just ignoring this nonsense.


> Kubernetes is CORBA of this generation. It will float around because of few heavyweights preaching it and suckers falling for it.

This sort of assertion is oblivious to the fact that Kubernetes does solve a few basic problems that no other orchestration service solves, at least as easily.

I'm referring to problems like cluster autoscaling.

Until there's any alternative to Kubernetes that not only offers these features but it is also supported by service providers like AWS or Azure or GCP then Kubernetes is not a fad but the default tool of the trade.


> I'm referring to problems like cluster autoscaling.

When it comes to autoscaling, kubernetes has two levels of scaling - pods and nodes. This introduces pros and cons in itself.

Either your resources are under-utilized and you waste money or you scale as fast as regular scaling of your cloud provider.

I'll just point out that for example AWS ASG TargetTracking scaling policy blasts default Kubernetes HPA/CA scaling out of the water. It's more conservative while HPA is highly susceptible to pod thrashing.

Two-level based scaling introduces a lot of complexity, it gives more flexibility/power to the user, but only if that user has enough experience to not fall into multiple pitfalls it also creates(scaling).

I'm confident enough to say that ANY corporation that is not familiar with Kubernetes - decides to introduce it to it's technology stack will most likely shoot itself in the foot - also increasing the cost of infrastructure for their platform.

I've seen it too many times. Which then results in this corpo to hire some consultant that looks at what abomination their devops created to shake head and spend few months fixing their incompetence.

Doing Kubernetes the right way is hard.


> I'll just point out that for example AWS ASG TargetTracking scaling policy blasts default Kubernetes HPA/CA scaling out of the water.

That doesn't really count as it's a proprietary service controlled by a single service provider.

> I'm confident enough to say that ANY corporation that is not familiar with Kubernetes (...) also increasing the cost of infrastructure for their platform.

That assertion doesn't pass muster because a) you're assuming generalized and widespread inexperience and/or incompetence and b) you're assuming that not being able to learn how to use a service is a permanent state of affairs.

Meanwhile, back in the real world cluster autoscaling works well and does in fact let users shut down nodes they are not using which otherwise would have to be up and cost real money. Kubernetes is the reason why this feature is available to the general public. Until a better alternative appears, Kubernetes is by far the best and only option available to the whole industry.


> That assertion doesn't pass muster because (..)

It does, because in the same time the corporation could have used the tech stack they are familiar with, focusing on the product, which would directly improve their revenue. In most cases I've been consulting, the move to Kubernetes was strictly pushed by people who did not understand that for their use case it made absolutely no difference and only brought a lot of complexity to their table, which then turned into a year of two of consulting costs and even more operational work they previously had to maintain the new tech stack.

Lose-Lose situation.

> Kubernetes is by far the best and only option available to the whole industry.

No it isin't. And the sooner people understand that the better.


Can you recommend some alternatives to k8s that are cloud provider agnostic?

I have 3 docker compose nodes that I manage manually. I will shortly need another one.

Seems like time for an orchestrator. If I squint swarm is a reasonable solution but k8s hysteria is sweeping me along..


There is no such a thing as "Cloud Agnostic". Be it K8S on AWS or Azure or GCP, each one has its own quirks that will make K8S run better or worse or even terrible.

Abstraction of some things on orchestration level does not mean there is no and will be no cloud specific code injected there. Actually there is a ton of this cloud specific code in there and when something stops working the way you want it to -> the solution often times is to write even more vendor locking code.

Instead of looking for "Cloud Agnostic" solution, find a solution that fits your use case the best. Using managed services if researched correctly can give many adventages. But like with everything that requires A LOT of knowledge about them and your domain.


It's interesting that you say that. Are you talkign about the hosted offerings then? If I spin up my own k8s "on metal" in GCP, would there still be something GCP specific to it?..


Remembering CORBA and glad it never took off: What if Kubernetes is the java of this generation? Java is not yet dying, and for better or worse, do solve organizational problems around code. Kubernetes is likewise becoming a business standard. Look into VMWare, IBM/RedHat and remote service providers offering it as standard.

One can write huge books on java, both on the strengths and solutions, but also the weaknesses. For smaller companies and projects, it would probably be overkill to start with the java/JVM monolith, even today. Though java is the reason I jumped off the dev bandwagon, I can defend the language in hindsight, because it do solve several huge "unnecessary" problems other languages/runtimes simply don't address. The operational costs having been hidden away with expensive hardware upgrades.

In hindsight, I don't see huge obstacles that the same may be said about Kubernetes in the future. These low-level kinks will be worked out, and provide for the masses what they can't or won't achieve alone. The masses don't need Google-scale, but if they can get the 15-year old mature distributed systems tech almost for free, there are several reasons to standardize.


I'm torn on this opinion. From 20 years of experience I can see where kubernetes fails and seems absurd (internally routed and natted networks with terrifying complexity and design, obtuse declarations in yml/json that hide more ugly complexity). Kubernetes tries to do too much for too many. OTOH, there are some really nice infrastructure simplifications which can be achieved using k8s that didn't exist previously. Like any tool it can be used in a sane way or made into monstrosity.


> Kubernetes is CORBA of this generation.

I like to compare Kubernetes to websphere.


I think React.js is the CORBA of this generation.

k8's might be too but I think the fact that containers are so portable it will be easy to migrate your application away from k8s if you want and to a new infrastructure. That's kind of the beauty of it.


I work somewhere that has one of the worlds largest Nomad deployments.

It's easy to manage, and since 0.9.5 release they have resolved most of my issues.

Simply adding Traefik/Fabio gets you a lot of benefit as well.

We run it across multiple distributions as well and it "just works," which is not something I experienced with Kubernetes. It also works on Windows if you need that.


The issue was not due to kubernetes, and would have appeared with other containerization technology as well.


The problem was due to cadvisor, a Google project for monitoring container resource usage

https://github.com/google/cadvisor

I think GP's point stands. I'd trust more Hashicorp than Google to help out on problems like these. They have a financial incentive to help you out and you can pay for support. Google does it just for corporate image and are known for terrible technical support for anything outside advertising customers.


Fantastic write-up. I'm curious how many people were involved in the investigation and how long it took.


This is a great question, thank you for asking!

Initially a few teams around the org had folks investigating poor performance from different perspectives of the applications that were observing issues. Once it was clear that it wasn’t the applications themselves or their configuration at fault, the team that runs our Kubernetes infrastructure started collating information together (in github issues) and getting to the point of having a clear repro (the Vegeta test) and what to look out for. This was the slowest part of the process because we needed to understand that something non-application-level was going on (and because “random network latency” is a very difficult thing to narrow down) - it probably took on the order of months from the first sign of an issue to fixing all the other issues that were contributing to small amounts of latency and being sure we still had an underlying problem to find.

At that point it became clear that something more low level was going on, we put together a focus team from a selection of teams to investigate the underlying cause - that was a group of about 5 engineers actively working on it, with another 5-10 interested engineers following along and helping out. Folks were typically working in pairs or solo to dive in to different potential leads, looping in everyone else in Slack as they go. Most of the work here was finding signal in the noise, we found a lot of other smaller system-level issues along the way that got ruled out and/or low priority to fix. There were other DNS related issues at play, fixing those also improved things, but not the specific underlying issue in the post here. Going down the specific path in the post took just a few days once the first few steps showed something was wrong at the packet level. The remediation from there was also just a few days, because we already had infrastructure in place to detect a known issue and mitigate in a safe/graceful way. The focus team was working on this as a primary task for around a few weeks overall.


There has been issues with Kubernetes and NAT reported from Xing engineering.

12 min into the video https://www.youtube.com/watch?v=MoIdU0J0f0E


This insert_failed issue described in the video was one of the ones we discovered during investigations as well, but it was already well understood because of this excellent Xing blog post which was extremely useful and referenced internally a lot: https://tech.xing.com/a-reason-for-unexplained-connection-ti...


Would these zombie cgroups be able to be detected in any of the kubernetes statsd metrics emitted?


We didn't find any metric that surfaced zombie cgroups, presumably because the kernel mostly tries to hide them from user space since they have been deleted, but haven't been cleaned up. The only way we found at the time to track them was via a BCC script and observing the latency on reading the /sys/fs/cgroup/memory/memory.stat file.


Awesome deep dive ! I wonder how one can debug this without deep insights into the networking stack. I hope i ll never have a problem like this myself ;)


I read the article with great fun. The question is, how do I run the bcc script in the article? Is it on a sidecar container or on a worker node machine?


The bcc script was run on the Kubernetes node itself directly over SSH, but it should be possible to run it in a privileged container as well.


Super interesting, thanks for sharing!

These things are always my favorite to deal with - the feeling at the end when you figure it out is amazing, and you usually learn a ton on your way there, too.


Is the tunneling protocol really IPIP for these overlays? Oh boy - that’s going to really suck on wide multi path networks like the ones used by cloud providers.


I learned a lot reading this writeup, thanks for posting!


This does not really have anything to do with either kubernetes or networks. If your computer is busy, it won't be able to process packets. Accessing certain kernel stats via proc, sys, or other special files may be really expensive. For example /proc/pid/smaps of a running mysqld takes 2 seconds on a computer I happen to have on hand. Sometimes when you have many cores it is expensive to produce some of the fields of /proc/pid/stat because the kernel has to visit numerous per-cpu data structures. /proc/pid/statm is better for this reason, if it contains what you are looking for.

TL;DR reading kernel stats can take a long time and cost a lot of CPU cycles. It costs more for more containers, and more on bigger machines.


True, it does not have anything to do with k8s or networks, but, that's the context in which this issue arose: when they noticed a higher network latency on a kubernetes cluster.

The value of this blog post isn't only in the "why" ("reading kernel stats can take a long time and cost a lot of CPU cycles"), but also in the "how" they went about finding the cause of the symptoms is also of interest.


> This does not really have anything to do with either kubernetes or networks.

The article is literally about an issue that was experienced while operating kkbernetes clusters.

FTA:

> Essentially, applications running on our Kubernetes clusters would observe seemingly random latency of up to and over 100ms on connections, which would cause downstream timeouts or retries.

Sounds like a problem affecting Kubernetes to me, and an important one.

More importantly, it sounds like a non-trivial problem that others operating Kubernetes clusters would be interested in learning how to identify and how to search for the root cause.


It has literally nothing to do with K8s. An equally suitable title would have been "Debugging network stalls on the Intel Xeon processor" or "Debugging network stalls on planet Earth".


I feel like the title is somewhat appropriate: part of the post was about selectively removing parts of Kubernetes networking, and digging through the kernel networking stack, to troubleshoot the issue


Simple answer really - anything with Kubernetes in the title gets more clicks.


Inefficient Linux cadvisor implementation causes unexpected latency. Also this affects buzzword compatible technologies.


What makes you say it was a cadvisor issue? Isn't it clearly a kernel issue?


It looks more like Docker is pushing the Linux kernel in unprecedented ways. You can't blame Linux for having a cache there for (so-far) normal workloads. And cadvisor was pushing the problem further by reading constantly stats.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: