At this point, why not just drive it up to the logical conclusion? Treat your business model as cattle, not pets. Customers leaving? Fire up another business until capital runs out, and if it does, no worries, jut hop to another job!
Sorry, but I feel like I landed in crazy-land. Kubernetes is already an exercise in how many layers you can insert with nobody understanding the whole picture. Ostensibly, it's so that you can isolate those fucking jobs so that different teams can run different tasks in the same cluster without interfering with each other. Hence namespaces, services, resource requirements, port translation, autoscalers, and all those yaml files.
It boggles my mind that people look at something like Kubernetes and decide "You know what? We need more layers. On top of this."
I'm in kind of the opposite position, as I feel confused about why I keep seeing so much resistance to higher levels of abstraction here.
> It boggles my mind that people look at something like Kubernetes and decide "You know what? We need more layers. On top of this."
I'm not involved with this project at all, but yes, I look at the massive Kubernetes deployments I've worked with, and conclude that toil and various other kinds of problems could be reduced with higher-level abstractions for declaratively managing configuration for all of these clusters.
If you wanted to run workloads on 100k servers, how would you do it? Would you have a team manually configure each cluster individually, or would you consider that there are a lot of similarities between clusters, and you only want differences that are chosen intentionally, and look into some abstraction to help keep the complexity down?
If 100k isn't enough for you, is there any number of nodes at which you'd consider building some tools support for your work?
What about data center count and change rate? There are a lot of business models that benefit from running workloads closer to customers, so it's useful to have deployments in many datacenters and clouds across the world, and for this to grow over time. Do you really not see why some people are interested in being able to bring up a cluster in a new DC by updating a few configs rather than starting from nothing by hand with kubeadm init?
I apologize if I've misrepresented your position by misunderstanding something. I feel like I'm missing or misunderstanding something from your perspective, but I don't know what it is.
More generally, I think it's extremely normal for people to build higher levels of abstraction as things grow. It's usually worth considering abstraction or architectural changes in a system every time it grows by about 10x.
Some of the most painful platforms I've supported in my career have been those that didn't do this, and instead tried to grow through brute force. Something can be fine to do once, troublesome to do 10 times, and overwhelming to do 100 times.
> If you wanted to run workloads on 100k servers...
But... Most don't run 100k servers. Most don't run 10k servers. I was working with a very large consumer insurance company recently and while they are really large for their segment, they're well under 15k servers in production. And containers aren't much of a thing. There's a lot of overengineered architecture going into not-so-hyperscaler/FAANG production environments compared to what Kubernetes' original use case was designed to solve for. The common consequence appears to be increased complexity and less operational system knowledge .
I don't follow where you're trying to go here, or how you intend this to contradict anything I said.
You're right that most people don't run 100k servers. In fact, most people don't run any servers at all, and don't use kubernetes. Some people do, and some of those people use approaches like this.
The comment I was responding to, by my reading, seems to express incredulity that anyone would do this, and seems to imply that this is a bad, counterproductive idea that nobody should ever implement. The most notable quote is "I feel like I landed in crazy-land". Do you read it differently from me?
My intent in my reply was to describe some of the use-cases where this is helpful. It's not only FAANG that run more than a small handful of clusters; there are plenty of smaller businesses that need to run large numbers of clusters.
I don't know anything about the company that wrote this article, but I found https://zitadel.ch/usecases on their website, and given that description, it sounds quite plausible that they run many small clusters in many clouds and datacenters. They probably don't need a ton of compute, but I wouldn't be surprised if they had latency needs for being close to their customers, or other specialized requirements that motivate dedicated clusters.
I also don't know anything about the insurance industry, so I'm curious to hear if I've guessed incorrectly, but it doesn't seem like the kind of business with especially high compute needs. I believe you when you say that your company has no use for this.
Can you believe me when I say that there are genuine problems these systems are trying to address? I keep hearing about companies who fund their employees building unnecessary pointless over-engineered production automation, but I have yet to ever encounter one in real life. Every SRE job I've had so far has been on a team that really wanted to invest time in improving infrastructure and automation, but couldn't get time away from the toil for it. If anyone can recommend a company that over-invests in infrastructure automation and architecture instead of under-invests, I'd dearly love to try it and see if it's as bad as I've heard.
Even without high compute needs or large numbers of clusters, there's still a lot of benefits you can get when you can move things to declarative configurations and away from global mutable state and human minds. I agree that this can go wrong, but that doesn't mean it's categorically worthless to try.
On the other hand, if you're just looking to gripe about it having gone wrong for you, could you share some war stories? I may be feeling idealistic from frustration with low investment in tools support and production automation lately, so maybe I could use some horror stories of it going badly to scare me straight. :)
I think that the contradiction here comes from the fact that these tools that are suited for large scale operations, like Kubernetes, end up getting standartized and adopted even by smaller corporations, which have neither the specialists, nor the resources to utilize them properly. Be it because of FOMO (fear of missing out), CV driven development or something else entirely, but i've seen this a number of times in the industry and it's always gone poorly. Instead of relatively quick deployments with Docker Swarm, Hashicorp Nomad, Docker Compose or anything of the sort, it suddenly becomes an uphill battle of trying to administer the darn cluster, as opposed to just being able to develop software, even with turnkey solutions like Rancher ( https://rancher.com/ which is great, by the way), especially if the company has only recently adopted Kubernetes.
In contrast, i think that Docker Swarm does a much better job at smaller scales, because:
- it uses way less resources than Kubernetes (which matters on smaller nodes)
- it is included in the install of Docker and therefore is easy to launch
- it supports the Compose format, which is often used for development with Docker Compose
- it is relatively simple, yet handles most of the concerns of smaller scale projects (i.e. excluding things like autoscaling or CRDs)
- there are solutions to manage it with web interfaces as well, like Portainer, if necessary ( https://www.portainer.io/ )
Or, if you need Kubernetes in a non-managed format, i'd suggest that anyone take a look at projects like K3s ( https://k3s.io/ ), which attempt to cut out all of the unnecessary components and features of Kubernetes, that most people won't use.
In my opinion, going with Kubernetes as a whole or even full Kubernetes distros (or building your own clusters for prod) is akin to choosing Apache Kafka because you want to build an event driven system, and then needing to throw a team at the thing to actually keep it running smoothly, instead of just choosing something simpler, like RabbitMQ or ZeroMQ.
I'd argue that most companies out there don't even have "SRE jobs", but instead it's a duty that gets thrown upon regular devs who also need to deliver features. It's probably easy to overestimate how capable most companies out there are in regards to throwing resources at problems to make them disappear, mostly due to a lack of said resources.
Startups are insisting on K8s experience despite targeting a market that will never need more than a single server if they write the application properly.
I look at "if you've got 100K servers..." and ask "why does anyone needs 100K servers?", rather than "how would I manage that?".
Each layer adds complexity and reduces performance. Less layers are better. Write Unikernel applications
I think the comment you are replying to was objecting to the number of layers, rather than the level, of abstraction. More concretely, I'd guess they would not object to a tool that replaces k8s and offers a higher level api, but they do object to building said tool on top of k8s.
For myself, I think it depends on how leaky the abstractions are. The leakier, the fewer layers the house of cards can stand. That is, it's more about where you slice it than how many times.
That’s something that comes in time, if ever. If Kubernetes started by replacing the OS later of abstraction, say by requiring unikernel applications, then it would have virtually no adoption because we as an industry don’t know how to build and operate unikernels in production. You have to meet the market/industry where it’s at and then co-evolve with the industry into some tighter, more optimal form.
I mean, I agree in that I'd like such a tool too, but given that we don't have this hypothetical more-scalable kubernetes, how is it such an unimaginably shitty idea to build tooling for declarative configuration of clusters?
As you say, there's still a fundamental amount of complexity involved in operating a system, and the best abstractions limit the amount you need to care about in a given context.
Something I like about declarative configuration specifically is how well it helps you move information from ephemeral human memories and habits into something more-reliable.
Most places I've worked usually had some weird ephemeral things, oral history about what needs to be treated differently, many systems that needed conversations to understand, etc. The more you limit your use of global mutable state (interactively changing things in production), the less room there is for important stuff to live in people's heads.
When I can spin up on a new service just by learning how a tool works and reading their configs, then I have all of the "what" and "how" and "where" there to look up at any time in a consistent way. There's much less room for weird state or needing to do rituals or broken staircases or whatever. When I talk with people, I can focus much more on the "why" questions.
I might be more sensitive to this than some people, because I've got a shitty memory, but the more I'm able to just check a thing the better. Theoretically documentation can fill a lot of the same role, but I've never worked anywhere with consistently up-to-date and comprehensive documentation for what's deployed. When it's the only way to deploy at all, it's no longer optional.
It feels like the same sort of thing as a good static type system to me. You can do local reasoning with just what you see, and limit unexpected action at a distance. The types are mandatory, and machine-checked, so you can't get them wrong in certain ways, and you can't just skip writing them like you do sometimes for unit tests that would cover similar verification otherwise.
> At this point, why not just drive it up to the logical conclusion? At this point, why not just drive it up to the logical conclusion? Treat your business model as cattle, not pets.
We're already there. Cue "We invest in teams not ideas"[1,2] and "We're pivoting since we couldn't get a product-market match, but have millions in the bank"[3]
> It boggles my mind that people look at something like Kubernetes and decide "You know what? We need more layers. On top of this."
This statement is less about adding another layer. It's more about not falling in to the same trap that we did with VMs in 2010--managing them by hand, producing unreproducible configurations, "tending" to them. The same pattern is happening with Kubernetes, but it fundamentally runs on computers at the end of the day so it's prone to similar types of failures.
A service mesh isn't a layer on top of K8s, but rather fills gaps in K8s feature set. Service meshes are great for managing traffic between services: authentication, authorization, encryption, traffic shaping, etc.
Kuberneteuberneteuberneteuberneteuberneteuberneteuberneteubernetes is a perfect name for the beautiful system you are about to architect: it's an ancient Greek word for helmsman-man-man-man-man-man-man-man, who is in charge of helmsman-man-man-man-man-man-man, who is in charge of ...
I mean, depending on the business, employees aren't trusted to understand the whole picture regardless. Many employees at traditional business don't even have administrator access on their laptops, depending on their position so it's not logically inconsistent with how things seem to operate. With that lack of big picture overview, it makes things hard to scrutinise since you're only seeing one sliver of an implementation (ie; "It saves money" without seeing, let alone understanding the technical requirements and vice versa)
Limited admin practice isn’t just to save money. It’s a security practice and it’s a good one. In a large business, no one understand the whole business maybe except legal department and CEO.
I only work for three large (2000+) companies. I’ve written reports to executive management committees. CEO never really cares about projects per se. they care about business units. Each business unit can have important projects. CEOs and management committees don’t really dive into them, they care about how each business unit are operating according to his management goals. CEOs aren’t gonna care about Python or Golang.
I agree in principle! To be a bit clearer, I was thinking more in terms of locality so for example, "Users don't even have access to their laptops right in front of them" let alone visibility into their own department or even wider organisation due to access policy restrictions. Being able to write or edit everywhere would be ridiculous but sometimes there are (admittedly intangible) upsides to transparency (read only access in a technical sense) to most aspects of the wider business
Every company I work for, legal usually sign of every deal, in terms of legality of the contract. They know every contracts/clients we have. They know where all our business is. They might not care whether our business plan is or works, but they know what our business is. Same can be say about finance group.
> Ostensibly, it's so that you can isolate those fucking jobs so that different teams can run different tasks in the same cluster without interfering with each other.
I think the point is so the teams can run services at all. Isolation isn’t the point, it’s breaking a dependency on an ops team (Kubernetes in a sense is ops as a service).
The only explanation I have that makes any sense is that it is a very Application Developer solution to a System Administration that hasn't matured it's abstractions and is now trapped in Tech Debt purgatory.
If I've read the Google papers on borg right (Kubernetes is conceptually borg v3 with omega being v2) this is different from how Google runs the things.
They'll do warehouse scale computing with borg operating large clusters. borg is at the bottom.
The workloads spanning dev, test, and prod then run on these clusters. By having large clusters with lots of things running on them they get high utilization of the hardware and need less hardware.
It's amusing to see k8s used in such a different way and one that often uses a lot more hardware while driving up costs. Concepts Google used to lower the cost.
Or, maybe I read the papers and book wrong.
I like the idea of higher utilization and better efficiency because it uses less resources which is more green.
No, that's exactly how it works. You have clusters spanning a datacenter failure domain (~= an AZ), and everything from prod to dev workloads runs on there, with low priority batch jobs bringing up the resource utilization to a sensible level.
You can do the same thing with k8s, you just have to trust its multitenancy support. You have RBAC, priority, quotas, preemption, pod security policies, network policies... Use them. You can even force some workloads to use gVisor or separate prod and dev workloads on different worker machines.
You’re right, vanilla Kubernetes has a level of complexity that starts paying off at a certain cluster level. But the wide adoption of K8s also shows that people love the standardization and API it offers for orchestrating workloads even if they don’t take advantage of its scaling capabilities.
My hope is that projects like k3s will manage to cover that small scale spectrum of the market.
One attraction of k3s is you can get a "real" production-grade K8s running on a single machine, or a small cluster, very quickly and easily. That can be great for learning, development and testing, certainly.
K3s is actually targeted at "edge" scenarios where you aren't running in a cloud, and don't have the ability and/or desire to dynamically adjust the cluster size.
You can scale a k3s cluster easily enough manually, adding or deleting nodes, but to get cluster autoscaling you'd need to do environment-dependent work to make it automatic.
If you do want that, then things will likely be quite a bit easier with a managed cluster like Google's GKE, or even possibly an self-installed cluster using a tool like kops, kubespray etc. (It's a while since I've used any of those, so not sure which is best.) AWS EKS is another choice, although its setup is a bit more complex and I wouldn't really recommend it to someone getting started.
And for production scenarios, integration with cloud load balancers, network environment etc. is similarly going to be easier with a provider-managed cluster. It's all possible to do with k3s, but it's more work and a steeper learning curve.
And furthermore, the primary problem with scaling out the number of clusters is that it hamstrings one of the primary value propositions of kubernetes: increased utilization. The scheduler can't do its job without yet another scheduler on top if you spread your workloads outside of their sphere of influence.
How do they do version upgrades, isn't that the traditional Achilles heel of K8s that leads people to want to frequently recreate clusters from scratch and/or do blue/green?
The borg concepts and interfaces have been the same for ages. The borglet and borgmaster get released and pushed out very frequently (daily or weekly) and it doesn't break anything among the workloads. There are not maintenance windows for these changes because containers can run while the borglet restarts, and borgmasters are protocol/format-compatible so the new release binary simply joins the quorum. Machines also get rotated out of the borg cell regularly for kernel upgrades, way more often than I've seen outside of Google.
I think an important thing to know about K8s is Omega failed to replace Borg, and then the Omega people created K8s. So K8s does not necessarily descend from Borg, and not all of Borg's desirable attributes made it into K8s.
The omega folks were on the borgmaster and borglet teams when they were building omega (I was on the borg team at the same time, but working on a different project). It's fair to say that k8s is an intellectual inheritor of the parts of borg that are required for it to be useful on the outside.
Borglet/borgmaster releases definitely break workloads. I recall one where something was rolled out to all the test clusters, passed all the tests (except one of ours) and was about to be promoted to increasing percents of the prod clusters. Whilst debugging why our test (which is not part of the feature rollout) broke we realize that if this had rolled out to prod, it would have broken all Tensorflow jobs, and would have been a major OMG.
But yeah, most of the time, the release process for borglet and borgmaster is fairly fast and fairly reliable.
A big part of this is that outside Google, the number of people who have to operate k8s as a fraction of all k8s users is way higher than the fraction of borg users who have to operate borg, so there's a lot of stuff in k8s that is 'end user experience' comforts and affordances for operators.
If you don't depend on clusters being 100% available all the time and design your applications to handle cluster-wide outages (which you need to do at Google scale anyway), then simply doing progressive rollouts across clusters is good enough. If a cluster rollout fails and knocks some workloads offline, so be it. Just revert the rollout and figure out what went wrong.
You can also throw in some smaller clusters that replicate the 'main' clusters' software stack and have some workloads whose downtime does not impact production jobs (not just dev workloads, but replicas of prod workloads!). These can be the earliest in your rollout pipeline and serve as an early stage canary.
Highly available kubeadm clusters are designed to be upgraded in place. The Kubernetes api-server is also designed to function with a minor version skew (for example v1.19.x and v1.20.x) that would happen during an in-place upgrade.
Cluster API takes the above and can in-place upgrade clusters for you. It's pretty awesome to see first hand. Bumping the Kubernetes version string and machine image reference can upgrade and even swap distros of a running cluster with no down-time.
One caveat is that downgrading the apiserver is not guaranteed to be possible, since the schema of some types in the API may have been migrated to newer versions in the etcd database that the previous version may not be able to read. There are tools such as Velero (https://velero.io) which can restore a previous snapshot, but you will likely incur downtime and lose any changes since the snapshot.
First, if I understand it right... Google does some smart things in upgrades. They do things like tests and then upgrade their equivalent to AZs in a data center. I'm sure there have been upgrades gone bad that they've had to fix.
Kubernetes can be upgraded. I've watched nicely done upgrades happening 4 or 5 years ago. I've watched simple upgrades happen more recently. It's not unheard of. Even in public clouds I've upgrade Kubernetes through many minor versions without issue.
I would argue it's more work to create more clusters. You need to migrate workloads and anything pointing to them. It also would cost more as you have to run more hardware.
Exactly. We practice this with Mesos. The point is that a central infrastructure team maintains the (regional) clusters and provides an interface for application teams to submit the services. Each application team maintaining its own cluster or dealing directly with the full power of the cluster scheduler ecosystem would defeat the purpose.
> It's amusing to see k8s used in such a different way and one that often uses a lot more hardware while driving up costs.
To be fair, google runs its own datacenters with teams doing research & optimisations on all levels of the stack: hardware and software and most importantly an amount of engineering resources converging to infinity.
The rest are stuck with VMs that share network interfaces and have to monitor CPU steal, understand complex pricing models, etc. Engineering resources are scarce so most companies will over-provision just to be safe and because profiling the application and fixing that API call that takes too long is expensive, will spin-up another 50 pods.
I've also see Kubernetes being used this way, with one cluster per data center for company-wide utilization with segmentation being at the _namespace_ layer. The priority class system is heavily utilized to make sure production workloads are always running, and other workloads are pre-empted as needed.
Yes that statement is wrong. I guess the real difference in the pet/cow calculus is for cattle you probably won't pay more for treatment than the cow is worth; with pets people do this all the time.
Yup. If the argument is "don't fall in love with your servers", I'm in agreement.
However, the idea that whenever anything weird happens you should just kill you server/cluster and move on without doing any sort of investigation seems like a recipe for disaster. That's a great way to mask bugs that may in-fact be systemic in nature, where they are or will eventually cause service degradation for your customers.
I would hate to work in an environment where bugs are ignored and worked around instead of understood and fixed.
I like the idea of killing it and moving on without paging someone at 2AM in the middle of the night.
Ideally that goes into an async queue of issues though and someone finds the root cause and that goes into a queue of issues to fix, which is actually burned down.
I suspect what is happening more often is that the whole stack has so many levels and the SREs responsible for it all don't have the visibility into the stack they need to debug it all, so they use their SLOs as a club to ignore issues as long as they're meeting their metrics until it becomes a firefighting drill.
A pile of cargo culted best practices and SLOs replacing hands on debugging.
What you do with the server/cluster after you take it out of service is up to you. Having automation like this to take things out of service means that you can immediately restore production workloads to full functionality.
I'm far more likely to ignore and work around a bug instead of doing a proper investigation when I've got pressure to get production back up because this server/cluster is a Special Snowflake that must be fixed in-place.
Hardware fails, and bugs happen. There's no getting around it. Automation to handle this case is a good part of any strategy for identifying, understanding, and fixing bugs.
AWS in my mind can quickly lose the kubernetes war amongst cloud providers. This is every cloud providers chance: EKS on AWS is so damn tied into a bunch of other AWS products that it's literally impossible to just delete a cluster now. I tried. It's tied into VPCs and Subnets and EC2 and Load balancers and a bunch of other products that no longer makes sense now that K8s won.
In my opinion it needs to be re-engineered completely into a super slim product that is not tied to all these crazy things.
It's not literally impossible to delete a cluster, I do this many times daily fully automated with no issues, and an EKS cluster is not tied to a VPC or subnets, you can spin them up independently, and you can delete your clusters without affecting the VPC or subnets in any way.
An Ingress or a service of type LoadBalancer will create a load balancer in AWS that's tied to your cluster, but that's the whole point of Kubernetes, it'll spin up the equivalent resources in Azure or GCP or DigitalOcean.
Also the fact that a new EKS cluster takes at least 20 minutes to come up and be ready, makes AWS' offering the weakest among the Big Three cloud vendors.
This isn’t true. I’ve provisioned 53 EKS clusters this week, and every one of them has been up in under 11 minutes with all of the accoutrements. I understand it has been substantially slower in the past.
You can spin up a local multi-node cluster using kind[0] in 1 minute on 6+ year old hardware. I know it's not the same but I really have to imagine there's ways to speed this up on the cloud. I haven't spun up a cluster on DO or Linode in a while, does anyone know how long it takes there for a comparison?
Those things don't have to be inherently slow and e.g. external IP don't need to block bring-up.
I've worked on KIND and on clusters on clouds (at Google, but on multiple clouds) and both can be very quick, if anything there's still low hanging fruit to make KIND faster that I'd expect a production service with more staffing to handle.
KIND is Kubernetes, typically on much weaker hardware :-)
Within a few minutes is a perfectly reasonable expectation even for "real"-er clusters, see e.g. under 4 minutes:
I just spun up a very simple 3 node EKS cluster using eksctl on us-east-1 and it took 19 minutes. Still seems quite slow to me, I'm not sure how you got 11 minutes over a sample size of 50+ clusters, maybe it's dependent on region or what worker node types you used?
GCP charges the same now (though your first one is free). I don't think it's an unreasonable cost, but it's certainly not competitive with places like Digital Ocean which are still free.
It really depends on the use case. We use Kubernetes for hosting managed data warehouses. EKS cluster spin-up time is a one-time cost to start a new environment for data warehouses. It's insignificant compared to the time required to load data. Other issues like auto-scaling performance are more important, at least in our case.
Not sure. I'm using DigitalOcean Managed Version and have experience with EKS. There are pros and cons on both sides. What I like about EKS is support for API and k8s logs (accounting / security), coupled with CloudTrail and IAM integration for RBAC/user management is a bliss.
I like the idea of "Building on Quicksand" as the analogy for Distributed Systems, but also maintaining your software dependencies. This article basically recommends trying to minimize your dependencies to keep reproducibility/portability high. I generally agree with this, but also carry an "all things within reason" mentality. But just as the article describes coworkers growing into their cluster, the complexity of what they run in their cluster will also grow over time and eventually they'll realize they've just built up their own "distribution". A few years ago, I've written a post asking people to think critically when they hear someone mention "Vanilla" Kubernetes[0].
The real problem they suffered is actually that Kubernetes isn't fundamentally designed for multi-tenancy. Instead, you're forced to make separate clusters to isolate different domains. Google themselves run multiple Borg clusters to isolate different domains, so it's natural that Kubernetes end up with a similar design.
You're just wrong, unfortunately - Google runs dev and test and prod all on the same clusters. Kubernetes multi-tenancy works just fine, but the conventional definition of multi-tenancy includes things like "network isolation" that are misguided. Multi-tenancy should be set up (and is within Google) by understanding what is and isn't shared with the environment, and through cryptographic assertion of who you're speaking to. If you want to see the latter part nicely integrated, come to a SIGAUTH meeting and help me argue for it
> If you want to see the latter part nicely integrated, come to a SIGAUTH meeting and help me argue for it.
Invitation accepted! :) I've been dying to see an ALTS-like [1] thing that works with Kubernetes. I really should be able to talk encrypted and authenticated gRPC-to-gRPC without ever having to set up secrets or manually provision certificates, dammit.
All I said was that they use separate clusters to "isolate domains" which is a pretty vague description on purpose -- I did not intend to claim they do it for different deployment environments as you've described.
It's fairly subjective what types of isolation define "multi-tenancy" which is why there hasn't been progress made despite SIG and WG efforts in the past. While you do not believe network isolation should be included, there are plenty of developers working on OpenShift Online which may disagree. OSO lets anyone on the internet sign up for free and instantly be given their own namespace on a shared cluster full of untrusted actors.
Hey we used tectonic ;-) was a great tool at that time.
Tectonic did influence some of the concepts around ORBOS.
Just think of Tectonic combined with GitOps, minus the iPXE part.
When I worked at Asana, we created a small framework that allowed for blue-green deployments of Kubernetes Clusters (and the apps that lived on top of them) called KubeApps[0].
It worked out great for us -- upgrading Kubernetes was easy and testable, never worried about code drift, etc.
Running a separate cluster for every service assures high overhead and poor utilization. Fine if you can afford it, but be aware that you are paying it.
Packing everything together is a "selling point" until you find that a service can fill up ephemeral storage and take down other services, or consume bandwidth without limit.
Let's not forget the potential security implications of not keeping things properly isolated.
People who were around when provisioning on bare-metal was still a thing already learned all these lessons. Somehow it seems they have been forgotten by all the people driving hype around Kubernetes.
> Packing everything together is a "selling point" until you find that a service can fill up ephemeral storage and take down other services, or consume bandwidth without limit.
Ephemeral storage has resource requests/limits in pods.
Traffic shaping/limiting can be accomplished using kubernetes.io/{ingress,egress}-bandwidth annotations. It's not as nice as resources (because there's not quotas and capacity planning, and it's generally very simplistic) but you can still easily build on this.
Pods can also have priorities and higher priority workloads can and will preempt lower priority workloads.
> Let's not forget the potential security implications of not keeping things properly isolated.
For hardware isolation isolation, you can use gVisor or even Kata containers.
> People who were around when provisioning on bare-metal was still a thing already learned all these lessons. Somehow it seems they have been forgotten by all the people driving hype around Kubernetes.
Kubernetes explicitly aims to solve resource isolation. It was built by people who have decades of experience solving this exact problem in production, on bare metal, at scale. Effectively, Kubernetes resource isolation is one of the best solutions out there to easily, predictably and strongly isolate resource between workloads _and_ maximize utilization at the same time.
Kubernetes has the concept of limits, especially on ephemeral storage; additionally: if your node becomes unhealthy then the workloads would be rescheduled on another node.
I’m super not hypey about kubernetes, mostly because the complexity surrounding networking is opaque and built on a foundation of sand... But let’s not argue things that aren’t true.
> additionally: if your node becomes unhealthy then the workloads would be rescheduled on another node.
Well of course, but you're going to run into that issue (likely) on all of the nodes where the offending service lives.
> But let’s not argue things that aren’t true.
If what I've said is untrue, looking at open GitHub issues and the Kubernetes documentation is certainly no indication. That's a massive problem all by itself.
The first issue you've linked concerns quota support for ephemeral storage requests/limits - which is not about the limits themselves, but the ability to set a limit quotas per tenant/namespace. Eg., team A cannot use a total of more than 100G ephemeral storage in total in the cluster. EDIT: No, sorry, it's about using underlying filesystem quotas for limiting ephemeral storage, vs. the current implementation, see the third point below. Also see KEP: https://github.com/kubernetes/enhancements/tree/master/keps/...
The second is a tracking issue for a KEP that has been implemented but is still in alpha/beta. This will be closed when all the related features are stable. There's also some discussion about related functionality that might be added as part of this KEP/design.
The third issue is about integrating Docker storage quotas with Kubernetes ephemeral quotas - ie., translating ephemeral storage limits into disk quotas (which would result in -ENOSPC to workloads), vs. the standard kubelet implementation which just kills/evicts workloads that run past their limit.
I agree these are difficult to understand if you're not familiar with the k8s development/design process. I also had to spend a few minutes on each one of them to understand what the actual state of the issues is. However, they're in a development issue tracker, and the end-user k8s documentation clearly states that Ephemeral Storage requests/limits works, how it works, and what its limitations are: https://kubernetes.io/docs/concepts/configuration/manage-res...
Yeah, especially in production bare metal clusters. If you want N+2 redundancy that's at least 5 physical machines for just the control plane (etcd & apiservers), more if you don't want to colocate worker nodes with that...
Even if you have full bare metal automation and thousands of machines that seems like unnecessary waste.
Decreases utilization but also decreases coordination between teams (no man-bear-pigs). Also weight the long-term costs of poorly maintained platflorms and infrastructure in desaster cases, security issues or when migrating to other providers.
High overhead can be automated away, google ORBOS.
You don’t exactly need to run a cluster per service ;-) Instead you can choose to collocate services who belong together and form a „domain“. But don‘t go the route and build the almighty one Kubernetes cluster where all your domains run in one single place.
Raising cattle is a lot of work. You have to weigh them regularly, apply treatments for intestinal worms, for lice, move them from pasture to pasture so they don't overgraze. It's a fulltime job.
Also if a cow dies, people don't just buy a new one. It represents quite a loss of profit. Also represents a big potential problem on the farm that people will want to resolve - they're you're money makers, if they're dying it's an issue.
You're still free to do that. But honestly having done DevOps/sysadmin in both of those worlds, i vastly prefer the whole one-service-per-machine, cattle treatment with the one off exception for remote workspaces and the like.
Losing a server was often devastating, even with backups. "Don't run your builds right now, people are visiting the website" also sounds a lot like "Don't connect to the internet right now, i need to use the phone".
I mean there's "correct" ways to implement the whole pet servers concept but .. why? It's fun, sure, but it's also a waste of time, and when you want to be productive it just tends to get in your way.
Because it let me replace Ansible playbooks with a sensibly designed YAML syntax usually needing 40% the number of lines, and got my deploys down to a roughly constant 6 seconds regardless of the size of the application. Kubernetes let me finally stop thinking about deployment, I can get new apps online in 5-10 minutes or less.
Don't even get me started on monitoring.. k8s murders everything else here
It gives you automatic failover and decent-ish (at least when coupled with Rancher, naked k8s is nuts) management compared to a couple of manually (or Puppet) managed servers.
A well implemented mini cluster can and will save you so much time in later maintenance and deployments.
> ... the GitOps pattern is gaining adoption for Kubernetes workload deployment.
Is it really though? I for one am glad I didnt jump on the bandwagon early. A lot of the articles popping up nowadays mentioning the downsides of GitOps make a lot of sense.
Kubernetes is already insanely complicated. In practice minor version differences of anything in the stack leads to issues. I get why it all exists but at some point I have to ask, are containers really that much better than an rpm/deb and private corporate repo? Every container is a effectively a chroot. Add to that the lack of easy debugging you’d get from simple packages and daemons. I get that doesn’t work for “cloud” scale or whatever, but i think the excitement over this stuff is overblown.
For larger companies, I think a huge benefit of Kubernetes is the shared language for defining and operating services, as well as the well-thought-out abstractions for how these services interact.
Costs are generally less of a concern, but having one way of running, operating, and writing services allowed our dev team to move faster, share knowledge, etc.
When I was at BigCo and our requirements became so complex and demanding that we had to migrate onto serious containerization and orchestration software, well, it was necessary but we all pined for the days when it was a dozen services and 20k boxes and we didn’t need that shit.
Not everyone in the world practices animal husbandry, so the "cattle" metaphor doesn't make a lot of sense to some of us, like me.
I have no idea how "cattle" should be treated, other than they are killed/used for resources.
Fungible vs non-fungible seem to be the concept hiding behind pets/cattle. You can substitute a dollar for another, you can't substitute your favorite rock for another.
Sorry, but I feel like I landed in crazy-land. Kubernetes is already an exercise in how many layers you can insert with nobody understanding the whole picture. Ostensibly, it's so that you can isolate those fucking jobs so that different teams can run different tasks in the same cluster without interfering with each other. Hence namespaces, services, resource requirements, port translation, autoscalers, and all those yaml files.
It boggles my mind that people look at something like Kubernetes and decide "You know what? We need more layers. On top of this."