Hacker News new | past | comments | ask | show | jobs | submit login
Kubernetes a black hole of unpredictable spend, according to new report (theregister.com)
126 points by eminemence on June 29, 2021 | hide | past | favorite | 86 comments



I think the articles headline is a little rude to Kubernetes. I’m by no means a fan of Kubernetes, especially not in non-tech enterprise, but the article is really about the unpredictable and rising cost of moving into the cloud that is owned by the big tech companies, isn’t it? Sure kubernetes can be part of that, but you can easily run into the same predicament without it.

The unpredictability of cost is actually the prime reason we stuck to our own cloud, where we rent (technically we buy the hardware that the company hosts, but it’s not really ours, we just use it till it breaks) the iron at a known rate. Which is just better for a public sector budget than paying by mileage, at least if anyone outside of the IT department bothers to look into what they are signing off on.

The really interesting part will be where we go from here. Moving from self-hosted to rented iron that we run our virtual servers on, was a fairly simple move that would be easy to reverse. The move into the cloud is even easier, but unless you’re careful, it could be very costly to get out.


Unfortunately AWS is the new oracle. No one ever gets in trouble for picking it and its a great way to make it look like you as a high up exec provide value. Look how fast we are iterating now with my decision. It almost always ends in a mess of unmaintainable unthought out services that someone else has to come and clean up or move to the next proprietary service.

The last 5 years for me has been soul crushing as someone who actually enjoys managing datacenters. We have seen time and time again having your own DC leads to much better visibility and control on spending as well as lower cost. Not to mention the huge advantage when negotiating with cloud vendors if you are a mid size or up company.

So time and time again i have had to transition out of environments you can reason about into AWS and become a glorified support engineer but i guess thats what companies need now days. Someone who will read docs the other engineers dont want to and troubleshoot all the issues because AWS is so easy.

Im glad I got to learn how the “cloud” works though as i likely never would have been drawn to infra and programming in this day and age.


The problem with running your own DC is growing past your planned capacity. There's often a huge delay between developers having to put up with the VM infrastructure having to put up with under-resourced machines and more capacity being approved.

As a developer I've put up with over-subscribed VMware clouds and I vastly prefer the Azure/AWS option.


Their are a lot of bad ways to run your own hardware. Limiting your infra to EC2 for burst capacity provides an easy escape hatch if you need. However i have never run into the issue of not calculating CPU capacity properly and also would never use VM ware that sounds like IT is running your DC. I could see this being the case for some tiny startup who just owns a small amount of rack space though.

Of course you can pay the cloud providers to deal with your companies bad planning. Thats what most do.

I also would never advocate for DC for everything. As a startup it likely makes no sense to run your own hardware and also likely doesnt make sense to run k8s also however one of those is completely acceptable. Once you get to more predictable growth owning your own hardware starts to look more attractive but most don’t know how to calculate it properly and finance likes to make it merky with capex and opex buckets.


I recently experienced this firsthand, in a company which owned no computers beyond employee laptops. The product was entirely built of AWS services created by a pile of Terraform spaghetti. It was only really understood by someone whose superpower was the ability to keep an apparently unlimited number of levels of indirection in his head.

I hear they might need to move it all to Azure soon!


Terraform just does too much and is abused. Its great for simple config but quickly turns into custom modules to make things easier. Eventually those modules need to change and stuff breaks but you never know until someone trys to run it again. Then you just pray no one trys to manage their database with it.

Additionally as its put together over time if you actually had to re create the environment it would never work. I spent time automating recreating an environment and quickly stopped. Terraform is the illusion of infra as code and fails miserably at any scale.


You know you can lock modules versions, right? And of course the environment recreation needs to be tested periodically if you have a disaster recovery plan.


Locked modules only works if all of your modules are pulled in through the registry. that isn't an option at some places.

Testing environment recreation is impossible at a certain scale. Try it when you have hundreds of people adding terraform code most of who only know how to copy and paste.


> I hear they might need to move it all to Azure soon!

I'm not sure if you made that post as a joke or not...but you just described a certain San Francisco based startup that operates in the SPF/DKIM/DMARC space...

Which company (if you can say) were you describing?


Many people have been fired after AWS migrations resulted in massive cost increases. But, few people want to talk about failed projects at hospitals etc where the cloud is a poor fit.

IT has always had issues with people cargo cutting solutions without understanding the details, and the cloud is no different.


I agree its just the current one that affects me. We did have someone get fired from an AWS migration at my first startup but they just turned around and tried to do again. It took one year for a single application that only lived on a single server and we never did anything more advanced before we sold.


Are those of us happy/comfortable with current major cloud offerings simply not speaking up here? I can't fathom running a data center any longer for a company of nearly any size. Given my team's responsibilities, this would require 2-3x more headcount with significantly worse SLA/SLO if we still ran our own datacenter. Maybe it's not such a big deal for places with constant demand? Or is this just a case of observer bias?


It depends on what you do.

Cloud boxes are insanely expensive (easily 10x the price of the equivalent in house box, taking hosting, power, cooling, hw into account).

To make this work, you need a combination of variable demand, and only paying for partial salaries (your cloud boxes are mutualized with other people's boxes).

If you're a reasonably big company (tens of thousands of servers) , with fairly stable demand, and adequate capacity planning, you won't necessarily save a huge amount of money by outsourcing your DCs. You can argue that the gcp/aws guys are better than you at running fleets of servers and data centers , but at 10x the price, it's worth double checking. If all I do is raw compute 100% of the time on a very large scale, it's extremely likely I want to do it myself.

Obviously, there's more than raw hardware to the cloud, starting with all kinds of managed services, which can be worth it. Again , you'll have to do the maths :10x for the boxes, then extra for the distributed db? Does it give me a competitive advantage? Better time to market?

In the end, there are good use cases, and bad use cases for the cloud, and I don't think it's as clear cut as what you say.

EDIT : if cloud hardware prices were not completely ridiculous (say 2x), then it might suddenly be a lot more compelling, and I would most likely agree with you (security / regulatory issues aside).


I think you can find better cloud prices (outside of aws) that make sense. Running a datacenter is hard, running your own servers (software only) is much easier.

I can get this dedicated server in Germany (courtesy of hetzner.com) for 40 eur / 47 usd per month (albeit usd is relatively weak right now): CPU: Intel® Core™ i7-6700 Quad-Core RAM: 64 GB DDR4 Drives: 2x 512 GB NVMe SSD

Building an equivalent computer would cost me around 1300 eur / 1540 usd without VAT (assuming the RAM is ECC).

Let's assume a 12.5 eur / 15 usd of electricity per month and the computer will pay itself in 2 years. It's hard to price maintenance, issues and hardware failures, but I feel like this computer could last 4-5 years, making the price roughly 2x.


When we pushed a bunch of teams at a large retailer from ec2 to kubernetes in ec2 (not eks) we reduced overall cloud spend by ~30%.

It's only unpredictable and unquantifiable if you don't look at it. Is the problem that they don't have the right tools to look at it yet?


The flip side of this is I was able to reduce an employers cloud spend by 80 percent by just _fixing_ the mistakes they made when migrating to kubernetes. We were still on k8s afterwards, its just that the first pass made a bunch naive assumptions and poor optimizations.


Renting the hardware is not necessarily a cost-saving measure though: how much of the compute/storage capacity you have is sitting idle in your datacenter? That's the whole point of finops: you need to have full visibility into the usage of your infrastructure so you can optimize the spend.


OP didn't say anything about cost saving but rather predictability.


Gosh, there is job title for capacity planning?


do capacity planners buy and sell capacity to scale up and down month to month? :)


When does the need ever go down?


If you're retail, it goes down after xmas. If you're a tax company, it goes down after may. If you sell a product, it goes down when you go long enough without releasing anything new


At night, often. For example, I have had a use case where we needed a 1000 node build farm during the day when developers were working, but only 50 at night. Machine learning jobs are another common source of workloads that need burst capacity.


What company has constant load 24/7/365?


I’m going to guess Visa.


I'm sure they still see time of day effects, day of week effects, month of year effects.


It is not necessarily a cost-saving measure, but it can be in some cases. In a project I was involved with, we came to the conclusion that we would pay AWS every three month the cost of the hardware that we would by ourselfs. I am aware that AWS includes hosting and services. Nevertheless that is a very big difference.


Still amazing to me that keeping capacity in reserve is now demonized


"Less than 25 per cent of those surveyed said they could accurately predict how much they’d spend on Kubernetes to within 5 per cent of actual cost."

The premise seems off to me. Of course people have a hard time predicting the cost of an autoscaling infrastructure that they haven't had for a long time.

Presumably they moved off of a fixed size infrastructure to get to Kubernetes. Where they were either paying for excess capacity on some days, or paying in the form of poor performance when demand exceeded supply.

Five percent accuracy seems like a high bar, and you would want a year or two to understand your seasonality and growth rate, etc.


Geez, We can barely predict anything within 5%.


I have experienced first hand several cases of k8s gone wrong. In the end I have come to the conclusion that most companies don't really need the complexity of k8s.

Seriously, most k8s projects I have been involved with required so much effort to bootstrap and keep it going, it just blew me away! The experience for the average developer was just frustrating and infuriating: AWS ECS to the rescue!

Some will argue: vendor lock in! Really? I bet most services out there are already vendor locked in, just go with the flow and make your life easier.

I have seen companies failing because investing so much in building infrastructure, supposly vendor lock in free (or so they thought) that they lost sight and did not invest enough building the actual product: no revenue -> party is over.

Don't make the same mistake.


> most companies don't really need the complexity of k8s

If you want to run a complex service consisting of multiple microservices, auto scaling and so on, nothing beats Kubernetes. But you're right, most small businesses just need a simple web site, and for them, an Amazon Lightsail VM might suffice.


yeah, nothing beats K8s except you know... maybe serverless. or maybe cloud primitives that are already there and are handled by someone else.

IMHO, unless you have a huge fleet of bare metal machines K8s is an overhead you don't need


i run k8s on my home network, its nice to not have to log into my linsux box to see logs and restart things.

I like having the ability to just nuke my thing at home and it comes up the same way it did before,... Not sure why it too me so long to adopt work mentality at home but its great. Plex acting funny? kill the pod. Sonar needs an update? Kill the pod.


Thanks for your HO but you're wrong. K8s is the future and the new JVM.


ECS trades kubernetes manifests and generalized tools for cloudformation or the AWS UI, though..

Most everything else seems to be the same thing as if you ran with EKS, just different names for everything.

Setting it up yourself, though, no, I wouldn't do that unless I had a large enough team to maintain it.


> ECS trades kubernetes manifests and generalized tools for cloudformation or the AWS UI, though..

Terraform has served me well, it does come with some pain as well, but nothing compared to k8s.


> I have seen companies failing because investing so much in building infrastructure, supposly vendor lock in free (or so they thought) that they lost sight and did not invest enough building the actual product: no revenue -> party is over.

How was the stock comp for engineers at these companies?


I was the tech lead / main engineer at a startup with a huge investor investing 1mln in a seed round. I refused a salary with a little bit of equity (single digit percentage) and kept contracting because I didn't trust the business side and apparently I wasn't cool enough to become an exec.

Everything was fine at the beginning, we built very fast a CRUD app in node.js hosted on K8s on Google Cloud, doing something useful for our target market. We demoed with the client and everything was fine.

Then they hired a tech lead from another failed startup as a CTO. He started bringing in all his friends (all from the same country, so the company started having a french crowd and everyone else) from his previous employer as minor exec titles. Stock compensation for the exec was pretty good and they even had a 50k bonus or so I heard.

Slowly but surely they started pushing for our backend to be rewritten in Go for no reason - which I resisted - and then we decided to move to AWS + nomad. This was 2015 and tooling was even more minimal than what Hashicorp offers nowadays - so they basically started building a K8S on top of nomad to replicate the capabilities we had before.

I complained saying that we didn't have time for this with the impending demos and I said it was a stupid decision.

I was promptly fired for that - and without me on the way they threw away our node.js application and started rewriting everything in Go.

They never made another demo in time and the investor eventually pulled the plug from investing in the startup and acqui-hired the company for little money. The founders were pissed and called me later on to update me on how it went and to apologise. Fin.


late stage "startup" who had struggled with growth for years, they thought they needed to invest in building a more complex system and infrastructure, and they hired tons of engs (who actually had literally nothing to do) ...

instead what they really needed was to focus on their customers and build them a useful product that simply worked. They spent months building and deploying their own k8s cluster: EKS was "vendor lock in" so that wasn't a good choice for them, but guess what, all their infrastructure was already running in AWS anyway and their product was already vendor locked in: RDS, S3, etc ...

also ... to make things even more complex, they thought they needed to go all-in super distributed micro services: it took literally months to get new "services" up and running in production. It was a s*t show!

One of the many story of let's break the monolith, embrace microservice, thus k8s ... gone horribly wrong.

Eventually most engs left ....


> but guess what, all their infrastructure was already running in AWS anyway and their product was already vendor locked in: RDS, S3, etc ...

I laughed, but your response doesn't answer my question regarding engineers comp.


stock options, not great


> Eventually most engs left ....

and now have k8s on their resume. Bazinga!


Did we work for the same company?


Thank you. At least I am not alone. I just want to write some code and not waste a week redeploying K8s shit again... send help...


The headline seems a bit hyperbolic after reading the article. The bar chart has the caption Accurate prediction of Kubernetes costs is a challenge. However, the chart shows that one in five respondents represented in the data don't bother to predict their costs at all. Over half can predict within 10%, which seems fairly reasonable even if there is room for improvement. That leaves the remaining 20% who really are struggling to accurately predict their costs with better accuracy than 25%.

The trouble with all of this is that it doesn't really account for how the respondents use Kubernetes. What type of workloads are they running and how variable are those workloads? Would the organizations struggling to predict costs still struggle using another solution if their workloads are highly variable? Are they trading fixed costs for scalability in the face of those variable workloads? It's certainly possible to set upper bounds to autoscaling and to run fixed sized workloads in Kubernetes.

Perhaps the best takeaway from the article is that there is an opportunity to develop better cost management tools or offer consulting services in this domain. I know there are a few companies out there hoping to offer services in this space already.


This is not counting the people costs involved, kube experts ain't cheap. Complexity is only going up.

Suggestion for next article -> "Software a black hole of unpredictable spend"


From the application engineer side I'm not convinced that Kubernetes is particularly complex. I recently ramped up on it and found it liberating, frankly. Once I understood the basic concepts it was much more sensible than staring at a mountain of bespoke Ansible scripts operating on components I could barely see or understand.

I can't speak to the SRE side; I can imagine the complexity there. But are these challenges greater than maintaining, observing, and modifying a Rube Goldberg machine of managed AWS services?


I have been on both sides of the stick.

At a previous place we set up a cluster on AWS. This was before EKS. We started out with kops initially but later used the generated CFN yaml. It was not an easy feat. There was a lot of gotchas, moving pieces and much more. All of these moving pieces had their own gotchas. Plus lots of competition in the area with not a lot of comparisons since it was early days. Many things were not fully stable. A lot of issues. We got there in the end, but it wasn't easy.

On the flip side, we were able to onboard people with their services in days, not weeks (previously the company ran their own datacentres). Teams were allowed to go to AWS directly, or go to our K8s cluster. I was able to observe the lead time of 3 teams, 2 chose k8s, 1 chose AWS. Those going to K8s were able to get their prod environment running within a week. The other team took a month to do their dev environment. All three teams deployed a single stateless service.

This is obviously anecdotal, but I was really impressed with the user friendliness of kubernetes for the consumers.

Nowadays, 4 years later at my current company, it's a different story with every major provider managing the clusters for you. At my current company we haven't taken the jump yet, so unfortunately I can't fully compare, but the little I've played with EKS, it's as easy as simple crud operations.


I'll back up your anecdote. We had a lot of teams fretting about how long it would take to utilize kubernetes. After a short sit down with one of the kubernetes ops team members (even juniors) they were up in running and thanking us for all the time they would save managing ec2 patching etc.


There is a large amount of stuff to learn about how to do stuff the kubernetes way but yea, it's just a tool to run isolated processes across a fleet of servers. You can make a mess in ec2 and in kubernetes and I don't see much if any additional complexity to costs associated to running kubernetes. It is friggin gorgeous and elegant when comparing it to something like hadoop. Spark on kubernetes has been a dream compared to writing automation that used cloudera or ambari hadoop management layers.


"Any sufficiently complicated deployment and application management system contains an ad hoc, informally-specified, bug-ridden, slow implementation of half of Kubernetes."


... including kubernetes itself


> I can't speak to the SRE side

In all seriousness, what other side is there? The SRE's role is to make sure you never encounter kubernetes. eg... you have a git repo and some branches - if you push to them, deployments magically happen. As a non-SRE, what parts of kubernetes would you actually be touching or interacting with?


> The SRE's role is to make sure you never encounter kubernetes

In my org feature development teams own the K8S configuration and uptime for their own workloads/services.

The SRE team owns the shared Kubernetes platform itself and the related infra/config (cluster-level config, RBAC, the CI/CD infra, certificate tooling, the secrets infrastructure, the observability pipelines, etc).


bouncing pods and occasionally tailing logs in the case of startup problems is about it


Had the exact same experience. Was planning to use Ansible until it became obvious that Kubernetes was a perfect match, and it was fairly easy to learn the basics. Getting it to run correctly within our VPC was a pain for the developer doing that part though.


I have a basic Ruby on Rails app up on Kube but with no authentication and no SSL certificate. What should I learn next to add these things? How did you learn these things?


Maybe using cert-manager & ingress-nginx to front the service with SSL. Then your ingress-nginx will front your ruby app and decorate it with ssl. Please note, and I am embarrassed it took me so long to figure out - there are two nginx ingress projects - "nginx-ingress" and "ingress-nginx"! I would strongly recommend using the k8s official one, which is ingress-nginx. See: https://www.digitalocean.com/community/tutorials/how-to-set-...

This will help you learn the ingress pattern. After that, I would suggest exploring ways to tack a sidecar on (log aggregator, etc) - my impression is you are just looking for things to learn, I wouldn't normally suggest doing this just to do it.

Alternatively, you could try exploring putting grafana/prometheus in, though this can be a big bite for someone learning, so I would recommend learning/comprehending sidecars/ingress, etc as they are some of the building blocks for k8s.

Edit: I see there's another comment for traefik for ingress - that's fine too, it's the concept that matters, not your particular choice. If you have a lot of trouble implementing one, try the other one, things you learn in your journey will help quite a bit.


Use Traefik as your Ingress, done.


Correcton: for TLS, Auth still has tbd.

Cant edit since on mobile


Just refresh the page the edit button should reappear on mobile.


Thanks for the tip. On mobile means using Materialistic for me, no way to edit there.


From the SRE side: if you need bespoke fancy shit (we write a lot of operators to remove toil from running things we'd typically do with ansible/manually), the kubernetes API, CRDs and the operator/controller pattern are a joy to develop against for the most part.


My only experience with k8s is trying to run it in-house on a cluster of servers. I could never get it to work. Ansible makes much more sense to me.


How do k8s and Ansible overlap in your world? I personally haven't encountered a situation where one can serve as a substitute for the other.


A lot of the “finops” practitioners I’ve seen are myopically focused on tagging of AWS resources; and that falls to pieces with kubernetes because AWS can’t see inside the kubernetes clusters.

I’m not surprised they don’t like it.


I just had a meeting with a 'finops' manager where i showed him how kubernetes has a similar tagging structure (labels) and how we can break down per team pricing based on cpu/memory utilization.

It's not hard, you just need the tools (kubecost, etc)


pod A uses 2 cores, pod B uses 1 core. Machine has 4 cores and all remaining unscheduled pods require 2 cores.

How do you attribute the partial usage of the node? Is it 2 cores billed to pod A, 1 core billed to pod B, and 1 core billed to some random team?

Or do you have 2/3 of Node billed to pod A and 1/3 of Node billed to Pod B.

Now deal with this permutation across all the various variables.


You do it roughly based on the deployment requests and the average HPA values throughout a time slice.

Most k8s workloads run on a homogenous set of node types, so you can have an hourly cost per gb and per vcpu without digging too much into it.


Yup. Particularly because k8s scales based on whole system load, not single app load. It's harder to predict because it's better at optimizing resource utilization and ultimately lowering costs.


You can scale load on many things in more recent versions of K8s. for example, pubsub depth of un-acked messages, or custom metrics in prometheus format.

https://cloud.google.com/kubernetes-engine/docs/tutorials/au...


I wrote a tool that helps estimate K8s costs by simulating K8s clusters. You write your pods in a simple DSL and it runs kube-scheduler without actual nodes behind the scenes.

It's still really basic but I'd love to hear your feedback!

https://github.com/aporia-ai/kubesurvival


I think this is a great idea. I can see the DSL being very useful for situations where you want to think about hypotheticals. My main piece of feedback is that you should consider supporting the ability to feed in your existing k8s manifests as input to this tool instead in addition to the DSL. I think that would make this tool very appealing and very easy to onboard users with existing clusters who are looking to reduce their costs.


For sure, this is definitely planned!


Awesome, I'm definitely keeping an eye on this project.


I mean, what did anyone think the cloud was? This isn't news.


This article is written from the perspective of companies that failed to plan for kubernetes and just used it willie nillie. So... pointless.


Kubernetes will happily run without autoscaling. This article is barking at the wrong tree.


I am pretty sure that Kubernetes is just a buzzword they included to get clicks. You can definitely spend money in the cloud using Kubernetes or not using Kubernetes, and in the Kubernetes case, disabling node autoscaling doesn't eliminate the ability to spend money by submitting an API object. The article mentions PVs (which will be auto-provisioned whenever you request one; this feature is so core to Kubernetes that it works without effort even if you aren't using the cloud provider's managed Kubernetes offering) and cloud storage (write a byte, now you're spending money to store that byte), and it's right -- those things cost money and make spend unpredictable. But, it really has nothing to do with Kubernetes. They may have well as said "using Intel CPUs makes cloud spend unpredictable", because Intel CPUs are capable of executing instructions that call APIs to spend money in the cloud. Technically true, but kind of grasping at straws.


i've seen enough installations that this is not the fault of kubernetes, but that they think this should work automatically. this has to be implemented to count as well.


As opposed to the annual corporate VMWare enterprise license shock and awe ritual we go through every year. Infrastructure platforms are wicked expensive.


We have the best of both worlds: we're migrating to k8s, but on ESXi just because.


Disclaimer: I'm Co-Founder and CEO at https://vantage.sh/ - a cloud cost transparency platform.

We have been hearing this a lot from our customers who use EKS. They are running single clusters as shared infrastructure so have no insight into which workloads are contributing the most costs. This is true with other shared infra like data pipelines.

We are currently working on a solution for pod-level cost insights if anyone is interested in signing up for the beta shoot an email to ben@vantage.sh


Why does the graph have a legend when the only color is green?


I think this headline is hyperbole, but also somewhat true, but not for any fault of Kubernetes. I've worked in this space extensively, and have been called in to consult in some variety or another on a number of large enterprise Kubernetes deployments. Nearly universally I found the following things to be true:

1. Companies had critical infrastructure for the success of Kubernetes owned by teams that opposed deploying Kubernetes

2. The primary person shepherding Kubernetes into the company's environment had not done their due diligence on what were appropriate workloads for Kubernetes and what were not and how applications would integrate across mixed environments when required.

3. The principal tech resources at the company were not educated about containerization, Kubernetes, and the intricacies of container networking but were on the hook internally for the implementation.

What ends up driving the "black hole of unpredictable spend" is that companies are sold (either internally or externally) on a relatively short migration timeframe, but that timeframe is contingent on the company having appropriate infrastructure, staffing, and no key persons internally blocking said migration. If any factor is out of whack the migration timeline can quickly approach infinite.

While it is true that there are startups that could run everything they need for their first 10k customers on 5 VMs w/ Nginx & MySQL that decide to build grandiose environments in Kubernetes they don't need. The opposite is also true, which is that there are huge enterprises who could in reality massively benefit from Kubernetes in their environment but for "political" reasons can't get it done even after spending millions of dollars, so are stuck mired in their "legacy" environments. Networking, in particular, is a huge barrier of entry for enterprise Kubernetes deployments and are almost always stymied by people, not technology, because most enterprises have some Boomer network admin who doesn't actually know anything about networking but only knows about Cisco gear running things.

So, what do companies do? They go to AWS or GCP and they just run up a /massive/ bill, as they very very slowly migrate (often rewrite) their legacy systems to the cloud. This is of course astronomically and unnecessarily expensive, but it's generally not the fault of the underlying technology. AWS and Google are happy to bilk major enterprises as well, and often sell them a bill of goods they can't deliver on.




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: