There is obviously some point, somewhere between ‘just running a website out of my spare bedroom’ and ‘Facebook’, where some of this infrastructure does become necessary.
It is very important that you realize when you pass one of those points.
It doesn’t help to just tell people ‘you don’t need kubernetes, you are not Facebook’
There are a lot of companies that aren’t Facebook, but they still have multimillion dollar revenue streams riding on service availability, with compliance obligations and employee paychecks on the line, and which need to be able to reliably release changes to production code every day, written by multiple development teams in different time zones, so... yeah, maybe they are a bit beyond just scp’ing their files up to a single Linux VPS.
This is a good point.
I wish there was more discussion around when to consider making these decisions, how to transition, what types of systems are in the middle ground between bedroom and facebook, etc.
It seems all of the articles/posts that get are attention are the "you don't need it ever" opinions, but those are incredibly short sighted and assume that everyone is recklessly applying complex infrastructure.
In general, it would be nice to have more nuanced discussions around these choices rather than the low hanging fruit that is always presented.
What I tend to see missing is clear explanations of, "What problems does this solve?" Usually you see a lot of, "What does it do?" which isn't quite the same thing. The framing is critical because it's a lot easier for people to decide whether or not they have a particular problem. If you frame it as, "Do we need this?" though, well, humans are really bad at distinguishing need from want.
Take Hadoop. I wasted a lot of time and money on distributed computing once, almost entirely because of this mistake. I focused on the cool scale-out features, and of course I want to be able to scale out. Scale out is an incredibly useful thing!
If, on the other hand, I had focused on the problem - "If your storage can sustain a transfer rate of X MB/s, then your theoretical minimum time to chew through Y MB of data is Y/X seconds, and, if Y is big enough, then Y/X becomes unacceptable" - then I could easily have said, "Well my Y/X is 1/10,000 of the acceptable limit and it's only growing at 20% per year, so I guess I don't really need to spend any more time thinking about this right now."
Yep. All of us tech folks read sites like hackernews and read all about what the latest hot Silicon Valley tech startup is doing. Or what Uber is doing. Or what google is doing. Or what Amazon is doing. And we want to be cool as well so we jump on that new technology. Whereas for the significant majority of applications out there, older less sexy technology on modern fast computers will almost certainly be good enough and probably easier to maintain.
I've had several experiences where I attended a tech talk by someone from Uber, and never once did I come away with the impression that the problem they were trying to solve was the kind of thing Fred Brooks had in mind when he coined the term "essential complexity."
That said, volume and velocity aren't the only kinds of scale that technical teams have to grapple with. I've spent enough time at a big organization to understand that Conway's Law costs CPU cycles. Lots of them.
> What I tend to see missing is clear explanations of, "What problems does this solve?"
This still sounds like one (like a scientist) starts out by looking at something for it being cool, hip and perhaps useful before considering if it solves any of the problems. The proper working-order is to start by having the problem. Hardly any problem requires a state of the art solution.
When one is wiring a house (where I live) the regulation says you should use the same standards for everything on a group on the switchboard. This hilariously means that if you need to extend iron pipes with canvas isolated wires you have to use metal pipes with canvas wrapped wires.(or rip everything out and replace it with something modern)
When you are in a position to pay someone full time to look after the web infrastructure of your project, that's when you should think this stuff through.
Before that, don't worry about the tech, go for what you know, and expect it to need totally re-writing at some point. Let's face it, this stuff moves quickly.
That's not good. That kind of stuff must be derived from your business plan, not from the current situation. You should know that "if I'm successful at this level, I will need this kind of infrastructure" before the problems appear, and not be caught by surprise and react in a hush. (Granted that reacting in a hush will probably happen anyway, but it shouldn't be surprising.)
If your business requires enough infrastructure that this kind of tooling is a need at all, it may very well be an important variable on whether you can possibly be profitable or not.
> You should know that "if I'm successful at this level, I will need this kind of infrastructure" before the problems appear, and not be caught by surprise and react in a hush. (Granted that reacting in a hush will probably happen anyway, but it shouldn't be surprising.)
What's the ROI on spending time on planning for that, given that you acknowledge that the response will be mess anyway? In my experience companies that planned for scaling don't handle it noticeably better than companies that didn't plan for it, so maybe it's better not to bother?
> What's the ROI on spending time on planning for that, given that you acknowledge that the response will be mess anyway?
Let's put it this way: what's the ROI of failing to address any of the failure modes avoided or mitigated by taking the time to thing things through at design time?
Are you planning on getting paid when you can't deliver value because your service is down?
The ROI of reducing time to market is huge, whereas I don't think I've ever seen thinking things through at design time deliver any real benefits (not even a reduced rate of outages).
> I don't think I've ever seen thinking things through at design time deliver any real benefits (...)
That's mostly the Dunning-Kruger rearing its head. It makes zero sense to state that making fundamental mistakes has zero impact on a project.
This type of obliviousness is even less comprehensible given that every single engineering field has project planning and decision theory imbued in its matrix, and every single intro to systems design book has its emphasis on failure avoidance and mitigation, but somehow in software development hacking is tolerated and dealing with perfectly avoidable screwups resulting from said hacking is ignored as if they were acts of nature.
I suspect it's tolerated because it works better; IMO the value of planning is something cargo-culted from engineering fields where changing something after it's partially built is costly. In software, usually the cheapest way to figure out whether something will work is to try it, which is in stark contrast to a field like physical construction.
Also, when cost caused by downtime and updates and/or hardware problems exceed the cost of this full time engineer.
Even then K8s is probable not the next step. Simple HA with a load balancer and seperate test and production environments plus some automation with an orchestration tool works wonders.
If you are already deploying to Heroku, you are already doing ops.
It makes zero sense to presume you don't need to know what you're doing or benefit from automating processes just because you can't hire a full-time expert.
True, but there's a difference between solving ops problems by throwing more money at hardware and solving ops problems by thinking smarter. That's why I think that some multiple of an FTE salary is a good metric for when you start transitioning to more complex operational setups, because it gives you a number for when the gain of that new complexity is worth the cost.
This. When you have enough revenue to hire new dev, and infrastructure is your most pressing problem to solve with that money, hire someone who knows this stuff backwards to solve it.
So I agree there’s definitely a need for more writing in this space (and you’d think it might be particularly of interest to incubating startups...)
My top level take on this is that the cues to take things up a level in terms of infrastructure diligence are when the risk budget you’re dealing with is no longer entirely your own. Once you have a banker, a lawyer, an insurance policy, an investor, a big customer, a government regulator, shareholders... ultimately once there’s someone you have to answer to if you haven’t done enough to protect their interests.
And risk management isn’t just ‘what will happen if the server catches fire’, it’s ‘how do I guarantee the code running in production is the code I think it is?’ And ‘how do I roll back if I push a bug to production?’ And ‘how can I make sure all the current versions of the code come back up after patching my servers?’
And it turns out that things like kubernetes can help you solve some of those problems. And things like serverless can make some of those problems go away completely. But of course they come with their own problems! But when those things become important enough that you need to solve for them to answer the risk aversion of sufficient stakeholders, the cost of adopting those technologies starts to look sensible.
> Obviously if you’re FAANG-level or some established site where that 0.1% downtime translates into vast quantities of cash disappearing from your books, this stuff is all great. You have the funds to do things “right”. Your fancy zero-click continuous deployment system saves you thousands/millions a year. But at indie maker, “Hey look at this cool thing I built … please, please someone look at it (and upvote it on Product Hunt too, thx)” scale — the scale of almost every single site on the net — that 0.1% is vanishingly insignificant.
> Obviously if you’re FAANG-level or some established site where that 0.1% downtime translates into vast quantities of cash disappearing from your books,
I'm yet to see a company with any o line presence where downtime doesn't eat away their profits.
In the very least, downtime means you're paying money to get no service.
Downtime is fine for someone's blog on his personal website that he setup on a weekend while drinking beer. I'm yet to see a business state that they are ok with getting their site 404 or 500 randomly and during a whole workday, which is what 0.1% translates to.
Fear not, the article has you covered (emphasis mine):
> Obviously if you’re FAANG-level or some established site where that 0.1% downtime translates into vast quantities of cash disappearing from your books, this stuff is all great.
This is not a FANG thing. Not being able to deliver a product or service is a business thing. This happens in small mom&pop shops as well as FANGs. This is a free market economy thing. If your business is not open them you get expenses but zero income. How is this hard to understand?
The software world involves way more than your small side project that you managed to cobble together during a weekend. Things do need to work reliably and predictably. Otherwise not only do you not get your lunch but also your competitors eat it from under your nose. Why is this even being discussed, in HN of all places?
Because having to keep a site live doesn't automatically mean I need all the complexity of K8s? I can deploy my server on two VMs in two AZs with Ansible and run them with systemd to restart on crash. Just because I don't immediately jump to K8s doesn't mean I don't know how to run a site.
This is a really weird take on what ‘every single site on the internet’ is. Or of what proportion of the software development community is working on sites of that scale.
It’s like saying ‘the vast majority of people building houses don’t need to dig foundations’ because you build lego houses, and after all the vast majority of houses are made of Lego, right?
There are ways to quantify it - but a real telling point is if your service is down because of the complexity of tooling you don’t fully understand.
Many problems can be solved with other methods - especially if you do NOT have global all-sync’d requirements like Twitter and Facebook. Many SaaS (example: Salesforce) are perfectly acceptable sharding customers to particular instances thereby limiting downtime if something goes wrong.
And always keep in the back of your mind that the largest VPS or servers are very performant today - having to scale sideways may be masking inefficiencies in the code.
True - but the datacenter burning still hits you where it hurts if your entire Kubernetes cluster is living in that datacenter.
Whereas if you make a point of spinning up each Linux box on different datacenters (and even perhaps different providers) you are at least resilient against that.
Using advanced tooling doesn't remove the need to do actual planning beyond "oh I have backups" - you should be listing the risks and recoveries for various scenarios - though being able to handle "AWS deleted all my servers because I returned too many things via Amazon Prime" means you can also handle any lesser problem.
IME you'll hit it around $500,000 USD ARR. But again, situations vary. I've worked projects where hours long scheduled downtime at the weekend was acceptable. So easy!
> There is obviously some point, somewhere between ‘just running a website out of my spare bedroom’ and ‘Facebook’, where some of this infrastructure does become necessary.
Frankly, that blog post reads terribly miopic. The author essentially looks at his personal experience setting up his single-dev personal project, complains that he spent some time jump-starting a basic CICD pipeline, and from there he proceeds to argue that no one needs any of that.
Well, how does he expect to check whether their changes break or not the deployment? How does he expect to deploy his stuff? How does he plan to get multiple commits to play nice prior to deployment? How does he expect to roll back changes? How does he plan to address request spikes, either from normal traffic or attacks? Does he plan to get up at 3AM on a Tuesday to scale up deployments or does he plan to let his site crap out during the night and just get back to it in the morning?
And somehow we are supposed to believe that this naive belief that getting features out of the door is all that matters, and basic ops is not needed by anyone?
Yeah, I kind of agree with the author, but not 100%. The takeaway shouldn't be "don't use CI, kubernetes, etc." but rather "don't learn something new" when your goal is to get a product out the door. If you already know kubernetes, then it might seem silly to not use it given how it simplifies things. It's only when you don't know kubernetes that you'll run into problems.
> but rather "don't learn something new" when your goal is to get a product out the door.
I don't agree at all, because that argument is a poorly put-together strawman.
No one ever had to choose between setting up a CICD pipeline and churn out features. Ever.
In fact, continuous deployment makes it possible for everyone to get features out if the door as fast as they possibly can, and ensure deployments are two-way door thing instead of a finicky one-way door pray-that-it-works event.
The door where features have go get through is not JIRA, but the prod environment. It's inconceivable how some people try to pretend that automating operations adds no value.
And hell, CICD nowadays is trivial. You know why? Because of tools like Kubernetes/Cloudformation/CDK/Ansible/etc.
There is a whole lot of time before your first customer where you don't need any of this. Getting a MVP out > getting your engineering shinies.
In my company I first had servers directly on a bare metal box, with half the infra running on my workstation, then moved everything to docker-compose, then kubernetes and now I'm getting ready to get my CICD pipelines.
> There is a whole lot of time before your first customer where you don't need any of this.
No, there really isn't. Customers want features, and they want it unrolled as fast as possible.
How do you provide that? By automating your delivery process. You don't treat your infrastructure as a pet, and all you need to do to get a feature to all customers, fully tested and verified it won't break your service, is to push a commit.
With modern CICD systems and a basic Kubernetes setup, you get a fully working continuous deployment pipeline from scratch in about 15 minutes. There is no excuse.
The blog post is bullshit from someone who has zero insight and no relevant experience.
(I am not commenting on the article, I am commenting on the comment thread)
> Customers want features, and they want it unrolled as fast as possible.
I am going to reiterate: There is a whole lot of time _before_ your first customer.
Now I agree with you, on my last job we had no CICD, no tests and no procedures when I joined and it is _hell_ to release anything. Writing tests and doing the infra candy improved our velocity, even though technically we were doing more work. But this was _with_ customers. If you don't have customers, they don't care about the features.
> I am going to reiterate: There is a whole lot of time _before_ your first customer.
So what? Don't you understand that this just renders your point even less defendable?
> But this was _with_ customers.
So what? Do you expect to get your code to work only in prod? Do you hope that all the bugs you're adding right now should just pile up into a big tangled and unmanageable mess until it becomes a problem? Are you even bothered by the fact that your last commit may or may not have screwed the entire project?
There is absolutely zero justification to not automate testing and delivery. Zero. There is not a single usecases where pushing half-baked crashers without any verification should be ok.
>Facebook is kind of an interesting example, as they got pretty far into "hyperscale" with mostly PHP, memcached, and Mysql.
But PHP and MySql are only the surface level tools we see.
There is also the unseen and that's where the immense complexity lies hidden under the surface.
E.g. To help with scaling (which Friendster failed at), Facebook had separate MySql instances for each school. An interview with Facebook co-founder Dustin Moskovitz mentioned they had constant stress of emergencies and fires to put out. They had tons of custom tooling, Linux scripts, batch jobs, automation/orchestration/dashboard, etc and it's very likely the combination of all that ends up inventing an "invisible internal Kubernetes" without calling it "Kubernetes" to manage their server fleet. In other words, if Docker/Kubernetes/AWS were around, the early Facebook team might have offloaded some of the complexity onto a 3rd-party to reduce the all-nighters. But the tools didn't exist yet so they had to do it the hard way and invent "deployment orchestration" themselves.
[EDIT to reply]
Thanks for the "Tupperware" comment. A Google search leads to a Facebook post saying they renamed it to "Twine":
That's recent though. I'm saying they scaled up quite a long way before doing anything like that. From the outside it looks like they didn't really branch out into things like this until 2008 or so.
>From the outside it looks like they didn't really branch out into things like this until 2008 or so.
No, even before the Newsfeed feature rollout of September 2006, they already had crazy complexity orchestrating multiple servers for many schools. They were only a "simple PHP+MySql" website when it was February 2004 with only Harvard students on it. Dustin said he was the one in charge of spinning up new servers for each new school and he said it was a nightmare of engineering effort.
Stack Overflow in 2009 is an example of something that's cacheable extremely well, several orders of magnitude than stuff like personalized walls for every user.
IIRC, they were doing things like patching bittorrent to prefer closer IP's to pull from during their deploys. So it would roll through their network from 1 side to the other without just completely saturating the entire network.
But the point still remains, they had challenges to solve, but they were also simple, which made it possible TO solve those challenges.
This could be the final post to this discussion. You can't really get more to the point than that.
We are running a large German browser game on individual VPS instances. For many years, with PHP, MySQL and Redis. It works and it runs just fine.
I duunno exactly what Craigslist is running on today, but some slides from 2012[1] include items like running MongoDB over "3 shards across 3 node replica sets" with "duplicate config in 2nd data center"; a "sharded multi-node" redis cluster; multiple "vertically partitioned" MySQL clusters; the actual application stack on top of that being mod_perl on Apache, and it was all operated by a 'small but growing' dedicated team.
It might or might not be notable that the growth of that tech team included, one year later, hiring Larry Wall[2].
Given that, I'm not sure how to apply 'if Craigslist doesn't need it, I probably don't' as a general principle....
I see language as a personal choice, and while I wouldn't run perl today (I'm partial to Python, myself), I think there stack is fairly simple - tried & true language (however quirky it may be), no virtualization, simple metal hardware stack, and yet they serve an incredible number of users. In that sense, I'm sure hiring Larry was a no-brainer.
They are no FB, but they're bigger than almost anything else on the web [0]. I doubt most people will build anything bigger in their careers. I certainly won't.
A huge caveat might be app features. Clearly running something like travelocity could require more pieces, and something like a BI app in a fortune 100 company would have many more connectors to data sources. But in general, they've done pretty well using only newer versions of what worked way back in 1999, and to me, that's just incredible.
People bitching about K8S, systemd, etc miss the point that these systems allow a qualified generalist ops admin pop in and fix stuff without knowing the app well.
For a little company with <500 servers, it doesn’t matter. As size and complexity increase, figuring out how the guy left last year setup FreeBSD becomes a liability.
That surfaces a very important communication problem. If you consider 500 servers a "small datacenter", then, of course, you will say that K8S adds value to small operations.
A qualified generalist ops admin should be productive on any modern UNIX or UNIX-like within a couple weeks.
Figuring out how the guy last year set up FreeBSD consists of similar effort and process as figuring out how the guy last year set up Debian or CentOS or SUSE.
It can be challenging. I basically bought my house based on figuring that out for people in the old days.
It’s not an issue with FreeBSD or anything per se. Usually there’s a genius superhero guy who does whatever, who burns out/quits/otherwise goes away, and the business is stuck.
I’d suggest that when you have paying customers, and your duck tape deploy process is starting to feel risky to run during business hours, that’s a good time to do your first taking of stock. Before that point it should be the fastest thing you can wire up given your own personal tooling expertise.
(I’d argue that k8s is actually lean enough to be a competitor here if you’re already an expert and you use a managed service like GKE).
Your infra should be in service to some uptime/performance/DX goals. Define those (honestly!) and then you can figure out what you need to improve.
In particular if no customer has ever noticed downtime, and you don’t think you’d lose business if someone hit a 30-min outage, then you probably don’t need to upgrade. Twitter was notoriously unreliable in the early days and they made it, for example.
Fair point. But people just love to over-complicate things and it is far more often that you come across absolutely ridiculous infrastructure monstrosities for very small business than that you come across the opposite.
I’ve found that career FOMO drives a lot of the desire to overcomplicate infrastructure.
It’s hard to look at job listings from prestigious companies listing K8s and other buzzwords as requirements without feeling some desire to add those buzzwords to your own resume.
If you need scale, then you know what you need when you need it. None of this advice people are tossing around would apply to WhatsApp, for instance. Because they needed to get into the weeds on BEAM optimizations for Erlang and they needed to do so quickly. They probably didn't have a year to mess around getting K8S up and stable. Or crafting the perfect Ansible script. Or infrastructure-as-code. They probably cared very little about any of that.
WhatsApp is the closest thing you'll see for explosive growth. And they had time to work on BEAM patches. Think about that. Their whole world didn't just collapse because they couldn't support a billion connections on day one. Even they had time to scale as they needed.
A couple ansible scripts to deploy the k3s kubernetes distribution onto cloud VMs.*
Add LetsEncrypt to the Traefik install for https
Use the default Helm chart to make your k8s deployments for your apps - you'll probably only need to tweak a couple values to start - ports and ingress rules.
* k3s comes with Traefik pre-baked to listen on ports 80 and 443 so you can run it "bare metal" without a cloud provider's load balancer service or any special tweaks to Traefik config. If you just Helm install any of the standard k8s ingress providers like the standard Traefik chart, Consul, or Nginx, you have to tweak a bunch to get them working on privileged external ports 80 and 443.
I found Dockerizing my app really early to be a good bridge to k8s.
Build container locally, test, push. On Linux box, pull the new app container, stop the old app container, start the new app container. Separate nginx reverse proxy also in docker so if you’re fast enough you don’t drop any packets.
I’d say docker is worth using from the get-go, but if you are most familiar with raw Linux probably just use that for your first iteration.
What about a managed service like EKS or GKE? A minimum k8s config is perhaps 20 lines to spin up a service and expose the ports - that feels pretty achievable for bedroom coder?
Machine installation manuals, "Create machine" shell scripts, manual docker, script launched docker, VM management tools... Well, technically, puppet and it's kind are there too, depending on whether you consider them viable or not.
You can use something like AWS Application Load Balancer with two EC2 hosts behind it. Linode also offers an LB if you prefer Linode.
Now do a simple Blue/Green deploy. Deploy to the first host, so maybe you use Java so just have a script that shutdowns your app and copies over a new uberjar and starts it up again. Try it out, if it all works, swap the LB to point to this host instead. If it results in errors and failures your rollback procedure is to swap it back again to the previous host.
You can scale this pretty easily as well, just add more hosts behind your Blue/Green groups.
Now the last challenge is actually going to be your DB. You can't swap DB instances like that, so that's where something like AWS RDS or a managed DB service is useful, cause they often offer automatic replica and fail over.
But if you're lazier (or don't have the money) then that, just have a third host running PostgressSQL with a scheduled backup of your DB to some other durable storage like AWS S3. And when you need to update that host, schedule a maintainance window for your app/website.
So...
1) You can start with one single VM and have your DNS point directly to that host. Where the host runs your DB and your app. Deployments are you ssh-ing, shutting down, copying a new uberjar, starting up again. And you turn this into a script that does it for you.
2) Then you add a DB backup cron job to S3 so you don't lose all your users data from an incident.
3) Now if you want to prevent needing downtime when you deploy, you add an LB and move your DNS to point to the LB and put the host behind the LB, it becomes Green, and add another host which will be Blue, move your DB to a third host that's not behind the LB. And now do what I said earlier. You can now deploy new versions of your app without downtime, and do quick rollbacks in case you deploy something that breaks your users.
4) If your user base grows and one host isn't enough anymore, you can add more behind the Green and Blue groups. Everything else is the same, just your script now needs to swap between the groups.
5) Now you might start to want tests running automatically, uberjar being created automatically when you commit, and have a beta stage and all that. So you can setup a CI/CD pipeline. Have a test host and a test DB. And your app has config to know which environment it runs under to detect what DB to connect too.
6) Finally, if your DB becomes a scaling bottleneck due to growth, congratulations, you've now becomed a successful business and can just hire other people to deal with it.
It's always a judgment call with how to complicate your infrastructure.
If you succeed, sometime in the future, you will have lots of engineering resources to manage the complexity of a "web scale" service. However, this level of complexity is high.
So, today, what is more important is ease of development, a friendly software iteration cycle, and easy testing on developer machines, since you don't have FB's or Google's test infrastructure yet.
You have to shoot for something that you can build fairly simply that'll carry you the next year or two, and think about it in terms of API boundaries, so that you can replace any particular piece, with a faster, more complex piece in the future when you need it.
Your appropriate design for today isn't going to come from this site, or from reading AWS or GCP press materials, but from people who have been through this multiple times and have some experience in evolving small systems into big systems. They're hard to find.
In that case just imagine how far you'd go if instead of manually scp to a bunch of servers you instead install a tool like Ansible, add your bunch of linux boxes in a list, write down in a script what you need to scp to those boxes, and from thereon just run the script whenever you need to update them.
But somehow that's too much because it's now ops and no one needs to scp to boxes safely and reliably like that.
In the overwhelming majority of cases, you will never pass that point. A well-optimized dedicated server with 90 fast cores, 128GB ram and 4TB SSD goes for $120/mo at Hetzner. That can easily handle 1 billion non-cached monthly page views if you write decent code. More if you cache. How many sites get that much traffic? A few dozen. If you're dedicating this much effort in preparing for 1-in-a-million eventualities, you will never ship anything.
> It is very important that you realize when you pass one of those points.
It. Does. Not. Matter.
You're not going to be the same guy that sets up Facebook's garage server in 2004 and then scales it out to a multiple data center monster it is today.
If you are that guy, then I really feel sorry for you. Something went seriously wrong with your equity (or lack thereof) and you are doing janitor shit when you should be up in the tower with the other C-levels.
Part of what you're saying is a good point, IMO. The people who transition the infrastructure from Level 0 to Level 1 won't be the same people in charge of infrastructure when it's time to transition from Level 17 to Level 18.
But however many levels there are from garage infrastructure to Big Tech infrastructure--even if there were only 2--someone will always have to decide that the time has come to transition from This Level to Next Level. Different expertise may be required to transition from L0 -> L1 vs L10->L11 vs L30->L31, but that expertise still matters.
And if I were the one who had to decide at any of those levels, I would want a well-designed plan for the next transition or two that was of the form: 'When [some measurement] reaches [some threshold] we will [change the infrastructure in the following ways...]'
> And if I were the one who had to decide at any of those levels, I would want a well-designed plan for the next transition or two
Right. This was really my (poorly-expressed) original point. Deciding how and when to scale is going to be a team and organization effort and decision. If it does come down to one guy deciding on something, then that's going to be the CTO. And if he/she is a good CTO, their decision is based on the feedback of a dozen people under them.
Scaling is a progression and no matter how fast your growth is, you do have time to come up with a solid plan and execute that plan. You don't need Facebook scale on day one even if you do become the next Facebook.
I agree that there's a problem with that kind of a false dichotomy, but just as Rails is "good enough" for a vast majority of workloads (where you specialize off the parts that need it when you need it), I'd say the same about Heroku. The DX scales up from 0 to "big enough" very well, and once you get beyond that, it's reasonably straightforward to take the parts you need inside of AWS inside.
I've noticed sometimes that there is often an ideological argument for or against open source (and sometimes to an extreme extent) buried in arguments for complex/expensive/distracting infrastructure, whether directly or indirectly. It gets coded as "portability" and so that's why you end up managing your own metal on Kubernetes. But you've got to really ask yourself whether if you had to pick one, you'd index on convenience or portability. There are some businesses where that portability actually is that important. But there are a lot of businesses for which convenience is first in priority.
Determining what your priorities are is the first step. From there, I think you have options.
Infrastructure is important whether you're running a bare metal server or VM's on someone else's computer in the cloud.
Different infrastructure solves different problems.
Container orchestrators solve a very specific density problem. Back in the day we recklessly threw a machine together and either put one or an infinite amount of apps on it. If you put one then you've wasted all remaining resources on the machine. If you put an infinite amount of apps on it then you've effectively created a hackers wet dream much less noisy neighbor hell. Can you still have containers that demand hoggish resources? Sure, configuration is up to your imagination and your expertise - just like with virtual machines and BSD jails. That said, now you have options.
Container orchestrators also made it incredibly easy to pay less for things like load balancers. Instead of paying for that pricey layer 7 load balancer from your favorite cloud provider, you can bootstrap nginx into your stack and have lots of fun. Container orchestrators are also very chatty and invoke a level of network utilization that our predecessors probably would've gawked at. When misconfigured, containers are difficult to troubleshoot and require a large array of supplementary software in order to run them with any efficacy partially due to sheer number (nobody runs just one container).
Walking around telling people "you don't need x" is a reactionary statement, because I have only covered the surface level information about container orchestrators in regards to benefits and trade-offs; you can't possibly know the answer without having extensive conversations about these.
Saying stuff like this:
> ‘Engineers get sidetracked by things that make engineers excited, not that solve real problems for users’ — we’ve heard it all before
Engineers do overcomplicate stuff but there's reasonable explanations for why if you dare to ask with any kind of non-judgemental intent. It's fanciful that engineers sit around complicating things for their own amusement, especially professional ones.
I run virtual machines and a Kubernetes cluster in my personal playground. They both are part of functional projects, however, I end up deploying applications on either platform based on what the applications needs are. It took me about a day to get that Kubernetes cluster configured the way I wanted it and to implement some basic CI/CD that kept it in shape. It probably doesn't get updates as much as it should, but neither do my VM's, so if you end up on one of my side projects: you've been warned! That day spent has saved me time in the future. When I want to deploy, I'm one kubectl command away from updating my production code. Every platform require effort; effort to build, effort to maintain; the only question is where it's spent in between the time of inception to sunset.
It is very important that you realize when you pass one of those points.
It doesn’t help to just tell people ‘you don’t need kubernetes, you are not Facebook’
There are a lot of companies that aren’t Facebook, but they still have multimillion dollar revenue streams riding on service availability, with compliance obligations and employee paychecks on the line, and which need to be able to reliably release changes to production code every day, written by multiple development teams in different time zones, so... yeah, maybe they are a bit beyond just scp’ing their files up to a single Linux VPS.