There is obviously some point, somewhere between ‘just running a website out of my spare bedroom’ and ‘Facebook’, where some of this infrastructure does become necessary.
It is very important that you realize when you pass one of those points.
It doesn’t help to just tell people ‘you don’t need kubernetes, you are not Facebook’
There are a lot of companies that aren’t Facebook, but they still have multimillion dollar revenue streams riding on service availability, with compliance obligations and employee paychecks on the line, and which need to be able to reliably release changes to production code every day, written by multiple development teams in different time zones, so... yeah, maybe they are a bit beyond just scp’ing their files up to a single Linux VPS.
This is a good point.
I wish there was more discussion around when to consider making these decisions, how to transition, what types of systems are in the middle ground between bedroom and facebook, etc.
It seems all of the articles/posts that get are attention are the "you don't need it ever" opinions, but those are incredibly short sighted and assume that everyone is recklessly applying complex infrastructure.
In general, it would be nice to have more nuanced discussions around these choices rather than the low hanging fruit that is always presented.
What I tend to see missing is clear explanations of, "What problems does this solve?" Usually you see a lot of, "What does it do?" which isn't quite the same thing. The framing is critical because it's a lot easier for people to decide whether or not they have a particular problem. If you frame it as, "Do we need this?" though, well, humans are really bad at distinguishing need from want.
Take Hadoop. I wasted a lot of time and money on distributed computing once, almost entirely because of this mistake. I focused on the cool scale-out features, and of course I want to be able to scale out. Scale out is an incredibly useful thing!
If, on the other hand, I had focused on the problem - "If your storage can sustain a transfer rate of X MB/s, then your theoretical minimum time to chew through Y MB of data is Y/X seconds, and, if Y is big enough, then Y/X becomes unacceptable" - then I could easily have said, "Well my Y/X is 1/10,000 of the acceptable limit and it's only growing at 20% per year, so I guess I don't really need to spend any more time thinking about this right now."
Yep. All of us tech folks read sites like hackernews and read all about what the latest hot Silicon Valley tech startup is doing. Or what Uber is doing. Or what google is doing. Or what Amazon is doing. And we want to be cool as well so we jump on that new technology. Whereas for the significant majority of applications out there, older less sexy technology on modern fast computers will almost certainly be good enough and probably easier to maintain.
I've had several experiences where I attended a tech talk by someone from Uber, and never once did I come away with the impression that the problem they were trying to solve was the kind of thing Fred Brooks had in mind when he coined the term "essential complexity."
That said, volume and velocity aren't the only kinds of scale that technical teams have to grapple with. I've spent enough time at a big organization to understand that Conway's Law costs CPU cycles. Lots of them.
> What I tend to see missing is clear explanations of, "What problems does this solve?"
This still sounds like one (like a scientist) starts out by looking at something for it being cool, hip and perhaps useful before considering if it solves any of the problems. The proper working-order is to start by having the problem. Hardly any problem requires a state of the art solution.
When one is wiring a house (where I live) the regulation says you should use the same standards for everything on a group on the switchboard. This hilariously means that if you need to extend iron pipes with canvas isolated wires you have to use metal pipes with canvas wrapped wires.(or rip everything out and replace it with something modern)
When you are in a position to pay someone full time to look after the web infrastructure of your project, that's when you should think this stuff through.
Before that, don't worry about the tech, go for what you know, and expect it to need totally re-writing at some point. Let's face it, this stuff moves quickly.
That's not good. That kind of stuff must be derived from your business plan, not from the current situation. You should know that "if I'm successful at this level, I will need this kind of infrastructure" before the problems appear, and not be caught by surprise and react in a hush. (Granted that reacting in a hush will probably happen anyway, but it shouldn't be surprising.)
If your business requires enough infrastructure that this kind of tooling is a need at all, it may very well be an important variable on whether you can possibly be profitable or not.
> You should know that "if I'm successful at this level, I will need this kind of infrastructure" before the problems appear, and not be caught by surprise and react in a hush. (Granted that reacting in a hush will probably happen anyway, but it shouldn't be surprising.)
What's the ROI on spending time on planning for that, given that you acknowledge that the response will be mess anyway? In my experience companies that planned for scaling don't handle it noticeably better than companies that didn't plan for it, so maybe it's better not to bother?
> What's the ROI on spending time on planning for that, given that you acknowledge that the response will be mess anyway?
Let's put it this way: what's the ROI of failing to address any of the failure modes avoided or mitigated by taking the time to thing things through at design time?
Are you planning on getting paid when you can't deliver value because your service is down?
The ROI of reducing time to market is huge, whereas I don't think I've ever seen thinking things through at design time deliver any real benefits (not even a reduced rate of outages).
> I don't think I've ever seen thinking things through at design time deliver any real benefits (...)
That's mostly the Dunning-Kruger rearing its head. It makes zero sense to state that making fundamental mistakes has zero impact on a project.
This type of obliviousness is even less comprehensible given that every single engineering field has project planning and decision theory imbued in its matrix, and every single intro to systems design book has its emphasis on failure avoidance and mitigation, but somehow in software development hacking is tolerated and dealing with perfectly avoidable screwups resulting from said hacking is ignored as if they were acts of nature.
I suspect it's tolerated because it works better; IMO the value of planning is something cargo-culted from engineering fields where changing something after it's partially built is costly. In software, usually the cheapest way to figure out whether something will work is to try it, which is in stark contrast to a field like physical construction.
Also, when cost caused by downtime and updates and/or hardware problems exceed the cost of this full time engineer.
Even then K8s is probable not the next step. Simple HA with a load balancer and seperate test and production environments plus some automation with an orchestration tool works wonders.
If you are already deploying to Heroku, you are already doing ops.
It makes zero sense to presume you don't need to know what you're doing or benefit from automating processes just because you can't hire a full-time expert.
True, but there's a difference between solving ops problems by throwing more money at hardware and solving ops problems by thinking smarter. That's why I think that some multiple of an FTE salary is a good metric for when you start transitioning to more complex operational setups, because it gives you a number for when the gain of that new complexity is worth the cost.
This. When you have enough revenue to hire new dev, and infrastructure is your most pressing problem to solve with that money, hire someone who knows this stuff backwards to solve it.
So I agree there’s definitely a need for more writing in this space (and you’d think it might be particularly of interest to incubating startups...)
My top level take on this is that the cues to take things up a level in terms of infrastructure diligence are when the risk budget you’re dealing with is no longer entirely your own. Once you have a banker, a lawyer, an insurance policy, an investor, a big customer, a government regulator, shareholders... ultimately once there’s someone you have to answer to if you haven’t done enough to protect their interests.
And risk management isn’t just ‘what will happen if the server catches fire’, it’s ‘how do I guarantee the code running in production is the code I think it is?’ And ‘how do I roll back if I push a bug to production?’ And ‘how can I make sure all the current versions of the code come back up after patching my servers?’
And it turns out that things like kubernetes can help you solve some of those problems. And things like serverless can make some of those problems go away completely. But of course they come with their own problems! But when those things become important enough that you need to solve for them to answer the risk aversion of sufficient stakeholders, the cost of adopting those technologies starts to look sensible.
> Obviously if you’re FAANG-level or some established site where that 0.1% downtime translates into vast quantities of cash disappearing from your books, this stuff is all great. You have the funds to do things “right”. Your fancy zero-click continuous deployment system saves you thousands/millions a year. But at indie maker, “Hey look at this cool thing I built … please, please someone look at it (and upvote it on Product Hunt too, thx)” scale — the scale of almost every single site on the net — that 0.1% is vanishingly insignificant.
> Obviously if you’re FAANG-level or some established site where that 0.1% downtime translates into vast quantities of cash disappearing from your books,
I'm yet to see a company with any o line presence where downtime doesn't eat away their profits.
In the very least, downtime means you're paying money to get no service.
Downtime is fine for someone's blog on his personal website that he setup on a weekend while drinking beer. I'm yet to see a business state that they are ok with getting their site 404 or 500 randomly and during a whole workday, which is what 0.1% translates to.
Fear not, the article has you covered (emphasis mine):
> Obviously if you’re FAANG-level or some established site where that 0.1% downtime translates into vast quantities of cash disappearing from your books, this stuff is all great.
This is not a FANG thing. Not being able to deliver a product or service is a business thing. This happens in small mom&pop shops as well as FANGs. This is a free market economy thing. If your business is not open them you get expenses but zero income. How is this hard to understand?
The software world involves way more than your small side project that you managed to cobble together during a weekend. Things do need to work reliably and predictably. Otherwise not only do you not get your lunch but also your competitors eat it from under your nose. Why is this even being discussed, in HN of all places?
Because having to keep a site live doesn't automatically mean I need all the complexity of K8s? I can deploy my server on two VMs in two AZs with Ansible and run them with systemd to restart on crash. Just because I don't immediately jump to K8s doesn't mean I don't know how to run a site.
This is a really weird take on what ‘every single site on the internet’ is. Or of what proportion of the software development community is working on sites of that scale.
It’s like saying ‘the vast majority of people building houses don’t need to dig foundations’ because you build lego houses, and after all the vast majority of houses are made of Lego, right?
There are ways to quantify it - but a real telling point is if your service is down because of the complexity of tooling you don’t fully understand.
Many problems can be solved with other methods - especially if you do NOT have global all-sync’d requirements like Twitter and Facebook. Many SaaS (example: Salesforce) are perfectly acceptable sharding customers to particular instances thereby limiting downtime if something goes wrong.
And always keep in the back of your mind that the largest VPS or servers are very performant today - having to scale sideways may be masking inefficiencies in the code.
True - but the datacenter burning still hits you where it hurts if your entire Kubernetes cluster is living in that datacenter.
Whereas if you make a point of spinning up each Linux box on different datacenters (and even perhaps different providers) you are at least resilient against that.
Using advanced tooling doesn't remove the need to do actual planning beyond "oh I have backups" - you should be listing the risks and recoveries for various scenarios - though being able to handle "AWS deleted all my servers because I returned too many things via Amazon Prime" means you can also handle any lesser problem.
IME you'll hit it around $500,000 USD ARR. But again, situations vary. I've worked projects where hours long scheduled downtime at the weekend was acceptable. So easy!
> There is obviously some point, somewhere between ‘just running a website out of my spare bedroom’ and ‘Facebook’, where some of this infrastructure does become necessary.
Frankly, that blog post reads terribly miopic. The author essentially looks at his personal experience setting up his single-dev personal project, complains that he spent some time jump-starting a basic CICD pipeline, and from there he proceeds to argue that no one needs any of that.
Well, how does he expect to check whether their changes break or not the deployment? How does he expect to deploy his stuff? How does he plan to get multiple commits to play nice prior to deployment? How does he expect to roll back changes? How does he plan to address request spikes, either from normal traffic or attacks? Does he plan to get up at 3AM on a Tuesday to scale up deployments or does he plan to let his site crap out during the night and just get back to it in the morning?
And somehow we are supposed to believe that this naive belief that getting features out of the door is all that matters, and basic ops is not needed by anyone?
Yeah, I kind of agree with the author, but not 100%. The takeaway shouldn't be "don't use CI, kubernetes, etc." but rather "don't learn something new" when your goal is to get a product out the door. If you already know kubernetes, then it might seem silly to not use it given how it simplifies things. It's only when you don't know kubernetes that you'll run into problems.
> but rather "don't learn something new" when your goal is to get a product out the door.
I don't agree at all, because that argument is a poorly put-together strawman.
No one ever had to choose between setting up a CICD pipeline and churn out features. Ever.
In fact, continuous deployment makes it possible for everyone to get features out if the door as fast as they possibly can, and ensure deployments are two-way door thing instead of a finicky one-way door pray-that-it-works event.
The door where features have go get through is not JIRA, but the prod environment. It's inconceivable how some people try to pretend that automating operations adds no value.
And hell, CICD nowadays is trivial. You know why? Because of tools like Kubernetes/Cloudformation/CDK/Ansible/etc.
There is a whole lot of time before your first customer where you don't need any of this. Getting a MVP out > getting your engineering shinies.
In my company I first had servers directly on a bare metal box, with half the infra running on my workstation, then moved everything to docker-compose, then kubernetes and now I'm getting ready to get my CICD pipelines.
> There is a whole lot of time before your first customer where you don't need any of this.
No, there really isn't. Customers want features, and they want it unrolled as fast as possible.
How do you provide that? By automating your delivery process. You don't treat your infrastructure as a pet, and all you need to do to get a feature to all customers, fully tested and verified it won't break your service, is to push a commit.
With modern CICD systems and a basic Kubernetes setup, you get a fully working continuous deployment pipeline from scratch in about 15 minutes. There is no excuse.
The blog post is bullshit from someone who has zero insight and no relevant experience.
(I am not commenting on the article, I am commenting on the comment thread)
> Customers want features, and they want it unrolled as fast as possible.
I am going to reiterate: There is a whole lot of time _before_ your first customer.
Now I agree with you, on my last job we had no CICD, no tests and no procedures when I joined and it is _hell_ to release anything. Writing tests and doing the infra candy improved our velocity, even though technically we were doing more work. But this was _with_ customers. If you don't have customers, they don't care about the features.
> I am going to reiterate: There is a whole lot of time _before_ your first customer.
So what? Don't you understand that this just renders your point even less defendable?
> But this was _with_ customers.
So what? Do you expect to get your code to work only in prod? Do you hope that all the bugs you're adding right now should just pile up into a big tangled and unmanageable mess until it becomes a problem? Are you even bothered by the fact that your last commit may or may not have screwed the entire project?
There is absolutely zero justification to not automate testing and delivery. Zero. There is not a single usecases where pushing half-baked crashers without any verification should be ok.
>Facebook is kind of an interesting example, as they got pretty far into "hyperscale" with mostly PHP, memcached, and Mysql.
But PHP and MySql are only the surface level tools we see.
There is also the unseen and that's where the immense complexity lies hidden under the surface.
E.g. To help with scaling (which Friendster failed at), Facebook had separate MySql instances for each school. An interview with Facebook co-founder Dustin Moskovitz mentioned they had constant stress of emergencies and fires to put out. They had tons of custom tooling, Linux scripts, batch jobs, automation/orchestration/dashboard, etc and it's very likely the combination of all that ends up inventing an "invisible internal Kubernetes" without calling it "Kubernetes" to manage their server fleet. In other words, if Docker/Kubernetes/AWS were around, the early Facebook team might have offloaded some of the complexity onto a 3rd-party to reduce the all-nighters. But the tools didn't exist yet so they had to do it the hard way and invent "deployment orchestration" themselves.
[EDIT to reply]
Thanks for the "Tupperware" comment. A Google search leads to a Facebook post saying they renamed it to "Twine":
That's recent though. I'm saying they scaled up quite a long way before doing anything like that. From the outside it looks like they didn't really branch out into things like this until 2008 or so.
>From the outside it looks like they didn't really branch out into things like this until 2008 or so.
No, even before the Newsfeed feature rollout of September 2006, they already had crazy complexity orchestrating multiple servers for many schools. They were only a "simple PHP+MySql" website when it was February 2004 with only Harvard students on it. Dustin said he was the one in charge of spinning up new servers for each new school and he said it was a nightmare of engineering effort.
Stack Overflow in 2009 is an example of something that's cacheable extremely well, several orders of magnitude than stuff like personalized walls for every user.
IIRC, they were doing things like patching bittorrent to prefer closer IP's to pull from during their deploys. So it would roll through their network from 1 side to the other without just completely saturating the entire network.
But the point still remains, they had challenges to solve, but they were also simple, which made it possible TO solve those challenges.
This could be the final post to this discussion. You can't really get more to the point than that.
We are running a large German browser game on individual VPS instances. For many years, with PHP, MySQL and Redis. It works and it runs just fine.
I duunno exactly what Craigslist is running on today, but some slides from 2012[1] include items like running MongoDB over "3 shards across 3 node replica sets" with "duplicate config in 2nd data center"; a "sharded multi-node" redis cluster; multiple "vertically partitioned" MySQL clusters; the actual application stack on top of that being mod_perl on Apache, and it was all operated by a 'small but growing' dedicated team.
It might or might not be notable that the growth of that tech team included, one year later, hiring Larry Wall[2].
Given that, I'm not sure how to apply 'if Craigslist doesn't need it, I probably don't' as a general principle....
I see language as a personal choice, and while I wouldn't run perl today (I'm partial to Python, myself), I think there stack is fairly simple - tried & true language (however quirky it may be), no virtualization, simple metal hardware stack, and yet they serve an incredible number of users. In that sense, I'm sure hiring Larry was a no-brainer.
They are no FB, but they're bigger than almost anything else on the web [0]. I doubt most people will build anything bigger in their careers. I certainly won't.
A huge caveat might be app features. Clearly running something like travelocity could require more pieces, and something like a BI app in a fortune 100 company would have many more connectors to data sources. But in general, they've done pretty well using only newer versions of what worked way back in 1999, and to me, that's just incredible.
People bitching about K8S, systemd, etc miss the point that these systems allow a qualified generalist ops admin pop in and fix stuff without knowing the app well.
For a little company with <500 servers, it doesn’t matter. As size and complexity increase, figuring out how the guy left last year setup FreeBSD becomes a liability.
That surfaces a very important communication problem. If you consider 500 servers a "small datacenter", then, of course, you will say that K8S adds value to small operations.
A qualified generalist ops admin should be productive on any modern UNIX or UNIX-like within a couple weeks.
Figuring out how the guy last year set up FreeBSD consists of similar effort and process as figuring out how the guy last year set up Debian or CentOS or SUSE.
It can be challenging. I basically bought my house based on figuring that out for people in the old days.
It’s not an issue with FreeBSD or anything per se. Usually there’s a genius superhero guy who does whatever, who burns out/quits/otherwise goes away, and the business is stuck.
I’d suggest that when you have paying customers, and your duck tape deploy process is starting to feel risky to run during business hours, that’s a good time to do your first taking of stock. Before that point it should be the fastest thing you can wire up given your own personal tooling expertise.
(I’d argue that k8s is actually lean enough to be a competitor here if you’re already an expert and you use a managed service like GKE).
Your infra should be in service to some uptime/performance/DX goals. Define those (honestly!) and then you can figure out what you need to improve.
In particular if no customer has ever noticed downtime, and you don’t think you’d lose business if someone hit a 30-min outage, then you probably don’t need to upgrade. Twitter was notoriously unreliable in the early days and they made it, for example.
Fair point. But people just love to over-complicate things and it is far more often that you come across absolutely ridiculous infrastructure monstrosities for very small business than that you come across the opposite.
I’ve found that career FOMO drives a lot of the desire to overcomplicate infrastructure.
It’s hard to look at job listings from prestigious companies listing K8s and other buzzwords as requirements without feeling some desire to add those buzzwords to your own resume.
If you need scale, then you know what you need when you need it. None of this advice people are tossing around would apply to WhatsApp, for instance. Because they needed to get into the weeds on BEAM optimizations for Erlang and they needed to do so quickly. They probably didn't have a year to mess around getting K8S up and stable. Or crafting the perfect Ansible script. Or infrastructure-as-code. They probably cared very little about any of that.
WhatsApp is the closest thing you'll see for explosive growth. And they had time to work on BEAM patches. Think about that. Their whole world didn't just collapse because they couldn't support a billion connections on day one. Even they had time to scale as they needed.
A couple ansible scripts to deploy the k3s kubernetes distribution onto cloud VMs.*
Add LetsEncrypt to the Traefik install for https
Use the default Helm chart to make your k8s deployments for your apps - you'll probably only need to tweak a couple values to start - ports and ingress rules.
* k3s comes with Traefik pre-baked to listen on ports 80 and 443 so you can run it "bare metal" without a cloud provider's load balancer service or any special tweaks to Traefik config. If you just Helm install any of the standard k8s ingress providers like the standard Traefik chart, Consul, or Nginx, you have to tweak a bunch to get them working on privileged external ports 80 and 443.
I found Dockerizing my app really early to be a good bridge to k8s.
Build container locally, test, push. On Linux box, pull the new app container, stop the old app container, start the new app container. Separate nginx reverse proxy also in docker so if you’re fast enough you don’t drop any packets.
I’d say docker is worth using from the get-go, but if you are most familiar with raw Linux probably just use that for your first iteration.
What about a managed service like EKS or GKE? A minimum k8s config is perhaps 20 lines to spin up a service and expose the ports - that feels pretty achievable for bedroom coder?
Machine installation manuals, "Create machine" shell scripts, manual docker, script launched docker, VM management tools... Well, technically, puppet and it's kind are there too, depending on whether you consider them viable or not.
You can use something like AWS Application Load Balancer with two EC2 hosts behind it. Linode also offers an LB if you prefer Linode.
Now do a simple Blue/Green deploy. Deploy to the first host, so maybe you use Java so just have a script that shutdowns your app and copies over a new uberjar and starts it up again. Try it out, if it all works, swap the LB to point to this host instead. If it results in errors and failures your rollback procedure is to swap it back again to the previous host.
You can scale this pretty easily as well, just add more hosts behind your Blue/Green groups.
Now the last challenge is actually going to be your DB. You can't swap DB instances like that, so that's where something like AWS RDS or a managed DB service is useful, cause they often offer automatic replica and fail over.
But if you're lazier (or don't have the money) then that, just have a third host running PostgressSQL with a scheduled backup of your DB to some other durable storage like AWS S3. And when you need to update that host, schedule a maintainance window for your app/website.
So...
1) You can start with one single VM and have your DNS point directly to that host. Where the host runs your DB and your app. Deployments are you ssh-ing, shutting down, copying a new uberjar, starting up again. And you turn this into a script that does it for you.
2) Then you add a DB backup cron job to S3 so you don't lose all your users data from an incident.
3) Now if you want to prevent needing downtime when you deploy, you add an LB and move your DNS to point to the LB and put the host behind the LB, it becomes Green, and add another host which will be Blue, move your DB to a third host that's not behind the LB. And now do what I said earlier. You can now deploy new versions of your app without downtime, and do quick rollbacks in case you deploy something that breaks your users.
4) If your user base grows and one host isn't enough anymore, you can add more behind the Green and Blue groups. Everything else is the same, just your script now needs to swap between the groups.
5) Now you might start to want tests running automatically, uberjar being created automatically when you commit, and have a beta stage and all that. So you can setup a CI/CD pipeline. Have a test host and a test DB. And your app has config to know which environment it runs under to detect what DB to connect too.
6) Finally, if your DB becomes a scaling bottleneck due to growth, congratulations, you've now becomed a successful business and can just hire other people to deal with it.
It's always a judgment call with how to complicate your infrastructure.
If you succeed, sometime in the future, you will have lots of engineering resources to manage the complexity of a "web scale" service. However, this level of complexity is high.
So, today, what is more important is ease of development, a friendly software iteration cycle, and easy testing on developer machines, since you don't have FB's or Google's test infrastructure yet.
You have to shoot for something that you can build fairly simply that'll carry you the next year or two, and think about it in terms of API boundaries, so that you can replace any particular piece, with a faster, more complex piece in the future when you need it.
Your appropriate design for today isn't going to come from this site, or from reading AWS or GCP press materials, but from people who have been through this multiple times and have some experience in evolving small systems into big systems. They're hard to find.
In that case just imagine how far you'd go if instead of manually scp to a bunch of servers you instead install a tool like Ansible, add your bunch of linux boxes in a list, write down in a script what you need to scp to those boxes, and from thereon just run the script whenever you need to update them.
But somehow that's too much because it's now ops and no one needs to scp to boxes safely and reliably like that.
In the overwhelming majority of cases, you will never pass that point. A well-optimized dedicated server with 90 fast cores, 128GB ram and 4TB SSD goes for $120/mo at Hetzner. That can easily handle 1 billion non-cached monthly page views if you write decent code. More if you cache. How many sites get that much traffic? A few dozen. If you're dedicating this much effort in preparing for 1-in-a-million eventualities, you will never ship anything.
> It is very important that you realize when you pass one of those points.
It. Does. Not. Matter.
You're not going to be the same guy that sets up Facebook's garage server in 2004 and then scales it out to a multiple data center monster it is today.
If you are that guy, then I really feel sorry for you. Something went seriously wrong with your equity (or lack thereof) and you are doing janitor shit when you should be up in the tower with the other C-levels.
Part of what you're saying is a good point, IMO. The people who transition the infrastructure from Level 0 to Level 1 won't be the same people in charge of infrastructure when it's time to transition from Level 17 to Level 18.
But however many levels there are from garage infrastructure to Big Tech infrastructure--even if there were only 2--someone will always have to decide that the time has come to transition from This Level to Next Level. Different expertise may be required to transition from L0 -> L1 vs L10->L11 vs L30->L31, but that expertise still matters.
And if I were the one who had to decide at any of those levels, I would want a well-designed plan for the next transition or two that was of the form: 'When [some measurement] reaches [some threshold] we will [change the infrastructure in the following ways...]'
> And if I were the one who had to decide at any of those levels, I would want a well-designed plan for the next transition or two
Right. This was really my (poorly-expressed) original point. Deciding how and when to scale is going to be a team and organization effort and decision. If it does come down to one guy deciding on something, then that's going to be the CTO. And if he/she is a good CTO, their decision is based on the feedback of a dozen people under them.
Scaling is a progression and no matter how fast your growth is, you do have time to come up with a solid plan and execute that plan. You don't need Facebook scale on day one even if you do become the next Facebook.
I agree that there's a problem with that kind of a false dichotomy, but just as Rails is "good enough" for a vast majority of workloads (where you specialize off the parts that need it when you need it), I'd say the same about Heroku. The DX scales up from 0 to "big enough" very well, and once you get beyond that, it's reasonably straightforward to take the parts you need inside of AWS inside.
I've noticed sometimes that there is often an ideological argument for or against open source (and sometimes to an extreme extent) buried in arguments for complex/expensive/distracting infrastructure, whether directly or indirectly. It gets coded as "portability" and so that's why you end up managing your own metal on Kubernetes. But you've got to really ask yourself whether if you had to pick one, you'd index on convenience or portability. There are some businesses where that portability actually is that important. But there are a lot of businesses for which convenience is first in priority.
Determining what your priorities are is the first step. From there, I think you have options.
Infrastructure is important whether you're running a bare metal server or VM's on someone else's computer in the cloud.
Different infrastructure solves different problems.
Container orchestrators solve a very specific density problem. Back in the day we recklessly threw a machine together and either put one or an infinite amount of apps on it. If you put one then you've wasted all remaining resources on the machine. If you put an infinite amount of apps on it then you've effectively created a hackers wet dream much less noisy neighbor hell. Can you still have containers that demand hoggish resources? Sure, configuration is up to your imagination and your expertise - just like with virtual machines and BSD jails. That said, now you have options.
Container orchestrators also made it incredibly easy to pay less for things like load balancers. Instead of paying for that pricey layer 7 load balancer from your favorite cloud provider, you can bootstrap nginx into your stack and have lots of fun. Container orchestrators are also very chatty and invoke a level of network utilization that our predecessors probably would've gawked at. When misconfigured, containers are difficult to troubleshoot and require a large array of supplementary software in order to run them with any efficacy partially due to sheer number (nobody runs just one container).
Walking around telling people "you don't need x" is a reactionary statement, because I have only covered the surface level information about container orchestrators in regards to benefits and trade-offs; you can't possibly know the answer without having extensive conversations about these.
Saying stuff like this:
> ‘Engineers get sidetracked by things that make engineers excited, not that solve real problems for users’ — we’ve heard it all before
Engineers do overcomplicate stuff but there's reasonable explanations for why if you dare to ask with any kind of non-judgemental intent. It's fanciful that engineers sit around complicating things for their own amusement, especially professional ones.
I run virtual machines and a Kubernetes cluster in my personal playground. They both are part of functional projects, however, I end up deploying applications on either platform based on what the applications needs are. It took me about a day to get that Kubernetes cluster configured the way I wanted it and to implement some basic CI/CD that kept it in shape. It probably doesn't get updates as much as it should, but neither do my VM's, so if you end up on one of my side projects: you've been warned! That day spent has saved me time in the future. When I want to deploy, I'm one kubectl command away from updating my production code. Every platform require effort; effort to build, effort to maintain; the only question is where it's spent in between the time of inception to sunset.
This past year I’ve finally been able to hit the $200/hour rate as a k8s consultant.
I feel like I’m selling snake oil. I’m working on my 7th implementation and none of my customers need kubernetes.
Basically all of their developers and devops engineers want kubernetes on their resumes or because it is shiny and what the cool kids are using.
My largest implementation was a $3m/year, 9 year old company who could run their entire infrastructure on a couple dozen ec2 instances. However they are under the delusion of “hitting it big and needing to be able to scale 100x”.
Paid off my student loans this year, and soon my house, thanks to superfluous k8s integrations!
I don't think this kind of attitude should be encouraged. Imagine other professions celebrating things like that.
Surgeon: Some of my patients request bladder surgeries while in reality their problems could be solved with a few simple diet supplements. But I just do the procedure and take the money. Just bought my 3rd Tesla, thanks to superfluous bladder operations!
I know selling people shit they don't need is celebrated as an achievement nowadays, but feels like it shouldn't be.
It already happens in medicine. Of course, a lot of it is "I'm afraid of getting sued so I'm going to cover my ass with all these tests." but, still.
When I was younger I had a pain in my abdomen I had never experienced before. My grandmother told me to go to their family doctor who was probably older than they were. He pushed on some things and tapped on some things and finally diagnosed me. "It's gas. Go home and have a big fart." Sure enough, hours later, a big release and the pain was gone.
These days I'd have been sent in for at least an MRI or CAT scan to go along with all the blood work that they would do.
Now, I know what some of you are going to say. "But it COULD have been something way worse!" Sure, it could have been. But this guy had been a doctor for a long time by that point and knew from experience that it was gas. Just like when I recently had to rebuild a really old server and was getting errors that suggested missing class files but, after 25 years of doing this, I knew that it was actually missing a PHP module.
There are plenty of stories of the opposite where the old doctor uses their intuition and it fails to a basic checklist. Instead of seeing temp 101, symptoms A+B+C = X they see (looks young and healthy, can't be X because X is only for old people). Turns out it is X
I mean one of the most famous stores is of experienced doctor refusing to wash their hand because they know better based on their "years of experience"
Most people keep themselves from recognizing the problems with what they are doing if they are making lots of money off it, they convince themselves what they are doing is worthwhile after all.
"It is difficult to get a man to understand something when his salary depends upon his not understanding it."
I don't think (I hope?) anyone is celebrating "cheating the rubes". But I respect GP's ability to be honest about it -- it's super useful to the rest of us. For instance if we've been skeptical of k8, seeing someone who's an expert in it say that makes us think ok maybe we're on to something.
Most people in that position would keep their mouths shut, and probably keep themselves from admitting to themselves what's up.
I've learned to not question what the customer wants. If I push back and think "you probably really want solution X over the solution Y you're asking for" more often than not it turns into a big fiasco. Every issue that comes up becomes my fault for suggesting X and not going with Y. Where as if I just do Y as they asked then they associate the issues with Y and not me.
Tons of work done in various industries is "shitwork" that the person doing the work wouldn't normally do except they're being paid to do it. At some point you give up arguing with the customer.
The dieting is a perfect example - many health issues can be solved by losing weight but few people want to make the lifestyle changes necessary to do that.
> few people want to make the lifestyle changes necessary to do that.
I didn't have the will to make lifestyle changes until I had a serious bout of sciatica. I needed that external motivator of torturous pain to be able to commit to changing my eating and fitness habits.
Two and a bit years later I'm down 12Kgs and slim and healthy (but, ironically, still have back pain issues that requires minor surgery to alleviate).
"Don't worry, it's only for the rest of your life". If you can come to terms with that, you're ready to start.
Dude, sciatica was my catalyst for lifestyle change as well. I had a bicycle accident 14 years ago and damaged my piraformis. It was the most humbling and painful experience of my life.
I think so. If someone sees enough value in what you provide to be willing to pay you, you shouldn't feel bad about it. It's not like you're pushing them to adopt k8s. They already did that.
It's like someone bought a Mercedes. Nobody needs a Mercedes, they could drive a Toyota. But now they need an oil change, and that's what you do, and they'll pay you for it.
But that's exactly what surgeons and accountants and mechanics and consultants and entertainers do. Most work is unnecessary. Once you embrace that, it is a lot easier to understand how this economy works.
I can confirm this. I was offered surgery for my 2 year old for a problem that is not life threatening and could possibly (but rarely) cure itself. In the end it indeed cured itself.
Sounds exactly like the medical industrial complex in the US. Opioid crisis and general over medication and over treatment just magically happened / keeps happening. Surely no doctor ever thought let’s give the customer what they are asking.
I’m confused. Isn’t a kubernetes cluster just a way to deploy software onto a couple of dozen ec2 instances?
Sure it has some accidental complexity, but so does having half a dozen different software systems each managing their own deployment pathway onto ec2.
There seems to be some belief that kubernetes has to be some kind of crazy complex system to operate, when really it’s just an api for managing which docker processes get spun up.
I'd assume anybody criticizing kubernetes isn't a huge docker fan either.
I see both used much more as dev ops tools rather than real cluster management.
Particularly questionable in organizations that may not have any in-house tech expertise. Some contractor, who may be the only programmer involved at a given time, comes in and sets everything up the kubernetes/docker way. From now on you'll always need kubernetes-enabled devs to maintain your thing.
Happened to somebody I know. They believed they had a "cluster" with different "servers" but it was a single machine. Then I had to explain why they cannot just FTP into the server, upload plugin folders (like some tutorial asked) and expect it to work. Nor did I have enough time to help out since who knows how these dockers mount folders, what's the latest "best practice" or what the dev who maintains it prefers. There's no way they'll ever need more than one machine. I'm pretty sure the system in question does not even have a good cluster story in the first place, so at best you could scale the database (which usually can do clusters well without 3rd party tools).
I am always baffled by this take. I run k8s because it's by far the easiest way to do what I want (deploy a handful of services to a handful of servers), not because it's cool or massively powerful.
In the past few years, k8s simply got easy to get started with and it feels like no one noticed.
If it was easy to set up a nuclear power plant, would that be a great idea for anyone to set one up and run it?
You will run into problems eventually that you won't be able to solve (especially as you have more complex requirements). If this is just for a pet project you probably won't care. But if there's a lot riding on you being able to fix it quickly, you will not enjoy being in a position where fixing it takes hours to days.
It's like a car. Super easy to operate, until there's a strange sound, and then something falls off. Do you know how to fix it? Do you need to get to work in 30 minutes? Can you just "order an Uber" when it comes to K8s?
I’m not sure what justifies the analogy of kubernetes to either a nuclear power plant or a car you mysteriously have no ability to fix, nor why the existence of an ‘Uber for kubernetes’ seems so outlandish (I mean there are literally api calls I can make to cloud platforms that will summon new kubernetes infrastructure for me to rent, in minutes, which is pretty Uber-like).
It’s not rocket science, it’s just networking, config management and container orchestration.
All these analogies are equally true of just.. running Linux in the first place. Or writing software. Using things that require some knowledge requires some knowledge.
Over-simplifying it does not change the fact that it is literally the most complicated system that exists today to just run a web server. The most complicated system of anything is going to be more difficult than the alternatives.
Sure. If ‘running a web server’ is the extent of your problem then kubernetes is a very wrong turn (in fact, so is ‘spinning up a Linux box’ - nobody needs to be writing an nginx config file to just put some static content on the internet).
I feel like this is just an argument between completely different communities. Like a recreational yachtsman saying ‘I don’t see why anyone should need a nuclear power plant to sail a boat’ to a naval attack submarine captain.
Yes, but the other way around: a naval attack submarine captain (Me) telling a recreational yachtsman (potential K8s users) not to put a nuclear power plant in their sailboat. I think some of us get that it should only be used sparingly as a big investment, but there's so many more using it just because it seems like they should.
Most tools are easier than people make them out to be. The average dev just isn't that great, and the average company overestimates the difficulty of these tools, while burying their devs with mindless drone work so they never get the time to study these things. The average dev isn't one which has studied CS/software dev and/or spends loads of free time still investigating tech off the clock, either.
"Basically all of their developers and devops engineers want kubernetes on their resumes or because it is shiny and what the cool kids are using.
"
That's what I am doing now. Knowing this stuff makes you much more employable and raises your salary. It's a stupid world but resume driven development is a rational thing to do and very encouraged by hiring practices. For a long time I thought the company would appreciate making things simple and cost efficient but when I look at the latest hires it's pretty clear that the company wouldn't even hire most of the people who keep the business running because they are "outdated". You really have to keep churning constantly or you fall off the bandwagon.
On the other hand I am pretty new to "modern" server and web development and the complexity is just insane. Getting a simple API server and frontend going with comprehensive testing, CI/CD, containers, dependency injection and whatever is a huge effort. It barely leaves you time to think about writing efficient quality code because you are so busy with pull requests, writing test code and other stuff.
Firstly - congrats on making great income by providing valuable skills to a hot market.
When you say "none of my customers need kubernetes" is that because they are not using any of its benefits, or because they could do it another way and kubernetes was one of the options rather than the only option?
In my experience very few people are fully exploiting everything k8s has to offer, but many people are getting side benefits from greater standardisation of deployments, consistency for implementation of logging, tracing and telemetry, reproducibility of environments and to an extend talent retention and development.
I'm very interested to understand where the line is for genuinely getting zero benefits of using k8s.
What in your opinion is the optimal way to run "a couple dozen ec2 instances"?
I'm kind of in that zone.
I don't think "totally manually" works out very well once you are in even "one dozen EC2 instances", it's too easy to make mistakes, you spend too much time (with downtime) figuring out what mistake you made and what it was supposed to be instead. Generally too hard to orchestrate changes without downtime.
But is there anything in between totally manual and kubernetes?
I think that's one reason people are reaching for kubernetes, they actually are having challenges that need a solution, and can't figure out what the solution is other than kubernetes.
It’s unpopular, but I embrace (AWS) lock-in for small and mid-sized infrastructure. AWS’ various services solve 99.9% of any infrastructure needs. High-bandwidh streaming and heavy gpu processing is one of the few places my customers find to be cost prohibitive.
For compute I lean towards an immutable infrastructure. I was a Puppet / Chef engineer for 8 years, never again. It’s cleaner to configure apps via some mixture of pre-baked amis, docker files and some sort of parameter store. (Usually SSM or Vault).
Fargate is a cleaner path but so much more expensive.
Thanks for response! Sure, I'm down with AWS lock-in, for the sake of argument.
But how do you manage this EC2 fleet and collection of other AWS services? Changes happen -- new version of your local software, change your mind about some configuration, new version of a database, whatever.
OK, it's immutable. But I still need to launch a new EC2, and get everyone else that's supposed to know about it to know about it and start talking to it instead of the old one (or get public traffic to go to it or whatever), and then take down the old one once it's safe to do so. (I also need to figure out how to construct it as immutable).
Purely manually? Just manual operations in the AWS console?
There may be an obvious answer here I am missing not being familiar with the tools. What can be overwhelming here is figuring out the "right" way to orchestrate this stuff, there are so many options (or sometimes you can't find any options but suspect you are not looking in the right place) and so many partisans. One reason for kubernetes popularity is the succesful awareness marketting, I've heard of it and know it is supposed to solve these challenges I'm having and don't really understand what the options are and am overwhelmed trying to figure it out... but k8 seems popular, it must be good?
Your CI pipeline can build an AMI image with integrated Docker image, and a simple systemd script to start it on boot.
Deploy step would just be to upgrade AutoScalingGroup with the new AMI, in a rolling fashion with CloudFormation.
---
However, in most cases I'd recommend going with ECS + Fargate, where you can just deploy your Docker image, and don't have to care about the underlying virtual machines.
---
If you want to go K8s route, EKS + Fargate, (or if you want to go cheaper, EKS + EC2 Node Groups), are the best way in most cases, but you need to be aware that you will have to keep an eye on upgrading EKS versions, Helm charts, and other resources you add to the cluster.
I'd say Kubernetes is a good tool if you need it, but otherwise I'd start with something that doesn't require any maintenance on the ops side.
This. I'm not some Unix sysadmin type who wants to do everything manually. But the AWS features work out of the box for basic deployment, scaling, etc. and Kubernetes seems to just add unnecessary operational complexity without any value if you don't need any other features and are okay with AWS lock-in.
The canonical AWS answer here is: Elastic Beanstalk.
If you use a CI tool, it probably has a CD counterpart which can pipe a passing build directly to EB.
Or you can deploy manually from your repo with "eb deploy".
In either case, AWS will automatically bring the app up on new instances, and test them. If everything looks good, the load balancer will direct live traffic to the new instances, and then spin down the old ones.
Obviously a personal opinion, but Ansible solves this problem really well. It does get sluggish once we're talking about more than 50 machines, though.
To reliably end up with a working cluster:
- spin up plain operating system VMs
- spin up one more VM for Ansible instigator
- collect the IPs, write them down to Ansible inventory
- pull/upload your custom binaries
- run Ansible playbooks to set the VMs up
Sometimes it's beneficial to have a two-step process. One playbook to install rarely changing components. Then make a VM snapshot. The second playbook installs your custom binaries. Future installations can then start from the snapshot.
Back in the day I worked on code to create custom Linux ISOs with network provisioning. Always wondered why more people didn't do this. You just need a kickstart script. With a simple provision script you can have a fully automated install. No messing about copying ssh keys or setting IP addresses, etc.
I rarely see instances where kubernetes actually makes sense over just using regular AWS services (load balancing, autoscaling, AMIs, etc.) directly. It seems like k8s just adds a mostly superfluous extra layer of abstraction with a rather user-unfriendly interface.
I'm in your position, but in my opinion, there are plenty of who are better in kubernetes even if they don't need it. There is complexity in the processes and even the scripting people use to glue things together, and there are enough places for the team to screw up.
Oh, I can tell you because I work with these asshats. They come in, deploy the technology, usually get promoted and then leave. The people left behind have to maintain their garbage and because promotions are only given to new feature work we're the suckers holding the bag.
Actually This is exactly how I accidentally started consulting on k8s. Three of my customers had engineers who advocated for k8s, did 50-75% of an implementation then left.
I have told all of my customers they don’t need this, but they are convinced this is the way of the future.
Right side of history? I dunno about anything like that. It’s just business, show me the money.
There will be a new trend next year or the year after, ad nauseum.
I’ve been consulting on and off for over 20 years now, I dont even remember my first 10 customers anymore. I get a lot of referrals from customers and from a consulting company I sold.
I was primarily focusing on terraform, which is the only technology I’ve actully ever enjoyed professionally, but there’s just so much stupid $$ in k8s.
Doubtful. The consultant who is willing to say that straight (and with the experience to know it) is going to be worth a lot more than $200/h, to the right clients.
I went down this exact same path. Started simplifying everything and it did wonders for my velocity. It took me a while to realize that we have teams and teams of people maintaining all this crap I was trying to spin up myself. It just wasn't worth it.
I went to basic principles, FreeBSD. systemd has taken Linux in the wrong direction in my opinion. I don't want anything fancy, I don't want to learn another persons resume building framework, I want to use technology that has proven longevity.
Our technology space moves so fast, for my personal work, I have to leverage the first principle technologies that I have mastered over my career. The latest framework may make some problems easier to solve, but I don't deploy technology without achieving some level of competency. That takes significant time. I'm getting wiser with how I spend that time, it's not always easy to know.
This process has also taught me why we have this proliferation of SaaS solutions. It's tempting, and maybe one day I'll be wise enough to realize that this is the route to take. However, I'm still able to leverage my core skills and open-source technology enough that SaaS solutions don't look attractive to me. I can see the benefits either way, so I suppose maybe in five years when I reflect I'll know if I made the right investment.
I've often heard people saying that systemd attempts to do too much and as a result is bloated, but personally i've found it to be sufficient for what i want to do - making certain software run and/or automatically restart on a server.
Some of the alternatives that i've heard of:
- SysVinit (i think older RPM distros used it) https://wiki.archlinux.org/index.php/SysVinit
- OpenRC (i believe that Alpine Linux uses it as its default) https://wiki.archlinux.org/index.php/OpenRC
- and here are some other links https://alternativeto.net/software/systemd/
For better or worse, i've mostly just stuck to using systemd because it's good enough - and init systems that require scripts to be written manually intimidate me, since it seems like it'd be easy to make mistakes and the scripts wouldn't be standartized enough. Personally, i prefer to create a configuration file that just defines where the software is, how to keep track of its PID, where and how to get configuration for it, as well as what user/group to execute it as (among other things).
Maybe the poster you're replying to will provide some arguments against systemd and situations where it isn't very nice to use.
The reason I asked is because I'm really not well educated on the alternatives. I've just used SysVinit a little bit (on older Ubuntu) and while I've been learning more and more sysadmin-related stuff this past couple of years, it's all been systemd.
Next time you're debugging connection issues between your canary webserver kubernetes deployment and redis cluster misc-3, ask yourself: are we serving more traffic than stack overflow?
I feel like we need a campaign to help developers understand just how powerful modern server hardware is when you can get at it directly.
Kestrel is pushing 7 million plaintext requests per second on techempower. And their test rig is certainly not the best you can get these days.
The cloud has ruined our perception of computing hardware. If I have a choice between owning a physical machine in order to meet a performance objective, or building out a 20 node k8 cluster in AWS for some notion of scalability, I am going to shell out for the physical machine and happily drive to the data center to install it myself.
Who would honestly prefer more computers over fewer computers to manage? Even if its in the cloud its still a fucking mess. The entire game of software engineering is about managing complexity. Not creating more of it and making it someone else's problem.
Paying for someone to manage the hardware is included in my assumptions. If you are asking "who would consider the role?" the answer is - Anyone who wants the money.
You're not really out of the weeds when it comes to dedicated servers either. You still need an on call resource to manage the software in that scenario.
> Next time you're debugging connection issues between your canary webserver kubernetes deployment and redis cluster misc-3, ask yourself: are we serving more traffic than stack overflow?
Kubernetes is a configuration management system first and foremost. It can be useful even if you run a low number of
very powerful machines. In fact, most of our k8s clusters are like that - no fancy SDN, just a couple of very beefy machines with simple subnet routing among them.
- if i use state of the art production grade technology i can talk about it in interviews, if i say oh well i have plenty experience running docker compose manually on a server, not so much. This is also where new technology can be tested without hurting someone, so even if something is totally overkill it can still make sense in the overall picture of all projects you work on
- ci/cd is maybe not relevant to your users, but its about your own sanity, it forces complete documentation in code of what is done to build and ship, decouple it from local dev setups and make as deterministic as possible. Even small project benefit, have you ever come back after 2 years and took hours to remember the order of voodoo commands or lost a laptop and then had to install specific versions of cli tools?
- the logic is totally changed by the simplicity of faas offerings. Nowadays i would maybe not build a gigantic kubernetes system, but instead of dealing with vms and servers skip right to cloudfare workers or cloud run and have the same benefits of k8 with even less work and more time to work on the features
>- ci/cd is maybe not relevant to your users, but its about your own sanity, it forces complete documentation in code of what is done to build and ship, decouple it from local dev setups and make as deterministic as possible. Even small project benefit, have you ever come back after 2 years and took hours to remember the order of voodoo commands or lost a laptop and then had to install specific versions of cli tools?
This to me is the biggest advantage of this sort of thing. Documentation and notes are great, but I'd rather just have the thing-in-itself available that does the job.
I can go back to projects I wrote 10-14 years ago, and the same build/deploy scripts still work today. I used standard POSIXy or GNU tools back then. They still work the same way today.
It really depends on how you pick what you depend on. Would some random third party CI/CD infrastructure, I might have selected 10 years ago still work or be in business today? Hard to say. Maybe I would have picked something that stayed supported until this day, maybe not...
Yeah and 13 years ago there was a similar argument about git. I heard a lot of developers say well its a small project and its just me, no way i'm dealing with repos, making commits and learn git or mercurial. This completely died out.
Perhaps in the resume/interview best to highlight outcomes (supported N users with 99.9% availability solo with $x a month spend) in favor of tech used. "Evaluated kubernetes and [tech du jour]" also gets the keywords in there.
Many people keep saying something to the effect of, "There is a right time to scale a system and we need to focus on that." This is true. There is a right time to scale up. In my experience, you rarely know when that might be except in hindsight.
I worked at a startup once and we built the product in a way that it should have been able to scale up pretty well. Then one of the co-founders went on a national news channel and started talking about our company. Our servers got absolutely slammed with traffic. We did the best we could but it didn't go well.
Now, we should have known to expect increased traffic from an event like that, and, in fact, we did. We simply didn't know just how much traffic we were going to get, and, honestly, couldn't have known that. We had only had a tiny fraction of that traffic until that event. That was the best basis for comparison we had at that point in time.
We updated and upgraded some things after that and the next time he went on that news channel, things were a lot better.
Knowing when to scale up is a hard problem. You have hints, sure, but there's never a data point that says, "Ok, you have until July to migrate this system from a single VPS to a full Kubernetes build-out."
I would like to add that you'd be foolish to use my post as a reason for building a system to scale to high levels prematurely. I agree with the original article that doing so bogs you down in needless complexity and wastes time and resources.
At that startup, we built the system to scale to where we were projecting traffic would be within a given time period. If that co-founder going on national television hadn't happened, our original plan would have been sufficient.
This isn't totally related to infrastructure but figuring out what you should build, from what you could build is something I've struggled with as I've moved about the engineering management ladder. My current role is a stone’s throw from solo developer (only developer at 4 person start-up) and while I've got total autonomy in the work I complete and the approach I take, it’s a terrible feeling discovering that you're not super-human and you need constraints like almost every engineer that's come before you.
Like the author there's been a number of times where I've felt self-satisfied completing a piece of technical work, only to scratch my head a week later and wonder whether it was worth it. I can say with certainty that having a solid CI/CD pipeline has saved my bacon a few times so I'd probably recommend that as an early investment to make, internal tooling for some of the customer registration was a bit of a mistake and not as critical as I thought it would be which now feels like a waste of 3 weeks.
I am yet to come up with any meaningful advice to pass on to developers in the same position. With time you get better at spotting when you can get away with hacking solutions together and when to put the time in, having a forward vision for a product really helps on focusing your thought and attention and time spent on making things visible to the rest of the business is rarely time wasted. Awareness of what’s going on with the rest of the industry is useful, but don’t obsess over the new hotness, build with what you know works and save your innovation tokens for a rainy day.
I've heard the Kubernetes vs VPS a lot now. I think it's the wrong way to think about it.
It's more about a learning curve of best practices.
- Building your application with a CI pipeline is a best practice. It helps to ensure your code builds from the repo and your tests run.
- Building your app into a container is a best practice. The Dockerfile tracks your dependencies as code, and you can version control it.
I can tell you from bitter experience these 2 are essential.
Once your app is containerized you have many choices in the way you deploy it. You can use PaaS i.e. heroku or azure web apps.
This is orchestration.
If your service has a few containers and maybe some jobs then Kubernetes starts to look like a good option. K8 is $50 a month on digital ocean, you can run a few projects on that.
Continuous deployment/delivery is a best practice. It takes the stress out of releases.
Infrastructure as Code is a best practice. It's immutable infrastructure and once you have it gives you so many options in the way you manage failover etc.
If you are concerned about uptime during deployments then Blue/Green is a great option too.
We're not overcomplicating things, these skills will transfer from a side project to a billion dollar corp concerned with safety critical software.
So you have to hit a few learning curves, but after that your a better developer.
Its not just engineers that want to build complex infrastructure though. This a massive misconception and shifting the blame to 'nerdy engineers' doing 'nerdy' stuff instead of solving real problems.
Startup that worked at raised whole bunch of money and hired tons of engineers. They were all tasked to "build something big". eg: it was expected of data engineers to setup "big data architecture" with complex data-lakes, job schedulers , spark clusters, streaming jobs and all sorts of weird stuff. No one really understood why we are building all this stuff but we never got real answers from the management.
I find there's a bit of an "if you give a mouse a cookie" effect with this stuff.
I recently wanted to organize a data processing pipeline at our company. In the past, the system was: there's a bunch of scripts, some of them call each other, and sometimes a person logs in and kicks off one or more scripts without recording their inputs and sending the output to wherever they feel like. I thought implementing this on Argo Workflows would allow us to keep track of previous runs and generally clarify things.
Well, Argo runs on Kubernetes. Also, it kinda wants your data in a bucket. And argo kubernetes wants kustomize and skaffold and or / helm and soon I'm installing a package manager that manages package managers and I spend half my time managing systems we don't really need.
I'm learning I need to to draw the line and deny Argo some of the things it wants. No, we don't need to run minio and call our files "artefacts." The filesystem works just fine thanks, we don't need another layer of it. This means we can't follow some of the Argo examples and lose some of its functionality.
Our problem domain is bounded in such a way that we never have to worry about scaling beyond 1 host in production. For the last 6 years, we have survived on little more than .NET Core and SQLite.
We utilize Self-Contained Deployments, so rolling out a new install for a customer is an unzip operation followed by sc.exe create invocation on a blank win2019 server. Because we use SQLite, we don't have to worry about configuring any separate infrastructure to store the business data. It all lives in a convenient folder called "db" right next to the binaries.
We also built our software so that it is capable of rebuilding itself from source and performing an in-place upgrade. Then we put a fancy Blazor UI on top of that and made builds & deployments the problems of non-developers. Time/frustration savings here is beyond your wildest imagination. We have project managers competently communicating about client environments in terms of commit hashes now. "Oh Client X's production is still back at 2c8ac0, ill run the build at e8c0a2". This is something that isn't really feasible when you have your solution architecture scattered to the seven winds.
Our entire stack lives inside the confines of a single process on a single host. Focusing on the vertical and making it as powerful and capable as possible rather than chasing delusions of grandeur has seemed to pay massive dividends so far. Maybe one day we get screwed and its all over because we never planned to scale to 2+ hosts, but we could never hope to get to that point in the first place if we were to chase a bunch of shiny bullshit around all day.
Here’s an oversimplified rule of thumb: Try not to spend more than a tenth your time on infrastructure or build if you are developing a product. If you find you are spending more than that on manual stuff, try to automate etc. to get back below that.
If you are scaling staff, you may want up to a tenth the engineering staff to be dedicated to these tasks.... as you scale up, more automation to stay at or below that level.
Oversimplified.... but not bad to think about if you think you are overdoing it.
If you are way below a tenth.... really think about if you are counting all the manual workarounds and pieces of overhead honestly....or if you are giving up scalability, reliability or flexibility by not controlling your infrastructure.
As always, it depends. Lots of people wish they had Facebook's challenges when it comes to scaling. 99% of applications simply don't.
I've never seen a company I've worked with go for 'fancy' infrastructure because they needed to. It was always because they HOPE that they need to. Granted, re-engineering your application / architecture for scaling is expensive, but not as expensive as building the wrong thing in the first place.
They should all have started off with a monolith and grow from there. I mean in all those cases we had just a single, 'monolithic' front-end without any problems.
> Your users to don’t care how your code gets onto your server.
I work in B2B and while sure, our users don't necessarily care but they also aren't the ones paying us money. The people paying us money have a huge set of questions about how code gets on our servers and Kubernetes is not everything but it helps a lot answering those questions. (Especially using GKE.)
As a "devops" engineer, I can say with total confidence that complexity is rarely the right answer. When it is, it's because that complexity is the lowest pain point in a calculation that has itself become complex.
Sometimes you can add complexity to the whole stack by adding a new shiny layer that brings down complexity where you care most about it, in your user-facing UI/UX layer.
Developers can handle complexity via code much more efficiently than users operating a site or app...
Also, some of the nice developments of the recent explosion of serverless/as a service stuff, is that you can mostly try stuff for free, be happy on the free tier if you don't need scale, and use that scale when you need it without much hassle but paying the overprice of their fat margins.
Also you should probably have an ejection plan in case scale + external provider becomes too expensive, you should always have a nice idea on how to run your stack if you need it...
Complexity is never the right answer. Complexity, in my experience, comes from two things: The person building that system doesn't understand the problem well enough and/or equates complexity with quality. Neither are good things.
that's because there are 2 kinds of complexity: essential (or intrinsic) complexity and accidental complexity.
When one uses k8s when one doesn't truly need it, it's accidental complexity. However some problems require k8s, where the engineered complexity of k8s helps solving the problem at hand.
Of course the issue is that most people don't need k8s, whereas Google and the like do.
I assume that premature infrastructure optimization (complexity, complex automation, kubernetes, etc whatever you say) is same with premature optimization, both are bad.
However same with optimization, there are some best practices that are good to follow, such as avoiding N+1 queries.
For me personally, that will be deploying with docker-compose, which usually yml already be created during development, and it'll be much easier to migrate later.
This is of course a tautology. Premature anything is bad. That’s what ‘premature’ means. It means you did it too soon, before the optimal time. Doing it later would have been better.
But there is an optimal time. And even if you did it too soon, if you eventually needed it, at least you’ve done it. Better that than getting to it too late, surely?
And then there's the "I've already done it this way a dozen times and I'm comfortable with it. There are benefits to it and it's not that much extra effort to set it up from the beginning and save 10x effort down the line."
> Your goal, when launching a product, is to build a product that solves a problem for your users.
This is the main takeaway, in my opinion. I've been at small companies who were under-architected "because we can" and large companies who were over-architected because someone in the leadership chain was enamored with a particular technical choice.
In the end it's a cost curve and depends heavily on the nature of your business. I can say personally if the business is even mildly complex (not necessarily large, this could mean a web app + a mobile app + a single RDBMS) I'd sleep better having some AWS (or other cloud) alerts set up to monitor uptime/etc. Others have mentioned CI/CD, but again, heavily depends on your needs. I worked at one place that survived uploading .zip files to Elastic Beanstalk for _years_ to update our PHP website.
For some of the other data compliance/up-time/service availability stuff, I'd expect to have the revenue to cover the sort of overhead that introduces (otherwise you're just not charging enough, or the business isn't really as viable as you think).
There is no reason it has to be that hard to have the scalable infra out of the box. There should be some setup out there, ideally portable, where setup is one command and now I have a working scalable demo app (multi-user todo, fb clone, instagram clone, etc...)
That setup should include unit tests, integration tests, a/b deployment, rolling upgrades, versioning, local dev, staging servers, production servers, data backups, logging infra, etc...
The inputs to get started should be as simple 10 questions. Domain names, service names, ...
As is now, setting up a new scalable website is similar in some ways to trying to setup a new computer if linux, macos, windows didn't exist and you had to write your own OS from existing parts. There's no reason it should be this way. There's a set a reasonable defaults that would work for most projects.
I wouldn't consider that too complex or over engineered. I'd consider it simplified the same way the fact get I get so much from existing OSes simplifies using my computers vs having to write my own OS for every new computer I buy
Operations guys, "devops" as they are called now, remain the arch-wizards with grey beards. They love creating highly company-customized, incredibly complex, often times brittle infrastructure and deployments based on obscure scripts written in non mainstream languages based on plugins and libraries without documentation. The same opaqueness as in IT security, but in emergencies you can downsize/ignore IT security, but without the ops guys your product can not survive.
Kubernetes, Helm and cloud-platform specific quirks were a godsend that cemented in this status for at least another decade.
There is probably no job with higher job security in tech than this.
So were it not for this era of cloud platforms and the other technical ”advancements” you mention, would we somehow be in a different place when it comes to maintaining and operating software?
As someone in a devops team, I could argue that ops teams are rarely the ones deciding what technologies to use. If we follow the basic sales->development->maintenance -pipeline, as ops your task is simply to keep the service running and secure with the tools you are given. Sure there can be added complexity, but I doubt it is as systemic/general as you make it seem to be.
Anyway, if this is seen as a problem the devs could simply take on the responsibility of learning those quirks themselves or produce the level of quality software that requires no maintanance whatsoever. But for some reason that hasn’t happened, eventhough the original idea behind devops was to blur this line. So until that day comes, I’ll sure enjoy whatever security one can really have in todays job market.
No, we would simply still use incomprehensible websphere setups. Now we got kubernetes as the application server of the current decade that's application language/framework agnostic.
But it shouldn't be like this. Devops shouldn't have to exist. Everything should work like heroku, where even mediocre devs can quickly setup proper pipelines.
> Kubernetes, Helm and cloud-platform specific quirks were a godsend that cemented in this status for at least another decade.
Quite to the contrary - containers and Kubernetes are standardizing and commoditizing builds, deployments and orchestration across different environments.
>>"...every minute spent on infrastructure is a minute less spent shipping features."
Completely true.
This does not mean that we should spend zero minutes on infrastructure
This does mean that we should ensure that every minute & dollar spent on infrastructure is well-spent - will actually produce, for the customers, the developers, and/or the company overall, a return of sufficient value.
I used to be all about planning for scale early. But I've actually found it's better to plan to build one or two throw-away versions, built well as we can while being fast and with only minimal scaling features, THEN take the lessons learned (about real features, usage patterns, & loads that will need to be supported) and apply these to building out the more scaleable versions. The initial cheap, cheerful, & disposable versions provide initial traction, a lot of knowledge, and a known & limited technical & financial debt, but the 'killer feature' is zero technical lock-in when you go to design & build the scalable version.
Whether the infra is complex or not isn't going to make a huge difference to whether your product makes money. Some of the highest-revenue-generating products I have seen were tire fires under the hood. One product had multiple people working 24 hours a day to constantly hose down the multiple levels of problems that the incredibly poor design and terrible infrastructure created. But through the toil of those people in the backend, the product made hundreds of millions a year.
Whether or not you use complex technology, it matters more whether it's well designed and well operated. If it's not, you will end up with a fire either way. It's just that "simple" fires are a lot easier to put out than "complex" ones. (this is literally true; a lot of fires should not be put out with water)
* Built the app (into a self contained .jar, it was a JVM shop)
* Put the app into a Ubuntu Docker image. This step was arguably unnecessary, but the same way Maven is used to isolate JVM dependencies ("it works on my machine"), the purpose of the Docker image was to isolate dependencies on the OS environment.
* Put the Docker image onto an AWS .ami that only had Docker on it, and the sole purpose of which was to run the Docker image.
* Combined the AWS .ami with an appropriately sized EC2.
* Spun up the EC2s and flipped the AWS ELBs to point to the new ones, blue green style.
The beauty of this was the stupidly simple process and complete isolation of all the apps. No cluster that ran multiple diverse CPU and memory requirement apps simultaneously. No K8s complexity. Still had all the horizontal scaling benefits etc.
My team is actually in the process of solving this problem ourselves. At the end of the day, k8s is overkill for most people, and using the nice UI/UX on top of K8s I feel will have devastating effects down the road when you need to troubleshoot, manage, and migrate.
I ran my B2B SaaS on a single (physical) server for the first three years or so. I then switched to a setup with three production servers, and three staging servers (in a different data center), mostly to have the comfort of a distributed database, and a staging system to try things out.
Physical servers provide enormous computing capacity these days. And managing them isn't much different from virtual servers: you should be using a tool like ansible anyway. The only difference is that when I spin up my development cluster I use terraform to generate the ansible inventory, and for physical servers the inventory is static.
I do not anticipate needing more capacity or scalability anytime within my planning horizon.
I keep seeing people investing tons of time and effort into things like Kubernetes and I honestly have no idea why. It complicates things enormously for no good reason: the benefits, if you aren't Google, aren't interesting at all, and the drawbacks are huge. The biggest drawback for me (apart from the cost of learning, building, and maintaining the infrastructure) is the complexity and resulting unpredictable failure modes.
There are things which I'd say are really important for infrastructure, though:
* a working and tested backup strategy
* disaster recovery plan (what happens when your entire datacenter burns down)
* managing servers via a tool like ansible
* regular testing of various scenarios (including setting up an entire system from scratch on new servers, with data recovery from backups)
* a database that reliably preserves all of your customers' data when any one of your servers goes down (this unfortunately excludes many databases and makes the usual "just use Postgres" a poor choice, now flame me to high heaven)
I can't imagine a small business where Kubernetes makes business sense. Do people seriously have problems with scaling to hundreds of machines? Scalability problems with huge traffic peaks that actually cause server load? I'm having trouble keeping my CPUs busy for any meaningful amount of time, and I don't spend much time on optimizations, I just avoid doing stupid things.
> a database that reliably preserves all of your customers' data when any one of your servers goes down (this unfortunately excludes many databases and makes the usual "just use Postgres" a poor choice, now flame me to high heaven)
You mean as in, backups have the latest data? Every transactional DB does that, postgres included. You just have to back-up the correct data (what is on the docs, for every one of them).
If you mean any DB server can go down at any time and the others must take over without losing any data, then your best bet is to hire a postgres expert to set it up for you. Both Oracle and SQL Server nominally do that, but they have flaws that their experts can't correct, and you won't be able to set any of them up without acquiring some expertise first.
> * a database that reliably preserves all of your customers' data when any one of your servers goes down (this unfortunately excludes many databases and makes the usual "just use Postgres" a poor choice, now flame me to high heaven)
What do you suggest instead? My understanding was that Postgres, when properly set up with backups and replication, is the most reliable open-source database.
That's so true! Not long ago, I wrote an Ansible playbook to install my open source software on a DigitalOcean VPS. I devoted so much time to learn Ansible as I had zero experience with it. I think I almost kept on working on that playbook for almost two months. Even after that, I was buggy.
One day, I just woke up and had an epiphany that a BASH script could do the same! On top of that, the end users would not have to install Ansible in order to run the playbook.
There is a fairly in-depth article on which modern deployment strategies are useful, and which are a waste of time for mid-size companies as of 2020 here:
So true. HN often admires complexity for complexity sake.
Even without cloudflare, a static website on 5 VPS from 5 different provider on 5 different continents with a Round Robin DNS and a low TTL will get you most of the bangs for bucks
Add geoIP to the mix (like from Azure) and you're golden, even if you are just scp'ing files to 5 VPS.
If a DC burns down, remove the A record (or even better: have a script do it for you) and reroute to the remaining 4.
That’s funny, sometimes when I visit your site, I get an HTML page that references script files and resources that come back as 404s. I wonder if it’s possible I’m hitting the site during a deployment and getting the new HTML page from one of your servers, but then the resource requests are going to a different server which hasn’t received the new version yet.
I guess occasional broken images and script failures are unavoidable though. You’d need like, some dev ops engineers to solve a problem of that complexity.
It is very important that you realize when you pass one of those points.
It doesn’t help to just tell people ‘you don’t need kubernetes, you are not Facebook’
There are a lot of companies that aren’t Facebook, but they still have multimillion dollar revenue streams riding on service availability, with compliance obligations and employee paychecks on the line, and which need to be able to reliably release changes to production code every day, written by multiple development teams in different time zones, so... yeah, maybe they are a bit beyond just scp’ing their files up to a single Linux VPS.