There is obviously some point, somewhere between ‘just running a website out of ...

protonimitate · on March 11, 2021

This is a good point. I wish there was more discussion around when to consider making these decisions, how to transition, what types of systems are in the middle ground between bedroom and facebook, etc.

It seems all of the articles/posts that get are attention are the "you don't need it ever" opinions, but those are incredibly short sighted and assume that everyone is recklessly applying complex infrastructure.

In general, it would be nice to have more nuanced discussions around these choices rather than the low hanging fruit that is always presented.

mumblemumble · on March 11, 2021

What I tend to see missing is clear explanations of, "What problems does this solve?" Usually you see a lot of, "What does it do?" which isn't quite the same thing. The framing is critical because it's a lot easier for people to decide whether or not they have a particular problem. If you frame it as, "Do we need this?" though, well, humans are really bad at distinguishing need from want.

Take Hadoop. I wasted a lot of time and money on distributed computing once, almost entirely because of this mistake. I focused on the cool scale-out features, and of course I want to be able to scale out. Scale out is an incredibly useful thing!

If, on the other hand, I had focused on the problem - "If your storage can sustain a transfer rate of X MB/s, then your theoretical minimum time to chew through Y MB of data is Y/X seconds, and, if Y is big enough, then Y/X becomes unacceptable" - then I could easily have said, "Well my Y/X is 1/10,000 of the acceptable limit and it's only growing at 20% per year, so I guess I don't really need to spend any more time thinking about this right now."

braveyellowtoad · on March 11, 2021

Yep. All of us tech folks read sites like hackernews and read all about what the latest hot Silicon Valley tech startup is doing. Or what Uber is doing. Or what google is doing. Or what Amazon is doing. And we want to be cool as well so we jump on that new technology. Whereas for the significant majority of applications out there, older less sexy technology on modern fast computers will almost certainly be good enough and probably easier to maintain.

XorNot · on March 11, 2021

Uber is one of those alleged behemoths I'm incredibly skeptical about the scaling problems of.

The service they provide can't be anywhere near the Facebook/Google league of scale, nor Netflix level of data/performance demand.

mumblemumble · on March 11, 2021

I've had several experiences where I attended a tech talk by someone from Uber, and never once did I come away with the impression that the problem they were trying to solve was the kind of thing Fred Brooks had in mind when he coined the term "essential complexity."

That said, volume and velocity aren't the only kinds of scale that technical teams have to grapple with. I've spent enough time at a big organization to understand that Conway's Law costs CPU cycles. Lots of them.

barbazoo · on March 12, 2021

They struggle with the size of their mobile apps so they've got that going for them

908B64B197 · on March 12, 2021

> And we want to be cool as well so we jump on that new technology.

I would rather have cool tech on a resume rather than boring tech. Helps if someone decides to jump to a cool company.

edejong · on March 11, 2021

We don't have data sheets, like professional electronic components have. I guess we value Turing/Church more than predictability.

6510 · on March 11, 2021

> What I tend to see missing is clear explanations of, "What problems does this solve?"

This still sounds like one (like a scientist) starts out by looking at something for it being cool, hip and perhaps useful before considering if it solves any of the problems. The proper working-order is to start by having the problem. Hardly any problem requires a state of the art solution.

When one is wiring a house (where I live) the regulation says you should use the same standards for everything on a group on the switchboard. This hilariously means that if you need to extend iron pipes with canvas isolated wires you have to use metal pipes with canvas wrapped wires.(or rip everything out and replace it with something modern)

cesaref · on March 11, 2021

When you are in a position to pay someone full time to look after the web infrastructure of your project, that's when you should think this stuff through.

Before that, don't worry about the tech, go for what you know, and expect it to need totally re-writing at some point. Let's face it, this stuff moves quickly.

marcosdumay · on March 11, 2021

That's not good. That kind of stuff must be derived from your business plan, not from the current situation. You should know that "if I'm successful at this level, I will need this kind of infrastructure" before the problems appear, and not be caught by surprise and react in a hush. (Granted that reacting in a hush will probably happen anyway, but it shouldn't be surprising.)

If your business requires enough infrastructure that this kind of tooling is a need at all, it may very well be an important variable on whether you can possibly be profitable or not.

lmm · on March 12, 2021

> You should know that "if I'm successful at this level, I will need this kind of infrastructure" before the problems appear, and not be caught by surprise and react in a hush. (Granted that reacting in a hush will probably happen anyway, but it shouldn't be surprising.)

What's the ROI on spending time on planning for that, given that you acknowledge that the response will be mess anyway? In my experience companies that planned for scaling don't handle it noticeably better than companies that didn't plan for it, so maybe it's better not to bother?

rualca · on March 12, 2021

> What's the ROI on spending time on planning for that, given that you acknowledge that the response will be mess anyway?

Let's put it this way: what's the ROI of failing to address any of the failure modes avoided or mitigated by taking the time to thing things through at design time?

Are you planning on getting paid when you can't deliver value because your service is down?

lmm · on March 12, 2021

The ROI of reducing time to market is huge, whereas I don't think I've ever seen thinking things through at design time deliver any real benefits (not even a reduced rate of outages).

rualca · on March 13, 2021

> I don't think I've ever seen thinking things through at design time deliver any real benefits (...)

That's mostly the Dunning-Kruger rearing its head. It makes zero sense to state that making fundamental mistakes has zero impact on a project.

This type of obliviousness is even less comprehensible given that every single engineering field has project planning and decision theory imbued in its matrix, and every single intro to systems design book has its emphasis on failure avoidance and mitigation, but somehow in software development hacking is tolerated and dealing with perfectly avoidable screwups resulting from said hacking is ignored as if they were acts of nature.

lmm · on March 15, 2021

I suspect it's tolerated because it works better; IMO the value of planning is something cargo-culted from engineering fields where changing something after it's partially built is costly. In software, usually the cheapest way to figure out whether something will work is to try it, which is in stark contrast to a field like physical construction.

marcus_holmes · on March 11, 2021

if your business plan has this much detail, you'll spend more time updating it to reflect current reality than actually shipping features.

Companies that are growing fast enough to need Kubernetes are also changing fast and learning fast.

Melkman · on March 11, 2021

Also, when cost caused by downtime and updates and/or hardware problems exceed the cost of this full time engineer. Even then K8s is probable not the next step. Simple HA with a load balancer and seperate test and production environments plus some automation with an orchestration tool works wonders.

blacktriangle · on March 11, 2021

That's the best metric I've heard. When your Heroku bill is the same as what a full time ops person would cost, start looking to hire.

jacquesm · on March 11, 2021

I would say it has to be at least double that and maybe even higher.

rualca · on March 12, 2021

If you are already deploying to Heroku, you are already doing ops.

It makes zero sense to presume you don't need to know what you're doing or benefit from automating processes just because you can't hire a full-time expert.

blacktriangle · on March 12, 2021

True, but there's a difference between solving ops problems by throwing more money at hardware and solving ops problems by thinking smarter. That's why I think that some multiple of an FTE salary is a good metric for when you start transitioning to more complex operational setups, because it gives you a number for when the gain of that new complexity is worth the cost.

marcus_holmes · on March 11, 2021

This. When you have enough revenue to hire new dev, and infrastructure is your most pressing problem to solve with that money, hire someone who knows this stuff backwards to solve it.

richardw · on March 11, 2021

No simple advice will cover all the cases but not a bad place to start:

https://basecamp.com/gettingreal/04.5-scale-later

jameshart · on March 11, 2021

So I agree there’s definitely a need for more writing in this space (and you’d think it might be particularly of interest to incubating startups...)

My top level take on this is that the cues to take things up a level in terms of infrastructure diligence are when the risk budget you’re dealing with is no longer entirely your own. Once you have a banker, a lawyer, an insurance policy, an investor, a big customer, a government regulator, shareholders... ultimately once there’s someone you have to answer to if you haven’t done enough to protect their interests.

And risk management isn’t just ‘what will happen if the server catches fire’, it’s ‘how do I guarantee the code running in production is the code I think it is?’ And ‘how do I roll back if I push a bug to production?’ And ‘how can I make sure all the current versions of the code come back up after patching my servers?’

And it turns out that things like kubernetes can help you solve some of those problems. And things like serverless can make some of those problems go away completely. But of course they come with their own problems! But when those things become important enough that you need to solve for them to answer the risk aversion of sufficient stakeholders, the cost of adopting those technologies starts to look sensible.

yawaramin · on March 11, 2021

FTA:

> Obviously if you’re FAANG-level or some established site where that 0.1% downtime translates into vast quantities of cash disappearing from your books, this stuff is all great. You have the funds to do things “right”. Your fancy zero-click continuous deployment system saves you thousands/millions a year. But at indie maker, “Hey look at this cool thing I built … please, please someone look at it (and upvote it on Product Hunt too, thx)” scale — the scale of almost every single site on the net — that 0.1% is vanishingly insignificant.

rualca · on March 12, 2021

> Obviously if you’re FAANG-level or some established site where that 0.1% downtime translates into vast quantities of cash disappearing from your books,

I'm yet to see a company with any o line presence where downtime doesn't eat away their profits.

In the very least, downtime means you're paying money to get no service.

Downtime is fine for someone's blog on his personal website that he setup on a weekend while drinking beer. I'm yet to see a business state that they are ok with getting their site 404 or 500 randomly and during a whole workday, which is what 0.1% translates to.

yawaramin · on March 13, 2021

Fear not, the article has you covered (emphasis mine):

> Obviously if you’re FAANG-level or some established site where that 0.1% downtime translates into vast quantities of cash disappearing from your books, this stuff is all great.

rualca · on March 13, 2021

This is not a FANG thing. Not being able to deliver a product or service is a business thing. This happens in small mom&pop shops as well as FANGs. This is a free market economy thing. If your business is not open them you get expenses but zero income. How is this hard to understand?

The software world involves way more than your small side project that you managed to cobble together during a weekend. Things do need to work reliably and predictably. Otherwise not only do you not get your lunch but also your competitors eat it from under your nose. Why is this even being discussed, in HN of all places?

yawaramin · on March 14, 2021

Because having to keep a site live doesn't automatically mean I need all the complexity of K8s? I can deploy my server on two VMs in two AZs with Ansible and run them with systemd to restart on crash. Just because I don't immediately jump to K8s doesn't mean I don't know how to run a site.

jameshart · on March 11, 2021

This is a really weird take on what ‘every single site on the internet’ is. Or of what proportion of the software development community is working on sites of that scale.

It’s like saying ‘the vast majority of people building houses don’t need to dig foundations’ because you build lego houses, and after all the vast majority of houses are made of Lego, right?

bombcar · on March 11, 2021

There are ways to quantify it - but a real telling point is if your service is down because of the complexity of tooling you don’t fully understand.

Many problems can be solved with other methods - especially if you do NOT have global all-sync’d requirements like Twitter and Facebook. Many SaaS (example: Salesforce) are perfectly acceptable sharding customers to particular instances thereby limiting downtime if something goes wrong.

And always keep in the back of your mind that the largest VPS or servers are very performant today - having to scale sideways may be masking inefficiencies in the code.

jameshart · on March 11, 2021

Or if your service is down because the single hosting site with your Linux box in it is on fire

Also, I assume sales force engineers aren’t SSHing into each shard and running docker-compose manually on each one.

Obviously you need Goldilocks infrastructure - not too little, but not too much.

All I’m saying is it doesn’t go ‘one Linux box, one Linux box, one Linux box, reach Facebook size, build your own data center’.

bombcar · on March 11, 2021

True - but the datacenter burning still hits you where it hurts if your entire Kubernetes cluster is living in that datacenter.

Whereas if you make a point of spinning up each Linux box on different datacenters (and even perhaps different providers) you are at least resilient against that.

Using advanced tooling doesn't remove the need to do actual planning beyond "oh I have backups" - you should be listing the risks and recoveries for various scenarios - though being able to handle "AWS deleted all my servers because I returned too many things via Amazon Prime" means you can also handle any lesser problem.

callmeal · on March 11, 2021

>Also, I assume sales force engineers aren’t SSHing into each shard and running docker-compose manually on each one.

Nothing wrong with that if all you have is 10-15 shards (or whatever your annoyance upper limit is).

Might even be faster and cheaper than spending hours learning/debugging/figuring out edge cases on something else.

jameshart · on March 11, 2021

But there’s an upper limit, right? And you’ll probably hit it somewhere before you pass 100bn USD ARR?

edoceo · on March 11, 2021

IME you'll hit it around $500,000 USD ARR. But again, situations vary. I've worked projects where hours long scheduled downtime at the weekend was acceptable. So easy!

jsilence · on March 11, 2021

A whole datacenter on fire, come on, what are the odds?

Oh, wait! Nevermind...

NateEag · on March 12, 2021

The first draft of any new product will be totally fine with just one Linux box.

Why?

Because the first draft has no users.

Once you have some paying users, yes - figure out what service guarantees those users want or need and put some effort into meeting them.

But, prior to having users, anything but pretty minimal infrastructure is waste.

jacquesm · on March 11, 2021

Having an off-site backup of your data is a no-brainer at any level of complexity.

rualca · on March 11, 2021

> There is obviously some point, somewhere between ‘just running a website out of my spare bedroom’ and ‘Facebook’, where some of this infrastructure does become necessary.

Frankly, that blog post reads terribly miopic. The author essentially looks at his personal experience setting up his single-dev personal project, complains that he spent some time jump-starting a basic CICD pipeline, and from there he proceeds to argue that no one needs any of that.

Well, how does he expect to check whether their changes break or not the deployment? How does he expect to deploy his stuff? How does he plan to get multiple commits to play nice prior to deployment? How does he expect to roll back changes? How does he plan to address request spikes, either from normal traffic or attacks? Does he plan to get up at 3AM on a Tuesday to scale up deployments or does he plan to let his site crap out during the night and just get back to it in the morning?

And somehow we are supposed to believe that this naive belief that getting features out of the door is all that matters, and basic ops is not needed by anyone?

thisiszilff · on March 11, 2021

Yeah, I kind of agree with the author, but not 100%. The takeaway shouldn't be "don't use CI, kubernetes, etc." but rather "don't learn something new" when your goal is to get a product out the door. If you already know kubernetes, then it might seem silly to not use it given how it simplifies things. It's only when you don't know kubernetes that you'll run into problems.

rualca · on March 11, 2021

> but rather "don't learn something new" when your goal is to get a product out the door.

I don't agree at all, because that argument is a poorly put-together strawman.

No one ever had to choose between setting up a CICD pipeline and churn out features. Ever.

In fact, continuous deployment makes it possible for everyone to get features out if the door as fast as they possibly can, and ensure deployments are two-way door thing instead of a finicky one-way door pray-that-it-works event.

The door where features have go get through is not JIRA, but the prod environment. It's inconceivable how some people try to pretend that automating operations adds no value.

And hell, CICD nowadays is trivial. You know why? Because of tools like Kubernetes/Cloudformation/CDK/Ansible/etc.

tpxl · on March 11, 2021

There is a whole lot of time before your first customer where you don't need any of this. Getting a MVP out > getting your engineering shinies.

In my company I first had servers directly on a bare metal box, with half the infra running on my workstation, then moved everything to docker-compose, then kubernetes and now I'm getting ready to get my CICD pipelines.

rualca · on March 12, 2021

> There is a whole lot of time before your first customer where you don't need any of this.

No, there really isn't. Customers want features, and they want it unrolled as fast as possible.

How do you provide that? By automating your delivery process. You don't treat your infrastructure as a pet, and all you need to do to get a feature to all customers, fully tested and verified it won't break your service, is to push a commit.

With modern CICD systems and a basic Kubernetes setup, you get a fully working continuous deployment pipeline from scratch in about 15 minutes. There is no excuse.

The blog post is bullshit from someone who has zero insight and no relevant experience.

tpxl · on March 12, 2021

(I am not commenting on the article, I am commenting on the comment thread)

> Customers want features, and they want it unrolled as fast as possible.

I am going to reiterate: There is a whole lot of time _before_ your first customer.

Now I agree with you, on my last job we had no CICD, no tests and no procedures when I joined and it is _hell_ to release anything. Writing tests and doing the infra candy improved our velocity, even though technically we were doing more work. But this was _with_ customers. If you don't have customers, they don't care about the features.

rualca · on March 13, 2021

> I am going to reiterate: There is a whole lot of time _before_ your first customer.

So what? Don't you understand that this just renders your point even less defendable?

> But this was _with_ customers.

So what? Do you expect to get your code to work only in prod? Do you hope that all the bugs you're adding right now should just pile up into a big tangled and unmanageable mess until it becomes a problem? Are you even bothered by the fact that your last commit may or may not have screwed the entire project?

There is absolutely zero justification to not automate testing and delivery. Zero. There is not a single usecases where pushing half-baked crashers without any verification should be ok.

tyingq · on March 11, 2021

Facebook is kind of an interesting example, as they got pretty far into "hyperscale" with mostly PHP, memcached, and Mysql.

jasode · on March 11, 2021

>Facebook is kind of an interesting example, as they got pretty far into "hyperscale" with mostly PHP, memcached, and Mysql.

But PHP and MySql are only the surface level tools we see.

There is also the unseen and that's where the immense complexity lies hidden under the surface.

E.g. To help with scaling (which Friendster failed at), Facebook had separate MySql instances for each school. An interview with Facebook co-founder Dustin Moskovitz mentioned they had constant stress of emergencies and fires to put out. They had tons of custom tooling, Linux scripts, batch jobs, automation/orchestration/dashboard, etc and it's very likely the combination of all that ends up inventing an "invisible internal Kubernetes" without calling it "Kubernetes" to manage their server fleet. In other words, if Docker/Kubernetes/AWS were around, the early Facebook team might have offloaded some of the complexity onto a 3rd-party to reduce the all-nighters. But the tools didn't exist yet so they had to do it the hard way and invent "deployment orchestration" themselves.

[EDIT to reply]

Thanks for the "Tupperware" comment. A Google search leads to a Facebook post saying they renamed it to "Twine":

https://engineering.fb.com/2019/06/06/data-center-engineerin...

908B64B197 · on March 12, 2021

Facebook released it's PHP to C++ compiler [0] in 2010.

Maybe it was used internally before that. So they got pretty far from "normal PHP" relatively early.

[0] https://en.wikipedia.org/wiki/HipHop_for_PHP

tyingq · on March 11, 2021

That's recent though. I'm saying they scaled up quite a long way before doing anything like that. From the outside it looks like they didn't really branch out into things like this until 2008 or so.

jasode · on March 11, 2021

>From the outside it looks like they didn't really branch out into things like this until 2008 or so.

No, even before the Newsfeed feature rollout of September 2006, they already had crazy complexity orchestrating multiple servers for many schools. They were only a "simple PHP+MySql" website when it was February 2004 with only Harvard students on it. Dustin said he was the one in charge of spinning up new servers for each new school and he said it was a nightmare of engineering effort.

mattmanser · on March 11, 2021

But today, 16 years of Moore's law later, you could now have gone way, way, way, way past that.

They probably could have done most of America on a single server.

For example, even in 2009 Stack Overflow was running off two dedicated servers:

https://stackoverflow.blog/2009/01/12/new-stack-overflow-ser...

KptMarchewa · on March 11, 2021

Stack Overflow in 2009 is an example of something that's cacheable extremely well, several orders of magnitude than stuff like personalized walls for every user.

Person5478 · on March 11, 2021

IIRC, they were doing things like patching bittorrent to prefer closer IP's to pull from during their deploys. So it would roll through their network from 1 side to the other without just completely saturating the entire network.

But the point still remains, they had challenges to solve, but they were also simple, which made it possible TO solve those challenges.

jeffffff · on March 11, 2021

it's called 'tupperware'

ahofmann · on March 11, 2021

This could be the final post to this discussion. You can't really get more to the point than that. We are running a large German browser game on individual VPS instances. For many years, with PHP, MySQL and Redis. It works and it runs just fine.

bombcar · on March 11, 2021

It's not one that just recently had to start migrating from Flash, is it?

rualca · on March 11, 2021

> Facebook is kind of an interesting example, as they got pretty far into "hyperscale" with mostly PHP, memcached, and Mysql.

Isn't Facebook known for hiring top world experts such as the likes of Andrei Alexandrescu to hyperoptimize all the way down their whole stack?

That hardly sounds like running vanilla PHP with your generic nem ache and mysql.

And how do you exactly think they get so many 9s in their reliability? Is this plausible with the "a single linode box ought to do" nonsense?

tyingq · on March 11, 2021

He got there in 2009. Facebook did run for a long time with a fairly "vanilla" setup, yes.

A hacker got ahold of their "index.php" in 2007:

https://gist.github.com/nikcub/3833406

It is decidedly not a tour-de-force of technical prowess, just typical old school "bag of functions in the same namespace" PHP.

MR4D · on March 11, 2021

> There is obviously some point...

My take is simple - if Craigslist doesn't need it, then I probably don't either. (I may want it, but that's different).

Definitely not perfect, but it works for me. Obviously more/better guideposts would make this a much better yardstick.

jameshart · on March 11, 2021

I duunno exactly what Craigslist is running on today, but some slides from 2012[1] include items like running MongoDB over "3 shards across 3 node replica sets" with "duplicate config in 2nd data center"; a "sharded multi-node" redis cluster; multiple "vertically partitioned" MySQL clusters; the actual application stack on top of that being mod_perl on Apache, and it was all operated by a 'small but growing' dedicated team.

It might or might not be notable that the growth of that tech team included, one year later, hiring Larry Wall[2].

Given that, I'm not sure how to apply 'if Craigslist doesn't need it, I probably don't' as a general principle....

[1] https://www.slideshare.net/jzawodn/living-with-sql-and-nosql... [2] http://blog.craigslist.org/2013/10/15/artist-formerly-known/

MR4D · on March 11, 2021

I see language as a personal choice, and while I wouldn't run perl today (I'm partial to Python, myself), I think there stack is fairly simple - tried & true language (however quirky it may be), no virtualization, simple metal hardware stack, and yet they serve an incredible number of users. In that sense, I'm sure hiring Larry was a no-brainer.

They are no FB, but they're bigger than almost anything else on the web [0]. I doubt most people will build anything bigger in their careers. I certainly won't.

A huge caveat might be app features. Clearly running something like travelocity could require more pieces, and something like a BI app in a fortune 100 company would have many more connectors to data sources. But in general, they've done pretty well using only newer versions of what worked way back in 1999, and to me, that's just incredible.

[0] - https://www.semrush.com/blog/most-visited-websites/

Spooky23 · on March 11, 2021

Another factor is the operating model.

People bitching about K8S, systemd, etc miss the point that these systems allow a qualified generalist ops admin pop in and fix stuff without knowing the app well.

For a little company with <500 servers, it doesn’t matter. As size and complexity increase, figuring out how the guy left last year setup FreeBSD becomes a liability.

nmfisher · on March 11, 2021

Man, I didn't realize 500 servers was still "little company" level!

marcosdumay · on March 11, 2021

That surfaces a very important communication problem. If you consider 500 servers a "small datacenter", then, of course, you will say that K8S adds value to small operations.

Spooky23 · on March 11, 2021

Marginal cost for servers is very small, and with containers, etc, they multiply like bunnies.

Many of our devs are running 8-15 VMs just for their dev work. Depending on what you do and how, it’s easy to hockey stick the count!

greggyb · on March 11, 2021

A qualified generalist ops admin should be productive on any modern UNIX or UNIX-like within a couple weeks.

Figuring out how the guy last year set up FreeBSD consists of similar effort and process as figuring out how the guy last year set up Debian or CentOS or SUSE.

Spooky23 · on March 11, 2021

It can be challenging. I basically bought my house based on figuring that out for people in the old days.

It’s not an issue with FreeBSD or anything per se. Usually there’s a genius superhero guy who does whatever, who burns out/quits/otherwise goes away, and the business is stuck.

greggyb · on March 12, 2021

I think we're agreeing past each other.

In this sort of sample scenario, it's not the platform, but the person who did crazy weird things -- things that they can do with any platform.

theptip · on March 11, 2021

I’d suggest that when you have paying customers, and your duck tape deploy process is starting to feel risky to run during business hours, that’s a good time to do your first taking of stock. Before that point it should be the fastest thing you can wire up given your own personal tooling expertise.

(I’d argue that k8s is actually lean enough to be a competitor here if you’re already an expert and you use a managed service like GKE).

Your infra should be in service to some uptime/performance/DX goals. Define those (honestly!) and then you can figure out what you need to improve.

In particular if no customer has ever noticed downtime, and you don’t think you’d lose business if someone hit a 30-min outage, then you probably don’t need to upgrade. Twitter was notoriously unreliable in the early days and they made it, for example.

jacquesm · on March 11, 2021

Fair point. But people just love to over-complicate things and it is far more often that you come across absolutely ridiculous infrastructure monstrosities for very small business than that you come across the opposite.

PragmaticPulp · on March 11, 2021

I’ve found that career FOMO drives a lot of the desire to overcomplicate infrastructure.

It’s hard to look at job listings from prestigious companies listing K8s and other buzzwords as requirements without feeling some desire to add those buzzwords to your own resume.

jrochkind1 · on March 11, 2021

Is there good tooling that's actually in between "just running a website out of my spare bedroom" and "kubernetes"?

I think maybe not.

deckard1 · on March 11, 2021

You're starting at the wrong point.

If you need scale, then you know what you need when you need it. None of this advice people are tossing around would apply to WhatsApp, for instance. Because they needed to get into the weeds on BEAM optimizations for Erlang and they needed to do so quickly. They probably didn't have a year to mess around getting K8S up and stable. Or crafting the perfect Ansible script. Or infrastructure-as-code. They probably cared very little about any of that.

WhatsApp is the closest thing you'll see for explosive growth. And they had time to work on BEAM patches. Think about that. Their whole world didn't just collapse because they couldn't support a billion connections on day one. Even they had time to scale as they needed.

skrtskrt · on March 11, 2021

A couple ansible scripts to deploy the k3s kubernetes distribution onto cloud VMs.*

Add LetsEncrypt to the Traefik install for https

Use the default Helm chart to make your k8s deployments for your apps - you'll probably only need to tweak a couple values to start - ports and ingress rules.

* k3s comes with Traefik pre-baked to listen on ports 80 and 443 so you can run it "bare metal" without a cloud provider's load balancer service or any special tweaks to Traefik config. If you just Helm install any of the standard k8s ingress providers like the standard Traefik chart, Consul, or Nginx, you have to tweak a bunch to get them working on privileged external ports 80 and 443.

theptip · on March 11, 2021

I found Dockerizing my app really early to be a good bridge to k8s.

Build container locally, test, push. On Linux box, pull the new app container, stop the old app container, start the new app container. Separate nginx reverse proxy also in docker so if you’re fast enough you don’t drop any packets.

I’d say docker is worth using from the get-go, but if you are most familiar with raw Linux probably just use that for your first iteration.

bearjaws · on March 11, 2021

Serverless compute in your favorite cloud provider IMO.

Been using serverless framework for over a year now, handles my deployments, load balancers, API gateway and obviously Lambda.

A proper CI/CD setup is only about 8 hours of work with a monolith app.

salmonlogs · on March 11, 2021

What about a managed service like EKS or GKE? A minimum k8s config is perhaps 20 lines to spin up a service and expose the ports - that feels pretty achievable for bedroom coder?

marcosdumay · on March 11, 2021

Machine installation manuals, "Create machine" shell scripts, manual docker, script launched docker, VM management tools... Well, technically, puppet and it's kind are there too, depending on whether you consider them viable or not.

didibus · on March 11, 2021

You can use something like AWS Application Load Balancer with two EC2 hosts behind it. Linode also offers an LB if you prefer Linode.

Now do a simple Blue/Green deploy. Deploy to the first host, so maybe you use Java so just have a script that shutdowns your app and copies over a new uberjar and starts it up again. Try it out, if it all works, swap the LB to point to this host instead. If it results in errors and failures your rollback procedure is to swap it back again to the previous host.

You can scale this pretty easily as well, just add more hosts behind your Blue/Green groups.

Now the last challenge is actually going to be your DB. You can't swap DB instances like that, so that's where something like AWS RDS or a managed DB service is useful, cause they often offer automatic replica and fail over.

But if you're lazier (or don't have the money) then that, just have a third host running PostgressSQL with a scheduled backup of your DB to some other durable storage like AWS S3. And when you need to update that host, schedule a maintainance window for your app/website.

So...

1) You can start with one single VM and have your DNS point directly to that host. Where the host runs your DB and your app. Deployments are you ssh-ing, shutting down, copying a new uberjar, starting up again. And you turn this into a script that does it for you.

2) Then you add a DB backup cron job to S3 so you don't lose all your users data from an incident.

3) Now if you want to prevent needing downtime when you deploy, you add an LB and move your DNS to point to the LB and put the host behind the LB, it becomes Green, and add another host which will be Blue, move your DB to a third host that's not behind the LB. And now do what I said earlier. You can now deploy new versions of your app without downtime, and do quick rollbacks in case you deploy something that breaks your users.

4) If your user base grows and one host isn't enough anymore, you can add more behind the Green and Blue groups. Everything else is the same, just your script now needs to swap between the groups.

5) Now you might start to want tests running automatically, uberjar being created automatically when you commit, and have a beta stage and all that. So you can setup a CI/CD pipeline. Have a test host and a test DB. And your app has config to know which environment it runs under to detect what DB to connect too.

6) Finally, if your DB becomes a scaling bottleneck due to growth, congratulations, you've now becomed a successful business and can just hire other people to deal with it.

herodoturtle · on March 12, 2021

> Is there good tooling that's actually in between "just running a website out of my spare bedroom" and "kubernetes"?

Yes. Plenty.

One example is AWS Lightsail.

Macha · on March 12, 2021

docker-compose, ansible, chef in -z mode are all in that middle complexity level

oppositelock · on March 11, 2021

It's always a judgment call with how to complicate your infrastructure.

If you succeed, sometime in the future, you will have lots of engineering resources to manage the complexity of a "web scale" service. However, this level of complexity is high.

So, today, what is more important is ease of development, a friendly software iteration cycle, and easy testing on developer machines, since you don't have FB's or Google's test infrastructure yet.

You have to shoot for something that you can build fairly simply that'll carry you the next year or two, and think about it in terms of API boundaries, so that you can replace any particular piece, with a faster, more complex piece in the future when you need it.

Your appropriate design for today isn't going to come from this site, or from reading AWS or GCP press materials, but from people who have been through this multiple times and have some experience in evolving small systems into big systems. They're hard to find.

kabdib · on March 12, 2021

You would be surprised how far "scp to a bunch of Linux boxes" plus some load balancers will get you.

rualca · on March 12, 2021

In that case just imagine how far you'd go if instead of manually scp to a bunch of servers you instead install a tool like Ansible, add your bunch of linux boxes in a list, write down in a script what you need to scp to those boxes, and from thereon just run the script whenever you need to update them.

But somehow that's too much because it's now ops and no one needs to scp to boxes safely and reliably like that.

blobster · on March 12, 2021

In the overwhelming majority of cases, you will never pass that point. A well-optimized dedicated server with 90 fast cores, 128GB ram and 4TB SSD goes for $120/mo at Hetzner. That can easily handle 1 billion non-cached monthly page views if you write decent code. More if you cache. How many sites get that much traffic? A few dozen. If you're dedicating this much effort in preparing for 1-in-a-million eventualities, you will never ship anything.

deckard1 · on March 11, 2021

> It is very important that you realize when you pass one of those points.

It. Does. Not. Matter.

You're not going to be the same guy that sets up Facebook's garage server in 2004 and then scales it out to a multiple data center monster it is today.

If you are that guy, then I really feel sorry for you. Something went seriously wrong with your equity (or lack thereof) and you are doing janitor shit when you should be up in the tower with the other C-levels.

SiVal · on March 12, 2021

Part of what you're saying is a good point, IMO. The people who transition the infrastructure from Level 0 to Level 1 won't be the same people in charge of infrastructure when it's time to transition from Level 17 to Level 18.

But however many levels there are from garage infrastructure to Big Tech infrastructure--even if there were only 2--someone will always have to decide that the time has come to transition from This Level to Next Level. Different expertise may be required to transition from L0 -> L1 vs L10->L11 vs L30->L31, but that expertise still matters.

And if I were the one who had to decide at any of those levels, I would want a well-designed plan for the next transition or two that was of the form: 'When [some measurement] reaches [some threshold] we will [change the infrastructure in the following ways...]'

deckard1 · on March 12, 2021

> And if I were the one who had to decide at any of those levels, I would want a well-designed plan for the next transition or two

Right. This was really my (poorly-expressed) original point. Deciding how and when to scale is going to be a team and organization effort and decision. If it does come down to one guy deciding on something, then that's going to be the CTO. And if he/she is a good CTO, their decision is based on the feedback of a dozen people under them.

Scaling is a progression and no matter how fast your growth is, you do have time to come up with a solid plan and execute that plan. You don't need Facebook scale on day one even if you do become the next Facebook.

yowlingcat · on March 11, 2021

I agree that there's a problem with that kind of a false dichotomy, but just as Rails is "good enough" for a vast majority of workloads (where you specialize off the parts that need it when you need it), I'd say the same about Heroku. The DX scales up from 0 to "big enough" very well, and once you get beyond that, it's reasonably straightforward to take the parts you need inside of AWS inside.

I've noticed sometimes that there is often an ideological argument for or against open source (and sometimes to an extreme extent) buried in arguments for complex/expensive/distracting infrastructure, whether directly or indirectly. It gets coded as "portability" and so that's why you end up managing your own metal on Kubernetes. But you've got to really ask yourself whether if you had to pick one, you'd index on convenience or portability. There are some businesses where that portability actually is that important. But there are a lot of businesses for which convenience is first in priority.

Determining what your priorities are is the first step. From there, I think you have options.

_ktx2 · on March 12, 2021

Infrastructure is important whether you're running a bare metal server or VM's on someone else's computer in the cloud.

Different infrastructure solves different problems.

Container orchestrators solve a very specific density problem. Back in the day we recklessly threw a machine together and either put one or an infinite amount of apps on it. If you put one then you've wasted all remaining resources on the machine. If you put an infinite amount of apps on it then you've effectively created a hackers wet dream much less noisy neighbor hell. Can you still have containers that demand hoggish resources? Sure, configuration is up to your imagination and your expertise - just like with virtual machines and BSD jails. That said, now you have options.

Container orchestrators also made it incredibly easy to pay less for things like load balancers. Instead of paying for that pricey layer 7 load balancer from your favorite cloud provider, you can bootstrap nginx into your stack and have lots of fun. Container orchestrators are also very chatty and invoke a level of network utilization that our predecessors probably would've gawked at. When misconfigured, containers are difficult to troubleshoot and require a large array of supplementary software in order to run them with any efficacy partially due to sheer number (nobody runs just one container).

Walking around telling people "you don't need x" is a reactionary statement, because I have only covered the surface level information about container orchestrators in regards to benefits and trade-offs; you can't possibly know the answer without having extensive conversations about these.

Saying stuff like this:

> ‘Engineers get sidetracked by things that make engineers excited, not that solve real problems for users’ — we’ve heard it all before

Engineers do overcomplicate stuff but there's reasonable explanations for why if you dare to ask with any kind of non-judgemental intent. It's fanciful that engineers sit around complicating things for their own amusement, especially professional ones.

I run virtual machines and a Kubernetes cluster in my personal playground. They both are part of functional projects, however, I end up deploying applications on either platform based on what the applications needs are. It took me about a day to get that Kubernetes cluster configured the way I wanted it and to implement some basic CI/CD that kept it in shape. It probably doesn't get updates as much as it should, but neither do my VM's, so if you end up on one of my side projects: you've been warned! That day spent has saved me time in the future. When I want to deploy, I'm one kubectl command away from updating my production code. Every platform require effort; effort to build, effort to maintain; the only question is where it's spent in between the time of inception to sunset.