Beefy machines don't give you redundancy though. Beefy machines arguably make re...

tsimionescu · on Dec 23, 2021

Conversely, since no one is actually doing true resilience anyway (just look at how many services failed when one AZ in us-east-1 went down), why not just start a new machine with the same code and data (either through active replication or shared hard disks), take the 1h downtime until that's up and running, and be golden?

You can even deduct it from your customer's monthly bill.

hinkley · on Dec 23, 2021

But if you have 20 of them then the frequency of problems goes up.

In the context of this conversation, where people are bringing up situations with 100’s of servers, 4 servers still counts as vertical.

For AWS, where you can lose an AZ (and lately that feels like a certainty), there are arguments to be made that 3 or 6 servers are the least you should ever run, even if they sell a machine that could handle the whole workload. You should go down 2 steps and spread out. But don’t allocate 40+ 2xlarge machines to do the work. That is also dumb.

What is also screwing up the math here is that AWS is doing something untoward with pricing. With dedicated hardware, for the last six decades, you have a bimodal distribution of cost per instruction. The fastest hardware costs more per cycle, but if your problem doesn’t fit on one (or three) machines you have to pay the price to stay competitive. At the other end the opportunity costs of continuing to produce the crappiest machines available, or the overhead of slicing up a shared box, start to dominate. If you just need a little bit of work done, you’re going to pay almost as much as your neighbor who does three times as much. Half as much at best.

CogitoCogito · on Dec 23, 2021

I don't really think that these ideas mutually exclusive. The example given here ( https://news.ycombinator.com/item?id=29660117 ) discusses 4000 nodes running extremely inefficiently. Depending on how inefficient it could be, you could replace it with multiple beefier servers placed redundantly in different zones. (And if it's _really_ inefficient you might be able to replace it with an order of magnitude fewer servers that aren't beefier at all.)

I think really the point being made is to efficiently use your resources. Of course some of the most expensive resources you have are programmers so inefficient hardware solutions may end up cheaper overall, but I've personally seen solutions that are architecturally so complex that replacing them with something simpler would be a big win across the board.

gjulianm · on Dec 23, 2021

Depends on what kind of redundancy you want. You can achieve quite a decent uptime for a single instance with local redundancy, process health watchdogs to restart things, crash isolation… Yes, some people will need a cluster with no downtime on failure events, but the clustering itself adds complexity and can even increase downtime if it causes problems. For example, a single machine can’t have load balancer issues, or replication errors.

Also, you’re assuming an active-active redundancy scheme, which might not be always necessary, and not counting the cost of the elements required to support the cluster.