I recommend not having thousands of servers. Who has thousands of servers? Maybe...

ghshephard · on May 4, 2020

Last company I worked at, the application stack, after lots and lots of trimming, and memory tweaking, ran 14 servers. Hot Failover doubled that, Disaster recovery ran 3x. So, that was 42 servers per customer. We had 40+ customer - so, at last count, were at 1600+ servers, just for production applications - not including development, marketing, sales, finances, etc... And it was just a run-of-the-mill data ingestion company - nothing special. 1600 servers is only 80 racks - that's a small data center as things go - there are hundreds, if not thousands of companies with data centers that large.

Now, at my current gig - we have 43 nodes - but each node can run 100+ k8s pods (and, they are beefy 768 GB nodes). Our administration overhead, in terms of people/effort/auditing is a fraction of what it was at the last gig. Also - Disaster and failover is just a matter of bringing up the clusters and persistent stores in another region. No need to keep compute on standby. This week, we're even doing a migration of a cluster from GKE to Azure. And it started out on AWS (though not in production) - I'm not in the sysadmin world anymore, but from what I've seen, things like k8s have made it a heckuva lot easier to do big things with fewer people.

seized · on May 4, 2020

Anymore than five or ten servers are risky to manage manually. This isn't a problem you hit suddenly at 2000 servers when at 1800 its fine.

Anyone whose properly managed more than a few servers realizes and understands this.

wolfgang42 · on May 4, 2020

Heck, IMO even one server is risky to manage manually. If you're making all your config changes from a prompt it's easy to forget how or why you set something up a specific way and end up with a huge pain when you need to do it again. A config management tool gives you executable documentation for exactly how the server is set up should you ever need to recreate it from scratch.

bn7t · on May 4, 2020

>If you're making all your config changes from a prompt it's easy to forget how or why you set something up a specific way [...]

The correct way would be to just document the changes.

wolfgang42 · on May 4, 2020

I tried that, but I'm not sufficiently superhuman. Things slowly but inexorably drift out of sync as changes get made in one place but not the other, despite my best efforts. Declaring how things should be and then letting the computer make it so is both less work and more reliable. (This is why I call it "executable documentation".)

someone13 · on May 4, 2020

Tell that to companies doing extremely large-scale machine learning. Or any cloud infrastructure provider. Or CDNs. Or literally any video production company that owns a render farm. Or any company doing large-scale media transcoding/streaming.

Maybe "don't have thousands of servers" is just a bad take :)

seized · on May 4, 2020

Many companies operate at a large scale. If your company or your product doesn't.... That doesn't mean your way is the correct way for anyone.

You haven't caught on that no one has said automation replaces testing or other processes but that seems to be the drum you're banging on. Ansible or whatever tool doesn't replace testing, monitoring, pentests, whatever else you want to do. Automation lets you fix the issues faster and replace servers faster and more reliably, once you have done the testing that you have to do either way.

akerl_ · on May 4, 2020

What about the plethora of companies that provide B2B IT services? If your company’s business is “deploy $insert_products on-premise for each customer”, and you’re doing good business, you’re going to quickly approach thousands of servers.

nineteen999 · on May 4, 2020

At a large Australian bank I contracted to about 10 years ago, it was suspected that they had upwards of 13,500 servers, including many mainframes. They had so many servers that there were more than a few just sitting uncollected and still packaged in the storage vault, abandoned several years ago when the project they were intended for was cancelled/put on hold etc. Nobody every claimed them.

Another part of the problem is that in the past they invested in every platform under the sun that ran the financial applications they required, so there was a lot of legacy platforms that they couldn't yet rationalize away (AIX, Solaris etc).

The other problem is that they are incredibly risk averse, and thus will build new platforms extremely slowly and only after immense testing will they cut over from the old platform to the new platform. They are typically averse to consolidation or rationalisation since the ongoing stability of the application is the most important factor to their business.

A virtualization umbrella project that I was working on 10 years ago for this bank that covered several "core" platforms (AIX, VMware, Solaris, etc) is still under way as we speak, with many applications still to be migrated from old platform to the new one, despite the original "replacement" hardware now being obsoleted by the manufacturer already, according to those I know that still work there. I wonder if they'll ever finish.

AmericanChopper · on May 4, 2020

I was doing some work for a bank last month. It’s a reasonably small bank, and you likely haven’t heard of it. They have thousands of servers. They had over 100 logstash instances alone. All their desktops were virtualized, and that was a couple thousand servers at least.

ddevault · on May 4, 2020

They have thousands of servers, but should they? I'm sure they must, banks are, after all, world-famous for their technical accumen. It's good to know that my money is as safe as I always suspected it was.

d3ad1ysp0rk · on May 4, 2020

Oof. I can't see how this is anything other than a huge assumption to think that you know better than each and every business that has thousands of servers. The smartest people I know generally assume the least of situations they are not intimately familiar.

How large is the company you work for (employees or revenue)? How many servers do you have to support that company?

wolfgang42 · on May 4, 2020

> How large is the company you work for (employees or revenue)? How many servers do you have to support that company?

“SourceHut [of which ddevault is the founder] has 10 dedicated servers and about 30 virtual machines, all of which were provisioned by hand... [It] made 4 thousand dollars in Q1 with two employees and an intern.” (from https://sourcehut.org/blog/2020-04-20-prioritizing-simplitit...)

> Oof... The smartest people I know generally assume the least of situations they are not intimately familiar.

Yeah, I generally think highly of Drew’s opinions on technical matters, but this thread is just painful to read.

AmericanChopper · on May 4, 2020

The thing I don’t like about the sorts of claims made here is that when people advocate for simplicity like this, they often just have a highly opinionated perspective of what should be simple. So they’re not really advocating for simplicity, they’re just advocating for prioritising whatever thing they think it is that’s important. Maybe that thing should be important, but it usually depends on the context, so you end up with bad general purpose advice, which usually doesn’t accurately portray the trade-offs it’s advocating.

AmericanChopper · on May 4, 2020

I did suggest to them, for the sake of reliability, that they should move their entire infrastructure to one incredibly large server. Because a single point of failure is easier to maintain (less points that can fail). But sadly, they didn’t listen.

Seriously though, payment infrastructure actually is known for its reliability and fault tolerance. Banking systems don’t go offline terribly often.

ddevault · on May 4, 2020

Is your reasonably small bank known for its reliability and fault tolerance? The main reason banks don't go offline is because the core critical infrastructure is running on 50 year old mainframes that no one is allowed to touch because all of the greybeards with actual talent who made them are pushing up daisies.

Nowhere did I say you should be on one incredibly large server, nor that you should have a single point of failure. That wouldn't be simple, either, because it would fail to support the prime directive, or would require a great deal of gymnastics to. It's about balance. You don't need thousands of servers to make a reliable system.

bowyakka · on May 4, 2020

> Is your reasonably small bank known for its reliability and fault tolerance? The main reason banks don't go offline is because the core critical infrastructure is running on 50 year old mainframes that no one is allowed to touch because all of the greybeards with actual talent who made them are pushing up daisies.

> Nowhere did I say you should be on one incredibly large server, nor that you should have a single point of failure. That wouldn't be simple, either, because it would fail to support the prime directive, or would require a great deal of gymnastics to. It's about balance. You don't need thousands of servers to make a reliable system.

Heh those things go offline every night for 2 hour maintenance.

Also fun when those nice single points of failure crash (they do)

Source: work at a shop trying to _get out of_ greybeard mainframe to get more reliablity

AmericanChopper · on May 4, 2020

Most banks have a window of a couple of hours per night where access to the core system is restricted for settlement. It's kind of a necessity of the business model. However most banks (including every bank I've ever worked at, which is quite a few), don't shut down services during that time. ATM, credit and debit cards, everything else... All those systems run all the time. They just have an eventual consistency model. The compromise that the banks tend to make, is that certain types of fraud are more tolerated during that window, rather than sacrificing availability.

AmericanChopper · on May 4, 2020

> Is your reasonably small bank known for its reliability and fault tolerance?

I don’t know if they have a specific reputation for it. But their services were very reliable and resilient.

> The main reason banks don't go offline is because the core critical infrastructure is running on 50 year old mainframes that no one is allowed to touch because all of the greybeards with actual talent who made them are pushing up daisies.

This isn’t really true at all. The banks central ledger will likely be running on DB2, and that will almost never fail. But most of their infrastructure runs on more “modern” systems (DB2 is still modern if we’re being honest, it’s just more specialized). When you swipe your card somewhere, nearly all of the systems the transaction passes through are ordinary web services, running on ordinary architecture. That’s where the greatest risk of service disruption is. It’s just the final side effect is updating a record in a DB2 database somewhere.

Most banks also outsource a lot of the core system maintenance directly to IBM. A typical software or infrastructure engineer in a bank will have absolutely no contact with those systems.

> Nowhere did I say you should be on one incredibly large server

No, you just seem to think that reducing the number of components in your infrastructure is the key to reliability. If you want fault tolerance, then you generally want redundancy, and you want your services to fail gracefully (ie, not take down other non-dependant services when they do). Both of those things require deploying more servers.

If points of failure are a concern, then you also need to account for the fact that every single time you make a manual change you are creating a new point of human failure (with humans generally being the least reliable part of any well designed system). If you automate a change, you can spend more time scrutinizing it for mistakes, and be reasonably sure it’s deployed exactly the same everywhere. Combine that with blue/green, rolling or canary deployments, and automated testing, and you end up with something much more reliable than shelling your way across the entire infrastructure to deploy a single change. To suggest that deploying less infrastructure is a superior alternative really just comes across as ludditism.

vegardx · on May 4, 2020

You might think so, but if that was the reason your bank would be offline a lot. Everything that these old monoliths process come in batches, at predictable intervals.

What makes banking systems reliable is that there's so much built in latency in every operation that nobody really notices if something is down for six hours.

AmericanChopper · on May 4, 2020

> What makes banking systems reliable is that there's so much built in latency in every operation that nobody really notices if something is down for six hours.

This is true for settlement, but not authorization. When you spend your money it could be hours or days until the transaction is correctly reflected in your balance. But in order to spend it in the first place the transaction has to be authorized, and this happens in real time. Depending on where you live, who you bank with, what kind of card you’re using, how you’re using it... this authorization may rely heavily on your bank being available for online transaction processing, or not very much. But unless you’re doing a transaction with one of those imprint devices, your transaction must be authorized in real time by somebody.

vegardx · on May 4, 2020

I sit close by to the people that deal with this at work, so my understanding is very limited. Payment processors, service providers, reconciliation, yatta yatta. We used to joke around with them about carrying cash, just in case. We stopped joking around when we realized they all did.

AmericanChopper · on May 4, 2020

Payment processor could mean a number of different things. It could mean a tiny gateway operator. The sort of thing that can fail quite happily without impact transaction processing (most services you use will be backed by redundant gateways).

I carry some cash too. But the times I’ve needed it were for things like an outage of the internet connection at a restaurant, or an issue with a payment terminal provider (or for the very uncommon cash-only merchant).