Is needing servers with 96 vCPU's and 768 GB of RAM each...normal?
I'm a programmer, but I do mobile dev, and that amount of resources for a single machine sounds comically high to me. I thought modern websites usually had more 'normally' specced servers and just distributed the load across a ton of them?
Can anyone recommend a Coursera/EdX or other online resource that goes over the basics of designing and setting up such kinds of high-performance server systems ?
I honestly don't think you'll find a course that covers this. The SO devs (and others) have kept a blog going for years with the type of information you're after. The search term is "mechanical sympathy."
Language/framework also matters. You're not going to pull this off with a backend based on a framework that gives no damns about performance. Rust, C# (bleeding edge), Java and C come to mind as good candidates.
SO doesn't have the content requirements of a modern Social Network: images, video, and streaming are huge factors. Especially with modern smartphone capabilities.
I wrote a program that ran on that type of computer once. It was real-time log analysis. There was a lot of data coming in, and there was a need for a globally-consistent datastore for the results. Solution? Put it all in RAM. If the single replica crashes, you can just read all the inputs again.
Some extra capacity existed for that, and some other system already stored the raw logs to disk durably. The terabyte of in-memory data just made queries tolerable enough to run every few seconds and display in alert/chart form.
I didn't keep the system in this state for very long, but for version 1.0 it was just the thing -- idea to production in a short period of time. Eventually it did move to a more distributed system, as log volume (and usefulness) increased, and we had more time to deal with the details. It was mostly nice to not have to reprocess data after releases -- I could do them in the middle of the day without anyone caring.
My biggest worry when writing this was that 40Gbps of network bandwidth wouldn't be enough, but it was fine in the early stages. 40Gbps is a lot of data.
I'm not sure I'd say it's a great sign that you need a single beefy machine to run something, but it's a tradeoff worth considering. I found the distributed system version easier to operate, but it did constrain what sort of features you could add. I think we got it right, but it's easy to code yourself into a corner when you have encoded assumptions deep into the system. Best to avoid that until you're sure your assumptions are right.
To be clear (if the image is accurate), they weren't asking for one beefy DB instance with 96 vCPUs and 768GB RAM. They were asking for 70-100 instances with those requirements.
I remember reading a few days ago that their code was atrocious. Perhaps this is an extreme example of the costs of never refactoring? I can believe that the cost of spaghetti code compounds quickly.
I can believe that the company who named resources sequentially, enforced no security validation for viewing posts/media, and didn't strip metadata from media uploads also didn't have engineers especially skilled in optimization.
I agree that the other points indicate bad engineering. But not stripping metadata can be a intentional. E.g. some smaller image hosters do that to preserve files bit-identically. Some forums with heavy emphasis on minimal moderation take the position that opsec is the poster's responsibility.
I don't know what 96 vCPUs means in terms of real cores, and my server knowledge is a bit old at this point, but here goes.
There's a benefit to running on fewer servers. If you can make good use of many core machines and gobs of ram, it makes sense to go up to at least reasonably large machines. For Intel, dual socket Xeon is widely available and not obscenely priced; for AMD, I haven't seen a lot of dual socket Epyc, but 64 cores in a single socket is quite a lot. 768 GB seems big, but if you can put it into one machine instead of 12 machines with 64 GB, that helps reduce maintenance and communications overhead.
I ran systems with dual Intel Xeon 2690v4, a total of 28 cores/56 threads, and we put up to 768GB in some of them; that was several years ago, you can get a lot bigger now.
Databases love ram, and social sites make a lot of queries, so it makes sense a bit. I don't know what their usage numbers are, or what their site looks like; I'm just guessing based on general description. The traffic numbers didn't look too big, but types of request makes a big difference there; serving media is relatively easy, serving comments threads and highlighting your friends is trickier.
(Serving media with transcoding is a lot less easy though)
This seems extreme especially if they're going for bare-metal. Even a single one of these DB servers would handle a lot of traffic if they're bare-metal.
When it comes to running Postgres on 70-100 servers I'm also not sure, unless they're doing some sharding at the application level, I'd expect the overheads of replication and resulting network traffic to be insane if they're merely replicating across all of them.
Their whole website should be able to run on a handful of these machines; their main cost and resource usage would be hosting & converting uploaded media, not the DB of app servers.
Maybe they are trying to bring up multiple sites (ie. AWS regions) for redundancy. If you have 10 "regions" then maybe hardware requirements look a bit more reasonable (7-10 DB servers each, etc.)
It wouldn't matter for the anti-trust lawsuit, since they're alleging that Amazon conspired with Twitter to kick off Parler because Parler would steal Twitter's market share. (Well, trying to allege, because they don't even manage to allege that any further than "Amazon hosts Twitter" and implying there's no other reason Amazon would kick Parler off, despite spending half the brief complaining that they're being kicked off for being conservative.)
This is an absolutely bonkers amount of resources for the functionality and audience they have. We got our first 5 million users on 4 used PowerEdge nodes and our first 30 million with 2 bare-metal DB servers.
They're running Wordpress, which is famous for scaling badly to the extent that there are large companies like WPEngine.com devoted to doing nothing except working around that.
A lot of people like the back-end admin panel of WordPress and its WYSIWYG feature, so a bunch of other people have created a lot of themes/mods/plugins around WP because of that.
If you run a small website it's all fine, otherwise you have to deal with performance issues and WP limitations because you've decided to build a social network on top of a blogging engine.
The reason is that they're running their social media site using Wordpress (of all things, why?!?!?!), so it's probable that it will scale horrendously. The data breach happend with Parler was allegedly due to an exploit in one of the Wordpress plugins.