But for that bandwidth to be used efficiently, the processes on each NUMA node need to almost exclusively limit themselves to memory attached to their node -- at which point, well-written software could probably do just as good spread out over several machines that are connected by multiple 100Gb network links, and then you saved two or three bajillion dollars.
If you're heavily using the bandwidth over the NUMA interconnect, then you're not going to be using the memory bandwidth very effectively (and likely not really using the processor cores effectively), and that's when NVDIMMs like Optane Persistent Memory could give you large amounts of bulk memory storage on a smaller system.
Or just use a number of Intel's new PCIe 4.0 Optane SSDs in a single machine in place of the extra memory and memory channels... the latency isn't the same as RAM, but it's much closer than traditional SSDs, and the bandwidth per SSD is like 7GB/s, which is impressive.
It all depends on the application at hand, but there are solutions that cost a lot less than the Platinum machines for virtually every problem, in my opinion.
I don't know... perhaps I'm too cynical of these cost-ineffective systems that just happen to be large, and I should be more impressed.
> NUMA scales better than RDMA / Ethernet / Infiniband. A lot, lot, LOT better.
> But for that bandwidth to be used efficiently, the processes on each NUMA node need to almost exclusively limit themselves to memory attached to their node -- at which point, well-written software could probably do just as good spread out over several machines that are connected by multiple 100Gb network links, and then you saved two or three bajillion dollars.
Wait, so latency over NUMA is too much, but you're willing to incur a 100Gb network link? Intel's UPI is like 40GBps (Giga-BYTEs, not bits) at sub 500ns latency.
100Gb Ethernet is what? 10GBps in practice? 1/4th the bandwidth with 4x more latency (in the microseconds) or something?
That's a PCIe latency penalty (x2, for the sender + the receiver). That's a penalty for electrical -> optical, and back again.
Any latency, or bandwidth, bound problem is going to prefer a NUMA-link rather than 100Gbps over PCIe.
> Wait, so latency over NUMA is too much, but you're willing to incur a 100Gb network link?
The point is not just the latency, but latency and bandwidth.
If the application is relying heavily on the NUMA interconnect to transfer tons of data, it's not going to be making efficient use of the processor cores or the RAM bandwidth. It's a total all around bust. You're just wasting money at that point. ^1
If you aren't relying heavily on the NUMA interconnect, and each node is operating independently with only small bits of information exchanged across the interconnect, then you'd save a metric ton of money by switching to separate machines and using just high speed fiber network links -- such as 100Gb.
I'm not saying that network would be better than the NUMA interconnect. I'm saying that you're not going to be having a happy day if you're relying on the interconnect for large amounts of data transfer.
The only situation where the NUMA set up is better is if you have a need for frequent, low latency communication between NUMA nodes... where very little data is being transferred between nodes, so the entire problem is just latency.
At that point, you're still suffering a lot by the NUMA interconnect, and it would be better to use larger processors... such as Epyc Rome processors.
So you really have to be in a very obscure situation which can't fit onto a dual socket Rome server, but can fit within a machine less than 2x larger. (28 cores * 8 sockets is less than 2x larger than 64 cores * 2 sockets)
You see how complicated this is and how absurdly niche those Platinum 8-socket machines are? They're almost never the right choice.
^1: The exception is if you have an even more niche use case that somehow is built for exactly this scenario in mind, and manages to balance everything perfectly. Such a software is almost certainly ridiculously overcomplicated at this point.
> If the application is relying heavily on the NUMA interconnect to transfer tons of data, it's not going to be making efficient use of the processor cores or the memory bandwidth. It's a total all around bust. You're just wasting money at that point.
If the interconnect is your bottleneck, you spend money on the interconnect to make it faster. Basic engineering: you attack the bottleneck.
You don't start talking about slower systems and how they're cheaper. Because that just slows down the rest of your system.
------
If you just wanted cores, you buy a 28core Xeon Gold. The point of 28-core Xeon Platinum is for the 8-way interconnect and scaling up to 8-way NUMA systems. The only person who would ever buy a Xeon Platinum is someone who wants 40GBps UPI connections at relatively low latencies. (or maybe even the 300GBps connections that IBM offers, but that's a similar high-cost vertical-scaling system)
> So you really have to be in a very obscure situation which can't fit onto a dual socket Rome server, but can fit within a machine less than 2x larger. (28 cores * 8 sockets is less than 2x larger than 64 cores * 2 sockets)
That's not even that hard to figure out! A 48TB Memcached / Redis, which is far more useful than a 8TB Memcached / Redis box.
A bit basic, but yeah. That's the point: spend more money on hardware and then don't spend much engineering time thinking about optimization. If 48TBs of RAM solves a problem that 8TB cannot, then just get the 48TB system.
> That's not even that hard to figure out! A 48TB Memcached / Redis, which is far more useful than a 8TB Memcached / Redis box.
No... dozens of terabytes of Optane would be just as good and much much cheaper. The person designing the system would have to prove that a few nanoseconds of latency difference makes any material difference to the company's profits in order to justify the more expensive machine. Otherwise, it's a huge waste of company money, hurting the business.
Also keep in mind that Redis is only going to be using a single core of that machine. A total waste of huge amounts of CPU cores just to have a lot of RAM, when there are equally good solutions that cost much less.
You surely must see why I'm skeptical.
> That's the point: spend more money on hardware and then don't spend much engineering time thinking about optimization.
It's bad business practice to buy the most expensive thing possible instead of engineering a solution that is the right price. If the more expensive solution saves money in the long term, sure... but your example doesn't show this.
Not really... especially if this is an in-memory database attached to a network, as implied.
It’s only slower if someone can observe the difference, which I don’t think they would be able to in this design.
I’m a strong proponent of using fewer, larger machines and services, instead of incurring the overhead involved in spreading things out into a million microservices on a million machines. But there is a balance to be achieved, and beyond a certain point... synthetic improvements in performance don’t show up in the real world.
Queuing up a few database requests concurrently to make up for the overhead of literally hundreds of nanoseconds of latency is trivial, especially when Optane can service those requests concurrently, unlike a spinning hard drive. Applications running on other machines won’t be able to tell a difference.
But, agree to disagree.
There are probably applications where these mega machines are useful, but I don’t personally find this to be a compelling example.
I readily admit that I could be wrong... but neither of us have numbers in front of us showing a compelling reason for a company to spend unbelievable amounts of money on a single machine. My experiences (limited compared to many, I’m sure) tell me this isn’t the winning scenario, though.
> I readily admit that I could be wrong... but neither of us have numbers in front of us showing a compelling reason for a company to spend unbelievable amounts of money on a single machine.
$500k on a machine isn't a lot of money compared to engineers. Even if you buy 4 of them for test / staging / 2xProduction, its not a lot compared to the amount spent on programming.
One thing being expensive doesn’t make another thing not expensive.
It’s possible for them to both be independently expensive, and I’m saying that unless you can prove that the performance difference makes any difference to company profits, it is literally a huge waste of company money to buy those expensive machines.
A lot of applications will actually perform worse in NUMA environments, so you’re spending more money to get worse performance.
Reality isn’t as simple as “throw unlimited money at Intel to save engineering time.” Intel wishes it was.
Engineering effort will be expended either way. It is worth finding the right solution, rather than the most expensive solution. Especially since that most expensive solution is likely to come with major problems.
Preface: If you have actually encountered applications that must be run on 8-socket systems because those are literally the only fit for the application... I would love to hear about those experiences. With the advent of Epyc Rome, most use cases for these 8-socket systems vanished instantly. It would be fascinating to hear about use cases that still exist. Your experiences are obviously different than mine.
If you need more than 8TB of RAM, with the right application design you can probably do better with fast Optane Persistent Memory or Optane SSDs, and an effective caching strategy. You can have many dozens of terabytes of Optane storage connected to a single system, and Optane is consistently low latency (though not as low latency as RAM, obviously).
If you need more compute power, you can generally do better with multiple linked machines. You can only scale an 8-socket system up to 8 sockets. You can link way more machines than that together to get more CPU performance than any 8 socket system could dream of.
----------
I didn't expect you to read and respond so quickly, so I had edited my previous comment before you submitted your reply.
This was a key quote added to my previous comment:
>> So you really have to be in a very obscure situation which can't fit onto a dual socket Rome server, but can fit within a machine less than 2x larger. (28 cores * 8 sockets is less than 2x larger than 64 cores * 2 sockets)
In response to your current comment,
> If the interconnect is your bottleneck, you spend money on the interconnect to make it faster. Basic engineering: you attack the bottleneck.
Exactly. Using a dual-socket Epyc Rome system would be more than half as powerful as the biggest 8-socket Intel systems, but it would reduce contention over the interconnect dramatically, which means that many applications that are simply wasting money on an 8-socket system would suddenly work better.
This also goes back to my comment about using accelerators instead of an 8-socket system.
The odds of encountering a situation that just happens to work well with Intel's ridiculously complicated 8-socket NUMA interconnect, but can't work well over a network, and can't work well on a system half the size and requires enormous amounts of RAM to keep the cores fed, the odds seem vanishingly small... and in that case, we still have to consider whether an accelerator (GPU, FPGA, or ASIC) could be used to make a solution that is a better fit for the application anyways, and if so, you'll save large amount of money that way as well.
So, to make buying an 8-socket system make sense, the application must require performance that is...
- less than twice a dual socket Epyc Rome system, but greater than one dual socket Epyc Rome system can handle
- not dependent on transferring huge amounts of data around the interconnect
- dependent on very low latency communication between NUMA nodes
- needs enormous memory bandwidth for each NUMA node
- needs huge amounts of RAM on each memory channel (so you can't just use HBM2 on a GPU to get massive amounts of bandwidth, for example)
- etc.
It's a niche within a niche within a niche.
As I said in an earlier comment, I probably should be more impressed instead of being so cynical about the usefulness of such a machine. They are engineering marvels... but in almost every case, you can save money with a different approach and get equal or better results.
That's why 8-socket server sales made up such a small percentage of the market, even before Epyc Rome came in and completely obliterated almost all of the very little value proposition that remained.
Don't forget that Rome also has a wildly nonuniform interconnect between the core complexes, and the system integrator gets much less control over it than Intel's UPI links. When you really need to end up with a very large single system image at the application layer, the bigger architecture works out to be much cheaper than 256gb DIMMs or HPC networking.
8-socket CLX nets you 1.75x the cores, and 3x as many memory channels vs. a 2-socket Rome system. It also scales to a single system image with 32 sockets if you use a fabric to connect smaller nodes:
I mean, I buy AMD Threadripper for my home use and experimentation. I'm pretty aware of the benefits of AMD's architecture.
But I also know that in-memory databases are a thing. Nothing I've touched personally needs an in-memory database, but its a real solution to a real problem. A niche for sure, but a niche that's pretty common actually.
Whenever I see these absurd 8-socket designs with 48TBs of RAM, I instinctively think "Oh yeah, for those in-memory database peeps". I never needed it personally, but its not that hard to imagine why 48TBs of RAM beats out any other architecture (including Optane or Flash).
> its not that hard to imagine why 48TBs of RAM beats out any other architecture (including Optane or Flash).
Agree to disagree.
In-memory databases are common yes, but it is pretty hard to imagine practical situations where an in-memory database can't handle a few nanoseconds of additional latency.
All else being equal, of course more RAM is nice to have. All else is not equal, though, so this is all highly theoretical.
But for that bandwidth to be used efficiently, the processes on each NUMA node need to almost exclusively limit themselves to memory attached to their node -- at which point, well-written software could probably do just as good spread out over several machines that are connected by multiple 100Gb network links, and then you saved two or three bajillion dollars.
If you're heavily using the bandwidth over the NUMA interconnect, then you're not going to be using the memory bandwidth very effectively (and likely not really using the processor cores effectively), and that's when NVDIMMs like Optane Persistent Memory could give you large amounts of bulk memory storage on a smaller system.
Or just use a number of Intel's new PCIe 4.0 Optane SSDs in a single machine in place of the extra memory and memory channels... the latency isn't the same as RAM, but it's much closer than traditional SSDs, and the bandwidth per SSD is like 7GB/s, which is impressive.
It all depends on the application at hand, but there are solutions that cost a lot less than the Platinum machines for virtually every problem, in my opinion.
I don't know... perhaps I'm too cynical of these cost-ineffective systems that just happen to be large, and I should be more impressed.
> NUMA scales better than RDMA / Ethernet / Infiniband. A lot, lot, LOT better.
Disagree.