> "Intel’s current Xeon offering simply isn’t competitive in any way or form at ...

pinewurst · on Dec 18, 2020

What they mean (IMHO) is that Intel pricing on Xeon Platinum SKUs is so disproportional to their comparative performance that they would be foolish to purchase.

dragontamer · on Dec 18, 2020

Xeon Platinum supports 8-socket machines. 6x memory channels x 8-sockets == 48x DIMMs for your massive 6TB RAM servers.

Platinum is crazy priced for a crazy reason: its a building block to truly massive computers that only IBM matches. Supercomputers don't even typically use Platinum: Platinum is for the most vertically-oriented systems.

---------

Xeon Gold is your more typical 2-socket or 4-socket machines: same performance, but fewer sockets supported.

coder543 · on Dec 18, 2020

> Xeon Platinum supports 8-socket machines. 6x memory channels x 8-sockets == 48x DIMMs for your massive 6TB RAM servers.

Both AMD Epyc Rome and Ampere Altra can support 8TB of RAM in dual socket machines, so that spec isn't particularly impressive anymore.

> Platinum is crazy priced for a crazy reason: its a building block to truly massive computers that only IBM matches.

Meh. A dual-socket Rome system with 128 cores (256 threads), 8TB of RAM, and 128 PCIe 4.0 lanes is an enormous machine. A full 8 socket Platinum system is going to cost an order of magnitude (or two) more, and the performance improvement is going to be less than 2x.

An 8-socket system is going to consume an enormous amount of power, cost an unbelievable amount of money, and the complex NUMA domain combined with Intel's interconnect isn't going to "just work" for almost any software, so you're likely going to put a lot of effort into making that system barely work.

Intel Platinum is priced crazy because Intel is building monolithic processors (which means low yields) and Intel likes to have a substantial profit margin. Together, those mean high prices.

At a certain point, you're better off investing in making your application work on an accelerator of some kind, in the form of a GPU or an FPGA or task-specific ASIC, rather than giving Intel all your money in the hopes of eeking out a marginal performance improvement. Alternatively, finding software that can scale to more than one machine.

Maybe IBM has truly large systems that are worth considering. Intel doesn't right now, unless you're using some software that demands Intel for contractual reasons no one can argue with.

dragontamer · on Dec 18, 2020

But not with 6-channels per socket (48-channels total across an 8-socket computer). That's a heck of a lot of bandwidth.

I was doing the specs from memory and conservatively: 128GB sticks across 48-channels.

https://www.supermicro.com/en/products/system/7U/7088/SYS-70...

This 8-way Supermicro system supports 192x DIMMs. So... 128GB sticks x 192 == 24TBs of RAM or so, maybe 48TBs.

> At a certain point, you're better off investing in making your application work on an accelerator of some kind, in the form of a GPU or an FPGA or task-specific ASIC, rather than giving Intel all your money in the hopes of eeking out a marginal performance improvement. Alternatively, finding software that can scale to more than one machine.

I mean, these 8-way systems are what? $500k ? Less than $1MM for sure. How many engineers do you need before a machine of that price is reasonable?

Really, not that many. If you can solve a problem with a team of 10 engineers + 48 TBs of RAM, that's probably cheaper than trying to solve the same problem with 30 engineers with a cheaper computer.

-----------

> An 8-socket system is going to consume an enormous amount of power, cost an unbelievable amount of money, and the complex NUMA domain combined with Intel's interconnect isn't going to "just work" for almost any software, so you're likely going to put a lot of effort into making that system barely work.

NUMA scales better than RDMA / Ethernet / Infiniband. A lot, lot, LOT better.

All communications are just memory-reads, and that's a real memory read. Not an "emulated memory read transmitted over Ethernet / RDMA pretending to be a memory read". You incur the NUMA penalty but that's almost certainly better than incurring a PCIe penalty.

coder543 · on Dec 18, 2020

Sure, I agree it's a lot of bandwidth.

But for that bandwidth to be used efficiently, the processes on each NUMA node need to almost exclusively limit themselves to memory attached to their node -- at which point, well-written software could probably do just as good spread out over several machines that are connected by multiple 100Gb network links, and then you saved two or three bajillion dollars.

If you're heavily using the bandwidth over the NUMA interconnect, then you're not going to be using the memory bandwidth very effectively (and likely not really using the processor cores effectively), and that's when NVDIMMs like Optane Persistent Memory could give you large amounts of bulk memory storage on a smaller system.

Or just use a number of Intel's new PCIe 4.0 Optane SSDs in a single machine in place of the extra memory and memory channels... the latency isn't the same as RAM, but it's much closer than traditional SSDs, and the bandwidth per SSD is like 7GB/s, which is impressive.

It all depends on the application at hand, but there are solutions that cost a lot less than the Platinum machines for virtually every problem, in my opinion.

I don't know... perhaps I'm too cynical of these cost-ineffective systems that just happen to be large, and I should be more impressed.

> NUMA scales better than RDMA / Ethernet / Infiniband. A lot, lot, LOT better.

Disagree.

dragontamer · on Dec 18, 2020

> But for that bandwidth to be used efficiently, the processes on each NUMA node need to almost exclusively limit themselves to memory attached to their node -- at which point, well-written software could probably do just as good spread out over several machines that are connected by multiple 100Gb network links, and then you saved two or three bajillion dollars.

Wait, so latency over NUMA is too much, but you're willing to incur a 100Gb network link? Intel's UPI is like 40GBps (Giga-BYTEs, not bits) at sub 500ns latency.

100Gb Ethernet is what? 10GBps in practice? 1/4th the bandwidth with 4x more latency (in the microseconds) or something?

That's a PCIe latency penalty (x2, for the sender + the receiver). That's a penalty for electrical -> optical, and back again.

Any latency, or bandwidth, bound problem is going to prefer a NUMA-link rather than 100Gbps over PCIe.

coder543 · on Dec 18, 2020

> Wait, so latency over NUMA is too much, but you're willing to incur a 100Gb network link?

The point is not just the latency, but latency and bandwidth.

If the application is relying heavily on the NUMA interconnect to transfer tons of data, it's not going to be making efficient use of the processor cores or the RAM bandwidth. It's a total all around bust. You're just wasting money at that point. ^1

If you aren't relying heavily on the NUMA interconnect, and each node is operating independently with only small bits of information exchanged across the interconnect, then you'd save a metric ton of money by switching to separate machines and using just high speed fiber network links -- such as 100Gb.

I'm not saying that network would be better than the NUMA interconnect. I'm saying that you're not going to be having a happy day if you're relying on the interconnect for large amounts of data transfer.

The only situation where the NUMA set up is better is if you have a need for frequent, low latency communication between NUMA nodes... where very little data is being transferred between nodes, so the entire problem is just latency.

At that point, you're still suffering a lot by the NUMA interconnect, and it would be better to use larger processors... such as Epyc Rome processors.

So you really have to be in a very obscure situation which can't fit onto a dual socket Rome server, but can fit within a machine less than 2x larger. (28 cores * 8 sockets is less than 2x larger than 64 cores * 2 sockets)

You see how complicated this is and how absurdly niche those Platinum 8-socket machines are? They're almost never the right choice.

^1: The exception is if you have an even more niche use case that somehow is built for exactly this scenario in mind, and manages to balance everything perfectly. Such a software is almost certainly ridiculously overcomplicated at this point.

dragontamer · on Dec 18, 2020

> If the application is relying heavily on the NUMA interconnect to transfer tons of data, it's not going to be making efficient use of the processor cores or the memory bandwidth. It's a total all around bust. You're just wasting money at that point.

If the interconnect is your bottleneck, you spend money on the interconnect to make it faster. Basic engineering: you attack the bottleneck.

You don't start talking about slower systems and how they're cheaper. Because that just slows down the rest of your system.

------

If you just wanted cores, you buy a 28core Xeon Gold. The point of 28-core Xeon Platinum is for the 8-way interconnect and scaling up to 8-way NUMA systems. The only person who would ever buy a Xeon Platinum is someone who wants 40GBps UPI connections at relatively low latencies. (or maybe even the 300GBps connections that IBM offers, but that's a similar high-cost vertical-scaling system)

> So you really have to be in a very obscure situation which can't fit onto a dual socket Rome server, but can fit within a machine less than 2x larger. (28 cores * 8 sockets is less than 2x larger than 64 cores * 2 sockets)

That's not even that hard to figure out! A 48TB Memcached / Redis, which is far more useful than a 8TB Memcached / Redis box.

A bit basic, but yeah. That's the point: spend more money on hardware and then don't spend much engineering time thinking about optimization. If 48TBs of RAM solves a problem that 8TB cannot, then just get the 48TB system.

coder543 · on Dec 18, 2020

You have since edited your comment, so...

> That's not even that hard to figure out! A 48TB Memcached / Redis, which is far more useful than a 8TB Memcached / Redis box.

No... dozens of terabytes of Optane would be just as good and much much cheaper. The person designing the system would have to prove that a few nanoseconds of latency difference makes any material difference to the company's profits in order to justify the more expensive machine. Otherwise, it's a huge waste of company money, hurting the business.

Also keep in mind that Redis is only going to be using a single core of that machine. A total waste of huge amounts of CPU cores just to have a lot of RAM, when there are equally good solutions that cost much less.

You surely must see why I'm skeptical.

> That's the point: spend more money on hardware and then don't spend much engineering time thinking about optimization.

It's bad business practice to buy the most expensive thing possible instead of engineering a solution that is the right price. If the more expensive solution saves money in the long term, sure... but your example doesn't show this.

dragontamer · on Dec 18, 2020

> No... dozens of terabytes of Optane would be just as good and much much cheaper.

PCIe Optane is orders of magnitudes slower than DDR4. In both bandwidth and latency.

The only Optane that keeps up to DDR4 (kinda-sorta) is the Optane DIMMs which are exclusive to Xeon Golds / Platinum systems.

coder543 · on Dec 18, 2020

Not really... especially if this is an in-memory database attached to a network, as implied.

It’s only slower if someone can observe the difference, which I don’t think they would be able to in this design.

I’m a strong proponent of using fewer, larger machines and services, instead of incurring the overhead involved in spreading things out into a million microservices on a million machines. But there is a balance to be achieved, and beyond a certain point... synthetic improvements in performance don’t show up in the real world.

Queuing up a few database requests concurrently to make up for the overhead of literally hundreds of nanoseconds of latency is trivial, especially when Optane can service those requests concurrently, unlike a spinning hard drive. Applications running on other machines won’t be able to tell a difference.

But, agree to disagree.

There are probably applications where these mega machines are useful, but I don’t personally find this to be a compelling example.

I readily admit that I could be wrong... but neither of us have numbers in front of us showing a compelling reason for a company to spend unbelievable amounts of money on a single machine. My experiences (limited compared to many, I’m sure) tell me this isn’t the winning scenario, though.

dragontamer · on Dec 18, 2020

> I readily admit that I could be wrong... but neither of us have numbers in front of us showing a compelling reason for a company to spend unbelievable amounts of money on a single machine.

$500k on a machine isn't a lot of money compared to engineers. Even if you buy 4 of them for test / staging / 2xProduction, its not a lot compared to the amount spent on programming.

coder543 · on Dec 18, 2020

One thing being expensive doesn’t make another thing not expensive.

It’s possible for them to both be independently expensive, and I’m saying that unless you can prove that the performance difference makes any difference to company profits, it is literally a huge waste of company money to buy those expensive machines.

A lot of applications will actually perform worse in NUMA environments, so you’re spending more money to get worse performance.

Reality isn’t as simple as “throw unlimited money at Intel to save engineering time.” Intel wishes it was.

Engineering effort will be expended either way. It is worth finding the right solution, rather than the most expensive solution. Especially since that most expensive solution is likely to come with major problems.

coder543 · on Dec 18, 2020

Preface: If you have actually encountered applications that must be run on 8-socket systems because those are literally the only fit for the application... I would love to hear about those experiences. With the advent of Epyc Rome, most use cases for these 8-socket systems vanished instantly. It would be fascinating to hear about use cases that still exist. Your experiences are obviously different than mine.

If you need more than 8TB of RAM, with the right application design you can probably do better with fast Optane Persistent Memory or Optane SSDs, and an effective caching strategy. You can have many dozens of terabytes of Optane storage connected to a single system, and Optane is consistently low latency (though not as low latency as RAM, obviously).

If you need more compute power, you can generally do better with multiple linked machines. You can only scale an 8-socket system up to 8 sockets. You can link way more machines than that together to get more CPU performance than any 8 socket system could dream of.

----------

I didn't expect you to read and respond so quickly, so I had edited my previous comment before you submitted your reply.

This was a key quote added to my previous comment:

>> So you really have to be in a very obscure situation which can't fit onto a dual socket Rome server, but can fit within a machine less than 2x larger. (28 cores * 8 sockets is less than 2x larger than 64 cores * 2 sockets)

In response to your current comment,

> If the interconnect is your bottleneck, you spend money on the interconnect to make it faster. Basic engineering: you attack the bottleneck.

Exactly. Using a dual-socket Epyc Rome system would be more than half as powerful as the biggest 8-socket Intel systems, but it would reduce contention over the interconnect dramatically, which means that many applications that are simply wasting money on an 8-socket system would suddenly work better.

This also goes back to my comment about using accelerators instead of an 8-socket system.

The odds of encountering a situation that just happens to work well with Intel's ridiculously complicated 8-socket NUMA interconnect, but can't work well over a network, and can't work well on a system half the size and requires enormous amounts of RAM to keep the cores fed, the odds seem vanishingly small... and in that case, we still have to consider whether an accelerator (GPU, FPGA, or ASIC) could be used to make a solution that is a better fit for the application anyways, and if so, you'll save large amount of money that way as well.

So, to make buying an 8-socket system make sense, the application must require performance that is...

- less than twice a dual socket Epyc Rome system, but greater than one dual socket Epyc Rome system can handle

- not dependent on transferring huge amounts of data around the interconnect

- dependent on very low latency communication between NUMA nodes

- needs enormous memory bandwidth for each NUMA node

- needs huge amounts of RAM on each memory channel (so you can't just use HBM2 on a GPU to get massive amounts of bandwidth, for example)

- etc.

It's a niche within a niche within a niche.

As I said in an earlier comment, I probably should be more impressed instead of being so cynical about the usefulness of such a machine. They are engineering marvels... but in almost every case, you can save money with a different approach and get equal or better results.

That's why 8-socket server sales made up such a small percentage of the market, even before Epyc Rome came in and completely obliterated almost all of the very little value proposition that remained.

blasdel · on Dec 19, 2020

Don't forget that Rome also has a wildly nonuniform interconnect between the core complexes, and the system integrator gets much less control over it than Intel's UPI links. When you really need to end up with a very large single system image at the application layer, the bigger architecture works out to be much cheaper than 256gb DIMMs or HPC networking.

8-socket CLX nets you 1.75x the cores, and 3x as many memory channels vs. a 2-socket Rome system. It also scales to a single system image with 32 sockets if you use a fabric to connect smaller nodes:

* 4-socket nodes: https://www.hpe.com/us/en/servers/superdome.html

* 2-socket nodes: https://atos.net/en/solutions/enterprise-servers/bullsequana...

That's 48tb of DRAM with all 128gb DIMMs, or 12tb+128tb when using 512gb Optane PDIMMs.

dragontamer · on Dec 18, 2020

I mean, I buy AMD Threadripper for my home use and experimentation. I'm pretty aware of the benefits of AMD's architecture.

But I also know that in-memory databases are a thing. Nothing I've touched personally needs an in-memory database, but its a real solution to a real problem. A niche for sure, but a niche that's pretty common actually.

Whenever I see these absurd 8-socket designs with 48TBs of RAM, I instinctively think "Oh yeah, for those in-memory database peeps". I never needed it personally, but its not that hard to imagine why 48TBs of RAM beats out any other architecture (including Optane or Flash).

coder543 · on Dec 18, 2020

> its not that hard to imagine why 48TBs of RAM beats out any other architecture (including Optane or Flash).

Agree to disagree.

In-memory databases are common yes, but it is pretty hard to imagine practical situations where an in-memory database can't handle a few nanoseconds of additional latency.

All else being equal, of course more RAM is nice to have. All else is not equal, though, so this is all highly theoretical.

But it is fun to think about!

jeffbee · on Dec 18, 2020

Right? "This platform does/does not run my software" is a huge checkbox when buying computers. The viewpoint of people who actually specify, buy, and operate servers at scale is severely underrepresented in these discussions.

coder543 · on Dec 18, 2020

Keep in mind that Intel also has to compete with AMD... and AMD is also x86_64.

AMD and Altra are trading blows at the high end of performance, as seen in this article, while Intel is... not, except in extremely specialized applications.

Intel servers don't even offer PCIe 4.0 at this time, which is just a bad joke when it comes to building out servers with lots of high performance storage and networking. For now, Intel offers (relatively) poor CPU performance and poor peripheral performance.

So, if your software can't run on Altra, the other choice for high performing servers is AMD, not Intel, unless you're just locked into Intel for historical reasons.

Intel does have some nice budget offerings for cheap servers, though, such as Xeon D.

monocasa · on Dec 18, 2020

Ironically Xeon-D seems to have been quietly cancelled.

klelatti · on Dec 18, 2020

So you're never going to consider moving to another architecture because your software (currently) doesn't run on it?

So Amazon made a huge mistake with Graviton then? Last time I checked Amazon 'specify, buy, and operate servers at scale'.

waynesonfire · on Dec 18, 2020

I don't know if it's a mistake, the competition is welcome. I spun up some arm servers because they were lower cost but then ended up having to worry about software availability issues. It's non-trivial and I don't think it was worth the savings at this time and went back to x86.

Regardless, the claim that there is no value in Intel let alone x86 is a stretch.