Hacker News new | past | comments | ask | show | jobs | submit login
The Ampere Altra Review: 2x 80 Cores Arm Server Performance Monster (anandtech.com)
274 points by sien on Dec 18, 2020 | hide | past | favorite | 261 comments



This line is the kicker for me:

"Intel’s current Xeon offering simply isn’t competitive in any way or form at this moment in time."

The rest of the paragraph twists the knife more, and declares there to be an open market on server hardware for AMD and Ampere to fight it out.

With application stacks being what they are and optimizations for Xeon so deeply integrated, it's unlikely that folks running private datacenters actually feel the same way, but if I was spinning up a new company with the sort of funding to be able to decide between self-hosted or cloud I would have some math to do.


AMD should see the writing on the wall and start designing ARM CPUs.

I think that would be very exciting because they have the engineering expertise and talent necessary to compete against Apple's M1. I don't know of any other company that can do that (besides Intel, but you know...).

Additionally, there is a huge hole open right now for ARM on everything else that is not an Apple product. Apple normalized ARM on consumer's desktops/laptops: they brought down the "subjective" barrier against ARM, which is much harder to break sometimes than the technological one.

But ARM CPUs for everything else that is not an Apple product is up for grabs. The M1 is extremely competitive but it will never leave Apple's realm (unless Apple wants to start competing on the CPU business, but I see that going very strongly against Apple's consumer focus).

It is incredible, looking at it in retrospective, but Apple opened up (or normalized, if you will) a whole new market for ARM products.


> AMD should see the writing on the wall and start designing ARM CPUs.

This is a really tough strategy decision. Focus all energy on delivering a knock-out blow to Intel (in x86/64) or try to do other stuff too?

For the next several years at least, there is still a whole lot of money to be made selling x86/64 chips. (I don't know if Microsoft has plans to move mainstream Windows users off x86/64, but even if they do, it won't happen quickly.)

Right now, AMD has an opportunity to take x86/64 away from Intel. Probably the best opportunity in decades. There's something to be said for focusing all resources on delivering a knock-out blow right now, before the opportunity is gone.

On the other hand, ARM might have a better future over the long term. And it isn't necessarily true that AMD would be spreading itself too thin by trying to do both.


This is a very thoughtful comment, thank you.

I would love to be a fly on the wall on the corresponding AMD strategy meetings :)

I think to not focus on x86 right now would be a tremendous blunder on AMD’s part. And at the same time ignoring ARM for too long will give others (like Nvidia) time to setup shop and capture the market.

I wonder how much of the Zen architectural “backend” is amenable to be reused with a different ISA (I don’t even know if such thing would be possible). Perhaps they could have a very low-effort initiative to try something along those lines.


For the moment I don't think its a tough decision at all - has to be x86 (and I say that as someone who has complained about the x86 duopoly). Why?

- It's in both AMD and Intel's interests for x86 to remain dominant for as long as possible. If they could it would be in their interest to drive Arm new entrants out of the market (I don't believe that will happen though).

- Introducing an Arm product (or even hinting at one) would add credibility to new entrants and undermine existing products.

- They could almost certainly switch to an Arm design reasonably quickly if they needed to.

There is also some uncertainty about the impact that Nvidia will have on Arm when it takes it over. x86 is AMD's jointly owned ISA - they have no control over the Arm ISA.


I'm also curious if there is something inherent to ARM that would stop them from making similar optimizations to the AMD chips. They are already using TSMC so they are not held back by density.


> I'm also curious if there is something inherent to ARM that would stop them from making similar optimizations to the AMD chips.

Yeah ARM isn’t x86. Apple made big gains by leveraging the difference in instruction length and complexity between x86 and ARM.

Something AMD can’t do, and they’ve said as much.


I still don't have a clear understanding. Apple also made big gains by integrating performance sensitive stuff on the package. AMD could do the same. I wonder if they are constrained by some hardware-level interoperability requirements across the motherboard, like with chipsets or DMA controllers or whatnot made by different companies. Apple clearly has a nice advantage here.


My understanding from reading around (I can't find sources right now) is that a big chunk of Apple's performance comes from how wide their instruction decode hardware, and long their instruction pipeline is.

ARM instructions are fixed length so it's very easy to stack up decoders next to each other. You don't need complex logic to figure out where instruction boundaries are, which you do with x86.

This added to a super long instruction pipeline means that the M1 has the ability to reorder a significant (something silly like 2x longer than any AMD or Intel CPU) number of instructions to increase CPU utilisation and thus performance.

AMD can't do this on an x86 CPU because the complexity involved is too high, and think someone has found comments from them basically saying so.

Yeah Apple did make gains from a better integrated package, some of that would be tricky to pull off else where without additional software support. Some of the other integrations would be tricky to pull of because it would reduce the amount of customisation OEM could do. Remember one of the big things Apple integrated was RAM, and making it unified memory for all the co-processors. That could be tricky to do, unless we move into a world where RAM is always integrated into CPUs.


> Apple also made big gains by integrating performance sensitive stuff on the package. AMD could do the same.

The only thing the M1 has integrated that a typical AMD or Intel laptop CPU doesn't have is a neural net processor. Everything else about the M1's 'integrated architecture' is typical.

Notably the M1's RAM is not integrated as is often incorrectly stated. It's just regular soldered LPDDR4X. And M1's RAM latencies are worse than Intel & AMD's socketed DDR4 latencies, so being soldered and physically close is definitely not providing a performance advantage.


I think the performance sensitive stuff Apple integrated is to some extent dependent on their control of their own software stack, which AMD doesn't have control of.


AMD is doing fine. "The writing on the wall". Zen3 is going to very competitive, their interconnect on chip is great.

Sure, add a line for ARM.

What's interesting to me, I feel like a lot of this started when intel wouldn't do the chip for the iphone. All of a sudden you had this leading edge mfg pushing arm hard which drove volumes through the roof, and all the copycats (android etc) copied the chip selection, screen style etc as well.

I wonder how different the path might have been if intel had come up with something for apple.


> I wonder how different the path might have been if intel had come up with something for apple.

They kind of tried and failed. Intel is completely stuck in their "x86 is the only way" culture. x86 just can't compete in the low power space. For phones, the x86 power requirements made it effectively DOA.

Sure, Atom could have competed with arm in some markets (small laptops for example, they had some success in the early netbook days), but Intel managed to screw even that up by deprioritizing Atom development - they thought they could make more money in the server space. Short term, not a bad bet, but it wouldn't surprise me if it set Intel's eventual demise in motion. We're starting to see the signs now.


They turned Apple down in 2005 to make the iPhone processor.

They've been shooting themselves in the foot toe by toe for a while.

I hope they come back for a while though, I mean at least to like $55/share for undisclosed personal reasons.


It's a bit ironic that they owned StrongARM which of course became XScale and was sold the year before the iPhone appeared! I think some of the StrongARM team founded PASemi and others may have ended up working for Arm in Texas.


The issue is this. Apple asked intel to make the chip for the iphone. If you partner with apple on this, then, together, and with billions of dollars, your entire ecosystem improves even if you were not originaly the best.

When you do this type of deal, you get to work with top class engineers and real major customer driving your power / efficiency story.Intel at the time had the process node advantage as well.

What's happened is Apple has SIGNIFICANTLY funded massive capital investment in the ARM ecosystem and in non-intel fab because now those non-intel providers are critical. Imagine all that money flowing to intel just on fab side.

Rumors around TSMC and Apple for iphones is that Apple funds some of the capacity at TSMC and pays for leading nodes. I'm not sure of the scale of Apple's orders, but they are going to be meaningful both in quantity and what apple is willing to pay.


I wonder if x86 really is inherently inefficient, or if it could get there given enough investment (i.e. not just put a 2nd tier team on it for a few months but really push it with major resources). It feels like they just did not want to bother, not that it was impossible.


I don't think that Intel producing the iPhone chip was ever likely to happen - sure Jobs had to ask Intel as a courtesy for a partner but it was never realistic.

- Arm has dominated low power / mobile segment from well before the iPhone due to its superior power consumption;

- The non Arm IP (eg Imagination GPU) was also important and it was unlikely that Intel would have been integrated that on their SoCs.

I had an Atom tablet from the time when Intel supposedly had competitive mobile SoCs and it really wasn't a great device - ran hot and very short battery life.


The first Atom SoCs used an Imagination GPU.


Interesting - and I stand corrected - thanks.


> I had an Atom tablet [...] and it really wasn't a great device - ran hot and very short battery life.

In all honesty the implementation and OS also matter. Most Atom tablets weren't particularly great (or even good) devices overall. And running Windows on a contemporary ARM SoC wouldn't have resulted in a better experience.


It was Android and much, much worse than contemporary Arm tablets.


Most Atom tablets I've ever seen were very bad implementations from noname brands, I haven't seen a halo product. And the OS choice was worse, either a full Windows, or an x86 Android, neither of which is the best choice for such a device. And that's on top of any weakness the CPU itself may have.


I've got one of those Bay Trail tablets. Its crippled memory bandwidth destroys performance, torpedoes battery (race to sleep).

Apple tends to emphasize memory bandwidth in its own designs.


Apple was also one of the first investors in Arm and one of the reasons the original company Acorn existed.


Acorn existed fully independently of Apple. Check out the history of the BBC Micro and its forebears.

Apple came on the scene later, when Acorn had developed the ARM processor. You're right that they were an original investor in the spinout company.


Indeed, the writing on the wall is that AMD is eating Intel's lunch is and positioned to dominate the server x86 market.


The RISC-V ISA would be the better bet.

a. They start investing now and they can have some leverage in the spec.

b. The spec is minimal and open to expansion that it sets itself up for more domain specific processors. Which we will be seeing alot more of thanks to the slowing of transistors per dollar growth.

c. There is a pretty big theme of moving toward more open and democratized standards in tech.

d. They would be the big dogs in RISC-V where as ARM already has big players.

e. In both ARM and RISC-V they would be saving the need for extra transistors. My understanding is that there are extra transistors that basically translate x86 machine code to some internal RISC ISA for performance reasons on x86 processors. Due to the slowing of transistors per dollar growth and the maturation of external Fabs Intel and AMD will no longer be able to 'hide' this deficit.

f. No royalties


I'd love to see RISC-V or another open source ISA become the dominant design. But I'm not sure RISC-V is there yet. Companies have been pouring resources into optimizing ARM CPUs for years now, and they're now starting to get to the point where their performance competitive with x86. RISC-V is just getting started. I don't know if there's a RISC-V CPU that's comes anywhere close to the raw performance of higher end ARM CPUs, nevermind x86.

Having said that, I hope AMD doesn't decide not to develop a RISC-V CPU. I'd expect them to have the resources to start working on ARM now, while planning for RISC-V further out.


> I'd expect them to have the resources to start working on ARM now, while planning for RISC-V further out.

I doubt that strategy would work. The problem in the PC space is that there's no Apple-like leader who can boldy drive a change in CPU architecture. Whoever is first out of the gate bears all the risk of the change not catching on. If you lose the bet, you've wasted massive R&D spend to make a lemon. Take a look at the reviews of Microsoft's ARM surface laptops to see what I mean here.

I think there's an opportunity over the next 5 years or so for PC makers to ride the M1's coattails and shift from x86 to ARM. Especially given intel's failure to move off 14nm. But I doubt lightning will strike twice. If windows moves to ARM now, we'll be stuck there for at least another few decades. And the technical argument for RISC-V will be much weaker if ARM rules the roost.

Weirdly the strongest counterargument I can think of is due to electron. Chrome will maintain first class support for any and every popular CPU architecture. So the more that desktop software is written on the web and in electron apps, the easier any architecture transition will be down the road. That and the server space. Linux already supports RISC-V extremely well given how few linux-capable RV chips there are in the wild.


> Companies have been pouring resources into optimizing ARM CPUs for years now

Yes, companies (Apple especially, but also Samsung and Qualcom) have been making optimized ARM CPUs, but those designs remain within those companies. Very similar optimizations could be made on a RISC-V implementation and AMD has proven that they have the ability to make design optimizations.


Not sure if it's still the case but RISC-V was beating ARM in code size as well


I find that a bit surprising since my understanding is RISC-V is more minimal (has fewer higher level instructions) (ATM, that will change with extensions) then ARM.

This paper refutes your point and is inline with my understanding: https://carrv.github.io/2020/papers/CARRV2020_paper_12_Perot...

One counter point is the simpler the ISA the more work the compiler needs to do. So x86 is may be easier to write an optimize compiler for given it does some of the work for you (despite the bloat).

I wonder how much benchmarks on the current ARM and RISK-V chips will change over time due to compiler improvements. ARM probably already has alot of investment in some workloads...

EDIT: This presentation show code size comparisons with a RISC-V extension that adds multiply/divide and it get really close to ARM: https://riscv.org/wp-content/uploads/2019/12/12.10-12.50a-Co...


The paper/presentation you link to is comparing 32-bit ARM with Thumb and RISC-V. This is mostly relevant to embedded processors, and my understanding is ARM and RISC-V is somewhat equal there.

But for 64-bit, ARM has dropped support for Thumb, while RISC-V still supports compressed instructions. In that case, RISC-V is more dense than ARM. I can't find the numbers for that right now, but I read it just yesterday so I'm pretty sure I remember correctly.

Also keep in mind that Thumb is a mode that the ARM CPU has to switch to, while RISC-V compressed instructions can be mixed with normal instructions.


The paper you linked to is about embedded controller micro-benchmarks being worse by 5-10%; RISC-V's compressed ISA made some tough decisions to improve density on general-purpose code, not just micro-focused on MCUs. On something larger like SPEC, RISC-V is denser than ARM and x86.


My understanding was that RV was less dense than the ARM-M Thumb variants, but more dense than AArch64 like these servers would be.


Yes, you are correct. Thanks for pointing that out to me.

With compression and the multiply/divide extensions very similar to the ARM-M Thumb variants in code density.

https://riscv.org/announcements/2016/04/risc-v-offers-simple...

https://riscv.org/wp-content/uploads/2019/12/12.10-12.50a-Co...

Really fun time to be working on this part of the stack.


AMD might already be working on an ARM design[1].

[1] https://www.techspot.com/news/87851-amd-rumored-working-arm-....


From what I recall, the K12 core was essentially the same micro-architecture as Zen, but with ARM instruction decoders. There were some vague statements that the ISA allowed some optimizations compared to the x86 version, but we never got to see what that really means, since the K12 project was shelved to focus on x86. I wouldn't be surprised if AMD has been sitting on K12 and keeping it just updated enough so that they could revive it whenever the market becomes more interested in ARM servers.


Would be fun if you could switch instruction decoders...


Maybe, but that's a Zen 2 proc being used and isn't far off in single core perf or really multi-core perf on a lot of tests(with worse single-core perf and fewer cores). It's the latest ARM tech against

Zen 3 EPYC releasing Q1 is likely to be a different story.


> AMD should see the writing on the wall and start designing ARM CPUs.

ARM's brand new 80-core system struggles to match AMD's year old 64-core system, while also being entirely unable to run some of the workloads either at all or in the 2S configuration. And with mostly comparable power draw and a much worse turbo story that ARM is trying to pretend is somehow a good thing.

Competition is great, but it's a bit premature to claim "the writing is on the wall" is it not? AMD has historically made ARM CPUs in the past, they're not rigidly stuck to x86 like Intel is, but it's also not like x86 is in immediate danger, either.


It's not Arm's system it's Ampere's. Arm is claiming nothing about this cpu.


This system uses ARM's Neoverse-N1 CPU cores. Ampere put it in a socket, but it's still "ARM's" CPU core. This isn't a custom design like Apple's CPUs.

And per the article it sounds like it's ARM's fault this can't exceed 3.3 ghz. That seems to be a Neoverse-N1 limitation:

"Fundamentally, the Altra’s handling of frequency and power in such a manner is simply a by-product of the Neoverse-N1 cores not being able to clock in higher than 3.3GHz"


System != Core for a whole range of reasons as you know - not least price. I actually think that Ampere should get the credit for producing a system this competitive (even with an off the shelf core).


I hope somebody will come up with a standard and open way to boot an ARM board and to enumerate devices on it, much like BIOS does for x86 boards.

Possibly the makers of Pine64 and Raspberry Pi could cooperate and offer something in this area, because they collectively control most of the market of hobby / general-purpose ARM system boards. If they set a common standard (or even two standards), other producers in this space would have an incentive to follow. This will simplify OS porting efforts enormously.


It already exists, it's called Server Base System Architecture (https://en.wikipedia.org/wiki/Server_Base_System_Architectur...). AFAIK, this Ampere Altra follows that standard, which is how unmodified Linux distributions can run on it (https://www.phoronix.com/scan.php?page=article&item=ampere-a... tested both CentOS and Ubuntu).


The BIOS doesn't really let you enumerate devices on x86. That mainly happens through PCI config space.


The "IBM PC architecture" tells you where to find the PCI config space (either the well known i/o address 0x0CF8 and friends, or memory mapped address found in ACPI tables)

BIOS provided tables allow you to enumerate devices that aren't on the PCI bus, and BIOS provides methods to interact with storage devices. UEFI does all the same things, but differently.

A lot of ARM64 boards intended towards general purpose use include UEFI these days.


The standard is called SBSA aka ServerReady.


I think Chromebooks did ARM years before Apple, no?


Yes but they were terrible. Just being ARM isn't some kind of magic dust. It has to actually be fast.


If you look at the historical benchmarks of the non-apple ARM chips, though they're clearly behind apple, they have largely kept pace. Huawei was perhaps even closing the gap before the trade war knocked them out of contention.

I don't think apple was really all that special in this regard. They were ahead of the curve, which is impressive, but the market was created by arm, not just apple. I doubt we'd be seeing this kind of resurgence had others - especially google - not also jumped on the ARM bandwagon. A rising tide lifts all boats; both android and ios likely benefited from each others shared usage of arm because it helped the whole ecosystem.

For some sense of scale: qualcomm's latest chip just released some benchmark numbers, and while they're not all that impressive (https://www.anandtech.com/show/16325/qualcomm-discloses-snap...) it's still good to put "not that impressive" in perspective: they're comparable to i7-6700k in single-threaded workloads, and to an i7-4770k in multi-threaded workloads. The oldest iPhone that runs the same geekbench version I can find is the iPhone 5s, which is a little more than 4x slower in the single-threaded benchmark, and 8x in the multi-threaded. Yet that chip was no slouch!

Obviously, apple's impressive pole position completely hogs the spotlight, but it's worth remembering that the competition isn't actually all that bad. That level of performance would make for an extremely fast chromebook - were it not for the fact that most are cheap cheap cheap and will never use a high-end chip.

Development across the board has been so fast, that if you remember slow ARM chromebooks, I think you attribute that slowness to 2 things other than apple vs everybody else: firstly, they were built to a bargin-basement price, and thus you weren't getting anything near this performance not because it wasn't apple, but simply because they didn't use the right chips in the first place, and secondly, it's been a while, and when things improve that quickly, then a slightly outdated model can easily appear terrible without that being a condemnation of the whole approach.


> If you look at the historical benchmarks of the non-apple ARM chips, though they're clearly behind apple, they have largely kept pace.

Not really. Look at the difference between Apple's "little" cores and the stock ARM "little" core.

>The performance showcased here roughly matches a 2.2GHz Cortex-A76 which is essentially 4x faster than the performance of any other mobile SoC today which relies on Cortex-A55 cores, all while using roughly the same amount of system power and having 3x the power efficiency.

https://www.anandtech.com/show/16192/the-iphone-12-review/2

Apple's Icestorm "little" cores are in the same performance ballpark as the big cores in the Qualcomm variant of a Galaxy S10.

That's a night and day difference.


Apple is wealthy and powerful, and gets the cream of the crop - i.e. the first batch of the new 5nm output. This isn't new, apple always releases their models a few months before equivalent snapdragons come out. This is a meaningful advantage, to be sure, but if you're trying to figure out how much apple's perf win is due to their great chips and not merely a generational gap, it's misleading. Comparing the s10 today, which is clearly on a inferior process to the iphone 12 pro today isn't useful from this perspective. It's better to compare the upcoming 5nm 888 to apples A14.

And here, the numbers just don't back up the claim that apple is running rings around qualcomm in terms of performance. Not just that; the difference between apple+qualcomm chips hasn't changed all that dramatically over the years; if anything it's gotten smaller.

Yes, if you look at power at equivalent performance that looks better for apple, but then that's always the case because nobody runs client side CPUs at an optimal perf/watt power, but faster. It's also not really how people use their phones, is it? People don't turn off the high-power cores in iphones; they use em as designed, i.e. at approximately equivalent power, just faster.

The point here isn't that Apple's win doesn't exist; simply that it's not the mythological night-day difference it's sometimes painted to be. Furthermore, because apple consistently releases earlier on a new process node, at the moment of their release the perf numbers look better than they are on average over the products lifetime; in effect, you're seeing the combination of both Apple's good CPUs, and a 1 generation gap. Then there's the fact that Apple's CPU's are so fast due not solely to the skill of their silicon division, but also their willingness to spend for more transistors. We don't have numbers for the 888 yet, but it's very likely going to be a lower-cost design if history is any guide (and the anandtech article goes into some other non-obvious ways in which is saves costs too, like power planes, that I bet apple is willing to go the extra mile on).

All that matters, because you're trying to predict how other CPUs will perform, and thus need to tease out the impact of stuff like process node, and apples willingness to spend extra on larger caches, and x86 vs ARM - and just ascribing all that perf uplift to apples magic pixie dust won't help you make predictions regarding the rest of the CPU market in the future.


> We don't have numbers for the 888 yet

The 888 uses the same Cortex-A55 cores mentioned above (the same core all the Android SOCs have been using for a couple of years now) with Apple having a 4x performance win at the same power levels.


Do you have a citation for that? Benchmarks don't typically target the efficiency cores.

In any case, I think it's tricky to read to much into microbenchmarks like this; the overal system perf in any case does not have anything near like that perf difference, but instead a slowly shrinking gap (as described before, and supported by links to evidence).

I'm still curious about those efficiency cores mind you, even if I don't think it has much impact on the overall point. Also, one of the things qualcomm intentionally skimps on is powerplanes, I'm assuming for cost reasons - and that means low-perf power will be suboptimal. Then again, I'm not sure how much that matters.

(Don't forget the context of the Altera ARM server CPU and what it means for intel/AMD; efficiency cores probably don't tell a very meaningful story of qualcomm vs. apple, but certainly don't apply in that space, yet).


The source is Anandtech running SPEC on them, already given above.

SPEC is not a microbenchmark. It's been the industry standard for comparing completely dissimilar CPU architectures for decades.


I though you were talking specifcally about the low-power cores (and I couldn't find a site specifically comparing those, but even you could find those, it would be a weird niche (micro) benchmark, because it's not clear if those cores are used like that.

In any case, the link you shared (https://www.anandtech.com/show/16192/the-iphone-12-review/2) backs up what I said; the overall SoC perf of the 865 is quite close to that of the A13 (which is it's same-gen competitor). The specfp score of the 865 is 74% of the A13; the specint 64% (the 865+ is 3 percentage points better here). And the joules consumed by the A13 is significantly higher than the 865 for both benchmarks (the 865+ uses more energy than the 865, but less than the A13).

In fact, the geomean of perf and power between A13 thunder and lightning cores is pretty close to on the money for the 865, for both power and perf!

All in all, the 865 looks pretty competitive in the benchmark you linked; pretty much smack between the high-power and low-power cores of the A13: more efficient than it's high-power cores, but slow, and faster than the A13's low-power cores, but less efficient. Unfortunately, the page doesn't include the 865's low-power cores in the benchmarks, but from the point of view of disproving the apple-domination story it doesn't really matter: clearly competitors can build cores that are close to the cutting edge of apples power/perf tradeoff; they just choose slower, more efficient designs.

Where are you getting a 4x improvement in performance?


> Where are you getting a 4x improvement in performance?

From the link you just quoted.

Here it is again:

>Look at the difference between Apple's "little" cores and the stock ARM "little" core.

>The performance showcased here roughly matches a 2.2GHz Cortex-A76 which is essentially 4x faster than the performance of any other mobile SoC today which relies on Cortex-A55 cores, all while using roughly the same amount of system power and having 3x the power efficiency.

https://www.anandtech.com/show/16192/the-iphone-12-review/2


Ah, right I see what you're saying. However, note that these other SoC's don't rely on A55 (not alone - this is what I meant by microbenchmark; as it's comparing a small bit of the chips in isolation in a way you'd never normally use them), and that the A14 should be compared to the 888 if you want to tease out the design bit (apple's) from the process bit (TSMC's). Early 888 benchmarks look good, but it's more reliable to compare the 865 to the A13 since those numbers are clearly solid by now (no unreleased SoC shenanigans). And the 865's high power core is clearly competitive with the A13; it's slower (around 3/4 speed), but more efficient, not less than the A13 - again, the link you posted shows that.

So even if the low power cores in the A14 are impressive; clearly other competitors are capable of getting results that are competitive with apple on the same node - at least for the high power cores.

I strongly suspect the low power cores just aren't a priority for qualcomm; but in any case when in comes to servers (what this all started with), clearly the high-power cores are the ones that matter, and it's equally clearly not the case that apple's lead looks unassailable. It's been shrinking year by year, not growing, and the difference isn't as large as it's made out to be.

Looking at this data I get the impression that Apple has more resources to throw at the problem, and that some parts of their solution are simply better - but the difference is quite small, and they use more transistors to get there, and need more power planes (thus cost) to get the greater efficiency. The big difference is simply how much head start they get at TSMC, which is a question of money, not some secret design sauce.

And again this all started on a thread about server CPUs and memories of how slow chromebooks were (i.e. not talking of efficiency cores in isolation, but SoC perf overall). And for that I think data speaks for itself: Apple has a significant lead - but a small one, that's been shrinking over the years. There is no evidence they're in a class of their own at the same process node - other chips are 3/4 as fast and more efficient, and that seems like a reasonable tradeoff. Nor incidentally is this just a 2 horse race; much as samsungs chips are derided, they're not that much worse, and likely much of that is due to the inferior process node. Huawei too seemed competitive pre-trade-war, sometimes beating qualcomm. Given that at the same process node there are 3 different competitor that come so close, it just does look like Apple's design is really all that unique. It's quite conceivable a different ARM competitor might catch up in a few years, given the current trends. Then again - only if they get a slot at TSMC, since there are rumors that Apple has already bought much of the 3nm production, and as intel showed in the past - decent design with a process node advantage is a winning combination. But Apple needs that process node advantage to keep a large lead.


They really haven't–Apple has slowly gained on, then outpaced them over the last several years. Apple's chips these days "lap" others, where the best that competitors have to offer perform poorer than last year's design.


The 888 (the first qualcomm chip on the same 5nm process) has approximately equivalent performance (91%) to the Apple A14 in multicore benchmarks, and 71% of the single-thread geekbench 5 score. source: https://www.anandtech.com/show/16325/qualcomm-discloses-snap...

If you look a few years ago, a similar comparison would be the snapdragon 820 vs. Apples A9 (both on samsung's 14nm process, though apple was dual sourcing with TSMC 16nm). Then, the iphone 6s plus scored 535, and the oneplus 3 (using the 820) scores 306 - i.e. 57% of apples performance. On multicore it achieved 77% of the performance (also a larger difference than today).

Qualcomm has been catching up lately on single-thread and multithread perf, just not very quickly (i mean at this rate it'll be well over a decade before they're equal in single thread perf...).

Sources: https://browser.geekbench.com/android-benchmarks and https://browser.geekbench.com/ios-benchmarks, search for oneplus 3 and iphone 6s plus.


I never had an ARM one, but imagine they would be a bit slow. However, they are still a fairly wide use case on a laptop, so to speak. And that gets blurred further by tablets I guess.

The M1 is a nice product. But let's not pretend it's the first ARM device in the world. It has plenty of merits to rest on that it don't need that qualifier as justification, anyways.


How are Exynos chromebooks terrible?


They're slow. The only thing a Chromebook has is a web browser. The browser performance of e.g. the Samsung Chromebook 2 was less than half that of its contemporary competitors, and a modern Chromebook with an Intel Core CPU is 5-10x faster.

Exynos didn't even manage to be the fastest ARM CPU in a Chromebook. The Tegra K1 was a bit quicker.


> The only thing a Chromebook has is a web browser

My (Celeron) Chromebook 2 supported a whole Linux VM (via Crostini) and using Vim for python and Go development on it was noticeably slower than any contemporary laptop - I can't imagine the Exynos version was far off on perf (since it was only a diff SKU). I would not call it "terrible".


The Celeron Chromebooks 2 were only slightly faster, maybe 15%. That's why I specifically called out the Intel Core CPU, not the Celeron. The Celeron N2840 is a "Bay Trail" Atom core, which nobody would willingly purchase. The state of the art CPU from that generation of Chromebooks was the 4th gen Core i3, a "Haswell" part that was more than twice as quick as that Celeron.


Apple is not known to do things first, but do things better than the competitive in some way.


Has anyone seriously used chromebooks for development work?

That's the difference.


I did for about a year, though with Linux installed instead. It's not really a bad experience when most if not all of your pipeline is punted remotely anyhow.



I recall AMD mentioning that their Zen architecture is pretty generic and that it would not be hard to make it execute ARM instructions instead. AMD has worked on this before, although the project was canned, but there's been some rumors recently of it having been resurrected.


Note that Nuvia is working on that "hole" with some very Apple-like claims.


It's not totally clear to me if the M1 gains are from the 5nm process, which even AMD hasn't transitioned too yet while still smoking everything Intel on 7nm, from ARM itself, or from some mix of both.

Of course, it's probably a mix, but I don't know if ARM alone has much in that mix. It does seem to me that, even if ARM held no performance benefit over x86, Apple would still decide to use it based on their use of it in their hugely successful mobile products.


I think M1's gains are mainly from the specific configuration they chose to build the ARMs in (after doing the necessary quantitative analyses), not from actual development of CPU tech.


When the first chips using Apple's Firestorm cores were announced on 5nm, the theory was that Apple had used the node shrink to decrease power usage, not increase performance.

>The one explanation and theory I have is that Apple might have finally pulled back on their excessive peak power draw at the maximum performance states of the CPUs and GPUs, and thus peak performance wouldn’t have seen such a large jump this generation, but favour more sustainable thermal figures.

Apple’s A12 and A13 chips were large performance upgrades both on the side of the CPU and GPU, however one criticism I had made of the company’s designs is that they both increased the power draw beyond what was usually sustainable in a mobile thermal envelope. This meant that while the designs had amazing peak performance figures, the chips were unable to sustain them for prolonged periods beyond 2-3 minutes. Keeping that in mind, the devices throttled to performance levels that were still ahead of the competition, leaving Apple in a leadership position in terms of efficiency.

https://www.anandtech.com/show/16088/apple-announces-5nm-a14...


This article might help the picture come into focus a little better: https://debugger.medium.com/why-is-apples-m1-chip-so-fast-32...


I can imagine that every cloud hosting provider expands their fleet of machines, and also retires and replaces some of it.

If the expansion and replacement will mostly involve AMD and ARM-based solutions, I won't be surprised.

Of course Xeons are going to stay for a relatively long time, because some customers are deeply invested in them, e.g. relying on AVX512 heavily, etc.

I also don't think that Intel is going to go under. AMD was in the position of a second-rate player for decades, and had expensive failures, too (remember Bulldozer?). So I expect them to regroup and show us something amazing in, say, 5 years. But one thing they are.going to lose: their sweet big margins.


I think the big issues for Intel will be avoiding getting into a vicious circle of uncompetitive performance -> lower margins -> less investment in process /fabs -> uncompetitive performance

In part AMD broke out of this by using TSMC, but Intel relying on (not just using) TSMC would be a huge step.


I bet there will be enough interest in keeping a top-tier fab facility and expertise on the US soil. If I were DoD, I would be very interested, for instance.


Probably connected:

https://www.extremetech.com/electronics/317329-tsmc-will-ope...

Not sure if it has to be absolutely top-tier for defense purposes though.


You can go with larger nodes and larger power budgets for avionics and other electronic stuff in battlefield machines.

OTOH you'd rather be top-notch in your communication equipment, low-power sensors for reconnaissance, etc. In some areas, having a US-controlled 7 nm fab may matter even for DoD.


Interesting - thanks.


I personally expect those plans to be dropped or cut back relatively soon as Trump leaves office. I don't think the 2024 date for this plan was at all a coincidence. Taiwan is extremely shifty about their semiconductor tech leaving the island. Of course, 5nm would be very outdated by then either way.


Intel are also miles ahead when it comes to software and documentation.

Intel's optimization guide alone is about 1000 pages long, AMD's isn't as good and they've been at it for decades now.

If you have the money Intel aren't as competitive any more, but for the middle ground where you don't have the budget to bring up your own software it's not the same calculus.


This is an important point. Intel has a quite mature performance analysis stack that AMD doesn't match. Their performance management unit in Skylake-SP and later is very capable, and the one in Tiger Lake, Ice lake or later doubles the number of counters and adds other abilities. It's a software ecosystem that allows you to get a very high level of performance out of your software, if your shop is building its own software.


>.....With application stacks being what they are and optimizations for Xeon so deeply integrated......

That is of course in the context of Server CPU design. Your mileage will vary depending your usage.

With the announcement from Marvell's exit [1] of ARM Server CPU it is now all but confirmed that Microsoft and Google are also working on their own ARM Server CPU. Which means all HyperScaler will now have a common interest to have Software running on ARM. And this will trickle down to every part of the industry.

It will take some time, depending on your scale, the cost benefits of switching over may be different. But there is no denying that Intel doesn't have a short term ( 2-3 years ) answer. And I dont see their medium term ( 4 - 5 years ) are any better. Their new Optane SSD is really exciting though [2].

And I am willing to bet all of these New Server CPU comers has absolutely zero experience in CPU forecasting, and risk averse enough to put down commitment on capacity, which ultimately leads to TSMC getting a bad name yet again for "Not Enough Capacity".

[1] https://www.servethehome.com/impact-of-marvell-thunderx3-gen...

[2] https://www.servethehome.com/new-intel-optane-p5800x-100-dwp...


I'd be curious to see AES encryption performance compared between the same CPUs from Ampere, Intel, and AMD. In my experience, Intel have a significant performance advantage with their AES-NI implementation over AMD. I wonder what Ampere's cores can do.

For some workloads, special things like AES-NI or other optimizations which only Intel (or another vendor) have can make a huge difference in real world performance which generalized benchmarks don't cover.


> I'd be curious to see AES encryption performance compared between the same CPUs from Ampere, Intel, and AMD. In my experience, Intel have a significant performance advantage with their AES-NI implementation over AMD. I wonder what Ampere's cores can do.

AMD in Zen1 had 2x AES-pipelines, allowing 2x 128-bit AES operations per clocktick.

Intel caught up later, and eventually also added 2x AES-pipelines of their own. But then AMD added 256-bit AES instructions to Zen3 (which are 128-bit x2 vectorized).

AVX512 does have 4x SIMD AES on AVX512 (4x128 bit wide), for 4x parallel AES computations per clock tick. But that's not on consumer chips. I guess that's the widest AES implementation right now however, but AMD isn't that far behind in Zen3.

--------

If anything, I'd say AMD actually was beating Intel at the AES-benchmark at Zen1 (In year ~2017 or so), but Intel / AMD have been trading blows with each other since then.


I haven't been able to test any Zen 3 CPUs. Now I'm curious!

From benchmarks listed here: https://calomel.org/aesni_ssl_performance.html it seems that Intel had similar performance in their 2015 release of the i7-6700 as AMD's 2017 release of Ryzen 7 1800X.

Raw general purpose benchmarks I found via Google put the Ryzen 7 1800X at about twice the compute capability as an i7-6700, but their AES performance is quite even.


CBC mode cannot be parallelized. That is a sequential-only algorithm, and none of the stuff I talked about before can take advantage of CBC mode.

GCM mode can be parallelized. Notice AES-128-GCM (which can take advantage of 2x pipelines) really flies on Zen1 compared to 6700.


Regardless, it does really all come down to what your compute need is, and specialized metrics for somewhat odd-ball cases can sometimes still heavily favor Intel for the money.

Thanks for your comments! I didn't previously fully comprehend how the AVX modes and pipeline widths mattered for AES.


Yeah, this is super-specialized. I just happen to have studied the AES instructions to a ridiculous degree a few years ago... (https://github.com/dragontamer/AESRand)

You don't come up with 32-bytes of random numbers every 4-clock ticks unless you understand this stuff. :-)


The reason why the M1 performs so well, next to its insanely wide out of order capabilities, is that Apple placed Specialized processors really close and gave everything a fast connection. Then because they control a large part of The OS and applications they rewrote the underlying macOS libraries and the applications to make use of these specific co processors. They were able to achieve this because they are vertically integrated.

When AMD would switch to RISC they could also create a huge out of order capable cpu but would be lacking the co processors plus the tight integration and if they could add them they would lack the capability to drive and dictate that the cpu would be used in the most performant way.

So it seems to me that Apple is perfectly aligned to take a large part of the market and AMD as a general purpose cpu designer will need to come with a very competitive cpu either on x86 (ZenX) or on RISC to stay competitive. They can outsource the fabrication the chips. Intel meanwhile with its owned fabs has a lot more inertia and ego to turn around.

On the other hand, maybe AMD or Intel will find a way to increase the re-order buffer and the amount of “splicer” (I forgot the official name for the unit that takes the instruction code and translates it into chunks of microcode).


I have to disagree with this, I really don’t think that Apple is a competitor here.

The M1 chip is undoubtably a remarkable piece of engineering. But it quite clear that Apple are never going to sell it standalone. It also seems extremely unlikely that Apple are going to get back into the server game again.

Apple sell high markup, low volume (relatively) products. Data centre severs are the opposite, and Apple have stated quite explicitly they’re not interested in building anything that isn’t a high margin, luxury, product.

In the consumer space, I can see Apple taking more market share. But again, the vast majority of PC are either business machines, or low margin consumer machines. Both markets that Apple doesn’t seem to be that interested in diving into (their enterprise support is crap, and see above regarding low margin goods for the consumer market).

In all I must disagree with this statement

> So it seems to me that Apple is perfectly aligned to take a large part of the market

Not because Apple doesn’t have the technical chops to do it. But because they just aren’t interested in that market.

As such I think AMD, and other ARM manufacturers have everything to play for, and no need to worry about Apple.


I agree that it is unlikely that Apple will sell the M1 chip separately exactly for the reasons I mentioned in my post.

However, they do not have to sell the chip individually to be a competitor to AMD. If Apple’s are sufficiently attractive it will eat the market share of consumer devices from the likes of Dell/HP/etc. When they start selling less devices they have less need for AMD or Intel CPUs.

This leaves only the datacenter market where Apple will not compete with Intel and AMD.

Indeed, there is lots of room for a (new?) player to pull of the same trick as Apple for general purpose cpus. I just think they are not as well positioned because of the lack of vertical integration. So AMD and Intel indirectly have to worry about what Apple is doing.


What about Linux servers for their own datacenters?


I can't wait to see if the engineering teams from Cavium push Marvel back into the ring. Cavium was designing an Arm server to compete with Intel for years before they switched to HPC. The market is primed. Intel is in serious trouble (I know, people have been saying this since NexGen in 1995, but for real real this time!)


Marvell/Cavium just canceled their server processors.


I thought they cancelled years ago and switched to HPC. Something recent?


Earlier this year ThunderX 3 was pitched as a server processor, then it was hyperscale-only, then it was canceled.


are there any good articles about what has gone so, so wrong at intel in the last few years?


There's a lot of great stuff at SemiAccurate throughout the years that details what went wrong at Intel and why. Unfortunately, you have to pay for it because Charlie has a practical stranglehold on semiconductor inside information.

And Charlie is a bit of a polarizing figure. He's gotten some things hilariously wrong, but he's also been dead on the money for a lot of issues, and his documentation of what's happening / happened at Intel is pretty much crystal clear.

You could probably browse through all the 10nm articles there and get a good idea of what happened without purchasing a subscription, if you have a decent knowledge of the semiconductor industry.

EDIT: If you read though these articles, even the parts that are available to non-registered users, a pretty clear picture starts to emerge - https://www.semiaccurate.com/tag/10nm/


I have yet to see an article with details about what went wrong in their fab process, but something seriously went wrong with their 10 nm fab process. Yields are low, many years in, and there were large delays.

This messed up their whole CPU development pipeline, because their designs are tightly bound to the process node, and it's a multi-year process. Because 10 nm hasn't ever worked in volume, and their new CPU designs are based around 10 nm, their latest server chips are still based on optimized Skylake microarchitecture, which was first released in 2015.

Early indications are that Intel's next process node (which they're calling 7nm) is also experiencing problems, if those problems aren't resolved, that probably means they'll have cpu designs for 7nm that they also can't ship in volume, in addition to the 10nm designs.

If they can figure out their fab issues, and they don't have a corporate implosion, I'm guessing they'll start making competitive CPUs again; they've been way behind before and come back (anti-competitive practices helped make sure they had the finances for that, of course).


that’s a statement refuted by the article in which it appears, though. The Ampere system is clearly not suited to latency-sensitive application services, unless you’re just looking to double your latency for no reason. If you looked at these specjbb results because you’re shopping for Java backend servers then you’re buying Xeon yet again.


It's not refuted, because Ampere isn't the only competitor. The statement is that with AMD EPYC and Ampere Altra as options, Intel Xeon loses everywhere.


I don't really think that's a supportable statement. The Xeon's outlier latency on specjbb is superior until it hits its throughput limits. A lot of machines are bought on the basis of latency, not throughput.

Also you really have to grind your teeth to get past AnandTech's habit of comparing, on a performance-per-dollar basis, just whatever CPU that happen to have laying around. The entire reason the Xeon Platinum costs $10000 is because it scales to 8 sockets, a level needed only by people who have backed themselves into a corner with Oracle or SAP and who now need a gigantic single host at any price. Anyone who was actually planning to build a 2-socket server would choose the Xeon Gold 6258R with the same core count, the same cache, higher clocks, and less than half the price. Suddenly when you make a comparison to a comparable product, you find that Intel has the cheapest product of the three, along with certain favorable performance aspects.


> "Intel’s current Xeon offering simply isn’t competitive in any way or form at this moment in time."

What? That makes so little sense. Is this chip x86 or something? Because the architecture is a pretty big deal for me when I decide which chip to invest in.


What they mean (IMHO) is that Intel pricing on Xeon Platinum SKUs is so disproportional to their comparative performance that they would be foolish to purchase.


Xeon Platinum supports 8-socket machines. 6x memory channels x 8-sockets == 48x DIMMs for your massive 6TB RAM servers.

Platinum is crazy priced for a crazy reason: its a building block to truly massive computers that only IBM matches. Supercomputers don't even typically use Platinum: Platinum is for the most vertically-oriented systems.

---------

Xeon Gold is your more typical 2-socket or 4-socket machines: same performance, but fewer sockets supported.


> Xeon Platinum supports 8-socket machines. 6x memory channels x 8-sockets == 48x DIMMs for your massive 6TB RAM servers.

Both AMD Epyc Rome and Ampere Altra can support 8TB of RAM in dual socket machines, so that spec isn't particularly impressive anymore.

> Platinum is crazy priced for a crazy reason: its a building block to truly massive computers that only IBM matches.

Meh. A dual-socket Rome system with 128 cores (256 threads), 8TB of RAM, and 128 PCIe 4.0 lanes is an enormous machine. A full 8 socket Platinum system is going to cost an order of magnitude (or two) more, and the performance improvement is going to be less than 2x.

An 8-socket system is going to consume an enormous amount of power, cost an unbelievable amount of money, and the complex NUMA domain combined with Intel's interconnect isn't going to "just work" for almost any software, so you're likely going to put a lot of effort into making that system barely work.

Intel Platinum is priced crazy because Intel is building monolithic processors (which means low yields) and Intel likes to have a substantial profit margin. Together, those mean high prices.

At a certain point, you're better off investing in making your application work on an accelerator of some kind, in the form of a GPU or an FPGA or task-specific ASIC, rather than giving Intel all your money in the hopes of eeking out a marginal performance improvement. Alternatively, finding software that can scale to more than one machine.

Maybe IBM has truly large systems that are worth considering. Intel doesn't right now, unless you're using some software that demands Intel for contractual reasons no one can argue with.


But not with 6-channels per socket (48-channels total across an 8-socket computer). That's a heck of a lot of bandwidth.

I was doing the specs from memory and conservatively: 128GB sticks across 48-channels.

https://www.supermicro.com/en/products/system/7U/7088/SYS-70...

This 8-way Supermicro system supports 192x DIMMs. So... 128GB sticks x 192 == 24TBs of RAM or so, maybe 48TBs.

> At a certain point, you're better off investing in making your application work on an accelerator of some kind, in the form of a GPU or an FPGA or task-specific ASIC, rather than giving Intel all your money in the hopes of eeking out a marginal performance improvement. Alternatively, finding software that can scale to more than one machine.

I mean, these 8-way systems are what? $500k ? Less than $1MM for sure. How many engineers do you need before a machine of that price is reasonable?

Really, not that many. If you can solve a problem with a team of 10 engineers + 48 TBs of RAM, that's probably cheaper than trying to solve the same problem with 30 engineers with a cheaper computer.

-----------

> An 8-socket system is going to consume an enormous amount of power, cost an unbelievable amount of money, and the complex NUMA domain combined with Intel's interconnect isn't going to "just work" for almost any software, so you're likely going to put a lot of effort into making that system barely work.

NUMA scales better than RDMA / Ethernet / Infiniband. A lot, lot, LOT better.

All communications are just memory-reads, and that's a real memory read. Not an "emulated memory read transmitted over Ethernet / RDMA pretending to be a memory read". You incur the NUMA penalty but that's almost certainly better than incurring a PCIe penalty.


Sure, I agree it's a lot of bandwidth.

But for that bandwidth to be used efficiently, the processes on each NUMA node need to almost exclusively limit themselves to memory attached to their node -- at which point, well-written software could probably do just as good spread out over several machines that are connected by multiple 100Gb network links, and then you saved two or three bajillion dollars.

If you're heavily using the bandwidth over the NUMA interconnect, then you're not going to be using the memory bandwidth very effectively (and likely not really using the processor cores effectively), and that's when NVDIMMs like Optane Persistent Memory could give you large amounts of bulk memory storage on a smaller system.

Or just use a number of Intel's new PCIe 4.0 Optane SSDs in a single machine in place of the extra memory and memory channels... the latency isn't the same as RAM, but it's much closer than traditional SSDs, and the bandwidth per SSD is like 7GB/s, which is impressive.

It all depends on the application at hand, but there are solutions that cost a lot less than the Platinum machines for virtually every problem, in my opinion.

I don't know... perhaps I'm too cynical of these cost-ineffective systems that just happen to be large, and I should be more impressed.

> NUMA scales better than RDMA / Ethernet / Infiniband. A lot, lot, LOT better.

Disagree.


> But for that bandwidth to be used efficiently, the processes on each NUMA node need to almost exclusively limit themselves to memory attached to their node -- at which point, well-written software could probably do just as good spread out over several machines that are connected by multiple 100Gb network links, and then you saved two or three bajillion dollars.

Wait, so latency over NUMA is too much, but you're willing to incur a 100Gb network link? Intel's UPI is like 40GBps (Giga-BYTEs, not bits) at sub 500ns latency.

100Gb Ethernet is what? 10GBps in practice? 1/4th the bandwidth with 4x more latency (in the microseconds) or something?

That's a PCIe latency penalty (x2, for the sender + the receiver). That's a penalty for electrical -> optical, and back again.

Any latency, or bandwidth, bound problem is going to prefer a NUMA-link rather than 100Gbps over PCIe.


> Wait, so latency over NUMA is too much, but you're willing to incur a 100Gb network link?

The point is not just the latency, but latency and bandwidth.

If the application is relying heavily on the NUMA interconnect to transfer tons of data, it's not going to be making efficient use of the processor cores or the RAM bandwidth. It's a total all around bust. You're just wasting money at that point. ^1

If you aren't relying heavily on the NUMA interconnect, and each node is operating independently with only small bits of information exchanged across the interconnect, then you'd save a metric ton of money by switching to separate machines and using just high speed fiber network links -- such as 100Gb.

I'm not saying that network would be better than the NUMA interconnect. I'm saying that you're not going to be having a happy day if you're relying on the interconnect for large amounts of data transfer.

The only situation where the NUMA set up is better is if you have a need for frequent, low latency communication between NUMA nodes... where very little data is being transferred between nodes, so the entire problem is just latency.

At that point, you're still suffering a lot by the NUMA interconnect, and it would be better to use larger processors... such as Epyc Rome processors.

So you really have to be in a very obscure situation which can't fit onto a dual socket Rome server, but can fit within a machine less than 2x larger. (28 cores * 8 sockets is less than 2x larger than 64 cores * 2 sockets)

You see how complicated this is and how absurdly niche those Platinum 8-socket machines are? They're almost never the right choice.

^1: The exception is if you have an even more niche use case that somehow is built for exactly this scenario in mind, and manages to balance everything perfectly. Such a software is almost certainly ridiculously overcomplicated at this point.


> If the application is relying heavily on the NUMA interconnect to transfer tons of data, it's not going to be making efficient use of the processor cores or the memory bandwidth. It's a total all around bust. You're just wasting money at that point.

If the interconnect is your bottleneck, you spend money on the interconnect to make it faster. Basic engineering: you attack the bottleneck.

You don't start talking about slower systems and how they're cheaper. Because that just slows down the rest of your system.

------

If you just wanted cores, you buy a 28core Xeon Gold. The point of 28-core Xeon Platinum is for the 8-way interconnect and scaling up to 8-way NUMA systems. The only person who would ever buy a Xeon Platinum is someone who wants 40GBps UPI connections at relatively low latencies. (or maybe even the 300GBps connections that IBM offers, but that's a similar high-cost vertical-scaling system)

> So you really have to be in a very obscure situation which can't fit onto a dual socket Rome server, but can fit within a machine less than 2x larger. (28 cores * 8 sockets is less than 2x larger than 64 cores * 2 sockets)

That's not even that hard to figure out! A 48TB Memcached / Redis, which is far more useful than a 8TB Memcached / Redis box.

A bit basic, but yeah. That's the point: spend more money on hardware and then don't spend much engineering time thinking about optimization. If 48TBs of RAM solves a problem that 8TB cannot, then just get the 48TB system.


You have since edited your comment, so...

> That's not even that hard to figure out! A 48TB Memcached / Redis, which is far more useful than a 8TB Memcached / Redis box.

No... dozens of terabytes of Optane would be just as good and much much cheaper. The person designing the system would have to prove that a few nanoseconds of latency difference makes any material difference to the company's profits in order to justify the more expensive machine. Otherwise, it's a huge waste of company money, hurting the business.

Also keep in mind that Redis is only going to be using a single core of that machine. A total waste of huge amounts of CPU cores just to have a lot of RAM, when there are equally good solutions that cost much less.

You surely must see why I'm skeptical.

> That's the point: spend more money on hardware and then don't spend much engineering time thinking about optimization.

It's bad business practice to buy the most expensive thing possible instead of engineering a solution that is the right price. If the more expensive solution saves money in the long term, sure... but your example doesn't show this.


> No... dozens of terabytes of Optane would be just as good and much much cheaper.

PCIe Optane is orders of magnitudes slower than DDR4. In both bandwidth and latency.

The only Optane that keeps up to DDR4 (kinda-sorta) is the Optane DIMMs which are exclusive to Xeon Golds / Platinum systems.


Not really... especially if this is an in-memory database attached to a network, as implied.

It’s only slower if someone can observe the difference, which I don’t think they would be able to in this design.

I’m a strong proponent of using fewer, larger machines and services, instead of incurring the overhead involved in spreading things out into a million microservices on a million machines. But there is a balance to be achieved, and beyond a certain point... synthetic improvements in performance don’t show up in the real world.

Queuing up a few database requests concurrently to make up for the overhead of literally hundreds of nanoseconds of latency is trivial, especially when Optane can service those requests concurrently, unlike a spinning hard drive. Applications running on other machines won’t be able to tell a difference.

But, agree to disagree.

There are probably applications where these mega machines are useful, but I don’t personally find this to be a compelling example.

I readily admit that I could be wrong... but neither of us have numbers in front of us showing a compelling reason for a company to spend unbelievable amounts of money on a single machine. My experiences (limited compared to many, I’m sure) tell me this isn’t the winning scenario, though.


> I readily admit that I could be wrong... but neither of us have numbers in front of us showing a compelling reason for a company to spend unbelievable amounts of money on a single machine.

$500k on a machine isn't a lot of money compared to engineers. Even if you buy 4 of them for test / staging / 2xProduction, its not a lot compared to the amount spent on programming.


One thing being expensive doesn’t make another thing not expensive.

It’s possible for them to both be independently expensive, and I’m saying that unless you can prove that the performance difference makes any difference to company profits, it is literally a huge waste of company money to buy those expensive machines.

A lot of applications will actually perform worse in NUMA environments, so you’re spending more money to get worse performance.

Reality isn’t as simple as “throw unlimited money at Intel to save engineering time.” Intel wishes it was.

Engineering effort will be expended either way. It is worth finding the right solution, rather than the most expensive solution. Especially since that most expensive solution is likely to come with major problems.


Preface: If you have actually encountered applications that must be run on 8-socket systems because those are literally the only fit for the application... I would love to hear about those experiences. With the advent of Epyc Rome, most use cases for these 8-socket systems vanished instantly. It would be fascinating to hear about use cases that still exist. Your experiences are obviously different than mine.

If you need more than 8TB of RAM, with the right application design you can probably do better with fast Optane Persistent Memory or Optane SSDs, and an effective caching strategy. You can have many dozens of terabytes of Optane storage connected to a single system, and Optane is consistently low latency (though not as low latency as RAM, obviously).

If you need more compute power, you can generally do better with multiple linked machines. You can only scale an 8-socket system up to 8 sockets. You can link way more machines than that together to get more CPU performance than any 8 socket system could dream of.

----------

I didn't expect you to read and respond so quickly, so I had edited my previous comment before you submitted your reply.

This was a key quote added to my previous comment:

>> So you really have to be in a very obscure situation which can't fit onto a dual socket Rome server, but can fit within a machine less than 2x larger. (28 cores * 8 sockets is less than 2x larger than 64 cores * 2 sockets)

In response to your current comment,

> If the interconnect is your bottleneck, you spend money on the interconnect to make it faster. Basic engineering: you attack the bottleneck.

Exactly. Using a dual-socket Epyc Rome system would be more than half as powerful as the biggest 8-socket Intel systems, but it would reduce contention over the interconnect dramatically, which means that many applications that are simply wasting money on an 8-socket system would suddenly work better.

This also goes back to my comment about using accelerators instead of an 8-socket system.

The odds of encountering a situation that just happens to work well with Intel's ridiculously complicated 8-socket NUMA interconnect, but can't work well over a network, and can't work well on a system half the size and requires enormous amounts of RAM to keep the cores fed, the odds seem vanishingly small... and in that case, we still have to consider whether an accelerator (GPU, FPGA, or ASIC) could be used to make a solution that is a better fit for the application anyways, and if so, you'll save large amount of money that way as well.

So, to make buying an 8-socket system make sense, the application must require performance that is...

- less than twice a dual socket Epyc Rome system, but greater than one dual socket Epyc Rome system can handle

- not dependent on transferring huge amounts of data around the interconnect

- dependent on very low latency communication between NUMA nodes

- needs enormous memory bandwidth for each NUMA node

- needs huge amounts of RAM on each memory channel (so you can't just use HBM2 on a GPU to get massive amounts of bandwidth, for example)

- etc.

It's a niche within a niche within a niche.

As I said in an earlier comment, I probably should be more impressed instead of being so cynical about the usefulness of such a machine. They are engineering marvels... but in almost every case, you can save money with a different approach and get equal or better results.

That's why 8-socket server sales made up such a small percentage of the market, even before Epyc Rome came in and completely obliterated almost all of the very little value proposition that remained.


Don't forget that Rome also has a wildly nonuniform interconnect between the core complexes, and the system integrator gets much less control over it than Intel's UPI links. When you really need to end up with a very large single system image at the application layer, the bigger architecture works out to be much cheaper than 256gb DIMMs or HPC networking.

8-socket CLX nets you 1.75x the cores, and 3x as many memory channels vs. a 2-socket Rome system. It also scales to a single system image with 32 sockets if you use a fabric to connect smaller nodes:

* 4-socket nodes: https://www.hpe.com/us/en/servers/superdome.html

* 2-socket nodes: https://atos.net/en/solutions/enterprise-servers/bullsequana...

That's 48tb of DRAM with all 128gb DIMMs, or 12tb+128tb when using 512gb Optane PDIMMs.


I mean, I buy AMD Threadripper for my home use and experimentation. I'm pretty aware of the benefits of AMD's architecture.

But I also know that in-memory databases are a thing. Nothing I've touched personally needs an in-memory database, but its a real solution to a real problem. A niche for sure, but a niche that's pretty common actually.

Whenever I see these absurd 8-socket designs with 48TBs of RAM, I instinctively think "Oh yeah, for those in-memory database peeps". I never needed it personally, but its not that hard to imagine why 48TBs of RAM beats out any other architecture (including Optane or Flash).


> its not that hard to imagine why 48TBs of RAM beats out any other architecture (including Optane or Flash).

Agree to disagree.

In-memory databases are common yes, but it is pretty hard to imagine practical situations where an in-memory database can't handle a few nanoseconds of additional latency.

All else being equal, of course more RAM is nice to have. All else is not equal, though, so this is all highly theoretical.

But it is fun to think about!


Right? "This platform does/does not run my software" is a huge checkbox when buying computers. The viewpoint of people who actually specify, buy, and operate servers at scale is severely underrepresented in these discussions.


Keep in mind that Intel also has to compete with AMD... and AMD is also x86_64.

AMD and Altra are trading blows at the high end of performance, as seen in this article, while Intel is... not, except in extremely specialized applications.

Intel servers don't even offer PCIe 4.0 at this time, which is just a bad joke when it comes to building out servers with lots of high performance storage and networking. For now, Intel offers (relatively) poor CPU performance and poor peripheral performance.

So, if your software can't run on Altra, the other choice for high performing servers is AMD, not Intel, unless you're just locked into Intel for historical reasons.

Intel does have some nice budget offerings for cheap servers, though, such as Xeon D.


Ironically Xeon-D seems to have been quietly cancelled.


So you're never going to consider moving to another architecture because your software (currently) doesn't run on it?

So Amazon made a huge mistake with Graviton then? Last time I checked Amazon 'specify, buy, and operate servers at scale'.


I don't know if it's a mistake, the competition is welcome. I spun up some arm servers because they were lower cost but then ended up having to worry about software availability issues. It's non-trivial and I don't think it was worth the savings at this time and went back to x86.

Regardless, the claim that there is no value in Intel let alone x86 is a stretch.


So how are the cores interconnected? As ARM server designs love to throw out tens to hundreds of cores which sounds good on paper, but are not really useful to lots of workloads as they have a shit interconnect or lots of NUMA nodes.

Interconnect is key, e.g., see the improvements between zen2 and zen3, especially for the `4 < cores <= 8` models - which stem mostly from having 8 cores per die with direct good interconnect, instead of two dies with only 4 cores.

An anecdote: I got some hands on a box with a Cavium ThunderX2 96 logical core processor about two years ago, kernel compiled slower than on my high-end, but non overclocked and common, i7 based workstation from ~ 2015 - that is certainly not to a statistical relevant comparison and there may be some workloads that do good with those CPUs, but it's neither single thread nor massively parallel, maybe something in between...


This is a serious problem. On-die interconnect requires insane bandwidth and lots of caches. We're talking TB/s to keep cores fed. This is significant challenge (source: I worked on Intel Xeon Phi) and at the time I worked on it in 2007 (pre-name-change), there was not a lot of empirical data, just lots of simulations, guard-bands and finger crossing.


What are guard-bands?


At the time the validation models were not very mature so they moved everything up to higher metal layers with big-ass drivers and designed for worst-case. I'm sure Intel has a much better handle on it now (they ain't dummies!), but back then the orders were "don't let the ring(s) throttle the cores".


This is a dumb question. But as you've said, cores per die has a strong influence on how performant a system is and that's because of the processor interconnects. My question is, what are the processors saying to each other?


It's not a dumb question; it's actually the defining characteristic of performance for large systems.

In a shared-memory multiprocessor, all cores and all memory are logically connected along a massive shared bus, and both inter-core communication and regular core-to-main-memory communication takes up slots on the bus. In practice, of course, it isn't physically a single bus but a more complex topology (usually meshes, I think) with protocols that make it like a bus for cache coherency purposes.

One of the readily apparent consequences of interconnect is NUMA, non-uniform memory accesses. Historically (my information on this matter is a few years out of date, so I don't know how much this is true today), this has been a bigger issue for AMD than Intel, as AMD had wildly divergent memory latencies (2-3× between worst/best case) even for a 4-core processor. So if the OS decided to resume execution of a sleeping program on a different core, your performance might suddenly drop by ½ or ⅓.

For some benchmarks (such as classic SPEC, or make -j), multicore performance is approximated by running the same binary on every core and there is no sharing between different copies. The interconnect here matters only insomuch as the cross-chatter on the interconnect may impede performance. But HPC applications tend to be much more sensitive to the actual interconnect, since reading and writing from other processors' memory is more common (transferring halo data, or having global data structures that you need to read from and occasionally write).


Its not so much "processors" talking to each other as caches.

The basic theory is called "MESI": Modified / Exclusive / Shared / Invalid. When a remote core reads a cache-line, if no one else has that cache-line then its the Exclusive owner. But if multiple caches have the cache-line, then it is a Shared cache line instead. If a remote cache changes, then your current cache-line becomes Invalid, while the cache line that's valid is placed into the Modified state (cache is correct, RAM is incorrect).

MESI isn't how things are done in practice: additional messages and states are piled on top in proprietary ways. But MESI will get you to the "basic textbook" concept at least.

EDIT: Oh, and "snooping". Caches "snoop" on each other's communications, which helps keep the MESI state up to date. Now you can have directory-based snooping (kind of a centralized push), or rings, or whatever. Those details are proprietary and change from chip to chip. In practice, these details are almost never published, so your best bet to understanding is "Something MESI-like is going on..."


In the case of Zen/Zen2, some of the dies didn't have direct access to main memory but went through an adjacent die's memory controller and channel. This had a performance and latency effect, and required some extra work on the part of OS schedulers to make sure that tasks weren't being scheduled on related cores.

It's (relatively) easy to have lots of cores on one die, it's not so easy to keep them fed.


Modern CPUs like these are shared memory multiprocessors. Programs on these CPUs are often designed to run on multiple cores at the same time to improve performance, necessitating communication between cores, to coordinate and share work between each other. Since each core typically has a private cache, most communication in a system like this will typically involve cache coherence protocols to ensure cores have a consistent view of memory contents. Adding more cores increases the overall complexity of the system. The latency of these protocols grows with more cores as interconnections become more "distant", across processor interconnects. Minimizing latency and effectively scaling to more cores is a difficult problem to solve while staying within silicon and thermal budgets.


"knock knock" "who's there?"


ThunderX was a huge disappointment. ThunderX2 (which one may think is the successor of ThunderX, but is actually a completely different system that Cavium obtained by acquiring a different company that was also working on ARMv8 hardware) was a (not so huge) disappointment. Cavium tried to copy-paste lots of coprocessors and offload things from the CPU, but the overall system was not that great.

Early AMD Softiron and Applied Micro boards (which had 8 cores unlike the ThunderX which had 48 or 96) were actually faster, which I always found interesting.

But Ampere's previous generation (before N1) is fast, much faster than the ThunderX2. Afaik, they built it on top of previous Applied Micro IP. So I'd expect N1 to be in a different league and not worth comparing to ThunderX2.


> But Ampere's previous generation (before N1) is fast, much faster than the ThunderX2.

I think you're getting ThunderX2 and ThunderX mixed up here.


In my experience, for tasks like everyday operation, kernel building etc.: Thunderx < ThunderX2 < eMAG (Ampere's platform before N1).

I wouldn't say it's the same order of magnitude for the 2 comparisons, but it's definitely noticeable.


Another anecdote: in 1990s, Sun servers had relatively "weak" Sparq CPUs, contemporary Intels would wipe the floor with them in integer performance tests. But they had superior interconnect and I/O channels, so e.g. in 4-core configurations Sparq-based servers showed much better application performance (say, running Oracle RDBMS) than the beefier Intel 4-CPU machines.

The connection fabric is utterly important, and expensive.


> slower than on my...

May have consumed far less power doing the job though. For server farms performance-per-watt can be far more important than absolute speed per unit, especially for massively parallel tasks.


Cavium ThunderX2 is a 205 TDP chip vs a circa 2017 i7 (~105W) so probably not.


> So how are the cores interconnected?

No one answered that and from the core to core ping-pong test the second processor their is some weird, possibly performance crushing issues.


I love seeing those alternatives to x86, Intel an AMD showing up. After long years of nothing exciting on the CPU side until AMD released Zen, now the M1 from Apple and this, hardware feels exciting again. I also really like the fact that Ampere cpus are not something that will only live in one ecosystem (like the graviton that will, i guess, never exists outside of AWS or the M1 outside of Apple products).

I really hope something similar is going to show up for PC as well.


I'm excited as well. But as an earlier discussion pointed out I'm worried that the transition might impact the openness in PCs, e.g. locked bootloaders etc


Yup. Finding out that Big Sur phones home every time one opens an application is a showstopper for me, never mind Apple's ever-changing ports and its campaign against the right to repair.

We've also already seen that Microsoft considers ARM systems to be "special" enough to require locked bootloaders as well. Don't get me started on the mess in the smartphone world. I'd hate to see a world where the only "open" computing revolves around either Raspberry Pi-class SBCs or expensive datacenter servers, with no middle ground.


Windows on arm64 systems aren't locked down.

Meanwhile for arm64 Macs, they aren't either. Yesterday, https://twitter.com/xenokovah/status/1339914714055368704?s=2... was released to run unsigned kernels.


It's encouraging to see an effort like this, but unless it can become a first-class citizen on M1 hardware, it will be at best like trying to keep an iDevice jailbroken, or having a custom Android ROM that lacks important basic functionality like VoLTE.

One can hope, I guess. If I could be sure I could run Linux without it being hobbled, and that Apple wouldn't pull the rug out from under me, I could actually see myself adding a Mac Mini to the stable.

Edit:

> Windows on arm64 systems aren't locked down.

Has this changed? I recall that on ARM, Microsoft requires that UEFI Secure Boot be enabled, cannot be disabled, and cannot load custom keys. So, you could boot Linux as long as it's been "blessed" by Microsoft, assuming they don't pull the rug out, much like the leaving Secure Boot enabled on an x86 PC (except on a PC you can usually load your own keys).

https://arstechnica.com/information-technology/2012/01/windo...


That was a restriction for 32-bit Arm Windows devices, which were indeed locked down. (I ended up breaking the Secure Boot implementation for Windows RT devices later on, and they didn't bother to fix it)

For 64-bit Arm Windows devices, they had the security policies of a conventional PC since the very beginning.


Thanks for clearing that up for me. I've long viewed that as one of the storm clouds on the horizon, as it was.


Alternate source, though may be dated to around the time of v2.2 of the UEFI spec.

https://www.happyassassin.net/posts/2014/01/25/uefi-boot-how...

I find it funny that almost all the real gripes with user freedom infringement actually tend to come as a result of Microsoft or other big players getting their fingers into a hardware platform.


Not quite, since Apple actively works against jailbreaks on iOS. On macOS they have provided an actual path for code execution to boot alternate OSes.


When Windows 10X gets release it might be different, as the original before COVID happened was for it to be the introduction of Win32 sandboxing as well.

Now with Reunion merging both worlds, and Windows 10X only planned for 2021, it might be a different story.


Windows 10X is launching for Intel SoCs first, hopefully without locked BLs.


There is already a few ARM laptops that are quite open (like the pine64 ones). There's also ARM boards like the Raspberry PI. I hope that if there's more beefy CPUs and SoCs we'll see those poping up in similar devices as well.


Sort of. Even the Pi is not entirely open-source. The Broadcom SOC relies on some closed-source binaries.


There's an interesting rumor that says AMD is building an ARM SoC for OEMs...


Didn't they build the architecture and then chuck it?


Yes, called Seattle, in their codenames.

For a little while there was one partner who used it for Gluster ( or similar ) storage nodes.

It fit into the model of Arm processors that were performant enough to be the glue between hard disks or flash and a reasonably fast network. But after the PCI lanes were all dispatched there wasn't a lot of compute for applications on top.



I think they grafted 4 and 8 cortex a57 in to a package and it didn’t seem to have much market interest.

I too have heard that they have their own ARM micro architecture project though


I could see AMD grafting an Arm front end onto Epyc.


If I were them I wouldn't even bother with anything pre-Arm v9. That's going to draw all the hype, especially when Apple announces the Arm v9-based M2 next year.

And then I'm sure most sheep-like OEMs will say "Oh, we want THAT, too. Where is it - we want it yesterday!" But AMD won't be able to provide one too soon, because they would've gone all in on Arm v8, and they'd want to squeeze at least a couple of generations out of that microarchitecture.

Something similar happened when Apple announced the first Armv8 processor and it took Qualcomm and all the rest 2 years to catch-up.


At the time of the intro of 64 bit ARM, the conventional wisdom was, "Pooh pooh, this'll just bloat your code and not offer any performance advantage whatsoever. How pointlessly stupid!"


The problem is that ARMv9, as I type this, is not available. Perhaps ARM holdings and/or NVidia are working on the ARMv9 instruction set, and there has been a lot of speculation about what ARMv9 will be, but so far nothing concrete. [1]

ARMv8 (i.e. 64-bit ARM), of course, has been around for a while now.

[1] Keep in mind that anything on Reddit not confirmed elsewhere is very likely either wild speculation or made up fiction.


Apple is usually the first to ship ARM designs. I think they are the only ARMv8.5 implementation at the moment.


I'm just blown away that Apple gets the credit for the Arm switch, but I'll take it.


Apple has been running ARM processors since the original iPhone.


Apple also part of the joint venture that started ARM.


Nice hardware, but if I order a hundred, how many will I get? Arm servers so far have been vaporware, only available in single-digit quantities after a very long wait.

There is a new announcement every half year or so by one firm or the other. But we've never managed to get beyond getting a sample, and not for a lack of trying. Numbers just aren't available, models change extremely fast, producers go bancrupt. Maybe it will get better as soon as one of the big server producers starts with Arms?


Like Amazon? You know you can get an ARM server there any day of the week for extremely reasonable prices?

Edit: to clarify, I mean on AWS, not from the Amazon store.


Reasonable is relative to what you want to do: Cloud instances are only cheap if your machines are usually idle, and you overprovisioned your datacenter massively. Cloud instances are only cheap if there is no great amount of output/input data to transfer like simulation results or large model input datasets. Cloud instances are only cheap if you don't need assurances about locality or security for compliance. Cloud instances are only cheap if you can live with a slow network between the instances, as soon as you want Infiniband or at least dedicated Ethernet links, you are either out of luck or in "really expensive" territory again.


You left out management cost. Once you’re talking something more than a personal laptop the cost of administration needs to be factored in. This is where cloud environments have some big advantages since you’re not paying skilled people to run cables, debug firmware, swap parts, etc. Yes, you can still beat cloud (especially if you serve a lot of network traffic) but it’s not easy unless you’re giving up security, features, or availability.


> run cables, debug firmware, swap parts, etc.

Data centers already do that. You could build your own rack and send it to them, but the more common model is renting preconfigured servers on a monthly or annual basis with them swapping out broken hardware as needed etc.

Cloud computing can save on administrative costs, but there is a huge world between building everything in house and AWS.


Data centers do that for additional costs. It can be a way to save if your needs are simple and don’t change but it’s also often a source of downtime or operational overhead since you’re looking at managing separate groups of people versus using an API.

You can definitely save money this way but it rarely ends up being as much as touted - double digit percentages, not whole multiples - unless you’re really bandwidth intensive.


Perhaps have a look at Profitbricks. They are an interesting small alternative to AWS for a niche market.


If I can't get a machine for local development I wouldn't care personally. I like the idea of many many threads as I work a lot with data, but i need something like a PC to try things on.

Also, don't really like vendor lock-in.


I'm doing some development on AWS currently on g4dn instance with T4 accelerator. With pings lower than 50ms and remote VS Code it gets really efficient. The biggest issue of course is additional mental burden on being remote + you can't simply print image in ssh console...

I thought it would be horrible, but after a brief adjustment period it turned out to be okay-ish for my current work.


But you can view images in terminal emulator. Not every one, but kitty, alacritty, st, iTerm2 for MacOS, etc. should support it. This works for me (need to install ueberzug, e.g. via pip):

    { declare -p c=([action]=add [identifier]=0 [x]=0 [y]=0 [width]=80 [height]=80 [path]="mypic.png") ; sleep 5;} | ueberzug layer --parser bash


In 2004 I was using telnet/X Windows into the UNIX development server of the company, in 2020 I use ssh/Web/RDP for the cloud environment instead.


To be fair it is using their own hardware (Graviton2, mentioned in the article). Other than Ampere are there any other hardware vendors offering ARM based server CPU's that are commercially available?


Amazon's Graviton2 is in some of the Anandtech benchmark results. The availability & pricing is probably related to how far behind it is in those benchmarks, too...


These things look awesome, but are pretty expensive, still. I'd love to put an ARM server in my homelab, but the price range between 100$ (RPi & Co) and 10000$ (enterprise server) seems pretty empty right now :/


You can get SolidRun HoneyComb[0] for ~$1000, it has 16 cores with supposedly reasonable performance.

[0] https://www.solid-run.com/arm-servers-networking-platforms/h...


For a thousand bucks of CPU, motherboard and RAM, anyone with a modicum of clue building x86-64 industry standard PC hardware could set up something based on Ryzen that runs circles around that. Probably at least quadruple the performance in any integer or floating point benchmarks.

$659 ryzen 5800X

$250 motherboard


Reasonable, but not exciting. Cortex-A72 is quite old by now. Just compare EC2 a1 instances vs. the newer ones.


Just compare how much such an exciting AWS instance would cost in a month.

You can't compare it to owning a whole machine for several years, having bought it once.


That puppy is on my wish list if tax refunds are a thing this spring.


This looks what I was searching for. Thank you!


Do I read the table on https://www.anandtech.com/show/16315/the-ampere-altra-review... incorrectly?

To me, it says they have a system with 32 CPUs at 1.7 GHz for $800, and a whole range between $2,200 and $4,050.


I think that's the processor only? I'm basing that off the paragraph after the table, which says:

> AMD’s EPYC 7742 which still comes in at $6950

A couple quick searches (Amazon, Newegg) show people selling the CPU-only for this price. Since that's the price they're comparing to the table, I think the table prices are for the ARM CPU only.


Thanks! I already found $800 cheapish, but didn’t consider it a price for just the CPU; my mind isn’t used to see server prices, it seems.


I built my home compute cluster out of four Odroid N2. These are $80 ARM based single board computers (SBC's), similar to Raspberry Pi's. You can bridge the gap between high and low ends buy buying low end and scaling horizontally.

The troublesome piece is, of course, the horizontal scaling. For a cluster of ARM CPUs to be worthwhile, you need a job that can be parallelized, but not so massively parallel that it already runs on a GPU.

Even so, I think there's a lot of potential for these low end ARM devices. Ultimately, their impact on x86 could be the same as what x86 did to the previous generation of mini computers.


I'll believe these are a real viable thing for ordinary open source Linux/BSD developers, when I can go buy a $150 motherboard and a $250 CPU from newegg, that have performance anywhere NEAR what I can do with the equivalent priced Ryzen. Or even some Intel 10th/11th generation core-whatever i5/i7 CPU.


There are several multi-board RPi&Co solutions. Pine64 has some interesting offerings in this space.


TuringPi 2 (for RPi CM4) https://turingpi.com/

Pine64 SOPINE Clusterboard https://www.pine64.org/clusterboard/


The problem with most of these is availability of a Linux operating system image that is as mature, and close to 'stock' debian as raspbian is. Usually it's some two year old version of Ubuntu that's been cobbled together by the vendor.


I believe there's been some work in the this space by the Armbian[1] community, which is aiming to create a unified base distro for a wide range of ARM single-board computers including Odroid, Pine64, etc.

[1] https://www.armbian.com/


Just wait for virtualization on M1 Mac.


That’s useful if you’re not averse to MacOS. Given that Apple now deploys backdoors by default for their own apps which will always inevitably result in exploits, and turning their back on decades of computing history with no “legacy” < 64bit support, I find myself struggling with that choice.

Heck, my 2015 MBP is still running Mojave (and thankfully still receiving software updates)


The removing of 32bit support in Catalina was definitely a bad thing for many, who are using x86-Macs, but for someone interesting in an ARM-based Mac, shouldn't play a role.


> shouldn't play a role.

Until it does. Some future MacOS may need support of some new architecture feature, then it'll be "better buy an Arm13 mac if you want to run MacOS 2023!". Apple doesn't have a very good track record about supporting their own old hardware in new MacOS releases. Several iMacs and MacMinis come to mind. Their past behaviour sets a benchmark for their future behaviour, and since there are no reliable assurances about their support roadmap, I'd think carefully...


What? Big Sur supports hardware that is at least 7 years old. Catalina goes back to 2012 (and still gets security updates). For an OS vendor that has yearly updates, they do a pretty good job of supporting older hardware.

https://eshop.macsales.com/guides/Mac_OS_X_Compatibility


The PowerPC->Intel transition led to an OSX release that was Intel only just three years after the first Intel hardware was released.

Apple's known for cutting compatibility when they have some goal that's served by it, or it's viewed as an albatross by their product side.


True. It looks like the last PPC Macs were released in 2005 and were available through 2006 [1]. The last OS that supported them was Leopard, which got Security updates though 2009 [2].

[1] https://everymac.com/systems/by_timeline/ultimate-mac-timeli...

[2] https://en.wikipedia.org/wiki/Mac_OS_X_Leopard#Release_histo...


Even better: Based on apples history High Sierra probably EOLs January 2021 and supports hardware all the way back to 2010. So that's a decade of hardware support which is pretty good.


> Given that Apple now deploys backdoors by default for their own apps which will always inevitably result in exploits

You're talking about the certificate revocation check that bypasses VPNs? That one is to prevent malware from being able to hijack and block certificate checks.


No, I’m talking about regular Apple apps that bypass local software firewalls.

See this HN thread from ~2 months ago for an introduction: https://news.ycombinator.com/item?id=24838816

This is new behaviour to Big Sur, breaking existing tooling and providing a back door for Apple, and anyone who can exploit it like this guy https://twitter.com/patrickwardle/status/1327726496203476992


If you are just building a home lab server (ie. don't need a video card, well... usually), you can build a surprisingly powerful, inexpensive system. Probably in the $500-$1000 range depending on CPU choice.


Just like with xeon server hardware, you will need to wait 4-6 years for companies to retire their fleets. r730s are currently reasonably priced (under $1K) but those were released new in 2014. R740s are still a few thousand dollars.


Why would you want a 6-year-old server chip? By that point I'd expect a mid-tier consumer chip to have better performance/watt, and much better support for peripherals.


Commercial Warning... Just in case anyone is interested in obtaining their own Altra system, they are available for pre-order here: https://store.avantek.co.uk/arm-servers.html


How about the Q32-17? For $800, it sounds like a pretty decent server and I am interested.


Bet you $5 it will be absolutely annihilated in any benchmark by a $800 threadripper or epyc on a single socket server motherboard.


I'm surprised I searched this whole thread and didn't see a single mention of NVIDIA! It seems everyone wants either Intel or AMD to pivot to ARM64 (very unlikely) in the server space. But with NVIDIA's $40 billion purchase of ARM -- I wonder what they're cooking up?

I think a high-performance ARM64 CPU from NVIDIA with an on-die GPU would be a great offering. In many benchmarks it's actually the on-die GPU that makes the M1 so much faster. A general purpose on-die GPU would help speed up a lot of low and mid-level cloud workloads, especially with native kernel support.

Good luck, NVIDIA. And godspeed!


I keep seeing the Google Chrome logo in the fan frame: https://images.anandtech.com/doci/16315/X-T30_DSF3724_575px....


Google Chrome powers a large percentage of all fan spins.


Talk about images you can hear.


From the comments:

> Each Neoverse N1 core with 1MB L2 is just 1.4mm^2, so 80 of them add up to 112mm^2. The die size is estimated at about 350mm^2, so tiny compared to the total ~1100mm^2 in EPYC 7742.

So performance/area is >3x that of EPYC. Now that is efficiency!


This is interesting to see another company take a crack at super-density. Scaleway famously entered this market in 2017 only to leave it in 2020 citing reliability issues [1]

Such is the churn and iteration to challenge x86. I welcome it.

[1] - https://www.theregister.com/2020/04/21/scaleway_arm64_cloud_...


How close are Intel to releasing anything that's a "next gen" core design and not just incremental design change or node shrink? When will Intel's "Zen2" arrive? Are there any rumors?


SemiAccurate thinks no where on the horizon. They have a good track record on these things. See this for example. https://www.semiaccurate.com/2020/11/12/intel-delays-a-mains...


No need to go by rumors. Intel themselves have announced that Ice Lake SP won't happen until mid-2021. They showed off some stuff at SC'20 last month, but since it's possible that Zen3 EPYC "Milan" will launch at the same time or even before, it's impossible to say what their competitive position will look like when they finally ship it.


On the server? Unknown. Anywhere at all? They already did - two iterations, in fact. Ice Lake uses the new Sunny Cove uArch and Tiger Lake uses the successor to Sunny Cove called Willow Cove. And Tiger Lake is fine but not great: https://www.anandtech.com/show/16084/intel-tiger-lake-review...

Rocket Lake is supposedly a "backport" of Sunny Cove to 14nm for desktop, coming Q1 2020.


Hopefully they get it together. Competition is good for us consumers.

They could end up being a low end x86 vendor.


Alder Lake will be a big-little design on 10nm, probably coming in Q3 2021.


So, what OS do these run? Is it oem provided custom kernels or third party?


Anything that's SBSA compliant (supports ACPI, basically)… and ready to run on a massive machine with (for some reason) very sparsely populated huge physical memory address space.

https://github.com/freebsd/freebsd/commit/d7704f9e75aafd312a... https://github.com/freebsd/freebsd/commit/5786fe85813a411327... https://github.com/freebsd/freebsd/commit/3ef92b2f0a0ff3f101... https://github.com/freebsd/freebsd/commit/1f181d2512f60098ed...



Wow. Never knew Linux architecture allowed it to scale on hardware in such a way.

Edit : I know Linux is customized to run on a variety of hardware, I was just pointing out that I never knew you could install Ubuntu on the mentioned hardware and have it take advantage of the available hardware, given that it has specialized hardware.


Linux has been running on 2000+ core machines for longer than a decade (at least):

https://unix.stackexchange.com/questions/4507/how-many-cores...


The last time discussions about Linux and SMP were hip, its competitors were Windows 2000 and 2003 :-D


Imagine a Beowulf cluster of these things!



Linux ran on 512 processor Altix machines more than a decade ago already.


Most (> 98%?) supercomputers run Linux. Has been so for many years.


Those are mostly a bazillion nodes with a couple dozen Xeon cores each, wired up together (over a very nice interconnect). Each Linux instance only has to handle the cores for a single node.

That said, I agree with you that it isn't particularly surprising that Linux handles lots of cores fine, it's been run on more esoteric hardware for ages.


Ironically with IBM tooling though, many of those computers use xlc toolchain.


Has anyone tried machine learning on this or the Graviton2?

I understand TPUs and GPUs easily beat these guys but it would still be interesting to see what raw cpu power can achieve in 2020.


> Has anyone tried machine learning on this or the Graviton2?

I have not done any machine learning on AWS Graviton2 CPUs but I ran many other CPU benchmarks on Graviton2 CPUs and overall I have been disappointed by their performance. They are still much slower than current x64 CPUs (x64 CPUs are up to 2x faster in single thread mode).

According to the benchmarks from Anandtech the Ampere Altra should have much better performance than Graviton2 CPUs as its performance is neck to neck with the fastest x64 CPUs.


er.

so how much for 1u chassis, 512gb ram, 2 x 1tb nvme, 4 x 10gbe nic, dual psu? what can i run on this? proxmox? kubernetes?

or 4u chassis, 128gb ram, 2 x 128gb nvme, 4 x 10gbe nic, 100 x 16tb sata ? what can i run on this? centos? Ubuntu? zfs? gluster?


https://store.avantek.co.uk/ampere-altra-mt-jade.html

They've only got the 2U and no SATA, but that page prices up the rest of your specs.

You can run CentOS, Ubuntu, ZFS and Gluster.


This article is the first CPU review I've seen that lacks both benchmark numbers and information about the manufacturer/process node. Is there a more technical document describing the Altra?


Did you go to page 2. There are several pages, with lots of benchmark numbers.


7nm.


32cores for 800$, feasible for a nice home server rig


Or (as https://news.ycombinator.com/user?id=thinkmassive pointed out) wait for SOPine ClusterBoard https://pine64.com/product/clusterboard-with-7-sopine-comput... and get 28 cores for $310. ($100 board, $30 per A64 compute module once it comes out, seven modules on a board.)


but really, in what way would 32 cores become useful in a home server, except running a bunch of VMs that do nothing or donate money in terms of electricity to folding@home


When doing my home programming projects, I regularly max out my 16-core / 32-thread Threadripper for minutes, or even hours. If you're not taking minutes to hours per programming task, you aren't really pushing the limits of modern computers. High-performance GPU and/or CPU is great.

Things like "what's the best 32-bit number for my custom random number generator" ?? (Try all 4-billion+ 32-bit numbers and run statistics on them, then sort the results). Except not really, that kind of search takes less than 10 seconds now, lol. But that should give you an idea of the size / scale of modern CPU power.

Or searching for chess or other AI search tasks. Or deep learning. Or raytracing. Or LTSpice simulations. Or... you get the gist.


Beefy CPU, couple TB of HDDs and an SSD to boot off of: (comfortably) host your own private cloud. And probably a <whatever social/messaging solution you prefer> instance for your friends & family too if you're feeling adventurous. Via authenticated Tor Hidden Services, for optimum easy-safe ratio.

Now the value of a private cloud is a subjective question, true, but personal experience suggests having it right there on home lan encourages practical use, especially in a shared flat.


all of that could be trivially done on a raspberry pi for $30 instead of a massively overspeced ARM server for $10000


I agree, that's what I'm rocking: a bunch of 8gig pi's doing everything from RAM, and a threadripper DB server. All at home, all for <3k out front. Equivalent monthly bill at a major hoster would be >200/month more than electricity + the cost of <low single digit> hours needed.

The same setup with a more reasonably priced consumer proc (but no ECC) would be around 1.5k out front.


It would chew through large compiles (e.g., kernel, BSD userspace, anything Gentoo) in no time flat. Especially if you load the thing up with RAM to use as a tmpfs...


Since it is ARM, running them at 0% would use little electricity :)


I assume because of leakage current the basic power usage at low-load isn't all that different for the same amount of chip real estate


Is this capable of doing hardware transcoding?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: