The actual title seems to be "Testing AMD’s Bergamo: Zen 4c Spam" which I really...

zorgmonkey · 2024-06-23T03:35:13.000000Z

I largely agree with you, but funnily enough the very same blog has a great post on the x86 decoding myth https://chipsandcheese.com/2021/07/13/arm-or-x86-isa-doesnt-...

trueismywork · 2024-06-23T07:52:43.000000Z

I'm not sure I agree with that. The unknown length of instructions in x86 does make decoders more power hungry. There's no doubt about that. That is a big problem for power efificency of x86 and the blog doesn't address that at all.

M1 really is a counterexample to all theory that Jim is saying etc. The real proof would be if same results were also reproduced on M1 instead of Zen

tremon · 2024-06-23T10:20:02.000000Z

The unknown length of instructions in x86 does make decoders more power hungry. There's no doubt about that.

I have doubts about that. I-cache word lines are much larger than instructions anyway, and it was the reduction in memory fetch operations that made THUMB more energy-efficient on ARM (and even there, there's lots of discussion on whether that claim holds up). And if you're going for fixed-width instructions then many instructions will use more space than they use now, reducing the overall effectiveness of the I-cache.

So even if you can prove that a fixed-size decoder uses less power, you will still need to prove that that gain in decoder power efficiency is greater than the increased power usage due to reduced instruction density and accompanying loss in cache efficiency.

weebull · 2024-06-23T16:07:29.000000Z

It's the width of multiplexing that has to on between having a fetch line and extracting a single instruction. As an instruction can start at many different locations, you need to be able to shift down from all those positions.

That's not too bad for the first instruction in a line but the second instruction is dependant on how the first instruction decides, and the third dependent on the second. Etc. So it's not only a big multiplexer tree, but a content dependent multiplexer tree. If you're trying to unpack multiple instructions each clock (or course you are. You've got six schedulers to feed) then that's a big pile of logic.

Even RISC-V has this problem, but there they've limited it to two sizes of instruction (2 and 4 bytes), and the size is in the first 2 bits of each instruction (so no fancy decode needed)

oelang · 2024-06-23T08:42:23.000000Z

Jim was involved in the early versions of Zen & M1, I believe he knows.

Apples M series looks very impressive because typically, at launch, they are node ahead of the competition, so early access deals with TSMC is the secret weapon this buys them about 6 months. They also are primarily laptop chips, AMD has competitive technology but always launches the low power chips after the desktop & server parts.

Jensson · 2024-06-23T09:51:18.000000Z

> so early access deals with TSMC is the secret weapon this buys them about 6 months

Aren't Apple typically 2 years ahead? M1 came out 2020, other CPUs from the same node level (5 nm TSMC) came out 2022. If you mean apple launches their 6 months ahead of the rest of the industry gets on the previous node, sure, but not the current node.

What you are thinking about is maybe that AMD 7nm is comparable to Apple 5nm, but really what you should compare is todays AMD cpus with the Apple cpu from 2022, since they are on the same architecture.

But yeah, all the impressive bits about Apple performance disapears once you take architecture into account.

hulitu · 2024-06-23T10:38:10.000000Z

> but really what you should compare is todays AMD cpus with the Apple cpu from 2022, since they are on the same architecture

There only seems to be comparisons between laptop CPUs which are quiet limited.

weebull · 2024-06-23T16:08:43.000000Z

Same node. Not same architecture.

torginus · 2024-06-23T09:47:14.000000Z

Not necessarily. Qualcomm just released its Windows chips, and in the benchmarks I've seen, it loses to the M1 in power efficiency, despite being built on a more advanced node, performing much closer the the Intel and AMD offerings.

Apple is just that good.

ComputerGuru · 2024-06-22T21:07:41.000000Z

I agreed with you up until the x86 comment. Legacy x86 support is largely a red herring. The constraints are architectural (as you noted, per-core memory bandwidth, plus other things) more than they are due to being tied down to legacy instruction sets.

mlyle · 2024-06-22T21:28:34.000000Z

If the goal ends up being many-many-core, x86's complexity tax may start to matter. The cost of x86 compatibility relative to all the complexities required for performance has been small, but if we end up deciding that memory latency is going to kill us and we can't keep complex cores fed, then that is a vote for simpler architectures.

I suspect the future will be something between these extremes (tons of dumb cores or ever-more-complicated cores to try and squeeze out IPC), though.

hajile · 2024-06-23T00:39:33.000000Z

The x86 tax ALREADY matters. Intel was only able to increase the number of decoders by adding entirely independent decoder complexes while reducing the number of decoders per branch.

In practice, this means that decode increases for branchy code, but non-branchy code will be limited to just 3 decoders. In contrast, an ARM X4 or Apple M4 can decode 10 instructions under all conditions.

This also play into ideas like Larabee/Knights processors where you basically want the tiniest core possible attached to a massive SIMD engine. x86 decode eats up a lot of real estate. Even worse, x86 decode adds a bunch of extra stages which in turn increase the size of the branch predictor.

That's not the biggest issue though. Dealing with all of this and all the x86 footguns threaded throughout the pipeline slows down development. It takes more designers and more QAs more time to make and test everything. ARM can develop a similar CPU design for a fraction of the cost compared to AMD/Intel and in less time too because there's simply fewer edge cases they have to work with. This ultimately means the ARM chips can be sold for significantly less money or higher margins.

jart · 2024-06-23T02:00:36.000000Z

Then show me the low-cost ARM version of AMD's 96-core Threadripper.

wmf · 2024-06-23T02:34:32.000000Z

Ampere One was supposed to be out by now.

TinkersW · 2024-06-23T01:55:48.000000Z

In the talk given by lead architect for skymont they implied it could decode 9 under all conditions, not just when there are heavy branches.

nuudlman · 2024-06-23T02:11:54.000000Z

The fun thing with branch predictors is that they tell you where the next branch is (among other things like the direction of the branch). Since hardware is built out of finite wires, the prediction will saturate to some maximum distance (something in the next few cache lines).

How this affects decode clusters is left as an exercise to the reader.

atq2119 · 2024-06-23T12:49:06.000000Z

> In contrast, an ARM X4 or Apple M4 can decode 10 instructions under all conditions.

Not if there are fewer than 10 instructions between branches...

Pet_Ant · 2024-06-22T21:49:27.000000Z

Each core needs to handle the full complexity of x86. Now, as super-scalar OoO x86 cores have evolved the percentage of die allocated to decoding the cruft has gone down.

…but when we start swarming simple cores, that cost starts to rise. Each core needs to be able to decode everything. Now when you can a 100 cores, even if the cruft is just 4%, that means you can have 4 more cores. This is for free if you are willing to recompile your code.

Now, it may turn out that we need more decoding complexity than something like RISC-V currently has (Qualcomm has been working in it), but these will be deliberate, intentionally chose instead of accrued, that meet the needs of today and current trade offs, and not of the eart 80’s.

magicalhippo · 2024-06-22T22:31:05.000000Z

As a developer of fairly standard software, there's very little I can say I rely on from the x86/x64 ISA.

One big one is probably around consistency model[1] and such which affects atomic operations and synchronizing multi-threaded code. Usually not directly though, I typically use libraries or OS primitives.

Are there any non-obvious (to me anyway) ways us "typical devs" rely on x86/x64?

I get the sense that a lot of software is one recompile away from running on some other ISA, but perhaps I'm overly naive.

[1]: https://en.wikipedia.org/wiki/Consistency_model

ajross · 2024-06-23T00:57:52.000000Z

> Are there any non-obvious (to me anyway) ways us "typical devs" rely on x86/x64?

Generally the answer is "we bought this product 12 years ago and it doesn't have an ARM version". Or variants like "We can't retire this set of systems which is still running the binary we blessed in this other contract".

It's true that no one writing "fairly standard software" is freaking out over the inability to work on a platform without AVX-VNNI, or deals with lockless algorithms that can't be made feasibly correct with only acquire/release memory ordering semantics. But that's not really where the friction is.

magicalhippo · 2024-06-23T14:24:58.000000Z

Yea was just trying to check for a blind spot. In these cloudy days, it seems nearly trivial for a lot of workloads, but like I said perhaps I had missed something.

For us the biggest obstacle is that our compiler doesn't support anything but x86/x64. But we're moving to .Net so that'll solve itself.

EasyMark · 2024-06-23T04:02:51.000000Z

A lot of systems are “good enough” and run flawlessly for years/decades so unless you have a good business case you won’t be able to move from x86 to ARM or the new RISC open stuff because the original system cost a couple million dollars.

snvzz · 2024-06-23T14:06:28.000000Z

"good enough" but made a decade ago would run fine in an emulator, while much more instrumentalized and under control than if running directly on hardware.

ajross · 2024-06-23T14:33:17.000000Z

That was true through the 90's, but not anymore. A typical datacenter unit from 2014 would have been a 4-socket Ivy Bridge-EP with 32ish ~3 GHz cores. You aren't going to emulate that feasibly with equivalent performance on anything modern, not yet.

snvzz · 2024-06-23T17:03:11.000000Z

Cycle exact? Sure. But what are the odds you need that for some x86 software made in 2014.

Otherwise, via translation techniques, getting 80% of native performance isn't unheard of. Which would be very fast relative to any such Ivy Bridge server.

Transitioning away from x86 definitely is feasible, as very successfully demonstrated by Apple.

hulitu · 2024-06-23T10:40:06.000000Z

> Legacy x86 support

... is slowly dissapearing. Even on Windows 10 is very hard to run Win32 programs from Win95, Win98 era.

torginus · 2024-06-23T09:57:35.000000Z

People have been saying this 20ish years ago (or probably much longer) - more, simple cores are the future.

In my experience, people just don't know how to build multi-threaded software and programming languages haven't done all that much to support the paradigm.

Multi threading is still the domain of gnarly bugs, and specialists writing specialist software.

The only kind of forward looking thing I've seen in this area is the Bend language that has been making strides a couple months ago.

And besides all that, Amdahl's law still exists - if 10% of the program cannot be parallelized, you're going to get a 10x speedup at most. Every consumer grade chip tends to have that many cores already.

DrBazza · 2024-06-23T10:10:16.000000Z

> In my experience, people just don't know how to build multi-threaded software and programming languages haven't done all that much to support the paradigm.

Go? Rust? Any functional language?

torginus · 2024-06-24T07:36:35.000000Z

Go just turned a bunch of library constructs (green threads, queues) into language keywords, many languages have the same stuff with a bit more involved syntax.

Rust in my opinion, is the biggest admission of failure of modern multithreaded thinking, with having classes like 'X but single threaded' and 'X but slower but works with multiple threads', requiring a complex rewrite for the code to even compile. It's moving all the mental burden of threading to the programmer.

CPUs have the ability to automatically execute non-dependent instructions in parallel, at the finest granularity. Yet if we take a look at a higher level, on the level of functions, and operational blocks of a program, there's almost nothing production grade that can say: Hey, you sort array A and B and then compare them, so lets run these 2 sorts in parallel.

hulitu · 2024-06-23T10:31:38.000000Z

Writing multithreaded programs does not means that people know how to do this.

Just fire up the Windows Process Explorer and look at the CPU graphs.

Pet_Ant · 2024-06-24T14:12:02.000000Z

> Multi threading is still the domain of gnarly bugs, and specialists writing specialist software.

1) In the cloud there are always more requests to serve. Each request can still be serial. 2) Stuff like reactive streams allow for parallelisation. The independent threads acquiring locks will forever be difficult, but there are other models that are easier and getting adopted.

hulitu · 2024-06-23T10:35:49.000000Z

> Multi threading is still the domain of gnarly bugs, and specialists writing specialist software.

It is not even there. Windows (7,10) has difficulties splitting jobs between I/O and processor. Simulations take hours because of this and because Windows like to migrate tasks from one core to the others.

torginus · 2024-06-24T07:39:49.000000Z

I haven't written low-level code for Windows for a while, but I recall that all Windows I/O since the NT days has been asynchronous at the kernel level. An I/O thread is purely an user-space construct.

In Linux, I/O threads are real, with true asynchronous I/O only being recently introduced with io_uring.

jorvi · 2024-06-22T23:21:03.000000Z

> As I said before, I do believe that this is the future of CPUs core

It is not. Or at least not the future, singular. Many applications still favor strong single-core performance, which means in, say, a 64-core CPU, ~56 (if not more) of them will be twiddling their thumbs.

> It's in the environment that the weight of x86's legacy will catch up with us and we'll need to get rid of all the waste transistors decoding cruft.

This very same site has a well-known article named “ISA doesn’t matter”. As noted though, with many-core, having to budget decoder silicon/power might start to matter enough.

jart · 2024-06-23T02:08:12.000000Z

Why does everyone keep repeating this mantra? I wrote the x86 decoder for https://github.com/jart/blink which is based off intel's xed disassembler. It's so tiny to do if you you have the know-how to do it.

    master jart@studio:~/blink$ ls -hal o/tiny/blink/x86.o
    -rw-r--r-- 1 jart staff 23K Jun 22 19:03 o/tiny/blink/x86.o

Modern microprocessors have 100,000,000,000+ transistors, so how much die space could 184,000 bits for x86 decoding really need? What proof is there that this isn't just some holy war over the user-facing design. The stuff that actually matters is probably just memory speed and other chip internals, and companies like Intel, AMD, NVIDIA, and ARM aren't sharing that with us. So if you think you understand the challenges and tradeoffs they're facing, then I'm willing to bet it's just false confidence and peanut gallery consensus, since we don't know what we don't know.

innocenat · 2024-06-23T02:33:05.000000Z

Decoding 1 x86 instruction per cycle is easy. That's solved like 40 years ago.

The problem is that superscalar CPU needs to decode multiple x86 instructions per cycle. I think latest Intel big core pipeline can do (IIRC) 6 instructions per cycle, so to keep the pipeline full the decode MUST be able to decode 6 per cycle too.

If it's ARM, it's easy to do multiple decode. M1 do (IIRC) 8 per cycle easily, because the instruction length is fixed. So the first decoder starts at PC, the second starts at PC+4, etc. But x86 instructions are variable length, so after the first decoder decodes instruction at IP, where does the second decoder start decoding at?

kijiki · 2024-06-23T02:38:21.000000Z

It isn't quite that bad. The decoders write stop bits back into the L1D, to demarc where the instructions align. Since those bits aren't indexed in the cache and don't affect associativity, they don't really cost much. A handful of 6T SRAMs per cache line.

jart · 2024-06-23T03:32:27.000000Z

I would have assumed it just decodes the x86 into a 32-bit ARM-like internal ISA, similar to how a JIT works in software. x86 decoding is extremely costly in software if you build an interpreter. Probably like 30% maybe and that's assuming you have a cache. But with JIT code morphing in Blink, decoding cost drops to essentially nothing. As best as I understand it, all x86 microprocessors since the NexGen i586 have worked this way too. Once you're code morphing the frontend user-facing ISA, a much bigger problem rears its ugly head, which is the 4096-byte page size. That's something Apple really harped on with their M1 design which increased it to 16kb. It matters since morphed code can't be connected across page boundaries.

kijiki · 2024-06-23T05:46:30.000000Z

It decodes to uOPs optimized for the exact microarchitecture of that particular CPU. High performance ARM64 designs do the same.

But in the specific case of tracking variable length instruction boundaries, that happens in the L1i cache. uOP caches make decode bandwidth less critical, but it is still important enough to optimize.

innocenat · 2024-06-23T03:43:21.000000Z

That's called uOP cache, which Intel has been using since Sandy Bridge (and AMD but I don't remember on top of my head since when). But that's more transistors for the cache and its control mechanism.

jart · 2024-06-23T03:56:06.000000Z

It's definitely better than what NVIDIA does, inventing an entirely new ISA each year. If the hardware isn't paying the cost for a frontend, then it shovels the burden onto software. There's a reason every AI app has to bundle a 500mb matrix multiplication library in each download, and it's because GPUs force you to compile your code ten times for the last ten years of ISAs.

saagarjha · 2024-06-23T06:39:37.000000Z

Part of it is that, but part of it is that people pay for getting from 95% optimal to 99% optimal, and doing that is actually a lot of work. If you peek inside the matrix multiplication library you'll note that it's not just "we have the best algorithm for the last 7 GPU microarchitectures" but also 7 implementations for the latest architecture because that's just how you need to be to go fast. Kind of like how if you take an uninformed look at glibc memcpy you'll see there is an AVX2 path and a ERMS path but also it will switch between algorithms based on the size of the input. You can easily go "yeah my SSE2 code is tiny and gets decent performance" but if you stop there you're leaving something on the table, and with GPUs it's this but even more extreme.

camel-cdr · 2024-06-23T07:53:12.000000Z

Using the uops directy as the isa would be a bad idea for code density. In RISC-V land, vendors tend to target standard extensions/profiles, but when they hardware is capable of other operations they often expose those through custom extensions.

kiratp · 2024-06-23T04:13:30.000000Z

IMO if the trade off is cheaper, faster hardware iteration then Nvidia’s strategy makes a lot of sense.

jorvi · 2024-06-23T14:26:56.000000Z

Chips and Cheese specifically talks about this in the article I mention[0].

x86 decoders take a tiny but still significant silicon and power budget, usually somewhere between 3-7%. Not a terrible cost to pay, but if legacy is your only reason, why keep doing so? It’s extra watts and silicon you could dedicate to something else.

[0] https://chipsandcheese.com/2021/07/13/arm-or-x86-isa-doesnt-...

antonkochubey · 2024-06-23T21:07:49.000000Z

But decoders for e.g. ARM are not free either, right? Or am I misunderstanding something?

jorvi · 2024-06-23T21:56:44.000000Z

Correct. However because ARM has fixed-length instructions the decoder can make more assumptions, keeping the decoder simpler.

Like I said, you its only a small amount of extra silicon you’re paying the x86 tax with, but with the world mostly becoming ARM-compatible, there’s no more reason to pay it.

ajross · 2024-06-22T23:59:11.000000Z

> with many-core, having to budget decoder silicon/power might start to matter enough

That seems backwards to me. Narrower, simpler cores with fewer execution engines have a much easier time decoding. It's the combinatorics of x86's variable length instructions and prefix coding that makes wide decoders superlinearly expensive.

latchkey · 2024-06-23T09:34:36.000000Z

I apologize for removing the word spam (and apologized to C&C directly as well). I mistook it as a mistake on their part since the word "spam" was not used anywhere else in the article. They put it in there as an assumption that people would just get it and I did not. My bad!

uluyol · 2024-06-23T07:12:38.000000Z

This idea was explored over a decade ago, in the context of cloud computing: https://www.cs.cmu.edu/~fawnproj/

tremon · 2024-06-23T11:28:03.000000Z

And almost a decade earlier, in the context of private cloud, was implemented in Sun's Niagara line of processors: https://en.wikipedia.org/wiki/UltraSPARC_T1

The UltraSPARC T1 was designed from scratch as a multi-threaded, special-purpose processor, and thus introduced a whole new architecture for obtaining performance. Rather than try to make each core as intelligent and optimized as they can, Sun's goal was to run as many concurrent threads as possible, and maximize utilization of each core's pipeline.

jiggawatts · 2024-06-23T01:42:15.000000Z

Something I've realised is that we're entering the era of "kilocores", where we start counting cores by the thousands, much like the early computers had kilo-words of memory. Soon... mega-cores, then giga-cores, and on!

foota · 2024-06-23T07:50:44.000000Z

Hate to burst your bubble, but with the end of Moores law this seems unlikely to pass.

jiggawatts · 2024-06-23T11:34:51.000000Z

There are 256-core and 288-core server processors from AMD and Intel respectively about to ship this year. If you count hyper-threads as a "virtual core", and count all such vCPUs in a box, then we're up to 1,024 or 1,152 already. That is the number you'll see in 'top' or Task Manager, and that's what matters to running software.

Also worth noting that a high-end GPU already has 16K cores, although the definition of a "core" in that context isn't as clear-cut as with normal CPUs.

These server CPUs are still being made with 5nm or 4nm technology. Sure, that's just a marketing number, not a physical size, but the point is that there are already firm plans from both Intel and TSMC to at least double the density compared to these current-gen nodes. Another doubling is definitely physically possible, but might not be cost effective for a long time.

Still, I wouldn't be surprised to see 4K vCPUs in a single box available in about a decade in the cloud.

After that? Maybe 3D stacking or volumetric manufacturing techniques will let us side-step the scale limits imposed by the atomic nature of matter. We won't be able to shrink any further, but we'll eventually figure out how to make more complex structures more efficiently.

foota · 2024-06-23T22:32:20.000000Z

Yes, but that's still a far cry from a million cores. Unless we change the power requirements and heat generation fundamentally, I don't see how we could get to a point where we have a million CPU cores that look anything like we have today (you can get there with more limited cores, but my impression of OP's comment was that they would be like today's cores).

jiggawatts · 2024-06-24T04:12:13.000000Z

A million cores won't look like a bigger Xeon or EPYC in a socket.

It'll be a combination of nascent technologies on the cusp of viability.

First, something like this will have to be manufactured with a future process node about 2-3 generations past what is currently planned. Intel has plans in place for "18A", so we're talking something like "5A" here, with upwards of 1 trillion transistors per chip. We're already over 200 billion, so this is not unreasonable.

Power draw will have to be reduced by switching materials to something like silicon-carbide instead of pure silicon.

Then this will have to use 3D stacking of some sort and packaging more like DIMMs instead of a single central CPU. So dozens of sockets per box with much weaker memory coherency guarantees. We can do 8-16 sockets with coherent memory now, and we're already moving towards multiple chiplets on a board that is itself a lot like a large chip with very high bandwidth interconnect. This can be extended further, with memory and compute interleaved.

Some further simplifications might be needed. This might end up looking like a hybrid between GPUs and CPUs. An example might be the IBM POWER server CPUs, some of which have 8 hyper-threads per physical core. Unlike POWER, getting to hundreds of kilocores or one megacore with general-purpose compute might require full-featured but simple cores.

Imagine 1024 compute chiplets, each with 64 GB of local memory layered on top. Each chiplet has 32 simple cores, and each core has 8 hyper threads. This would be a server with 64 TB of memory and 256K vCPUs. A single big motherboard with 32 DIMM-style sockets each holding a compute board with 32 chiplets on it would be about the same size as a normal server.