Zen, CUDA, and Tensor Cores, Part I: The Silicon

fulafel · 2024-09-07T05:03:00 1725685380

The answer to the leading question "What’s the difference between a Zen core, a CUDA core, and a Tensor core?" is not covered in Part 1, so you may want to wait if this interests you more than chip layouts.

raphlinus · 2024-09-07T06:36:46 1725691006

Here's my quick take.

A top of the line Zen core is a powerful CPU with wide SIMD (AVX-512 is 16 lanes of 32 bit quantities), significant superscalar parallelism (capable of issuing approximately 4 SIMD operations per clock), and a high clock rate (over 5GHz). There isn't a lot of confusion about what constitutes a "core," though multithreading can inflate the "thread" count. See [1] for a detailed analysis of the Zen 5 line.

A single Granite Ridge core has peak 32 bit multiply-add performance of about 730 GFLOPS.

Nvidia, by contrast, uses the marketing term "core" to refer to a single SIMD lane. Their GPUs are organized as 32 SIMD lanes grouped into each "warp," and 4 warps grouped into a Streaming Multiprocessor (SM). CPU and GPU architectures can't be directly compared, but just going by peak floating point performance, the most comparable granularity to a CPU core is the SM. A warp is in some ways more powerful than a CPU core (generally wider SIMD, larger register file, more local SRAM, better latency hiding) but in other ways less (much less superscalar parallelism, lower clock, around 2.5GHz). A 4090 has 128 SMs, which is a lot and goes a long way to explaining why a GPU has so much throughput. A 1080, by contrast, has 20 SMs - still a goodly number but not mind-meltingly bigger than a high end CPU. See the Nvidia Ada whitepaper [2] for an extremely detailed breakdown of 4090 specs (among other things).

A single Nvidia 4090 "core" has peak 32 bit multiply-add performance of about 5 GFLOPS, while an SM has 640 GFLOPS.

I don't know anybody who counts tensor cores by core count, as the capacity of a "core" varies pretty widely by generation. It's almost certainly best just to compare TFLOPS - also a bit of a slippery concept, as that depends on the precision and also whether the application can make use of the sparsity feature.

I'll also note that not all GPU vendors follow Nvidia's lead in counting individual SIMD lanes as "cores." Apple Silicon, by contrast, uses "core" to refer to a grouping of 128 SIMD lanes, similar to an Nvidia SM. A top of the line M2 Ultra contains 76 such cores, for 9728 SIMD lanes. I found Philip Turner's Metal benchmarks [3] useful for understanding the quantitative similarities and differences between Apple, AMD, and Nvidia GPUs.

[1]: http://www.numberworld.org/blogs/2024_8_7_zen5_avx512_teardo...

[2]: https://images.nvidia.com/aem-dam/Solutions/Data-Center/l4/n...

[3]: https://github.com/philipturner/metal-benchmarks

JonChesterfield · 2024-09-07T07:20:42 1725693642

An x64 core roughly corresponding to a SM, or in the amdgpu world a compute unit (CU) seems right. It's in the same ballpark for power consumption, represents the component handling an instruction pointer and a local register file and so forth.

A really big CPU is a couple of hundred cores, a big GPU is a few hundred SM / CUs. Some low power chips are 8 x64 cores and 8 CUs on the same package. All roughly lines up.

openrisk · 2024-09-07T08:50:57 1725699057

If SIMD lanes come to vastly dominate the composition of a typical computer chip (in terms, e.g., of where power is consumed) will the distinction between CPU/GPU continue to be meaningful?

For decades the GPU was "special purpose" hardware dedicated to the math of screen graphics. If the type of larger-scale numerical computation that has been popularised with LLM's is now deemed "typical use", then the distinction may be becoming irrelevant (and even counterproductive from a software development perspective).

bee_rider · 2024-09-07T14:46:29 1725720389

Xeon Phi was that, a bunch of little cores with a ton of SIMD lanes each.

It didn’t really work out, in part because it was too far from a regular old Xeon to run your normal code well without optimizing. On the other side, Intel couldn’t keep up with NVIDIA on the sort of metrics people care about for these compute accelerators: memory bandwidth mostly. If you are going to have to refactor your whole project anyway to use a compute accelerator, you probably want a pretty big reward. It isn’t obvious (to me at least) if this is the result of the fact that the Phi cores, simple as they were, were still a lot more complex than a GPU “core,” maybe the design just had various hidden bottlenecks that were too hard to work out due to that complexity. Or if it is because Intel just wasn’t executing very well at the time, especially compared to NVIDIA (it is Intel’s dark age vs NVIDIA’s golden age, really). The programmer’s “logical or” joke is a possible answer here.

But, you can’t do everything in parallel. It is a shame the Phi didn’t survive into the age where Intel is also doing big/little cores (in a single chip). A big Xeon core (for latency) surrounded by a bunch of little Phi cores (for throughput) could have been a really interesting device.

dahart · 2024-09-07T15:37:27 1725723447

The special purpose graphics distinction is already mostly irrelevant and has been for 10 or 20 years for anyone doing High Performance Computing (HPC) or AI. It predates LLMs. For a while we had the acronym GPGPU - General Purpose computing on Graphics Processing Units [1]. But even that is now an anachronism, it started dying in 2007 when CUDA was released. With CUDA and OpenCL and compute shaders all being standard, it is now widely understood that today’s GPUs are used for general purpose compute and might not do any graphics. The bulk of chip area is general purpose and has been for some time. From a software development perspective GPU is just a legacy name but is not causing productivity problems or confusion.

To be fair, yes most GPUs still do come with things like texture units, video transcode units, ray tracing cores, and a framebuffer and video output. But that’s already changing and you have, for example, some GPUs with ray tracing, and some without that are more designed for data centers. And you don’t have to use the graphics functionality; for GPU supercomputers it’s common for the majority of GPU nodes to be compute-only.

In the mean time we now have CPUs with embedded GPUs (aka iGPUs), GPUs with embedded CPUs, GPUs that come paired with CPUs and a wide interconnect (like Nvidia Grace Hopper), CPU-GPU chips (like Apple M1), and yes CPUs in general have more and more SIMD.

It’s useful to have a name or a way to distinguish between a processor that mostly uses a single threaded SISD programming model and has a small handful of hardware threads, versus a processor that uses a SIMD/SIMT model and has tens of thousands of threads. That might be mainly a question of workloads and algorithms, but the old line between CPU and GPU is very blurry, headed towards extinction, and the “graphics” part has already lost meaning.

[1] https://en.wikipedia.org/wiki/General-purpose_computing_on_g...

adrian_b · 2024-09-07T17:34:31 1725730471

The display controller, which handles the frame buffers and the video outputs, and the video decoding/encoding unit are two blocks that are usually well separated from the remainder of the GPU.

In many systems-on-a-chip, the 3 blocks, GPU in the strict sense, video decoder/encoder and display controller may even be licensed from different IP vendors and then combined in a single chip. Also in the CPUs with intgrated GPU, like Intel Lunar Lake and AMD Strix Point, these 3 blocks can be found in well separated locations on the silicon die.

What belongs into the GPU proper from the graphics-specific functions, because these perform operations that are mixed with the general-purpose computations done by shaders, are the ray-tracing units, the texture units and the rasterization units.

JonChesterfield · 2024-09-07T11:37:15 1725709035

The x64 cores putting more hardware into the vector units and amdgpu changing from 64 wide to 32 wide simd (at least for some chips) looks like convergent evolution to me. My personal belief is that the speculation and pipelining approach is worse than the many tasks and swapping between them.

I think the APU designs from AMD are the transition pointing to the future. The GPU cores will gain increasing access to the raw hardware and the user interface until the CPU cores are optional and ultimately discarded.

adrian_b · 2024-09-07T17:58:25 1725731905

There is little relationship between the reasons that determine the width of SIMD in CPUs and GPUs, so there is no convergence between them.

In the Intel/AMD CPUs, the 512-bit width, i.e. 64 bytes or 16 FP32 numbers, matches the width of the cache line and the width of a DRAM burst transfer, which simplifies the writing of optimized programs. This SIMD width also provides a good ratio between the power consumed in the execution units and the power wasted in the control part of the CPU (around 80% of the total power consumption goes to the execution units, which is much more than when using narrower SIMD instructions).

Increasing the SIMD width more than that in CPUs would complicate the interaction with the cache memories and with the main memory, while providing only a negligible improvement in the energy efficiency, so there is no reason to do this. At least in the following decade it is very unlikely that any CPU would increase the SIMD width beyond 16 FP32 numbers per operation.

On the other hand, the AMD GPUs before RDNA had a SIMD width of 64 FP32 numbers, but the operations were pipelined and executed in 4 clock cycles, so only 16 FP32 numbers were processed per clock cycle.

RDNA has doubled the width of the SIMD execution, processing 32 FP32 numbers per clock cycle. For this, SIMD instructions with a reduced width of 32 FP32 have been introduced, but they are executed in one clock cycle versus the old 64 FP32 instructions that were executed in four clock cycles. For backwards compatibility, RDNA has kept 64 FP32 instructions, which are executed in two clock cycles, but these were not recommended for new programs.

RDNA 3 has changed again all this, because now sometimes the 64 FP32 instructions can be executed in a single clock cycle, so they may be again preferable instead of the 32 FP32 instructions. However it is possible to take advantage of the increased width of the RDNA 3 SIMD execution units also when using 32 FP32 instructions, if certain new instructions are used, which encode double operations.

So the AMD GPUs have continuously evolved towards wider SIMD execution units, from 16 FP32 before RDNA, to 32 FP32 in RDNA and finally to 64 FP32 in RDNA 3.

The distance from CPUs has been steadily increasing, there is no convergence.

Symmetry · 2024-09-07T12:25:11 1725711911

There are still a lot of differences, even if you put in a lot more SIMD lanes to the CPU. CPUs keep their execution resources fed by by aggressive caching, prefetching, and out of order execution while GPUs rely on having lots of threads around so that if one stalls another is able to execute.

Remnant44 · 2024-09-07T07:47:05 1725695225

Hi Raph, first of all thank you for all of your contributions and writings - I've learned a ton from reading your blog!

A minor quibble amidst your good comparison above ;)

For a zen5 core, we have 16-wide SIMD with 4 pipes; 2 are FMA (2 flop), and 2 are FADD @ ~5GHZ. I math that out to 16 * 6 * 5 = 480 GFLOP/core... am I missing something?

adrian_b · 2024-09-07T18:16:30 1725732990

According to the initial reviews, it appears that when 512-bit instructions are executed at the maximum rate, this increases the power consumption enough so that the clock frequency drops to around 4 GHz for a 9950X

So a 9950X can do 256 FMA + 256 FADD for FP64 or 512 FMA + 512 FADD for FP32, per clock cycle.

Using FP32, because it can be compared with the GPUs, there are 1536 Flop per clock cycle, therefore about 6 FP32 Tflop/s @ 4 GHz for a 9950X (around 375 FP32 Gflop/s per core, but this number is irrelevant, because a single active core would go to a much higher clock frequency, probably over 5 GHz). For an application that uses only FMA, like matrix multiplication, the throughput would drop to around 4 FP32 Tflop/s or 2 FP64 Tflop/s.

The values for the FP32 throughput are similar to those of the best integrated GPUs that exist at this time. Therefore doing graphics rendering on the CPU on a 9950X might be similarly fast to doing graphics rendering on the iGPU on the best mobile CPUs. Doing graphics rendering on a 9950X can still leverage the graphics and video specific blocks contained in the anemic GPU included in 9950X, whose only problem is that it has a very small number of compute shaders, but their functions can be augmented by the strong CPU.

raphlinus · 2024-09-07T14:54:33 1725720873

Thanks for the kind words and the clarification. I'm sure you're right; I was just multiplying things together without taking into account the different capabilities of the different execution units. Hopefully that doesn't invalidate the major points I was making.

tjoff · 2024-09-07T10:21:41 1725704501

For those of us not fluent in codenames:

Granite Ridge core = Zen 5 core.

dundarious · 2024-09-07T18:25:44 1725733544

> It's almost certainly best just to compare TFLOPS

Depends on what you're comparing with what, and the context, of course.

Casey is doing education, so that people learn how best to program these devices. A mere comparison of TFLOPS of CPU vs GPU would be useless towards those ends. Similarly, just a bare comparison of TFLOPS between different GPUs even of the same generation would mask architectural differences in how to in practice achieve those theoretical TFLOPS upper bounds.

I think Casey believes most people don't know how to program well for these devices/architectures. In that context, I think it's appropriate to be almost dismissive of TFLOPS comparison talk.

to11mtm · 2024-09-08T05:51:07 1725774667

> Depends on what you're comparing with what, and the context, of course.

Agreed

It's a classic question of 'what is an "embarrassingly parallel" problem' (e.x. physics calculations, image rendering, LLM image creation or textual content generation) or not.

bee_rider · 2024-09-07T15:12:22 1725721942

> It's almost certainly best just to compare TFLOPS - also a bit of a slippery concept, as that depends on the precision

Agreed. Some quibbles about the slipperiness of the concept.

flops are floating point operations. IMO it should not be confusing at all, just count single precision floating point operations, which all devices can do, and which are explicitly defined in the IEEE standard.

Half precision flops are interesting but should be called out for the non-standard metric they are. Anyone using half precision flops as a flop is either being intentionally misleading or is confused about user expectations.

On the other side, lots of scientific computing folks would rather have doubles, but IMO we should get with the times and learn to deal with less precision. It is fun, you get to make some trade-offs and you can see if your algorithms are really as robust as you expected. A free 2x speed up even on CPUs is pretty nice.

> and also whether the application can make use of the sparsity feature

Eh, I don’t like it. Flops are flops. Avoiding a computation exploiting sparsity is not a flop. If we want to take credit for flops not executed via sparsity, there’s a whole ecosystem of mostly-CPU “sparse matrix” codes to consider. Of course, GPUs have this nice 50% sparse feature, but nobody wants to compete against PARDISO or iterative solvers for really sparse problems, right? Haha.

leogao · 2024-09-07T16:19:16 1725725956

In domains like ML, people care way more about the half precision FLOPs than single precision.

bee_rider · 2024-09-07T18:22:07 1725733327

They don’t have much application outside ML, at least as far as I know. Just call them ML ops, and then they can include things like those funky shared exponent floating point formats, and or stuff with ints.

Or they could be measured in bits per second.

Actually I’m pretty interested in figuring out if we can use them for numerical linear algebra stuff, but I think it’d take some doing.

paulmd · 2024-09-08T01:04:51 1725757491

you can calculate the area of the tensor and raytracing units by measuring+comparing die sizes between the nearest 20-series and 16-series chips. Contrary to the assumptions a lot of people made from the cartoon diagrams, it's actually relatively small, together they make up approximately 18% of the cluster area and it's below 10% of the chip as a whole. The area is roughly 2/3rds tensor unit area and 1/3 raytracing unit area, so RT is around 3% of total chip area and tensor is around 6%.

https://old.reddit.com/r/hardware/comments/baajes/rtx_adds_1...

This could have changed somewhat in newer releases, but probably not too drastically, since NVIDIA has never really increased raw ray performance since the 20-series launch. And while there have been a few raytracing features around the edges, raster and cache have been bumped significantly too (notably, ampere got dual-issue fp32 pipelines... which didn't really work out for NVIDIA that well either!) so honestly there's a reasonable chance it's slightly less in subsequent architectures.

kvemkon · 2024-09-07T11:04:32 1725707072

> Each of the tiles on the CPU side is actually a Zen 4 core, complete with its dedicated L2 cache.

Perhaps, it could be more interesting to compare without L2 cache.

adrian_b · 2024-09-07T18:38:10 1725734290

The L2 really belongs to the core, a comparison without it does not make much sense.

The GPU cores (in the classic sense, i.e. not what NVIDIA names as "cores") also include cache memories and also local memories that are directly addressable.

The only confusion is caused by the fact that first NVIDIA, and then ATI/AMD too, have started to use an obfuscated terminology where they have replaced a large number of terms that had been used for decades in the computing literature with other terms.

For maximum confusion, many terms that previously had clear meanings, like "thread" or "core", have been reused with new meanings and ATI/AMD has invented a set of terms corresponding to those used by NVIDIA but with completely different word choices.

I hate the employees of NVIDIA and ATI/AMD who thought that it is a good idea to replace all the traditional terms without having any reason for this.

The traditional meaning of a thread is that for each thread there exists a distinct program counter a.k.a. instruction pointer, which is used to fetch and execute instructions from a program stored in the memory.

The traditional meaning of a core is that it is a block that is equivalent with a traditional independent processor, i.e. equivalent with a complete computer minus the main memory and the peripherals.

A core may have only one program counter, when it can execute a single thread at a time, or it may have multiple program counters (with associated register sets) when it can execute multiple threads, using either FGMT (fine-grained multithreading) or SMT (simultaneous multithreading).

The traditional terms were very clear and they have direct correspondents in GPUs, but NVIDIA and AMD use other words for those instead of "thread" and "core" and they reuse the words "thread" and "core" for very different things, for maximum obfuscation. For instance, NVIDIA uses "warp" instead of "thread", while AMD uses "wavefront" instead of "thread". NVIDIA uses "thread" to designate what was traditionally named the body of a "parallel for" a.k.a. "parallel do" program structure (which when executed on a GPU or multi-core CPU is unrolled and distributed over cores, threads and SIMD lanes).

Symmetry · 2024-09-07T12:27:22 1725712042

Or maybe a CUDA core versus one of Zen's SIMD ports.

diabllicseagull · 2024-09-07T05:48:08 1725688088

It was a good read. I wonder what hot takes he'll have in the second part if any.

downvotetruth · 2024-09-07T03:20:53 1725679253

I refused to buy the so determined defective chips even if they represented better value because if the intent was truly to try and max yield then there should be for Ryzen for example good 7 core versions with only 1 core that was found to be defective. Since no 7 core zens exist, then at least some of the CPUs with 6 core CCDs have intentionally had 1 of the cores destroyed for reasons unknown, which could be to meet volume targets. If this is because for Ryzen the cores can only be disabled in pairs, then it boggles my mind that it would not be economic given the $ diff of tens to hundreds of dollars between the 6 and 8 core versions that is does not make sense to add the circuits to allow each core to be individually fused off and allow further product differentiation, especially considering how much effort and # of SKUs have been put forth with the frequency binning in AM4 (5700x, 5800, 5800x, 5800xt, etc.), rather than bigger market segmentation jumps.

AnthonyMouse · 2024-09-07T04:21:37 1725682897

> if the intent was truly to try and max yield then there should be for Ryzen for example good 7 core versions with only 1 core that was found to be defective. Since no 7 core zens exist

There are Zen processors that use 7 cores per CCD, e.g. Epyc 7663, 7453, 9634.

The difference between Ryzen and Epyc is the I/O die. The CCDs are the same so that's presumably where they go.

Another reason you might not see this on the consumer chips is that they have higher base clocks. If you have a CCD where one core is bad and another isn't exactly bad but can't hit the same frequencies as the other six, it doesn't take a lot of difference before it makes more sense to turn off the slowest than lower the base clock for the whole processor. 6 x 4.7GHz is faster than 7 x 4.0GHz, much less 7 x 2.5GHz.

In theory you could let that one core run at a significantly lower speed than the others, but there is a lot of naive software that will misbehave in that context. Whereas the base clock for the Epyc 9634 is 2.25GHz, because it has twelve 7-core CCDs so it's nearly 300W, and doesn't want to be nearly 1300W regardless of whether or not most of the cores could do >4GHz.

downvotetruth · 2024-09-07T12:31:45 1725712305

To correct the example for the Epyc line, models appears to exist with 1 through 8 cores available except for 5.

AnthonyMouse · 2024-09-07T19:29:43 1725737383

The Epyc models with lower core counts per CCD probably don't exist because of yields though. The 73F3 has two cores per CCD, so with eight CCDs it only has 16 cores. The 7303 also has 16 cores but two CCDs, so all eight cores per CCD are active. The 73F3 costs more than five times as much. That's weird if the 73F3 is the dumping ground for broken dice. Not so weird when you consider that it has four times as much L3 cache and higher clock speeds.

The extra cores in the 73F3 aren't necessarily bad, they're disabled so the others can have their L3 cache and so they can pick the two cores from each CCD that hit the highest clock speeds. Doing that is expensive, especially if the other cores aren't all bad, but then you get better performance per core. Which some people will pay a premium for, so they offer models like that even if yields are good and there aren't that many CCDs with that many bad cores.

At which point your premise is invalid because processors are being sold with cores disabled for performance reasons rather than yield reasons.

downvotetruth · 2024-09-07T21:39:06 1725745146

> they're disabled so the others can have their L3 cache and so they can pick the two cores from each CCD that hit the highest clock speeds

what or where does that follow from? One can take a CCD with 2+ cores and pin a process to a set (of the fastest) cores based on profiling the cores and those 2+ cores could use the L3 cache as needed; disabling cores at the hardware level is the waste as if they were not disabled, then that would allow other processes to be able to benefit from more than 2 cores to run when desired. The latter point of disabling cores for "better [frequency] performance per core Which some people will pay a premium for" is dubious especially for the Epyc server line. If that were true, then there should at least be 4 core or fewer SKUs for desktop Ryzen variant where apps like games are more likely to benefit from the higher clock.

AnthonyMouse · 2024-09-08T03:39:31 1725766771

> what or where does that follow from? One can take a CCD with 2+ cores and pin a process to a set (of the fastest) cores based on profiling the cores and those 2+ cores could use the L3 cache as needed

You're assuming that the buyer knows how to do this and wants to do it themselves, rather than buying a piece of hardware which is configured from the factory to do it for them, and getting a modest discount over the processor with more of the cores operational because some of the cores they weren't going to use anyway might be defective.

> disabling cores at the hardware level is the waste as if they were not disabled, then that would allow other processes to be able to benefit from more than 2 cores to run when desired.

This is exactly the thing some buyers want to avoid. Many applications will spawn a thread for each hardware thread, but each thread for each application will consume shared resources like L3 cache and memory bandwidth. That adds up fast if you have 96 cores per socket. When these are your bottleneck you don't want to spend your time changing the defaults in every application to not do this, you just want a processor with fewer cores and more L3 cache.

> The latter point of disabling cores for "better [frequency] performance per core Which some people will pay a premium for" is dubious especially for the Epyc server line. If that were true, then there should at least be 4 core or fewer SKUs for desktop Ryzen variant where apps like games are more likely to benefit from the higher clock.

Base clocks are limited by power. AM4 supplies up to 170W, which is sufficient to run 12 cores at a base clock of 4.7GHz and is approximately the limit of the architecture rather than the socket. The Zen4 processors with fewer cores than that don't have higher base clocks, they have lower TDPs.

SP5 supplies up to 400W, but supports up to 12 Zen4 CCDs, which is 96 cores. If you enable all 96 cores then you have ~4W/core rather than ~14W/core, and on top of that the Epyc I/O die is bigger and consumes more power than the I/O die for AM5. The highest base clock with 96 Zen4 cores is the 9684X at 2.55GHz and the full 400W. So if you want a >4GHz base clock you have to reduce the core count, and doing this by reducing the number of cores per CCD leaves you with more L3 cache than reducing the number of CCDs. There is no reason to do the same for AM5 because the lower core count processors already aren't limited by socket power.

MobiusHorizons · 2024-09-07T04:46:16 1725684376

I would guess that there is a desire to not create too many product tiers. I believe 6 core parts are made from 2 3-core CCXs, (rather than 4 and 2) so only one core is disabled per ccx.

cinnamonteal · 2024-09-07T04:54:01 1725684841

Current Ryzen and EPYC processors have 8 core CCXs. The 6 core parts used to be as you described, but are now a single CCX. The Zen C dies have two CCXs, but they are still 8 core CCXs, and are always symmetrical in core count.

The big exception is that the new Zen 5 Strix Point chip has a 4 core CCX for the non-C cores. I think the Zen 4 based Z1 has a similar setup but don't remember and couldn't quickly find the actual information to confirm.

wtallis · 2024-09-07T07:19:20 1725693560

The Ryzen Z1 was a weird one: two Zen4 cores plus four Zen4c cores all in one cluster, sharing the same 16MB L3 cache.

Symmetry · 2024-09-07T12:30:09 1725712209

It would be sort of cool if they could do direct to consumer sales with every core going at whatever its maximum speed is or turned off if to disrupted. But that's not something you could do through existing distribution channels, everyone presumes a fairly limited number of SKUs.

tverbeure · 2024-09-07T04:28:52 1725683332

That’s first sentence is a spectacular non-sequitur.