NVIDIA RTX 4090: ~1,008 GB/s NVIDIA RTX 4080: ~717 GB/s AMD Radeon RX 7900 XTX: ...

ttul · 2025-03-05T15:12:00 1741187520

I have a 4090 and, out of curiosity, I looked up the FLOPS in comparison with Apple chips.

Nvidia RTX 4090 (Ada Lovelace)

FP32: Approximately 82.6 TFLOPS

FP16: When using its 4th‑generation Tensor Cores in FP16 mode with FP32 accumulation, it can deliver roughly 165.2 TFLOPS (in non‑tensor mode, the FP16 rate is similar to FP32).

FP8: The Ada architecture introduces support for an FP8 format; using this mode (again with FP32 accumulation), the RTX 4090 can achieve roughly 330.3 TFLOPS (or about 660.6 TOPS, depending on how you count operations).

Apple M1 Ultra (The previous‑generation top‑end Apple chip)

FP32: Around 15.9 TFLOPS (as reported in various benchmarks)

FP16: By similar scaling, FP16 performance would be roughly double that value—approximately 31.8 TFLOPS (again, an estimate based on common patterns in Apple’s GPU designs)

FP8: Like the M3 family, the M1 Ultra does not support a dedicated FP8 precision mode.

So a $2000 Nvidia 4090 gives you about 5x the FLOPS, but with far less high speed RAM (24GB vs. 512GB from Apple in the new M3 Ultra). The RAM bandwidth on the Nvidia card is over 1TBps, compared with 800GBps for Apple Silicon.

Apple is catching up here and I am very keen for them to continue doing so! Anything that knocks Nvidia down a notch is good for humanity.

bigyabai · 2025-03-05T19:23:05 1741202585

> Anything that knocks Nvidia down a notch is good for humanity.

I don't love Nvidia a whole lot but I can't understand where this sentinent comes from. Apple abandoned their partnership with Nvidia, tried to support their own CUDA alternative with blackjack and hookers (OpenCL), abandoned that, and began rolling out a proprietary replacement.

CUDA sucks for the average Joe, but Apple abandoned any chance of taking the high road when they cut ties with Khronos. Apple doesn't want better AI infrastructure for humanity; they envy the control Nvidia wields and want it for themselves. Metal versus CUDA is the type of competition where no matter who wins, humanity loses. Bring back OpenCL, then we'll talk about net positives again.

ricebunny · 2025-03-07T10:04:00 1741341840

Uhm, we can expect close to 8 FP32 TFLOPS from the CPUs alone on the M3 Ultra. It comes with 4 tensor engines (AMX) each capable of about 2 TFLOPs.

M3 Max GPU benchmarks around 14 TFLOPs, so the Ultra should score around 28 TFLOPs.

Double the numbers for FP16.

whimsicalism · 2025-03-05T15:48:16 1741189696

h100 sxm - 3TB/s

vram is not really the limiting factor for serious actors in this space

gatienboquet · 2025-03-05T16:09:29 1741190969

If my grandmother had wheels, she’d be a bicycle