why would you ever want to do that remains an open question

aurareturn · 2025-03-05T15:53:28 1741190008

Probably some kind of local LLM server. 1TB of 1.6 TB/s memory if you link 2 together. $20k total. Half the price of a single Blackwell chip.

whimsicalism · 2025-03-05T15:59:29 1741190369

with a vanishingly small fraction of flops and a small fraction of memory bandwidth

aurareturn · 2025-03-05T16:26:34 1741191994

It's good enough to run whatever local model you want. 2x 80core GPU is no joke. Linking them together gives it effectively 1.6 TB/s of bandwidth. 1TB of total memory.

You can run the full Deepseek 671b q8 model at 40 tokens/s. Q4 model at 80 tokens/s. 37B active params at a time because R1 is MoE.

Linking 2 of these together let's you run a model more capable (R1) than GPT4o at a comfortable speed at home. That was simply fantasy a year ago.

burnerthrow008 · 2025-03-06T03:06:08 1741230368

> with a vanishingly small fraction of flops and a small fraction of memory bandwidth

Is it though?

Wikipedia says [1] an M3 Max can do 14 TFLOPS of FP32, so an M3 Ultra ought to do 28 TFLOPS. nVidia claims [2] a Blackwell GPU does 80 TFLOPs of FP32. So M3 Ultra is 1/3 the speed of a Blackwell.

Calling that "a vanishingly small fraction" seems like a bit of an exaggeration.

I mean, by that metric, a single Blackwell GPU only has "a vanishingly small fraction" of the memory of an M3 Ultra. And the M3 Ultra is only burning "a vanishingly small fraction" of a Blackwell's electrical power.

nVidia likes throwing around numbers like "20 petaFLOPs" for FP4, but that's not real floating point... it's just 1990's-vintage uLaw/aLaw integer math.

[1] https://en.wikipedia.org/wiki/Apple_silicon#Comparison_of_M-...

[2] https://resources.nvidia.com/en-us-blackwell-architecture/da...

Edit: Further, most (all?) of the TFLOPs numbers you see on nVidia datasheets for "Tensor FLOPs" have a little asterisk next to them saying they are "effective" TFLOPs using the sparsity feature, where half the elements of the matrix multiplication are zeroed.

whimsicalism · 2025-03-07T15:23:10 1741360990

TFLOPS are teraflops not “tensor flops”.

Blackwell and modern AI chips are built for fp16. B100 has 1750 tflops of fp16. M3 ultra has ~80tflops of fp16 or about 4% that of b100