Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Yeah they did? The M4 has a max memory bandwidth of 546GBps, the M3 Ultra bumps that up to a max of 819GBps.

(and the 512GB version is $4,000 more rather than $10,000 - that's still worth mocking, but it's nowhere near as much)



Not that dramatic of an increase actually - the M2 Max already had 400GB/s and M2 Ultra 800GB/s memory bandwidth, so the M3 Ultra's 819GB/s is just a modest bump. Though the M4's additional 146GB/s is indeed a more noticeable improvement.


Also should note that 800/819GB/s of memory bandwidth is actually VERY usable for LLMs. Consider that a 4090 is just a hair above 1000GB/s


Does it work like that though at this larger scale? 512GB of VRAM would be across multiple NVIDIA cards, so the bandwidth and access is parallelized.

But here it looks more of a bottleneck from my (admittedly naive) understanding.


For inference the bandwidth is generally not parallelized because the weights need to go through the model layer by layer. The most common model splitting method is done by assigning each GPU a subset of the LLM layers and it doesn't take much bandwidth to send model weights via PCIE to the next GPU.


My understanding is that the GPU must still load its assigned layer from VRAM into registers and L2 cache for every token, because those aren’t large enough to hold a significant portion. So naively, for a 24GB layer, you‘d need to move up to 24GB for every token.


But the memory bandwidth is only part of the equation; the 4090 is at least several times faster at compute compared to the fastest Apple CPU/GPU.




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: