Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Model weights are significantly larger than cache in almost all cases. Even an 8B parameter model is ~16G in half precision. The caches are not large enough to actually cache that.

Every weight has to be touched for every forward pass, meaning you have to wait for 16G to transfer from VRAM -> SRAM -> registers. That's not even close to 100ns: on a 4090 with ~1TB/s memory bandwidth that's 16 milliseconds. PCIe latency to launch kernels or move 20 integers or whatever is functionally irrelevant on this scale.

The real reason for batching is it lets you re-use that gigantic VRAM->SRAM transfer across the batch & sequence dimensions. Instead of paying a 16ms memory tax for each token, you pay it once for the whole batched forward pass.






You've made several incorrect assumptions and I am not bothered enough to try to correct them so I apologize for my ignorance. I'll just say that 16ms memory tax is wildly incorrect.

You are either having a massive misconception of GPT-like decoder transformers, of how GPU data paths are architected, or are trolling. Go talk to a modern reasoning model to get yourself some knowledge, it's gonna be much better than what you appear to have.



Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: