> Transformers are typically memory-bandwidth bound during decoding. Not in case...

cubefox · 2025-03-05T22:41:13 1741214473

I assume even this one won't run on an RTX 5090 due to constrained memory size: https://news.ycombinator.com/item?id=43270843

whimsicalism · 2025-03-05T22:50:00 1741215000

sure on consumer GPUs but that is not what is constraining the model inference in most actual industry setups. technically even then, you are CPU-GPU memory bandwidth bound more than just GPU memory, although that is maybe splitting hairs

cubefox · 2025-03-05T23:27:43 1741217263

Why are industry setups considered actual while others are not?