Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> Transformers are typically memory-bandwidth bound during decoding.

Not in case of language models, which are typically bound by memory size rather than bandwidth.



nope


I assume even this one won't run on an RTX 5090 due to constrained memory size: https://news.ycombinator.com/item?id=43270843


sure on consumer GPUs but that is not what is constraining the model inference in most actual industry setups. technically even then, you are CPU-GPU memory bandwidth bound more than just GPU memory, although that is maybe splitting hairs


Why are industry setups considered actual while others are not?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: