Hacker News new | past | comments | ask | show | jobs | submit login

But if you need to get 2x consumer GPUs seems to me the reason is not for the compute capabilities but rather to be able to fit the model on the VRAM of both. So what exactly does having lots of memory on a server help with this when it’s not memory the GPU can use unlike on Apple Silicon computers?



The problem with LLMs is that the models are large but the entire model has to be read for each token. If the model is 40GB and you have 80GB/s of memory bandwidth, you can't get more than two tokens per second. That's about what you get from running it on the CPU of a normal desktop PC with dual channel DDR5-5200. You can run arbitrarily large models by just adding memory but it's not very fast.

GPUs have a lot of memory bandwidth. For example, the RTX-4090 has just over 1000GB/s, so a 40GB model could get up to 25 tokens/second. Except that the RTX-4090 only has 24GB of memory, so a 40GB model doesn't fit in one and then you need two of them. For a 128GB model you'd need six of them. But they're each $2000, so that sucks.

Servers with a lot of memory channels have a decent amount of memory bandwidth, not as much as high-end GPUs but still several times more than desktop PCs, so the performance is kind of medium. Meanwhile they support copious amounts of cheap commodity RAM. There is no GPU, you just run it on a CPU with a lot of cores and memory channels.


Got it, thanks!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: