But if you need to get 2x consumer GPUs seems to me the reason is not for the co...

AnthonyMouse · 2024-03-09T10:04:10 1709978650

The problem with LLMs is that the models are large but the entire model has to be read for each token. If the model is 40GB and you have 80GB/s of memory bandwidth, you can't get more than two tokens per second. That's about what you get from running it on the CPU of a normal desktop PC with dual channel DDR5-5200. You can run arbitrarily large models by just adding memory but it's not very fast.

GPUs have a lot of memory bandwidth. For example, the RTX-4090 has just over 1000GB/s, so a 40GB model could get up to 25 tokens/second. Except that the RTX-4090 only has 24GB of memory, so a 40GB model doesn't fit in one and then you need two of them. For a 128GB model you'd need six of them. But they're each $2000, so that sucks.

Servers with a lot of memory channels have a decent amount of memory bandwidth, not as much as high-end GPUs but still several times more than desktop PCs, so the performance is kind of medium. Meanwhile they support copious amounts of cheap commodity RAM. There is no GPU, you just run it on a CPU with a lot of cores and memory channels.

ErneX · 2024-03-09T11:59:51 1709985591

Got it, thanks!