Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I had given up long time ago on self hosted transformer models for coding because the SOTA was definetly in favor of SaaS. This might just give me another try.

Would llama.cpp support multiple (rtx 3090, no nvlink hw bridge) GPUs over PCIe4? (Rest of the machine is 32 CPU cores, 256GB RAM)



How fast you run this model will strongly depend on if you have DDR4 or DDR5 ram.

You will be mostly using 1 of your 3090s. The other one will be basically doing nothing. You CAN put the MoE weights on the 2nd 3090, but it's not going to speed up inference much, like <5% speedup. As in, if you lack a GPU, you'd be looking at <1 token/sec speeds depending on how fast your CPU does flops, and if you have a single 3090 you'd be doing 10tokens/ec, but with 2 3090s you'll still just be doing maybe 11tok/sec. These numbers are made up, but you get the idea.

Qwen3 Coder 480B is 261GB for IQ4_XS, 276GB for Q4_K_XL, so you'll be putting all the expert weights in RAM. That's why your RAM bandwidth is your limiting factor. I hope you're running off a workstation with dual cpus and 12 sticks of DDR5 RAM per CPU, which allows you to have 24 channel DDR5 RAM.


1 CPU, DDR4 ram


How many channels of DDR4 ram? What speed is it running at? DDR4-3200?

The (approximate) equation for milliseconds per token, is:

Time for token generation = (number of params active in the model)*(quantization size in bits)/8 bits*[(percent of active params in common weights)/(memory bandwidth of GPU) + (percent of active params in experts)/(memory bandwidth of system RAM)].

This equation ignores prefill (prompt processing) time. This assumes the CPU and GPU is fast enough compute-wise to do the math, and the bottleneck is memory bandwidth (this is usually true).

So for example, if you are running Kimi K2 (32b active params per token, 74% of those params are experts, 26% of those params are common params/shared expert) at Q4 quantization (4 bits per param), and have a 3090 gpu (935GB/sec) and an AMD Epyc 9005 cpu with 12 channel DDR5-6400 (614GB/sec memory bandwidth), then:

Time for token generation = (32b params)*(4bits/param)/8 bits*[(26%)/(935 GB/s) + (74%)/(614GB/sec)] = 23.73 ms/token or ~42 tokens/sec. https://www.wolframalpha.com/input?i=1+sec+%2F+%2816GB+*+%5B...

Notice how this equation explains how the second 3090 is pretty much useless. If you load up the common weights on the first 3090 (which is standard procedure), then the 2nd 3090 is just "fast memory" for some expert weights. If the quantized model is 256GB (rough estimate, I don't know the model size off the top of my head), and common weights are 11GB (this is true for Kimi K2, I don't know if it's true for Qwen3, but this is a decent rough estimate), then you have 245GB of "expert" weights. Yes, this is generally the correct ratio for MoE models, Deepseek R1 included. If you put 24GB of that 245GB on your second 3090, you have 935GB/sec speed on... 24/245 or ~10% of each token. In my Kimi K2 example above, you start off with 18.08ms per token spent reading the model from RAM, so even if your 24GB on your GPU was infinitely fast, it would still take... about 16ms per token reading from RAM. Or in total about 22ms/token, or in total 45 tokens/sec. That's with an infinitely fast 2nd GPU, you get a speedup of merely 3 tokens/sec.


Inspired me to write this, since it seems like most people don't understand how fast models run:

https://unexcitedneurons.substack.com/p/how-to-calculate-hom...


Thank you for that writeup!

In my case it is a fairly old system I built from cheap eBay parts. Threadripper 3970X with 8x32GB dual channel 2666Mhz DDR4.


Oh yes llama.cpp's trick is it supports any hardware setup! It might be a bit slower, but it should function well!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: