Back in April I bought some parts to build a PC for testing LLMs with llama.cpp. I paid around $192 for: a B550MH motherboard, AMD Ryzen 3 4100, 1x16GB DDR4 Kingston ValueRAM, 256GB M.2 SSD. I already had an old PC case with a 350W PSU.
I was getting 2.2 tokens/s with the llama-2-13b-chat.Q4_K_M.gguf and 3.3 tokens/s with llama-2-13b-chat.Q3_K_S.gguf. With Mistral and Zephyr, the Q4_K_M versions, I was getting 4.4 tokens/s.
A few days ago I bought another stick of 16GB RAM ($30) and for some reason that escapes me, the inference speed doubled. So now I'm getting 6.5 tokens/s with llama-2-13b-chat.Q3_K_S.gguf, which for my needs gives the same results as Q4_K_M, and 9.1 tokens/s with Mistral and Zephyr. Personally, I can barely keep up with reading at 9 tokens/s (if I also have to process the text and check for errors).
If I wasn't considering getting an Nvidia 4060 Ti for Stable Diffusion, I would seriously be considering a used RX 580 8GB ($75) and run Llama Q4_K_M entirely on the GPU or offload some layers when using a 30B model.
Cpus often have 2 ram channels, you need 2 sticks to get the full memory bandwidth out of the processor. Inference is very memory intensive, so it makes sense that the perf doubled.
Do you know of a good reference / primer for LLMs from a technical architecture perspective? I've been somewhat avoiding them, but after seeing MonadGPT -- I'm just too damn curious.
Ideally, I'd like to be able to have a "survey level" understanding of what goes into scaling these models, and what they're capable of at different levels of scale. For example, in the "introducing llama" page, they say
> Smaller, more performant models such as LLaMA enable others in the research community who don’t have access to large amounts of infrastructure to study these models, further democratizing access in this important, fast-changing field.
I'd like to be able to somewhat intelligently be able to discuss the tradeoffs here. What exactly does "smaller, more performant" mean in this context and how can we quantify the differences between models that demand larger infrastructure.
I was getting 2.2 tokens/s with the llama-2-13b-chat.Q4_K_M.gguf and 3.3 tokens/s with llama-2-13b-chat.Q3_K_S.gguf. With Mistral and Zephyr, the Q4_K_M versions, I was getting 4.4 tokens/s.
A few days ago I bought another stick of 16GB RAM ($30) and for some reason that escapes me, the inference speed doubled. So now I'm getting 6.5 tokens/s with llama-2-13b-chat.Q3_K_S.gguf, which for my needs gives the same results as Q4_K_M, and 9.1 tokens/s with Mistral and Zephyr. Personally, I can barely keep up with reading at 9 tokens/s (if I also have to process the text and check for errors).
If I wasn't considering getting an Nvidia 4060 Ti for Stable Diffusion, I would seriously be considering a used RX 580 8GB ($75) and run Llama Q4_K_M entirely on the GPU or offload some layers when using a 30B model.