Back in April I bought some parts to build a PC for testing LLMs with llama.cpp....

kristianp · on Nov 27, 2023

Cpus often have 2 ram channels, you need 2 sticks to get the full memory bandwidth out of the processor. Inference is very memory intensive, so it makes sense that the perf doubled.

aitchnyu · on Nov 27, 2023

What happens to performance if I have an 8 gb (soldered to laptop mainboard) and 32 gb dimm?

scrlk · on Nov 27, 2023

It will run in flex mode: 8 GB soldered + 8 GB on the DIMM in dual channel, 16 GB on the DIMM will run in single channel.

This is slower than dual channel.

aitchnyu · on Nov 28, 2023

You mean 24 instead of 16 right?

scrlk · on Nov 28, 2023

Whoops - yeah, it's 24 GB in single channel mode.

potsandpans · on Nov 27, 2023

Do you know of a good reference / primer for LLMs from a technical architecture perspective? I've been somewhat avoiding them, but after seeing MonadGPT -- I'm just too damn curious.

Ideally, I'd like to be able to have a "survey level" understanding of what goes into scaling these models, and what they're capable of at different levels of scale. For example, in the "introducing llama" page, they say

> Smaller, more performant models such as LLaMA enable others in the research community who don’t have access to large amounts of infrastructure to study these models, further democratizing access in this important, fast-changing field.

I'd like to be able to somewhat intelligently be able to discuss the tradeoffs here. What exactly does "smaller, more performant" mean in this context and how can we quantify the differences between models that demand larger infrastructure.

davikr · on Nov 27, 2023

I'd recommend against going with an AMD GPU if you plan to run models. Support is always more spotty than on NVIDIA.