Running 32b on what hardware?

nurettin · 2026-03-05T05:17:17 1772687837

I'm running it on a pure cpu 2020 model ryzen server with 2x128 GB RAM with llama.cpp, it seems as intelligent as gpt4. I optimized a little bit by forcing it to run on a single RAM stick and tuning llama.cpp build parameters, going from 3-5 tok/s to a more acceptable 5-8 tok/s.

It can call tools and reason adequately enough to use them when appropriate.

ramgine · 2026-03-05T15:27:14 1772724434

Does keeping it on one stick make it more performant? I have a epyc server with 1tb of 64 gig sticks and a 3060, looking to maximize what models I can run on it

nurettin · 2026-03-06T05:31:56 1772775116

In my case it definitely helps, but keeping 64k context window would probably require more than 64 GB.