Not a 30 series, but on my 4090 I'm getting 32.0 tokens/s on a 13b q4_0 model (uses about 10GiB of VRAM) w/ full context (`-ngl 99 -n 2048 --ignore-eos` to force all layers on GPU memory and to do a full 2048 context). As a point of reference, currently exllama [1] runs a 4-bit GPTQ of the same 13b model at 83.5 tokens/s.
Also, someone please let me use my 3050 4GB for something else than stable diffusion generation of silly thumbnail-sized pics. I'd be happy with an LLM that's specialized in insults and car analogies.
You can split inference between CPU and GPU using whatever available GPU vRAM with llama.cpp. And you can run many small models with 4GB of vRAM. Anything with 3B parameters quantized to 4bit should be fine.
Cheaper Nvidia cards are generally considered to have dubious value. Having seen the benchmarks I agree, but it's not like a game-changing difference really. For CUDA and ML stuff, the 3050 would run circles around the 6600.
For cuda and ML you'd be much better off choosing a 3060. Honestly, if you've only got the money for a 4gb 3050, you're prob better off working in google colab
With layering enabled, I don't necessarily agree. Not being able to load an entire model into memory isn't a dealbreaker these days. You can even layer onto swap space if your drive is fast enough, so there's really no excuse not to use the hardware if you have it. Unless you just like the cloud and hate setting stuff up yourself, or what have you.
I’m curious how it compares to his apple silicon numbers