Also quantization and allocation strategies are a big thing for local usage. 16gb vram don't seem a lot, but you can run recent 32b model in IQ3 with their full 128k context if you allocate the kv matrix on system memory, with 15t/s and a decent prompt processing speed (just above 1000t/s on my hardware)