Just tried it - doesn't seem to be working. In fact, I'm getting 1.4 t/s with a ...

Just tried it - doesn't seem to be working. In fact, I'm getting 1.4 t/s with a Quadro P4000 (8 GB) running a 7B at 3 bits per weight. Are you changing anything other than the 8 bit cache and context?

For reference, I'm getting 10 t/s with a Q5_K_M Mistral GGUF model.