Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Just tried it - doesn't seem to be working. In fact, I'm getting 1.4 t/s with a Quadro P4000 (8 GB) running a 7B at 3 bits per weight. Are you changing anything other than the 8 bit cache and context?

For reference, I'm getting 10 t/s with a Q5_K_M Mistral GGUF model.



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: