Hacker News new | past | comments | ask | show | jobs | submit login

Also quantization and allocation strategies are a big thing for local usage. 16gb vram don't seem a lot, but you can run recent 32b model in IQ3 with their full 128k context if you allocate the kv matrix on system memory, with 15t/s and a decent prompt processing speed (just above 1000t/s on my hardware)



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: