Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Thanks for the uploads! Was reading through the Unsloth docs for Qwen3-Coder before I found the HN thread :)

What would be a reasonable throughput level to expect from running 8-bit or 16-bit versions on 8x H200 DGX systems?



Oh 8*H200 is nice - for llama.cpp definitely look at https://docs.unsloth.ai/basics/qwen3-coder-how-to-run-locall... - llama.cpp has a high throughput mode which should be helpful.

You should be able to get 40 to 50 tokens / s in the minimum. High throughput mode + a small draft model might get you 100 tokens / s generation




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: