Hacker News new | past | comments | ask | show | jobs | submit login

Only if the CPU is serving multiple users, maybe.

LLMs can't batch token generation for single users. Its sequential, each token depends on the next. In fact that's a part of the paper: "dumb" batching will leave the GPU underutilized because responses aren't all the same length, and they end up processing one token at a time at the end.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: