Only if the CPU is serving multiple users, maybe. LLMs can't batch token generat...

Only if the CPU is serving multiple users, maybe.

LLMs can't batch token generation for single users. Its sequential, each token depends on the next. In fact that's a part of the paper: "dumb" batching will leave the GPU underutilized because responses aren't all the same length, and they end up processing one token at a time at the end.