It massively helps arithmetic intensity to batch during inference, and the desir...

It massively helps arithmetic intensity to batch during inference, and the desired batch sizes by that tend to exceed the memory capacity of a single GPU. Thus desire to do training-like cluster processing to e.g. use a weight for each inference stream that needs it every time it's fetched from memory. It's just that you can't fit 100+ inference streams of context on one GPU, typically, thus the desire to shard along less-wasteful (w.r.t. memory bandwidth) dimensions than entire inference streams.