Thank you! I was looking for how to do this. The example in the issue above show...

eurekin · 2025-01-27T14:42:10 1737988930

Can context be split on multiple GPUs?

magicalhippo · 2025-01-27T22:13:45 1738016025

Not my field, but from this[1] blog post which references this[2] paper, it would seem so. Note the optimal approach are a bit different between training and inference. Also note that several of the approaches rely on batching multiple requests (prompts) in order to exploit the parallelism, so won't see the same gains if fed only a single prompt at a time.

[1]: https://medium.com/@plienhar/llm-inference-series-4-kv-cachi...

[2]: https://arxiv.org/abs/2104.04473