That calculation is incorrect. You need to fit both the model (140GB) and the KV...

renewiltord · 2024-04-12T15:20:56 1712935256

I’ve got a few 4090s that I’m planning on doing this with. Would appreciate even the smallest directional tip you can provide on splitting the model that you believe is likely to work.

andersa · 2024-04-12T15:23:41 1712935421

The split is done automatically by the inference engine if you enable tensor parallelism. TensorRT-LLM, vLLM and aphrodite-engine can all do this out of the box. The main thing is just that you need either 4 or 8 GPUs for it to work on current models.

renewiltord · 2024-04-12T16:03:32 1712937812

Thank you! Can I run with 2 GPUs or with heterogeneous GPUs that have same RAM? I will try. Just curious if you already have tried.

andersa · 2024-04-12T16:06:10 1712937970

2 GPUs works fine too, as long as your model fits. Using different GPUs with same VRAM however, is highly highly sketchy. Sometimes it works, sometimes it doesn't. In any case, it would be limited by the performance of the slower GPU.

renewiltord · 2024-04-12T16:49:10 1712940550

All right, thank you. I can run it on 2x 4090 and just put the 3090s in different machine.

Tepix · 2024-04-14T00:13:27 1713053607

I know there's some overhead, it's not my calculation.

https://www.tweaktown.com/news/97110/tinycorps-new-tinybox-a...

Quote: "Runs 70B FP16 LLaMA-2 out of the box using tinygrad"

Related: https://github.com/tinygrad/tinygrad/issues/3791