That calculation is incorrect. You need to fit both the model (140GB) and the KV cache (5GB at 32k tokens FP8 with flash attention 2) * batch size into VRAM.
If the goal is to run a FP16 70B model as fast as possible, you would want 8 GPUs with P2P, for a total of 192GB VRAM. The model is then split across all 8 GPUs with 8-way tensor parallelism, letting you make use of the full 8TB/s memory bandwidth on every iteration. Then you have 50GB spread out remaining for KV cache pages, so you can raise the batch size up to 8 (or maybe more).
I’ve got a few 4090s that I’m planning on doing this with. Would appreciate even the smallest directional tip you can provide on splitting the model that you believe is likely to work.
The split is done automatically by the inference engine if you enable tensor parallelism. TensorRT-LLM, vLLM and aphrodite-engine can all do this out of the box. The main thing is just that you need either 4 or 8 GPUs for it to work on current models.
2 GPUs works fine too, as long as your model fits. Using different GPUs with same VRAM however, is highly highly sketchy. Sometimes it works, sometimes it doesn't. In any case, it would be limited by the performance of the slower GPU.
If the goal is to run a FP16 70B model as fast as possible, you would want 8 GPUs with P2P, for a total of 192GB VRAM. The model is then split across all 8 GPUs with 8-way tensor parallelism, letting you make use of the full 8TB/s memory bandwidth on every iteration. Then you have 50GB spread out remaining for KV cache pages, so you can raise the batch size up to 8 (or maybe more).