For example, if you want to run low latency multi-GPU inference with tensor parallelism in TensorRT-LLM, there is a requirement that the number of heads in the model is divisible by the number of GPUs. Most current published models are divisible by 4 and 8, but not 6.
Interesting... 1 Zen 4 EPYC CPU yields a maximum of 128 PCIE lanes so it wouldn't be possible to put 8 full fat GPUs on while maintaining some lanes for storage and networking. Same deal with Threadripper Pro.
It should be possible with onboard PCIe switches. You probably don't need the networking or storage to be all that fast while running the job, so it can dedicate almost all of the bandwidth to the GPU.
I don't know if there are boards that implement this, though, I'm only looking at systems with 4x GPUs currently. Even just plugging in a 5kW GPU server in my apartment would be a bit of a challenge. With 4x 4090, the max load would be below 3kW, so a single 240V plug can handle it no issue.
Not sure if there exists an 8-way PCIE Gen 5 Multiplexer that doesn't cost ludicrous amounts of cash. Ludicrous being a highly subjective and relative term of course.
98 lanes of PCIe 4.0 fabric switch as just the chip (to solder onto a motherboard/backplane) costs 850$ (PEX88096).
You could for example take 2 x16 GPUs, pass then through (2216=64 lanes), and have 2 x16 that bifurcate to at least x4 (might even be x2, I didn't find that part of the docs just now) for anything you want, plus 2 x1 for minor stuff.
They do claim to have no problems being connected up into a switching fabric, and very much allow multi-host operations (you will need signal retimers quite soon, though).
They're the stuff that enables cloud operators to pool like 30 GPUs across like 10 CPU sockets while letting you virtually hot-plug them to fit demand.
Or when you want to make a SAN with real NVMe-over-PCIe.
Far cheaper than normal networking switches with similar ports (assuming hosts doing just x4 bifurcation, it's very comparable to a 50G Ethernet port. The above chip thus matches a 24 port 50G Ethernet switch. Trading reach for only needing retimers, not full NICs, in each connected host. Easily better for HPC clusters up to about 200 kW made from dense compute nodes.), but sadly still lacking affordable COTS parts that don't require soldering or contacting sales for pricing (the only COTS with list prices seem to be Broadcom's reference designs, for prices befitting an evaluation kit, not a Beowulf cluster).
I really like the information about how the cloud providers do their multiplexing, thanks. There was some tech posted here a few month ago that was similar I found very interesting - plug all devices, ram, hard drives, and CPUS into a larger fabric and a way to spin up "servers" of any size from the pool of resources... wish I could remember the name now.
nit: HN formatting messed up your math in the second sentence, I believe you italicized on accident using * for equations.
It's more difficult to split your work across 6 GPUs evenly, and easier when you have 4 or 8 GPUs. The latter setups have powers of 2, which for example, can evenly divide a 2D or 3D grid, but 6 GPUs are awkward to program. Thus, the OP argues that a 6-GPU setup is highly suboptimal for many existing applications and there's no point to pay more for the extra 2.
What do you mean?