GPU training of ML/DL models is bottlenecked in bizarre locations. Generally whe...

GPU training of ML/DL models is bottlenecked in bizarre locations. Generally when doing distributed training on multiple GPUs with a large dataset you move all at once enough data to fill the GPU RAM during training and let it crunch for a little while, then push the next batch in. People have come up with workarounds like doing multiple training passes over each batch of data, but ideally you would not do that and would refresh the whole 12/16GB of training data on each pass. If you have enough RAM on the board to keep the whole training dataset in memory (likely) then you can easily find that the bottleneck in your system is bandwidth to the GPU. People like to be able to train with a large number of GPUs in parallel, but they really don't like cutting down to 8x PCIe lanes/GPU.