GPU memory bandwidth is a limiting factor for how fast training can happen, so i...

GPU memory bandwidth is a limiting factor for how fast training can happen, so it’s much more efficient to train models on locally connected high memory GPUs.

Also gradient updates from all nodes would need to get combined at least every few training steps, and it would take a while to sync all gradient updates across the network.