At the very least, computation could be distributed at the hyperparameter tuning...

At the very least, computation could be distributed at the hyperparameter tuning stage. Each node would be responsible for training on a different set of hyperparameters. The master node would coordinate the selection and distribution of new hyperparameter sets, amortizing the data-set distribution time.

It would also be possible to distribute computation of batches across nodes. Each node would compute the gradients on its batch, and the master would combine gradients and distribute new weights.

High-speed interconnects (e.g. Infiniband) are not needed in this scenario, and the bandwidth usage scales according to the size of the weights and/or gradients, not the data-set size.