For very large models, the weights may not fit on one GPU. Also, sometimes havin...

For very large models, the weights may not fit on one GPU.

Also, sometimes having more than one GPU enables larger batch sizes if each GPU can only hold the activations for perhaps one or two training examples.

There is definitely a performance hit, but GPU<->GPU peer is less latency than GPU->CPU->software context switch->GPU.

For "normal" pytorch training, the training is generally streamed through the GPU. The model does a batch training step on one batch while the next one is being loaded, and the transfer time is usually less than than the time it takes to do the forward and backward passes through the batch.

For multi-GPU there are various data parallel and model parallel topologies of how to sort it, and there are ways of mitigating latency by interleaving some operations to not take the full hit, but multi-GPU training is definitely not perfectly parallel. It is almost required for some large models, and sometimes having a mildly larger batch helps training convergence speed enough to overcome the latency hit on each batch.