Simplified: Training works by taking an input sample (say an image), running it through the network, seeing if your answer is right, then updating the weights.
If you had 4 GPUs, each GPU would process 1/4 of the input images. Then after they are done, they would all pool their updates and update a global view of network. Repeat.
In practice, images are not particularly large and a batch of them would easily fit on a single GPU. What's more common is either (a) performing the forward and backward passes on 4 GPUs where each GPU has its own batch, then collecting the gradient from all 4 backward passes or (b) splitting the computation for individual layers across multiple GPUs.
Both (a) and (b) have various trade-offs. Some models perform worse with large batch sizes, so (a) is not preferred, and others are hard or impossible to parallelize at the layer level, ruling out (b). Google NMT did (b), though it required many trade-offs and restrictions (see my blog post[1]), while many image based tasks are happy with large batch sizes so go with (a).
Simplified: Training works by taking an input sample (say an image), running it through the network, seeing if your answer is right, then updating the weights.
If you had 4 GPUs, each GPU would process 1/4 of the input images. Then after they are done, they would all pool their updates and update a global view of network. Repeat.