In practice, images are not particularly large and a batch of them would easily ...

In practice, images are not particularly large and a batch of them would easily fit on a single GPU. What's more common is either (a) performing the forward and backward passes on 4 GPUs where each GPU has its own batch, then collecting the gradient from all 4 backward passes or (b) splitting the computation for individual layers across multiple GPUs.

Both (a) and (b) have various trade-offs. Some models perform worse with large batch sizes, so (a) is not preferred, and others are hard or impossible to parallelize at the layer level, ruling out (b). Google NMT did (b), though it required many trade-offs and restrictions (see my blog post[1]), while many image based tasks are happy with large batch sizes so go with (a).

[1]: http://smerity.com/articles/2016/google_nmt_arch.html