This area is a big deal - ML networks need to be much deeper and denser to provide human-level understanding, and training networks is currently a considerable bottleneck.
Does this method make it easier to spread a neural network over multiple GPUs/machines? I mean, does it reduce the amount of data being communicated between compute nodes or just decouples the updates from the need to wait for the rest of the net to finish?
Does this method make it easier to spread a neural network over multiple GPUs/machines?
Yes, but this isn't the primary focus of this work.
This is about a method of approximating error rates (gradient) back up the neural network.
This is important because allowing the use of approximate error rates means that earlier layers can be trained without waiting for error back-propagation from the later layers.
This asynchronous feature helps on a (computer) network too - there is no need to wait for back-propagation across the network.
As they point out the error will back-propagation eventually. The analogy with an eventually consistent database system (and the effect that has on scalability) is pretty clear.