When we use data parallelism, we're summing/averaging the gradients induced by d...

sillysaurusx · on Jan 9, 2020

https://www.docdroid.net/faDq8Bu/swarm-training-v01a.pdf

We average the weights themselves, and the efficiency seems to be similar to gradient gathering.

It’s also averaging in slices, not the full model. There’s never a full resync.

SWA is the theoretical basis for why it works, I think.

Another way of thinking about it: If the gradients can be averaged, then so can the weights.

uoaei · on Jan 9, 2020

If you are averaging weights often enough, then it's basically the same as averaging gradients. If you average the weights of a bunch of independently-trained models, you're going to have a rough time. Even if the function computes the exact same thing, the order of rows and columns in the intermediate matrices will totally ruin your averaging strategy.