Hacker News new | past | comments | ask | show | jobs | submit login

When we use data parallelism, we're summing/averaging the gradients induced by different parts of the data, not the weights of the model itself. When using multiple models for ensemble methods, we're summing/averaging the output of the models for a sample. We're not summing/averaging the weights of the models themselves. While averaging the weights of several models might work on a given problem, it definitely doesn't work in general.

w2tanh(w1x) = -1w2tanh(-w1*x)

But if you average the weights in those two equivalent models you get 0.

If you're talking about asynchronous data parallelism, then there can be some averaging of weights, but they all start with the same weights and are re-synched often enough that weights are never too different to break it.




https://www.docdroid.net/faDq8Bu/swarm-training-v01a.pdf

We average the weights themselves, and the efficiency seems to be similar to gradient gathering.

It’s also averaging in slices, not the full model. There’s never a full resync.

SWA is the theoretical basis for why it works, I think.

Another way of thinking about it: If the gradients can be averaged, then so can the weights.


If you are averaging weights often enough, then it's basically the same as averaging gradients. If you average the weights of a bunch of independently-trained models, you're going to have a rough time. Even if the function computes the exact same thing, the order of rows and columns in the intermediate matrices will totally ruin your averaging strategy.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: