Hacker News new | past | comments | ask | show | jobs | submit login

https://www.docdroid.net/faDq8Bu/swarm-training-v01a.pdf

We average the weights themselves, and the efficiency seems to be similar to gradient gathering.

It’s also averaging in slices, not the full model. There’s never a full resync.

SWA is the theoretical basis for why it works, I think.

Another way of thinking about it: If the gradients can be averaged, then so can the weights.




If you are averaging weights often enough, then it's basically the same as averaging gradients. If you average the weights of a bunch of independently-trained models, you're going to have a rough time. Even if the function computes the exact same thing, the order of rows and columns in the intermediate matrices will totally ruin your averaging strategy.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: