If you are averaging weights often enough, then it's basically the same as averaging gradients. If you average the weights of a bunch of independently-trained models, you're going to have a rough time. Even if the function computes the exact same thing, the order of rows and columns in the intermediate matrices will totally ruin your averaging strategy.
We average the weights themselves, and the efficiency seems to be similar to gradient gathering.
It’s also averaging in slices, not the full model. There’s never a full resync.
SWA is the theoretical basis for why it works, I think.
Another way of thinking about it: If the gradients can be averaged, then so can the weights.