When we use data parallelism, we're summing/averaging the gradients induced by different parts of the data, not the weights of the model itself. When using multiple models for ensemble methods, we're summing/averaging the output of the models for a sample. We're not summing/averaging the weights of the models themselves. While averaging the weights of several models might work on a given problem, it definitely doesn't work in general.
w2tanh(w1x) = -1w2tanh(-w1*x)
But if you average the weights in those two equivalent models you get 0.
If you're talking about asynchronous data parallelism, then there can be some averaging of weights, but they all start with the same weights and are re-synched often enough that weights are never too different to break it.
If you are averaging weights often enough, then it's basically the same as averaging gradients. If you average the weights of a bunch of independently-trained models, you're going to have a rough time. Even if the function computes the exact same thing, the order of rows and columns in the intermediate matrices will totally ruin your averaging strategy.
w2tanh(w1x) = -1w2tanh(-w1*x)
But if you average the weights in those two equivalent models you get 0.
If you're talking about asynchronous data parallelism, then there can be some averaging of weights, but they all start with the same weights and are re-synched often enough that weights are never too different to break it.