> That pushes most of the weights to be really close to zero unless larger values are necessary.
So why not have a certain epsilon, below which you can turn the connection off altogether? (meaning the back-propagation would only apply to the remaining connections) To avoid getting stuck in local minima you could occasionally re-initialise them with a random small value.
Again, zero background in machine learning here. It's a sincerely naive question to which I fully expect a "we've tried that with, methods X, Y and Z are most famous and this is how they work out in practice".
What do you gain by doing that? It isn't any cheaper to train with connections removed. It could really damage the training if a parameter gets stuck at 0 that shouldn't be. And the sparsity penalty has traditionally been considered to be enough.
So why not have a certain epsilon, below which you can turn the connection off altogether? (meaning the back-propagation would only apply to the remaining connections) To avoid getting stuck in local minima you could occasionally re-initialise them with a random small value.
Again, zero background in machine learning here. It's a sincerely naive question to which I fully expect a "we've tried that with, methods X, Y and Z are most famous and this is how they work out in practice".