Hacker News new | past | comments | ask | show | jobs | submit login
Pruning AI networks without impacting performance (ibm.com)
101 points by rbanffy on Dec 12, 2017 | hide | past | favorite | 20 comments



Wouldn't it make more sense to do pruning during training instead of afterwards? That is, make the training itself involve pruning?

I remember coming across a paper four years ago showing that when using evolutionary algorithms, simply introducing a tiny connection cost would prune useless connections and spontaneously generate modularity:

> we demonstrate that the ubiquitous, direct selection pressure to reduce the cost of connections between network nodes causes the emergence of modular networks. Computational evolution experiments with selection pressures to maximize network performance and minimize connection costs yield networks that are significantly more modular and more evolvable than control experiments that only select for performance.

http://rspb.royalsocietypublishing.org/content/280/1755/2012...

Not sure if that is easily generalized to other machine learning approaches, since I'm not working in machine learning.


During training the weights are constantly changing and their final values uncertain. If you prune a connection to early you could destroy a synapse that would become important later on.


That's no different from a real-life network. Biology tends to solve this by growing connections as well as pruning them.

Another comment mentioned Song Han's work into NN compression. Well, I looked up a recent paper and look what it says:

> We discovered an interesting byproduct of model compression: re-densifying and retraining from a sparse model can improve the accuracy. That is, compared to a dense CNN baseline, dense → sparse → dense (DSD) training yielded higher accuracy

> We now explain our DSD training strategy. On top of the sparse SqueezeNet (pruned 3x), we let the killed weights recover, initializing them from zero. We let the survived weights keeping their value. We retrained the whole network using learning rate of 1e−4. After 20 epochs of training, we observed that the top-1 ImageNet accuracy improved by 4.3 percentage-points

> Sparsity is a powerful form of regularization. Our intuition is that, once the network arrives at a local minimum given the sparsity constraint, relaxing the constraint gives the network more freedom to escape the saddle point and arrive at a higher-accuracy local minimum. So far, we trained in just three stages of density (dense → sparse → dense), but regularizing models by intermittently pruning parameters throughout training would be an interesting area of future work

https://arxiv.org/pdf/1602.07360v3.pdf

I wonder if you could also receive similar results by simply turning some connections "off" for a while, only training the rest of the connections. Then turn the missing connections on a gain and, randomly turning a few other connections off, and continue training.


Most deep NNs are already trained with a sparsity penalty called weight regularization or weight decay. That pushes most of the weights to be really close to zero unless larger values are necessary. The benefit of this is it's continuous and differentiable. So it can be trained with backpropagation. Binary on/off connections are much more complicated to optimize.


> That pushes most of the weights to be really close to zero unless larger values are necessary.

So why not have a certain epsilon, below which you can turn the connection off altogether? (meaning the back-propagation would only apply to the remaining connections) To avoid getting stuck in local minima you could occasionally re-initialise them with a random small value.

Again, zero background in machine learning here. It's a sincerely naive question to which I fully expect a "we've tried that with, methods X, Y and Z are most famous and this is how they work out in practice".


What do you gain by doing that? It isn't any cheaper to train with connections removed. It could really damage the training if a parameter gets stuck at 0 that shouldn't be. And the sparsity penalty has traditionally been considered to be enough.


> I wonder if you could also receive similar results by simply turning some connections "off" for a while, only training the rest of the connections. Then turn the missing connections on a gain and, randomly turning a few other connections off, and continue training.

Isn't that almost like Dropout? Dropout does what you say but it changes the dropped set of neurons for each individual example.


https://www.cs.toronto.edu/~hinton/absps/JMLRdropout.pdf

Looks pretty close, yeah.

So this is basically dividing the network into two groups, and give each group a second weight: one for those that are on, zero for those that are off.

Has anyone tried more variants of this? Using random bundling + various levels of amplification of such bundles, so that backpropagation affects various networks differently?


There is also this: Mixture of experts. A massive neural net from Google, with multiple submodules that are activated depending on input.

https://medium.com/@thoszymkowiak/google-brains-new-super-fa...


What we’re seeing is that most weights are actually spurious. This in effect reveals the underlying circuitry. I did network pruning on gene regulatory networks using an evolutionary algorithm (mathematically identical to artificial neural networks.

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2538912/


Okay, this is at least partly disingenuous,pruning deep nets absolutely helps (see e.g. Song Han's Deep Compression paper). It's also very worrying they only test on the toy dataset, MNIST.


I would have liked to have seen a comparison against Song Han s approach....


> see e.g. Song Han's Deep Compression paper

Thanks for that, by the way. I have a friend who can really use that information.


Neat. As someone who has largely surface-level knowledge of ML, does this development mean we can expect many more on-phone networks in the future?

Phrased another way: Is this a huge step up from previous pruning methods?


Glancing at the paper, I see the biggest dataset they used was MNIST. So it's hard to quantify how much error they preserve for larger, more useful networks in more complicated tasks.


Compressing is indeed a hot topic, but I do feel like this article (the one on arXiv https://arxiv.org/abs/1611.05162 ) has some major shortcomings. First off, the datasets used (spiral and MNIST) are simple and small. They can be used as illustration, but should be avoided for benchmarking. Secondly, despite it being a hot topic, the authors did not compare with other algorithms. Thirdly, they have a 2 hidden dense layer network with over a million parameters for mnist, of course you can prune 95% of those parameters. You could probably have achieved the same result by simply training with 5% of the weights. Finally, there seems to be no approach for convolution layers?

In network pruning, my experience is that simple heuristics sometimes outperform hard math approaches. Also different problems can have wildly different approaches which work best. A good approach on one problem and one network can be very bad on a slightly different network. In this sense, it is sad that LeNet is usually used for benchmarking as the results typically dont generalize well.


Compressing neural networks for inference is an entire subfield of work, with a number of effective approaches. It looks like the paper doesn't compare against any of them?


iirc this was called optimal-brain-damage in oldish literature. it is pretty cool actually :)


Interestingly, the brain also undergoes a thorough pruning process early in childhood development. I wonder whether this process accomplishes something similar to what the linked approach does for artificial neural networks.

https://en.m.wikipedia.org/wiki/Synaptic_pruning


similar work from couple of years ago https://arxiv.org/pdf/1506.02626.pdf

"our method prunes redundant connections using a three-step method. First, we train the network to learn which connections are important. Next, we prune the unim- portant connections. Finally, we retrain the network to fine tune the weights of the remaining connections. On the ImageNet dataset, our method reduced the number of parameters of AlexNet by a factor of 9 × , from 61 million to 6.7 million, without incurring accuracy loss. Similar experiments with VGG-16 found that the total number of parameters can be reduced by 13 × , from 138 million to 10.3 million, again with no loss of accuracy. "




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: