It definitely is a problem. For instance there was a recent post about trying to...

chillee · on Dec 25, 2017

Sorry, saw this late.

It's a shown/kind of proven result that deep neural networks don't fall into local minima in the very high dimensional parameter space.

GANs and reinforcement learning are different. Research on getting those to converge to good local minima is still much more in its infancy. I don't particularly consider those just a "neural network", but sorry, I should have been more clear.

Houshalter · on Dec 26, 2017

This thread is about reinforcement learning which definitely suffers from local minimas.

But even vanilla supervised nets suffer from local minima. Anyone who's played with them has encountered it. Here you can mess around with a neural net live in the browser and it very easily gets stuck if you try more than 3 layers (especially try the spiral dataset): http://playground.tensorflow.org/

chillee · on Dec 27, 2017

That's why I said high dimensional neural networks. There's been a lot of literature explaining why local minima aren't a problem in very high dimension loss surfaces.

Check any of the literature on this subject: https://arxiv.org/abs/1611.06310v2

https://arxiv.org/abs/1406.2572

Local minima are something that people thought was gonna be a problem, especially back in the 2000s. They played around with small neural nets on toy examples such as yours, and thought it was intractable. It's the entire reason why neural nets fell out of the fashion in the early 2000s, and people moved towards techniques like SVM.

These toy examples don't generalize to high dimensions, and if you take a look at the literature, you'll see that the consensus agrees with my statement.

Houshalter · on Dec 27, 2017

Ehh these theoretical results have questionable application to real life. Sure it might be very easy to learn simple correlations like "this patch of pixels correlates highly with the output '8'". But it's trivial to construct examples where neural nets get stuck in local minimas. For instance, try training a net to multiply two binary numbers.

Maybe with a billion neurons, just by random chance some of them would correspond to the correct algorithm and get reinforced by backprop. But very few NNs have layers larger than a thousand neurons. Because the cost of layers that big grows quadratically. And the chance of random weights finding the solution decreases exponentially.

One of the biggest reasons things like stochastic gradient descent, and dropout are used is because they break local minimas.

chillee · on Dec 27, 2017

The statement "deep neural networks are not affected by poor local minima" is not really a personal opinion/theory at this point; it's the dominating consensus in the research community.

These are not just theoretical results. They're theory papers trying to explain the empirical result of why neural nets don't get stuck at local minima.

> Given that deep networks are highly nonlinear systems optimized by local gradient methods, why do they not seem to be affected by bad local minima?

And other such results.

As I said above, neural nets are obviously able to get stuck in local minima in toy examples. If you read my above comment, you'll see that that has no bearing on my initial statement.

Dropout's main motivation is not to break local minima. It's to achieve better generalization. If it were the case that it was meant to break bad minima, we'd have better training loss upon adding dropout, which is obviously not true.

As for SGD, we used to think that it was mainly for computational purposes. That is, we're unable to batch our entire training set at once, so we have to split into mini batches.

Modern theory states more that SGD is good for avoiding sharp minima, as well as some other desirable properties.

I'm not sure you're really reading my comments thoroughly nor checking out the links, so if you're actually interested in understanding what's really going on, please do some proper research on the topic.