The theories I've heard for why you don't get stuck in local optima when using deep learning include:
1. It's hard to get stuck in multidimensional space
2. There are more saddles than convex local optima
3. There are many local optima, but they are all useful
4. Something related to spin glass theory (which I don't understand)
5. There is no theory, or we haven't found it yet; all we
know is that it works in practice and the theory will have to catch up later
I can't comment on #4 but with regards to 1-2 we can make an argument that high dimensional space is generally friendly towards stochastic gradient descent.
Consider a neural net that produces a single output given `N` inputs– it's basically a function `f(x) = f(x_1, x_2, …, x_N)`
A local minimum x* has the gradient `∇f(x) = (0, 0, …, 0)` and an NxN Hessian of the form `[H•f(x)]_{ij} = (∂^2 f)/(∂x_i ∂x_j)`.
The critical point is a local maximum if the Hessian is positive definite at x and a local minimum if it's negative definite; it's a saddle point otherwise.
This corresponds to the Hessian having a mixture of positive and negative eigenvalues.
Heuristically, we have N eigenvalues, and absent prior information we can expect that the probability of each eigenvalue being positive is 1/2, and 1/2 for the eigenvalue being negative instead.
So the probability of an arbitrary critical point being a local minimum is `(1/2)^N`, which becomes extremely small as N grows large.
Large values of N are kinda the bread-and-butter of deep learning, so we expect that most critical points we'll encounter are saddle points.
Additionally, the inherent noise in SGD means that we're unlikely to stay trapped at a saddle point, because once you're nudged away from the saddle, the rate at which you slide off it increases rapidly.
So if your net appears to have converged, it's probably at a local optimum with a reasonably deep basin of attraction, assuming you're using the bag of tricks we've accumulated in the last decade (stuff like adding a bit of noise and randomizing the order in which the training data is presented).
As for 3 & 5, we kinda cheat because if your model appears to have converged but is not performing adequately, we step outside of the learning algorithm and modify the structure of the neural net or tune the hyperparameters.
I don't know if we'll ever have a general theory that explains why neural nets seem to work so well, because you'd have to characterize both the possible tasks and the possible models, but perhaps we'll gradually chip away at the problem as we gather more empirical data and formalize heuristics based on those findings.
I think it's a combination of (1) and (2): in a high-dimensional space, for a local optimum to be convex it has to be convex in every dimension, the probability of which falls off exponentially in the number of dimensions. So in practice, they're all saddles.
I would not be comfortable making such assumptions on the topologies of fitness functions in high-dimensional spaces. I think it was not so long ago when I read an article here at HN about how weird the high dimensional spaces are, but can't find that quickly.
Further, the mindboggling size of the high-dimensional spaces make me all but guaranteed that not a single non-trivial neural network made by homo sapiens has ever been in global maxima. But I have nothing but my hunch on this.