Is it true that genetic algorithms have the benefit of being able to find the global optima more consistently due to the incorporation of randomness in subsequent generations? Whereas DL models often get stuck at local optima?
DL models don't often get stuck at local optima. In theory, they could be vulnerable to that, but in practice they are not, it simply doesn't happen in most practical supervised learning applications.
I'm not up to date on theoretical research about this topic, but as far as I recall there are some interesting demonstrations on realistic problems showing that all the different "local" optima resulting from different random initializations are actually all "connected", i.e. there exists a nondecreasing route how you can get from a worse "local optimum" to the better one, so in reality it's not a local optimum, it's just that it's nontrivial to find the path to a better optimum if the space is very highdimentional.
The theories I've heard for why you don't get stuck in local optima when using deep learning include:
1. It's hard to get stuck in multidimensional space
2. There are more saddles than convex local optima
3. There are many local optima, but they are all useful
4. Something related to spin glass theory (which I don't understand)
5. There is no theory, or we haven't found it yet; all we
know is that it works in practice and the theory will have to catch up later
I can't comment on #4 but with regards to 1-2 we can make an argument that high dimensional space is generally friendly towards stochastic gradient descent.
Consider a neural net that produces a single output given `N` inputs– it's basically a function `f(x) = f(x_1, x_2, …, x_N)`
A local minimum x* has the gradient `∇f(x) = (0, 0, …, 0)` and an NxN Hessian of the form `[H•f(x)]_{ij} = (∂^2 f)/(∂x_i ∂x_j)`.
The critical point is a local maximum if the Hessian is positive definite at x and a local minimum if it's negative definite; it's a saddle point otherwise.
This corresponds to the Hessian having a mixture of positive and negative eigenvalues.
Heuristically, we have N eigenvalues, and absent prior information we can expect that the probability of each eigenvalue being positive is 1/2, and 1/2 for the eigenvalue being negative instead.
So the probability of an arbitrary critical point being a local minimum is `(1/2)^N`, which becomes extremely small as N grows large.
Large values of N are kinda the bread-and-butter of deep learning, so we expect that most critical points we'll encounter are saddle points.
Additionally, the inherent noise in SGD means that we're unlikely to stay trapped at a saddle point, because once you're nudged away from the saddle, the rate at which you slide off it increases rapidly.
So if your net appears to have converged, it's probably at a local optimum with a reasonably deep basin of attraction, assuming you're using the bag of tricks we've accumulated in the last decade (stuff like adding a bit of noise and randomizing the order in which the training data is presented).
As for 3 & 5, we kinda cheat because if your model appears to have converged but is not performing adequately, we step outside of the learning algorithm and modify the structure of the neural net or tune the hyperparameters.
I don't know if we'll ever have a general theory that explains why neural nets seem to work so well, because you'd have to characterize both the possible tasks and the possible models, but perhaps we'll gradually chip away at the problem as we gather more empirical data and formalize heuristics based on those findings.
I think it's a combination of (1) and (2): in a high-dimensional space, for a local optimum to be convex it has to be convex in every dimension, the probability of which falls off exponentially in the number of dimensions. So in practice, they're all saddles.
I would not be comfortable making such assumptions on the topologies of fitness functions in high-dimensional spaces. I think it was not so long ago when I read an article here at HN about how weird the high dimensional spaces are, but can't find that quickly.
Further, the mindboggling size of the high-dimensional spaces make me all but guaranteed that not a single non-trivial neural network made by homo sapiens has ever been in global maxima. But I have nothing but my hunch on this.
This is common wisdom I think is false. You absolutely do get stuck in local optima frequently with reinforcement learning. OpenAI has a good example somewhere of a robot trying to put a peg through a hole. Trained with regular gradient descent it just gets stuck putting the peg pretty close to the hole, but not through it.
I'm not even sure that it's not a problem in general. I know I've watched NNs frequently get stuck in local minima even on incredibly simple datasets like xor or spirals. SGD and dropout are widely used in part because they add noise to the gradients that can help break out of local optima. But that's not a perfect method
This is field dependent, it feels more like an attribute of certain types of data rather than certain algorithms.
Reinforcement learning was on my mind when I was writing about "practical supervised learning applications" because yes, RL is different in that regard. And various function calculation examples (starting with XOR) indeed do so.
However, if we're applying neural networks for the (wide and practically important!) class of "pattern recognition" tasks like processing image or language data, then it's different, and those are full fields where you can easily spend a whole career working on just one of these types of data. Perhaps there's a relation with the structure and redundancy inherent in data like this.
It depends, but I don't think general statements like that are necessarily true. Most methods allow trade-offs between converging quickly on local optima vs finding global optima. The mutation aspect of genetic algorithms is just one way to do this. With well tuned mutation parameters, genetic algorithms can definitely be successful at finding global optima, but similar measures exist for most methods.
What do you mean by "find" and how do you define "global optima"?
For "find" you could discuss convergence rates vs. the points that are being converged to. If you add randomness at every iteration you're not even converging, at which point annealing rate becomes a related issue.
For "global optima" are we talking about training error, or test error, or some other kind of function value?
Deep learning has enough randomness. The random weight initialization, the random choice of samples (as in name Stochastic Gradient Descent), the random flip/crop/noise in data augmentation, and randomness injected by dropout.