Wait, I'm confused...you're saying genetic approaches fell out of favor because they're basically just stochastic gradient descent? Most of modern DL relies heavily on SGD at various points during training.
My impression is that they fell out of favor precisely because don't actually use any gradients, and end up converging on good maxima slower than you could if you used the gradients from the net. Am I off base here?
If you mutate genomes by small additive modifications to a vector of continuous parameters, then taking lots of samples and keeping the best is essentially a stochastic approximation to gradient descent. However, unlike the SGD used in deep learning, it doesn't make use of calculus and therefore requires many more samples (exponentially more, in the worst case) to get a gradient of equivalent accuracy. I.e. it's slow.
If your mutations aren't small, or your parameters are not continuously valued, or your fitness function is hard to differentiate analytically, genetic algorithms might still come out ahead.
On the other hand, even if evolutionary algorithms require a lot of samples, they are embarrassingly parallel: you can easily try all samples simultaneously. If you have enough resources to throw at the problem, it can be faster (although more resource-intensive) to estimate the gradient this way than to compute an accurate gradient analytically.
SGD is embarrassingly parallel as well. You can train a net on several different examples simultaneously and combine the gradients or learned weights.
The reason it's not done so much is because the bandwidth of moving huge numbers of gradients or weights between computers is pretty significant. There's been all sorts of research into compressing them or reducing the precision. However this is a problem for evolutionary algorithms as well.
Not exactly. The required bandwidth for the evolutionary strategy is actually very small: if every node knows each other's random seed, they can reconstruct the best model themselves, using the seed of whichever node declares the best result. There is no need to transfer any weights. Of course, this trick only works if it's cheap to compute the weights other nodes are using, which is not the case for SGD. So evolutionary strategies have an advantage here, even if it's not a decisive one.
Well that would work for simple hillclimbing. In a full genetic algorithm, every member of the population is the product of thousands of previous individuals.
I mean, we were talking about simple hill climbing.
Still, even more complex systems still do not require particularly high bandwidth. You only need to broadcast the fitness of the individuals, and from there each node can independently recalculate and combine the best ones.
Is it true that genetic algorithms have the benefit of being able to find the global optima more consistently due to the incorporation of randomness in subsequent generations? Whereas DL models often get stuck at local optima?
DL models don't often get stuck at local optima. In theory, they could be vulnerable to that, but in practice they are not, it simply doesn't happen in most practical supervised learning applications.
I'm not up to date on theoretical research about this topic, but as far as I recall there are some interesting demonstrations on realistic problems showing that all the different "local" optima resulting from different random initializations are actually all "connected", i.e. there exists a nondecreasing route how you can get from a worse "local optimum" to the better one, so in reality it's not a local optimum, it's just that it's nontrivial to find the path to a better optimum if the space is very highdimentional.
The theories I've heard for why you don't get stuck in local optima when using deep learning include:
1. It's hard to get stuck in multidimensional space
2. There are more saddles than convex local optima
3. There are many local optima, but they are all useful
4. Something related to spin glass theory (which I don't understand)
5. There is no theory, or we haven't found it yet; all we
know is that it works in practice and the theory will have to catch up later
I can't comment on #4 but with regards to 1-2 we can make an argument that high dimensional space is generally friendly towards stochastic gradient descent.
Consider a neural net that produces a single output given `N` inputs– it's basically a function `f(x) = f(x_1, x_2, …, x_N)`
A local minimum x* has the gradient `∇f(x) = (0, 0, …, 0)` and an NxN Hessian of the form `[H•f(x)]_{ij} = (∂^2 f)/(∂x_i ∂x_j)`.
The critical point is a local maximum if the Hessian is positive definite at x and a local minimum if it's negative definite; it's a saddle point otherwise.
This corresponds to the Hessian having a mixture of positive and negative eigenvalues.
Heuristically, we have N eigenvalues, and absent prior information we can expect that the probability of each eigenvalue being positive is 1/2, and 1/2 for the eigenvalue being negative instead.
So the probability of an arbitrary critical point being a local minimum is `(1/2)^N`, which becomes extremely small as N grows large.
Large values of N are kinda the bread-and-butter of deep learning, so we expect that most critical points we'll encounter are saddle points.
Additionally, the inherent noise in SGD means that we're unlikely to stay trapped at a saddle point, because once you're nudged away from the saddle, the rate at which you slide off it increases rapidly.
So if your net appears to have converged, it's probably at a local optimum with a reasonably deep basin of attraction, assuming you're using the bag of tricks we've accumulated in the last decade (stuff like adding a bit of noise and randomizing the order in which the training data is presented).
As for 3 & 5, we kinda cheat because if your model appears to have converged but is not performing adequately, we step outside of the learning algorithm and modify the structure of the neural net or tune the hyperparameters.
I don't know if we'll ever have a general theory that explains why neural nets seem to work so well, because you'd have to characterize both the possible tasks and the possible models, but perhaps we'll gradually chip away at the problem as we gather more empirical data and formalize heuristics based on those findings.
I think it's a combination of (1) and (2): in a high-dimensional space, for a local optimum to be convex it has to be convex in every dimension, the probability of which falls off exponentially in the number of dimensions. So in practice, they're all saddles.
I would not be comfortable making such assumptions on the topologies of fitness functions in high-dimensional spaces. I think it was not so long ago when I read an article here at HN about how weird the high dimensional spaces are, but can't find that quickly.
Further, the mindboggling size of the high-dimensional spaces make me all but guaranteed that not a single non-trivial neural network made by homo sapiens has ever been in global maxima. But I have nothing but my hunch on this.
This is common wisdom I think is false. You absolutely do get stuck in local optima frequently with reinforcement learning. OpenAI has a good example somewhere of a robot trying to put a peg through a hole. Trained with regular gradient descent it just gets stuck putting the peg pretty close to the hole, but not through it.
I'm not even sure that it's not a problem in general. I know I've watched NNs frequently get stuck in local minima even on incredibly simple datasets like xor or spirals. SGD and dropout are widely used in part because they add noise to the gradients that can help break out of local optima. But that's not a perfect method
This is field dependent, it feels more like an attribute of certain types of data rather than certain algorithms.
Reinforcement learning was on my mind when I was writing about "practical supervised learning applications" because yes, RL is different in that regard. And various function calculation examples (starting with XOR) indeed do so.
However, if we're applying neural networks for the (wide and practically important!) class of "pattern recognition" tasks like processing image or language data, then it's different, and those are full fields where you can easily spend a whole career working on just one of these types of data. Perhaps there's a relation with the structure and redundancy inherent in data like this.
It depends, but I don't think general statements like that are necessarily true. Most methods allow trade-offs between converging quickly on local optima vs finding global optima. The mutation aspect of genetic algorithms is just one way to do this. With well tuned mutation parameters, genetic algorithms can definitely be successful at finding global optima, but similar measures exist for most methods.
What do you mean by "find" and how do you define "global optima"?
For "find" you could discuss convergence rates vs. the points that are being converged to. If you add randomness at every iteration you're not even converging, at which point annealing rate becomes a related issue.
For "global optima" are we talking about training error, or test error, or some other kind of function value?
Deep learning has enough randomness. The random weight initialization, the random choice of samples (as in name Stochastic Gradient Descent), the random flip/crop/noise in data augmentation, and randomness injected by dropout.
You are not off base at all, thanks for clarify and sorry for the confusion, I did not mean to say it was using gradient descent. It's been a while. The term I was thinking of was multiple "simulated annealing".
Genetic algorithms essentially run coordinate descent with some tweaks. On the positive side, coordinate descent doesn't need gradients and thus happily optimizes complex models. On the negative side, coordinate descent is roughly N times slower than gradient descent where N is the number of parameters being optimized.
GAs are not "basically just stochastic gradient descent." That's the confusion.
Why did they fall out of favour? Fashion, perhaps a natural break in progress - hit a wall and couldn't get further. But there is plenty of GA research going on.
My impression is that they fell out of favor precisely because don't actually use any gradients, and end up converging on good maxima slower than you could if you used the gradients from the net. Am I off base here?