It is difficult to understand the implications of these assumptions and if they really apply to supervised deep learning nets.
It has been know for a very long time that simple models, like the Random Energy Model, displays a spin glass transition at low temperature. (the REM is a p-infinite limit the mean field p-spherical spin glass used in the earlier papers by LeCun) So it is expected that a random network may also display this kind of behavior, and, therefore, there could exist a large number of global minima, separated by very large barriers.
However, it has been argued that this is unphysical for very strongly correlated systems like proteins and random hetero-polymers (with non-local contacts). Instead, one would have a spin glass minimal frustration. This would lead to a highly funneled energy landscape, with a single (or a few local) minima. This would lead to a rugged convexity.
As I understand it, the main result of the paper relies on these 4 assumptions:
- That the dimensionality of the output of the network is smaller than that of the input. That is usually the case in image recognition, where the image is width x height x channels dimensional, while the output is usually a much smaller number of label-wise probabilities. It probably isn't the case when you generate data from some smaller representation, e.g. with autoencoders, image generation, etc.
- That the input data is decorrelated, and that the input data is uncorrelated with the output ground truth. The former can easily be obtained via a whitening transformation in many cases in practice. I am not quite sure about the latter.
- That whether a connection in the network is activated or not is random with the same probability of success across the network. Active means the ReLU activation function has output greater than 0. Many people initialize weights in the network with some 0-mean random variable and some constant bias, in which case that assumption should hold true at the beginning of training. Whether that assumption holds throughout training could easily be verified empirically - i.e. by monitoring the network's activation.
- That the network activations are independent of the input, the weights and each other. That's obviously not completely true - the network activations of a given layer are a function of the previous layer's activations and weights, and ultimately of the input in the first layer. With large enough networks, this may hold sufficiently in practice - any single activation should not depend very significantly on any other single variable.
I may have missed something in interpreting the maths, any comment is appreciated. From a practical standpoint, especially for computer vision, these assumptions seem quite reasonable. I am not qualified however to comment on the proof of this result, so I would wait on peer review. Still, it is heartening to see the theory of deep learning finally catching up with practice!
Linear correlation. Actually, it seems to be a bit more general than just uncorrelated, i.e. if the input is a m by n matrix X and ground truth a k by n matrix Y, the author requires that XX^T and XY^T to be full-rank. A whitening transformation would yield the identity matrix for XX^T, but that's a bit stronger than what's strictly necessary. My interpretation of XY^T being full-rank meaning X and Y being uncorrelated might indeed be mistaken.
I am not quite sure there is such a thing. :p
I was playing a bit loose with the mathematics here, and trying to find some more intuitive way to explain "XY^T is full-rank", but it got confusing. Sorry about that. I will edit my initial post accordingly. (Ah, can't edit it seems, oh well)
I have a feeling, albeit one not informed by deep knowledge of either field, that there are already-known pure mathematical results that can cause a <insert your favorite way of saying "paradigm shift"> in machine learning, just "waiting to be discovered" by people in the latter field.
All this talk of maxima and minima reminds me of Morse theory, for instance (and that Wikipedia page is more than what I know about it). [1]
For context, if true (the paper has not been peer reviewed yet), this confirms what has already been suspected and empirically evaluated about deep learning: that the non-convexity of neural networks is not an issue.
If these claims are true (specifically, that every local minimum is a global minimum), then why did the earlier neural networks have poor performance? Why did we need advancements like pretraining via stacked RBMs and dropout in order to make deep learning converge on usable/better models?
1. Our labeled datasets were thousands of times too small.
2. Our computers were millions of times too slow.
3. We initialized the weights in a stupid way.
4. We used the wrong type of non-linearity.
The pre-training helped with initialization, but later it turned out that just initializing the weights with correct scales for each layer (to deal with dissapearing/exploding gradient effect) worked almost as well with enough data.
Poor performance is generally evaluated by considering the validation error. This paper only cares about the training error: it is about the presence or absence of bad local minima in the optimization problem that one has to solve when training a MLP with ReLU activations. The learning and the optimization problems are related but they are not the same :)
Dropout is a tool to prevent overfitting (large gap between training error and validation error). This paper does not say anything with overfitting or generalization or the impact of regularization.
Also note that unsupervised pre-training via stacked RBMs has proven mostly useless for MLPs with ReLU activations if the network is wide enough and the number of samples in the training set big enough. It is unclear that initialization via unsupervised pre-training can improve the training error or not. I think it mostly has an impact on the validation error (although I am not sure).
Furthermore, practitioners tend to stop training before the full convergence on the training set. Instead one generally stops when validation error stops decreasing significantly (early stopping) and one does not really care about the final value of the training loss one could have reached if we had continued training forever. Traditional SGD has a convergence rate that is too slow and in practice it prevents checking whether we are converging to a bad local minima or not on non-toy problems.
To sum up: better understanding of the optimization problem is very helpful (in particular to tackle underfitting and reduce training times) but that alone will not ensure that we can build model that generalize correctly to unseen data.
Using ReLU units is a newer advancement and I agree that changing the activation function does change the cost function. However, before Hinton got all excited about ReLU units, he was still showing huge improvements just by using pretraining and later by using dropout, which shouldn't change the cost function.
Dropout helps with convergence/optimization, sure. The existence of a global minimum says nothing about the time required to reach it. Important to note that dropout isn't as common anymore; it's not a huge win.
> For deeper networks, Corollary 2.4 states that there exist “bad” saddle points in the sense that the Hessian at the point has no negative eigenvalue.
To me these sound just as bad as local minima. Also I don't think it's standard to call something a saddle point unless the Hessian has negative as well as positive eigenvalues. Otherwise there's no "saddle", more something like a valley or plateau.
They claim that these can be escaped with some peturbation:
> From the proof of Theorem 2.3, we see that some perturbation is sufficient to escape such bad saddle points.
I haven't read through the (long!) proof in detail but it doesn't seem obvious to me why these would be any easier to escape via peturbation than a local minimum would be, and I think this could use some extra explanation as it seems like an important point for the result to be useful. Did anyone figure this bit out?
Ah yep, true. I'd forgotten you can still get the saddle effect from higher-order derivatives, the Hessian eigenvalues aren't enough to characterise it.
I was thinking of examples like (x-y)^2 at zero, although I guess that's still a local minimum, just not a unique local minimum in any neighbourhood.
This paper would be much improved by cutting the material on linear models, which AFAIK is not new. Then by properly explaining the assumptions (Axa...) they are trying to do away with for nonlinear models rather than referring the reader to previous work.
Oh yeah, our deep learning is a super cool technology, no need to study math, believe us, just buy more GPUs and they will help you, throw the data on the fan. Or, even better, give us your data, we'll decide what to do with it for you, upload your data to our cloud and relax.
Consider the following question - how many layers do you need to build a RELU network to approximate x^2 function on [0,1] with 1e-6 accuracy.
Exactly, this simple example demonstrates how hard is to learn even a simple function. Playing with your code, 10 hidden units give error 0.6 in 1000 iterations, 1000 hidden units give error 0.02 in 10000 iterations so it takes way longer to train. This is not an easy technology.
[1] http://stats.stackexchange.com/questions/203288/understandin...