We understand how they work mechanically, but not why they work from a theoretical stand-point (which is I assume what you're trying to say by "abstractly").
Why it is that ASGD and backprop converges on a non-convex optimization problem, and what kinds of model topologies make it do better / worse? That's all basically art right now.
No that's not what he's trying to say. We know why gradient descent and back-propagation works.
But no one understands what an actual trained neural net is doing. You can look at the weights, and you can watch the inputs and outputs, but it is very difficult to understand why it does what it does.
It just fit a model to data, but there is no explanation why that model is best. The weights are not interprettable by humans.
There have been some attempts at making models which humans can interpret. One program named Eureqa fit the simplest possible mathematical expression possible to a set of samples. A biologist tried it on his data and found that it created an expression which actually fit the data really well. But he couldn't publish it because he couldn't explain why it worked. It just did. But there was no understanding, no explanation.
"We know why gradient descent and back-propagation works."
I would phrase that differently: we know when gradient descent and back-propagation work, not why they work for so many real world problems.
For the when, there are zillions of published mathematical results stating "if a problem has property X, this and this method will find a solution with property Y in time T", in zillions of variations (Y can be the true optimum, a value within x% of the real optimum, the true optimum 'most of the time', etc. T can be 'eventually', 'after O(n^3) iterations', 'always', etc)
However, for most real-world problems we do not know whether they have property X, or even how to go about arguing that it is likely they have property X, other than the somewhat circular "algorithm A seems to work well for it, and we know it works for problems of type X"
Well, we do know why gradient descent works (for smooth data), at least for finding a local minimum, because finding the minimum is basically what it does by construction. Similarly, we certainly know how back-propagation works, because it's simple calculus, backwards application of the chain rule.
Perhaps what you're trying to say is we don't know why finding local minima of these problems is good at solving the problem?
If that is what he was trying to say, then he's wrong. Hiddencost is right. We do not have a solid theoretical understanding of neural nets and why backprop, and all the other tricks (e.g. relu, dropout) works. There is basic intuition on why they work, but no rigorous theory. Even worse, we have no theory on which architectures work better or worse and why.
So whenever you read about some 50-layer net trained with this architecture with this padding/stride/normalization, and you wonder how they came up with that, the answer is: some grad student sat there, thought about his past experiences and the papers he's read of architectures that have worked well, and then spent months trying a bunch of things.
I don't think that's quite right. There is a lot of understanding about neural nets and why different method work. Maybe not to the level that satisfies purist mathematicians, but nothing really satisfies them.
Yes it's true that hyperparameters seem arbitrary, but that's just a consequence of the no-free-lunch theorem. Some models will always fit some problems better than others. There is no such thing as a perfect model. NNs turn out to be a really good prior for real world problems. I wrote about why that might be here: http://houshalter.tumblr.com/post/120134087595/approximating...
But in principle, that doesn't apply to things like layer numbers. In theory the best neural net has infinite layers of infinite size and infinite convolution, with a stride of 1. Because you can fit every other model into that, and as long as the parameters are properly regularized it can also avoid overfitting too. In the real world of course, we are limited to merely 50 layers, and need to cut corners with convolution sizes and such.
Likewise for training. The best algorithm for training is bayesian inference on the parameters. Since that is usually very impractical, we use approximations like maximum-likelihood or dropout.
No, that's not even what the theory says. In theory, all you need is a single hidden layer, and all you need to do is keep increasing the number of hidden units until things work, because that can model any function (universal approximation theorem). What you said makes no sense theoretically (an infinite number of layers and convolutions makes no sense - I think you meant to use the word arbitrary). In the real world, we need to stack layers, for some reason that is only vaguely understood.
It goes without saying that the reason we have 50 (vs. deeper) layer nets and strides, among other things, is not solely to reduce computational cost.
The universal approximation theorem just says that you can construct a giant NN to fit any function. By making it a giant lookup table basically. It says nothing about fitting functions efficiently. I.e. generalizing from little data, using fewer parameters.
In order to do that you do need to use multiple layers. And the same is true for digital circuit, which NNs basically are. I'm sure there is mathematical theory and literature on the representation power of digital circuits.
There is a limit to what you can do with only one layer of circuits, and you can do more functions more efficiently with more layers. That is, taking the results of some operations, then doing more operations on those results. Composing functions. As opposed to just memorizing a lookup table, which is inefficient.
That's why multiple layers work better. It isn't some strange mystery.
>an infinite number of layers and convolutions makes no sense - I think you meant to use the word arbitrary
A better way to word it would be "as it approaches infinity" or "in the limit" or something. That is, the accuracy of the neural net should increase and only increase as you increase the number of layers and units (provided you have proper regularization/priors.) Since bigger models can emulate smaller models, but not vice versa.
Yes, in order to generalize better you need deeper nets. That was my whole point. But how deep? And what are the parameters of each layer? Grad students just pull those numbers from intuition. And it goes without saying that an infinitely deep net (whatever that means) would not generalize on little data, and would get even harder to train the deeper it gets. If it means what I think it means, you're basically claiming that recurrent neural nets can easily represent anything, but RNNs exist today, and they don't do the magic you're claiming they do.
The forward pass of a net is not theoretically interesting. It's the training of the net that has no theory. The training has nothing to do with digital circuits.
It goes without saying that you've handwaved some (perfectly fine) ideas about composing functions and such. And then claim "it isn't some strange mystery." That's my point. You've argued some ideas from intuition. There is little theoretical rigor around this, however.
While we understand the math of backpropegation, we don't really understand why it lets neural networks converge as well as they do. If you asked a mathematician they'd point out that yes, backpropegation can work, but there are no guarantees even around the probability that it will work. And yet, it does work, and often enough and in enough different cases for it to be useful.
Yes, interpretability can be an issue, but that's not what is being discussed here.
There have been some attempts at making models which humans can interpret. One program named Eureqa fit the simplest possible mathematical expression possible to a set of samples. A biologist tried it on his data and found that it created an expression which actually fit the data really well. But he couldn't publish it because he couldn't explain why it worked. It just did. But there was no understanding, no explanation.
This is a pretty well understood issue in the ML community. Different tools have different strengths: Regression (sometimes), Tree based classifiers and Bayesian nets are nice for explaining your data, but others methods (SVMs, Neural Networks etc) can be better for prediction in some circumstances.
Gradient descent is guaranteed to move the parameters closer to a more fit solution. That's almost the definition of GD. I guess you are saying that we don't know why it doesn't just get stuck in local optimas?
I'm not certain, but for one, stochastic gradient descent is usually used, which helps break out of local optimas. And second, they don't seem to be as much of a problem as you add more hidden units. I was under the impression there was a mathematical explanation for both of those, though I haven't done much research.
Off the top of my head, I think as you add infinite hidden units, a subset of them will fit the function exactly just by chance, and gradient descent will only increase their parameters. In fact as long as they are in some large possibility space of the correct solution, GD can move the parameters downhill to the optimal solution. Not that we even want the optimal solution, just a good enough one.
At this point I'm going to point you to [1]. They interview Ilya Sutskever (ex Toronto, ex Google Brain, now director of OpenAI). He knows a little about neural networks[2] and possibly knows as much as anyone else about gradient descent.
In [1] he talks a lot about how there is no theoretical basis to think that a deep neural network should converge, and prior to around 2006 the accepted wisdom was that networks deep enough to outperform other methods of machine learning were useless because they couldn't be trained. Then they discovered that using larger random values for initialization means that they do converge, but that there is no theoretical basis to explain this at all.
I believe that has to do with the vanishing gradient issue which has been studied.
As for why nets converge at all, there's this paper whichI believe tries to establish some theory why bigger nets don't have such a big problem with local optima: http://arxiv.org/abs/1412.0233
>The number of local minima outside that band diminishes exponentially with the size of the network. We empirically verify that the mathematical model exhibits similar behavior as the computer simulations, despite the presence of high dependencies in real networks. We conjecture that both simulated annealing and SGD converge to the band of low critical points, and that all critical points found there are local minima of high quality measured by the test error. This emphasizes a major difference between large- and small-size networks where for the latter poor quality local minima have non-zero probability of being recovered.
Your link supports my point, not yours. I'm aware of research showing that local minima are close to the global minima. That does not mean neural nets usually converge to the global minima, only that, the local minima they converge to is close to the global minimum.
> Finally, we prove that recovering the global minimum becomes harder as the network size increases
Do you have a citation or blog post for this application of Eureqa that gave a good result but wasn't publishable from a biological study perspective? I have a background in BME and CS and am very interested in learning about this anecdote.
Why it is that ASGD and backprop converges on a non-convex optimization problem, and what kinds of model topologies make it do better / worse? That's all basically art right now.