If that is what he was trying to say, then he's wrong. Hiddencost is right. We do not have a solid theoretical understanding of neural nets and why backprop, and all the other tricks (e.g. relu, dropout) works. There is basic intuition on why they work, but no rigorous theory. Even worse, we have no theory on which architectures work better or worse and why.
So whenever you read about some 50-layer net trained with this architecture with this padding/stride/normalization, and you wonder how they came up with that, the answer is: some grad student sat there, thought about his past experiences and the papers he's read of architectures that have worked well, and then spent months trying a bunch of things.
I don't think that's quite right. There is a lot of understanding about neural nets and why different method work. Maybe not to the level that satisfies purist mathematicians, but nothing really satisfies them.
Yes it's true that hyperparameters seem arbitrary, but that's just a consequence of the no-free-lunch theorem. Some models will always fit some problems better than others. There is no such thing as a perfect model. NNs turn out to be a really good prior for real world problems. I wrote about why that might be here: http://houshalter.tumblr.com/post/120134087595/approximating...
But in principle, that doesn't apply to things like layer numbers. In theory the best neural net has infinite layers of infinite size and infinite convolution, with a stride of 1. Because you can fit every other model into that, and as long as the parameters are properly regularized it can also avoid overfitting too. In the real world of course, we are limited to merely 50 layers, and need to cut corners with convolution sizes and such.
Likewise for training. The best algorithm for training is bayesian inference on the parameters. Since that is usually very impractical, we use approximations like maximum-likelihood or dropout.
No, that's not even what the theory says. In theory, all you need is a single hidden layer, and all you need to do is keep increasing the number of hidden units until things work, because that can model any function (universal approximation theorem). What you said makes no sense theoretically (an infinite number of layers and convolutions makes no sense - I think you meant to use the word arbitrary). In the real world, we need to stack layers, for some reason that is only vaguely understood.
It goes without saying that the reason we have 50 (vs. deeper) layer nets and strides, among other things, is not solely to reduce computational cost.
The universal approximation theorem just says that you can construct a giant NN to fit any function. By making it a giant lookup table basically. It says nothing about fitting functions efficiently. I.e. generalizing from little data, using fewer parameters.
In order to do that you do need to use multiple layers. And the same is true for digital circuit, which NNs basically are. I'm sure there is mathematical theory and literature on the representation power of digital circuits.
There is a limit to what you can do with only one layer of circuits, and you can do more functions more efficiently with more layers. That is, taking the results of some operations, then doing more operations on those results. Composing functions. As opposed to just memorizing a lookup table, which is inefficient.
That's why multiple layers work better. It isn't some strange mystery.
>an infinite number of layers and convolutions makes no sense - I think you meant to use the word arbitrary
A better way to word it would be "as it approaches infinity" or "in the limit" or something. That is, the accuracy of the neural net should increase and only increase as you increase the number of layers and units (provided you have proper regularization/priors.) Since bigger models can emulate smaller models, but not vice versa.
Yes, in order to generalize better you need deeper nets. That was my whole point. But how deep? And what are the parameters of each layer? Grad students just pull those numbers from intuition. And it goes without saying that an infinitely deep net (whatever that means) would not generalize on little data, and would get even harder to train the deeper it gets. If it means what I think it means, you're basically claiming that recurrent neural nets can easily represent anything, but RNNs exist today, and they don't do the magic you're claiming they do.
The forward pass of a net is not theoretically interesting. It's the training of the net that has no theory. The training has nothing to do with digital circuits.
It goes without saying that you've handwaved some (perfectly fine) ideas about composing functions and such. And then claim "it isn't some strange mystery." That's my point. You've argued some ideas from intuition. There is little theoretical rigor around this, however.
So whenever you read about some 50-layer net trained with this architecture with this padding/stride/normalization, and you wonder how they came up with that, the answer is: some grad student sat there, thought about his past experiences and the papers he's read of architectures that have worked well, and then spent months trying a bunch of things.