One of the problems with coming up with a good theory is that, at the end of day...

One of the problems with coming up with a good theory is that, at the end of day, we're building a system that's particularly suited for a certain kind of patterns. If you're building a facial-recognition convnet, there is something about the dataset of faces that is going to influence what works and what doesn't.

When you're building digital circuits, they're expected not to care about what the bits mean, which patterns are more likely. It works for all possible inputs, with equal quality.

There are things in common with how you would process faces and how you would recognize other visual objects, and that's why there are design patterns such as "convolutional layers come before fully-connected layers".

In a way, the "no free lunch" theorem says that you are always paying a price when you specialize to a certain kind of patterns. It comes at the detriment to other patterns. So, any kind of stack of theories on ML/DL is going to be incomplete unless you say something about the nature of your data/patterns.

(That doesn't mean that we can't anything useful about DL, but it just puts a certain damper on those efforts.)