Yes, I have a very good point in fact. But the above comment purposely chooses not to argue with it, because it's easier to ignore it entirely and argue something else.
The problem is you presented something as a fact while it’s just a guess. Some people guess it’s an improved gradient flow, others guess it’s a smoother loss surface, someone else guesses it’s a shortcut for early layer information to reach later layers, etc. We don’t actually know why resnets work so well.
The point of that comment doesn't have anything to do with how ResNets actually work. You missed the actual point.
> We don’t actually know why resnets work so well.
Yes actually we do. We know, from the literature, that very deep neural networks suffered from vanishing gradients in their early layers in the same way traditional RNNs did. We know that was the motivation for introducing skip connections which gives us a hypothesis we can test. We can measure, using the test I described, the differences in the size of gradients in the early layers with and without skip connections. We can do this across many different problems for additional statistical power. We can analyze the linear case and see that the repeated matmults should lead to small gradients if their singular values are small. To ignore all of this and say that well we don't have a general proof that satisfies a mathematician so i guess we just don't know is silly.
You're doing it again - presenting guesses as facts. Why would a resnet - a batch normalized network using ReLU activations suffer from vanishing gradient problem? Does it? Have you actually done the experiment you've described? I have, and I didn't see gradients vanish. Sometimes gradients exploded - likely from a bad weights initialization (to be clear - that's a guess), and sometimes they didn't, but even when they didn't the networks never converged. The best we can do is to say: "skip connections seem to help training deep networks, and we have a few guesses as why, none of which is very convincing".
We know, from the literature
Let's look at the literature:
1. Training Very Deep Neural Networks: Rethinking the Role of Skip Connections: https://orbilu.uni.lu/bitstream/10993/47494/1/OyedotunAl%20I... they're making a hypothesis that skip connections might help prevent transformation of activations into singular matrices, which in turn could lead to unstable gradients (or not, it's a guess).
2. Improving the Trainability of Deep Neural Networks through Layerwise Batch-Entropy Regularization: https://openreview.net/pdf?id=LJohl5DnZf they are making some hypothesis about an optimal information flow through the network, and that a particular form of regularization helps improve this flow (no skip connections are needed).
3. Deep Learning without Shortcuts: Shaping the Kernel with Tailored Rectifiers https://arxiv.org/abs/2203.08120: focus on initial conditions and propose better activation functions.
Clearly the issues are a bit more complicated than the vanishing gradients problem, and each of these papers offer a different explanation of why skip connections help.
It's similar to people building a bridge in 15th century - there was empirical evidence and intuition of how bridges should be built, but very little theory explaining that that evidence or intuition. Your statements are like "next time we should make the support columns thicker so that the bridge doesn't collapse", when in reality it collapsed due to the resonant oscillations induced by people marching on it in unison. Thicker columns will probably help, but they do nothing to improve understanding of the issue. They are just a guess.
That's why we need mathematicians looking at it, and attempting to formalize at least parts of the empirical evidence, so that someone, some day, will develop a compelling theory.