Thank you for the great discussion. You've put your finger on the right thing I ...

Thank you for the great discussion. You've put your finger on the right thing I think. We can now dispense with the old VC-type thinking (i.e., that it's because the hypothesis space is not complex enough that we get generalization). Instead now the real question is this: is it the loss landscape itself, or the particular way in which the landscape is searched that leads to good generalization in deep learning.

One can think of perhaps an "exhaustive" search with say God's computer of the loss landscape and pick an arbitrary point among all the points that minimize (or are close to the minimum). Or with our computers we can merely sample. But in both cases, it's hard to see how one would avoid picking "memorization" solutions in the loss landscape. Recall that in an over-parameterized setting, there will be many solutions that have the same low training loss but very different test losses. The reference in my original post [1] shows a nice example with a toy overparameterized linear model (Section 3) where multiple linear models fit the training data but they have very different generalizations. (It also shows why GD ends up picking the better-generalizing solution.)

Now people have argued that the curvature around the solution is a distinguishing factor between well-generalizing solutions and not. Though already now we are moving into the territory of how to sample the space i.e. the specifics of the searching algorithm (a direction you may not like), but even if we press ahead, it's not a satisfactory explanation since in a linear model with L2 loss, the curvature is the same everywhere as Zhang et al. pointed out. So the curvature theories fail for the simplest case already unless one believes that somehow linear models are fundamentally different from deeper and non-linear models.

[1] points out other troubling facts about the curvature explanation (Section 12), but one I like more than the others is the following: As per curvature theories the reason for good generalization at the start of the training process is fundamentally different from the reason from good generalization at the end of the training process. (As always, generalization is just the difference between test and training, and so good generalization is when that difference is small; not necessarily that the test loss is small.) At the start of the GD training process curvature theories would not be applicable (we just picked a random point after all) and so they would hold that we get good (in fact, perfect) generalization because we didn't look at the training data. However, at the end of training, they say we have good generalization because we found a shallow minima. This lack of continuity is disconcerting. In contrast, stability based arguments provide a continuous explanation: the longer you run SGD the less stable it is (so don't run it too long and you'll be fine since you'll achieve an acceptable tradeoff between lowering the loss and overfitting).

[1]: https://arxiv.org/abs/2203.10036