At risk of embarrassing the rest of the thread, please note that the paper was released with an R package [0] which performs the polynomial regression algorithm, and additionally an entire chapter of the paper is dedicated to running this package against two other R packages which implement neural networks, with competitive results.
I don't know if it's at all embarrassing (though there are some HNers that definitely did not at least skim the paper, which is a whole other thing).
The examples they gave in the paper were all tiny: MNIST was the biggest with 784 features (Crossfit has <40, Forest has <60, Taxi has ~11), so this is an absurd comparison. You can get better results than all of the models presented, in much less time, using random kitchen sinks (also sometimes known as 'random features') [0].
Plus, it's a little weird to me that their polynomial regression "runs out of memory"—they're solving a least squares problem for which we have a ton of good algorithms for. More specifically, there's no need to generate the data all at once (e.g., why they mention the "polynomial matrix would be huge") much less run QR factorization on this super dense matrix... rather, you would do this in a streaming fashion (e.g., generate the points, one by one, and proceed via SGD, or a nice pre-conditioned version of it since we have a good approximation of the Hessian via some simple streaming dimensionality reduction).
At the risk of sounding a bit harsh, this paper reads like someone recently found out about least squares and tried a few of the bases until they realized that simple things do pretty well in small cases when compared to fancy models like NNs (which is something almost every researcher that I work with knows—though I don't work directly in ML).
Indeed, there is the unfortunate fact that directly pointing out that a particular person has not read the paper is against HN's rules. Let us not violate the Prime Directive further.
If I understand the origin of this paper correctly, it is mainly the idea and work of an undergraduate student. From this perspective, several problems become more understandable. The limited computation done in support of the paper may have been done by a setup as meager as a lone student's laptop. The naïve view of machine learning as a field could be explained by undergraduate eagerness; indeed, perhaps they did "recently [find] out about least squares".
I agree with your thoughts on SGD. The authors did do PCA for some of their tests, but this entire setup should be buildable in a more efficient and incremental way.
I think, however, that the reason that this paper got the other three co-authors on it, despite its controversial nature, is that the core of the original idea is interesting and has yielded related fruit like UMAP [0] in the past. We should be more comfortable exploring it.
I like your book reference. It's informative and direct, without mumbo-jumbo. To accelerate the lookup for the next follower, start at p279.
[0] https://github.com/matloff/polyreg