Neural Networks Are Essentially Polynomial Regression

cs702 · on Aug 30, 2019

How and why did this end up on the homepage again?[a][b]

This is a silly claim, backed only by tiny toy experiments on tiny toy datasets with tiny toy neural networks. See links to the paper in [a] and [b] below.

Using the same logic as the authors of the paper, I could make the claim that neural nets are "essentially the same" as any universal approximator (GP's, SVM's, RF's, GLM's, etc.) and then run a few experiments only on tiny toy problems to back up my claim.

The authors should try to fit a polynomial regression model to cognitive tasks such as language translation, image generation, or modeling the game of Go... they won't get very far without running into the curse of dimensionality.

Please stop posting this on HN.

[a] https://news.ycombinator.com/item?id=17385383

[b] https://news.ycombinator.com/item?id=19145706

knowuh · on Aug 30, 2019

I came here to say this ⬆ Glad someone else did it for me, better than I could have.

goodside · on Aug 30, 2019

This feels overly sensationalist.

It's no surprise that in domains where low-parameter models work at all, they often work better than NNs. In econometric forecasting, where the traditional methods encapsulate more prior knowledge than could ever be inferred from the data, neural networks aren't anywhere near competitive. Even in domains where NNs are state-of-the-art, they might not be worth the training costs over GLMs or gradient-boosted forests or whatever.

But there are many other domains, like raw waveform modeling, large natural images (not MNIST), or NLP, where anything that isn't neural fails miserably. These are the domains where people are most excited about neural networks — problems that otherwise have no solution at all. Neural networks won't help you get a better interpretation of your n=25 daily-resolution cohort study, or predict user retention from your 300-column, 20K-row Excel spreadsheet.

In their paper, the authors dismiss most of NN's greatest recent successes as being "specialized" applications that only work because they incorporate tricks like recurrence and convolution, and not because they are neural. This is a much clearer, bolder claim, but they hardly support it as far as I can tell. The things that they call "specialized" (RNNs and CNNs!) are what most would recognize as the bread-and-butter of neural network design.

ummonk · on Aug 30, 2019

Right, the whole point is that neural networks let you apply tricks like convolution or recurrence to capture a lot of the structure in the model domain and exponentially reduce the search space for your model when dealing with large inputs like detailed images or waveforms.

posterboy · on Aug 30, 2019

Could one say, in a sense, that recurrence is essentially differential equations and convolution essentially more complicated operations than those of arithmetic polynomials?

This might sound like nonsense. On the one hand, most trivial convolutions use trivial operators; "polynomials" might include higher operations anyhow, or approximate some of the more important ones, none of which is appealing if simplicity equals efficiency. On the other hand, I never really understood diff-eqs; ODEs seem like polynomials over self similar polynomials, to me, hence "recurrent"; All the other diff-eqs I can't begin to fathom.

apl · on Aug 30, 2019

There's many perspectives on everything. Deep ConvNets, for instance, can be expressed as a continuously evolving ODE. Here's a fantastic paper on this view:

https://papers.nips.cc/paper/7892-neural-ordinary-differenti...

ummonk · on Aug 31, 2019

Yeah, that's a reasonable way of looking at it. Convolution would be a partial differential equation in this view.

enriquto · on Aug 30, 2019

This has to do with the symmetry of the input to translations, that allows a convolutional neural network. But neural networks need not be convolutional.

wish5031 · on Aug 30, 2019

At the beginning the authors claim they're not just making a restatement of the Stone-Weierstrass theorem (any continuous function on a compact set can be approximated arbitrarily well by a polynomial), but reading through, I'm not sure what they're proving besides that. In fact, on pages 6-7 they simply appeal to that theorem before stating "NNs can loosely be viewed as a form of polynomial regression". Most of the rest of the paper is comparing polyreg and NNs on various datasets. Sec. 9 doesn't have anything doesn't have anything especially novel in it. For instance, in 9.3 they say they'll explore the causes of overfitting in the context of their "NN <-> PR" principle, but never actually do so...

Polynomial regression is nice because it's a little easier to interpret, and it's also a convex problem with a single, global minimum. OTOH you have to design features yourself, otherwise for high-dimensional problems polyreg quickly requires way too much memory and compute to solve in a reasonable amount of time.

This paper might have been more interesting if it had somehow connected neural nets to those facts, or if it had shown how insight on the way in which neural nets work can be used to improve polyreg. But (admittedly, from a brief reading) I don't anything like that here.

olooney · on Aug 30, 2019

I agree. All I see here is: "We know any reasonably smooth function can be approximated by a neural net, likewise it can be approximated by a polynomial. So they're the same!" Imagine how it's going to blow their minds when they find out about Fourier transforms or wavelets!

There are lots of ways to approximate functions; the property of NNs that make them attractive for ML isn't the universal approximation theorem. It's that there's a fast, robust method of training then that's easy to implement, easy to vectorize, easy to parallelize, and easy to customize for different applications.

posterboy · on Aug 30, 2019

Isn't it rather the other way around, that regression methods informed the development of NNs?

I suppose the finding isn't novel. The contribution here might still be a didactic approach, demystifying NNs by simplifying a fundamental notion in known terms. For example exposing implicitly (that is, leaving the insight to be a success for the learning) that "OTOH you have to design features yourself [for the polyreg, because ...]". Although, since "Feature Engineering" is a buzzword for NN development, I still don't understand the difference. Indeed, the paper implies there is none except for the approach and terminology.

Essentially, they uphold that the traditional terminology should not be discarded in favour of the new, but rather understood in context. The part of understanding is left to the reader, of course.

Relatedly, in a linear algevra course, Terry Tao's one pixel camera was attributed with the success of the following ML resurgence, while we were otherwise talking about convolutions, fourier syntheses, wavelets and the like. It's no secret that linear algebra is a corner stone of ... just a corner stone, and that abstract algebra, topology and the like lies very near; At least this lecturer made it a main concern of the course to get that accross, which figured in nicely with a logics course on universal algebra that I took in parallel, while others taught rather monotonously towards taylor series and fourier transformations. At any rate, pretty much all professional researchers in the field stress that the mathematical basis need to be understood.

PS: The paper is written with an undergrad, what ever that means; No offence, I really don't know the slang, much less how much study vs research this implies. The publishing coauthor slash blog host shows some resentment against new fangled fancyfull terminology in the blog's About section, which might explain the scope of the paper, as well as the intended extent as far as I outlined my impressions above.

Your criticism is not quite correct, insofar the blog post notes the conclusions of the paper explicitly, which seems to be explaining common pitfalls, by use of statistical terms.

PPS: Many obervers lament that the results of NNs are intractable, nigh impossible to verify. This is a strong contrast to mathematical rigor. Hoping for the traditional methodology to get up to speed is in principle justified. I'm sure that your remark about design by hand being intractable holds as well, I'm just not sure to which extent. Showing that it can be done reasonably for some is a start, and chronicaling that endeavor is par for the course, however perhaps not yet enough, I guess, as another comment asks for benchmarks.

rcar · on Aug 30, 2019

This is an essentially meaningless finding. Neural networks had the dark days they did in the late 90's/early 2000's because computational power didn't allow for the depth they needed to be applicable to most problems. Once GPUs caught up to the method, deep neural nets have thrived on some problems.

Those problems are precisely the ones that this paper doesn't evaluate (image recognition, text processing, etc.). In most other domains, tree-based models crush neural nets and statistical regression methods, and so this whole thing just ends up being an academic exercise.

thanatropism · on Aug 30, 2019

It's not a finding. They never compare their results to literature benchmarks, just to the networks they've managed to estimate in R. They also say that charitably they've refrained to reporting the "worst results", but fail to present a summary of quality of results found over the many networks they ran.

rcar · on Aug 30, 2019

I agree, though they do make the pretty strong statement of

  NNAEPR suggests that one may abandon using NNs altogether, and simply use PR instead.

in the blog post, which sounds to me like they're going for a finding.

thermonot · on Aug 30, 2019

Important point:

> Our work so far has primarily been on general feedforward NNs. Our investigations have not yet involved much on specialized networks such as convolutional NNs (CNNs) for image classification, recurrent NNs (RNNs) for text processing, and so on. Though we intend to adapt our ideas to these frameworks, we view them as separate, orthogonal issues.

ta0xdeadbeef · on Aug 30, 2019

>Though we intend to adapt our ideas to these frameworks, we view them as separate, orthogonal issues.

The authors should read about dynamical systems. I don't understand why the physics/controls folks, DSP folks, and ML folks don't read each other's papers. They're using the same math and using the same concepts. The latter two groups seem to be working together more often however.

apl · on Aug 30, 2019

Their applications are tabular data (for which MLPs have never been the method of choice) and MNIST (which I could classify at 85% with a rusty nail), so it's not super impressive.

NNs and the associated toolkit shine with structured high-dimensional data where CNNs, RNNs, or modern shenanigans like Transformer networks excel. I sincerely doubt that these networks turn out to be reducible to polynomial regression in any practically useful sense of the notion. But who knows.

olooney · on Aug 30, 2019

Terminology note: data like images and voice which have strong spatial or temporal patterns are actually referred to as "unstructured" data; while data you get from running "SELECT * FROM some_table" or the carefully designed variable of a clinical trial are referred to as "structured" data.

If this seems backwards to you (as it did to me at first) note that unstructured data can be captured raw from instruments like cameras and microphones, while structured data usually involved a programmer coding exactly what ends up in each variable.

As you say, deep neural networks based on CNNs are SOTA on unstructured image data, RNNs are SOTA on unstructured voice and text data, while tree models like random forest and boosted trees usually SOTA on problems involving structured data. The reason seems to be the that the inductive biases inherent to CNNs and RNNs, such as translation invariance, are a good fit for the natural structure of such data, while the the strong ability of trees to find rules is well suited to data where every variable is cleanly and unambiguously coded.

apl · on Aug 30, 2019

Yeah, that's right. Doing too little proofreading with HN comments...

sjg007 · on Aug 30, 2019

NNs can approximate any computable function. Any function can also be approximated by a polynomial. This is all proven math and has been known for a long time. I think of these as different facets of the same thing and each has different tools for computing, analyzing and understanding.

lars · on Aug 30, 2019

A hash table can also approximate any function, so the "universal function approximation" thing is a bit oversold. It isn't really what matters. What matters is how well methods generalize beyond the training data, and how much data they need to do this well.

darkmighty · on Aug 30, 2019

Is the relevant model here simply a sum of 2nd order terms?

It's widely known to practitioners vanilla Neural Nets don't do very well on small datasets because of overfitting issues -- which can probably be corrected with very careful regularization and parameter optimization, but where generally it's best to just use simpler methods.

Polynomial regression is one of those, but I'm not sure how it compares to other methods such as random forests and KNN variations.

Those methods don't scale very well, in particular PR (if I'm interpreting correctly) doesn't scale, because depth is essential to solving complex problems with computational efficiency.

The researchers conveniently refer to those more complex tasks as a 'work in progress'... when they are what NNs are mostly used for (i.e. it starts getting interesting with CIFAR-10). That said, I would be curious to see performance of deeper polynomial regression networks. They become almost identical to NNs then, but maybe some methods of PR could be extended to give an edge in some cases (e.g. perhaps better regularization techniques than dropout/etc).

mxwsn · on Aug 30, 2019

Previous discussion (June 24, 2018): https://news.ycombinator.com/item?id=17385383

gbrown · on Aug 30, 2019

Dismissing more interesting architectures, and ignoring the different approaches to fitting makes this seem very incomplete. I've long thought that neural networks are principally useful when the architecture can be chosen to exploit structural features of a problem - it hardly seems surprising that feed-forward neural networks with a given architecture have an alternate representation as polynomial regression. The focus on multicollinearity is also sort of odd here - I usually take it as a given that neural networks provide an over-complete basis which only works due to various kinds of regularization (and the use of optimization techniques like gradient-descent/back-propagation, which aren't really concerned with globally optimal parameter estimates, but rather in producing good fitted values/predictions).

pacala · on Aug 30, 2019

> Our work so far has primarily been on general feedforward NNs. Our investigations have not yet involved much on specialized networks such as convolutionalNNs (CNNs) for image classification, recurrent NNs (RNNs) for text processing,and so on. Though we intend to adapt our ideas to these frameworks,we viewthem as separate, orthogonal issues.

This is a interesting insight. Future work is needed though. The vast majority of impressive results using NNs have been achieved by specialized networks, with special mention for resnets in vision [and Go] and transformers for NLP.

thanatropism · on Aug 30, 2019

This is a... preprint. It's clearly not ready for the prime-time yet.

The main interesting finding for me was the utility of variance inflating factors in debugging neural networks. But not only the thing lacks a lot of polish (what is their metric for "accuracy"?), but it's still quite "unscholarly", most noticeably lacking:

* serious comparisons with classic literature benchmarks (I'm not even saying state-of-the-art) of NNs rather than whatever they were able to find with R. (Who uses R in academic machine learning?)

* a better developed argument about how Taylor is good enough. I'm not saying rigorous, but an article about reinforced concrete that just said what they said wouldn't be enough. An example of a workable engineering-type Taylor argument would be something that says that all minima are locally like parabolas and h~=0.01 is a very conservative upper bound for the needed approximation because the material is almost 100 times as strong and we just need 10 times for this calculation.

* reproducibility, reproducibility, reproducibility. Don't R people have Jupyter notebooks? Does their Keras package not serialize network structure and weights? Keras for Python does.

gbrown · on Aug 30, 2019

> Don't R people have Jupyter notebooks?

Yes, though I prefer Rmarkdown documents.

carapace · on Aug 30, 2019

They revised the paper in response to earlier, uh, response.

https://matloff.wordpress.com/2018/07/01/update-on-polynomia...

https://arxiv.org/abs/1806.06850

xscott · on Aug 30, 2019

I suppose it depends on the activation function, but many of them are bounded. Polynomials are unbounded at the extremes, and as such should really only be trusted (sometimes) for interpolation, not extrapolation. If you get a new input which is outside the radius of your training set, the answer could be pretty extreme and not supported by the training data.

thanatropism · on Aug 30, 2019

> Polynomials are unbounded at the extremes, and as such should really only be trusted for interpolation, not extrapolation.

Very well noted. There's also this, which is usually taught in second year engineer "numerical calculus" and again extensively studied in graduate programs involving [something numbers inside computers something]:

https://en.wikipedia.org/wiki/Runge's_phenomenon

xscott · on Sept 2, 2019

It was exactly that topic which is why I included the "sometimes" qualification :-)

peterwoerner · on Aug 30, 2019

ANN's aren't that much different than polynomial regression, the issue is that there are upwards of millions of parameters. Totally intractable for the human mind.

bobinaz · on Aug 30, 2019

They are also almost exactly functional composition. Were it not for the activation terms, this composition would reduce to a simple linear equation.

rahulmehta95 · on Aug 30, 2019

This isn't a serious ML result; it reads more like course notes/toy experiments rather than a serious research effort. The funny thing is, this is still only for tabular data; we don't see any actual classification results with any proper robust metrics (also the PCA setup seems a bit bizarre in this setting).

kgwgk · on Aug 30, 2019

The actual blog post: https://matloff.wordpress.com/2018/06/20/neural-networks-are...

sctb · on Aug 30, 2019

Thanks, we've updated the link from https://www.r-bloggers.com/neural-networks-are-essentially-p....

cozzyd · on Aug 30, 2019

I mean both are ~general classes of functions, so not surprising they can represent each other?

29athrowaway · on Aug 30, 2019

A biological neural network does much more than neural models. How much more and why? we don't even know.

Most neural models don't use spikes or dynamical systems, or some of the most intrincate stuff going on in neurons.

cabaalis · on Aug 30, 2019

I'm not an expert in ML. However, with what little I do know, I wonder how well "training" a neural network really mimics the brain.

Our brains grow as we mature and along the way reinforce pathways as we internalize information. Our interpretation of the world is internal and unique to our individual consciousness.

The part that is "trained" is how we relate that information externally to our environment and other entities.

A child says "Da da" because it learns that sound elicits a reaction. The internal representation of its father is not trained. It is where the neurons grew and reinforced.

Maybe poorly communicated on my part to people who know much more about it than me. Tl;dr: maybe focus less on perfecting the training of recognition and more on the training of mapping the external to what was naturally recognized where the neurons fell.

otabdeveloper2 · on Aug 30, 2019

The so-called 'neural networks' have nothing to do with the brain. They're only called that as a branding exercise. Because math is scary and boring, biology and brains are warm and fuzzy.