Regularization is all you need: simple neural nets can excel on tabular data

amilios · on June 22, 2021

This is interesting, but the paper still notes that in most "real-life" applications people will likely still prefer gradient-boosted trees, just because you need to allocate significant computation to hyperparameter tuning even in the case of the MLP-with-regularization-cocktail. For just getting something off the ground quickly based on tabular data, GBDTs are still unbeatable.

not_jd_salinger · on June 22, 2021

> GBDTs are still unbeatable.

You'd be surprised how many times I've replaced a GBDT with logistic regression and had negligible drop off in model performance with a dramatic improvement in both training time as well as debugging and fixing production models.

I've had plenty of cases where a bit of reasonable feature transformation can get a logistic model to outperform a gbdt. Any non-linearity your picking up with a GBDT can often easily be captured with some very simple feature tweaking.

My experience has been that GBDTs are only particularly useful in Kaggle contests, where minuscule improvements in an arbitrary metric are valuable and training time and model debugging are completely unimportant.

There are absolutely cases where NNs can go places that logistic regression can't touch (CV and NLP), but I have yet to see a real world production pipeline where GBDT provides enough improvement over Logistic Regression, to throw out all of the performance and engineering benefits of linear models.

kimukasetsu · on June 22, 2021

I strongly agree with this. Not to mention parameter interpretability and, in the case of Bayesian models, uncertainty estimates and convergence diagnostics. Such things are very important when making decision under uncertainty. Kaggle competitions and empirical benchmarks are very biased samples of model performance in real life.

I feel these two things often influence too much the course of Machine Learning research and communities, and this is not good. Most ML researchers and pratictioners are barely aware of the latest advances in parametric modelling, which is a shame. Multilevel models allow you to model response variables with explicit dependent structures. This is done through random (sometimes hierarchical) effects constrained by variance parameters. These parameters regularize the effects themselves and converge really well when fitting factors with high cardinality.

Also, multilevel models are very interesting when it comes to the bias-variance tradeoff. Having more levels in a distribution of random effects actually DECREASES [1] overfitting, which is fascinating.

[1] https://m-clark.github.io/posts/2019-05-14-shrinkage-in-mixe...

borroka · on June 22, 2021

While I agree and it is surprising that multi-level/hierarchical modeling is rarely applied in industry (I used them extensively in academia and industry), dealing with hundreds or thousands of random effects in large data sets, especially in non-linear models, is a computational nightmare. And the benefits may not warrant those nightmares.

RA_Fisher · on June 23, 2021

Finally multi-level/hierarchical modeling is starting to permeate industry thanks to Stan and company.

I use hierarchical modeling regularly to help build Zapier. So do other companies like Generable: https://www.generable.com/

I suspect hierarchical models will become the next “new” hot data structure in software engineering due to their ability to compact logic. https://twitter.com/statwonk/status/1363104221747421184?s=21

borroka · on June 24, 2021

I don't know about permeating the industry. I know for example that the model that Airbnb used 3 years ago (things may have changed in the meantime) to forecast occupancy was a random-effects model maintained by a single person in Europe. I don't know about the penetrance of Generable and companies providing similar probabilistic modeling solutions, although I hope they succeed.

When I was working for one of the FAANGs, I was the only one using random effects models (that I know of), in particular non-linear random effects models with ~ hundreds of random effects. I was using a language/tool faster than Stan (fitting the same model with Stan would have taken hours, or more likely days), but making the models converge was always challenging. In addition, since most of my colleagues had a CS background and were in love with the latest not interpretable, brute force algorithm, and were scared of a more statistical approach they made no effort to learn, I faced pushback and skepticism despite the model working very well.

I love random effects model, and I build my technical career on them.

laichzeit0 · on June 23, 2021

I think one of the main reasons is that there is no good Python library for doing linear mixed effect models. There is no sklearn implementation. There are some libraries that wrap R's lmer (probably using rpy2 or soemthing). The best native Python library I could find is statsmodels, and it has several shortfalls (saving a model to disk consumes hundreds of megabytes, the predict method is useless, it just predicts using the fixed effects, multi-level beyond just 1 group is not even clearly documented, and the syntax sucks if you really do it, nevermind actually implementing a predict method using those random effects). I think once someone does a decent sklearn implementation, it might take off. I've been thinking of doing an implementation for sklearn as a side project, but I'm not an ML researcher, just a practitioner, so it might suck :)

fho · on June 23, 2021

I used statsmodels for a while ... it's definitely possible to predict arbitrary inputs, it just a pain to fiddle in the right inputs ...

logicchains · on June 23, 2021

>You'd be surprised how many times I've replaced a GBDT with logistic regression and had negligible drop off in model performance with a dramatic improvement in both training time as well as debugging and fixing production models.

Not only reduced training time, but also less data needed for training. Which is particularly important if training on time-series data for something that changes over time, as older data is less useful.

magicalhippo · on June 23, 2021

> I have yet to see a real world production pipeline where GBDT provides enough improvement over Logistic Regression

Not my field at all, so "I know nooothing".

Are GBDT's very different from "plain" binary decision trees? I've seen the latter a lot in the context of particle experiments[1][2][3].

[1]: https://arxiv.org/abs/physics/0408124

[2]: http://cds.cern.ch/record/2289251/

[3]: https://arxiv.org/abs/2002.02534

hogFeast · on June 23, 2021

Very simply: plain decision trees usually overfit to training data (and, therefore, perform very badly out of sample). So the important part isn't the tree but the boosting. How you go from an ensemble of weak learners to something that works.

And this boosting generalises to any learner. You can apply it to regression too. Again, the boosting part is really the key. The innovation isn't a new technique either, it is just the aggressive application of computing power to these problems.

jncraton · on June 23, 2021

They are the same concept under the hood, but a GBDT is an ensemble model using a number of trees in tandem that are grown to improve the performance of the overall model.

ramraj07 · on June 23, 2021

Uhm how do you deal with imbalanced data? Like I mean 99:1 or something? I’m always worried about feature engineering - in the right hands it’s great but I’d posit that majority of DSes out there do not have said hands. Much rather take a random forest with no manipulation and shittier (and hopefully less biased) results.

flyers_research · on June 22, 2021

What are the size of the datasets? I have a hard time conceptualizing tabular business data large to be a problem.

huac · on June 22, 2021

consider the problem of "online advertising"

beckingz · on June 22, 2021

When you have billions of rows the performance savings can be nice.

mynameisash · on June 23, 2021

One of my projects several years back ran both a LR model and a DNN against the same input data (albeit featurized differently). Accuracy, P&R were roughly the same (minor differences depending on the time horizon), but the LR model took maybe a half hour to train and five minutes to run; the DNN took about 24 hours to train and an hour or two to run.

This wasn't even particularly huge data compared to my other projects. But certainly at that scale, there are huge differences between regression & NNs.

quantum_mcts · on June 22, 2021

An overlooked advantage of the MLP is that it is differentiable. Essentially, one trades the extra CPU for a classifier that one can pipe gradients through. That can be extremely useful in larger NN architectures.

periheli0n · on June 23, 2021

Images -> correlation in space -> Deep convolution networks

Time series -> correlation in time -> Recursive networks

Tabular data without clear correlation structure -> good old ML (ANN, SVM, DT, LR, KNN).

This is obvious when following the field since 2006 or so. Deep Convolutional Networks were considered a special case for data with local correlations at a hierarchy of spatial scales. Same for RNNs in time, although they came much later (when was the LSTM rediscovery again? 2016?)

For most data without clear spatial or temporal structure to exploit, the good old ML techniques will work just fine.

carlmr · on June 23, 2021

Not only will they work just fine, they'll save you compute costs, training time and lower your need for data.

A-Train · on June 23, 2021

Not necessarily true that MLPs are very compute expensive. It is maximum a couple of layers and if your input is sparse (categorical) you can gain even more. For some problems it can be the fastest and decent non linear model from my experience.

periheli0n · on June 23, 2021

I don’t think that was the claim… MLP/ANNs are fine except for the difficulties around interpretability. DTs and LR are preferable on that front. Or an SVM if you know a kernel/similarity metric that kills it in your data.

periheli0n · on June 23, 2021

Not to mention the shedloads of “X is all you need” papers which you can ignore, because Bishop’s “Pattern Recognition and ML” book is actually all you need (plus perhaps a good reference on linear algebra).

jackylupino · on June 23, 2021

It is true that MLPs are classic, but the regularizaton methods that apparently make a big empirical difference at this paper are new concepts (data augmentation, skip connections/residual blocks, dropout, batch norm, lookahead, stochastic weight averaging, etc.). They compare againts a good old MLP without the bells and whistles at Table 2 and the classic MLP is quite a poor performer (XGBoost beats a classical MLP very significantly). Which leads to the conclusion that we need all these recent deep learning advances on innovative regularization techniques to make the difference.

joe_the_user · on June 22, 2021

From paper: Tabular datasets are the last "unconquered castle" for deep learning, with traditional ML methods like Gradient-Boosted Decision Trees still performing strongly even against recent specialized neural architectures.

My but this statement seems more than a little grandiose.

Never mind that XGBoost still does well on a substantial portion of ML challenges (supposedly). The bigger problematic is that there's a confusion of maps and territories in this way of talking of machine learning. The field of ML has made a certain level of palpable progress by having created a number of challenges and benchmarks and then doing well on them. But success on a benchmark isn't necessarily the same as a success the "task" broadly. An NLP test doesn't imply mastering real language, a driving benchmark doesn't imply master over the road driving. etc. Notably, success on a benchmark also "isn't nothing". In a situation like the game of go, the possibilities can be fully captured "in the lab" and success at tests indeed became success against humans. But with driving or language, things are much more complicated.

What I would say is that benchmark success seems to produce at least a situation where the machine can achieve human-like performance for some neighborhood (or tile or etc) limited in time, space and subject. Of course, driving is the poster-child for the limitations of "works most of the time" but lot of "intelligent" activities require an ongoing thread of "reasonableness" aside from having an immediate logic.

Anyway, it would be nice if our ML folks looked at this stuff more as a beginning than as a situation where they're poised on success.

orlp · on June 22, 2021

Paper is missing the control: how good is this 'cocktail of regularization' when applied to traditional methods like XGBoost?

At best you can claim the result here that neural networks with regularization methods can beat traditional methods without it, but to be apples to apples both methods must have access to the same 'cocktail of regularization'.

ipsum2 · on June 22, 2021

From the paper:

> This paper is the first to provide compelling evidence that wellregularized neural networks (even simple MLPs!) indeed surpass the current state-of-the-art models in tabular datasets, including recent neural network architectures and GBDT (Section 6).

> Next, we analyze the empirical significance of our well-regularized MLPs against the GBDT implementations in Figure 2b. The results show that our MLPs outperform both GBDT variants (XGBoost and auto-sklearn) with a statistically significant margin.

They test against XGBoost, GBDT Auto-sklearn, and others. Did you read the paper?

orlp · on June 22, 2021

> They test against XGBoost, GBDT Auto-sklearn, and others. Did you read the paper?

Yes. Did you read my comment?

They compare NN + Cocktail vs. vanilla XGB. They don't compare NN + Cocktail vs. XGB + Cocktail.

To make it crystal clear, if I wrote a paper "existing medicine A enhanced with novel method B is more effective than existing medicine C" and I did not include the control "C + B" (assuming if relevant, which is the case here), that'd be bad science. It's very much possible that novel method B is doing the heavy lifting and A isn't all that relevant. s/A/NN, s/B/Cocktail, s/C/XGBoost.

ipsum2 · on June 22, 2021

How would you even apply layer normalization or SWA to XGB? These methods are neural net specific.

mandelken · on June 22, 2021

GBDT have their own set of hyperparameters such as learning rate, number of trees, min samples per bin, l0, l1, etc. So you could definitely also create an appropriate cocktail to optimize on, although GBDT are typically more robust wrt huperparameters.

orlp · on June 22, 2021

The authors do claim to do a hyperparameter sweep but only for vanilla XGB hyperparams.

civilized · on June 23, 2021

The old "my method (with as much optimization as I could get away with) beats the other method (with as little optimization as I could get away with)"

disgruntledphd2 · on June 23, 2021

Yup. The Least Publishable Unit strikes again.

orlp · on June 22, 2021

Batch normalization is nothing neural network specific to it if you use it on the input layer. I don't think it matters for a tree algorithm like XGBoost either way though.

SWA is pretty NN specific. So leave it out for XGB. There's a bunch that are relevant, and they could be very important.

jackylupino · on June 23, 2021

Batch norm has an advantage for iterative methods on mini-batches, while XGB uses the full training set. Using batch norm on the full training set is equivalent to Z-normalizing the features, which has no effect at all for XGB as the scale of features plays no role at the split decisions of the tree nodes. Apart few non-parametric data augmentation methods (notice adversarial augmentation is also nn specific), I do not think any other regularization used in that paper can be directly/intuitively applied to XGB.

blt · on June 22, 2021

I don't think every single regularization method in the cocktail can be applied to non-neural-network methods, but I'm pretty sure some of them can, like data augmentation. The authors could have figured out which methods can be applied to non-NN models or considered if equivalent/analogous methods exist. I agree it would make a fairer comparison.

jackylupino · on June 23, 2021

I agree with your point, e.g. data augmentation can be added, but thats pretty much it. All the other regularization techniques they use are neural network specific and cannot be applied to gradient-boosted trees. What I find particularly striking at this paper is that their method trains a single neural network which outperforms an ensemble of decision trees (XGBoost). Asking for perfect apple-to-apple comparisons means also comparing an ensemble of the MLPs vs. XGBoost. In this context, at least the message here is that XGBoost and/or other gradient-boosted methods are not anymore a silver bullet for tabular datasets. Boosting for trees was great in reducing both bias and variance, but apparently neural networks can achieve the same effect with a high capacity (low bias) and a mix of modern regularization techniques (low variance).

audi0slave · on June 22, 2021

This paper compares xgboost vs nueural nets and an ensemble [Tabular Data: Deep Learning is Not All You Need] https://arxiv.org/abs/2106.03253

lern_too_spel · on June 22, 2021

A lot of these papers can be titled "Tuning hyperparameters on the evaluation dataset is all you need." I see a few cases in this paper.

dweinus · on June 23, 2021

Per the paper, 37.5% of the regularization regimes were Batch Norm + Weight Decay. According to other findings, Batch Norm and Weight decay have an interesting interaction that results in a learning rate adjustment, not regularization:

https://blog.janestreet.com/l2-regularization-and-batch-norm...

whatshisface · on June 22, 2021

If you want to use NNs on tabular data look up the work that's being done on point clouds. They both share the same major symmetry over permutations.

lurkmurk · on June 22, 2021

They don't? One is a row in table (R^n) (order matters) and the other is a set of points (R^nxd).

If you consider the whole table, any dataSET is like point clouds.

whatshisface · on June 22, 2021

The table is the point cloud the row is the point. The symmetry is permutations of rows or permutations of points.

lurkmurk · on June 22, 2021

In which case would you use the whole table?

whatshisface · on June 23, 2021

Well yeah, there's no permutation symmetry between columns. (Unless there is...)

elcomet · on June 23, 2021

Then as parent said, almost all datasets are like point cloud, except time-series datasets.

whatshisface · on June 23, 2021

Yeah, they are.

eachro · on June 22, 2021

Oh that's interesting. Can you say more? Haven't seen much relating the two topics before.

whatshisface · on June 22, 2021

A row in a table of data is an point in R^n. I'm not sure how much there is to write about it other than to say, that's a point cloud.

eachro · on June 23, 2021

what's the intuition for why that might be desirable? I can sort of see that you might care to consider the relation between a given row and other rows (not disimilar to something like kernel methods) and then you can use something like Deep Sets[1] to featurize the data?

[1] https://arxiv.org/abs/1703.06114

whatshisface · on June 23, 2021

I think the way it works is, you have one network that produces global permutation-invariant (maintained so by training loss) metrics and another that recognizes based on those metrics. The big prior you're putting in is that the order of the points doesn't matter. Relationships between points do matter but only in a permutation-invariant way. I would recommend reading the literature because of course, it's not my idea. :)

asdf3331 · on June 22, 2021

No, that's a point.

whatshisface · on June 22, 2021

The table's the point cloud. I guess if you want to be really pedantic about grammar then I should point out that a set of one item is still a set. :-)

asdf3331 · on June 22, 2021

True - I misread your post. Your first post was intriguing but the second was dismissive. Surely there's more to be said than data sets are point clouds? Images are points in R^N too, right?

whatshisface · on June 23, 2021

If your unit of recognition was a set of images, and not an image alone, then you would have permutation symmetry and want to use point cloud techniques to design your first layer. So yes images are points in R^N.

elcomet · on June 23, 2021

Oh thanks, this was not clear from your other posts: the whole table is used as a data point, not the line. Much clearer now, why you would compare this to point clouds

karmickoala · on June 26, 2021

Which work do you suggest?

toxik · on June 22, 2021

Would you please stop it with the "$whatever is all you need"?

melling · on June 22, 2021

Isn’t it a reference to the “Attention is All You Need” machine learning paper?

https://arxiv.org/abs/1706.03762

toxik · on June 22, 2021

Yes, and this is the fourth or fifth paper I see copying that format. The attention paper was pretty damn significant. This won’t be, because it just shows that hyperparameters are important. We know that.

prionassembly · on June 22, 2021

"All you need considered harmful"

joe_the_user · on June 22, 2021

"The Unreasonable Effectiveness of All You Need Is Considered Harmful"

-- All interpretations worth considering...

ayyy · on June 22, 2021

"The Unreasonable Effectiveness of All You Need Is Considered Harmful for Fun and Profit"

philipswood · on June 22, 2021

FANG doesn't want you to know how this one weird trick about the unreasonable effectiveness of all you need considered harmful for fun and profit and you will never guess what happened next.

qolop · on June 23, 2021

We live in a society where FANG doesn't want you to know how this one weird trick about the unreasonable effectiveness of all you need considered harmful for fun and profit and you will never guess what happened next in my social experiment gone wrong.

prionassembly · on June 23, 2021

We live in a society. https://asemic-horizon.com/2019/12/18/we-live-in-a-society/

lux · on June 22, 2021

I often interpret that pattern as a reference to the lyric "love is all you need" by the Beatles (e.g., an attempt at being playful and relatable), but that may be my musical bias.

Either way, totally agree. Overused, almost always incorrect, and easily misconstrued especially by people who don't speak English as their first language.

jasonzemos · on June 22, 2021

What if the paper which finally gives us The Singularity is titled "All you need is all you need?"

jackylupino · on June 23, 2021

If the paper would not have used this clickbait title, it would probably go unnoticed and noone would pay attention? Sad truth about sensation-oriented modern lives :(

The_rationalist · on June 22, 2021

Sometimes it feels like papers authors consider their audience as children.

ausbah · on June 22, 2021

to be fair many of the tasks of machine learning are things children do like telling if something is a dog, coloring in a a picture, solving a maze, etc.

bonoboTP · on June 22, 2021

The field is certainly moving in a direction of infantilization due to the social media attention economy. Your paper competes with cute flashy cat gifs etc. on Twitter. And Twitter is ridiculously important in AI/ML, except perhaps above 50-60 years of age.

touisteur · on June 23, 2021

A little whimsy doesn't hurt... Nobody said science and research had to be dry and witless.

edmundsauto · on June 22, 2021

Communication is based on the lowest common denominator. The best communication can be understood by a wide range of audiences. Have you considered that you weren't the intended audience?

jgalt212 · on June 22, 2021

I don't know about that. It's probably driven by the need to for a short attention* gathering title.

* I see what I did there.

toxik · on June 22, 2021

It's such a cheap trick too, basically clickbait.

bonoboTP · on June 22, 2021

But it works, for example it got to the top of HN.

karxxm · on June 22, 2021

I strongly agree with this paper. Regularizing a NN is the key for better performance. But first, one needs to know, what exactly to regularize. I don't think trial-and-error hyperparameter bingo should be the way to go. We need better insight and understanding about these networks, analyze their structure and find out, what exactly is wrong with it. Then, a TARGETED regularization (layerwise, or maybe per neuron) has a huge potential to let very simple networks perform extremely well. I even suggest, that adaptive regularizations (on/off/strength) should be researched even more. It is not necessary, that a network is regularized all the time the same way.

nabusman · on June 22, 2021

Any references/recommendations on the best practices to analyze a network, weights, etc.?

alpineidyll3 · on June 23, 2021

Anyone who finds this paper interesting should read: "self attention between datapoints" Which shows non-parametric transformers often beat GBDTs on tabular data.

:https://www.google.com/url?sa=t&source=web&rct=j&url=https:/...

tooheavymental · on June 23, 2021

Short description: Overfit is All you need

whalesalad · on June 23, 2021

> As a result, we propose regularizing plain Multilayer Perceptron (MLP) networks by searching for the optimal combination/cocktail of 13 regularization techniques for each dataset using a joint optimization over the decision on which regularizers to apply and their subsidiary hyperparameters.

Pardon? https://youtu.be/RXJKdh1KZ0w

zozbot234 · on June 22, 2021

Ridiculous clickbait title. How the heck did this neural network "can" Excel wrt. this tabular dataset? What's the underlying objective and the evaluation benchmark? And what feature of Excel was this tested against in the first place?

I suppose the follow up will be titled "One Weird Trick is All You Need To Destroy SOTA On This Dataset!"

TFortunato · on June 22, 2021

Not Excel: the spreadsheet software, Excel: the verb.

They aren't saying that they "canned" Excel (the software), they are saying that neural nets have the potential to perform well on tasks involving tabular data that are traditionally performed by other ML techniques.

bno1 · on June 22, 2021

I, for one, very much appreciate the comedy of this post.

RocketSyntax · on June 23, 2021

This makes no sense. Neural nets already tear through tabular data like butter. We're literally dropping out half of the learned parameters. This points to the fundamental inefficiency of the approach. Better regularization is the answer, but it's not about round-robin of existing techniques.

xiaodai · on June 23, 2021

if you spend the same amount of time in xgboost and lightgbm looking for hyperparameters and build trees with a small enough eta. will you achieve the same performance? that's not answered in the paper, and if I am a reviewer I would ask them to come compare time spent.

jackylupino · on June 23, 2021

After reading the paper I believe they do both aspects that you mention: i) they give XGboost and all the baselines up to 4 CPU days of hyperparameter optimization time on 20 CPU core servers per dataset, same compute as the author's proposed method. ii) the search space for the XGBoost hyperparameters includes low eta (starting from 0.001) and large num_round ranges (up to 1000 trees) in Table 5 at the appendix.

DSingularity · on June 22, 2021

Attention, regionalization, someone tell me what I really need.

woadwarrior01 · on June 23, 2021

Massive hyperparameter sweeps without any cross validation are the perfect way to overfit holdout sets on small tabular datasets.

cochne · on June 23, 2021

But they evaluate on a test set while sweeping on a dev set, so there is no risk of overfitting to the test set.

stevofolife · on June 23, 2021

To piggy back on the topic of tabular data, has anyone experienced transfer learning on tabular data in their work or research?

sunstone · on June 22, 2021

I see what you did there.