Statistical vs. Deep Learning forecasting methods

em500 · on Dec 1, 2022

The conclusion, that a low-complexity statistical ensemble is almost as good as a (computationally) complex Deep Learning model, should not come as a surprise, given the data.

The dataset[1] used here are 3003 time series from the M3 competition ran by the International Journal of Forecasting. Almost all of these are sampled at the yearly, quarterly or monthly frequency, each with typically 40 to 120 observations ("samples" in Machine Learning lingo), and the task is to forecast a few months/quarters/years out of sample. Most experienced Machine Learners will realize that there is probably limited value in fitting high complexity n-layer Deep Learning model to 120 data-points to try to predict the next 12. If you have daily or intraday (hourly/minutely/secondly) time series, more complex models might become more worthwhile, but such series are barely represented in the dataset.

To me the most surprising result was only how bad AutoARIMA performed. Seasonal ARIMA was one of the traditional go-to methods for this kind of data.

[1] https://forecasters.org/resources/time-series-data/m3-compet...

cs702 · on Dec 2, 2022

Exactly.

If the task is to predict the next 12 values from a sample of 120 previous values, drawn from some computationally simple statistical process, it's much cheaper and easier to use old-fashioned, tried-and-true statistical methods.

If the task is to predict millions of pixel values that make up an original work of art, or the pixel values over time that make up a deep-fake video, or the next set of values encoding the next best possible play in a game of Go, or the set of values that encode the entire structure of a protein, and so on, then you have no choice: You must use a deep neural network. Simple methods cannot do any of that.

usgroup · on Dec 2, 2022

Sure more data and high signal then you might have a substantial advantage with DL. Lots of data is low signal to noise which again favours parametric models.

PaulHoule · on Dec 1, 2022

It is something that bothers me about the ML literature is that they frequently present a large number of evaluation results such as precision and AUC but these are not qualified by error bars. Typically they make a table which has different algorithms on one side and different problems on the other side and the highest score for a given problem gets bolded.

I know if you did the experiment over and over against with different splits you'd get slightly different scores so I'd like to see some guidance as to significance in terms of ① statistical significance, and ② is it significant on a business level. Would customers notice the difference? Would it make better decisions that move the needle for revenue or other business metrics?

This study is an example where a drastically more expensive algorithm seems to produce a practically insignificant improvement.

igorkraw · on Dec 1, 2022

This is one of my default suggestions when I act as reviewer: t test with bonferroni correction please. ML, ironically, has absolutely horrible practices in terms of distinguishing signal from noise( which at least is partially offset by the social pressure to share code, but still)

LevoMX · on Dec 1, 2022

Bonferroni's correction on hold-out data is an excellent suggestion. To adapt it into time series forecasting, one could perform temporal cross-validation with rolling windows and follow the performance's variance through time.

Unfortunately, the computational time would explode if the ML method's optimization is performed naively. Precise measurements of the statistical significance would crowd out researchers except for Big Tech.

mattkrause · on Dec 1, 2022

Bonferroni is probably not the right choice because it can be overly conservative, especially if the tests are positively-correlated.

Holm-Šidák would be better--but something like false discovery rate might be easier to interpret.

tylerneylon · on Dec 1, 2022

What would be a better method for machine learning folks to take? As a sincere curiosity / desire to learn, not meant as a rhetorical implication that I disagree.

I interpret your criticism to mean that ML folks tend to re-use a test set multiple times without worrying that doing so reduces the meaning of the results. If that's what you mean, then I do agree.

Informally, some researchers are aware of this and aim to use a separate validation data set for all parameter tuning, and would like to use a held out test set as few times as possible — ideally just once. But it gets more complicated than that because, for example, different subsets of the data may not really be independent samples from the run-time distribution (example: data points = medical data about patients who lived or died, but only from three hospitals; the model can learn about different success rates per hospital successfully but it would not generalize to other hospitals). In other words, there are a lot of subtle ways in which a held out test set can result in overconfidence, and I always like to learn of better ways to resist that overconfidence.

igorkraw · on Dec 1, 2022

Ben Recht actually has a line of work showing that we aren't over fitting the validation/test set for now (amazingly...). What I mean is, by chasing higher and higher SotA with more and more money and compute, whole fields can go "improving" only for papers like https://arxiv.org/abs/2003.08505 or "Implementation matters in deep RL" to come out and show that what's going on is different from the literature consensus. The standards for showing improvement are low, while standards for négative résultats are high (I'm a bit biased because I have a rejected paper trying to show empirically some deep RL work didn't add marginal value but I think the case still holds). Everyone involved is trying their best to do good science but unless someone like me asks for it, there simply isn't a value add for your career to do exhaustive checking.

A concrete improvement would be only being allowed to change 1 thing at a time per paper, and measure the impact of changing that one thing. But then you couldn't realistically publish anything outside of megacorps. Another solution might be banning corporate papers, or at least making a separate track...from reviewing papers, it seems like single authors or small teams in academia need to compete with Google where multiple teams might share aspects of a project, one doing the architecture, the other a new training algorithm etc...which won't be disclosed, you'll just read a paper where for some reason a novel architecture is introduced using a baseline which is a bit exotic but also used in another paper that came out close to this one, and a regulariser which was introduced just before that ...

If you limit the pools, you can put much higher standards on experiments on corporate where you have the budget, while giving academia more points for novelty and creativity

tylerneylon · on Dec 2, 2022

Thanks! I hadn't previously thought about the internal boost that large well-sponsored (aka corporate) teams get this way. It seems worth being aware of. I'm certainly in favor of encouraging & including researchers who work in smaller teams or with less funding.

tomrod · on Dec 1, 2022

Question: why do we care about the Bonferroni correction if the model being reviewed shows high performance on holdout/test samples?

I mean, it's nice to know that the p-values of coefficients on models you are submitting for publication are appropriately reported under the conservative approach Bonferroni applies, but I would think making it a _default_ is an inappropriate forcing function when the performance on holdout is more appropriate. Data leakage would be a much, much larger concern IMHO. Variance of the performance metrics is also important.

What am I missing?

igorkraw · on Dec 1, 2022

Because the variance can be uniformly high, making it difficult to properly judge the improvement of one method vs the baseline method: did you actually improve, or did you just get a few lucky seeds? It's much harder to get a paper debunking new "SotA" methods so I default to showing a clear improvement over a good baseline. Simply looking at the performance is also not enough because a task can look impressive, but be actually quite simple (and vice versa), so using these statistical measures makes it easy to distinguish good models on hard tasks from bad models on easy tasks.

I should also note 1) this is about testing whether the performance of a model is meaningfully different from another, not the coefficient of the models 2) I don't reject papers just because they lack this, or if they fail to achieve a statistical significance, I just want it in the paper so the reader can use that to judge (and it also helps suss out cherry picked results)

tomrod · on Dec 1, 2022

Thanks, that makes sense. I was confused where you where and how you were applying the Bonferroni correction yardstick.

goosedragons · on Dec 1, 2022

You'd want to do some sort of test because it can help assess whether your method did better than the alternatives by chance. For example can you really say Method A is better than B if A got 88% accuracy on the holdout set and B got 86% accuracy? Would that be true of all possible datasets?

t-test with Bonferroni isn't necessarily the best test for all metrics either.

mattkrause · on Dec 1, 2022

The test sample is just a small, arbitrary sample from a universe of similar data.

You (probably) don't care about test-set performance per se but instead want to be able to claim that one model works better _in general_ than another. For that, you need to bust out the tools of statistical inference.

tomrod · on Dec 2, 2022

The test set allows you to make this claim if it is representative of the universe of novel data the model will run on and there is no data spoilage between test and train.

This isn't always true (of course, especially in aggregate series over time) and of course statistical measures are used to report model performance. But a Bonferroni correction struck me as a weird place to apply this specifically, but after the other comments from yesterday I saw where they were taking it.

PaulHoule · on Dec 1, 2022

See https://en.wikipedia.org/wiki/Bonferroni_correction

hulalula · on Dec 1, 2022

Would this work for every kind of data? I imagine maybe not?

zone411 · on Dec 1, 2022

Every researcher would love to include error bars but it's a matter of limited computing resources at universities. Unless you're training on a tiny dataset like MNIST, these training runs get expensive. Also, unless you parallelize from the start and risk wasting a lot of resources if something goes wrong, it could take longer to get the results.

PaulHoule · on Dec 1, 2022

Using bootstrap and/or repeated runs is a great way to get error bars but there are low cost ways to do it.

For instance they estimate error bars on public opinion polls based on simple formulas and not redoing the poll a large number of times.

time_to_smile · on Dec 1, 2022

Simple formulas only work because the models themselves for those polls are incredibly simple and adding a bit more complexity requires a lot of tools to compute these uncertainties (this is part of the reason you see probabilistic programming so popular for people doing non-trivial polling work).

There are no simple approximations for a range of even slightly complex models. Even some nice computational tricks like the Laplace approximation don't work on models with high numbers of parameters (since you need to compute the diagonal of the Hessian).

A good overview of the situation is covered in Efron & Hastie's "Computer Age Statistical Inference".

nequo · on Dec 1, 2022

If you don't have an analytical expression for your asymptotic variance, you do have to use bootstrap though.

For public opinion polls, the estimator is simple (i.e., a sample mean), so we have an analytical expression for its asymptotic variance.

LevoMX · on Dec 1, 2022

Thanks for the comment!

In Machine Learning literature, the variance of accuracy measurements originates from different network parameters initialization. Since the deep learning ensembles already use aggregate computation in the hundreds of days, computing the variance would elevate the computational time into thousands of days.

In contrast, statistical methods that we report optimize convex objectives; their optimal parameters are deterministic.

That being said, we like the idea of including cross-validation with different splits for future experiments.

IfOnlyYouKnew · on Dec 1, 2022

The tests sets are large enough to render this moot, as the confidence intervals are almost certainly smaller than the precisions typically reported, i. e. 0.1 %.

PaulHoule · on Dec 1, 2022

I've worked on commercial systems where N<=10,000 in the evaluation set and the confidence interval there is probably not so good as 0.1% for that. For instance there is a lot of work on this data set (which we used to tune up a search engine)

https://ir-datasets.com/gov2.html

and sometimes it as bad as N=50 queries with judgements. I don't see papers that are part of TREC or based on TREC data dealing with sampling errors in any systematic way.

jll29 · on Dec 1, 2022

NIST's TREC workshop series uses Cyril Cleverdon's methodology ("Cranfield paradigm") from the 1960s, and more could surely be done at the evaluation front:

- systematically addressing sampling error;

- more than 50 queries;

- more/all QRELs;

- full evaluation instead of system pooling;

- study IR not just of the English language (this has been picked up by CLEF and NTCIR in Europe and Japan, respectively)

- to devise metrics that take energy efficiency into account.

- ...

At the same time, we have to be very grateful to NIST/TREC for executing an international (open) benchmark annually, which has moved the field forward a lot in the last 25 years.

ford · on Dec 1, 2022

I've done some work in this area and have indeed found that simpler statistical models often out perform ML/DL.

This is true for single time series, where we are predicting P(x_t+1 | x_0..t)

DL has advantages when you

a) have additional context at each time step, or

b) you have multiple related time series.

For example, consider Amazon who predicts demands for all of their products. At each time step, they know about inventory, marketing efforts, and could even model higher dimensional attributes like persuasiveness of the item's description with NLP.

It's also true they have items that are highly correlated. Skis, Snowboards, and Ski jackets all likely have similar sales patterns. Leveraging this correlation can increase accuracy, and is especially useful when you have items with limited history.

Including all of that context is hard with a statistical model, and whatever equation a human can come up with to combine them is probably worse than a learned, embedding-based DL model.

Statistical models are a great starting point & baseline for most problems, but as you add real world complexity beyond the general case time-series that's not as true.

I might not be aware of it, but I wish there were more benchmarks/research on higher complexity problems.

Salgat · on Dec 2, 2022

My understanding is that the biggest advantage between traditional machine learning and neural networks is that neural networks are useful when features either need to be generated or are poorly understood (such as a binary blob of an image or sound sample). So for data that is already neatly organized and labeled in a spreadsheet, DL loses its main advantage.

hackernewds · on Dec 2, 2022

How do you refer to traditional ML, DL and neutral networks in this context? Realize they are rather nebulously defined IRL

Salgat · on Dec 3, 2022

Deep Learning is just Neural networks with more than one hidden layer, so in this case I refer to NN/DL grouped together versus everything else.

fedegr · on Dec 2, 2022

I completely agree with your perspective. It is a reality that deep learning models might offer certain advantages over classical statistical models. We are building benchmarks and comparisons to clarify when the more complex models are better.

We also want to show with this experiment the importance of creating benchmarks. In many use cases, practitioners choose more sophisticated models because they think this will give them better accuracy. The main idea is that robust benchmarks should always be created.

nojito · on Dec 2, 2022

Amazon for years used nothing but a random forest for their forecasting.

stefanpie · on Dec 1, 2022

Timeseries data can sometimes be deceptive, depending on what you are trying to model.

I have been hacking on a peroneal research project to predict hurricane tracks furcating using deep learning. Only given track and intensity data at different points in time (every 6 hours) and some simple feature engineering, you will not get any good results close to the official NHC forecast, no matter what model you use.

In hindsight, this is a little obvious. Hurricane forecasting time series models depend more on other factors than time itself. A sales forecast can depend on seasonal trends and key events in time, but a hurricane forecast is much more dependent on long-range spatial data like the state atmosphere and ocean that are very non-trivial to model simply using just track data.

However, deep leading models and techniques in this scenario are helpful because they can allow you to integrate multiple modalities like images, graphs, and volumetric data into this one model, which may not be possible with statistical models alone.

brrrrrm · on Dec 1, 2022

I'm heavily involved in this area of research (getting deep learning competitive with computationally efficient statistical methods), and I'd like to note a couple things I've found:

1. Deep learning doesn't require thorough understanding of priors or statistical techniques. This opens the door to more programmers in the same way high level languages empower far more people than pure assembly. The tradeoffs are analogous - high human efficiency, loss of compute efficiency.

2. Near-CPU deep learning accelerators are making certain classes of models far easier to run efficiently. For example, an M1 chip can run matrix multiplies (DL primitive composed of floating point operations) 1000x faster than individual instructions (2TFlops vs 2GHz). This really changes the game, since we're now able to compare 1000 floating point multiplications with a single if statement.

usgroup · on Dec 2, 2022

It opens the door to more script kiddies, not more researchers. I really think we need more researchers who understand inference from first principles and make models in view to furthering understanding as opposed to more fit(X,y).

I don’t say this naively. At least in industries, the weight of imposter data scientists I think is getting to a level that may cause the profession to implode due to customer disillusionment within the next 10 years precisely because fit(X,y) is so accessible.

zmachinaz · on Dec 1, 2022

Regarding 1)

I am not sure if you are not trading "high human efficiency" against increased risk of blowing up at some point. Good luck doing forecasting without thorough understanding of priors and statistics in general.

epgui · on Dec 1, 2022

Agreed, I see the "lower barrier to entry" in this particular case as coming with potentially huge risks. IMO, statistics is vastly, vastly, vastly under-appreciated and under-estimated.

brrrrrm · on Dec 1, 2022

that's a good point. I guess as an addendum it's not just compute efficiency but also "statistical efficiency" (if that has any meaning?)

singhrac · on Dec 1, 2022

I think that term already has usage as a proxy for "lowest sampling variance"; for example the Gauss Markov theorem shows that OLS is the most efficient unbiased linear estimator.

I guess this is echoing your point 2, but I would have generally said that "principled" statistical models are less efficient these days than DL (see: HMC being much slower than variational Bayes). Priors are usually overrated but I think the risk is more that basic mistakes are made because people don't understand what assumptions go into "basic" machine learning ideas like train/test splits or model selection. I'm not sure it warrants a lot of panic though.

tylerneylon · on Dec 1, 2022

This readme lands to me like this: "People say deep learning killed stats, but that's not true; in fact, DL can be a huge mistake."

Ok, I fully agree with their foundational premise: Start simple.

But: They've overstated their case a bit. Saying that deep learning will cost $11,000 and need 14 days on this data set is not reasonable. I believe you can find some code that will cost that much. The readme suggests that this is typical of deep learning, which is not true. DL models have enormous variety. You can train a useful, high-performance model on a laptop CPU in a seconds-to-minutes timeframe; examples include multilayer perceptrons for simple classification, a smaller-scale CNN, or a collaborative filtering model.

While I don't endorse all details of their argument, I do think the culture of applied ML/data science has shifted too far toward default-DL. The truth is that many problems faced by real companies can be solved with simple techniques or pre-trained models.

Another perspective: A DL model is a spacecraft (expensive, sophisticated, powerful). Simple models like logistic regression are bikes and cars (affordable, efficient, less powerful). Using heuristics is like walking. Often your goal is just a few blocks away, in which case it would be inefficient to use a spacecraft.

sigmoid10 · on Dec 1, 2022

>They've overstated their case a bit. Saying that deep learning will cost $11,000 and need 14 days on this data set is not reasonable.

After glancing at the paper they're criticising, I really wonder how they arrived at these insane figures. From what I saw, they were mostly using stuff like MLPs with a handful layers at O(100) neurons at most. Yeah, if you put a hundred million parameter transformer in there you will train forever (and waste tons of compute since that would be complete overkill), but not with simple perceptrons. I don't know the extent of the data, but given these architectures I very much doubt a practical model would take this long to train - even on a CPU - given that you could run a statistical ensemble in 5 minutes.

kgarten · on Dec 1, 2022

Nice article and interesting comparison. Yet, I have a minor issue with the title: Deep Learning are also statistical methods ... "univariate models vs. " would be a better title.

nerdponx · on Dec 1, 2022

You could argue that deep learning is not a statistical method in the traditional sense, in that a typical neural network model is not a probability model, and some neural networks are well known to produce specifically bad probability models, requiring some amount of post processing in order to produce correctly "calibrated" probability predictions.

However I don't like that there is often a strict dichotomy presented between "deep learning" and "statistics". There is a whole world of gray areas and hybrid techniques, which tend to be both more accessible, easier to reason about, and more effective in practice, especially on smaller "tabular" datasets. What about generalized additive models, random forests, gradient boosted trees, etc.?

The author of the document I'm sure is aware of these techniques, and I assume they are left out because they didn't perform well enough to be considered here. But I don't think it does the discourse any favors to promulgate the false dichotomy.

uoaei · on Dec 1, 2022

Statistical models and probabilistic models are not synonymous.

Vanilla deep learning models are statistical models (a la linear regression) and not probabilistic models (a la Gaussian mixture). It is important to maintain the distinction.

But to your point about the dichotomy between deep learning and more "traditional" statistical methods: this confusion in common parlance clearly has negative effects on model-building among engineers. You are right that when people think "deep learning" they think of very specific architectures with very specific features, and don't seem to conceive of the possibility that automatic differentiation techniques mean you can incorporate all sorts of new model components that blur the line between deep learning and older methods. For instance, you could feed the results of a kernel SVM to an ARIMA model in such a way that the whole thing is end-to-end differentiable. In fact, the great benefit of deep learning long-term is (in my opinion) that the ability to build these compositional models means you can bake in that much more inductive bias into the models you build, meaning they can be smaller and more stable in training.

salty_biscuits · on Dec 1, 2022

"Vanilla deep learning models are statistical models (a la linear regression) and not probabilistic models (a la Gaussian mixture). It is important to maintain the distinction."

Isn't this just a matter of interpretation of the models? You can interpret linear regression in a Bayesian way and say that the prediction of the linear model is the MAP of the mean, you can also calculate the variance, the l2 norm objective is saying the distribution of errors is normally distributed, l2 regularisation is a normal prior on the coefficients, etc, etc? All the same stuff can be applied to deep learning models.

Maybe I don't understand your distinction between statistical and probabilistic though?

uoaei · on Dec 1, 2022

> Isn't this just a matter of interpretation of the models?

Not really. This is the classic frequentist vs Bayesian debate. In frequentist-land, you are computing point estimates of the model parameters. In Bayesian-land, you are computing distribution estimates of the model parameters. It is true that there is a difference in interpretation of the generative process but the two choices demand fundamentally different models because of the decision about which of the parameters or data are considered "real" and which are considered "generated".

I think a more abstract/general way to put it is: "statistics" is concerned with statistical summary values (i.e. mean-field estimates over measures) while "probability" is concerned more with distributions (i.e., topologies of measures). I'm not sure this is a rigorously correct way to characterize it, but it illustrates the intuition I'm trying to convey.

nerdponx · on Dec 2, 2022

Statistics as practiced today (1930s until now?) consists almost entirely of making inferences about unobserved probability distributions. That includes nonparametric statistics, and frequentist versus Bayesian has nothing to do with it.

There are some probability models that are not really statistical models, but there are few or no statistical models that are not also probability models.

Least-squares regression is a probability model. Even if you don't particularly care about the error distribution, you are still estimating a conditional expectation and setting a conditional independence assumption on the residuals. If that's not a probability model, then I don't know what it is!

uoaei · on Dec 2, 2022

Maybe the more intuitively obvious way to put it is:

probability : distributions :: statistics : expected values

flusteredBias · on Dec 2, 2022

I don’t think I have read anything more false on the internet. XD

flusteredBias · on Dec 2, 2022

I apologize. ^This comment was to harsh.

Statistics can be summarizes as one thing n -> N. Does ‘little n’ represent ‘big N’. In other words, does the sample generalize to the population. Statistics means something like “description of the state”. It was born out of census samples where larger population samples had to be estimated. “n” could be a handful of fish in a “N” lake. “n” could also be the parameter estimated in a linear regression with the sample of data collected while “N” is the true parameter of the relationship if we had all the data. Point estimation is about finding the needle in the haystack, but much more often statistics is about finding the haystack given the needle. One tool statistics uses to get to the haystack is probability.

uoaei · on Dec 2, 2022

A point estimate of distribution parameters describing a population is frequentist. A point estimate of distribution parameters describing another distribution's parameter is Bayesian.

flusteredBias · on Dec 2, 2022

How the parameters are estimated it not the message.

In statistics there are latin letters and greek letters. When you see a symbol denoted as a greek letter then that is a population parameter. When you see a latin letter that is a sample estimate. It could be Frequentist, Bayesian, Likelihoodist, Fiducial, Empirical Bayes, etc. Theoretical population greeks or sample calculated latins.

uoaei · on Dec 2, 2022

Now that's a fascinating and deliciously crackpot theory.

flusteredBias · on Dec 5, 2022

Don't take my word for it. https://ocw.smithw.org/csunstatreview/statisticalsymbols.pdf

dumb1224 · on Dec 1, 2022

I have very limited statistical background but doesn't variational inference applied in the neural networks make them probabilistic models? The modelling definitely seems so because the math in those papers doesn't even specify whether it's a network (it implies that it can be any model).

uoaei · on Dec 1, 2022

Yes indeed. This synthesis of concepts is a great illustration of moving beyond hardened dichotomies in this research space and I believe similar approaches will be fruitful in the years to come.

fedegr · on Dec 1, 2022

Co-author here: all in due time. Next iteration we will include LigthGBM, XGBoost, and newer DL models like TFT and NHiTS.

stellalo · on Dec 1, 2022

They are all univariate models: some are trained offline on a bunch of different series before being applied (deep learning, “global” models), others are applied directly to each series to forecast (“statistical”, “local” models), but the task is the same univariate time series prediction for every model there.

Xcelerate · on Dec 1, 2022

I wish we could start moving to better approaches for evaluating time series forecasts. Ideally, the forecaster reports a probability distribution over time series, then we evaluate the predictive density with regard to an error function that is optimal for the intended application of the forecast at hand.

1980phipsi · on Dec 1, 2022

You mean I can't just go on CNBC and say my forecast is X?

flusteredBias · on Dec 2, 2022

You mean CRPS?

flusteredBias · on Dec 2, 2022

I use my package https://github.com/alexhallam/tablespoon to generate naive forecasts then evaluate the crps of the naive vs the crps of the alternative method. This “skill score” approach is very good.

usgroup · on Dec 2, 2022

This isn’t surprising for those of us who grew up with “Elements of Statistical Learning” (book).

Similar in vein:

https://emiruz.com/post/2022-11-16-defect-detection/

Extremely simple PCA based defect detection significantly beats orders of magnitude more complex segmentation network.

LevoMX · on Dec 1, 2022

Comparison of several Deep Learning models and ensembles to classical statistical models for the 3,003 series of the M3 competition.

clircle · on Dec 1, 2022

What is the point of this kind of comparison? It is completely dependent on the 3000 datasets they chose to use. You're not going to find that one method is better than another in-general or find some type of time series for which you can make a specific methodological recommendation (unless that series is specifically constructed with a mathematical feature, like stationarity).

What matters is "which method is better for MY data?" but that's not something an academic can study. You just have a test a few different things.

tomrod · on Dec 1, 2022

My thoughts exactly. Unless the method can be shown to be inferior in certain or all dimensions, it is a meaningless comparison.

MrMan · on Dec 1, 2022

so your corollary to the No Free Lunch theorem is "Lunch Is Impossible"?

macrolime · on Dec 1, 2022

What deep learning could instead be used for in this case is to incorporate more data, like text describing events that affects macroeconomics when doing macroeconomic predictions.

lysecret · on Dec 2, 2022

Hmmm. Not sure why they use M3 data when there is already M4 where a deep learning model won. I know because I reimplemented it as a toy version in python here: https://github.com/leanderloew/ES-RNN-Pytorch

It was actually very cool because the model was a melt of exponential smoothing and dl.

graycat · on Dec 1, 2022

I can have some interest in, hope for, etc. machine learning. One reason is, for the curve fitting methods of classic statistics, i.e., versions of regression, the math assumptions that give some hope of some good results are essentially impossible to verify and look like they will hold closely only rarely. So, even when using such statistics, good advice is to have two steps, (1) apply the statistics, i.e., fit, using half the data and then (2) verify, test, check using the other half. But, gee, those two steps are also common in machine learning. Sooo, if can't find much in classic math theorems and proofs to support machine learning, then, are just put back into the two steps statistics has had to use anyway.

So, if we have to use the two steps anyway, then the possible advantages of non-linear fitting have some promise.

So, to me, a larger concern comes to the top: In my experience in such things, call it statistics, optimization, data analysis, whatever, a huge advantage is bringing to the work some understanding that doesn't come with the data and/or really needs a human. The understanding might be about the real problem or about some mathematical methods.

E.g., once some guys had a problem in optimal allocation of some resources. They had tried simulated annealing, run for days, and quit without knowing much about the quality of the results.

I took the problem as 0-1 integer linear programming, a bit large, 600,000 variables, 40,000 constraints, and in 900 seconds on a slow computer, with Lagrangian relaxation, got a feasible solution guaranteed, from the bounding, to be within 0.025% of optimality. The big advantages were understanding the 0-1 program, seeing a fast way to do the primal-dual iterations, and seeing how to use Lagrangian relaxation. My guess is that it would be tough for some very general machine learning to compete much short of artificial general intelligence.

One way to describe the problem with the simulated annealing was that it was just too general, didn't exploit what a human might understand about the real problem and possible solution methods selected for that real problem.

I have a nice collection of such successes where the keys were some insight into the specific problems and some math techniques, that is, some human abilities that would seem to need machine learning to have artificial general intelligence to compete. With lots of data, lots of computing, and the advantages of non-linear operations, at times machine learning might be the best approach even now.

Net, still, in many cases, human intelligence is tough to beat.

uoaei · on Dec 1, 2022

A point about gradient-free methods such as simulated annealing and genetic algorithms: the transition (sometimes called "neighbor") function is the most important part by far. The most important insight is the most obvious one in some way: if your task is to search a problem space efficiently for an optimal solution, it pays to know exactly how to move from where you are to where you want to be in that problem space. To that point, (the structure of) transitions between successive state samples should be refined to your specific problem and encoding of the domain in order to be useful in any reasonable amount of time.

graycat · on Dec 2, 2022

> the transition (sometimes called "neighbor") function is the most important part by far.

And, indeed, in the 0-1 integer linear programming with Lagrangian relaxation I used there is nothing differentiable so should be counted as "gradient free". And the linear programming part and the Lagrangian part do "move" from where are to closer to "where want to be".

A thing is, the bag of tricks, techniques, that work here is large. So, right, should use knowledge of the real problem to pick what tricks to use.

jwilber · on Dec 1, 2022

Seems like these guys just wasted $11k to erroneously claim, “deep learning bad! Simple is better!”

There’s definitely use for these classical, model-based methods, for sure. But a contrived comparison claiming they’re king is just misinformation.

Eg, here are a number of issues with classical techniques where dl succeeds (‘they’ here refers to classical techniques):

- they often don’t support missing/corrupt data

- they focus on linear relationships and not complex joint distributions

- they focus on fixed temporal dependence that must be diagnosed and specified a priori

- they take as input univariate, not multiple interval, data

- they focus on one-step forecasts, not long time horizons

- they’re highly parameterized and rigid to assumptions

- they fail for cold start problems

A more nuanced comparison would do well to mention these.

flusteredBias · on Dec 2, 2022

I am going to defined the readme a little

- they often don’t support missing/corrupt data

Clean data would benefit most models, not just non-deep learning models. Missing data introduces bias even in DL models.

- they focus on linear relationships and not complex joint distributions

1. If seasonality is present. Which is usually the case in practical business problems then you will find that actual ~ lag_actuals explains most of the variance with a linear relationship. non-linearities in time series is not something that I see often. You can usually make a feature that explains those away linearly.

- they focus on fixed temporal dependence that must be diagnosed and specified a priori

Not sure sure you are saying here.

- they take as input univariate, not multiple interval, data

That is not the case. Time series regression can take many lags for inputs.

- they focus on one-step forecasts, not long time horizons

False. Time series regression models are used to forecast revenue many years into the future in the business world.

- they’re highly parameterized and rigid to assumptions

So is F=ma

- they fail for cold start problems

Because a cold start is not a time series data set. Why would time series methods work on non time series data.

jwilber · on Dec 13, 2022

I appreciate the response but I admit I'm a bit confused - maybe it's the case that you're not familiar with the need for modern forecasting? Some of your 'defense' is just flat out wrong. A few points:

- You write: 1. "Clean data would benefit most models, not just non-deep learning models. Missing data introduces bias even in DL models." and 2. "Because a cold start is not a time series data set. Why would time series methods work on non time series data."

Clean data is good, yes, thanks for the insight. But 1. missing data is a huge issue for model quality, particularly for highly-parameterized state space models if you're missing key aspects (e.g. an ETS model with missing data for trend is an awful model) and 2. cold start is certainly a time series concept (and one with lots of ongoing research at that). Here's an example: At Amazon, thousands of new products launch each week. Many of those new products have zero or very limited historical data, but metrics (e.g. demand) still need to be forecasted. Despite your claim that such an issue doesn't exist, this is a cold start time series problem. Modern techniques exist to handle this issue, such as DeepAR, which handles the cold start issue by training a global model for probabilistic forecasts (https://arxiv.org/pdf/1704.04110.pdf)

- You write: "That is not the case. Time series regression can take many lags for inputs"

Of course, but that's not what was written. I suggest you google the difference between multiple and multivariate time series

- You write: "False. Time series regression models are used to forecast revenue many years into the future in the business world"

In the 'business world'(?) you can use a forecasting model to do multi-step ahead prediction (aka k-step ahead error propagation), but I'm referring specifically to direct multi-horizon forecasting (e.g. https://arxiv.org/pdf/1711.11053.pdf)

- You write: "Not sure sure you are saying here."

I'm describing verbatim the process by which the classical forecasting techniques described work.

- You write: "So is F=ma"

This is agreeing with me. I think maybe your intent was to write something with a nonlinearity, but I’m not sure you’re sure what point you’re trying to make here.

- You write: "1. If seasonality is present. Which is usually the case in practical business problems then you will find that actual ~ lag_actuals explains most of the variance with a linear relationship. non-linearities in time series is not something that I see often. You can usually make a feature that explains those away linearly."

You ignored the bit about joint distributions and you gave an example of lag features you've seen. Non-linearities are common in many time series applications (e.g. weekly returns of <currency> foreign exchange rate , <stock> realized volatility). You can handle them with classical techniques, but it's painful if you do something more complex than just introducing a feature and hope it captures the missing variance.

srean · on Dec 1, 2022

> they often don’t support missing/corrupt data

You gotta be kidding right, that's one thing that they do well.

jwilber · on Dec 2, 2022

You can fit a state space model to missing data in many api’s, sure, but that doesn’t mean it’s supported in any meaningful way.

Imagine you have a new product with 6 months of missing data. You can feed that into ets/arima/whatever, but you’re not getting any valuable output for those missing point value estimates.

MrMan · on Dec 1, 2022

why are middle-ground (but SOTA) techniques like guassian processes and GBM regression not in this comparo

LevoMX · on Dec 1, 2022

A lot of M3 datasets we use are high-frequency, with large seasonal inputs. Considering Gaussian Processes (GP) complexity is O(N^3), a careful study of their performance would be challenging.

Also... I'm not aware of any efficient GP Python implementations.

vladf · on Dec 1, 2022

GPs over time series can leverage low-dimensional index sets for O(N lg N) fitting and inference. This can be done by interpolating the inputs onto a regular grid which admits Toeplitz kernels. See https://arxiv.org/abs/1503.01057.

thanatropism · on Dec 1, 2022

Just write your GP model in Pyro or something like that.