The conclusion, that a low-complexity statistical ensemble is almost as good as a (computationally) complex Deep Learning model, should not come as a surprise, given the data.
The dataset[1] used here are 3003 time series from the M3 competition ran by the International Journal of Forecasting. Almost all of these are sampled at the yearly, quarterly or monthly frequency, each with typically 40 to 120 observations ("samples" in Machine Learning lingo), and the task is to forecast a few months/quarters/years out of sample. Most experienced Machine Learners will realize that there is probably limited value in fitting high complexity n-layer Deep Learning model to 120 data-points to try to predict the next 12. If you have daily or intraday (hourly/minutely/secondly) time series, more complex models might become more worthwhile, but such series are barely represented in the dataset.
To me the most surprising result was only how bad AutoARIMA performed. Seasonal ARIMA was one of the traditional go-to methods for this kind of data.
If the task is to predict the next 12 values from a sample of 120 previous values, drawn from some computationally simple statistical process, it's much cheaper and easier to use old-fashioned, tried-and-true statistical methods.
If the task is to predict millions of pixel values that make up an original work of art, or the pixel values over time that make up a deep-fake video, or the next set of values encoding the next best possible play in a game of Go, or the set of values that encode the entire structure of a protein, and so on, then you have no choice: You must use a deep neural network. Simple methods cannot do any of that.
Sure more data and high signal then you might have a substantial advantage with DL. Lots of data is low signal to noise which again favours parametric models.
It is something that bothers me about the ML literature is that they frequently present a large number of evaluation results such as precision and AUC but these are not qualified by error bars. Typically they make a table which has different algorithms on one side and different problems on the other side and the highest score for a given problem gets bolded.
I know if you did the experiment over and over against with different splits you'd get slightly different scores so I'd like to see some guidance as to significance in terms of ① statistical significance, and ② is it significant on a business level. Would customers notice the difference? Would it make better decisions that move the needle for revenue or other business metrics?
This study is an example where a drastically more expensive algorithm seems to produce a practically insignificant improvement.
This is one of my default suggestions when I act as reviewer: t test with bonferroni correction please. ML, ironically, has absolutely horrible practices in terms of distinguishing signal from noise( which at least is partially offset by the social pressure to share code, but still)
Bonferroni's correction on hold-out data is an excellent suggestion. To adapt it into time series forecasting, one could perform temporal cross-validation with rolling windows and follow the performance's variance through time.
Unfortunately, the computational time would explode if the ML method's optimization is performed naively. Precise measurements of the statistical significance would crowd out researchers except for Big Tech.
What would be a better method for machine learning folks to take? As a sincere curiosity / desire to learn, not meant as a rhetorical implication that I disagree.
I interpret your criticism to mean that ML folks tend to re-use a test set multiple times without worrying that doing so reduces the meaning of the results. If that's what you mean, then I do agree.
Informally, some researchers are aware of this and aim to use a separate validation data set for all parameter tuning, and would like to use a held out test set as few times as possible — ideally just once. But it gets more complicated than that because, for example, different subsets of the data may not really be independent samples from the run-time distribution (example: data points = medical data about patients who lived or died, but only from three hospitals; the model can learn about different success rates per hospital successfully but it would not generalize to other hospitals). In other words, there are a lot of subtle ways in which a held out test set can result in overconfidence, and I always like to learn of better ways to resist that overconfidence.
Ben Recht actually has a line of work showing that we aren't over fitting the validation/test set for now (amazingly...). What I mean is, by chasing higher and higher SotA with more and more money and compute, whole fields can go "improving" only for papers like https://arxiv.org/abs/2003.08505 or "Implementation matters in deep RL" to come out and show that what's going on is different from the literature consensus. The standards for showing improvement are low, while standards for négative résultats are high (I'm a bit biased because I have a rejected paper trying to show empirically some deep RL work didn't add marginal value but I think the case still holds). Everyone involved is trying their best to do good science but unless someone like me asks for it, there simply isn't a value add for your career to do exhaustive checking.
A concrete improvement would be only being allowed to change 1 thing at a time per paper, and measure the impact of changing that one thing. But then you couldn't realistically publish anything outside of megacorps. Another solution might be banning corporate papers, or at least making a separate track...from reviewing papers, it seems like single authors or small teams in academia need to compete with Google where multiple teams might share aspects of a project, one doing the architecture, the other a new training algorithm etc...which won't be disclosed, you'll just read a paper where for some reason a novel architecture is introduced using a baseline which is a bit exotic but also used in another paper that came out close to this one, and a regulariser which was introduced just before that ...
If you limit the pools, you can put much higher standards on experiments on corporate where you have the budget, while giving academia more points for novelty and creativity
Thanks! I hadn't previously thought about the internal boost that large well-sponsored (aka corporate) teams get this way. It seems worth being aware of. I'm certainly in favor of encouraging & including researchers who work in smaller teams or with less funding.
Question: why do we care about the Bonferroni correction if the model being reviewed shows high performance on holdout/test samples?
I mean, it's nice to know that the p-values of coefficients on models you are submitting for publication are appropriately reported under the conservative approach Bonferroni applies, but I would think making it a _default_ is an inappropriate forcing function when the performance on holdout is more appropriate. Data leakage would be a much, much larger concern IMHO. Variance of the performance metrics is also important.
Because the variance can be uniformly high, making it difficult to properly judge the improvement of one method vs the baseline method: did you actually improve, or did you just get a few lucky seeds? It's much harder to get a paper debunking new "SotA" methods so I default to showing a clear improvement over a good baseline. Simply looking at the performance is also not enough because a task can look impressive, but be actually quite simple (and vice versa), so using these statistical measures makes it easy to distinguish good models on hard tasks from bad models on easy tasks.
I should also note 1) this is about testing whether the performance of a model is meaningfully different from another, not the coefficient of the models 2) I don't reject papers just because they lack this, or if they fail to achieve a statistical significance, I just want it in the paper so the reader can use that to judge (and it also helps suss out cherry picked results)
You'd want to do some sort of test because it can help assess whether your method did better than the alternatives by chance. For example can you really say Method A is better than B if A got 88% accuracy on the holdout set and B got 86% accuracy? Would that be true of all possible datasets?
t-test with Bonferroni isn't necessarily the best test for all metrics either.
The test sample is just a small, arbitrary sample from a universe of similar data.
You (probably) don't care about test-set performance per se but instead want to be able to claim that one model works better _in general_ than another. For that, you need to bust out the tools of statistical inference.
The test set allows you to make this claim if it is representative of the universe of novel data the model will run on and there is no data spoilage between test and train.
This isn't always true (of course, especially in aggregate series over time) and of course statistical measures are used to report model performance. But a Bonferroni correction struck me as a weird place to apply this specifically, but after the other comments from yesterday I saw where they were taking it.
Every researcher would love to include error bars but it's a matter of limited computing resources at universities. Unless you're training on a tiny dataset like MNIST, these training runs get expensive. Also, unless you parallelize from the start and risk wasting a lot of resources if something goes wrong, it could take longer to get the results.
Simple formulas only work because the models themselves for those polls are incredibly simple and adding a bit more complexity requires a lot of tools to compute these uncertainties (this is part of the reason you see probabilistic programming so popular for people doing non-trivial polling work).
There are no simple approximations for a range of even slightly complex models. Even some nice computational tricks like the Laplace approximation don't work on models with high numbers of parameters (since you need to compute the diagonal of the Hessian).
A good overview of the situation is covered in Efron & Hastie's "Computer Age Statistical Inference".
In Machine Learning literature, the variance of accuracy measurements originates from different network parameters initialization.
Since the deep learning ensembles already use aggregate computation in the hundreds of days, computing the variance would elevate the computational time into thousands of days.
In contrast, statistical methods that we report optimize convex objectives; their optimal parameters are deterministic.
That being said, we like the idea of including cross-validation with different splits for future experiments.
The tests sets are large enough to render this moot, as the confidence intervals are almost certainly smaller than the precisions typically reported, i. e. 0.1 %.
I've worked on commercial systems where N<=10,000 in the evaluation set and the confidence interval there is probably not so good as 0.1% for that. For instance there is a lot of work on this data set (which we used to tune up a search engine)
and sometimes it as bad as N=50 queries with judgements. I don't see papers that are part of TREC or based on TREC data dealing with sampling errors in any systematic way.
NIST's TREC workshop series uses Cyril Cleverdon's methodology ("Cranfield paradigm") from the 1960s, and more could surely be done at the evaluation front:
- systematically addressing sampling error;
- more than 50 queries;
- more/all QRELs;
- full evaluation instead of system pooling;
- study IR not just of the English language (this has been picked up by CLEF and NTCIR in Europe and Japan, respectively)
- to devise metrics that take energy efficiency into account.
- ...
At the same time, we have to be very grateful to NIST/TREC for executing an international (open) benchmark annually, which has moved the field forward a lot in the last 25 years.
I've done some work in this area and have indeed found that simpler statistical models often out perform ML/DL.
This is true for single time series, where we are predicting P(x_t+1 | x_0..t)
DL has advantages when you
a) have additional context at each time step, or
b) you have multiple related time series.
For example, consider Amazon who predicts demands for all of their products. At each time step, they know about inventory, marketing efforts, and could even model higher dimensional attributes like persuasiveness of the item's description with NLP.
It's also true they have items that are highly correlated. Skis, Snowboards, and Ski jackets all likely have similar sales patterns. Leveraging this correlation can increase accuracy, and is especially useful when you have items with limited history.
Including all of that context is hard with a statistical model, and whatever equation a human can come up with to combine them is probably worse than a learned, embedding-based DL model.
Statistical models are a great starting point & baseline for most problems, but as you add real world complexity beyond the general case time-series that's not as true.
I might not be aware of it, but I wish there were more benchmarks/research on higher complexity problems.
My understanding is that the biggest advantage between traditional machine learning and neural networks is that neural networks are useful when features either need to be generated or are poorly understood (such as a binary blob of an image or sound sample). So for data that is already neatly organized and labeled in a spreadsheet, DL loses its main advantage.
I completely agree with your perspective. It is a reality that deep learning models might offer certain advantages over classical statistical models. We are building benchmarks and comparisons to clarify when the more complex models are better.
We also want to show with this experiment the importance of creating benchmarks. In many use cases, practitioners choose more sophisticated models because they think this will give them better accuracy. The main idea is that robust benchmarks should always be created.
Timeseries data can sometimes be deceptive, depending on what you are trying to model.
I have been hacking on a peroneal research project to predict hurricane tracks furcating using deep learning. Only given track and intensity data at different points in time (every 6 hours) and some simple feature engineering, you will not get any good results close to the official NHC forecast, no matter what model you use.
In hindsight, this is a little obvious. Hurricane forecasting time series models depend more on other factors than time itself. A sales forecast can depend on seasonal trends and key events in time, but a hurricane forecast is much more dependent on long-range spatial data like the state atmosphere and ocean that are very non-trivial to model simply using just track data.
However, deep leading models and techniques in this scenario are helpful because they can allow you to integrate multiple modalities like images, graphs, and volumetric data into this one model, which may not be possible with statistical models alone.
I'm heavily involved in this area of research (getting deep learning competitive with computationally efficient statistical methods), and I'd like to note a couple things I've found:
1. Deep learning doesn't require thorough understanding of priors or statistical techniques. This opens the door to more programmers in the same way high level languages empower far more people than pure assembly. The tradeoffs are analogous - high human efficiency, loss of compute efficiency.
2. Near-CPU deep learning accelerators are making certain classes of models far easier to run efficiently. For example, an M1 chip can run matrix multiplies (DL primitive composed of floating point operations) 1000x faster than individual instructions (2TFlops vs 2GHz). This really changes the game, since we're now able to compare 1000 floating point multiplications with a single if statement.
It opens the door to more script kiddies, not more researchers. I really think we need more researchers who understand inference from first principles and make models in view to furthering understanding as opposed to more fit(X,y).
I don’t say this naively. At least in industries, the weight of imposter data scientists I think is getting to a level that may cause the profession to implode due to customer disillusionment within the next 10 years precisely because fit(X,y) is so accessible.
I am not sure if you are not trading "high human efficiency" against increased risk of blowing up at some point. Good luck doing forecasting without thorough understanding of priors and statistics in general.
Agreed, I see the "lower barrier to entry" in this particular case as coming with potentially huge risks. IMO, statistics is vastly, vastly, vastly under-appreciated and under-estimated.
I think that term already has usage as a proxy for "lowest sampling variance"; for example the Gauss Markov theorem shows that OLS is the most efficient unbiased linear estimator.
I guess this is echoing your point 2, but I would have generally said that "principled" statistical models are less efficient these days than DL (see: HMC being much slower than variational Bayes). Priors are usually overrated but I think the risk is more that basic mistakes are made because people don't understand what assumptions go into "basic" machine learning ideas like train/test splits or model selection. I'm not sure it warrants a lot of panic though.
This readme lands to me like this: "People say deep learning killed stats, but that's not true; in fact, DL can be a huge mistake."
Ok, I fully agree with their foundational premise: Start simple.
But: They've overstated their case a bit. Saying that deep learning will cost $11,000 and need 14 days on this data set is not reasonable. I believe you can find some code that will cost that much. The readme suggests that this is typical of deep learning, which is not true. DL models have enormous variety. You can train a useful, high-performance model on a laptop CPU in a seconds-to-minutes timeframe; examples include multilayer perceptrons for simple classification, a smaller-scale CNN, or a collaborative filtering model.
While I don't endorse all details of their argument, I do think the culture of applied ML/data science has shifted too far toward default-DL. The truth is that many problems faced by real companies can be solved with simple techniques or pre-trained models.
Another perspective: A DL model is a spacecraft (expensive, sophisticated, powerful). Simple models like logistic regression are bikes and cars (affordable, efficient, less powerful). Using heuristics is like walking. Often your goal is just a few blocks away, in which case it would be inefficient to use a spacecraft.
>They've overstated their case a bit. Saying that deep learning will cost $11,000 and need 14 days on this data set is not reasonable.
After glancing at the paper they're criticising, I really wonder how they arrived at these insane figures. From what I saw, they were mostly using stuff like MLPs with a handful layers at O(100) neurons at most. Yeah, if you put a hundred million parameter transformer in there you will train forever (and waste tons of compute since that would be complete overkill), but not with simple perceptrons. I don't know the extent of the data, but given these architectures I very much doubt a practical model would take this long to train - even on a CPU - given that you could run a statistical ensemble in 5 minutes.
Nice article and interesting comparison.
Yet, I have a minor issue with the title: Deep Learning are also statistical methods ... "univariate models vs. " would be a better title.
You could argue that deep learning is not a statistical method in the traditional sense, in that a typical neural network model is not a probability model, and some neural networks are well known to produce specifically bad probability models, requiring some amount of post processing in order to produce correctly "calibrated" probability predictions.
However I don't like that there is often a strict dichotomy presented between "deep learning" and "statistics". There is a whole world of gray areas and hybrid techniques, which tend to be both more accessible, easier to reason about, and more effective in practice, especially on smaller "tabular" datasets. What about generalized additive models, random forests, gradient boosted trees, etc.?
The author of the document I'm sure is aware of these techniques, and I assume they are left out because they didn't perform well enough to be considered here. But I don't think it does the discourse any favors to promulgate the false dichotomy.
Statistical models and probabilistic models are not synonymous.
Vanilla deep learning models are statistical models (a la linear regression) and not probabilistic models (a la Gaussian mixture). It is important to maintain the distinction.
But to your point about the dichotomy between deep learning and more "traditional" statistical methods: this confusion in common parlance clearly has negative effects on model-building among engineers. You are right that when people think "deep learning" they think of very specific architectures with very specific features, and don't seem to conceive of the possibility that automatic differentiation techniques mean you can incorporate all sorts of new model components that blur the line between deep learning and older methods. For instance, you could feed the results of a kernel SVM to an ARIMA model in such a way that the whole thing is end-to-end differentiable. In fact, the great benefit of deep learning long-term is (in my opinion) that the ability to build these compositional models means you can bake in that much more inductive bias into the models you build, meaning they can be smaller and more stable in training.
"Vanilla deep learning models are statistical models (a la linear regression) and not probabilistic models (a la Gaussian mixture). It is important to maintain the distinction."
Isn't this just a matter of interpretation of the models? You can interpret linear regression in a Bayesian way and say that the prediction of the linear model is the MAP of the mean, you can also calculate the variance, the l2 norm objective is saying the distribution of errors is normally distributed, l2 regularisation is a normal prior on the coefficients, etc, etc? All the same stuff can be applied to deep learning models.
Maybe I don't understand your distinction between statistical and probabilistic though?
> Isn't this just a matter of interpretation of the models?
Not really. This is the classic frequentist vs Bayesian debate. In frequentist-land, you are computing point estimates of the model parameters. In Bayesian-land, you are computing distribution estimates of the model parameters. It is true that there is a difference in interpretation of the generative process but the two choices demand fundamentally different models because of the decision about which of the parameters or data are considered "real" and which are considered "generated".
I think a more abstract/general way to put it is: "statistics" is concerned with statistical summary values (i.e. mean-field estimates over measures) while "probability" is concerned more with distributions (i.e., topologies of measures). I'm not sure this is a rigorously correct way to characterize it, but it illustrates the intuition I'm trying to convey.
Statistics as practiced today (1930s until now?) consists almost entirely of making inferences about unobserved probability distributions. That includes nonparametric statistics, and frequentist versus Bayesian has nothing to do with it.
There are some probability models that are not really statistical models, but there are few or no statistical models that are not also probability models.
Least-squares regression is a probability model. Even if you don't particularly care about the error distribution, you are still estimating a conditional expectation and setting a conditional independence assumption on the residuals. If that's not a probability model, then I don't know what it is!
Statistics can be summarizes as one thing n -> N. Does ‘little n’ represent ‘big N’. In other words, does the sample generalize to the population. Statistics means something like “description of the state”. It was born out of census samples where larger population samples had to be estimated. “n” could be a handful of fish in a “N” lake. “n” could also be the parameter estimated in a linear regression with the sample of data collected while “N” is the true parameter of the relationship if we had all the data. Point estimation is about finding the needle in the haystack, but much more often statistics is about finding the haystack given the needle. One tool statistics uses to get to the haystack is probability.
A point estimate of distribution parameters describing a population is frequentist. A point estimate of distribution parameters describing another distribution's parameter is Bayesian.
How the parameters are estimated it not the message.
In statistics there are latin letters and greek letters. When you see a symbol denoted as a greek letter then that is a population parameter. When you see a latin letter that is a sample estimate. It could be Frequentist, Bayesian, Likelihoodist, Fiducial, Empirical Bayes, etc. Theoretical population greeks or sample calculated latins.
I have very limited statistical background but doesn't variational inference applied in the neural networks make them probabilistic models? The modelling definitely seems so because the math in those papers doesn't even specify whether it's a network (it implies that it can be any model).
Yes indeed. This synthesis of concepts is a great illustration of moving beyond hardened dichotomies in this research space and I believe similar approaches will be fruitful in the years to come.
They are all univariate models: some are trained offline on a bunch of different series before being applied (deep learning, “global” models), others are applied directly to each series to forecast (“statistical”, “local” models), but the task is the same univariate time series prediction for every model there.
I wish we could start moving to better approaches for evaluating time series forecasts. Ideally, the forecaster reports a probability distribution over time series, then we evaluate the predictive density with regard to an error function that is optimal for the intended application of the forecast at hand.
I use my package https://github.com/alexhallam/tablespoon to generate naive forecasts then evaluate the crps of the naive vs the crps of the alternative method. This “skill score” approach is very good.
What is the point of this kind of comparison? It is completely dependent on the 3000 datasets they chose to use. You're not going to find that one method is better than another in-general or find some type of time series for which you can make a specific methodological recommendation (unless that series is specifically constructed with a mathematical feature, like stationarity).
What matters is "which method is better for MY data?" but that's not something an academic can study. You just have a test a few different things.
What deep learning could instead be used for in this case is to incorporate more data, like text describing events that affects macroeconomics when doing macroeconomic predictions.
Hmmm. Not sure why they use M3 data when there is already M4 where a deep learning model won. I know because I reimplemented it as a toy version in python here: https://github.com/leanderloew/ES-RNN-Pytorch
It was actually very cool because the model was a melt of exponential smoothing and dl.
I can have some interest in, hope for, etc. machine learning. One reason is, for the curve fitting methods of classic statistics, i.e., versions of regression, the math assumptions that give some hope of some good results are essentially impossible to verify and look like they will hold closely only rarely. So, even when using such statistics, good advice is to have two steps, (1) apply the statistics, i.e., fit, using half the data and then (2) verify, test, check using the other half. But, gee, those two steps are also common in machine learning. Sooo, if can't find much in classic math theorems and proofs to support machine learning, then, are just put back into the two steps statistics has had to use anyway.
So, if we have to use the two steps anyway, then the possible advantages of non-linear fitting have some promise.
So, to me, a larger concern comes to the top: In my experience in such things, call it statistics, optimization, data analysis, whatever, a huge advantage is bringing to the work some understanding that doesn't come with the data and/or really needs a human. The understanding might be about the real problem or about some mathematical methods.
E.g., once some guys had a problem in optimal allocation of some resources. They had tried simulated annealing, run for days, and quit without knowing much about the quality of the results.
I took the problem as 0-1 integer linear programming, a bit large, 600,000 variables, 40,000 constraints, and in 900 seconds on a slow computer, with Lagrangian relaxation, got a feasible solution guaranteed, from the bounding, to be within 0.025% of optimality. The big advantages were understanding the 0-1 program, seeing a fast way to do the primal-dual iterations, and seeing how to use Lagrangian relaxation. My guess is that it would be tough for some very general machine learning to compete much short of artificial general intelligence.
One way to describe the problem with the simulated annealing was that it was just too general, didn't exploit what a human might understand about the real problem and possible solution methods selected for that real problem.
I have a nice collection of such successes where the keys were some insight into the specific problems and some math techniques, that is, some human abilities that would seem to need machine learning to have artificial general intelligence to compete. With lots of data, lots of computing, and the advantages of non-linear operations, at times machine learning might be the best approach even now.
Net, still, in many cases, human intelligence is tough to beat.
A point about gradient-free methods such as simulated annealing and genetic algorithms: the transition (sometimes called "neighbor") function is the most important part by far. The most important insight is the most obvious one in some way: if your task is to search a problem space efficiently for an optimal solution, it pays to know exactly how to move from where you are to where you want to be in that problem space. To that point, (the structure of) transitions between successive state samples should be refined to your specific problem and encoding of the domain in order to be useful in any reasonable amount of time.
> the transition (sometimes called "neighbor") function is the most important part by far.
And, indeed, in the 0-1 integer linear programming with Lagrangian relaxation I used there is nothing differentiable so should be counted as "gradient free". And the linear programming part and the Lagrangian part do "move" from where are to closer to "where want to be".
A thing is, the bag of tricks, techniques, that work here is large. So, right, should use knowledge of the real problem to pick what tricks to use.
Clean data would benefit most models, not just non-deep learning models. Missing data introduces bias even in DL models.
- they focus on linear relationships and not complex joint distributions
1. If seasonality is present. Which is usually the case in practical business problems then you will find that actual ~ lag_actuals explains most of the variance with a linear relationship. non-linearities in time series is not something that I see often. You can usually make a feature that explains those away linearly.
- they focus on fixed temporal dependence that must be diagnosed and specified a priori
Not sure sure you are saying here.
- they take as input univariate, not multiple interval, data
That is not the case. Time series regression can take many lags for inputs.
- they focus on one-step forecasts, not long time horizons
False. Time series regression models are used to forecast revenue many years into the future in the business world.
- they’re highly parameterized and rigid to assumptions
So is F=ma
- they fail for cold start problems
Because a cold start is not a time series data set. Why would time series methods work on non time series data.
I appreciate the response but I admit I'm a bit confused - maybe it's the case that you're not familiar with the need for modern forecasting? Some of your 'defense' is just flat out wrong. A few points:
- You write:
1. "Clean data would benefit most models, not just non-deep learning models. Missing data introduces bias even in DL models." and 2. "Because a cold start is not a time series data set. Why would time series methods work on non time series data."
Clean data is good, yes, thanks for the insight. But 1. missing data is a huge issue for model quality, particularly for highly-parameterized state space models if you're missing key aspects (e.g. an ETS model with missing data for trend is an awful model) and 2. cold start is certainly a time series concept (and one with lots of ongoing research at that). Here's an example: At Amazon, thousands of new products launch each week. Many of those new products have zero or very limited historical data, but metrics (e.g. demand) still need to be forecasted. Despite your claim that such an issue doesn't exist, this is a cold start time series problem. Modern techniques exist to handle this issue, such as DeepAR, which handles the cold start issue by training a global model for probabilistic forecasts (https://arxiv.org/pdf/1704.04110.pdf)
- You write: "That is not the case. Time series regression can take many lags for inputs"
Of course, but that's not what was written. I suggest you google the difference between multiple and multivariate time series
- You write: "False. Time series regression models are used to forecast revenue many years into the future in the business world"
In the 'business world'(?) you can use a forecasting model to do multi-step ahead prediction (aka k-step ahead error propagation), but I'm referring specifically to direct multi-horizon forecasting (e.g. https://arxiv.org/pdf/1711.11053.pdf)
- You write: "Not sure sure you are saying here."
I'm describing verbatim the process by which the classical forecasting techniques described work.
- You write: "So is F=ma"
This is agreeing with me. I think maybe your intent was to write something with a nonlinearity, but I’m not sure you’re sure what point you’re trying to make here.
- You write: "1. If seasonality is present. Which is usually the case in practical business problems then you will find that actual ~ lag_actuals explains most of the variance with a linear relationship. non-linearities in time series is not something that I see often. You can usually make a feature that explains those away linearly."
You ignored the bit about joint distributions and you gave an example of lag features you've seen. Non-linearities are common in many time series applications (e.g. weekly returns of <currency> foreign exchange rate , <stock> realized volatility). You can handle them with classical techniques, but it's painful if you do something more complex than just introducing a feature and hope it captures the missing variance.
You can fit a state space model to missing data in many api’s, sure, but that doesn’t mean it’s supported in any meaningful way.
Imagine you have a new product with 6 months of missing data. You can feed that into ets/arima/whatever, but you’re not getting any valuable output for those missing point value estimates.
A lot of M3 datasets we use are high-frequency, with large seasonal inputs. Considering Gaussian Processes (GP) complexity is O(N^3), a careful study of their performance would be challenging.
Also... I'm not aware of any efficient GP Python implementations.
GPs over time series can leverage low-dimensional index sets for O(N lg N) fitting and inference. This can be done by interpolating the inputs onto a regular grid which admits Toeplitz kernels. See https://arxiv.org/abs/1503.01057.
The dataset[1] used here are 3003 time series from the M3 competition ran by the International Journal of Forecasting. Almost all of these are sampled at the yearly, quarterly or monthly frequency, each with typically 40 to 120 observations ("samples" in Machine Learning lingo), and the task is to forecast a few months/quarters/years out of sample. Most experienced Machine Learners will realize that there is probably limited value in fitting high complexity n-layer Deep Learning model to 120 data-points to try to predict the next 12. If you have daily or intraday (hourly/minutely/secondly) time series, more complex models might become more worthwhile, but such series are barely represented in the dataset.
To me the most surprising result was only how bad AutoARIMA performed. Seasonal ARIMA was one of the traditional go-to methods for this kind of data.
[1] https://forecasters.org/resources/time-series-data/m3-compet...