The conclusion, that a low-complexity statistical ensemble is almost as good as a (computationally) complex Deep Learning model, should not come as a surprise, given the data.
The dataset[1] used here are 3003 time series from the M3 competition ran by the International Journal of Forecasting. Almost all of these are sampled at the yearly, quarterly or monthly frequency, each with typically 40 to 120 observations ("samples" in Machine Learning lingo), and the task is to forecast a few months/quarters/years out of sample. Most experienced Machine Learners will realize that there is probably limited value in fitting high complexity n-layer Deep Learning model to 120 data-points to try to predict the next 12. If you have daily or intraday (hourly/minutely/secondly) time series, more complex models might become more worthwhile, but such series are barely represented in the dataset.
To me the most surprising result was only how bad AutoARIMA performed. Seasonal ARIMA was one of the traditional go-to methods for this kind of data.
If the task is to predict the next 12 values from a sample of 120 previous values, drawn from some computationally simple statistical process, it's much cheaper and easier to use old-fashioned, tried-and-true statistical methods.
If the task is to predict millions of pixel values that make up an original work of art, or the pixel values over time that make up a deep-fake video, or the next set of values encoding the next best possible play in a game of Go, or the set of values that encode the entire structure of a protein, and so on, then you have no choice: You must use a deep neural network. Simple methods cannot do any of that.
Sure more data and high signal then you might have a substantial advantage with DL. Lots of data is low signal to noise which again favours parametric models.
The dataset[1] used here are 3003 time series from the M3 competition ran by the International Journal of Forecasting. Almost all of these are sampled at the yearly, quarterly or monthly frequency, each with typically 40 to 120 observations ("samples" in Machine Learning lingo), and the task is to forecast a few months/quarters/years out of sample. Most experienced Machine Learners will realize that there is probably limited value in fitting high complexity n-layer Deep Learning model to 120 data-points to try to predict the next 12. If you have daily or intraday (hourly/minutely/secondly) time series, more complex models might become more worthwhile, but such series are barely represented in the dataset.
To me the most surprising result was only how bad AutoARIMA performed. Seasonal ARIMA was one of the traditional go-to methods for this kind of data.
[1] https://forecasters.org/resources/time-series-data/m3-compet...