> In recent years, transformer-based models have gained prominence in multivariate long-term time series forecasting
Prominence, yes. But are they generally better than non-deep learning models? My understanding was that this is not the case, but I don't follow this field closely.
From experience in payments/spending forecasting, I've found that deep learning generally underperform gradient-boosted tree models. Deep learning models tend to be good at learning seasonality but do not handle complex trends or shocks very well. Economic/financial data tends to have straightforward seasonality with complex trends, so deep learning tends to do quite poorly.
I do agree with this paper - all of the good deep learning time series architectures I've tried are simple extensions of MLPs or RNNs (e.g. DeepAR or N-BEATS). The transformer-based architectures I've used have been absolutely awful, especially the endless stream of transformer-based "foundational models" that are coming out these days.
Transformers are just MLPs with extra steps. So in theory they should be just as powerful. The problem with transformers is simultaneously their big advantage: They scale extremely well with larger networks and more training data. Better so than any other architecture out there. So if you had enormous datasets and unlimited compute budget, you could probably do amazing things in this regard as well. But if you're just a mortal data scientist without extra funding, you will be better off with more traditional approaches.
I think what you say is true when comparing transformers to CNNs/RNNs, but not to MLPs.
Transformers, RNNs, and CNNs are all techniques to reduce parameter count compared to a pure-MLP model. If you took a transformer model and replaced each self-attention layer with a linear layer+activation function, you'd have a pure MLP model that can model every relationship the transformer does, but can model more possible relationships as well (but at the cost of tons more parameters). MLPs are more powerful/scalable but transformers are more efficient.
Compared to MLPs, transformers save on parameter count by skimping on the number of parameters devoted to modeling the relationship between tokens. This works in language modeling, where relationships between tokens isn't that important - you can jumble up the words in this sentence and it still mostly makes sense. This doesn't work in time series, where relationships between tokens (timesteps) is the most important thing of all. The LTSF paper linked in the OP paper also mentions this same problem: https://arxiv.org/pdf/2205.13504 (see section 1)
Though I agree with the idea that MLPs are theoretically more "capable" than transformers, I think seeing them just as a parameter reduction technique is also excessively reductive.
Many have tried to build deep and large MLPs for a long time, but at some point adding more parameters wouldn't increase models' performance.
In contrast, transformers became so popular because their modelling power just kept scaling with more and more data and more and more parameters. It seems like the 'restriction' imposed on transformaters (the attention structure) is a verg good functional form for modelling language (and, more and more, some tasks in vision and audio).
They did not become popular because they were modest with respect to the parameters used.
>Compared to MLPs, transformers save on parameter count by skimping on the number of parameters
That is only correct if you look at models with equal parameter count from a purely theoretical perspective. In practice, it is possible to train transformers to orders of magnitude bigger scales than MLPs because they are so much more efficient. That's why I said a modern transformer will easily beat these puny modern MLPs, but only in cases where data and compute budgets allow it. That is not even a question. If you look at recent time series forecasting leaderboard entries, you'll almost always see transformers playing along at the top of it: https://github.com/thuml/Time-Series-Library
Transformers reduce the number of relationships between tokens that must be learned, too. An MLP has to separately learn all possible relationships between token 1 and 2, and 2 and 3, and 3 and 4. A transformer can learn relationships between specific values regardless of position.
In my aviation safety work, deep learning outperforms traditional non-DL models for multivariate time-series forecasting. Between deep learning models, I've had a wide variance in performance between transformers, Bi-LSTMs, regular MLPs, VAEs, and so on.
If you have short time-series with low variance, noise and outliers, strong prior knowledge, or limited resources to train and maintain a model, I would stick with simpler traditional models.
If DL is a good fit for your use-case, then I tend to like transformers or combining CNNs with recurrent models (e.g., BiGRU, GRU, BiLSTM, LSTM) and optional attention.
While I don't have firsthand experience with these models, I recently discussed this topic with a friend who has used tree-based models like XGBoost for time series analysis. They noted that transformer-based architectures tend to yield decent performance on time series tasks with relatively little effort compared to tree models.
From what I understood, tree-based models can usually outperform transformers when given sufficient parameter tuning. However, models like TimeGPT offer decent performance without extensive tuning, making them an attractive option for quicker implementations.
A part of my work is literally building nowcasting and other types of prediction models in economics (inflation, GDP etc) and finance (market liquidity, etc). I haven’t yet had a chance to read the paper but overall the tone of “transformers are great for what they do but LSTM-type of models are very valuable still” completely resonates with me.
No, Graphcast is a graph transformer trained on ERA5 weather reconstructions of the atmosphere, not a general time series prediction model. It by the way outperforms all traditional global point forecasts (non-ensembles), at least on predicting large-scale global patterns (Z500 and such, on the lag of 3–10 days or so). ECMWF has AIFS that is a derivate of Graphcast, they'll probably get it or something similar to production in a couple of years.
I'd say that's kind of a different task. I'm not a pro in this, but you could maybe treat it as a multi-variate forecast problem where the targets are probabilities per event if n is really small?
I can't speak for all use cases, but I've done a great deal of work in the space of using deep learning approaches for anomaly detection in network device telemetry. In particular with high resolution univariate time series of latency measurements, we saw success using convolutional autoencoders and GANs. These methods lean on reconstruction loss rather than forecasting, but still effective.
There is some prior art for this that we leaned on [1][2].
RE: transformers — I did some early experimentation with Temporal Fusion Transformers [3] which worked pretty well for forecasting compared to other deep learning methods, but rarely did I see it outperform standard baselines (like ARIMA) in our datasets.
There is no such thing as a generally best model due to the no free lunch theorem. What works in hedge funds will be bad in other areas that need less or different inductive biases due to having more or less data and different data.
Some funds that tried to recruit me were really interested in classical generative models (ARMA, GARCH, HMMs with heavy-tailed emissions, etc.) extended with deep components to make them more flexible. Pyro and Kevin Murphy's ProbML vol II are a good starting point to learn more about these topics.
The key is to understand that in some of these problems, data is relatively scarce, and it is really important to quantify uncertainty.
I know next to nothing about this. How do people make use of forecasts that don't provide an uncertainty? It seems like that's the most important part. Why hasn't bayseyan statistics taken over completely?
Bayesian inference is costly and adds a significant amount of complexity to your workflow. But yes, I agree, the way uncertainty is handled is often sloppy.
Maximum likelihood estimates are very frequently atypical points in the posterior distribution. It is unsettling to hear people are using this and not computing the entire posterior.
For example, satellite imagery of trucking activity correlated to specific companies or industries.
Its all signal processing at some level, but directly modeling the time series of price or other asset metrics doesn’t have the alpha it may have had decades ago.
time series forecasting works best with deterministic domains. none of the published LLM/AI/Deep/Machine techniques do well in the stock market. Absolutely none. we've tried them all.
Prominence, yes. But are they generally better than non-deep learning models? My understanding was that this is not the case, but I don't follow this field closely.