Hacker News new | past | comments | ask | show | jobs | submit login
XLSTMTime: Long-Term Time Series Forecasting with xLSTM (arxiv.org)
231 points by beefman 3 months ago | hide | past | favorite | 53 comments



> In recent years, transformer-based models have gained prominence in multivariate long-term time series forecasting

Prominence, yes. But are they generally better than non-deep learning models? My understanding was that this is not the case, but I don't follow this field closely.


From experience in payments/spending forecasting, I've found that deep learning generally underperform gradient-boosted tree models. Deep learning models tend to be good at learning seasonality but do not handle complex trends or shocks very well. Economic/financial data tends to have straightforward seasonality with complex trends, so deep learning tends to do quite poorly.

I do agree with this paper - all of the good deep learning time series architectures I've tried are simple extensions of MLPs or RNNs (e.g. DeepAR or N-BEATS). The transformer-based architectures I've used have been absolutely awful, especially the endless stream of transformer-based "foundational models" that are coming out these days.


Transformers are just MLPs with extra steps. So in theory they should be just as powerful. The problem with transformers is simultaneously their big advantage: They scale extremely well with larger networks and more training data. Better so than any other architecture out there. So if you had enormous datasets and unlimited compute budget, you could probably do amazing things in this regard as well. But if you're just a mortal data scientist without extra funding, you will be better off with more traditional approaches.


I think what you say is true when comparing transformers to CNNs/RNNs, but not to MLPs.

Transformers, RNNs, and CNNs are all techniques to reduce parameter count compared to a pure-MLP model. If you took a transformer model and replaced each self-attention layer with a linear layer+activation function, you'd have a pure MLP model that can model every relationship the transformer does, but can model more possible relationships as well (but at the cost of tons more parameters). MLPs are more powerful/scalable but transformers are more efficient.

Compared to MLPs, transformers save on parameter count by skimping on the number of parameters devoted to modeling the relationship between tokens. This works in language modeling, where relationships between tokens isn't that important - you can jumble up the words in this sentence and it still mostly makes sense. This doesn't work in time series, where relationships between tokens (timesteps) is the most important thing of all. The LTSF paper linked in the OP paper also mentions this same problem: https://arxiv.org/pdf/2205.13504 (see section 1)


Though I agree with the idea that MLPs are theoretically more "capable" than transformers, I think seeing them just as a parameter reduction technique is also excessively reductive.

Many have tried to build deep and large MLPs for a long time, but at some point adding more parameters wouldn't increase models' performance.

In contrast, transformers became so popular because their modelling power just kept scaling with more and more data and more and more parameters. It seems like the 'restriction' imposed on transformaters (the attention structure) is a verg good functional form for modelling language (and, more and more, some tasks in vision and audio).

They did not become popular because they were modest with respect to the parameters used.


>Compared to MLPs, transformers save on parameter count by skimping on the number of parameters

That is only correct if you look at models with equal parameter count from a purely theoretical perspective. In practice, it is possible to train transformers to orders of magnitude bigger scales than MLPs because they are so much more efficient. That's why I said a modern transformer will easily beat these puny modern MLPs, but only in cases where data and compute budgets allow it. That is not even a question. If you look at recent time series forecasting leaderboard entries, you'll almost always see transformers playing along at the top of it: https://github.com/thuml/Time-Series-Library


Transformers reduce the number of relationships between tokens that must be learned, too. An MLP has to separately learn all possible relationships between token 1 and 2, and 2 and 3, and 3 and 4. A transformer can learn relationships between specific values regardless of position.


In my aviation safety work, deep learning outperforms traditional non-DL models for multivariate time-series forecasting. Between deep learning models, I've had a wide variance in performance between transformers, Bi-LSTMs, regular MLPs, VAEs, and so on.


Seconding the other question, would be curious to know


What's your go-to model that generally performs well with little tuning?


If you have short time-series with low variance, noise and outliers, strong prior knowledge, or limited resources to train and maintain a model, I would stick with simpler traditional models.

If DL is a good fit for your use-case, then I tend to like transformers or combining CNNs with recurrent models (e.g., BiGRU, GRU, BiLSTM, LSTM) and optional attention.


What are you doing in aviation safety that requires time series modeling? Weather?


My best guess would be accident occurrence prediction.


Now take into account that it has to be lightweight and DL falls shirt


While I don't have firsthand experience with these models, I recently discussed this topic with a friend who has used tree-based models like XGBoost for time series analysis. They noted that transformer-based architectures tend to yield decent performance on time series tasks with relatively little effort compared to tree models.

From what I understood, tree-based models can usually outperform transformers when given sufficient parameter tuning. However, models like TimeGPT offer decent performance without extensive tuning, making them an attractive option for quicker implementations.


The paper says this in the next paragraph. xLSTMTime is not transformer-based either.


They aren’t so hot, but recent efforts at transfer learning were promising.


A part of my work is literally building nowcasting and other types of prediction models in economics (inflation, GDP etc) and finance (market liquidity, etc). I haven’t yet had a chance to read the paper but overall the tone of “transformers are great for what they do but LSTM-type of models are very valuable still” completely resonates with me.


Have you had the chance to apply Mamba to your work at all? Thoughts?


Is this somehow related to the Google weather prediction model using AI [1]?

https://deepmind.google/discover/blog/graphcast-ai-model-for...


No, Graphcast is a graph transformer trained on ERA5 weather reconstructions of the atmosphere, not a general time series prediction model. It by the way outperforms all traditional global point forecasts (non-ensembles), at least on predicting large-scale global patterns (Z500 and such, on the lag of 3–10 days or so). ECMWF has AIFS that is a derivate of Graphcast, they'll probably get it or something similar to production in a couple of years.


AIFS is transformer based (Graphcast is pure GNN) so different architecture and is already running operationally, see:

https://www.ecmwf.int/en/about/media-centre/aifs-blog/2024/i...


marketed as a forecasting tool, so is this not applicable to event classification in time series?


I'd say that's kind of a different task. I'm not a pro in this, but you could maybe treat it as a multi-variate forecast problem where the targets are probabilities per event if n is really small?


Yes, I would be interested where this (and any Transformer/LLM based approach) is improving anomaly detection for example.


I can't speak for all use cases, but I've done a great deal of work in the space of using deep learning approaches for anomaly detection in network device telemetry. In particular with high resolution univariate time series of latency measurements, we saw success using convolutional autoencoders and GANs. These methods lean on reconstruction loss rather than forecasting, but still effective.

There is some prior art for this that we leaned on [1][2].

RE: transformers — I did some early experimentation with Temporal Fusion Transformers [3] which worked pretty well for forecasting compared to other deep learning methods, but rarely did I see it outperform standard baselines (like ARIMA) in our datasets.

[1] https://www.mdpi.com/2076-3417/12/23/12472

[2] https://arxiv.org/abs/2009.07769

[3] https://arxiv.org/abs/1912.09363


Too bad the dataset link in the paper isn't working. I hope that'll get amended.


The best deep learning time series models are closed source inside hedge funds.


Most of the hard work is actually feature construction rather than monolithic models. And afaik gradient boosting still rules the world


There is no such thing as a generally best model due to the no free lunch theorem. What works in hedge funds will be bad in other areas that need less or different inductive biases due to having more or less data and different data.


I think hedge funds, at least the advanced once, definitely don't use time series modelling anymore. That's quit outdated nowadays.


There are many ways of approaching quantitative trading and many people do employ time series analysis, especially for high frequency trading.


What do you suspect they are using?


Some funds that tried to recruit me were really interested in classical generative models (ARMA, GARCH, HMMs with heavy-tailed emissions, etc.) extended with deep components to make them more flexible. Pyro and Kevin Murphy's ProbML vol II are a good starting point to learn more about these topics.

The key is to understand that in some of these problems, data is relatively scarce, and it is really important to quantify uncertainty.


I know next to nothing about this. How do people make use of forecasts that don't provide an uncertainty? It seems like that's the most important part. Why hasn't bayseyan statistics taken over completely?


Bayesian inference is costly and adds a significant amount of complexity to your workflow. But yes, I agree, the way uncertainty is handled is often sloppy.

Maximum likelihood estimates are very frequently atypical points in the posterior distribution. It is unsettling to hear people are using this and not computing the entire posterior.


They pull data from all kinds of things now.

For example, satellite imagery of trucking activity correlated to specific companies or industries.

Its all signal processing at some level, but directly modeling the time series of price or other asset metrics doesn’t have the alpha it may have had decades ago.


Alternative data is passed into time series models. They are features.

You don’t know as much about this as you think


emoji hand pointing up


time series forecasting works best with deterministic domains. none of the published LLM/AI/Deep/Machine techniques do well in the stock market. Absolutely none. we've tried them all.


Reminder: If someone's time series forecasting method worked, they wouldn't be publishing it.


They definitely would and do, the vast majority of time series work is not about asset prices or beating the stock market


The Transformer model, despite being one of the most successful in AI history, was still being published.


It's a sequence model, not a time-series model. All time series are sequences but not all sequences are time series.


I misread this as XSLT :')


100% clicked thinking I was getting into an article about XML and wondering how interesting that was in 2024. Simultaneously disappointed and pleased.


Yup. And it's about transforms too.


Same. I am old?


Me too (and yes, I'm old)


cant wait for someone to lose all their money trying to predict stocks with this thing


Wow, is there a way to apply this to financial trading?


The paper links to their code:

https://github.com/muslehal/xLSTMTime

And their data, which includes daily exchange rates:

https://drive.google.com/drive/folders/1nuMUIADOc1BNN-uDO2N7...


If you have dataset in financial , I can try it for you




Consider applying for YC's W25 batch! Applications are open till Nov 12.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: