The paper focuses on a number of practical mistakes but overlooks some higher-level ones that I think are also important:
- Most ML algorithms perform well when there is some underlying phenomenon whose characteristics are being statistically inferred by the model. However, the stock market is now essentially a large network of computers trying to model one another. You can imagine how this might break some of the underlying assumptions that good ML results rely upon. We can see the results of this in the increasing frequency of "flash crashes" caused by over-leveraged quant hedge funds all tripping over sell triggers and/or getting margin calls at the same time.
- In practice, asset class/trading strategy correlations tend to change dramatically over time. Nowadays we talk about the market being 'risk-on' or 'risk-off' modes, since we're used to seeing market-wide selloffs and buyins. Your ML model can only do so much if the entire market is selling everything, unless you're going to dip your toes in derivatives or short selling to which I say: good luck! :)
- A major, major issue is execution. You can have the greatest and most accurate model in the world, but actually trading it is another beast entirely. Bid/ask spreads and market movements from your trading, particularly if you are dealing with a non-trivial amount of money or trading securities that have less than ideal liquidity, is usually going to eat up any alpha you might have. In my own backtests of fairly straightforward trading algorithms even a minor 0.05% spread on bid-ask spreads for a weekly trading algorithm can eat your lunch, nevermind if you're planning on doing intra-day trading or trading anything other than the most popular funds/stocks.
- Beyond any of these risks, you're going to have to inevitably suffer through downdrafts. I don't know about you, but watching my money disappear as it's being controlled by a trading algorithm/model that is subject to all kinds of mistakes and bugs is well beyond my own intestinal fortitude.
"Most ML algorithms perform well when there is some underlying phenomenon whose characteristics are being statistically inferred by the model. However, the stock market is now essentially a large network of computers trying to model one another."
This is one of the insights that I think deserves more ink than it gets. Historically system's analysis has focused on predominantly linear systems where underlying factors that were unseen but capable of being modeled as FSM combined into emergent behaviors which were more complex but ultimately still linear across the region of analysis.
That sort of analysis falls down however when the modelling system is a part of the system it is modelling. The result can easily become non-linear (or turbulent or chaotic depending on when you were introduced into systems analysis :-)) and the calculus for those conditions is a lot harder to tease out.
The feedback algorithms become signals to other feedback algorithms and you get hard to predict changes which don't track the measured data, they track the response to the measured data, which is itself changing in response to the response.
From a practical viewpoint (for me) its of interest in discovering entities that use ML techniques to game search engine rank which is itself being derived by ML algorithms.
Indeed -- it opens up a (scary) need for models to be "self-aware" of competitors' models and their effects on the training data. Above my pay grade, for sure.
Great point... I've been thinking lately how it must be possible to hack the systems now. E.g. there must be machine analysis of press coverage on equities, and the machines must deduce, quickly, "positive" or "negative". This means fake stories have a lot of power. Now this was always true, of course... but traditionally you had humans vet what seemed legit. Now I bet you can keyword stuff articles, among other things, to shift the market... etc who knows but sky's the limit.
I work in a quant fund. Can I just say that the longer I spend in the industry, the more I find that ML/AI techniques are useless (in general). It almost seems like collective self-delusion in building ever more complex systems that attempt to use the latest fad in the field without first appreciating that your dataset is almost always just noise...
Me, I just do my boss's bidding like a good little solider and code up whatever the fund wants. My personal portfolio is a vanilla asset allocation model. Guess which one has done better for the past 2 years?
There was a really nice example of a mistake applying ML to currency exchange rate forecasting given in the "Data Snooping" section of Yaser Abu-Mostafa's "Learning from Data" course at Caltech (http://work.caltech.edu/telecourse.html).
Re: the insufficient data sets problem, I've been surprised how hard it is to find large quantities of historical financial data.
Decent sources of the data seem very expensive, and I'm surprised there isn't a startup out there that is tackling this well since I'd be willing to pay a decent amount of money for minute by minute or hour by hour tick data--but am finding most subscription prices crazy. Is this data really as hard to come by as it seems?
It's a situation where your average hedge fund can shell out the thousands of dollars per month for a full-access feed while individual investors are left hanging. Zack's used to offer reasonable prices for accessing fundamental data via their API but have since changed their pricing model to be more prohibitive to small timers.
Intra-day data is still a bitch but YCharts recently added a premium plan for $200/mo that seems to hit a nice sweet spot for individual investors that do serious trading for accessing fundamental data & analysis tools.
It is expensive as you know. The data is generally owned by exchanges and they see it as a major revenue source. On the other hand if you want it it is just a cost to play. Open data is not big in finance.
This could be renamed "Common Machine Learning Mistakes". Great advice all around, of course. Insufficient data, lack of or incorrect data prep, and poorly defined success criteria are problems that plague all forms of machine learning.
As other commentors pointed out, the biggest problem with stock market modeling is with trade execution and automated trading. A skilled ML practitioner will know how to deal with the size of the data sets and the normalization of the data. Placing the trades properly and taking into account the myriad sources of information available will give even experts a hard time.
Nice, accessible paper, but I find it galling that the images are so low quality. This paper is by no means alone in suffering from low image quality - I was reading a Nature paper earlier today where the graphics had compression artifacts.
Pet peeve: reading two-column pdfs on screen is awful!
Article is nice, sometimes when using machine learning people forget how it works, its limitations, etc, and sometimes a solution that fits better a certain range is not the best solution
On typical screens made only to display movies, sure. On the 10 inch retina iPad the article page fits perfectly the screen in portrait mode and everything is perfectly readable. If there's a perfect use case for that 10 inch device it's this.
Actually the biggest problem is coming from the bottom of the 1st column to the top of the 2nd column (and even on paper this can be bothersome sometimes, well, it's their standart)
If the pages fits entirely on the iPad screen, great!
- Most ML algorithms perform well when there is some underlying phenomenon whose characteristics are being statistically inferred by the model. However, the stock market is now essentially a large network of computers trying to model one another. You can imagine how this might break some of the underlying assumptions that good ML results rely upon. We can see the results of this in the increasing frequency of "flash crashes" caused by over-leveraged quant hedge funds all tripping over sell triggers and/or getting margin calls at the same time.
- In practice, asset class/trading strategy correlations tend to change dramatically over time. Nowadays we talk about the market being 'risk-on' or 'risk-off' modes, since we're used to seeing market-wide selloffs and buyins. Your ML model can only do so much if the entire market is selling everything, unless you're going to dip your toes in derivatives or short selling to which I say: good luck! :)
- A major, major issue is execution. You can have the greatest and most accurate model in the world, but actually trading it is another beast entirely. Bid/ask spreads and market movements from your trading, particularly if you are dealing with a non-trivial amount of money or trading securities that have less than ideal liquidity, is usually going to eat up any alpha you might have. In my own backtests of fairly straightforward trading algorithms even a minor 0.05% spread on bid-ask spreads for a weekly trading algorithm can eat your lunch, nevermind if you're planning on doing intra-day trading or trading anything other than the most popular funds/stocks.
- Beyond any of these risks, you're going to have to inevitably suffer through downdrafts. I don't know about you, but watching my money disappear as it's being controlled by a trading algorithm/model that is subject to all kinds of mistakes and bugs is well beyond my own intestinal fortitude.