Hacker News new | past | comments | ask | show | jobs | submit login
Predicting Stock Performance with Natural Language Deep Learning (microsoft.com)
233 points by lxm on May 30, 2018 | hide | past | favorite | 96 comments



This is not very convincing. It reads like a semester project for an undergrad or a summer intern. In particular, the prediction task used for evaluation is not very meaningful, and the intended use case (predicting biotech underperformers) is not very meaningful without predicting by how much they would underperform, i.e. just using the output of this model as yet another regression factor with questionable significance in the end.

But in that case, it’s very unlikely a heavy CNN text model would be better suited than simpler methods like a classifier using LSA vectors, or even just extracting the gloVe vectors and doing no convolution. Especially given all the metaparameters they mention needing to tweak.

For example, an ablation study to figure out that you need Lecun initialization seems both overkill and requiring way more specialization in deep learning than a typical firm would be looking to have, all just to squeeze out what might be some slight efficacy in one industry group.

When I previously worked in quant finance, I used to be very passionate about trying to apply the latest & greatest methods.

But over time my feeling is most of it is inapplicable to finance, because it ends up being a lot of work for pretty much no efficacy, like in this post. I found much more value in simpler regression and tree models, and using simple bag of words models with text. The only “advanced” stuff that ever seemed to add significant efficacy was Bayesian hierarchical regression, but only for helping overcome limitations of classical random effects models, not for adding any greater complexity.

The post author deserves plenty of credit for skills in deep learning, but the analysis seems unconvincing if it’s supposed to market for using MSFT cloud for deep learning in finance.


Agreed. Also noted three potential flows:

1. They applied pandas.qcut over the entire returns (i.e. training and validation set) when generating the target performance labels, which could compromise the validation.

2. The ordering of news is actually important when it comes to forecasting the movements of the market; their training/validation split was done after shuffling the entire dataset, which means the model will have access to information that shouldn't be available to it during training.

3. It makes more sense to use a training-validation-testing setting rather than a training-validation split when reporting the model performance in order to avoid inflated results due to all the hyper-parameter tuning.


What do you mean by training-validation-testing vs training-validation split? Also why would pandas.qcut compromise the validation?


If the quantile assignments took into account data from both the training set and the test set, then it means the training set quantile label itself contains lookahead information about what the distribution of test set values is.

This can lead to artificially increased test set accuracy since by optimizing the model during training, you are partially optimizing based on information you gained by knowing about the test set in advance.

Training-validation-testing generally means keeping a portion of the training set held out for evaluating accuracy and convergence after one full iteration of training updates. So you get some information per each training update regarding whether the model had overfitting, or performs unrealistically well on the validation set (which it’s not trained on), or for early stopping criteria.

Then, after all that, you move on to checking the final accuracy and metrics on a fully held out test set.

This validation approach is common in deep learning because of the many diagnostics you need to get information about during training. Without some information outside the training set itself, it can be hard tonunderstand how the learning rate is affecting you, how likely overfitting is, whether there is a vanishing gradient problem.

Waiting until all the way to the end of training to get any feedback on these is sometimes just too inefficient or too risky that you’ll discover an issue only after a huge time sink.


Yep. A lot of quant strategies actually exclude biotech altogether. Simply too volatile. Even with relatively conservative portfolio concentration limits you can still lose millions on a bio name dropping 50% due to a rumor of a failed clinical trial.


Bingo. Echoes my experience (also in quant finance). ML is just not all that useful in our field. To the extent that some shops are getting mileage from it, I presume it's from combining lots of really alternative datasets. Satellite photos, traffic data, etc?


Methodology seems to be poor, like with most publications on finance.

1. How did they split the data for train/validation/test sets? If all of the sets had the data from the same time period, or the same companies end up in multiple sets, it is a major flaw. For example, if outperformance was persistent for the companies over the time period considered, the model may simply learn to identify specific companies by their filings.

2. What is the variance of the out-of-sample performance? Given that their dataset is very small, and the model performed badly at predicting high returns and reasonably well at predicting good returns, what are the chances of getting those results by luck alone?

3. How has the model performed since then, on the most recent filings?

4. Why use a convnet? Would gradient boosted trees not perform just as well/better but be more interpretable? Methods like the ones in the eli5 package can help get an idea of why the model makes a particular prediction, which could help sanity check the model.


The sense check here is "If I had a large team of people reading 10-k reports, could I profit from the analysis?". Well, it turns out this is exactly what many asset managers do. If it was easy to profit from this sort of information, we'd be awash with asset managers beating the market. We aren't.

There are two mutually exclusive conclusions that you could draw from this: one, the informational advantage you get from reading company reports is small, and TFA is overstating their case. Two, convnets have discovered a way of reading reports that conveys a big advantage, and further, human readers can't replicate the process.

My money is on the first hypothesis. The second is an enormous claim, and so demands an enormous amount of evidence.


> If it was easy to profit from this sort of information, we'd be awash with asset managers beating the market.

Unless it's a Moneyball-like situation, where managers theoretically have access to all the same metrics, but actually use inferior squishy human judgement instead of hard data.

Due to the amount of data across the entire industry, it's totally conceivable that "human readers can't replicate the process".

That said, I'm a bit of a pessimist, and concur with your suspicion.


This is not really finance research so do not confuse the two. It is CS/ML with a financial application; two different things. Return predictability low in real data, even if you control for time- and firm-fixed effects.


When applying CS/ML to finance, methodology is absolutely crucial. Yet most work - even work done by quants in major sell-side banks - pays little attention to methodology; it is often naively assumed that the methodology applied to classifying cat images can be directly transferred into the domain of finance.

Here is an article that explains the problem well: http://zacharydavid.com/2017/08/06/fitting-to-noise-or-nothi...


But that's because it is being applied by people who come from classifying cat images and have little domain knowledge.


>"When applying CS/ML to finance, methodology is absolutely crucial."

When wouldnt it be?


In other areas, we can often assume that each example in our dataset is an independent sample from some underlying distribution. We can also assume that the distribution stays constant. Furthermore, we have fairly high-accuracy ground truth labels. Also, we know that everything we need to classify a sample is in the data we are given. We may also have more data to play with than we really need.

Take handwritten digit recognition task like MNIST, for example - the way we write the number '5' does not change much, and the distribution of the different ways people write that number stays pretty much the same over time. The labels we are given are very accurate and everything we need to classify the image is in the data.

All these assumptions mean that picking the right methodology is fairly easy - anyone can read a short tutorial and get it right.

None of the assumptions I listed hold in the world of finance. There is no standard methodology that everyone can agree should be followed. You really have to be very, very careful in order to produce useful results.


It sounds like you just need to use a different methodology, not that methodology is unimportant for whatever ml 101 examples that became popular like MNIST, etc.


Yes, but I also think it's a lot easier to get it wrong in finance without even realizing it than in many other domains. Relatively more difficult to screw up the methodology on MNIST. Many mistakenly assume that the methodology from MNIST transfers directly to finance, which makes the problem worse.

ALl of that means that when talking about applying ML to finance, discussing the methodology in detail is a must. If it is not talked about or only briefly mentioned, from my experience that usually means that the methodology used is rubbish. Whereas you don't need to focus as much on the methodology when talking about MNIST-like problems - one can usually assume a reasonable one is used.


If you use financial performance to measure the success of a model, methodology is critical. It's easy to have models that generate outperformance on paper, in reality it turns out that nearly all of those had some flaw in them.


>"If you use financial performance to measure the success of a model, methodology is critical."

When wouldn't it be? And why does this sound exactly like the other post I just read?


If you compare to other results, small problems with timing usually don't impact results as severely. You don't get completely different results if you forget to move one time series by just a few hours.

As for the second part, I don't know. Probably because we had the same thought? Not sure which post you refer to.


1. Are you really assuming that researchers at MSFT screwed up the training and validation sets?? 4. In what way is GBT more interpretable than an ANN??


I don't think that was MS researchers - I think they just publicised this bit of research.

With GBT, you can check why a particular prediction was made - roughly - by navigating down each tree for a particular sample and summing up the influence of each feature on the final score. Then you can see if something weird is having a large effect on your score. Can you do something similar with ANNs?


Yes they did. I noted them in another comment.


They're attempting something fantastically difficult to do effectively (arguably beyond human capability) and arguing that their slightly-better-than-chance results indicate that their approach is promising. More likely, the system is picking up on some "low hanging-fruit" indicators in the data (perhaps that corporate filings with certain buzzwords are likely to fail, or that reference projects beyond a certain age, or similar) which would already be easily picked up on by human analysts, and that anything much beyond that would be impossible with anything remotely resembling the present method.

I want to posit a general hypothesis (perhaps it's already been said); better-than-chance performance by a classifier on some dataset is not evidence that a similar (or any) classifier can perform better on the same data.


I was thinking about weather prediction.

Now climate is changing you see that all the models are becoming less and less acurate.

So yeah some models might be performing better but overall they are all performing worse.


Now climate is changing you see that all the models are becoming less and less acurate.

That's not true.

"Last year, the National Weather Service 5-day forecasts were within 4 degrees of the high temperature. That's as accurate as 2005's 3-day forecasts and a full degree better than the 5-day forecasts of 11 years ago"[1]

[1] http://www.mcall.com/business/tech/mc-weather-forecasts-impr...

https://www.washingtonpost.com/news/capital-weather-gang/wp/...


Weather prediction is constantly getting more accurate. And since it only really cares about what happens in 7 (now maybe 10) days climate change is simply not a factor.


Have there been any research into the decline of weather prediction models?

Sounds really interesting.


Weather models are constantly being validated and improved. They are getting better as computation power and grid resolution have increased. The climate is changing but the laws of physics are not. Also, climate is different from weather.


Yup, it might be picking up on a few clear negative signals like profit warnings and positive signals like mergers. It doesn't seem to be doing much better than the old Bayesian classifiers on Reuters feeds do, though since it's more advanced it may potentially be better, but that's a big maybe.


A fuller analysis would show the time-series of total stock returns and mark the corresponding prediction. I've built these models professionally, and these are some questions I would expect to be asked:

Did all the true 'low return' predictions happen during a bear market, or is there a good temporal mix between predictions?

Did they all happen for the same company? If so, it it possible the report was revised after publication?

What horizon are they measuring performance over? What happens if the change that horizon? What happens if they lag the prediction by a day?

What happens if they use a finer granularity than high / medium / low?


> already be easily picked up on by human analysts

How?


A few years ago I spoke to a data scientist who, after a bunch of NLP work, had hit on a very simple technique for quickly identifying good news and bad news.

How to tell that it's good news: there are numbers near the top.

How to tell that it's bad news: the figures are buried as deeply as legally permissible.


Ironically, I figured the authors of the article weren't very impressed with their results since they weren't outlined in the summary...


Moreover, when the news is good, the numbers at the top are the right ones.


That sounds accurate that most money making strategies are based off simple logic and observations. Seems people are blindly impressed by ML and convolutional neural networks, even if they do not yield any impressive results.

I built system to determine which fundamental or technical factors have the most influence over a stock price. It required a fair amount of manual tinkering and sample runs to eliminate variance. The system itself is not that complex but it works. When I have demo'd it in a interview usually the interviewers are not that impressed... they are expecting that the system must be very complicated to have any results.


That just goes to show that Pearl's approach on causal reasoning is solid. Instead of looking patterns and correlation in data, one could consider building causal graphs from the domain knowledge.


I have found that if the earnings comments are easy to understand, the stock is going to go up. If there are lots of qualifiers and they invent new ways to measure: sales, costs and profit, then the stock is going down.

If they mention the weather during the last quarter - run!


Weather really is last refuge of scoundrels.


They lose a lot of information artificially classifying a continuous variable (% change) into 3 bins (high/medium/low) instead of trying to do regression on percent change. This incorrect practice is surprisingly prevalent in deep learning.


With financial data you might be better off separating out the side of the bet from the size of the bet and thus it makes sense to just make three bins. The features to use for direction and volatility are different.


Choosing bins for stock returns isn't easy. You could use two bins for positive vs. negative returns but any other thresholds are arbitrary and results could depend on the knowledge available when the threshold was chosen.


It's arbitrary if you make it arbitrary. Depending on the data you might target various time intervals to find the optimal way to trade on that information just like any hyperparameter and the bins are a function of that time, on average in a 30 minute period bitcoin moves by 0.2% on high liquidity exchanges.


But as I understand they used the whole sample to determine bins, including validation sets. That alone is a big no-go. I didn't mean that you can't find a proper way to choose bins but there is a lot of potential for errors.

And in the end you want to invest based on the model. Arbitrary bins aren't that helpful if some of them combine desirable and undesirable (e.g. positive and negative) results.


>whole sample to determine bins, including validation sets.

well thats not random or arbitrary just wrong.

>And in the end you want to invest based on the model.

I'm not quite sure what you mean here. Either you determine a model and hyperparameters (which includes the bins) is correct enough of the time via testing on out of sample data or even synthetic data or you are talking about a determining what to do given a single observation, which I would assume you give it to the model and ask it what it thinks and the bins are part of the determination (as done here in the article with softmax output) of what to do and given you've done the testing you should have a level of confidence about the outcome of acting upon the models output. The bins aren't a post processing step, the post processing you might do to trade recall for accuracy might be to expect the bin to have a stronger signal (class > 0.5 or something maybe and otherwise ignore), all of this is "the model".

>Arbitrary bins aren't that helpful if some of them combine desirable and undesirable results

Correct me if I'm wrong but it sounds like you are making the case that there are models and bins which would have only good outcomes (and those are the only useful ones)? Am I misunderstanding something here?


It might even make sense to use something like a mixture network or a GAN to get an idea of the distribution.


So if something like this came into common use, the obvious follow-on would be "earnings writers" who specialize in phrasing and vocabulary that tickles the AI. (It would be easy to practice, just feed your draft into the system, get a score, tweak some words, repeat.) So, reproducing the "SEO" frenzies of a decade ago.


This technique generally (applying NLP to 10K’s and other corporate filings) has been in practice for many years. One common, highly commoditized text signal is earnings surprise: measuring sentiment in earnings call transcripts, and comparing sentiment scores with similar metrics from analyst publications in reaction to the earnings announcement.

The arms race effect you mention absolutely exists. Corporate officers are often coached extensively on key phrases to repeat in earnings calls, and phrases to avoid, and how to steer caller questions away from SEO-like speech patterns they don’t want.


In earnings calls, management already generally reads off scripts and won't answer anything they didn't prepare for (except for some smaller companies, e.g. Tesla recently). They know that transcripts are available and one wrong word could cause investors to panic. Using machine learning won't really change much there. Maybe picking up on signals humans can't hear (e.g. confidence in voice) but texts are already overanalyzed.


Goodhart's law will eventually come around.


I truly wonder when we'll get some deep learning models running on Enterprise Exchange Servers trying to find meaning in the deluge of email's that is sent across all large corporates, building fancy Dashboard with "actionable" warnings.

I bet someone is going to make a fortune based on the pitch alone.


Or just a version of outlook that has a usable search function. As much as I don't like being dependent on Google, at least I know that I find all emails I'm looking for in Gmail. That's definitely not the case for Outlook, even recent emails can be very hard to find (we're at the newest Office version at my job).


Seriously, I moved from a G-Suite org, where I was actually a G-Suite admin, to an Outlook based MSFT shop and the difference is staggering. Skype for Business alone is horrendous, but combine that with Outlook and internal communication becomes a nightmare.



Weren't there some bots doing sentiment analysis on twitter? Also, these were discovered then exploited at one point to cause a rapid drop in the stock markets.


Sentiment analysis was a big topic and is still being used. But results are very specific. Most investors won't tweet while they're losing money so the short term information gain is small.

Where it can help is to determine the state of the economy. Same as with image recognition for satellite pictures (traffic patterns, footfall at stores), those indicators can give you a faster idea if something's going wrong with the economy than classical indicators. But I have yet to see an accurate one.


Any links with more info would be appriciated :)




Microsoft should just analyze trends of LinkedIn links and job postings correlated with people leaving companies.


Like Google isn't already doing that?


Training Set includes, old stock data? How is that corelated with anything? Steam-Engines where spectacular stakes, until the car and eletricity came along.

I wouldn't train a AI on stakes- i would train it on history books/ old newspapers to find hidden needs and gaps. People do not use half of the day, due to darkness? People fighting, due to starvation, or reduced chances to mate and procreate?

A NN could find these needs, and hand it to a diffrent NN, that is trained to find solutions/ companys likely to develop solutions.


Except a NN could not find these needs. They are very poor at the understanding part of language.


It would be interesting alongside estimating general "consensus" on stock performance development using NLP processing, sentiment analysis etc. to utilize game theory to find a contrarian strategy utilized by a few players that has the highest probability of yield, being one step ahead of moving average of dominant expectations, and mastering equilibrium where transacted amount doesn't affect the dominant trend to keep surfing the wave as long as possible.


related work from 2009 for those interested...

Financial forecasting using character n-gram analysis and readability scores of annual reports. by Matthew Butler and Vlado Keselj


This feels like the type of thing that if widespread could be easily abused - at least in the short term. Something along the likes of SEO, but for SEC filings - used to boost a stock price in the short term.


Black Hat SEC?

There was a model to predict up votes on Hacker News based on title that made its rounds a way back. Fortunately we don't have a bunch of submissions with the title "YC YC YC YC YC YC golang is better than Rust" [1]

[1] https://news.ycombinator.com/item?id=14400603


This is really only my 2c and I've only really dabbled in btc/alt coin trading. In my experience, there is too much entropy generated by the collective masses transacting, to reliably predict most ups and downs without cheating or using some external "special" knowledge of the market in question. We will need to wait til the days of strong quantum computers before we can achieve that level of precision with ai.


By that point the other traders will also have quantum computers, so they'll be just as hard to predict.



JP Morgan would have a nice quote for these things: "The Market Fluctuates".

I don't really think we can predict the stock market because the market fluctuates every day based on news and things like that.

Stock market is like a legal gamble, in the words of Wrren Buffet, "The intelligent earn money off the fools".


Warren Buffet has literally made billions by exactly "predicting the stock market" - the thing that you write is "impossible". So what do you mean?


Well that isn't "prediction". He invests in sound businesses which are poised to grow. His investment returns used to take like decades to gove a good return.

He did not spit out " 100 stocks which will go up tomorrow using my fancy algorithm"

He researched the people behind the company the finances etx

Wildly different from what algorithms do


I started doing this with 10k reports, the different formatting between different years was too much of a pain in the ass. They started embedding Excel documents in what I think is base64 into text files? I don't know. In the 90s the tables were in plain text.


Yes, XBRL (XML) files. However they only give you the financials. What OP wanted was management summary, risk factors, etc which still needs to be parsed. Luckily many companies now will put anchors to those sections in the HTML.


Here's a meta question about all such "predicting stock market performance using X"-type publications: if you have a method that does better than average, why are you publishing it, and not, say, investing all your money and making a killing?


This is a pretty common response so I think its worth pointing out that there are a lot of reasons why you might have a model you think might be able to predict the stock market but are not actively using it to make money.

* You may not be eligible to trade that instrument because of where you live or who you are

* You may have found an edge based on historical data in one instrument and expect it could work for another but lack the data necessary to confirm

* You may not have the capital necessary to trade at sufficient scale

* You may not have the programming skills to turn a mathematical model into an build and maintain a high-frequency trading bot with adequate risk controls (waaaaaaaaaay different skillset)

* You may have found a prediction who's error rate is within the transaction costs and thus needs to be traded at higher volume (lower tx costs) to confirm

And before anyone thinks "why not just negotiate to sell it to someone who does" they get hundreds of random people doing that every day so your chances of getting noticed are between slim and none.

Now, what is an effective way to benefit from a market prediction model is to publish it. Those people may hire you to continue to develop it further or to refine it for other markets you didn't think of. No model lasts forever, so having demonstrated the ability to find one makes it more likely you will do so in the future, thus should attract big money salaries from people with the capital for your next one.

Not to say that these are legit, just pointing out its not hand-wavvy dismissive is all. I wish authors would start a paper with something like this just to address this inevitable comment. "We aren't trading this because we can't trade SLURM futures" would really help readers out if they also can't.


The Rockwell Retroencabulator ...


which one, the original or the turbo?


I don't think this is very useful outside of short term fluctuations. At EOD analysis of the 3 financial statements (Balance sheet, cash flow, Profit/Loss) are what drives price.


It turns out you can make a lot of money off those short term fluctuations.


Yeah I mean that is true. Almost had an edit to mention that.


Anyone have experiences with Azure? Nearly everyone I see personally in industry is doing their research on AWS. This article is obviously trying to market Azure, which makes me curious.


The ML features of Azure are as good as anyone’s. You can dip your toes in the water at https://studio.azureml.net/ then scale all the way to running CNTK on GPU farms.


Azure is almost as big as AWS. I think Azure is more used for Enterprise Saas.


According to who? And what is included? Microsoft has a history of blurring the lines here. I think the only way you get them being equal is by combining internal use and/or their SAAS offerings. I don't think it's a fair comparison if you do that.


Annual revenue, 1st quarter 2018:

Microsoft: $21.2 Amazon: $20.4B

From [1]. That includes Office365 revenue[2]. Excluding that, Azure is about 1/3 the size of AWS[3]

[1] https://www.zdnet.com/article/cloud-providers-ranking-2018-h...

[2] https://techcrunch.com/2017/10/30/aws-continues-to-rule-the-...

[3] https://www.srgresearch.com/articles/cloud-market-keeps-grow...


Maybe you like this article: https://www.redpixie.com/blog/microsoft-azure-aws-guide

I think this is a fair and in depth comparison.


> I think this is a fair and in depth comparison.

redpixie is a MS partner, so I don’t expect it to be fair.

https://www.redpixie.com/blog/iaas-paas-saas

“As a Microsoft Partner, we focus on Microsoft Azure IaaS solutions”


Nothing like an article written 6 months ago about a method with 50% accuracy. The dime in my pocket does just as well and is much simpler :)


this was being promoted to me an insane amount on twitter - now on HN as well?


really interesting article. i imagine the financial markets being one industry a lot of people apply deep learning models to.


If you learn how to predict the stock market and reveal your methods then you no longer know how to predict the stock market.


This is from December 2017.


Yeah, the Github repo on this looks quiet; no responses to the issues posted.


This sounds like the plot of a mediocre thriller. Oh wait, it is.[1]

[1]https://www.theguardian.com/books/2011/sep/30/fear-index-rob...





Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: