Cool paper, but the fact that a silly keyword like 'color' is so significant without an author explanation makes me question their results. I really wish they dug deeper and explained unintuitive results like that. There may be a bug in their experiment.
Also, why would they use the Dow Jones Industrial Average? It is a ridiculously bad average of stock market performance for many reasons. This planet money podcast goes into why:
Lots of people use the Dow precisely because it is a bad indicator. If they don't like the conclusion they get with something better, like the W5000 or the S&P500, they can try it again with the Dow for a "second opinion."
The whole methodology is a joke anyway. If you evaluate a huge number of search terms, some of them are going do better than others. So terms that might mean something ("debt") get mixed up with terms like ("color") that got lucky.
What you have to do is come up with one strategy that trains a classifier on the combined words, tunes it up with logistic regression and derives optimal trading actions from that. And you've got to factor in what you're paying to the broker and the market makers.
The DJIA is a "bad" index but it is highly correlated with the S&P 500. I'm including a link to a study looking at ~70 years of data which shows a 0.95 correlation coefficient between DJIA and SP500.
So, while the methodology might be spurious (random search terms may give false positives) i don't think the use of the Dow vs SP500 is a key criticism.
The "Salmon study" was an illustration of the necessity of correcting for multiple comparisons in voxel-wise testing. The problem it's addressing isn't itself data dredging, based on the definition linked above on Wikipedia.
Uncorrected multiple comparisons are, of course, a big portion of the statistical dubiousness inherent it "data dredging"
Logistic regression and other parametric, non-regularized linear learners tend to do poorly with NLP forecasting -type modeling. (They usually overfit.)
That's why you build another threshold-type learner and then apply logistic regression to convert the score produced by learner A into a probability score.
Then you can tune at the exact point of the precision-recall curve that maximizes business value.
For what it's worth, I came across this yesterday and tried replicating their results. I didn't use exactly the same process - I used the S&P 500 and the trading strategy was not exactly the same (because i'm using a framework that makes something else easier to try). But my approach was close enough that I would expect to see similar results if their results were robust enough to be of any use. I didn't get similar results, in fact 'debt' went the other way, and I calculated an associated p value of about 0.8 (i.e. no significance whatsoever). Of course I could have made a mistake, and having come this far going to spend some time double checking everything today, but it'd also surprise me greatly if there was money lying around for the taking this easily..
"Color" is slang for opinion or prediction in finance, e.g. a trader could ask an analyst for some "color" on a company before he goes and buys its stock.
I used to backtest trading indicators a lot and the majority of time the indicator would work on a stock like Apple but wouldn't work on a stock like Tyson Foods. Or any other stock. I was just fitting the curve.
Ignoring methodology concerns for a moment, there have been times in the past where people have found statistical regularities like this in the stock market. Sometimes people have made money on these discoveries. If not someone else will as soon as the researcher publishes. But making money on them gets rid of them, and given that this is now publicly available information those regularities certainly no longer exist, if they ever existed in the first place.
I'm (and economists) aren't saying that there are never $10 bills on the sidewalk, just that they don't stay there for long after someone notices them. Stock prices went down on the weekends for decades before someone noticed. But once it was noticed it stopped pretty quickly.
I'm more interested in the persistent market inefficiencies; growth vs. value, January effect, low P/E, all interesting examples that haven't been arbitraged away.
I am just trying to double check to make sure I understood so bear with me. Are you telling me that stock prices systematically went down for decades and no one took this into account to make money? there is hope after all...
Most of the time, the economist is right about that: the vast majority of ten-dollar-bill-appearing things found on sidewalks are obviously not real ten-dollar bills once you pick them up. They're advertisements for nightclubs or get-rich-quick schemes and whatnot.
Out-of-sample backtest or it didn't happen. Out of a bunch of random search queries, chances are several of them will be "predictive" of futures moves in an index. Big whoop.
"One day we had a conversation where we figured we could just try to predict the stock market, and then we decided it was illegal. So we stopped doing that."
I mean, subject to a variety of risk and regulatory issues, perhaps, and not a core competency, and a variety of other things, but... the fundamental reason for insider-trading laws is to make sure the insiders don't abuse their positions and act against the interests of the company's owners (shareholders) by trading in stock tips instead of building shareholder value. If you get knowledge independently -- like if you sent your analyst out to count the number of cars in a firm's parking lots to estimate hiring/firing -- that's all well and good (and is something that hedge funds actually do.)
Heck, if it works, this sort of thing would be great. Moving the market earlier means fewer people buy and sell companies at the wrong price. That sort of knowledge is worth billions. (Think about it from the perspective of startups: if you could see the future and know whether a company would work out before you even founded it, then you would build only successful companies, and you'd be certain they'd be funded well. This is but a small fraction of that power, but it's still quite meaningful.)
The legality of information you obtain is not about whether you obtained it independently, but based on whether it is public knowledge (or can be derived from it) or not.
Let's say you are not connected to company X, but you have a friend that works there. One day your friend tells you material fact. Now, your friend probably broke a couple of rules, but _you_ cannot control what people tell you and you cannot be held liable for what you hear. However, acting on this information is illegal. Even if you presumably do not use this information, but trade shares of company X before the information you know becomes public, you are in murky waters.
Same with Google. Google cannot control what people search for. However, acting on this information would be illegal, unless they are absolutely sure that information in the search terms is consistent with public knowledge about the company.
I believe this is not a fully accurate picture, although the example you give in your second paragraph is definitely accurate.
At a high level you aren't allowed to use "material, non-public information" for investment purposes, but information isn't material just because you can make money off of it in some way, otherwise "channel checks" would be illegal. Material non-public information has to come from insiders of the company, so the only argument that could be made that it was illegal for Google to make investments is based on material information that was being provided to them, via search terms, by corporate insiders. If they are merely using the sentiment exposed by the public to them through search terms that is probably legal. Similarly it's legal for hedge funds to fly planes over department stores and count the cars in their parking lots to gauge the level of business they are seeing at Christmas time, even though this isn't public information.
Somewhat related, on a more micro level: some have suggested a trend in which news about Anne Hathaway drives up the price Berkeshire Hathaway’s stock.
As the intelligence of our technology grows, I find it amusing to consider these less-logical patterns in the algorithms (indirectly) responsible for so much of our economy.
You are always going to be able to find trends between data points if you have enough information. In order to tell whether you actually have predictable information you need to do a double test, which they don't seem to mention anywhere.
Typically you split your data pool in half and use data analysis to determine some kind of relation, such as the change in "color" seems to correlate to a reverse change in DOW Jones. You then generate a prediction algorithm based on that.
Finally you use your prediction system on the other half of your data, to see if you are actually predicting or just correlating. Feel free to adjust any of your methodology but make sure you don't include the second half in your generation step or else you have only shown correlation not prediction.
I was thinking of something very similar the other day but in regards to bitcoins. They are very volatile at this stage and whenever they're in the public eye they either seem to increase or decrease in value. Perhaps somebody could perform a study as to a relation between the volume of bitcoin related search terms and it's market value.
I'm actually looking into this a little right now since I happen to work for a media monitoring company. I can tell you there is a definite correlation between them, but haven't been able to determine if it's a leading or lagging indicator.
I'm not sure this is so great. For example, the number of searches for Bitcoin is predicted very well by the number of searches for "marina fresh beat band".
As soon as someone figures out bitcoin derivatives (e.g. options), you'll be able to buy or sell the volatility. More broadly, such a development may also stabilize those price fluctuations.
Theory. If you expect that the price of bitcoins is going to be somewhere in the next month, but it's wildly deviated from that on the spot market, there is a monetary incentive to buy/sell bitcoin until the prices are in line with long-term expectations -- discounted for risk, of course. Derivatives like futures options reduce that risk, because you can lock in that future price right now. If anyone's offering, that is.
Derivatives could also reduce risk for users. For instance: you're a business that accepts bitcoins for payment. But the price of bitcoins can fall by 50% in a day! So you'd better transfer those bitcoins to cash the moment you get them, or you're subject to losing half your cash any day. Something that's too hot to handle like that kind of damages the ability to use it as a medium of exchange, doesn't it? But you could alternatively buy bitcoin currency futures, essentially locking in your current price. So maybe you could hang onto those bitcoins a while and use them to pay people for your referral program or something. (Oh, sure, there's a cost associated with it, but there are costs associated with translating cash back and forth as well. Depending on the price of the derivatives, it might make sense.)
When people use derivatives to hedge their exposure to "bad things happening" and the market is not very liquid (i.e. BTC), arbitrage opportunities can persist. For example, if there is no buyer at an out-of-the-money option strike price I know is overpriced, then the market cannot correct itself even though I am "right." When these mis-pricings stick around for a while, the eventual corrections are often much more extreme. Derivatives reduce overall volatility when the spot market is mature and liquid, which is a far cry from MtGox right now.
Many many moons ago I had a similar idea, although my idea was to spot patterns of eBay purchases. I had a feeling that people in various parts of the world would buy certain types of items depending on economic trends. Once I had identified purchasing trends, it seemed to me that I might be able to predict trends.
Don't think it's that simple. Example: There's a sudden burst in searches for "Macys". They could have started a huge advertising campaign. Turns out that campaign costed them billions of dollar, and didnt have high ROI.
This seems more like a macroeconomic-leading-indicator strategy. The algorithm worked against the Dow Jones Industrials, not against individual stocks.
How do you get access to google search terms for something like this? Also, anyone have any other hypothesis about search terms that would predict movement?
What's interesting is that I had this exact idea with precious metals. The frequency of the terms searched (like gold or buy gold) matches the gold spot market index impressively closely.
Also, why would they use the Dow Jones Industrial Average? It is a ridiculously bad average of stock market performance for many reasons. This planet money podcast goes into why:
http://www.npr.org/blogs/money/2013/03/12/174139347/episode-...