No they can't - or at least, this paper doesn't provide any compelling evidence that they can.
I read this paper when it first came out a few years ago, and produced an implementation of the signal. They have heavily overfitted to historical data - many plausible alternative assumptions for which keywords are predictive are not profitable in backtest at all, let alone useful as a basis for future trading.
This is an unfortunate example of non-finance domain experts, who I'm sure are more than capable in their respective fields, making egregious errors when they try to apply their knowledge in finance.
> I thought the common practice was using part of the historical data for creating the model, and another sizable, non overlapping chunk to validate it.
One problem is that too often, people break the data into a training set and a testing set. Then they train N algos on the training data, test them on the testing data, and then trade on the algo that tested best.
Once you use the testing set for more than one algo, it's really a meta-training set.
Really, you need a training set, a testing set, and a validation set. If you use the validation data set with more than one algo, it's no longer a validation set.
So, you train N algos, test N algos. Pick the best, and validate it. If validation fails, do you have enough discipline to wait for more data to come in and try again? Most people do not and will make hand-wavy arguments about why it's okay to re-shuffle the same data into 3 data sets and try again.
Its an infinite regression. You keep needing more data to be completely 'fair'. If the data set is finite, eventually you use all of it. Then where do you go?
Another route is to model the data source, and train on the model (which you can run forever to get endless data). Then test on the real-world data. But that's only as good as the model.
> This is an unfortunate example of non-finance domain experts, who I'm sure are more than capable in their respective fields, making egregious errors when they try to apply their knowledge in finance.
Full ACK to this statement. I remember when this post was written in 2013 (by the way, can that date be put in the title?), alongside a similar paper arguing Twitter hashtags/likes/retweets could serve as a market signal - mostly for this excellent response:
I read this paper when it first came out a few years ago, and produced an implementation of the signal. They have heavily overfitted to historical data - many plausible alternative assumptions for which keywords are predictive are not profitable in backtest at all, let alone useful as a basis for future trading.
This is an unfortunate example of non-finance domain experts, who I'm sure are more than capable in their respective fields, making egregious errors when they try to apply their knowledge in finance.
https://xkcd.com/1570/