Hacker News new | past | comments | ask | show | jobs | submit login

The paper trains an XGBoost model to predict investment outcomes for startup investments from the time period 2014-2016. The model is trained on investments prior to 2014, and uses features from Pitchbook. It says the most important features are text descriptions of the company and CEO, which are featurized as unigram and ngram TF-IDF counts.

The paper avoids the very first methodological problem one would check for, which is using a held-out set that is mixed in time. That is: at least it uses a heldout _period_ (2014-2016), instead of individually held-out samples. Using individually held-out samples would let the model learn about future trends. If you know some companies that will be successful in 2026, that will help a lot in deciding what to invest in today, even if you can't pick those companies. So at least the paper doesn't do that.

However, the paper doesn't seem to have limited itself to information written and documented about the companies at the time of the investment decisions. Descriptions about the CEO will change over time. Once a company is looking to go public, maybe they emphasise that their CEO has an MBA. If your company died at seed, maybe this detail isn't included in the text. Even the people listed as the founders of the company is something that might be revised over time (Tesla is a famous example).

The paper really doesn't give us much insight into what the model learns. This means that we need to consider competing explanations for the model's accuracy --- we're not forced to accept the author's conclusion that the investments were indeed "predictably bad" in a way that could be profitably implemented as a trading strategy.

By itself, the fact that a black-box machine learning model with a large number of features can distinguish between two classes on a held-out set really isn't very strong evidence that the model has learned something of practical importance, that will generalise beyond the train and held-out sets you've collected. You need to show us what it's actually learned. If it's supposed to identify wolves, is it looking at the canines or the snow? For a study like this, I'd want to see feature pruning down to where the model is still accurate but with only a few textual features, and then show us what those features are. If you can't do that in a way that actually leads to human insight, it's extremely unlikely that the model is learning anything real.




> The one data point that is not specific to the incubator deal is the company description. Pitchbook does not store or provide a time-varying description of the company. Instead, for each startup I have a current description of the firm, its product, and activities, independent of its health or status. I use this information in my analyses as a way to extract information about the company’s product and thus refer to this field as product information. A key assumption in this approach is that the descriptions found in Pitchbook do not evolve as a function of early-stage funding or late-stage success which would mechanically make the descriptions predictive. To the best of my knowledge the company descriptions available to me today are not systematically different from the descriptions available to early-stage investors when they would have been engaged in due diligence. Further, all results are robust to the exclusion of the product information.

This is a prime place to go looking for a leak of signal from the future that inflated the performance of the model.

It would be interesting to go look at Pitchbook's descriptions of successful and failed startups and see if these descriptions seem to leak any information about the later success or failure of the firm.


Good points here.

And I agree, without some kind of sensitivity analysis, partial dependence analysis, etc. it's hard to draw conclusions about what, if anything, the model has learned.

It's also particularly important to test your model against simulated data with a known effect built into it. Your model should be able to learn real effects and avoid learning spurious effects. Simulation studies can be time-consuming and difficult to design, but not much moreso than a good test suite for a piece of software. I don't know why this technique isn't more common in statistics and ML, even in the world of traditional probability models. It really should be taught in stats and ML courses.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: