The paper trains an XGBoost model to predict investment outcomes for startup inv...

civilized · on July 10, 2022

> The one data point that is not specific to the incubator deal is the company description. Pitchbook does not store or provide a time-varying description of the company. Instead, for each startup I have a current description of the firm, its product, and activities, independent of its health or status. I use this information in my analyses as a way to extract information about the company’s product and thus refer to this field as product information. A key assumption in this approach is that the descriptions found in Pitchbook do not evolve as a function of early-stage funding or late-stage success which would mechanically make the descriptions predictive. To the best of my knowledge the company descriptions available to me today are not systematically different from the descriptions available to early-stage investors when they would have been engaged in due diligence. Further, all results are robust to the exclusion of the product information.

This is a prime place to go looking for a leak of signal from the future that inflated the performance of the model.

It would be interesting to go look at Pitchbook's descriptions of successful and failed startups and see if these descriptions seem to leak any information about the later success or failure of the firm.

nerdponx · on July 10, 2022

Good points here.

And I agree, without some kind of sensitivity analysis, partial dependence analysis, etc. it's hard to draw conclusions about what, if anything, the model has learned.

It's also particularly important to test your model against simulated data with a known effect built into it. Your model should be able to learn real effects and avoid learning spurious effects. Simulation studies can be time-consuming and difficult to design, but not much moreso than a good test suite for a piece of software. I don't know why this technique isn't more common in statistics and ML, even in the world of traditional probability models. It really should be taught in stats and ML courses.