Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Interesting concept to handle missing data using Trinary decision trees. At a high-level, it seems reminiscent of Multiple Imputation in randomForests which could address missingness. Though the Trinary tree takes a different approach by not presuming the missing values harbor any significant information about the response. It's intriguing that it shines in MCAR settings, but falls short with Informative Missingness.

> "Notably, the Trinary tree outperforms its peers in MCAR settings, especially when data is only missing out-of-sample, while lacking behind in IM settings."

This somewhat mirrors the behavior of early imputation strategies. One must ponder, however, how the Trinary tree would perform vis-a-vis older methods like CART's surrogate splits or C4.5's probabilistic splits for handling missing values. These older methods were crafted with an intuition somewhat similar to the Trinary tree.

It's also great to see the amalgamation of Trinary tree with the Missing In Attributes approach into the TrinaryMIA tree. But the efficacy of this hybrid model isn't completely surprising. MIA has historically shown resilience in diverse missing data scenarios, and combining that with the Trinary's approach could harmonize their strengths.

What would be really enticing is to see if the essence of the Trinary decision tree can be injected into boosting models like XGBoost or LightGBM. Since these models are notorious for their treatment of missing values, maybe there's some potential symbiosis there?




I implemented something like this in a [pre xgboost boosting framework](https://github.com/ryanbressler/CloudForest) ~10 years ago and it worked well.

It isn't even that much of a speed hit using the classical sorting CART implementation. However xgboost and ligthgbm use histogram based approximate sorting which might be harder to adapt in a performant way. And certainly the code will be a lot messier.


Came here to cite your work, I even mention "CloudForest" in my slides still as "an interesting implementation that is also capable of handling NANs in DTs in a slightly different way." Crazy this has already been 10 years.




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: