RandomForest and other decision tree methods can actually handle missing data very well by treating an individual cell in the data matrix as missing, rather than discarding the entire row.
So the assertion that "All algorithms should operate only on data vectors and on frequency weights — they should have no knowledge of missing-ness." is false - there are a lot of other fruitful ways to handle missing data, such as using indicator variables, imputation, etc. - see http://www.amazon.com/Statistical-Analysis-Missing-Roderick-...
>RandomForest and other decision tree methods can actually handle missing data very well by treating an individual cell in the data matrix as missing
This is why I tend to favor these over regression or other methods. Real data is going to have lots of missing values, sometimes too many for imputation to even work. Whittling down your data to only those which have all predictor fields filled in can drastically shrink down your data set.
If anyone is particularly interested, there are notes on the original RandomForest implementation, by Breiman and Cutler, which describe how missing values are handled (as well as a great deal of other interesting tricks!):
There are some other ways an rf can be made to handle missing values including splitting the values onto a third branch at each node (which can be useful if the fact the value is missing may be informative), just leaving the missing values out at each split and applying a bias correction (as was common with pre rf decision trees), or doing a local imputation within the in branch cases at each node.
- Regarding missing values, if you have a special number in your survey to indicate it, then preprocess your data to replace it with an NA.
- Regarding 80 bit doubles. Adding a few bits of precision may be just all you need in some cases, but that's a crap shoot. If you have catastrophic cancellation, then rewrite your algorithm to avoid the numerical instability. Use compensated sums for instance, or work in log-space. There are many ways to solve this problems in a much more robust way than adding more precision.
- Plenty of algorithms can handle missingness. You can treat them as hidden variables and use something like the EM algorithm or MCMC sampling.
In his example you have several missing value indicator that each contain some sort of reason as to why it's missing. Collapsing that to NA would lose data, which is why the post argue against it.
If it's important and your algorithm can handle it, then it's data, not missing data, and there's no need to be concerned with the underlying representation.
If it's not important or your algorithm can't handle it, then you don't lose anything by collapsing it to NA.
The toolkit universe seems bifurcated between incomplete stats packages written by programmers and incomplete programming packages written by statisticians.
Why is needing more significant digits than the hardware can provide doing something wrong? At the end of day the tools should be subordinate to the problem at hand and not the other way around. You're arguing for the primacy of tools over the actual problem being solved. That sounds backwards.
On the precision front, http://www.mpfr.org/ bindings for a language are a much nicer choice than doing fiddling things with 80bit floats. Then you can have 1000 bit floats if you need the precision! And then it just becomes a library design problem. https://github.com/ekmett/rounded is one such exemplar binding
on the empty/missing data front, I often in my haskell code model it as something like (Maybe (Either Weird Normal)), though some folks might write it as something more like ExceptT Maybe Weird Normal, but same idea either way.
I'm not sure how your Maybe type representation would help with the OP's concern about SIMD optimization.
It's a theoretically clean approach, but this article is largely focused on how to keep the (SIMD-aware) compiler humming along on large vectors without reverting to scalar operations.
AFAICT GHC only has experimental support for SIMD optimizations anyway, so this is probably less of an issue for Haskell code than it is for C/C++/Fortan numeric primitives (and languages like Julia which utilize them).
I can totally get struct of arrays format in haskell (Heck, the Unboxed.* modules in the vector package are basically that http://hackage.haskell.org/package/vector)
I can define the SOA format as something like (Vector Bool, Vector Bool, Vector a, Vector b), or something similarly along those lines. That allows a pretty reasonable branchless vectorized formulation with only a relatively small ~2x space overhead for most pointwise operations.
I actually wrote a prototype of a simd backed vectorized arithmetic lib for haskell that does the right thing over a year ago, https://github.com/wellposed/vector-vectorized, but the fact of the matter is that pointwise simd is honestly kinda boring, it really only is useful for streaming/linear time workloads. What more interesting is SIMD on super linear time algorithms like FFT and (Dense) Matrix multiply, along with on compressed structures and audio/video. In all of these latter cases, SIMD shuffle matters to! And I've not yet seen a high level language that does a good treatment of supporting simd shuffles.
https://github.com/flame/blis/blob/master/kernels/x86_64/avx...
is a very very good example of how shuffle/permutation heavy interesting SIMD gets. (The associated project, BLIS, is a modern, amazingly well engineered sequel to blas, written in C, thats easy to customize to new architectures).
I definitely agree that GHC's current SIMD isnt very useful, it lacks nice support for shuffles, and its a bit weird (it naively just bundles up the LLVM simd api). I hope to work with a few folks to fix that up in time for GHC 7.12, I actually spent a bit of time this summer trying to plan out some design ideas.
The frequency weight trick seems like a cool one. You always have to code around frequency/probability/analytic weights, so you get two birds with one stone (It's not a perfect solution though, as if you need to e.g. regress two variables by different groups, then updating the mask for each value of the group ID is silly compared to just adding a conditional)
On a related note, I don't know why more people don't look at what others in stats have made (e.g. Stata's solution of having 32 missing values using the very last values of each numeric type, so you can bitmask them easily). If you really want to supersede Matlab, R, Numpy, Stata, etc. then don't discard everything they have made, learn about the good parts and use them (!!!).
Also, what's up with the comments here? At least 4/8 comments are basically saying "don't use regressions, use a technique that doesn't need missing value hacks..". Well guys that's NOT up to the language writer to decide. If you don't do linear regressions well and fast they you are dead in the water in terms of a basic statistical package. Same for the precision issue; many algorithms at some point require quad precision so just saying "screw it change the algo" just misses the point.
So the assertion that "All algorithms should operate only on data vectors and on frequency weights — they should have no knowledge of missing-ness." is false - there are a lot of other fruitful ways to handle missing data, such as using indicator variables, imputation, etc. - see http://www.amazon.com/Statistical-Analysis-Missing-Roderick-...