Your point regarding lost meaning is quite salient. It can be useful and enlight...

rm999 · on April 3, 2013

>my goal was just to make it as easy as possible to learn on arbitrary structured data

I'd be very careful about throwing arbitrary data at your learner, at least if you don't understand your data well. Oftentimes the predictors and response are not properly separated in the same way they will be during real-world usage (for example, in time); this leads to target leaks, where your model is effectively cheating by using data it won't have in production.

Target leaks are obvious when the classifier performs suspiciously well on in-sample test data, but sometimes the repercussions are more subtle but still very damaging in a production environment.

mhluongo · on April 3, 2013

Hm, couldn't a hybrid approach deal with this? Eg hash all the data except a few dimensions you think are vital, and add those to the resulting hashed array?

mkmkmmmmm · on April 4, 2013

Or location-aware hashing?

tomarr · on April 3, 2013

The hashing is not for security though, so why not keep a store of your hash + key. Again, added overhead but you wouldn't have to hash twice and you could just use the mapping table for debugging, rather than in operational code at the expense of resources.