When you say "hashing makes the programming a little easier," I think you hit the nail on the head. I'm not trying to improve classification accuracy -- my goal was just to make it as easy as possible to learn on arbitrary structured data.
>my goal was just to make it as easy as possible to learn on arbitrary structured data
I'd be very careful about throwing arbitrary data at your learner, at least if you don't understand your data well. Oftentimes the predictors and response are not properly separated in the same way they will be during real-world usage (for example, in time); this leads to target leaks, where your model is effectively cheating by using data it won't have in production.
Target leaks are obvious when the classifier performs suspiciously well on in-sample test data, but sometimes the repercussions are more subtle but still very damaging in a production environment.
Hm, couldn't a hybrid approach deal with this? Eg hash all the data except a few dimensions you think are vital, and add those to the resulting hashed array?
The hashing is not for security though, so why not keep a store of your hash + key. Again, added overhead but you wouldn't have to hash twice and you could just use the mapping table for debugging, rather than in operational code at the expense of resources.
When you say "hashing makes the programming a little easier," I think you hit the nail on the head. I'm not trying to improve classification accuracy -- my goal was just to make it as easy as possible to learn on arbitrary structured data.