What about building machine learning models that make predictions on said data? ...

abhgh · on Dec 2, 2020

I have made another comment on this page and this is indeed a problem barring specific cases where you have a "simulator" and you want to learn something applicable to some "aggregate of simulations". For ex. learning a Reinforcement Learning policy for tic-tac-toe is fine, because you can build a tic-tac-toe play simulator and you want a universal policy.

The problem goes beyond testing: your training might (1) be deprived of information your original data had, e.g., among a 1000 features, the key to the classification of interest might be one feature; how does your generator know to not distort this potentially at the cost of distorting the other 999 features? (2) might latch onto the assumptions made by the data generator, e.g., for a continuous valued feature, your model might bias itself towards the distribution moments that the generator assumed.

I am not sure generating good fake data is a problem different from good density determination. And in some cases, you might need to specify what parts of the original distribution you don't want the generator to mess with: consider an NLP dataset where your model must rely on sentence structure. Generating the right bag-of-words features might not help here: sequence matters. Or if you wanted to use contextual embeddings; sequence matters then too.

Even if you did manage to generate a "distributionally-compatible" version of the data, for cases where you perform some kind of data enrichment at a later stage, you could run into problems. For ex: if the original data has zipcodes that you wanted to mask, and your data generator substitutes them with arbitrary strings, then at a later point you cannot introduce a feature that measures the proximity of two locations.

fractionalhare · on Dec 2, 2020

This isn't my area of expertise, but I've spoken to computer vision researchers who apparently use generated data for training models for self-driving vehicle autonomy. Maybe they only use generated data for the train set and then do cross-validation on real data? I'd like to hear them chime in on this thread if any are reading here.

Theoretically speaking if the generated data has the same distribution and parameters as the real data [1], and encodes similar nonparametric features like seasonality and user activity, I think generated data might be fine. [2]

_________

1. Admittedly tricky if you have limited data and no insight into the underlying population distribution/features, just those of the sample. But then you have a worse problem for modeling diagnostics anyway.

2. In the sense that anything is "fine", which is a spectrum that requires some critical skepticism in statistics. There are always caveats but it may still be robust and useful.

levesque · on Dec 2, 2020

You could kickstart training on simulators and then do a transfer, i.e. make adjustments to your final model, on real world data. But to learn only on generated data the problem boils down to the nonparametric features you will be using to state that the generated data is similar to the real data. What is a complex enough feature to say that images are equivalent? They might be statistically equivalent according to your features, but are they really? I think this is a very hard problem, because if we did have a good answer to this question then Tesla & co. would already be training their models on perfect simulators and we wouldn't see the glitches currently found in autonomous driving applications.

fractionalhare · on Dec 2, 2020

That's what I figured, re: transfer modeling. Thanks for chiming in.

m0nster · on Dec 2, 2020

ARX (see other comment in this thread) also supports data anonymization for privacy-preserving machine learning.

jacquesm · on Dec 2, 2020

There is a German company that specializes in this: https://www.statice.ai/ , and that too is a path that you should only walk if you fully understand the subject matter.

So you could instead test on fake data that has the same (or as much as possible the same) statistical properties as the data that you would like to use.