Not sure that synthetic or LLM-generated training data is as useful as human generated text.
It seems "good enough" (for now) but synthetic makes up a very small proportion of the training set being used in current models that have been trained on it, if that proportion ends up being mostly synthetic we'll likely see whatever weird hallucinations and biases in the dominant backend (GPT4 or whatever) become amplified.
It's been shown repeatedly that garbage in = garbage out for training data.
Agree about synthetic data. My point is that AI-powered applications that are deployed in production generate more _real_ data which can be used for training. For example, self-driving cars generate tons of data about how their models perform, as a result of the cars driving around. Similarly, code-writing AI applications will generate feedback in the form of errors, logs, etc. which is can be fed back into the models as training data.
It seems "good enough" (for now) but synthetic makes up a very small proportion of the training set being used in current models that have been trained on it, if that proportion ends up being mostly synthetic we'll likely see whatever weird hallucinations and biases in the dominant backend (GPT4 or whatever) become amplified.
It's been shown repeatedly that garbage in = garbage out for training data.