> Because so far we've never seen an AI improve based on its own output.
Maybe it's because AI is such an overloaded term, but this is pretty commonplace for (semi-)supervised learning algorithms.
Pseudo-labeling [1,2] is an example of this that has been around for decades. When done properly it does improve the performance of the original model, up to a certain limit (far from the singularity).
Moreover, it is apparently possible to improve a model's performance by augmenting it's training set with synthetic examples generated by a second model [3].
Finally, boosting [4] can also be seen as iteratively leveraging the output of a model to train a slightly better model. In fact, a specific type of boosting often yields state of the art performance on tabular data.
This really only works well in resource limited settings and/or semisupervised tasks.
I've tried augmentation for LLM domain adaptation and it's very modest gains in the best of situations, and even still the augmented corpus is a very tiny fraction of the underlying training corpus.
I believe OP's question was getting at whether synthetic data is useful as a substantial corpus for unsupervised training of a language model (given the topic it's reasonable to disregard other areas of 'AI') and that answer appears to be no or at least unproven and non-intuitive.
Maybe it's because AI is such an overloaded term, but this is pretty commonplace for (semi-)supervised learning algorithms.
Pseudo-labeling [1,2] is an example of this that has been around for decades. When done properly it does improve the performance of the original model, up to a certain limit (far from the singularity).
Moreover, it is apparently possible to improve a model's performance by augmenting it's training set with synthetic examples generated by a second model [3].
Finally, boosting [4] can also be seen as iteratively leveraging the output of a model to train a slightly better model. In fact, a specific type of boosting often yields state of the art performance on tabular data.
[1] https://arxiv.org/abs/2101.06329
[2] https://stats.stackexchange.com/questions/364584/why-does-us...
[3] https://arxiv.org/abs/2304.08466
[4] https://en.m.wikipedia.org/wiki/Boosting_(machine_learning)