Hacker News new | past | comments | ask | show | jobs | submit login

> Because so far we've never seen an AI improve based on its own output.

Maybe it's because AI is such an overloaded term, but this is pretty commonplace for (semi-)supervised learning algorithms.

Pseudo-labeling [1,2] is an example of this that has been around for decades. When done properly it does improve the performance of the original model, up to a certain limit (far from the singularity).

Moreover, it is apparently possible to improve a model's performance by augmenting it's training set with synthetic examples generated by a second model [3].

Finally, boosting [4] can also be seen as iteratively leveraging the output of a model to train a slightly better model. In fact, a specific type of boosting often yields state of the art performance on tabular data.

[1] https://arxiv.org/abs/2101.06329

[2] https://stats.stackexchange.com/questions/364584/why-does-us...

[3] https://arxiv.org/abs/2304.08466

[4] https://en.m.wikipedia.org/wiki/Boosting_(machine_learning)




This really only works well in resource limited settings and/or semisupervised tasks.

I've tried augmentation for LLM domain adaptation and it's very modest gains in the best of situations, and even still the augmented corpus is a very tiny fraction of the underlying training corpus.

I believe OP's question was getting at whether synthetic data is useful as a substantial corpus for unsupervised training of a language model (given the topic it's reasonable to disregard other areas of 'AI') and that answer appears to be no or at least unproven and non-intuitive.


Boosting is reminiscent of the wisdom of the crowd effect.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: