Hacker News new | past | comments | ask | show | jobs | submit login

This really only works well in resource limited settings and/or semisupervised tasks.

I've tried augmentation for LLM domain adaptation and it's very modest gains in the best of situations, and even still the augmented corpus is a very tiny fraction of the underlying training corpus.

I believe OP's question was getting at whether synthetic data is useful as a substantial corpus for unsupervised training of a language model (given the topic it's reasonable to disregard other areas of 'AI') and that answer appears to be no or at least unproven and non-intuitive.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: