> AI generated content that gets posted to the web as original human generated content, with LLMs getting re-trained on this content
Isn't that extrapolating the current trend a bit too much? Clearly, the text corpora[0] amassed before mass LLM content distribution are already big enough to train such models to decent general language fluency. So why would AI creators contaminate those datasets with potentially spurious content?
Sure, you want to keep your model up-to-date about the state of the world (the GPT corpus ends in mid-2021 afaik), but you can be much more careful about which texts you include. Those newer training data serve a different purpose than the original corpus, you don't need to bootstrap general language proficiency anymore. OpenAI already released a product for classifying AI-generated text, why would they not use something like that to filter future training data, for example?
Isn't that extrapolating the current trend a bit too much? Clearly, the text corpora[0] amassed before mass LLM content distribution are already big enough to train such models to decent general language fluency. So why would AI creators contaminate those datasets with potentially spurious content?
Sure, you want to keep your model up-to-date about the state of the world (the GPT corpus ends in mid-2021 afaik), but you can be much more careful about which texts you include. Those newer training data serve a different purpose than the original corpus, you don't need to bootstrap general language proficiency anymore. OpenAI already released a product for classifying AI-generated text, why would they not use something like that to filter future training data, for example?
[0] edited, thanks!