Yes, consider an existing LLM being given “shitpost-y” messages and asking it if there is anything interesting in there. It could probably summarize it well and that could then be used for training another LLM.
This assumes everything in the training data set is accurate. Sometimes people are wrong, obtuse, sarcastic, etc. LLM's don't have any way of detecting or accounting for this, do they?
That output, then being used to train other LLM's, just creates an ouroboros of AI generated dogshit.
LLMs are state-of-the-art at detecting sarcasm. It won't help if the data is just wrong though.
Edit: https://arxiv.org/abs/2312.03706 Human performance on this benchmark (detecting sarcasm in Reddit comments) was 0.82, a BERT-based LLM scored 0.79.
The training data doesn't need to be strictly accurate. If it was, you'd just be programming a deterministic robot. The whole point is the feed it actual human language. Giving it shitposts and sarcasm is literally what makes it good. Think of it like 100 people guessing the number of marbles in a jar. Average their guesses and it will be very close. The training data is the guesses, the inference is the average.
And yet human civilization has survived the fact that many humans are wrong, lying, delusional, etc. There is no assumption that everything in our personal training set is accurate. In fact, things work better when we explicitly reject that idea.
LLMs do not rely on 100% factually accurate inputs. Sure, you’d rather have less BS than more, but this is all statistics. Just like most people realize that flat earthers are nutty, LLMs can ingest falsehoods without reducing output quality (again, subject to statistics)
etc etc