It’s also important to remember that most of what is online is there to sell ads and does not represent reality in the same quantity. I think people are really trying too hard to find deep meaning everywhere, they might want to read more about social sciences instead.
That said, what can be found online does cover a lot of what offline people do and think and write, since there is a lot of stuff being brought online that wasn't produced online (books, news, ...)
Otoh, it's not clear how (or if) LLM training balances the different sources
Absolutely. Many sibling comments point out all the ways this is true
- that this massive sample (and it is truly big-data) is still a tiny
speck of the human condition and spectrum of thought and experience.
It raises vital questions about the where the centroid of this cloud
of random stuff that people decided to input into the machine really
lies? What is not represented in the model? Probably just about
everything! Big new questions about objectivity and normalcy occur.
Is the average of everybody elses' intelligence actually any use at all
to an average individual? Does the average of everybody elses'
intelligence have a different kind of use to groups, companies, states,
than common utility of synthesising though-like speech?