I get the impression that scraping the web isn't nearly as important a source of...

endgame · on July 25, 2024

> The average webpage on the internet is so random and terrible it's not even clear how prior LLMs learn anything at all.

Perhaps we should stop exposing humans to them, as well?

skybrian · on July 25, 2024

It’s already the case that people don’t see that stuff very much.

The key word in that quote is “average.” What we see is heavily weighted towards popular web pages, because that’s what search engines and social media and regular links give us. We don’t see average.

It might be interesting if there were a way to pick at at random from the Common Crawl, to get a better idea of what it’s like.

TheDudeMan · on July 25, 2024

They are moving to Discord.

MVissers · on July 25, 2024

Absolutely for learning. If you want to learn something we should realize it's awful. Try to learn something in your own field to see how awful it is.

That's why we're all armchair experts in other domains.

markwkw · on July 24, 2024

You trim, yes, but AI content surely invades (all?) areas of written material. People are increasingly using AI to assist their writing. Even it if's for slight editing, word choice suggestions.

Even AP doesn't ban the use of LLMs, its standards prohibit direct publishing of AI-generated content. I'm sure its writers leverage LLMs in some ways in their workflow, though. They would probably continue to use these even if AP attempted to ban LLMs (human incentives).

tensor · on July 24, 2024

If the AI generated content is filtered for quality or is corrected then it will still be good data. The phenomenon of model degradation is only in the case where there is no outside influence in the generated data.

progmetaldev · on July 24, 2024

I think this is extremely important with AI generated content, but seems to be given less and less thought as people start to "trust" AI as it seeps into the public conscious more. It needs to be reviewed, filtered, and fixed where appropriate. After that, it isn't any different from reviewing data on your own, and wording it in a way that fits the piece you're writing. Unfortunately, there's so much trust in AI now that people will go ahead and publish content without even reading it for the correct tense!

throwaway4aday · on July 25, 2024

The same problem exists if you blindly trust any source without verifying it. There is a huge amount of endlessly recycled incorrect blog spam out there for all domains. Not only that but this problem has always existed for second hand information so it's not like we were even starting from some pristine state of perfect truthfulness. We have the tools we need to deal with the situation and they were developed hundreds of years ago. Empiricism being chief among them. Nullius in verba[0]

[0] https://en.wikipedia.org/wiki/Nullius_in_verba

mempko · on July 24, 2024

If tail events aren't produced by these models, no amount of human filtering will get them back. People would not just need to filter or adjust AI generated content, but create novel content of their own.

jroesch · on July 24, 2024

I think this is roughly correct. My 2c is that folks used the initial web data to cold start and bootstrap the first few models, but so much of the performance increase we have seen at smaller sizes is a shift towards more conscientious data creation/purchase/curation/preparation and more refined evaluation datasets. I think the idea of scraping random text except maybe for the initial language understanding pre-training phase will be diminished over time.

This is understood in the academic literature as well, as people months/years ago were writing papers that a smaller amount of high quality data, is worth more than a large amount of low quality data (which tracks with what you can pick up from an ML 101 education/training).