Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I get the impression that scraping the web isn't nearly as important a source of LLM training data as it used to be.

Everyone is trimming down their training data based on quality - there are plenty of hints about that in the Llama 3.1 paper and Mistral Large 2 announcement.

OpenAI are licensing data from sources like the Associated Press.

Andrej Karpathy said this: https://twitter.com/karpathy/status/1797313173449764933

> Turns out that LLMs learn a lot better and faster from educational content as well. This is partly because the average Common Crawl article (internet pages) is not of very high value and distracts the training, packing in too much irrelevant information. The average webpage on the internet is so random and terrible it's not even clear how prior LLMs learn anything at all.



> The average webpage on the internet is so random and terrible it's not even clear how prior LLMs learn anything at all.

Perhaps we should stop exposing humans to them, as well?


It’s already the case that people don’t see that stuff very much.

The key word in that quote is “average.” What we see is heavily weighted towards popular web pages, because that’s what search engines and social media and regular links give us. We don’t see average.

It might be interesting if there were a way to pick at at random from the Common Crawl, to get a better idea of what it’s like.


They are moving to Discord.


Absolutely for learning. If you want to learn something we should realize it's awful. Try to learn something in your own field to see how awful it is.

That's why we're all armchair experts in other domains.


You trim, yes, but AI content surely invades (all?) areas of written material. People are increasingly using AI to assist their writing. Even it if's for slight editing, word choice suggestions.

Even AP doesn't ban the use of LLMs, its standards prohibit direct publishing of AI-generated content. I'm sure its writers leverage LLMs in some ways in their workflow, though. They would probably continue to use these even if AP attempted to ban LLMs (human incentives).


If the AI generated content is filtered for quality or is corrected then it will still be good data. The phenomenon of model degradation is only in the case where there is no outside influence in the generated data.


I think this is extremely important with AI generated content, but seems to be given less and less thought as people start to "trust" AI as it seeps into the public conscious more. It needs to be reviewed, filtered, and fixed where appropriate. After that, it isn't any different from reviewing data on your own, and wording it in a way that fits the piece you're writing. Unfortunately, there's so much trust in AI now that people will go ahead and publish content without even reading it for the correct tense!


The same problem exists if you blindly trust any source without verifying it. There is a huge amount of endlessly recycled incorrect blog spam out there for all domains. Not only that but this problem has always existed for second hand information so it's not like we were even starting from some pristine state of perfect truthfulness. We have the tools we need to deal with the situation and they were developed hundreds of years ago. Empiricism being chief among them. Nullius in verba[0]

[0] https://en.wikipedia.org/wiki/Nullius_in_verba


If tail events aren't produced by these models, no amount of human filtering will get them back. People would not just need to filter or adjust AI generated content, but create novel content of their own.


I think this is roughly correct. My 2c is that folks used the initial web data to cold start and bootstrap the first few models, but so much of the performance increase we have seen at smaller sizes is a shift towards more conscientious data creation/purchase/curation/preparation and more refined evaluation datasets. I think the idea of scraping random text except maybe for the initial language understanding pre-training phase will be diminished over time.

This is understood in the academic literature as well, as people months/years ago were writing papers that a smaller amount of high quality data, is worth more than a large amount of low quality data (which tracks with what you can pick up from an ML 101 education/training).




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: