Hacker News new | past | comments | ask | show | jobs | submit login
FineWeb: Decanting the web for the finest text data at scale (huggingface.co)
127 points by nmstoker 8 months ago | hide | past | favorite | 14 comments



I'm always happy to see the proliferation of open-source resources for the next generative models. But I strongly suspect that OpenAI and friends are all using copywritten content from the wealth of shadow book repositories available online [1]. Unless open models are doing the same, I doubt they will ever get meaningfully close to the quality of closed-source models.

Related: I also suspect that this is one reason we get so little information about the exact data used to train Meta's Llama models ("open weights" vs "open source").

[1]: https://www.annas-archive.org/llm


Of course. This isn't even a suspicion. They are using any publicly accessible dataset especially the piracy hoards.


Am continuing to reflect on this, so may have more to say later, but what struck me first was the commitment to openness with FineWeb, it is impressive and thorough (eg all the scripts they actually used are available and linked to, not just the finished data)


I agree, the information provided in this article is a treasure. Maybe someone will add some "magic sauce" to it?


The deduplication discussion shows they don't filter out ads as part of their cleaning - I appreciate this could be risky and perhaps a huge processing step given dataset sizes, but intuitively it feels like it would cut the noise dramatically and thus help tbe signal within datasets.


Reading the section on [synthetic data](https://huggingfacefw-blogpost-fineweb-v1.static.hf.space/di...) was eye-opening for me. The hockey-stick growth of words associated with common ChatGPT output in the common crawl database over the past ~18 months is worrying.


It might be worrying but they also point out that the quality seems to go up. Perhaps people think that random web scrapes are way better than they really are, and so expect ChatGPT output to worsen corpuses on average rather than improve them...


I am by no means a data scientist, but if, as a large language model, ChatGPT was trained to optimize the same "quality" metrics that are used to evaluate the models trained on these random web scrapes, and now ChatGPT output has a larger proportion of the random web scrapes, wouldn't the measured "quality" increase as a result? It all seems intertwined.

In other words are we just overfitting?

It's important to note that the tests that they use appear to be open source, for example https://huggingface.co/datasets/lighteval/mmlu.

Again, I could be totally ignorant on how these things work. (edited to add key words associated with ChatGPT output in order to increase the quality of my comment :))


The article contains a recipe worth millions of dollars. And there's silence in the comments... why?


Yep interesting that massively useful and impactful work like this doesn't seem to get as much interest on HN compared to gimicky new architectures that have only been shown work well only on tiny datasets like MNIST and CIFAR, rather than at scale.


Takes time to read and evaluate something of high quality.

Likely some good blog posts will come out and get posted here in the coming days / weeks evaluating and summarizing.



Filtering datasets is not the flashiest and shiniest part of the AI ecosystem, so I'm not surprised it doesn't evoke strong emotions or lots of comments.


Any ideas for a "magic sauce" for this?




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: