FineWeb: Decanting the web for the finest text data at scale

nyyp · 2024-06-03T15:34:56 1717428896

I'm always happy to see the proliferation of open-source resources for the next generative models. But I strongly suspect that OpenAI and friends are all using copywritten content from the wealth of shadow book repositories available online [1]. Unless open models are doing the same, I doubt they will ever get meaningfully close to the quality of closed-source models.

Related: I also suspect that this is one reason we get so little information about the exact data used to train Meta's Llama models ("open weights" vs "open source").

[1]: https://www.annas-archive.org/llm

ilrwbwrkhv · 2024-06-04T16:04:56 1717517096

Of course. This isn't even a suspicion. They are using any publicly accessible dataset especially the piracy hoards.

nmstoker · 2024-06-02T08:57:51 1717318671

Am continuing to reflect on this, so may have more to say later, but what struck me first was the commitment to openness with FineWeb, it is impressive and thorough (eg all the scripts they actually used are available and linked to, not just the finished data)

sieszpak · 2024-06-02T13:17:47 1717334267

I agree, the information provided in this article is a treasure. Maybe someone will add some "magic sauce" to it?

nmstoker · 2024-06-02T15:13:25 1717341205

The deduplication discussion shows they don't filter out ads as part of their cleaning - I appreciate this could be risky and perhaps a huge processing step given dataset sizes, but intuitively it feels like it would cut the noise dramatically and thus help tbe signal within datasets.

ipython · 2024-06-03T15:37:08 1717429028

Reading the section on [synthetic data](https://huggingfacefw-blogpost-fineweb-v1.static.hf.space/di...) was eye-opening for me. The hockey-stick growth of words associated with common ChatGPT output in the common crawl database over the past ~18 months is worrying.

gwern · 2024-06-03T19:34:03 1717443243

It might be worrying but they also point out that the quality seems to go up. Perhaps people think that random web scrapes are way better than they really are, and so expect ChatGPT output to worsen corpuses on average rather than improve them...

ipython · 2024-06-03T20:05:15 1717445115

I am by no means a data scientist, but if, as a large language model, ChatGPT was trained to optimize the same "quality" metrics that are used to evaluate the models trained on these random web scrapes, and now ChatGPT output has a larger proportion of the random web scrapes, wouldn't the measured "quality" increase as a result? It all seems intertwined.

In other words are we just overfitting?

It's important to note that the tests that they use appear to be open source, for example https://huggingface.co/datasets/lighteval/mmlu.

Again, I could be totally ignorant on how these things work. (edited to add key words associated with ChatGPT output in order to increase the quality of my comment :))

sieszpak · 2024-06-03T10:58:54 1717412334

The article contains a recipe worth millions of dollars. And there's silence in the comments... why?

montebicyclelo · 2024-06-03T11:22:54 1717413774

Yep interesting that massively useful and impactful work like this doesn't seem to get as much interest on HN compared to gimicky new architectures that have only been shown work well only on tiny datasets like MNIST and CIFAR, rather than at scale.

bckr · 2024-06-03T12:13:43 1717416823

Takes time to read and evaluate something of high quality.

Likely some good blog posts will come out and get posted here in the coming days / weeks evaluating and summarizing.

sieszpak · 2024-06-03T20:09:19 1717445359

https://www.youtube.com/watch?v=20wbA_ijjmg&t=696s

baobabKoodaa · 2024-06-04T11:59:52 1717502392

Filtering datasets is not the flashiest and shiniest part of the AI ecosystem, so I'm not surprised it doesn't evoke strong emotions or lots of comments.

sieszpak · 2024-06-02T13:19:11 1717334351

Any ideas for a "magic sauce" for this?