I'm always happy to see the proliferation of open-source resources for the next generative models. But I strongly suspect that OpenAI and friends are all using copywritten content from the wealth of shadow book repositories available online [1]. Unless open models are doing the same, I doubt they will ever get meaningfully close to the quality of closed-source models.
Related: I also suspect that this is one reason we get so little information about the exact data used to train Meta's Llama models ("open weights" vs "open source").
Am continuing to reflect on this, so may have more to say later, but what struck me first was the commitment to openness with FineWeb, it is impressive and thorough (eg all the scripts they actually used are available and linked to, not just the finished data)
The deduplication discussion shows they don't filter out ads as part of their cleaning - I appreciate this could be risky and perhaps a huge processing step given dataset sizes, but intuitively it feels like it would cut the noise dramatically and thus help tbe signal within datasets.
Reading the section on [synthetic data](https://huggingfacefw-blogpost-fineweb-v1.static.hf.space/di...) was eye-opening for me. The hockey-stick growth of words associated with common ChatGPT output in the common crawl database over the past ~18 months is worrying.
It might be worrying but they also point out that the quality seems to go up. Perhaps people think that random web scrapes are way better than they really are, and so expect ChatGPT output to worsen corpuses on average rather than improve them...
I am by no means a data scientist, but if, as a large language model, ChatGPT was trained to optimize the same "quality" metrics that are used to evaluate the models trained on these random web scrapes, and now ChatGPT output has a larger proportion of the random web scrapes, wouldn't the measured "quality" increase as a result? It all seems intertwined.
Again, I could be totally ignorant on how these things work. (edited to add key words associated with ChatGPT output in order to increase the quality of my comment :))
Yep interesting that massively useful and impactful work like this doesn't seem to get as much interest on HN compared to gimicky new architectures that have only been shown work well only on tiny datasets like MNIST and CIFAR, rather than at scale.
Filtering datasets is not the flashiest and shiniest part of the AI ecosystem, so I'm not surprised it doesn't evoke strong emotions or lots of comments.
Related: I also suspect that this is one reason we get so little information about the exact data used to train Meta's Llama models ("open weights" vs "open source").
[1]: https://www.annas-archive.org/llm