Hacker News new | past | comments | ask | show | jobs | submit login

books3.tar.gz itself is ~37gb compressed. Often really the entire "The Pile" dataset (composed of both the mostly compressed archives, along with a ~450gb compressed jsonl compilation of the data) is being discussed. That's around 825gb.



That is shockingly approachable for a large fraction of English literature.


Out of curiosity why is jsonl so popular in the ML space?


Not speaking for ML, but I love jsonl as a distribution format. Lots of storage overhead vs a more appropriate bulk container, but it makes it trivial to stream, sample, concatenate, etc. Anything else (eg csv or parquet) is going to require better tooling and/or just a little shell magic to handle headers.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: