books3.tar.gz itself is ~37gb compressed. Often really the entire "The Pile" dat...

fbdab103 · on Sept 4, 2023

That is shockingly approachable for a large fraction of English literature.

Tokumei-no-hito · on Sept 4, 2023

Out of curiosity why is jsonl so popular in the ML space?

fbdab103 · on Sept 4, 2023

Not speaking for ML, but I love jsonl as a distribution format. Lots of storage overhead vs a more appropriate bulk container, but it makes it trivial to stream, sample, concatenate, etc. Anything else (eg csv or parquet) is going to require better tooling and/or just a little shell magic to handle headers.