books3.tar.gz itself is ~37gb compressed. Often really the entire "The Pile" dataset (composed of both the mostly compressed archives, along with a ~450gb compressed jsonl compilation of the data) is being discussed. That's around 825gb.
Not speaking for ML, but I love jsonl as a distribution format. Lots of storage overhead vs a more appropriate bulk container, but it makes it trivial to stream, sample, concatenate, etc. Anything else (eg csv or parquet) is going to require better tooling and/or just a little shell magic to handle headers.