Another key technique I've used is to use pipes and basic command line tools whe...

vardump · on Nov 11, 2019

> Of course the next problem is "my data doesn't fit on disk", but with xzcat on the command line and lzma.open in python you can work transparently with compressed files.

I usually use LZ4 compression for this purpose, because it's ridiculously fast (700 MB/s per core compression and 5 GB/s decompression). Sure, compression ratios aren't as good, but usually good enough.

muxator · on Nov 11, 2019

Completely agree (I do this myself all the time) but keeo in mind the initial sort has not a constant cost in term of memory.

Yet, one realizes how well written these basic tools are only after having some bruises with fancier tools.

wongarsu · on Nov 11, 2019

I don't know what POSIX says on the matter, but at least GNU sort uses some variation of merge sort with temporary files and has for all intents and purposes constant memory use.

naniwaduni · on Nov 12, 2019

Compression alleviates the problems "my current disks don't fit my data" and "my data is too big for my disk quota". It's comparatively bad at solving the problem "I cannot buy enough disks to fit my data".

namibj · on Nov 12, 2019

zstd is nearly always superior to xz, fyi.