Another key technique I've used is to use pipes and basic command line tools where possible to pre- or postprocess your data. For example a `sort | uniq -c | sort -nr | head` pipeline to get only the most frequently occurring lines works in a few kilobytes of ram no matter how big the input is. Combine this with chunking and you can get a lot of data processed in manageable amounts of memory.
Of course the next problem is "my data doesn't fit on disk", but with xzcat on the command line and lzma.open in python you can work transparently with compressed files.
> Of course the next problem is "my data doesn't fit on disk", but with xzcat on the command line and lzma.open in python you can work transparently with compressed files.
I usually use LZ4 compression for this purpose, because it's ridiculously fast (700 MB/s per core compression and 5 GB/s decompression). Sure, compression ratios aren't as good, but usually good enough.
I don't know what POSIX says on the matter, but at least GNU sort uses some variation of merge sort with temporary files and has for all intents and purposes constant memory use.
Compression alleviates the problems "my current disks don't fit my data" and "my data is too big for my disk quota". It's comparatively bad at solving the problem "I cannot buy enough disks to fit my data".
Of course the next problem is "my data doesn't fit on disk", but with xzcat on the command line and lzma.open in python you can work transparently with compressed files.