Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Another key technique I've used is to use pipes and basic command line tools where possible to pre- or postprocess your data. For example a `sort | uniq -c | sort -nr | head` pipeline to get only the most frequently occurring lines works in a few kilobytes of ram no matter how big the input is. Combine this with chunking and you can get a lot of data processed in manageable amounts of memory.

Of course the next problem is "my data doesn't fit on disk", but with xzcat on the command line and lzma.open in python you can work transparently with compressed files.



> Of course the next problem is "my data doesn't fit on disk", but with xzcat on the command line and lzma.open in python you can work transparently with compressed files.

I usually use LZ4 compression for this purpose, because it's ridiculously fast (700 MB/s per core compression and 5 GB/s decompression). Sure, compression ratios aren't as good, but usually good enough.


Completely agree (I do this myself all the time) but keeo in mind the initial sort has not a constant cost in term of memory.

Yet, one realizes how well written these basic tools are only after having some bruises with fancier tools.


I don't know what POSIX says on the matter, but at least GNU sort uses some variation of merge sort with temporary files and has for all intents and purposes constant memory use.


Compression alleviates the problems "my current disks don't fit my data" and "my data is too big for my disk quota". It's comparatively bad at solving the problem "I cannot buy enough disks to fit my data".


zstd is nearly always superior to xz, fyi.




Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: