Hacker News new | past | comments | ask | show | jobs | submit login
File Compression in the Multi-Core Era (codinghorror.com)
15 points by ajbatac on March 1, 2009 | hide | past | favorite | 7 comments



This is interesting, but not for the reasons Jeff suggests. bzip uses all 8 cores for 21 minutes to produce a 986M file (or 2 minutes for 1092M), while 7zip doesn't use all 8 cores, and produces a file smaller than anything bzip can produce, in 5 minutes.

So it looks like 7zip is not just slightly better than bzip; it's much better. Ideally you can utilize all your cores by piping data from the DB directly into the compressor -- the compressor will use 2 cores (or whatever), and your database will use the rest.


Bzip2 is a very slow compression algorithm, mostly due, from what I recall, to the Burrows-Wheeler transform that lies at its core. LZMA is pretty much superior in every respect; it is (though quite slowly) on the road to replacing the aging Bzip2.

And as usual the comments on CodingHorror (at least the initial dozen or two) show a relative ignorance about the topic. 7zip (as can any compressor) can be trivially parallelized just by running it simultaneously on each solid block. The compression cost of a smaller solid block size is generally near-zero for the case where dictionary size << input data size.

The included Windows interface doesn't allow this kind of threading AFAIK, but it would be relatively simple to implement in an app using the LZMA libraries.


This is the exact approach taken by pigz (Parallel GZIP) - http://www.zlib.net/pigz/


> and uses the zlib and pthread libraries

Er, no, thanks. What about a good STL C++ implementation with OpenMP (automagic on STL.)

That's great for us, developers. We'll never run out of things to do :)


Whats the problem? pigz works as advertised for me. Almost a 4x speed increase on a 4 core system, assuming your storage I/O can keep up.


Poor guy discovered multi-threading on SMP too late! I would fire such system administrator who does not undestand system in essence. Also the one who does not respect the history and does not understand the fact that old unix utilities are not (and in most cases cannot be!) multi-threaded.


Note that the bzip2 implementation he uses is 7zip's; the classic unix implementation does not make use of multiple cores. But, there is also pbzip2, which supposedly uses all available cores:

http://compression.ca/pbzip2/




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: