Hacker News new | past | comments | ask | show | jobs | submit login

I chuckled at the name, since out-of-order results are a typical output of parallelization. Kudos.



I also thought the name was clever, but your comment made it even more interesting. Also, my first thought was, "is this safe to use?", I heard of gzip vulnerabilities before, but a parallel implementation sounds a lot easier to get wrong.


Gzip streams support dictionary resets which means you can concatenate individually commuters blocks together to make a while stream.

This is what pigz is doing: shooting the input into blocks, spreading the compression of these blocks over different threads so multiple cores can be used, then joining the results together in the right order.

It is the very same property of the format that gzip's own --rsyncable option makes use of to stop small changes forcing a full file send when rsync (or similar) is used to transfer updated files.

The idea is as simple as it is clever, one of those "why did I not think about that?" ideas that are obvious once someone else has thought of it, so adds little or no extra risk. A vulnerability that uses gzip (a "compression bomb") or can cause a gzip tool to errantly run arbitrary code, is no more likely to affect pigz than it is the standard gzip builds.


Given that, why wouldn't this just be upstreamed into gzip? If it's a clean, simple solution that's just expanding the use of a technique that's already in the core binary?


gzip is a pretty old, pretty core program, so I imagine it's largely in maintenance mode, and that there is a lot of friction to pushing large changes into it. At one point, pigz required the pthreads library to build. If it still does, the gzip people would need to consider if that was appropriate for them, and if not, rewrite it to be buildable without it.

There are multiple implementations of zlib that are faster than the one that ships with GNU gzip, and yet they haven't been incorporated.

There are also just better algorithms if compatibility with gzip isn't needed. zstd, for example, supports parallel compression, and is both faster and compresses better than gzip.


> Given that, why wouldn't this just be upstreamed into gzip?

I suspect to keep the standard gzip as simple, small, and stable, as possible. It does the job, does it well enough, with minimal dependencies, has done for many years, and can do so in a wide array of systems including very small environments (in part due to the minimal dependencies).

Core tools like that typically don't get major updates, just security & stability patches as needed and maybe the occasional safe & easy change for performance reasons or to widen the number of supported environments.


While I'm in agreement on all of those points; I find adoption of new tools extremely difficult. The are modern alternatives to many commands; ls, cat, grep etc but if they're not "the default" it becomes near impossible to switch to them.

Given almost all desktops/servers/mobile phones are likely to be multi core these days, if gzip gained multithreading on desktop it could save time and energy for the whole planet potentially, that seems like a worthwhile benefit?


When reading from media with high random-access latency (for instance traditional hard drives) going parallel could make things much slower, and it will reduce the compression achieved (if only by a small amount), and it will increase the amount of memory consumed during the process, so I wouldn't want it to be the default. Nor would I particularly want it to try be clever and detect the usefulness of going parallel as that could lead to unexpected inconsistency.

I think there is a case for including it as a selectable option, as with --rsyncable, unless this adds extra dependencies (pthreads was mentioned in other comments).


Interesting, I had no idea this would have such an effect, I simply assumed it would be performed in-memory and written out sequentially. I can agree with a flagged option being a nice idea, but what's so heinous about adding a dependency?

I try to avoid it generally in my coffee too, but for something that could potentially offer a measurable benefit in a world of multi core-solid state computing, would it not be "worth it"?


Ah yes, no guarantee of concurrency or ordering (in the headline, lol).

That’d be a pretty funny compression algorithm. You listen to a .mpfoo file, and you’ll hear the whole song, we promise!


Oh, that's the pun. I just saw Parallel Implementation of GZip...




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: