Hacker News new | past | comments | ask | show | jobs | submit login
An ode to pack: gzip’s forgotten decompressor (vidarholen.net)
200 points by ingve on Oct 23, 2017 | hide | past | favorite | 52 comments



I have to say hats off to Paul Eggert, the maintainer who fixed the obscure bug and added a test case and responded in just several hours[0]. I admire his dedication to free software. At least the good news is that he isn’t some poor bloke but rather well-respected and well-paid[1].

[0]: see timestamp in the mailing list message http://lists.gnu.org/archive/html/bug-gzip/2017-10/msg00000....

[1]: https://transparentcalifornia.com/salaries/2016/university-o...


Wouldn't it be a good idea to remove pack support from gzip if it obviously doesn't get enough eyeballs to detect bugs in a timely manner? Poorly tested obscure format support is a goldmine for hackers looking for exploitable code.


It's valuable to keep legacy decompression around for opening up old files, lest the data be lost forever.


A nice middle ground would be to change to have this unexpected code path require explicit invocation via a command line flag or enablement through an env variable. This would reduce the attack surface for the vast majority of users and still retain support (albeit not entirely backwards compatible with scripts) for folks that want this.

> Poorly tested obscure format support is a goldmine for hackers looking for exploitable code.

IMO the issue is when these code paths are automatically invoked. This is how you get gstreamer running 6502 opcodes for a codec no one has ever heard of when you download a file from the internet.


> This is how you get gstreamer running 6502 opcodes for a codec no one has ever heard of when you download a file from the internet.

Ok, I'll bite: context?



iirc, it has to do with gstreamer's support for chiptunes in their native format. There is basicly a nes emulator built in.


I don't see how a command line flag or environment variable helps to reduce the attack surface. If an exploit is found, any attacker can add the required flag or environment variable anyway.


Only if the attacker controls the command line or the environment.

Suppose you have a shell script which does something like "blah ... | gunzip > somewhere", where input to the "gunzip" step is under control of the attacker. Requiring a command line flag, or even an environment variable, would be enough to avoid exposing the code in question to the attacker-controlled input.

Usually, it would be too late to add a new command line flag, since people might have scripts which depend on being able to unpack files without passing that flag. In this case however, since it was broken for years and nobody else complained, it's very probable that nobody actually depends on this feature working, so requiring a new command line flag or even removing the feature would not cause many problems.


Ah, agreed. Hadn't considered that case. Thanks.


Not in all cases. For example if you find a bug in some obscure imagemagick code you can exploit it by just having the user download it and look at the download folder in a file explorer that produces thumbnails using imagemagick. If support for that obscure format would require explicit enabling, most likely the user would see some generic icon instead of getting their hard disk encrypted.


Except that's not at all related. 6502 was added not because it's legacy, but because it was a feature. It wasn't added in 1973 and "kept" for decades.


What of having a pure ANSI C implementation of pack and unpack available?

I, too, am concerned with future data loss due to codec extinction. A solution seems like it’d take the shape of an archive of the codecs in a simple, pure C subset with an actively maintained compiler. Unused code should be refactored out of mainstream tools


There's no test suite for "pure ANSI C". Even experts tend to write code that relies on undefined behaviour. I'd sooner trust e.g. Java bytecode to remain executable in the future.


What if it were pure “archival C” whereby the project maintainers defined the undefined behaviors? And I don’t think it would need SIMD optimizations and such, it’s not designed for performance so much as to keep code runnable.

It’d probably require some sort of VM specification too, maybe java is a better choice. keeping the code “alive” by keeping it in a project when it’s never used doesn’t seem like a strong guarantee that it will still work when you need it.


The C compatibility test comes when you build your kernel, your toolchain, your browsers, your shells, your desktop applications, and run their test suites after each build. ;- )

"ANSI C" is a bit of a misnomer, since a) there have been updates to the ANSI standard which are not considered "ANSI C", and b) all of the updates are sworn to be backward compatible.


> The C compatibility test comes when you build your kernel, your toolchain, your browsers, your shells, your desktop applications, and run their test suites after each build. ;- )

But if we're talking about obscure applications being preserved for posterity, they're not going to get built.

Really I was talking about the other direction though: a way to confirm that a given program is written using only ANSI C. There's no testsuite for that, all you can do is test with the compilers that exist - and then sometimes your program suddenly stops working in the next version of the compiler.


You might be interested in tis-interpreter:

http://trust-in-soft.com/tis-interpreter/

It can’t prove a program is 100% conformant, but it can detect a much broader range of undefined behaviors than regular compilers, and does other things more strictly (like, IIRC, limiting the declarations exposed by standard library includes to what the standards guarantees they expose).


> But if we're talking about obscure applications being preserved for posterity, they're not going to get built.

Well, they don't really need to be, that's the point.

> Really I was talking about the other direction though: a way to confirm that a given program is written using only ANSI C. There's no testsuite for that, all you can do is test with the compilers that exist - and then sometimes your program suddenly stops working in the next version of the compiler.

See --std=c89 or -ansi flags on GCC or Clang. One other way to prevent other sources of rot is to avoid UB that is frequently subject to change. Most UB is defined, at worst, by the constraints of real ISAs (i.e. the behaviour of idiv when dividing INT_MIN by -1 is an exception on x86, but the ARMv7 equivalent simply produces a number), and most UB-altering characteristics are standard across popular ISAs, and new (general purpose) ISAs tend to go with the status quo when in doubt.

And if all else fails, you can still run your code in an ISA simulator, and compile it with an old compiler, to figure out exactly what it meant in the past.


> See --std=c89 or -ansi flags on GCC or Clang.

They won't help you if your code relies on behaviour not defined in the standard, which is almost always what happens in practice.

> Most UB is defined, at worst, by the constraints of real ISAs (i.e. the behaviour of idiv when dividing INT_MIN by -1 is an exception on x86, but the ARMv7 equivalent simply produces a number), and most UB-altering characteristics are standard across popular ISAs, and new (general purpose) ISAs tend to go with the status quo when in doubt.

ISA extensions aren't always so conservative - x86 family ISAs used to never have alignment requirements, but some of the newer SSE-family instructions do. And modern C compilers will optimise code that invokes UB in surprising ways - codepaths that do signed integer overflow on x86 used to silently wrap around in twos-complement fashion, long shifts used to behave the way the hardware did, but these days compilers will omit to generate code for that codepath since the behaviour is officially undefined.

> And if all else fails, you can still run your code in an ISA simulator, and compile it with an old compiler, to figure out exactly what it meant in the past.

Sure, but in that case you're not gaining anything from using ANSI C, and might as well stick with whatever the code is written in, or just keep the binaries and run them on an emulator.


> Sure, but in that case you're not gaining anything from using ANSI C, and might as well stick with whatever the code is written in, or just keep the binaries and run them on an emulator.

Not being able to compile and run your code, with few or no minor changes, is an edge case. I frequently run unchanged or minimally changed code from before ANSI C. Sure, if you are definitely going to need a simulator to understand the code, you might as well just use all the original infrastructure; but if you'd like to run it in the interim, C is nice, and sticking to a restricted subset means you'll have fewer things to change if you want to adapt it to a new system.


> Not being able to compile and run your code, with few or no minor changes, is an edge case. I frequently run unchanged or minimally changed code from before ANSI C.

Not my experience. A lot of pre-ANSI code would segfault if you didn't build with -fwriteable-strings, for example, and there was certainly talk of removing that option from GCC entirely.

> if you'd like to run it in the interim, C is nice, and sticking to a restricted subset means you'll have fewer things to change if you want to adapt it to a new system.

I wouldn't call C nice, but if you are using it I would stick with the style today's popular compilers and programs use - popularity is the best guarantee of future compatibility. Using "ANSI C" might be understood to mean e.g. using trigraphs to replace special characters, which would probably be a net negative for long-term maintainability (I think some compilers are already removing support?).


This could be achieved of archiving the source tree of the compression software/version used to create the backups, or at least a version note. Nowadays with version control every can checkout and revert to older versions when they need to.

I'm not convinced it's the compression software author responsibility.


Without a running software, not check is made for the build. And without a checked build, one day, you will realize your machine can build this anymore and you would need to understand the algo, port it and compile it yourself to open a file.


Removing it from the browser wouldn't remove it from the world. You could always download the compressed file then use a CLI tool, or something.


Except that gzip is both.


then only enable it with a configure --with-pack option at compile time, disabled by default.


Yep.

I once fired up a TOPS-10 VM on SIM-H to fire up an old system for historical research purposes:

https://en.wikipedia.org/wiki/MATHLAB

ditto 4.2BSD and so on, various .tar.Z archives of yore, etc.

Quite alot of good code that already does what you are wanting to do was written decades before you realized you wanted to do it.


There's an even more obscure Unix compression tool of the same vintage as pack. While pack originated in USG Unix (System V and its predecessors), the early BSD releases contained a program called "compact" which used adaptive Huffman coding. According to the creator of compress [0], compact had a comparable compression ratio to pack but took six times as long to run.

The original versions of both can be found on the Unix Tree [1] which I've found to generally be an incredible resource/time-sink.

[0] https://groups.google.com/d/msg/net.sources/fonve4JCDpQ/39Vk...

[1] http://www.tuhs.org/cgi-bin/utree.pl


The article mentions compress but states that pack is it's predecessor. It even gives a comparison of their compression ratios on a kernel release, showing compress is significantly better:

compress: 302 MiB (-61%) pack: 548 MiB (-30%)


compact != compress

The easily confused names don't help with the obscurity.


I loved it.

- Retrospective on how we have advanced with compression methods; it's quite interesting how the same file is reduced in compressed size when doing a survey of the distinct compression methods.

- nice explanation of Huffman coding

- mention of code golfing

A quintaessential HN article. I think each of the 3 items above is worth 10 hacking nerd bonus points (collect 50 points to get a Devo Energy Dome)

If code golfing is done using Befunge or Brainf*, multiply score by 3.

EDIT: BTW, Brainf____ source code probably compresses nicely with Pack.


It's okay, we're all adults. You can say Brainfuck :)


I feel like code golfers are always just slightly insane in the best possible way.


That resonates with what I think of people making demos. Like, demoscene demos. Never realized that this is actually a modern variant of the scene.


Exactly that type of post, I come here for. Great read!


It's unfortunate that xz has become the "default" compressor. It has some issues that make it ill-suited for long term storage. lzip is a similar format that corrects these issues:

http://www.nongnu.org/lzip/xz_inadequate.html


Where the “issues” in question consist of things like “some field is 64 bits wide even though there won’t be 2^64 values in practice”, and “very slightly less likely to detect corrupted data” (but still extremely likely; anyway, since neither xz nor lzip can correct errors, I hope your long term storage ensures integrity at a different level).

Meanwhile, xz is seekable and lzip is not. Both formats use the same compression algorithm under the hood, so size differences are minimal. And lzip is licensed under GPLv3, which means it cannot become a universally adopted standard, which is surely important for long-term storage.


Version 2 or later, not version 3. And to defend it's long-term viability:

"The lzip format is as simple as possible (but not simpler). The lzip manual provides the source code of a simple decompressor along with a detailed explanation of how it works, so that with the only help of the lzip manual it would be possible for a digital archaeologist to extract the data from a lzip file long after quantum computers eventually render LZMA obsolete."

Additionally there is a separate public domain version for those allergic to GPL: http://www.nongnu.org/lzip/pdlzip.html

lzip is included in every Linux distro's package repository, and the three major BSDs.


> Version 2 or later, not version 3.

Ah. I wasn't misremembering, though: it seems it was previously GPLv3, but changed to v2+ with lzip 1.16 from late 2014.

In any case, lzip has already lost the popularity war. If XZ were actually "inadequate" and "inadvisable" as the lzip author claims, I might consider that a shame and work to counter it. In reality, the blog post comes off more like sour grapes. Some other things:

- The post spends several paragraphs complaining that some things in XZ are padded to a multiple of four bytes. Seriously? It may be unnecessary, but it's also very common in binary formats and makes very little difference.

- I was mistaken when I interpreted the blog post as claiming that XZ is "very slightly less likely to detect corrupted data". Actually, it's probably more likely, because it uses CRC64 whereas other formats (including lzip) use CRC32. (Although supposedly lzip makes up for this by being more likely to detect errors through the decoding process itself.) The post notes this, but claims that reducing what it calls "false negatives" (i.e. thinking a corrupted file is not corrupted) is not important "if the number of false negatives is already insignificant".

The post does make a decent case that XZ should have some kind of check or redundancy for length fields - though arguably the index serves as that, and there's a case to be made that lzip is worse because it doesn't have one.

But where the post spends most of its time claiming lzip is superior is in reducing false positives - that is, thinking a file is corrupted when it is really intact! How is that even possible? Well, because the checksum itself could get corrupted while the actual data in the file remains intact. The post assumes that each bit written has a given chance of being corrupted, so a larger file will, on average, have more corruption than a smaller file. Because a longer checksum is slightly larger (even though checksums are still a tiny fraction of the overall file size), it's supposedly a greater risk.

Which is such a load of bunk, especially when it comes to long-term archiving. In practice, the probability of error is not independent per bit. Rather, you should have a setup that uses error-correcting codes (whether in the form of RAID parity, built-in ECC on a NAND chip, etc.) to make the probability negligible that you will ever see any errors. Which is not to say that there aren't bad setups that can encounter errors… that there aren't people who stick their treasured photos on some $99 USB hard disk that frequently takes nosedives to the floor. But even then, one error is not just one error; it's often a precursor to the entire drive becoming unreadable. In any case, the solution is not "store very slightly less data to reduce what's exposed to damage", it's "fix your setup"! On the other hand, for anyone who thinks they have a good setup, it's very, very important to detect if an error has occurred, as an error would indicate a problem that needs to be fixed to protect the rest of the data. (Or in other cases, such as corruption during transmission, it could indicate the need to re-copy that file from elsewhere.) So "false negatives" are extremely harmful. Meanwhile, "false positives" aren't necessarily bad at all, because if a given file is only detected as corrupted because the checksum itself was corrupted, it still demonstrates that there was an error! I'd go so far as to say you can never have too many checksums - but at any rate, it's absurd to treat false positives as more important than false negatives.

- Oh, this is rich. Another part of the post makes a big deal about lzip being able to handle arbitrary data appended to the end of the file (because apparently not supporting that means "telling you what you can't do with your files"). But lzip also supports concatenating two lzip files together to get something that decodes to the concatenation, and this is how "multi-member" streams work (which you get e.g. from parallel lzip, or by using an option to limit member size). There's nothing to indicate whether or not a "member" is the last one in the file; the decompressor just checks whether the data that follows starts with the magic bytes "LZIP".

But what if there's a bitflip in one of the magic bytes? How do you know that what follows is a corrupted member, and not some random binary data that someone decided to append, which can be safely ignored?

You don't. I tried it. If you make any change to the magic bytes of a member after the first one, decompressing with `lzip -d` silently truncates the output at that point, without reporting any error. That's right - the supposedly safer-for-archiving lzip can have silent data corruption with a single bit flip.

The manual actually mentions this, but brushes it off:

> In very rare cases, trailing data could be the corrupt header of another member. In multimember or concatenated files the probability of corruption happening in the magic bytes is 5 times smaller than the probability of getting a false positive caused by the corruption of the integrity information itself. Therefore it can be considered to be below the noise level.

So it's not just a question of "the false negative rate is already low, and we should also get the false positive rate as low as possible". Rather, the manual directly compares the probabilities of a bit flip either (a) causing an error when theoretically the data could be fully recovered, or (b) silently corrupting data. Since the chance of (b) is a whole 5 times smaller than (a), its reasoning goes, it can be disregarded. That makes no sense at all! Those are not comparable scenarios! If you have a backup, then any detected error is recoverable no matter what bytes it affects - but an undetected error could be fatal, as backups are rotated over time.

I was curious: if you flip a random bit in an .lz file, what's the chance of it being unprotected? In other words, what fraction of the file is made up of 'LZIP' headers? Using plzip (parallel lzip) with default settings - which makes one member per 16MB of uncompressed data - I measured this as ranging anywhere from ~2 in 10 million for mostly uncompressible data, to over 1 in 1000 (!) for extremely compressible data (all zeroes).

To be fair, I am not sure whether xz has any comparable vulnerabilities. But if you're going to talk about how superior your format is in the face of bitflips, your format had better be actually resilient to bitflips!


> neither xz nor lzip can correct errors

Speaking of which, is there even a unixy way of adding/applying error correction information to a stream or file?

Ideally, one should be able to do this in a pipe.


I didn't know the answer, but I was interested so I googled it:

https://unix.stackexchange.com/questions/170652/is-it-possib...

Basically, you want the tool "par2".


The next default compressor might be lrzip [1] by Con Kolivas; I've only see it a couple of times in the wild so far but for certain files it can increase the compression ratio quite a bit.

[1] https://github.com/ckolivas/lrzip

  # 151M    linux-4.14-rc6.tar.gz
  # GZIP decompression
  ~$ time gzip -dk linux-4.14-rc6.tar.gz

  real    0m4.518s
  user    0m3.328s
  sys     0m13.422s

  # 787M    linux-4.14-rc6.tar
  # LRZIP compression
  ~$ time lrzip -v linux-4.14-rc6.tar
  [...]
  linux-4.14-rc6.tar - Compression Ratio: 7.718. Average Compression Speed: 13.789MB/s.
  Total time: 00:00:56.37

  real    0m56.533s
  user    5m35.484s
  sys     0m9.422s

  # 137M    linux-4.14-rc6.tar.lrz
  # LRZIP decompression
  ~$ time lrzip -dv linux-4.14-rc6.tar.lrz
  [...]
  100%     786.16 /    786.16 MB
  Average DeCompression Speed: 131.000MB/s
  Output filename is: linux-4.14-rc6.tar: [OK] - 824350720 bytes
  Total time: 00:00:06.35

  real    0m6.524s
  user    0m8.031s
  sys     0m1.766s

  # Results
  ~$ du -hs linux* | sort -h
  137M    linux-4.14-rc6.tar.lrz
  151M    linux-4.14-rc6.tar.gz
  787M    linux-4.14-rc6.tar

tested on WSL (Ubuntu BASH for Windows 10)

edit:

  ~$ time xz -vk linux-4.14-rc6.tar
  linux-4.14-rc6.tar (1/1)
    100 %        98.9 MiB / 786.2 MiB = 0.126   3.0 MiB/s       4:25

  real    4m25.189s
  user    4m23.828s
  sys     0m1.094s
  
  ~$ du -hs linux* | sort -h
  99M     linux-4.14-rc6.tar.xz
  137M    linux-4.14-rc6.tar.lrz
  151M    linux-4.14-rc6.tar.gz
  787M    linux-4.14-rc6.tar
It looks like XZ still has the best compression ratio but also took the longest (real)time.


lrzip is a preprocessor that finds matches in the distant past that the backend compressor (xz) couldn't normally find. zstd has a new long range matcher mode inspired by the ideas behind rzip/lrzip with some extra tricks. It produces data in the standard zstd format, so can be decompressed the the normal zstd decompressor. There is a short article about it in the latest release notes https://github.com/facebook/zstd/releases/tag/v1.3.2


I tried a few times with zstd at various levels of compression with the linux kernel sources. While I've been impressed with zstd, and have some projects lined up to use it, it seems in the case of the linux kernel sources, it's not a great fit. xz handily beats it, and not by a small margin either. I had to really ratchet up the compression levels (20+) before I could get close to 100Mb.


In general, xz beats zstd in compression ratio, as xz is very committed to providing the strongest compression, at the expense of speed, while zstd provides a range of compression ratio vs speed tradeoffs [0]. At the lower levels, zstd isn't approaching xz's compression level, but it's doing it much much faster. Additionally, zstd generally massively outperforms xz in decompression speed

  $ time xz linux-4.14-rc6.tar

  real    4m26.009s
  user    4m24.828s
  sys     0m0.724s

  $ wc -c linux-4.14-rc6.tar.xz
  103705148 linux-4.14-rc6.tar.xz

  $ time zstd --ultra -20 linux-4.14-rc6.tar
  linux-4.14-rc6.tar   : 12.81%   (824350720 => 105564246 bytes, linux-4.14-rc6.tar.zst)

  real    4m34.129s
  user    4m33.484s
  sys     0m0.432s

  $ time cat linux-4.14-rc6.tar.xz | xz -d > out1                                                                                                                                           

  real    0m9.677s
  user    0m6.608s
  sys     0m0.704s

  $ time cat linux-4.14-rc6.tar.zst | zstd -d > out2

  real    0m1.702s
  user    0m1.220s
  sys     0m0.520s
[0]: https://github.com/facebook/zstd/blob/dev/doc/images/DCspeed...


While making no judgement against lrzip, I'll point out that out-performing gzip is pretty much the baseline as far as compression goes. More interesting comparison would be against some modern compressors like zstd: http://facebook.github.io/zstd/


lrzip is somewhat more polished implementation of same idea as rzip by Andrew Tridgell. That means: use rsync's rolling hashing algorithm to implement LZ78 with enormous dictionary size and compress output of that with some general purpose compression algorithm (bzip2 in rzip, IIRC in lrzip it is configurable)


Beyond code golf, I believe pack is still used on many platforms for creating .a archives of .o object files.


I'm almost certain that you are confusing pack with ar (which is uncompressed archive format, in contrast to tar it is designed such that file can be extracted without reading whole archive).

ar is used as format of .a/.lib files by almost all C compilers (IIRC even by MSVC). Libraries are simply archives of .o files, on some platforms with additional symbol->file index, which is what is added by running .a through ranlib.


Yup! You are correct. Confusing it due to being recently exposed to `pack` which is a binary in the go toolchain that happens to interact with ar archives.


Code golf is not a use case.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: