Version 2 or later, not version 3. And to defend it's long-term viability: "The ...

comex · on Oct 24, 2017

> Version 2 or later, not version 3.

Ah. I wasn't misremembering, though: it seems it was previously GPLv3, but changed to v2+ with lzip 1.16 from late 2014.

In any case, lzip has already lost the popularity war. If XZ were actually "inadequate" and "inadvisable" as the lzip author claims, I might consider that a shame and work to counter it. In reality, the blog post comes off more like sour grapes. Some other things:

- The post spends several paragraphs complaining that some things in XZ are padded to a multiple of four bytes. Seriously? It may be unnecessary, but it's also very common in binary formats and makes very little difference.

- I was mistaken when I interpreted the blog post as claiming that XZ is "very slightly less likely to detect corrupted data". Actually, it's probably more likely, because it uses CRC64 whereas other formats (including lzip) use CRC32. (Although supposedly lzip makes up for this by being more likely to detect errors through the decoding process itself.) The post notes this, but claims that reducing what it calls "false negatives" (i.e. thinking a corrupted file is not corrupted) is not important "if the number of false negatives is already insignificant".

The post does make a decent case that XZ should have some kind of check or redundancy for length fields - though arguably the index serves as that, and there's a case to be made that lzip is worse because it doesn't have one.

But where the post spends most of its time claiming lzip is superior is in reducing false positives - that is, thinking a file is corrupted when it is really intact! How is that even possible? Well, because the checksum itself could get corrupted while the actual data in the file remains intact. The post assumes that each bit written has a given chance of being corrupted, so a larger file will, on average, have more corruption than a smaller file. Because a longer checksum is slightly larger (even though checksums are still a tiny fraction of the overall file size), it's supposedly a greater risk.

Which is such a load of bunk, especially when it comes to long-term archiving. In practice, the probability of error is not independent per bit. Rather, you should have a setup that uses error-correcting codes (whether in the form of RAID parity, built-in ECC on a NAND chip, etc.) to make the probability negligible that you will ever see any errors. Which is not to say that there aren't bad setups that can encounter errors… that there aren't people who stick their treasured photos on some $99 USB hard disk that frequently takes nosedives to the floor. But even then, one error is not just one error; it's often a precursor to the entire drive becoming unreadable. In any case, the solution is not "store very slightly less data to reduce what's exposed to damage", it's "fix your setup"! On the other hand, for anyone who thinks they have a good setup, it's very, very important to detect if an error has occurred, as an error would indicate a problem that needs to be fixed to protect the rest of the data. (Or in other cases, such as corruption during transmission, it could indicate the need to re-copy that file from elsewhere.) So "false negatives" are extremely harmful. Meanwhile, "false positives" aren't necessarily bad at all, because if a given file is only detected as corrupted because the checksum itself was corrupted, it still demonstrates that there was an error! I'd go so far as to say you can never have too many checksums - but at any rate, it's absurd to treat false positives as more important than false negatives.

- Oh, this is rich. Another part of the post makes a big deal about lzip being able to handle arbitrary data appended to the end of the file (because apparently not supporting that means "telling you what you can't do with your files"). But lzip also supports concatenating two lzip files together to get something that decodes to the concatenation, and this is how "multi-member" streams work (which you get e.g. from parallel lzip, or by using an option to limit member size). There's nothing to indicate whether or not a "member" is the last one in the file; the decompressor just checks whether the data that follows starts with the magic bytes "LZIP".

But what if there's a bitflip in one of the magic bytes? How do you know that what follows is a corrupted member, and not some random binary data that someone decided to append, which can be safely ignored?

You don't. I tried it. If you make any change to the magic bytes of a member after the first one, decompressing with `lzip -d` silently truncates the output at that point, without reporting any error. That's right - the supposedly safer-for-archiving lzip can have silent data corruption with a single bit flip.

The manual actually mentions this, but brushes it off:

> In very rare cases, trailing data could be the corrupt header of another member. In multimember or concatenated files the probability of corruption happening in the magic bytes is 5 times smaller than the probability of getting a false positive caused by the corruption of the integrity information itself. Therefore it can be considered to be below the noise level.

So it's not just a question of "the false negative rate is already low, and we should also get the false positive rate as low as possible". Rather, the manual directly compares the probabilities of a bit flip either (a) causing an error when theoretically the data could be fully recovered, or (b) silently corrupting data. Since the chance of (b) is a whole 5 times smaller than (a), its reasoning goes, it can be disregarded. That makes no sense at all! Those are not comparable scenarios! If you have a backup, then any detected error is recoverable no matter what bytes it affects - but an undetected error could be fatal, as backups are rotated over time.

I was curious: if you flip a random bit in an .lz file, what's the chance of it being unprotected? In other words, what fraction of the file is made up of 'LZIP' headers? Using plzip (parallel lzip) with default settings - which makes one member per 16MB of uncompressed data - I measured this as ranging anywhere from ~2 in 10 million for mostly uncompressible data, to over 1 in 1000 (!) for extremely compressible data (all zeroes).

To be fair, I am not sure whether xz has any comparable vulnerabilities. But if you're going to talk about how superior your format is in the face of bitflips, your format had better be actually resilient to bitflips!