Hacker News new | past | comments | ask | show | jobs | submit login
Xz format inadequate for long-term archiving (2017) (nongnu.org)
235 points by pandalicious on April 20, 2018 | hide | past | favorite | 135 comments



This article again? In my opinion, this article is biased. The subtext here is that the author is claiming that his "lzip" format is superior. But xz was not chosen "blindly" as the article claims.

To me, most of the claims are arguable.

To say 3 levels of headers is "unsafe complexity"... I don't agree. Indirection is fundamental to design.

To say padding is "useless"... I don't understand why padding and byte-alignment that is given so much vitriol. Look at how much padding the tar format has. And tar is a good example of how "useless padding" was used to extend the format to support larger files. So this supposed "flaw" has been in tar for dozens of years, with no disastrous effects at all.

The xz decision was not made "blindly". There was thought behind the decision.

And it's pure FUD to say "Xz implementations may choose what subset of the format they support. They may even choose to not support integrity checking at all. Safe interoperability among xz implementations is not guaranteed". You could say this about any software - "oh no, someone might make a bad implementation!" Format fragmentation is essentially a social problem more than a technical problem.

I'll leave it at this for now, but there's more I could write.


> To say 3 levels of headers is "unsafe complexity"... I don't agree. Indirection is fundamental to design.

3 individual headers for one file format is unnecessary complexity.

> To say padding is "useless"

Padding in general is not useless, but padding in a compression format is very counterproductive.

> And it's pure FUD to say "Xz implementations may choose what subset of the format they support. They may even choose to not support integrity checking at all. Safe interoperability among xz implementations is not guaranteed". You could say this about any software - "oh no, someone might make a bad implementation!" Format fragmentation is essentially a social problem more than a technical problem.

This isn't about "someone making a bad implementation!", it's about crucial features being optional. That is, completely compliant implementations may or may not be able to decompress a given XZ archive, and may or may not be able to validate the archive.

XZ may not have been chosen blindly, but it certainly does not seem like a sensible format. There is no benefit to this complexity. We do not need or benefit from a format that is flexible, as we can just swap format and tool if we want to swap algorithms, like we have done so many times before (a proper compression format is just a tiny algorithm-specific header + trailing checksum, so it is not worth generalizing away).

Any and all benefits of XZ lie in LZMA2. We could have lzip2 and avoid all of these problems.

(I have no opinion as to whether LZIP should supersede GZIP/BZIP2, but XZ certainly seems like a poor choice.)


> 3 individual headers for one file format is unnecessary complexity.

So all these file formats are unnecessarily complex?

- all OpenDocument formats

- all MS office formats

- all multimedia container formats

- deb/rpm packages

etc?


It depends on how you count headers, but yes.

Multimedia containers, while too complicated, don't really qualify for a position on that list. These containers are basically just special purpose file containers, and thus the headers of the "files" within should not contribute to the header count.

deb/rpm is also a good example for old and quite obnoxious formats. Deb is an AR archive of two GZIP compressed TAR archives (control and data) and a single file (debian-binary). TAR replaced AR for all but a few ancient tasks long ago, but for some reason, Deb uses both. A tar.gz with 3 files/folders that were not tar'd or compressed would have been much simpler. I believe RPM goes that route, but rather than TAR they use CPIO, and rather than embedding the metadata inside the archive, the RPM package has its own header.

Both RPM and DEB have given support for using a bunch of compression formats, meaning that not only do the content of the DEB/RPM package have dependencies, but there each package can now basically end up having its own dependencies that need to be satisfied before you can even read the package in the first place. Oh, and one of the supported compression formats is XZ now, adding an extra dependency as your version of XZ might not support the contained XZ archive at all.


Aren't MS office formats the poster child for overly complex file formats?


> rpm packages

I recall an article posted here detailing how incredibly bloated and crufty the RPM format was.


"Look at how much padding the tar format has. And tar is a good example of how "useless padding" was used to extend the format to support larger files. So this supposed "flaw" has been in tar for dozens of years, with no disastrous effects at all."

Just because it's in tar doesn't mean that the design is flawless. tar was created a long time ago, when a lot of things we are concerned with now weren't even thought of.

Deterministic, bit-reproduceable archives are one thing that tar has recently struggled with[1], because the archive format was not originaly designed with that in mind. With more foresight and a better archive format, this need not have been an issue at all.

[1] - https://lists.gnu.org/archive/html/help-tar/2015-05/msg00005...


The name tar comes from Tape ARchive. Lots of padding makes sense when you know that tar was originally used to write files to magnetic tape, which is highly block oriented. The use of tar today as a bundling and distribution format is something of a misapplication, as it lacks features one might want of such a program.


Thanks for such an amazing rabbit-hole of a link.


I feel he has made a case for some inadequacies in Xz. Some of the claims seem exaggerated, such as (2.2) the optional integrity checking, assuming the decompressor at least logs the fact that it couldn't do the integrity checking. Some others are clearly more significant issues, such as (2.5) not checksumming the length fields (2.6) the variable length integers being able to cause framing errors. Others still are petty, such as (2.3) too many possible filters.

While I think he made a case, I somewhat doubt that the other formats are flawless, and the real answer would lie in a more open analysis of all of them.


Last time this came up on HN, I did some research, and discovered that lzip was quite non-robust in the face of data corruption: a single bit flip in the right place in an lzip archive could cause the decompressor to silently truncate the decompressed data, without reporting an error. Not only that, this vulnerability was a direct consequence of one of the features used to claim superiority to XZ: namely, the ability to append arbitrary “trailing data” to an lzip archive without invalidating it.

Like some other compressed formats, an lzip file is just a series of compressed blocks concatenated together, each block starting with a magic number and containing a certain amount of compressed data. There’s no overall file header, nor any marker that a particular block is the last one. This structure has the advantage that you can simply concatenate two lzip files, and the result is a valid lzip file that decompresses to the concatenation of what the inputs decompress to.

Thus, when the decompressor has finished reading a block and sees there’s more input data left in the file, there are two possibilities for what that data could contain. It could be another lzip block corresponding to additional compressed data. Or it could be any other random binary data, if the user is taking advantage of the “trailing data” feature, in which case the rest of the file should be silently ignored.

How do you tell the difference? Simply enough, by checking if the data starts with the 4-byte lzip magic number. If the magic number itself is corrupted in any way? Then the entire rest of the file is treated as “trailing data” and ignored. I hope the user notices their data is missing before they delete the compressed original…

It might be possible to identify an lzip block that has its magic number corrupted, e.g. by checking whether the trailing CRC is valid. However, at least at the time I discovered this, lzip’s decompressor made no attempt to do so. It’s possible the behavior has improved in later releases; I haven’t checked.

But at least at the time this article was written: pot, meet kettle.


The maintainer's response when I reported this bug was 'Just use "lzip -vvvv" to see the warning':

https://lists.debian.org/55C0FE82.7050700@gnu.org

Their advocacy in this thread was so good that I removed lzip from my system.


It's that an implementation problem? I would expect a decompressor to warn that there's unidentified trailing data and perhaps dump it out as-is. After all, even if you did put it there on purpose, surely you still want it, not to have it discarded.


If the claims in the article are true who cares if the competing thing that the author is working on is also shit (but good to know that too).


Are these concerns, about error recovery, outdated? If I want to recover a corrupted file, I find another copy. I don't fiddle with the internal length field to fix framing issues. Certainly, if I want to detect corruption, I use a sha256 of the entire file. If that fails, I don't waste time trying to find the bad bit.

To add to that, if you need parity to recover from errors, you need to calculate how much based on your storage medium durability and projected life span. It's not the file format's concern. The xz crc should be irrelevant.


While that's true for most use cases, I think the author's point is that an archival compression format should be as forgiving as possible to the person recovering data because they are not necessarily the person who stored it. There will certainly be plenty of data in the future that was haphazardly stored but which needs to be recovered, possibly centuries after it was originally created, when no other copies may exist. So we should try to be nice to future archivists/librarians by making our data formats as robust as possible (in addition to our storage media, which is what you are correctly implying we should also worry about).


"If I want to recover a corrupted file, I find another copy."

So you've archived two or more copies of each file? That means you're use at least twice as much space (and if you're keeping the original as well, more than twice).

For the likely corruption of the occasional single bit flip here and there, you could do a lot better by using something like par2 and/or dvdisaster (depending on what media you're archiving to).


> So you've archived two or more copies of each file

You haven't?

It took me just one minor "data loss incident" ~20 years ago to very quickly convince me to become a lifetime member of the "backup all the things to a few different locations" club.

> That means you're use at least twice as much space (and if you're keeping the original as well, more than twice).

"Storage is cheap."


Storage is cheap indeed, though it takes some effort to make it cheap.

99% of the digital data I'm keeping for the long term is family photos and videos. All my photos go to Dropbox (easy copy-from-device and access anywhere) and are then backed up to multiple locations by CrashPlan.

It'll be a while yet, but in the next few years I'll be hitting the 1TB Dropbox limit. I'm hoping that Dropbox make a >1TB 'consumer' plan in the next couple of years. There's no way I'm assuming my backups are fine, deleting from Dropbox to make space, then finding out in a few years that some set of photos is missing.

I also sync up to Google Drive - but again, there's a 1TB limit (or a large cost).

In the future, I might have to create a new Dropbox account and keep the old one running. Storage might be cheap, but keeping it cheap is tricky.


Are you using the small business version of CrashPlan? I was using CP too until they discontinued their B2C.


Same here - I'm currently still migrating from CrashPlan B2C to using Arq backing up to Backblaze B2. (Being able to access B2 from Panic's Transmit Mac app made B2 really attractive to me as well, and it looks like I'll save a lot of money compared with CrashPlan.)


I'll look into this. Thanks for the info!


> Storage might be cheap, but keeping it cheap is tricky.

If it's really for pure backup, not continuous sync, Glacier is $4 per TB.


That's $4 per TB-month. Meaning you're effectively paying more than the cost of a 1TB hard drive replaced every year, for every TB you're storing. Plus fees to get your data back out. An 8TB drive, replaced every year, is half the cost per TB, with no additional access cost.

Depending on how price conscious you are, I agree with the GP's "keeping it cheap is tricky". And with things like backup, even if you do it yourself, the time spent maintaining it should be negligible: Occasionally kick off a format shift or failed drive replacement, have scripts running everything else.


> Meaning you're effectively paying more than the cost of a 1TB hard drive replaced every year, for every TB you're storing.

Yes. But what you get in return is not having that data at home. It doesn't matter how many copies you have locally if your home gets robbed, flooded, or burns down.


Glacier is good for dumping data into it but it's absolutely terrible for getting your data out and for full retrievals it's also very expensive. Don't rely on it for anything other than emergency backups of your backups.


"You haven't?"

No I don't, because it's a waste of space and money. Using par2 and/or dvdisaster I can archive a lot more files on to the same archival media and still get enough redundancy to feel secure.

"Storage is cheap."

Cheap is relative. Are you really buying an extra 2 TB's of storage to archive 1 TB of data, because that's what you'd need to do to archive 2 copies of each file. That's a huge waste of space and money that adds up when you're archiving a lot of data.

If your needs are small or your pockets deep, you can afford to do what you're proposing, but for the rest of us who aren't made out of money it's just not practical.

On the other hand, when I have worked at places which could afford to have multiple archives at various locations, I've made sure each of those archives were protected with par2 or dvdisaster, so I could recover from both rather than have one of the archives fail because of a bit flip error.


> Are you really buying an extra 2 TB's of storage to archive 1 TB of data ...

  $ sudo zpool get size zdata
  NAME   PROPERTY  VALUE  SOURCE
  zdata  size      21.8T  -
Yep.

It's fine that you "feel secure" with your current backup regimen -- and I certainly hope you never lose any important data.

After losing data once, though, I promised myself I'd do my best to make sure that it never happened again. The "primary copy" of all my data lives on the individual machines (my workstation, primarily, but there's a bit on my main laptop too) but there's also a copy of it all on a server out in the garage as well as yet another server (see above) that I have in an ISP's facility nearby. There's yet another copy of a small fraction of my files (the "really, really, really important stuff") that's sitting in AWS (via tarsnap) as well.

Some folks are satisfied with a copy of their family photos copied onto a flash drive and tossed into a drawer or an external USB drive permanently sitting on the desk next to their computer. I know of several small companies in my area that thought they were safe with an external USB drive connected to their server... until they got hit with ransomware.

My laptop has a pair of mirrored SSDs, my workstation has a pair of mirrored SSDs and a pair of mirrored "spinners". The server in the garage (my "first backup") has RAID10. That box at the ISP has mirrored SSDs plus a "raidz2" that the backups live on. Some of us just want a little bit more reassurance than others. :-)


> Some of us just want a little bit more reassurance than others

Which gets back to the original point. If you use format that can be more easily recovered, then having the same amount of copies, you're data is more secure.

You've probably been downvoted because it's perceived as showing off, but it is a nice setup.

I've also spend more time than I'm willing to admit with planning researching configuring and maintaining different backup strategies, and just wanted to say that I regret some of that. It's easy to become data hoarder and it's easy to spend more time on preserving it than it is actually worth. I mean, think about how much of this data is worth to people other than you, i.e. what happens to it when you die. Life's short and there are so many things that are more exciting than backups.

Don't get me wrong though. Backups are important.Just know how much exactly are they important to you.


Good for you. If you can afford it, go for it. But I'd still use something like par2 over each of your backups.


I use PAR2, even with multiple copies at different sites, because I look at my photos so rarely that I wouldn't notice a master file had become corrupt before it had mirrored to the other places and the original versions expired (1 year).

5% parity archives is an easy sell, on top of 200% for off site copies.


If you have any responsibility for data protection I urge you to read literally anything on disaster recovery procedures.


What if he reads his own comment? That would be covered by the admonition to read literally anything.

But my troll aside, I agree. If losing the data would cause you harm or make you sad (losing photos of your kids for example), you definitely need to have multiple backups in multiple locations, ideally controlled by different parties (so one bug on your cloud provider's side doesn't wipe out both of the copies they store for you). I've been burned by this with personal data a few times. The stakes get even higher when you are responsible for someone else's data. If they don't want to pay for the extra storage, make sure they understand the risk involved.


If you're using par2, I'd say that's closer to recovering a second copy than trying to extract meaningful data from a corrupted file. (The internal structure of the format is irrelevant, thus concerns about it are outdated.)


Even when using par2, the fact that xz files contain no version number for the format is still troubling.


Yes.

If your data is not in three different places it might as well not exist.


I would generally suggest you're more likely to corrupt/lose your whole backup than to have one corrupted bitflip not addressed by the filesystem or underlying storage.


If you are archiving it may be for long times (think "museums", "data vaults" etc). Finding another copy 200 years later may be difficult.


While finding another copy might be a practical solution for most of us, it seems like a wrongheaded way of designing an archiving data format.


I upvoted this because it seems to make some good points and I think the topic is interesting and important, but I can't understand why the "Then, why some free software projects use xz?" section does not mention xz's main selling point of being better than other commonly used alternatives at compressing things to smaller sizes.

https://www.rootusers.com/gzip-vs-bzip2-vs-xz-performance-co...


> compressing things to smaller sizes.

...relative to ... ? Is it better than lzip? lzip sounds like it would also use LZMA-based compression, right? This [1] sounds like an interesting and more detailed/up-to-date comparison. Also by the same author BTW.

[1] https://www.nongnu.org/lzip/lzip_benchmark.html#xz


Relative to the compression formats people were aware of at the time (which didn't include lzip.)

People began using xz because mostly because they (e.g. distro maintainers like Debian) had started seeing 7z files floating around, thought they were cool, and so wanted a format that did what 7z did but was an open standard rather than being dictated by some company. xz was that format, so they leapt on it.

As it turns out, lzip had already been around for a year (though I'm not sure in what state of usability) before the xz project was started, but the people who created xz weren't looking for something that compressed better, they were looking for something that compressed better like 7z, and xz is that.

(Meanwhile, what 7z/xz is actually better at, AFAIK, is long-range identical-run deduplication; this is what makes it the tool of choice in the video-game archival community for making archives of every variation of a ROM file. Stick 100 slight variations of a 5MB file together into one .7z (or .tar.xz) file, and they'll compress down to roughly 1.2x the size of a single variant of the file.)


that did what 7z did but was an open standard rather than being dictated by some company.

7z is an open standard, and the SDK is public domain:

https://www.7-zip.org/sdk.html


7z, xz, and lzip all use the same compression algorithm (LZMA). The differences between them are in the container format that stores the data, not in the compression of the data. 7z is akin to zip in that it functions as an archive in addition to the compressed data. xz and lzip both accomplish the same goal, which is to store the LZMA-compressed stream while letting some other tool handle archival if desired, as is traditional on unixy systems, where archival is usually handled by tar (though you will sometimes fun into cpio or ar archives) while compression is handled by gzip (same compression algorithm as zip) or bzip2 or whatever else.

Thus, the proposed benefits of the compression ratio apply equally to lzip as they do to 7z and xz. When the article talks about shortcomings of the xz file format compared to the lzip file format, it's talking about file structure and metadata, not compression algorithm. Just running some informal comparisons on my machine, an empty (zero byte) file results in a 36 byte lzip file and a 32 byte xz file, while my hosts file of 1346 bytes compresses to a 738 byte lzip file and a 772 byte xz file. An mtree file listing my home directory comes to 268 Mbytes uncompressed, resulting in an 81M lzip file and an 80M xz file (a difference of 720 Kbytes, less than 0.9% overhead). Suffice it to say, the compression of the two files is comparable. Yet, the lzip file format also has the advantages discussed in the article.

That said, for long-term archival I wouldn't use any of the above: I prefer zpaq, which offers better compression than straight LZMA, along with dedup, journaling, incremental backup, append-only archives, and some other desirable features for archival. Together with an mtree listing (to capture some metadata that zpaq doesn't record) and some error recovery files (par2 or zfec), this makes a good archival solution, though I hesitate to call it perfect.


Can you provide an example of such a .xz file?


I did some tests of my own and xz turned out marginally better than lzip in most of them.

    665472 freebsd-11.0-release-amd64-disc1.iso
    401728 freebsd-11.0-release-amd64-disc1.iso.xz 5m0.606s
    406440 freebsd-11.0-release-amd64-disc1.iso.lz 5m43.375s
    430872 freebsd-11.0-release-amd64-disc1.iso.bz2 1m38.654s
    440400 freebsd-11.0-release-amd64-disc1.iso.gz 0m27.073s
    431740 freebsd-11.0-release-amd64-disc1.iso.zst 0m3.424s
    
Maybe xz is not good for long term archiving but it's both faster and produces smaller files in most scenarios. However, I'm sticking with gz for backups, mainly because of the speed and popularity. If I want to compress anything to the smallest possible size without any regard for CPU time, then I use xz.


You can see discussion of this point, from the last time that this article was on Hacker News, at https://news.ycombinator.com/item?id=12769277 .


(2016)

Previously discussed here on HN back then:

https://news.ycombinator.com/item?id=12768425

The author has made some minor revisions since then. Here are the main differences to the page compared to when it was first discussed here:

http://web.cvs.savannah.nongnu.org/viewvc/lzip/lzip/xz_inade...

And here's the full page history:

http://web.cvs.savannah.nongnu.org/viewvc/lzip/lzip/xz_inade...


It may not be a good choice for long-term data storage, but I disagree that it should not be used for data sharing or software distribution. Different use cases have different needs. If you need long-term storage, it's better to avoid lossless compression that can break after minor corruption. You should also be storing parity/ECC data (I don't recall the subtle difference). If you only need short to moderate term storage, the best compression ratio is likely optimal. Keep a spare backup just in case.


> It may not be a good choice for long-term data storage, but I disagree that it should not be used for data sharing or software distribution. Different use cases have different needs.

I'm not so sure, using tools suitable long-term archiving by default might not be a bad practice. The thing about archiving is that it's often hard to know in advance what exactly you want to keep long-term. Using more robust formats probably won't cost much in the short term, but could pay off in the long term.


For long-term archival I think relying on your compression software to protect data integrity is a fool's errand, protecting against bit-rot should be a function of your storage layer as long as you have control over it (in contrast to say, Usenet, where multiple providers have copies of data and you can't trust them to not lose part of it - hence the inclusion of .par files for everything under alt.binaries).


I keep seeing recommendations for par/par2 but it seems like as software, the project isn't actively maintained? As an aside, that makes me think of dead languages and the use of latin for scientific names because it isn't changing anymore... but do you want that out of archival formats and software?


There's a par2 "fork" under active development - https://github.com/Parchive/par2cmdline

The fork compiled for me this week, when the official 0.3 version on Sourceforge wouldn't. I vaguely remembered par3 being discussed, but couldn't find anything usable. And that's an example of why to be wary of new formats, I guess?


Yes. Learn from what we have already experienced in the world of computing.

* https://www.theguardian.com/uk/2002/mar/03/research.elearnin...


It's probable that PAR2 is essentially feature complete, so no maintenance is really needed.

The program does pretty much the same thing as it did a decade ago.


Nope. You always need end-to-end parity integrity checking. Your data goes through too many layers before reaching the storage medium. E.g. I once got a substantial amount of my pictures filled with bit errors because of a faulty RAM module in my NAS.


This happened to me, and caused me to rethink my approach file management.

Unfortunately, mainstream tooling is largely fire-and-forget and never includes verification (e.g. copying succeeds even if the written data is getting garbled), so one is forced to use multi-step workflows to get around this. It's pretty discouraging that no strong abstractions exist in this space.


Yes, end-to-end checking is a must - but that applies to any method of integrity protection. I could run TrueNAS at home on some old desktop I've retired instead of the used Dell R520 I bought for the task, but I have experienced memory failures before and expect them to happen - this doesn't change if you're using .par files instead.

(People underestimate how frequently memory corruption can actually occur, almost two years ago when Overwatch first came out the game kept crashing - it took me forever to find the cause was a faulty DIMM. Hell, right now the R320 I have in my rack at home has an error indicator because one of my 2 year old Crucial RDIMM's has an excessive amount of correctable errors).


I've used XZ to compress tarballs of backup. XZ was useful so I could store more backups on an external hard drive. I have seen bit rot on some of these files (stored on a magnetic HDD), in the sense that the md5sum of the .tar.xz archive no longer matches when it was created. What do you suggest for creating parity/ECC in this case? I'm aware of parchive, but is that the right choice and in what configuration?


Keep in mind I'm not an archival expert so you should do your own research. That being said, currently I'm using pyFileFixity [1] to generate the hashes and ECC data for my personal backups. I write them to M-Disc Blu-rays using Dvdisaster [2] which can also write additional ECC data. After a lot of googling and reading this useful Super User question [3], and this extensive answer [4] I settled on this setup. I must admit that I am guilty of storing images as JPGs and compressing most most of my files in ZIPs for convenience.

[1]: https://github.com/lrq3000/pyFileFixity

[2]: http://dvdisaster.net/en/index.html

[3]: https://superuser.com/q/374609/52739

[4]: https://superuser.com/a/873260/52739


The whole structural adaptive encoding seems like massive overcomplication. I feel like clever tricks such as that serve only to bite in the ass when you need it the most.

Same goes for the bit jpeg. Sure, it might not be ideal technically, but recommending JPEG2000 (presumably as there is no JPEG2) with its ridiculously poor software support seems weak too. What use is robust file that you can't open?


When you're transferring files and need to cope with corrupted/missing chunks, you should use a parity scheme. Others have mentioned that; it's common for, for example, Usenet.

If you can't control the underlying storage, then ditto. Keeping and maintaining explicit parity chunks is somewhat inconvenient, but it works.

But if you just want to avoid bitrot of your own files, sitting on your own HDD, I'd recommend using a reliable storage system instead. ZFS or, at higher and more complicated levels, Ceph/Rook and its kin. That still offers a posix interface (unlike parity files), while being just as safe.


If I am using a single HDD, can ZFS still add parity data? That's neat if it can. I assumed parity with ZFS was for something like RAID6 where there are multiple HDDs in a set.

Do any other file systems other than ZFS support adding parity in a single HDD config? Last I checked getting ZFS in Linux required lots of side band steps due to licensing issues.


ZFS can do multiple copies of a file on a single hard drive. It is not adding parity.

ZFSOnLinux is developed outside Linux’ tree for 2 reasons. One, it is easier that way and two, Linus does not want it in the main tree. Consequently, you need to install it in addition to the kernel as if it were entirely userspace software. That does not add anymore difficulty than say, installing Google Chrome. :/


I have occasionally had downloaded tarballs that were truncated by network failure. It's nice to be able to get a meaningful error when decompression fails, instead of silently decompressing only part of the data. So built-in integrity checks are also desirable for short-term distribution.


> parity/ECC

Parity is ECC (which is usually Reed-Solomon, which is just a fancy name for a big set of more equations than data chunk you have, so that's how it adds in redundancy) with 1 bit. Usually you should aim for +20-40% redundancy.

Ceph, HDFS and other distributed storage systems implement erasure coding (which is subtly different from error correction coding), which I would recommend for handling backups.


The interesting thing about erasure codes is that you need to checksum your shards independently from the EC itself. If you supply corrupted or wrong shards, you get corrupted data back.

I think for backup (as in small-scale, "fits on one disk") error-correcting codes are not a really good approach, because IME hard disks with one error you notice usually have made many more errors - or will do so shortly. In that case no ECC will help you. If, on the other hand, you're looking at an isolated error, then only very little data is affected (on average).

For example, a bit error in a chunk in a tool like borg/restic will only break that chunk; a piece of a file or perhaps part of a directory listing.

So for these kinds of scenarios "just use multiple backup drives with fully independent backups" is better and simpler.


Vivint published a Go package a while back that does both Reed Solomon and Forward Error Correction: https://innovation.vivint.com/introduction-to-reed-solomon-b...


For small scale, use Dropbox or Google Drive, or whatever, because for small scale the most important part of backup is actually reliably having it done. If you rely on manual process, you're doomed. :)

For large scale in house things: Ceph regurarly does scrubbing of the data. (Compares checksums.) and DreamHost has DreamObjects.

Thanks for mentioning borg/restic, I have never heard of them. (rsnapshot [rsync] works well, but it's not so shiny) Deduplication sounds nice. (rsnapshot uses hardlinks.)

That made me look for something btrfs based, and here's this https://github.com/digint/btrbk seems useful (send btrfs snapshots to a remote somewhere, also can be encrypted), could be useful for small setups.


I think rsync/rsnapshot aren't really appropiate for backups:

(1) They need full support for all FS oddities (xattrs, rforks, acls etc.) wherever you move the data

(2) They don't checksum the data at all.

The newer tools don't have either problem that much: For (1) they pack/unpack these in their own format which doesn't need anything special, so if you move your data twice in a circle you won't lose any (but their support for strange things might not be as polished as e.g. rsync's or GNU coreutils). And for deduplication they have to do (2) with cryptographic hashes.

However (as an ex-dev of one of these) they all have one or the other problem/limitations that won't go away. (Borg has its cache and weak encryption, restic iirc has difficult-to-avoid performance problems with large trees etc.)

Something that nowadays might also need to be discussed is if and how vulnerable your on-line backup is against BREACH-like attacks. E.g. .tar.gz is pretty bad there.


Hm, rsync does MD5 checking automatically. Which doesn't do much against bitrot [0], but it should help with the full circle thing. (And maybe it'll be SHA256+ in newer versions? Though there's not even a ticket in their bugzilla about this. And maybe MD5 is truly enough against random in-transit corruption.)

Yeah, crypto is something that doesn't play well with dedupe, especially if you don't trust the target backup server.

Uh, BREACH was a beast (he-he). I'm a bit still uneasy after thinking about how long these bugs were lurking in OpenSSL. Thankfully the splendid work of Intel engineers quickly diverted the nexus of our bad feels away from such high level matters :|

[0] That's something that the btrfs/ZFS/Ceph should/could fix. (And btrfs supports incremental mode for send+receive.)


Of course there are million variables here, but for compressible data arguably compression+ecc is more robust against damage than uncompressed data. The rationale being that with compression you can afford to use more/bigger ecc


I sent a reasonable amount of data to Cloud Storage. It varies a lot. Usually ~10GB/day, but it can be up to 1TB/day regularly.

xz can be amazing. It can also bite you.

I've had payloads that compress to 0.16 with gzip then compress to 0.016 with xz. Hurray! Then I've had payloads where xz compression is par, or worse. However, with "best or extreme" compression, xz can peg your CPU for much longer. gzip and bzip2 will take minutes and xz -9 is taking hours at 100% CPU.

As annoying as that is, getting an order of magnitude better in many circumstances is hard to give up.

My compromise is "xz -1". It usually delivers pretty good results, in reasonable time, with manageable CPU/Memory usage.

FYI. The datasets are largely text-ish. Usually in 250MB-1GB chunks. So talking JSON data, webpages, and the like.


If you can compress data this much, you seem to have a lot of repetitive data. Have you tried using compression algorithms that support custom dictionaries? ZSTD and DEFLATE support those and can maybe help with compression ratio as well as speed.


If you get compression ratios that good, you should consider if your application might be doing something stupid like storing the same data thousands of times inside it's data file.

If you store enough of the same type of data, invest in redesigning the application. There's a reason we all use jpegs over zipped bitmaps...


> There's a reason we all use jpegs over zipped bitmaps...

It's because it's an appropriate compression - just like xz can be? Not sure what you're actually suggesting here.


The suggestion is to design an application-specific format that avoids storing redundant data in the first place. When that's an option at all it gives you higher compression than any general-purpose compression algorithm can achieve.


HTML is pretty repetitive, but if you want to archive HTML data, you don't get to redefine what HTML is. Compression is useful.


This is what the WARC [0] file format (and/or gzip) is for.

[0] https://en.m.wikipedia.org/wiki/Web_ARChive


and/or xz? because xz gives better compression than gzip or warc?


It sounds like his application scraping data of some kind rather than say generating it.


This is purely anecdotal and could easily be PEBKAC, but I created a bunch of xz backups years ago and had to access them a couple of years later after a disc died. To my panicked surprise, when trying to unpack them, I was informed that something was wrong (sorry at this point I don't remember what it was). I never did get it working. From that point on I went back to gzip and have not had a problem since. Yes xz packs efficiently, but a tight archive that doesn't inflate is worse than worthless to me.


FWIW, PNG also "fails to protect the length of variable size fields". That is, it's possible to construct PNGs such that a 1-bit corruption gives an entirely different, and still valid, image.

When I last looked into this issue, it seemed that erasure codes, like with Parchive/par/par2, was the way to go. (As others have mentioned here.) I haven't tried it out as I haven't needed that level of robustness.


FWIW, xz is also a memory hog with the default settings. I inherited an embedded system that attempts to compress and send some logs, using xz, and if they're big enough, it blows up because of memory exhaustion.


"xz is also a memory hog with the default settings"

Then why use the default settings?

I tend to use the maximum settings, which are much more of a memory hog, but I have enough memory where that's not an issue.

Just use the settings that are right for you.


You'd have to ask the guy who wrote the code in the first place.

I think he saw "'best' compression" and stopped looking there.


I didn't mean to ask why the defaults are defaults, but rather why anyone would use the defaults rather than settings more appropritate to their use case?

It's not like xz is unable to be lighter on memory, if that's what you want. It's an option setting away.


To clarify: you'd have to ask the guy who wrote our code.


When I use xz for archival purposes I always use par2[1] to provide redundancy and recoverability in case of errors.

When I burn data (including xz archives) on to DVD for archival storage, I use dvdisaster[2] for the same purpose.

I've tested both by damaging archives and scratching DVDs, and these tools work great for recovery. The amount of redundancy (with a tradeoff for space) is also tuneable for both.

[1] - https://github.com/Parchive/par2cmdline

[2] - http://dvdisaster.net/


Thank you for sharing this. I am in charge of archiving the family files - pictures, video, art projects, email. I want it available through the aging of standards and protected against the bitrot of aging hard drives. I'll be converting any xz archives I get into a better format.


Mix and match, according to criticity and max affordable data loss: multiple locations, multiple solutions, multiple local copies (e.g. one cloud solution + DVD + NAS). See: https://www.backblaze.com/blog/the-3-2-1-backup-strategy/


You should also write out ECC information.


Requiring userland software to worry about bitrot is a great way to ensure that it is not done well. It is better to let the filesystem worry about it by using a file system that can deal with it.

This article is likely more relevant to tape archives than anything most people use today.



Is this really an issue for this use case? My naive take is that since Arch updates packages so often, "long-term storage" doesn't come up that much in practice.


It's zero issue since the packages are updated regularly and hashed before installing.


Good point, I hadn't even thought of checksums!


Default compression for debian packages is xz as well: http://manpages.ubuntu.com/manpages/xenial/en/man1/dpkg-deb....


this is from 2010, I guess if xz was bad for this use case they'd know by now


The purpose of a compression format is not to provide error recovery or integrity verification.

The author seems to think the xz container file format should do that.

When you remove this requirement, nearly all his arguments become moot.


> The purpose of a compression format is not to provide error recovery or integrity verification.

On the contrary. People archive files to save space, exchange files with each other over unreliable networks able to corrupt data, store them in corrupted ram and corrupted disks, even if just temporary. Compression formats are there to help with that, this is their main purpose. This is why fast and proper checksumming is expected, but not cryptographic, like sha256, that adds nothing to this goal but overhead.


I fail to see why integrity checking is the file format's responsibility. Is this historical? Like when you just dd a tar file directly onto a tape and there is no filesystem? Anyway seems like it should be handled by the filesystem and network layers.

I can understand the concerns about versioning and fragmented extension implementations though.


> you just dd a tar file directly onto a tape

Actually, one uses the tape archive utility, tar, to write directly to the tape. (-:


Perhaps renice your job so that others don't complain about their noisy neighbor.

    renice 19 -p $$ > /dev/null 2>&1
then ...

Use tar + xz to save extra metadata about the file(s), even if it is only 1 file.

    tar cf - ~/test_files/* | xz -9ec -T0 > ./test.tar.xz
If that (or the extra options in tar for xattrs) is not enough, then create a checksum manifest, always sorted.

    sha256sum ~/test_files/* | sort -n > ~/test_files/.sha256
Then use the above command to compress it all into a .tar file that now contains your checksum manifest.


I did some compression tests of the CI build of master branch of zig:

    34M zig-linux-x86_64-0.2.0.cc35f085.tar.gz
    33M zig-linux-x86_64-0.2.0.cc35f085.tar.zst
    30M zig-linux-x86_64-0.2.0.cc35f085.tar.bz2
    24M zig-linux-x86_64-0.2.0.cc35f085.tar.lz
    23M zig-linux-x86_64-0.2.0.cc35f085.tar.xz
With maximum compression (the -9 switch), lzip wins but takes longer than xz:

    23725264 zig-linux-x86_64-0.2.0.cc35f085.tar.xz  63.05 seconds
    23627771 zig-linux-x86_64-0.2.0.cc35f085.tar.lz  83.42 seconds


Why do people use xz anyway? As for me I just use tar.gz when I need to backup a piece of a Linux file system into an universally-compatible archive, zip when I need to send some files to a non-geek and 7z to backup a directory of plain data files for myself. And I dream of the world to just switch to 7z altogether but it is hardly possible as nobody seems interested in adding tar-like unix-specific metadata support to it.


xz has substantially better compression than gz or bz2, especially if using the flags -9e. You can use all your cores with -T0 or set how many cores to use. I find it to be on par with 7-zip.

Perhaps folks are trying to stick with packages that are in their base repo. p7zip is usually outside of the standard base repos.


Substantially is a relative term. There are niche cases but how many people really care, or need to care, about the last bytes that can be compressed?

Packing a bunch of files together as .tgz is a quite universal format and compresses most of the redundancy out. It has some pathological cases but those are rare, and for general files it's still in the same ballpark with other compressors.

I remember using .tbz2 in the turn of the millennium because at the time download/upload times did matter and in some cases it was actually faster to compress with bzip2 and then send over less data.

But DSL broadband pretty much made it not matter any longer: transfers were fast enough that I don't think I've specifically downloaded or specifically created a .tbz2 archive for years. Good old .tgz is more than enough. Files are usually copied in seconds instead of minutes, and really big files still take hours and hours.

None of the compressors really turn a 15-minute download into a 5-minute download consistently. And the download is likely to be fast enough anyway. Disk space is cheap enough that you haven't needed the best compression methods for ages in order to stuff as much data on portable or backup media.

Ditto for p7zip. It has more features and compresses faster and better but for all practical purposes zip is just as good. Eventhough it's slower it won't take more than a breeze to create and transfer, and it unzips virtually everywhere.


I never thought bz2 was worth it over gzip, but xz is much much better in many common cases (particularly text files, but also other things). Source code can often be xz compressed to about half the size as gzip. If you are downloading multiple things at once or a whole operating system or uploading something then even on slower DSL lines it makes a huge difference IMO. I wish more package systems provided deltas.

The only issue I've had with xz is that it doesn't notice if it is not actually compressing the file like other utilities do and then just store the file uncompressed, so if you try to xz a tar file with a bunch of already highly compressed media files then it both takes forever and and you end up with a nontrivially larger file than you started with.

Also, I like that, unlike gzip, xz can sha256 the uncompressed data if you use the -C sha256 option, providing a good integrity check. Yes, I would really like to use a format that doesn't silently decompress incorrect data and I can't understand why the author of this article thinks that is a bad thing. For backups I keep an mtree file inside the tar file with sha512 of each file and then the -C sha256 option to be able to easily test the compressed tar file without needing another file. In some cases I encrypt the txz with the scrypt utility (which stores HMAC-SHA256 of the encrypted data).


> zip is just as good

A major problem of zip is the "codepage hell" (it has been almost eradicated in browsers but still lives in zip archives, e-mails and non-.Net Windows programs). With 7z you just always know nobody is going to have problems decoding the names of the files inside it, whatever languages those are in, regardless to the system locale.


Related: where can I find a thorough step-by-step method for maintaining the integrity of family photos/videos in backups on either Windows or macOS?


The [Koopman] cited throughout is my boss, Phil! At any rate I'm sadly not surprised and a little appalled that xz doesn't store the version of the tool that did the compression..


So long as xz(1) gets insane amounts of compression and there is no compressor which compresses better, people are going to keep preferring it.


What is the probability that a given byte will be corrupted on a hard disk in one year?

What is the probability of a complete HD failure in a year?


Use par2 to generate FEC for your archives and move on with your life.


So what wrong with plain and simple

  tar c foo | gzip > foo.tar.gz
or

  tar c foo | bzip2 > foo.tar.bz2
Been using these for over 20 years now. Why is is so important to change things especially as this article points out for the worse?!


Better (smaller and/or faster) compression.


To read the article:

    document.body.style['max-width'] = '550px'; document.body.style.margin = '0 auto'


or just resize your browser window


What are better file formats for long term archiving? Were any of them designed specifically with that use case in mind?


There's a post on Super User that contains useful information:

"What medium should be used for long term, high volume, data storage (archival)?" https://superuser.com/q/374609/52739

It mostly focuses on the media instead of formats though.


Personally I think the premise of the question is poor. Attempting to build monolithic long term (100+ years) cold storage of significant amount of data is a folly, instead the only reasonable approach is to do it in smaller parts (maybe 10-20 years) and plan for migrations.


It all depends on what your definition of "high-volume" is, and just how "archival" your access patterns really are.

Amazon Glacier runs on BDXL disc libraries (like a tape library). There's nothing truly expensive about producing BDXL media, there just isn't enough volume in the consumer market to make it worthwhile. If you contract directly with suppliers for a few million discs at a time, that's not an issue (you did say high-volume, right?).

https://storagemojo.com/2014/04/25/amazons-glacier-secret-bd...

For medium-scale users, tape libraries are still the way to go. You can have petabytes of near-line storage in a rack. Storage conditions are not really a concern in a datacenter, which is where they should live.

(CERN has about 200 petabytes of tapes for their long-term storage.)

https://home.cern/about/updates/2017/07/cern-data-centre-pas...

If you mean "high-volume for a small business", probably also tapes, or BD discs with 20% parity encoding to guard against bitrot.

Small users should also consider dumping it in Glacier as a fallback - make it Amazon's problem. If you have a significant stream of data it'll get expensive over time, but if it's business-critical data then you don't really have a choice, do you?


> Amazon Glacier runs on BDXL disc libraries ...

This has been a rumor I've heard for quite a while (probably since shortly after Glacier was announced) but has it ever been confirmed?


Thanks, I'll take a look. Though I think I have the media question answered, and I settled on M-DISC for personal stuff (https://en.wikipedia.org/wiki/M-DISC). It only has special requirements for writing, reading can be done on standard drives.


I went with M-Disc too and an LG Blu-ray burner. I think you only need a special burner if you're using the DVDs. I want to say most Blu-ray burners work.


TFA is on the homepage for lzip[1] which is an lzma based compressor designed for this.

You can also use xzip on top of something that can correct errors, such as par2.

1: https://www.nongnu.org/lzip/


According to the article, xz minus extensibility.


That's not what I got from it. Xz has other problems such as:

> According to [Koopman] (p. 50), one of the "Seven Deadly Sins" (i.e., bad ideas) of CRC and checksum use is failing to protect a message length field. This causes vulnerabilities due to framing errors. Note that the effects of a framing error in a data stream are more serious than what Figure 1 suggests. Not only data at a random position are interpreted as the CRC. Whatever data that follow the bogus CRC will be interpreted as the beginning of the following field, preventing the successful decoding of any remaining data in the stream.

> Except the 'Backward Size' field in the stream footer, none of the many length fields in the xz format is protected by a check sequence of any kind. Not even a parity bit. All of them suffer from the framing vulnerability illustrated in the picture above.


Given that there is basically one standard implementation, and virtually nobody has ever had an issue with compatibility with a given file, I don't see how it is "inadequate". Sure, if it's inadequate now, it'll be inadequate if you read it in a decade, but not in any way which would prevent you from reading it.

If your storage fails, maybe you'll have a problem, but you'd have a problem anyway.

Sometimes I feel like genuine technical concerns are buried by the authors being jerks and blowing things way out of proportion. I, for one, tend to lose interest when I hear hyperbolic mudslinging.


> The xz format lacks a version number field. The only reliable way of knowing if a given version of a xz decompressor can decompress a given file is by trial and error.

Wow ... that is inexcusably idiotic. Whoever designed that shouldn't be programming. Out of professional disdain, I pledge never to use this garbage.


Histrionic reactions don't improve the overall quality of software.

We certainly should have environments where we can tell someone code is shit, it's just silly and counterproductive to then leap to attacks on the abilities on the person behind it.


Improving badly designed software that is unnecessary in the first place is foolish; just "rm -rf" and never give it another thought.


Welcome to the world of software I guess. Non idiotic things are rare here.


Not w.r.t. that level of idiotic; and in FOSS, at least, we should be able to eject the idiotic. Thanks in part to articles like this, we can.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: