Hacker News new | past | comments | ask | show | jobs | submit login

What are better file formats for long term archiving? Were any of them designed specifically with that use case in mind?



There's a post on Super User that contains useful information:

"What medium should be used for long term, high volume, data storage (archival)?" https://superuser.com/q/374609/52739

It mostly focuses on the media instead of formats though.


Personally I think the premise of the question is poor. Attempting to build monolithic long term (100+ years) cold storage of significant amount of data is a folly, instead the only reasonable approach is to do it in smaller parts (maybe 10-20 years) and plan for migrations.


It all depends on what your definition of "high-volume" is, and just how "archival" your access patterns really are.

Amazon Glacier runs on BDXL disc libraries (like a tape library). There's nothing truly expensive about producing BDXL media, there just isn't enough volume in the consumer market to make it worthwhile. If you contract directly with suppliers for a few million discs at a time, that's not an issue (you did say high-volume, right?).

https://storagemojo.com/2014/04/25/amazons-glacier-secret-bd...

For medium-scale users, tape libraries are still the way to go. You can have petabytes of near-line storage in a rack. Storage conditions are not really a concern in a datacenter, which is where they should live.

(CERN has about 200 petabytes of tapes for their long-term storage.)

https://home.cern/about/updates/2017/07/cern-data-centre-pas...

If you mean "high-volume for a small business", probably also tapes, or BD discs with 20% parity encoding to guard against bitrot.

Small users should also consider dumping it in Glacier as a fallback - make it Amazon's problem. If you have a significant stream of data it'll get expensive over time, but if it's business-critical data then you don't really have a choice, do you?


> Amazon Glacier runs on BDXL disc libraries ...

This has been a rumor I've heard for quite a while (probably since shortly after Glacier was announced) but has it ever been confirmed?


Thanks, I'll take a look. Though I think I have the media question answered, and I settled on M-DISC for personal stuff (https://en.wikipedia.org/wiki/M-DISC). It only has special requirements for writing, reading can be done on standard drives.


I went with M-Disc too and an LG Blu-ray burner. I think you only need a special burner if you're using the DVDs. I want to say most Blu-ray burners work.


TFA is on the homepage for lzip[1] which is an lzma based compressor designed for this.

You can also use xzip on top of something that can correct errors, such as par2.

1: https://www.nongnu.org/lzip/


According to the article, xz minus extensibility.


That's not what I got from it. Xz has other problems such as:

> According to [Koopman] (p. 50), one of the "Seven Deadly Sins" (i.e., bad ideas) of CRC and checksum use is failing to protect a message length field. This causes vulnerabilities due to framing errors. Note that the effects of a framing error in a data stream are more serious than what Figure 1 suggests. Not only data at a random position are interpreted as the CRC. Whatever data that follow the bogus CRC will be interpreted as the beginning of the following field, preventing the successful decoding of any remaining data in the stream.

> Except the 'Backward Size' field in the stream footer, none of the many length fields in the xz format is protected by a check sequence of any kind. Not even a parity bit. All of them suffer from the framing vulnerability illustrated in the picture above.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: