I think articles like this are important... we need a good way of collecting knoweldge (and especially frustrations) like this in a compact format. Everyone who thinks, "I'll use tar," in response to some problem involving archives, backups, package distribution, container distribution, etc. will at some point run head-first into one of the things that makes tar weird. Without good collections of complaints, somebody comes in to solve the problem and ends up only solving the corner of it that they personally care about.
Something I personally find frustrating is that I'd like to be able to create a tar archive without recording UIDs, usernames, or timestamps. The command-line tools aren't really built for that, and yet this seems pretty reasonable, and libraries which create tar files do this by default. (This is covered under the reproducible builds section.)
The next thing that you're likely to want is some kind of random-access archive, for which you'll use either zip (which sticks the directory at the end) or squashfs, maybe.
Finally, the last thing that I often want is just a very binary key-value format which I can distribute and randomly access with reasonable efficiency.
Looking at things from the opposite perspective, I'm personally frustrated by the cross-platform personal backup options. Currently, I'm using Duplicity or some variant, but I find it a bit of and as a result my backup schedule is rather sporadic.
I recently learned about [SQLite archives](https://www.sqlite.org/sqlar.html) which feel like an interesting option to consider, especially with your comment around sometimes wanting a "binary key-value format" in mind.
Because it's "just sqlite" it's possible to use the archive functionality in archive-like ways, while also using the file as a sqlite db (because that's all it is).
It's not a perfect tool for all scenarios (and I'm thinking it wouldn't be good for backups), but I feel there are definitely some very useful scenarios for it.
> Looking at things from the opposite perspective, I'm personally frustrated by the cross-platform personal backup options. Currently, I'm using Duplicity or some variant, but I find it a bit of and as a result my backup schedule is rather sporadic.
It's funny you mention backups -- the similarity is not a coincidence, as many of the problems we have in container images are related to problems the backup community had in the past. As for a backup tool recommendation, I use restic[1] which is quite neat.
It may sound esoteric but this stuff is actually Super Damn Important, for at least two big reasons:
1. Security. Right now it is hard to trust a container image. Determining provenance and/or reproducibility are hampered by the image format and especially hampered by the charming little quirks of tar.
2. Performance. There are lots of optimisations that would be available if we didn't need to stream out an entire layer before we could do anything use with it. The tar family of formats is, as their name suggests, all about creating linear files intended to be saved to a tape. On a tape random access is bonkers. For a container, random access is a frequent operation. On top of that there's the whole hassle of shipping almost-but-not-quite-identical layers over and over and over. There must be petabytes of wasted bandwidth worldwide by now.
There are also a few others, and I've spoken to Lennart about using casync. It's a bit of a mixed bag because while casync does give us mostly what we want (and I've had a very long thread on Twitter with Lennart about it), there are some other concerns I've heard in the past few days from OCI users that make me quite cautious about using a format that cannot be easily made into a runtime format.
For instance, some folks want to have their runtime format be identical to the image format (so that signatures of the image can be used to verify the running containers). This is something that you cannot currently do with stock OCI (though they have worked around it), but should be a consideration for a future format. I only became aware of this concern after writing my blog post, so I will have to include it in the next one. :P
My last skim of casync suggested that it's heavily block-oriented. That's fine for backups and FS duplication, but as a distribution format it doesn't really fly.
A lot of folks want two levels of insight into the image's provenance or trustworthiness:
1. Has the image been tampered with? This is mostly solved by TUF / Notary.
2. What are the files and where did they come from? This means you need a file-level abstraction that is simple and fast.
Insofar as casync is block-oriented instead of file-oriented, it's a poor fit for the second problem. It doesn't matter how efficient the streaming and storage is if you make people download the entire layer each time you want to check a single file.
IIRC the unix metadata is not part of the canonical zip format, but rather an extension introduced by (?) infozip, an implementation of zip for Unix systems (if you use the zip/unzip cli tools on a Unix style system, it's most likely infozip).
Because it wasn't designed for for unix-like systems. Perhaps most notably, zip files don't have any support for unix-style permissions. Which makes it a bit of a non-starter for containers at least.
They do---as an optional extra field. This seems reasonable; it's not even desirable for many archives to include e.g. owner/group information. The zip format has its own odd required metadata (a timestamp in local time!), but for the most part the design of relegating platform-specific metadata to the extra fields seems quite sensible.
-UNIX Extra Field (0x000d):
The following is the layout of the Unix "extra" block.
Note: all fields are stored in Intel low-byte/high-byte
order.
Value Size Description
----- ---- -----------
(UNIX) 0x000d 2 bytes Tag for this "extra" block type
TSize 2 bytes Size for the following data block
Atime 4 bytes File last access time
Mtime 4 bytes File last modification time
Uid 2 bytes File user ID
Gid 2 bytes File group ID
(var) variable Variable length data field
The variable length data field will contain file type
specific data. Currently the only values allowed are
the original "linked to" file names for hard or symbolic
links, and the major and minor device node numbers for
character and block device nodes. Since device nodes
cannot be either symbolic or hard links, only one set of
variable length data is stored. Link files will have the
name of the original file stored. This name is NOT NULL
terminated. Its size can be determined by checking TSize -
12. Device entries will have eight bytes stored as two 4
byte entries (in little endian format). The first entry
will be the major device number, and the second the minor
device number.
zip can solve one of the problems I listed, namely the parallel operations problem, because it has an index at the end of the archive which would facilitate random-access (though because it's at the end you'd have to download it fully before you could parallel-extract -- so it's only partially ideal).
The index at the end of the file is definitely a serious drawback for zipfiles; it privileges the writer over the reader (it makes sense given the hardware limitations at the time). For someone archiving their own stuff that's neither here nor there. For container images, JAR files etc the ratio is lopsided.
Almost all modern file formats are zip with something machine and humanish readable inside (XML, json, etc). OpenOffice, MS Office, jar, apk, epub, etc are all just zip files.
This is what Singularity (and to a lesser extent, LXD) do. The main problem with this is that you don't get de-duplication of transfer (or really of storage) -- any small change in your published rootfs and you'd have to re-download the whole thing. In addition, it requires that the system you're mounting it on supports the filesystem you use (and that the admins are happy using that filesystem).
There is also a potential security risk -- since filesystem drivers are generally not secured against potentially malicious sources (there are plenty of attacks that have been found against the big Linux filesystems when you attack them with un-trusted filesystem data). This is one of the reasons auto-mounting USBs is generally seen as a bad security practice.
Don't get me wrong, there is a _huge_ benefit to using your runtime format as your image distribution format. But there are downsides that are non-trivial to work around. I am thinking about how to bridge that gap though.
Yes, and this is what LXD does. I think I mentioned it in the article, but basically the issue is that it requires one of:
1. A clever server, which asks you which version do you have so it can generate a diff for you. This has quite a few drawbacks (storage and processing costs as well as making it harder to verify that the image you end up with is what was signed by the original developers). But this will guarantee that you will always get de-duplication.
2. Or you could pre-generate diffs for a specific set of versions, which means it's a lottery whether or not users actually get transfer de-duplication. If you generate a diff for _every_ version you're now back to storage costs (and processing costs on the developer side that increase with each version of the container image you have). You could make it so that the diffs only step you forward one version rather than instantly get you the latest, but then you now have clients having to pull many binary diffs again.
This system has existed for a long time with BSD as well as distributions having delta-RPMs (or the equivalent for debs). It works _okay_ but it's far from ideal, and the other negatives of using loopback filesystems only make it less ideal.
I could be technically inaccurate, but my understanding is that it's rsync but with the server serving a metadata file which allows the rsync-diffing to happen from the client side rather than the server side - hence no clever server required.
It also doesn't require diffing particular revisions; but only the different blocks will be fetched. It does require server the metadata file, but they're note very large afaik.
I thought much the same thing. ZFS scratches most of these itches. (Sharing common blocks, it metadata is self-verifying, it's able to serialize a minimal set of changes between two snapshots, etc.) Just ship the filesystem images as, well, filesystem images. Plus if you want to go from development to production: you can `zfs-send` your image onto your HA cluster. ZFS makes for a durable & reliable storage subsystem that's been production-grade for many years.
This is essentially what Illumos/SmartOS does, and it seems to work out well for them.
The problem is when you have systems that don't have ZFS, or cases where you want to operate on an image without mounting it.
Also (from memory) ZFS de-duplication is not transmitted over zfs-send which means that you don't really get ZFS's de-dup when transferring (you do get it for snapshots -- but then we're back to layer de-duplication which is busted when it comes to containers).
Don't get me wrong, I'm a huge fan of ZFS but it really doesn't solve all the problems I went through.
ZFS supports both compression and de-duplication on send streams, the behavior on the receiving side depends on the configuration of the pool+dataset where the data is being received.
There used to be some differences in features/behavior depending on the ZFS version in use (Solaris vs FreeBSD vs ZFSoL/OpenZFS) but I believe as of 2018 all ZFS implementations have these features for send streams
As someone who was not programming when many of the customizations introduced in the article were introduced: this seems like a cautionary tale that goes against the wisdom about "competing standards": https://xkcd.com/927/
The desire to unify formats' competing back-compatibility needs created something that was (because of the standards conflicts/schisms/reunifications) extremely sub-par for most use cases, but (because of the time spent baking common interfaces) just usable enough that it remained the primary basis for storage formats for, if the author is to believed, far longer than it should have.
I wonder how many other tools that are venerated because of their age and ubiquity are similarly decrepit and broken when you peel back the curtain?
The hoo-ha over tar and cpio in the 1980s when the subject of standardization came up, and the whole raison d'être of pax, is largely forgotten nowadays. But the problems that were brought to light back then live on.
The thing to bear in mind is that we already have attempts to improve upon this. The problematic formats in the 1980s were themselves products of the second-system effect, adding on various things to the original tape archive format. We are now, in the second decade of the 21st century, well into umpteenth-system effect.
This brings up a point where this article is very wrong. The pax utility did not appear in 2001. pax was in the POSIX draft back in the early 1990s, and had become widespread enough to be in reference books by 1991. The article conflates the PAX extensions with the pax utility. This, and the incorrect dates, are errors promulgated by Wikipedia, which I challenged back in 2018 at https://en.wikipedia.org/wiki/Talk:Pax_(Unix)#Incomplete_Inf... and which was probably the source of these errors in the article. Always double-check what Wikipedia says on computing topics with a proper reference.
Before being tempted to reinvent the wheel here, I recommend a look at all of the times during the so-called "archiver wars" where this wheel already was reinvented, and learning from them. Of particular note, given this article, is Rahul Dhesi's ZOO file format from 1986. It allowed for multiple generations of a given file, and archive headers marked deleted files with a flag (allowing them to be undeleted), which could be used for "whiteouts". It suffers far less from an extension mess because support for filesystem features from Unix, MS-DOS, and VMS (at least as they were in 1986) was provided in the base data strutures; including 8.3 and long filenames.
But really the basic error here is in using an off-line archive format for an on-line live filesystem mechanism. It's the wrong data structure for the job. Whereas there are right data structures, and have been for years. The history of filesystem formats development includes several cases of addressing the very things mentioned in the article, from deduplication (c.f. ZFS) through generations for deleted files (ODS for Files-11 in VMS, where ZOO got the idea from) to reproducible directory scan orders (c.f. the side-effects of using B-trees in HPFS).
> This, and the incorrect dates, are errors promulgated by Wikipedia [...] which was probably the source of these errors in the article. Always double-check what Wikipedia says on computing topics with a proper reference.
I didn't use Wikipedia, actually. The primary source was the libarchive documentation, star, bits of GNU tar's docs, and POSIX. The main issue is that it's hard to get a copy of POSIX.1-1988, let alone POSIX drafts from the 1990s.
EDIT: Also, I didn't actually notice there was a rationale section in POSIX.1-2001 which references PAX as existing in earlier standards. I will read through it and update my article accordingly. Thank you!
> Of particular note, given this article, is Rahul Dhesi's ZOO file format from 1986.
Funnily enough, I have heard of ZOO (not sure where) and looked into it. While it does support file versions (and deletion) and has many improvements over tar, there are many other properties it doesn't have that we'd need (and last I checked there's no real support for it in modern Linux and Unix-likes -- so it makes no difference from a ubiquity perspective).
> But really the basic error here is in using an off-line archive format for an on-line live filesystem mechanism.
We don't use tar archives as the live filesystem for containers, it's used as a distribution mechanism.
> from deduplication (c.f. ZFS)
While ZFS has de-duplication, it's not really the kind we need and (from memory) zfs-send doesn't include the de-dup tables so they're all generated on the receiving end. Ideally we'd want content-defined de-duplication because that way you can reproducibly generate the blobs.
Something I personally find frustrating is that I'd like to be able to create a tar archive without recording UIDs, usernames, or timestamps. The command-line tools aren't really built for that, and yet this seems pretty reasonable, and libraries which create tar files do this by default. (This is covered under the reproducible builds section.)
The next thing that you're likely to want is some kind of random-access archive, for which you'll use either zip (which sticks the directory at the end) or squashfs, maybe.
Finally, the last thing that I often want is just a very binary key-value format which I can distribute and randomly access with reasonable efficiency.
Looking at things from the opposite perspective, I'm personally frustrated by the cross-platform personal backup options. Currently, I'm using Duplicity or some variant, but I find it a bit of and as a result my backup schedule is rather sporadic.