Tar is an ill-specified format (2015)

st_goliath · on March 30, 2021

I once foolishly thought, I'll write a tar parser because, "how hard can it be" [1].

I simply tried to follow the tar(5) man page[2], and got a reference test set from another website posted previously on HN[3].

Along the way I discovered that NetBSD pax apparently cannot handle the PAX format[3] and my parser inadvertently uncovered that git-archive was storing the checksums wrong, but nobody noticed because other tar implementations were more lax about it[4].

As the article describes (as does the man page), tar is actually a really simple format, but there are just so many variants to choose from.

Turns out, if you strive for maximum compatibility, it's easiest to stick to what GNU tar does and favor GNU extensions over PAX header fields. If you think about it, in many ways the GNU project IMO ended up doing "embrace, extend, extinguish" with Unix.

[1] https://github.com/AgentD/squashfs-tools-ng/tree/master/lib/...

[2] https://www.freebsd.org/cgi/man.cgi?query=tar&sektion=5

[3] https://mgorny.pl/articles/portability-of-tar-features.html

[4] https://www.spinics.net/lists/git/msg363049.html

vanviegen · on March 30, 2021

> If you think about it, in many ways the GNU project IMO ended up doing "embrace, extend, extinguish" with Unix.

The "extend" phase from the Microsoft playbook is always something that competitors cannot duplicate. Either for legal reasons (patents) or technical reasons (having the "spec" depend on all of the Windows ecosystem).

I don't think that's what GNU did.

st_goliath · on March 30, 2021

GNU extensions are all over the place. The C library has all kind of GNU functions that are more convenient/practical than what Unix had, the command line programs (e.g. coreutils) have feature flags and extensions that the originals didn't have but that made them more convenient to use, and I already mentioned GNU extensions in the tar format.

POSIX aside, Unix-like systems that are still around nowadays and not directly use GNU user space, have copied many of the GNU extensions out of necessity. People have become accustomed to those extensions and written programs & scripts depending on them, so you now need to implement many of the GNU extensions simply to stay compatible.

Even if not consciously done and without malicious intents, this is IMO still fairly similar to how Microsoft played it: add extensions that are convenient, so people use them, eventually rely on them and become a de facto standard.

vanviegen · on March 30, 2021

But as you say, Unix vendors could just implement these features. That's not how Microsoft would have employed its infamous tactic.

pjmlp · on March 30, 2021

I am yet to see Microsoft sue WINE for example.

Any C compiler vendor (even commercial ones) does language extensions, mob => pushing the language forward.

Microsoft does C compiler with language extensions, mob => get your pitchforks.

stinos · on March 30, 2021

Similar for vendor lock-in. While very real, it wouldn't be the first time reading topics where the general sentiment is 'oh you poor Apple/MS/RedHat/Visual Studio/... user you're getting locked in, all hope lost' being written by people who don't recognize they are themselves following that principle merely by sticking to one particular OS/editor/debugger/...

throw0101a · on March 30, 2021

> GNU extensions are all over the place.

On purpose and on accident: when Debian switched its /bin/sh from Bash to dash all sorts of things broke (for us) because GNUism were leaking into what should have been a standard language/feature set.

We had to tell our users: change your code or change your shebang.

pjmlp · on March 30, 2021

I used to be a Solaris fanboy (Linux was only the way I managed to have UNIX at home), and one of my regular tasks as Solaris admin was to install GNU from Solaris Packages repository due to software expectations.

rualca · on March 30, 2021

> POSIX aside, Unix-like systems that are still around nowadays and not directly use GNU user space, have copied many of the GNU extensions out of necessity.

That reads like a lack of interest to update POSIX to catch up with the de-facto sstandard

I mean, I doubt that GNU would try to block the standardization process to avoid including their extensions.

stonogo · on March 30, 2021

No, this was pretty explicitly the reason GCC was not permitted to have independent front- and back-ends like llvm does today. Stallman was concerned someone would write proprietary extensions that way. So, if you wanted GCC, you got the whole GCC, GNU extensions to C and all.

josefx · on March 30, 2021

And you only got GCC. There was someone who tried to implement refactoring support for emacs based on gcc, since there already was a llvm based implementation and that obviously wasn't GNU enough. Stallman left him hanging for a year before telling him to stop because it would expose too many gcc internals. He ended up killing the discussion when too many people disagreed with his assertion that text base search and replace was good enough.

blihp · on March 30, 2021

They pretty much did it with all of the utilities they produced... and I'm glad they did. I don't think it was nefarious, they just thought their extensions made sense at the time and/or filled gaping holes. Even in the early 90's it was painful to go from a machine with GNU stuff installed on it back to a vanilla System V system... they just felt so limited.

apotheon · on March 30, 2021

That's exactly what GNU did. GNU can take changes from software produced by others under copyfree licenses and wrap it in a copyleft license, then the projects from which it borrowed can't borrow-back the GNU changes because of license incompatibility that runs only one way.

GNU licenses are very much embrace-extend-extinguish licenses.

vanviegen · on March 30, 2021

Perhaps the code itself couldn't be used by other vendors, but the ideas/protocols/interfaces could easily have been implemented by anybody who was so inclined. No trade secrets, no patents, and no dependencies on heaps of legacy cruft were necessary to be compatible. That's an important difference with "Embrace, extend, extinguish".

apotheon · on April 1, 2021

Ideas, protocols, and (except perhaps in the post-Oracle-Java world) interfaces can still be copied. A lot of open source software actually does exactly that with ideas, protocols, and interfaces from closed source software.

What makes open source software "open source" is the source code, which is subject to copyright law. Ideas, protocols, and interfaces are not subject to copyright; only their expressions are subject to copyright.

The Embrace, Extend, Extinguish approach makes things incompatible with the way others do them to cement hold on a large user base that prevents those users from changing their minds later. Sometimes that takes the form of forcing people to reimplement everything in a cleanroom setting, because they can never keep up that way; that's where the GPL excels within the open source world for EEE. Other times, it's by diverging from norms in ways that others cannot follow without breaking their own ecosystems, which is where the GNU project has excelled with its plethora of subtle incompatibilities.

slaymaker1907 · on March 30, 2021

There are similar problems with csv. A deceptively simple format with many variants to handle various edge cases.

brazzy · on March 30, 2021

How many, though?

Off the top of my head, you need to specify text encoding, line endings, a quoting mechanism in case a field contains the separator, and an escaping mechanism in case a field contains multiple lines or the quote character.

If you have those specified and everyone follows the specification, there shouldn't be any problems, right?

Woe be unto you if you have to process input data where those are not used consistently, of course.

dathinab · on March 30, 2021

Besides inconsistencies, that a non defined quoting and escaping mechanism is already quite painful. Let's look at following single column single line csv:

'\0'

Depending on the source this might be:

- quote back-slash zero quote

- zero

- back-slash zero

- a byte with the numeric value of 0

- quote zero quote

- quote back-slash zero quote

- quote byte-with-value-0 quote

- malformed and should be ignored

Worse the same application might mix different of this quoting mechanisms for different columns or even in the same column. Normally it shouldn't but it's defintely a thing you can find in programs which don't use a proper marschalling library/code, and you know. CSV is easy so you surly don't need to add a library dependency just to use it. (sorry sarcasm).

Or with other words "because CSV is easy" it's somteimes not properly specified and sometimes no "proper" marschalling/serialization library/module is used resulting in ad-hoc fixes for quoting where needed which potentially don't follow any specification either.

disgruntledphd2 · on March 30, 2021

The worst part about CSV is the quoting. Oh the quoting. As soon as a human writes something in a plain text field, your nice CSV parsing gets much, much harder.

Salespeople enter things like Company Foo, the One (which used to be Bar). The worst part is that often these are legal names, which means you do need to store them.

When I worked for a FAANG this was one of my major annoyances as there was no canonical number for a customer, so everyone did string matching which broke whenever a company changed their name (like for instance, if they'd just gone public).

Just say no to CSV's with commas in quotes.

probably_wrong · on March 30, 2021

> If you have those specified and everyone follows the specification, there shouldn't be any problems, right?

Well, yes, but I feel that that's the point at which CSV definitely earns its "ill-specified format" hat.

Given that there is no way to extract the file specification from the file itself, you can only follow the specification if someone tells you the specification beforehand. You can even use the "wrong" specification and don't notice it because your specification overlaps with the "true" specification in the types of data you're typically seeing (looking at you, optional quotes).

dkarl · on March 30, 2021

I suspect that most of the CSV issues I've encountered were produced by ad-hoc writers and parsers embedded in applications rather than serious attempts to implement a CSV library. Behaviors I've encountered during various integrations:

- Values in most columns are escaped, but a few are not. (You have to count commas from both ends of the line and then log a warning and make a guess if there are too many commas.)

- ASCII whitespace is url-encoded (why?) but no other characters are.

- Columns without double quotes can only contain digits.

- Double commas are used as the field separator because they didn't want to implement quoting. (Some of the fields came from user input on a web page, and I prayed for them to hit a double comma and learn the error of their ways, but it never happened. I think their front end developers were sanitizing for them.)

- And my favorite, no quoting was supported at all, double quotes were left unquoted, commas were stripped from values when writing, and when reading, fields were compared against a list of mappings like "Acme Inc." -> "Acme, Inc." to restore the commas.

happymellon · on March 30, 2021

Always makes me sad that ASCII delimited was never a thing.

It's almost like they really wanted to avoid this issue back in the 70s so it is built right into ASCII and unicode.

yrro · on March 30, 2021

What I like about this is that ending your data with FS means that you can detect file truncation, yet another thing that CSV makes impossible.

I wonder how you embed US/RS/GS/FS within a field though. Prefix with ESC?

tinus_hn · on March 30, 2021

All you need to do is specify all these unspecified things and it’ll be fine.

You forgot the localized decimal separator symbol.

VMG · on March 30, 2021

if you start going down that route you will have to specify the format for dates, network addresses, postal addresses, phone numbers...

it never ends

tinus_hn · on March 30, 2021

Either you accept that CSV can’t reliably contain fractional numbers or you specify the decimal separator.

brazzy · on March 30, 2021

CSV doesn't contain numbers at all; it contains strings. How those should be parsed into numbers is not part of the format, unless by CSV you additionally mean "to be imported by Excel". Which is certainly a common use case but not always the case.

tinus_hn · on March 30, 2021

By this line of thinking, any file type is specified, because the specification of a file is a sequence of bytes.

Arguably right, practically useless.

brazzy · on March 31, 2021

The distinction between file format and payload format is perfectly normal and not at all useless.

All file formats have limits for how much they define types for the payload. Most have very limited types. JSON for example knows objects, arrays, strings, numbers, booleans and null. It's not that big a difference to have a format that has only strings.

Denvercoder9 · on March 30, 2021

> Off the top of my head, you need to specify text encoding, line endings, a quoting mechanism in case a field contains the separator, and an escaping mechanism in case a field contains multiple lines or the quote character.

You also need to specify the separator itself, which despite the name isn't always a comma, but can also be a semicolon or (less commonly) a tab.

medstrom · on March 30, 2021

The IANA standard for TSV files [1] bans tab characters inside of fields. It's a convenient assumption, so if I'm writing a quick way to sync some data to disk, I choose TSV because the functions can be simpler.

https://www.iana.org/assignments/media-types/text/tab-separa...

mFixman · on March 30, 2021

To this day I don't understand why standard table formats don't use the standard ASCII unit and record separators.

They are there, they are convenient, and they don't require any visible character sacrifices.

marcthe12 · on March 30, 2021

I gues lack of program support. But yeas. That would have been great

stinos · on March 30, 2021

How many, though?

As many as there are localized datetime formats :)

there shouldn't be any problems, right?

That would be reallys awesome, yes. With the aid of the other sibling comments you already figured by now it's not as simple, unfortunately.

_uvvk · on March 30, 2021

First, there's no specification.

Second, this: http://georgemauer.net/2017/10/07/csv-injection.html

brazzy · on March 30, 2021

Quote from the site you linked to: "So who’s fault is all of this anyways? Well it’s not the CSV format’s."

And what I wrote before is basically all you need for a specification. Yes, it sucks that there is no fixed one that everyone uses.

lstamour · on March 30, 2021

Yes, except it's easy to debug a csv problem because it's generally text in some encoding you can hopefully identify without looking at specific byte offsets in a hex editor: http://fileformats.archiveteam.org/wiki/Tape_Archive#Identif...

In fact, one of the biggest questions I've still yet to answer is whether Docker or OCI have ever clarified just which TAR format must be followed? As far as I can tell, it's a mix of "whatever library X supports"? https://www.cyphar.com/blog/post/20190121-ociv2-images-i-tar for example.

I'm not sure I want a more complicated format, but I do wish somebody would pick a new file extension for their next use of tar so we could clearly differentiate one tar format from another the next time someone builds on top of simple tar files for their binary distribution, etc.

tdeck · on March 30, 2021

The other common archive, ZIP, has a standard. But it's a hot mess! To quote Wikipedia:

> Tools that correctly read ZIP archives must scan for the end of central directory record signature, and then, as appropriate, the other, indicated, central directory records. They must not scan for entries from the top of the ZIP file, because (as previously mentioned in this section) only the central directory specifies where a file chunk starts and that it has not been deleted. Scanning could lead to false positives, as the format does not forbid other data to be between chunks, nor file data streams from containing such signatures

You might think that ZIP files would have a header, or at least a fixed-width footer, but they don't. Instead you're supposed to scan backward through the file looking for a magic number indicating you've found the central directory. It's amazing to me that this is the format that won the compression wars and stuck around to this day.

greggman3 · on March 30, 2021

ZIP is a hot mess but that particular feature is not a bad idea, it's just poorly implemented / designed.

The point was you could create a multi-disk zip (zip came out when we had 1.4meg floppy disks). If you zip up 10meg across 7 disks and you want to update the README of your 7 disks, by putting the central directory at the end you just ask the user to insert disk 7, read the central directory, append the new README, write a new central directory. With the directory at the start you'd have to re-write all 7 disks. With it at the end you only have to re-write disk 7 and only a small portion of the file.

The only bad part of that particular feature in the zip implementation is the scanning part. The variable sized underspecified "Zip comment" that you have to scan through should have either come before the central directory OR there should have been a length or pointer to directory after it so no scanning needed.

anyfoo · on March 30, 2021

Also what would have been wrong with having a fixed size trailer/footer at the end of the file, pointing back to the (possibly newly appended) central directory, and even back to the previous trailer/footer if chaining is necessary/useful? No scanning necessary, then.

silvestrov · on March 30, 2021

The PDF format does this correctly. PDF files end with:

    startxref
    12345
    %%EOF

where 12345 is the file offset of the TOC.

wilsonrocks · on March 30, 2021

Many memories of PKZip asking me to insert the last disk of a set :)

Usually when copying a game...

ungamedplayer · on March 30, 2021

And disk (total / 2 + 1) was the corrupted one, you'd only find it mid way through.

wahern · on March 30, 2021

> You might think that ZIP files would have a header, or at least a fixed-width footer

ZIP files do have per file headers, which duplicate information in the directory, such as the file name. And malicious files have used this quirk to circumvent safeguards. See, e.g., WinRAR filename spoofing exploit: https://www.rapid7.com/db/modules/exploit/windows/fileformat...

Any time metadata is duplicated in a system (specifically, an interface or protocol), alarm bells go off for me. Simply duplicating metadata can inadvertently create an explosion of failure cases and implementation complexity. Conversely, avoiding duplication of data is an example of failure-proof design--if it's not duplicated, you don't have to worry about inconsistencies and reconciliation rules.

klodolph · on March 30, 2021

> Conversely, avoiding duplication of data is an example of failure-proof design--if it's not duplicated, you don't have to worry about inconsistencies and reconciliation rules.

I can't agree with this.

The reason you deal with inconsistencies and reconciliation rules is because, despite your efforts to the contrary, the world will give you inconsistent or incomplete copies of your data from time to time.

A sure-fire way to ensure that your data is lost is to only keep one copy of it. At some place in your storage stack, you want some level of redundancy. The general idea for an archive format is that it is somewhat self-synchronizing, and somewhat possible to repair. All media has a non-zero error rate. And then there are all of the various reasons (software, human error, disk full, etc.) that a file might get accidentally truncated (which destroys the central directory in a zip file).

You may not personally have to deal with the consequences of damaged media or similar types of failures, but for those that do, it's nice to have a data format where a stray error or two doesn't render the entire archive useless. If you are building on top of "reliable" storage systems (like cloud storage providers), you're just pushing the problem to lower levels in the stack... but there is less context at lower levels in the stack, which means recovery can't be very sophisticated.

closeparen · on March 30, 2021

This seems like a tension between the needs of genuine archival storage, and packaging of multi-file artifacts for web distribution. The latter use case dominates in the “archives” I actually encounter.

klodolph · on March 30, 2021

You can create multiple formats and expect your users to understand the differences and select the correct one, but the 99% use case is non-technical users--so 99% of us are going to choose the default option. A bit of redundancy (a short signature and copy of file metadata) is a clear win here, at least for popular formats like Zip. In order for a format to BE the default option for non-technical users, it has to be reasonably good for most use cases, rather than optimized for one specific use case.

If you are writing a package manager or backup program you might make a different choice, but how often do you find yourself writing a package manager?

lifthrasiir · on March 30, 2021

> Any time metadata is duplicated in a system (specifically, an interface or protocol), alarm bells go off for me.

You might be very nervous to learn that most filesystems duplicate some metadata in one way or another. FAT, NTFS MFT, ext4 superblocks and group descriptors come to my mind. In some level you ought to have duplicates; it's a matter of where to place them. ZIP's choice made sense at the time of smaller and slower disks. (I do think that the modern standard ZIP should make some fields scrubbed to prevent compatibility problems, though.)

jandrese · on March 30, 2021

However if the metadata is duplicated you have the possibility of recovering data from a corrupted or incomplete archive.

st_goliath · on March 30, 2021

> You might think that ZIP files would have a header, or at least a fixed-width footer, but they don't. Instead you're supposed to scan backward through the file looking for a magic number indicating you've found the central directory.

Yes. ZIP has no header, it has a footer.

What gets really messy is that the footer contains a variable length comment field, with the length stored before the field, so good look finding where the footer starts.

And here we are just talking a about a correctly generated file. This also opens up the door to all kinds of intentional abuse.

viraptor · on March 30, 2021

> It's amazing to me that this is the format that won the compression wars and stuck around to this day.

The header at the end actually helped it a lot. It allowed the self-unpacking archives to exist.

diggernet · on April 1, 2021

And allowed adding a file to a zip without needing to copy the whole zip to add header space at the start. A mere optimization today, but absolutely critical when working with floppies.

tda · on March 30, 2021

So can you technically "hide" data in a ZIP file as a form of steganography? Just add it to the zip file but mark it as deleted. Then any common zip utility will skip it, but a specialized one could extract that extra information. This would be easy to miss as the size of a zip file is always different than the sum of the size of the files it contains.

And of course googling zip file steganography reveals this is not a new idea: https://aroralakshay2014.medium.com/hide-it-lest-find-it-zip...

yjftsjthsd-h · on March 30, 2021

Oh yeah, that's the trick that lets you shove a zip into other file formats and have it be valid, right?

LeoPanthera · on March 30, 2021

zip at least has a de-facto standard, which today is the "Info-ZIP" utility.

greggman3 · on March 30, 2021

not really.

The zip spec says anything can come in before the "zip data". Info-Zip gives this example

    unzip unz552x3.exe unzipsfx.exe                 // extract the DOS SFX stub
    cat unzipsfx.exe yourzip.zip > yourDOSzip.exe  // create the SFX archive
    zip -A yourDOSzip.exe                          // fix up internal offsets

http://infozip.sourceforge.net/FAQ.html#unixSFX

In other words, append any data you want, then fix the pointers in the central directory.

There are tons of various zip utilities that will mess up depending on what you put in that self extractor including MacOS's built in finder based zip support.

PKWare, the keepers of the documentation for the ZIP format refuse to specify that you must read the end of central directory. They do this by claiming you can stream a zip file. But you can't stream a zip file if you have to read the central directory because data you may have already used, "the definition of streaming", maybe not be listed in the central directory in which case you shouldn't have used it.

In other words imagine each file in the zip is a compressed script and you are going to execute them. Imagine there are 2 scripts, the first one is "rm -rf /" and the second "ls /". The central directory only points to the 2nd script but if you stream you'll read the first script an execute it.

The ZIP spec claims you can stream but clearly you can't.

chungy · on March 30, 2021

Zip also has a real bonefide standard: https://pkware.cachefly.net/webdocs/casestudies/APPNOTE.TXT

greggman3 · on March 30, 2021

That spec is seriously underspecified

https://games.greggman.com/game/zip-rant/

chungy · on March 30, 2021

Seems kind of a shame to tuck this away in a comments section on a (mostly) unrelated article. I would suggest you submit it as a main article for HN :)

gumby · on March 30, 2021

Random trivia: gnu tar was called gnu tar before it became the base of the GNU project’s tar. Why? Because John Gilmore’s username has been gnu since the early 80s.

I still call him gnu as he of course calls me gumby.

person_of_color · on March 30, 2021

Just how rich did Sun make John Gilmore?

LeoPanthera · on March 30, 2021

This isn't wrong, but it's mildly misleading. Pax is mentioned right at the top, but glosses over that pax creates "ustar" format tar files, which is specified by POSIX.1-2001.

ustar tar files are compatible with almost all tar archivers, and are in fact well-specified.

mixologic · on March 30, 2021

In my 28+ years of working with tar files I have ran into issues exactly zero times with being able to tar or untar something.

Im sure if you're deeper down the stack this ends up being one of those nuanced peas under a mattress, but Im guessing that 99% of people who use tar do not run into issues using tar.

regularfry · on March 30, 2021

So, there is a direct effect of this that's really annoying when you bump into it. Docker images are tarballs, and the layer cache relies on how the tar file is constructed. The same layer content constructed by two different tar implementations will appear different to the cache lookup.

For reasons known only to themselves, Docker originally used different implementations for the docker command-line tool and the docker-compose tool. This means that cache entries produced by two tools that you got from the same place don't match up, and you can't use one tool to cache a layer and have the other pick it up.

At least, not by default - if you know about it, there's a switch you can flip to have one use the other's implementation. But they had to add that in, and do that extra work, because tar isn't well-specified.

rozab · on March 30, 2021

Seems like gzip would be the bigger issue there. There are millions of valid ways to compress a single file

fanf2 · on March 30, 2021

I’ve been using various flavours of unix for a similar length of time. I always installed gnu tar and gzip, and then I had no issues.

dehrmann · on March 30, 2021

I recently discovered that you can't append to a tar file quickly. It caused me a lot of headaches with incremental updates.

http://tiamat.name/blogposts/fast-appending-files-to-tar-arc...

st_goliath · on March 30, 2021

> You’d think that -r option usage forces tar application to append files to the end of the archive, getting the position of the archive’s end from archive’s index. It doesn’t. Tar format is designed in a way that it has no index.

Um, yes. A tar file is just a blob of 512 byte aligned files with 512 byte headers in front of them + an optional 1k null-bytes signifying the end. Technically all there is to appending at the end is to slap on another header and a file blob?

EDIT: Ok, on second thought, you do need a linear scan. There could be ancillary data after the 1k termination blob. Even if not an intentional polyglot nonsense, but if you actually use tar on a main frame tape drive as originally intended, simply seeking to the end doesn't work.

gumby · on March 30, 2021

Yes because the name ‘tar’ means ‘Tape ARchive’. Appending to the end is how you would add a file to a tape. In fact tar will silently overwrite an existing file because that is how you would update a file on a tape (that functionality is arguably useful today for other reasons)

jopsen · on March 30, 2021

Have you tried rolling a custom tar implementation?

That's when you realize how many cases you actually have to handle.

josefx · on March 30, 2021

Did a quick and dirty implementation to read a file from Java. Immediately ran into the "ustar " vs. "ustar" magic strings used to differentiate between format versions as I generated files on several different systems. That lead to some head scratching because I couldn't comprehend why there would be a difference on that last whitespace.

brongondwana · on March 30, 2021

Yeah, maybe... but I wrote https://metacpan.org/pod/Archive::Tar::Stream without too much trouble and it's been happily managing all our backups for over 10 years since (it didn't get released to CPAN straight away).

It helps that we're only ever appending gzipped tar fragments, and we have a tracking database format to know where the offsets in the uncompressed stream are:

https://fastmail.blog/2015/12/21/building-a-backup-system-fo...

But yeah - much worse if you're parsing arbitrary tar files than if you create everything yourself and just have to make it readable by gnu tar and that's about it.

mongol · on March 30, 2021

Sqlite has an archive format that is worthwhile to be aware of

https://sqlite.org/sqlar.html

slaymaker1907 · on March 30, 2021

It unfortunately has major limitations on file size for use cases where that matters. Max file size is on the order of 2^32 bytes since files are stored in a single SQLite blob. Note that this limit is only for input file size. The archive file can basically be of any arbitrary size.

Personally I wish sqlar didn't have this limitation since SQLite is awesome for tooling.

dragonwriter · on March 30, 2021

It seems pretty easy to build big files variant of sqlar; instead of a single sqlar table, have index and filedata tables; the index has all the elements of the normal sqlar table except data, and adds a counter for the number of chunks the file is stored in (a directory is zero chunks, an empty file is zero-length with one, empty, chunk), the filedata has composite PK consisting of an FK to the index and an integer sequence number in the file, and a data blob. The code changes to store/extract from it should be straightforward, and given the size of SQLite integers it would in theory structurally handle file sizes (as stored, which may be compressed) of up to 2^63-1 bytes; but the SQLite 281 TB database size limit would be reached well before this.

xurukefi · on March 30, 2021

Whilst we're at it: does anybody know what's the deal with the weird octal ASCII representation in the tar header? To store a file size of 10 bytes, all common tar implementations typically put literally the string "0000012" into the header, i.e., the size in octal in ASCII representation. I stumbled upon this when I had to write a parser because I wanted to rename files inside a tar without extracting the files first (turns out there is no tool for that). I found this format quite peculiar because I don't see any benefit of doing it this way.

st_goliath · on March 30, 2021

Tar was in Unix very early on. The first versions of Unix were IIRC written for a PDP-8 and later ported to a PDP-11.

Most of the PDP era DEC machines are designed favoring an octal encoding. The PDP-8 was a 12 bit machine (12=4 * 3). They also had few 18 (=6 * 3) and 36 bit (=12 * 3) machines.

The PDP-11 being a 16 bit machine with 8 bit granular memory access was an odd man out, but still had many aspects designed around splitting words into groups of 3 bits, e.g. if you look at the instruction set encoding. Or just take a look at the front panel[1].

Because of the DEC machines that early Unix was developed on (and the people involved in early Unix development being very familiar with them), octal encoding crept into many places where it stayed until today, including the C programing language and some Unix specific file formats.

Using octal, stored as plain text ASCII makes it very easy for a human to debug, for whom reading octal is second hand nature.

[1] https://en.wikipedia.org/wiki/File:Pdp-11-70-panel.jpg

xurukefi · on March 30, 2021

Very interesting, thx!

gorgoiler · on March 30, 2021

It’s a shame you can’t cat tar archives together. It always seemed a neat feature of container formats like mpeg:

    cat chapter*.mpg >movie.mpg

Are there any file archive formats that support this?

hddqsb · on March 30, 2021

For some reason I thought this was a standard feature of tar, and was surprised when it didn't work.

It turns out that GNU tar supports extraction of concatenated tarballs, but it requires the --ignore-zeros (-i) option:

> Normally, tar stops reading when it encounters a block of zeros between file entries (which usually indicates the end of the archive). ‘--ignore-zeros’ (‘-i’) allows tar to completely read an archive which contains a block of zeros before the end (i.e., a damaged archive, or one that was created by concatenating several archives together). -- https://www.gnu.org/software/tar/manual/html_node/Ignore-Zer...

(As a bonus, this also works for concatenated .tar.gz files because gzip supports concatenation.)

Regarding other formats that support concatenation, the Linux kernel's initramfs format is interesting because it is based on CPIO but is explicitly defined to be the concatenation of CPIO archives: https://www.kernel.org/doc/html/latest/driver-api/early-user...

CPIO itself doesn't support concatenation. There is a neat hack to extract such concatenated archives:

    while cpio -i; do :; done < archive.cpio

(source: https://unix.stackexchange.com/a/266090; also mentioned there is the tool skipcpio, which is part of dracut: https://github.com/dracutdevs/dracut/blob/master/skipcpio/sk...)

This hack would also work for extracting concatenating tarballs (without GNU tar's --ignore-zeros option). One annoyance is that it shows a warning when it gets to the end of the archive.

LeoPanthera · on March 30, 2021

You can cat .mpg files together, but you should not. ".mpg" is supposed to refer to an MPEG Program Stream, which does in fact have a header. You can cat Transport Streams, but these are supposed to have the ".ts" extension.

You sometimes get transports streams named ".mpg", which complicates everything.

bestouff · on March 30, 2021

I also have written my own tar parser, in https://github.com/bestouff/genext2fs - oh boy it's not easy ! My parser is of course incomplete and won't handle corner cases, which are really plenty. In the end I added libarchive as an alternative to my hand-rolled code because that's how it works best.

chungy · on March 30, 2021

Archive link: https://web.archive.org/web/20210329212251/https://invisible...

thomashabets2 · on March 31, 2021

cpio is pretty shitty too. After reading the manpage I found a pretty serious bug: https://nvd.nist.gov/vuln/detail/CVE-2019-14866

https://blog.habets.se/2019/11/CVE-2019-14866-gnu-cpio.html

geokon · on March 30, 2021

... is there an accepted better alternative?

yowlingcat · on March 30, 2021

Really? And all this time, I was hoping that it wouldn't end up being a tarpit.

ceratin6 · on March 30, 2021

It’s all these modern smokers with their low-tar cigarettes.

Tar is tar. Let it be.

tar -czvf if you’re gzzy.

tar+gzip beats zip for most uses. I don’t know about whether 7z beats tar+gz, but I don’t think 7z is as ubiquitous, though that may not be true now.

xt00 · on March 30, 2021

And in case you didn't already know, there is a parallel implementation of gzip called "pigz" that functions just like gzip, but is super fast on a machine with lots of cores.. like many modern PC's! Say for example you have a 32 core amd machine, the speed up is awesome using pigz.

yjftsjthsd-h · on March 30, 2021

tar also wins by making it trivial to swap the compression; these days I personally favor .tar.xz or .tar.zstd over gzip.

adgjlsfhk1 · on March 30, 2021

It's a shame zstd is Facebook, because it's really awesome.

Y_Y · on March 30, 2021

Facebook produces tons of great FOSS, it's paid for with very dirty money, but I don't consider it tainted.

rakoo · on March 30, 2021

And Zstd was created before the creator joined Facebook

adrianN · on March 30, 2021

zstd is BSD 3-clause.

QuercusMax · on March 30, 2021

7zip always seemed really shady, almost intentionally so compared to zip and tar+gz. Always gave me the "leet haxor" vibe. 7zip makes me feel like I need to wash my hands.

user-the-name · on March 30, 2021

The 7z format is also a massive, MASSIVE over-engineered mess, that is basically impossible to actually implement full support for.

_abox · on March 30, 2021

7zip was a really old format for compatibility with 7-bit data streams. I used it all the time with packet radio in the early 90s. A bit like zipping and then base64 encoding does.

But I don't know how it got to the wares/hacker scene from there. I assume they will also have dropped the 7 bit thing as it would reduce efficiency by 1/8th on modern systems.

lstamour · on March 30, 2021

Edit: as another commenter rightly points out, it's likely unrelated. Here's the changelog: https://www.7-zip.org/history.txt That said the first entry at the bottom dates back to 2.00, so it's possible there was a 1.0 that predates this changelog:

2.00 Beta 1 1999-01-02 - Original beta version.

I'm pretty sure its popularity increased because it was one of the few options on Windows that was completely free to install with no nag screens, ever. It had a decent right-click menu; excellent usability for its time. And you wouldn't be tempted to use an older copy such as whatever shipped with Windows XP to extract archives. Plus it offered compatibility with all other formats including its own, and eventually apps like Firefox publicly shipped with 7zip Self-Extractors and such. And if you wanted to test whether 7zip really was more efficient, you could quickly try compressing files using different options - in my personal testing, 7zip often won back then. Not sure if results have changed recently given there are a few new compression techniques these days.

Kwpolska · on March 30, 2021

That's likely a different 7zip. The 7z format was invented in 1999, and the Wikipedia page [0] doesn't say anything about 7-bit compatibility. [0] https://en.wikipedia.org/wiki/7z

totetsu · on March 30, 2021

You don't even need the - tar is so tar

qwerty456127 · on March 30, 2021

Tar should have been abandoned a decade ago, in favour of 7z or something like that. It baffles me it's 2021 an we still have to extract an entire archive just to look at the list of files it contains, let alone get a particular file. I've never heard a valid reason to use tar on today servers (let alone PCs) other than "it's guaranteed to be available everywhere".

cestith · on March 30, 2021

> we still have to extract an entire archive just to look at the list of files

GNU tar has -t and --list, while NetBSD, OpenBSD, and FreeBSD all have -t. None of them extract the archive to list the contents as far as I am aware.

fl0wenol · on March 30, 2021

I think they mean that you have to read in the whole archive (incl. decompression) to see all the files; that there's no option for an index that can be immediately seeked.

cestith · on March 30, 2021

Since tar is a streaming format originally intended for tape and compression is done via external programs against the full archive, it's a little difficult to look inside the archive if it's compressed. Other formats accomplish this by compressing individual archive members inline with building the archive then building a TOC that's not compressed. That's all implementation details.

As a user, I can use tar and an outside compression tool together in one command to see the file list.

    tar -Jtf foo.tar.bzip
    tar -xtf example.tar.gz
    tar -I /usr/bin/mydecompressor -tf somethingelse.tar.myc

So unless there's some huge, huge archive out there that takes forever to decompress and read it's not really an issue for the user.

qwerty456127 · on March 30, 2021

> Since tar is a streaming format originally intended for tape

So why are we still using it if ~99% of Linux users and administrators have never seen a digital tape drive? I've built quite a number of Linux, BSD and Windows servers yet the only tape with digital data I have seen in my entire life was for a Spectrum. Do we really need our files stored in a way compatible with tape storage?

> As a user, I can use tar and an outside compression tool together in one command to see the file list.

Which would decompress the whole archive, the seek through it. Ridiculous waste with no benefit.

Tar is the ultimate idiosyncrasy of of Linux.

cestith · on March 31, 2021

> ~99% of Linux users and administrators have never seen a digital tape drive

I would question this assertion. Some of us have been doing this a while, you know.

> Do we really need

Probably not.

> no benefit

Benefits include being able to use the same dictionary across the whole archive, being able to leverage multiple improvements in multiple outside compression programs, backwards compatibility going back decades, and the ability to use a tool that's standard not just on Linux but BSD, macOS, Solaris, and any POSIX compliant system.

> ultimate idiosyncrasy of of Linux

Are you certain that's not systemd? Or the way ptys are handled? Or its sound and video systems? Or the /proc file system? Or devd? Maybe the multiple families of different incompatible package managers? I mean tar isn't originally from Linux or even GNU. It sure beat shar files.

If you want your files inside your archive compressed before being added to the file, you can do that too with tar, but other tools do it for you. You're absolutely free to use those other tools.

qwerty456127 · on April 1, 2021

> Are you certain that's not ...

Yes, because that's the only famous oddity which is actually annoying (sound and video systems just work and do everything a user might want of them) and has no real reason to exist on a today computer.

> If you want your files inside your archive compressed before being added to the file

No, I want to see a list of files inside without decompressing anything and without reading the whole archive.

Being able to extract a particular file could be a very nice bonus - using the same dictionary for all of them doesn't require making this impossible.

ceratin6 · on March 30, 2021

Fuck you to whoever thinks it’s ok to diss on tar.

Your mom was ill-specified.

dang · on March 31, 2021

I don't think this comment should have been flagged. It's obviously not literal.

However, could you please stop creating accounts for every few comments you post? We ban accounts that do that. This is in the site guidelines: https://news.ycombinator.com/newsguidelines.html.

You needn't use your real name, of course, but for HN to be a community, users need some identity for other users to relate to. Otherwise we may as well have no usernames and no community, and that would be a different kind of forum. https://hn.algolia.com/?sort=byDate&dateRange=all&type=comme...

Wowfunhappy · on March 31, 2021

> I don't think this comment should have been flagged. It's obviously not literal.

...it’s not at all obvious to me...

dang · on March 31, 2021

It's absurd to get that worked up about tar.