The .zip file specification is flawed

jplasmeier · on Dec 13, 2016

This strikes me of an example of a sort of dualism that exists within software. The .zip spec is flawed, and can be broken in all sorts of different ways. However, after years and years of using computers, I've never had an issue with extracting or using a .zip file.

I wonder if this same pattern can be applied to physical engineering disciplines- e.g. a structural engineer assessing a bridge and finding numerous faults in the design, despite traffic still using the bridge as normal...

asmithmd1 · on Dec 14, 2016

Yes, this happened with a building built in NYC in the late 1970's:

http://www.slate.com/blogs/the_eye/2014/04/17/the_citicorp_t...

An undergraduate doing a class project on the building uncovered the flaw.

mmahemoff · on Dec 14, 2016

"It was only after seeing the documentary that she began to learn about the impact that her undergraduate thesis had on the fate of Manhattan."

Amazing story.

taneq · on Dec 14, 2016

I believe this is an example of the old adage, "In theory, there's no difference between theory and practice. In practice, however, there is."

In theory, the .zip spec is broken. In practice, it's the most reliable format for transferring a group of files. (And don't even get me started on JPEG, where iirc the file format wasn't even specified until after JPEG files had been popular for years.)

> I wonder if this same pattern can be applied to physical engineering disciplines- e.g. a structural engineer assessing a bridge and finding numerous faults in the design, despite traffic still using the bridge as normal...

I would be amazed if this weren't the case. I know I've encountered a few cases where mechanical engineers had cocked up the design but the resulting machine still managed to limp along and mostly perform its function.

netheril96 · on Dec 14, 2016

> In practice, it's the most reliable format for transferring a group of files.

For those who speak only English, yes.

The encoding of filenames have always been a mess for zip files.

taneq · on Dec 14, 2016

True, that's a bit of a blind spot for me, being an only-English-speaker.

niftich · on Dec 13, 2016

Most out-in-the-wild usage of zip files fits the 'Robustness Principle', or, after its author, Postel's Law, found in RFC 1122 [1]:

> Be liberal in what you accept, and conservative in what you send

There are several analyses of this maxim and whether it's a good choice for designing robust, secure systems. This particular Internet Draft doesn't agree [2].

[1] https://tools.ietf.org/html/rfc1122#section-1.2.2 [2] https://tools.ietf.org/html/draft-thomson-postel-was-wrong-0...

user5994461 · on Dec 13, 2016

For user interfaces: Be liberal in what you accept from the user.

For M2M (machine to machine): Be strict in what you accept, and strict in what you send.

That's my take on it now.

For instance, a user will copy/paste URLs to his browser, there will probably be a space too much before or after. It's okay to clean it (and a better experience than a "site not found").

When a machine sends something "weird". Well, it's not possible to know if it's really wrong and it can't be corrected in any meaningful way. Just fail and throw an error so the developer can fix it.

taeric · on Dec 14, 2016

No. You should always be as liberal on what you accept as you can. Primarily because as strict as you try to be in what you send, you are likely to make mistakes. Very very likely.

Obviously, your field of work dictates a lot here. And, if you are accepting something that has severe consequences on acting out, then yes, be more strict. However, the general principal holds. In general.

user5994461 · on Dec 14, 2016

> Primarily because as strict as you try to be in what you send, you are likely to make mistakes. Very very likely.

What you call a human mistake should be called a bug. The only acceptable way to handle a bug is to fix the code.

Not to be dismissive but the way you think is a classic beginner mistake. It's not the responsibility of other software to guess what bugs you'll put in yours. ^^

> Obviously, your field of work dictates a lot here.

large distributed systems, financial exchanges, trading systems, national government projects, aerospace, and even web stuff at times.

Some fields have low standards. That doesn't mean that the good practises don't apply, it just mean that people don't apply them ;)

taeric · on Dec 14, 2016

I conceded it is a bug. I merely claimed you will make some. Because, you will. Period.

You also have an odd misconception. It isn't having lower standards to be liberal in what you accept. It is actually harder. Much harder. And you should do it.

Easy example from finance. You shouldn't just take one currency. You should take in as many currencies as you can understand.

Does this mean to be magical? No. But you should ask, "how many different ways could this be given to me?" And you should instrument this with a marker for "unexpected."

user5994461 · on Dec 14, 2016

> You shouldn't just take one currency. You should take in as many currencies as you can understand.

This is a business feature request.

The "Being liberal about what you accept" is a technical guideline for protocol/format design and input processing/sanitation. Don't apply that rule to feature requests, it's not meant for that :(

taeric · on Dec 14, 2016

I disagree. It is meant in the "if you can understand it, you should act like you do." This is true in life as much as it is in programming.

It is fun to be pedantic with kids about "may I" versus "can I?" However, both work and if something can be understood, it should be understood.

So, should you build in all currencies? Not necessarily. However, you should definitely consider tagged inputs immediately, with the ability to extend the tags at some point. I'm not talking about big bang feature development, but ultimately, useful tools will have a plethora of ways they take in inputs. Because that is useful. And no design constraint should preclude this growth.

manarth · on Dec 14, 2016

  > "if you can understand it, you should act like you do"

For an unanticipated event, it's often less "if you can understand it", than "if you believe you can correctly guess the intention of an ambiguous request".

HTML is a great example of that ambiguity. Take this example markup:

  <h2 title="The <a href="https://en.wikipedia.org/wiki/Fall_of_the_Western_Roman_Empire">Fall of Rome</a>" class="roman_empire">The <a href="https://en.wikipedia.org/wiki/Fall_of_the_Western_Roman_Empire">Fall of Rome</a></h2>

This isn't too naive a markup error, and it's easy to see how a simple error in a CMS template could cause this output. So how should a parser treat it? It's non-conformant to the standard, but should it totally reject it? Recognise that the opening-bracket of the a tag in the title attribute is the start of the incorrect markup, and strip tags within that attribute? How should it appropriately recognise the misuse of double-quotes within a property attribute, that - according to the spec - should close the attribute?

There is some merit in strict behaviour, rejecting non-conformant input. Especially when it comes to computer-to-computer API design, where rejecting non-conformant input may be better than attempting to interpret ambiguous input. Limitations can be an arrow to direct us back on the best path.

taeric · on Dec 14, 2016

I get your point. It is very appealing. I do note that HTML is a definitive example where the strict version lost. Hard.

My argument here would be to do your best to not create ambiguity in what can be fed to you. The currency one is a great example. Never ever make something that takes in untagged integers as a value. It could mean too many things and you would have no method of knowing what was intended. But, once you start accepting strings, "$4" or "4 dollars", or "Four dollars" should all be on the table. (Though, yes, in this case probably best to stick with "4 USD" and complain about "dollars" being ambiguous without another qualifier.)

wolrah · on Dec 14, 2016

I agree with those positions, but what about the middle ground? Browsers for example, taking machine-to-machine data which sometimes is human generated.

I'm a fan of the strict solution as far as that goes, but there's a reason XHTML failed. Somehow asking people who write web sites to do it right or else it doesn't work is a big deal.

taneq · on Dec 14, 2016

The best approach seems to be to be strict in what you accept, and strict in what you emit, and if you want your software to be forgiving to the vagaries of human input, then strictly adhere to a forgiving spec.

chrismorgan · on Dec 16, 2016

If the web had been strict from the start, we would never have had a problem: people would never have been publishing invalid documents because browsers would have been rejecting them outright. XHTML failed because HTML existed and was good enough.

user5994461 · on Dec 14, 2016

Well. How to say that without being dismissive of the average web hobbyist...

Well, when the job is to accept a hugely complex flexible poorly defined input format from 30 years ago that is written by millions of people who have no clue what they are doing, fixing common errors and getting anything to render is part of the spec ^^

And... oh wait!

I said "Be liberal in what you accept from the user" when it comes to user input. Web pages are user input, so yeah, browsers are no exception to the rules, they're actually a perfect example! :D

Too · on Dec 14, 2016

The difference is a user interface translates the non strict version to the strict version instantly once you click the submit button. A hand written html file is stored persistently, retaining the flaw for all future.

flukus · on Dec 14, 2016

I wonder how many security exploits this principle is responsible for?

vacri · on Dec 14, 2016

As an aside, a couple of months ago I found out that the PHP ISO8601 timestamp format is not 8601 compliant. If you want an ISO8601 timestamp, you have to use a different format.

I'm not sure how this fits in with Postel's law, but your comment jogged my memory of it :)

http://php.net/manual/en/class.datetime.php

Terr_ · on Dec 14, 2016

MySQL utf8 isn't utf8 either :p

TAForObvReasons · on Dec 13, 2016

CSV is a great example of this phenomenon. There is a "spec" RFC4180 and there are tools that generate CSV files that do not technically conform to the spec. One such tool is Excel. For most users, Excel is doing the right thing. Blaming Excel for not handling CSV files according to the spec is passing the buck. The CSV tools that are worth using generally take great pains to work with Excel files at the cost of some ideological purity.

IMHO it's a reflection of the software developers involved. The best tools, the ones we turn to time and time again, generally just work.

niftich · on Dec 13, 2016

In fairness, usage of CSV-like formats pre-dates the CSV RFC by almost 30 years, which was authored in 2005 specifically to try to formalize a de-facto spec:

> Surprisingly, while this format is very common, it has never been formally documented. [1]

[1] https://tools.ietf.org/html/rfc4180#section-1

fnord123 · on Dec 13, 2016

> For most users, Excel is doing the right thing. Blaming Excel for not handling CSV files according to the spec is passing the buck.

No. Excel is wrong when it comes to CSV.

Paste a Unicode string into Excel. e.g. Beijing in Simplified Chinese (北京市). Now Save As Windows CSV as beijing.csv. Close the file. Open beijing.csv. The cell now reads `___` (on Excel for Mac 2011 - maybe they deigned to fix it).

Excel just outputs bad data.

jefffan241 · on Dec 13, 2016

I don't know how you do it in excel but if you generate a csv with UT8 data you can add a byte order mark[0] as the first byte and it will render correctly.

Once you add that, excel will open the file with utf8 encoding (if you use the utf8 byte order mark obviously). I haven't tried with other utf-* encodings.

Again don't know how to tell excel how to add that though :/ I've only had to deal with arabic in generated csv's.

[0] https://en.wikipedia.org/wiki/Byte_order_mark

mjevans · on Dec 13, 2016

Byte order marks are wrong and not part of 'clean' 8-bit handling.

All applications and operating systems should assume files WITHOUT a BOM are either ASCII, or the superset there of, UTF-8.

Please remember to say WHY you disagree if you do.

jefffan241 · on Dec 14, 2016

I wasn't disagreeing with anything was just saying how I handle it with generated csv's. I do not disagree in the slightest that Byte order marks are wrong and not part of clean utf 8-bit handling. BUT, when you have a client and you need to generate a csv for them with characters that are only valid in utf-8 and that client's program will only open the file correctly if you add the Byte order mark, then you add the byte order mark.

joncrocks · on Dec 14, 2016

Which, ASCII or UTF-8?

If you have a file without a BOM, you have to pick one.

As every 8 bit combination is an ASCII character of some kind, you can interpret every UTF-8 character as a combination of ASCII characters. And what you output will be different to what was input (unless you restrict yourself to single byte UTF-9 characters).

Without some other way of indicating the encoding format of a file, a BOM can be a tool to indicate "It's probably encoded using UTF-X".

caf · on Dec 14, 2016

What? No. ASCII is a 7-bit encoding: only bytes with the top-bit zero are valid ASCII, and all of those bytes represent exactly the same character in UTF-8. UTF-8 is a strict superset of ASCII and this is not by accident.

jcranmer · on Dec 14, 2016

UTF-8 has many nice properties. One of the nicest is that most random binary strings are not valid UTF-8. In fact, the structure of UTF-8 strings is such that, if a file parses as UTF-8 without error, then it is almost certainly UTF-8.

If it's merely ASCII, it doesn't matter. Nearly every charset contains all valid ASCII texts as a strict subset. UTF-7, UTF-16, UTF-32, and EBCDIC are the major counterexamples, and UTF-7 and EBCDIC aren't going to come up unless you're actually expecting them to. (Technically, ISO-2022 charsets can introduce non-ASCII characters without use of high bit characters, since they switch modes using the ASCII ESC character as part of the sequence. In practice, ESC isn't going to come up in most ASCII text and ISO-2022-JP (the only one still in major use) will frequently use 8-bit characters anyways).

The only useful purpose of a BOM is to distinguish between UTF-16LE and UTF-16BE, and even then it's discouraged in favor of actually maintaining labels (or not storing in UTF-16 in the first place). You can detect UTF-8 in practice without a BOM quite easily, and it's only Microsoft who feels obliged to need them.

LukeShu · on Dec 14, 2016

As caf said, ASCII is a 7-bit encoding.

However, the question "which?" can still apply. There are many encodings that are a superset of ASCII. UTF-8 is a superset of ASCII, but so are ISO-8859-X (for any "X"), Windows-1252, and many others.

joncrocks · on Dec 14, 2016

Gah, yeah, you're right, I was thinking of CP-1252.

When I've had problems in the past with this it's been around windows machines, which love their own encoding formats.

tedunangst · on Dec 13, 2016

Excel for Mac can't process BOM correctly.

jefffan241 · on Dec 14, 2016

I did not know that. I'm on Linux and the client is on windows so only had experience with it on those 2 OS's. Good to know.

alogray · on Dec 14, 2016

IIRC, it was not fixed at release for Excel for Mac 2016 but there is now an option to Save As a "CSV UTF-8 (Comma Separated)".

rspeer · on Dec 14, 2016

That's amazing. Finally.

I don't suppose I should get my hopes up too much that that option is going to be more prominent than the terrible default of saving as Mac OS Roman, right? (Whoever decided that Excel on OS X should export CSVs in an obsolete encoding for Mac OS Classic must have been trying to hurt Mac users.)

fnord123 · on Dec 14, 2016

I tried it out on Office 2016/365/15.25 whatever on OS X. It works when writing data out. That's great news! And they also put it high up in the possible formats to output.

Unfortunately, to open the UTF-8 file again, it doesn't work ("åŒ—äº¬å¸‚"). You have to make a new workbook and then import the csv in, specifying UTF-8 csv. It's pretty messed up but at least it's possible now.

mnarayan01 · on Dec 14, 2016

You can make it work with Unicode (see e.g https://gist.github.com/mnarayan01/61bedce2b55e258d8f8c), it's just an enormous PITA.

manarth · on Dec 13, 2016

  The .zip spec is flawed, but…
  I've never had an issue with extracting or using a .zip file

I'd suggest this is thanks to most software following (to some extent) the robustness principal: be conservative in what you do, be liberal in what you accept.

Most of us will typically encounter fairly well packaged, conforming zip files. Occasionally we may come across something unanticipated - like this example, where HTML content is accidentally appended to the end of a zip file - and I suspect this is where we will find ambiguous behaviour: some package tools may crash, others might "extract" it as though it were content, others might ignore it.

It's this area of ambiguity that lends itself to vulnerabilities and attacks.

Re: physical engineering, I'd recommend a great book: "To Engineer Is Human". It talks about the evolution of engineering, which is a surprising amount of trial and error, with emphasis on the error.

mohaine · on Dec 14, 2016

Sorta like how your keys are always in the last place you look. Everything before good enough is an error. Good enough is good enough. This is engineering.

xg15 · on Dec 14, 2016

This feels kind of like concluding that the y2k bug was overblown because in the end nothing happened - and ignoring the reasons why nothing happened.

Those dualisms can usually be resolved if you realize the vast and complex efforts that go into working around all the spec bugs - in this case, the various "find the magic number" heuristics.

amelius · on Dec 13, 2016

Well, you have to keep in mind that mistakes like this could turn into exploits.

optionalparens · on Dec 14, 2016

I have indeed encountered many broken zip files. In truth it doesn't happen to me much anymore, bit it sure did in the earlier days of pkunzip/pkzip, and in general during the BBS days. Sometimes this was just due to all sorts of weird things going on with transmitting files over networks and modems. Mostly you could prevent these things with CRC checks or any other verification method, but that doesn't mean implementations checked this or that the check itself didn't just fail.

A few things that I've hit in the wild that screwed up zips:

* Viruses/Worms that were rather primitive and start adding weird bytes in all kinds of places in all kinds of files

* Incomplete transmission over a modem. Depending on the protocol, you might even have most of the data, so you could read part of zip header, but not the actual archive or vice-versa. Normally, you knew right away the file was incomplete with a CRC check so not the worst problem, but the check itself was slow on old hardware.

* Weird things that added metadata and screwed with byte order, ends of file marks, etc. For instance there were some early attempts in the BBS scene to add metadata formats similar to what ID3 is to mp3s. Sometimes the writers would ruin the original file. I hit a few cases of strange attempts at steganography with software pirates trying to be "3l337" or whatever.

* Tools people tried to use to fix broken zips, that didn't quite fix them how they thought.

* Not really the fault of zip, but I've seen people rename a zip's extension to another archive format, causing the unarchiver to assume the format based on the extension. Never trust an extension if you can help it. (ex: rename a zip to rar, arc, lzz/lha, tar whatever)

* Floppy and hard disk repair programs. Sometimes these things would end up corrupting the bytes of files instead of fixing them when they tried to be more clever than moving things around. Sometimes moving things around also would result in things being ordered wrong for whatever reason in these programs. Some of the DOS Norton/Fastback/etc. ilk were especially frequent offenders.

user5994461 · on Dec 14, 2016

> I wonder if this same pattern can be applied to physical engineering disciplines

Oh yes. Basically, EVERYTHING is flawed in INFINITE ways.

Or to put it differently, perfection doesn't exist. Luckily for us, the barrier for "being practical and useful" is a lot lower than perfection =)

teaearlgraycold · on Dec 13, 2016

Didn't expect to find you on hackernews, plaz

arh68 · on Dec 13, 2016

I don't understand why this program should fail, though, if other tools can unzip the file just fine. I specifically wonder about this part of the code [0]:

    if (commentLength !== expectedCommentLength) {
      return callback(new Error("invalid comment length. expected: " + expectedCommentLength + ". found: " + commentLength));
    }

Why not just determine this isn't the record, keep looking, and handle the error after the loop?

    if (commentLength !== expectedCommentLength) { continue }

[0] https://github.com/thejoshwolfe/yauzl/blob/master/index.js#L...

x1798DE · on Dec 13, 2016

I think the argument would be that if you write a non-standards-compliant unzip utility, in many cases people generating malformed zip files won't get any warning because the software "knows what you meant to do", then people emitting malformed zip files with a random amount of data at the end won't know that they're doing anything wrong until one day they hit the limit of what is acceptable, way down the road and after they've been emitting malformed zip files for ages.

I think it's the programming equivalent of not telling someone they have some food on their face - yeah, it's no skin off your back if you don't tell them, but who knows who they're going to run into later.

That said, a hybrid approach which fails by default but then offers a way to "force unzip" the file in a non-standards-compliant way might be the best of both worlds here, because the users get an immediate way to fix the problem and the emitters are likely to be notified that they are creating malformed zips.

TAForObvReasons · on Dec 13, 2016

The problem is that the ZIP format is not a full-file specification: it permits data before and after the logical extents of the ZIP archive. Among other things, it allows for self-extracting archives by prepending the extractor code. the library in this case is erroneously assuming that the archive spans the entire file.

jonahrd · on Dec 14, 2016

I think you're missing arh68's point. If it's supposed to find the magic number, and by definition the comment is allowed to contain the magic number, then this section of the code should (correctly) diagnose that this is not the magic number, but (incorrect in the current implementation) then continue to search backward for the REAL magic number.

Drdrdrq · on Dec 14, 2016

The way I understood it magic number is not allowed in the comment.

bonzini · on Dec 13, 2016

This would not fix the problem. The issue is that there is extra content after the comment (for what we know there could be no comment at all, actually). The library expected the comment to end right at the end of the file.

grkvlt · on Dec 15, 2016

Reading that code and seeing the multi-disk (non) support [0] brought back awful memories of .zip files spanning ten or twenty disks which you would feed in, one by one, until you god the dreaded 'zip file corrupted' error on disk N-1, almost inevitably. Meaning you had to try and re-download the data again, over your trusty 19K2 link...

[0] https://github.com/thejoshwolfe/yauzl/blob/master/index.js#L...

twr · on Dec 14, 2016

I almost wrote a zip parser. After studying the format a bit, I gave up that idea.

- A valid zip does not begin with, as you would normally expect, a magic number.

- A valid zip can contain arbitrary prepended data.

- A valid zip can contain sections with no identifier value.

- A valid zip can contain arbitrary data between sections.

- A valid zip is validated starting with a tail section located at the end.

- A valid zip can have a valid tail section that contains a valid tail section.

- A valid zip can contain arbitrary appended data.

I don't get it: Why design a format that's so hard to parse? Implementing a single-pass streaming parser is impossible. It should be a basic requirement for most file formats. /usr/bin/unzip cannot even extract from standard input. I'm sure the implementer didn't feel like receiving user complaints about exhausted memory.

badsectoracula · on Dec 14, 2016

Because the format was originally written by PKWARE for their PKZIP and PKUNZIP programs for DOS in the 80s and used for backup purposes, among others (being able to span multiple floppy disks was an often used feature).

To write the directory at the beginning they'd need to keep all the data in memory or do some extra postprocessing that would make it prohibitively slow on a 4.7MHz machine with a 20MD hard disk and a slow 360K floppy disk drive. So the directory goes at the end.

The files do not begin with a magic number because ZIP files can be embedded in other files - most commonly executable files for the self-extracting feature of PKZIP that placed the ZIP file right after the decompressor. This allowed the executable to be used both by itself and by PKZIP.

As for the data after the magic number, it was used for the comment which was most likely added at a later point and they decided to put it after the magic number so that it remains compatible with existing decompressors. In the 80s and early 90s people didn't had Internet to get the latest version and a lot of old versions of PKUNZIP were floating around for many years.

dkonofalski · on Dec 14, 2016

To further add on to that, I would frequently run into programs (and sometimes games) that would span multiple floppies where almost the entire first part of the archive was just installer data. Sometimes, the actual .zip file data would start on the tail end of the first disk and then span however many disks were needed while other times the first disk was just the installer and then the .zip data was on the other disks. In either case, PKZIP and PKUNZIP would be able to read that file but only in the 2nd case would they be able to read the file without that first disk.

Things got further complicated when RARs made the scene because I remember that some of the early .RAR apps offered their own implementations of .zip support that would create multi-span .zips where the installer data was on the first disk, the .zip data was on the other disks, but the container spanned all the disks.

This meant that you could potentially have 3 different ways of dealing with the same data and there was no safe way to assume which method was used:

1. Installer on disk 1, multi-spanned .zip on disks 2-n 2. Multi-span .zip on disks 1-n 3. Multi-span .zip on disks 1-n but installer data only on disk 1

twr · on Dec 14, 2016

That is an excellent explanation, thank you.

pixelglow · on Dec 14, 2016

Mark Adler, he of zlib/gzip/Info-Zip fame, seems to think that zips cannot contain arbitrary data before and between individual files.

http://stackoverflow.com/a/12393597/60910

Therefore the straightforward way to parse a zip file is to proceed from the beginning and parse out each file sequentially. The End of Central Directory record is then only a redundant convenience to avoid sequentially scanning files e.g. in large zips for random access.

twr · on Dec 14, 2016

Interesting. It makes sense to be strict when parsing stream input. Skipping the redundant central directory section hadn't even occurred to me. Bonus points for eliminating the stupid comment confusion dilemma!

On second thought, not entirely redundant, as the central directory does contain the file permissions. But those can be parsed and set after file extraction, without increasing the overall memory complexity.

ztravis · on Dec 13, 2016

Having browsed through a few different zip parsers in the course of writing my own, most handle a range of malformed/modified input, including prepended/appended garbage, filename encodings (CP437, Unicode "extra field" flags), etc. I found this set of test files useful: https://github.com/Stuk/jszip/tree/master/test/ref

It is annoying that support for these non-standard zips has become de-facto standard (you don't want your tool to fail where others succeed)...

optionalparens · on Dec 14, 2016

Most specs I encounter have seriously flawed implementations in the wild. This is regardless of the quality of the spec.

Here's a short-list off the top of my head of tech with specs and horrific offenders in the wild, including many popular implementors.

* Telnet

* VT-XXX, i.e. VT-100 or any term spec for that matter

* MIDI

* MKV, AVI, etc.

* GIF, JPEG

* A huge amount of tech in any browser

* SAUCE

* ANSI escape parsers/writers/sequences

* LZH/LHA

* Open Document Format

* XMP

* MAPI

* POP, IMAP

As I think of this, I realize I'll be typing all night. In some cases, the implementors all adopt things from the most popular implementors even if they are historical. In other cases, things just don't work or break in awesome ways. Zip bombs were mentioned, but there are plenty of awesome things.

I'll finish by saying I wonder if many authors of anything related to terminals even ever bother to read specs or care. Over the years I've started to wonder how some of these things even work in the wild, but the answer is most people don't notice what actually isn't working, while anyone who needs to care has to draw from what become unofficial new specs.

trothamel · on Dec 13, 2016

Another fairly major flaw with zip is a lack of a defined encoding on filenames. As far as I can tell, it's impossible to 100% reliably include non-ASCII filenames inside a zip. (Probably because the format predates unicode's general acceptance.)

TAForObvReasons · on Dec 13, 2016

APPNOTE.TXT actually has a comment about it:

> D.1 The ZIP format has historically supported only the original IBM PC character encoding set, commonly referred to as IBM Code Page 437. This limits storing file name characters to only those within the original MS-DOS range of values and does not properly support file names in other character encoding, or languages.

Extra fields are generally used to implement other code pages. There is a direct way to do it with an extension to the GPBF. Problem is that some writers don't implement the proper flags.

niftich · on Dec 13, 2016

The spec says filenames are encoded in CP437. UTF-8 filenames are supported as overlays in two different 'extra headers', which essentially amount to vendor-specified extensions.

There are noncompliant implementations that generate zip files with filenames in ISO 8859-1 and other 8-bit codepages, and conversely, there are decompressors that try to accommodate such malformed files.

Graphon1 · on Dec 14, 2016

> it's impossible to 100% reliably include non-ASCII filenames inside a zip.

Yes, the spec does not accommodate this. However there are conventions among popular ZIP creation tools that allow it.

If the zip spec were governed by an open body, then this is the kind of thing that would get quickly fixed. (basically standardize the common practice already in use). I don't know why PKWARE never fixed the spec. Maybe because there's no money in it for them, and the pain for the industry is really not that severe.

xenadu02 · on Dec 13, 2016

Why can't this algorithm work reliably:

1. Begin linear search backwards from end of file 2. If magic constant found, push onto stack of possible locations 3. Once 32k has been reached stop searching 4. For each location on the stack, validate the rest of the header and that the remaining comment length matches the recorded length. 5. The first location to validate as a correct header is chosen

This should account for the appearance of the magic header in comments or data preceding the end of directory entry.

Have I missed something?

ambrop7 · on Dec 13, 2016

Yes you have missed that a valid header may appear by chance somewhere near the end of main zip data. Because your algorithm will prefer picking up headers toward the front, it might misinterpret a valid file, even one without a magic number in the comment; if the last 64k-(actual header+comment size) happens to contain an apparently valid header.

j_s · on Dec 14, 2016

This happens with nested .ZIPs.

diggernet · on Dec 13, 2016

That should work reasonably reliably, and is essentially what TFA describes as the behavior of 'unzip'. But it then goes on to say, "The zip file spec does not allow this arbitrary data [at the end of a zipfile], so unzip is non-conformantly tolerating this data."

So I think what you missed is pedantry. :)

grkvlt · on Dec 15, 2016

Yes.

Step 4, specifically "the remaining comment length matches the recorded length" will fail in the presence of appended data, as there is no end of comment or end of record marker. This is exactly how the file failed in the article, it used that algorithm and found zero valid headers.

londons_explore · on Dec 13, 2016

Frakenfiles like this are made by PHP coders who have:

Download.php:

if (isset($POST['whatever']) write($file_contents);

print "download page contents here"

mschuster91 · on Dec 13, 2016

When you're done flaming PHP, read the whole Github thread, and especially comment #4. The file was produced by a .NET library.

Graphon1 · on Dec 14, 2016

Yep, it has nothing to do with PHP. It's easy to produce this kind of frankenfile with an incorrect cp command. Or with incorrect stream management using any zip library that emits streams.

TazeTSchnitzel · on Dec 13, 2016

Even so, it could easily be produced by this kind of error in a PHP script.

Including an ending tag (?>) followed by whitespace could have the same effect.

Meic · on Dec 13, 2016

See also zip bombs [0]

[0] https://en.m.wikipedia.org/wiki/Zip_bomb

kazinator · on Dec 14, 2016

What the article doesn't cover is more braindamage in the ZIP spec which prevents a forward scan through the individual file records. This is detailed in the Wikipedia.

Tools that correctly read .ZIP archives must scan for the end of central directory record signature, and then, as appropriate, the other, indicated, central directory records. They must not scan for entries from the top of the ZIP file, because only the central directory specifies where a file chunk starts. Scanning could lead to false positives, as the format does not forbid other data to be between chunks, nor file data streams from containing such signatures.

762236 · on Dec 14, 2016

I wonder who here remembers self-extracting zip files under DOS/Windows? They put the zip file into the executable file. And you could extract with a normal unzip program, I believe.

flukus · on Dec 14, 2016

You're probably thinking if NSIS (http://nsis.sourceforge.net/Main_Page) and a few others.

khedoros1 · on Dec 14, 2016

I still run into those a couple times a year. Admittedly, mostly when I'm going on some kind of retro binge.

jbb555 · on Dec 13, 2016

Presumably this flaw explains why zip files are so obscure and difficult to use. (Yeah it's no great but clearly it's not been an issue)

Dylan16807 · on Dec 14, 2016

So the flaw in the spec is extremely minor and irrelevant to the real complaint. The real complaint is about programs making and accepting slightly invalid files.

Aldo_MX · on Dec 13, 2016

Which is the consensus about better alternatives?

niftich · on Dec 13, 2016

There is no consensus about alternatives. There are plenty of open container file formats that can contain multiple files, random-access them, and optionally compress them -- with some, but not all, supporting filesystem attributes: CFS, 7z, PEA, ZPAQ, and as another commenter writes, DAR.

A combined 'contain & compress' format is perhaps at odds with the unix philosophy of tools doing one thing well. Meanwhile, zip definitely exemplifies 'worse is better', in that despite the format's quirks and darker corners, it's good enough for most casual uses, and network effects and backwards compatibility help it win out.

teilo · on Dec 13, 2016

DAR. GPL, indexed, file permissions, ACLs, alternate file streams, per-file compression, slices, hard and symbolic links, sparse files, encryption, parity records for recovery. Everything you could want. Not sure why it hasn't had much use. Maybe because it is usually thought of as a backup tool (at which it is excellent). But it is also an excellent archival tool.

From the man page: "Dar can also use a sequential reading mode, in which dar acts like tar, just reading byte by byte the whole archive to know its contents and eventually extracting file at each step. In other words, the archive contents is located at both locations, all along the archive used for tar-like behavior suitable for sequential access media (tapes) and at the end for faster access, suitable for random access media (disks). ... Note also that the sequential reading mode let you extract data from a partially written archive (those that failed to complete due to a lack of disk space for example)."

http://dar.linux.free.fr

witty_username · on Dec 14, 2016

From http://dar.linux.free.fr

> Consequently, to use the API, your program must be released under the GPL as well.

GPL is a bad thing for programs like this. It hampers usability. Especially at least the decoder should be BSD/MIT/etc.

teilo · on Dec 14, 2016

A fair point. Lacking good GUI tools, integration libraries, etc., something like this will never gain traction. This looks to be a fairly complex code base. Re-implementing it in a library licensed under BSD/MIT would be non-trivial.

EpicEng · on Dec 14, 2016

Yeah, and then you send your DAR file to Sally in accounting, who can't open it, and you're back to ZIP.

teilo · on Dec 14, 2016

True of any alternative and thus irrelevant to the question I was answering.

wolfgang42 · on Dec 14, 2016

The main problem I see with any alternatives is that ZIP is the only format that seems to work everywhere. On NIX-type systems you also have tar+gzip, and some Windows users have something that can open RAR files, but if you want to send an archive and know that the other party will be able to open it without hassle ZIP is the only way to go. Unfortunately this is impossible to change on any reasonable time scale (less than a decade) since everyone needs to have the archiving tools immediately to hand.

user5994461 · on Dec 13, 2016

Define "better".

If you just want to compress some stuff, zip is perfectly fine and widely supported. (or gzip or tar)

If you want fancy things with better compression, user rights/ACL (either linux or windows style), file dates, encryption, password protection. Well, I don't know what can do that.

optionalparens · on Dec 14, 2016

I'd also add compression time and decompression time to that list. Zip wasn't actually the best compression algorithm I remember seeing in its original days, it just was the best compromise of speed of compression, extraction, and compressed size. The TOC feature was a killer because even the fastest formats decompressed slow, not to mention space was more of a premium.

An additional feature that sometimes is important as well is making itself stream friendly so you can read the metadata without the entire file (or easily before streaming) as it is streamed over some context or using some technology.

The landscape is much different now and so some of features you'd design into a new format didn't exist then, while others that were concerns are less concerns now.

As parent hints, the compatibility alone makes zip still very desirable and actual compression is good enough given size of disks today for most people. Obviously there are exceptions.

aidenn0 · on Dec 14, 2016

7z and rar are probably the most widely available alternatives today (The 7zip GUI utility on windows can decompress RAR files, while I don't think WinRAR can handle 7z files, so RAR is strictly more available than 7z on windows).

I personally also like ZPAQ for generating archives, though I find it a very poor fit for its suggested use as a backup tool (doesn't handle symlinks, owner, xattr, ACL &c.). As an added bonus the reference implementation is completely composed of public domain and MIT/X11 licensed code, which makes it friendly for linking pretty much anywhere.

cwmma · on Dec 14, 2016

For a lot of things zip files are used for tar.gz isn't just an alternative but what you should have been using all along as you can do streaming reads, that being said it dosn't have as good support and can't do random access from disk like zip can so it depends on what you are trying to do.

LeoPanthera · on Dec 14, 2016

My favorite archive format is lzip. It's libre and extremely well documented, and there is a Public Domain unarchiver. And it uses modern LZMA compression.

http://www.nongnu.org/lzip/lzip.html

pkulak · on Dec 13, 2016

tar.gz ?

Aldo_MX · on Dec 13, 2016

tar.gz is not good with random access

duskwuff · on Dec 13, 2016

"Not good" is an understatement. tar.gz is maximally bad at random access -- there is no way to uncompress a single file, or to list files in the archive, without decompressing the entire gzip stream.

Zardoz84 · on Dec 13, 2016

and gzip every file and tar it ? ie gz.tar

duskwuff · on Dec 13, 2016

Gzipping each file individually makes the compressor have to start over with a fresh context on each file. This will result in a much worse compression ratio than if a single compressor is used for the whole archive, especially if the archive contains a lot of small, similar files. (Note that this is essentially what PKZIP does, though, so it's not awful.)

Also, there are no tools I'm aware of that will handle that archive structure.

mjevans · on Dec 13, 2016

While a 'solid' archive wouldn't typically have the compression penalty or the 'listing' penalty it would still have the extraction penalty.

cwmma · on Dec 14, 2016

But it's much better at streaming reads then zip so depending on the context might be better.

BentFranklin · on Dec 13, 2016

There is an issue of POC|GTFO that discusses zip frankenfiles.

GFK_of_xmaspast · on Dec 13, 2016

See also those tricks you can do with a file that's both a zip and a gif at the same time.

klodolph · on Dec 13, 2016

There are a ton of formats you can turn into "both a ZIP and X" at the same time, because most formats have a header at the beginning of the file and ZIP has a header at the end. Self-extracting EXEs are really just some standard EXE which is also a ZIP file, so you can use unzip to extract them without having to install WINE or whatever.

xenadu02 · on Dec 13, 2016

Same applies to PDF. I believe it is possible to construct a file that is an EXE, PDF, and ZIP all in one.

rabidrat · on Dec 13, 2016

I made a tool "mkizo" that generates .iso files that are also .zip files:

https://github.com/century-arcade/src/tree/master/tools/mkiz...

_blrj · on Dec 13, 2016

Fascinating... I wonder how many .zip files in the wild have this issue and if perhaps as a result many unarchiving applications tolerate it.

__jal · on Dec 13, 2016

I exploited this once when designing a puzzle. One clue was "encrypted" text appended to the end of an otherwise-innocent zip file.

(One person found it, but the whole thing was way too hard. I drastically overestimated people's operational cleverness.)

pavel_lishin · on Dec 13, 2016

> I drastically overestimated people's operational cleverness.

Maybe your target audience was just not quite right? People go gaga over ARGs and devote thousands of hours to 'em.

Maybe hint that your game has something to do with HalfLife 3...

user5994461 · on Dec 13, 2016

> Maybe your target audience was just not quite right?

Nah. It's insanely difficult to find this kind of stuff when you have no clue what to look for.

diggernet · on Dec 13, 2016

I'm actually surprised yauzl can get away with rejecting .zip files due to appended data. I recall that back in the BBS days downloaded files frequently had trailing bytes on them (perhaps padding to block size, or something?).

optionalparens · on Dec 14, 2016

You're making me seem old I guess, but I've encountered this a bunch of times. This was often done as a way to add metadata to things that otherwise didn't have metadata or needed additional info. Typical use cases include music players, trackers, drawing programs, media viewers, and anything that scanned files for metadata (ex: BBS).

The SAUCE spec adds trailing bytes after the EOF mark for instance. I wrote a few SAUCE reader/writers embedded in various codebases (and will release a stand-alone good one some day soon).

Generally, a trick was that if you wanted some sort of metadata on a file, you could add it before the contents of the file (as in MKV allows ASCII at the start) or after the EOF marker. The latter made sure you didn't interfere with a lot of existing programs on various operating systems of the time, including but not limited to DOS, Windows, and CP/M. This trick still works for a lot of OSs or in other forms today, we just don't rely on markers for EOF as much so there are other methods, and people have started to bake metadata into specs.

As for the BBS scene, although SAUCE was used, for zips DIZ tended to be used more and was just packed into the archive and could be found using the internal TOC, or otherwise just uploaded separately.

softblush · on Dec 14, 2016

http://www.acid.org/info/sauce/sauce.htm

http://www.textfiles.com/computers/fileid.txt

mark-r · on Dec 13, 2016

CP/M didn't record the file size, so all files were a multiple of 128 bytes. The Ctrl-Z convention to mark the end of a text file came from this restriction, and that convention is still used by Windows today.

winteriscoming · on Dec 13, 2016

I haven't taken a look at the spec, but assuming what's noted in that issue comment is indeed the case, then I find it surprising that something as obvious as this would not be fixed/raised during the RFC and review of the spec itself.

gpcz · on Dec 13, 2016

There was/is no RFC for the ZIP file format (there is for gzip and the DEFLATE algorithm). It was Phil Katz releasing a text file called APPNOTE.TXT describing the file structure (latest version is available at https://pkware.cachefly.net/webdocs/casestudies/APPNOTE.TXT ).

_blrj · on Dec 13, 2016

It almost makes .zip feel much more proprietary than it is!