Hacker News new | past | comments | ask | show | jobs | submit login

Anyone more knowledgeable in assembly and file formats care to expand on this:

>It serves no purpose, except proving that files format not starting at offset 0 are a bad idea

What exactly does it mean to start at offset 0 and why don't these file formats do that? Is there an advantage in not starting at offset 0 or is it simply oversight/indifference? Any kind of background on the problem would be appreciated, I'm really quite intrigued.




Every major file type (or nearly every, anyway) has a set of signature bytes, a "magic number" or something equivalent that identifies it as being of that type. This lets programs identify what kind of object a file represents without requiring this information to be supplied by the user.

Most file types have this magic signature as the initial few bytes of the file. For example, a Windows executable always begins with the ASCII characters "MZ".

The point is that with non-overlapping magic signatures, a single file can be simultaneously identified as more than one type.


File format trivia:

"MZ" are the initials of Mark Zbikowski, one of the developers of MS-DOS. :)

http://en.wikipedia.org/wiki/DOS_MZ_executable


I'm not an expert on file formats so I looked into Wikipedia. Here's what it says on PNG[1]:

  A PNG file starts with an 8-byte signature.
  The hexadecimal byte values are 89 50 4E 47 0D 0A 1A 0A;
  the decimal values are 137 80 78 71 13 10 26 10. 
So if a file starts with 89 50 4E 47 0D 0A 1A 0A, you know it may be a valid PNG, otherwise you know it's not.

GIF starts with another marker at zero offset, so no valid GIF is a valid PNG, and vice versa.

Some formats are mutually exclusive because they “fight” for contents of first several bytes.

Some formats are more relaxed and introduce the exploited possibility of carefully engineered ambiguity.

edit: removed a section that was utterly wrong

[1]: http://en.wikipedia.org/wiki/Portable_Network_Graphics


It's a little more complicated than that, actually. Any given application of a file format may use various obfuscation techniques on the file's header or contents that render the file invalid from the perspective of the published standard (if there is one; it is also common in these cases to change the file extension to further disguise what format the file actually uses). Programs that do this may or may not de-obfuscate the file prior to use, depending largely on how and why the file was obfuscated.

For instance, a common obfuscation method is simply removing the magic number from the file; in this case, the program may simply try to use the file as the given format and return an error (or crash; we are talking largely about proprietary software in these cases after all) if the file can't be read.


When a file format starts at offset 0, it simply means that it starts at the first byte of the file.

Other than that, I can't provide any information on file formats allowed to start at offsets other than 0, or why this may or may not be a good idea (I suppose maybe it would allow an enterprising programmer to hide a malicious file by embedding it in an otherwise-innocuous format?), though I am certainly curious as well.


I think you're on to the right answer (though I don't know for sure myself).

It seems to me that if all file format identifiers started at the zero offset, it would be impossible for a single file to identify as more than one format. However, when different formats use different offsets to identify themselves, it is possible to construct the file in such a way that it validly identifies as more than one format.


I've seen files have been distributed on 4chan before via a .rar file embedded in an image.


That's kind of a different issue though, my understanding is that .jpeg has an unlimited size footer and .rar has an unlimited size header. It gets similar results, though.


A lot of archive formats start at the end because you don't know what is going to be written beforehand. But there is very little reason not to have magic bytes at either the very start or end of a file.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: