Anyone more knowledgeable in assembly and file formats care to expand on this:
>It serves no purpose, except proving that files format not starting at offset 0 are a bad idea
What exactly does it mean to start at offset 0 and why don't these file formats do that? Is there an advantage in not starting at offset 0 or is it simply oversight/indifference? Any kind of background on the problem would be appreciated, I'm really quite intrigued.
Every major file type (or nearly every, anyway) has a set of signature bytes, a "magic number" or something equivalent that identifies it as being of that type. This lets programs identify what kind of object a file represents without requiring this information to be supplied by the user.
Most file types have this magic signature as the initial few bytes of the file. For example, a Windows executable always begins with the ASCII characters "MZ".
The point is that with non-overlapping magic signatures, a single file can be simultaneously identified as more than one type.
It's a little more complicated than that, actually. Any given application of a file format may use various obfuscation techniques on the file's header or contents that render the file invalid from the perspective of the published standard (if there is one; it is also common in these cases to change the file extension to further disguise what format the file actually uses). Programs that do this may or may not de-obfuscate the file prior to use, depending largely on how and why the file was obfuscated.
For instance, a common obfuscation method is simply removing the magic number from the file; in this case, the program may simply try to use the file as the given format and return an error (or crash; we are talking largely about proprietary software in these cases after all) if the file can't be read.
When a file format starts at offset 0, it simply means that it starts at the first byte of the file.
Other than that, I can't provide any information on file formats allowed to start at offsets other than 0, or why this may or may not be a good idea (I suppose maybe it would allow an enterprising programmer to hide a malicious file by embedding it in an otherwise-innocuous format?), though I am certainly curious as well.
I think you're on to the right answer (though I don't know for sure myself).
It seems to me that if all file format identifiers started at the zero offset, it would be impossible for a single file to identify as more than one format. However, when different formats use different offsets to identify themselves, it is possible to construct the file in such a way that it validly identifies as more than one format.
That's kind of a different issue though, my understanding is that .jpeg has an unlimited size footer and .rar has an unlimited size header. It gets similar results, though.
A lot of archive formats start at the end because you don't know what is going to be written beforehand. But there is very little reason not to have magic bytes at either the very start or end of a file.
Edit: someone posted results for .exe file inside the .zip, which are a bit different (it seems like some antiviruses don't try to unpack it?), but then deleted the comment. Here's the link for .exe: https://www.virustotal.com/file/2a9c7a16cdb3c3f2285afaf61072...
Given what its doing and how it's doing it then those virus alerts listed are understandable and if anything I'd have to say kudo to panda AV for being the most honest about it. Probably breaking the PE and the CRC checksum aspects would get it flagged as it has in some and the html/exe flagging is also explained as well having read thru how it works.
Still impressive stuff and also given the use of undocumented opcodes and x86 foo it does raise a new question:
Given some VM's will fail on some of the instructions instead of running on bare metal, is it possible to have a virus that will only trigger on bare metal or VM machines thru use of undocumented op codes and the like.
Non the less a wonderful definition in hacking in its truest sence and educational on undocumented OP codes and how for some things you cant beat pure assembly for fun and jollys.
It being an .exe and a JAR file doesn't surprise me at all. JAR files follow the ZIP format, and self-extracting ZIP files have always worked by being simultaneously a valid EXE and ZIP file.
>It serves no purpose, except proving that files format not starting at offset 0 are a bad idea
What exactly does it mean to start at offset 0 and why don't these file formats do that? Is there an advantage in not starting at offset 0 or is it simply oversight/indifference? Any kind of background on the problem would be appreciated, I'm really quite intrigued.