iOS 5 ZIP-compressed install files are 700MB each, a 500MB Android wouldn't be t...

Xuzz · on Oct 26, 2011

(Encrypted and then compressed, however, so it's essentially the same as the uncompressed size.)

obtino · on Oct 27, 2011

Just to clarify what you're saying here: Are you saying that simply because a file is encrypted, that it'd be equivalent to the uncompressed size?

There are a fair few modes of encryption (such as CFB, OFB and CTS) out there which ensure that the encrypted data is the same size as the input data. Even if Apple uses a mode which requires padding, the padding should not be so large as to increase the size of the iOS image to the size that you are suggesting.

tesseract · on Oct 27, 2011

Encrypted data is not very compressible. Compression algorithms exploit statistical regularities in the bitstream, whereas the whole point of encrypting something is to get a result with no statistical regularity whatsoever.

If you compress first and then encrypt, you can then get smaller file sizes than the input, because the compression algorithm can work on the plaintext.

sixcorners · on Oct 27, 2011

And iOS doesn't do that?

Let me get this conversation straight.. Someone says that it's strange that android is so large. Someone else says iOS is big too. Then someone comes in to say that they are not large, they just suck at getting the compress/encrypt issue right. No one, other than obtino, thinks that it was simply a mistake that Xuzz put 'then' in his/her post?

Xuzz · on Oct 27, 2011

It's compressed after encryption, so the effects of the compression is minimal. I believe that my post was correct: due to the effects of the encryption, the compression is almost useless, rendering the large iOS file size essentially equivalent to the size when installed to disk. That's all.

saurik · on Oct 27, 2011

This is correct, and yet at the same time I believe it is incorrect. I am totally willing to believe I'm wrong here, though (as it has been two years since I actually did this process manually). I will explain. ;P

So, it is my understanding that an IPSW file is a ZIP archive containing a number of files, the largest one being where the main filesystem is stored. This file is encrypted, and does not compress very well at all.

However, that file is itself a dmg (Apple disk image) file, which is a compressed file format: a dmg is a compressed HFS+ image. Therefore, the encryption is happening after the compression.

Therefore, I do not believe it is accurate to claim that this is key to the problem. While it is humorous that the files are being compressed, encrypted, and then compressed again, that is not what is causing them to fail to compress: the first compression should work.

Instead, if we go one level deeper, we can ask the question "what is Apple even storing on this filesystem", and the answer is "maybe one or two hundred megabytes of executable code, and a few hundred megabytes of graphics".

The images are stored as PNG and JPEG: file formats that are already compressed. We therefore would not expect the version in the final output file to be much smaller than that on the filesystem. These files are, in essence, being compressed, compressed, encrypted, and compressed. ;P

The executable code, meanwhile, really doesn't compress well with algorithms like deflate: while it has reasonably low entropy, its encoding looks irritatingly random to algorithms that are looking for sequences of bytes (or bits) that are actually identical, especially over small window sizes.

The problem is that you may see "add one, compare, branch if equal" all over the place, but it is "add one (to X), compare (with Y), branch if equal (to Z)", which breaks up the nice sequence. Even just reorganizing the data bits based on the instruction encoder then helps /tremendously/.

However, it is also often the case that X is one of just a few numbers, Z is one of a small range (loops aren't usually that large), etc.: however, normal algorithms look for "exactly this", not "something similar to this with an offset" (or even switching to a general integer encoder); again, minor details, but it breaks deflate.

...and, indeed, there are better compression algorithms out there already that are designed to handle code well. I swear Google even had some cool stuff for this, but I'm not finding it right now :(. Regardless, a quick (silly) citation for validity:

"""While we have not addressed the compression of machine code, others have shown that it is possible to compress machine code by a factor of 3 using a specially tuned version of a conventional compressor [Yu96] and by as much as a factor of 5 using a compressor that understands the instruction set [EEF +97]."""

-- http://www.usenix.org/event/usenix99/full_papers/wilson/wils...

So, yeah: I think the key problem is that Apple is not wasting disk space on the device. And, when you put it that way, it is obvious: why would Apple waste 700MB of flash on a 32GB device, space the user would probably really love to be storing music in, when they only have 100MB of entropy?

The answer is: "they wouldn't", and so (modulo the further compressibility of binaries, an interesting and partially open academic problem) the result is that most of the data on the filesystem is already compressed images/audio, and therefore compressing, encrypting, and even compressing again, doesn't matter to the result.

keeperofdakeys · on Oct 27, 2011

Here is a small test:

  dd if=/dev/zero of=/tmp/file count=50k bs=100 # create 5MB file of zero
  aes -e -f file -o file2 -p asdfasdf # create aes encrypted version of the file
  tar -czf file.tar.gz file # compress file
  tar -czf file2.tar.gz file2 # compress encrypted file
  du -sh file* # check size of all files

Here is the output I got:

  4.9M	file
  5.0M	file2
  8.0K	file.tar.gz
  5.0M	file2.tar.gz

These results pretty much speak for themselves. Just think of it this way, compression works by finding patterns (like every byte is zero), and only storing the patterns. If the encrypted data has patterns, then the plain text could more easily be found.

tincholio · on Oct 27, 2011

Compressing an all-zeros file is not representative of anything. Why not do it with actual text?

keeperofdakeys · on Oct 27, 2011

It depends what you want to show, in this case I was showing the size difference between a file unencrypted and encrypted. An all-zeros file has nearly no randomness, so it compressed very well. Then I show that somehow the encryption process takes away this lack of randomness, and leaves a file almost incompressible. I'm really just increasing the scale to make it more 'dramatic'.

If I happened to be showing how compressed codecs like jpeg, mp3 or h264 weren't compressible, I would definitely pick something more like an actual text file.

DHowett · on Oct 27, 2011

(You know you don't have to use tar for a single file, and can just gzip it outright, right? :P)