Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Unexplored Areas in Data Compression (medium.com/duhroach)
124 points by ingve on Dec 5, 2015 | hide | past | favorite | 46 comments


`Entropy is a broken measurement. The fact that [0,1,2,3] and [0,3,1,2] have the same entropy value is annoying...' This is incorrect. There is no concept of entropy for a specific finite sequence; entropy is a property of random variables. Assuming a finite string is generated (i.i.d.) from a random source, then one way to estimate its entropy is using the histogram but this is not always optimal. Also several other parts of the article are over-simplified or incorrect.


1. Entropy is a broken measurement

2. This is incorrect

3. There is no concept of entropy for a specific finite sequence

Do you see what you did there? Entropy is limited concept if you want to advance compression beyond certain limit.

General note: There is tendency of commentators to nit-pick blog posts as if being in debate where the goal is to win. It would be more valuable if we would read the article and try to interpret the intended meaning behind them favorably. Of course there are parts that are over-simplified or incorrect. This is just writing in the blog, not fully thought out edited and peer reviewed writing.


While I agree with your general comments about the attitude we should have in evaluating/commenting on blog post I have to disagree with the specific critique of the above comment. The favorable interpretation of the article is that the OP understands the basic definition of entropy but didn't realize that it could be extended to other models that account for permutation symmetry. But if you have read about any authoritative accounts of information entropy (Shannon's 1948 expository article, which discusses permutation invariance, is an example) you would be aware that it is more generally applicable.

Entropy is a famous concept, but it is also a famously misunderstood concept, and misunderstood frequently enough that I think we should err on the side of being charitable to the critics as long as their commentary is technically sound and focused. The OP could have probably done without last two sentences. Nevertheless, I would like to think that the article could have been greatly improved or possibly avoided entirely if the auothor had consulted AidanChurch818.


I have a compression issue much closer to my heart that is not being properly addresses: AES-256 encrypted plaintext files. Seriously! If you just figure out the 256-bit key, it would compress well with any compressor - there's a huge class of files which would only need 256 additional bits! Why isn't there any advance on this front? /s

Entropy is not a broken measurement - in fact, it's the perfect measurement (proof: Shannon's source coding theorem and its converse), its only downside is that it is only defined with respect to a model.

If you measure it on the output of one model (permutations) with respect to another model (i.i.d) then, duh, it's wrong. And general permutations are so rare, that they might bug the author but there's no incentive to consider them. Non general permutations - e.g., addition-modulo permutations, often ARE reasonably well compressed by markov/tree/fsmx model algorithms and their delta counterparts. (And they're still too rare to consider on their own).

Kolmogorov complexity is interesting, but it is too vast and complicated to consider as a whole (proof: equivalence of computing Chaitin's omega to the halting problem), other than as a very useful reasoning device. Therefore, like with the halting problem, we focus on specific subsets that seem interesting.

Frankly, this rant reads like "But I don't understand why they don't yet have a car that runs on water". Compression is math, and it's a part of math we understand quite well. BWT, for example, is an amazing algorithm - but if you take a step back, it's just an efficient implementation of a compression scheme that was known before. (IIRC, the last gap in understanding the equivalence was closed by Effros et al in 2002)


Well, that's why you compress then encrypt first

Entropy is a good measurement but it's not the goal. The goal is a smaller file size

Yeah, I guess that for 'generic compression' the advances are going to be slow and few. Without any knowledge of the type of file it's only bytes so you have to work with that. Also disk space and network speeds are getting cheaper, so what might have needed compression in the past might not need today (even though compression got much cheaper)


> Well, that's why you compress then encrypt first

Just be careful of CRIME-style attacks...


wiki: "CRIME ("Compression Ratio Info-leak Made Easy") is a security exploit against secret web cookies over connections using the HTTPS and SPDY protocols that also use data compression." https://en.wikipedia.org/wiki/CRIME


It is a design goal of symmetric crypto algorithms to produce an output that is indistinguishable from random with any reasonable effort.

Thereby what you're asking for is a compression algorithm that is so good that you can use it to attack AES. As long as we consider AES as safe that won't happen. (And I don't expect that AES gets anywhere near being broken within decades to come.)


Yes, that is exactly what I was hinting at. The /s at the end of my first paragraph stands for "and without sarcasm from here on...".


There's a great big unexplored area: Optimising the expression.

I had a hack back in the previous millenium that improved on .tar.gz, compatibly. I think it used find, but the current expression would be:

  tar zcf $(file **/*(.) | sort -T: -k2 | cut -d: -f1)
The use of file is a bit of a hack, but the idea is good: let the compressor compress all of the source code together, without being distracted by dissimilar files. Ditto for other types.

Compressing a byte stream is not the same as compressing the data. Reordering a tar file doesn't change the meaning, and IMO the big, big gap in compression research is to identify such expression transformations.

Compression algorithms regard the input as god-given, but in reality much input can be reordered or otherwise transformed. The order of the files in a .tar could be reordered if the compressor could say anything about what's beneficial; the order of glyphs in a font could be reordered too. That sort of communication between the compression subsystem and the data source could be a very fruitful area of research.


> I had a hack back in the previous millenium that improved on .tar.gz, compatibly.

FWIW, this is known as the 'sort key trick'. It's been in the folklore, as you say, for quite a while. I give some examples in http://www.gwern.net/Archiving%20URLs#sort---key-compression... for WWW archives.

> Reordering a tar file doesn't change the meaning, and IMO the big, big gap in compression research is to identify such expression transformations.

This hasn't been totally ignored. The problem is that you need to know and be able to specify in advance a useful ordering of input files to make up a better stream or a tool (like 'lrzip', designed to look for long-range correlations) must infer it which seems hard to do without requiring multiple passes; and what's the incentive? In a world in which RAR and ZIP and Gzip are still considered the standards and people aren't even using xz, no one wants to use such a niche tool for anything serious.


1. Well, there are people who serve web fonts and CSS sprite assets to about 10¹²³ people per hour and would go to some lengths to save a few bytes on those, even if it involves niche tools that many others don't bother with.

2. "What's the incentive" is IMNSHO a terrible question for a research area.


> 1. Well, there are people who serve web fonts and CSS sprite assets to about 10¹²³ people per hour

Those people have the problem that the delivery mechanisms are extremely limited and hardwired and a lot of the possible advances cannot be shoehorned into a DEFLATE-compatible bytestream or whatever it is IE6 supports...

> 2. "What's the incentive" is IMNSHO a terrible question for a research area.

Researchers are human too. It's demotivating to try to research fancier data compression techniques if you're certain that no one will ever use it.


Wow, I've been thinking the exact same thing for a long time.

In principle there's no need to limit it to file-level granularity. Take for example a JS minifier-compressor where human-readability isn't a goal. If you're compressing a javascript file with a bunch of top-level functions, it shouldn't matter which order the functions are expressed, thanks to hoisting.

Once you make that jump, it's not even limited to re-ordering things. The minifier-compressor could also consider all possible results of swapping singlequote and doublequote, x['this'] vs. x.this when appropriate, and so on.


Now that's a clever trick. I just tried it on a random subdirectory here (a website source) and it actually made the resulting file larger so it may not always work but the idea makes sense. Wonder why it does not work in this case.


Zlib and most others use dictionaries; the dictionary is adjusted as compression (and decompression) proceeds. When the input data type changes drastically, continuing with the formerly well-adjusted dictionary can actually be worse than issuing a flush and starting from scratch.

I saw the same thing happen during the work on RFC 4978 (IMAP compression), which is why that RFC recommends issuing a flush before and after large attachments. A compression state based on typical IMAP chatter isn't good for compressing a random attachment, and one based on the attachment isn't good for compressing the IMAP chatter.


Check if the original was compressed with a higher level than you just did. Decompress the file and recompress and you should get the same file size.


That's sort of what he was going on about with the Burrows-Wheeler transform, right?


And that's all for lossless compression. For lossy compression, there are lots of unexplored areas.

I still like the FrameFree approach to video compression. Motion estimation decomposes sequences of images into a set of layers which move and morph, much like the way 2D movies are converted to pseudo-3D. This is hard, but can be done, and there are several software packages for it, of varying quality. Decompression is done by morphing each layer between keyframes, warping a texture-mapped mesh in a GPU. Decompression can be done to any frame rate, including ones much higher than the original. This approach is often used to generate ultra-slow motion.

(FrameFree was demonstrated around 2006, but the company behind it, Kerner Optical, went bust for unrelated reasons.)


How is this different from motion estimate in HEVC?


It's not block-oriented. Image components are meshes, and they're filled with textures. Between frames, the meshes change position and shape, and the textures dissolve. It's a full morph.[1][2]

This got to the demo stage, and there were a plug-in and an SDK around 2008. The playback side worked fine, but the compression side was really slow at the time.

[1] https://www.youtube.com/watch?v=RHHfNZFuM0g [2] https://www.youtube.com/watch?v=VBfss0AaNaU


Interestingly, all envogue tech of today like ML and Bitcoin converge around this topic. In essence, PoW is just (partly) figuring out the seed of a RNG, which in return is compression at its finest.

Thus, the author has a point in saying "Basically, once we’re done w/ statistics, we’re done. But this mentality may be one of our biggest faults". Even seemingly random data can be represented by incredibly short descriptions (as in MDL). I'm fascinated by compression since I was a child hacking in x86 assembler.


I was expecting this to be an ignorant rant by someone not steeped in the art, and was all ready to puff my chest out and correct someone saying something wrong on the Internet, but actually this is pretty spot on instead ;)

People interested in data compression may like to browse the http://encode.ru forums, where the experts hang out.


Good questions. Especially where all BWT-cousins are. An amazing algorithm that should have spawned more research and variants by now.


Not sure if this is good place to ask, but is there some Unix utility to presort files, based on content, to feed them into tar or zip in correct order so they would compress better by virtue of being similar?

I think it would be very useful if tar or zip could do that automatically, and it seems like low-hanging fruit in compression.

Actually, thinking about it, the tool could later in more general fashion rearrange the content inside the files. So it would map each file into blocks of similar characteristics that would then be compressed in correct order across files.


I usually use something like the following to do it:

find $DIR -type f | rev | sort | rev | tar -cv -T /dev/stdin -f $OUTPUT.tar

Add a -j or -z or any other compressor you want.

This will sort the files on reverse filename order so that similar sounding files will go together. It's not perfect and I had an idea to write something better but since it's usually enough I didn't really bother.

Only thing that is needed is a program that takes a list of files and sets them in order for tar to do the packing and gzip/bzip2 compress the hell out of the pre-ordered files.


Thanks, I am aware of tricks like that, but it gets tiring really fast, so I don't use that in practice. Why can't they just add another flag to tar to call a program that would produce the ordering, just like they have compressor flags?


Different people will want different sorts for different needs, you can already feed tar the list of sorted files so it only takes another program to provide the specific sorted order for you.

I actually do not really know why they bothered with the -z -j and such compression options when you could already pipe to a compressor of choice.


> Why can't they just add another flag to tar [...]?

You are "they" - you can add that flag. If it turns out that it works well, it'll (probably) be taken and added to whichever `tar` implementations you submit patches for.


goodtar:

    #!/bin/bash

    find $3 type -f | rev | sort | rev | tar -cv $2 -T /dev/stdin -f $1
Now put `goodtar` in `/usr/bin`. Whenever you want to compress stuff, just say:

   goodtar -j output.tar mydir
What's making you tired?


I just created a quick hack which is essentially at this level but I would try to improve upon it, it can be found at:

https://github.com/baruch/sortedtar


I think what you're describing is called the knapsack problem, so I'd guess the answer is no. The best you could do is randomly order the files and compress them, and do that a few times and pick the best ordering.


You can do much better than that: you can use summaries or hashes to get a short version of each file and then compare the hashes for similarity. (Much like using perceptual hashes to find similar images.) A tool that does this is 'binsort'. It's O(N^2) IIRC, because of all the pairwise comparisons, but it's usable for up to a few hundred or thousand files.


First, I am not looking for optimal solution, good approximation is perfectly valid solution.

Second, what you describe is not the best you can do.


Wonder if BWT can be improved (i.e. make it sort better) with a help of extra additional information. So it is not a choice of all (complete sort) or pure BWT but something in between. Maybe group things a bit better (close to a sorted order) but needed a little map/error-correction/otherthing..? to help it decode.

Idea came from seeing Kolmogorov right under BWT's section. Maybe the extra information is a small FSM in a compact representation that works before or after BWT.

Another useful thing in practice is ability to quickly recognize already compressed data. There are enough encoded images, sound and are already fairly compress. Outer compressor can be quicker if it quickly detects that.


I tried to add 1 to the first byte of a file, 2 to the second, 3 to the third... and see how it effect the compression rate.

On a small text file that could normally be compressed by 71% the modified file only got 16%.


Now try it other way around. Compress with 7zip first, then XOR with known randomness and try compressing again. There is a random number generator X out there for data Y that will let you produce low entropy this way.


In my childhood I had the wishful idea that some sort of "1d" fractal could compress huge amounts of data based on much shorter inputs. Assuming the inputs to the fractal could be computed effectively, which I was fairly sure could only be done by brute force which was never going to be useful.



I have started to converge on this space from a different perspective -- thinking about how to design a graph database in terms of discrete topology, which has led me to thinking in terms of the holographic principle: "the description of a volume of space can be thought of as encoded on a boundary to the region". Specifically, I have been exploring how to use shape to succinctly and uniquely define an object/relation in terms of the composition of its constituent parts in a way that enables fast decomposition back into its constituent parts.

For example, picture a spatial structure resulting from a the prime decomposition (https://en.wikipedia.org/wiki/Integer_factorization#Prime_de...) of all the positive 64 bit integers [0 to 2^63-1]. In this space, the prime numbers are the base of this discrete topology. Picture the trees resulting from all the decompositions, with the roots jutting up into space like peaks in a mountain range. This gives the numbers shape and provides a foundation to anchor more abstract objects representations.

To further illustrate the point, we could define a discrete topology with each unicode character (https://en.wikipedia.org/wiki/Unicode#Code_point_planes_and_...) as a base element. Now define an algorithm to decompose strings into its constituent parts. Similarly, we could do the same thing with dates and time, decomposing them into trees, and likewise with geo coordinates in space. Or, we could map all of these structures onto a tree of its binary representation.

Now with these discrete elements as anchor points existing in multiple dimensions, define relations betweens them, and use the elements and relations to compose higher-level objects.

For example, in a property graph (https://github.com/tinkerpop/gremlin/wiki/Defining-a-Propert...), each vertex can contain key/value attributes, and traditionally when adding a new vertex, a vertex ID is assigned using a sequence generator. Edges are composed of two vertices, a label, optional key/value properties, and an edge ID is commonly assigned using a sequence generator. Each vertex and edge is traditionally stored as a row in the database, and traversing the graph is a process that begins with looking up the "start vertex", looking up its outgoing edges, traversing the outgoing edges to the incoming vertices, and continuing this process while using constraints to direct the traversal and filter out vertices and paths. For large graphs distributed across multiple machines, finding optimal ways to partition the graph across multiple machines is a challenge.

What if instead of identifying each vertex and edge by a meaningless integer, we uniquely define each vertex and edge by a compact representation of its constituent parts, like the cells of a body compose to form a person?

For example, consider the key/value properties in a vertex, such as:

  {type: "person", handle: "espeed", location: "Dallas, TX", age: 38, updated: 20151205112207}
...using our discrete topology, each key/value pair is uniquely represented as a point in space, and the set of points are grouped together in a cell. Multiple cells compose together to form a cell complex (https://en.wikipedia.org/wiki/CW_complex). When the set points are in general position, there exists a unique Delaunay triangulation (https://en.wikipedia.org/wiki/Delaunay_triangulation).

Use Delaunay triangulation to encode a unique ID for each vertex, and then use the resulting vertex IDs, label, orientation, and optional key/value properties to compose edges. Edges then compose together to form a graph.

Here's the question I'm exploring: Can we encode a graph in such a way that traversing the graph is reduced to a geometric algorithm?


Summary: Kolmogorov complexity for compression is completely unexplored.

   Orig wordcount: 1325
   TlDr wordcount: 7
   Saved: 99.47%


But is it lossless?


yes! (by definition)


(the above comment is) also a successful application of Kolmogorov complexity to compression


It is not successful because it is not lossless (unless you have a "Colt McAnlis in exactly the right state with that initial condition" oracle).


middle out?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: