Code golf is anecdotal, at least, evidence that proving lower bounds on the Kolg...

tshaddox · on Aug 11, 2019

I don’t dislike this challenge, and I don’t dislike code golf. What I don’t understand is the notion that compressing a 100MB Wikipedia file somehow requires an “intelligent” compression algorithm. If the challenge was to be able to compress any English text down to a certain ratio, then perhaps making the algorithm behave “intelligently” would be a requirement. But as far as I can tell, this challenge is asking for a lossless reproduction of one specific chunk of data, so it seems to me that we just have to hope that, on a given platform, there exists a 15MB executable that outputs that exact chunk of data. One might exist, but one might not. And you probably won’t even be able to prove the latter, because of the halting problem (technically it doesn’t apply with a computer with finite memory, but in practice no one will wait long enough).

carapace · on Aug 11, 2019

> one specific chunk of data

Not just any chunk though. It (presumably) contains a lot of structure that encodes information about [English representations of] the real world.

It's not random, nor is it simple, nor is it an image of a fractal (maybe?) etc.

A system that is able to compress it well makes good predictions about what the data contain which indicates "intelligence" about the structure of the (encoded) world.

tshaddox · on Aug 11, 2019

But would you expect that program to itself be useful for anything else? It literally just takes no input and outputs that one specific Wikipedia dump. It’s not like I can tweak it slightly and have it output new predictions of things that sound like Wikipedia articles.

Unless the idea is that whatever methodology is required to build this particular program could then be adapted to do other things that seem intelligent. That seems possible, but not at all likely given my reading of the challenge requirements.

carapace · on Aug 11, 2019

See the links in my other comment in the thread for all the gory details.

> But would you expect that program to itself be useful for anything else?

That exact program, no, but the process of creating that program would be educational, both the compressor that makes it and the process of making the compressor.

Another way of putting it is in terms of the "Twenty Questions" game. Which pre-selected twenty questions allows you to "cover" the most of the Universe of phenomenon? If one set of 20 allows you to distinguish more of the Universe than some other set, then the first set is somehow a better model of the Universe. Does that help?

derefr · on Aug 12, 2019

> If one set of 20 allows you to distinguish more of the Universe than some other set, then the first set is somehow a better model of the Universe.

Is Akinator’s 20q database available as a dataset for non-commercial research purposes? It’d be beat to attempt to compress it in a way where training with one half predicts the other half at the smallest trained model size.

klipt · on Aug 11, 2019

The point of the challenge is to actually create a model that fits in 15MB together with its compressed data. Not just to prove its existence.

One of the best contenders is the open source cmix which, among other tricks, uses LSTM neural networks to predict the next character/bit: http://www.byronknoll.com/cmix.html

kzrdude · on Aug 11, 2019

Seems like cmix compresses better than the winners, but uses far too much memory to be eligible for the prize.

Dylan16807 · on Aug 12, 2019

I had to search out the size of the decompressor on a different page, but it proves you right. The compressed file is ~400KB smaller than the record holder and the cmix decompressor is only ~200KB.

I wonder what the memory/speed/efficiency tradeoffs looks like for that kind of design.