Hacker News new | past | comments | ask | show | jobs | submit login
Hutter Prize for compressing human knowledge (hutter1.net)
213 points by kelseyfrog on Sept 13, 2023 | hide | past | favorite | 215 comments



I think the mistake here is to require lossless compression.

Humans and LLMs only do lossy compression. I think lossy compression might be more critical to intelligence. The ability to forget, change your synapses or weights, is crucial to being able to adapt to change.


Some background before i explain why your suggestion is "not even wrong". If you can predict the next X bits given previous Y bits of observation you don't need to store the next X bits, the decompressor can just use the same prediction algorithm you just used and write out its predictions without correction. This is the same as general purpose AI at a high level. If you can predict the next X bits given previous Y bits of observation you can make reasoned a decision ("Choose option A now so that the next X bits have a certain outcome").

The above is actually the premise of the competition and the reason it exists. What i've said above is in academic papers in detail by Marcus Hutter et al. who runs this competition. That is lossless compression can be a scorecard of prediction which is the same as AGI.

Now saying "they should just make it lossy" misses the point. Do you know how to turn lossy data into lossless? You store some data everytime you're wrong on your prediction. ie. Store data when you have loss to make it lossy. This is arithmetic coding in a nutshell. You can turn any lossy data into lossless data with arithmetic coding. This will require more data the more loss you have. The lossless requirement gives us a scorecard of how well the lossy prediction worked.

If you ask for this to be lossy compression you throw out that scorecard and bring in an entire amount of subjectivity to this which is unwanted.


Some background before i explain why your suggestion is "not even wrong".

The phrase you quote is generally used to imply a stupid or unscientific suggestion. Your succeeding comments about what you think AGI is carry a certitude that isn't warranted.

It's good that you are trying to supply knowledge where you think it is lacking, and I understand there are fora where this sort of public school lecturing is amusing but I think your tone is misplaced here.


Years long compression challenge with dozens of geniuses participating: exists

Random person on the internet: let me improve this thing I've never heard of by using the one fact I know about compression, there are two kinds

It's absolute hubris and a waste of everyone's time to chime in with low value, trash comments like "they should make it lossy". It's not unreasonable at all to take a snarky tone in response. "not even wrong" absolutely applies here, and they carefully, patiently, and in great detail explained why.


I often feel the same way when discussions pop up here or on other forums, about topics I'm familiar with. Like randos declaring that researchers in deep learning are "obviously doing it wrong" and they should instead do X, where X is like an entire subfield existing for years with a lot of activity, etc.

So I get where you're coming from. But I'd suggest that a place like HN is in fact a place for random people to inject their half-baked takes. It is a just discussion board where lots of the comments will be uninformed or wrong. Take it or leave it. If you want something else, you need to find more niche communities that are - by the nature of it - more difficult to find and less public, including IRL discussion, clubs, conferences etc. But it has its use: we, you and me can jump in any thread and type out what we think after 2 minutes and get some response. But of course someone even more novice might think that we know more than just that 2 minutes consideration, and they learn our junk opinion as if it was the result of long experience. It's unavoidable, since nobody knows who the rest of the commenters are.

Online discussions are incredibly noisy, and often even the people who seem to use the jargon and seem knowledgeable to the outsider can be totally off-base and essentially just imitate how the particular science or field "sounds like". Unfortunately, you only learn this gradually and over a long time. If you learn stuff through forums, Reddit, HN, blogs, substacks etc. it can be very misleading from the first-person experience because you will soak up lots of nonsense as well. Reading actual books and taking real courses is still very much relevant.

HN and co. are more like the cacophony of what the guy on the street thinks. Very noisy, and only supposed to be a small treat over rigorous study. You shouldn't expect to see someone truly breaking new ground in this comment thread. If it disturbs you, you can skip the comments. But trying to "forbid" it, or gatekeep is futile. It's like trying to tell people in a bar not to discuss how bad the soccer team coach is, because they don't really have the relevant expertise. Yeah, sure, but people just wanna chat and throw ideas around. It's on the reader to know not to take it too seriously.


ISWYM but the problem isn't really people making suggestions, but the way they make suggestions that grates. If Mr Internet-random-guy wants to introduce the issue of lossy compression then by all means ask why not, but don't say they should.

It comes across as arrogance, probably because it is, then it sucks up plenty of the time of others who do actually know the subject, putting something right.

Even more bloody annoying is when people ask when even the most immediate web search would get the answer. Wikipedia is usually a very good place to start. I guess that for these people, the cost is externalising it to other's wasted time.

Then again, we all take turns at being the stupid one, so am I to complain.


It's inherent in reading comments. And it's also inherent in encountering mere mortals in the real world. And remember how the most annoying and stupid people keep going on about how all the people they meet are stupid and annoying. There's no point in piling on another layer. Close the tab, or comment constructively and charitably. Else you end up with stuff like the badX subreddits (badhistory, badphilosophy) who get their adrenaline/dopamine fix by seeking explanations they see as ignorant/naive/arrogant and sneering at it while self-aggrandizing and feeling like they are in the inner circle who know it all.

The other thing is, you never see all the people who do go to Wikipedia, google or check a book. They won't comment "Hello I'm not commenting now because I went to Wikipedia". They just don't comment.

And Cunningham's Law states "the best way to get the right answer on the internet is not to ask a question; it's to post the wrong answer."

People are more prone to comment out of frustration than other feelings.


You may find the book "Structure of Scientific Revolutions" interesting if you have not read it. The author posits that it isn't people entrenched in the field that will offer breakthrough advancements, but it is instead outsiders looking in.

More often than not, they are aggressively rebutted, which leads to the belief that science progresses one funeral at a time. Perhaps it is you who needs to guard against hubris?

https://en.wikipedia.org/wiki/The_Structure_of_Scientific_Re...


I disagree with you; therefore by your own logic, I am right.


I wish I could claim the logic, but I am a nobody who knows nothing. The author of the book, however, has impressed people for over 70 years with this line of thought and I too agree with him.


It depends on your goal.

If your goal is compressing human knowledge, then you do want to avoid wasting bits on the details of wording that were random chance.

The problem is inability to objectively judge such a compression, not the mere fact that it won't be bit-perfect.

It is not "not even wrong".


You didn't even read the thread though?

All the methods in this competition that are competitive DO use lossy compression.


I did read the thread. I'm objecting to the idea that stacking lossless corrections onto lossy compression and then measuring the total is a good way to measure what we want to measure here, wrt human knowledge. It may be the best we have, but it's not good.

See my other comment here, that I probably shouldn't have put in a reply to a reply to a dead comment: https://news.ycombinator.com/item?id=37506099


Why should we care what you think though? Im not being nasty, but unless you have a reputation in that field you have to give some cogent argument or its just some random possibly-uninformed opinion.


I’m sorry you see it as a trash comment, but I hope the discussion it provoked was to your liking.

I found it enlightening.


There is nothing wrong with jumping in and saying stuff that probably isn't right on an internet forum! That doesn't make it valuable or insightful, it's just how forums work. Your comment was cool with me and you shouldn't feel like you need to change at all.

That said, the response to your comment was insightful and made interesting points. You did in fact kick off a very interesting conversation!


The phrase ["not even wrong"] is generally used to imply a stupid or unscientific suggestion.

An unscientific suggestion is exactly what was offered. Forgive my ignorance, but why was the tone of the message misplaced, and what was the tone?


I feel your tone-policing is misplaced here. When someone is that ignorant or foolish they should be told so so they can hopefully recalibrate or something. There's enough inchoate nonsense on the Internet that I appreciate the efforts to keep discussion on a little higher level here.


> There's enough inchoate nonsense on the Internet

I agree 100%. But the top-level comment is not an example of such.

However, the reply in question – and your comment – are certainly examples of the kind of tone-deaf, needlessly aggressive, hostile, confrontational, and borderline malicious posts I wish I could cleanse the Internet of wholesale.


Thanks for the feedback, I disagree.

> tone-deaf, needlessly aggressive, hostile, confrontational, and borderline malicious posts I wish I could cleanse the Internet of wholesale.

I have never said this before, but maybe you're a little too sensitive?

In any event, I feel your characterization of my comment borders on ad hominem and certainly it seems to violate the site guideline to interpret comments charitably.

Good day.


> When someone is that ignorant or foolish ...

Dude. Calling people "ignorant or foolish" isn't exactly great either. :(


first two comments are the classic HN banter I’m here for, next two have stink on them I do not like


Banter shouldn't be insulting for no good reason.


Banter by definition is teasing, which insulting. That, along with being humorous in that context and not to be taken seriously is what makes it banter rather than any other form of exchange.

Lets just go to the dictionary:

Banter (n): the playful and friendly exchange of teasing remarks.

Teasing (adj): intended to provoke or make fun of someone in a playful way.


And in specific, "not even wrong" is a total shutdown, not playful.


Oh I agree 100%, that wasn't banter, it was a grumpy shutdown.


Is there any evidence that arithmetic coding works at the level of concepts?

As a thought experiment, suppose Copernicus came up with this Hutter prize idea and declared that he would provide the award to whoever could compress the text of his book on the epicycle-based movement of planets around the sun (De revolutionibus orbium coelestium).

Today we can explain the actual motion to high accuracy with a single sentence that would have been understandable in that age: "A line drawn from the sun to any planet sweeps out equal areas in equal time"

This however is mostly useless in attempting to win the Copernican Hutter prize. Predicting the wording of some random human's choosing (especially at length) is very far removed from ability to predict in general.


Arithmetic coding isn't the key thing here. That's just a bit wise algorithm. You predict the next bit is a 1 with 90% certainty? you don't need to store much data with arithmetic coding. That's all arithmetic coding is here.

What your getting at is the 'predictor' that feeds into the arithmetic coder and that's wide open and can work any way you want it to. LLMs absolutely have context which is similar to what your asking and they are good predictors of output given complex input (pass gpt a mathematical series and ask it what comes next. If it's right then it's really helpful in compression as you wouldn't need to store the whole series in full).


So you don’t think that writing a wiki article about this could be made smaller by encoding this info in a few logical steps, and adding some metadata on what kind of sentence should follow what? That part about decompressing it is the AI part. Where to place a comma can be added in constant cost between any contender programs.


Sure my point is more that this adding of commas business dwarfs any real prediction.

Suppose there were a superintelligence that figured out the theory of everything for the universe. It's unclear that would actually help with this task. You could likely easily derive things like gravitaiton, chemistry, etc but the vast majority of your bits would still be used attempting to match the persona and wording of the various wikipedia authors.

This superintelligence would be masked by some LLM that is slightly better at faking human wording.


But that comma will have the exact same price between 2 contending lossy compressions. In fact, it is a monotonic function of the difference, so the better your lossy compression is, the better your arithmetic one will be — making you measure the correct thing in an objective way.

It’s like, smart people have spent more than 3 minutes on this problem already.


> But that comma will have the exact same price between 2 contending lossy compressions.

Why do you think that? Do you have proof of that?

> making you measure the correct thing

If we're trying to measure knowledge, then the exact wording is not part of being correct.

Very often you will have to be less correct to match wikipedia's working. A better lossy encoding of knowledge would have a higher cost to correct it into a perfect match of the source.


> Why do you think that? Do you have proof of that?

You want to encode “it’s cloudy, so it’ll rain”. Your lossy, intelligent algorithm comes up with “it is cloudy so it will rain”. You save the diff and apply it. If another, worse algorithm can only produce “it’s cloudy so sunny”, it will have to pay more in the diff, which scales with the number of differences between the produced and original string.

You can be less correct, if that cumulatively produces better results, that’s the beauty of the problem - the last “mile” difference is the same cost for everyone as a factor of the difference.


But that's just two random examples.

How about "it is cloudy so it will rain" and "it's cloudy, so sunny"? Then since we're looking at the commas for this argument, the second algorithm is paying less for comma correction even though it's much wronger.

You seem to be assuming that a less intelligent algorithm is worse at matching the original text in every way, and I don't think that assumption is warranted.

I'll rephrase the last line from my earlier post: What if wikipedia is using the incorrect word in a lot of locations, and the smart algorithm predicts the correct word? That means the smart algorithm is a better encoding of knowledge, but it gets punished for it.

In that case the last mile cost is higher for a smart algorithm.

And even when the last mile cost is roughly the same, the bigger of a percentage it becomes, the harder it is to measure anything else.

And it shuns any algorithm that's (for example) 5% better at knowledge and 2% worse at the last mile, even though such a result should be a huge win. There are lots of possible ways to encode knowledge that will drag things just a bit away from the original arbitrary wording. So even if you use the same sub-algorithm to do the last mile, it will have to spend more bits. I don't think this is an unlikely scenario.


> Then since we're looking at the commas for this argument, the second algorithm is paying less for comma correction even though it's much wronger.

And? It will surely have to be on average more correct than another competitor, otherwise its size will be much larger.

> What if wikipedia is using the incorrect word in a lot of locations,

Then you write s/wrongword/goodword for a few more bytes. It won't be a deciding factor, but to beat trivial compressions you do have to be more smart than plain looking at the data - that's the point.

> And it shuns any algorithm that's (for example) 5% better at knowledge and 2% worse at the last mile

That's not how it works. With all due respect, much smarter people than us has been thinking about it for many years - let's not try to make up why it's wrong after thinking about it badly for 3 minutes.


> And? It will surely have to be on average more correct than another competitor, otherwise its size will be much larger.

It's possible to have an algorithm that is consistently closer in meaning but also consistently gets commas (or XML) wrong and pays a penalty every time.

Let's say both that algorithm and its competitor are using 80MB at this stage, before fixups.

Which one is more correct?

If you say "the one that needs fewer bytes of fixups is more correct", then that is a valid metric but you're not measuring human knowledge.

A human knowledge metric would say that the first one is a more correct 80MB lossy encoding, regardless of how many bytes it takes to restore the original text.

> Then you write s/wrongword/goodword for a few more bytes. It won't be a deciding factor

You can't just declare it won't be a deciding factor. If different algorithms are good at different things, it might be a deciding factor.

> That's not how it works. With all due respect, much smarter people than us has been thinking about it for many years - let's not try to make up why it's wrong after thinking about it badly for 3 minutes.

Prove it!

Specifically, prove they disagree with what I'm saying.


You could still come up with a scorecard for lossy compression, for example area under the rate distortion curve. It would be very hard to calculate.


Lossless compression _is_ a scorecard for lossy compression. Thanks to the ideas from arithmetic coding.


There’s no way I could reproduce a calculus textbook verbatim, but I can probably prove all the important theorems in it.

Even then, given half of any sentence in the book, I don’t rate my chances of reproducing the next half. That’s more a question of knowing the author’s style than knowing calculus itself.

Does arithmetic coding capture all of that?


Let’s say you come up with the exact same text given an initial seed (of you ate this and that that day). Then it really is just the arithmetic function of the difference, as it is deterministic.

Now, who will get closer to the algorithm with that added arithmetic coding? You, knowing the proofs, or a random guy that doesn’t even speak the language? Does it then measure intelligence all else being equals!


Yep


Or just require the results to be 99.99 percent accurate (within some edit distance)


If it's 99.99% accurate arithmetic coding would have next to no data stored.

Arithmetic coding is optimal in turning probabilistic data into lossless data. There's provably no way to do it more efficiently than arithmetic coding. The data it needs for corrections is smaller the better the predictions are.

So given this why even dwell on ways that add any form of subjectivity. Arithmetic coding is there. It's a simple algorithm.


"... lossless compression can be a scorecard of prediction which is the same as AGI."

Having not read the papers, this sentence strikes me as a bit of a leap. Maybe for very constrained definitions of AGI?


There's a section in the link above "Further Recommended Technical Reading relevant to the Compression=AI Paradigm" and they define it in a reasonably precise mathematical way. It's well accepted at this point. If you can take input, predict what will happen given some options you can direct towards a certain goal. This ability to direct towards a goal effectively defines AGI. "Make paperclips" and the AI observes the world, what decisions needed to be made to optimize for output paperclips and then starts taking decisions to output paperclips is essentially what we mean by AGI and prediction is a piece of this.

I have no stake in this btw, I've just had a crack at the above challenge in my younger days. I failed but i want to get back into it. In theory a small LLM model without any existing training data (for size) that trains itself on the input as it passes predictions to an arithmetic coder that optimally compresses and the same process on the decompression side should work really well here. But i don't have the time these days. Sigh.


This ability to direct towards a goal effectively defines AGI

No it doesn't, though it may be argued to be a requirement.

That's the point of the previous commenter - that you are making unjustified assertions using an extrapolation of the views of some researchers. Reiterating it with a pointer to why they believe that to be the case doesn't make it more so.

If that's your favoured interpretation, fine, but that's all it is at this point.


Hey don't pin this on me it's not my assertion.

Go argue with the scientists who state pretty much what i just said verbatim including full links with proofs in http://prize.hutter1.net/hfaq.htm#ai :)

>One can prove that the better you can compress, the better you can predict; and being able to predict [the environment] well is key for being able to act well. Consider the sequence of 1000 digits "14159...[990 more digits]...01989". If it looks random to you, you can neither compress it nor can you predict the 1001st digit. If you realize that they are the first 1000 digits of π, you can compress the sequence and predict the next digit. While the program computing the digits of π is an example of a one-part self-extracting archive, the impressive Minimum Description Length (MDL) principle is a two-part coding scheme akin to a (parameterized) decompressor plus a compressed archive. If M is a probabilistic model of the data X, then the data can be compressed (to an archive of) length log(1/P(X|M)) via arithmetic coding, where P(X|M) is the probability of X under M. The decompressor must know M, hence has length L(M). One can show that the model M that minimizes the total length L(M)+log(1/P(X|M)) leads to best predictions of future data. For instance, the quality of natural language models is typically judged by its Perplexity, which is equivalent to code length. Finally, sequential decision theory tells you how to exploit such models M for optimal rational actions. Indeed, integrating compression (=prediction) into sequential decision theory (=stochastic planning) can serve as the theoretical foundations of super-intelligence (brief introduction, comprehensive introduction, full treatment with proofs.


Hey don't pin this on me it's not my assertion.

But it is your assertion, wherever you've picked up the idea from.

...which is the same as AGI..

...effectively defines AGI...

No it isn't, and no it doesn't. Your language is too strong in its claims.


Whether or not you agree, a lot of people do. There is a trivial sense in which a perfect compression algorithm is a perfect predictor (if it ever mispredicted anything, that error would make it a sub-optimal compressor for a corpus that included that utterance), and there are plenty of ways to prove that a perfect predictor can be used as an optimal actor (if you ever mispredicted the outcome of an event worse than what might be fundamentally necessary due to limited observations or quantum shenanigans, that would be a sub-optimal prediction and hence you would be a sub-optimal compressor), a.k.a. an AGI.

Where a lot of us get off the fence is when we remove "perfect" from the mix. I don't personally think that performance on a compression task correlates very strongly with what we'd generally consider as intelligence. I suspect good AGIs will function as excellent compression routines, but I don't think optimizing on compression ratio will necessarily be fruitful. And I think it's quite possible that a more powerful AGI could perform worse at compression than a weaker one, for a million reasons.


If you had a perfect lossless compressor (one that could compress anything down to it's fundamental kolmogorov complexity), you would also definitionally have an oracle that could compute any computable function.

Intelligence would be a subset of the capabilities of such an oracle.


A universal turing machine can compute any computable function. That doesn't make it intelligent, because it won't without direction.


Then expand your target to a universal turing machine and instructions to compute any computable function. Do you consider it intelligent then?


No, because instructions to compute any computable function for an utm only requires a tiny set of instructions on how to generate every possible permutation over forever increasing lengths of tape.

It will run forever, and I would agree that in that set there will be an infinite number of functions that when run would be deemed intelligent, but that does not make the computer itself intelligent absent first stumbling on one of those specific programs.

EDIT: Put another way, if the potential to be made to compute in a way we would deem intelligent is itself intelligence, then a lump of random particles is intelligent because it could be rearranged into a brain.


The analogy is not a random lump of particles but all particles in all configurations, of which you are a subset. Is the set of you plus a rock intelligent?


This is a fundamentally flawed argument, because a computer is not in all the states it can execute at once, and naively iterating over the set of possible states might well not have found a single intelligence before the heat death of the universe.

If we were picking me out of an equally large set of object, then I'd argue that no, the set is not meaningfully intelligent, because the odds of picking me would be negligible enough that it'd be unreasonable in the extreme to assign the set any of my characteristics.


[flagged]


Sorry but i hope my long-winded explanation explains it.

In computer science we have a way to score how good lossy data is. That way is to make it lossless and look at how much data the arithmetic coder needed to correct it from lossy to lossless.

This is a mathematically perfect way to judge (you can't correct lossy data any more efficiently than an arithmetic coder). All the entries here do in fact make probabilistic predictions on the data and they do all use arithmetic coding. So the suggestion misses a key point of CS involved here. I don't mean to be rude about it but the idea does need correcting.


> This is a mathematically perfect way to judge

Only if you're using a very particular and honestly circular-sounding definition of "good".

Some deviations are more important than others, even if you're looking at deviations that take the same amount of data to correct.

Think about film grain. Some codecs can characterize it, remove it when compressing, and then synthesize new visually matching grain when decompressing.

Let's say it takes a billion bytes to turn the lossy version back into the lossless version.

The version with synthetic film grain still needs a billion bytes or maybe even slightly more bytes, even if the synthetic grain is 95% as good as real grain. The cost to turn it lossless is the wrong metric.


Your comment is far more worse, lol. His comment actually supports gp's suggestion, just in a different angle.

It looks like LLM's are like compression algorithms with strengths and weaknesses in different things.

Losslessness doesnt always equate to usefulness. But yea, maybe a different competition for this.


Good lossy compression can be used to achieve lossless compression.

(information theory)

The more accurate the lossy compression is, the smaller the difference between the actual data (lossless) and the approximation. The smaller the difference, the fewer bits required to restore the original data.

So a naive approach is use the LLM to approximate the text (this would need to be deterministic --zero temp with a preset seed), then subtract that from the original data. Store this diff then add back to restore the original data bit-for-bit.


In psychoacoustics, one of the most important aspects of how lossy compression works is that they throw away sounds that a human can't even hear, because it's too short or subtle to be noticed. This is the difference between data and knowledge. Someone with perfect pitch and a good memory knows exactly what Don't Stop Me Now by Queen sounds like, and it's a lot smaller than digitizing an LP record in mint condition.

This person can cover that song, and everyone will be happy, because they have reproduced everything that makes that song what it is. If anything we might be disappointed because it's a verbatim reproduction (for some reason we prefer cover songs to introduce their own flavor).

If I ask you to play Don't Stop Me Now and you sound like alcoholic Karaoke, you haven't satisfied the request. You've lost. Actually we've all lost, please stop making that sound, and never do that again.


But that’s the good thing about this competition: it is fair to everyone. The perfect pitch version can just use a tiny amount of additional data to decompress to the same thing, while the alcoholic will have to correct for loads of differences, making that version be much larger.

This difference is the same monotonically increasing function in both cases so you can basically don’t have to care about it - you can fairly compare the lossy versions as well, you will get the same amount.

So the more advanced version wins, and the competition remains fair and non-ambiguous (otherwise, is perfect pitch A or perfect pitch B had the better cover?)


Exactly right. There would be no confusion if the Hutter prize was a competition for compressing human data.

This is a parallel issue to the ones in conversations around understanding. There's a massive gulf between being able to build an ontology about a thing and having an understanding of it: the former requires only a system of categorizing and relating elements, and has nothing to do with correctness per se; the latter is an effective (note: lossy!) compression of causal factors which generate the categories and relations. It's the difference between "Mercury, a star that doesn't twinkle, orbits Earth, the only planet in existence, and we have inferred curves that predict why it goes backwards in its orbit sometimes" and "Mercury and Earth, planets, both orbit the Sun, a star, and all stable orbits are elliptical".


How is the sum total of wikipedia not human data?


I think Wikipedia is human data. But to say it's human knowledge (that is, that data is synonymous with knowledge in any context) is actually a pretty hardline epistemological stance to take, one that needs to be carefully examined and justified.


Completely agree, and I think this is where a lot of the confusion in the thread is coming from. The prize claims to compress “human knowledge” but it’s actually a fairly rigid natural language text compression challenge.

I’d conjecture that the algorithms that do well (as measured by compression ratio) on this particular corpus will also do well on corpuses that contain minimal ‘knowledge’.

Compression algorithms that do well might encode a lot of data about syntax, word triple frequencies, and Wikipedia editorial style, but I really doubt they’ll encode much “knowledge”.


It is not the sum-total of human knowledge surely. But that it is part of that knowledge is not a controversial statement in my opinion.


I didn't say it's the sum total anywhere nor did I imply that in any way. I also didn't say it doesn't represent some aspect of human knowledge.

In set notation, what I am saying in above comments is

    ({data} ∪ {knowledge}) \ ({data} ∩ {knowledge}) ≠ ∅
while you are simply assuming (without a good reason) that I am saying

    {data} ∩ {knowledge} = ∅
If you look up the chain with the example of singing a Queen song, maybe that will help you better understand the angle that that commenter and I share about the differences between data and knowledge.


One of the big problems in this use case is that a dump of wikipedia contains both knowledge and arbitrary noise. And lossless compression has to preserve every single bit of both. It's hard to tease out "good lossy compression" from the mess, because better and better lossy compression doesn't get you arbitrarily close to the original, it only gets you somewhat close.


Who downvoted you? This is correct; the better the lossy compression, the better the lossless compression as well.


That's true until the lossy compression alters the data in a way that actually makes it harder to represent. As an example, a bitstream that 'corrects' a decompressed .MP3 file to match the original raw waveform data would be almost as difficult to compress as the raw file itself would be.

It wouldn't be a matter of simply computing and compressing deltas at each sample, because frequency-domain compression moves the bits around.


The problem with lossy compression in a competition is that in order to compare different approaches to lossy compression you have to define a metric for the quality of different lossy reconstructions of the data. This is practically impossible to do in an unbiased way. As soon as you define a quality metric then the competition becomes cheating the metric instead of useful compression.

So how do you define a metric that can't be cheated? You add an arithmetic coder after your lossy compressor, which turns it into a lossless compressor, and the size of the losslessly compressed data is your metric for the quality of your lossy compressor. It's the only metric that definitively can't be cheated.


Feel free to start your own competition where people are awarded money for a system that correctly answers some set of questions, divided by the amount of storage that system needs.

Sounds like a tough competition to run objectively, not that that makes it less worth doing, but I can see why the parameters were chosen as they were in 2006.


Ted Chiang actually explores this concept in his article on ChatGPT and other LLMs https://www.newyorker.com/tech/annals-of-technology/chatgpt-...


Lossy compression is lossless over the information you decided to care about.

You may decide that the sound that's outside the human range of hearing isn't interesting. Or that the colours in your camera sensor that's just thermal noise can be safely ignored. Then you can throw it away and do lossy compression.

What you care about, an algorithm can't answer for you. Sometimes you may want to keep the inaudible sounds in a recording to better understand the audible ones. Sometimes you may be glad you kept the noise in a picture as it allowed you to later identify the model of camera.

There's no context-free answer to what matters. There's no subject-free answer to what matters, there's always a "who" it matters to.


True, but then I would argue (as a layman) that AGI would need to be able to guess what matters in a given context to really be AGI.


Matters to who?

To me? In that case, it sometimes does, sometimes doesn't, depending on how lazy I've been in constraining to the algorithm what I might want.

To itself? Then who can say if it isn't doing that already?


See a related benchmark: http://www.mattmahoney.net/dc/text.html

In particular, Fabrice Bellard is leading with a transformer model for “enwik9” that which is 10 times larger than the one used in Hutter. It’s doing quite well with enwik8 too, but perhaps the issue is that the “economy of scale” with on the fly training hasn’t caught up in the smaller benchmark yet.

Transformers are definitely lossy, but afaict Bellard’s entry uses the probabilities generated by the model to create an encoding that used fewer bits


Lossless text compression is basically something like arithmetic coding[1] on top of probability. In fact the loss metric that LLM directly optimize for is bits required for lossless compression. Loss plotted in papers is generally nits/token.

The only reason why LLM are not in the leaderboard because it is not a task for compressing human knowledge. It's a task for compressing wikipedia which is tiny in comparison, and the model size plays a big role.

[1]: https://en.wikipedia.org/wiki/Arithmetic_coding


The compression should be semantically lossless, but lexicographically lossy. It should find ways to say the same thing in less characters.

I.e "Sky is blue. Ocean is blue." -> "Sky and ocean are blue."


Lossy text compression has little utility.


> Lossy text compression has little utility

You're describing every book you've ever read and learned from.


Ah, see but what they really meant to say was "Lossy text compression has a little utility" but due to lossy compression the meaning changed to be the opposite, thus proving the author's point.


No, not at all. I don't read a book to regurgitate the text in an imperfect form later. I read it to learn about the ideas and thought. The syntax itself (ie. text) is not that important.


> The syntax itself (ie. text) is not that important

So, you're preferentially discarding information you consider extraneous to your application, distilling it to a smaller representation that retains what you consider important about it.


Most image compression smooths skies. Because most people don’t want a regurgitated sky. If you’re an astronomer, on the other hand, you’ll compress away the trees. These are all lossy retrieval (and, inherent to compression, transformation) functions.


Good books aren't lossy.


> Good books aren't lossy

As someone who recently began rereading books, I heartily agree. The lossless original has value. That doesn’t mean the lossy imprint of that book in my mind “has little utility” or value.


Blinkist would like a word.


Oral tradition has had great utility for people for millennia, and it's very much lossy compression.


Depends on the use of the text. If it's to transfer ideas, then fairly easy to rewrite things so they're more concise. For example, brevity can be used without hurting comprehension.


/s?

lol


Humans can do lossy or lossless. There are plenty of people who can recite entire Bible or Koran flawlessly.


This is more the equivalent of asking humans to create an exact copy of the text, typesetting and all, including the publishing information, page numbers, and exact linebreaks. Not just recite the text, which would be a lossy encoding of the original.

Humans are terrible at lossless encoding of information, it's what we invented machines for =D


No it is not more like that. This isn't compressing images, just the sequence of characters (or words). It is literally EXACTLY the same task as just memorizing word for word (plus punctuation), which humans are very good at, probably because our brains use a lot of strategies akin to what is used in this challenge.


How do you know it's compressed? Someone who can recite one of these probably knows interpretations, back stories, related information that help them memorize it. The "storage" required to the recital may be larger than the text itself, it we can even think in those terms.


And there are humans that can jump 8ft in the air. Doesn't mean it's correct to say that "humans can jump 8ft in the air." Very few people are regurgitating verbatim information.


Many can recite the Koran flawlessly, its short and heavily encouraged in education through rote repetition.

Much, much fewer can recite the bible, its many times longer.

LLMs can also recite the bible and Koran flawlessly, given how frequent the text appears in their training material.


> LLMs can also recite the bible

Is this something you've seen demonstrated? I have no doubt they would be able to recite a lot of sections of it, but there are many sections that are less often quoted, and its a long frickin book. If its just a hunch you have I might give it a go and compare if thats okay I think it would be interesting to interpret what given LLM knows and might miss


Now you've given me the idea of coaxing new bible verses out of an LLM. Maybe we can even get one to confabulate a complete additional book! =)


That’s a strong assertion! Are there prompts that cause llms to regurgitate the bible to demonstrate this?

I’m thinking that the pigeon hole principle says that the current models are orders of magnitude too small to do this.


That's true, but it seems unlikely that that's a particularly important part of intelligence. The vast majority of people do _not_ do that type of memorization, are they still intelligent?


math is a lossless compression


Yeah it makes no sense to say it's inspired by intelligence and then require lossless which is definitionally rote work and not intelligent.


Not true, a smart model could be really good at lossy compression and then you only have to store a small delta to make it lossless.


That's literally arithmetic coding which is used by all winning entries in the above so far.


Which is the exact same function of the difference between any competitor, so you can actually fairly compare them on the lossy part.. like, people have been thinking about this problem much more than the 3 minutes HNers in this comment section.


yep!


I'm no mathematician but I don't believe this is true. Lossless information encoding requires all the original information to be present.


Arithmetic coding allows you to make a prediction and only provide bits for correction.

Have the de-compressor predict the next data based on the outcome so far (a statistical prediction of next data will be lossy as it won't always be correct). If the prediction is correct you need to spend very little to confirm that. If it's incorrect you'll need to spend data to correct it. Arithmetic coding is the best way to make this work.

It's also been used by all winning entries of the Hutter prize so far.


Alright I guess I was wrong.


Technically you’re correct. If you can rebuild the original losslessly then all the original information is present, just in a different configuration.


Or at least reproducible. It could still be compressed.


What


500000 EUR is the prize pool. Each winner has to gain at least 1% improvement over previous record to claim a prize that is proportional to the improvement. Getting the full 500000 EUR prize requires an 100% improvement (i.e. compressing 1GB to zero bytes).


Does it or does it just require 1% improvement over the last winner? As opposed to a static additional 1% improvement vs the initial best “score”.


It's 1% over the last winner. The latest winner has a total size of 114156155, compared to previous winner of 115352938. The payout was

   500000 * (1 - 114156155 / 115352938) = 5187
(see table near "Baseline Enwik9 and Previous Records Enwik8")


Probably if you succeed at this, 500,000 will be worthless to you


Why? How does this improvement translates to more financial gains?


The idea is that sufficiently powerful compression is indistinguishable from artificial intelligence, which can be used to make even more money.


because with that knowledge, you will be able to decompress 0 dollar to infinite dollars which the storage mafia will pay you for not publishing your breakthrough in making them obsolete.


Let's say you really find a new, much much better compression algorithm for texts. I do believe that soon new applications will appear that will make use of the whole storage space that existing data storage devices can offer.


Are you possibly hinting at the endless stream of "new generation iWasteware camera, now with 2 gigapixel sensor" in combination with camera apps that put a (high) limit on the lowest resolution one can choose?


I did not have exactly this application in mind, but I was aware that this kind of waste does exist for many commercial applications when I was writing my reply.


sorry, who exactly is the storage mafia?


Well, if you achieve this, you'll basically have proven that something (a bunch of information) equals nothing (no information). So 1=0.

Once you have that, becoming rich is trivial. Multiplying both sides of the equation by one trillion, 1 trillion = 0. So open a bank account with $0, now you have one trillion. Easy.

A funnier (although grimmer) way: let p be the world's population. Multiplying both sides of the equation by (p-1) you get p-1=0. Thus, p=1. If you assume that you exist (which is a reasonable assumption, following Descartes's reasoning) you now own all the wealth in the world.


I think personally I can compress $1b into 0 within a week, but $1tr would be a real challenge. Another thing I should point out that I don't support compressing people (for now).


because, then you will have beat pied piper in the market.


Being that smart translates to more financial gains.


Form is emptiness, emptiness is form. 1GB == 0B. Full prize money delivered upon successfully achieving enlightenment.


holographic storage, if we can harness the zero point field the way Karl Pribram postulated cognitive memory works (i.e. using the Void for storage) may actually get us to 0 "physical" B ;)


> compressing 1GB to zero bytes

Is this even possible? Don’t we need atleast some data to decompress from?


Not really 0, since any exe file will have some minimum length, and that's defined to be part of the calculation.

Conceptually you could cram it into some dozens of bytes (trivial initial state of the universe + basic physics + iteration). In practice, of course, that's ridiculous.


Depends on your definitions. I could decompress 1GB from zero bytes, with a program that is just over 1GB in length...


It's very well defined on the homepage:

> Create a Linux or Windows compressor comp.exe of size S1 that compresses enwik9 to archive.exe of size S2 such that S:=S1+S2 < L

So obviously your suggestion does not work.


Well, all right, I could use conventional compression such that S2 is zero, by making S1 what would have been S1 + S2. That still leaves the "compressed image" having zero bytes, by hiding the actual compressed image in the executable.

Note well: I am not claiming that there is any reasonable reason to do this. It's just a way to say "well acksually..."


So your way around is by baking the data into the executable instead?

How good of a ”solution” is that in reality? Like what if you were tasked to decompress multiple different files? Would you bake does into the executable aswell?

Personally, I would count this way around as invalid.


Notice how you had to change the wording in this comment to make the scenario work. If the problem statement is "compressing 1GB to zero bytes" or "decompress 1GB from zero bytes" then that trick is easily called out as invalid.


Ah... I had professors who graded like that


I mean, come on man. For some reason, the nerd in me sees this and immediately adds it on my 'I really need to do this' list.

Just memories of old times doing some similar (albeit less challenging probably) competitions on TopCoder almost a decade ago, and also the curiosity to see how I would manage it know, with experience. Given that the current scores are also very far from what they estimate the lower bound to be, this is really interesting ! The prize is however very misleading - per their own FAQ - the total possible payout is ~223k euros.

Definitely not thanking you for the hours I will put into this !


I think the total prize pool can be won with repeated 1% improvements by different people but never by a single person no?


I have an idea if you want to code it. You know how we can drop the vowels from sentences and still understand the sentence? What if we do that on the first stage? May not work for every case so have to identify. Worse comes to worst, use full word. Probably not saving much though. Worth a try.


You got a lot of flak for what is clearly a take from someone that isn't versed in compression techniques. But as one might to a student; you're on the right track! This idea is similar in form to "arithmetic coding" which is what people are using to chip away at this. Namely, finding smaller encodings which can be used to predict common parts (maybe a recognisable word, more likely a sequence of bits or characters) of the full encoding, then cycling through storing "hints" for each part it would get wrong until it can predict the exact desired output


I think you have it backwards, but I like the direction of thinking. There is redundancy in the language itself.

> However, vowel-only sentences were always significantly more intelligible than consonant-only sentences, usually by a ratio of 2:1 across groups. In contrast to written English or words spoken in isolation, these results demonstrated that for spoken sentences, vowels carry more information about sentence intelligibility than consonants for both young normal-hearing and elderly hearing-impaired listeners.

https://pubs.aip.org/asa/jasa/article-abstract/122/4/2365/98...


What you're describing is a form of lossy compression. Yes, it can compress the file, but you're losing some information such that there's no way to convert it back to it's original form without external information. And, as you noted: it may not work for every case, which means some information would be permanently lost.


By your logic, the sign language which mostly communicates via symbols and its mostly 1-2 symbols per word, no spaces, will be the best compression.


I bet you think of yourself as an idea guy.


> I bet you think of yourself as an idea guy.

What does this even mean?


It means he wants to sit back, write a lazy comment, and have someone else do all the work ("code it up", hah). Toss his brilliant ideas that are "worth a try" over the fence and move on to the next brilliant idea. Without any research or due diligence or even consideration for the comparative amount of effort it takes to have an idea, vs executing on it. Ideas are a dime a thousand.

This is the same kind of person who's got a great business idea, they just need you to write up an AirBnb clone in a couple of weeks, and they'll give you 5% of the company.


While I know it wouldn't qualify for the prize due to the size of the model, I'm curious how well a big modern LLM can compress the prize file (enwiki9) compared to the current record. (I guess ideally it would be an LLM not trained on wikipedia as well, to minimize memorizing).

On that thought, with a modern LLM's weights dwarfing the enwiki9 file used by the prize, it feels like this prize was setup with the right idea to advance AGI, but with the problem several orders of magnitude to small.


The site itself says that this is the general idea, proposing that intelligence can actually be defined as the ability to compress information as much as possible and.. I guess either just rehydrate it or also derive inferences from it.

So basically like enthalpy

Interesting - if you play the universe in reverse (reversing entropy) you wind up at the Big Bang, which is the final boss of compressing everything, into a singularity.


> Interesting - if you play the universe in reverse (reversing entropy) you wind up at the Big Bang, which is the final boss of compressing everything, into a singularity.

Only if you believe that running the universe forward did not add any extra information. Ie everything is deterministic, and no randomness occurs.

(That's actually a reasonable stance to take. Quantum mechanics is probabilistic in the Copenhagen interpretation, but completely deterministic in the Many World interpretation.)


A non-local formulation of QM is deterministic. [0] An addition to the "many world" interpretation. Still one universe but the underlying reality is fully connected (to go forward you need to take the effects of everything).

0: https://en.wikipedia.org/wiki/De_Broglie–Bohm_theory


Yes. De Broglie Bohm theory is another example of an interpretation that's deterministic.

In practice, because of deterministic chaos it doesn't matter too much whether your lowest level theory is deterministic. (It only matters philosophically.)


It is more than likely that most physical reactions are not deterministic, and in that case the Big Bang definitely didn’t encode anything about all the state we have now. (Also, simply by Occam’s razor it seems much much more likely)


> The site itself says that this is the general idea, proposing that intelligence can actually be defined as the ability to compress information as much as possible

I don't understand how anyone can define "intelligence" like that...


It's not an unreasonable definition, if you are aware of Kolmogorov complexity and Solomonoff induction. Intelligence is intimately connected to the ability to predict and model things, and it turns out that data compression is _also_ connected to the ability of predict and model things.


It makes sense intuitively right?

If I tell you that this week it's raining on Monday, raining on Tuesday, raining on Wednesday, etc..

Compressing that information into "it's raining every day this week" requires creating an abstraction around it. Finding a pattern. Producing more order from something more chaotic.


> Compressing that information into "it's raining every day this week" requires creating an abstraction around it. Finding a pattern. Producing more order from something more chaotic.

Is there any way to determine if that process of abstraction-finding and pattern-finding will always/actually/sometimes/never result in a more compressed output than simpler approaches though?


A hunch of mine is that's undecidable in the general case. In simple cases like this, there's likely to be a way to prove that yes, this is tho most compressed thing out there.


Interestingly, compression increases entropy. So the first description is more "ordered", or less random.


> It's not an unreasonable definition, if you are aware of Kolmogorov complexity and Solomonoff induction

I was already familiar with Kolmogorov Complexity, but not _Solomonoff's theory of inductive inference_ - I skimmed the Wikipedia article just now and I think/hope I get the general gist of it - so thank you for that.

> Intelligence is intimately connected to the ability to predict and model things

I agree with that statement - but I am unsure how this relates to other aspects of intelligence - or even other notions of intelligence entirely. If we're redefining exactly what the word "Intelligence" means aren't we just moving the goalposts?

...so I'm unsure if the word "Intelligence" in your post can/should be read as a layperson would understand it, or if it's a specific, well-defined term in this field? (Like how "complexity" in algorithms does not refer to how convoluted and unmaintainable a program is, which is what a layperson would think of when an SWE person says "this program has high computational complexity" to them).

> and it turns out that data compression is _also_ connected to the ability of predict and model things.

This is where I have difficulty following your argument:

As a preface: my academic understanding of data compression only covers things like algorithms in the LZ family, Entropy coding, and (lossy and lossless) domain-specific compression schemes (like JPEG/DCT, MP3/FLAC, etc). I am aware of very interesting results coming from present-day AI approaches, like using LLMs fed only on compressed data - but these are completely outside my scope of understanding (and LLMs are still very spooky to me).

Does it matter if a scheme is lossless or lossy? In a lossless system, surely a data compression system needs to be deterministic and mechanical? If so, what room is there for "intelligence" in a rigid, statically-defined, system?

Take entropy coding for example - or even simpler probability-based compression schemes like Morse Code (e.g. "E" has a high probability, so it has a very short symbol length). I just can't see the connection from things like entropy-encoding to "modelling": supposing I have a large file (CSV?) of meterological data for a specific region over time - if it's plaintext then I expect I can compress it better using gzip than by using a system that can (somehow!) identify some structure to the weather sensor data (the "model", right?) and then use that as the basis for a better domain-specific compression - but doing this means having to add additional metadata to the source data to describe the structure that the system identified, and then hope that this approach is better than a comparatively "dumb" approach like DEFLATE - and even then, assuming that employing that "model" really does result in smaller compressed output, how is that an example of the system having a general "intelligence"?


Sorry, I didn't get back to this until now.

I was thinking of lossless compression, lossy is another can of worms on top of lossless. Lossless compression works, in principle, by the realization that the probability distribution of the all possible input strings is not flat. There are some strings that are more probable, and strings that are less probable. The reason some strings are more probable is that they result not from random processes, but there is some process or algorithm that generate them. Thus, they have inherent structure to them. A random string would be a string that has no structure detectable by any algorithm smaller than the string itself. If the probability distribution of the input strings is not flat, then we can use entropy coding to describe the common case - non-random data – in less bits than the input.

However, the difference between some specific compression algorithms and a "general compression algorithm" is the assumptions they do. Most compression algorithms don't consider the probability distribution of the "full" set of input strings, but they rather divide the input string into a predictable sized chunks, and consider the distribution of those chunks. This is way simpler, and yields to being able to compress somewhat well, while still having rather "static" distribution (like morse code), or only simple algorithms (like adaptive Huffman coding) to adapt the distribution to the input data.

But if we don't restrict ourselves to the world of "compressing a stream, message by message", but enable "intelligent" compression that is allowed to use any computable means to achieve smaller compressed size and can adapt to the whole context of the input, we can see that "message-by-message" entropy coding is only a subset of what we can do. (And we of course _also_ have that subset in our toolbox.) But the true challenge of the compression now evolves into being able to find and represent approximations to the "full" distribution of the input data. That involves things like identifying structure in weather data! If the input is large enough to start with, the more complex model might be worth it. And if it isn't then we can intelligently just decide to use DEFLATE.

> what room is there for "intelligence" in a rigid, statically-defined, system?

But as we can see from above, surely intelligence that is tasked with compressing stuff, doesn't need to be "rigid"? The intelligence is in the compressor, not in the compressed result. The compressed result might be assembly code that unpacks itself by doing arbitrary computations, and to achieve good results, the generator of that code must be sufficiently intelligent.


There is a different but similar lossless compression benchmark which just had a new front-runner using LLMs, see https://bellard.org


If anyone is interested in lossless compression competition, but is too intimidated by 100MB/1GB size and level of optimization already achieved here - you can try http://golf.horse/ challenges, which include Wordle wordlist, Pokemons and OEIS database


I don't understand why they sum the size of the compressor with the combined compressed file and decompressor. I think the compressed file and decompressor would make an ungameable challenge on their own. Their FAQ section 'Why is Compressor Length superior to other Regularizations?' is satisfied equally well by the decompressor length.


I had the same question, and I think this FAQ entry answers it fairly well: http://prize.hutter1.net/hfaq.htm#addcomp


Thanks, that one answers my question.


It's interesting to see a new winner of different nationality beating the multiple previous records by the same guy, and what appears to be another one who beat him 2 years ago; for some (cultural?) reason, there seems to be a historical association between Soviets and advanced data compression technologies in general, possibly starting with Markov's work in the 19th century.


The causation for this might run along the lines of having plenty (in absolute terms) of well educated smart folks but pretty low wages.

Most of the really smart people in Europe and the US for example already have high paying jobs and thus less time for these relatively low paying competitions.


Why the formula use the size of the compressor ? Intuitively I would expect the size of the decompressor and compressed data is the only think that is really important to know how much the total information have been compressed.


Otherwise you could just store the whole text in the decompressor.

Another way is to keep the file secret and just output a score, but if you want to use a public file, you have to include the program in the calculated size.


> Otherwise you could just store the whole text in the decompressor.

I don't think GP was asking about the decompressor, but specifically the compressor.

The FAQ has a fairly detailed answer for this though: http://prize.hutter1.net/hfaq.htm#addcomp


Yes, I didn't read the comment carefully and quickly assumed it was the standard question about including the decompressor.


Thanks, I didn't see this part of the faq !


Perhaps because they include metrics for compression time, which could easily be gamed by doing significant precomputation in the compressor?


Related. Others?

Saurabh Kumar's fast-cmix wins €5187 Hutter Prize Award - https://news.ycombinator.com/item?id=36839446 - July 2023 (1 comment)

Hutter Prize Submission 2021a: STARLIT and cmix (2021) - https://news.ycombinator.com/item?id=36745104 - July 2023 (1 comment)

Hutter Prize Entry: Saurabh Kumar's “Fast Cmix” Starts 30 Day Comment Period - https://news.ycombinator.com/item?id=36154813 - June 2023 (5 comments)

Hutter Prize - https://news.ycombinator.com/item?id=33046194 - Oct 2022 (3 comments)

Hutter Prize - https://news.ycombinator.com/item?id=26562212 - March 2021 (48 comments)

500'000€ Prize for Compressing Human Knowledge - https://news.ycombinator.com/item?id=22431251 - Feb 2020 (1 comment)

Hutter Prize expanded by a factor of 10 - https://news.ycombinator.com/item?id=22388359 - Feb 2020 (2 comments)

Hutter Prize: up to 50k € for the best compression algorithm - https://news.ycombinator.com/item?id=21903594 - Dec 2019 (2 comments)

Hutter Prize: Compress a 100MB file to less than the current record of 16 MB - https://news.ycombinator.com/item?id=20669827 - Aug 2019 (101 comments)

New Hutter Prize submission – 8 years since previous winner - https://news.ycombinator.com/item?id=14478373 - June 2017 (1 comment)

Hutter Prize for Compressing Human Knowledge - https://news.ycombinator.com/item?id=7405129 - March 2014 (24 comments)

Build a human-level AI by compressing Wikipedia - https://news.ycombinator.com/item?id=143704 - March 2008 (4 comments)


One would think it ought to be possible to automate generating these overviews. Or has it been automated and is it being posted with a non-service account because it's partially but not fully automated?


Pretty sure Dang has some nice scripts that generate this, and I expect he does a little manual pruning before posting it. He’s very precise in his choices, definitely applies some intuition/judgement. We can’t losslessly compress him — yet!


This is probably a dumb question, but I recall from (way) back in my college days studying Shannon's source coding theorem that there are calculable limits regarding how much a data stream can be compressed. Has this limit been calculated for the sample file in question? And apologies if this was in the linked article or one of the footnotes.


To establish bounds in this way, you have to start with some claim about the distribution of the input data. In this case, the data is natural human language, so it's difficult or impossible to directly state what distribution the input was drawn from. Even worse, the prize is for compressing a particular text, not a sample from a distribution, so tight bounds are actually not possible to compute.

There is some discussion on the Hutter prize page, under "What is the ultimate compression of enwik9?"

http://prize.hutter1.net/hfaq.htm


From this article:

More empirically, Shannon's lower estimate suggests that humans might be able to compress enwik9 down to 75MB, and computers some day may do better.


Notable that enwiki8/9 isn't really just human text - a good ~half of the data is random xml markup which may not have the same properties as text.


The current best compresses it to <15MB because it's not a human which that example is referring to. :)


That's enwik8. enwik9 is an order of magnitude larger.


No. In fact there are limits but unless you've already written a 'perfect compression tool' you can't actually know the limit.

https://en.wikipedia.org/wiki/Kolmogorov_complexity is ultimately what you're looking for btw. Shannon is more about limits in transmission speed given noise but Kolmogorov dealt with the limits of compression which is actually unknowable.


  Q: Why do you restrict to a single CPU core and exclude GPUs?
  A: The primary intention is to limit compute and memory to some generally available amount in a transparent, easy, fair, and measurable way. 100 hours on one i7 core with 10GB RAM seems to get sufficiently close to this ideal
 
Sorry, who are these people that don't have a GPU? Even laptops have GPUs. Why would you spend 100 hours on an "i7" (which generation? 4790K or six times faster 13700k?) CPU when you can achieve orders of magnitude better performance on a consumer GPU that literally everyone has access to?


> Sorry, who are these people that don't have a GPU

Literally millions of people. Especially circa early 2000's when this started.

> GPU that literally everyone has access to?

They don't. (Money is a barrier to access).

> Why would you spend 100 hours on an "i7"

Because it's what they had/have.

This a competition to move understanding forward, not to see who has the biggest budget.


The level of privilege that some people are completely blind to is just incredible.


Sure, in the early 2000s. But today literally every laptop or desktop computer or phone with a touchscreen has a GPU. Even most feature phones have GPUs, I'd actually be interested to see if you could find a currently manufactured one that doesn't. And they have all been programmable for a very long time now. And almost without exception they have more FLOPS than all of the CPUs in the same machine put together, yes, even the integrated GPUs.


In which case: https://news.ycombinator.com/item?id=37502782

And 20 years changed A LOT of things, turns out the folks who can't afford a deskop or laptop don't need to, they have a call phone. And even cell phones have GPUs that are more powerful than the CPUs of 20 years ago.


The first winner didn't even have a computer!

    Q: What if I can (significantly) beat the current record?

    A: In this case, submit your code and win the award and/or copyright your code and/or patent your ideas. You should be able to monetize your invention beyond the HKCP. This happened to the first winner, a Russian/Ukrainian who always had to cycle 8km to a friend to test his code because he did not even have a suitable computer, and who now has a lucrative job at QTR in Canada.
http://prize.hutter1.net/hfaq.htm#fame


>lucrative job at QTR in Canada

Google failed me on this one. What company/entity is QTR?


Quantum Technology Recruiting Inc. (https://ca.linkedin.com/company/quantum-technology-recruitin...)

Our guy, Alex Rhatushnyak, is listed as employee.

BTW, this is the first result on DuckDuckGo. (https://duckduckgo.com/?q=qtr+company+canada)

Is it time to switch? ;)


Note that the competition close to 20 years old

...though I also had a GPU in 2006, so idk. Then again, you need to define something as reference hardware and it doesn't really matter what it is. Better compression should win out over less-good compression no matter if you run both on a 100-core system or a 1-core system, I think?


In the category "then update your FAQ, you've have many, many years to do so" =D

(not to change the rules, but to explain why they rules haven't changed, and to clarify the super vague phrasing even back when the 3dfx Voodoo was the pinnacle of graphics. Level playing fields are a worthwhile pursuit)


anecdata: for my entire life, I have neber, until about 3 weeks ago, owned a laptop with any type of gpu other than an integrated graphics card built into the motherboard. for most of my life in fact I have never really had any modern or really serviceable hardware, as I've basically inherited all of my compute devices secondhand never had the money to just go out and buy a decent rig (or to even put one together for the matter, perhaps tbst will elucidate the economic sector which I have basic y always occupied. well, not always, as now that I'm homeless, I somehow went even LOWER! )

though it truly does excite me that I finally have something with some type of GeForce branded x graphics device inside of it, now I can finally run local machine learning tasks and not have to cloud it out


My two years old high end productivity Thinkpad has an Intel Iris XE display adapter, which I would hesitate to call a GPU.


Maybe commodity servers with integrated and probably never used graphics capabilities that no one would actually describe as a GPU?


I do think it's interesting that recent submissions use nearly the entire 50 hours. I wonder how much better people could do if faster hardware was allowed.


Absolutely they could. All the entries so far can do even better with more RAM. That PAQ entries have command line flags to do this. That's already known in fact.

What this is looking for is fundamental improvements, not "i brute forced a known way to win this competition".


An element of compression is usually a search problem - 'lets try all these ways to encode the data, and see which is smallest'.

Therefore to maximize compression, you tweak parameters to search as hard as possible, and therefore use all the time.


This competition was from 2006.

The consensus among AI experts (really just everyone) during the 1990s/2000s was that:

- AGI could be achieved by giant GOFAI (usually expert systems/knowledge bases) projects (like OpenCog and Cyc).

- ... or that AGI development is limited by lack of knowledge about key insights (mostly symbolic/rational) into intelligence, not computation/data. (IE AGI was viewed similarly to proving that NP = P or other very-hard math/computer science/psychology/philosophy problems).

- ... or that AGI will be achieved through brain scanning/connectomics.

- ... or that AGI is impossible.

Nobody (except for LeCun and Schmidhuber) paid much attention to neural networks until AlexNet (2012) showed that they could be ran and trained at fast speed and beat the symbolic competition. In the 2000s, only a real, actual psychic would be able to tell you that LLMs would be a valuable path for AGI research.

Here is a list of various expert (and "expert") perspectives on AGI during the 1990s/2000s (notice how nobody is talking about neural networks, and they are definitely not talking about anything remotely close to a LLM or transformer):

> Copycat is a computer program designed to be able to discover insightful analogies, and to do so in a psychologically realistic way. Copycat's architecture is neither symbolic nor connectionist, no was it intended to be a hybrid other two (although some might see it that way); ... [describes a very symbolic system to our modern day eyes, though it was not really symbolic to 1990s AI researchers]

- Douglas Hofstadter and Melanie Mitchell, Fluid Concepts and Creative Analogies (Chapter 5), 1995

> Interviewer: Are you an advocate of furthering AI research?

> Dennett: I think that it’s been a wonderful field and has a great future, and some of the directions are less interesting to me and less important theoretically, I think, than others. I don’t think it needs a champion. There’s plenty of drive to pursue this research in different ways.

> Dennett (cont): What I don’t think it’s going to happen and I don’t think it’s important to try to make it happen; I don’t think we’re going to have a really conscious humanoid agents anytime in the foreseeable future. And I think there’s not only no good reason to try to make such agents, but there’s some pretty good reasons not to try. Now, that might seem to contradict the fact that I work on a Cog project [sic] with MIT, which was of course is an attempt to create a humanoid agent, cogent, cog, and to implement the multiple drafts model of consciousness; my model of consciousness on it.

> Dennett (later): [Cog is intended as a] proof of concept [for AGI]. You want to see what works but then you don’t have to actually do the whole thing.

- Daniel Dennett, Daniel Dennett Investigates Artificial Intelligence, Big Think, 2009

> [Context: Marvin Minsky had a speech where he talked about how expert systems don't work, because they do not have any common sense (and the only solution seems to be to create a giant AGI project (without automatic data gathering)).]

> Only one researcher has committed himself to the colossal task of building a comprehensive common-sense reasoning system, according to Minsky. Douglas Lenat, through his Cyc project, has directed the line-by-line entry of more than 1 million rules into a commonsense knowledge base.

- Mark Baard, AI Founder Blasts Modern Research, Wired, 2003

> Section 1 discusses the conceptual foundations of general intelligence as a discipline, orienting it within the Integrated Causal Model of Tooby and Cosmides; Section 2 constitutes the bulk of the paper and discusses the functional decomposition of general intelligence into a complex supersystem of interdependent internally specialized processes, and structures the description using five successive levels of functional organization: Code, sensory modalities, concepts, thoughts, and deliberation. Section 3 ... [yada yada, this is old, wrong stuff]

- Eliezer Yudkowsky, Levels of Organization in General Intelligence, 2007, Machine Intelligence Research Institute

I could list more examples, but I have spent way too long on this post. What I will say is that Hutter probably had the most correct idea of how modern semi-general AI would work (from the 2000s). He figured out that compression is a extremely important component of intelligence > 10 years before everybody was doing LLMs. That is impressive.

I should probably write a blog post over this.


Trying to figure out who you're replying to though, none of this has anything to do with compression algorithms.


> Being able to compress well is closely related to intelligence as explained below. While intelligence is a slippery concept, file sizes are hard numbers. Wikipedia is an extensive snapshot of Human Knowledge. If you can compress the first 1GB of Wikipedia better than your predecessors, your (de)compressor likely has to be smart(er). The intention of this prize is to encourage development of intelligent compressors/programs as a path to AGI.

- Marcus Hutter, http://prize.hutter1.net/


okay, but why post that as a reply to a complaint about how GPUs can't be used and how an "i7" is not one specific CPU? You almost certainly wanted to just post that as its own comment instead of a reply to anyone.


> So the ultimate compressor of it should "understand" all human knowledge, i.e. be really smart. enwik9 is a hopefully representative 1GB extract from Wikipedia.

I spotted 2 really compelling examples of "compression is AI" posted recently.

ChatLZMA: https://news.ycombinator.com/item?id=37318810

Ziplm: https://news.ycombinator.com/item?id=36732430


Related:

Homepage of Marcus Hutter:

http://www.hutter1.net/


I worked on this a bit and potentially had an improved result on the current leader but it felt a bit cheap since I was doing it manually for this data set rather than for data in general.


Does it require participants to give up all their source codes, as usual in these competition? Wonder what the free market price for winner system would be.


Small font — great place to start.


Middle out ftw


I will now compress all of human knowledge: fuck

/s


Easy! Write a compression algorithm that works as follows: if you detect the input is the first gb of Wikipedia, return nothing. Else return the compressed artifact of your compression algorithm of choice. To decompress, running the decompression algorithm on an empty input returns the first gb of Wikipedia, if else you use whatever algorithms you used earlier :)

This is inspired by HQ9++: https://www.dangermouse.net/esoteric/hq9plusplus.html


Not sure if you’re joking or not, but the prize is for the smallest archive + compressor.

The best you could do with this approach is to use the best existing compressor and the compressed text (needed to check the input). With the extra test you’d end up doing slightly worse than the previous winner.


> Rules

> Publish a compression program comp9.exe that outputs archive9.exe given input enwik9.

> If archive9.exe is run with no input, it reproduces 10^9 byte file data9 that is identical to enwik9.

> Total size is measured as S := length(comp9.exe/zip)+length(archive9.exe).

In other words the input is indeed set to zero, but the size of the .exe is actually measured. And even stricter the thing that made archive.exe is included (which I think I a bit mean)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: