How is the sum total of wikipedia not human data?

uoaei · on Sept 14, 2023

I think Wikipedia is human data. But to say it's human knowledge (that is, that data is synonymous with knowledge in any context) is actually a pretty hardline epistemological stance to take, one that needs to be carefully examined and justified.

soVeryTired · on Sept 14, 2023

Completely agree, and I think this is where a lot of the confusion in the thread is coming from. The prize claims to compress “human knowledge” but it’s actually a fairly rigid natural language text compression challenge.

I’d conjecture that the algorithms that do well (as measured by compression ratio) on this particular corpus will also do well on corpuses that contain minimal ‘knowledge’.

Compression algorithms that do well might encode a lot of data about syntax, word triple frequencies, and Wikipedia editorial style, but I really doubt they’ll encode much “knowledge”.

kaba0 · on Sept 14, 2023

It is not the sum-total of human knowledge surely. But that it is part of that knowledge is not a controversial statement in my opinion.

uoaei · on Sept 14, 2023

I didn't say it's the sum total anywhere nor did I imply that in any way. I also didn't say it doesn't represent some aspect of human knowledge.

In set notation, what I am saying in above comments is

    ({data} ∪ {knowledge}) \ ({data} ∩ {knowledge}) ≠ ∅

while you are simply assuming (without a good reason) that I am saying

    {data} ∩ {knowledge} = ∅

If you look up the chain with the example of singing a Queen song, maybe that will help you better understand the angle that that commenter and I share about the differences between data and knowledge.

Dylan16807 · on Sept 14, 2023

One of the big problems in this use case is that a dump of wikipedia contains both knowledge and arbitrary noise. And lossless compression has to preserve every single bit of both. It's hard to tease out "good lossy compression" from the mess, because better and better lossy compression doesn't get you arbitrarily close to the original, it only gets you somewhat close.