I think Wikipedia is human data. But to say it's human knowledge (that is, that data is synonymous with knowledge in any context) is actually a pretty hardline epistemological stance to take, one that needs to be carefully examined and justified.
Completely agree, and I think this is where a lot of the confusion in the thread is coming from. The prize claims to compress “human knowledge” but it’s actually a fairly rigid natural language text compression challenge.
I’d conjecture that the algorithms that do well (as measured by compression ratio) on this particular corpus will also do well on corpuses that contain minimal ‘knowledge’.
Compression algorithms that do well might encode a lot of data about syntax, word triple frequencies, and Wikipedia editorial style, but I really doubt they’ll encode much “knowledge”.
while you are simply assuming (without a good reason) that I am saying
{data} ∩ {knowledge} = ∅
If you look up the chain with the example of singing a Queen song, maybe that will help you better understand the angle that that commenter and I share about the differences between data and knowledge.
One of the big problems in this use case is that a dump of wikipedia contains both knowledge and arbitrary noise. And lossless compression has to preserve every single bit of both. It's hard to tease out "good lossy compression" from the mess, because better and better lossy compression doesn't get you arbitrarily close to the original, it only gets you somewhat close.