This is particularly interesting as there seems to be, for decades, a general consensus that the problem of text compression is the same as the problem of artificial intelligence, for example https://en.wikipedia.org/wiki/Hutter_Prize
"It is well established that compression is essentially prediction, which effectively links compression and langauge models (Delétang et al., 2023). The source coding theory from Shannon’s information theory (Shannon, 1948) suggests that the number of bits required by an optimal entropy encoder to compress a message ... is equal to the NLL of the message given by a statistical model." (https://ar5iv.labs.arxiv.org/html//2402.00861)
I will say again that Li et al 2024, "Evaluating Large Language Models for Generalization and Robustness via Data Compression", which evaluates LLMs on their ability to predict future text, is amazing work that the field is currently sleeping on.
I’m not sure how this generalises to grammar based compression such as SEQUITUR for example is… incidentally LZW also is though not advertised as such.
Math seems very limited when it comes to reasoning about generative grammars and their unfolding into text. Should the apparatus been there we’d probably had grammar/prolog based AI long ago…
Grammars are not AI, it's just another formalism (like regular expressions, Turing machines etc.) - formalism alone doesn't solve anything.
In formal language theory, you have different classes of grammars, the most general ones correspond to Turing machines, i.e. they are a glofified assembler and you can do anything. The most restricted (in the Chomsky hierarchy), "Type 3" grammars, are basically another notation for regular expressions, and they described regular grammars.
There are algorithms for learning grammars, but the issue with that is that the induced grammars may not resemble anything that a human may write (in the same way that a clustering algorithm often does not give you the clusters you want).
But to answer your question, we need to separate the discussion between appropriate representation and method to solve a problem.
I believe grammar-based compression - if you accept probabilistic grammars - is similar to LLM-based compression at some level in the sense that highly probable sequence of words get learned (whether by dictionary, grammar, neural network = LLM, could be just an implementation detail). Whichever you choose, you still need to solve the problem you are trying to solve (any grammar formalism still needs a parsing algorithm, and an actual grammar that does something useful - even after you develop a parser generator).
[Side rant, not responding specifically to the parent or OP: as a linguist, I'd also warn everybody to use "AI" with an article: *"an AI" (asterisk marks wrong use). It wrongly suggests human-like properties when it's actually just a matrix of numbers that encode a model. Here is a test whether you are using "AI" right: replace it by "Applied Statistics" in a sentence and see if you would still say it.]
AI is just an academic field (ill-named for historic reasons), subpart of computer science, and while it's fair to talk about useful representations for modeling human-like behaviors, we should focus on what intelligence is, and talk about the limits of concrete models and possibilities to extend them.
The thing about LLMs is they are a bit like the perfect snake oil salesman: extremely articulate, but knows very little nothing about a lot, understands nothing. (Whatever one criticises, they do the one thing that they are designed for very well: to generate text. Sadly that misleads a lot of people that they are just next-word/next-sentence predictors.)
You are very brave to call or not call something AI, but it is precisely generative grammars (a stochastic ones) who were initially considered AI - as a linguist you should know this better than myself.
There's a general consensus that entropy is deeply spooky. It pops up in physics in black holes and the heat death of the universe. The physicist Erwin Schrodinger suggested that life itself consumes negative entropy, and others have proposed other definitions of life that are entropic. Some definitions of intelligence also centre on entropy.
What to make of all that however, has anything but consensus.
To have entropy, you need to have a notion of information. To have information, you have to decide which differences matter, I.e. which states you classify as the same.
This isn't a problem for physics, or for computer science. But it is a problem for would-be philosophers (including a few physicists and computer scientists!) who thought information was a shortcut to avoid answering big questions about what matters, what we care about.
> but on the internet you don't have to say anything and if you do it may as well have some substance
Seems like we're using different internets. Which i am glad about. I just wish mine had less of the negativity that's coming over from yours. Guess in the end, the people on your internet realize, it's more fun over here.
You could have expressed all of that with less maliciousness towards the person. Thank god, in my internet everyone can say whatever they want f they want. Because– and more people should remember this apparently– if i don't like it, i just turn off the internet, like grandma!
Wish all the best to you and everyone you care about in real life. I might be just a bot. You might be. We'll never know for certain. Don't let some bits mess with your feels.
I'm sorry for for leaking negativity into your internet. I don't think negativity is inherently undesirable, but I don't think it's useful to express it towards people's selves. I meant only to criticize the comment without further implication.
In fact I went and got some references I really liked because I was hoping to add what I felt was missing from the discussion on entropy. My motivation in the end was to share my personal feeling of awe, and in a way that was accessible to the parent poster as well as other readers. How do you like that internet?
> My motivation in the end was to share my personal feeling of awe, and in a way that was accessible to the parent poster as well as other readers.
Then write it that way:
1. Remove the first paragraph, where you treat the OP like a child by telling them where it is and isn't appropriate to express their idea
2. Remove the first two sentences of the 2nd paragraph
3. Remove the clause "but you can't get that from a quip."
Now we've got the beginnings of a delicious comment! You could even garnish it at the beginning with something like "Not sure if we're talking about the same thing, but..." But you don't even really need it.
That's the difference between playing in a sandbox with others, and unwittingly kicking someone out of one.
> I give up. Delete my account please dang. This site isn't good for my mental health.
While I cannot speak to your conclusion, I can humbly suggest to not put any credence in what some rando says on the Internet. Including myself. :-)
Far better is it to dare mighty things, to win glorious
triumphs, even though checkered by failure... than to rank
with those poor spirits who neither enjoy nor suffer much,
because they live in a gray twilight that knows not victory
nor defeat.[0]
>> This is all weasel words, and you've misspelled "Schroedinger"/"Schrödinger". That sort of comment might be fine for the pub, but on the internet you don't have to say anything and if you do it may as well have some substance.
> ... I can't invalidate your sense of awe.
Actually, yes. Yes, you can.
And so could I, or anyone really, given sufficiently focused vitriol.
For example, your sentence fragment "This is all weasel words" is incorrect English. "This is" should use the plural form "These are" as the subject is "words" and not "weasel", as well as the modifier "all" emphasizing plurality.
The irony of your subsequently pointing out a spelling error and then chastising the OP for same has not been lost.
> At least 50% of posts that point out a spelling or grammatical error contain one as well.
Quite true. While I do not generally claim to be a grammatical wizard, I do know when I hear from one (hello Zortech-C++, it's been too long!).
If you don't mind pointing out my mistake(s) above, I would appreciate it as my goal was to exemplify the social effect of pedantic critique. Being corrected when doing same could serve as an additional benefit.
What's the unconditional rate of errors in posts generally? Without the prior I don't know if whinginging about spelling or grammar makes my posts correcter or incorrecter.
By what standard of English did you reckon my post incorrect? I appreciate your effort to cheer up your parent post, and to improve my language skills, of course.
(I'm not the language usage police, though I am fussy about correctly rendering people's names.)
I didn't understand your gainsaying about invalidating awe. Whether or not the poster's awe was a real and worthwhile feeling seems to me entirely independent of my opinions.
I find your aims admirable. However, I regret to say that for me the irony, and purpose of this comment thread, have indeed been lost.
> The subject was "this", referring to the comment.
While I understand your clarification being the intent, in the original context "this" is in its determiner form and not pronoun form. Would the addition of "comment" have been included, then I believe most (if not all) readers would understand its use as the pronoun form it is often used as well as being associated with the noun form of "comment."
More important than my pedantry was an attempt to illustrate how corrections in this medium can be interpreted quite differently based on the person. As you intimate, my example did not affect you adversely (which is great BTW). How the OP responded to your original reply indicated a different effect unfortunately. I am not judging, only providing my observation.
A quote I wish I knew much earlier in my life is:
A sharp tongue is the only edge tool that grows keener with
constant use.[0]
Your comment is excellent, inspiring and quite true.
Please stay, otherwise the rest of us are stuck with the alternative (which essentially someone saying "read this wikipedia and Schrödinger original talks", with a perplexing pile of unhappyness, pretending to correct things that you didnt get wrong)
I’m not sure this is strictly true. It seems more accurate to say there are deep connections between the two rather than they are theoretically equivalent problems. His work is really cool though no doubt.
In the sense I understand that comparison, or have usually seen it referred to, the compressed representation is the internal latent in a (V)AE. Still, I haven't seen many attempts at compression that would store the latent + a delta to form lossless compression, that an AI system could then maybe use natively at high performance. Or if I have... I have not understood them.
it is true, but i think it's only of philosophical interests. for example, in a sense our physical laws are just human's attempt at compressing our universe.
the text model used here probably isn't going to be "intelligent" the same way those chat-oriented LLMs are. you can probably still sample text from it, but you can actually do the same with gzip[1].
Also worth checking out some of the author's other compressors e.g. another one of their neural network solutions using a transformer https://bellard.org/nncp/ holds the top spot in the Large Text Compression Benchmark. It's ~3 orders of magnitude slower though.
If I read this correctly, the largest test reported on this page is the "enwik9" dataset, which compresses to 213 MB with xz and only 135 MB with this method, a 78 MB difference... using a model that is 340 MB (and was probably trained on the test data).
No one would be impressed with saving 78 MB on compression using a 340 MB dictionary so I am not sure why this is good?
"The model is quantized to 8 bits per parameter and evaluated using BF16 floating point numbers" means the model is stored as 1 byte per parameter even though it's using a 2 byte type during compute. This is backed up by checking the size of from the download which comes out as 171,363,973 bytes for the model file.
> and was probably trained on the test data
This is likely a safe assumption (enwik8 is the default training set for RWKV and no mention of using other data was given) however:
> No one would be impressed with saving 78 MB on compression using a 340 MB dictionary so I am not sure why this is good?
The Ts_zip+enwik9 size comes out to less than the 197,368,568 for xz+enwik9 listed in the Large Text Compression Benchmark
despite the large model file. Getting 20,929,618 total bytes smaller while keeping a good runtime speed is not bad and puts it decently high in the list (even when sorted by total size) despite the difference in approach. Keep in mind the top entry at 107,261,318 total bytes in the table is nncp by the same author (neural net but not LLM based) so it makes sense to keep an open mind as to why they thought this would be worth publishing.
I wouldn't be surprised if my math was wrong but I can't quite follow yours. ts_zip(171 MB you say)+llm-enwik9(135MB) = 306MB is still larger than xz(0.3MB)+xz-enwik9(213MB) = 213MB.
I done did went and copied the enwik8 value for ts_zip when doing that compare, good catch!
I guess that leaves the question of "how well does the LLM's predictions work for things we're certain weren't in the test data set". If it's truly just the prebuilt RWKV then it is only trained on enwik8 and enwik9 is already a generalization but there's nothing really guaranteeing that assumption. On the other hand... I can't think of GB class open datasets of plain english to test with that aren't already in use on the page.
Of the two nncp uses transformers but isn't an LLM while ts_zip doesn't use transformers but is an LLM. Remember LLM just means large language model, it doesn't make any assumptions about how it's built. Similarly transformers just relate tokens according to attention, they don't make any assumptions those tokens must represent natural language.
I.e. anything you can tokenize can be wrangled using a transformer, not just language. Thankfully the same author also has a handy example of this: transformer based audio compression https://bellard.org/tsac/
If you’re compressing 100 or 100k such datasets, presuming that it is not custom tuned for this corpus, then wouldn’t you still save much more than you spend?
I'm not saying the result is completely useless, I am comparing it to the age-old technique of using a dictionary. Does this new LLM-powered technique improve upon the old dictionary technique?
Dictionaries also don't require a GPU or this amount of RAM.
Where I assume LLMs would shine is lossy compression.
Ah ok, I think we made different assumptions about whether the model was specific to the particular dataset so each one would need a new model — a dictionary is specific to the particular dataset being compressed, right? I was thinking the LLM would be a general-purpose text compression model.
AIUI, a dictionary is built during compression to specify the heuristics of a particular dataset and belongs to that specific dataset only. For example, it could be a ranking of the most frequent 10 symbols in the compressed file. That will be different for every input file.
That could be different for every input file, but it doesn't have to be. It could also be a fixed dictionary. For example, ZLIB allows for a user-defined dictionary [1].
In this case, I'd consider the LLM to be a fixed dictionary of sorts. A very large, fixed dictionary with probabilistic return values.
Admittedly, I don’t think it is common, but I think there was a project a few years ago (Google?) that tried to compress HTML using at least a partially fixed dictionary.
Nowadays though, it’s apparently still something that’s being tried. Chrome now supports shared dictionaries for Zstd and Brotli. One idea being, you would likely benefit from having a shared dictionary used to decompress multiple artifacts for a site. But, you many not want everything compressed all together, so this way you get the compression benefit, but can have those artifacts split into different files.
I believe almost all LLMs are trained using wikpedia these days. So compressing wikipedia well without including the size of the LLM in the compression result is a bit of a cheat. I guess one would argue it is a universal dataset representing understanding the English language and real-world relationships at this point but it is still a bit of a cheat.
There's a reason compression benchmarks often times include the size of the executable when benchmarking compression ratios. Although Matt Mahoney's large text compression benchmark[0] does currently have a transformer model at number 1.
Looks like it’s been updated since then; commenters in that thread are saying the decompressor needs to run on the same hardware as the compressor; now the link says:
> “The model is evaluated in a deterministic and reproducible way. Hence the result does not depend on the exact GPU or CPU model nor on the number of configured threads. This key point ensures that a compressed file can be decompressed using a different hardware or software configuration.”
It adds levity to the article and also introduces the reader to the sorts of things that can go wrong if they try it at home.
The last paragraph highlights how they fixed one of the main pitfalls I normally see in this sort of thing, where floating-point operations are mangled in myriad ways in the name of efficiency (almost always correct for physics or whatever, but a single bit being incorrect will occasionally mangle this compression scheme).
Mind you, actually doing what they claimed in that last paragraph is usually painful. The easiest approaches re-implement floating-point operations in software using integer instructions, and the complexity increases from there.
Not just efficiency, if you have e.g. floating point values arriving asynchronously to be accumulated, you'll always have a slightly unpredictable result.
Fun fact: Gemini 2.0 Flash is 100% deterministic with temp 0, unlike most models. This must be related to TPUs somehow, not sure why all previous Gemini versions are not like that, though.