Ts_zip: Text Compression Using Large Language Models

0x0 · 2024-12-30T18:48:12 1735584492

This is particularly interesting as there seems to be, for decades, a general consensus that the problem of text compression is the same as the problem of artificial intelligence, for example https://en.wikipedia.org/wiki/Hutter_Prize

bravura · 2024-12-31T03:16:33 1735614993

"It is well established that compression is essentially prediction, which effectively links compression and langauge models (Delétang et al., 2023). The source coding theory from Shannon’s information theory (Shannon, 1948) suggests that the number of bits required by an optimal entropy encoder to compress a message ... is equal to the NLL of the message given by a statistical model." (https://ar5iv.labs.arxiv.org/html//2402.00861)

I will say again that Li et al 2024, "Evaluating Large Language Models for Generalization and Robustness via Data Compression", which evaluates LLMs on their ability to predict future text, is amazing work that the field is currently sleeping on.

larodi · 2024-12-31T03:30:10 1735615810

I’m not sure how this generalises to grammar based compression such as SEQUITUR for example is… incidentally LZW also is though not advertised as such.

Devising the minimal grammar that generates the text is NP-hard (https://en.m.wikipedia.org/wiki/Smallest_grammar_problem)

Math seems very limited when it comes to reasoning about generative grammars and their unfolding into text. Should the apparatus been there we’d probably had grammar/prolog based AI long ago…

jll29 · 2024-12-31T11:58:58 1735646338

Grammars are not AI, it's just another formalism (like regular expressions, Turing machines etc.) - formalism alone doesn't solve anything.

In formal language theory, you have different classes of grammars, the most general ones correspond to Turing machines, i.e. they are a glofified assembler and you can do anything. The most restricted (in the Chomsky hierarchy), "Type 3" grammars, are basically another notation for regular expressions, and they described regular grammars.

There are algorithms for learning grammars, but the issue with that is that the induced grammars may not resemble anything that a human may write (in the same way that a clustering algorithm often does not give you the clusters you want).

But to answer your question, we need to separate the discussion between appropriate representation and method to solve a problem. I believe grammar-based compression - if you accept probabilistic grammars - is similar to LLM-based compression at some level in the sense that highly probable sequence of words get learned (whether by dictionary, grammar, neural network = LLM, could be just an implementation detail). Whichever you choose, you still need to solve the problem you are trying to solve (any grammar formalism still needs a parsing algorithm, and an actual grammar that does something useful - even after you develop a parser generator).

[Side rant, not responding specifically to the parent or OP: as a linguist, I'd also warn everybody to use "AI" with an article: *"an AI" (asterisk marks wrong use). It wrongly suggests human-like properties when it's actually just a matrix of numbers that encode a model. Here is a test whether you are using "AI" right: replace it by "Applied Statistics" in a sentence and see if you would still say it.]

AI is just an academic field (ill-named for historic reasons), subpart of computer science, and while it's fair to talk about useful representations for modeling human-like behaviors, we should focus on what intelligence is, and talk about the limits of concrete models and possibilities to extend them.

The thing about LLMs is they are a bit like the perfect snake oil salesman: extremely articulate, but knows very little nothing about a lot, understands nothing. (Whatever one criticises, they do the one thing that they are designed for very well: to generate text. Sadly that misleads a lot of people that they are just next-word/next-sentence predictors.)

larodi · 2025-01-02T10:00:42 1735812042

You are very brave to call or not call something AI, but it is precisely generative grammars (a stochastic ones) who were initially considered AI - as a linguist you should know this better than myself.

retrac · 2024-12-31T00:07:15 1735603635

There's a general consensus that entropy is deeply spooky. It pops up in physics in black holes and the heat death of the universe. The physicist Erwin Schrodinger suggested that life itself consumes negative entropy, and others have proposed other definitions of life that are entropic. Some definitions of intelligence also centre on entropy.

What to make of all that however, has anything but consensus.

vintermann · 2024-12-31T16:51:08 1735663868

To have entropy, you need to have a notion of information. To have information, you have to decide which differences matter, I.e. which states you classify as the same.

This isn't a problem for physics, or for computer science. But it is a problem for would-be philosophers (including a few physicists and computer scientists!) who thought information was a shortcut to avoid answering big questions about what matters, what we care about.

pstuart · 2024-12-31T05:45:37 1735623937

I liked the awe you shared -- it made me want to learn more about entropy.

Y_Y · 2024-12-31T00:37:56 1735605476

[flagged]

endofreach · 2024-12-31T03:51:39 1735617099

> but on the internet you don't have to say anything and if you do it may as well have some substance

Seems like we're using different internets. Which i am glad about. I just wish mine had less of the negativity that's coming over from yours. Guess in the end, the people on your internet realize, it's more fun over here.

You could have expressed all of that with less maliciousness towards the person. Thank god, in my internet everyone can say whatever they want f they want. Because– and more people should remember this apparently– if i don't like it, i just turn off the internet, like grandma!

Wish all the best to you and everyone you care about in real life. I might be just a bot. You might be. We'll never know for certain. Don't let some bits mess with your feels.

Y_Y · 2024-12-31T08:15:56 1735632956

I'm sorry for for leaking negativity into your internet. I don't think negativity is inherently undesirable, but I don't think it's useful to express it towards people's selves. I meant only to criticize the comment without further implication.

In fact I went and got some references I really liked because I was hoping to add what I felt was missing from the discussion on entropy. My motivation in the end was to share my personal feeling of awe, and in a way that was accessible to the parent poster as well as other readers. How do you like that internet?

jancsika · 2024-12-31T16:42:19 1735663339

> My motivation in the end was to share my personal feeling of awe, and in a way that was accessible to the parent poster as well as other readers.

Then write it that way:

1. Remove the first paragraph, where you treat the OP like a child by telling them where it is and isn't appropriate to express their idea

2. Remove the first two sentences of the 2nd paragraph

3. Remove the clause "but you can't get that from a quip."

Now we've got the beginnings of a delicious comment! You could even garnish it at the beginning with something like "Not sure if we're talking about the same thing, but..." But you don't even really need it.

That's the difference between playing in a sandbox with others, and unwittingly kicking someone out of one.

retrac · 2024-12-31T01:09:45 1735607385

I was trying to convey a subjective and emotional experience. Obviously I failed.

I hope that when you try to express awe it isn't dismissed as weasel words.

I give up. Delete my account please dang. This site isn't good for my mental health.

AdieuToLogic · 2024-12-31T02:28:18 1735612098

> I give up. Delete my account please dang. This site isn't good for my mental health.

While I cannot speak to your conclusion, I can humbly suggest to not put any credence in what some rando says on the Internet. Including myself. :-)

  Far better is it to dare mighty things, to win glorious 
  triumphs, even though checkered by failure... than to rank 
  with those poor spirits who neither enjoy nor suffer much, 
  because they live in a gray twilight that knows not victory 
  nor defeat.[0]

0 - https://www.brainyquote.com/quotes/theodore_roosevelt_103499

Y_Y · 2024-12-31T01:36:53 1735609013

I didn't like your comment, that's all. I'm just one anonymous asshole, I can't invalidate your sense of awe.

FWIW, I didn't want or expect to harm your mental health.

AdieuToLogic · 2024-12-31T03:22:24 1735615344

>> This is all weasel words, and you've misspelled "Schroedinger"/"Schrödinger". That sort of comment might be fine for the pub, but on the internet you don't have to say anything and if you do it may as well have some substance.

> ... I can't invalidate your sense of awe.

Actually, yes. Yes, you can.

And so could I, or anyone really, given sufficiently focused vitriol.

For example, your sentence fragment "This is all weasel words" is incorrect English. "This is" should use the plural form "These are" as the subject is "words" and not "weasel", as well as the modifier "all" emphasizing plurality.

The irony of your subsequently pointing out a spelling error and then chastising the OP for same has not been lost.

WalterBright · 2024-12-31T03:50:32 1735617032

At least 50% of posts that point out a spelling or grammatical error contain one as well.

AdieuToLogic · 2024-12-31T04:15:08 1735618508

> At least 50% of posts that point out a spelling or grammatical error contain one as well.

Quite true. While I do not generally claim to be a grammatical wizard, I do know when I hear from one (hello Zortech-C++, it's been too long!).

If you don't mind pointing out my mistake(s) above, I would appreciate it as my goal was to exemplify the social effect of pedantic critique. Being corrected when doing same could serve as an additional benefit.

WalterBright · 2024-12-31T05:47:11 1735624031

It's nice to hear from a ZTC++ user!

Y_Y · 2024-12-31T08:19:06 1735633146

What's the unconditional rate of errors in posts generally? Without the prior I don't know if whinginging about spelling or grammar makes my posts correcter or incorrecter.

Y_Y · 2024-12-31T08:06:00 1735632360

> "This [comment] is all weasel words."

The subject was "this", referring to the comment.

By what standard of English did you reckon my post incorrect? I appreciate your effort to cheer up your parent post, and to improve my language skills, of course.

(I'm not the language usage police, though I am fussy about correctly rendering people's names.)

I didn't understand your gainsaying about invalidating awe. Whether or not the poster's awe was a real and worthwhile feeling seems to me entirely independent of my opinions.

I find your aims admirable. However, I regret to say that for me the irony, and purpose of this comment thread, have indeed been lost.

AdieuToLogic · 2025-01-03T00:00:24 1735862424

>> "This [comment] is all weasel words."

> The subject was "this", referring to the comment.

While I understand your clarification being the intent, in the original context "this" is in its determiner form and not pronoun form. Would the addition of "comment" have been included, then I believe most (if not all) readers would understand its use as the pronoun form it is often used as well as being associated with the noun form of "comment."

More important than my pedantry was an attempt to illustrate how corrections in this medium can be interpreted quite differently based on the person. As you intimate, my example did not affect you adversely (which is great BTW). How the OP responded to your original reply indicated a different effect unfortunately. I am not judging, only providing my observation.

A quote I wish I knew much earlier in my life is:

  A sharp tongue is the only edge tool that grows keener with 
  constant use.[0]

HTH

0 - https://www.brainyquote.com/quotes/washington_irving_384249

Y_Y · 2025-01-06T06:21:55 1736144515

Thank you, especially for deft use of pedantry as a tool for good, and that quote which I'll retain.

jakeogh · 2025-01-02T12:59:23 1735822763

Your comment is excellent, inspiring and quite true.

Please stay, otherwise the rest of us are stuck with the alternative (which essentially someone saying "read this wikipedia and Schrödinger original talks", with a perplexing pile of unhappyness, pretending to correct things that you didnt get wrong)

Y_Y · 2025-01-06T06:19:31 1736144371

Inspiring and true

I could leave if you like

CamperBob2 · 2024-12-31T05:55:03 1735624503

You wouldn't toss out your radio because it picks up a bit of static now and then, would you? That's all that posts like that one amount to... static.

WhitneyLand · 2024-12-30T19:50:41 1735588241

I’m not sure this is strictly true. It seems more accurate to say there are deep connections between the two rather than they are theoretically equivalent problems. His work is really cool though no doubt.

micimize · 2024-12-31T00:15:38 1735604138

In the sense I understand that comparison, or have usually seen it referred to, the compressed representation is the internal latent in a (V)AE. Still, I haven't seen many attempts at compression that would store the latent + a delta to form lossless compression, that an AI system could then maybe use natively at high performance. Or if I have... I have not understood them.

nialv7 · 2024-12-31T08:36:57 1735634217

it is true, but i think it's only of philosophical interests. for example, in a sense our physical laws are just human's attempt at compressing our universe.

the text model used here probably isn't going to be "intelligent" the same way those chat-oriented LLMs are. you can probably still sample text from it, but you can actually do the same with gzip[1].

[1]: https://github.com/Futrell/ziplm

zamadatix · 2024-12-30T20:34:45 1735590885

Also worth checking out some of the author's other compressors e.g. another one of their neural network solutions using a transformer https://bellard.org/nncp/ holds the top spot in the Large Text Compression Benchmark. It's ~3 orders of magnitude slower though.

remram · 2024-12-30T19:31:19 1735587079

If I read this correctly, the largest test reported on this page is the "enwik9" dataset, which compresses to 213 MB with xz and only 135 MB with this method, a 78 MB difference... using a model that is 340 MB (and was probably trained on the test data).

No one would be impressed with saving 78 MB on compression using a 340 MB dictionary so I am not sure why this is good?

Please let me know if I misunderstand.

zamadatix · 2024-12-30T20:14:19 1735589659

> using a model that is 340 MB

"The model is quantized to 8 bits per parameter and evaluated using BF16 floating point numbers" means the model is stored as 1 byte per parameter even though it's using a 2 byte type during compute. This is backed up by checking the size of from the download which comes out as 171,363,973 bytes for the model file.

> and was probably trained on the test data

This is likely a safe assumption (enwik8 is the default training set for RWKV and no mention of using other data was given) however:

> No one would be impressed with saving 78 MB on compression using a 340 MB dictionary so I am not sure why this is good?

The Ts_zip+enwik9 size comes out to less than the 197,368,568 for xz+enwik9 listed in the Large Text Compression Benchmark despite the large model file. Getting 20,929,618 total bytes smaller while keeping a good runtime speed is not bad and puts it decently high in the list (even when sorted by total size) despite the difference in approach. Keep in mind the top entry at 107,261,318 total bytes in the table is nncp by the same author (neural net but not LLM based) so it makes sense to keep an open mind as to why they thought this would be worth publishing.

remram · 2024-12-31T02:33:24 1735612404

I wouldn't be surprised if my math was wrong but I can't quite follow yours. ts_zip(171 MB you say)+llm-enwik9(135MB) = 306MB is still larger than xz(0.3MB)+xz-enwik9(213MB) = 213MB.

zamadatix · 2024-12-31T03:16:08 1735614968

I done did went and copied the enwik8 value for ts_zip when doing that compare, good catch!

I guess that leaves the question of "how well does the LLM's predictions work for things we're certain weren't in the test data set". If it's truly just the prebuilt RWKV then it is only trained on enwik8 and enwik9 is already a generalization but there's nothing really guaranteeing that assumption. On the other hand... I can't think of GB class open datasets of plain english to test with that aren't already in use on the page.

pmayrgundter · 2024-12-30T22:32:56 1735597976

Not following. That top entry is marked as Transformer, which does mean it's an LLM

zamadatix · 2024-12-31T00:49:23 1735606163

Of the two nncp uses transformers but isn't an LLM while ts_zip doesn't use transformers but is an LLM. Remember LLM just means large language model, it doesn't make any assumptions about how it's built. Similarly transformers just relate tokens according to attention, they don't make any assumptions those tokens must represent natural language.

I.e. anything you can tokenize can be wrangled using a transformer, not just language. Thankfully the same author also has a handy example of this: transformer based audio compression https://bellard.org/tsac/

pmayrgundter · 2024-12-31T14:15:52 1735654552

Fair nuff. Thanks!

binary132 · 2024-12-30T19:38:27 1735587507

If you’re compressing 100 or 100k such datasets, presuming that it is not custom tuned for this corpus, then wouldn’t you still save much more than you spend?

remram · 2024-12-30T21:10:29 1735593029

I'm not saying the result is completely useless, I am comparing it to the age-old technique of using a dictionary. Does this new LLM-powered technique improve upon the old dictionary technique?

Dictionaries also don't require a GPU or this amount of RAM.

Where I assume LLMs would shine is lossy compression.

binary132 · 2024-12-30T21:30:18 1735594218

Ah ok, I think we made different assumptions about whether the model was specific to the particular dataset so each one would need a new model — a dictionary is specific to the particular dataset being compressed, right? I was thinking the LLM would be a general-purpose text compression model.

remram · 2024-12-31T02:28:14 1735612094

Not particularly. You could make a dictionary from "the English web", with common character sequences found on those sites you use as input.

ksec · 2024-12-30T21:47:33 1735595253

I have the same question, what is the different between LLM and Dictionary in the context of compression. Can I not "train" a dictionary?

binary132 · 2024-12-30T22:39:15 1735598355

AIUI, a dictionary is built during compression to specify the heuristics of a particular dataset and belongs to that specific dataset only. For example, it could be a ranking of the most frequent 10 symbols in the compressed file. That will be different for every input file.

mbreese · 2024-12-31T03:28:34 1735615714

> That will be different for every input file

That could be different for every input file, but it doesn't have to be. It could also be a fixed dictionary. For example, ZLIB allows for a user-defined dictionary [1].

In this case, I'd consider the LLM to be a fixed dictionary of sorts. A very large, fixed dictionary with probabilistic return values.

[1] https://www.rfc-editor.org/rfc/rfc1950#page-9

binary132 · 2024-12-31T16:43:36 1735663416

Ah, I see. I’d never thought of the possibility of using a dictionary not created specifically from the given input dataset, heh

mbreese · 2024-12-31T18:11:35 1735668695

Admittedly, I don’t think it is common, but I think there was a project a few years ago (Google?) that tried to compress HTML using at least a partially fixed dictionary.

Nowadays though, it’s apparently still something that’s being tried. Chrome now supports shared dictionaries for Zstd and Brotli. One idea being, you would likely benefit from having a shared dictionary used to decompress multiple artifacts for a site. But, you many not want everything compressed all together, so this way you get the compression benefit, but can have those artifacts split into different files.

https://developer.chrome.com/blog/shared-dictionary-compress...

KTibow · 2024-12-30T20:03:47 1735589027

Notably, solutions specialized for enwik9 (specifically fx2-cmix) take up only 110 MB, including the size of the decompressor.

justmarc · 2024-12-30T19:17:39 1735586259

This man is an absolute wizard, and a legend who hasn't stopped since the fantastic LZEXE days.

bhouston · 2024-12-30T20:56:55 1735592215

I believe almost all LLMs are trained using wikpedia these days. So compressing wikipedia well without including the size of the LLM in the compression result is a bit of a cheat. I guess one would argue it is a universal dataset representing understanding the English language and real-world relationships at this point but it is still a bit of a cheat.

atiedebee · 2024-12-30T21:29:45 1735594185

There's a reason compression benchmarks often times include the size of the executable when benchmarking compression ratios. Although Matt Mahoney's large text compression benchmark[0] does currently have a transformer model at number 1.

[0] http://www.mattmahoney.net/dc/text.html

Kiro · 2024-12-30T22:23:27 1735597407

Which is also made by the same author as ts_zip (Fabrice Bellard): https://bellard.org/nncp/

cedws · 2024-12-31T17:54:34 1735667674

Is there anything this man can’t do?

vessenes · 2024-12-30T18:17:44 1735582664

Fabrice has recently extended this work into audio encoding, an area which to me seems more useful than shaving a bit more off wik8 compression rates.

Demo and code? Available at bellard.org as well.

zamadatix · 2024-12-30T20:37:59 1735591079

Link for the curious https://bellard.org/tsac/

Has anyone done the work of comparing this to other similar extreme audio compression solutions?

rahimnathwani · 2024-12-30T18:11:18 1735582278

Prior discussion: https://news.ycombinator.com/item?id=37152978

jodrellblank · 2024-12-30T19:43:35 1735587815

Looks like it’s been updated since then; commenters in that thread are saying the decompressor needs to run on the same hardware as the compressor; now the link says:

> “The model is evaluated in a deterministic and reproducible way. Hence the result does not depend on the exact GPU or CPU model nor on the number of configured threads. This key point ensures that a compressed file can be decompressed using a different hardware or software configuration.”

0-_-0 · 2024-12-30T19:16:54 1735586214

1 MBps is insanely fast for a method like this, it must be in the 100k tokens per second range. Probably with large batches.

droideqa · 2024-12-30T18:33:50 1735583630

I have always thought compression to be an analog to intelligence. The smarter you are, the better at summarization you are.

Twirrim · 2024-12-30T18:06:20 1735581980

"(and hopefully decompress)" is a horrifying descriptor.

hansvm · 2024-12-30T18:59:30 1735585170

It adds levity to the article and also introduces the reader to the sorts of things that can go wrong if they try it at home.

The last paragraph highlights how they fixed one of the main pitfalls I normally see in this sort of thing, where floating-point operations are mangled in myriad ways in the name of efficiency (almost always correct for physics or whatever, but a single bit being incorrect will occasionally mangle this compression scheme).

Mind you, actually doing what they claimed in that last paragraph is usually painful. The easiest approaches re-implement floating-point operations in software using integer instructions, and the complexity increases from there.

orbital-decay · 2024-12-31T17:17:34 1735665454

Not just efficiency, if you have e.g. floating point values arriving asynchronously to be accumulated, you'll always have a slightly unpredictable result.

Fun fact: Gemini 2.0 Flash is 100% deterministic with temp 0, unlike most models. This must be related to TPUs somehow, not sure why all previous Gemini versions are not like that, though.

perching_aix · 2024-12-30T18:08:03 1735582083

They're clearly just poking fun at it.

mikevin · 2024-12-30T18:39:36 1735583976

I'm curious what the compressed text looks like. Anyone have an example?

Lerc · 2024-12-30T19:20:02 1735586402

If it is within cooee of state of the art the compressed text should look like a pile of random bits.

If it looks like anything at all other than randomness then you can describe whatever it is that it looks like to get more compression.

munch117 · 2024-12-30T19:17:29 1735586249

Binary goo, barely distinguishable from random data, if at all. The arithmetic coder will make sure of that.

It's the nature of compression: Any discernible pattern could have been exploited for further compression.

j_juggernaut · 2024-12-30T19:43:13 1735587793

Made a quick and dirt streamlit app to play around encrypt decrypt

https://llmencryptdecrypt-euyfofcjh8bf2utuha2zox.streamlit.a...

meindnoch · 2024-12-30T23:27:42 1735601262

It is very good at decrypting the string "Error".

jll29 · 2024-12-31T11:59:48 1735646388

Speed and compression are are one thing, but I wonder how much energy Ts_zip consumes compared to gzip?

cat5e · 2024-12-30T22:59:54 1735599594

Has this been attempted for raw binary? Using an NN to predict the most likely next binary string?