NNCP: Lossless Data Compression with Neural Networks

cs702 · on April 6, 2019

A natural question I've pondered from time to time is whether Fabrice Bellard is really a time traveler from a more advanced civilization in the future, sent back in time to show us, mere mortals, what humankind will be capable of in the future.

If this sounds far-fetched, consider that he has created FFMPEG, QEMU, LibBF, SoftFP, BPG, TinyEMU, a software implementation of 4G/LTE, a PC emulator in Javascript, the TCC compiler, TinyGL, LZEXE, and a tiny program for computing the biggest known prime number.

And that's just a partial list of his successful projects, which now of course also include software for more efficient lossless compression with deep LSTM and Transformer neural networks.

Any of these projects, on its own, would be considered a notable achievement for an ordinary human being.

Fabrice Bellard deserves some kind of superhuman lifetime achievement award.

Source: https://bellard.org

eismcc · on April 6, 2019

Agree. It’s also a message to the rest of us that we are likely capable of a lot more than we think.

makapuf · on April 6, 2019

Agree. It's also an - involuntary - message to the rest of us that we are likely capable of a lot less than he is.

trollied · on April 6, 2019

I know all of these in isolation. I’d just like to thank you for joining the dots. Wow

feedbeef · on April 7, 2019

> Fabrice Bellard deserves some kind of superhuman lifetime achievement award.

Indeed. The Tesla of software perhaps.

I just launched JSLinux [1] for fun. You gotta love this dmesg line: [ 0.000000] CPU: vendor_id 'AuthenticX86' unknown, using generic init.

[1] https://bellard.org/jslinux/

raphlinus · on April 6, 2019

And this is why Fabrice Bellard is so often cited as an example of a 10x programmer: his relatively simple LSTM-based neural net compressor is 10x faster than one of the more complex state of the art algorithms while delivering comparable compression ratios.

mpais · on April 6, 2019

"his relatively simple LSTM-based neural net compressor is 10x faster than one of the more complex state of the art algorithms while delivering comparable compression ratios."

No, it's 10x faster than an experimental, completely performance-unoptimized LSTM model that is far from state-of-the-art on its own. From the paper:

"Regarding the speed, our results show that our implementation is much faster than lstm-compress[8] with a similar model and gain. The speed improvement comes from the more optimized matrix multiplication implementation and the larger batch size (16 instead of 1)."

Of most interest here is the library used, LibNC, which might allow us to speed up cmix a bit (even making it 2x faster would already help a lot in testing).

Disclaimer: I've worked on cmix with its author

raphlinus · on April 6, 2019

Fair enough. cmix is awesome work, obviously, and I'm looking forward to the ideas becoming production-ready. Probably the most notable contribution by Fabrice Bellard is having something relatively simple.

m0zg · on April 6, 2019

He's like 1000000x programmer if you consider the kinds of projects he successfully completes. An "average" programmer never does anything anywhere near as impactful as ffmpeg throughout the entirety of their career. And he has multiple of those under his belt.

stabbles · on April 6, 2019

Well, to be fair, compression speed is not great: 17400 kB/s for gzip vs 2.75 kB/s for nncp (large network).

tom_ · on April 6, 2019

And that's not all! - "The decompression speed is slower (1.5x slower for the LSTM model, 3x slower for the Transformer model)"...

usefulcat · on April 7, 2019

It would be better to compare it to xz, which is significantly slower (and compresses better) than gzip.

javierluraschi · on April 6, 2019

Where is the reference to the 10X improvement? As in, which benchmark are you comparing against? Thanks!

beagle3 · on April 6, 2019

As in “one Fabrice Bellard is worth 10 or more average programmers”.

javierluraschi · on April 6, 2019

I mean, regarding algorithm speed.

From the conclusion:

“We presented a practical implementation of an LSTM based lossless compressor. Although it is slow, its description is simple and the memory consumption is reasonable compared to compressors giving a similar compression ratio. It would greatly benefit from a dedicated hardware implementation.”

I only skimmed through the paper, but looks to me like this might be a 10X improvement over other LSTM compression methods, not state-of-the-art compression.

That said, I’m far from being an expert in compression and was hoping from someone in HN to explain how relevant this is in their discipline.

stan_rogers · on April 6, 2019

The comment you are replying to was trying to explain the "10X programmer" idea. Nowhere was it claimed that the algorithm would have a 10X performance improvement.

javierluraschi · on April 6, 2019

No, I meant this comment: "His relatively simple LSTM-based neural net compressor is 10x faster than one of the more complex state of the art algorithms"

raphlinus · on April 6, 2019

I meant it as something of a joke (10x compression speed vs 10x the productivity of the average programmer), but also to summarize useful information from the paper.

reubenmorais · on April 6, 2019

The paper linked in the article, table 3: https://bellard.org/nncp/nncp.pdf

I would add that lstm-compress from Byron Knoll, the author of CMIX, already achieved comparable compression ratios at 10x the speed, as is shown in the paper.

tntn · on April 6, 2019

10x the speed of what? lstm-compress achieves a ratio of 1.64 bpb with a speed of 4.4 kB/s, Bellard's LSTM (small) achieves the same ratio with a speed of 41.7 kB/s.

userbinator · on April 6, 2019

Interesting to see another Fabrice project, and great to see compression performance on par with the best available, but unfortunately it still falls short of what's necessary to win this prize:

http://prize.hutter1.net/index.htm

(I've computed that 14,826,395 bytes is required to become the next winner. CMIX comes close, but its huge memory consumption disqualifies it.)

londons_explore · on April 6, 2019

Note that this uses CPU only to run the neural network.

Accelerators are now the norm for this sort of stuff, hence why this is so slow.

Also note this is closed source. Unusual for bellard, and also unusual for this sort of research.

loeg · on April 6, 2019

> Also note this is closed source. Unusual for bellard

Not wholly unusual for Bellard. E.g., https://bellard.org/lte/

rosstex · on April 6, 2019

Could someone explain how these are guaranteed to be lossless?

gwern · on April 6, 2019

Arithmetic encoding https://en.wikipedia.org/wiki/Arithmetic_encoding . The way I would put it is: the statistical model does its best to predict the next bits, assigning short bit strings to what it thinks is likely to come next and long bit strings to less likely next bits, and if it is indeed right, only the short bitstrings need to be emitted; if it's wrong, more long bit strings must be spent to correct it and emit the exact right bitstring. (Like using shorthand with lots of convenient shortcuts, and then falling back to painfully writing everything out explicitly.) So, you can use imperfect lossy probabilistic models, while still getting lossless compression.

ychen306 · on April 6, 2019

Arithmetic coding is a standard compression technique. It uses less bits for more frequent characters (think Huffman code, even though they are not the same).

This one makes arithmetic coding "adaptive". Consider this. The rough frequency of `e' in English is about 50%. But if you just seen this partial sentence "I am going to th", the probability/frequency of `e' skyrockets to, say, 98%. In standard arithmetic coding scheme, you would still parametrize you encoder with 50% to encode the next "e" despite it's very likely (~98%) that "e" is the next character (you are using more bits than you need in this case), while with the help of a neural network, the frequency becomes adaptive.

sreekotay · on April 6, 2019

The NN is used for the probability model for the compression - it's not the compression itself. In simplest terms, think of it as being used to predict the next bit.

meifun · on April 6, 2019

I wish I could ask Fabrice how he structure his day/time and balances life and family.

Edit: typo

thesz · on April 6, 2019

Why did he not used Time Convolutional Networks?

[1] https://openreview.net/pdf?id=rk8wKk-R-

[1] shows that TCN may outperform LSTM on tasks that really needs memory.

317070 · on April 6, 2019

Because LSTM's are proven over time to be industry grade, and are not latest-greatest-but-no-one-has-really-reproduced-the-results-yet.

A paper with inductive results is just a first step in a really long process until ideas are commonplace accepted.

thesz · on April 6, 2019

For every time you say "LSTM is industry grade" I would point you to large companies (Google-level) which train their acoustic models with TCN architectures.

He used Transformer, why not TCN?

Cybiote · on April 6, 2019

It's rare that a model introduced in a paper has fair comparisons. You'll want to see it used by many different authors before one can speak with confidence on claimed improvements.

In this paper, Bellard makes a comparison to the Transformer, some variant of which holds most NLP records. Of note is that in this test, the Transformer did not outperform the LSTM. Is it that Transformers need to be larger to outperform or LSTMs are more suited to compression or that more epochs are required? This deserves further investigation, what are the required conditions for Transformers to underperform LSTMs?

thesz · on April 7, 2019

The problem in the paper is to predict most probable "words". In NLP terms, model must be optimized for perplexity. Transformer holds records not for perplexity (it is too heavy for that) but for other tasks like machine translation or semantic similarity, I believe.