Smaz - short strings compression library

tptacek · on March 31, 2009

This code made me smile, so don't take this the wrong way.

Don't run smaz() on untrusted input. Two things you need to fix, and a general word of advice:

(1) In the 0xFE case, you check for headroom in the output buffer but not the input buffer. If 0xFE is the last character in the input, you run right off the end of it, dumping the contents of memory to the output string.

(2) In the 0xFF case, same thing; if 0xFF 0xFE are the last characters in the buffer, I get at least 254 bytes of surrounding memory in my output.

How big a deal this is depends on what's in memory (and where the buffers are being kept). Generally, these are going to be on the stack, thus, a big deal.

General advice: not in love with using signed integers for byte counts; also, I don't think you have this problem because you're using full width integers, but be aware that your length bytes can theoretically wrap back to 0 if you're ever not careful in the future.

DON'T DATE ROBOTS.

antirez · on March 31, 2009

Hello tptacek, it is not very common with the normal use case to decompress things that you didn't compressed yourself, but I agree with you that is better to add this checks at the cost of some speed. Btw note that this is common in compression code that tries to do fast. in things like Fastlz and similar libs this is a compile time define, I may do the same, if SAFE_DECOMPRESSOR is defined it will do the check, otherwise will just skip the checking code.

Also note that the access is read only, so basically it's not possible to use such thing to run unprivileged code. Also in moder operating systems pages are zeroed before to be used in a different process, so it is not possible to read part of memory about a different process.

Thanks for the hint

tptacek · on March 31, 2009

Three observations:

(1) It is extremely common for network apps to decompress blobs they didn't generate; even in cases where those blobs aren't passed client/server, but are instead roundtripped through a database or shared persistence layer.

(2) Whatever fastlz may do, the industry standard (in Win32 and Unix and in embedded devices) is zlib --- at least from what we've found on app pentests --- and zlib doesn't have this problem (now).

(3) Zlib is itself a minefield of vulnerabilities, which supports the point that your compression codec is a security hotspot for an application.

But my point isn't that your code sucks; it's nice and tight.

antirez · on March 31, 2009

1) ACK, you are right, actually I'm probably going to use in the future this lib for Redis (http://code.google.com/p/redis), and it is important that an untrusted DB will load correctly without memory corruptions or invalid accesses.

2) Right, but a define to don't this checks is useful for environments where data can be trusted but speed is very important.

3) Good point. I'll fix the code. Btw what is nice about that kind of things is that testing against random data helps a lot.

Zlib should do that. For example, do you remember the old security bug of old bind implementations of using a 'ptr' opcode (used for compression) in DNS packets in order to create an infinite loop and mount a DoS? Years ago I tested how this bug were trivial to spot just processing random data. This is true for a lot of applications.

If you know how the application works you can "tune" this random data with biases that will stimulate a lot more the code. This is a worthy testing tool that is underused IMHO.

Thanks for your comment.

tptacek · on March 31, 2009

The testing you're describing is called "fuzzing", and it's the dominant way researchers find vulnerabilities today (I didn't fuzz your code; I just read it, and then wrote a little test doohickey).

David Molnar at Berkeley has done a ton of work on smart fuzzing: http://www.cs.berkeley.edu/~dmolnar/

Ilja van Sprundel wrote what has got to be the most popular "dumb" fuzzer, mangle.c: http://www.digitaldwarf.be/products/mangle.c

Mangle routinely finds horrible, horrible things in modern code; Ilja deserves some kind of anti-Nobel Prize for it.

mustpax · on March 31, 2009

This could be the backend for a cool stateless URL shortener. Not super useful, but stateless things are pretty nifty.

In this case you'd be getting unsafe data for both compression and decompression.

I really like your project though. Thanks for sharing!

antirez · on March 31, 2009

Thanks mustpax! I agree that stateless things are cool :) Btw optimizing the dictionary specifically for urls much better compression ratios are possible.

antirez · on March 31, 2009

Hello, I wrote this lib as an initial test on short strings compression. It seems like it can be useful for things like memcached, Redis, and alike, but may also be a good idea to make your MySQL DB shorter (chats, logs, user messages, news titles, urls, can all be compressed with this lib). In the README there is some info about the lib, the usage is brain-dead simple.

It's code of few hours ago, but should be pretty stable already, with this kind of programming what's nice is that you can encrypt/decrypt millions of strings from /dev/urandom to spot obvious bugs in very short time.

I wonder if this lib can be useful to somebody else.

Some example:

    'This is a small string' compressed by 50%
    'foobar' compressed by 34%
    'the end' compressed by 58%
    'not-a-g00d-Exampl333' enlarged by 15%
    'Smaz is a simple compression library' compressed by 39%
    'Nothing is more difficult, and therefore more precious, than to be able to decide' compressed by 49%
    'this is an example of what works very well with smaz' compressed by 49%
    '1000 numbers 2000 will 10 20 30 compress very little' compressed by 10%

jcl · on March 31, 2009

It looks like your encoding routine could get better compression if it weighed alternative encodings more carefully. For example, "not-a-g00d-Exampl333" could be better compressed as a single run of literal characters rather than as several runs of literal characters interspersed with codebook entries. (In this sense, the single letter codebook entries might not be helping your compression ratio.)

antirez · on March 31, 2009

jcl yes this is a compromise, if you check smaz_compress it does this kind of checks but only in order to glue together consecutive verbatim outputs into a single one (up to 256 bytes), but does not back-check if a single byte interruption is not worth it.

I think this can be an option that can be done in a second-pass way if a given #define is active. Thanks for the useful comment.

rw · on March 31, 2009

"[...] with this kind of programming what's nice is that you can encrypt/decrypt millions of strings from /dev/urandom [...]"

How many of those pseudorandom strings are you able to compress, in general?

antirez · on March 31, 2009

Most of this random strings will "globally" enlarge, but will locally compress. For example sequences of compressible subsequences of two-three bytes are very frequent. The remaining data that can't be compressed will stress the other code that emits verbatim uncompressed sequences.

To stress more the compression with valid sequences it's possible to use a very big HTML file and get ranges at random instead to get this strings from /dev/urandom

chime · on March 31, 2009

I was just looking for something like this last weekend. I am making an app in which the URL contains lots of different parameters and it is generated using JS. Compressing this long query-string would be awesome. I haven't looked at your code yet but do you think it would be possible to have a JS library that does just this? Maybe a jQuery plugin?

antirez · on April 1, 2009

Hello chime, it should be very simple to port the lib to JS. Given that JS supports hashes you get a much shorter implementation compared to the C one.

Luc · on March 31, 2009

Very neat! Could you explain a bit how this would help with memcached and Redis? Is the idea to keep memory usage down (apologies if this is obvious, I haven't used memcached (yet))?

antirez · on March 31, 2009

Hello Luc,

immagine you want to store millions of SMS messages into a memcached or Redis server. Memory usage can be a problem, compressing this data with Smaz you can use half the number of servers needed otherwise. The same works for URLs as well. Compression examples against URLs:

    'http://google.com' compressed by 59%
    'http://programming.reddit.com' compressed by 52%
    'http://github.com/antirez/smaz/tree/master' compressed by 46%

joeyo · on March 31, 2009

What are you doing to achieve the compression? Huffman coding (or a similar symbol code)?

antirez · on March 31, 2009

no, you can "add" Huffman coding to this algorithm to get better compression (but it will be much slower). What I used is a modified version of the LZ algorithm family, but instead of a table assembled from the data I'm compressing I use a fixed one.

It's basically a simple table-lookup, with two special bytes needed to emit verbatim bytes.

gojomo · on March 31, 2009

Seems to me that gzip or lzma with a constant, language-optimized dictionary would both...

(1) compress better; and

(2) be more robust/self-documenting.

EG: "Feed this dictionary, optimized over 1 billion [English|Western language|etc] documents, to zlib deflateSetDictionary() before compressing."

(The zlib/gzip dictionary is only 32KiB, but the lzma dictionary can be up to 32MiB.)

Building a small library of standard dictionaries, keyed by hash (so they can be locally cached or fetched when needed) would offer good coverage of many string domains.

See for example the Google 'Shared Dictionary Compression over HTTP (SDCH)' proposal, apparently a feature of Google Toolbar:

http://sdch.googlegroups.com/web/Shared_Dictionary_Compressi...

http://groups.google.com/group/SDCH

antirez · on March 31, 2009

Sure you can do this, what you get is something that is also able to adapt dynamically after the first 100 bytes or so. My requirement was to have something smaller, simpler, possibly able to run on low memory systems, and especially, with a much faster decompression compared to zlib.

I'm not exploiting an arithmetic encoding basically, and the dictionary is (intentionally) small. Add to this lib an arithmetic encoding and you get near to a zlib with an optimized dictionary, but at the cost of a lot of speed.

jcl · on March 31, 2009

It looks like you put together your codebook by hand (containing phrases like "http://" and ".com"), right? Wouldn't it be better to automatically extract it from a sample text corpus? Otherwise, you'd be introducing your own biases.

antirez · on March 31, 2009

Hello, no I build it with a Ruby script. But I inserted two codes by hand, http:// and .com. All the rest is based on a probability-length based weight, and a test data that is the following: english books in .txt format, and different .html pages from wikipedia.