Stream VByte: breaking new speed records for integer compression

kbenson · on Sept 27, 2017

Just to give you an idea of how effective this strategy seems to be, this seems very similar to what the COST in the land of databases submission the other day was talking about, where choosing interesting, fitting algorithms for the problem instead of large distributed but general systems often yields very good results (enough such that a single thread can often beat 100+ core distributed systems).

In one of the linked prior blog posts[2] of his he linked to in the beginning of that article, he said this:

Using the Hilbert curve layout leads to a very simple compressed representation: having sorted the u64 values it uses to represent edges, one can simply delta-encode them (writing the differences between adjacent values). Because the Hilbert curve teases out locality, the gaps are often quite small, and therefore compress nicely. When I combined a variable-length integer encoding with gzip, the resulting file was only 45GB; about 2.8 bits per edge). Decompressed (with just the variable-length delta encoding), it is 154GB, which is the representation I used to get the measurements above.

He then goes on in his posts to comically trounce some large distributed systems.

1: https://news.ycombinator.com/item?id=15332051

2: https://github.com/frankmcsherry/blog/blob/master/posts/2015...

frankmcsherry · on Sept 27, 2017

!#$!@# I was going to go try that out. ;) When doing the Chaos/Order stuff, the decompression was about half the CPU cycles, so thinning that could help.

Fwiw, the encoding I used wasn't VByte; because I was doing strict increments, I knew `0` wasn't a valid value to see, and used it to encode variable byte lengths (number of zeros: number of valid data bytes). Not smart, but easy to hack up, and gave me back 128-255 in one byte. I'm curious to check this out, now!

mrhill · on Sept 27, 2017

This method of compressing integer arrays in differences and then reducing the number of bits for each difference, is very efficient for many applications. We used it many years ago at a startup to compress and realtime render vector maps for mobile devices.

Not only made it the data really small (we could fit all the roads of a country like Germany in 5 MB), it was also blazing fast to decode. Our vector map renderer used to do the vector map decompression on the fly, while doing coordinate transformation, and rendering to the screen. That was really the trick to get realtime zooming and panning vector maps on ~100 MHz ARM CPUs with tiny data caches of the time.

We even took that compression scheme a step further from byte-bound to bit-granularity. Above article made me wonder, how our old scheme performs on today's CPU...

kwillets · on Sept 27, 2017

One of the problems with SIMD is that we can't do this kind of shuffle at bit granularity, at least not in one instruction.

jeffffff · on Sept 28, 2017

if you want bit level granularity you can use either SIMD-BP128 or SIMD-FastPFOR from https://arxiv.org/pdf/1209.2137.pdf

kwillets · on Sept 27, 2017

This is indeed fast and simple, since the pshufb instruction allows unpacking of bytes into multiple zero-padded words in one step. This technique uses a precomputed table of shuffle patterns, so it's a short sequence of instructions to fetch control information for each 4-word sequence and apply the shuffle given.

jorangreef · on Sept 28, 2017

"I should mention that unlike Amazon, we did not patent our approach. We want you to use this in your commercial and open-source projects!"

Thanks Dan!

kuwze · on Sept 27, 2017

Could there be a similar scheme for floats?

Also, is there a penalty for using float16s for CPU centric code?

akrasuski1 · on Sept 28, 2017

Using fixed-point representation would allow using this compression method.

Scaevolus · on Sept 27, 2017

The example code should replace "datasource" with "databytes".

rurban · on Sept 28, 2017