On-the-fly encoding and decoding of bitmaps

willglynn · on Dec 12, 2017

There are a lot of tools available in the "compressed bitmap" space. When last I needed something like this, I evaluated several approaches and ended up selecting Roaring Bitmaps:

https://arxiv.org/abs/1402.6407

http://roaringbitmap.org

That application had some particular wins relating to large sparse bitmaps, especially the ability to efficiently perform binary operations directly on their compressed representations. I didn't need random insertions, so this ended up being a win across the board: faster runtime, lower memory footprint, and lower I/O requirements versus a normal bitset.

amallia · on Dec 12, 2017

Roaring Bitmaps is for sure another big thing out there. I will definitely write about it soon.

leni536 · on Dec 12, 2017

> boost::math::binomial_coefficient<double>

Using double here seems to be unsuitable for this kind of algorithm. The boost algorithm seems to be tuned for floating point arithmetic for large n and k values. An integral binomial_coefficient can be written rather easily. Such a function can easily be declared constexpr and can be used to create a lookup table at compile time.

Also I don't see how your algorithm is constant time, unless you can write a bitmap class with a constant time operator[]. I expected the lookup to be constant time. Decoding a known block to be constant time is not such a strong statement.

I can see that it can compress well bitmaps where there are large chunk of 0s or 1s. I wonder what is the benefit compared to RLE. jooster's comment seems to be valid too, you need 3 bits to store the popcount. That makes your "compressed" example 27 bits, which is 3 bits more than the original, making it not a too compelling example for your compression algorithm.

amallia · on Dec 12, 2017

  Using double here seems to be unsuitable for this kind of algorithm.

Agreed! But for the purpose of the post it is fine to use boost. Doing a constexpr function is definitely a plus for a production implementation.

  Also I don't see how your algorithm is constant time, unless you can write a bitmap class with a constant time operator[].

Decoding the block is constant. Accessing can be constant time, but is not if you encode implicitly the all-zeros/ones block. On the other and you can still do AND or OR between compressed bitmaps by paying only the scan.

  I wonder what is the benefit compared to RLE.

Benchmarking with other implementations is something I will definitely do in the future.

  jooster's comment seems to be valid too

Fixed that by using blocks of size 3 instead of 4. Sorry for the mistake. Please check the update, you will see it is even more compressed than before.

willglynn · on Dec 12, 2017

The updated post basically boils down to the following encoding:

    000 => 00
    001 => 01 00
    010 => 01 01
    011 => 10 00
    100 => 01 10
    110 => 10 01
    111 => 11

It compresses all-zeroes and all-ones blocks by one bit, and expands all other blocks by one bit. Compression rates therefore depend greatly on the exact bitmap, but this approach never yields more than 33% savings, and it results in expansion for random bitmaps with densities between 0.21 and 0.79.

joosters · on Dec 12, 2017

I can't see how an algorithm to lookup bit 'i' in a compressed bitmap of size 'n' can ever be constant time.

No matter how the bitmap is compressed, the block storage space for 'b' bits of the bitmap cannot be constant (since it's impossible to always compress 'b' bits of arbitrary data into fewer than 'b' bits).

So, to find bit 'i' in the compressed form, you are going to have to parse the compressed data. A naive approach would just linearly read through the compressed data, giving O(n). Alternatively, the data structure could contain indexes, pointing to various places in the bitmap. You could get the cost down to O(log n) by making your index contain pointers to a binary subdivision of the bitmap. But we're still not constant time.

If you index every 'x' bits, the cost drops to O(x). To reach constant time, you would need to index every bit in the bitmap, which obviously cannot be done without creating a data structure bigger than the original bitmap.

(Edit: fixed my misuse of linear/constant time!)

amallia · on Dec 12, 2017

  I can't see how an algorithm to lookup bit 'i' in a compressed bitmap of size 'n' can ever be constant time.

What you do is decoding a block in constant time, as the block size is constant and you need exactly block size iterations.

Accessing the i-th bit cannot be done in constant time as I already mentioned unless you also store all the offsets (including all-zeros/ones). In that case, all you have to do is moving to the i/b block (where is is the position you want to access and b is the block size) by skipping (len(C)+len(O)) * i/b bits.

joosters · on Dec 12, 2017

But to decode a block of size 'b' takes O(b), i.e. larger blocks take longer to decode. You can pick whatever size block you like, but smaller blocks lead to less compression. It's just another form of indexing.

leni536 · on Dec 12, 2017

I just found the following paper on "random access compression". They do claim constant time lookup.

https://pdfs.semanticscholar.org/1511/11c0c6dec6104bbf70d67d...

But yeah, I agree, O(log n) access time is easily achievable. I wonder what compressed filesystems use for seeking in large files.

joosters · on Dec 12, 2017

Interesting paper, but I'm not sure that I understand its claims sufficiently. In section 3.1, it seems to talk about a table that is effectively an index into the compressed data. So to find the position of codeword 'i' you can look it up in the table select(D,i) This could be practical if the codes being compressed were large, because the index has a chance of being smaller than the original data.

The original article is discussing a bitmap, so the codewords are 1 bit long and there is no chance that an index of positions of individual bits could ever be smaller than the original data.

jobigoud · on Dec 12, 2017

> I can't see how an algorithm to lookup bit 'i' in a compressed bitmap of size 'n' can ever be constant time.

Well for lossy compression we can, texture compression works this way. Every 4x4 block or whatever block size is compressed to a fixed known length like 128 bits. Then you can address any pixel in constant time.

specialist · on Dec 12, 2017

Ages ago, I tried to create a sparse bitmap for ANTLR’s storage of Unicode match sets. The paper’s algorithm I tried to implement was like compressed b-trees. I didn’t get very far.

I’ve been wondering how parser generators now Unicode bitsets.

joosters · on Dec 12, 2017

To store the cardinality of each block we would need 2 bits per block, for a total of 12 bits.

I don’t follow this. Each block, or 4 bits, can have a cardinality of 0-4. Yet 2 bits can only store 0-3. What am I missing?

amallia · on Dec 12, 2017

You are absolutely right! Will fix. Thanks for the feedback