Hacker News new | past | comments | ask | show | jobs | submit login
Building data structures that are smaller than an array and faster (in C) (siganakis.com)
134 points by siganakis on March 1, 2012 | hide | past | favorite | 34 comments



Wavelet trees are awesome. There are several ways to speed up the rank/select operations, which the article doesn't mention. For example, you can create a simple (popcount, offset) index for each bitmap. Then when you need to find the nth bit (the select operation), you can simply do a binary search to find the offset with the nearest count. Likewise for the access (or the rank operation) you can find the popcount for the nearest offset in the index.

I use wavelet tree heavily in my job. They are very useful for compact representations of order preserving collections of strings which supports fast lookups.

Update: Sux4j has some interesting implementations of fast select/rank:

http://sux.dsi.unimi.it/

http://sux4j.dsi.unimi.it/docs/it/unimi/dsi/sux4j/bits/Selec...


Glad you mentioned this. The author linked to my blog post, where I mention fast rank/select structures. There are faster ways to do it without a supporting structure too of course :) but I do like this post.

I'm interested in what you use the WT for in your job... I'm just starting off my PhD in succinct data structures, so it's nice to know that people use them :P


Your blog posts about wavelet trees (http://www.alexbowe.com/wavelet-trees) and RRR (http://www.alexbowe.com/yarrr-me-hearties) are very clear and well written!

I've been thinking of writing some blog posts myself for a long time, now I think I will just link to yours.

EDIT: I think WTs should really have a page on Wikipedia...


Thanks ot! :) I loved writing them and I hope it shows. Where is your blog at? Yes I was tempted to write a wiki page before...


> Where is your blog at?

I don't have a blog yet, I was thinking of opening a blog and then write those posts... :)


Cool, WT and succinct data structures are definitely an interesting field these days.

I'm primarily using as a compact order preserving structure to support fast lookups for both (token -> position) and (position -> token). I'm using an updatable variant to support O(1) appends, and then use it for realtime indexing/compression. I probably should do a blog post at one point...

Would be interesting to hear about faster ways to do rank/select without using support structures... I'll keep an eye on your blog :-)


> I'm primarily using as a compact order preserving structure to support fast lookups for both (token -> position) and (position -> token). I'm using an updatable variant to support O(1) appends, and then use it for realtime indexing/compression. I probably should do a blog post at one point...

That's very interesting, are you using raw or compressed bitmaps (such as RLE or RRR)? A limitation of (updatable) wavelet trees is that you need to know in advance the set of symbols that you will see (otherwise you would need to change the tree structure). Is it a problem for you? I wrote a paper with a solution to this problem, I can't share it until mid march, let me know if you are interested.


My solution is to use a fixed tree with bloated leaf nodes (use more bits per element in the leaf node). A high number of nodes with a low number of elements causes a high overhead, so I simply rebuild the tree at a few given thresholds to keep the overhad and the cost of the bloated leaf nodes down. We can discuss further by email, if you want. I'd love to read your paper! :-)


I'm interested in the paper too of course :D


Yes you should do a blog post :) What is your blog address?


http://erik.gorset.no/

not much content, though..


Thanks! I'll keep visiting ^^ looks interesting as it is


> Update: Sux4j has some interesting implementations of fast select/rank

If you are using C++ you can have a look at my library of succinct data structures:

https://github.com/ot/succinct

It uses the same algorithms as Sux4j for rank/select (https://github.com/ot/succinct/blob/master/rs_bit_vector.hpp) but it has exhaustive unit tests and the resulting files can be memory-mapped instead of loaded in memory, which has a number of advantages.

I'm also going to release soon an optimized implementation of wavelet trees (actually a more generic variant that we are developing).


Thank you, this is exactly why I wrote the article - knew I was missing something!

I'll do a follow up with the rank / select optimizations.

Thanks!


What kind of compressed data structure I would use depends heavily on so much things (the probability distribution of the data, the probability distribution of different access patterns) that it is very hard to design a data structure for a generic case. Designing for a specific use case is a lot of fun though!

For example this: "Use less memory than an array in all circumstances" is just plain impossible if your data distribution is uniform. You just cannot store 1000 32 bit random intergers (preserving their order) in a smaller place than 1000*32 bits; (it comes from from simple combinatorics considerations.)


I agree with your point 100%, but if they're sorted you could probably just store the deltas an then use something like the varints that google protocol buffers to compress those deltas down from 32 bits (with a thousand of them, you are guaranteed that some of the deltas will be less than 22 bits because 1000 * 2^22 > 2^32).

With custom compression, where there's a will there's generally a way :)


So wait... You have 64 bits keys and 32 bits values. Then why not store the values in the keys? Use the lower 32-bits for the value, and the higher 32 bits to count how many duplicates you have.

    int32_t big_array[2^32] = {0};
    
    /* returns the key */
    int64_t write(int_32 value)
    {
        big_array[value]++;
        return big_array[value] << 32 & value;
    }
    
    /* returns the value */
    int32_t read(int64_t key)
    {
        return (int32_t) key;
    }

    /* all indexes between 1 and the return value are valid higher half values 
    for the key. The lower half is the value itself. */
    int32_t seek(int32_t value)
    {
        return big_array[value];
    }
memory: 16GB

access: O(1)

seek: O(1)

Problem solved!

:)

edit: I'm missing the "Use less memory than an array in all circumstances" requirement, but hey, you talk about "billion of records" in your post.


You also get order with a wavelet tree. You can ask "Give me the absolute position for the 4th value 42 in the array". By using a wavelet tree with an index, you can answer that question in almost constant time.


Bit confused here --

The author rewrites the initial values as:

1 0 2 0 2 3

He defines the keys array as:

2 3 4 6

Using the method defined in the article to find the values of each of the elements of the rewritten array, wouldn't the values be:

3 2 4 2 4 6

which does not equal the original values of:

3 2 4 6 2 6


My bad, I got that wrong.

The translated array should be:

1 0 2 3 0 3.

I am correcting the article now. For a non-rushed, accurate description please look at:

http://www.alexbowe.com/wavelet-trees

Edit: Got it wrong again... I guess this is why we have computers and debuggers!


Looks like you still have some bugs:

  rewritten data: 1 0 2 3 0 3
  first row     : 0 0 1 1 0 1
  group a       : 1 0     1
  group b       :     2 2   3
Group a and group b don't seem to match up with the data. Or I've misunderstood something.


I'm in the same boat. siganakis?


Yeah... I'm a bit confused there too. It's not the clearest article out there.


I'm inclined to believe this is an error. I have wasted hours trying to figure something out in a textbook, before I decide to check the errors section and discover that it was just a misprint.


Depending on your data set, there might be a way to use Compressed Bit Vectors: http://sdm.lbl.gov/fastbit/

General intro: http://crd-legacy.lbl.gov/~kewu/ps/LBNL-2164E.pdf

"Word Aligned Hybrid" compression: http://crd-legacy.lbl.gov/~kewu/ps/LBNL-49627.pdf

Bit maps as alternative to inverted index: http://crd-legacy.lbl.gov/~kewu/ps/LBNL-61768.pdf

Bit maps for range queries: http://crd-legacy.lbl.gov/~kewu/ps/LBNL-60891.pdf


A better fit for this use case is probably the paper "Encoded Bitmap Indexing for Data Warehouses" (http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.44.7...) by Ming-chuan Wu and Alejandro P. Buchmann. The data is structure similar to bloom filters, but it's not lossy, and not compressed, so random access is O(1).


Bitmaps are very good at "Seek" operations (where I try to find a value), but poor at "Access" (where I try to find the value at an index). This is because "Access" requires that I check all the bitmaps to see which one has the set value.

Also the "Word Aligned Hybrid" method of RLE compression is patented. http://www.freepatentsonline.com/6831575.html


It's true that simple Access is O(N) as is, but you aren't limited to using the same index for forward and inverse lookups. It's even possible that you could use compressed bitmaps for both in less space than a fixed array. Single or multilevel indexing into the bitmap can also speed things up. The "Inverted Index" paper linked above has an interesting take.

Patents are thorny, and it's generally not recommended that developers read them: willful infringement equals treble damages, caveat lector. The license may make better reading <http://crd-legacy.lbl.gov/~kewu/fastbit/src/license.txt>. Search for "software patents pose a constant threat to the existence of any free program".

If it remains a concern, and realizing that any other scheme you choose is also likely to be encumbered, you may be able to substitute a different compression such as PForDelta. The important part is cache awareness and branch prediction -- this is likely where your wavelet code is falling short.

Good luck!


AFAIK these kinds of Compressed Bit Vectors do not have sublinear random-access, they are only fast for linear scans and operations such as union and intersection.


I've seen slab allocation schemes (not sure if that's a standard term for it) work in scenarios like inverted indicies where you want to keep your datum key size small (32 vs 64bit).

The intuition is that you don't need granular memory addressing for storing large chunks. You only need to have addressability at the min(sizeof(doc)) level. So you can usually store the key in a 32bit int or less.

More detail can be found here: https://gist.github.com/1947190


A very interesting data structure. Reading this inspired me to give an implementation a try. It was a lot harder than it seemed at first and it's certainly a naive attempt but if anyone is interested you can take a look at it here:

https://gist.github.com/1947371


So you have 64-bit integer keys and 64-bit integer values, and you need to store these key-value pairs such that you get good lookups from key to list of values and from value to list of keys?

How often are you updating? What is reading from this data structure? What's the query structure?

If you're adding lots of records often, I'd go for a B+tree. There's also a trie (and a B-trie), if your keys share lots of common prefixes.

If you are making consecutive queries, then you'd do well to take advantage of what pages are in L1 and L2 cache, but if you're randomly picking things, that's less important.


What we have is 32 bit integers (or anything really) that we store using bit vectors (which are implemented as 64 bit integers).

These structures are read only. I believe that there is some work on making updatable Wavelet trees but I think that this would be very expensive.

Reads are primarily Seeks (maybe 70%) I think, but Access is also important - so the order must also be encoded.

Storing keys and values is expensive in terms of memory (storing everything twice) so what we are trying to do is to manipulate the data such that we can do these operations quickly while using the minimum number of bits.

Edit: Fixed iPhone related spelling / grammar.


(particularly in the “refine” and “access” methods, where utilizing the popcnt SSE4.2 instruction would no doubt have a massive impact on performance.

)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: