Succinct Data Structures: Cramming 80,000 words into a Javascript file

ot · on March 21, 2011

What a wonderful hack! I thought that succinct data structures were not really feasible in JS because of the lack of unsigned integer arithmetic, but here they are implemented doing arithmetic directly on base64 6 bit "bytes", without passing through JS numbers. Very clever, I wonder how fast this implementation performs compared to jeresig's implementations (I'd say that the bottleneck is the binary search in `select`)

BTW, if anyone is interested in succinct data structures, there are very good libraries for Java (http://sux4j.dsi.unimi.it/) and C++ (http://www.uni-ulm.de/in/theo/research/sdsl).

PS. the job listing at the bottom of the page is awesome, compared to the usual BS:

> Some benefits: You'll probably get a next-generation Blackberry that you have to hide under the table when in restaurants. You get to know what's going on when a call drops or a web page isn't loading. Plus, it's a small team so you get to own a large chunk of the project.

Makes me almost want to apply :)

matt4711 · on March 21, 2011

Succinct data structures are very interesting.

I recently implemented an FM-Index[1] which supports the following operations:

- count() the number of occurrences of a pattern P of length m in O(m) time.

- locate() determine the locations of all occurrences of a pattern P.

- extract() extract any substring from the Text

- recover the original text

The most interesting part is that the size constructed index depends on the compressibility of the text itself. For English text roughly half the size of the original text.

Construction time and memory requirements, however are not that great.

[1] http://portal.acm.org/citation.cfm?id=1082039

ot · on March 21, 2011

Did you compare your implementation with others? There has been a lot of algorithm engineering effort on FM-indexes and CSAs (in general, full text self-indexes), and there is a common testbed to compare the implementations:

http://pizzachili.dcc.uchile.cl/

matt4711 · on March 21, 2011

My implementation is very similar to SSA_RRR on the pizza&chilli website. The index implementations on the website are not easy to understand (research code + 8 years old) so I'm trying to re-implement a few of them (especially CSA).

I used some of the test data on the website to test my implementation [1] but didn't really compare to their other indexes.

Also make sure to always use the chilean mirror as the italian one is outdated (2005).

[1] https://github.com/mpetri/FM-Index

ot · on March 21, 2011

> My implementation is very similar to SSA_RRR on the pizza&chilli website.

Thanks for the code, I've seen you are using libcds's wavelet tree. I was wondering if you had implemented your own, since apart from libcds and sdsl (and some unreadable research code) I couldn't find any other implementation of the WT so I'm always curious to see if there are any developments in the engineering of wavelet trees, the current implementations are still too slow to be really useful, I think.

matt4711 · on March 22, 2011

I'm trying to get myself to implement a wavelet tree. There is however a lot of low level stuff that I'm still not sure how to implement. Especially RRR.

This paper [1] is the most recent practical wavelet tree paper I know of.

[1] http://portal.acm.org/citation.cfm?id=1483966

ot · on March 22, 2011

There is also this one: http://portal.acm.org/citation.cfm?id=1198521

Instead of RRR it uses RLE and gamma codes. Oddly, I have found no comparisons between this approach and RRR. I would expect RLE to compress better, and it is also conceptually simpler, but I don't know how fast it is.

robryan · on March 21, 2011

Would be interesting to see how this method compares to the benchmarks John did in his post.

mryall · on March 21, 2011

Yeah, in particular it would be great to see how the load time compares. John found that even loading literal JSON in a JavaScript interpreter is much slower than loading a comparable length string. Presumably the latter is just a buffer copy operation.

netghost · on March 21, 2011

There are a couple of things to balance. I'm building a word game myself right now, and ran into these same issues. I found that loading a string for each letter of the alphabet worked adequately. It is not clever in the least, and far from optimal, but it ended up with a reasonable size / load time / processing trade off in the browser.