Hacker News new | past | comments | ask | show | jobs | submit login
Succinct Data Structures: Cramming 80,000 words into a Javascript file (stevehanov.ca)
126 points by bumbledraven on March 21, 2011 | hide | past | favorite | 10 comments



What a wonderful hack! I thought that succinct data structures were not really feasible in JS because of the lack of unsigned integer arithmetic, but here they are implemented doing arithmetic directly on base64 6 bit "bytes", without passing through JS numbers. Very clever, I wonder how fast this implementation performs compared to jeresig's implementations (I'd say that the bottleneck is the binary search in `select`)

BTW, if anyone is interested in succinct data structures, there are very good libraries for Java (http://sux4j.dsi.unimi.it/) and C++ (http://www.uni-ulm.de/in/theo/research/sdsl).

PS. the job listing at the bottom of the page is awesome, compared to the usual BS:

> Some benefits: You'll probably get a next-generation Blackberry that you have to hide under the table when in restaurants. You get to know what's going on when a call drops or a web page isn't loading. Plus, it's a small team so you get to own a large chunk of the project.

Makes me almost want to apply :)


Succinct data structures are very interesting.

I recently implemented an FM-Index[1] which supports the following operations:

- count() the number of occurrences of a pattern P of length m in O(m) time.

- locate() determine the locations of all occurrences of a pattern P.

- extract() extract any substring from the Text

- recover the original text

The most interesting part is that the size constructed index depends on the compressibility of the text itself. For English text roughly half the size of the original text.

Construction time and memory requirements, however are not that great.

[1] http://portal.acm.org/citation.cfm?id=1082039


Did you compare your implementation with others? There has been a lot of algorithm engineering effort on FM-indexes and CSAs (in general, full text self-indexes), and there is a common testbed to compare the implementations:

http://pizzachili.dcc.uchile.cl/


My implementation is very similar to SSA_RRR on the pizza&chilli website. The index implementations on the website are not easy to understand (research code + 8 years old) so I'm trying to re-implement a few of them (especially CSA).

I used some of the test data on the website to test my implementation [1] but didn't really compare to their other indexes.

Also make sure to always use the chilean mirror as the italian one is outdated (2005).

[1] https://github.com/mpetri/FM-Index


> My implementation is very similar to SSA_RRR on the pizza&chilli website.

Thanks for the code, I've seen you are using libcds's wavelet tree. I was wondering if you had implemented your own, since apart from libcds and sdsl (and some unreadable research code) I couldn't find any other implementation of the WT so I'm always curious to see if there are any developments in the engineering of wavelet trees, the current implementations are still too slow to be really useful, I think.


I'm trying to get myself to implement a wavelet tree. There is however a lot of low level stuff that I'm still not sure how to implement. Especially RRR.

This paper [1] is the most recent practical wavelet tree paper I know of.

[1] http://portal.acm.org/citation.cfm?id=1483966


There is also this one: http://portal.acm.org/citation.cfm?id=1198521

Instead of RRR it uses RLE and gamma codes. Oddly, I have found no comparisons between this approach and RRR. I would expect RLE to compress better, and it is also conceptually simpler, but I don't know how fast it is.


Would be interesting to see how this method compares to the benchmarks John did in his post.


Yeah, in particular it would be great to see how the load time compares. John found that even loading literal JSON in a JavaScript interpreter is much slower than loading a comparable length string. Presumably the latter is just a buffer copy operation.


There are a couple of things to balance. I'm building a word game myself right now, and ran into these same issues. I found that loading a string for each letter of the alphabet worked adequately. It is not clever in the least, and far from optimal, but it ended up with a reasonable size / load time / processing trade off in the browser.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: