Hacker News new | past | comments | ask | show | jobs | submit login

Succinct data structures are very interesting.

I recently implemented an FM-Index[1] which supports the following operations:

- count() the number of occurrences of a pattern P of length m in O(m) time.

- locate() determine the locations of all occurrences of a pattern P.

- extract() extract any substring from the Text

- recover the original text

The most interesting part is that the size constructed index depends on the compressibility of the text itself. For English text roughly half the size of the original text.

Construction time and memory requirements, however are not that great.

[1] http://portal.acm.org/citation.cfm?id=1082039




Did you compare your implementation with others? There has been a lot of algorithm engineering effort on FM-indexes and CSAs (in general, full text self-indexes), and there is a common testbed to compare the implementations:

http://pizzachili.dcc.uchile.cl/


My implementation is very similar to SSA_RRR on the pizza&chilli website. The index implementations on the website are not easy to understand (research code + 8 years old) so I'm trying to re-implement a few of them (especially CSA).

I used some of the test data on the website to test my implementation [1] but didn't really compare to their other indexes.

Also make sure to always use the chilean mirror as the italian one is outdated (2005).

[1] https://github.com/mpetri/FM-Index


> My implementation is very similar to SSA_RRR on the pizza&chilli website.

Thanks for the code, I've seen you are using libcds's wavelet tree. I was wondering if you had implemented your own, since apart from libcds and sdsl (and some unreadable research code) I couldn't find any other implementation of the WT so I'm always curious to see if there are any developments in the engineering of wavelet trees, the current implementations are still too slow to be really useful, I think.


I'm trying to get myself to implement a wavelet tree. There is however a lot of low level stuff that I'm still not sure how to implement. Especially RRR.

This paper [1] is the most recent practical wavelet tree paper I know of.

[1] http://portal.acm.org/citation.cfm?id=1483966


There is also this one: http://portal.acm.org/citation.cfm?id=1198521

Instead of RRR it uses RLE and gamma codes. Oddly, I have found no comparisons between this approach and RRR. I would expect RLE to compress better, and it is also conceptually simpler, but I don't know how fast it is.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: