Succinct data structures are very interesting. I recently implemented an FM-Inde...

ot · on March 21, 2011

Did you compare your implementation with others? There has been a lot of algorithm engineering effort on FM-indexes and CSAs (in general, full text self-indexes), and there is a common testbed to compare the implementations:

http://pizzachili.dcc.uchile.cl/

matt4711 · on March 21, 2011

My implementation is very similar to SSA_RRR on the pizza&chilli website. The index implementations on the website are not easy to understand (research code + 8 years old) so I'm trying to re-implement a few of them (especially CSA).

I used some of the test data on the website to test my implementation [1] but didn't really compare to their other indexes.

Also make sure to always use the chilean mirror as the italian one is outdated (2005).

[1] https://github.com/mpetri/FM-Index

ot · on March 21, 2011

> My implementation is very similar to SSA_RRR on the pizza&chilli website.

Thanks for the code, I've seen you are using libcds's wavelet tree. I was wondering if you had implemented your own, since apart from libcds and sdsl (and some unreadable research code) I couldn't find any other implementation of the WT so I'm always curious to see if there are any developments in the engineering of wavelet trees, the current implementations are still too slow to be really useful, I think.

matt4711 · on March 22, 2011

I'm trying to get myself to implement a wavelet tree. There is however a lot of low level stuff that I'm still not sure how to implement. Especially RRR.

This paper [1] is the most recent practical wavelet tree paper I know of.

[1] http://portal.acm.org/citation.cfm?id=1483966

ot · on March 22, 2011

There is also this one: http://portal.acm.org/citation.cfm?id=1198521

Instead of RRR it uses RLE and gamma codes. Oddly, I have found no comparisons between this approach and RRR. I would expect RLE to compress better, and it is also conceptually simpler, but I don't know how fast it is.