Cool Algorithm - Fast text search using BWT

bazzargh · on April 24, 2012

BWT is a really neat trick, I first came across it in Andrew Tridgell's thesis on rsync, which is worth a read (http://www.samba.org/~tridge/phd_thesis.pdf). I changed PMD's copy-paste detector (CPD) to use it, which at the time was a massive improvement over its brute-force approach: http://onjava.com/pub/a/onjava/2003/03/12/pmd_cpd.html?page=... ...fairly obviously, the sorted permutations of BWT allow you just to read off duplicates; I was using permutations of tokens not characters.

CPD now uses Rabin-Karp searching, which is faster still. However, writing a copy-paste detector with BWT is fairly trivial and I still keep that script in my head for languages CPD can't handle.

WilhelmJ · on April 24, 2012

As far as I can tell, author is talking about FM-Index. It compresses the search data into a much smaller index memory footprint. I tried using it few times, but never figured out how to use it as a key-value data store. If anybody is interested, here is the code: http://pizzachili.di.unipi.it/indexes/FM-indexV2/fmindexV2.t...

superbobry · on April 24, 2012

I guess FM index is just not the right thing to use when you need a key-value data store. It's a full text index -- a data structure, which allows fast substring queries over a fixed text corpus.

arethuza · on April 24, 2012

Perhaps if you want to store (tag) sub-strings with stored data then it might make sense?

superbobry · on April 24, 2012

Yup, that might work, but still, this is a weird idea for a key-value store, maybe a DAWG or a radix tree would do better.

varun729 · on April 24, 2012

yes this is FM index. The linked paper is authored by Ferragina and Manzini, which is how the name FM index comes.

dunham · on April 24, 2012

Also see Alex Bowe's blog for a description of this:

   http://www.alexbowe.com/tag/datastructures

He describes how you can store the FM-index in less space than the original text.

techfiltered · on April 24, 2012

This is awesome!