Hacker News new | past | comments | ask | show | jobs | submit login
Cool Algorithm - Fast text search using BWT (avadis-ngs.com)
77 points by varun729 on April 24, 2012 | hide | past | favorite | 8 comments



BWT is a really neat trick, I first came across it in Andrew Tridgell's thesis on rsync, which is worth a read (http://www.samba.org/~tridge/phd_thesis.pdf). I changed PMD's copy-paste detector (CPD) to use it, which at the time was a massive improvement over its brute-force approach: http://onjava.com/pub/a/onjava/2003/03/12/pmd_cpd.html?page=... ...fairly obviously, the sorted permutations of BWT allow you just to read off duplicates; I was using permutations of tokens not characters.

CPD now uses Rabin-Karp searching, which is faster still. However, writing a copy-paste detector with BWT is fairly trivial and I still keep that script in my head for languages CPD can't handle.


As far as I can tell, author is talking about FM-Index. It compresses the search data into a much smaller index memory footprint. I tried using it few times, but never figured out how to use it as a key-value data store. If anybody is interested, here is the code: http://pizzachili.di.unipi.it/indexes/FM-indexV2/fmindexV2.t...


I guess FM index is just not the right thing to use when you need a key-value data store. It's a full text index -- a data structure, which allows fast substring queries over a fixed text corpus.


Perhaps if you want to store (tag) sub-strings with stored data then it might make sense?


Yup, that might work, but still, this is a weird idea for a key-value store, maybe a DAWG or a radix tree would do better.


yes this is FM index. The linked paper is authored by Ferragina and Manzini, which is how the name FM index comes.


Also see Alex Bowe's blog for a description of this:

   http://www.alexbowe.com/tag/datastructures
He describes how you can store the FM-index in less space than the original text.


This is awesome!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: