Show HN: Fast suffix arrays in Python

matt4711 · on July 3, 2017

There are much faster SA construction algorithms than skew (check out divsufsort). The O(n) algorithms using induced sorting are also likely much faster than this work. The constants of recent O(n) algorithms are very low.

here a link for some state-of-the-art benchmarks of non-parallel SA construction algorithms: https://github.com/y-256/libdivsufsort/blob/wiki/SACA_Benchm...

there also now exist linear speedup parallel SA construction algorithms.

Labo333 · on July 3, 2017

OP and author here :)

Python is not the fastest programming language in the world either. As I said, I wanted something simple (less than 20 lines and quite readable) that performs well against random data. It is NOT intended to be used in production. You would have to add some checks for border cases and things like that. Plus it uses a lot of memory (although linear) because of the set and the dict, which is a drawback that the original implementation (suffix_matrix_stanford) does not have.

I clearly took simplicity over efficiency, and the advantage of this particular algorithm is that you compute the whole suffix matrix that you can reuse for the LCP.

Thank you for divsufsort, it seems to be really fast! During my search, I found this algorithm that does really well against the others, including divufsort: https://arxiv.org/abs/1307.1417

Without thinking too much, I think it is quite easy to make the prefix doubling algorithm parallel if you have a good parallel sorting algorithm, but it would of course not be the best solution…

pelario · on July 3, 2017

Hi, do you have some pointers for "linear speedup parallel SA construction algorithms." ? Thanks!

Labo333 · on July 3, 2017

Hi, sorry, I don't have anything in particular. However, it seems that http://snnynhr.github.io/ParallelSuffixArrays/ contains interesting algorithms and links.

lqdc13 · on July 2, 2017

Is there another implementation we can compare this with?

Kind of crazy to say that this is fast without any comparisons with anything else. So far I've been using suffix trees from here: http://www.daimi.au.dk/~mailund/suffix_tree.html

fnl · on July 2, 2017

Most uses are in rather obscure bioinformatics packages, where the exact matching of sequences is useful, including BWT compression of the arrays. Easy to Google, but I don't know if any pure-Python versions not written in C exist, and having seen colleagues argue about the best/most correct implementation, I prefer not to link any (Google is your friend...). But so any interesting performance comparison would probably be against a pure C implementation.

pathsjs · on July 3, 2017

You can test it against https://unicredit.github.io/cello/

dom0 · on July 2, 2017

Article seems to be mostly about efficient, not fast.

lqdc13 · on July 2, 2017

"Suffix arrays : How to compute them fast with Python"

The assumption is it's fast in Python without using Cython etc. My intuition says that looping and iterating in Python is slow. It seems like a great reference implementation, but I don't think anyone should use it if there are implementations that are (possibly) orders of magnitude faster.

I guess my bigger issue with this is that in many cases it's much more time efficient to just drop to Cython earlier on instead of fighting with the language.

Labo333 · on July 3, 2017

I'm not sure about the difference, but yes :) I wanted an implementation that could fit on my screen and still be efficient. Of course Python is slow and even my algorithm is not the best (I explained why I chose it). But I believe this is one of the best implementation you can make of prefix doubling, especially against random data and I have never seen my tricks anywhere.

xfer · on July 3, 2017

When i was doing a derivation of LZ compression using suffix arrays i used induced sorting. Here is a good overview: http://zork.net/~st/jottings/sais.html.

Also i think Doug's C implementation of MM suffix array is worth mentioning : http://www.cs.dartmouth.edu/~doug/sarray/

Labo333 · on July 3, 2017

Cool page! It would be really great to have the code shipped in a single file so that people can use it!

I love Jupyter notebooks because 1) they are beautiful 2) if you download them, you can play :)

Doug's implementation seems indeed really fast.

xfer · on July 4, 2017

It's not mine. Sorry if my comment gave that impression.

visarga · on July 3, 2017

Anyone remember "sary"? It was doing suffix arrays 17 years ago. Last update was in 2005.

http://sary.sourceforge.net/