Building a Regex Search Engine for DNA

lambda · on April 20, 2016

I don't know if there was any strict requirement to use Python, but you mention the difficulty in finding a library to parse regular expressions.

One of the many great features of the Rust regex crate (https://doc.rust-lang.org/regex/regex/index.html) is that there's actually a regex_syntax crate that provides this parsing support (https://doc.rust-lang.org/regex/regex_syntax/index.html), which can be used for purposes like this, to do alternative kinds of matching engines.

Rust's regex library also has pretty good DFA and NFA matching engines, and an Aho-Corasick engine for parts of the regex that can be decomposed into just a set of simple strings to match, as well as some fairly well optimized exact substring search to speed up search for expressions with some substring that must match exactly, so it can be fairly fast for the kinds of matches like your "TAG(A|C|T|G)GG" example.

Another, even more radical change, would be to build a finite state transducer of your data set, which you can then do regex matching on directly by taking a union of an automaton representing the regex and the FST representing your data (which is itself an automaton). I learned about FSTs from the author of the Rust regex crate: http://blog.burntsushi.net/transducers/. An FST is actually used internally for certain types of searches in Elasticsearch/Lucene (https://www.elastic.co/guide/en/elasticsearch/reference/curr...). I don't know if these would be suited to this problem as is by just building an FST from your set of sequences, or maybe using the overlapping subsequences idea to make FSTs from a larger number of shorter strings, but it would be an interesting project to explore.

joshma · on April 20, 2016

Insanely cool, thanks for sharing - this would be interesting to replace our final filter (if I'm interpreting correctly, a trigram index would still speed up the initial search)

And yeah, Python is a strict requirement since the rest of our code is in Python - figured it wasn't worth throwing in another language when a rough parser worked.

burntsushi · on April 21, 2016

> (if I'm interpreting correctly, a trigram index would still speed up the initial search)

Your trigram approach is roughly isomorphic to the FST approach. The main benefit of an FST is that it can store a huge number of keys in a very small amount of space that is still searchable without a separate decompression phase. In the case of trigrams with a low alphabet size, an FST probably isn't a big win. But say if you wanted to increase to a bigger N-gram for some reason, then perhaps an FST might help you out. It's true that Rust's fst library has a way to apply regexes to the FST for very fast search (Lucene also has this), but its subjects are terms in the FST, which in your case corresponds to trigrams. This means you still have to go through the "extract a set of trigrams from a regex to pre-filter on" dance.

And yeah, using the regex-syntax crate from Python would have been a fair amount of work. In particular, it still needs a C API before you can use it from Python, which would probably be a fair amount of grunt work.

In any case, as a computational biologist in a former life, this is very cool work. Nice job! :-)

ersinesen · on April 21, 2016

This is such an FST based sequence search engine: http://78.189.139.203/demos/seqbul

Currently, only viral sequences are supported.

abecedarius · on April 20, 2016

For the union of the automata, do you mean intersection? Maybe when I look over the transducers post less quickly I'll see my mistake.

lambda · on April 21, 2016

Yeah, I meant to write intersection. My mistake.

daemonk · on April 20, 2016

Bioinformatican here with couple years of bench experience. Nice writeup, but what would be the use case here? I would imagine something like automatically annotating motifs/restriction sites and then searching for them would be more useful than searching for near-exact sequence?

For example, finding all plasmids with a certain primer sequence, specific restriction sites, promoter motif...

theophrastus · on April 20, 2016

Protein Structural Chemist (i guess) here and I'd like an even higher level 'regular expression' which would allow one to search DNA sequences likely to code for protein domains with likely secondary structure, (something like a string to search for "Secstruc" within "Exon Structure" [1]). The "likely" will always be a problem in these sorts of attempts; any genomics regular expressions should allow for the option of the statistical best hit.

[1] http://www.rcsb.org/pdb/protein/P00918

kusmi · on April 21, 2016

Have you looked at T-coffee? They have this new feature called expresso which might do what you want.

jonstewart · on April 21, 2016

Can you elaborate on the types of expressions you want a bit more, and how they should match?

joshma · on April 20, 2016

Posting on behalf of Hannah, since her new account is getting blocked from replying:

hi! resident scientist at Benchling here! searching near-exact sequence can be really useful! for example, sometimes i would like to find if any of my plasmid has a certain signal peptide, it would be so hard without this search algorithm because so many DNA sequences can be translated to the same signal peptide. it would be great if i could just paste the amino acids and i could identify which plasmids contain the signal peptide.

daemonk · on April 20, 2016

I see. I guess near exact match functionality would be great for plasmids where there is a custom sequence inserted.

But wouldn't a pre-processing pipeline that either automates annotation of the plasmid based on commonly used motifs (m17 primers, t7, sp6 promoters, fluorescent protein sequences, antibiotic resistence...etc) or requiring the user to annotate their plasmid during submission be more ideal? I would imagine the more common use case is to search for these elements?

mbreese · on April 20, 2016

Blastx? Or are your motifs too short?

I know you mention that this isn't a use case for blast, but it's simple to keep a private blast index to use for cases like this.

dekhn · on April 20, 2016

Regular expression-based searching for bio sequences has been used for a long time but I don't think anybody thinks regular expressions are the write language for this, although I guess it depends a lot on the usecase.

i think most people want a sensitive search that has a probabilistic model for substitutions and for indels, ideally with a heuristic so it's fast. BLAST does this, and it also includes support for "profiles" which are probabilistic models (and HMMER has an even more powerful version of said models).

tmaly · on April 20, 2016

I was going to say there was a faster way than regex, but I saw you were not actually using regex to parse in the way I first thought.

I work in a different domain, and I find custom state machines to the preferred way to do matching over large sets of data. Of course using something like compressed bitmap indices are also helpful.

joshma · on April 20, 2016

Yeah, I'm not too happy about how we're using regex inside a groovy script for the final matching - when we scale this to support larger matches we'll likely hit memory issues trying to parse the entire string at once, and I've been playing around with the idea of streaming the bases into a state machine. Is that similar to what you were referring to?

Do you have a way of generating state machines, or are they hand coded? Is it a perf optimization, or does it help integrating domain-specific logic?

tmaly · on April 20, 2016

I initially started with this article when I first looked into the issue of regex performance

https://swtch.com/~rsc/regexp/regexp1.html

To generate state machines

For C there is re2c http://re2c.org/

For Go there is https://github.com/opennota/re2dfa

My patterns do not change that often, so I end up hand coding them.

If you are looking to continue using regex for matching, I would look at using the TCL regex library as it has some optimizations that other regex libraries do not have.

vorg · on April 20, 2016

Apache Groovy was really just intended for scripting anyway, and not meant to be scaled. Someone in their development group added static compilation and Android executables in Groovy 2.x, but they aren't being used by many people. Best just to use Groovy for its original purpose, for scripting Java classes or as a DSL in Gradle, and convert to Java (or Scala, Kotlin, etc) if you need to scale anything.

gliese1337 · on April 20, 2016

Can't speak for tmaly, but you might look at things like the Xerox Finite State Tools, Helsinki Finite State Tools, or KLEENE. They were developed with linguistics in mind, but are in fact general-purpose tools for generating and editing string-matching state machines.

It seems like every time I learn something new about bioinformatics, it just has more and more overlap with computational linguistics!

malisper · on April 20, 2016

What were some of the query features that caused you to choose Elasticsearch over pg_trgm, because pg_trgm supports regular expressions out of the box.

joshma · on April 20, 2016

We use ES for the rest of benchling search already, so that was the main reason why we stuck with ES (ngram indexing is pretty standard there).

As for why we were using ES for the rest of search: we were using it for things like language stemming, matching terms with "slop", things like {'minimum_should_match': '3<67%'} (require exact matches if 3 or less terms, 67% if more than that), searching _all to match anything in the document, etc etc. I think a bunch of these (maybe even all) you can do in PG, but it was way easier to get going with ES.

ES is also distributed, which makes scaling up and doing maintenance a lot easier - a bunch of times we just threw new nodes at it and remove old nodes that had gone bad.

tdeck · on April 21, 2016

I don't quite understand this comment. Isn't a regular expression just a description of a custom finite state machine? Any fast regex implementation will first compile the pattern into a custom DFA before matching it.

burntsushi · on April 21, 2016

By my count, most "regex" engines implement things more powerful than regular expressions. I'd say the majority of regex engines use backtracking instead of a DFA. Backtracking does tend to be slower in my experience, but PCRE2's JIT is definitely competitive with even the fastest DFA engines in my experience.

wrong_variable · on April 20, 2016

Could HN help me with this.

I remember reading this really wonderful article on the levenshtein distance on HN 2-3 months ago.

basically I have difficulty understanding why the Wagner-Fisher's algorithm works - the intuition behind it - I remember it was dynamic programming but I really want to read that article again.

The format was similar to medium with wonderful diagrams :( I tried searching for it on HN but searching on HN is really difficult and it was my own fault for not saving it on Pocket.

I would really appreciate the help since it was one of those "better explained" gems.

alixaxel · on April 20, 2016

Also curious, could it be one of these websites where they explain algorithms visually?

Have you tried https://hn.algolia.com/?query=levenshtein&sort=byPopularity&... ?

wrong_variable · on April 20, 2016

oh wow ! didnt know there was a handy search engine for HN.

basically it explained 1-dimentional Levenshtein distance first.

That blew my mind since its much easier to think about the 1d case and then apply it for a 2d matrix.

but i remember reading only half-way through the article before had to do something else :( sorry - but I really want to read it again since it will really help me out at my work !

EDIT -

FOUND IT !

thanks a lot for that ! It was a god send !

http://davedelong.tumblr.com/post/134367865668/edit-distance...

here is the link - you just made my afternoon !

EDIT 2 -

someone should write a book which explains all maths/algorithms concepts using just links to blog posts by most HN popularity - just a though :)

twotwotwo · on April 20, 2016

Part of the way down the post links to Russ Cox's excellent writeup of how Google used this general sort of technique (turn a regexp into a bunch of conditions involving three-symbol sequences, search an index for them, work from there) to do regexp searching scalably for Code Search: https://swtch.com/~rsc/regexp/regexp4.html

You can peruse his open-source (re)implementation of the approach at https://github.com/google/codesearch .

rurban · on April 20, 2016

You are still searching for char (256 values, 8 bit) when you only need 4 values (2 bit)? You know that cache misses dominate the search, so you need to compress the chars to 2 bit, and search in L1 cache size blocks. Esp. the inverted index will be much faster to search if compressed.

Furthermore those search terms are so highly specialized that you can easily jit the regex.

Why using the super slow python and not a fast language? I thought DNA matching is a big and important enough problem where you shouldn't toy around with dinosaurs.

[Edit: 3 -> 2 bit]

joshma · on April 20, 2016

Hm, unless I'm misunderstanding, the bottleneck here isn't running regex over all the sequences - it's in filtering out a large % of sequences before running a regex over the remaining ones. You're storing trigrams (3-byte keys) in an inverted index - compressing them won't get you much other than space savings. Re: caching, we're using Elasticsearch's bitmap caching to speed things up. If you search for something twice, the second time it'll take the cached bitmap from the first search and use that instead.

We use fast languages when we need them. Python is just parsing a regex to generate constraints - for DNA search, parsing a 1000-char regex is super fast and rarely an issue.

In this case, we're already at sub 100ms searches (usually sub 50ms) so I don't see much benefit from playing around with L1 caches and JIT-ing when higher-level structure already gives the perf characteristics we need.

danieltillett · on April 20, 2016

You do need more than 2 bits. DNA can be modified (5-methyl-C is the most well known, but there are lots of other modifications known). You also have degeneracy where the exact base at a position is not known and these are represented in the data as additional letters [1]. To cover all these you need 4 bits and if you want to cover the modifications you need at least 5 bits.

1. https://en.m.wikipedia.org/wiki/Nucleic_acid_notation

da_chicken · on April 20, 2016

I was assuming Python because big data analysts generally use one of two languages: Python and R.

rurban · on April 20, 2016

If they can afford to wait for their results 10x longer, and need 5x more RAM, ok. But in the normal, nonlazy, technical world you have your backends optimized, and only the frontend is simplified to python or R or matlab.

jonstewart · on April 21, 2016

ElasticSearch is a wrapper around Lucene, and Lucene does a tremendous amount of compression in its indices.

alixaxel · on April 20, 2016

Nice article!

BTW, Go has a very complete RE2 parser: https://golang.org/pkg/regexp/syntax/ - it even handles simplifications for you, I'm actually using it together with a RDPT in namegrep.com.

And in JavaScript you also have ret.JS: https://github.com/fent/ret.js/.

noir_lord · on April 20, 2016

Absolutely fascinating write up, stuff like this is why I still love HN, as a web developer who mostly writes line-of-business stuff this kind of programming is a long way from what I do but reminds me just how big the field is.