Boyer Moore is OK, but there are better algorithms available. The simpler Horspo...

johnboyer · on Aug 1, 2018

I would most definitely be interested. I always assumed that certain algorithms are better suited for certain string sizes, but I was never sure which ones. The ideal implementation is probably combining multiple algorithms with ranges of the string size.

MattPalmer1086 · on Aug 1, 2018

Absolutely true that there isn't a single algorithm that beats all the others. Of the general string search algorithms which don't use specialized CPU instructions to obtain speedups, I would recommend:

1. ShiftOr for short strings. Easy to implement.

This algorithm is not sub-linear like Boyer Moore - it examines every position, but it uses bit-parallelism to validate a match, making it outperform algorithms that rely on jumping then separate verification stages, for short strings.

2. Variants of Wu-Manber for longer strings. Hard to find a good description, but not too hard to implement.

Wu-Manber is a search algorithm designed for multi-pattern searching, based on Boyer-Moore-Horspool and hashes. However, it also performs really well for single-pattern searching. I have encountered variants of this in various places, e.g. Lecroq's in "Fast Exact String Matching Algorithms".

These algorithms use a hash of a q-gram to look up how far it's safe to shift. Q-grams tend to appear less frequently in search patterns vs. single characters, so you get longer jumps, at the cost of reading more characters and producing a hash.

3. Horspool (or Boyer-Moore-Horspool).

This algorithm performs quite well - not as well as ShiftOr for shorter patterns or Wu-Manber variants for longer ones, but still respectable. It's essentially Boyer-Moore but only using one of the shift tables, which makes it much easier to implement.

4. Qgram-Filtering by Branislav Durian, Hannu Peltola, Leena Salmela and Jorma Tarhio

For longer patterns, this algorithm outperforms the others mostly. However, it can be a bit complicated to implement well, and it has some nasty worst-cases (rarely encountered) where the performance becomes absolutely dreadful. For that reason I tend not to use it.

There are hundreds of possible search algorithms available now (see https://github.com/smart-tool/smart for implementations of many of them with a benchmark tool). However, it's hard to figure out exactly which algorithm is the best given your data and pattern. For that reason, I tend to keep things simple.

I would just use ShiftOr for short patterns, and another one for longer patterns. I would tend to use a Wu-Manber variant there, but Horspool would probably give acceptable performance.

The only other consideration is the time it takes to build the pattern indexes. For short searches, or if you have to re-build the pattern index on each repeated search, it can actually be quicker to just do a naive "check the string at each position" search, since it doesn't require building anything before searching.

burntsushi · on Aug 7, 2018

In my experience, it is difficult to beat a very simple heuristic in most settings: when provided a string, pick the most infrequently occurring byte in that string, and use that as your skip loop in a vector optimized routine such as memchr. If you guess the right byte, you'll spend most of your time in a highly optimized routine that makes the most out of your CPU.

Picking the right byte can be tricky, but a static common sense ranking actually works quite well. At least, the users of ripgrep seem to think so. :-)

For some reason, I've never seen this algorithm described in the literature. It doesn't have nice theoretical properties, but it's quite practical.

MattPalmer1086 · on Aug 1, 2018

I'd be remiss not to note that there are better and still simple variants of Horspool that outperform the vanilla algorithm.

In any case, if you'd like to discuss search algorithms when you want to implement them, I'd be happy to help. I'm mattpalms, gmail.com.