Hacker News new | past | comments | ask | show | jobs | submit login

Regexes can be fast and simple, but sometimes it’s faster and simpler to use normal string search. I recently replaced a regex with two finds for 10x speedup.



There's pretty wide variety in the performance of regular expressions between different languages, and efficient implementations will often recognize when your regular expression has a literal suffix/prefix, and optimize accordingly.

That doesn't help you if you're stuck using a library that doesn't perform those optimizations, but it means you need to be careful about importing your assumptions about regex performance from one language to another.

See also: https://github.com/burntSushi/rebar


Sounds like the regex library was not implemented as Russ describes... a linear find should be equivalent to a well implemented regex search for a specific string of characters.


The GP doesn't really provide enough details to tell the difference between 1) the programmer did something wrong, or 2) the regex engine is missing optimizations or 3) the GP has a use case that is actually difficult to optimize for in the generality of the regex engine. Any one of these 3 things could be true. It's possible more than one is true simultaneously!

> a linear find should be equivalent to a well implemented regex search for a specific string of characters

This is only true theoretically. If your regex engine is just a DFA scan that does char-at-a-time matching, then any "well implemented" substring search algorithm is going to blow it out of the water because it likely uses SIMD. Of course, many regex engines (including RE2), detect literals in the regex and use SIMD to scan for it. So while "DFA scan that does char-a-time matching" is a fine semantic model to have (to a first approximation) for a finite automata based regex engine, the model breaks down pretty quickly for reasoning about performance in general.


Depending on the type of string search this is not completely true. A naive string search is O(kn) (where k is the length of the search string) which is actually slower than the O(n) that a DFA can get.

But KMP and similar algorithms can do better; they can get performance approaching O(n/k) for many cases by not even looking at every character of the input string.


KMP always looks at every position, but it is linear.

There is a variant of Boyer Moore which is sublinear and linear in the worst case, but it's not that fast by modern standards.

The fastest sub linear search algorithms I know that remain linear in the worst case are Linear Weak Factor Recognition (LWFR) and Linear Hash Chain (LHC). LHC is currently unpublished (I'm writing a paper on it right now).


In practice, it's apparently often better to avoid these techniques and just use SIMD operations to search. You look at every byte, so it's O(n) but the overhead of more complicated approaches isn't worth it. See this comment from burntsushi: https://lobste.rs/s/ycydmd/why_gnu_grep_is_fast_2010#c_gpim7....


For short patterns this is probably right. Longer patterns will be faster using a modern sublinear search algorithm.


As the author of ripgrep, I wouldn't necessarily buy this. I suppose I might agree with it in the extremes, but SIMD prefilters are quite exceptionally difficult to beat with scalar code in the common cases. Longer patterns are great for the SIMD algorithms in ripgrep because of its use of background frequency distribution heuristics. That is, the longer the pattern, the less likely your candidate detection is to produce a false positive in its hot path. (I say "less likely" here intentionally. One can of course pick specific needles and haystacks where a false positive is reported at every position for any length N.)

I don't mean to 100% disagree with you, but I think it's misleading to suggest a sort of one dimensional view of things where, as the pattern gets larger, SIMD gets worse compared to sublinear search algorithms. There are other factors at play here, and, importantly, what "long" means in this context.

In many practical circumstances, "short" might be a few bytes, while "long" is 16 bytes. But maybe your idea of "long" is actually much longer.

If you're curious how your own algorithm stacks up to ripgrep's, you can plug your implementation into the `memchr` crate's benchmark harness: https://github.com/BurntSushi/memchr

It uses rebar: https://github.com/BurntSushi/rebar


Interesting, thanks for the detailed response. I'll have a look at the benchmark; I'm doing some work on algorithm benchmarking right now by coincidence.

I'd say a long pattern is more like 64 bytes (the benchmarking suite I use defines short patterns as 32 or under).

Edit: will also check out the frequency approach used in ripgrep, sounds fascinating.


You can read more about it at various points in these two blogs I've written:

https://blog.burntsushi.net/ripgrep/

https://blog.burntsushi.net/regex-internals/

64 bytes is decently long, but I'd still put my money on SIMD for "common" cases.


It’s re2…


Or write a parser!

In most cases the parser will not be shorter (code) than the regex, but it will most certainly be simpler (cognitively; it's hard to understand all that regexes do, they're close to magic) and more testable (you can test the parts of the parser, like the number parser or the string parser, individually), and often faster too.



Well if people spend time learning arcane regex syntax it'd be a shame to not use them!

But yes, you're right. Sometimes simple finds are sufficient.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: