Hacker News new | past | comments | ask | show | jobs | submit login
Regular expression matching can be simple and fast (2007) (swtch.com)
141 points by momonga 6 months ago | hide | past | favorite | 43 comments



This and the associated series of regex posts from Russ Cox are probably the best content on practical regular expression engine internals you can get. If you have a cursory understanding of NFAs and DFAs, it's some of the best technical writing I've ever had the pleasure to read.

These posts are responsible for my love of regular expressions :)


Many years ago, as a fresh out of college young coder, I had to come up with a way to check types on an text input file in C++ -- long before C++ had even a standard String type let alone regex libraries like pcre. The standard conversion functions at the time expected you to have done some type detection beforehand.

Each field in the file could contain a string, an int, or a float. In particular to figure out the floats, I remember sitting down for a few hours and coming up with a regular expression, testing it in Perl on a bunch of test cases, and then went about the hard work of slowly converting the regex into an NFA to a DFA and then finally to an adjacency graph in a 2d matrix. A little bit of code to read in things a byte at a time and some simple arithmetic and I had a stupid fast

   bool isFloat(char *)
Nobody at my work understood the code, they just sort of knew that if they submitted a character array with a field value into it, something happened with a big matrix, and it you let them know if it was a float or not and then they could call the appropriate conversion function -- I think stof() or whatever was in fashion back then.

I had to make one revision to it for an edge case I missed and then it found its way into pretty much all code for the company after that.

The tuition for the class that taught me how to do that payed for itself with that one function.


And if you don't have a cursory understanding of NFAs and DFAs, they're still great for getting that cursory understanding.

These articles were my introduction to finite automata (and almost the only place I've learned about them aside from practical use) and my intuition for them seem to often be better than my peers'.


NFAs are apparently undecidable since the same input can go to multiple states. DFA states can only go to one state when given a particular input sequence.

The Dragon book contains an NFA to DFA algorithm (powerset) which explodes states with multiple exiting token transitions.

PS: All your DAGs are belong to us.


It is possible to implement NFAs directly, in parallel:

https://github.com/mike-french/myrex

Just fan-out messages to all downstream states. Let a fair runtime system evaluate all progress, then tally results along all paths, for example:

https://github.com/mike-french/myrex?tab=readme-ov-file#exec...

The approach can also be adapted to captures and their many, many possible ambiguities. There is an example that generalizes a highly ambiguous match given in Cox: regex (a?){m}(a*){m} against a string of a{m}. The case m=9 gives 864k possible solutions (traversals), you may optionally ask for all of them to be returned:

https://github.com/mike-french/myrex?tab=readme-ov-file#ambi...

The processes share nothing, all state is in the messages, so a single network can process multiple input strings simultaneously. It can also generate Monte-Carlo strings that match the regex, and those may also proceed simultaneously.


That's identical to NFA->DFA conversion but with dynamic programming.


NFAs are decidable. As you said, a NFA can be converted to a DFA which is decidable. Maybe you mean they are non-deterministic (that's in the name)?


You know what I meant. I used the wrong word.


There exist algorithms that deal with state blow-up. In fact, these algorithms project any given DFA into an equivalent DFA of minimal size - guaranteeing a minimal automaton for the language that is accepted by it. In essence, you build equivalence classes between states with respect to possible paths they can take into other equivalence classes for further input. State machines for LR-Grammars are vastly more complex in comparison to NFA automata, so for most (if not all) applications, a computer that can deal with LR-Grammars will be able to handle a NFA conversion with a subsequent minimization.


Decidability is whether one can statically determine if an algorithm halts for a given input. NFAs halt for all inputs, by construction.


You may also find this very detailed post interesting https://devblogs.microsoft.com/dotnet/regular-expression-imp...

It's an indepth look at where regex in dotnet was prior to and after version 7.


> As mentioned earlier, no one knows how to implement regular expressions with backreferences efficiently, though no one can prove that it's impossible either. (Specifically, the problem is NP-complete, meaning that if someone did find an efficient implementation, that would be major news to computer scientists and would win a million dollar prize.)

It really needs to be noted how NP-complete and an algorithm efficient in practice have nothing to do with each other. If an algorithm takes C * 1.000...eighty zeroes...1^n time it is exponential but for any n you will encounter in real life the time it takes will be indistinguishable from C.


There's a fun list of problems like this here: https://cstheory.stackexchange.com/questions/6660/polynomial...

On the other hand, it's remarkable that so many algorithms _do_ have reasonable constants/exponents. So the concept works pretty well.


And the converse as well -- an algorithm that is 10,000,000 * n^17 is going to be effectively intractable but falls neatly into P.

And the reality is that most NP-complete problems are well-solved by heuristics except in strange combinatorial corners. It's why we can't just create encryption systems based on knowing the correct answer to an NP-complete problem; most of the problem instances will be too easy to solve.


This is a fine introduction to automata-based regexps but the framing of comparing "good" automata with "bad" perl-style is misguided. Automata and perl-style regexps are different beasts and solve different problems. The problem seems to be one of terminology: the perl style should never have been called regexps. That's not what they are. It's a pattern language that happens to have a variant of regexps as a subset.


I've recently discovered Lua Parsing Expression Grammars (LPEG)[1]. They are built on a different approach to parsing and are much more capable than conventional regexes, able to handle recursive sequences, able to include code, have debugging utilities, are often much more performant than regexes and they are an absolute delight to work with. It also has a module called re [2] which uses a similar syntax to regexes.

1: https://www.inf.puc-rio.br/~roberto/lpeg/

2: https://www.inf.puc-rio.br/~roberto/lpeg/re.html


It's easy to miss in the article, but the reason most regex engines exhibit Perl's behavior is due to features like back referencing. (And I assume look behinds and look aheads.)

PCRE and theoretically pure regular expressions are different things.


FWIW, it seems pretty likely the same multi-NFA algorithm described in the article could be applied to other backtracking scenarios (and therefore to back references and lookaround). Since I’m just speculating, I’ll hazard a guess that it would result in a meaningfully more complex implementation to derive each state machine, and potentially much slower (while still “linear” complexity).


As the OP specifically and explicitly says, back-references are definitively NP-complete. The OP even links to a proof[1]. If you found a way to implement them in worst case linear time, then I believe you would have discovered a constructive proof for p=np.

Look-around is a different story, but I don't believe you can still guarantee linear time.

[1]: https://perl.plover.com/NPC/NPC-3SAT.html


I also wanted to post that link, but now that I’m rereading it I’m starting to doubt that it proves what we want it to prove here. The way it’s presented, the size of the 3-SAT formula corresponds to the size of the expression, not the size of the haystack; and that compiling a regex is exponential in its length is not exactly a surprise—the standard determinization procedure used to compile actually regular (no backreferences) regexes for linear matching is also exponential. So what we’d actually want to prove is along the lines of NP-completeness even if we allow for an exponential amount of preprocessing on the expression only.

(I’m willing to believe backreference matching is NP-complete, I just think the linked statement is weaker than that.)


Ah I see. Geoff Langdale (creator of Hyperscan) has a post about this. The comments seem relevant: https://branchfree.org/2019/04/04/question-is-matching-fixed...


Sat solvers have been implemented on backtracking regex engines. Sat is NP thus game over for looking for linear time solutions to arbitrary backtracking regex. Sadly.


> Sat solvers have been implemented on backtracking regex engines.

Again, if the (not-really-)regular expression grows as the SAT problem does, that doesn’t mean game over yet. Your favourite ahead-of-time regex compiler (the good interactive ones are usually incremental, but lex/flex or re2c certainly fit) is also exponential in the size of the regex, because determinization can have exponential-size output. After the (exponentially large) compiled form is produced, though, matching is linear in the size of the haystack no matter what it is.

Burntsushi’s sibling reply links to blog post and thence to a proof-of-concept regular+backreferences expression matcher[1] that works in (quite miserable but still) polynomial time wrt the haystack, with the degree of the polynomial depending on the number of groups that have backreferences to them.

That’s not exactly fantastic, mind you—recall that you can match arbitrary context-free languages in “merely” cubic time. But it’s not exponential.

[1] https://github.com/travisdowns/polyregex


Attempting polynomial with respect to the string being searched for a constant regex is interesting. A finite regex will contain a finite number of back references.

I think a finite number of back references implies a maximum stack use for the backtracking approach. DFA plus finite size stack machine is convertible to a DFA with a smaller stack and onwards to a pure DFA.

That is probably worth trying to implement, thank you for the push.


Look-arounds can be implemented in quadratic time for unbounded expressions (i.e: containing +, *), and linear time for bounded expressions quite easily. And I suspect they can be implemented in (super)linear time in general by matching them in parallel to the NFA.


Regexes can be fast and simple, but sometimes it’s faster and simpler to use normal string search. I recently replaced a regex with two finds for 10x speedup.


There's pretty wide variety in the performance of regular expressions between different languages, and efficient implementations will often recognize when your regular expression has a literal suffix/prefix, and optimize accordingly.

That doesn't help you if you're stuck using a library that doesn't perform those optimizations, but it means you need to be careful about importing your assumptions about regex performance from one language to another.

See also: https://github.com/burntSushi/rebar


Sounds like the regex library was not implemented as Russ describes... a linear find should be equivalent to a well implemented regex search for a specific string of characters.


The GP doesn't really provide enough details to tell the difference between 1) the programmer did something wrong, or 2) the regex engine is missing optimizations or 3) the GP has a use case that is actually difficult to optimize for in the generality of the regex engine. Any one of these 3 things could be true. It's possible more than one is true simultaneously!

> a linear find should be equivalent to a well implemented regex search for a specific string of characters

This is only true theoretically. If your regex engine is just a DFA scan that does char-at-a-time matching, then any "well implemented" substring search algorithm is going to blow it out of the water because it likely uses SIMD. Of course, many regex engines (including RE2), detect literals in the regex and use SIMD to scan for it. So while "DFA scan that does char-a-time matching" is a fine semantic model to have (to a first approximation) for a finite automata based regex engine, the model breaks down pretty quickly for reasoning about performance in general.


Depending on the type of string search this is not completely true. A naive string search is O(kn) (where k is the length of the search string) which is actually slower than the O(n) that a DFA can get.

But KMP and similar algorithms can do better; they can get performance approaching O(n/k) for many cases by not even looking at every character of the input string.


KMP always looks at every position, but it is linear.

There is a variant of Boyer Moore which is sublinear and linear in the worst case, but it's not that fast by modern standards.

The fastest sub linear search algorithms I know that remain linear in the worst case are Linear Weak Factor Recognition (LWFR) and Linear Hash Chain (LHC). LHC is currently unpublished (I'm writing a paper on it right now).


In practice, it's apparently often better to avoid these techniques and just use SIMD operations to search. You look at every byte, so it's O(n) but the overhead of more complicated approaches isn't worth it. See this comment from burntsushi: https://lobste.rs/s/ycydmd/why_gnu_grep_is_fast_2010#c_gpim7....


For short patterns this is probably right. Longer patterns will be faster using a modern sublinear search algorithm.


As the author of ripgrep, I wouldn't necessarily buy this. I suppose I might agree with it in the extremes, but SIMD prefilters are quite exceptionally difficult to beat with scalar code in the common cases. Longer patterns are great for the SIMD algorithms in ripgrep because of its use of background frequency distribution heuristics. That is, the longer the pattern, the less likely your candidate detection is to produce a false positive in its hot path. (I say "less likely" here intentionally. One can of course pick specific needles and haystacks where a false positive is reported at every position for any length N.)

I don't mean to 100% disagree with you, but I think it's misleading to suggest a sort of one dimensional view of things where, as the pattern gets larger, SIMD gets worse compared to sublinear search algorithms. There are other factors at play here, and, importantly, what "long" means in this context.

In many practical circumstances, "short" might be a few bytes, while "long" is 16 bytes. But maybe your idea of "long" is actually much longer.

If you're curious how your own algorithm stacks up to ripgrep's, you can plug your implementation into the `memchr` crate's benchmark harness: https://github.com/BurntSushi/memchr

It uses rebar: https://github.com/BurntSushi/rebar


Interesting, thanks for the detailed response. I'll have a look at the benchmark; I'm doing some work on algorithm benchmarking right now by coincidence.

I'd say a long pattern is more like 64 bytes (the benchmarking suite I use defines short patterns as 32 or under).

Edit: will also check out the frequency approach used in ripgrep, sounds fascinating.


You can read more about it at various points in these two blogs I've written:

https://blog.burntsushi.net/ripgrep/

https://blog.burntsushi.net/regex-internals/

64 bytes is decently long, but I'd still put my money on SIMD for "common" cases.


It’s re2…


Or write a parser!

In most cases the parser will not be shorter (code) than the regex, but it will most certainly be simpler (cognitively; it's hard to understand all that regexes do, they're close to magic) and more testable (you can test the parts of the parser, like the number parser or the string parser, individually), and often faster too.



Well if people spend time learning arcane regex syntax it'd be a shame to not use them!

But yes, you're right. Sometimes simple finds are sufficient.


Perl needed a couple of years to catchup. At first they had no idea what Russ was talking about


Some discussion from 5 years ago:

https://news.ycombinator.com/item?id=20310669


Thanks! Macroexpanded:

Regular Expression Matching Can Be Simple and Fast (2007) - https://news.ycombinator.com/item?id=20310669 - June 2019 (33 comments)

Regular Expression Matching Can Be Simple and Fast (2007) - https://news.ycombinator.com/item?id=16341519 - Feb 2018 (20 comments)

Regular Expression Matching Can Be Simple and Fast (2007) - https://news.ycombinator.com/item?id=9374858 - April 2015 (16 comments)

Regular Expression Matching Can Be Simple And Fast [2007] - https://news.ycombinator.com/item?id=6593331 - Oct 2013 (1 comment)

Regular Expression Matching Can Be Simple And Fast - https://news.ycombinator.com/item?id=820201 - Sept 2009 (15 comments)

Regular Expression Matching Can Be Simple And Fast - https://news.ycombinator.com/item?id=466845 - Feb 2009 (17 comments)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: