Regular Expression Matching in the Wild

avar · on March 11, 2010

After reading Russ's original article in 2007 I did some work on bringing the nascent regex API in perl's core up to par so one could lexically replace perl's regex engine with the plan9 engine, PCRE and others.

One interesting product of this work was the ability to compare other regex engines to Perl's in perl's own test suite. See this journal entry for some info about that: http://use.perl.org/~avar/journal/33585

Maybe I'll resurrect some of that work once Perl 5.12 comes out and try dropping in RE2 and see how it compares to PCRE and Perl on the various edge cases involved.

This would be somewhat easier if RE2 supported Perl's syntax for named captures.

rsc · on March 12, 2010

It's easy enough to tweak the parser (re2/parse.cc) to add it. Please send me an email (rsc@swtch.com) if you do run the tests. I'd be interested to see the results.

Thanks.

avar · on March 12, 2010

I'll make sure to do that, no promises that I'll actually get around to this though.

Perl and others could definitely use the test work that was done on RE2 though, the way the tests are being automatically generated is really neat.

crux_ · on March 11, 2010

Interesting fact: If you restrict yourself in certain ways (e.g. no backreferences), it's possible to write an automatic "subtype" checking system for regular expressions.

(By this I mean you can prove, for two regexes, that if the first matches something then the second will always also match that same text. You can also test for exact equivalence by making sure the relation holds in both directions.)

I've used this before in code that dealt with XML schema (which are roughly regular expressions over trees), but it strikes me that it could be highly useful to have for the analysis & optimization described...

Examples might be an optimization database (if you see a part of a regex that's equivalent to X, substitute Y instead), or automatically generating faster-but-coarser regexes.

bdr · on March 11, 2010

Do you know whether the restrictions you're talking about limit regexps to the set of actual regular languages?

crux_ · on March 11, 2010

The algorithm I used should work with anything that's a "Regular Hedge Grammar".

An off-the-cuff understanding is that means a regular hedge grammer is a CFG written such that any recursion in the grammar is tail-recursion. (Hopefully that makes sense, although I'm mixing terms from different domains there.)

So, yes.

(n.b.: I haven't worked much with the theory; just implemented a couple of papers. :) )

tshaddox · on March 12, 2010

I'm not understanding what you're saying. Whether a grammar on one level of the Chomsky hierarchy is also a grammar on a lower level is an undecidable problem.

alttab · on March 12, 2010

I'd think so. Certain types of recursion or undefinable branches leads to unprovable expressions.

tshaddox · on March 12, 2010

Surely this doesn't work in general. If A and B are arbitrary regexes that are "pure" (meaning they describe a regular language and therefore are each equivalent to a unique deterministic finite automaton), I don't think it's possible to prove that if any string S is in A then it is also in B. Obviously for some regexes this may be possible (e.g. if A and B are finite, or regexes with trivial disjunction) but I do not believe two arbitrary DFA's can be shown to be equivalent. Please correct me if I'm wrong.

koenigdavidmj · on March 12, 2010

http://compilers.iecc.com/comparch/article/06-09-021 appears relevant.

EDIT: http://www.utdallas.edu/~dxd056000/cs4384/lect9.ppt shows how to find the minimum DFA as well.

crux_ · on March 12, 2010

Sure it does, and it's actually easy to demonstrate. :) That's the power of limiting yourself to a regular language.

I started typing out psuedo-code, but it's easier to be informal. Consider:

- It's easy to prove that one terminal (character, char-range, etc) is a subset of another.

- It's easy to prove that a sequence of terminals is a subset of another.

- If there's a disjunction, you just need to prove both branches. If the disjunction is on the "less-than" side, check both branches with AND; if it's on the "greater-than" side, check with OR -- this is because `(A|B) > A` and `(A|B) > B`, but `C > (A|B)` iff `C > A` and `C > B`.

- For non-recursive nonterminals, just expand them as they're encountered.

- Which only leaves recursion, which is surprisingly easy to deal with as well: If you're trying to prove `A > B`, it amounts to expanding A and B into their right-hand sides and proving those; at some point due to recursion you'll need to prove `A > B` again -- which seems tricky!

However, recursion can only happen in a limited manner in a regular language. Being very handwavy because explaining it this way seems most intuitive to me: Recursion can only happen in 'tail' position in a regular language. Thus, by the time you encounter `A > B` for the second time, it happens to be the very last thing which needs to be demonstrated in order for `A > B` to be true -- you've already checked, as it were, one full round of the repetition/looping. Thus the rule is easy: if you encounter `A > B` for a second time, you can just accept it as true!

Hopefully that makes sense.

Edit for the fun of it (n.b. that I'm using `>` instead of `>=` out of laziness):

     prove(A > B) ->
       if seen(A > B): True 
       else: add_to_seen(A > B); prove(rhs(A) > rhs(B))
     prove(xA > yB) -> prove_terminal(x > y) && prove(A > B)
     prove( A|B > C ) -> prove(A > C) || prove(B > C)
     prove( C > A|B ) -> prove(C > A) && prove(C > B)

ximeng · on March 11, 2010

A couple of the image links are broken. Try

http://pdos.csail.mit.edu/~rsc/regexp-img/script_Greek.png http://pdos.csail.mit.edu/~rsc/regexp-img/cat_Lu.png

for the trees used to match greek scripts and lower case letters respectively.

Another interesting article from the guy who wrote Google Code Search if I'm not mistaken.

rsc · on March 11, 2010

Image links fixed.

ableal · on March 11, 2010

Thanks for great article, but I'd also suggest s/cacophany/cacophony/

(If theophany is "an appearance of a deity to man", 'cacophany' sounds like "shit happens" ;-)

rsc · on March 12, 2010

That's probably an even more appropriate term for modern regular expression syntax, but you're right, it isn't the sense I was going for. Fixed, thanks.

ableal · on March 11, 2010

As long as I'm picking nits, I'd also object to the final header being a 'Summary' - it's more a 'Conclusions'.

My speed-reading summary (aka tl;dr):

- Your RE2 implementation beats the snot out of sparring-partner PCRE, as graphically demonstrated in the tail end.

- At the cost of leaving out some esoteric features, and a bit more memory (still small for current times).

- Does Unicode.

- Has automatically-generated tests (supplementing a small core of hand-written ones).

- Leaves reader feeling like he saw a demo of fine precision engineering ;-)

andrewcooke · on March 11, 2010

At last - been waiting for this. Will rewrite LEPL's regexp engine as soon as I can get the current release out...

If you haven't read the previous articles, they are classics and really worth the effort (linked at start of this one).

mkinsella · on March 11, 2010

Great article. I've been interested in compilers since my college class and this scratches the itch.

chasingsparks · on March 11, 2010

This should be included by professors teaching, or programmers working through, the Dragon Book. It really does a great job reinforcing some ideas.

andrewcooke · on March 11, 2010

See also http://news.ycombinator.com/item?id=1184971 which links to http://google-opensource.blogspot.com/2010/03/re2-principled... (source for this).