After reading Russ's original article in 2007 I did some work on bringing the nascent regex API in perl's core up to par so one could lexically replace perl's regex engine with the plan9 engine, PCRE and others.
One interesting product of this work was the ability to compare other regex engines to Perl's in perl's own test suite. See this journal entry for some info about that: http://use.perl.org/~avar/journal/33585
Maybe I'll resurrect some of that work once Perl 5.12 comes out and try dropping in RE2 and see how it compares to PCRE and Perl on the various edge cases involved.
This would be somewhat easier if RE2 supported Perl's syntax for named captures.
It's easy enough to tweak the parser (re2/parse.cc) to add it. Please send me an email (rsc@swtch.com) if you do run the tests. I'd be interested to see the results.
Interesting fact: If you restrict yourself in certain ways (e.g. no backreferences), it's possible to write an automatic "subtype" checking system for regular expressions.
(By this I mean you can prove, for two regexes, that if the first matches something then the second will always also match that same text. You can also test for exact equivalence by making sure the relation holds in both directions.)
I've used this before in code that dealt with XML schema (which are roughly regular expressions over trees), but it strikes me that it could be highly useful to have for the analysis & optimization described...
Examples might be an optimization database (if you see a part of a regex that's equivalent to X, substitute Y instead), or automatically generating faster-but-coarser regexes.
The algorithm I used should work with anything that's a "Regular Hedge Grammar".
An off-the-cuff understanding is that means a regular hedge grammer is a CFG written such that any recursion in the grammar is tail-recursion. (Hopefully that makes sense, although I'm mixing terms from different domains there.)
So, yes.
(n.b.: I haven't worked much with the theory; just implemented a couple of papers. :) )
I'm not understanding what you're saying. Whether a grammar on one level of the Chomsky hierarchy is also a grammar on a lower level is an undecidable problem.
Surely this doesn't work in general. If A and B are arbitrary regexes that are "pure" (meaning they describe a regular language and therefore are each equivalent to a unique deterministic finite automaton), I don't think it's possible to prove that if any string S is in A then it is also in B. Obviously for some regexes this may be possible (e.g. if A and B are finite, or regexes with trivial disjunction) but I do not believe two arbitrary DFA's can be shown to be equivalent. Please correct me if I'm wrong.
Sure it does, and it's actually easy to demonstrate. :) That's the power of limiting yourself to a regular language.
I started typing out psuedo-code, but it's easier to be informal. Consider:
- It's easy to prove that one terminal (character, char-range, etc) is a subset of another.
- It's easy to prove that a sequence of terminals is a subset of another.
- If there's a disjunction, you just need to prove both branches. If the disjunction is on the "less-than" side, check both branches with AND; if it's on the "greater-than" side, check with OR -- this is because `(A|B) > A` and `(A|B) > B`, but `C > (A|B)` iff `C > A` and `C > B`.
- For non-recursive nonterminals, just expand them as they're encountered.
- Which only leaves recursion, which is surprisingly easy to deal with as well: If you're trying to prove `A > B`, it amounts to expanding A and B into their right-hand sides and proving those; at some point due to recursion you'll need to prove `A > B` again -- which seems tricky!
However, recursion can only happen in a limited manner in a regular language. Being very handwavy because explaining it this way seems most intuitive to me: Recursion can only happen in 'tail' position in a regular language. Thus, by the time you encounter `A > B` for the second time, it happens to be the very last thing which needs to be demonstrated in order for `A > B` to be true -- you've already checked, as it were, one full round of the repetition/looping. Thus the rule is easy: if you encounter `A > B` for a second time, you can just accept it as true!
Hopefully that makes sense.
Edit for the fun of it (n.b. that I'm using `>` instead of `>=` out of laziness):
prove(A > B) ->
if seen(A > B): True
else: add_to_seen(A > B); prove(rhs(A) > rhs(B))
prove(xA > yB) -> prove_terminal(x > y) && prove(A > B)
prove( A|B > C ) -> prove(A > C) || prove(B > C)
prove( C > A|B ) -> prove(C > A) && prove(C > B)
That's probably an even more appropriate term for modern regular expression syntax, but you're right, it isn't the sense I was going for. Fixed, thanks.
One interesting product of this work was the ability to compare other regex engines to Perl's in perl's own test suite. See this journal entry for some info about that: http://use.perl.org/~avar/journal/33585
Maybe I'll resurrect some of that work once Perl 5.12 comes out and try dropping in RE2 and see how it compares to PCRE and Perl on the various edge cases involved.
This would be somewhat easier if RE2 supported Perl's syntax for named captures.