Hacker News new | past | comments | ask | show | jobs | submit login
Perl Incompatible Regular Expressions (github.com/yandex)
62 points by eatitraw on Sept 12, 2015 | hide | past | favorite | 43 comments



If you're interested in regular expressions and their place in automata, Jeff Ullman's Automata course starts today on Coursera: https://www.coursera.org/course/automata

The recent HN discussion of its announcement is here: https://news.ycombinator.com/item?id=10089092

Ullman is also coauthor of "The Dragon Book".


Thanks for the reminder, thought it was still a few weeks away.


Google has a similar library with similar goals. See https://github.com/google/re2/wiki/CplusplusAPI It also removes backtracking.

The idea is that backtracking may kill performance, so a specially crafted text that causes a lot of backtracking can be used as a DoS attack.


Wow, really impressive. Sometimes specializing by cutting out functionality is the right approach. In this case eliminating greedy/non-greedy matching (and others) means this can work as a high-level triage and something with more specificity can do the precision work once you have a candidate match.

It looks like this could have a good place in a real-time streaming architecture somewhere.


The readme does say it grew out of the Yandex's webcrawler


README.ru has the real documentation- google translate does a pretty good job with it. It mentions that the algorithms are from the Dragon book.

I didn't try the code, but I think it's missing full Unicode character class support (for example when you use \w). But I see it handles Russian :-)

https://github.com/yandex/pire/blob/master/pire/classes.cpp#...



What I don't get is that the example given:

     hello\\s+w.+d$
Is 100% perl compatible, seems more like "subset" than "incompatible". I've seen comments that say it's a "joke". Can any confirm that the title was indeed a joke?

Edit: I know both what a DFA/NFA are, and how they relate to formal language theory and regular languages, the question still stands how a subset can be called "incompatible"


Perl's regex engine is uses a non-deterministic finite automata [NFA]. Because the Readme indicates each character is only examined once and that PIRE lacks backtracking, look ahead and capture groups, it's engine is almost certainly DFA [deterministic finite automata] based. Thus among the syntax it is likely to choke on or ignore are anything involving curly braces {}.

Keep in mind that "regular expressions" can denote a notation for describing finite automata and that this is subtly different from the programming language implementation of regex engines in languages like Perl.


Perl's regex engine must use something stronger than plain NFAs. The expressive power of NFAs and DFAs is exactly the same. They both recognize the regular languages, which is less than what can be expressed with Perl "regular expressions".


I'm using Friedl's classification scheme for regex engines from Mastering Regular Expressions. I don't know of a more standard survey regarding regex's as implemented in various programming languages. Anyway, from a practical standpoint an NFA has to be implemented as a push down automata in the Von Neumann machines we currently have to allow backtracking to simulate simultaneous exploration of the arbitrary number of DFA states that a single NFA state may represent. That doesn't make Friedl's classification useless.


https://swtch.com/~rsc/regexp/regexp1.html and the articles it links to are a good detailed explanation of the advantages of the DFA approach.


To sum it up: The NFA and the DFA have exactly the same Big O complexity. The NFA is just slower by some factor. So that's not much of a difference.

The real difference is between dynamic programming and backtracking. When you implement an NFA with dynamic programming, you're still O(N) with respect to the input no matter what. But when you use backtracking, as Perl does, you're prone to combinatorial explosions in some cases.

Also keep in mind that PCRE are not regular expressions. They're more expressive than that, and can express more than just regular grammars. That's possible because when you start with backtracking, it is relatively easy to add other niceties on top of pure regular expressions.


Though there is an equivalent DFA for every NFA, the DFA can have an exponentially larger number of states than the NFA, though my understanding is that's not usually the case in practice. In any event the processing time of an NFA implementation can reflect this explosion since the actual implementation algorithm of a finite state machine on ordinary hardware has to be deterministic.


> the DFA can have an exponentially larger number of states than the NFA

Yep. Which is why some implementation fall back to NFA when the number of state of the DFA reaches some predefined threshold.

> In any event the processing time of an NFA implementation* can reflect this explosion*

Well, not necessarily.

Trial and error (backtracking) is not the only way to simulate an NFA. A more reliable way to do it is to execute the multiple choices in parallel. This nullifies the combinatorial explosion because the different threads share a great deal of information.

A thread of execution in a finite automaton has exactly 2 pieces of information: the current position in the character stream, and the current state. In the case of a DFA, you stick to a single thread of execution, and if you fail, there's nothing to backtrack to anyway. In the case of an NFA, you can maintain a set of current states. And that set will never be bigger than the number of states in your NFA.

The algorithm looks like this:

  Set = [Start_state]
  Until input is empty:
     Read the next character
     New_set = []  // empty set
     For each state in your old set:
        Add all the possible next states to New_set
     Set = New_set
     If New_set is empty
        You failed to recognised the input
  You successfully recognised the input


Parallel processing won't convert an exponential process into a polynomial one...unless one can run an exponential number of processes simultaneously.


Not in the general case. But in this case, yes it can.

First, I suggest you read the fucking link I summarised in the first place: https://swtch.com/~rsc/regexp/regexp1.html

Second, this fucking course is really good: https://www.coursera.org/course/automata (I have taken it myself.) If you have a dozen hours to spare, it is highly recommended.

Third, you may want to take a look at this… fucking… analogy:

    0
   / \
  |   |
   \ /
    1
   / \
  |   |
   \ /
    2
    .
    .
    .
   N-2
   / \
  |   |
   \ /
   N-1
   / \
  |   |
   \ /
    N
The number of paths from top to bottom is 2^N. If you want to try them all, that will indeed take exponential time, parallelism or not. But when you process an NFA, you don't care which path you have taken. So you can forget about which path you have "actually" taken.

The thing is, when you branch from K, both branches land at the same spot: K+1. So you merge those branches. Then you split again. So when you're going through this graph, the maximum number of actual threads you will ever run in parallel is 2.

That's nowhere near the expected 2^N.

Here lies the beauty of proper NFA processing. By merging the parallel threads at each opportunity, you keep their numbers down to the number of states in the NFA —at most. As a result, the worst running time of a good NFA algorithm is O(N*S), where N is the length of the input, and S the number of states in the NFA. And the worst running space is O(S).

Of course, if you're fail to merge the threads (like backtracking), then your worst case goes back to exponential. On the other hand, it can give you extra powers such as back references.


Real world regular expressions

Regular expression usage in real programs is somewhat more complicated than what the regular expression implementations described above can handle...

I'm unable to see how the approach suggested in the paper could be made idempotent with Perl's regular expression engine while still maintaining linear performance. Though admittedly idempotence is a strong form of "Perl Compatible".

But as I'm not particularly bright, there's no need to try and explain.


(Sorry for the dismissive tone, but you apparently glossed over the algorithm I took pain to write for you. That pissed me off a little.)

PCRE is the wrong trade-off, not worth the backward compatibility. The world will be a better place when people stop to use them. I don't care what real programs do, real programs are wrong.

You need to understand that PCRE are not regular expressions. They're top down parsers.

Let that sink in for a moment. PCRE are actually top-down parsers.

By using the syntax of regular expression, PCRE are less readable than they could be, and their full power comes of as incredibly contrived. If they were called Perl Compatible Parsing Expression Grammars (PCPEG), and used an actual PEG syntax, I wouldn't object.

If you need to parse a regular grammar, use an actual regular expression engine. If you need something more powerful, then you may consider top down parsing. Just don't be shy about it.


Obviously, one of us has learned something since the ancestor comment. Hopefully it's more than how to move goalposts.


I'm not a Perl RegEx expert, but what about Perl regexes isn't idempotent? Which features modify the external environment other than by returning matches? Even variable assignment would be idempotent unless the variable is both modified and affects the behavior of the regex.


How idempotence has to do with anything? A regex match is supposed to return a result, not perform effects…


> How idempotence has to do with anything? A regex match is supposed to return a result, not perform effects…

Correct, which is why I was asking why the GP wrote:

> I'm unable to see how the approach suggested in the paper could be made idempotent with Perl's regular expression engine while still maintaining linear performance.


s/fucking/fine/g;


See https://swtch.com/%7Ersc/regexp/regexp1.html for a fine explanation of DFA vs NFA.


Again, that's leaves it as a subset of Perls regular-expressions, not a disjoint/distinct set, which is what incompatibility would imply.


Since regular expressions evaluate to values [that's what makes them expressions], type theory is perhaps a better abstraction than set theory. In particular, since regular expressions accept an input and produce an output, it May be particularly useful to think of them as function types.

If PIRE is an ancestor type of Perl regular expressions, then it's less specific input language can be fed into a Perl engine and Perl's more specific input language cannot be fed into it. Compatibility/incompatibility labels depend on which side of covariance/contravariance [1] the person doing the labeling stands. The authors of the library got their labelbaby label maker out first.

If they were wrong, file an issue on their Github repository.

[1]: https://en.m.wikipedia.org/wiki/Covariance_and_contravarianc...


Well, since you want to get pedantic, fine....

Regular Expressions define/describe a language. They do not, can not, evaluate to a value, and it wouldn't even make sense if they did.

Regular Expressions also don't by themselves accept input. A RE library/software accepts input (which is always a finite string, an expression (also a finite string), and always returns one of two values: The input string is a member of that language, or the input string is not a member of that language. The result is binary, and it isn't returned from the regular expression.

Saying that you can't feed PCRE expressions into PIRE is completely false, since my original comment and the example in the linked github is literally a valid PCRE. That's like saying ASM.js isn't compatible with JavaScript, which is obvious nonsense, because it's by definition a strict subset of javascript.

The set of all regular expressions that will test valid in this library is strictly a subset of those which are valid in PCRE.

Type theory has nothing to do with anything here, and even if it did, PCRE and PIRE have identical functional types. They both accept string expressions (language descriptions), finite string input, and they both return boolean values (wether the input string is a member of that language).

And enough with the snark already. You aren't the only one on the planet who knows what a DFA/NFA is, and your statements make it seem debatable that you understand them.

But again, my first question was: "is this name really a joke?", which could have been answered by a single damned comment saying "yes".


When we get to the end of a string, the automata is either in an accepting state or not. If that ain't a boolean, it'll do till one comes along...at least for me as I have a hard time imagining uses for regular expressions that aren't implemented so as to return values. Then again I'm not that bright and tend to mumble along writing down my thoughts.

Which is why I'm writing about types...it seems to unroll the at hand possible "compatible/incompatible" conundrum of the name...or at least when after writing about it and heaedd to bed, I realized a simpler illustration was possible using Sets...and isn't that just the way of the maths, one description is isomorphic with another that makes explanation easier? Alas, I despaired of sharing it, but then this very morning I found a place for it.

Anyway, if one tests for compatibility [whatever that may be] between a set S and a superset SS as suggested above, the results of the test depend on both the test chosen and the ordering of the arguments. This is to say:

   For each m in S, m is member of SS -> true
   For each mm in SS, mms is member if S -> false
But choosing a different test

    S is subset of SS -> true
    SS is subset of S -> false
Of course whether we map true to compatibility [whatever that is] or incompatibility is entirely dependent in our arbitrary choice of whether our test is deemed to be for compatibility or incompatibility.

I wish I had an easy answer, but it's turtles, even by any other sweet smelling name, all the way down I'm afraid.


I don't know PCRE nor PIRE. Still…

…I bet you can find some input that would work differently on PIRE and PCRE. Just use an escape sequence that PCRE recognise, and PIRE interprets as ordinary characters.

Failing that, we may be able to find some input that is accepted by PCRE and rejected by PIRE; then some input that is accepted by PIRE and rejected by PCRE.

If any of those two hypothesis turn out to be true, then PIRE is neither forward nor backward compatible with PCRE. So it would truly be incompatible. I expect the provided example works on both simply because most simple inputs would work on both. Showing incompatibilities was probably not the goal of this example.


I think it's called "incompatible" because it looks at it from the other direction, so to speak. There are valid Perl regular expressions, i.e. those using backtracking, that are incompatible with this regex engine.


Thank you, that's 100% the type of answer I was hoping for


Edit: I can't edit my post, but how about instead of down-voting, you answer my question? I don't get how my question isn't genuine.


It's a pun on Perl Compatible Regular Expressions. You're being downvoted for bikeshedding over a name.


What was wrong with the GNU basic regex?

If you're going to write a stripped down string matching syntax more strictly for "regular" text then why bother mentioning perl?


Seems like a riff on the famous Perl Compatible Regular Expressions library (http://www.pcre.org), which is used in a bunch of high-profile things (PHP, and Apache Server, for starters). Kind of like "less" vs "more".

So a bit of an inside joke I guess, but most people familiar with regexes will probably have heard of PCRE, so it's not a terribly obscure reference. I liked it :)


PCRE is a set of extensions to RE to enable parsing of nonregular grammars. This is just RE. It's like calling some hypothetical language C++--. It doesn't make sense.


Well, if I designed a language that was C++ but stripped of some specific features for a purpose that made it less than C++, but still not quite just C, I might call it C++-- as a joke. In that respect, I think it makes perfect sense.


"Perl Compatible Regex" (pcre.org) is a popular regex library. The name PIRE is presumably a reference to that.


woosh


There's nothing wrong with it. Its implementation however is optimized for some use cases and not others. It appears that the folks at Yandex did not feel that crawling the web at their scale was one among the tasks for which it was suited.


Scary to think that a major search engine really uses regular expressions heavily. Regexprs are great for quick scripts, but one would expect that in major production applications better and higher level parsing algorithms would be used. It must be a nightmare to debug if you have a lot of reg-exprs interacting in a large code base.


If the language you're trying to parse happens to be regular, then regular expressions are the perfect tool for the job. They're simple, reliable, and impossibly fast.

When you think about it, a search engine is mainly about finding key words in a huge pile of text, without caring much about the structure of the language the text is written in. That use case is totally specifiable by a regular language.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: