Though there is an equivalent DFA for every NFA, the DFA can have an exponential...

loup-vaillant · on Sept 13, 2015

> the DFA can have an exponentially larger number of states than the NFA

Yep. Which is why some implementation fall back to NFA when the number of state of the DFA reaches some predefined threshold.

> In any event the processing time of an NFA implementation* can reflect this explosion*

Well, not necessarily.

Trial and error (backtracking) is not the only way to simulate an NFA. A more reliable way to do it is to execute the multiple choices in parallel. This nullifies the combinatorial explosion because the different threads share a great deal of information.

A thread of execution in a finite automaton has exactly 2 pieces of information: the current position in the character stream, and the current state. In the case of a DFA, you stick to a single thread of execution, and if you fail, there's nothing to backtrack to anyway. In the case of an NFA, you can maintain a set of current states. And that set will never be bigger than the number of states in your NFA.

The algorithm looks like this:

  Set = [Start_state]
  Until input is empty:
     Read the next character
     New_set = []  // empty set
     For each state in your old set:
        Add all the possible next states to New_set
     Set = New_set
     If New_set is empty
        You failed to recognised the input
  You successfully recognised the input

brudgers · on Sept 13, 2015

Parallel processing won't convert an exponential process into a polynomial one...unless one can run an exponential number of processes simultaneously.

loup-vaillant · on Sept 14, 2015

Not in the general case. But in this case, yes it can.

First, I suggest you read the fucking link I summarised in the first place: https://swtch.com/~rsc/regexp/regexp1.html

Second, this fucking course is really good: https://www.coursera.org/course/automata (I have taken it myself.) If you have a dozen hours to spare, it is highly recommended.

Third, you may want to take a look at this… fucking… analogy:

    0
   / \
  |   |
   \ /
    1
   / \
  |   |
   \ /
    2
    .
    .
    .
   N-2
   / \
  |   |
   \ /
   N-1
   / \
  |   |
   \ /
    N

The number of paths from top to bottom is 2^N. If you want to try them all, that will indeed take exponential time, parallelism or not. But when you process an NFA, you don't care which path you have taken. So you can forget about which path you have "actually" taken.

The thing is, when you branch from K, both branches land at the same spot: K+1. So you merge those branches. Then you split again. So when you're going through this graph, the maximum number of actual threads you will ever run in parallel is 2.

That's nowhere near the expected 2^N.

Here lies the beauty of proper NFA processing. By merging the parallel threads at each opportunity, you keep their numbers down to the number of states in the NFA —at most. As a result, the worst running time of a good NFA algorithm is O(N*S), where N is the length of the input, and S the number of states in the NFA. And the worst running space is O(S).

Of course, if you're fail to merge the threads (like backtracking), then your worst case goes back to exponential. On the other hand, it can give you extra powers such as back references.

brudgers · on Sept 14, 2015

Real world regular expressions

Regular expression usage in real programs is somewhat more complicated than what the regular expression implementations described above can handle...

I'm unable to see how the approach suggested in the paper could be made idempotent with Perl's regular expression engine while still maintaining linear performance. Though admittedly idempotence is a strong form of "Perl Compatible".

But as I'm not particularly bright, there's no need to try and explain.

loup-vaillant · on Sept 14, 2015

(Sorry for the dismissive tone, but you apparently glossed over the algorithm I took pain to write for you. That pissed me off a little.)

PCRE is the wrong trade-off, not worth the backward compatibility. The world will be a better place when people stop to use them. I don't care what real programs do, real programs are wrong.

You need to understand that PCRE are not regular expressions. They're top down parsers.

Let that sink in for a moment. PCRE are actually top-down parsers.

By using the syntax of regular expression, PCRE are less readable than they could be, and their full power comes of as incredibly contrived. If they were called Perl Compatible Parsing Expression Grammars (PCPEG), and used an actual PEG syntax, I wouldn't object.

If you need to parse a regular grammar, use an actual regular expression engine. If you need something more powerful, then you may consider top down parsing. Just don't be shy about it.

brudgers · on Sept 15, 2015

Obviously, one of us has learned something since the ancestor comment. Hopefully it's more than how to move goalposts.

KMag · on Sept 15, 2015

I'm not a Perl RegEx expert, but what about Perl regexes isn't idempotent? Which features modify the external environment other than by returning matches? Even variable assignment would be idempotent unless the variable is both modified and affects the behavior of the regex.

loup-vaillant · on Sept 15, 2015

How idempotence has to do with anything? A regex match is supposed to return a result, not perform effects…

KMag · on Sept 29, 2015

> How idempotence has to do with anything? A regex match is supposed to return a result, not perform effects…

Correct, which is why I was asking why the GP wrote:

> I'm unable to see how the approach suggested in the paper could be made idempotent with Perl's regular expression engine while still maintaining linear performance.

justinator · on Sept 15, 2015

s/fucking/fine/g;