Javascript regexes are "getting better" feature-wise, but performance-wise not m...

burntsushi · on Nov 28, 2018

I can't source this, but IIRC, Chrome devs have tried something like this, but couldn't get the match semantics in the RE2 case to exactly match existing Javascript regex match semantics. I heard about this long ago from a friend, and I don't remember the details. But it's quite plausible.

In general, "obvious" optimizations like this are almost always harder than they seem.

One possible path would be to precisely characterize regexes on which both FSMs and backtrackers agree precisely, and of only those, use the FSM based approach. But still, even then, you're maintaining two regex engines instead of one, and both need to be production grade and very fast. It is no easy task.

jhpriestley · on Nov 28, 2018

You can also implement linear-time regex matching in user code with webassembly. In my tests it was faster than native regex matching (https://jasonhpriestley.com/regex-dfa)

ridiculous_fish · on Nov 28, 2018

As someone familiar with JS engine regex internals, those results are totally shocking to me. How can I try your example? In particular I wasn't able to figure out which regexes you are testing with.

jhpriestley · on Nov 28, 2018

I tested with this regex:

  /^abcd(a(b|c)*d)*a(bc)*d$/

I used long strings of repeated "abcdabcd" as the test strings.

It's possible I made a mistake somewhere, and I can put together an open-source repo with the test setup when I get the time. But I'm curious why you find the results shocking?

ridiculous_fish · on Nov 28, 2018

The results are shocking because the native regular expression engines have thousands of man hours poured into them, with specialized JITs, exotic optimizations, etc. And your page is reporting the native one at 10x slower!

One thing I spotted is that you're asking the native engines to record capture groups, but your wasm implementation doesn't support those. For fairness you might switch to non-capturing groups: `(?:bc)` instead of `(bc)`. However this cannot explain the magnitude of the difference.

I dug into it some more, reducing it to this case:

    /^(?:abc*d)*$/

what happens is that the backtracking engine has to be prepared to backtrack into every iteration of the outermost loop. This means that the backtracking stack grows as the length of the input; the engine spends most of its time allocating memory! These nested loops definitely make the NFA approach look good.

Regardless it's a cool project, thanks for sharing!

stplsd · on Dec 2, 2018

There is nothing shocking here, parent implements strictly regular expressions and compile them to DFA, so of course it will be fast, especially using only ASCII characters and hand chosen regular expressions. Russ Cox articles covers this very well.

JoelJacobson · on Nov 29, 2018

Where can I find the "wat2wast utility" mentioned in your article?

jhpriestley · on Nov 29, 2018

It's part of https://github.com/WebAssembly/wabt

I think I had to build from source.

ridiculous_fish · on Nov 28, 2018

> Add a step immediately after parsing the regex, check if it is regular and can be handled by an engine like RE2, otherwise handle it using the normal regex engine.

This would be a bad idea. For one thing, it would be slower on non-pathological cases, since backtracking is faster in practice. But more importantly, devs who test on Chrome might unwittingly create regexes that would run exponentially slower in other browsers.

burntsushi · on Nov 28, 2018

> For one thing, it would be slower on non-pathological cases, since backtracking is faster in practice.

That's most certainly not universally true. See: https://rust-leipzig.github.io/regex/2017/03/28/comparison-o...

The real answer is that it's complicated. Backtrackers can certainly be faster in some cases, but not all. Moreover, as someone who has worked on regex engines for a while now, it's not clear to me how exactly to characterize performance differences (outside of pathological cases or cases that otherwise invoke catastrophic backtracking), so my guess is that it's probably mostly up to implementation quality and the optimizations that are implemented.

tlb · on Nov 28, 2018

Use repetition range matching for O(1) performance:

  /a{1,30}/

Can you give an example of a regex that can't be easily re-written to be fast with the backtracking engine?

ehaliewicz2 · on Nov 28, 2018

The point is, with the NFA approach, there are no regexes that exhibit catastrophic backtracking.

So you can't shoot your foot with expressions that seem fine but send your cpu off into la-la land when asked to match certain input data.

tlb · on Nov 28, 2018

It's a strange property to want. It's not a property of most other parts of programming languages. You can write loops or function calls in any language that take exponential time. So why demand that regexps be immune to it?

It would make sense to worry about it if it were a common mistake, but I've never seen a gratuitously exponential regexp in the wild.

burntsushi · on Nov 28, 2018

It's a thing: https://en.wikipedia.org/wiki/ReDoS

And it really happens. StackOverflow had a ~30 minute outage a couple years ago because of it: http://stackstatus.net/post/147710624694/outage-postmortem-j...

One of the key points here is that it's not always obvious from looking at a regex whether it will exhibit catastrophic backtracking (at least, to an untrained eye). Even more insidious, catastrophic backtracking may only happen on certain inputs, which means that tests won't necessarily track it either. Therefore, desiring a performance guarantee here is certainly not "strange."

The bottom line is that the various implementation strategies for regex engines have trade offs. On Internet forums, it's popular to "proclaim" for one side or the other. The backtracking advocates like to believe that catastrophic cases never happen or are easily avoided and the linear time advocates like to believe that so long as matching doesn't take exponential time, then it will be fast enough. :-)

ehaliewicz2 · on Nov 29, 2018

Because most of the time you don't need the features that demand backtracking, and it's simple to use a NFA approach that avoids those kinds of issues.

Besides, people make mistakes all the time. Why not prevent the possibility?