Hacker News new | past | comments | ask | show | jobs | submit login

Javascript regexes are "getting better" feature-wise, but performance-wise not much have improved. The the a?a?a? example regex in Ross Cox article "Regular Expression Matching Can Be Simple And Fast" from 2007 is still getting exponentially slower and slower with the number of "a?"-repetitions (https://swtch.com/~rsc/regexp/regexp1.html)

for (let i=20; i<30; i++) { let str = "a".repeat(i); let regex = new RegExp("a?".repeat(i) + "a".repeat(i)); let t0 = performance.now(); str.match(regex); let t1 = performance.now(); console.log(i + " took " + Math.round(t1 - t0) + " milliseconds."); }

20 took 7 milliseconds. 21 took 14 milliseconds. 22 took 27 milliseconds. 23 took 52 milliseconds. 24 took 102 milliseconds. 25 took 224 milliseconds. 26 took 421 milliseconds. 27 took 814 milliseconds. 28 took 1604 milliseconds. 29 took 3470 milliseconds.

Tested today on latest Chrome on a MacBook Pro 2,6GHz Intel Core i7.

Here comes an idea on what could be done to improve performance for regexes that are actually regular, and to still keep support for non-regular advanced regexes that e.g. contain back-references:

Add a step immediately after parsing the regex, check if it is regular and can be handled by an engine like RE2, otherwise handle it using the normal regex engine.

An even more exotic idea would be to let multiple regex engines execute in parallell, and return the result for the one that finishes first.




I can't source this, but IIRC, Chrome devs have tried something like this, but couldn't get the match semantics in the RE2 case to exactly match existing Javascript regex match semantics. I heard about this long ago from a friend, and I don't remember the details. But it's quite plausible.

In general, "obvious" optimizations like this are almost always harder than they seem.

One possible path would be to precisely characterize regexes on which both FSMs and backtrackers agree precisely, and of only those, use the FSM based approach. But still, even then, you're maintaining two regex engines instead of one, and both need to be production grade and very fast. It is no easy task.


You can also implement linear-time regex matching in user code with webassembly. In my tests it was faster than native regex matching (https://jasonhpriestley.com/regex-dfa)


As someone familiar with JS engine regex internals, those results are totally shocking to me. How can I try your example? In particular I wasn't able to figure out which regexes you are testing with.


I tested with this regex:

  /^abcd(a(b|c)*d)*a(bc)*d$/
I used long strings of repeated "abcdabcd" as the test strings.

It's possible I made a mistake somewhere, and I can put together an open-source repo with the test setup when I get the time. But I'm curious why you find the results shocking?


The results are shocking because the native regular expression engines have thousands of man hours poured into them, with specialized JITs, exotic optimizations, etc. And your page is reporting the native one at 10x slower!

One thing I spotted is that you're asking the native engines to record capture groups, but your wasm implementation doesn't support those. For fairness you might switch to non-capturing groups: `(?:bc)` instead of `(bc)`. However this cannot explain the magnitude of the difference.

I dug into it some more, reducing it to this case:

    /^(?:abc*d)*$/
what happens is that the backtracking engine has to be prepared to backtrack into every iteration of the outermost loop. This means that the backtracking stack grows as the length of the input; the engine spends most of its time allocating memory! These nested loops definitely make the NFA approach look good.

Regardless it's a cool project, thanks for sharing!


There is nothing shocking here, parent implements strictly regular expressions and compile them to DFA, so of course it will be fast, especially using only ASCII characters and hand chosen regular expressions. Russ Cox articles covers this very well.


Where can I find the "wat2wast utility" mentioned in your article?


It's part of https://github.com/WebAssembly/wabt

I think I had to build from source.


> Add a step immediately after parsing the regex, check if it is regular and can be handled by an engine like RE2, otherwise handle it using the normal regex engine.

This would be a bad idea. For one thing, it would be slower on non-pathological cases, since backtracking is faster in practice. But more importantly, devs who test on Chrome might unwittingly create regexes that would run exponentially slower in other browsers.


> For one thing, it would be slower on non-pathological cases, since backtracking is faster in practice.

That's most certainly not universally true. See: https://rust-leipzig.github.io/regex/2017/03/28/comparison-o...

The real answer is that it's complicated. Backtrackers can certainly be faster in some cases, but not all. Moreover, as someone who has worked on regex engines for a while now, it's not clear to me how exactly to characterize performance differences (outside of pathological cases or cases that otherwise invoke catastrophic backtracking), so my guess is that it's probably mostly up to implementation quality and the optimizations that are implemented.


Use repetition range matching for O(1) performance:

  /a{1,30}/
Can you give an example of a regex that can't be easily re-written to be fast with the backtracking engine?


The point is, with the NFA approach, there are no regexes that exhibit catastrophic backtracking.

So you can't shoot your foot with expressions that seem fine but send your cpu off into la-la land when asked to match certain input data.


It's a strange property to want. It's not a property of most other parts of programming languages. You can write loops or function calls in any language that take exponential time. So why demand that regexps be immune to it?

It would make sense to worry about it if it were a common mistake, but I've never seen a gratuitously exponential regexp in the wild.


It's a thing: https://en.wikipedia.org/wiki/ReDoS

And it really happens. StackOverflow had a ~30 minute outage a couple years ago because of it: http://stackstatus.net/post/147710624694/outage-postmortem-j...

One of the key points here is that it's not always obvious from looking at a regex whether it will exhibit catastrophic backtracking (at least, to an untrained eye). Even more insidious, catastrophic backtracking may only happen on certain inputs, which means that tests won't necessarily track it either. Therefore, desiring a performance guarantee here is certainly not "strange."

The bottom line is that the various implementation strategies for regex engines have trade offs. On Internet forums, it's popular to "proclaim" for one side or the other. The backtracking advocates like to believe that catastrophic cases never happen or are easily avoided and the linear time advocates like to believe that so long as matching doesn't take exponential time, then it will be fast enough. :-)


Because most of the time you don't need the features that demand backtracking, and it's simple to use a NFA approach that avoids those kinds of issues.

Besides, people make mistakes all the time. Why not prevent the possibility?




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: