Why the spectacular loss to V8 in the Regex tests? Is Go's regex engine a naive ...

chaosphere2112 · on April 17, 2013

Russ Cox, in response to a regex benchmark vs. Python/Ruby.

"You assume the benchmark is worth something.

First of all, Ruby and Python are using C implementations of the regexp search, so Go is being beat by C, not by Ruby.

Second, Go is using a different algorithm for regexp matching than the C implementations in those other languages. The algorithm Go uses guarantees to complete in time that is linear in the length of the input. The algorithm that Ruby/Python/etc are using can take time exponential in the length of the input, although on trivial cases it typically runs quite fast."[1]

[1] https://groups.google.com/d/msg/golang-nuts/6d1maef4jQ8/yg4N...

baq · on April 17, 2013

As a member of the masses of programmers that typically deal with truckloads of trivial problems, his reply is worth as much as the benchmark he criticizes. I've never used anything but a trivial regexp (for some values of trivial) in production code (one-shot problems don't count) and I don't give a damn what is being benchmarked here - I only care if my Python script runs fast.

kibwen · on April 17, 2013

Actually, I might agree with Go here. I agree that most of my regexes are the trivial kind that don't exhibit exponential growth, but that's only because I also control the input to all those regexes. But Go's intended niche appears to be for user-facing server-side programs, and in that environment if you ever run a naive exponential-time regex against user-supplied input you're setting yourself up for a potential DOS attack.

microtonal · on April 17, 2013

But Go's intended niche appears to be for user-facing server-side programs, and in that environment if you ever run a naive exponential-time regex against user-supplied input you're setting yourself up for a potential DOS attack.

Yes, but for larger recognizers, one would probably use something like Ragel anyway. A DSL that provides intersection, union, difference, concatenation, composition, etc. makes it far easier to construct elaborate automata from smaller expression than the 'compile a big regex string' approach.

Bakkot · on April 17, 2013

Why not use the potentially exponential algorithm initially and timeout and fall back to the linear case if the sometimes-quick one has taken longer than the linear one would have? At most a 2x slowdown for complex cases, and substantial speedup for the simple ones.

microtonal · on April 17, 2013

It depends, if you have the opportunity to build a DFA once, why not do it? It's only a one-time cost.

I think the more serious problem is that what most people consider to be a 'regular expression' does not express a regular language, and hence cannot be expressed as a finite state automaton.

ja30278 · on April 17, 2013

In this case, you really should care. http://www.benfrederickson.com/2013/04/09/python-catastrophi...

EdiX · on April 17, 2013

I think his answer is useful. It does explain what's going on. If the performance characteristics of PCRE suit your problem better it's trivial to write a cgo wrapper for it. There's even one on github, though I don't know if it works.

igouy · on April 17, 2013

It works (and will be updated across the website once Go1.1 is released).

http://benchmarksgame.alioth.debian.org/u64q/benchmark.php?t...

igouy · on April 17, 2013

>>a different algorithm for regexp matching<<

Here's a C++ program that uses RE2

http://benchmarksgame.alioth.debian.org/u64/program.php?test...

simonster · on April 17, 2013

V8 JIT-compiles regular expressions.

EDIT: Since apparently someone doesn't believe me: http://blog.chromium.org/2009/02/irregexp-google-chromes-new... ("After optimization we generate native machine code which uses backtracking to try different alternatives.") See also the yarr JIT, which jsc and SpiderMonkey use for regular expressions: http://trac.webkit.org/browser/trunk/Source/JavaScriptCore/y....

CatMtKing · on April 17, 2013

Looks like there's a reimplementation of RE2 in Go that might we worth benchmarking against. https://code.google.com/p/sre2/

igouy · on April 17, 2013

So what approach is implemented in the Go distribution?

ori_b · on April 17, 2013

The same as re2 (and, in fact, written by the same authors as re2)

igouy · on April 17, 2013

So why is the C++ regex-dna program that uses RE2 so much faster?

http://benchmarksgame.alioth.debian.org/u64/program.php?test...

And why did someone do "a reimplementation of RE2 in Go" ?