Russ Cox, in response to a regex benchmark vs. Python/Ruby.
"You assume the benchmark is worth something.
First of all, Ruby and Python are using C implementations
of the regexp search, so Go is being beat by C, not by Ruby.
Second, Go is using a different algorithm for regexp matching
than the C implementations in those other languages.
The algorithm Go uses guarantees to complete in time that is
linear in the length of the input. The algorithm that Ruby/Python/etc
are using can take time exponential in the length of the input,
although on trivial cases it typically runs quite fast."[1]
As a member of the masses of programmers that typically deal with truckloads of trivial problems, his reply is worth as much as the benchmark he criticizes. I've never used anything but a trivial regexp (for some values of trivial) in production code (one-shot problems don't count) and I don't give a damn what is being benchmarked here - I only care if my Python script runs fast.
Actually, I might agree with Go here. I agree that most of my regexes are the trivial kind that don't exhibit exponential growth, but that's only because I also control the input to all those regexes. But Go's intended niche appears to be for user-facing server-side programs, and in that environment if you ever run a naive exponential-time regex against user-supplied input you're setting yourself up for a potential DOS attack.
But Go's intended niche appears to be for user-facing server-side programs, and in that environment if you ever run a naive exponential-time regex against user-supplied input you're setting yourself up for a potential DOS attack.
Yes, but for larger recognizers, one would probably use something like Ragel anyway. A DSL that provides intersection, union, difference, concatenation, composition, etc. makes it far easier to construct elaborate automata from smaller expression than the 'compile a big regex string' approach.
Why not use the potentially exponential algorithm initially and timeout and fall back to the linear case if the sometimes-quick one has taken longer than the linear one would have? At most a 2x slowdown for complex cases, and substantial speedup for the simple ones.
It depends, if you have the opportunity to build a DFA once, why not do it? It's only a one-time cost.
I think the more serious problem is that what most people consider to be a 'regular expression' does not express a regular language, and hence cannot be expressed as a finite state automaton.
I think his answer is useful. It does explain what's going on. If the performance characteristics of PCRE suit your problem better it's trivial to write a cgo wrapper for it. There's even one on github, though I don't know if it works.