Thanks for the mention! I just want to expand on some things you said: - Rust's ...

zanny · on Nov 29, 2018

Thank you for all your hard work. I've been using Rust for two years now and have invoked your code probably a hundred times over.

I always feel so spoiled with Rust because it seems every core "functionality" library (serde, uuid, rand, chrono, soon to be futures I hope, and innumerable more) is the best implementation of that concept in any language ever so far.

Its go weird to go from working on decade old PHP or C++ code where I constantly curse the developers for undefined or illogical behavior to Rust where I rarely ever have a hiccup - if I do, the crazy powerful documentation engine often solves it in seconds - if that doesn't work, the lints, compiler, and more and more the RLS point me in the right direction, and finally in the worst case I go actually look at the crate source and discover how well thought out the developer made things.

Its legitimate programming magic and I'm worried I'm developing a psychological dependence on Rust where in my next job if I can't get built in backtraces from my errors like Failure I might jump out a window. So thank you again (and the rest of the Rust community in general) so much for making me miserable whenever I think about having to write regular expressions (or code, in general) with any other language (except Python, even with the warts its still a sweetheart who means well, I can't stay mad at it).

burntsushi · on Nov 29, 2018

Thanks for the kind words. :) Glad Rust is working out for ya!

ridiculous_fish · on Nov 28, 2018

Does the linear time hold up in the face of capture groups? For example, say you have a bunch of capture groups in a loop:

    /((a)|(b)|(c)|(d))*/

If the string has length N, the loop iterates N times, and each iteration must clear capture groups proportional to the length of the regex. So in this case the time varies as the product of the input and regex length, not their sum independently.

burntsushi · on Nov 28, 2018

Yes, it does. Capture groups aren't special here. The issue is a semantic quibble. Namely, when folks say, "a regex engine that guarantees matching in linear time," what they actually mean is, "a regex engine that guarantees matching in linear time with respect to the input searched where the regex itself is treated as a constant." If you don't treat the regex as a constant, then the time complexity can vary quite a bit depending on the implementation strategy.

For example, if you do a Thompson NFA simulation (or, more practically, a Pike VM), then the time complexity is going to be O(mn), where m ~ len(regex) and n ~ len(input), regardless of capturing groups.

As another example, if you compile the regex to a DFA before matching, then the time complexity is going to be O(n) since every byte of input results in executing a small constant number of instructions, regardless of the size of the regex. However, DFAs typically don't handle capturing groups (although they certainly can handle grouping), with the notable exception of Laurikari's Tagged DFAs, but I don't know off-hand if the time complexity of O(n) usually associated with a DFA carries over to Tagged DFAs. Of course, the principal downside of building a DFA is that it can use exponential (in the size of the regex) memory. This is why GNU grep, rust/regex and RE2 use a hybrid approach ("lazy DFA"), which avoids O(2^n) space, but falls back to O(mn) matching when the DFA would otherwise exceed some memory budget during matching.

ridiculous_fish · on Nov 28, 2018

> when folks say, "a regex engine that guarantees matching in linear time," what they actually mean is, "a regex engine that guarantees matching in linear time with respect to the input searched where the regex itself is treated as a constant."

Well the Rust docs say "all searches execute in linear time with respect to the size of the regular expression and search text." Their engine compiles to a DFA, not an NFA or PikeVM; I suppose this is the basis for their claim.

> As another example, if you compile the regex to a DFA before matching, then the time complexity is going to be O(n) since every byte of input results in executing a small constant number of instructions, regardless of the size of the regex. However, DFAs typically don't handle capturing groups

Now you have arrived at my question! Rust compiles to a DFA that supports capture groups. My question is whether capture groups ruin the linearity of the DFA matching.

Thanks for the Laurikari's Tagged DFAs reference, I hadn't heard of that. I'll check it out!

burntsushi · on Nov 28, 2018

> Well the Rust docs say "all searches execute in linear time with respect to the size of the regular expression and search text."

Yeah, that's ambiguous phrasing on my part. I meant that it was linear time with respect to both the size of the regex and the search text.

> Their engine compiles to a DFA, not an NFA or PikeVM; I suppose this is the basis for their claim.

No, it doesn't. rust/regex uses some combination of the Pike VM, (bounded) backtracking and a lazy DFA. It will compile a DFA ahead of time in some cases where Aho-Corasick can be used.

> Rust compiles to a DFA that supports capture groups.

No, it uses a lazy DFA to answer "where does it match," but it still must use either the Pike VM or the bounded backtracker to resolve capture groups.

> My question is whether capture groups ruin the linearity of the DFA matching.

Yeah I think I would probably look at Tagged DFAs to answer this. You'll want to check out recent papers that cite Laurikari's work, since I think there have been some developments!

beagle3 · on Nov 28, 2018

First match, then capture. All captures are wasteful unless there’s a match.

ridiculous_fish · on Nov 28, 2018

There's no way to reconstruct the captures after the match completes. You have to record them as you go in case you successfully match, or run the whole regex again.

burntsushi · on Nov 28, 2018

> or run the whole regex again

Right, that's what RE2 and rust/regex both do. The lazy DFA finds the match, then something else resolves the capture groups.

ridiculous_fish · on Nov 28, 2018

Whoa. How does knowing the match ahead of time help the second pass?

burntsushi · on Nov 28, 2018

Finding the match can use the lazy DFA, which is roughly one or two orders of magnitude faster than the Pike VM, and similarly faster than bounded backtracking. Typically, the span of a match is much much smaller than the entire input you're searching, so it makes sense to find that span as quickly as possible. When it comes time to run the Pike VM (or the bounded backtracker), you can run the regex on just the span of the match, which will be much better than running it on the entire input to find the match if the input is large.

This strategy doesn't always make sense, and it can be better to just run the NFA right away if the input is small.

If you find this stuff interesting, you'll definitely want to check out Russ Cox's article series on regexes. In particular: https://swtch.com/~rsc/regexp/regexp3.html

lilyball · on Nov 28, 2018

I was under the impression that Go's regex library was RE2. Wasn't RE2 originally written by Russ Cox?

burntsushi · on Nov 28, 2018

RE2 is written in C++. Go's regex library is written in pure Go. Russ Cox wrote both of them. Both RE2 and Go's regex library have a Pike VM, a bitstate backtracker, and a one-pass NFA, but RE2 has a lazy DFA which Go lacks. The lazy DFA is a key component for performance in a lot of cases.

SiVal · on Nov 28, 2018

Any idea why Cox left it out? Is it just a "we intend to add it but haven't gotten to it yet" or "we don't want it because..."?

burntsushi · on Nov 29, 2018

I don't know specifically, but I'd guess the former over the latter. Adding a lazy DFA is a lot of work. There is an issue tracking the addition of a lazy DFA on the golang tracker. My guess is that it's up to a contributor to do it.