Hacker News new | past | comments | ask | show | jobs | submit login

Thanks for the mention! I just want to expand on some things you said:

- Rust's regex library was greatly inspired by RE2, which is a C++ library that also executes regexes in linear time.

- Go's regex library also runs in linear time, however, it is still missing some optimizations that can make it run more slowly on non-pathological regexes when compared to RE2 and Rust's regex crate.

- You don't need to use fancy features with a backtracking regex engine in order to shoot yourself in the foot. e.g.,

    >>> import re
    >>> re.search('(a*)*c', 'a' * 30)
- Even with linear time regex engines, you can get big slow downs. You'll never get exponential (in the size of the text) slow downs of course, but regexes like `[01]*1[01]{20}$`[1] can generate large finite state machines, which can be problematic in either memory or match speed, depending on the implementation.

[1] - https://cyberzhg.github.io/toolbox/min_dfa?regex=KDB8MSkqMSg...




Thank you for all your hard work. I've been using Rust for two years now and have invoked your code probably a hundred times over.

I always feel so spoiled with Rust because it seems every core "functionality" library (serde, uuid, rand, chrono, soon to be futures I hope, and innumerable more) is the best implementation of that concept in any language ever so far.

Its go weird to go from working on decade old PHP or C++ code where I constantly curse the developers for undefined or illogical behavior to Rust where I rarely ever have a hiccup - if I do, the crazy powerful documentation engine often solves it in seconds - if that doesn't work, the lints, compiler, and more and more the RLS point me in the right direction, and finally in the worst case I go actually look at the crate source and discover how well thought out the developer made things.

Its legitimate programming magic and I'm worried I'm developing a psychological dependence on Rust where in my next job if I can't get built in backtraces from my errors like Failure I might jump out a window. So thank you again (and the rest of the Rust community in general) so much for making me miserable whenever I think about having to write regular expressions (or code, in general) with any other language (except Python, even with the warts its still a sweetheart who means well, I can't stay mad at it).


Thanks for the kind words. :) Glad Rust is working out for ya!


Does the linear time hold up in the face of capture groups? For example, say you have a bunch of capture groups in a loop:

    /((a)|(b)|(c)|(d))*/
If the string has length N, the loop iterates N times, and each iteration must clear capture groups proportional to the length of the regex. So in this case the time varies as the product of the input and regex length, not their sum independently.


Yes, it does. Capture groups aren't special here. The issue is a semantic quibble. Namely, when folks say, "a regex engine that guarantees matching in linear time," what they actually mean is, "a regex engine that guarantees matching in linear time with respect to the input searched where the regex itself is treated as a constant." If you don't treat the regex as a constant, then the time complexity can vary quite a bit depending on the implementation strategy.

For example, if you do a Thompson NFA simulation (or, more practically, a Pike VM), then the time complexity is going to be O(mn), where m ~ len(regex) and n ~ len(input), regardless of capturing groups.

As another example, if you compile the regex to a DFA before matching, then the time complexity is going to be O(n) since every byte of input results in executing a small constant number of instructions, regardless of the size of the regex. However, DFAs typically don't handle capturing groups (although they certainly can handle grouping), with the notable exception of Laurikari's Tagged DFAs, but I don't know off-hand if the time complexity of O(n) usually associated with a DFA carries over to Tagged DFAs. Of course, the principal downside of building a DFA is that it can use exponential (in the size of the regex) memory. This is why GNU grep, rust/regex and RE2 use a hybrid approach ("lazy DFA"), which avoids O(2^n) space, but falls back to O(mn) matching when the DFA would otherwise exceed some memory budget during matching.


> when folks say, "a regex engine that guarantees matching in linear time," what they actually mean is, "a regex engine that guarantees matching in linear time with respect to the input searched where the regex itself is treated as a constant."

Well the Rust docs say "all searches execute in linear time with respect to the size of the regular expression and search text." Their engine compiles to a DFA, not an NFA or PikeVM; I suppose this is the basis for their claim.

> As another example, if you compile the regex to a DFA before matching, then the time complexity is going to be O(n) since every byte of input results in executing a small constant number of instructions, regardless of the size of the regex. However, DFAs typically don't handle capturing groups

Now you have arrived at my question! Rust compiles to a DFA that supports capture groups. My question is whether capture groups ruin the linearity of the DFA matching.

Thanks for the Laurikari's Tagged DFAs reference, I hadn't heard of that. I'll check it out!


> Well the Rust docs say "all searches execute in linear time with respect to the size of the regular expression and search text."

Yeah, that's ambiguous phrasing on my part. I meant that it was linear time with respect to both the size of the regex and the search text.

> Their engine compiles to a DFA, not an NFA or PikeVM; I suppose this is the basis for their claim.

No, it doesn't. rust/regex uses some combination of the Pike VM, (bounded) backtracking and a lazy DFA. It will compile a DFA ahead of time in some cases where Aho-Corasick can be used.

> Rust compiles to a DFA that supports capture groups.

No, it uses a lazy DFA to answer "where does it match," but it still must use either the Pike VM or the bounded backtracker to resolve capture groups.

> My question is whether capture groups ruin the linearity of the DFA matching.

Yeah I think I would probably look at Tagged DFAs to answer this. You'll want to check out recent papers that cite Laurikari's work, since I think there have been some developments!


First match, then capture. All captures are wasteful unless there’s a match.


There's no way to reconstruct the captures after the match completes. You have to record them as you go in case you successfully match, or run the whole regex again.


> or run the whole regex again

Right, that's what RE2 and rust/regex both do. The lazy DFA finds the match, then something else resolves the capture groups.


Whoa. How does knowing the match ahead of time help the second pass?


Finding the match can use the lazy DFA, which is roughly one or two orders of magnitude faster than the Pike VM, and similarly faster than bounded backtracking. Typically, the span of a match is much much smaller than the entire input you're searching, so it makes sense to find that span as quickly as possible. When it comes time to run the Pike VM (or the bounded backtracker), you can run the regex on just the span of the match, which will be much better than running it on the entire input to find the match if the input is large.

This strategy doesn't always make sense, and it can be better to just run the NFA right away if the input is small.

If you find this stuff interesting, you'll definitely want to check out Russ Cox's article series on regexes. In particular: https://swtch.com/~rsc/regexp/regexp3.html


I was under the impression that Go's regex library was RE2. Wasn't RE2 originally written by Russ Cox?


RE2 is written in C++. Go's regex library is written in pure Go. Russ Cox wrote both of them. Both RE2 and Go's regex library have a Pike VM, a bitstate backtracker, and a one-pass NFA, but RE2 has a lazy DFA which Go lacks. The lazy DFA is a key component for performance in a lot of cases.


Any idea why Cox left it out? Is it just a "we intend to add it but haven't gotten to it yet" or "we don't want it because..."?


I don't know specifically, but I'd guess the former over the latter. Adding a lazy DFA is a lot of work. There is an issue tracking the addition of a lazy DFA on the golang tracker. My guess is that it's up to a contributor to do it.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: