Rust's standard regex library isn't the fastest in the world, and the lack of backtracking can be limiting, but it is highly optimised for Rust's main use case of posting showoff comments on Hacker News.
I also tried this in Java (OpenJDK 1.8.0_161-b14), and it took about 24 seconds per iteration at 29 characters.
For those that don't know, Rust's regex crate descends from RE2 and Go's regexp library, both of which were written by Russ Cox, the author of the OP.
The principle difference between Rust's regex crate and, say, RE2 is its focus on literal optimizations. This speeds up a very large number of regexes, and it is one of the reasons why it does tend to be quite fast.
How good is ecosystem/crate support for more complex regex features? Speed is good, especially when you need it, but man, positive/negative lookahead/lookbehind and other advanced features are pretty damn wonderful once you know how to use them. The ability to choose what you want when you want it would be nice.
The regex crate doesn't support it, and it's not something I'm personally interested in working on. My goal at this point is to make the regex crate as good as possible at what it does, across the board. There's still a long way to go.
Other than that though, there is a WIP crate that actually builds on the `regex` crate and provides fancier features: https://crates.io/crates/fancy-regex (Created by the same author as Xi.)
There are also bindings to PCRE, but they are of varying quality. The regex crate's benchmark suite binds to several other libraries (including C++ and D libraries), but they are very minimal bindings sufficient for rough microbenchmarking: https://github.com/rust-lang/regex/tree/master/bench/src/ffi
Thanks. The info on the other crates is what I was looking for, I wouldn't expect this one to do more than what it aims for, which seems to be a specific type of regex engine which isn't amenable to all features but gets extra speed in that trade-off.
This is an excellent read. I would suggest reading this also (maybe even before), as it provides one of the simplest (incomplete, but still useful) regex implementations you'll ever see: https://www.cs.princeton.edu/courses/archive/spr09/cos333/be...
It suffers from the problems remedied by the original submission (backtracking instead of DFA), but the code is shorter and simpler. They're great to compare.
The biggest gap in the references/history section is the failure to mention V. Glushkov, whose NFA construction has much appeal and certainly predates Thompsons'. Reading the references, one might also draw the conclusion that regex implementation almost ceased between 1968 and 2007.
In our experience, regex matching can be simple and fast, but only sometimes. RE2's approach is straightforward but naive; we built a considerably faster (on some metrics) and more scalable system in Hyperscan. The cost of this turned out to considerable complexity - and there are still many regexes and inputs that don't perform particularly well in either system.
I thought this was going to be about compiling regular expressions, which is how CL-PPCRE (https://edicl.github.io/cl-ppcre/) manages to be faster than C. Better algorithms beat compilation, though.
The comments from the original post[1] of this are still quite relevant. I feel like I've seen this link posted here several times (much more recently than 2009), but I can't find the links.
There's been good discussion quite a few of the times this has been posted. Check the "past" link below the submission info at the top for a quick search of all the submissions.
Yeah, Russ Cox ranted about pattern matching libraries a decade ago, ignoring the fact that they were already doing things far more complex than formal regular expression matching. In the decade since, that hasn't changed.
The vim text editor actually can use the NFA regular expression engine [1] for some searches. I'm not sure if that's an option for some of the languages mentioned in the article.
This should be labeled (2007) and has been on HN several times before. It's a good article.
The article (and the wikipedia) say Thomson put regexps into a version of QED for CTSS, but CTSS was only used at MIT. He must have traveled between MIT and Bell labs as part of the Multics project. The CTSS reference is pretty obscure.
I also tried this in Java (OpenJDK 1.8.0_161-b14), and it took about 24 seconds per iteration at 29 characters.