I've only given a quick read/skim, but this is extremely impressive. BurntSushi has a lot of create stuff out there, but the Rust regex crate is legendary. The fact that Rust has had such a performant and easy to use regular expression library for so long is a real treat to the community.
This article as a whole is also a treasure. There's a fairly limited number of people who have written a ton about regular expressions, but they all do so extremely well. Russ Cox's article series (as mentioned in this one) is really great. I used it in college to write a regular expression engine one summer after I had started to fall in love with the perfect cross between theory and practice that regular expressions are.
The changes for more in depth testing here are also very interesting, especially for a crate that I'm sure if critical so a lot of the ecosystem, and I appreciate the writeup on such a deep dive topic.
Are regular expressions hard to read? Sometimes they can, especially if you haven't gone out of your way to take deep dives. Are they not perfect for a lot of parsing tasks? Sure, there are definitely things people shouldn't use regexs for that they do, like validating emails. At their core though, regular expressions are one of the most power dense tools we have in pretty much every language. The insane stuff you can get done (sometimes in a way that would be frowned upon) in so little time if just incredible.
I'd love for there to be more content on regular expressions as well. Right now I'm only familiar with one book that really does a great job (at a practical level), and that's Mastering Regular Expressions by Jeffrey Friedl. On a theory level, a lot of compiler books will talk about them to varying degrees. The Dragon Book has some good content on regular expressions at an implementation level too.
Does anyone have any other book recommendations for regular expressions?
And if BurntSushi or any other regular expression engine contributors see this, I really appreciate the effort y'all put in to building such powerful software.
I think this exercise is valuable for anyone writing regexes to not only understand that there's less magic than one might think, but also to visualize a bunch of balls bouncing along an NFA - that bug you inevitably hit in production due to catastrophic backtracking now takes on a physical meaning!
Separately re: the OP, https://github.com/rust-lang/regex/issues/822 (and specifically BurntSushi's comment at the very end of the issue) adds really useful context to the paragraph in the OP about niche APIs: https://blog.burntsushi.net/regex-internals/#problem-request... - searching with multiple regexes simultaneously against a text is both incredibly complex and incredibly useful, and I can't wait to see what the community comes up with for this pattern!
> Are regular expressions hard to read? Sometimes they can, especially if you haven't gone out of your way to take deep dives. Are they not perfect for a lot of parsing tasks? Sure, there are definitely things people shouldn't use regexs for that they do, like validating emails. At their core though, regular expressions are one of the most power dense tools we have in pretty much every language. The insane stuff you can get done (sometimes in a way that would be frowned upon) in so little time if just incredible.
The real crowning use case for regular expressions (at least in terms of parsing-like tasks) is when you've got formats with varied delimiters. So one format I was parsing this weekend was basically header:field1,field2,field3"data"hash (with a fixed number of fields). Or another format I work with is suite~split/test1,test2@opt1:opt2^hw1^hw2#flags1#flags2 (most of the elements of which are optional). Regular expressions shine here; using primitives like split don't quite cut it in these formats.
And I think this also is a major cause of why regular expressions tend to become unreadable quickly. If you look at parsing via regular expressions as a whole, there's basically three things that are being jammed into one expression: what are the delimiters between fields, what is valid in each field, and which fields are optional. These are basically three separate concerns, but most regex APIs are absolutely terrible at letting you separate these concerns into separate steps (all you can do is provide the string that combines all of them).
> header:field1,field2,field3"data"hash (with a fixed number of fields).
In a language where regex is just right there, like Perl, I agree that the natural way to parse this is always a regex, although on the other hand in Perl the natural way to do almost everything is a regex so maybe Perl was a bad example.
In a language with split_once (and a proper string reference type) I actually rather like split_once here, splitting away header, and then field1, and then field2, and then field3, and then data, leaving just hash.
I guess this gets to your concern, by writing it with split_once we're clear about a lot of your answers, each delimiter is specified with what it's splitting, none of the fields are optional (unless we write that) and (if we write code to check) what is valid in each field.
(`results` will be None if the regex didn't match)
Regexes are one of those things where, once you understand it (and capture groups in particular) and it's available in the language you're working in, string-splitting usually doesn't feel right anymore.
The thing that I think sometimes gets lost when people are talking about the strengths and weaknesses of regular expression use, is that it can matter quite a bit what the language in question is like. If you're using a language that compiles down to a low level or is expected to have a fairly close match of expressions to specific code run (C, C++, Java, etc) then the choice is between using a regular expression library which often does an good enough match of what you want to accomplish in a performance manner that rolling your own optimized parser is hard to justify.
When using an interpreted language like Perl or Python, it's a slightly different story, since often there's no way to get as good of performance out of a custom parser written in that language as you can get from the optimized regular expression library included (if you use it competently). The overhead in the interpreted language can make some tasks impossible to do as quickly as teh regular expression engine can. In a way, a regular expression library is to parsing what Numpy is to computation.
I remember a task I had to parse a bunch of very simplistic XML in Perl a decade ago, where it consisted of the equivalent of a bunch of records of dictionaries/hashes (solr), and needed them as Perl hashes to work on them. I surveyed every XML parsing module available to me in Perl, and couldn't quite get the performance I needed, given the file was a few GB and it needed to be done as quick as possible multiple times a day. I implemented a simple outer loop to grab a working chunk of the file, a regex to split that chunk into individual records text chunks, and another regex to return each key/value as an alternating list I could assign directly to a hash. It was 5-6x faster than the faster XML streaming solution I found that I could use as a Perl module (which were all just wrapping X libs).
Could I have coded a custom solution in C that parsed this specialized format faster? Undoubtedly, but I'd still be marshaling data through the C/Perl interface, so would have taken a hit to get the data easily accessible to the rest of the codebase, which is less of an issue for regular expressions in Perl since they're a native capability of the language and deeply integrated. Using regular expressions as a parsing toolkit yielded very good results at a fraction of the time and effort, and honestly in my opinion that's what regular expressions are all about.
Nope. Usually only backtrackers have a JIT (not strictly true, I believe there is a JIT of a Thompson NFA and there is also redgrep). This library is not a backtracker.
I would be interested in experimenting with a JIT in the future, primarily as a way to speed up capture group extraction. But this regex library uses a lazy DFA, which has excellent theoughput characteristics compared to the normal backtracking interpreter (and perhaps even a backtracking JIT).
> If not, then perhaps this is a case where JavaScript beats Rust.
Like you I skimmed this, having done a bit of work on RegEx's recently. I think the language in my case is using the PikeVM, by virtue of there being no error returned unlike the other engines, so I had to, within the limitations of the language and its copyrighted protected status, had to create some new RegEx functionality for myself, to make it easier to use as RegEx's can be plain and simple Voodoo.
I dont have the stats of how often the other engines would be used, but if most of the programming languages out there are using PikeVM, then I can see why Google has not only written their own OS for their servers, but also saved a few more clock cycles by getting other engines into action for specific situations where the PikeVM would be too slow and/or heavy on the clock cycles. I know all to well how a few extra characters in the search string can drastically slow up the pattern matching.
The proverb "take care of the pennies and the pounds will take care of themselves" definitely applies to RegEx and clock cycles. I suspect when looking back at some conversations from the 90's, its made some coders I know very rich when dealing with millions of records per second processing.
My biggest complaint is with the little variations in dialects. Especially how the dialect * context expects you to handle quoting or terminating the expression. It's so varied that I've given up on even trying to remember and just resign myself to googling an example every time I need one.
Oniguruma is one of examples where you can actually pick which variation to use. I think Rust `regex` also exposes enough internals to make this possible.
What kind of content are you looking for? From a user POV, I think there is thankfully not much to learn about automata-based regexes like RE2 and rust/regex. The exception is if you're writing a performance-sensitive system entirely around regexes, like an intrusion detection system or search engine.
A regex is a very small concept with a very large syntax :) It's "just" literals, alternation, repetition, and concatenation. So it's mostly being able to see through the syntax IMO.
(On the other hand, from the implementation POV, there is a wealth of material online, but you already know where to find that :) I still think someone should "cliff notes" the Cox RE2 articles, since the ideas are important but there is a lot of detail. There are also dozens of references in those articles for people to read; it was a comprehensive survey at the time.)
---
BTW Friedl's book covers backtracking engines, which makes sense because PCRE, Perl itself, Python, Java, JavaScript, etc. all use that style
But as far as I remember, most of the book is irrelevant for users of automata-based engines. It's a whole bunch of detail about optimizing backtracking, reordering your clauses, and lots of detail about engine-specific and even version-specific differences.
Personally I just ignore all of that, and it doesn't cause me a problem. I use regexes all the time in every language, and stick to the features that an automata-based engine can handle.
Once you go into the backtracking stuff, then you might want the Friedl book, and that's a huge waste of time IMO. It's way easier to do that kind of thing outside of a regex engine, with normal parsing techniques.
And honestly once you go outside, you'll find you don't even need to backtrack. Perl-style regexes can be an astoundingly inefficient way to write a pattern recognition algorithm.
It makes the syntax smaller, more familiar, and more composable. It's closer to the classic lex tool (e.g. GNU flex). It's funny that lex has a way better syntax for regular expressions for Perl, but it's used probably 10,000x less often because most people want an interpreter and not a compiler to C code. But lex has 90%-100% of what you need, and it's way simpler, with a nicer syntax.
Literals are quoted so you don't have the literal/metachar confusion, and you can reuse patterns by name. The manual is worth skimming:
Story time. At ActiveState, my colleague and I - both fresh out of school - were tasked with building a regular expression debugger for the Komodo editor. We hired the legendary Perl guru Mark Jason Dominus to hack some hooks into Perl’s regex engine and then by exposing these hooks to the UI, we allowed users to watch the execution of their regexes step by step.
These days, various web-based tools offer superior functionality. But in 2001, Komodo’s Rx Debugger was absolutely state of the art and so much fun to work on.
I've found myself in the past in a situation where I needed an offline regex debugger (dealing with air-gapped networks, so the folks that needed to use the tool literally couldn't get at any online site, nor would they be permitted to even think of putting what they were working with in to an online one, no matter how it's designed), and it seems almost all of the work has gone in to online tools. The offline tools are scarce and somewhat lacking in comparison to, say, https://regex101.com/.
Good question! I use regex101.com a few times a month and I've thought about offline tools too. I just happened to see this GNOME application called Wildcard recently:
Can this be used for non-string lists? I always found it frustrating that we have this amazing machinery for searching and modifying lists of characters, but as soon as I have a list of numbers or dates it all goes out of the window.
Let's say I have a list of login attempt dates, and I want all sequences of 5+ failed login attempts followed by a success. With a regex it's trivial, but I can't use that; I have to roll my own loop, flags, and temporary lists.
Sure, I could convert my list to a string, process it, and (try to) convert it back, but the downsides are obvious.
Even if the performance is not as good as string-based regex, why shouldn't we have regexes for arbitrary list types?
It's horribly slow, but as an API experiment I'm quite happy. It also gives some tools not available to regexes, like inverting and intersect patterns, and matched pairs.
Nope. The regex library is tightly coupled with searching strings. That's an intentional design decision. Making a regex engine like this have a generic alphabet is a total non-starter. It's impractically difficult for me to do. Especially the API design and doing it in a way that doesn't impact the perf of the primary use case.
The kind of regex engine you want, especially one where you don't care about perf, is not hard to build. You could take the `regex-lite` crate I published, for example, and code it up to be as generic as you want. In so doing, I expect you'll run into a number of interesting challenges.
Anyway, it's not like these things don't exist. They do. People attempt to build them[1]. They just usually don't gain much traction because I suspect you're overstating their general utility. :-)
> Nope. The regex library is tightly coupled with searching strings. That's an intentional design decision. Making a regex engine like this have a generic alphabet is a total non-starter.
how can this be true? Not trying to pick any type of religious war, but if you had a C library based on char vectors, you could global replace with short or long and it would still work, pointers included, except for any places where you relied on the knowledge that a char was implemented as 8 bits.
or if you're saying "but I rely on the string class", is it somehow impossible to write a utf-32 based string class? would that string class be required to suppress anything that wasn't a valid code point at this particular time? I know professors like to teach that knowledge that a character is stored as an 8bit number is undefined behavior, but it sure is nice to know you have an 8-bit-clean data type, where 0 can in some contexts considered a terminator, and utf-8 is considered a special case on top of that.
I'm just trying to grok the type of impossiblity you're talking about here. Yes, I could read the code instead of asking :)
I answered basically this question about seven years ago[1]. Re-reading that now, it's actually still quite a good answer.
> but if you had a C library based on char vectors, you could global replace with short or long and it would still work, pointers included, except for any places where you relied on the knowledge that a char was implemented as 8 bits.
Right, and that last sentence is critical. Another way to say "rely on char being 8 bits" is "rely on an alphabet size of no more than 256."
The regex crate and all of its non-standard-library dependencies (of which I wrote all of them) corresponds to about 100K source lines of code. Everything from the concrete syntax on up to the matching engines itself would need to be parameterized over the alphabet. You can't just replace `u8` (unsigned 8-bit integer) with any other type and hope that it works, because the entire regex crate makes use of the fact that it's searching a sequence of `u8` values specifically. It doesn't treat `u8` as some opaque type with integer-like operations. The literal architecture of the code embeds assumptions about it, such as the fact that every possible value can be enumerated as [0, 1, 2, ..., 255]. The answer I wrote seven years ago touches on this. A concrete place where this occurs is in the lazy DFA (described in the blog). Very loosely, the logical representation of an individual state looks like this:
struct State {
next: [*State; 256],
}
The `u8` type appears nowhere there. Instead, the length of `next` is hard-coded to the total number of unique possible elements of the `u8` type. One could of course define this as part of some generic alphabet type, but this entire representational choice is only feasible in the first place because the alphabet size of `u8` is so small. What's the alphabet size of `u16`? Oops. Now your logical representation is:
struct State {
next: [*State; 65536],
}
You can see how that would be quarrelsome right?
Of course, you could use something like `HashMap<u16, *State>` instead. But holy moly you really do not want to do that when searching a sequence of `u8` values because a hashmap lookup for every byte in a string would absolutely trash performance. So now your generic alphabet interface needs to know about DFA state representation in-memory so that when you use `u8` you get the fast version and when you use anything else you get the slow-but-feasible version.
And then there's the whole UTF-8 automata stuff and various Unicode-aware algorithms sprinkled in at various places. This thread is talking about a generic alphabet, and that goes well beyond Unicode. So now all that Unicode stuff needs to be gated behind your generic alphabet interface.
At some point, you realize this is incredibly impractical, so if you want to expose a generic alphabet API, you do something like, "with a `u8` alphabet, go do all this fast stuff, but with any other alphabet, just do a basic simplistic generic regex engine." But that basic simpistic regex engine is going to wind up sharing very little with the `u8` specific stuff. And thus you get to my conclusion that it has no place in a general purpose regex engine designed for searching strings. For these niche use cases, you should just go build something for what you need and specialize for that. If you want a generic alphabet, then use that as your design criteria from the start and build for it. You'll wind up in a completely different (and likely much simpler) spot than where the regex crate is.
Basically, this line of thinking is a sort of "disease of abstraction" that is common in programming. We take what we think about the theory of regular languages and maybe some cute use cases and think, "well why not just make it generic!" But the thing is, coupling is generally how you make things fast, and making something more general usually involves de-coupling. So they are actually and usually exclusionary goals. But it takes a lot of domain knowledge to actually see this.
There's also a whole mess of code that uses low level vector instructions for accelerating searches. Thousands of lines of codes are devoted to this, and are just yet another part of the regex crate that doesn't really lend itself to being easily generic over arbitrary alphabets. You could probably write different versions of each of these algorithms for u8, u16 and u32 types (and that would in turn require using different vector operations because of the lane size differences), but they certainly aren't generalizable to arbitrary alphabets.
> I know professors like to teach that knowledge that a character is stored as an 8bit number is undefined behavior, but it sure is nice to know you have an 8-bit-clean data type, where 0 can in some contexts considered a terminator, and utf-8 is considered a special case on top of that.
I don't think regex engines have been tightly coupled to NUL terminated C strings for quite some time. Not even PCRE2 is. It's just a sequence of bytes and a length. That's it.
There does exist a gaggle of general purpose regex engines that, instead of working on UTF-8, works on UTF-16 code units. These are the Javascript, Java and .NET regex engines. None of them work on anything other than `u16` values. All of them are backtrackers. (Well, .NET does have a non-backtracking regex engine. I don't know its internals and how it arranges things with respect to alphabet size.)
PCRE2 does expose compile time configuration knobs for supporting UTF-8, UTF-16 and UTF-32. I'm not an expert on its internals, but PCRE2 is a backtracker and backtrackers tend to care less about the alphabet size than automata based engines such as the regex crate and RE2. But I'd guess that PCRE2 likely has some UTF-8 specific optimizations inside of it. They just don't show up as much (I assume) in fundamental aspects of its data structure design by virtue of its approach to regex searching.
Still, just supporting UTF-8, UTF-16 and UTF-32 is a very different thing than supporting generic alphabets.
> I'm just trying to grok the type of impossiblity you're talking about here. Yes, I could read the code instead of asking :)
It's really all about coupling and being able to make assumptions that the coupling grants you. If you remove the coupling---and that's what abstraction over arbitrary alphabets actually is---then you have to either remove the assumptions or push those assumptions into the abstraction. And whether that's even possible at all depends on the nature of how those assumptions are used in the chosen representations.
This is why Unicode is such a beastly thing to support in a regex engine. It runs roughshod all over ASCII assumptions that are extremely convenient in contexts like this. Something like `\w` that is ASCII aware can be implemented with a single DFA state. But as soon as you make `\w` Unicode-aware, you're now talking about hundreds of DFA states when you're alphabet is `u8`. You could switch to making your alphabet be the space of all Unicode codepoints (and thus collapse your DFA back down into a single state), but now your state representation basically demands to use a sparse representation which in turn requires more computational resources for a lookup (whether it's a hashmap or a linear/binary search). If your alphabet size is small enough to use a dense representation, then your lookup is literally just a pointer offset and dereference. That operation is categorically different than, say, a hash lookup and there's really nothing you can do to change that.
The bottom line here is that if you want a regex engine on a generic alphabet, then you probably don't care about perf. (As folks in this thread have come out and said.) And if you don't care about perf, it turns out you don't need something like the regex crate at all. You'd be able to build something a lot simpler. By at least an order of magnitude.
Why hasn't someone built this "simpler" library? Well, they have. I've linked to at least one elsewhere in this thread. It just turns out that they are often not productionized and don't catch on. My thesis for why is, as I've said elsewhere in this thread, that the utility of this sort of generic support is often overstated. I could be wrong. Maybe someone just really needs to do the work to make a great library. I doubt it, but it's not really a question that I want to put the work into answering.
The extent to which this is infeasible is the extent to which Rust is not good enough, sigh.
E.g. it would be very nice to derive doing unicode regexes on u8 (as you say) from the serial composition of `user_regex . unicode_regex`. Fusing that composition into a single automaton is a handy generic thing to have in the library bag of tricks.
I of course believe a better world is possible. That's why I went out and did something that nobody had ever done before: I produced a documented and sprawling API of regex engine internals as reusable abstractions in a separately versioned library. Rust is the only reason I was able to even consider doing that in the first place.
What you're talking about is categorically different. Your idea is so out there that I can't even begin to dream up a programming language that would resolve all of my concerns and challenges with building a general purpose regex library with Unicode support that is one of the fastest in the world, supports arbitrary alphabets and is actually something that I can maintain with reasonablish compile times.
IMO, I'm the optimist here and you're the pessimist. From my perspective, I see and celebrate what Rust has allowed me to accomplish. But from your perspective, what you see (or what you comment on, to be more precise) is what Rust has supposedly held me back from accomplishing. (And even that is something of a stretch, because it isn't at all clear to me that programming language expressivity and the concept of "zero cost abstractions" is arbitrarily powerful in the first place. See, for example, the premise of the 2014 film "Lucy".)
So I am a big fan of your work. It is really admirable what you've done, this latest push recently. Having the stabler `regex` vs more experimental/open-ended/user-assembly-required `regex-automata` is fantastic! I wish many more libraries did this, and I will happily spread around your blog post in arguing that specific ones should do so.
I wouldn't describe the difference between our views here as optimism vs pessimism. I like Rust decently enough to use it a fair amount and contribute to it somewhat. But I'm hungry for something more. It's fine if you aren't.
> But the thing is, coupling is generally how you make things fast, and making something more general usually involves de-coupling. So they are actually and usually exclusionary goals. But it takes a lot of domain knowledge to actually see this.
The biggest difference is that I don't believe this is true. The interface may be extremely gnarly and hard to express, but I do believe the interface can always in principle be sliced. It may not be practical (the interface is far more complicated than the thing before cutting) but it is still in principle possible.
---------------------
Let me put it this way, imagine some alternative world Unicode which was pathological in all the same ways (case, variable width encodings, all the other things I am forgetting). Imagine if someone forked your work to handle that. Clearly there would be lots of small things different --- it would be hard to abstract over --- but the broad brushstrokes would be the same?
I am not interested in supporting things which are not like Unicode, agree stray to far towards large alphabets etc. and all the algorithms one would want would be different anyways. But supporting things like Unicode or strictly easier than Unicode should be doable in some language, I think.
> The biggest difference is that I don't believe this is true. The interface may be extremely gnarly and hard to express, but I do believe the interface can always in principle be sliced. It may not be practical (the interface is far more complicated than the thing before cutting) but it is still in principle possible.
Well "practicality" matters. And I used weasel words such as "generally."
It's one thing to make concrete suggestions about how to actually achieve the thing you're looking for. But when you're way out in right field picking daisies, it just comes across (to me) as unproductive.
> but the broad brushstrokes would be the same?
This is basically impossible to answer. I could just as well say that the "broad brushstrokes" are the same between the `regex` and `regex-lite` crates.
Oh it's definitely overstated by suggesting it should live alongside a general purpose regex engine. :-)
I have no doubt that its utility can be great in niche use cases. I've never come across one in my decades of programming that I know of, but I'm sure they exist.
> I've never come across one in my decades of programming that I know of, but I'm sure they exist.
The login attempt example was convoluted, so here are more common scenarios:
- Grouping a list into pairs by matching /../
- Finding or collapsing repeated sequences with /(.)\1+/
- Parsing tag+value binary formats.
- Searching structured logs.
- DSL's for unit tests.
Are there better ways to do each of those? Yes, but either because someone implemented those specific functions, or it's a much longer solution.
Though I'm not recommending list-regexes for production code anytime soon. Prototypes and code golfing, sure, but not until the community understands it better.
One of the uses I had was for identifying particular subsequences of events in a long sequence. I actually at one point considered translating the sequence of events into a string of distinct Unicode code points, so I could define and match patterns as regular expressions!
Someone should go out and built it. It won't be me though, that's for sure. Believe it or not, such a library wouldn't leverage much of my experience. The vast majority of the complexity inside the regex crate is about optimizing for sequences of bytes.
It's not that hard to build a backtracking matching engine, at least if performance is not important. I did that in the past for the C# syntax tree used in a decompiler.
Basically the syntax tree has a bunch of additional (non-syntax) pattern nodes, that support flexible regex-style matching of a pattern-tree against a concrete syntax tree.
That said it would be a really cool feature to have some of the more complicated but common regexes (urls, emails, dates) be some kind of "template" in the syntax. Eg what if I could do '(?D<h=22>)' or something like that to search for all dates in any format that had some version of 8pm represented.
Anyway, super appreciate your work and your writing!
It's an interesting idea, but I think it would be worth trying to actually prototype it as its own library first and see if it catches on. My suspicion is that it probably won't work as nicely as you think it will unfortunately.
The C++ standard library std::basic_regex attempts to do this by exposing a templated class over a custom character type. https://en.cppreference.com/w/cpp/regex/basic_regex You can provide your own trait class to define things for your custom "characters".
Unfortunately performance goes out of the window. And it's likely to work as well as storing arbitrary non-character things into a custom std::basic_string.
You're going to have to define some sort of API for matching on a sliding window of values. This isn't impossible, but most languages don't really have great interfaces for it.
I use Ripgrep daily for searching anything in code or in text files, and I'm grateful every time I use it (on windows, linux, mac, vscode, vim, etc...). It's one of those software that has changed my life and how I work. Whenever I'm forced to use `grep`, it feels like I've travelled back in time when everything ran on a single core CPU and data were on slow spinning hard disk on PATA/IDE. Anyway, burntsushi definitely deserves honor amongst other great programmers.
ripgrep comes out of a "lineage", before it there was ag and before that ack, that did similar things to provide a much better interface than just grep.
For a problem at work, I needed to create a RegexSet of over 10 MM really long regexes. No engine out of the box could handle it. Even Rust’s RegexSet wasn’t good enough by default.
However, trying to use and read through regex-automata and regex-syntax (even in 2018) was such a very valuable and helpful learning resource. I ended up modeling the project at work off Lucene APIs, but it was all only after learning the basics from the regex crates.
Yeah 10 million regexes is huuuuge. Aho-Corasick can barely handle 10 million literals.
Future work is definitely trying to make the regex engine scale better with more patterns. Currently, it will crap out well before 10 million regexes, and I'm not sure that goal is actually attainable. But it can certainly be better than what it is today.
And of course, Hyperscan is really the gold standard here as far as multi-pattern searching goes. I'm not sure how well it will handle 10 million patterns though.
Yeah basically built a competitor to Hyperscan without the proprietary Intel bits.
Quick notes: to scale past X regexes in the set.
1. Don’t make a single RegexSet. Create multiple internally.
2. Ideally colocate similar partners together.
3. Use the immutability of a RegexSet to your advantage by generating an inverse index of “must match” portions of each regex.
4. When you get an input to match, first check the inverse index for a “may match set”, then only execute the internal automata that may possibly match on the input.
The inverse index can be as complicated (FSTs, or all of Lucene) or simple (hash map ) as you’d like it to be. The better it can filter the regexes in the set the more performance and scale you end up with. And tune the sizes of the internal automata for your usecase for some Goldilocks size for extra perf.
Right. I'll also note that this strategy probably assumes something about the search task itself, and may not work for all possible search operations. Think about what happens when your haystack is GBs big for example.
I'm guessing the answer's 'no' because you didn't elaborate to begin with, but on the off-chance - can you share any more about what the problem/project was?
Sure, it has been a while. I worked at Amazon in Brand Protection, specifically on Trademark and Logo infringement.
There were many different strategies at play (eg neural nets) but the false positives can be very expensive. But the one I described above is an expert system of manual overrides that made the final decisions.
Amazon don’t have much structured content, eg “title”, “description” and bad actors are constantly trying to obfuscate using language based grammar tricks or N1ke type attacks.
Also just the law is very nuanced. You can say “a case compatible with iPhone, that is ..” but it is infringement to say “iPhone case that is ..”
Also there are many nuances around licensing. For example, “BurntSushi T-shirts” as a company cannot have the Disney logo on it, however, a “Lego tshirt” might have the Disney logo legally on it. So a lot of overrides.
Even if you compress all the nuances into a single regex, which you can’t: there are 10 MM estimated brands worldwide (Amazon at the time hosted 1 MM) with on average 10 trademarks each that is 10 MM regexes.
Then you also need to multiple the 25 countries Amazon operates in for legal nuances, and multiple languages (eg even in the US Amazon store someone might use Spanish or Italian to sell their fake product). Additionally there is more fanout for reasons I’ve long forgotten.
Point of story 10 MM was a conservative upper bound on the RegexSet and it had to be fast and cheap. I built it on top of Lucene with some hacks here and there but it did (does) the job - fast and cheap :)
I hope now generative AI can help them more, but I’m not holding my breath.
> Even if you compress all the nuances into a single regex, which you can’t
You cant, standalone hw isnt that capable, distributed computing/cluster computing is.
> I hope now generative AI can help them more, but I’m not holding my breath.
I'm not from what I have seen.
>Also just the law is very nuanced. You can say “a case compatible with iPhone, that is ..” but it is infringement to say “iPhone case that is ..”
And Amazon would be up against users of Lexis Nexis who have a head start in all of this RegEx malarkey, by virtue of what they sell, without giving too much away about their code inner workings.
I experimented with the regex-automata crate in the past and it's the only one I found which can be used for text editors because the direct access to the underlying DFA makes it compatible with any text data structure. The usual regex library APIs expect the input to be one contiguous string.
I was in the middle of writing some regex-automata code (using the early 0.2.0 release) when this post dropped. Welp, time to see if I'm going to have to start all over digging into some brand new internals~
(I haven't read the post yet because I have an important call in a few minutes, but it looks like a very interesting and also conveniently-timed blog post.)
(Edit a few minutes later: looks like the answer might be yes, but since this is a polished release, I might be able to simplify my code massively. Wish me luck~)
(Edit 2 about 10 minutes later: well that was pretty painless and the new Builder::patch method is a total upgrade. Awesome~)
P.S. I'm still blocked from all your GitHub repositories and I think that's kind of unfair considering how widespread a lot of your crates are. I don't remember the original incident anymore. I believe the regex crate itself is under the rust-lang organization now, but there are still others I can't interact with.
> I'm still blocked from all your GitHub repositories and I think that's kind of unfair considering how widespread a lot of your crates are. I don't remember the original incident anymore.
I don't either. I block a lot of people for a lot of reasons. I unblocked you.
> The regex-automata 0.2.0 docs did have a giant warning about this
I didn't listen, and it paid off, because a bunch of the knowledge I learned on it is still perfectly applicable to 0.3.0; I already have the same NFA as before. :)
Now my appointment got canceled so I get to read your blog post and play with the new version. Fun!
BioJulia has published Automa.jl, which is a pure-Julia regex engine that can insert arbitrary Julia code at compile time.
Not to take away from Rusts regex which are far more advanced than Automa, but I just take issue with that library being the first time regex internals have been exposed as a library :)
These sound like two different things. PCRE2, for example, has support for "callouts" which sound similar to what you're saying? https://www.pcre.org/current/doc/html/pcre2callout.html --- Other things like ragel and re2c have done similar things.
What I'm talking about in this blog is very different from that. This is about taking the internals of the regex library itself and turning them in a separately versioned library that others can compose.
This makes somewhat less sense for a backtracker because most backtrackers just have one engine: the backtracker. But an automata based library usually has many different engines that can be composed in different ways. Still, backtrackers have things they could expose but don't in practice, such as a regex parser and an AST.
You're both correct. In the context of Julia, where compilation and run time are arbitrarily interleaved, something like re2c can be a library and used at run time.
Just to be clear here, I mentioned re2c as an example of what my blog is not about. There is a category difference between "ask a regex engine to run arbitrary code" and "expose internals as a separately versioned library for others to compose as they want." They service different use cases.
burntsushi, the code example after "we can also increase it by decreasing the size of our regex by disabling Unicode mode:" seems to be the same as the previous code example. Copy-paste mistake?
My biggest wish for rust regex was support for ARM SIMD. Currently the literal optimization is only vectorized on x86. It seems that the author doesn’t own an ARM machine, and therefore can’t test implementation and so doesn’t want to support it.
I actually have an M2 mac mini in the mail from Apple for exactly this purpose.
My time horizon is very long. It takes me a long time to do things these days.
It has never been true that I don't want to support it. Merely that it is difficult to verify and test if I can't do it myself. There is also the problem that the port from x86 to arm is not straight-forward, due to both my own ignorance and what I believe are important missing vector operations such as movemask.
The current PR for ARM SIMD[1] uses a different instruction mix to achieve the same goals as movemask. I tested the PR and it has a significant speedup over the non-vectorized version.
> It seems that the author doesn’t own an ARM machine, and therefore can’t test implementation and so doesn’t want to support it.
Tangential (since he answered already), but in this era of cheap cloud instances, I'd be surprised if "not owning a <blank>" is any hindrance to support.
> but in this era of cheap cloud instances, I'd be surprised if "not owning a <blank>" is any hindrance to support.
When you own a device, it doesn't cost you more (other than electricity) if you take longer while playing with it, and there's no extra cost (other than space) to having it ready to be used when necessary. When you rent a cloud instance, there's a direct monetary cost for every extra minute you take while investigating an issue, and it's less guaranteed that it'll be available when you need it.
Or, to put it more simply: unless he has his own personal device to play with, he'll have to spend extra money for every issue he investigates and every change he makes to the ARM support.
This article as a whole is also a treasure. There's a fairly limited number of people who have written a ton about regular expressions, but they all do so extremely well. Russ Cox's article series (as mentioned in this one) is really great. I used it in college to write a regular expression engine one summer after I had started to fall in love with the perfect cross between theory and practice that regular expressions are.
The changes for more in depth testing here are also very interesting, especially for a crate that I'm sure if critical so a lot of the ecosystem, and I appreciate the writeup on such a deep dive topic.
Are regular expressions hard to read? Sometimes they can, especially if you haven't gone out of your way to take deep dives. Are they not perfect for a lot of parsing tasks? Sure, there are definitely things people shouldn't use regexs for that they do, like validating emails. At their core though, regular expressions are one of the most power dense tools we have in pretty much every language. The insane stuff you can get done (sometimes in a way that would be frowned upon) in so little time if just incredible.
I'd love for there to be more content on regular expressions as well. Right now I'm only familiar with one book that really does a great job (at a practical level), and that's Mastering Regular Expressions by Jeffrey Friedl. On a theory level, a lot of compiler books will talk about them to varying degrees. The Dragon Book has some good content on regular expressions at an implementation level too.
Does anyone have any other book recommendations for regular expressions?
And if BurntSushi or any other regular expression engine contributors see this, I really appreciate the effort y'all put in to building such powerful software.