Not directly related, but recently I've built a regex engine that was a bit different: I implemented the semantics of Java's java.util.regex.Pattern on top of JavaScript's RegExp. This is for Scala.js' implementation of Pattern. [1]
It turns out that they are wildly incompatible by default, so directly passing through the regular expressions to RegExp is incorrect in most cases. That was what we used to do before, though. Other compilers like GWT use the same technique.
So I implemented a complete compiler from Java regular expressions to JavaScript regexes that preserves semantics. It supports virtually everything, and fails fast for the few things that are not supported (like grapheme clusters or canonical equivalence). Some things are very tricky to get right, like atomic groups or handling of surrogate pairs when the 'u' flag is not natively supported.
The advantages of that strategy over reimplementing a complete engine (which is what TeaVM does) are a) code size and b) we get the performance of native RegExp's (which are JIT-compiled to native code, these days).
Not directly related to your not directly related:
Are there any regexp syntax that works across main language platforms (e.g. JS, C, Rust, Ruby, Go) with identical semantics? Is there a subset of each language's native regexp syntax that is common across all of them? Or perhaps a third-party regexp library that has been implemented on each language?
I've found one: Google's C++ RE2, which has wrappers for many languages.
It's still really early stages, but I am actively working on a regexp engine[0] in Zig which aims to blur the line between regex engines and advanced parsing algorithms used to parse programming languages.
I am quite optimistic that due to Zig's portability and cross compilation, it shouldn't be too hard to expose it as a nice C library and use it from other languages (and WebAssembly) soon.
I'm not an expert in regexes in other platforms, but already if you want a common subset between Java and JS, you're limited to the following constructs:
* Literal characters
* Some escape sequences for literal characters: \t \r \n \xHH \uUUUU as well as the ones for the regex metacharacters such as \[ \( etc.
* Groups, sequences and alternatives
* Look-ahead (positive and negative)
* Look-behind, if supported (ES 2018+)
* Back-references to existing groups opened before the back-references, but you need to make sure that they're not syntactically followed by a digit, otherwise parsing is different between JS and Java
* Greedy and lazy quantifiers
* Custom character classes with [], without intersections or inner custom character classes
That's assuming you have 'u' support in JS and that you use it (ECMAScript 2015), and that you don't use any other flag. (you can use the 'g' flag in JS or not; it affects the behavior of the methods, but not of the regex itself). If you don't have 'u' support, it's only the same if you restrict your regex and your input to ASCII characters.
Some things that are unexpectedly NOT portable:
* Most pre-defined character classes like \s and the like
* Lookahead/lookbehind are not supported in Go and Rust (unless using fancy-regexp, from what I understand)
Also beware that character classes like `[]` have differing Unicode support, some operate on ASCII, unicode codepoints, or on extended grapheme clusters.. so YMMV.
As you start to expand to more languages, the common feature set is _far_ more limited.
Lookbehinds are annoying to implement, because they require driving the character iteration backwards. Now every regex construct has to be parametrized on whether it is tracking forwards or backwards. It is not something which can be easily tacked on to an existing engine.
I will plug my own Rust regex engine "regress", which supports arbitrary lookbehinds. Note fancy-regex only supports fixed-width lookbehinds.
I remember having built my own regex engine (for educational purposes only) following his articles almost a decade ago. Great fun.
Another fun thing you can do (and I have done) is that given a regex, construct strings that fully match the regex. It's not hard at all, but it's quite subtle if you want good results. For example given the regex "a+b+" the code can simply create strings like "ab", "aab", "aaab"... (with one "b" always), but it is slightly more challenging to have it output "aab", "abb", "aabb", etc, which involves forming the Cartesian product of infinite streams in a smart way.
They really don't (remain the gold standard, that is).
Don't get me wrong: I like RE2. However, it's 2-3 fairly standard ideas taped together in an idiosyncratic way that happens to work pretty well for Google and many other users.
I will confess (having designed Hyperscan), that it's 200-300 ideas ranging from 'boring as heck' to 'wildly non-standard' taped together in a idiosyncratic way that happens to work pretty well for a different subset of users (mostly in much more heavyweight network IPS cases)
I frankly think we're still waiting for a 'gold standard' in regular expression implementation. I daydream of a system that doesn't have the elephantine complexity of Hyperscan (and isn't inherently tied to x86) but still supports large scale regular expression matching and streaming. Ideally this system would also not have the weird "corners" of RE2 (dynamically constructing a DFA at run-time).
Even nicer: native support for 'difficult' constructs like backtracking - by native I mean "one integrated algorithm" not "first one run approach, then run the other from scratch if the fast path matches".
Hyperscan may be idiosyncratic but is quite an accomplishment, and by all accounts the fastest there is, kudos!
Do you know of any resources describing how to handle 'difficult' constructs in DFAs? Capture groups, zero-width assertions, and high-trip-count loops are examples: these are all straightforward in pcre-style backtracking, but require novel techniques in DFAs. Some of these techniques are research papers (tagged DFAs), some are yet-to-be-written papers (re2 drives the loop "one ahead" for zero-width assertions), some are open questions. What's the path forward here?
These difficult constructs are a brace of unpleasant questions, all quite distinct. Laurikari has some good work on capturing; we also did capturing in Hyperscan in an unreleased branch during the closed-source days (idea: run a trace of states from the NFA backward, then follow the trace forward, emulating what a backtracker would do).
Zero-width assertions are painful for automata-based approaches, especially forwards asserts. Backward asserts are easy in an bit-parallel NFA although I'm not aware of anyone actually doing that (you just need to have special "AND" states rather than the usual relation of being OR'd on if any of your predecessors are on). Even streaming would be doable.
As a rule, forward asserts are ugly and are "as easy to do as determinizing the two patterns together from that point on" - which can be trivial, or it can be a horrible world-smashing explosion.
High trip count loops were a lot of work in Hyperscan. We special-cased out the single-character width ones to properly handle /foo.{1500}bar/ and the like, but of course even Unicode can screw this up; something as trivial as /foo.{100,200}bar/ with anything but "dot means any single byte" - e.g. UTF-8, or even a 2-byte code unit of any size - much harder.
Wrapping a fully general regular expression inside a large bounded repeat is a nightmare. I have some ideas kicking around for how to do this that I've toyed with for years.
As might be apparent, I'm thinking of returning to the general regular expression fray (probably with a new project), so I've been thinking a lot about this stuff.
I feel like this is like describing GCC or Clang as "a bunch of ideas taped together". That is more or less true but also fails to capture a pretty big achievement.
Or same with v8 and SpiderMonkey. They have a bunch of different optimization techniques tuned for specific workloads.
That is, regexes are a rich enough language now, with enough different use cases, that I think it will be true for any future implementation.
---
I think the articles themselves are great, but they're also pretty dense, and I wish there was a version for regex users, not implementers. It seems like people still have a lot of problems with the backtracking regex vs. regular language distinction in practice.
GCC and Clang have phases that communicate synergistically. If I'm working on instruction scheduling, I can get the benefits of an upstream pass that did code motion. Conversely, if I'm working on a backend for x86, I can both share benefit of the front-end generic optimizations with an ARM backend [ edit: and not be concerned about what's going on in the other backend as well ]
By contrast, both Hyperscan and RE2 have a tendency to just throw different, well understood techniques at the problem of matching regular expressions, without necessarily getting much decomposition of the problem. Hyperscan goes further but at great expense and complexity - and many of the decisions we made are very "one size fits all".
One of the consequences of this is that the regex vs regular language distinction you speak of tends to pop up aggressively and annoyingly, even when there are perfectly good possibilities for a solution that integrates automata and back-tracking based matching. RE2 handles this by taping back-tracking to a automata-based approach, and Hyperscan doesn't solve this at all (unless you count Chimera, which tapes libpcre to Hyperscan).
Users probably shouldn't have to know about the underlying engine, but the other problem of the idiosyncratic nature of all of our 'taping' jobs is that each system has weird corners that users can wind up suddenly finding out about (and often can't really control). My dream library would allow users to indicate which tradeoffs they want (potentially telling them to "go away" when they ask for the impossible or unimplemented; e.g. "I want fully general backreferences, streaming and fixed-sized stream state").
Hm decomposing the problem and integrating automata and backtracking sound interesting. Although I'll say you can't claim that the RE2 articles aren't the gold standard if you haven't worked on or published this yet! :) (as code or prose)
Many things may have changed since they were written, but as far as I can tell they haven't been demonstrated in a friendly way, i.e. with the short example code in those articles.
> I frankly think we're still waiting for a 'gold standard' in regular expression implementation.
Is it the implementations, or is it regexes themselves? I was under the impression that SNOBOL/SPITBOL/REBOL was the gold standard... But I guess you're saying that given some (posix?) regex syntax, implementations can/should be improved? (while eg REBOL/Red-lang goes more towards "a more pragmatic regular language for pattern recognition", I guess ?)
I'm fascinated by PEG parsers, but I don't know whether they can be implemented in a way that's remotely as performant as even mediocre regular expression implementations, much less tuned SIMD extravaganzas.
That would be a stretch goal for me: build a regular expression / automata library that's sufficiently modular to keep some components and drop in more expressive constructs (not necessarily PEGs, but maybe pushdown automata).
> Simple PEG (Parsing expression grammar) matching. Uses no memorization, but uses superoperators and symbol inlining to improve performance. Note: Matching performance is hopefully competitive with optimized regular expression engines.
I've written a lot of things that are "hopefully competitive" with something else, but generally benchmarking is the ideal. :-) Hard to benchmark PEG vs regex fairly. Not dismissing it out of hand, as I've had many years to observe the weak spots of regex at quite a short distance.
Creating strings that match a regex can be very useful for testing too, as it allows you to fuzz a regex engine without using another regex engine as oracle.
It is indeed quite subtle to be "productive" (that's how we say we will explore both "+"). I wrote a paper about how to do it well a few years ago, with quite a few tricks: https://regex-generate.github.io/regenerate/
I disagree with the sibling comments: regular expressions should still be about regular languages. It's not because Larry Wall committed yet another atrocity that we should all follow him off the cliff. Perl can keep its monster.
However, that doesn't mean we shouldn't strive for richer feature (within the confine of regularity), or that a parsing library automatically does the job better.
For instance: intersection and complement are regular operators that works very well in regular expressions. lookahead is a regular operator (lookbehind and backreferences are not), etc. Those operators do not necessarily work at all in richer parsing libraries. When added to regular expressions, they should be well handled and optimized by regex engines (which is is currently not the case).
There's still work to do, features to add, which require appropriate testing techniques.
Regular languages are closed under a host of more interesting operations than those that have 'grown with the capabilities of our pattern matching engines'.
Regex engines don't just parse regular languages - that ship sailed a long time ago. As I understand it, any regex engine supporting backtracking is no longer working with regular languages.
As the author of Perl's regex engine Larry Wall puts it[0]:
> “Regular expressions” […] are only marginally related to real regular expressions. Nevertheless, the term has grown with the capabilities of our pattern matching engines, so I’m not going to try to fight linguistic necessity here. I will, however, generally call them “regexes” (or “regexen”, when I’m in an Anglo-Saxon mood).
Another interesting exercise is generating a uniformly random string of length N of a given regular expression (that has no limits on number of characters).
Regex is a mini-language and my library compiles it to Rust. Writing it was surprisingly difficult. While working on one bug, I had a moment of doubt about my ability to solve it. I've had such experiences only a few times in my life. Luckily, I woke up the next morning with the solution in my mind.
The work was satisfying. I also wrote a safe Rust async library, like a mini-tokio without all the scary bits: https://crates.io/crates/safina . And I started writing a safe Rust TLS 1.3 implementation and a safe Rust HTTP library. Working on a challenging side project can bring you satisfaction, knowledge, and confidence, even if you never finish the project.
I'm not the author, but it looks like he's using the same general approach as other similar libraries: Rust has a fairly advanced macro system that can be used to parse input strings at compile time and output hygienic Rust code. There are safe database libraries for example that check that the SQL query and the supplied parameters match.
safe-regex uses a Rust Procedural Macro [0] to convert expressions like `regex!(br"ab?c+")` into a Rust code block that defines a new type and returns an instance of it. The type implements a `Matcher` interface with a function that takes a byte array, executes the regular expression against it, and returns true/false and the capture groups. Examples of the generated code are in [1].
The generated function stores capture group info on the stack. It is not recursive. It makes one pass through the input, using an NFA algorithm. It is plain Rust code that starts instantly. This instant startup time makes `safe-regex` faster on small inputs than `regex` [3] which must compile or retrieve from cache at runtime.
The `regex` library has highly optimized code which implements Aho-Corasick with SIMD instructions to find substrings and then matches the regex forward or backward from each substring match, while also tracking utf-8 glyph boundaries. It's amazing. On large inputs, it is 20-500 times faster than `safe-regex` in benchmarks.
I expect that the simpler and shorter code from `safe-regex` puts less pressure on the processor instruction caches, improving performance of the rest of the program and other processes running on the machine. I believe Rust's benchmark library does not measure this.
The Rust compiler optimizes the generated rust code. This means that `safe-regex` gets faster as the Rust compiler gets better. It also exercises the Rust compiler like few hand-written programs do. A previous version of `safe-regex` would compile expressions like 'a?a?a?a?a?' to Rust code that rustc could not compile in reasonable time. I let it compile for an hour once and it didn't finish. I think the rustc optimizer was traversing the potential execution paths with an inefficient algorithm. The workaround was to add extra useless intermediate function calls, so the optimizer would reach some internal limit earlier and give up, allowing the compiler to continue and complete. I saved the code that showed this problem [2]. A refactoring eliminated the problem entirely.
The `regex!(br"ab?c+")` syntax is intuitive for Rust users. An invalid expression fails at compile time with a decent error message. The library does not allocate on the heap. The matcher object consumes a single byte when not matching. This is all very idiomatic. Macros are difficult to fuzz test because the Rust compiler must run for each test case. Writing tests by hand is tedious. I wrote a test helper that generates input strings. It uncovered some subtle bugs.
Here's how safe-regex works: Every matcher has a variable for each byte (b0, b1, etc.) representing states in the NFA. We iterate through the input bytes. In each iteration, we get the next input byte `b` and calculate the transition of every state for consuming that byte.
A matcher with no capturing groups uses the type Option<()> to represent each state. That means each state can be Option::Some(()) "input matched up to this point" or Option::None "no match". When the loop finishes, if the `accept` state is Option::Some, then the input matched. We use `Option::filter` to copy the preceding state only if the byte matches.
The core of the matcher for "a" [1]:
let mut start = Some(());
let mut b0: Option<()> = None;
let mut data_iter = data.iter();
loop {
let prev_b0 = b0.clone();
if let Some(b) = data_iter.next() {
b0 = start.clone().filter(|_| *b == 97u8);
start = None;
} else {
return prev_b0;
}
}
When matching with capturing groups, we must keep track of every capturing group for every state. A capturing group match is two integers representing the start and end of the matching portion of the input. We store that in a Range<usize> struct. There can be multiple capturing groups, so we store the ranges in a tuple. The state variables become type Option<(Range<usize>,)>.
When the input enters a capturing group, we set the Range for that group to the input byte: `.map(|(r0,)| (n..n,))`. Technically, it's the zero-length position between the current byte and the previous byte.
Whenever a byte matches inside a group, we expand the range for that group to include the byte: `.map(|(r0,)| (r0.start..n + 1,))`.
Here's the core of the matcher for "(a)" [1]:
let mut start = Some((usize::MAX..usize::MAX,));
let mut b0: Option<(core::ops::Range<usize>,)> = None;
let mut accept: Option<(core::ops::Range<usize>,)> = None;
let mut data_iter = data.iter();
let mut n = 0;
loop {
let prev_b0 = b0.clone();
accept = prev_b0.clone();
if let Some(b) = data_iter.next() {
b0 = start
.clone()
.map(|(r0,)| (n..n,))
.clone()
.filter(|_| *b == 97u8)
.map(|(r0,)| (r0.start..n + 1,));
start = None;
} else {
break;
}
n = n + 1;
}
The matcher for "(abc)d" [2] is a more thorough example.
I don't get it? In a DFA there's no way to store "the value of a capture group" because each state may have multiple values for any capture group, depending on how it was reached.
WAIT, I think what you've built is NOT a DFA, but an NDFA! Your code has multiple states active at once, and it just tests them all! Ok, now it all makes sense.
In a DFA, it's common for a state to have multiple values for capture groups. For example in /(a*)a*/, state 1 is "in the first loop, or in the second loop with \1={0,0}" and state 2 is "in the first loop, in the second loop with \1={0,0}, or in the second loop with \1={0,1}". So the second state has multiple simultaneous values for group 1; this is why tagged-DFAs usually need more "tags" than there are capture groups and that's why I was asking.
I never filed the rustc bug report. I suspect that the problem is actually in LLVM.
Byte strings are much easier to use than UTF-8. Also, I only needed to execute regular expressions against ASCII HTTP headers. I think `safe-regex` will get String support eventually, or somebody will write another safe regular expression library.
Safe Rust has a bright future. Eventually, modern OSes will simply refuse to run software built from unsafe code, unless it has attached correctness proofs. This will mostly solve the IT security dumpster fire. To get us there, we need to write more useful things in safe Rust or other memory-safe languages.
I was really impressed with just how many fantastic blog posts there were that introduced the key concepts, and also how easy it was to create my own regex engine.
The bigger challenge was testing. I ended up using the PCRE test suite which I transformed into a suitable test suite for my engine:
I'm curious: what is the advantage of a custom engine over calling the JavaScript's `RegExp` via the interop layer of Wasm<->JS? I guess you'd have to convert the Wasm input string to a JS string, but is there any other issue?
I was learning about continuations in LISP and how you can use them to get rid of the call stack and implement pseudo-concurrency. After reading some articles I realised that the same approach can be used to implement a backtracking regex engine in just few lines of code.
I was reading about LISP, but the code is in Java. Don't judge me.
I have a (mostly abandoned, since I couldn't find an organization to cbe early adopter in the time I gave myself to work on it) side project with this aim. I spent a lot of time thinking about language design, from the perspective of enabling adoption of a new language when an existing one has so much network effect:
I agree, I would think there should be a better DSL at the finite state machines/automaton/grammar level, but designed such that regular expressions could be implemented in trivially and in a readable manner.
Fun with regex: if you're using the GCC C++ compiler and build you application with -D_GLIBCXX_DEBUG you can call the secret `std::regex::_M_dot(std::ostream&)` member function and it will generate a graphviz illustration of the regex state machine.
This was fantastically useful for developing my own regex engine.
It turns out that they are wildly incompatible by default, so directly passing through the regular expressions to RegExp is incorrect in most cases. That was what we used to do before, though. Other compilers like GWT use the same technique.
So I implemented a complete compiler from Java regular expressions to JavaScript regexes that preserves semantics. It supports virtually everything, and fails fast for the few things that are not supported (like grapheme clusters or canonical equivalence). Some things are very tricky to get right, like atomic groups or handling of surrogate pairs when the 'u' flag is not natively supported.
The advantages of that strategy over reimplementing a complete engine (which is what TeaVM does) are a) code size and b) we get the performance of native RegExp's (which are JIT-compiled to native code, these days).
[1] https://github.com/scala-js/scala-js/pull/4455