Hacker News new | past | comments | ask | show | jobs | submit login
The Regular Expression Visualizer, Simulator and Cross-Compiler Tool (robertelder.org)
197 points by robertelder on July 28, 2020 | hide | past | favorite | 43 comments



Hi, I'm the author of this page. This tool has a lot of complexity in it and a huge number of corner-cases. If you discover any bugs or XSS vulnerabilities, feel free to let me know.


This is fantastic, thanks for sharing it.

I wish I had it when I was a CS professor in 2004-2006, teaching Compilers to 2nd year university students. They would have found it really useful.


I'm having trouble with backtracking and groups. It's possible I'm doing something wrong, but

    [bc]*(cd)+
doesn't seem to match

    cbcdcd
Nor does the simpler

    a*(ab)
match

    aab
But

    a*ab
matches aab.


You found a bug. The bug was: When I was computing the animation frames for the visual, after a backtracking event, I forgot to restore the 'progress' values for the PROGRESS nodes to what they were before the stack push took place. I think I fixed this issue, and I just pushed an update which should appear any second when the cache invalidates.


Cool! Thanks for looking at it/fixing it.

P.S. The whole split() thing you do is an interesting hack to not have to preserve/rewind so much state. :D I've never quite seen fork() used so...


I tried using this expression:

   ^((?:(?:https?|ftps?):\/\/)|irc:\/\/|mailto:)([\w?!~^\/\\#$%&'()*+,\-.\/:;<=@]*[\w~^\/\\#$%&'()*+,\-\/:;<=@])  
and it seems to fail to visualize it.


The (?:stuff) part is for a non-capturing group which I didn't add support for. Maybe in the near future. That would be one of the easiest features to add since the tool also doesn't support backreferences yet.

Edit: Actually, since the non-capturing group part doesn't add much here, you can just remove the '?:'s and this works:

https://blog.robertelder.org/regular-expression-visualizer/?...


The first one that I tried also failed for the same reason, FWIW. Cool tool! :)


That's really impressive work. What flavor of regular expressions do you follow there?


I tried to follow a middle-of-the-road works mostly everywhere made up flavour. Here is a grammar for it:

https://blog.robertelder.org/regular-expression-parser-gramm...

I may add to it in the future if there is enough interest. At the moment it doesn't support a few common things like word breaks and backreferences, but I don't think it'd be too hard to add.


Any chance of open sourcing it so we can contribute some regex features to your awesome project? :)


along these lines, https://regex101.com is consistently a godsend for writing regex.


Yeah regex101 is probably the best one I've found so far. It allows you to provide sample text and shows the analysis and output you'll get including captured groups.


and https://regexper.com is a different type of regex visualizer compared to the OP's


I've used https://regexr.com/ for composing regex.


+1 Regex101 is my goto for whenever I need to whip up a more complex regex


I would have loved this when I worked on a regex engine a few years back. It was before I took my CS theory course and had no idea what an automaton was. Let me tell you, having a formal theory and visualization in front of you while working on something is extremely helpful.

This looks like it would be a great tool for both people looking to learn, and those who are looking to optimize their regular expressions as well.


When I read the title, I half hoped ‘cross-compiler’ referred to translating between different regex varieties. Anyone know of a tool that does this?


Any one write an AST aware regex engine?

Examples: replace s/(?ast:comment)foo/bar to replace foo with bar but only in comments. Or s/(\w+): (?ast:openparent)(.?)(?ast:closeparen)\s+(?ast:openbracked)(.?)(?ast:closebracket)/function $1($2)\n{\n$3\n}\n/ to convert from

    foo: (args) {
      codeblock
    }
to

    function foo(args) {
      codeblock
    }
etc..?


semgrep will at least do the search portion, not sure about replace. Posted a few days ago:

https://news.ycombinator.com/item?id=23919313

Slightly related:

How to Build Static Checking Systems Using Orders of Magnitude Less Code

https://web.stanford.edu/~mlfbrown/paper.pdf

They use "microgrammars" and found that the technique is surprisingly effective despite not being "correct".


Perl 6 extends the regex system to be a full fledged recursive descent parser, which lets you do this kind of stuff. But as is pointed out in a sibling comment, a purely regular regex engine won't really do it since it's non-regular language.

Of course, some regex engines do let you do this (notably Perl 5), but it requires code of the "just because you can doesn't mean you should" variety.


That's awkward because parsing ASTs can't usually be done with a regular language (for most programming languages), so a language-specific parser would be needed on top anyway.

This stuff is useful, though, and easy to do in lisps and schemes.


Nice graphical explanations! It would be nice to see positive/negative look ahead/behind support as they're pretty powerful.

The `\K` flag to clear the capture buffer (e.g. Perl flavor) is also a very useful regexp feature absent here (e.g. when testing `grep` commands).


Any suggestions to keep regexes simple? They often grow to huge lengths which are hard to read and/or debug. It would be nice to keep modular. Maybe using websites like this is the only way.


In Python or Perl, you can use re.VERBOSE or /x to spread your regex across multiple lines. That helps a little, but the syntax is still pretty obscure.

I created a new regular language syntax for Oil which looks more like a grammar:

Eggex Example: Recognizing Python Integer Literals http://www.oilshell.org/blog/2019/12/22.html

And the nice thing is that you can modularize and compose patterns (without string concatenation). It is more like AST splicing of regexes. There are details in the article but here's how it looks:

    DecDigit = / [0-9] /
    BinDigit = / [0-1] /
    OctDigit = / [0-7] /
    HexDigit = / [0-9 a-f A-F] /

    DecInt   = / [1-9] ('_'? DecDigit)* | '0'+ ('_'? '0')* /
    BinInt   = / '0' [b B] ('_'? BinDigit)+ /
    OctInt   = / '0' [o O] ('_'? OctDigit)+ /
    HexInt   = / '0' [x X] ('_'? HexDigit)+ /

    Integer  = / DecInt | BinInt | OctInt | HexInt /
To accept 123, 0b10101, 0o755, 0xff but reject non-integers.


Note that when you use grep, awk, sed, libc regexec(), RE2 [1], or rust/regex, this isn't how your regexes are executed.

They use automata-based engines whereas this is a backtracking engine, which can blow up on small inputs. (Although I will say the C code here is cool, because it gives you a fork() bomb, so it might exhaust your operating system and not just your CPU!)

Canonical reference: Regular Expression Matching Can Be Simple And Fast (but is slow in Java, Perl, PHP, Python, Ruby, ...)

https://swtch.com/~rsc/regexp/regexp1.html

Archive since it's down at the moment due to some App Engine stuff (!): https://web.archive.org/web/20200624195819/https://swtch.com...

My blog post and sample code that verifies that GNU grep, awk, sed, and libc don't backtrack:

Comments on Eggex and Regular Languages http://www.oilshell.org/blog/2020/07/eggex-theory.html

https://github.com/oilshell/blog-code/tree/master/regular-la...

[1] https://github.com/google/re2

----

Also, the production quality way of compiling regular expressions to C code looks like this [2], which is completely different than what's shown in the article:

https://www.oilshell.org/release/0.8.pre8/source-code.wwz/_d...

The regex is converted to an NFA, then a DFA, then a bunch of switch/goto statements in C that implement the DFA.

This example is huge, but the switch/goto and lack of any notion of "threading" is the important point.

Essentially, the instruction pointer is used as the DFA state. It's a very natural use of "goto" (where you don't write or debug them yourself)

[2] via http://re2c.org/ (not related to RE2)


It's worth pointing out that the resource

https://swtch.com/~rsc/regexp/regexp1.html

linked above is also by the same author, Russ Cox,

as the reference I included at the end of the blog post:

https://swtch.com/~rsc/regexp/regexp2.html

All of the insights that I presented through this visualization tool are based upon the knowledge found in Russ Cox's articles. Also, the link above on RE2:

https://github.com/google/re2

is a project that was was also (started by|heavily contributed to by) Russ Cox. His writings on the topic of regular expressions are absolutely world-class and I have never found any better resource for understanding the deep low-level theory on how they work in practice.


But Cox would dislike this formulation of regular expressions. His whole point is that Perl-stye regexes are not regular languages, and RE2 goes to great lengths to work around this fact.

I'm taking up his arguments with some more traditional terminology that I think may help:

http://www.oilshell.org/blog/2020/07/eggex-theory.html

The entire O'Reilly book "Mastering Regular Expressions" by Friedl is based on the "debugging regexes by backtracking" approach you present here. It talks about "optimizing" regexes by reordering clauses and optimizing the backtracking.

Russ Cox HATES THAT APPROACH and he wrote the entire RE2 project to advocate that it NOT be used! You do not need to optimize, reorder clauses, debug, or trace backtracking when you're using automata-based engines, because they run in guaranteed linear time and constant space (with respect to the length of the input). There is no backtracking.

(I was at Google when he wrote RE2 and used it shortly afterward on big data, and regular languages are indeed better than regexes for such engineering. Also made clear by the HyperScan work.)

Since his site is down I can't find the original comments, but here's a twitter exchange from 2018 that backs this up:

https://twitter.com/geofflangdale/status/1060313603268501504

https://twitter.com/_rsc/status/1060702201159565313

Also, I believe your article would be improved if you simply take out this sentence:

In computer science theory, this node type should remind you of the difference between DFAs and NFAs.

It is treading close to the confusion that Cox is talking about. So I would just remove, add a proper explanation, or add a link to the authoritative source.

----

I would love to see a followup that goes through the rest of the article with the Thompson VM and Pike VM approaches! :) I think the whole point is that you don't have to backtrack to implement this VM, and this page confuses that issue.


The goal with this article was never to show the best or most efficient way to match regular expressions, but rather to convey the correct interpretation of what a given regular expression will do when you try to use it. Even discussions of slow and simple backtracking regular expression matchers are a pretty niche topic and likely too much for casual users of regular expressions.

Yes, I was thinking about also allowing it to optionally cross-compile to the more asymptotically efficient matching algorithm, also documented by Russ Cox. Everything takes time though, and I figured it would make to put this out there before investing too much time in something that possibly no one would use.


Sure, but automata based engines are also "correct" and used in common tools like grep, awk, and sed.

It would be nice to make that clear in the article.

-----

It's a matter of preference, but to me debugging by backtracking is a pointless exercise. It's easier to reason about regexes as sets of strings, composed by sequence and alternation.

Write unit tests that enumerate what you accept and reject, and build them up by composition. There's no need for debugging or tracing.

I understand that the imperative style is more natural to many programmers, but I don't think it requires too much of a leap of thinking. And once you get used to the other style, it saves effort overall (as well as bounding your worst case running time).


Agreed on the point that there are better mental models out there. The difficulty is to make content that is simultaneously correct, but also not over-complicated. I finished off the post with a reference to Russ Cox's work so any sufficiently motivated individual will find the right models there.


How does the "sets of strings" model express the difference between:

   /x*/
and

   /x*?/


Update: I tested it out a little more here.

https://github.com/oilshell/blog-code/blob/master/regular-la...

Basically everything you want with regard to greedy-nongreedy can be done through automata-based engines.

They jump through a lot of hoops to make them work!

However I would also argue that when you're using them, you're doing it as a performance hack in a Perl-style regex engine. In an automata-based engine, you can just write

    /x*/
    /(x*)/
and it's what you want, and it runs fast.


Good question, and here's something I wrote a couple weeks ago to the author of the Regex Lingua Franca paper [1]:

I admit I occasionally use non-greedy repetition in Python, but I think that is the only construct I use that's not in the shell-ish engines. I believe it's usually pretty easy to rewrite without non-greedy, but I haven't tested that theory.

----

OK so I started testing that theory. I collected all the times I used

    .*?
https://github.com/oilshell/blog-code/blob/master/regular-la...

With a cursory inspection, most of them are unnecessary in an automata-based "regular language" engine.

They are "nice to have" in a Perl-style regex engine for PERFORMANCE. But automata-based engines don't have that performance problem. They don't backtrack.

The distinction between .* and .*? doesn't affect whether it matches or not, by definition. It does affects how much it backtracks. (And it affects where the submatches are, which I need to test out)

So I think most of my own usages ARE "nice to have" in Python, but unnecessary in awk because it uses an automata-based engine.

But I'd like to look at it more carefully, and I'm interested in answering this question more thoroughly (feel free to contact me about it, e-mail in profile).

I started a demo here. Still needs work.

https://github.com/oilshell/blog-code/blob/master/regular-la...

----

So here's another conjecture: regexes are bad because they force you to make UNNECESSARY DISTINCTIONS in the name of performance.

You have to care about greedy vs. nongreedy, because there is backtracking.

In contrast, with regular languages, you just write what you mean, and they do it quickly for you. You can't write a slow expression for a regular language.

There are some caveats around that, like submatch extraction with nongreedy repetition, which I will test out. If anyone has specific examples or use cases, send them my way.

I can't see where I really rely on nongreedy submatches, but I'll have to look more closely. I think the nongreedy use case is usually just "skip quickly until this other string appears", e.g. the one in pgen2/tokenize.py that I didn't write.

----

As far as I remember submatch extraction works fine in automata-based engines, and Cox's articles are in a large part about the (then) little known algorithms for doing it in automata-based engines. re2c also has a 2017 paper for their algorithm to do it in DFAs:

http://re2c.org/2017_trofimovich_tagged_deterministic_finite...

----

[1] thread: https://news.ycombinator.com/item?id=23687478


I don't agree it's primarily about performance. /".+"/ will find a quoted string, but you will certainly prefer the non-greedy match instead.

Now you may torture the greedy regex to make it work (/"[^"]+"/), but the point stands. "Did it match" is rarely enough; the regex author usually wishes to know where it matched, and to exercise control over which match is preferred if there are multiple possibilities. It's indeed possible to do this with DFAs but the theory is comparatively hard, as a cursory reading of your linked paper will show!


Yeah that is a good example, and I recorded it here:

https://github.com/oilshell/blog-code/blob/master/regular-la...

I don't mind the rewrite, but I'm not surprised that some would be inconvienced by it. (Though that is part of the reason for the Eggex syntax)

----

On the other side I'll offer that I wrote a lexer for bash and all its sublanguages without any nongreedy repetitions. It's dozens if not hundreds of regexes.

https://github.com/oilshell/oil/blob/master/frontend/lexer_d...

Though, regular languages are more like a building block for the lexer than the lexer itself. As mentioned elsewhere, they're too limited to express the entire lexer. They're too limited to express syntax highlighters (but so are regexes! They also need some logic outside the model, e.g. for lexing shell here docs)

-----

So I pose the question of whether there's a more powerful abstraction that's still linear time and constant space:

http://www.oilshell.org/blog/2020/07/ideas-questions.html#mo...

Of course there is, but it's a matter of picking one that will address a lot of use cases effectively.

I think re2c has grown something like

    .*?
for example. grep needs something like that too. Often you want to be able to skip input in a linear fashion, without invoking the pattern matching logic at all.

Regexes are OK if you know how to use them, but evidently most people have lots of problems with them (the necessity of OP's demo being an example of that.) I think there's something better out there built on top of regular languages.


You hit upon an important cleavage: features vs worst-case perf. A generous lexer supports partial input, syntax highlighting, helpful error messages, etc., while a strict one is not so nice, but guarantees protection against malicious input. But the way you write these are very different! I also would like to write one lexer that can operate in both modes, but with one syntax.

BTW, high five from one shell author to another :)


Yeah I've been looking at fish. Definitely has the cleanest code out of any shell I've looked at :) (e.g. bash, dash, zsh, mksh)

And I came to realize what a huge effort it is on top of the language:

http://www.oilshell.org/blog/2020/02/recap.html#fish-oil-is-...

Also I wonder if you knew of the existence of ble.sh, a fish-like UI written in pure bash! I think it's the biggest shell program in the world:

http://www.oilshell.org/cross-ref.html?tag=ble.sh#ble.sh

https://github.com/oilshell/oil/wiki/The-Biggest-Shell-Progr...

----

I did add little hack in the lexer for autocompletion of partial input. But my hope though is to keep most of that logic in the parser and not the lexer? Though it is true I was thinking of breaking up the exact regex we talked about for similar reasons, e.g. a single token:

    ' [^']* '
vs. 3 tokens

    '
    [^']*
    '

I think the latter could be better for completion of paths in quotes, which bash does, but I haven't revisited it yet.


RE2 source code is interesting, it seems to switch between simulating an NFA or a DFA depending on which is faster. However to my understanding, don't you first need to make an NFA and use subset construction to convert it to a DFA?


With RE2 in particular, the DFA is almost always faster. The most common cases where an NFA is used is when the DFA can't handle the request. Typically, that occurs when you need the offsets of submatches. But in some circumstances, if the regex would cause the DFA to grow exponentially, then RE2 will dynamically fall back to the NFA, since it's slightly faster.

Keep in mind that RE2 doesn't really have a true DFA. It has a hybrid NFA/DFA or a "lazy DFA." Namely, the DFA is computed at match time through incremental subset construction. The DFA states are cached in memory, so most bytes that are processed are processed as if it were a bog-standard table based DFA.


How do DFAs handle zero-width anchors? For example \b, or ^ and $ in multiline mode?


Not easily. In the special case of one byte look around (which works for the ASCII interpretations of \b, ^ and $), the DFA can be massaged to make it work by doing two things. First is by embedding some look-around state into each DFA state, when appropriate. For example, in the case of \b, a DFA state is marked as "coming from a word byte" if its incoming transition is a word byte and not marked otherwise. Then, when computing the epsilon closure during subset construction, you use the look-around state to compute whether an epsilon transition is something you're allowed to take or not. e.g., a \b is an epsilon transition since it doesn't consume any input, so you only take it if the state you're in and the byte currently under inspection satisfy the \b criteria.

The second aspect of this is that you have to delay your match by one byte, in order to account for \b and $ at the end of the input. This actually requires adding a special "end of text" sentinel to your DFA's alphabet, but is otherwise pretty easy to handle.

Look-around can actually be supported in the general case using finite automata, although I believe it requires something more powerful than a bare DFA. This paper has the details on that, although it is a difficult read: https://re2c.org/2017_trofimovich_tagged_deterministic_finit...

Of course, these techniques can greatly balloon the size of the DFA. The special case of one-byte look-ahead is generally tolerable, but anything more than that can get pretty bad.

A further pain point is that in Rust's regex engine, \b is Unicode aware by default, and the DFA doesn't support that. The tagged DFA paper makes it clear that such things are possible, but it's complex and it's not clear how effective it would be. Especially since a Unicode aware \w is quite large. So Rust's regex crate has a special optimization where it treats a Unicode aware \b as an ASCII \b if and only if all of its input is ASCII. (Of course, it doesn't scan the input ahead of time to determine this. Instead, it constructs the DFA to quit if it observes any non-ASCII byte. Then it falls back to the slower PikeVM or bounded backtracking engines.)

Apologies for the late reply. :-( I wish HN sent email notifications for replies.


neato bandito




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: