In the "Beautiful Code" book there is a chapter, I think from Rob Pike, which presents a bare-bones regex implementation in C. It doesn't implement alternatives or grouping if I remember correctly, but the implementation is breathtakingly beautiful and not any longer than this one.
EDIT: I see the author links to the article at the beginning of the post. Still, I missed this on my first reading, so I think posting the link here is still worthwhile. Especially because the translation to JS kind of misses the point - the beauty of the Rob's implementation comes from recursion and pointers and JS lacks the latter.
Literally the first sentence in the posted article mentions this:
I stumbled upon an article the other day where Rob Pike implements a rudimentary regular expression engine in c. I converted his code to Javascript and added test specs so that someone can self-guide themselves through the creation of the regex engine. The specs and solution can be found in this GitHub repository.
It's small, but unfortunately due to how ? is implemented with recursive backtracking (look at how matchQuestion() tries both alternatives), has an exponential worst-case runtime. Fortunately, the algorithm to do it in linear time is pretty simple too:
I actually don't think this is the case (I definitely might be wrong). Although the matchQuestion function has the pattern that typically resembles a function of exponential runtime (where a function invokes itself recursively multiple times), there is a slight difference in our scenario. If you look at matchQuestion's invocations of match, you'll notice that on both sides of the OR, the pattern is stripped of two characters (the "_?"). This means that the recursive invocation of match will never invoke matchQuestion a second time, unless there is a second '?', in which case it's entirely appropriate.
unless there is a second '?', in which case it's entirely appropriate.
That's precisely where the exponential behaviour comes from; consider e.g. matching the pattern "a?a?a?aaa" against "aaa". It will try matching "a?" against the first "a", which succeeds, leading to a recursive call to match "a?a?aaa" with "aa". That eventually fails, so it tries matching "a?a?aaa" against "aaa"; and inside those two branches, it also splits into two depending on whether to match the "a?", etc. The result is, for each "a?" and "a" added, the total amount of work involved in matching doubles, so it is exponential.
It have been a few years since I read the code, but it was so beautifully written that I cannot imagine it was screwed up in the meantime: Edi Weitz' CL-PPCRE[1] is a beautiful implementation in CL and highly recommended if one understands one aspect (CL or Perl compatible regular expressions) and wants to learn the other one. IIRC he even discovered some bugs in the original perl implementation while creating this library.
This does not implement grouping "()" or alternatives "|". Hence, looping is only required on individual characters. This is a considerable simplification over full regexp.
Which means that this CAN NOT BE USED TO ACCEPT REGULAR LANGUAGES. The language (a|b)*, any number of as or bs in any order, can not be descibed using that parser.
"considerable simplification" sounds like it's kinda regex. It's not even basic regular expressions/regular languages.
Edit:
The author reimplements code from https://www.cs.princeton.edu/courses/archive/spr09/cos333/be... which acknowledges that the implemented subset does not match all classes but they propose that the classes they can parse, are already useful. The author of the new article makes no such acknowledgement that their implementation just represents a subset of regular languages.
Of course, the regular languages themselves are also just a subset of regexp, which can match more languages due to constructs like references.
Lua is not a language to be used as is (and it isn’t in practice). Simple pattern matches are there for internal needs like constructing package paths, but one can luarocks install any regex implementation at will.
Bad side is that luarocks works fine on Windows only if no build step is involved, otherwise you’re doomed to mess with mingw/msys/msys2 environment that isn’t well-supported by third-parties; often not supported at all. It is not Lua’s fail, but it happens. New complicated build systems like CMake only make things worse since you cannot simply guess flags and gcc .c together anymore.
Edit: this is also true for all languages except maybe perl that includes full mingw system with it. Idk why some package managers do not prebuild windows packages on server-side. Windows actually does a lot to maintain binary compatibility.
I've written a JS regexp parser and engine. It did not fit in 40 lines.
The most obnoxious part is backreferences. The atom \3 is a backreference if the whole regexp contains at least 3 capture groups; otherwise it is an octal (!) escape for char code 3. But you don't know how many capture groups there are until you're done parsing. This is why JS regexp parsers sometimes must make two passes!
Regular expressions are truly elegant. If the regex engine is built in a functional (compositional style), it is even more elegant.
This particular Functional pearl is my favorite, http://sebfisch.github.io/haskell-regexp/regexp-play.pdf
For TLDR on the paper, It slowly builds an abstraction and implementation on regex engine which runs on O(mn) where m is length of the regex and n - length of the text. Then they generalize it to do grouping and even extend it to match context free grammar (using lazy evaluation mostly).
It's nicely coded but leaves out an important part of that algorithm: the regex derivatives never get simplified for comparison, so equal states appear different, so the set of states blows up as you march along the string. If you're OK with an inefficient matcher like that, then a backtracking algorithm probably gives you even simpler code.
Depending on how it’s being counted, I have a regex engine in around 30 lines [1] (the parser is longer). It handles branching, grouping, etc. and it’s run time is proportional to the length of the string being searched (I.e. no infinite loops on certain patterns, etc.).
A bit longer than 40 lines, but back in the day, I was a big fan of the simplicity and clarity of Henry Spencer's regex code: https://github.com/garyhouston/regexp.old
Glad to see that come up in this thread. It was the first clearly explained regex engine for me in _Software Solutions in C_ (Schumacher D., Academic Press, 1994). I still have a copy and the original disk (and disk image). If anyone is interested I could scan that chapter tonight.
Hi, I'm the original author of this article. I just want to say thank you for all the positive and constructive comments. I'm happy to answer questions that anyone has.
I would suggest that since it lacks grouping, like one other commenter pointed out, it's not a regular expression engine, it only implements a useful subset which is not a regular expression language.
For that you need to be able to build an equivalent to the regular language expression
Nice, remind me of a custom Regex engine i built some years ago in JS while trying to build a fast & small lexer, it does not support ? but +, - and sub-rules, it is around 60 LOC if I remind, mostly built by "accident", it work with JSON as input and output a finite state automaton (compiled version ?)
In my experience, it's the edge cases that are a pain when implementing a regex engine. Think about weird regexes like "^*", "$?" (or any quantified zero-width match), anchors that appear in the middle of regexes, nested quantifiers (don't make much sense, but you need to generate good error messages).
Once you add captures and maybe even backreferences, you get a whole new world of weird :-)
I did that with grouping (), alternatives |, and classes [], for Asterisk some decades ago. They had a very special phonenumbering syntax. Less than 100 lines in C.
In lisp it's trivial, less then 10 lines for a simple matcher.
I think it's the same implementation described here: http://www.cs.princeton.edu/courses/archive/spr09/cos333/bea... (not 100% sure as I lost my copy of Beautiful Code :()
EDIT: I see the author links to the article at the beginning of the post. Still, I missed this on my first reading, so I think posting the link here is still worthwhile. Especially because the translation to JS kind of misses the point - the beauty of the Rob's implementation comes from recursion and pointers and JS lacks the latter.