Regex combinators are a much better solution to that problem, for which I gave v...

sillysaurus3 · on Aug 30, 2016

Personally, I use regexes all the time. Every day. I'm constantly typing regexes into emacs and vim to do text manipulation. Sometimes a quick :%s/$foo\|bar$d/baz\1/g, other times a :%v/function /d. I can't imagine what it would be like if I had to write out "begin with either of (number, letter, ..." I'd never get anything done.

For programming it could be nice. It's an interesting idea. But I've found that good highlighting solves most of the problem. When I type "\$" into emacs, it immediately highlights that \\( in a different color, along with \\| and \$. That way you can differentiate instantly between capture groups vs matching literal parens. Here's an example: http://i.imgur.com/b417O2o.png That small snippet would become way, way longer if you use a regex combinator. Maybe good, maybe bad, but it's hard to read a lot of code.

Yeah, it's ugly. And there are a bazillion small variations between regex engines. But for raw productivity, it seems hard to beat.

foxylad · on Aug 30, 2016

I, on the other hand, have to build a regex a few times a year. Every darn time I have to look up the difference between .* and .+, and how to do an "or", and why I get matches when I group a set of characters.

So SRL sounds like a great idea: us mortals can use it to painfully stitch together our our poor regexes, and when we get good enough we can skip SRL and become fully fledged regex gods.

Off topic, if SRL is a regex compiler, I'd love a regex decompiler. To be able to turn an impossible jumble of weird characters into a structured description of what it does would be hugely useful.

tokenizerrr · on Aug 30, 2016

RegexBuddy has a regex explainer, debugger, builder and tester. It works very well and supports exporting code to multiple languages and dialects

jval43 · on Aug 30, 2016

Also works on Linux (WINE), and can simulate different regex flavors.

I can also recommend the webbased regex101.com, as it also supports non-JS regex flavors and explains things quite well.

Drup · on Aug 30, 2016

I think you misunderstand what "regex combinators" means. In particular, SRL is not regex combinators. TAForObvReasons in another answer posted this[1] that would explain it to you.

[1]: https://groups.csail.mit.edu/mac/users/gjs/6.945/psets/ps01/....

About your example ... on the contrary, it would become way cleaner with regex combinators. The string manipulation would be replaced with proper composition and the escaping madness would disappear. For example

   (concat foo "\\|" bar "\||" baz)

becomes

  (alt foo bar baz)

You seem to be familiar with elisp. The lisp family is particularly adapted to combinators approach, please read the link above. :)

sillysaurus3 · on Aug 30, 2016

If you like regex combinators, check out the elisp "rx" library. You can end up writing code sort of similar to the article's, with less emphasis on using english grammar: http://i.imgur.com/iD34MqC.png

Combinators are very convenient and precise, but the tradeoff is that the code is longer. And I have to look up what to write every time I want to write one. But that's a personal bias.

Thanks for the reference. I'll study it.

EDIT: One of the good ideas in SRL is "if not followed by". There are too many [not]ahead-[not]behind combinations to warrant special syntax for each of them. I wonder if it could be streamlined, though?

qwertyuiop924 · on Aug 30, 2016

...wait a minute, that sounds a lot like SRE.

If you don't know, SRE is a DSL in scheme that is essentially an alternate syntax for regex that does this, originally implemented by SCSH, and now most popularly by irregex. It looks like this:

  (w/nocase (: (=> name (+ (or alnum ("._%+-")))) "@" 
               (=> domain (: (* (or alnum (".-"))) "."
                             (>= 2 alpha))) 
               eos))

I don't know if that's exactly what you were talking about: It's an alternate syntax, not a set of functions, which is what combinators usually imply.

TAForObvReasons · on Aug 30, 2016

I find that most of the problems are resolved with interstitial comments a la perl /x or coffeescript heregex. Merely splitting the regular expression into smaller blocks with a short description of what to match makes it easy for people to verify the regular expression.

rplnt · on Aug 30, 2016

I do most of my text manipulation in sublime text using replace. I like the interactivity of it compared to sed/awk, and the easy undo is helpful as well (try, back, edit, try, back, ..).

davexunit · on Aug 29, 2016

There's an MIT CS assignment that goes along with exactly what you are saying: "Regexp Disaster" by Gerald Sussman. It's a good read for those that want to learn more about the combinator approach.

https://groups.csail.mit.edu/mac/users/gjs/6.945/psets/ps01/...

kazinator · on Aug 29, 2016

> You don't need to remember which regex syntax the library is using.

Only for the time being, while only one regex combinator API/implementation exists for your language.

colanderman · on Aug 29, 2016

Unlike a regex string, using incorrect syntax with a parser combinator generally results in a syntax error. Regex parsers generally treat "unknown" characters as characters to match, hiding bugs.

raiph · on Aug 30, 2016

Perl 6 regexes are parsed at compile-time (i.e. they're code just like any other code) so using incorrect regex syntax, eg unknown characters, results in a compile-time syntax error.

kazinator · on Aug 30, 2016

A regex language could fix it by some clear prefix convention or whatever for all operators. Though more verbose, it would still be less verbose than alternatives.

Suppose all regex operator characters have to be backslashed (and \\ stands for a single \). Then it's clear. No backslash means it's literal; otherwise it's an operator (and a backslash on a nonexistent operator is a parse error).

The ambiguities exist because regex aficionados want common operators to be just one character long.

TylerE · on Aug 30, 2016

That's about as classic a case of 'cure worse than the disease' as I've ever heard.

How readable is this:

`\^\[-+\]\?\[0\-9\]\*\.\?\[0\-9\]\+\$`

kazinator · on Aug 31, 2016

Perfectly, if your name is Donald Knuth.

kazinator · on Aug 30, 2016

A really annoying "feature" of regexes in POSIX is that known regex meta-characters are used as ordinary, when they are in a context that doesn't recognize them.

Some examples are a superfluous closing parenthesis or a [ bracket in a character class.

A sane regex syntax treats these characters as a separate lexical category from literal characters, in all contexts, and only produces literals out of them when they are escaped.

1ris · on Aug 30, 2016

And usually not at compile time.

junke · on Aug 30, 2016

Here is an example of an "unusual" case where the compiler can detect errors in statically known regexes.

This fails at compilation with Opening paren has no matching closing paren. at position 0 in string "(ab"

    (defun scan-ab (s)
      (ppcre:scan "(ab" s))

By the way, here is the alternative syntax:

    (defun scan-ab (s)
      (ppcre:scan '(:register "ab") s))

The alternative syntax allows to embed string regexes:

    (defun wrap-regex (regex)
      (typecase regex
        (string `(:regex ,regex))
        (t regex)))

This is useful for combining regexes:

    (defun exactly-some (&rest choices)
      (let ((choices (mapcar #'wrap-regex choices)))
        `(:sequence :start-anchor
                    ,@(if (rest choices)
                          `((:alternation ,@choices))
                          choices)
                    :end-anchor)))

    (exactly-some "t.*")
    (:SEQUENCE :START-ANCHOR (:REGEX "t.*") :END-ANCHOR)

    (exactly-some "t.*" "a.a" ':whitespace-char-class)
    (:SEQUENCE :START-ANCHOR
               (:ALTERNATION (:REGEX "t.*")
                             (:REGEX "a.a")
                             :WHITESPACE-CHAR-CLASS)
               :END-ANCHOR)

Drup · on Aug 29, 2016

Sure, but 1. It's pretty easy to expose a similar API, those are all functions 2. The documentation tools for the language naturally document the combinators.

Even if the actual combinators are different, it's still a much better situation than the basilions regex syntaxes with completely different quoting and escaping mechanisms.

kazinator · on Aug 29, 2016

True; you will never have to wonder whether alternatives(x, y, z) has to be \alternatives. :)

pygy_ · on Aug 30, 2016

https://github.com/pygy/compose-regexp.js

^^^ Regexp combinators in JS. 614 bytes minimized and gzipped.

Once you're proficient with the standard RegExp syntax, you can write pretty complex ones without external tools, but when the times comes to debug and modify them, using combinators makes the task much much more simple.