Hacker News new | past | comments | ask | show | jobs | submit login

Regex combinators are a much better solution to that problem, for which I gave various arguments here[1]:

- You don't need to remember which regex syntax the library is using. Is it using the emacs one ? The perl one ? The javascript one ? that new "real language" one ?

- It's "self documenting". Your combinators are just functions, so you just expose them and give them type signatures, and the usual documentation/autocompletion/whatevertooling works.

- It composes better. You don't have to mash string together to compose your regex, you can name intermediary regexs with normal variables, etc.

- Related to the point above: No string quoting hell.

- You stay in your home language. No sublanguage involved, just function calls.

- Capturing is much cleaner. You don't need to conflate "parenthesis for capture" and "parenthesis for grouping" (since you can use the host's languages parens).

[1]: https://news.ycombinator.com/item?id=12293687




Personally, I use regexes all the time. Every day. I'm constantly typing regexes into emacs and vim to do text manipulation. Sometimes a quick :%s/\(foo\|bar\)d/baz\1/g, other times a :%v/function /d. I can't imagine what it would be like if I had to write out "begin with either of (number, letter, ..." I'd never get anything done.

For programming it could be nice. It's an interesting idea. But I've found that good highlighting solves most of the problem. When I type "\\(" into emacs, it immediately highlights that \\( in a different color, along with \\| and \\). That way you can differentiate instantly between capture groups vs matching literal parens. Here's an example: http://i.imgur.com/b417O2o.png That small snippet would become way, way longer if you use a regex combinator. Maybe good, maybe bad, but it's hard to read a lot of code.

Yeah, it's ugly. And there are a bazillion small variations between regex engines. But for raw productivity, it seems hard to beat.


I, on the other hand, have to build a regex a few times a year. Every darn time I have to look up the difference between .* and .+, and how to do an "or", and why I get matches when I group a set of characters.

So SRL sounds like a great idea: us mortals can use it to painfully stitch together our our poor regexes, and when we get good enough we can skip SRL and become fully fledged regex gods.

Off topic, if SRL is a regex compiler, I'd love a regex decompiler. To be able to turn an impossible jumble of weird characters into a structured description of what it does would be hugely useful.


RegexBuddy has a regex explainer, debugger, builder and tester. It works very well and supports exporting code to multiple languages and dialects


Also works on Linux (WINE), and can simulate different regex flavors.

I can also recommend the webbased regex101.com, as it also supports non-JS regex flavors and explains things quite well.


I think you misunderstand what "regex combinators" means. In particular, SRL is not regex combinators. TAForObvReasons in another answer posted this[1] that would explain it to you.

[1]: https://groups.csail.mit.edu/mac/users/gjs/6.945/psets/ps01/....

About your example ... on the contrary, it would become way cleaner with regex combinators. The string manipulation would be replaced with proper composition and the escaping madness would disappear. For example

   (concat foo "\\|" bar "\||" baz) 
becomes

  (alt foo bar baz)
You seem to be familiar with elisp. The lisp family is particularly adapted to combinators approach, please read the link above. :)


If you like regex combinators, check out the elisp "rx" library. You can end up writing code sort of similar to the article's, with less emphasis on using english grammar: http://i.imgur.com/iD34MqC.png

Combinators are very convenient and precise, but the tradeoff is that the code is longer. And I have to look up what to write every time I want to write one. But that's a personal bias.

Thanks for the reference. I'll study it.

EDIT: One of the good ideas in SRL is "if not followed by". There are too many [not]ahead-[not]behind combinations to warrant special syntax for each of them. I wonder if it could be streamlined, though?


...wait a minute, that sounds a lot like SRE.

If you don't know, SRE is a DSL in scheme that is essentially an alternate syntax for regex that does this, originally implemented by SCSH, and now most popularly by irregex. It looks like this:

  (w/nocase (: (=> name (+ (or alnum ("._%+-")))) "@" 
               (=> domain (: (* (or alnum (".-"))) "."
                             (>= 2 alpha))) 
               eos))
I don't know if that's exactly what you were talking about: It's an alternate syntax, not a set of functions, which is what combinators usually imply.


I find that most of the problems are resolved with interstitial comments a la perl /x or coffeescript heregex. Merely splitting the regular expression into smaller blocks with a short description of what to match makes it easy for people to verify the regular expression.


I do most of my text manipulation in sublime text using replace. I like the interactivity of it compared to sed/awk, and the easy undo is helpful as well (try, back, edit, try, back, ..).


There's an MIT CS assignment that goes along with exactly what you are saying: "Regexp Disaster" by Gerald Sussman. It's a good read for those that want to learn more about the combinator approach.

https://groups.csail.mit.edu/mac/users/gjs/6.945/psets/ps01/...


> You don't need to remember which regex syntax the library is using.

Only for the time being, while only one regex combinator API/implementation exists for your language.


Unlike a regex string, using incorrect syntax with a parser combinator generally results in a syntax error. Regex parsers generally treat "unknown" characters as characters to match, hiding bugs.


Perl 6 regexes are parsed at compile-time (i.e. they're code just like any other code) so using incorrect regex syntax, eg unknown characters, results in a compile-time syntax error.


A regex language could fix it by some clear prefix convention or whatever for all operators. Though more verbose, it would still be less verbose than alternatives.

Suppose all regex operator characters have to be backslashed (and \\ stands for a single \). Then it's clear. No backslash means it's literal; otherwise it's an operator (and a backslash on a nonexistent operator is a parse error).

The ambiguities exist because regex aficionados want common operators to be just one character long.


That's about as classic a case of 'cure worse than the disease' as I've ever heard.

How readable is this:

`\^\[-+\]\?\[0\-9\]\*\.\?\[0\-9\]\+\$`


Perfectly, if your name is Donald Knuth.


A really annoying "feature" of regexes in POSIX is that known regex meta-characters are used as ordinary, when they are in a context that doesn't recognize them.

Some examples are a superfluous closing parenthesis or a [ bracket in a character class.

A sane regex syntax treats these characters as a separate lexical category from literal characters, in all contexts, and only produces literals out of them when they are escaped.


And usually not at compile time.


Here is an example of an "unusual" case where the compiler can detect errors in statically known regexes.

This fails at compilation with Opening paren has no matching closing paren. at position 0 in string "(ab"

    (defun scan-ab (s)
      (ppcre:scan "(ab" s))
By the way, here is the alternative syntax:

    (defun scan-ab (s)
      (ppcre:scan '(:register "ab") s))
The alternative syntax allows to embed string regexes:

    (defun wrap-regex (regex)
      (typecase regex
        (string `(:regex ,regex))
        (t regex)))
This is useful for combining regexes:

    (defun exactly-some (&rest choices)
      (let ((choices (mapcar #'wrap-regex choices)))
        `(:sequence :start-anchor
                    ,@(if (rest choices)
                          `((:alternation ,@choices))
                          choices)
                    :end-anchor)))

    (exactly-some "t.*")
    (:SEQUENCE :START-ANCHOR (:REGEX "t.*") :END-ANCHOR)

    (exactly-some "t.*" "a.a" ':whitespace-char-class)
    (:SEQUENCE :START-ANCHOR
               (:ALTERNATION (:REGEX "t.*")
                             (:REGEX "a.a")
                             :WHITESPACE-CHAR-CLASS)
               :END-ANCHOR)


Sure, but 1. It's pretty easy to expose a similar API, those are all functions 2. The documentation tools for the language naturally document the combinators.

Even if the actual combinators are different, it's still a much better situation than the basilions regex syntaxes with completely different quoting and escaping mechanisms.


True; you will never have to wonder whether alternatives(x, y, z) has to be \alternatives. :)


https://github.com/pygy/compose-regexp.js

^^^ Regexp combinators in JS. 614 bytes minimized and gzipped.

Once you're proficient with the standard RegExp syntax, you can write pretty complex ones without external tools, but when the times comes to debug and modify them, using combinators makes the task much much more simple.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: