Hacker News new | past | comments | ask | show | jobs | submit login
SRL – Simple Regex Language (simple-regex.com)
279 points by maxpert on Aug 29, 2016 | hide | past | favorite | 133 comments



Regex combinators are a much better solution to that problem, for which I gave various arguments here[1]:

- You don't need to remember which regex syntax the library is using. Is it using the emacs one ? The perl one ? The javascript one ? that new "real language" one ?

- It's "self documenting". Your combinators are just functions, so you just expose them and give them type signatures, and the usual documentation/autocompletion/whatevertooling works.

- It composes better. You don't have to mash string together to compose your regex, you can name intermediary regexs with normal variables, etc.

- Related to the point above: No string quoting hell.

- You stay in your home language. No sublanguage involved, just function calls.

- Capturing is much cleaner. You don't need to conflate "parenthesis for capture" and "parenthesis for grouping" (since you can use the host's languages parens).

[1]: https://news.ycombinator.com/item?id=12293687


Personally, I use regexes all the time. Every day. I'm constantly typing regexes into emacs and vim to do text manipulation. Sometimes a quick :%s/\(foo\|bar\)d/baz\1/g, other times a :%v/function /d. I can't imagine what it would be like if I had to write out "begin with either of (number, letter, ..." I'd never get anything done.

For programming it could be nice. It's an interesting idea. But I've found that good highlighting solves most of the problem. When I type "\\(" into emacs, it immediately highlights that \\( in a different color, along with \\| and \\). That way you can differentiate instantly between capture groups vs matching literal parens. Here's an example: http://i.imgur.com/b417O2o.png That small snippet would become way, way longer if you use a regex combinator. Maybe good, maybe bad, but it's hard to read a lot of code.

Yeah, it's ugly. And there are a bazillion small variations between regex engines. But for raw productivity, it seems hard to beat.


I, on the other hand, have to build a regex a few times a year. Every darn time I have to look up the difference between .* and .+, and how to do an "or", and why I get matches when I group a set of characters.

So SRL sounds like a great idea: us mortals can use it to painfully stitch together our our poor regexes, and when we get good enough we can skip SRL and become fully fledged regex gods.

Off topic, if SRL is a regex compiler, I'd love a regex decompiler. To be able to turn an impossible jumble of weird characters into a structured description of what it does would be hugely useful.


RegexBuddy has a regex explainer, debugger, builder and tester. It works very well and supports exporting code to multiple languages and dialects


Also works on Linux (WINE), and can simulate different regex flavors.

I can also recommend the webbased regex101.com, as it also supports non-JS regex flavors and explains things quite well.


I think you misunderstand what "regex combinators" means. In particular, SRL is not regex combinators. TAForObvReasons in another answer posted this[1] that would explain it to you.

[1]: https://groups.csail.mit.edu/mac/users/gjs/6.945/psets/ps01/....

About your example ... on the contrary, it would become way cleaner with regex combinators. The string manipulation would be replaced with proper composition and the escaping madness would disappear. For example

   (concat foo "\\|" bar "\||" baz) 
becomes

  (alt foo bar baz)
You seem to be familiar with elisp. The lisp family is particularly adapted to combinators approach, please read the link above. :)


If you like regex combinators, check out the elisp "rx" library. You can end up writing code sort of similar to the article's, with less emphasis on using english grammar: http://i.imgur.com/iD34MqC.png

Combinators are very convenient and precise, but the tradeoff is that the code is longer. And I have to look up what to write every time I want to write one. But that's a personal bias.

Thanks for the reference. I'll study it.

EDIT: One of the good ideas in SRL is "if not followed by". There are too many [not]ahead-[not]behind combinations to warrant special syntax for each of them. I wonder if it could be streamlined, though?


...wait a minute, that sounds a lot like SRE.

If you don't know, SRE is a DSL in scheme that is essentially an alternate syntax for regex that does this, originally implemented by SCSH, and now most popularly by irregex. It looks like this:

  (w/nocase (: (=> name (+ (or alnum ("._%+-")))) "@" 
               (=> domain (: (* (or alnum (".-"))) "."
                             (>= 2 alpha))) 
               eos))
I don't know if that's exactly what you were talking about: It's an alternate syntax, not a set of functions, which is what combinators usually imply.


I find that most of the problems are resolved with interstitial comments a la perl /x or coffeescript heregex. Merely splitting the regular expression into smaller blocks with a short description of what to match makes it easy for people to verify the regular expression.


I do most of my text manipulation in sublime text using replace. I like the interactivity of it compared to sed/awk, and the easy undo is helpful as well (try, back, edit, try, back, ..).


There's an MIT CS assignment that goes along with exactly what you are saying: "Regexp Disaster" by Gerald Sussman. It's a good read for those that want to learn more about the combinator approach.

https://groups.csail.mit.edu/mac/users/gjs/6.945/psets/ps01/...


> You don't need to remember which regex syntax the library is using.

Only for the time being, while only one regex combinator API/implementation exists for your language.


Unlike a regex string, using incorrect syntax with a parser combinator generally results in a syntax error. Regex parsers generally treat "unknown" characters as characters to match, hiding bugs.


Perl 6 regexes are parsed at compile-time (i.e. they're code just like any other code) so using incorrect regex syntax, eg unknown characters, results in a compile-time syntax error.


A regex language could fix it by some clear prefix convention or whatever for all operators. Though more verbose, it would still be less verbose than alternatives.

Suppose all regex operator characters have to be backslashed (and \\ stands for a single \). Then it's clear. No backslash means it's literal; otherwise it's an operator (and a backslash on a nonexistent operator is a parse error).

The ambiguities exist because regex aficionados want common operators to be just one character long.


That's about as classic a case of 'cure worse than the disease' as I've ever heard.

How readable is this:

`\^\[-+\]\?\[0\-9\]\*\.\?\[0\-9\]\+\$`


Perfectly, if your name is Donald Knuth.


A really annoying "feature" of regexes in POSIX is that known regex meta-characters are used as ordinary, when they are in a context that doesn't recognize them.

Some examples are a superfluous closing parenthesis or a [ bracket in a character class.

A sane regex syntax treats these characters as a separate lexical category from literal characters, in all contexts, and only produces literals out of them when they are escaped.


And usually not at compile time.


Here is an example of an "unusual" case where the compiler can detect errors in statically known regexes.

This fails at compilation with Opening paren has no matching closing paren. at position 0 in string "(ab"

    (defun scan-ab (s)
      (ppcre:scan "(ab" s))
By the way, here is the alternative syntax:

    (defun scan-ab (s)
      (ppcre:scan '(:register "ab") s))
The alternative syntax allows to embed string regexes:

    (defun wrap-regex (regex)
      (typecase regex
        (string `(:regex ,regex))
        (t regex)))
This is useful for combining regexes:

    (defun exactly-some (&rest choices)
      (let ((choices (mapcar #'wrap-regex choices)))
        `(:sequence :start-anchor
                    ,@(if (rest choices)
                          `((:alternation ,@choices))
                          choices)
                    :end-anchor)))

    (exactly-some "t.*")
    (:SEQUENCE :START-ANCHOR (:REGEX "t.*") :END-ANCHOR)

    (exactly-some "t.*" "a.a" ':whitespace-char-class)
    (:SEQUENCE :START-ANCHOR
               (:ALTERNATION (:REGEX "t.*")
                             (:REGEX "a.a")
                             :WHITESPACE-CHAR-CLASS)
               :END-ANCHOR)


Sure, but 1. It's pretty easy to expose a similar API, those are all functions 2. The documentation tools for the language naturally document the combinators.

Even if the actual combinators are different, it's still a much better situation than the basilions regex syntaxes with completely different quoting and escaping mechanisms.


True; you will never have to wonder whether alternatives(x, y, z) has to be \alternatives. :)


https://github.com/pygy/compose-regexp.js

^^^ Regexp combinators in JS. 614 bytes minimized and gzipped.

Once you're proficient with the standard RegExp syntax, you can write pretty complex ones without external tools, but when the times comes to debug and modify them, using combinators makes the task much much more simple.


Barely touch regexes these days because for last few years I've been using Rebol / Red parse more and more.

Here's a translation of the first SRL example in the parse dialect:

  [
      some [number | letter | symbol]                 
      "@"
      some [number | letter | "-" ]                   
      some ["." copy tld some [number | letter | "-" ]]
      if (parse tld [letter some letter])
  ]
And here's a full matching example:

  number: charset "0123456789"
  letter: charset [#"a" - #"z"]
  symbol: charset "._%+-"
  
  s: {Message me at you@example.com. Business email: business@awesome.email}
  
  parse s [ 
    any [
        copy local some [number | letter | symbol]
        "@" 
        copy domain [
            some [number | letter | "-" ] 
            some ["." copy tld some [number | letter | "-" ]]
        ]   
        if (parse tld [letter some letter])
        (print ["local:" local "domain:" domain])

        | skip
    ]   
  ]
Some parse links:

* http://blog.hostilefork.com/why-rebol-red-parse-cool/

* https://en.wikibooks.org/wiki/REBOL_Programming/Language_Fea...

* http://www.codeconscious.com/rebol/parse-tutorial.html

* http://www.red-lang.org/2013/11/041-introducing-parse.html


Same for me. About a decade ago I started using Rebol for shell-scripts, file management and working with web API's. It's nice to work with a syntax where I can look at an older parse routine and figure out what I was trying to do just by reading it. Only time I use regex is within my code editor.


I don't mind regular expressions myself, but this does look quite nice. Any idea if this has been implemented for anything other than Rebol (which I'd never heard of)?


There was a Parse-inspired project for JavaScript, but I'm not sure to what extent it was developed. There was also a Parse implementation in Topaz[1] (which itself is implemented in JavaScript) but is as of now unfinished.

It'd be difficult to implement as tightly in another language that doesn't have Rebol's (or a somewhat Lisp-like) free-form code-is-data/data-is-code[2] approach. Rebol and its Parse dialect share the same vocabulary and block structure that amongst other things: makes it easy to insert progress-dependent Rebol snippets within Parse; build Parse blocks dynamically (mid-Parse if needed!); build Rebol code from parsed data; develop complex grammar rules very similar to EBNF[3]. The article linked above[4] (and now linked again below :) does a good job of fleshing out these ideas and why it may remain a unique feature for some time.

[1]: http://reb4.me/tt

[2]: http://rebol.info/rebolsteps.html

[3]: http://codereview.stackexchange.com/q/87716/46291

[4]: http://blog.hostilefork.com/why-rebol-red-parse-cool/


There are other Rebol inspired languages that may have the parse dialect but the only one that I know for certain is Red.

http://www.red-lang.org/


I like how the marquee example is of how to do something you shouldn't be trying to do [1] anyway.

In more general terms, if a regex is complicated enough that something like this seems to make sense, the problem is that your regex is too complicated, and you should fix that.

[1] https://news.ycombinator.com/item?id=12312574


It's my observation that a lot of problems people have with regex is that they haven't learned tips like "Is regex the right way to do this?", "Can I simplify this regex because I trust the input?" and "What do I really need the regex to do?".

To validate a user-supplied email address, you arguably just need something along the lines of ^.@.\..$ to help avoid whitespace and forgetting the @ sign. In a lot of cases, if the email address is wrong, all that happens is either a) user login fails or b) user registration fails. Hence there's no need for unreadably-complex regex to validate.


> To validate a user-supplied email address, you arguably just need...

See the earlier discussion. It's not nearly so simple.


I think what he's trying to say is the best test for a valid email is to send them an email with a confirmation URL for the user to click.


> In more general terms, if a regex is complicated enough that something like this seems to make sense, the problem is that your regex is too complicated, and you should fix that.

I disagree. There is no such thing as a "complicated regex" in itself; it's all the same to the underlying engine. It's the maintenance of the regex that's the problem. You could, for instance, compile many small, maintainable patterns into a (technically) complicated regex with guarantees it will evaluate as expected—which is one possible way the outlined SQL approach could work. Don't confuse process with technology.


> it's all the same to the underlying engine

So what you're saying is that as long as I have a PCRE library linked into my brain, there's no problem. Gotcha.


If you're compiling the regex, there's no need to read it. Just understand the compiler. I'm not sure how what you've said related to my comment.


What do you do when the compiler breaks, or fails to implement something you need?


Good point. With PEG, YACC/LEX clones, and tutorials online, it's not to hard to get a functional (if primitive) Real Parser nowadays.


“And now, you have two problems”.


Consider changing "either of" to "any of".

The word "either" implies only two choices, making your opening example confusing when the first "either of" was really picking from three possibilities.


Also 'begin' is vague - it could mean start of word, here it means start of line. Make it 'start of line'

I like the general approach, although the 2010-era BDD fake natural language is a turn off.


A really good shorthand for "start of line" is "^"

;o)


A really good shorthand for "yes, however it's not the most discoverable thing in the world" is "& ^ ( ^ ^ % & ^"


Is it less discoverable than the proposed alternative though?

In both cases one looks up the documentation, in one I find that search for the line start requires a regex with "^" and in the other I find something like "begin with" of Simple Regex Language (SRL). I still need to read (or test) to find what "begin with" means and I still couldn't guess it - why not "start with", "open with", "first character", or a myriad of other possible options.

Whatever suits the user I suppose.


Since code is read more than it is written, how well do you think a colleague without previous knowledge of regexs could understand '^' vs 'start of line'?


I'm not absolutely sure this isn't a joke that got out-of-hand. This is the COBOL of regular expressions. :)

Whilst the conventional regular expression syntax is arguably overly compact, this is just too far in the opposite direction!

Something more PEG-like, or even Perl 6 regex-like, would make more more readable regular expressions whilst not completely throwing out everything we think things mean. Hell, even /x -- ignore whitespace and comments -- can make things much clearer:

    / ^
      [0-9a-z._%+-]+           # The local part. Mailbox/user name. Can't contain ~, amongst other valid characters.
      \@
      [0-9a-z.-]+ \. [a-z]{2,} # The domain name. We've decided a TLD can never contain a digit, apparently.
    $ /x
Tangentally, there's no point validating email addresses with anything more complicated than /@/. If people want to enter an email address that doesn't work, they can and will. If you want to be sure that the address is valid, send it an email!


> Tangentally, there's no point validating email addresses with anything more complicated than /@/

Well once you have this in database, maybe you'll give it to some library, then maybe this poor little library will just paste it into the SMTP conversation. And maybe some user will be clever enough to exploit it.

I believe that if you want to get X from the user, it's always a good idea to make super sure that it is actually X before passing it further.

And I very much agree that whenever possible and necessary, simply splitting regular expression to multiple lines and using comments seems like a superior approach.


The point is not to parse email addresses with regex because an actual email parser is far more appropriate for the task.


Similarly, python's re.VERBOSE can push up readability in combination with its named groups feature.

  https://docs.python.org/2/library/re.html#re.VERBOSE


My attempt at the Perl 6 version of the example:

    rx:i/^^
    [ <+ alpha + digit + [._%+-] >+ ] ** 2 % '@'
    '.'
    <alpha>** 2..*
    $$/
Notice that this avoids the repetition of a pattern.

Of course, it'd make much more sense to write a grammar:

    grammar Email {
        token TOP { <name> '@' <domain> }
        token name { <valid_char>+ }
        token domain { <valid_char>+ '.' <alpha>** 2..* }
        token valid_char { <alpha> | <digit> | <[._%+-]> }
    }


Pretty good, though <domain> should end with <alpha> 2..*

Note that your version is not actually the same, as it accepts all alphabetic characters, not just ASCII ones. Which put it much closer to RFC 6532, but still not exactly there due to the quoting rules in usernames. Which get pretty hairy. See [1] for an implementation of RFC 822, which is a simpler version of the modern standard, for a regex to validate email in Perl 5.

[1]: https://metacpan.org/source/RJBS/Email-Valid-1.200/lib/Email... trigger warning: bleeding eyes



Yes, I was talking about this. Irregex is much better than RX, nowadays, however. Shinn did a really good job.


We (the Hyperscan team) have spent a lot of time staring at regular expressions over the years (shameless plug: https://github.com/01org/hyperscan).

I think a better format for regex is long overdue, but this isn't it. It's way too verbose (other commentators also noticed the resemblance to COBOL). I'm picturing a Snort/Suricata rule with this format regex, and you've now doubled the amount of screen real estate per rule.

The real problems with regex readability are (1) the lack of easily grasped structure, so it's almost impossible to spot the level at which a sequence or alternation operates (PCRE's extended format and creative tabbing can help) and (2) the total lack of abstraction - so if you have a favorite character class or subregex you write it approximately a bazillion times.


I'd love to hear your feedback on my approach, that focuses on the kind of problems I read in your comment.

http://www.slideshare.net/AurSaraf/re3-modern-regex-syntax-w... https://github.com/sonoflilit/re2


SRE might help. It's pretty nice, if you like parens.


What's missing in regexps IMO is composability so you can build larger patterns out of smaller ones, giving each a clear name. Replacing '[0-9]' with 'digit' doesn't really help much.


Sane middle ground:

  $ txr
  This is the TXR Lisp interactive listener of TXR 147.
  Use the :quit command or type Ctrl-D on empty line to exit.
  1> (regex-parse ".*a(b|c)?")
  (compound (0+ wild) #\a (? (or #\b #\c)))
  2> (regex-compile *1)
  #/.*a[bc]?/


I think SRE is a bit nicer than TXR's regex format, but it's the same idea.


I designed a similar but terser language in 2012:

The examples give the gist: http://chubot.org/annex/cre-examples.html

More justification: http://chubot.org/annex/intro.html

doc index: http://chubot.org/annex/ (incomplete)

I showed it to some coworkers in 2013 and got some pretty good feedback. Then I got distracted by other things. One of the issues is that I learned Perl regex syntax so well by designing this language that I never needed to use it again :)

I plan on coming back to it since I'm writing a shell now, and I can't remember grep -E / sed -r syntax in addition to Perl/Python syntax.

SRL is the same idea, but I think it is way too verbose, which it appears a lot of others agree with.

If anyone is interested in the source code let me know! It was also bootstrapped with a parsing system, which worked well but perhaps wasn't "production quality". So I think I will reimplement CRE with a more traditional parsing implementation (probably write it by hand).


COBORE: common business-oriented regex.

Bonus: COBORE -> (Japanese) kobore -> こぼれ -> 溢れ/零れ ("spillage").

"Overflowing spillage of verbosity."


Kaz is right.

Now if only I made this joke first...


I cannot upvote this enough.


Looks like the COBOL of pattern matching, and frankly I like it.


I have a (mostly abandoned) side project with the same aim. I spent a lot of time thinking about language design, from the perspective of enabling adoption of a new language when an existing one has so much network effect:

https://www.slideshare.net/AurSaraf/re3-modern-regex-syntax-... https://github.com/sonoflilit/re2

If people like my direction, I may continue to work on it.


Maybe this as a learning tool? I think it's much better to learn regex as is, no matter how ugly or terse you may or may not find it. It's pretty universal across languages ( with some annoying variations ). There's lots of online tools and programs that can help you decode or create regex, after a while it's not so hard to read/create. But also worth knowing a more comprehensive parsing tool or parsing techniques so you don't get too ambitious with regex :)


I fully appreciate the problems with regex, but I don't think this is the right approach.

If you have fantastic tooling regex can actually be a pleasure

Unfortunately the best regex helper ever made seems to still be an old Windows app. But wow is it good: https://www.regexbuddy.com

I've seen online tools but they never seem to measure up.


Jan Goyvaerts, the author of this program also created this great website [0], which has lots of very clear and detailed regex tutorials. I always come back to it when I get confused about regular expressions.

[0] http://www.regular-expressions.info/tutorial.html


Side note on the API of the PHP lib implementing this: could we please stop using the fluent/query-builder pattern with closures?

It's the most disgraceful code style I've seen, and misleading: makes you mind think smth. async could be happening. I know Laravel popularized this, along with other ugly patterns, but let's stop cargo-culting this.



I think this is really valuable. Just today I had a non-tech co-worker who needed to understand Regex for some tool we were using. I did the regex for him and (very briefly) explained it. Now this might be something he can more easily grok and use a translator (+ regex101.com to verify) to create more complex regex's he might end up needing.


Sometimes it's difficult to reason out an involved regex. I doubt I'd ever use something like this from code but I might use the translator.

Example based on their example.

https://simple-regex.com/build/57bc5eac74c4d


See also:

Is there a specific reason for the poor readability of regular expression syntax design?

http://programmers.stackexchange.com/q/298564/33157


Demonstrative "that" adjective-connective-subjective "seems" infinitive-marker "to" verb-existential "be" verb-passive-gerund "missing" article-definite "the" noun-subject "point".


If you're dissatisfied with the terseness of regular expressions, it's worth looking at SNOBOL4: http://www.snobol4.org/ which has been around for decades.


SNOBOL, AWK, and TXR are all pretty cool: Pick your swiss army knife.


Reminds me of regex parse trees in CL-PPCRE. S-expressions are the simplest language of all! [1]

[1] http://weitz.de/cl-ppcre/#create-scanner2


The author pits his project against POSIX regular expressions, but personally I feel that it's PCRE that rules the day. I find pcre regex significantly less verbose and easier to read.


Hmm, perhaps useful for a teaching tool.

If used as such, it'd be really nice to be able to go the other way - a regex explainer if you will.


That's what I was thinking too. It'd be a great way to get people to understand the power of regex. But yeah, I wouldn't want to learn a verbose language to use regexes as frequently as I do today.


Very nice. Would love to see versions in other programming languages.

I'm very interested in examples that extrapolate this idea to other areas of programming and even math. And also work in the reverse direction.

Most of the examples I've found are old or not open source.

another example of english to regex: https://people.csail.mit.edu/regina/my_papers/reg13.pdf https://arxiv.org/abs/1608.03000

English to dates https://github.com/neilgupta/sherlock

English to a graph (network representation) https://github.com/incrediblesound/MindGraph

C to English and vice versa http://www.mit.edu/~ocschwar/C_English.html

English to python: http://alumni.media.mit.edu/~hugo/publications/papers/IUI200...

English to database queries http://kueri.me/


So we're replacing a universally understood syntax for a new one that was just invented, and is painfully verbose? I understood what the first regex was doing just fine.

This is a major step up in readability, so it's nice, and you have to invent a new syntax to do that, so I'll chalk that up as unavoidable. But did it have to be so verbose? SCSH/irregex's SRE had similar readability wins, with way less verbosity. You still have to learn a new syntax, though.


Indeed, and following their lead, I created SBFL (Simple Brainfuck Language) since most find the original specification too esoteric.

So instead of writing this Hello World program in Brainfuck:

  ++++++++[>++++[>++>+++>+++>+<<<<-]>+>+>->>+[<]<-]>>.>---.+++++++..+++.>>.<-.<.+++.------.--------.>>+.>++.
You can instead have this much more readable version:

  increment byte increment byte increment byte increment byte increment byte
  increment byte increment byte increment byte jump forward if zero
  increment pointer increment byte increment byte increment byte increment byte
  jump forward if zero increment pointer increment byte increment byte
  increment pointer increment byte increment byte increment byte increment pointer
  increment byte increment byte increment byte increment pointer increment byte
  decrement pointer decrement pointer decrement pointer decrement pointer
  decrement byte jump backward if zero increment pointer increment byte
  increment pointer increment byte increment pointer decrement byte
  increment pointer increment pointer increment byte jump forward if zero
  decrement pointer jump backward if zero decrement pointer decrement byte
  jump backward if zero increment pointer increment pointer output byte
  increment pointer decrement byte decrement byte decrement byte output byte
  increment byte increment byte increment byte increment byte increment byte
  increment byte increment byte output byte output byte increment byte
  increment byte increment byte output byte increment pointer increment pointer
  output byte decrement pointer decrement byte output byte decrement pointer
  output byte increment byte increment byte increment byte output byte
  decrement byte decrement byte decrement byte decrement byte decrement byte
  decrement byte output byte decrement byte decrement byte decrement byte
  decrement byte decrement byte decrement byte decrement byte decrement byte
  output byte increment pointer increment pointer increment byte output byte
  increment pointer increment byte increment byte output byte


How about SCL?:

  include header file stdio.h, searching system paths first.

  describe function main that returns a value of type 
  integer, and has argument of type integer argc, and 
  argument of type pointers to pointers to characters argv.

  begin function body

  call function printf with the single argument of type 
  string "hello world" with a newline appended.

  begin new statement.

  return the integer value 0.


> array of pointers to pointers to characters

    Wouldn't that be char ***argv? (with three stars)


For some reason, I thought it was:

  char **argv[]
It isn't. I've been away from C too long. Fixed in GP.


An array of anything decays to a pointer though, if I'm not mistaken.


Yeah, I know. I've already fixed it.


I'd be happy to get a translator that goes the other way, to help decode unfamiliar regex. Other than that though, I pretty much agree with you.


Not really translators, but can be quite useful sometimes – https://regex101.com/ and http://regexr.com/


Whenever i need to do anything in regex, i open regex101.

It's such a great resource.


Nice tools. I often use https://regexper.com


Wasn't this a goal of Light Table, or am I up in the night? There are a lot of useful representations I'd like to see for my foo, but very few of them I'd actually want to work in directly when I'm creating or modifying the underlying code representation. I could see this tool as a quick and useful way to load a complex regex meaning in my head like doc text, but after that, I'll just prefer to work with the regex directly instead of the verbose text.


And I think SRE is a good middle ground. Plus, you can use paredit on it.


I bet there's a fair number of people like me who rarely have to use regex and because of that have to half relearn it each time I have to use them. A more verbose syntax is easier to work with for that because you're not fighting both the syntax and the semantics simultaneously.


While I agree that this is verbose to the point of almost being absurd, the regex syntax is far from universal. There are several different variations that are different in annoyingly subtle ways.


How so? The differences between PCRE and Python RE are negligable, and cannot be confused for any syntax in the other. Most people use ERE with only the most commonly implemented PCREisms in any case. And BRE is different enough that it's usually detactable right off the bat.


If you don't switch between them frequently enough it can be very difficult to deal with. For example, when doing a search in vim I often confuse what needs to be escaped and what doesn't vs a grep. And the escape syntax between perl compatible regexs and mySQL regexs are significantly different.

They are subtle but they can cost valuable time while mentally context switching between them.


The first is BRE vs ERE. Honestly, just use egrep. It behaves more in the way you'd expect.

I'm not really sure what so different about the escape syntax in mySQL. But I'm sure you know.


mySQL regex is a bit of a nightmare. It contains a lot of character classes that can be confusing if you don't use them often. I guess "escape syntax" is the wrong phrase.

http://dev.mysql.com/doc/refman/5.7/en/regexp.html


Ah. I was confused, because it uses Spencer's regex package, which is pretty standard for parsing and matching ERE.


I agree that SRL is painfully verbose. I'm fairly experienced with regexes, and I found the new syntax to be far less readable. A better (and much simpler) way to make the regex more readable would be to split its construction into multiple lines (possibly using separate strings that are concatenated and given meaningful variable names). I already find myself often creating regex substrings that are then used in the construction of multiple, related regexes.


And I think that SRE is far more readable than standard regex syntax, which is in turn more readable than SRL. SRE is lisp-like, though, so it's not for everyone.


The first example:

    /^(?:[0-9]|[a-z]|[\._%\+-])+(?:@)(?:[0-9]|[a-z]|[\.-])+(?:\.)[a-z]{2,}$/i
is a total strawman, needlessly obfuscated. How about writing it like this:

    /^[0-9a-z._%+-]+@[0-9a-z.-]+.[a-z][a-z]+$/i
which, while "scary looking", is at least immediately readable by anyone who knows even the basics about REs. If the argument for "verbose REs" is valid, it ought to stand up at least a typical standard RE.

Also, it's not clear that "letter" and "[a-z]" mean the same thing. Does "letter" include uppercase? Does it include non-ASCII letters like "[[:alpha:]]" does? Don't forget the weird collation behavior "[a-z]" sometimes encounters.


> which, while "scary looking", is at least immediately readable by anyone who knows even the basics about REs

Nope, I'm mostly a DB guy very fluent in SQl and I use regex like two dozen time a year.But every time I nead to write something not trivial I must run to a regex cheatsheat website and spend long minutes trying to figure shit out.

It's not that I'm dumb and taking a MOOC about regex is definitely on my todo list... It's just that I haven't found the damn time yet to learn monstrosity and exceptions of regex.

And this is especially painful coming from PostgreSQL which have a good debugger and a clear syntax (even for non standard functions).


FWIW I'm doing the Accessing Web Data part of the Python for Informatics course at Coursera. First part of that course is regex and after one lesson (~15 mins) it covers enough to read the above expression.

I was already cognisant with regex, so I'm perhaps biased, but a simple email-like search seems easy for a novice to read.

Perhaps you have a particular block when approaching regex. The book by Charles Severance that goes with the course (above) is freely available online.


Nit: you need to escape the last dot since it matches the domain separator. (Perhaps this proves the need for saner regex syntax?)


    /^[0-9a-z\._%\+\-]+@[0-9a-z\.\-]+\.[a-z][a-z]+$/i


"Also, it's not clear that "letter" and "[a-z]" mean the same thing"

"number" and [0-9] are even worse. That should have been called "digit" and, as another commenter already pointed out, in the age of Unicode, it still is confusing.

As to this attempt at simplifying regex writing and reading: nice try, but I think it needs more work. Apart from the Unicode thing, there's the fact that "letter" only is equivalent [a-z] because of the 'case insensitive' flag.

I think I would go for something that's less grammatical English and more programming language like (alignment of the colons optional)

  Start of text.
  1 or more  : digit, lowercase or one of ._%+-
  Literal    : @
  1 or more  : digit, lowercase or one of .-
  Literal    : .
  2 or more  : lowercase
  End of text.
  All: case insensitive.
My default would be to have 'lowercase' mean the Unicode character class. 'ASCII lowercase' would handle [a-z]

Adding capture groups, look ahead and look behind, comments, etc. is left as an exercise to the reader (they probably would make this look very ugly)

There's also the issue of nesting, like in this botched attempt to write a regex for URLs:

  One or more of:
    Once        : letter or underscore
    One or more : letter, digit or underscore
    Separated by: /
  Optional:
    Literal: ?
    One or more:
      One or more : letter, digit or underscore
      Literal     : =
      One or more : letter, digit or underscore
      Separated by: ,
Of note here is that I think we need to digress from regexes a bit by introducing things like 'Separated by'. Without it, you often need to repeat potentially long phrases (programmatically building your regex can avoid that, but I think you still would need a serialization format, and I also think it makes sense for that to not use a full fledged programming language)

Thinking of things of that complexity, I'm starting to think it would be better to have people write a BNF grammar.


> I think we need to digress from regexes a bit by introducing things like 'Separated by'.

You might be interested to learn that perl6 regexes have this, notated '%'. https://docs.perl6.org/language/regexes#Modified_quantifier:...


I'd still keep the {2,} instead of repeating the [a-z] twice, but otherwise, I completely agree.


And if you had something supporting verbose regular expressions (e.g. Python’s re.VERBOSE flag):

    ^
    [0-9a-z\._%\+-]+  # Local component
    @
    [0-9a-z\.-]+      # Domain name (subdomains permitted)
    \.[a-z]{2,}       # TLD
    $


I remember learning that \d was not the same as [0-9]. \d is 'digit in any language, not just 0 to 9', so it'll activate on digits other than Arabic numerals.

The SRL documentation doesn't make it clear if they mean 'number' to be 'only numbers 0 to 9' or 'any digit in any language'.


great work, nice site! There's a huge amount of programmer's time lost worldwide reading Regex expressions and trying to determine what the expression do. This a way better option for readability.


Zounds, now that programming languages basically scrapped the idea to sound like English, regular expressions pick it up from the trash. And emulate Applescript of all, if I am any judge.


AppleScript, SQL, COBOL and others have all made the same mistake.


Cool, it's like COBOL for regexes. Now even managers can write them!


Anyone remembers: South. South. West. Look. Pick axe.

for the same reasons I'm having mixed feeling about cucumber and similar testing frameworks (BDB), that also rely on semi-english language to do things. It looks cool, and enticing, but hard to sell (to others), even If I myself am super-excited to see it in action (just because how crazy it looked the first time I saw it).


Looks like applescript, easy to read, impossible to write


This is really cool. I'm sure it can be improved, but a nicer high-level DSL for regex is something I have been looking for a long time now. Combinator libraries are nice but language-dependent.


My attempt at simplifying (a subset of) regexes:

https://github.com/crdoconnor/simex


Because if there was one thing regexp needed it was COBOL.


Don't take me wrong, but I think that if anything this makes regular expressions harder to understand. The syntax is super verbose.


I'm no fan of regexes, but I'm not a huge fan of this either; I would be interested in seeing existing convoluted regexes expressed like this for me in an IDE, but I don't like it as an input format.

I do wonder if having an EBNF compiler like ANTLR being more accessible would solve the readability & maintainability issues.


I don't think anyone is really a fan of regex(es?) but the task they are for is actually a rather complicated one and, let's be honest, they are pretty good at it. I agree that it would be nice to see better tools to understand and create new regex but I think that they are actually fine on their own.

They might be ugly and everyone had a point where they thought they were just some sort of magic impossible to understand but I think that there is a point where they just... click. After that they can still be ugly and messy but you at least understand that there is a purpose to it.


'Regular Expressions Made Simple'

Regular expressions are simple. It's just a matter of putting a bit of time in to learning them.


Simple for various definitions of simple. Readable not included.


Hey, remember that time we gave up regular expressions and went back to writing grammars? Right tool for the job..


I hate languages that try to be like natural language, for the simple reason that I can't actually use natural language. I can't type: "I want something that has at least some letters at the beginning followed by an @ ..." or any other variation except the exact syntax they require. Maybe in 10 years if NLP has come far enough, but not when it's a simple parser like now. I think it's much harder to remember than a syntax that is completely different from anything else you know.

On the other hand, it did work for SQL...


This definitely has some entertainment value. However, if you want to make your code verbose, there is only one way to go. Use a language that needs just the following statement to produce a valid program:

  TRANSFORM THE CURRENT OBJECT INTO THE DESIRED OBJECT.
Any other program statements are redundant.


This is really cool, but my brain keeps getting stuck on the word choice. Every time I see the `literally` keyword, I hear a teenage, valley girl accent in my head.

"Literally, at sign."

"Like, literally, hashtag, guys."

I can't even.


And now that "literally" has been coopted to also mean "figuratively", I don't even know what that clause means. I also got mentally stuck on that phrase, and my eyes literally popped out of my skull.


Yeah, we've got VALGOL for that:

  14 LIKE, Y$KNOW (I MEAN) START
  %% IF
  PI A =LIKE BITCHEN AND
  01 B =LIKE TUBULAR AND
  9  C =LIKE GRODY**MAX
  4K (FERSURE)**2
  18 THEN
  4I FOR I=LIKE 1 TO OH MAYBE 100
  86 DO WAH + (DITTY**2)
  9  BARF(I) =TOTALLY GROSS(OUT)
  -17 SURE
  1F LIKE BAG THIS PROGRAM
  ?  REALLY
  $$ LIKE TOTALLY (Y*KNOW)


For the sake of pedanticity, the regex he proposed as an example doesn't validate all possible email addresses (particularly ones that are simply …@tld).




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: