The first example: /^(?:[0-9]|[a-z]|[\._%\+-])+(?:@)(?:[0-9]|[a-z]|[\.-])+(?:\.)...

Twisell · on Aug 30, 2016

> which, while "scary looking", is at least immediately readable by anyone who knows even the basics about REs

Nope, I'm mostly a DB guy very fluent in SQl and I use regex like two dozen time a year.But every time I nead to write something not trivial I must run to a regex cheatsheat website and spend long minutes trying to figure shit out.

It's not that I'm dumb and taking a MOOC about regex is definitely on my todo list... It's just that I haven't found the damn time yet to learn monstrosity and exceptions of regex.

And this is especially painful coming from PostgreSQL which have a good debugger and a clear syntax (even for non standard functions).

pbhjpbhj · on Aug 30, 2016

FWIW I'm doing the Accessing Web Data part of the Python for Informatics course at Coursera. First part of that course is regex and after one lesson (~15 mins) it covers enough to read the above expression.

I was already cognisant with regex, so I'm perhaps biased, but a simple email-like search seems easy for a novice to read.

Perhaps you have a particular block when approaching regex. The book by Charles Severance that goes with the course (above) is freely available online.

nneonneo · on Aug 30, 2016

Nit: you need to escape the last dot since it matches the domain separator. (Perhaps this proves the need for saner regex syntax?)

adambowles · on Aug 30, 2016

    /^[0-9a-z\._%\+\-]+@[0-9a-z\.\-]+\.[a-z][a-z]+$/i

Someone · on Aug 30, 2016

"Also, it's not clear that "letter" and "[a-z]" mean the same thing"

"number" and [0-9] are even worse. That should have been called "digit" and, as another commenter already pointed out, in the age of Unicode, it still is confusing.

As to this attempt at simplifying regex writing and reading: nice try, but I think it needs more work. Apart from the Unicode thing, there's the fact that "letter" only is equivalent [a-z] because of the 'case insensitive' flag.

I think I would go for something that's less grammatical English and more programming language like (alignment of the colons optional)

  Start of text.
  1 or more  : digit, lowercase or one of ._%+-
  Literal    : @
  1 or more  : digit, lowercase or one of .-
  Literal    : .
  2 or more  : lowercase
  End of text.
  All: case insensitive.

My default would be to have 'lowercase' mean the Unicode character class. 'ASCII lowercase' would handle [a-z]

Adding capture groups, look ahead and look behind, comments, etc. is left as an exercise to the reader (they probably would make this look very ugly)

There's also the issue of nesting, like in this botched attempt to write a regex for URLs:

  One or more of:
    Once        : letter or underscore
    One or more : letter, digit or underscore
    Separated by: /
  Optional:
    Literal: ?
    One or more:
      One or more : letter, digit or underscore
      Literal     : =
      One or more : letter, digit or underscore
      Separated by: ,

Of note here is that I think we need to digress from regexes a bit by introducing things like 'Separated by'. Without it, you often need to repeat potentially long phrases (programmatically building your regex can avoid that, but I think you still would need a serialization format, and I also think it makes sense for that to not use a full fledged programming language)

Thinking of things of that complexity, I'm starting to think it would be better to have people write a BNF grammar.

philh · on Aug 30, 2016

> I think we need to digress from regexes a bit by introducing things like 'Separated by'.

You might be interested to learn that perl6 regexes have this, notated '%'. https://docs.perl6.org/language/regexes#Modified_quantifier:...

lilyball · on Aug 30, 2016

I'd still keep the {2,} instead of repeating the [a-z] twice, but otherwise, I completely agree.

chrismorgan · on Aug 30, 2016

And if you had something supporting verbose regular expressions (e.g. Python’s re.VERBOSE flag):

    ^
    [0-9a-z\._%\+-]+  # Local component
    @
    [0-9a-z\.-]+      # Domain name (subdomains permitted)
    \.[a-z]{2,}       # TLD
    $

vacri · on Aug 30, 2016

I remember learning that \d was not the same as [0-9]. \d is 'digit in any language, not just 0 to 9', so it'll activate on digits other than Arabic numerals.

The SRL documentation doesn't make it clear if they mean 'number' to be 'only numbers 0 to 9' or 'any digit in any language'.