Regex combinators are a much better solution to that problem, for which I gave various arguments here[1]:
- You don't need to remember which regex syntax the library is using. Is it using the emacs one ? The perl one ? The javascript one ? that new "real language" one ?
- It's "self documenting". Your combinators are just functions, so you just expose them and give them type signatures, and the usual documentation/autocompletion/whatevertooling works.
- It composes better. You don't have to mash string together to compose your regex, you can name intermediary regexs with normal variables, etc.
- Related to the point above: No string quoting hell.
- You stay in your home language. No sublanguage involved, just function calls.
- Capturing is much cleaner. You don't need to conflate "parenthesis for capture" and "parenthesis for grouping" (since you can use the host's languages parens).
Personally, I use regexes all the time. Every day. I'm constantly typing regexes into emacs and vim to do text manipulation. Sometimes a quick :%s/\(foo\|bar\)d/baz\1/g, other times a :%v/function /d. I can't imagine what it would be like if I had to write out "begin with either of (number, letter, ..." I'd never get anything done.
For programming it could be nice. It's an interesting idea. But I've found that good highlighting solves most of the problem. When I type "\\(" into emacs, it immediately highlights that \\( in a different color, along with \\| and \\). That way you can differentiate instantly between capture groups vs matching literal parens. Here's an example:
http://i.imgur.com/b417O2o.png That small snippet would become way, way longer if you use a regex combinator. Maybe good, maybe bad, but it's hard to read a lot of code.
Yeah, it's ugly. And there are a bazillion small variations between regex engines. But for raw productivity, it seems hard to beat.
I, on the other hand, have to build a regex a few times a year. Every darn time I have to look up the difference between .* and .+, and how to do an "or", and why I get matches when I group a set of characters.
So SRL sounds like a great idea: us mortals can use it to painfully stitch together our our poor regexes, and when we get good enough we can skip SRL and become fully fledged regex gods.
Off topic, if SRL is a regex compiler, I'd love a regex decompiler. To be able to turn an impossible jumble of weird characters into a structured description of what it does would be hugely useful.
I think you misunderstand what "regex combinators" means. In particular, SRL is not regex combinators.
TAForObvReasons in another answer posted this[1] that would explain it to you.
About your example ... on the contrary, it would become way cleaner with regex combinators. The string manipulation would be replaced with proper composition and the escaping madness would disappear. For example
(concat foo "\\|" bar "\||" baz)
becomes
(alt foo bar baz)
You seem to be familiar with elisp. The lisp family is particularly adapted to combinators approach, please read the link above. :)
If you like regex combinators, check out the elisp "rx" library. You can end up writing code sort of similar to the article's, with less emphasis on using english grammar: http://i.imgur.com/iD34MqC.png
Combinators are very convenient and precise, but the tradeoff is that the code is longer. And I have to look up what to write every time I want to write one. But that's a personal bias.
Thanks for the reference. I'll study it.
EDIT: One of the good ideas in SRL is "if not followed by". There are too many [not]ahead-[not]behind combinations to warrant special syntax for each of them. I wonder if it could be streamlined, though?
If you don't know, SRE is a DSL in scheme that is essentially an alternate syntax for regex that does this, originally implemented by SCSH, and now most popularly by irregex. It looks like this:
I find that most of the problems are resolved with interstitial comments a la perl /x or coffeescript heregex. Merely splitting the regular expression into smaller blocks with a short description of what to match makes it easy for people to verify the regular expression.
I do most of my text manipulation in sublime text using replace. I like the interactivity of it compared to sed/awk, and the easy undo is helpful as well (try, back, edit, try, back, ..).
There's an MIT CS assignment that goes along with exactly what you are saying: "Regexp Disaster" by Gerald Sussman. It's a good read for those that want to learn more about the combinator approach.
Unlike a regex string, using incorrect syntax with a parser combinator generally results in a syntax error. Regex parsers generally treat "unknown" characters as characters to match, hiding bugs.
Perl 6 regexes are parsed at compile-time (i.e. they're code just like any other code) so using incorrect regex syntax, eg unknown characters, results in a compile-time syntax error.
A regex language could fix it by some clear prefix convention or whatever for all operators. Though more verbose, it would still be less verbose than alternatives.
Suppose all regex operator characters have to be backslashed (and \\ stands for a single \). Then it's clear. No backslash means it's literal; otherwise it's an operator (and a backslash on a nonexistent operator is a parse error).
The ambiguities exist because regex aficionados want common operators to be just one character long.
A really annoying "feature" of regexes in POSIX is that known regex meta-characters are used as ordinary, when they are in a context that doesn't recognize them.
Some examples are a superfluous closing parenthesis or a [ bracket in a character class.
A sane regex syntax treats these characters as a separate lexical category from literal characters, in all contexts, and only produces literals out of them when they are escaped.
Sure, but 1. It's pretty easy to expose a similar API, those are all functions 2. The documentation tools for the language naturally document the combinators.
Even if the actual combinators are different, it's still a much better situation than the basilions regex syntaxes with completely different quoting and escaping mechanisms.
^^^ Regexp combinators in JS. 614 bytes minimized and gzipped.
Once you're proficient with the standard RegExp syntax, you can write pretty complex ones without external tools, but when the times comes to debug and modify them, using combinators makes the task much much more simple.
Barely touch regexes these days because for last few years I've been using Rebol / Red parse more and more.
Here's a translation of the first SRL example in the parse dialect:
[
some [number | letter | symbol]
"@"
some [number | letter | "-" ]
some ["." copy tld some [number | letter | "-" ]]
if (parse tld [letter some letter])
]
And here's a full matching example:
number: charset "0123456789"
letter: charset [#"a" - #"z"]
symbol: charset "._%+-"
s: {Message me at you@example.com. Business email: business@awesome.email}
parse s [
any [
copy local some [number | letter | symbol]
"@"
copy domain [
some [number | letter | "-" ]
some ["." copy tld some [number | letter | "-" ]]
]
if (parse tld [letter some letter])
(print ["local:" local "domain:" domain])
| skip
]
]
Same for me. About a decade ago I started using Rebol for shell-scripts, file management and working with web API's. It's nice to work with a syntax where I can look at an older parse routine and figure out what I was trying to do just by reading it.
Only time I use regex is within my code editor.
I don't mind regular expressions myself, but this does look quite nice. Any idea if this has been implemented for anything other than Rebol (which I'd never heard of)?
There was a Parse-inspired project for JavaScript, but I'm not sure to what extent it was developed. There was also a Parse implementation in Topaz[1] (which itself is implemented in JavaScript) but is as of now unfinished.
It'd be difficult to implement as tightly in another language that doesn't have Rebol's (or a somewhat Lisp-like) free-form code-is-data/data-is-code[2] approach. Rebol and its Parse dialect share the same vocabulary and block structure that amongst other things: makes it easy to insert progress-dependent Rebol snippets within Parse; build Parse blocks dynamically (mid-Parse if needed!); build Rebol code from parsed data; develop complex grammar rules very similar to EBNF[3]. The article linked above[4] (and now linked again below :) does a good job of fleshing out these ideas and why it may remain a unique feature for some time.
I like how the marquee example is of how to do something you shouldn't be trying to do [1] anyway.
In more general terms, if a regex is complicated enough that something like this seems to make sense, the problem is that your regex is too complicated, and you should fix that.
It's my observation that a lot of problems people have with regex is that they haven't learned tips like "Is regex the right way to do this?", "Can I simplify this regex because I trust the input?" and "What do I really need the regex to do?".
To validate a user-supplied email address, you arguably just need something along the lines of ^.@.\..$ to help avoid whitespace and forgetting the @ sign. In a lot of cases, if the email address is wrong, all that happens is either a) user login fails or b) user registration fails. Hence there's no need for unreadably-complex regex to validate.
> In more general terms, if a regex is complicated enough that something like this seems to make sense, the problem is that your regex is too complicated, and you should fix that.
I disagree. There is no such thing as a "complicated regex" in itself; it's all the same to the underlying engine. It's the maintenance of the regex that's the problem. You could, for instance, compile many small, maintainable patterns into a (technically) complicated regex with guarantees it will evaluate as expected—which is one possible way the outlined SQL approach could work. Don't confuse process with technology.
The word "either" implies only two choices, making your opening example confusing when the first "either of" was really picking from three possibilities.
Is it less discoverable than the proposed alternative though?
In both cases one looks up the documentation, in one I find that search for the line start requires a regex with "^" and in the other I find something like "begin with" of Simple Regex Language (SRL). I still need to read (or test) to find what "begin with" means and I still couldn't guess it - why not "start with", "open with", "first character", or a myriad of other possible options.
Since code is read more than it is written, how well do you think a colleague without previous knowledge of regexs could understand '^' vs 'start of line'?
I'm not absolutely sure this isn't a joke that got out-of-hand. This is the COBOL of regular expressions. :)
Whilst the conventional regular expression syntax is arguably overly compact, this is just too far in the opposite direction!
Something more PEG-like, or even Perl 6 regex-like, would make more more readable regular expressions whilst not completely throwing out everything we think things mean. Hell, even /x -- ignore whitespace and comments -- can make things much clearer:
/ ^
[0-9a-z._%+-]+ # The local part. Mailbox/user name. Can't contain ~, amongst other valid characters.
\@
[0-9a-z.-]+ \. [a-z]{2,} # The domain name. We've decided a TLD can never contain a digit, apparently.
$ /x
Tangentally, there's no point validating email addresses with anything more complicated than /@/. If people want to enter an email address that doesn't work, they can and will. If you want to be sure that the address is valid, send it an email!
> Tangentally, there's no point validating email addresses with anything more complicated than /@/
Well once you have this in database, maybe you'll give it to some library, then maybe this poor little library will just paste it into the SMTP conversation. And maybe some user will be clever enough to exploit it.
I believe that if you want to get X from the user, it's always a good idea to make super sure that it is actually X before passing it further.
And I very much agree that whenever possible and necessary, simply splitting regular expression to multiple lines and using comments seems like a superior approach.
Pretty good, though <domain> should end with <alpha> 2..*
Note that your version is not actually the same, as it accepts all alphabetic characters, not just ASCII ones. Which put it much closer to RFC 6532, but still not exactly there due to the quoting rules in usernames. Which get pretty hairy. See [1] for an implementation of RFC 822, which is a simpler version of the modern standard, for a regex to validate email in Perl 5.
We (the Hyperscan team) have spent a lot of time staring at regular expressions over the years (shameless plug: https://github.com/01org/hyperscan).
I think a better format for regex is long overdue, but this isn't it. It's way too verbose (other commentators also noticed the resemblance to COBOL). I'm picturing a Snort/Suricata rule with this format regex, and you've now doubled the amount of screen real estate per rule.
The real problems with regex readability are (1) the lack of easily grasped structure, so it's almost impossible to spot the level at which a sequence or alternation operates (PCRE's extended format and creative tabbing can help) and (2) the total lack of abstraction - so if you have a favorite character class or subregex you write it approximately a bazillion times.
What's missing in regexps IMO is composability so you can build larger patterns out of smaller ones, giving each a clear name. Replacing '[0-9]' with 'digit' doesn't really help much.
$ txr
This is the TXR Lisp interactive listener of TXR 147.
Use the :quit command or type Ctrl-D on empty line to exit.
1> (regex-parse ".*a(b|c)?")
(compound (0+ wild) #\a (? (or #\b #\c)))
2> (regex-compile *1)
#/.*a[bc]?/
I showed it to some coworkers in 2013 and got some pretty good feedback. Then I got distracted by other things. One of the issues is that I learned Perl regex syntax so well by designing this language that I never needed to use it again :)
I plan on coming back to it since I'm writing a shell now, and I can't remember grep -E / sed -r syntax in addition to Perl/Python syntax.
SRL is the same idea, but I think it is way too verbose, which it appears a lot of others agree with.
If anyone is interested in the source code let me know! It was also bootstrapped with a parsing system, which worked well but perhaps wasn't "production quality". So I think I will reimplement CRE with a more traditional parsing implementation (probably write it by hand).
I have a (mostly abandoned) side project with the same aim. I spent a lot of time thinking about language design, from the perspective of enabling adoption of a new language when an existing one has so much network effect:
Maybe this as a learning tool? I think it's much better to learn regex as is, no matter how ugly or terse you may or may not find it. It's pretty universal across languages ( with some annoying variations ). There's lots of online tools and programs that can help you decode or create regex, after a while it's not so hard to read/create. But also worth knowing a more comprehensive parsing tool or parsing techniques so you don't get too ambitious with regex :)
Jan Goyvaerts, the author of this program also created this great website [0], which has lots of very clear and detailed regex tutorials. I always come back to it when I get confused about regular expressions.
Side note on the API of the PHP lib implementing this: could we please stop using the fluent/query-builder pattern with closures?
It's the most disgraceful code style I've seen, and misleading: makes you mind think smth. async could be happening. I know Laravel popularized this, along with other ugly patterns, but let's stop cargo-culting this.
I think this is really valuable. Just today I had a non-tech co-worker who needed to understand Regex for some tool we were using. I did the regex for him and (very briefly) explained it. Now this might be something he can more easily grok and use a translator (+ regex101.com to verify) to create more complex regex's he might end up needing.
If you're dissatisfied with the terseness of regular expressions, it's worth looking at SNOBOL4: http://www.snobol4.org/
which has been around for decades.
The author pits his project against POSIX regular expressions, but personally I feel that it's PCRE that rules the day. I find pcre regex significantly less verbose and easier to read.
That's what I was thinking too. It'd be a great way to get people to understand the power of regex. But yeah, I wouldn't want to learn a verbose language to use regexes as frequently as I do today.
So we're replacing a universally understood syntax for a new one that was just invented, and is painfully verbose? I understood what the first regex was doing just fine.
This is a major step up in readability, so it's nice, and you have to invent a new syntax to do that, so I'll chalk that up as unavoidable. But did it have to be so verbose? SCSH/irregex's SRE had similar readability wins, with way less verbosity. You still have to learn a new syntax, though.
include header file stdio.h, searching system paths first.
describe function main that returns a value of type
integer, and has argument of type integer argc, and
argument of type pointers to pointers to characters argv.
begin function body
call function printf with the single argument of type
string "hello world" with a newline appended.
begin new statement.
return the integer value 0.
Wasn't this a goal of Light Table, or am I up in the night? There are a lot of useful representations I'd like to see for my foo, but very few of them I'd actually want to work in directly when I'm creating or modifying the underlying code representation. I could see this tool as a quick and useful way to load a complex regex meaning in my head like doc text, but after that, I'll just prefer to work with the regex directly instead of the verbose text.
I bet there's a fair number of people like me who rarely have to use regex and because of that have to half relearn it each time I have to use them. A more verbose syntax is easier to work with for that because you're not fighting both the syntax and the semantics simultaneously.
While I agree that this is verbose to the point of almost being absurd, the regex syntax is far from universal. There are several different variations that are different in annoyingly subtle ways.
How so? The differences between PCRE and Python RE are negligable, and cannot be confused for any syntax in the other. Most people use ERE with only the most commonly implemented PCREisms in any case. And BRE is different enough that it's usually detactable right off the bat.
If you don't switch between them frequently enough it can be very difficult to deal with. For example, when doing a search in vim I often confuse what needs to be escaped and what doesn't vs a grep. And the escape syntax between perl compatible regexs and mySQL regexs are significantly different.
They are subtle but they can cost valuable time while mentally context switching between them.
mySQL regex is a bit of a nightmare. It contains a lot of character classes that can be confusing if you don't use them often. I guess "escape syntax" is the wrong phrase.
I agree that SRL is painfully verbose. I'm fairly experienced with regexes, and I found the new syntax to be far less readable. A better (and much simpler) way to make the regex more readable would be to split its construction into multiple lines (possibly using separate strings that are concatenated and given meaningful variable names). I already find myself often creating regex substrings that are then used in the construction of multiple, related regexes.
And I think that SRE is far more readable than standard regex syntax, which is in turn more readable than SRL. SRE is lisp-like, though, so it's not for everyone.
is a total strawman, needlessly obfuscated. How about writing it like this:
/^[0-9a-z._%+-]+@[0-9a-z.-]+.[a-z][a-z]+$/i
which, while "scary looking", is at least immediately readable by anyone who knows even the basics about REs. If the argument for "verbose REs" is valid, it ought to stand up at least a typical standard RE.
Also, it's not clear that "letter" and "[a-z]" mean the same thing. Does "letter" include uppercase? Does it include non-ASCII letters like "[[:alpha:]]" does? Don't forget the weird collation behavior "[a-z]" sometimes encounters.
> which, while "scary looking", is at least immediately readable by anyone who knows even the basics about REs
Nope, I'm mostly a DB guy very fluent in SQl and I use regex like two dozen time a year.But every time I nead to write something not trivial I must run to a regex cheatsheat website and spend long minutes trying to figure shit out.
It's not that I'm dumb and taking a MOOC about regex is definitely on my todo list... It's just that I haven't found the damn time yet to learn monstrosity and exceptions of regex.
And this is especially painful coming from PostgreSQL which have a good debugger and a clear syntax (even for non standard functions).
FWIW I'm doing the Accessing Web Data part of the Python for Informatics course at Coursera. First part of that course is regex and after one lesson (~15 mins) it covers enough to read the above expression.
I was already cognisant with regex, so I'm perhaps biased, but a simple email-like search seems easy for a novice to read.
Perhaps you have a particular block when approaching regex. The book by Charles Severance that goes with the course (above) is freely available online.
"Also, it's not clear that "letter" and "[a-z]" mean the same thing"
"number" and [0-9] are even worse. That should have been called "digit" and, as another commenter already pointed out, in the age of Unicode, it still is confusing.
As to this attempt at simplifying regex writing and reading: nice try, but I think it needs more work. Apart from the Unicode thing, there's the fact that "letter" only is equivalent [a-z] because of the 'case insensitive' flag.
I think I would go for something that's less grammatical English and more programming language like (alignment of the colons optional)
Start of text.
1 or more : digit, lowercase or one of ._%+-
Literal : @
1 or more : digit, lowercase or one of .-
Literal : .
2 or more : lowercase
End of text.
All: case insensitive.
My default would be to have 'lowercase' mean the Unicode character class. 'ASCII lowercase' would handle [a-z]
Adding capture groups, look ahead and look behind, comments, etc. is left as an exercise to the reader (they probably would make this look very ugly)
There's also the issue of nesting, like in this botched attempt to write a regex for URLs:
One or more of:
Once : letter or underscore
One or more : letter, digit or underscore
Separated by: /
Optional:
Literal: ?
One or more:
One or more : letter, digit or underscore
Literal : =
One or more : letter, digit or underscore
Separated by: ,
Of note here is that I think we need to digress from regexes a bit by introducing things like 'Separated by'. Without it, you often need to repeat potentially long phrases (programmatically building your regex can avoid that, but I think you still would need a serialization format, and I also think it makes sense for that to not use a full fledged programming language)
Thinking of things of that complexity, I'm starting to think it would be better to have people write a BNF grammar.
I remember learning that \d was not the same as [0-9]. \d is 'digit in any language, not just 0 to 9', so it'll activate on digits other than Arabic numerals.
The SRL documentation doesn't make it clear if they mean 'number' to be 'only numbers 0 to 9' or 'any digit in any language'.
great work, nice site!
There's a huge amount of programmer's time lost worldwide reading Regex expressions and trying to determine what the expression do.
This a way better option for readability.
Zounds, now that programming languages basically scrapped the idea to sound like English, regular expressions pick it up from the trash. And emulate Applescript of all, if I am any judge.
for the same reasons I'm having mixed feeling about cucumber and similar testing frameworks (BDB), that also rely on semi-english language to do things. It looks cool, and enticing, but hard to sell (to others), even If I myself am super-excited to see it in action (just because how crazy it looked the first time I saw it).
This is really cool. I'm sure it can be improved, but a nicer high-level DSL for regex is something I have been looking for a long time now. Combinator libraries are nice but language-dependent.
I'm no fan of regexes, but I'm not a huge fan of this either; I would be interested in seeing existing convoluted regexes expressed like this for me in an IDE, but I don't like it as an input format.
I do wonder if having an EBNF compiler like ANTLR being more accessible would solve the readability & maintainability issues.
I don't think anyone is really a fan of regex(es?) but the task they are for is actually a rather complicated one and, let's be honest, they are pretty good at it. I agree that it would be nice to see better tools to understand and create new regex but I think that they are actually fine on their own.
They might be ugly and everyone had a point where they thought they were just some sort of magic impossible to understand but I think that there is a point where they just... click. After that they can still be ugly and messy but you at least understand that there is a purpose to it.
I hate languages that try to be like natural language, for the simple reason that I can't actually use natural language.
I can't type: "I want something that has at least some letters at the beginning followed by an @ ..." or any other variation except the exact syntax they require. Maybe in 10 years if NLP has come far enough, but not when it's a simple parser like now. I think it's much harder to remember than a syntax that is completely different from anything else you know.
This definitely has some entertainment value. However, if you want to make your code verbose, there is only one way to go. Use a language that needs just the following statement to produce a valid program:
TRANSFORM THE CURRENT OBJECT INTO THE DESIRED OBJECT.
This is really cool, but my brain keeps getting stuck on the word choice. Every time I see the `literally` keyword, I hear a teenage, valley girl accent in my head.
And now that "literally" has been coopted to also mean "figuratively", I don't even know what that clause means. I also got mentally stuck on that phrase, and my eyes literally popped out of my skull.
14 LIKE, Y$KNOW (I MEAN) START
%% IF
PI A =LIKE BITCHEN AND
01 B =LIKE TUBULAR AND
9 C =LIKE GRODY**MAX
4K (FERSURE)**2
18 THEN
4I FOR I=LIKE 1 TO OH MAYBE 100
86 DO WAH + (DITTY**2)
9 BARF(I) =TOTALLY GROSS(OUT)
-17 SURE
1F LIKE BAG THIS PROGRAM
? REALLY
$$ LIKE TOTALLY (Y*KNOW)
For the sake of pedanticity, the regex he proposed as an example doesn't validate all possible email addresses (particularly ones that are simply …@tld).
- You don't need to remember which regex syntax the library is using. Is it using the emacs one ? The perl one ? The javascript one ? that new "real language" one ?
- It's "self documenting". Your combinators are just functions, so you just expose them and give them type signatures, and the usual documentation/autocompletion/whatevertooling works.
- It composes better. You don't have to mash string together to compose your regex, you can name intermediary regexs with normal variables, etc.
- Related to the point above: No string quoting hell.
- You stay in your home language. No sublanguage involved, just function calls.
- Capturing is much cleaner. You don't need to conflate "parenthesis for capture" and "parenthesis for grouping" (since you can use the host's languages parens).
[1]: https://news.ycombinator.com/item?id=12293687