Claiming that regular expressions are too terse is a bit much. There are only three (!) fundamental operators in basic regular expressions (four if you include parenthesis), with all other non-language specific operators being derived from that (ignoring precendence rules):
1. concatenation, to append regex A or regex B: AB
2. alternation, to select between A or B: A | B
3. kleene star, to repeat A zero or more times: A*
4. parenthesis allows specification of a sub-expression: (A)
The following are all derived/syntactic sugar:
[ABCD] -> (A | B | C | D)
A+ -> AA*
A{2} -> AA
A{2,4} -> AA(|A|AA) or A(A|AA|AAA)
A? -> (A|)
Just about everything else is implementation specific (if choice of special characters and available operators isn't already). That means you either need to be using the features regularly to remember them, or you have to look them up anyway.
Regular expressions are terse not because they are badly designed, but because by definition the description of regular languages is inherently minimal. It is part of their beauty. Without this minimalism, every tidy little one liner we have to perform some simple match becomes a multi-line specification in Backus-Naur form.
The world needs to get over this fear of regular expressions from ignorance and continued misinformation. They are not magic or impossible to understand. They are an elegant description of a very simple state machine which steps through a string one character at a time, nothing more.
Edit: corrected derivation of A{2,4} a la Twisol and
jbnicolai.
To be fair, you are not including more advanced operators, like positive/negative lookahead/behind (which is the specific example the article uses), capturing and non-capturing groups, greedy vs non-greedy kleene stars, etc.
As you say, they are implementation specific, but that's part of the problem: the basic regular expression syntax is insufficient for many tasks, so people take to extending it in complicated and syntactically opaque ways. That's the sign of a bad DSL, not a good one.
Maybe I'm arguing semantics here. To be clear, my point is that I do not agree it is reasonable to declare that regular expressions are a bad DSL simply because it is possible (however common) for people to write difficult to read, or difficult to understand regular expressions. It is the responsibility of the author of the expression to ensure that it is readable and understandable - to the extent that they should exercise restraint when possible use of an available feature would hinder readability and understandability.
There is absolutely no need for the example regex of http-like strings to be written the way that it is - there is only the want of the author, because they have a hammer and they are looking for a nail. If anything, using a regex for such a thing sets a bad precedent because anybody who wishes to come along and add user@password support to it is going to extend it and make it worse.
A more understandable way to process such a string would be to split it into constituent parts and use regex only for validation. Split at the :// for schema, split at the next / for path, etc. Turn these into functions, and keep the regexes simple.
Regular expressions are notorious because they are abused, not because they are evil.
I like the way Perl 6 handles this with the grammar feature.
(A grammar is just a special type of class, with a regex as just a special type of method.)
It could be simpler, but I want the resulting data structure to be easier to use.
grammar Url {
# default regex/token/rule/method to call
# (token disables backtracking)
token TOP {
<protocol> <domain> <path> <query> <fragment>
}
token protocol {
<(
<[a..z]> ** 3..10
)> # don't include :// in the stringified result
'://' # must be escaped as it isn't alphanumeric
}
token domain-segment { <-[?#/.]>+ }
token domain {
<domain-segment> ** 2..* # at least 2 domain segments
% '.' # separated by .
<?{
# make sure that the last segment is at least 3 chars
# (using the Boolean result of regular Perl 6 code)
@<domain-segment>.tail.chars >= 3
}>
}
token path-segment { <-[?#/\\]>+ }
token path {
[
<[/\\]>
<path-segment>*
%% <[/\\]> # separated by path separator (allow trailing)
]?
}
token query-segment {
# store as named, rather than positional
$<key> = ( <-[#=&]>+ )
'='
$<value> = ( <-[#=&]>+ )
# run regular Perl 6 code in the regex
{
# attach a Pair object as the AST
make ~$<key> => val(~$<value>)
# (`val` turns a numeric value into an allomorph)
}
}
token query {
[
'?'
<( # don't include ? in the stringified result
<query-segment>*
% '&' # separated by & (no trailing allowed)
)>
]?
{
# attach a static associative array of the key value pairs
# as the AST
make Map.new: (@<query-segment>».ast if @<query-segment>.elems)
}
}
token fragment {
[
'#'
<( .* )> # don't include '#' in the stringified result
]?
}
}
Example usage:
> my $result = Url.parse('http://perl6.org/foo/bar/baz/?a=1&b=2#fragment');
> say $result;
「http://perl6.org/foo/bar/baz/?a=1&b=2#fragment」
protocol => 「http」
domain => 「perl6.org」
domain-segment => 「perl6」
domain-segment => 「org」
path => 「/foo/bar/baz/」
path-segment => 「foo」
path-segment => 「bar」
path-segment => 「baz」
query => 「a=1&b=2」
query-segment => 「a=1」
key => 「a」
value => 「1」
query-segment => 「b=2」
key => 「b」
value => 「2」
fragment => 「fragment」
> say $result<query>.ast;
Map.new((:a(IntStr.new(1, "1")),:b(IntStr.new(2, "2"))))
> my %query := $result<query>.ast;
> say %query<b> ~~ Int; # True (because of val(…))
True
A more advanced usage would be with an actions class.
Basically Perl 6 treats regular expressions as code that is written in a domain specific sub-language, with grammars acting as a structure to hang them off of.
That will certainly work, however for large regular expressions it will become just as unmanageable over time, especially since the concatenation of each part depends on all the previous ones being error free. I was referring to the idea of breaking apart the work done by the regex, into more manageable parts.
One of the complaints is that a regex is too terse. This is because a regex provides you with no internal context of what you're trying to do. You can add context by leveraging additional regex features, but that may potentially make the expression even more difficult for a human to parse. The alternative is to use the regexes more sparingly, and allow whatever the host language is to provide the context. Just because you are able to parse a whole string and capture each part that matches some particular pattern all in one go doesn't mean that it's a good idea. Consider the following quick piece of pseudocode where the URL is split into smaller pieces first, which does a similar job to the regex above:
protocol, domain, path, query, fragment = explode("<protocol>://<domain>/<path>[?<query>][#<fragment>]")
if (protocol !~ "[a-z]{3,10}")
error("invalid protocol")
if (domain !~ "[a-z]+(\.[a-z]+)*")
error("invalid domain")
if (path !~ "[a-z]+(\/[a-z]+)*")
error("invalid path")
if (query not nil and query !~ "[a-z]+")
error("invalid path")
if (fragment not nil and fragment !~ "[a-z]+")
error("invalid path")
success("valid URL!")
Granted, this is much longer relative to the regex-only solution - and it will probably take a bit more effort to implement the magical 'explode' function I've imagined here - however the regexes themselves are now simpler, easier to evaluate, and we have context on what they're there for. Ultimately, we've just stopped using 500 lines worth of features in the regex library in favour of 500 lines of code that does the same thing elsewhere, but arguably have made it all much more understandable.
You'll notice that I've changed the regular expressions for each of the components of the URL. This is because the original regex is essentially only providing the same functionality as the explode() function above, and there is no validation of the contents of each part. Consider this another argument against using regular expressions for this kind of work. Note however, that the language provided to the explode() function itself appears to be regular. This is not unexpected - since URIs are defined using BNF and therefore are either regular or context-free - however it is an example of a scenario where a regular-expression-like language does not have to be cryptic.
The way Perl 6 makes this more manageable is with grammars.
(See previous post for an example)
Since a grammar is just a special type of class you can put regexes into roles and compose them together. You can even inherit from another grammar if you only need to change parts of it.
Also if there is something that is difficult to do regularly in the regex sub-language, it allows you to use regular Perl 6 code inline.
(A regex is just a special type of method with a domain specific sub-language.)
Also if there is a bug you can use Grammar::Debugger or Grammar::Tracer to help find it.
(I had a bug in the earlier post and used Grammar::Tracer to find and fix it within seconds.)
> I do not agree it is reasonable to declare that regular expressions are a bad DSL simply because it is possible (however common) for people to write difficult to read, or difficult to understand regular expressions.
I would not call RE a bad language - they are simply too useful for that. I would argue, though, that it's a reasonable design criticism to say - not that it's possible to write difficult to read code - but that it difficult to write easy to read code.
I would argue that risk of mistakes should not limit the expressivity of a language or have it added to the pile of bad ideas. It is better for users of the language to be aware of potential pitfalls, and use the language appropriately.
That introduces problems too. If you try to use sugar like '+' with an implementation that doesn't support it, you don't get any sort of error. Instead you get a different expression.
Unfortunately, there's an inherent tradeoff between encoding efficiency and error detection. Notice that with the VerbalExpressions it would be trivial to return a useful error message if the 'at_least_one' pattern did not exist.
Perl 6 regexes attempt improve upon this situation by making regexes more like a regular programming language. That is it errs on the side of error detection rather than encoding efficiency.
(It also adds features that would be difficult to add to Perl 5/PCRE regex design)
For a start if it didn't support using `+`, then any attempt to use it would generate a compiler error because it is not alphanumeric.
(regex is code in Perl 6)
All non-alphanumeric characters are presumed to be metasyntactic, and so must be escaped in some way to match literally.
Arguably best way is to quote it like a string literal.
(Uses the same domain specific sub-language that the main language uses for string literals)
/ "+" + / # at least one + character
It really is a significant redesign.
/A{2,4}/ # Perl 5/PCRE
/A ** 2..4/ # Perl 6
/A (?:BA){1,3}/x
/A [BA] ** 1..3/ # Perl 6: direct translation
/A ** 2..4 % B/ # Perl 6: 2 to 4 A's separated by B
/A (?:BA){1,3} B?/x
/A ** 2..4 %% B/ # Perl 6: %% allows trailing separator
/\" [^"]* \"/x # Perl 5/PCRE
/\" <-["]>* \"/ # Perl 6: direct translation
/「"」 ~ 「"」 <-["]>*/ # Perl 6: between two ", match anything else
# (can be used to generate better error messages)
---
# Perl 5
my $foo = qr/foo/;
'abfoo' =~ /ab $foo/x;
# Perl 6
my $foo = /foo/;
'abfoo' ~~ /ab <$foo>/;
# or
my token foo {foo} # treat it as a lexical subroutine
'abfoo' ~~ /ab <&foo>/;
---
# Perl 5
my $foo = 'foo';
'abfoo' =~ /ab \Q $foo \E/x; # treat as string not regex
# Perl 6
my $foo = 'foo';
'abfoo' ~~ /ab $foo/; # that is the default in Perl 6
What drove me crazy at first was that any character can be used as a delimiter. That's very useful, I admit. But it complicates understanding examples found through searching ;)
Yep! It's equivalent to `AA(A|AA)?`, by way of the parent's reduction for `?`, but it's a nice option when you want to emphasize some kind of symmetry.
1. Don't (unless there's an extremely good reason to do so). Ask yourself: is there a net benefit gained by forcing a developer to learn and use your DSL, ignoring their likely familiarity with general purpose languages that could solve the same problem? For starters, any application where humans won't be reading/writing a large amount of the DSL is probably out.
2. Avoid anything that resembles natural language. If you're writing a true natural-language interpreter, you're not writing a DSL. If you're writing a DSL that looks like natural language, people will be tempted to apply the grammar rules they already know from that language, rather than the strict grammar of your DSL, resulting in frustrating errors. It's a whole lot easier to memorise the rules in a language of keywords and symbols, because you don't have to first banish your existing knowledge of natural language.
3. Don't try to accomodate "non-technical" users performing intrinsically technical tasks. There's no point creating a "friendly" DSL over HTML and CSS if the people authoring in the DSL still require an in-depth knowledge of the box model, responsive web design etc. All you've done is kicked the can down the road and created a false sense of capability.
> 3. Don't try to accomodate "non-technical" users performing intrinsically technical tasks. There's no point creating a "friendly" DSL over HTML and CSS if the people authoring in the DSL still require an in-depth knowledge of the box model, responsive web design etc. All you've done is kicked the can down the road and created a false sense of capability.
In particular, Parse::RecDescent, Parse::Yapp, Parse::Lex, Parse::Flex, Regexp::Common, Net::IP, NetPacket::IP, ... really most of the things you'd want to parse from Apache::ParseLog to Parse::DNS::Zone
The inclusion of (not so) regular expressions in a language doesn't mean one needs to abuse them.
Interesting article, though the title should perhaps be `How not to design a Bad DSL` since the vast majority of the advice is apparently what NOT to do.
If the no-go space is larger and easier to fall into, it makes sense to have a more detailed danger map. It might stop a lot of folks from even taking the journey, which unlike real journeys is proper choice. If you are making a DSL for your own enjoyment, don't transmute your joy into someone else's pain. If end users have real issues that your DSL will solve, by all means, make it.
(EDIT: I replaced asterisks with <AST> in the pattern below, since HN doesn't have a convenient way to escape asterisks last I checked.)
> First, most of its syntax beyond the very basics like "X+" or "[^X]" is impossible to remember. It’d be nice to know what "(?<!X)" does without having to look it up first.
I realize regex can be hard to remember, but I think this is a little overblown. "(? ...)" is the general form of an extended group -- something for which "special behavior" occurs. The character sequence following that determines what the special behavior is. In this case, "<!" means "negative lookbehind": "<" for "lookbehind" and "!" for "negative". Compact, but mnemonic.
Yes, this is very dense. But it's not really meant to be read at a glance -- URLs do not have an exceptionally simple pattern, and they have multiple parts, many of which are optional. You could write a recognizer for URLs explicitly, but I doubt it would be as immediately recognizable in full. I consider it more important to obtain a high-level understanding before a low-level understanding, and (for me at least) I can see that this regex matches URLs up front, setting my expectations for the details later.
I think a PEG or combinator parser would be more self-documenting, so I'm not saying regex can't be pushed too far. But it isn't nearly as unstructured as it looks.
> Many DSLs were designed to reduce amount of non-DSL code to the absolute zero. They try to help too much.
I 100% agree here. A DSL should be laser-focused on doing one thing well. Any language should be composed of orthogonal features, each laser-focused; a DSL is just a language with a very small handful of features geared toward a specific domain. (Of course, ideals are rarely realized in full.) I particularly like GraphQL as of recently; it has a really nice feel to it.
I found an interesting paper [0] on design principles for DSLs while composing this comment. I've only skimmed it, but it looks quite nice.
Incidentally, here's how I might write that URL regex with comprehensibility outweighing all else. Notice that I enable free-spacing [0], which can usually be enabled by passing a flag rather than embedding it in the regex. In Python, the re.X (or re.VERBOSE) flag does the job.
1. concatenation, to append regex A or regex B: AB
2. alternation, to select between A or B: A | B
3. kleene star, to repeat A zero or more times: A*
4. parenthesis allows specification of a sub-expression: (A)
The following are all derived/syntactic sugar:
[ABCD] -> (A | B | C | D)
A+ -> AA*
A{2} -> AA
A{2,4} -> AA(|A|AA) or A(A|AA|AAA)
A? -> (A|)
Just about everything else is implementation specific (if choice of special characters and available operators isn't already). That means you either need to be using the features regularly to remember them, or you have to look them up anyway.
Regular expressions are terse not because they are badly designed, but because by definition the description of regular languages is inherently minimal. It is part of their beauty. Without this minimalism, every tidy little one liner we have to perform some simple match becomes a multi-line specification in Backus-Naur form.
The world needs to get over this fear of regular expressions from ignorance and continued misinformation. They are not magic or impossible to understand. They are an elegant description of a very simple state machine which steps through a string one character at a time, nothing more.
Edit: corrected derivation of A{2,4} a la Twisol and jbnicolai.