Hacker News new | past | comments | ask | show | jobs | submit login
Alternatives to Regular Expressions (c2.com)
58 points by vezzy-fnord on June 21, 2015 | hide | past | favorite | 53 comments



"I am not much of a fan of RegularExpressions. It is too hard to remember what the symbols stand for, for one. Asterisk for 0 or more repetition, plus for 1 or more repetitions, question mark for 0 or 1 occurrences, brackets surrounding a set of characters -- who's got time to memorize this kind of complexity? "

Somebody please tell me this is sarcasm...


Regular expressions are a very obviously inefficient and confusing syntax, full of incidental complexity, and to make matters worse, they’re often a bit less powerful than we want them to be.

The best summary of the problems I’ve seen is from Larry Wall, highly recommended if you haven’t read it before.

First half of this page: http://perl6.org/archive/doc/design/apo/A05.html

Unfortunately, I think at this point regular expressions are too firmly entrenched in too many places to properly replace.


From the preamble of the page you linked:

> In fact, regular expression culture is a mess, and I share some of the blame for making it that way. Since my mother always told me to clean up my own messes, I suppose I'll have to do just that.

So Larry aimed to do better in Perl 6. Do you think he is as good at solving problems as he is at identifying them -- or more specifically, what do you think of the "Rules" system[1] he came up with for Perl 6?

[1] https://en.wikipedia.org/wiki/Perl_6_rules


They are not obviously inefficient and confusing. Many of us, especially the Perl geeks, find them relatively clear when well written.

It's a mini language, and like any language it can be clear or opaque in its meaning depending on the skill and intention of the creator.


The main issue I have with regular expressions is that the current syntax mixes "code" and "data".

I.e. it's very easy to write something like /https?:\/\/www.foo.com/ (e.g. you want to whitelist your own domain for some action) and forget that this also matches wwwxfoo.com which could be owned by anyone (it's also a common mistake to forget the anchors, in this case evil.com/?https://www.foo.com might also match).

It is pretty easy to come up with a better syntax, but it's going to be hard to convince everyone to use it. As someone else commented on this thread, the current way of doing things is pretty entrenched.


His examples were bad (asterisk etc.) but the point is valid. There is some hairy regular expression syntax that is pretty difficult to grok right off the bat. It could be simpler and clearer.


You could say that about any syntax. But actually coming up with a better syntax is much harder than just saying it can be done.


There's VerbalExpressions, which simply wrap common PCRE tokens into methods/functions: https://github.com/VerbalExpressions


This is awesome, like life-changing awesome! I always felt an impedance mismatch between how my mind describes string patterns and how regexps are written. VerbalExpressions is the closest thing I've seen to how I naturally think.


This is going to rapidly replace every instance of regex in my code. The readability improvement is immeasurable. It's also conveniently similar enough to Python's verbose regex syntax.


Those are pretty good, if a little verbose.


You could say that about any syntax. But in most cases it would still apply much more to things like Perl regular expressions.


No one forces you to memorize every possible mnemonic, you remember the basic stuff that you use all the time, than you learn a bit more when you need it. And there is google for the rest, just like with everything else. And I seriously doubt that it could be made simpler and clearer for everyone, because we all come from different backgrounds and work on different things and our brains are by now hardwired to quickly handle different syntaxes. So it is a huge undertaking to rewrite something as complex as regexp to a completely new syntax and yet clearer and simpler to the majority of devs. Which of course doesn't mean one shouldn't try...


= for equals, or assign, or compare, < for less, > for more, + for adding, - for subtracting -- who's got time to memorize this kind of complexity?


Technically, compare is ==. = means "trash the left argument, then interpret the the right argument as a boolean". :)



Perl-compatible regular expressions are not semantically equivalent to real regular expressions (as the article seems to claim). In fact, they correspond to completely different grammars in the Chomsky hierarchy. PCREs are Turing complete (and for closely related reasons, extremely slow for some expressions), while regular expressions are isomorphic to FSMs (which are always linear time in the length of their input).


I never understood why so many developers have problems with Regex. In my experience, it's difficult because most people avoid it, once you use it enough times the logic makes perfect sense.


I do not know if it's correlation or causation, but almost all code using regular expressions I encounter is horribly broken.

1. Regular expressions are often used instead of existing parsers. XML, CSV, file paths, URIs etc. all already have fast, well tested and correct parsers.

2. Often the thing worked on should not be a string in the first place. For example comma separated string is used instead of a list and regular expression emulate list operations.

3. Some things could be processed much easier with different tool - for example recursive descent parser. Yet the developer still tries to parse arithmetic expressions with a hammer.

4. They are often developed via trial and error. They are either first thing which worked for a simple case or are 10 line monsters riddled with exceptions from exceptions.

5. They are often part of hacks and workarounds. For example User-Agent is matched to work around bugs in browsers.

3. They give very limited feedback to user. There is either a match or no match. There is no way to tell what and where is broken.

There are valid use cases for regular expressions. It's even possible to write correct code with them. It's just a rare sight.


Many of these points are correct, but a little comment on #1: Often you don't care about the whole structure, you just need some small piece of data from a middle of a huge document. One common example is a spider that collects a price of some product from a 200+KB webpage. You just need those few digits and don't care about the head or title or the structure of dom or anything else. In such cases (and it's very common task for people working on data extractions) no complex parsers can ever compete with the regexp in terms of speed and memory footprint. And if you need to parse a few millions of products that performance gain is a huge deal. So don't underestimate the power of regexp when properly used...


That is fine for throwaway scripts. But such "perl duct tape" is not 100% accurate and will break for no reason. There is no place for such solutions in reliable and maintainable software.


Why would it "break for no reason"?! For all I know regExps matching one small piece of the page if far less prone to breaking than parser that has to analyze the whole page. Designer changes one <div> or id/class somewhere in the top of the DOM tree and you can't reach the node that you are looking for anymore. Same goes for regExp of course, but it's looking at a smaller portion of the html, so it's less likely to be affected by small changes in some unrelated part of the page. And any major redesign will break any dedicated scraper, no matter which parser it uses...


Lets try an example. Extract first link address from https://news.ycombinator.com/.

As DOM query:

    document.getElementsByClassName("title")[0].parentElement.getElementsByTagName("a")[1].href
This will break:

* When title element no longer has "title" class.

* When title is no longer a sibling of link.

* When link is no longer 2nd link of its parent.

As regular expression:

    document.documentElement.innerHTML.match('td class="title">.*a href="([^"]*)"')[1]
This will break:

* On any white space change.

* On any new attributes on td or a.

* When ' is used instead of "

* When href includes escaped "

* In most cases when DOM query will break.

Many of those can happen without any server-side changes. It will sometimes works sometimes won't - making it hard to test.

There are cases when regular expression will break less often than DOM but DOM is easier to reason about, more predictable and has less corner cases.


And the xml paresr will fail if the xml is not well formed while the regex will just keep sailing along. I had an example of this with BlogPoster.py which uses python xmlrpc. There are Wordpress hosts which return invalid xml and this causes an exception, I reimplemented what I needed with Bash and cURL using regexes and it works fine.


I'm not sure if you mean that pro or against regular expressions. That is a good example for #5. Regular expressions are great to quickly patch together something that kinda works, but:

1. Those hosts are still broken. The next person will have to jump the same hoops to support them.

2. You parser is very permissive. It will encourage people to create even more broken implementations.

3. The specification of this protocol is now worthless. There is no way to safely add new functionality. Any new element or attribute can break those regexes. Everyone has to take every implementation into account.

4. You are probably missing some corner cases like CDATA elements or quoted characters.

From your point of view, it probably makes sense to support even broken sites. But, you are helping to create next HTML - where every implementation works differently and you have to test everything on every browser.


I think it's because few take the time to actually understand them before using them, and usually suffice with cobbling together a few expressions from 'tutorials' of varying quality. Reading the Friedl book at least once is essential. Then again, what do we expect when our field is filled with people whose 'education' consists of a 6-week 'hacker bootcamp' and a few weekends of watching Youtube videos...

(also, too many kids on my lawn etc)


That's a patronizing comment. Often the very education may a stumbling block in the process of learning. Most users of regexps or any other device do not need to read a thick theory book on it. Cobbling together expressions from online tutorials is a perfectly good way to learn effectively. Admittedly, it is not the best way to write production code, but there are often economic considerations that override the expert's desire for perfect design and implementation.


"Often the very education may a stumbling block in the process of learning."

People don't want to learn because it's hard?

"Most users of regexps or any other device do not need to read a thick theory book on it."

Which is why they should at the very least read Friedl.

"Cobbling together expressions from online tutorials is a perfectly good way to learn effectively."

No it's not, otherwise we wouldn't have so many bad regexes and people asking silly questions about them, would we?

"Admittedly, it is not the best way to write production code, but there are often economic considerations that override the expert's desire for perfect design and implementation."

Nobody's talking about 'perfect design and implementation', just 'not be a moron' level. Because car analogies are everbody's favorite, let me use one here: no-one is saying that one should have a Formula 1 licence before driving (using regular expressions); just to have more than 20% vision in both eyes and not to drive after drinking 5 beers while texting your wife that you're on your way. Which is the car-driving equivalent of the majority of regex uses in the wild.


> People don't want to learn because it's hard?

There is truth in that, but I meant to address your comment about other people's lack of education. Education (certificate granted by teaching authority), how good soever it may be, is not necessary to have knowledge or skill. Similarly, reading Friedl how good soever his work is, is not necessary to learn or use regular expressions.


it's a transferable skill as well. The engine/syntax might differ a little language to language, but the core concepts remain the same.


Perl 6's regular expressions and grammars appear to be fairly powerful and useful.

http://doc.perl6.org/language/regexes http://doc.perl6.org/language/grammars


> Maybe some use of XML

No. If you're even thinking about defining a language in XML you've probably already screwed up some place.


Lua has an alternative to Regex which is an extension of the C patterns (used in printf etc).

http://lua-users.org/wiki/PatternsTutorial

At first I found it a bit confusion but they're actually pretty great. For me at least 90% of the tasks I want to do with regex can be done with scanf, and Lua's little extension covers the remaining 10% quite well.

I know lots of people often say that it would be nicer to use functional composition for regex instead of strings because strings are too confusing, but I disagree with this. The confusion of regex to me is not from the string representation, it is that some characters are "special" while others are "normal" (including whitespace). At first it appears that most characters are "normal" and so you can start from some example and generalize the string until it matches all the things you want - but once you start putting parenthesis and such in you start to realize that most of the string wont be matched "normally" and it is better to start thinking like a grammar and write it from scratch. This double thinking is pretty annoying.

For this reason, to me the Lua patterns are really the only alternative I've come across to regex that I've liked. They've got nice compact expressive syntax, can really easily do most of the matching tasks I need due to the scanf base, (almost) all the "special" characters begin with %, and the complex cases can still be matched.


I read the tutorial you linked and I see no difference between Lua patterns and regexes, except that '\' has been replaced with '%', and '*?' with '-'. The only thing in common with scanf is adopting '%' as control character, otherwise that's the same old regular expressions: '%d' matches just one digit, not whole signed integer like in scanf.


OpenBSD's new web server has just started using these [1]

There is still some backtracking behaviour that means you can create expressions that are slow, but for most use cases it is not an issue.

[1] http://marc.info/?l=openbsd-tech&m=143480475721221&w=2


To everybody hating on regexes: remember that the alternative to even a simple regex is typically a poorly-implemented ad-hoc state machine dozens of lines long full of nested loops and conditionals. Without gotos if you're lucky.

_That_ is a bug breeding ground.

Now, if what you are after are regular expressions with a different syntax.... well, maybe you're onto something here. But I would still call it a regex, personally.


No, an alternative to the write-only regexps is nice, clean, dense but yet readable BNF or PEG syntax. And no, they're not regular expressions with a different syntax, they're much higher up in the Chomsky hierarchy.


> To everybody hating on regexes: remember that the alternative to even a simple regex is typically a poorly-implemented ad-hoc state machine dozens of lines long full of nested loops and conditionals.

That's an alternative.

Its not the only alternative. Monadic Parser Combinators are a thing, after all.


I have my regex alternative version, currently in development, with C like syntax. Here is approximate sketch, how it looks. Goal is, that from this syntax I can get full, hierarchical AST for programming languages parsing: http://pastebin.com/CFZf306p - until it's done, some smaller details can change.

Idea is that it can build multiple layers of matched items. In this example, layer 0 is "pattern lines" then on top of this, gets mapped tokens representation at "parttern tokens".

and lastly, over lines and tokens goes AST objects, which are represented by language code.

Hierarchy and context-based patterns referencing and construction can be easily achieved.

Ideally, I would like to build this with realtime update feature, when used in source code editor. In IDE, when I write some characters in code, these changes propagate into parsed layers objects and updates what is needed.


I'd hate to be writing a grammar every time I want to do a substring match. Regular expressions are convenient, and far more compact.


If you're doing a substring match, then use a substring function.


One alternative in the Wolfram Language is called StringExpresions. For instance, something like

StringReplace["Mad Hatter", "M"|"H"~~a_~~b_..:>"B"<>a<>"g"]

returns "Bag Bager"

The nice thing is that a string expression can use a RegularExpression as part of the pattern, but it's often not neccesary.


"The nice thing"

Nothing in this post is a nice thing :)


I rarely have trouble constructing or understanding RegExes, but I do have an awful time getting quoting correct and remembering which special characters have to be escaped in any one particular language.


In Perl 6 the regex character escaping rule is:

> Alphanumeric characters and the underscore _ are literal matches. All other characters must either be escaped with a backslash (for example \: to match a colon), or included in quotes.

From http://doc.perl6.org/language/regexes


I go from BRE to flex to spitbol depending on the situation. Never had a need for PCRE. Snobol can match anything.

k/q also has pattern matching.

Personally, I get more mileage out of BRE than anything else. Simple and effective.


Wouldn't Marpa [1] be an alternative to REs?

[1]: http://marpa-guide.github.io/chapter1.html



I stopped reading at: "Besides, grammars are more complex than regular expressions, so they're simpler."


You are right. The article introduce those ? + * in the grammars again, though grammers are actually simpler from some aspects.


Problems arise with regular expressions when they are used against irregular grammars.


I'm almost always using PEG where regexps are typically applied.


Perl 6 unifies "regexes", PEGs, and lexical closures.[1]

[1] https://en.wikipedia.org/wiki/Perl_6_rules




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: