Hacker News new | past | comments | ask | show | jobs | submit login
The true power of regular expressions (2012) (nikic.github.io)
163 points by doomrobo on Oct 14, 2018 | hide | past | favorite | 38 comments



I find this to be an interesting side effect of computer science as a branch of mathematics vs applied computer science in "programming" and "development".

I remember touring the computer science building at the university I would end up attending before I had enrolled there - some students in one room were discussing whether there was a regular expression for detecting prime numbers. I laughed to myself - because in my mind, that's not at all what a regular expression was.

Sure enough, there is in fact such a regular expression (or "regex pattern", if you prefer)[1]. Regular expressions in practice encompass BOTH the academic level-3 regular grammar stuff, and a bunch of other stuff that's been tacked on because it's useful in real-world applications.

[1] - This has been posted on HN before, probably a bunch of times: https://iluxonchik.github.io/regular-expression-check-if-num...


This regex doesn't tell whether a number is prime, it tells whether the length of a specially crafted string is prime.

Kind of a big difference as you can make a regex that answers any yes/no question if you preprocess the input enough.


Well, saying that a number presented as unary isn't a number, to me, is the same as saying that a number presented as decimal isn't a number. But you're right that it is "specially crafted" rather than a general solution.


While true that PCRE is widely used, it is also true that there are several regexp libraries that do not veer very far from the formal regular language theory. Most notably Googles re2 which is based on finite automation engine.


Yes, there’s a claim in TFA about modern implementations going well beyond regularity, and, well, if you consider PCRE modern, then sure, just as TS Eliot is Modern. Now _contemporary_ regexp engines, OTOH...


> So if somebody found a fast solution to a NP-complete problem, pretty much all of the computationally hard problems of humanity would be solved all in one strike.

There are plenty of interesting computationally hard problems that are (probably) harder than NP-complete problems, e.g. PSPACE-complete problems. For example solving chess, or more on topic, the problem of "does this regular expression generate all strings over this alphabet"?


Small but relevant nitpick:

Generalized chess is EXPEPACE-complete. Standard chess has an 8x8 grid with a finite number of states. In theory, chess is solvable in constant time (because you can provide a finite lookup table).


If I understand correctly it's only exptime-complete if you allow an exponential number of moves? If you only allow a polynomial number of moves then it's PSPACE-complete.

But yes good point about specific board sizes :)


My takeaway from this, since the majority of it went over my head: I can continue to parse HTML with regular expressions.


I disagree with that takeaway.

Technically PCRE regexes are powerful and can match anything.

In reality, a complex PCRE regex will almost always be more difficult to maintain than a parser-combinator or hand-rolled parser in a more traditional language.

People saying "Regexes can't match HTML, use an html library" are wrong to say regexes are incapable of it, but they're right to say to use a library meant for the job.

Sure, you can use this regex to match emails (http://www.ex-parrot.com/pdw/Mail-RFC822-Address.html), but using a more normal parsing language than PCRE regex will result in more readable code.

The same is true for almost any regular expression that takes advantage of PCRE features, especially backreferences.

In addition, a regexp will only match html correctly if you write a very complex one. With a naive regexp for an html tag's contents, you'll find that you still might match that text inside a <script> tag even though that is not html.. so you now need to figure out when you're in a script tag and exclude that, or if you're inside an html attribute string, and before you know it you have a 2000 character regexp that no one else will be able to read, all because you didn't want to use an html parsing library where getting a tag's value correctly would be a single xpath expression or css selector away.


There's no arguing the fact that regexes are a poor fit for HTML, but maybe this is the wrong time to use that ridiculous email regex as an example, since TFA features a highly readable, fully compliant email matching regex as its main example.


It also doesn't point out that matching email addresses in general is a nightmare because the standard is one of those "we'll just allow everything everybody is doing right now" type standards that have a million different little quibbles.

No matter what language or programming style you use it's going to be ugly because it's an ugly problem.


What the language contains as a main example is PCRE pattern that matches email addresses and in comparison to it's original EBNF incarnation is highly unreadable due all the syntax noise required to graft backtracing support onto traditional regex syntax (not to mention the fact that it's performance is certainly highly suboptimal).


The performance might be better in some cases actually.

In the case of perl, where regexps performance has been optimized for significantly, the regexp actually performs better than a more normal parser.

From http://www.ex-parrot.com/pdw/Mail-RFC822-Address.html:

> It provides the same functionality as RFC::RFC822::Address, but uses Perl regular expressions rather that the Parse::RecDescent parser. This means that the module is much faster to load as it does not need to compile the grammar on startup.

Of course, if perl were a statically compiled language, the cost of compiling the grammar could be done at compile time.


Perl5 is not too meaningful for such performance comparisons, because on one hand it's regex implementation is very optimized while on other hand performance of Perl5 on "normal" procedural code is horrible (eg. Perl5 is about an order of magnitude slower on Gabriel's Takeuchi function benchmark than CPython).


This result generalises to most interpreted languages though. PHP, python, javascript, etc all have highly optimised regex engines, and regexs can consequently be a good optimisation technique when using those languages.


Using regex to match email addresses isn't actually a good idea either.

https://blog.onyxbits.de/validating-email-addresses-with-a-r...


> People saying "Regexes can't match HTML, use an html library" are wrong to say regexes are incapable of it, but they're right to say to use a library meant for the job.

Plot twist: the html library is built upon regex's (at least in part).


Every parser is partially built upon regexes. You have to go all the way until Haskell, Prolog or such languages before you get better options than regexes to build them.

But they are not built solely of regexes. They always have added control structures that complement regexes on those place they are worst.


> Sure, you can use this regex to match emails

Or you can use comments and named groups: https://stackoverflow.com/a/1917982


Even real regular regexes can be used when nesting is limited, which is true for most real-world html, xml, json. Still you're better off using libraries.


The words of the statement matter in specificity. You can parse HTML with a powerful regular expression, but it's not a good tool for the job. That said, I find it a wonderful tool to extract specific portions of an HTML document.

If you actually just care about retrieving a few specific bits of data within a page, I've found parsing libraries (including ones that allow for CSS selectors) to be just as brittle to changes as regular expression extraction, and not all that much easier to use, given a good grasp of both technologies.

That said, if you need to alter an HTML document in some non-trivial way, parsing is probably the way to go.


We had two versions of a particular app once. One used BeautifulSoup to parse the page and pull out the relevant elements. The other used some crusty old Regex patterns. At the end of the day the Regex version required about half of the maintenance the tag soup version did. IMHO the difference was that it took some of the content into consideration unlike the tag only version that was more sensitive to otherwise invisible changes under the hood.


Not to mention the sheer difference in performance between the two. I've found regexes to be magnitudes faster than parsers, for extracting data, that is.


The author's stated stance is that you can, but usually shouldn't.

Personally, I think to myself before using regexes to parse HTML:

1. Am extracting more than the contents of a single element?

2. Is the input HTML prone to change?

3. Will there be issues if the parsing completely fails?

4. Do I plan to use this code for more than a few months?

If the answer to any of the questions is "Yes", then don't use a regex :)


See "Oh Yes You Can Use Regexes to Parse HTML!"[0] on Stack Overflow by Tom Christiansen.

[0]: https://stackoverflow.com/a/4234491/123109


Have you tried using an XML parser instead?


HTML is not valid XML.

Some parsers may survive, but it's different enough to break most of them.


Wow! Fascinating! The portion displaying the regular expression for matching email addresses as per RFC 5322 was excellent. I've not once seen a regular expression written out this way.


Related content on Stack Overflow: "Regex: How to find and extract acronyms and corresponding definition of acronym from a text?"[0] that gives the nearly the same RFC 5322 parser.

The reduction of 3CNF-SAT seems to be borrowed from "Reduction of 3-CNF-SAT to Perl Regular Expression Matching"[1], whose author is careful to note that the reduction shows that regex matching with backreferences is NP-hard. To show that it is NP-complete, you'd also have to show that regex matching with backreferences is also in NP, i.e., that any match is checkable in polynomial time.

Readers may also be interested in "Oh Yes You Can Use Regexes to Parse HTML!"[2] on Stack Overflow by Tom Christiansen.

[0]: https://stackoverflow.com/a/18339610/123109

[1]: https://perl.plover.com/NPC/NPC-3SAT.html

[2]: https://stackoverflow.com/a/4234491/123109



Wrapping up As the article was quite long, here a summary of the main points:

- The “regular expressions” used by programmers have very little in common with the original notion of regularity in the context of formal language theory.

- Regular expressions (at least PCRE) can match all context-free languages. As such they can also match well-formed HTML and pretty much all other programming languages.

- Regular expressions can match at least some context-sensitive languages.

- Matching of regular expressions is NP-complete. As such you can solve any other NP problem using regular expressions.

I wish he wrote that in the beginning of the article. Would have saved me a lot of skimming (I already know grammar theory).


It's only OK to parse something with regex if it was defined with regex. Far too often I see people wanting to match postcodes and things which are not defined by regex and your heuristic could break at any time.


What about postcodes makes heuristics break for regex but not a general parser? I'd assume if the format changed it'd break both all the same.


As a lead developer, I frown upon every instance of regex used. If it's shorter then, say, 20 characters, it may be allowed. Longer then that will result in a negative advice and binding advice to refactor. You implement a language lexer & parser, or use plain string functions.


I’d expect you, “as a lead developer,” to know how the tokenizer in your lexer almost certainly works. Rather than frowning upon “every instance of regex used,” I’d expect you to be familiar with cleaner and more maintainable ways of writing regexes, which are, after all, just declarative ways of specifying finite automata. An NFA is infinitely easier to characterize than some mess of a hand written recursive descent parser, precisely because they are so restricted.


You might be interested in the parse python library

https://github.com/r1chardj0n3s/parse


On contrary, I find sed quite valuable.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: