Hacker News new | past | comments | ask | show | jobs | submit login

One of my pet peeves is how PRCE destroyed the definition of "regular" in regular expressions. It has basically made a huge number of programmers illiterate as to what truly is regular in the formal sense.



That wasn't PCRE. That was Perl. PCRE just copied the features that people liked in Perl, and made them available elsewhere.

The same features have also been implemented independently many times. For example Ruby, Java and Python all have independent implementations of regular expressions with a lot of the same features.


When were back-references introduced in sed?


http://sed.sourceforge.net/sedfaq3.html

> Grouping and backreferences: All versions of sed support grouping and backreferences on the LHS and backreferences only on the RHS.


I saw that. I wasn't sufficiently confident as to whether it meant "all historical versions" or just "all current versions."

If the former, then it predated perl by quite a bit, of course.


Perl started with Henry Spenser's regular expression library, which was written in the mid-1980s. That got used in a lot of places, and may have been used in sed. It includes backreferences. I don't know what predates that.

Perl used that as a starting place and began adding extensions such as the difference between .* and .*?.

So using "non-regular thing" as a regular expression predates Perl. But the dialect we standardized on with the features people now expect is due to Perl's influence.


Ah I get you.

I've tried finding some source for early versions but all the links I've found are dead. It's an interesting question!


But why should people care about what is regular in the formal sense? Rather, regular in this context would mean it can be recognized with a restricted type of algorithm, which resembles the formalism.


http://davidvgalbraith.com/how-i-fixed-atom/ http://stackstatus.net/post/147710624694/outage-postmortem-j... https://swtch.com/~rsc/regexp/regexp3.html

The difference isn't academical. It has huge practical implications, from software that is just annoyingly slow to DoS attacks.

People need to understand that. And stop using non-regular regexes (or better yet: regex in general) in production code. It increases world-suck.


It destroys mechanical sympathy. Many people can no longer understand why some of their regex calls are so slow or why some are so much quicker.


PCRE doesn't implement that "restricted type of algorithm", and I think that's what the parent is suggesting is wrong, not so much the difference between formal understanding vs informal, intuition of the same subject (I would argue that the latter would be sufficient, but few even have that).

Understanding REs (a little more) formally, one can exploit their properties for e.g. performance and/or concision. For an example of what that understanding gives you, I'd recommend checking out the Ragel state machine compiler.


They were named "regular" in the formal sense. Thus if they are not regular, it is a misnomer. It would have been just as useful to call them "Perl Matching Expressions", but without all the additional confusion about what a regular expression can do.


Is there a standard for which additional features one can add on top of regular expressions in the original limited sense and still be considered a regex?


"regular language" is a mathematical term. You can add any features you want to your matcher, but some features allow you to match languages that are not "regular".

However, most programmers are not mathematicians or computer scientists, and so are effectively using lay terminology which is similar to but not precisely the same as the mathematical terminology. Most programmers only care that their regex library allows them to concisely match strings, not that it allows them to match regular languages.

Thus there are two standards you could follow: the precise mathematical definition of "regular", or the lay definition.


I don't buy the layman argument. Programmers might not be mathematicians or computer scientists, but they should be professionals and held to the same standards as professionals in other fields. For example, a mechanical engineer doesn't use a "laymen" definition of tensile strength when choosing materials for a design.


No, but he probably does use layman's terms when describing molecular lattices and electron shells. If he ever does.

The point here is that users of (colloquial) regular expressions, however professional they are or aren't, are just engaging with the surface of a deeper set of mathematical principles that require greater nuances of meaning to discuss than their applications do.


I basically agree, but I would say that a slightly deeper understanding of regular expressions is a really useful thing to know for basically all professional programmers. If for no other reason they can recognize the "pathological" cases where backtracking implementations (i.e. almost all modern regex implementations) would need until the heat death of the universe to evaluate.


Programmers might not be mathematicians or computer scientists, but they should be professionals

What? No. Many programmers out there are just working on fun projects in their spare time. I'm sure there exist some aerospace engineers doing that, but it's not many. I don't see why your average bedroom open source programmer should be treated like an aerospace engineer.


Programmers have more freedom from the constraints of reality than mechanical or civil engineers.


What the fuck are you talking about? Those aren't even remotely the same thing at all.


Right. My point (which may not be a particularly great point, but it was what I was trying to convey) is that the lay definition doesn't have any particular tight borders; presumably different regex libraries are capable of matching different things, use different syntax, etc.

(And this is part of why I find it confusing terminology and would prefer "regular expression" only to mean specifications of regular languages in the original formal sense. But, of course, I'm not the king of language, and it's natural that people would speak the way they speak, given the circumstances.)


The lay definition isn't a definition, per se, it's a misappropriate of the term. Since the lay user only cares about text matching, the term for what they need is "text matching syntax" or some such thing.

And surely you don't think those implementing the regular expression engine are lay people. They should know better.


> there are two standards you could follow: the precise mathematical definition of "regular", or the lay definition

Considering this writeup leads with a giant headline saying "Regular Expressions - The Theory"... I think the complaints about how it's not actually a regular expression are even more valid than usual.


I always understood it as being whatever can be straightforwardly tacked onto a typical DFA implementation. I though that's how people came up with it—whichever extras were the easiest to implement without mucking anything else up, so in a way they "came for free". (It's possible I misunderstood, I don't know for sure.)


Well, sure, but "whatever can be straightforwardly tacked on" is highly subjective, no?


No, it is not.

You can invent a variety of notations to describe a DFA. But you can't change the formal definition of a DFA, or what kinds of things a DFA can match without making it not a DFA.


I'll say to you what I said above, but even moreso: Er... I'm not sure what you're saying here, so I'll clarify what I was saying. What I mean is, we all agree on what DFAs are, and so we all agree on what regular languages in the original, Kleene sense are. But once we start admitting non-regular languages into our notion of "regular expression" because they're "straightforwardly tacked on", it's unclear what the limits are supposed to be.

I'm not actually a fan of using "regular expressions" in a sense broader than that of Kleene's regular, DFA-accepted languages, to be clear. That is exactly what I've been questioning.


You can actually quite easily add features to regular expressions, that are typically excluded while still being actually regular. Examples would be an `and` or a `not` operator.


What I am saying is that you can come up with a lot of extensions of the basic regular expression language. But there is no subjectivity at all in whether the extensions can be implemented by a DFA. Either you can, or you can't.


Not really. A graph is either a valid DFA or it is not. At that point, it's a matter of tooling in terms of "straightforward"ness.


Er... I'm not sure what you're saying here, in that it doesn't seem to be engaging with what I was saying, or trying to say, so I'll clarify what I was saying. What I mean is, we all agree on what DFAs are, and so we all agree on what regular languages in the original, Kleene sense are. But once we start admitting non-regular languages into our notion of "regular expression" because they're "straightforwardly tacked on", it's unclear what the limits are supposed to be.

I'm not actually a fan of using "regular expressions" in a sense broader than that of Kleene's regular, DFA-accepted languages, to be clear. That is exactly what I've been questioning.


I think the intent of the parent comments was to say that it's fine to consider a regex a standard "regular expression" in the mathematical sense if it still translates to a DFA and matches a regular language. For instance, mathematical regular expressions typically only include '*' (0 or more), not '+' (1 or more). However, you can still translate a regex including '+' to a DFA. Likewise arbitrary repetition mechanisms like {m,n}, case-insensitivity, and other features that add convenience without actually matching non-regular languages.

On the other hand, backreferences can allow matching non-regular languages.


Ah, yeah, I see now that that is indeed probably what they meant; thanks! In that case, yes, I'm sympathetic to that, using regex to just mean "Anything that ultimately describes a regular language".


I don't think so, not too much anyway, I think the "software engineering" aspects of it would place reasonably strong constraints on what is desirable and on what is feasible (for a DFA). That's probably how the (informal?) consensus emerged.


If you can compile any string in the language to a DFA then it's equivalent in power to regular expressions.


Maybe we should stop talking about regular expressions and talk about pattern matching instead...


Sure, but on the pragmatic hand, "regular" isn't super useful unless you're designing a language yourself. If you're parsing something, not understanding "regular" is going to hurt a lot less than attempting to use PCRE to do something it can't.


Regular expressions has implications in space time complexity. PCRE tosses that away.


Indeed. To support backreferences, "regex" libraries are forced to use algorithms that can be very slow in the worst case.

The sad thing is that the libraries use the same algorithms even if the expression doesn't contain backreferences. A while ago, Stack Overflow had a brief outage because of regular expression performance, although the expression that caused it didn't even use backreferences or other non-regular features:

http://stackstatus.net/post/147710624694/outage-postmortem-j...

In contrast, Google Code Search - when it still existed - supported regular expression searches over world's public codebases. One key ingredient making this possible was to only use proper regular expressions:

https://swtch.com/~rsc/regexp/regexp4.html


Regular is actually incredibly useful because it allows you to confidently parse or validate (large amounts of) untrusted data without having to worry about DoS attacks.

I'd argue that this is useful, even if you use it as part of a parser that doesn't just parse regular languages, as you'll have a slightly easier time reasoning about the time and space complexity of the construction.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: