But why should people care about what is regular in the formal sense? Rather, regular in this context would mean it can be recognized with a restricted type of algorithm, which resembles the formalism.
PCRE doesn't implement that "restricted type of algorithm", and I think that's what the parent is suggesting is wrong, not so much the difference between formal understanding vs informal, intuition of the same subject (I would argue that the latter would be sufficient, but few even have that).
Understanding REs (a little more) formally, one can exploit their properties for e.g. performance and/or concision. For an example of what that understanding gives you, I'd recommend checking out the Ragel state machine compiler.
They were named "regular" in the formal sense. Thus if they are not regular, it is a misnomer. It would have been just as useful to call them "Perl Matching Expressions", but without all the additional confusion about what a regular expression can do.
Is there a standard for which additional features one can add on top of regular expressions in the original limited sense and still be considered a regex?
"regular language" is a mathematical term. You can add any features you want to your matcher, but some features allow you to match languages that are not "regular".
However, most programmers are not mathematicians or computer scientists, and so are effectively using lay terminology which is similar to but not precisely the same as the mathematical terminology. Most programmers only care that their regex library allows them to concisely match strings, not that it allows them to match regular languages.
Thus there are two standards you could follow: the precise mathematical definition of "regular", or the lay definition.
I don't buy the layman argument. Programmers might not be mathematicians or computer scientists, but they should be professionals and held to the same standards as professionals in other fields. For example, a mechanical engineer doesn't use a "laymen" definition of tensile strength when choosing materials for a design.
No, but he probably does use layman's terms when describing molecular lattices and electron shells. If he ever does.
The point here is that users of (colloquial) regular expressions, however professional they are or aren't, are just engaging with the surface of a deeper set of mathematical principles that require greater nuances of meaning to discuss than their applications do.
I basically agree, but I would say that a slightly deeper understanding of regular expressions is a really useful thing to know for basically all professional programmers. If for no other reason they can recognize the "pathological" cases where backtracking implementations (i.e. almost all modern regex implementations) would need until the heat death of the universe to evaluate.
Programmers might not be mathematicians or computer scientists, but they should be professionals
What? No. Many programmers out there are just working on fun projects in their spare time. I'm sure there exist some aerospace engineers doing that, but it's not many. I don't see why your average bedroom open source programmer should be treated like an aerospace engineer.
Right. My point (which may not be a particularly great point, but it was what I was trying to convey) is that the lay definition doesn't have any particular tight borders; presumably different regex libraries are capable of matching different things, use different syntax, etc.
(And this is part of why I find it confusing terminology and would prefer "regular expression" only to mean specifications of regular languages in the original formal sense. But, of course, I'm not the king of language, and it's natural that people would speak the way they speak, given the circumstances.)
The lay definition isn't a definition, per se, it's a misappropriate of the term. Since the lay user only cares about text matching, the term for what they need is "text matching syntax" or some such thing.
And surely you don't think those implementing the regular expression engine are lay people. They should know better.
> there are two standards you could follow: the precise mathematical definition of "regular", or the lay definition
Considering this writeup leads with a giant headline saying "Regular Expressions - The Theory"... I think the complaints about how it's not actually a regular expression are even more valid than usual.
I always understood it as being whatever can be straightforwardly tacked onto a typical DFA implementation. I though that's how people came up with it—whichever extras were the easiest to implement without mucking anything else up, so in a way they "came for free". (It's possible I misunderstood, I don't know for sure.)
You can invent a variety of notations to describe a DFA. But you can't change the formal definition of a DFA, or what kinds of things a DFA can match without making it not a DFA.
I'll say to you what I said above, but even moreso: Er... I'm not sure what you're saying here, so I'll clarify what I was saying. What I mean is, we all agree on what DFAs are, and so we all agree on what regular languages in the original, Kleene sense are. But once we start admitting non-regular languages into our notion of "regular expression" because they're "straightforwardly tacked on", it's unclear what the limits are supposed to be.
I'm not actually a fan of using "regular expressions" in a sense broader than that of Kleene's regular, DFA-accepted languages, to be clear. That is exactly what I've been questioning.
You can actually quite easily add features to regular expressions, that are typically excluded while still being actually regular. Examples would be an `and` or a `not` operator.
What I am saying is that you can come up with a lot of extensions of the basic regular expression language. But there is no subjectivity at all in whether the extensions can be implemented by a DFA. Either you can, or you can't.
Er... I'm not sure what you're saying here, in that it doesn't seem to be engaging with what I was saying, or trying to say, so I'll clarify what I was saying. What I mean is, we all agree on what DFAs are, and so we all agree on what regular languages in the original, Kleene sense are. But once we start admitting non-regular languages into our notion of "regular expression" because they're "straightforwardly tacked on", it's unclear what the limits are supposed to be.
I'm not actually a fan of using "regular expressions" in a sense broader than that of Kleene's regular, DFA-accepted languages, to be clear. That is exactly what I've been questioning.
I think the intent of the parent comments was to say that it's fine to consider a regex a standard "regular expression" in the mathematical sense if it still translates to a DFA and matches a regular language. For instance, mathematical regular expressions typically only include '*' (0 or more), not '+' (1 or more). However, you can still translate a regex including '+' to a DFA. Likewise arbitrary repetition mechanisms like {m,n}, case-insensitivity, and other features that add convenience without actually matching non-regular languages.
On the other hand, backreferences can allow matching non-regular languages.
Ah, yeah, I see now that that is indeed probably what they meant; thanks! In that case, yes, I'm sympathetic to that, using regex to just mean "Anything that ultimately describes a regular language".
I don't think so, not too much anyway, I think the "software engineering" aspects of it would place reasonably strong constraints on what is desirable and on what is feasible (for a DFA). That's probably how the (informal?) consensus emerged.