As someone who makes much of his living rehabilitating old perl scripts, please, if you must use such things, use them like this:
[ -~] #match only printable characters
It takes 5 seconds longer and with regexes, just knowing what the damn thing is trying to do is half the battle. When you use a regex, use a comment. Its the civil thing to do.
I recommend using the /x suffix to extend your pattern's legibility by permitting whitespace and comments.
/x allows you to break up your regex into its component parts, one part per line, and then comment each part.
Here is what the manual says about /x:
/x tells the regular expression parser to ignore most whitespace that is neither backslashed nor within a character class. You can use this to break up your regular expression into (slightly) more readable parts. The # character is also treated as a metacharacter introducing a comment, just as in ordinary Perl code. This also means that if you want real whitespace or # characters in the pattern (outside a character class, where they are unaffected by /x), then you'll either have to escape them (using backslashes or \Q...\E ) or encode them using octal, hex, or \N{} escapes. Taken together, these features go a long way towards making Perl's regular expressions more readable.
Yea, anytime I use a regex that isn't immediately obvious I put it in a function called get_<something>. Unfortunately people that write overly complicated and error prone regexes usually don't choose to document them.
If a regex is going to be reusable, then yeah, I'd agree. But dumping single lines of code into their own functions just for readability isn't practical for real time systems. In those cases you really should be using comments as they get stripped out by the compiler.
Couldn't those functions just be inlined by the compiler if they're simple regex-wrappers anyway?
I do agree that it might be overkill to move regexes to their own functions just for readability's sake but I don't buy the performance argument. Furthermore, regexes are most popular in scripting languages that no sane person would use for real time performance-critical systems anyway.
2. Web sites are a classic example of scripting languages being used for real time performance critical systems (though I'm not arguing that all web sites are real time).
Sometimes the ability to modify code easily is as important to the choice of languages as the raw execution speed of the compiled binaries.
Sometimes C is inappropriate (eg you'd be nuts to build a website in C yet some sites do offer real time services)
Often the data set and/or logic required makes C an inappropriate language (eg you wouldn't use C for AI nor for some types of database operations).
And even in the cases where you're just building a standard procedural system, sometimes the interface lends itself better to other languages (eg C would be possibly the worst language for real time websites.)
But even in the cases where you're building a solution that's suited for C, there are still other performance languages which could be used.
"Real time" is quite a general term and as such, sometimes it makes more sense to use scripting languages which are performance tuned. Which is where writing 'good' PCRE is critical as RegEx can be optimised and compiled - if you understand the quirks of the language well enough to avoid easy pitfalls, eg s/^\s//;s/\s$//; outperforms s/(^\s|\s$)//; despite it being two separate queries as opposed to one.
"Real time" is commonly assumed to mean that you can't use a garbage collected language or need to be extremely careful doing so because random pauses of 100ms break your constraints.
If you're in a situation where the overhead of a couple of function calls is unacceptable, regexes are totally unacceptable and you need to write custom character manipulation.
This situation is really rare and in almost all business cases, using C is inappropriate.
The writer of said script needing rehabilitation probably doesn't have that much insight. Just try to tell me what you were trying to get done and that will be enough.
The worst case is when the original author never really had it clear in his/her mind what exactly that compound regex was trying to accomplish. They just kind of bodged and hacked till the usual input stream started coming out right. Trying to write a clear comment on the purpose of the regex helps with that too.
It's not just Unicode either. I just mentioned EBCDIC because that particular regex has bit me before when I was translating perl scripts from Linux to zOS USS. Take a look at the code page for EBCDIC, you'll see quickly why it's a massive pain to sort through regexes like that.
I'm sure you've heard of the IBM AS/400 which is still firmly entrenched in MANY Fortune 500 companies. Not to mention tons of state and county government installations handling payroll, inventory, taxrolls, etc. I had to deal with a Perl script which dealt with ASCII to EBCDIC to port data to an Oracle database. If you're a Windows only shop, that's fine, but don't assume that anyone whom isn't is ancient.
More generally, it's a characterset / collate sequence thing. Specifying a range with a start and end point requires understanding what that range specifies. Which can change depending on context, locale, characterset, etc.
In the case of EBCDIC, there are several places in the alphabetic collation sequence in which non-alpha characters are interspersed among the letter codes. Most notably between R & S, though it appears that I-J also includes a standout. The fact that there are multiple incompatible forms of EBCDIC doesn't help matters much.
It's both, but ASCII vs. EBCDIC is worse. Even in Unicode, the regex will still grab the printable characters that also happen to be part of ASCII: you won't see anything wrong until you get to characters outside that range. In EBCDIC, things get much hairier: it won't get capital letters, nor lowercase letters from r through z (but it will get all the other lowercase letters), nor brackets or braces (though it will get parens).
I'm not sure who the PHB is, but I'm certainly not impressed by anything cryptic in a codebase. Deliberately writing code that's hard to understand should be a firing offense.
Are you saying people should google regular expressions? in my experience (correct me if I'm wrong) that doesn't work, I've never been able to get google to return relevant results even with quotation marks.
I'm saying that usually comments are either wrong or out of date, developers code one regex, comment it, then fix a bug later and don't, then there's a discrepancy between the comment and the code. It's nearly always easier to just google the code and see what it does, if (as in this case) it's not obvious.
Google regex and you'll find plenty of resources including tools to testing patterns. You won't find much for any specific pattern but read the docs and it will be apparent what this regex does. Familiariy and competence with regex is a basic component of being a developer.
Hey, might just be me. I'm usually the 'Ben, can you help me with a regular expression' guy over here, but I stumbled, hard, and failed to connect the '-' with a range of characters (probably because I never thought of 'space to .. something').
So I read the snippet, thought 'Yeah, a character class of space, -, ~' and fell on my face in the next couple of lines.
Yeah, I should've known better, I know how to read it. If .. I invest the time and don't glance over a construct and hope to just get it instantly.
I wouldn't want to see this in a code base without proper documentation (be it a comment, a function name or whatever. Something).
Not that I agree with the "expect people reading your code to Google things" mindset, but to be fair the only ambiguous thing is the ASCII table which is Googleable.
The best code is readable. Readability includes comments. If you're going to comment anything in your code at all, RegExes should be at the very top of that list.
Even if I can figure out what the regex matches (with Google or something else), that doesn't necessarily tell me WHY I'm matching on that particular pattern, or why I needed a RegEx in this spot, or what the intent was at the time of writing it.
[ -~] #match only printable characters
It takes 5 seconds longer and with regexes, just knowing what the damn thing is trying to do is half the battle. When you use a regex, use a comment. Its the civil thing to do.