As someone who makes much of his living rehabilitating old perl scripts, please,...

atsaloli · on Nov 12, 2012

I recommend using the /x suffix to extend your pattern's legibility by permitting whitespace and comments.

/x allows you to break up your regex into its component parts, one part per line, and then comment each part.

Here is what the manual says about /x:

/x tells the regular expression parser to ignore most whitespace that is neither backslashed nor within a character class. You can use this to break up your regular expression into (slightly) more readable parts. The # character is also treated as a metacharacter introducing a comment, just as in ordinary Perl code. This also means that if you want real whitespace or # characters in the pattern (outside a character class, where they are unaffected by /x), then you'll either have to escape them (using backslashes or \Q...\E ) or encode them using octal, hex, or \N{} escapes. Taken together, these features go a long way towards making Perl's regular expressions more readable.

http://perldoc.perl.org/perlre.html

sbochins · on Nov 12, 2012

Yea, anytime I use a regex that isn't immediately obvious I put it in a function called get_<something>. Unfortunately people that write overly complicated and error prone regexes usually don't choose to document them.

laumars · on Nov 13, 2012

If a regex is going to be reusable, then yeah, I'd agree. But dumping single lines of code into their own functions just for readability isn't practical for real time systems. In those cases you really should be using comments as they get stripped out by the compiler.

entropy_ · on Nov 13, 2012

Couldn't those functions just be inlined by the compiler if they're simple regex-wrappers anyway?

I do agree that it might be overkill to move regexes to their own functions just for readability's sake but I don't buy the performance argument. Furthermore, regexes are most popular in scripting languages that no sane person would use for real time performance-critical systems anyway.

laumars · on Nov 13, 2012

1. Ahh right. I wasn't aware that happened.

2. Web sites are a classic example of scripting languages being used for real time performance critical systems (though I'm not arguing that all web sites are real time).

Sometimes the ability to modify code easily is as important to the choice of languages as the raw execution speed of the compiled binaries.

jbooth · on Nov 13, 2012

REGEXES aren't practical for real time systems.

If you're using a regex, and certainly if you're using a language other than C, you probably have space for the function call overhead.

laumars · on Nov 13, 2012

I don't really agree with that.

Sometimes C is inappropriate (eg you'd be nuts to build a website in C yet some sites do offer real time services)

Often the data set and/or logic required makes C an inappropriate language (eg you wouldn't use C for AI nor for some types of database operations).

And even in the cases where you're just building a standard procedural system, sometimes the interface lends itself better to other languages (eg C would be possibly the worst language for real time websites.)

But even in the cases where you're building a solution that's suited for C, there are still other performance languages which could be used.

"Real time" is quite a general term and as such, sometimes it makes more sense to use scripting languages which are performance tuned. Which is where writing 'good' PCRE is critical as RegEx can be optimised and compiled - if you understand the quirks of the language well enough to avoid easy pitfalls, eg s/^\s//; s/\s$//; outperforms s/(^\s|\s$)//; despite it being two separate queries as opposed to one.

jbooth · on Nov 13, 2012

"Real time" is commonly assumed to mean that you can't use a garbage collected language or need to be extremely careful doing so because random pauses of 100ms break your constraints.

If you're in a situation where the overhead of a couple of function calls is unacceptable, regexes are totally unacceptable and you need to write custom character manipulation.

This situation is really rare and in almost all business cases, using C is inappropriate.

hellrich · on Nov 13, 2012

Shouldn't most compilers (jit-)inline it?

yxhuvud · on Nov 12, 2012

or even better # Match only printable ASCII characters.

noonespecial · on Nov 12, 2012

The writer of said script needing rehabilitation probably doesn't have that much insight. Just try to tell me what you were trying to get done and that will be enough.

The worst case is when the original author never really had it clear in his/her mind what exactly that compound regex was trying to accomplish. They just kind of bodged and hacked till the usual input stream started coming out right. Trying to write a clear comment on the purpose of the regex helps with that too.

CountHackulus · on Nov 12, 2012

Thank you for that. You have no idea how annoying it is to port perl scripts from ASCII to EBCDIC when they do that kind of thing.

rmc · on Nov 12, 2012

It's not an ASCII v EBCDIC thing, its an ASCII vs Unicode thing.

CountHackulus · on Nov 13, 2012

It's not just Unicode either. I just mentioned EBCDIC because that particular regex has bit me before when I was translating perl scripts from Linux to zOS USS. Take a look at the code page for EBCDIC, you'll see quickly why it's a massive pain to sort through regexes like that.

natrius · on Nov 13, 2012

I honestly thought you were being sarcastic. I've never heard of someone who has actually used EBCDIC.

nobleach · on Nov 14, 2012

I'm sure you've heard of the IBM AS/400 which is still firmly entrenched in MANY Fortune 500 companies. Not to mention tons of state and county government installations handling payroll, inventory, taxrolls, etc. I had to deal with a Perl script which dealt with ASCII to EBCDIC to port data to an Oracle database. If you're a Windows only shop, that's fine, but don't assume that anyone whom isn't is ancient.

rmc · on Nov 13, 2012

So did I! Now that's a war story....

dredmorbius · on Nov 12, 2012

More generally, it's a characterset / collate sequence thing. Specifying a range with a start and end point requires understanding what that range specifies. Which can change depending on context, locale, characterset, etc.

tripzilch · on Nov 14, 2012

Also in the 32-127 ASCII range? I thought they just differ in 128-255 with the code pages and such?

dredmorbius · on Nov 17, 2012

In the case of EBCDIC, there are several places in the alphabetic collation sequence in which non-alpha characters are interspersed among the letter codes. Most notably between R & S, though it appears that I-J also includes a standout. The fact that there are multiple incompatible forms of EBCDIC doesn't help matters much.

Makes sorts really tweaky.

http://en.wikipedia.org/wiki/Ebcdic

Millennium · on Nov 13, 2012

It's both, but ASCII vs. EBCDIC is worse. Even in Unicode, the regex will still grab the printable characters that also happen to be part of ASCII: you won't see anything wrong until you get to characters outside that range. In EBCDIC, things get much hairier: it won't get capital letters, nor lowercase letters from r through z (but it will get all the other lowercase letters), nor brackets or braces (though it will get parens).

perlgeek · on Nov 13, 2012

Or just use [[:print:]]

tomjen3 · on Nov 12, 2012

Unless you want to seem clever and impresse the PHB.

It is the selfish (but smart) thing to do.

nathan_long · on Nov 12, 2012

I'm not sure who the PHB is, but I'm certainly not impressed by anything cryptic in a codebase. Deliberately writing code that's hard to understand should be a firing offense.

rada · on Nov 12, 2012

PHB: Pointy Haired Boss

nathan_long · on Nov 13, 2012

In that case, the smart thing to do is not to work for the PHB, rather than pervert your craft in an attempt to impress him/her.

Shorel · on Nov 13, 2012

In this case, the entire post was the comment.

You are right anyway.

hnriot · on Nov 12, 2012

Google is by far the best "comment"

citricsquid · on Nov 12, 2012

Are you saying people should google regular expressions? in my experience (correct me if I'm wrong) that doesn't work, I've never been able to get google to return relevant results even with quotation marks.

boyter · on Nov 12, 2012

Agreed, Google fails at this, however alternative search engines,

http://searchco.de/?q=%5B+-~%5D+ext%3Apod&cs=on http://symbolhound.com/?q=%5B+-~%5D

Symbolhound gives the answer quite well, and searchco.de has some examples of its use in the results.

hnriot · on Nov 12, 2012

I'm saying that usually comments are either wrong or out of date, developers code one regex, comment it, then fix a bug later and don't, then there's a discrepancy between the comment and the code. It's nearly always easier to just google the code and see what it does, if (as in this case) it's not obvious.

cjfont · on Nov 13, 2012

Your response doesn't address what citricsquid said, googling for a regex will almost never return helpful results.

hnriot · on Nov 13, 2012

Google regex and you'll find plenty of resources including tools to testing patterns. You won't find much for any specific pattern but read the docs and it will be apparent what this regex does. Familiariy and competence with regex is a basic component of being a developer.

kstenerud · on Nov 12, 2012

or make a function regexMatchingAllPrintableASCIIChars() and have it return the regex.

missing_cipher · on Nov 12, 2012

Search "[ -~]" (with or without quotes) to see how good Google's comment is.

logn · on Nov 12, 2012

The only ambiguous thing about this regex is knowing what's between space and tilde. Otherwise this is a pretty ordinary regex.

darklajid · on Nov 13, 2012

Hey, might just be me. I'm usually the 'Ben, can you help me with a regular expression' guy over here, but I stumbled, hard, and failed to connect the '-' with a range of characters (probably because I never thought of 'space to .. something').

So I read the snippet, thought 'Yeah, a character class of space, -, ~' and fell on my face in the next couple of lines.

Yeah, I should've known better, I know how to read it. If .. I invest the time and don't glance over a construct and hope to just get it instantly.

I wouldn't want to see this in a code base without proper documentation (be it a comment, a function name or whatever. Something).

wpietri · on Nov 13, 2012

The only thing ambiguous about it is most of it?

masterzora · on Nov 13, 2012

Not that I agree with the "expect people reading your code to Google things" mindset, but to be fair the only ambiguous thing is the ASCII table which is Googleable.

DanBC · on Nov 12, 2012

There is one author of the code, and potentially many readers.

That one author is the only person who knows what s/he is trying to achieve.

That author taking a few minutes to add some comments will save other people the time to search for answers and the time it takes to grok everything.

bmelton · on Nov 13, 2012

The best code is readable. Readability includes comments. If you're going to comment anything in your code at all, RegExes should be at the very top of that list.

Even if I can figure out what the regex matches (with Google or something else), that doesn't necessarily tell me WHY I'm matching on that particular pattern, or why I needed a RegEx in this spot, or what the intent was at the time of writing it.