Thank you for that. You have no idea how annoying it is to port perl scripts fro...

rmc · on Nov 12, 2012

It's not an ASCII v EBCDIC thing, its an ASCII vs Unicode thing.

CountHackulus · on Nov 13, 2012

It's not just Unicode either. I just mentioned EBCDIC because that particular regex has bit me before when I was translating perl scripts from Linux to zOS USS. Take a look at the code page for EBCDIC, you'll see quickly why it's a massive pain to sort through regexes like that.

natrius · on Nov 13, 2012

I honestly thought you were being sarcastic. I've never heard of someone who has actually used EBCDIC.

nobleach · on Nov 14, 2012

I'm sure you've heard of the IBM AS/400 which is still firmly entrenched in MANY Fortune 500 companies. Not to mention tons of state and county government installations handling payroll, inventory, taxrolls, etc. I had to deal with a Perl script which dealt with ASCII to EBCDIC to port data to an Oracle database. If you're a Windows only shop, that's fine, but don't assume that anyone whom isn't is ancient.

rmc · on Nov 13, 2012

So did I! Now that's a war story....

dredmorbius · on Nov 12, 2012

More generally, it's a characterset / collate sequence thing. Specifying a range with a start and end point requires understanding what that range specifies. Which can change depending on context, locale, characterset, etc.

tripzilch · on Nov 14, 2012

Also in the 32-127 ASCII range? I thought they just differ in 128-255 with the code pages and such?

dredmorbius · on Nov 17, 2012

In the case of EBCDIC, there are several places in the alphabetic collation sequence in which non-alpha characters are interspersed among the letter codes. Most notably between R & S, though it appears that I-J also includes a standout. The fact that there are multiple incompatible forms of EBCDIC doesn't help matters much.

Makes sorts really tweaky.

http://en.wikipedia.org/wiki/Ebcdic

Millennium · on Nov 13, 2012

It's both, but ASCII vs. EBCDIC is worse. Even in Unicode, the regex will still grab the printable characters that also happen to be part of ASCII: you won't see anything wrong until you get to characters outside that range. In EBCDIC, things get much hairier: it won't get capital letters, nor lowercase letters from r through z (but it will get all the other lowercase letters), nor brackets or braces (though it will get parens).