If you know something the author doesn't, I'm sure they'd appreciate a (construc...

ephaeton · on March 15, 2021

Setting LC_* to 'C' nukes locale awareness by saying "we're only caring about ASCII-en-us" and thus goes back to the 'cheap' tolower that's also present in the C and C++ version, and most likely in the awk as well - in contrast to, say, the python version.

Finding the "lower case of a character" is immensely harder in face of unicode, because the table is way larger and there's no nice speed hacks by manipulating an index into the ASCII table. JFGI: "tolower performance unicode".

If you notice setting LC_ALL makes a difference in performance for you, you ought to be aware that now you're no longer comparing same capabilities.

I'm not sure how to constructively state that except for: One shouldn't compare ASCII and unicode "tolower" in performance comparisons, as you end up comparing apples and oranges (at best. More like apples and snails) - which I did.

If the author uses code that uses "tolower" in job interviews, i.e., evaluates candidates based on their input WRT case normalization - and he even writes ("This is Unicode-aware ...") - he should know that Unicode awareness is not ubiquitous, and comes at quite a cost. Knowing about unicode awareness, one would assume he'd be aware of not making a fair comparison.

burntsushi · on March 15, 2021

Can you state the specific comparison that is not apples-to-apples in the OP?

The only one I see is the comparison between 'simple' and 'optimized'. But that's more about "what does a simple idiomatic solution look like" and what does an "optimized and possibly less simple" solution look like. That comparison isn't designed to be apples-to-apples in the way you're saying. The simple variant will use Unicode-aware casing in environments where that's the simple and natural thing to do.

IIRC, most of the 'optimized' programs are using ASCII casing. Python doesn't, but casing isn't even close to the bottleneck in that program.

jerf · on March 15, 2021

I believe the author is fully aware of this, and just chose the ASCII version for all languages precisely so that we don't get into Unicode efficiency issues. Doing that wouldn't be objectively bad or anything, it just isn't the question the author is asking. It is still a fine question on its own terms.

benhoyt · on March 15, 2021

Yes, I'm definitely aware of this, thanks -- and as a result, you can't really compare the "simple" and "optimized" versions. Most of the "simple" versions (except C) are Unicode-aware, most of the optimized versions not -- though I acknowledge I haven't stuck to that always, e.g., in the C case, when it's just "too hard".