Hacker News new | past | comments | ask | show | jobs | submit login

Mostly, we're used to defining tolower() and toupper() to return either a lower or upper case variant if one exists, otherwise you get back what you put in. For most Unicode codepoints no such variants exist and so you just get back whatever you fed in. Some of the alphabets have uppercase/ lowercase, but obviously most writing systems don't do this.

However, lower(upper(X)) is not defined to be the same as lower(X), and there's no promise that meddling with a string transforming with lower() or upper() does what you hoped because that isn't how language actually works (e.g. in English the case sometimes marks proper nouns so "May" is the Prime Minister of the UK, but "may" is just an auxiliary verb).

Where standards tell you something is case-insensitive, but it's also allowed to be Unicode rather than ASCII, you can and probably should "case crush" it with tolower() and then never worry about this problem. In a few places you have to be careful because a standard says something in particular is case-insensitive, but not everything that goes in that slot is case-insensitive. For example MIME content type names like "text/plain", "TEXT/PLAIN" and "Text/Plain" are case-insensitive, but

multipart/mixed; boundary="ABCDEFGHIJKL" multipart/mixed; boundary="abcdefghijkl" multipart/mixed; boundary="AbcDefGhiJkl"

... declare three different boundary tokens, and none of them matches the sequence abCdeFghIjkL.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: