Hacker News new | past | comments | ask | show | jobs | submit login

I just found that this applies to Python too.

    >>> 'Straße'.upper()
    'STRASSE'



Now this has me wondering how you would be able to reverse something like this to get something like:

>>> 'Straße'.upper().lower()

'straße'

instead of:

>>> 'Straße'.upper().lower()

'strasse'


The conversion is lossy since lowercase ss (e.g Gasse) is valid and common German and capitalizes to SS as well. Short of the call learning some German, there's no neatly reversing this.


There are even cases where it is context-depedent: Maße (measurements) and Masse (mass). Both are written MASSE uppercased, but how do you know which one's which?

For additional fun: Swiss German doesn't use ß at all, so there is no difference between "in Maßen genießen" (to enjoy in moderation) and "in Massen genießen" (to enjoy in large amounts).


Swiss German doesn't use ß at all

TIL! I googled my way to a fun summary of the various factors potentially influencing this:

https://german.stackexchange.com/questions/56567/why-was-%C3...


Making X.upper().lower() return X.lower() is doable with an expanded string type that keeps track of more context, such that modifiers don't apply until the final output. In that case, it would be relatively simple to say that with multiple upper and lower calls, only the last survives.

Making 'STRASSE'.lower() return 'straße' requires that the caller have knowledge of the written language in us and a lookup through a language dictionary. IIRC, not all German words with two consecutive 's' es are properly written with an 'ß', and I don't know much about other languages that use that character. Blindly changing any SS to ß on lowercasing isn't what anyone wants, but rarely are strings annotated with the written language they contain, and it gets worse because strings can contain multiple written languages, which is only extremely rarely annotated.


You could plausibly capitalize the ß into one of the homoglyphs for S. Then, lower() could detect the homoglyph and know how to lowercase it. This will have some side effects, though (e.g. a doubled lowercase version of your homoglyph will now not round-trip).


You can't without out-of-band information.


you can't


Use str.casefold() if you're using str.upper() or str.lower() for comparisons.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: