Hacker News new | past | comments | ask | show | jobs | submit login

Not even close.

The letter À (A grave) can be written as the UTF-8 bytestream 0xC3 0x80 (i.e. a single "character"), or as À - i.e. a letter A, then a combining grave character i.e. 0x41 0xCC 0x80.

The two are identical. Except they have different byte representations. If you don't normalize your unicode you will run into major problems.




There are actually two kinds of normalization in play here.

The one you are talking about is the one unicode actually calls normalization, and is dealt with in UTR#15. http://unicode.org/reports/tr15/

You are absolutely right that, in almost any situation taking unicode input where you're ever going to need to compare strings (and in most where you're ever going to need to display them), you are going to need to apply one of the UTR#15 normalization forms. UTR#15 normalizes different byte representations of what, in ALL circumstances are indeed identical characters/graphemes. A lot of people don't take account of this.

Then there's the kind of canonicalization that OP talks about, which Unicode actually calls 'folding', and is about characters/graphemes which really ARE different characters but which, for _some_ but not all contexts may be treated as 'equivalent' (if not neccesarily identical). The simplest example is case insensitivity, but there are other trickier ones in the complete repertoire, like those discussed in the OP.

This second kind of 'folding' canonicalization is a lot trickier, because it is contextual, not absolute. Which is maybe why Unicode started out trying to make an algorithm for it in UTR#30 but then abandoned it. Nonetheless, despite it's trickiness and contextuality, you often still really do need to do it, as in OP.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: