Hacker News new | past | comments | ask | show | jobs | submit login

> Technically the shortest UTF-8 representation is _the_ representation and _correctly_normalized_ Unicode is uniquely represented

Not necessarily the shortest (NFC means not using composed characters from later revisions of the standard), and you only get a normalised representation if you've actually normalised it - if you've just accepted and maybe validated some UTF-8 from outside then it probably won't be in normalized form. IMO it's worth having separate types for unicode strings and normalized unicode strings, and maybe the latter should expose more of the codepoint sequence representation, but I don't know if any language implements that.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: