Hacker News new | past | comments | ask | show | jobs | submit login

> If you split a unicode string on codepoints, the results are always valid unicode strings.

"One of the reasons why the Unicode Standard avoids the term “valid string”, is that it immediate begs the question, valid for what?"

Source: http://www.unicode.org/mail-arch/unicode-ml/y2013-m01/0034.h...

The only thing you get by splitting a sequence of codepoints at random is another sequence of codepoints. Because you can end up with codepoint sequences that map to different glyphs or end up being ignored when they wouldn't have been in their proper order, you can end up with non-sense. You can shuffle a sequence of ASCII characters and still end up with a sequence of ASCII characters. What good is that? I fail to see how it would be qualitatively different than splitting a UTF-8 string at arbitrary code points. The latter is supposed to induce an error, but the former doesn't necessarily. The Unicode specification is written in a way to degrade softly when manipulated or displayed by poorly written software or old software dealing with future sequences with unique semantics. But that's not the same thing as saying that any sequence of codepoints is valid. Rather it's more akin to undefined behavior in C, except without a license to unleash nasal daemons.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: