Hacker News new | past | comments | ask | show | jobs | submit login

IMO, the sin of Unicode is that they didn't just pick local language authorities and gave them standardized concepts like lines and characters, and start-of-language and end-of-language markers.

Lots of Unicode issues come from handling languages that the code is not expecting, and codes currently has no means to select or report quirk supports.

I suppose they didn't like getting national borders involved in technical standardization bit that's just unavoidable. It is getting involved anyway, and these problems are popping up anyway.






This doesn't self-synchronize. Removing an arbitrary byte from the text stream (e.g. SOL / EOL) will change the meaning of codepoints far away from the site of the corruption.

What it sounds like you want is an easy way for English-language programmers to skip or strip non-ASCII text without having to reference any actual Unicode documentation. Which is a Unicode non-goal, obviously. And also very bad software engineering practice.

I'm also not sure what you're getting at with national borders and language authorities, but both of those were absolutely involved with Unicode and still are.


> start-of-language and end-of-language markers

Unicode used to have language tagging but they've been (mostly) deprecated:

https://en.wikipedia.org/wiki/Tags_(Unicode_block)

https://www.unicode.org/reports/tr7/tr7-1.html


The lack of such markers prevents Unicode from encoding strings of mixed Japanese and Chinese text correctly. Or in the case of a piece of software that must accept both Chinese and Japanese names for different people, Unicode is insufficient for encoding the written forms of the names.

Just in case this has to be said: the reason this hasn't been a problem in the past is because you could solve this problem by picking a team and completely breaking support for the others.

With rapidly improving single-image i18n in OS and apps, "breaking support for the others" slowly became non-ideal or hardly viable solution, and the problem surfaced.


I’m working with Word documents in different languages, and few people take the care to properly tag every piece of text with the correct language. What you’re proposing wouldn’t work very well in practice.

The other historical background is that when Unicode was designed, many national character sets and encodings existed, and Unicode’s purpose was to serve as a common superset of those, as otherwise you’d need markers when switching between encodings. So the existing encodings needed to be easily convertible to Unicode (and back), without markers, for Unicode to have any chance of being adopted. This was the value proposition of Unicode, to get rid of the case distinctions between national character sets as much as possible. As a sibling comment notes, originally there were also optional language markers, which however nobody used.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: