Hacker News new | past | comments | ask | show | jobs | submit login

Idunno, the Han unification wasn't that aggressive. What would really be destructive would be something like 瞭 and 了 sharing a codepoint.

Which glyphs were “unified out of existence”? Isn't it more just a matter of using the appropriate typeface or a variation selector? In my limited experience, I've noticed some things being non-unified that I would expect to be unified, like 步 and 歩 (e.g. in 散步 [zh-TW] vs 散歩 [ja-JP]); far as I can tell there isn't any semantic difference between these characters, 新字體 just added a stroke to make one of the radicals more consistent; I guess the idea was that Japanese users mix 新字體 and 舊字體 in text, and JIS character sets had separate codepoints for each from the beginning. As somebody who is learning both Taiwanese Mandarin and Japanese, I don't think it's common for unified characters to differ enough that they would be unrecognizable. I think that no matter how the Unicode Consortium approached this, they would need to set some limits on what gets a separate codepoint; the question is not whether or not to do the Han Unification, the question is when a glyph is actually its own character.

As for selecting specific glyphs of a character, if you have a font that even has the variant you're looking for, there is https://en.wikipedia.org/wiki/Variation_Selectors_Supplement




Let's look at a specific example then of why sharing codepoints between languages is so silly.

Let's talk about 直. Is that a japanese character? A chinese one? Well, let's look it up in a japanese dictionary [0] and then a chinese one [1].

You should see that the results look different. They're two different characters drawn in two different ways. However, in this hacker news comment, there's no way for me to indicate to use one font for one, and one font for the other. I can't say "The japanese glyph 直 is the same unicode codepoint as the chinese glyph 直 even though they render differently. They were unified". I can't make them render correctly as japanese and chinese respectively. For my computer, it _only_ renders in the chinese variant (without the extra stroke on the left) on hacker news. Like most sites, there's no way to indicate in the text input which language that portion of my text is. Unlike with every western script, if I don't indicate the language correctly, it will be actively rendered wrong and difficult for a reader to understand.

To draw an analogy from another hacker news comment, this would be like the unicode consortium saying that 'colour' always renders as 'color', and you just have to switch fonts for it to look like 'colour' [3].

Okay, so that's why unifying things at all is silly and causes trouble.

As for characters that were unified out of existence: unfortunately, examples of those are hard to give. There are various names that have stylistic choices or use unusual characters which can no longer be rendered "correctly". Arguably, that could be seen as akin to the fact that if you style your name calligraphically in the western world, unicode doesn't help you replicate that flair.

[0]: https://jisho.org/search/%E7%9B%B4%20%23kanji

[1]: https://www.mdbg.net/chinese/dictionary?page=worddict&wdrst=...

[3]: https://news.ycombinator.com/item?id=8041288


> However, in this hacker news comment, there's no way for me to indicate to use one font for one, and one font for the other.

Sure you can, the Chinese one is 直 and the Japanese one is 直. They are still the same character though, and it's meant the same thing the whole time. The Japanese got it earlier, so they form it in a way that would be recognizable to the scribes of the Zhou dynasty, and possibly the Shang dynasty. [0]

Part of how you can “tell” it's the same character, is that the cousin variations are all used in precisely the same way. Many compounds formed with it are shared directly between Japanese and Chinese. [1] [2] [3] It's related right down to in some compounds being interchangeable with 只, the latter being a less dated form of it (i.e. 直中 vs. 只中 in Japanese). And beyond all of this, compounds shared between languages have been written in both orthographies and meant precisely the same thing for a very long time.

I think considering these variations to be separate characters makes about as much sense as considering the s in “stop” to be different in French because it's pronounced slightly differently, and because French penmanship is different from British/American penmanship (i.e. sometimes French people lift the pen while forming a lowercase s [IIRC]).

[0]: http://xiaoxue.iis.sinica.edu.tw/yanbian?char=直

[1]: https://en.wiktionary.org/wiki/直言

[2]: https://en.wiktionary.org/wiki/是非曲直

[3]: https://en.wiktionary.org/wiki/強直性脊椎炎


步 and 歩 are similar for sure, and for people aware of what's in Han unification, it's probably not that bad.

But the problem is, there are many characters very similar to each other, with a difference of one stroke already. 今 and 令 for example, are both simplified Chinese, but with completely unrelated meaning.

So, when you see a character that looks familiar, how can you tell whether you are looking at a character that you don't know or a character you know but rendered in a different language?


While 今 and 令 may be stylized similarly today in Chinese, they are clearly distinct, those sorts of things are not unified in the Han Unification. In fact 令 is different enough that the version of it no longer commonly used in China now has its own codepoint.


> 今 and 令 for example, are both simplified Chinese, but with completely unrelated meaning.

This isn't the best example of meaning, as neither one of them really means anything standing alone like that.


In the case of 今, it means something on its own in Japanese; and to a lesser extent it means something on its own in standard Chinese (kinda like 這, sometimes like 現在) and definitely in classical Chinese. In Japanese, 令 doesn't tend to mean much on its own (though it can still be read a couple ways), but in Chinese it has many standalone meanings.


How do I look at the sequence of Roman characters “gift” and decide if it is an English-language thing I want vs a German-language thing I don’t?

You can’t code your way out of a context trap.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: