The one that still surprises me is Hangul (Korean script). Hangul characters are made of 24 basic characters (jamo) which represent consonant and vowel sounds, which are composed into Hangul characters representing syllables.
Unicode has a block for Hangul jamo, but they aren't used in typical text. Instead, Hangul are presented using a massive 11K-codepoint block of every possible precomposed syllable. ¯\_(ツ)_/¯
I believe that was a necessary compromise to use Hangul on any software not authored by Koreans.
"These are characters from a country you've never been to. Each three-byte sequence (assuming UTF-8) corresponds to a square-shaped character." --> Easy for everyone to understand, and less chance of screwup (as long as the software supports any Unicode at all).
"These should be decomposed into sequences of two or three characters, each three bytes long, and then you need a special algorithm to combine them into a square block." --> This pretty much means the software must be developed with Korean users in mind (or someone must heroically go through every part of the code dealing with displaying text), otherwise we might as well assume that it's English-only.
Well, now the equation might be different, as more and more software are developed by global companies and there are more customers using scripts with complicated combining diacritics, but that wasn't the case when Hangul was added to Unicode.
For example: if NFD works properly, the first two characters below should look identical, and the third should show a "defective" character that looks like the first two except without the circle (ㅇ). It doesn't work in gvim (it fails to consider the second/third example as a single character), Chrome in Linux, or Firefox in Linux.
은 은 ᅟᅳᆫ
Of course, if it were the only method of encoding Korean, then the support would have been better, but it would've still required a lot of work by everyone.
The original version of Unicode was primarily intended to unify all existing character sets as opposed to designing a character database from fundamental writing script principles. That's why most of the Latin accented characters (e.g., à) come in precomposed form.
It is worth noting that precomposed Hangul syllables decompose to the Jamo characters under NFD (and vice versa for NFC). However, most data is sent and used with NFC normalization.
This is primarily because the legacy character set---KS X 1001---already contained tons (2,350 to be exact) of precomposed syllables. Unicode 1.0 and 1.1 had lots of syllables encoded in this way, with no good way to figure out the pattern, and in 2.0 the entire Hangul syllable block is reallocated to a single block of 11,172 correctly [1] ordered syllables.
So yeah, Unicode is not a problem here (the compatibility with existing character sets was essential for Unicode's success), it's a problem of legacy character sets :-)
[1] Only correct for South Koreans though :) but the pattern is now very regular and it's much more efficient than heavy table lookups.
I would imagine this is a legacy from the Good Old Days when every Asian locale had its own encoding. Unicode imported the Hangul block from ISO-2022-KR/Windows-949 (different encodings of the same charset), which has only Hangul syllables.
The ideographic description characters do provide a way to describe how to map radicals into characters, but don't actually provide rendering in such a manner.
There is active discussion on actually being able to build up complex grapheme clusters in such a manner, because it's necessary for Egyptian and Mayan text to be displayed properly. U+13430 and U+13431 have been accepted for Unicode 10.0 already for some Egyptian quadrat construction.
Korean doesn't use IDSes, it's a fixed algorithm (not specced by unicode, but a fixed algorithm) for combining jamos into a syllable block. Korean syllable blocks are made up of a fixed set of components.
IDSes let you basically do arbitrary table layout with arbitrary CJK ideographs, which is very very different. With Hangul I can say "display these three jamos in a syllable block", and I have no control over how they get placed in the block -- I just rely on the fact that there's basically one way to do it (for modern korean, archaic text is a bit more complicated and idk how it's done) and the font will do it that way.
With IDS I can say "okay, display these two glyphs side-by-side, place them under this third glyphs, place this aggregate next to another aggregate made up of two side-by-side glyphs, and surround this resulting aggregate with this glyphs". Well, I can't, because I can't say the word display there; IDS is for describing chars that can't be encoded, but isn't supposed to really be rendered. But it could be, and that's a vastly different thing from what existing scripts like Hangul and Indic scripts let you do when it comes to glyph-combining.
Jamo, Emoji (including flag combinators), Arabic, and Indic scripts all combine according on effectively per-character basis. There's not really any existing character that says "display any Unicode grapheme A and grapheme B in the same visual cell with A above B." The proposed additions to Egyptian hieroglyphs would be the first addition of such a generic positioning control character to my knowledge, albeit perhaps limited just to characters in the Egyptian Unicode repertoire.
Research on what to do vis à vis Mayan characters (including perhaps reusing Egyptian control characters for layout) is still ongoing, as is better handling of Egyptian.
Somebody involved with Unicode must have had the same idea, because the ideographic description characters exist. However, I've never seen them used in practice because they don't actually render the character. You just get something like ⿰扌足, which corresponds to 捉.
I'm curious. Are interlinear ruby annotation codepoints actually used for their intended purpose anywhere? And what are you supposed to do when you encounter one?
I know that they appear on Unicode's shitlist in [UTR#20], a proposed tech report that contained a table of codepoints that should not be used in text meant for public consumption. UTR#20 suggested things you could do when you encounter these codepoints, but it was withdrawn, leaving the status of these codepoints rather confused.
> Are interlinear ruby annotation codepoints actually used for their intended purpose anywhere?
Yes
> And what are you supposed to do when you encounter one?
Nothing. Don't display them, or display some symbolic representation. You probably shouldn't make ruby happen here; if your text is intended to be rendered correctly use a markup language.
----------
Unicode is ultimately a system for describing text. Not all stored text is intended to be rendered. This is why it has things like lacuna characters and other things.
So when you come across some text using ruby, or some text with an unencodable glyph, what do you do? You use ruby annotations or IDS respectively. It lets you preserve the nature of the text without losing info.
(Ruby is inside unicode instead of being completely deferred to markup since it is used often enough in Japanese text, especially whenever an irregular (not out of the "common" list) kanji is used. You're supposed to use markup if you actually want it rendered, but if you just wanted to store the text of a manuscript you can use ruby annotations)
Can you give an example of text in the wild that uses interlinear ruby annotation codepoints? Because I searched the Common Crawl for them, and every occurrence of U+FFF9 through U+FFFB seems to have been an accident that has nothing to do with Japanese.
Note that I didn't actually ask you about rendering.
I care from the point of view of the base level of natural language processing. Some decisions that have nothing to do with rendering are:
- Do they count as graphemes?
- What do you do when you feed text containing ruby characters to a Japanese word segmenter (which is not going to be okay with crazy Unicode control characters, even those intended for Japanese)?
- Could they appear in the middle of a phrase you would reasonably search for? Should that phrase then be searchable without the ruby? Should the contents of the ruby also be searchable?
Seeing how ruby codepoints are actually used would help to decide how to process them. But as far as I can tell, they're not actually used (markup is used instead, quite reasonably). So I'm surprised that your answer is a flat "Yes".
> Can you give an example of text in the wild that uses interlinear ruby annotation codepoints?
Sadly, no :( You may have luck scraping Wikibooks or some other source of PDFs or plaintext. In general you won't find interlinear annotations on the web because HTML has a better way of dealing with ruby. This is also why they're in the "shitlist", that shitlist is for stuff that's expressly not supposed to be used in markup languages.
Another way to get a good answer here is by asking the unicode mailing list, they tend to be helpful here. I know that they're used because I've heard that they are, so no first-hand experience with them. This isn't a very satisfying answer, I know, but I can't give a better one.
> Do they count as graphemes?
The annotation characters themselves? By UAX 29 they probably do, since UAX 29 doesn't try to handle many of these corner-case things (it explicitly asks you to tailor the algorithm if you care about specifics like these). ICU might deal with them better. The same goes for word segmentation, e.g. UAX 29 will not correctly word-segment Thai text, but ICU will if you ask it to. I haven't tried any of this, but it should be easy enough.
I guess a lot of this depends on what kind of processing you're doing. Ignoring the annotation sounds like the way to go for NLP, since it's ultimately an _annotation_ (which is kinda a parallel channel of info that's not essential to the text). This certainly applies for when the annotations are used for ruby, though they can be used for other things too. Interlinear annotations were almost used for the Vedic samasvara letter combiners, though they ultimately went with creating new combiners since it was a very restricted set of annotations.
They're not used much so the best way forward is probably to ignore them, really. These are a rather niche thing that never really took off.
I'm not sure if that would be a good thing or a bad thing.