I'm curious. Are interlinear ruby annotation codepoints actually used for their ...

jcranmer · on March 10, 2017

You could just read PropLists.txt to find the list of characters with the Deprecated property:

    0149          ; Deprecated # L&       LATIN SMALL LETTER N PRECEDED BY APOSTROPHE
    0673          ; Deprecated # Lo       ARABIC LETTER ALEF WITH WAVY HAMZA BELOW
    0F77          ; Deprecated # Mn       TIBETAN VOWEL SIGN VOCALIC RR
    0F79          ; Deprecated # Mn       TIBETAN VOWEL SIGN VOCALIC LL
    17A3..17A4    ; Deprecated # Lo   [2] KHMER INDEPENDENT VOWEL QAQ..KHMER INDEPENDENT VOWEL QAA
    206A..206F    ; Deprecated # Cf   [6] INHIBIT SYMMETRIC SWAPPING..NOMINAL DIGIT SHAPES
    2329          ; Deprecated # Ps       LEFT-POINTING ANGLE BRACKET
    232A          ; Deprecated # Pe       RIGHT-POINTING ANGLE BRACKET
    E0001         ; Deprecated # Cf       LANGUAGE TAG

(note that the ruby annotation codepoints aren't on that list).

The use in XML/HTML is no longer maintained by Unicode, it is maintained by the W3C instead: https://www.w3.org/TR/unicode-xml/.

Manishearth · on March 10, 2017

> Are interlinear ruby annotation codepoints actually used for their intended purpose anywhere?

Yes

> And what are you supposed to do when you encounter one?

Nothing. Don't display them, or display some symbolic representation. You probably shouldn't make ruby happen here; if your text is intended to be rendered correctly use a markup language.

----------

Unicode is ultimately a system for describing text. Not all stored text is intended to be rendered. This is why it has things like lacuna characters and other things.

So when you come across some text using ruby, or some text with an unencodable glyph, what do you do? You use ruby annotations or IDS respectively. It lets you preserve the nature of the text without losing info.

(Ruby is inside unicode instead of being completely deferred to markup since it is used often enough in Japanese text, especially whenever an irregular (not out of the "common" list) kanji is used. You're supposed to use markup if you actually want it rendered, but if you just wanted to store the text of a manuscript you can use ruby annotations)

rspeer · on March 10, 2017

Can you give an example of text in the wild that uses interlinear ruby annotation codepoints? Because I searched the Common Crawl for them, and every occurrence of U+FFF9 through U+FFFB seems to have been an accident that has nothing to do with Japanese.

Note that I didn't actually ask you about rendering.

I care from the point of view of the base level of natural language processing. Some decisions that have nothing to do with rendering are:

- Do they count as graphemes?

- What do you do when you feed text containing ruby characters to a Japanese word segmenter (which is not going to be okay with crazy Unicode control characters, even those intended for Japanese)?

- Could they appear in the middle of a phrase you would reasonably search for? Should that phrase then be searchable without the ruby? Should the contents of the ruby also be searchable?

Seeing how ruby codepoints are actually used would help to decide how to process them. But as far as I can tell, they're not actually used (markup is used instead, quite reasonably). So I'm surprised that your answer is a flat "Yes".

Manishearth · on March 10, 2017

> Can you give an example of text in the wild that uses interlinear ruby annotation codepoints?

Sadly, no :( You may have luck scraping Wikibooks or some other source of PDFs or plaintext. In general you won't find interlinear annotations on the web because HTML has a better way of dealing with ruby. This is also why they're in the "shitlist", that shitlist is for stuff that's expressly not supposed to be used in markup languages.

Another way to get a good answer here is by asking the unicode mailing list, they tend to be helpful here. I know that they're used because I've heard that they are, so no first-hand experience with them. This isn't a very satisfying answer, I know, but I can't give a better one.

> Do they count as graphemes?

The annotation characters themselves? By UAX 29 they probably do, since UAX 29 doesn't try to handle many of these corner-case things (it explicitly asks you to tailor the algorithm if you care about specifics like these). ICU might deal with them better. The same goes for word segmentation, e.g. UAX 29 will not correctly word-segment Thai text, but ICU will if you ask it to. I haven't tried any of this, but it should be easy enough.

I guess a lot of this depends on what kind of processing you're doing. Ignoring the annotation sounds like the way to go for NLP, since it's ultimately an _annotation_ (which is kinda a parallel channel of info that's not essential to the text). This certainly applies for when the annotations are used for ruby, though they can be used for other things too. Interlinear annotations were almost used for the Vedic samasvara letter combiners, though they ultimately went with creating new combiners since it was a very restricted set of annotations.

They're not used much so the best way forward is probably to ignore them, really. These are a rather niche thing that never really took off.