Typesetting Rare Chinese Characters in LATEX (2003) [pdf]

aeontech · on July 23, 2023

18 years later, building a neural network on top of it - https://arxiv.org/abs/2107.00395

hnfong · on July 23, 2023

The neural network is a new twist, but the idea of generating Han "characters" from rules isn't a new one.

The earliest attempt I know of is from the initial versions of Changjie:

From https://en.wikipedia.org/wiki/Cangjie_input_method#Early_dev... :

``` Initially, the Cangjie input method was not intended to produce a character in any character set. Instead, it was part of an integrated system consisting of the Cangjie input rules and a Cangjie controller board. This controller board contains character generator firmware, which dynamically generates Chinese characters from Cangjie codes when characters are output, using the hi-res graphics mode of the Apple II computer. In the preface of the Cangjie user's manual, Chu Bong-Foo wrote in 1982:

[in translation] In terms of output: The output and input, in fact, [form] an integrated whole; there is no reason that [they should be] dogmatically separated into two different facilities.… This is in fact necessary.…

In this early system, when the user types "yk", for example, to get the Chinese character 文, the Cangjie codes do not get converted to any character encoding and the actual string "yk" is stored. The Cangjie code for each character (a string of 1 to 5 lowercase letters plus a space) was the encoding of that particular character.

A particular "feature" of this early system is that, if one sends random lowercase words to it, the character generator will attempt to construct Chinese characters according to the Cangjie decomposition rules, sometimes causing strange, unknown characters to appear. This unintended feature, "automatic generation of characters", is described in the manual and is responsible for producing more than 10,000 of the 15,000 characters that the system can handle. The name Cangjie, evocative of the creation of new characters, was indeed apt for this early version of Cangjie.

```

Even Unicode has some "support" for this stuff: https://en.wikipedia.org/wiki/Ideographic_Description_Charac...

Whether it's correctly implemented in popular operating systems is another matter.

Sadly, the current state of affairs is that half of unicode is "polluted" with rarely used Han characters that just happened to be included in a dictionary. This is the main reason why 64k code points isn't enough for everybody (and why emoji breaks in Javascript and other UTF-16 languages). And this strategy of adding a code point for every rare character continues to be the path of least resistance - if you only need to use a handful of rare characters, just submit them to be included as the next version of Unicode, and get a font that has the required glyph. It's easier than getting major vendors to fix their rendering of constructed Han characters by a large margin.