Hacker News new | past | comments | ask | show | jobs | submit login
A Spectre Is Haunting Unicode (2018) (dampfkraft.com)
268 points by EvanAnderson on July 14, 2022 | hide | past | favorite | 180 comments




In the 90s I worked on a project to digitize land registration in Taiwan.

In order to record deeds and property transfers, we needed to enter people's names and official registered addresses into the computer system. The problem was that some people used non-traditional writing variants for their names, and some of their birthplaces were tiny places in China with weird names.

Someone might write their name with a two-dot water radical instead of three-dot radical. We would print it out in the normal font, and the people would lose their minds, saying that it was wrong. Chinese people can be superstitious about the number of strokes in their name, so adding a stroke might make it unlucky, so they would not buy the property.

The customer went to the agency responsible for managing the big character set, https://en.wikipedia.org/wiki/CNS_11643 Despite having more characters than anything else on earth, it didn't have those variants. The agency said they would not encode them, because they were not real characters, just printing differences.

The solution was for the staff in the office to use a "font maker" program to create a custom font with these characters. Then they could print out the deeds using a Chinese variant of Adobe Acrobat, and everyone was happy.


That's a great story. The inability to represent a name with standard characters reminds me of when Prince changed his name to a symbol and they had to send all of the media floppy disks containing a custom font with a single character.

https://nymag.com/intelligencer/2016/04/princes-legendary-fl...


Are you acquainted with Freur (which means, "Underworld 0.5" - Rick Smith and Karl Hyde in the '80s)?

"Freur", or, "The squiggle we chose as the name for a band but that CBS Records insisted should at least have a pronunciation".

I see it is not in Unicode (well, you can never really know if you do not try), nor I can find pieces to reconstruct it.

The "freur" in foreground: https://d4q8jbdc3dbnf.cloudfront.net/user/6885/edb290c6183ac...


Yikes. If somebody hasn’t written a “falsehoods programmers believe about human writing systems” document this would make for a good start.


It deserves its own entry in "falsehoods programmers believe about names" lists too.


It's already there, #11 "People’s names are all mapped in Unicode code points."


The falsehood here is thinking that if you can encode the name into the right code points, and you have a font that can print them, the result will be acceptable to the people whose name it is.

They had that, but needed a font that used a different number of strokes for the characters because of the superstition.


More generally, the notion that human culture, systems, and behaviors can be mapped, losslessly and without causing harm, to something a computer understands.

I think these language examples are so good, as examples, because all aspects of them are clear and easy to follow. I think computerization of business and society and the systems that make them work, causes immense amounts of this kind of friction and pain all the time, in ways that are much harder to understand, explain, or catalog (which is precisely why it's such a big problem, though as far as I know it's received little attention)

[EDIT] To distill it, I think that trying to make a computer a "source of truth" rather than a tool, tends to do substantial violence to the "truth".


I feel like there has to be some level of triviality at which the harm is no longer being caused by the attempt to systematize something, but rather by a small group of people refusing to be systematized not out of cultural heritage et al, but rather purely out of the (inane) human desire to feel special by intentionally doing something in a way nobody else does it.

Language and writing exist to communicate, using patterns of signals that have shared meaning and recognition; things like alphabets and vocabularies are effectively (loose, overlapping, diasporic) consensus-state autoencoding models. They only work to compress meaning, when there are rules for said compression that generalize, and which don't have as many exceptions with their own separate symbols as there are words/names needing to be encoded.

Most countries don't allow you to just make up your own novel graphemes when writing a name on a birth certificate. And nobody is asking for that, either. (Presumably because living in a world where that was allowed would be horrible: you'd no longer being able to error-correct when reading, because any given mysterious squiggle in the middle of a word or name, might be exactly what some unknown-to-you-or-anyone-other-than-the-author character is supposed to look like. Is that "o with a curlicue" written here just a semi-cursive attempt at writing an "o" — or is it an "o" with a novel accent marker, one that appears nowhere else, but which must be preserved nevertheless to properly record this person's name?)

Instead, legal names are (in every country I'm aware of) required to be spelled using the character-set of the country you're entering a legal relationship with by being born / immigrating / etc. America? Legal names using the Latin alphabet. Japan? Legal names using characters from this set: https://en.wikipedia.org/wiki/Jinmeiy%C5%8D_kanji

Note, though, that legal names are representations of names. They aren't encodings of names. Your legal name is a distinct thing from your name, just as your credit-card number is a distinct thing from your name. It's an applied-for + registered + assigned systematic identifier for you — a bit like a domain name, or a vanity license-plate number. Which means that your legal name is not a lossy or lossless encoding of your name. It's, per se, a nickname. It doesn't have to have anything to do with your name. (And it often doesn't; immigrants often choose legal names entirely distinct from what they / their home country thinks of as their name.)


> causing harm

> do substantial violence to the "truth"

I don't think this kind of wild escalatory rhetoric is at all helpful to the (presumably) good cause intended. Probably the opposite actually.

Most of the time people are not in fact having substantial harm inflicted upon them by others out to do violence. That mindset seems incredibly fragile, paranoid, and divisive.

The reality is people are just working away to improve things and sometimes they make mistakes or don't have perfect information ahead of time, and other times making things better simply necessitates that not everybody's last whim can always be accommodated. The healthy mindset to have is that not every real or perceived slight against you is done because you are being persecuted by violent hatemongers, and that people should be more accommodating and accepting of the reality that systems and procedures designed for the benefit of everybody may just not be able to accommodate every unique request they have.


The first (harm) is about as mild as it gets, and the second is a common usage.

Webster's 1913:

> 2. Injury done to that which is entitled to respect, reverence, or observance; profanation; infringement; unjust force; outrage; assault.

> We can not, without offering violence to all records, divine and human, deny an universal deluge. - T. Burnet.

It's a bit more poetic a use, but it's not escalatory in the way you suggest.

[EDIT] Further, my point is (and I think that was clear?) that computerization necessarily does these things if you treat the computer as correct and humans as suspect, not that anyone's doing this on purpose.


I'm not sure why you're quoting the dictionary, I didn't say you are incorrect, I said the rhetoric is escalatory, and it is.

"I'm going to shoot you" is escalatory. It doesn't matter that I could have been talking about photographing you.


And to bureaucratic systems too—governments try to regularize human behavior to make it legible to the state and it can absolutely cause massive harm to people. Seeing like a State makes a great case for this idea; I found it absolutely fascinating.


One could argue they're facets of the same issue. Although in the spirit of the original list, they would probably get split into separate line items.

On further review, I think this is als similar to #12 & #13 on the list: "names are case-sensitive," and "names are not case-sensitive." To generalize that to include non-Western alphabets: display variations of the same character are significant, and display variations of the same character are not significant.

This of course goes back to the evergreen philosophical question "what even is a character, anyways?" Since we've found a case where two characters which are the same character are not the same character. Are they distinct characters or typographical variants? Yesn't: one would want them unified for searching, but distinct for printing.

But regardless of what they are, these characters/variants only show up in names. Names tend to retain archaic (or extinct) language variations longer than speech, which is the reason for rule #11, which is at least part of the problem.


I fully agree with this second, expanded take of yours. Some names are both represented and not represented by the Unicode simultaneously. This suggests there should be variant versions of characters, but that becomes an even thornier combinatorics (and sorting/collation, and lookalike characters) issue than what already exists.


"If the character isn't in Unicode it's in CNS-11643" apparently is also false.


I've been told that this is also an issue in Japan, except the reason might more often be a matter of pride than superstition. It is supposedly one reason (of a few) why fax machines are still in common use in Japan.

Later versions of Unicode support "Variation Forms" of Han characters as a way to be able to encode different variations. They are encoded as a Variation Selector code (U+E01000 and up) after the Han character. The forms are listed separate from Unicode versions in the "Ideographic Variation Database" <https://www.unicode.org/ivd/>. So far, it contains characters from a couple of Japanese dictionaries, a Korean and one from Macao/Hong Kong.


I knew someone who added an accent character to their name because everyone pronounced it wrong. She met someone bilingual who shot back that if she wants it pronounced that way she needs to add an aigue. So she did, and everyone still pronounced her name wrong.

In fact going any place with her very nearly became an “are we living in a simulation” crisis for me because the number of times she would say her name and the other person would say it back incorrectly was… upsetting. The degree to which some people butchered her name, especially combining half of her first and last name into a completely different name, made us joke about buggy NPCs.

I could imagine how in some cultures writing it incorrectly hurts as much as pronouncing it incorrectly. Or possibly moreso in places where multiple plausible pronunciations have to be negotiated via an introduction, which is the case in China, is it not?


In Poland people have a neat life hack for that problem. They have other names for non-polish folk to use. Eg pawek, tomek, bartek rather than have people mangle their real name.

My name got changed when I moved to Spain and it never bothered me, while I have met people who took great offence at the use of standard nicks that they had not explicitly sanctioned in advance. I know a guy who makes a new name up for everyone he meets. Like or lump it. If you are too sensitive about your name, you risk people not using it at all.


Apropos Poland: the chess legend Miguel Najdorf.

He was in Argentina for the Chess Olympiad in 1939 when WW2 started, so he stayed and got stuck there. It's unlikely he'd be known as "Miguel" now if this had never happened.


I wonder how far back this practice goes. There are quite a few Polish-American last names which are like that, to the point that the original last name has been forgotten.

EDIT: Aha! this website has a guide on these names, and even dispells the Ellis Island explanation I was told as a kid: https://pgsctne.org/changed-surname-list/

They did it the same way you did.


I had a Polish American GF. I have no idea how to pronounce her last name. Also had an Austrian friend. When people asked how to pronounce his last name he'd say you can't.


It does goe both ways though. Take the time to learn how to say and spell someone’s name and it usually goes down well.

I say this while fully aware of my own butchering.


People are just incredibly dense sometimes. My wife has a name that's one letter different from a more common name, but clearly different in pronunciation.

Nevertheless there have been countless times where people automatically substitute the more common name, or even worse in text messages manage to misread it and reply incorrectly.

It sometimes upsets her. The npc analogy is very apt, i guess many people are just very preoccupied?!


> or even worse in text messages manage to misread it and reply incorrectly

Overzealous autocorrect can happen to names, too. There's a whole thing about Asian names not being in computer spellcheck dictionaries: https://www.abbynews.com/news/youre-not-a-mistake-b-c-group-...


That used to drive me crazy at work where we have lots of chinese, indian and other asian names. Fortunately it seems that Outlook now checks words against the names in the TO/FROM fields.


Not just Asian names; my SO's (English-language) nickname frequently autocorrects to its common homophone. I can always tell who proofreads their texts by how it ends up spelled.


Try typing ‘Siân’ on iOS, (well, maybe Sian) and it autocorrects to Asian.

Unhelpful, though luckily found funny when I did it.


My last name isn't English, and doesn't follow English pronunciation rules. I've long ago accepted the minor annoyance of people neither knowing how to spell or pronounce it.

There's a trick in Chinese to explain which character you are referring to when it could be any number of homophones- you repeat it as part of another well-known word. To use a crap analogy, you might say "Je like Jeep" or Ge like geography" if your name is Jennifer or Gene. The latter is a bad example since Gene is itself a single syllable, but hopefully it is illustrative enough. On the other hand, if you have a special stroke in a common character (or remove a stroke, as in the post above), I am guessing it's harder to explain.


I tried to learn Thai a few years ago, and people learn the consonants in its alphabet via a mnemonic system (much like the NATO phonetic alphabet), so if you want to spell your name you could use a collection of words like "egg short-a bell long-o chicken" and the other person would immediately understand.

On a perhaps related note, I was once accosted in the street by a Thai tuk tuk driver who wanted to know which football team to bet on ("just ask any white bloke you find" probably isn't the best strategy but I digress). One of the teams was Portsmouth, but I told him unless I knew who was playing I had no idea of the strength of the team. He held up a newspaper written exclusively in Thai, which does not use a roman alphabet or anything like it, and proceeded to read out the names of the players almost perfectly. I'm 99% sure Thai has all of the sounds made in English, but even knowing that it still blows my mind to this day.


Thai has the additional wrinkle that its alphabet originates from Sanskrit and has tons of duplicate letters, so you can't just say "K", you need to specify if it's K-as-in-egg, K-as-in-chicken, K-as-in-bell, etc.


Well, those are slightly different from each other. For example, the "k" (ก) used in "chicken" (ko kai) sounds more like a "g".


Forgot which country (iran, turkey..) but one diacritic on a phone text got a girl killed because it altered the meaning one word. Turning the sentence from loving to threatening or insulting.


That sounds terrible, however, it's important to remember that diacritics don't get people killed, the person who decides to kill ultimately needs to stop themselves.


No, "diacritics don't kill people, people kill people" is not an important life lesson. It is a reductive just-so generalization of basic common sense that obscures more than it enlightens.

The important thing for engineers to note is a technical shortcoming caused a tragic misunderstanding. Focusing instead on the well-known fact that some people have poor impulse control, knowing full well that is a non-controllable input, instead makes an excuse for poor engineering and implicitly expresses powerlessness to do anything about the problem.


I am all for good localization efforts. I've been something of a champion for that whenever I've been around user facing code and people working on it. I also am a bit of a language nerd and not monolingual.

But yes, misunderstanding or not, we should not kill people.

The story in the sibling comment is about a man attacking his daughter's ex because the ex came to apologize about a confusion over the Turkish dotless I. That's still a violent attack that the father could have kept his emotions in check. I don't condone calling the daughter names, even accidentally, but it is not a crime and the right response is not attempted murder.


> but it is not a crime and the right response

I don't know who you're arguing with, but it isn't me. Nobody is saying it was.

I'm saying it is an irrelevant non sequitur.

Imagine that Dad instead misunderstood an instruction related to a financial transaction and lost a ton of money. Would you now be discounting the technical problem that caused the misunderstanding and berating Dad for being foolish?


I'm not discounting the technical problem.

If I were on a code review and I spotted an issue affecting Turkish dotless I, I assure you I would rant about it more than is reasonable.


Ya, I don't see that happening in authoritarian countries.

As a contrived example if you had a symbol for 'happy' you want to be very cautious that it doesn't get converted to 'gay' because in your language gay and happy mean the same thing, in some repressive regime it means the leadership gets to execute you with the approval of the law.


A recent example is that "Let's go [gun emoji] him" could be interpreted as either harmless fun, or conspiracy to murder, depending on if the recipient's phone displays that as a water pistol or a real gun.

Edit: weirdly HN refuses to display that emoji.


Hacker News doesn't allow emojis because only serious fun or something.


HN does not like displaying emojis, though a few slip through I believe.


try it ...you are right. or just a (・∀・)


Even to a lesser extent, it's easy to forget how a small mistake can have a butterfly effect in other cultures.



Disussed on HN in 2008:

https://news.ycombinator.com/item?id=226853 (18 comments)


In Spanish, dropping one diacritic (~) changes "How old are you?" to "How many anuses do you have?".


In English, dropping one diacritic changes "Where's the rosé?" to "Where's the rose?", and changes "My maté is cold" to "My mate is cold."


An oddity is that "maté" (meant to indicate that the e is pronounced) is an incorrect spelling in both Spanish and Portuguese, where it would wrongly suggest that the e is stressed.

https://en.wikipedia.org/wiki/Yerba_mate#Name_and_pronunciat...


In Polish "zrób mi łaskę" means "do me a favor" and "zrób mi laskę" means "give me a blowjob".


> rosé

Maybe, though it's still halfway the same word.

> maté

Not a change, both spellings are valid.


Maybe even three-quarters the same word. (4/5ths if you count code points in NFD!)

Malé parties are a lot of fun.

Those are some pretty lamé runners.


There was a great thread on HH about names and falsehoods programmers believe.

You’ve added to it, as custom fonts wasn’t one covered.

I think it’s this thread: https://news.ycombinator.com/item?id=18567548

Edit: and it’s there, #11.


That sounds equally fascinating, and a little madding.


Yep, and with pictographic writing systems it's a lot more common than latin... but even here we have X Æ A-12 Musk, and Prince's name symbol.

Heck, my initials are totally non-standard.


> Chinese people can be superstitious about the number of strokes in their name, so adding a stroke might make it unlucky

Why am I not surprised in the slightest?


This might be interesting read to those unfamiliar with CJK, but character bloat(?) isn't remotely a recent thing. It's actually at least a couple hundred years old.

The Kangxi dictionary (1716), an authoritative dictionary of Chinese characters, contains definitions for 47035 characters, even though only a couple thousand are in common use. Quoting from Wikipedia: "The dictionary was the largest of the traditional dictionaries, containing 47,035 characters. Some 40% of them are graphic variants, however, while others are dead, archaic, or found only once. Fewer than a quarter of the characters it contains are now in common use."

All of these archaic (or even bogus in some cases) characters found in the dictionary are now part of the Unicode standard, of course :) The unihan database even has a field that shows the page number where the character appears in the Kangxi dictionary. If you're wondering why 65536 characters isn't enough for everyone, the junk in Kangxi dictionary is a significant contribution.


Unicode is a mistake that could only have happened in turn of the century America.

It is the distilled essence of the idea that you need to be inclusive of everyone along with a fundamental ignorance of what anyone who isn't American does.

The idea that Chinese characters are glyphs in the same sense of Latin characters can only have come from someone who has never written Chinese.

It is as stupid as demanding a glyph point for each possible integral, e.g. https://quicklatex.com/cache3/8c/ql_9739884527bd893429657272... and https://quicklatex.com/cache3/18/ql_c51509950f58a52253c696a4....

Solutions for English are not solutions for all languages. You can tell because the solutions that were natively invented by people who spoke those languages were _not_ unicode.


JIS, Big5, UHC, and GB all use a codepoint-to-character approach. You're right to point out that many aspects of "multilingual" support are written by people who do not know another language and so end up being hopelessly misguided but it's not really fair to say that Unicode invented this and thrust it upon the CJK world. Every pre-existing system of representing 漢字 had a codepoint table (which Unicode references in their description of each character).

Han Unification was in my view problematic but was driven by technical limitations (then again, if Simplified Chinese characters had also been unified I suspect there would've been more pushback to come up with a better solution, but ultimately Japanese was stuck with being the only one making a major compromise on that front).

I don't think a stroke based or combination system would've been better for many reasons: https://news.ycombinator.com/item?id=32102093. And if you don't trust Americans who at least tried to learn about the subject matter, how much do you trust any other programmer (who has no interest in other languages) to be able to handle a more complicated system for representing and rendering 漢字?


I don't know about Chinese since I'm barely literate in it.

But copying how every Chinese dictionary renders characters as trees of simpler characters seems like a much better approach than the arbitrary letter to number mapping of unicode. That it works well in English is only due to the fact it has no accents.

I can confidently say that native solutions for scripts with accents were _not_ unicode like but overstrike. My grandfather has the source code, in Romanian, of a 1960s computer payment system he worked on which had to deal with both Romanian and Hungarian names.

The combinatorial explosion of possible letters and accents made unicode like encodings an obvious non-starter. Historically names could pick up any accent (some times more than one) on any letter. When you have 26 base letters and 6 possible accents you'd need 26 + 26 * 6 (182) unique representations for single accented letters and 26 + 26 * 6 + 26 ^ 2 (1118) for double accented letters. That makes a language which is fundamentally alphabet based unusable on a keyboard. Something that Unicode is still sweeping under the rug.


> But copying how every Chinese dictionary renders characters as trees of simpler characters seems like a much better approach than the arbitrary letter to number mapping of unicode.

Then why does every natively-developed encoding system in a 漢字-using country not do it that way? For one thing, how would you handle the fact that 食反=飯 but 食耳=餌 (and the correct rendering depends on the language and locale of the text being rendered)? How about 辵 (aka 辶, ⻍, ⻌)? There are many issues on top of this one, but this is among the most obvious. In the end you would end up with having your encoding format look like Ideographic Description Sequences (which exist in Unicode) but every rendering library would need to have its own lookup table anyway to produce the correct character. Overlaying accents on top of latin characters (in most European languages) is nowhere near as complicated as combining components to form 漢字.

> I can confidently say that native solutions for scripts with accents were _not_ unicode but overstrike.

Unicode supports combining characters for this reason, though there are separate problems with this approach (some characters look almost identical but semantically should be treated differently -- maybe that is something fonts could deal with, but I suspect "Latin Unification" would've gotten more pushback than Han Unification did). If we want computer systems from different languages and cultures to interoperate there are going to be a few rough edges.


> When you have 26 base letters and 6 possible accents you'd need 26 + 26 * 6 (182) unique representations for single accented letters and 26 + 26 * 6 + 26 ^ 2 (1118) for double accented letters.

No, you don't. Only the most common combinations have their own Unicode number. Most combinations can simply be combined by base and accent ("Mark") numbers. Unicode is not that stupid.


>No, you don't. Only the most common combinations have their own Unicode number. Most combinations can simply be combined by base and accent ("Mark") numbers. Unicode is not that stupid.

https://en.wikipedia.org/wiki/List_of_Unicode_characters#Lat...

The most common being literally all of them.

Between Latin-1 Supplement, Latin Extended-A, Latin Extended-B and Latin Extended Additional you have some 700 extra characters of which half are some type of accented letter. I only said you'd need 182 for the six most common European accents. Unicode somehow ends up using 300.

The only people who defend unicode are people who have never looked into the spec.


> Every pre-existing system of representing 漢字 had a codepoint table (which Unicode references in their description of each character).

Even typewriters worked with a giant table: https://en.wikipedia.org/wiki/Chinese_typewriter


Han Unification was invented by Chinese people (in Hong Kong iirc), not by Americans. And even before unification, the national standards also had one code point per kanji/hanzi rather than building them up out of radicals, as opposed to the way eg flag emoji are done.

JIS did this in 1978 for instance.

It does appear that computerizing CJK languages has made them very different from handwriting them; native Chinese speakers now constantly forget how to write hanzi. But they did this to themselves.


> It does appear that computerizing CJK languages has made them very different from handwriting them; native Chinese speakers now constantly forget how to write hanzi.

This is separate to the question of encoding -- phonetic-based input systems (IMEs) are a far more likely cause (there are less-widely-used shape-based IMEs which still spit out a Unicode codepoint). The same is happening to Japanese natives, though it should be noted that it's not the case that they cannot write 漢字 normally, they just might forget how to write a relatively rare one (just like how you might forget how to spell a word in English because of a dependence on autocorrect and spellcheck).


I think 'character bloat' is simply inherent to the writing system when characters are written by hand (now that perhaps most written communication is digital people can't use characters that are not already supported)

Anyone can invent characters whenever they want, and it's only a question of them sticking or not.

I think this is also one of the reasons for the Chinese tendency to push for unification and uniformity.


When it’s character based instead of alphabet based, I think it’s the equivalent of coming up with a new word in English, which is basically what you’re describing.

Sometimes it’s mashing two previously unrelated ‘words’ together (aka the tons of compound characters in Chinese), other times it’s coming up with something completely new.

Same rules apply though, if it doesn’t add value worth the trouble (or get mandated by the powers that be), it’ll eventually just die out or be a curiosity.

Also, to keep it tech related:

RISC = English CISC/VLIW = Chinese?


IIUC Old Chinese was a much more “isolating” language, in that words were typically single characters - meaning that to make new words, you typically needed to make new characters. As it evolved through the ages, “compound” words composed of multiple characters became more common. These days, new words are almost always combinations of multiple characters (often 2, occasionally 3-4).


> These days, new words are almost always combinations of multiple characters (often 2, occasionally 3-4).

Yep! For example, the most common Chinese term for "Internet" is 因特网. This is composed of three characters:

互: "mutual"

联: "join", "coupled", "allied"

网: "net" -- carrying both the meaning of a woven net and a computer network


I think your comment got a little mixed up - fairly sure Internet is 互联网 (因特网 seems to be a much less common term, a hybrid of phonetic "Inter" and semantic "net")


Having been around before Unicode (and having to deal with a lot of the early growing pains of getting it working when dealing with non-Latin character sets in production), I have to say.

Despite all it’s problems, the fact these two messages ‘just worked’ is really awesome. I heart Unicode, despite all that.


Whoops, yeah. Copy/paste error.


Any idea if it was due to things like the Confucian Official’s exam system (and corresponding increase in prioritization of education)?

More complex characters require more education to understand is my guess. Some of the traditional ones are….. obscure, and crazy complex.


I'm not entirely sure what you mean to ask nor am I a Chinese speaker, but I have myself suspected that the massive variety of characters was a side-effect of having a middle class that was differentiated based on their ability to read. You see various in-group signalling systems similar to this in lots of areas.

A good historical example is all the strangly specific words for groups of animals. A history I read of this indicated these terms were first found in books sold to nobility, and they were just made up. But you weren't hip if you weren't reading that literature.


Chinese has a lot of specific characters for various species of birds (https://en.wikipedia.org/wiki/Radical_196), fish (https://en.wikipedia.org/wiki/Radical_195), and "bugs" (which includes small reptiles and certain other non-warm-blooded animals) (https://en.wikipedia.org/wiki/Radical_142).

One of my favorite features of the language is the fact that every element in the periodic table gets its own character - and the characters have radicals that indicate their usual state of matter (钅 for metal, 石 for non-metal solids, 氵 for liquids, 气 for gases) - see https://en.wikipedia.org/wiki/Chemical_elements_in_East_Asia....

There are also lots of specific words for species of trees (https://en.wikipedia.org/wiki/Radical_75) and other plants (https://en.wikipedia.org/wiki/Radical_140), but characters under those radicals also include botanical terms, medicines, things made of wood and plants, etc.


> Sometimes it’s mashing two previously unrelated ‘words’ together (aka the tons of compound characters in Chinese), other times it’s coming up with something completely new.

That's not how it works. Most Chinese characters stem from a character C having a pronunciation A referring to a meaning M being used to note another word of meaning M' with same pronunciation A (sometimes slightly different A'). This of course doesn't scale really well, hence the existence of determiners in logographic scripts, which are words used without their pronunciations placed before or after another to give a semantic clue. The innovation of Chinese (which I think is why it's still an efficient script today) was to incorporate the determiner in the character itself to give birth to a character C' where a part refer to the pronunciation and another acts as the determiner, instead of padding the main text with (a lot of) determiners.


I'm not sure I understand. Most European languages go through cycles where letters are added when languages are mixed together followed by periods of redundant letters disappearing. Old English had something like 39 letters. 'th' used to have its own letter: thorn.

I think character proliferation in CJK languages are a result of each word having its own character. The proliferation isn't fundamentally a proliferation of characters, it's a proliferation of words, which happens all the time in all languages. But only in certain languages does this proliferation of words result in additional characters being added to the language.


> I'm not sure I understand. Most European languages go through cycles where letters are added when languages are mixed together followed by periods of redundant letters disappearing. Old English had something like 39 letters. 'th' used to have its own letter: thorn.

There is a fundamental difference between pictographic languages where glyphs have intrinsic meaning, and alphabetic languages where letters reflect sounds.

Old English was much better spelled than current English because it didn't have a spelling. People wrote what they heard. The current mess is because we have 5 centuries of bad standards that can render ghoti as fish. I think you'll agree that the digraph ti, as in nation, is just as nonsensical as sh for the same sound and we'd be much better served by a single glyph for both.

We in fact have that already: https://en.wikipedia.org/wiki/International_Phonetic_Alphabe... English uses somewhere around 45 of those sounds depending on accent. Th for example renders two distinct sounds: ð and θ. þ is not any better than th, apart from brevity, because it also rendered to ð or θ when spoken depending on context.

Chinese is of course as much a pictographic language as English is an alphabetic one. A substantial number of glyphs come from combinations of simpler glyphs which have the same sound as the word you're trying to write.


>Fewer than a quarter of the characters it contains are now in common use

12K characters in common use is equally impressing for me as a non-Asian.


It's actually way fewer than that IRL. Japan's official list of commonly used Kanji only has 2136 characters. Taiwan's list has 4808, and the PRC's list has 3500 "frequent" characters with another 3000 supplementary "common" ones. Digitization has made it even easier to use these characters without recognizing the actual form or how to write them.


The 常用漢字 (Japanese Common Use Kanji) list does not include many kanji that native speakers can read and newspapers don't always follow the rule that they only should use characters from the list. In addition, you need to include the 人名用漢字 (Personal Name Use Kanji) in the list because basically all of those characters are also used in fairly common words.

Native speakers can probably recognise at least 3-4k kanji if not more but can probably only write around 2k from memory, depending on how well-read they are.

嘘 (lie) is the best example of an incredibly common word whose kanji form (which is used fairly often) is not in any official government list.


If you look at a frequency list of Chinese characters,[0] the top 4800 characters make up about 99.9% of modern texts.

That means that if you know 4800 characters, and you read a text that is 1000 characters (equivalent to around 700 words) long, there's likely one character you won't recognize.

The funny thing is, if you recognize only the top six characters, you already know 10% of the characters in a typical text. The distribution is very top-heavy, but with a long tail that you do have to learn to become literate.

0. https://lingua.mtsu.edu/chinese-computing/statistics/char/li...


A now vanished Chinese restaurant near us was named in English 'The Good Earth', but in Chinese even I with near zero knowledge could read 'Three Big <somethings>'; never found out what that last character was and couldn't imagine what would make sense in context either!


大三元. It's a Mahjong reference.[0]

By the way, those are the 17th, 125th and 370th most common characters in modern written Chinese.

0. https://zh.m.wikipedia.org/zh/%E5%A4%A7%E4%B8%89%E5%85%83


More like 12k characters currently in use at all. Common use characters are a much smaller set than that. (3k or so?)


Does Unicode really need to store Chinese words? Is it impossible to deconstruct the glyphs into strokes, each stroke effectively being a character?


The problem with that would be that every software must know the intricate rules about combining glyphs, and if they guess wrong, users get garbage characters.

Considering that the majority of code is written by people who don't know Chinese characters, it would result in never-ending issues, pretty much everywhere.

Korean actually has a two-way system in Unicode. Every conceivable character (= syllable) possible in modern Korean has its own codepoint, which allows most software to display them correctly: from their point of view, it's just another CJK character.

On the other hand, there is a Unicode area containing Korean sub-blocks ("jamo") that were used historically. In theory, you can combine them and get some pretty funky archaic syllables. Almost no software renders them right.


They can't even get much simpler things right. Qt incorrectly combines accents with the character to the right instead of the left and has been refusing to fix this bug for years.


In addition to the problems mentions by yongjik, even with the current system, very little software is even aware that the same codepoint should be rendered differently in different languages (返す needs to be rendered differently in every CJK locale) which often results in websites and programs using Chinese fonts for Japanese text (even if you've configured your language as Japanese). Having stroke breakdowns would not make this situation better because there are multiple ways to render the same stroke description and there aren't really systematic rules for how to correctly represent the Japanese (or Taiwanese or Korean) version of a character -- it's generally for historical reasons. If you were to try to actually represent the characters faithfully (in an attempt to avoid making every country unhappy with the way you've butchered their language), many characters would become unusable for text searching because the same "character" (from the perspective of a CJK native) would have a completely different representation in a way that a computer could not be able to identify as being the same (even a character as simple as 言う would have this issue).

I dread to think what an enormous mess would result if every character was represented as a build-it-yourself instruction manual rather than allowing font authors to correctly represent the characters. This is also ignoring that (depending on the font style), the apparent strokes for a character can change between fonts in the same language (this is because the computer font stroke style and the written font stroke style can be different) -- by putting stroke decisions in the encoding you're introducing a layering violation since fonts should be deciding how characters are styled, not encoding format committees.

Also nobody in China, Japan, nor Korea would switch to an encoding system so incredibly inefficient that more strokes results in more bytes being necessary to store the character (they already compromised with having 3-byte UTF-8 characters when JIS, GB, and Big5 all only required 2 -- and Japan was basically forced to compromise on Han Unification). This would've resulted in the failure of Unicode's mission to be the One True Encoding Format.


In the early days of computers some character systems were stroke-based because that used less memory than a 32x32 bit map. A kilobit of ROM (one character) could cost $10.

Currently stroke-based systems are used for calligraphic effect. You could generate new font types, e.g. bold., but controlling the shape of strokes.

Stroke systems are important for teaching character writing because the drawing order is rigorously prescribed. Once you learn the first couple hundred, you can pretty much guess future characters. Wrong order characters often look bad and suggest a non-Chinese speaker mis-copied them. (e.g. some tattoos)


Unicode has support for this, in the Ideographic Description Characters block (https://en.m.wikipedia.org/wiki/Ideographic_Description_Char...). However, it’s purely descriptive, and not designed for rendering.

There are somewhat more sophisticated systems which define both the rendering and stroke decomposition of characters (e.g. CDL: http://guide.wenlininstitute.org/wenlin4.3/Character_Descrip...). The general workaround for characters that aren’t on Unicode would be to use one of these stroke description systems to create the character, then render it to an image and insert it.


Many attempted, but nobody have suceed. The most famous one is `Chu, B.F.: 漢字基因朱邦復漢字基因工程 (Genetic engineering of Chinese characters) (2003), http://cbflabs.com/down/show.php?id=26 `


I thought this was going to be about something like the massive security problem of homoglyph attacks being currently deployed in stuff like phishing baked into the standard at first glance of the title, but this ghost character business is pretty interesting. Japanese literacy requires you to know 2-4 meanings per 2,136 kanji characters (something like 6000+ in total possible meanings between these characters) just to be able to pass a university level literacy test, it's a massive amount of complexity to get right. Even if you just need basic literacy it's still about a thousand less than that, and there's even more than these I mentioned for further literacy competence. Furthermore each of these characters look funny if not unreadable if you write them down using the wrong order of strokes. I can see how mistakes might have been made even by native speakers of that language. The two kana syllabiaries are there of course and mixed in with the kanji, but if everything was written in that you wouldn't be able to achieve the same amount of information density, which is probably part of the reason they never switched over (I understand before world war 2 or so, the more rounded hiragana was for women while the more sword stroke like katakana was for men).


The Latin alphabet being boring, I spent some time going through ancient alphabets included in Unicode.

It gets pretty trippy, pretty quick.

As in "We don't have a clear idea what this rune was for, or what it means, but we see it in documents and so added it to Unicode."

https://en.m.wikipedia.org/wiki/Runic_(Unicode_block)


My favorite Unicode glyph is Multiocular O (ꙮ). There is only one recorded usage, by a 15th century russian monk, who decided to use it in phrase “many-eyed seraphim” instead of two regular letters ‘o’. So of course it was added to Unicode.

https://en.wikipedia.org/wiki/Multiocular_O


It gets better: this glyph is bugged. Somehow, the guy responsible for adding it to Unicode somehow got the number of eyes wrong. Per his description, Unicode fonts represent it with 7 eyes, but after getting called out on Twitter he realized the original manuscript shows 10 eyes.

This bug will be fixed in Unicode 15.


Achieving peak Byzantium there, I guess.


What about modern uses of the character that specifically intended 7 eyes? Unicode needs to add a time or (worse, but probably OK) version datum to glyphs or glyph ranges, I suppose (applying it only at the document level wouldn't suffice, as in the case of quoting).


Going to the release note pdf, it will "expand" to 10 eyes by adding one more to the end of each horizontal row (if people are curious).


This one has sat in the back of my head for a long time. I wonder if any of the little ligatures or strange letter variations people write today could be preserved in the same way, or if the shorthand systems could be.


> As in "We don't have a clear idea what this rune was for, or what it means, but we see it in documents and so added it to Unicode."

Documents? I had the strong impression that there are no documents written in runes. A rune we only know by its occurrence in documents would be far more interesting for the existence of a document than it would be for its own sake!

Compare what the page about Anglo-Saxon runes says about the corpus:

> The Old English and Old Frisian Runic Inscriptions database project at the Catholic University of Eichstätt-Ingolstadt, Germany aims at collecting the genuine corpus of Old English inscriptions containing more than two runes in its paper edition, while the electronic edition aims at including both genuine and doubtful inscriptions down to single-rune inscriptions.

> The corpus of the paper edition encompasses about one hundred objects (including stone slabs, stone crosses, bones, rings, brooches, weapons, urns, a writing tablet, tweezers, a sun-dial,[clarification needed] comb, bracteates, caskets, a font, dishes, and graffiti). The database includes, in addition, 16 inscriptions containing a single rune, several runic coins, and 8 cases of dubious runic characters (runelike signs, possible Latin characters, weathered characters). Comprising fewer than 200 inscriptions, the corpus is slightly larger than that of Continental Elder Futhark (about 80 inscriptions, c. 400–700), but slightly smaller than that of the Scandinavian Elder Futhark (about 260 inscriptions, c. 200–800).

So across every runic system we know, we have under 600 texts, all of those texts are short inscriptions, and even to reach that number of samples we need to include texts that we aren't even sure contain any runes.


> Documents? I had the strong impression that there are no documents written in runes.

One of the original goals of Unicode was to be able to computerize every document. I still have some old linguistics books in which characters have been handwritten into typed or even typeset text. So these are the types of documents being referred to: academic papers.

Some fancy books have photographs of ancient writing; I’m not sure if Unicode tries to encode such sources and I pretty much doubt it (how would you even know what to call the symbols? You touch on this in your comment). However often they are attached to treatises that order the characters in some way (I.e. index an alphabet) in which case the first case above would apply.

In other words: thanks to some scholars who wrote down and ordered runic alphabets, you can now discuss runes with your friends and colleagues through email.


This is interesting. I’m comparing this to how musical notation is encoded in unicode. I mean, there is a block dedicated to the symbols, so the symbols are encoded, but you can’t document music using only unicode. But musical documents are being composed and written all the time. To write music you need an additional software which arranges these symbols in a certain way so that they express the authors intention.

I guess math has a similar representation in unicode as well.

All that said, I think people use runes to express magic and spells (even to this day). I don’t think all the magical runes are expressed in unicode (and perhaps they shouldn’t). If you want to use a rune in that way, you might have to draw it out in SVG or something and then email it to your friends.


> I guess math has a similar representation in unicode as well.

It's an ongoing project. As you seem to have guessed, Unicode math symbols are just about as useless for representing math as Unicode music symbols are for representing music. Producing mathematical documents is done using dedicated software, generally LaTeX.

(And what you get is a PDF, because, as I noted in another comment, PDFs already support every notation there is, was, or ever will be.)


> (And what you get is a PDF, because, as I noted in another comment, PDFs already support every notation there is, was, or ever will be.)

I now have a perverse urge to invent a sculptural writing system just so I can break this completely reasonable claim.


> One of the original goals of Unicode was to be able to computerize every document. I still have some old linguistics books in which characters have been handwritten into typed or even typeset text.

That's a weird goal for Unicode to have. We've already accomplished that; a PDF file does the job better (note: PDF documents already support every character existing in the past, present, or future!) while being less complex.


I don’t understand. If there is no computerized way to represent the script, all you can do would be to include photographs in your pdf. The point of computerization is not simply storage and retrieval (and retrieval is hard if you can’t represent the script) but automated processing, which is meaningless if you can’t represent any semantics).

Separately, PDF felt like a step backwards on the day it was announced and sadly nothing since then has changed that.


How do you search for non-unicode characters in a pdf document?


How do you search for them in a book?


ctrl-F once you have digitized it.


And how do you type the character you are searching for?


On Ubuntu: l-ctrl+l-shift+u, <codepoint>, <enter>

Of course, that sucks, so I've programmed a nearby key to act as l-ctrl+l-shift+u.

Several characters can also be typed with Compose Key.

For characters I use regularly (in my case, generally the elder and younger futharks), I've created a keyboard out of an Elgato StreamDeck XL so I can type any of these runes with a single button press.


The question was

> How do you search for *non-unicode* characters in a pdf document


Eh. That was the original question, but I think the implication of some of subsequent questions can be interpreted in multiple ways.


I don't think that not having a physical key on the keyboard has ever stopped anybody from inputing unicode symbols.


> I had the strong impression that there are no documents written in runes.

There are. Such documents are called runestones and thousands survive to this day, most in Sweden.

https://en.wikipedia.org/wiki/Runestone


Huh. https://en.wikipedia.org/wiki/Document says:

> Documents are also distinguished from "realia", which are three-dimensional objects that would otherwise satisfy the definition of "document" because they memorialize or represent thought; documents are considered more as 2-dimensional representations.

I think "realia" - a term I had never heard before - describes runestones better than "document".


Runes continued to be used long past the Elder Futhark period and from the medieval period manuscripts survive that fit the modern conception of a "document", most famously the Codex Runicus https://www.e-pages.dk/ku/579/html5/ (202 pages)


https://www.youtube.com/watch?v=2yWWFLI5kFU describes another side-effect of encoding old scripts/runes.


>Documents? I had the strong impression that there are no documents written in runes.

If a clay tablet counts, why not a runestone?


I'm not knocking runestones for being the wrong medium. I'm knocking them for not being documents. A typical cuneiform record might be analogized to an invoice for delivery of a crate of shirts or whatever. (And of course we also have textbooks, dictionaries, literature, correspondence, business reports, mathematical treatises, and every other type of written work.) A typical runic record would be more like the text "Made in Taiwan" printed on the shirt labels.

One of the biggest problems in the study of these cultures is that they left no written records. We know they had a writing system, the runes, but as far as we can tell they almost never used it for anything. Quite the opposite is true of Mesopotamian cultures, where we're buried in more records than we have the manpower to translate.


I suppose it also matters what you think a rune is. Does futhork count? There's parchment with that written on it. Elder Futhark, none as far as i know.


It's because unicode was defined by Tolkien fans

https://en.m.wikipedia.org/wiki/Runic_(Unicode_block)#cite_n...


Reminds me of the case of U+237C ⍼ RIGHT ANGLE WITH DOWNWARDS ZIGZAG ARROW [0], also discussed on HN [1].

[0] https://ionathan.ch/2022/04/09/angzarr.html

[1] https://news.ycombinator.com/item?id=31012865


Can we talk about the artwork used?

https://dl.ndl.go.jp/info:ndljp/pid/1312837?itemId=info%3And...

https://philamuseum.org/collection/object/84871

Googling for Tsukioka Yoshitoshi brings up so much SEO that it is hard to find information in English. If anyone knows anything about it, I'd be appreciative for a pointer about its content/subject!


Author here. Nobody has ever asked about the art before. It depicts Maruyama Oukyo, a famous painter of ghosts (and other things), where one of his pieces comes to life and frightens him.

https://en.wikipedia.org/wiki/Maruyama_%C5%8Ckyo


> At this rate they'll presumably be with humanity forever. Ψ

So, that's a really interesting thought. Perhaps our solution to a permanent reminder of nuclear destruction[1] could be hidden inside a plane of Unicode.

[1] https://en.wikipedia.org/wiki/Long-term_nuclear_waste_warnin...


Maybe Unicode will feature the same kind of warnings one day.

> This Unicode range is not a place of honor. No highly-esteemed symbol is registered here.

> What was here represented cultural signs that were considered powerful in our time.


Maybe we can encode instructions on how to restart society in Unicode character names? After all basically every computer contains a list of them.


That did not end well for the Georgia guidestones…


Also thought about posting that this morning, but wasn't sure anyone else would get the reference. (As context for everyone else, some kook blew up some of the guidestones last week in the middle of the night)


Not so kooky, they were a call for genocide.


From what I see, this was the maximally flame-y way to say what you said, and it's still inaccurate to call it "Not so kooky" as these commandments, while disagreeable to me, are not really that violent.

1. Maintain humanity under 500,000,000 in perpetual balance with nature. 2. Guide reproduction wisely – improving fitness and diversity. 3. Unite humanity with a living new language. 4. Rule passion – faith – tradition – and all things with tempered reason. 5. Protect people and nations with fair laws and just courts. 6. Let all nations rule internally resolving external disputes in a world court. 7. Avoid petty laws and useless officials. 8. Balance personal rights with social duties. 9. Prize truth – beauty – love – seeking harmony with the infinite. 10. Be not a cancer on the Earth – Leave room for nature – Leave room for nature.

source: https://en.wikipedia.org/wiki/Georgia_Guidestones#Inscriptio...


The whole thing stinks of of ecofascism. These first two read as pretty clear calls for eugenicism/genocide to me.

> 1. Maintain humanity under 500,000,000 in perpetual balance with nature.

> 2. Guide reproduction wisely – improving fitness and diversity.


#1 trends towards ecofascism but it's also a major theme in 70s environmentalism (was known as "the population bomb" and is now known as "degrowth") - it's false of course, but many people do believe it.

#2 is actually quite unique I think. I mean you could say it's eugenics, but eugenicists are rarely in favor of "diversity".


Sure they are- some diversity's obviously good so the whole population doesn't get wiped out by one disease. There's a tradeoff between diversity and it's resistance to that sort of disaster, and unity/homogeneity and its resistances to other problems. You'd be a poor eugsnicist to not take that into account!


For obviously fake characters, a Unicode proposal for the Egyptian Hieroglyphics Extended-A block managed to include a hieroglyph for an ancient Egyptian holding a laptop. (Note that this is a proposal, and has not yet made it into the standard.) Presumably it was a copyright trap.

https://www.unicode.org/mail-arch/unicode-ml/y2020-m02/0018....


If you type 彁 on google translate and set it to detect language it will switch to Chinese and translate it to "lingering". If you switch to Japanese no translation will happen.

Also if you google search for 彁 one of the results will be this video [!!!!seizure warning!!!!] https://www.youtube.com/watch?v=EsOU0V2kpUI that seems to borrow on the theme of a computer ghost character.


Google Translate will hallucinate translations for complete nonsense, so this probably doesn't mean anything.


WWWJDIC/JMdict also claims it's a name "Junko", but it also isn't very reliable. If you want to know what a Japanese word means you should look it up in a JP-JP dictionary.


It looks as if these (at least 妛) are being used in various places on and offline. It’s eventually possible that they will become associated with one or more meanings and perhaps a pronunciation.


In East Asian cultures that use Han characters, people used to make up new characters when the need arises.

These days, we scroll though the Unicode standard and find rarely used characters that were accidentally added and imbue them with new meaning. (yes, this is seriously a thing)


When the article said:

"In the end only one character had neither a clear source nor any historical precedent: 彁."

my instinct was that this character could be retconned to mean "character whose meaning has been lost", thus creating a self-referential paradox.

Presumably someone would have to then separately come up with a pronunciation for it. Perhaps pronouncing it "duangu" would solve another problem:

https://coconuts.co/hongkong/lifestyle/duang-jackie-chan-ins...


Oooh that sounds fascinating. Any examples of that that spring to mind? Is the pronunciation (or a reasonable representation thereof) already recorded in the Unicode standard or is that also a bit of free-jazz?


An old one but possibly the earliest and most prominent of obsolete Chinese characters being imbued with new (Internet-based) meanings: https://en.wikipedia.org/wiki/Jiong

There's also 奭 https://en.wiktionary.org/wiki/%E5%A5%AD which is occasionally used as a censorship workaround to mock one of Xi Jinping's gaffes in an early 2000's TV interview where he bluffed about being able to carry two hundred "catty" (~100kg)'s worth of wheat on rural mountain roads. The character is composed of two 百 ("hundred") and one 人 ("human/person/people") which is a pitoral euphemism to that line he said on TV. I can't find any sources about this one that's in English so please bear with my half-assed explanation.


Both cases are fascinating, thank you!! Side note: of course 奭 is pronounced shi. I only know a bare minimum about Chinese but when in doubt: it's pronounced "shi" (with some license regarding tone). https://en.wikipedia.org/wiki/Lion-Eating_Poet_in_the_Stone_...


I don’t recall specific examples, but as of today, there are quite a lot of Cantonese “words” that don’t have a commonly accepted corresponding character. There’s an unofficial list of word-pronunciation-character mappings, probably this list (https://docs.google.com/spreadsheets/d/1W8Ca3U0YfN-LtZxT1otV...). Some of the character mappings have historical precedent, but a few are probably just adopted in the way I mentioned. It’s quite fascinating even to me (as a native speaker) that some common spoken words have absolutely no standard (or even non standard) way to write down as text.

The unihan database records pronunciations as well but I don’t think anyone takes it seriously. I’m not aware of any software that “prescribes” specific meaning or pronunciations to characters, so basically if you want you can pick a character and assign it a meaning and convince everyone else to use it. For languages that don’t have a fully standardized writing system , it’s something you can do and people have done it. (There’s also a bunch of people trying to convince people to not use the defacto standardized characters in favor of archaic ones that they claim are more “authentic”… but that’s a story for another day


The character usually has a radical component which hints at the pronunciation. They or ordered by radical in the standard. So you would go spelunking for a little-used character in the part of the standard which has characters close in meaning or pronunciation to what you are looking for

Or you just make something up. If you’re coining a new character, you probably don’t care about whether the pronunciation is already known.


One of the reasons I wish a compositional language had been standardized for Unihan instead of the code-point-for-every-character approach.


Unfortunately it wouldn't have been workable for the reasons outlined in this thread's responses: https://news.ycombinator.com/item?id=32098359.


Wiktionary claims this character is in Guangyun (1007-1008, see https://en.wikipedia.org/wiki/Guangyun), and gives the link to Kangxi dictionary (1716), https://www.kangxizidian.com/kangxi/0256.gif which means that this character likely predates the Japanese "Overview of National Administrative Districts".


This is a tangent, but I felt like sharing. In college, I purchased a used copy of the communist manifesto. Famously, the first line reads, "A spectre is haunting Europe, ...".

The previous owner had both highlighted and circled the word "spectre" and wrote "ghost?" in the margins. The rest of the text was similarly marked up.

Every time I hear the word "spectre" I see "ghost?" in my mind's eye.


I'm more worried about the inflation of emoji than a couple dozen unused ghost JIS characters.


Godwin's second law: any sufficiently long discussion about Unicode includes a discussion about emoji :)


Yeah, it does seem to come up a lot more often than discussions about U+5350.


Why? Unicode isn't running out of space any time soon.


It's not about the space, it's about the use of what is supposed to represent written glyphs for graphical effects which are becoming increasingly complex. There is nothing in principle that would limit infinite expansion of the space of emojis, and their capabilities.


The encoding has gotten out of hand with compound emoji. Splitting them on glyph boundaries is non-trivial.


The functionality was always there (think ê vs ê), so properly handling glyph segmenting a string requires returning a `Vec<Vec<char>>`, which makes these emoji actually very useful: it makes it more likely that implementations will do the right thing by giving people from "predominantly ASCII" locales a "tool" to exercise those codepaths. Widespread emoji adoption is likely the best thing that could have happened to proper text handling for users outside of the anglosphere.


Combining characters are trivial to deal with. Some of the new emoji compounding uses things like the color squares to alter an emoji and you are forced to have a table of which emoji are dual function to know if they merge with their neighbors or stand by themselves.

Nothing else in Unicode acts like that. You can't properly parse complex emoji glyphs from a random starting point because you need previous context to know how to interpret following codepoints. With combining characters you just skip ahead to the next non-combiner.


640K should be enough for anybody.


With Unicode, concerns of running out of space are absurd. There are almost a million unallocated codepoints. At the rate codepoints are being allocated, we won't run out for at least 250 years, and emoji are less than 3% of those allocations. Also, the biggest limitation on the number of codepoints is UTF-16, which we have to pray is dead by the late 2200s (when it dies, we'll have over 2 billion unallocated codepoints.)


If Slack/Discord/etc. custom emojis get used enough, do they get incorporated into Unicode? I've seen something like 40 variants of laughing emoji, and closer to 400 variants of Pepe the Frog, and I'm not even in any "alt right" or 4chan-adjacent chat rooms/guilds where I imagine there are even more. Not to mention the countless custom anime face ones.


What worries you about it?


Many (too many) online forms in Japan are unable to process foreign names, sometimes for reasons as brain dead as not allowing more than a few characters in length.

Even a perfect unicode standard wouldn't be able to mitigate the arrogance of a programmer.


Is it possible for Unicode standard to deprecate characters? If yes, has it already happened?



I don't think so. It would make it impossible to talk about deprecated characters ever again, even in a historical context.

Unicode contains even some ancient and long forgotten scripts so historians can keep proper records of them.


Summary:

Some Japanese characters that aren't real got accidentally built into the unicode code table. This is NOT related to speculative execution attacks at all. It's just "whoa, these kanji don't mean anything, how the hell are they here?!"

Other than that caveat (not security related), this is a fascinating article, especially if you've studied Japanese as a (foreign) language.


Z̵̘̋̎̕ả̶͓͑l̶̜̈͒g̸̡̧̤̋͆õ̶̡͔̥̓ ̵̱͌̈́͝ẃ̴̫̤͘a̶͈̭̱͒̊i̴̭̪̾͑̕ṫ̵͈͙̻s̵̺͐̅̊ ̶̮̩͒̋


I can't be the only person who thought the character would be , right? (based on the first line of the Communist Manifesto: https://en.wikisource.org/wiki/Manifesto_of_the_Communist_Pa...)

edit: ah the character (hammer and sickle) does not show up


Is there anything similar for Latin characters?

The only circumstance I can imagine is where a Latin character has been erroneously encoded with an unused diacritic, for instance a T with a diaeresis.


Multilocular o is known only from a single word in a single manuscript.


link to the wiki article. Though this is a variation of a Cyrillic letter, not latin

https://en.wikipedia.org/wiki/Multiocular_O


This is funny. The Indo European (and hence Slavic) roots for eyes typically have an /o/, and this glyph is round like the eye, so it seems like this character and others linked in that article are just people making little cartoonish drawings on writings involving descriptions of eyes.


Something like the letters V and U.


Or double-U (i.e. UU == W)


(2018)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: