This reminds me of Patrick McKenzie's article "Falsehoods Programmers Believe Ab...

pwf · on April 27, 2012

If I'm not allowed to assume names fit into the set of Unicode symbols, what do I use for a name field in my database? An image field? What if a dragon whose name cannot be printed attempts to sign up for my service?

Someone who decides to identify themselves with Klingon (whether their parents gave them that name or not) should expect to have an alias ready...

JoshTriplett · on April 28, 2012

That article doesn't claim that your software must support all of those use cases; it exists to explain that those use cases exist, because some people genuinely believe otherwise. If you write something highly specialized with a strong need for a universal representation, such as genealogy software, you might need to deal with the strangest corner cases. Most services, however, can get away with two simple rules: don't ask for a name if you don't need it, and if you do, use a single Unicode text field and don't try to parse it.

gloob · on April 27, 2012

Should a Japanese octogenarian whose parents had the poor taste to spell their child's name with a character that would not make it into Unicode expect the same problem?

alanh · on April 27, 2012

What would they enter into _any_ computer system as their name?

tsuraan · on April 27, 2012

I believe that ISO2022 allows for the full set of japanese names (and has some sort of process for introducing new kanji). That's probably a big part of the reason that Ruby's strings are bytes with an encoding attribute, rather than just being unicode.

patio11 · on April 28, 2012

Ruby was designed as utility goop for Japanese programmers. Inability to parse/output legacy encodings would have rendered it virtually useless for that, even if legacy encodings were strictly dominated by any available Unicode encoding, which many Japanese programmers would hotly contest.

nandemo · on April 28, 2012

The Ruby thing is probably due to the fact that EUC and Shift-JIS were then (and to some extent still are) the prevalent encodings. It's not so much about character sets, after all Unicode includes every kanji defined in ISO-2022. Please see my other comment in this thread.

gerrit · on April 27, 2012

Japanese computer systems are often not using Unicode but are based on other encodings like shift-jis.

Even if it wasn't for historical characters that aren't part of Unicode, this will probably stay that way because of the inefficiencies of encoding Asian text in e.g. UTF8.

That's part of the reason the ruby programming language didn't have proper Unicode support for a long time (and now supports arbitrary encodings for its strings, not just Unicode ones)

nandemo · on April 28, 2012

> Japanese computer systems are often not using Unicode but are based on other encodings like shift-jis.

That's true but doesn't really answer GP's question. Shift-jis is an encoding for one of the JIS X character sets. Unicode includes all the characters defined in JIS.

http://unicode.org/faq/han_cjk.html#8

Though if we enter into details it gets a bit messy due to the complications due to different simplifications occured in (mainland) China and Japan, questions about different glyphs for the "same" character, etc.

guelo · on April 27, 2012

Yea, they got screwed over by the Unicode Consortion, don't go blaming programmers for that.

Natsu · on April 28, 2012

If for some weird reason all else failed, they could just write their name in hiragana.

gravitronic · on April 27, 2012

"Jim".

dalore · on April 28, 2012

What keyboard would they use?

bbrtyth · on April 28, 2012

It's not one character per key, they input phonetically or by type of stroke in the character.

goblin89 · on April 28, 2012

Name representation in your DB may not match original. I think it even might be fine if you only allow ASCII to input names—most people who use a computer are probably used to writing their names in latin alphabet. But you should accept that the name you store is just an alias, a person identifier. And other points are valid: you can take into account others including #7, #37, #38, #18.

I think it's more about business process than technology limitations.

guard-of-terra · on April 28, 2012

"I think it even might be fine if you only allow ASCII to input names—most people who use a computer are probably used to writing their names in latin alphabet." But why? Why tell people they are second-class because of their alphabet preference?

goblin89 · on April 28, 2012

Well, among Patrick's claims are that it's probably impossible to input all possible name even with all the power of Unicode, and that there are other limitations that prevent accurate representation of names in computer system, including human errors.

So my point is, maybe then it's better not to implicitly promise that your system has no limitations at all, but make these limitations more obvious instead. This'll help to avoid surprise errors and ensure consistent user experience by setting correct expectations.

I admit that most users today expect systems to accept their names in their language though, and that it might be an edge case depending on particular occasion, but then I can see why most payment systems apparently still accept only latin letters (cardholder's name is an example).

bbrtyth · on April 28, 2012

For the dragon, you can use a password field to hide the name, which satisfies the requirements.

Actually I think the image field is a good idea, just have everyone use a typed alias and then draw/render what they want to be called. Maybe an image and/or a sound, to be more inclusive.

michaelochurch · on April 27, 2012

I think his reference to Unicode code points was actually based on UTF-16/8 mismatches, as in "Don't assume one character == one 16-bit char", since the upper code points (U+10000 and up) actually use two chars in, e.g., Java's UTF-16 representation.

There's a lot of code out there that works fine with one-char code points but breaks on two-char code points.

SomeCallMeTim · on April 28, 2012

I don't think that's what he meant. A surrogate pair in UTF-16 is still referring to a single code point. It's a good point that it's a place where people could screw up in dealing with Unicode, but it doesn't match his complaint.

What does match is the fact that you can use a huge number of combining characters [1] to form a single glyph; each combining character and the base are a code point, so in order to figure out how many glyphs there are you have to iterate with the knowledge of what code points are combining characters.

[1] https://en.wikipedia.org/wiki/Combining_character

derleth · on April 28, 2012

Keep in mind UTF-8 can represent everything in Unicode without the surrogate pairs that make UTF-16 as annoying as it is.

UTF-8 is inherently variable-width, but it's a flat scheme in that one universal mechanism suffices to represent everything you can represent.

UTF-16 appears fixed-width until you realize it isn't, at which point you discover it's segmented and the segmentation scheme is both somewhat complex and likely to be missed entirely in naïve testing.