Caterina Fake, co-founder of Flickr, famously had issues with IT systems:
Tim: There’re so many places we could start, but in the process of doing homework for this, I found mentioned, and I wanted to do a fact check on this, of you having plane tickets automatically cancelled, and other issues related to your last name. Is that accurate? Did those things actually happen?
Caterina Fake: This has happened to me many times, in fact. And I discovered that it was actually the systems at KLM and Northwest that would throw my ticket out, my last name being “Fake.” And I have missed flights and have spent way too many hours with customer service trying to fix this problem. Here’s another thing too, is that I was unable for the first two years of Facebook to make an account there also. And probably all of my relatives.
lol so much data gets converted into strings at some point when passed around. Definitely encountered systems where you have to check for both null and "null"
This seems like a good spot for the link to @patio11's "Falsehoods Programmers Believe About Names"
So, as a public service, I’m going to list assumptions your systems probably
make about names. All of these assumptions are wrong. Try to make less of
them next time you write a system which touches names.
I get what he's doing, but some of these are not actionable:
> People’s names are all mapped in Unicode code points.
So... what? What do I do with this? My program has to use something to represent text, and since I fail to be a large multinational consortium, I can't invent my own character set and expect it to work.
Also:
> Confound your cultural relativism! People in my society, at least, agree on one commonly accepted standard for names.
This is pretty much true in countries with naming laws, yes.
> People have names.
People in a database will have certain records which will not be NULL. Whether you call one of those records a 'name' outside the context of that database really isn't my concern.
Unicode is not the only character set (or the best one); this is a falsehood programmers believe about character sets (I wrote a list of this too but I do not remember if I had published it). However, that is not the most severe issue, due to the other things mentioned, such as if people do not have names (or if there are multiple ways to enter them, or if people sometimes change their name, or have the same name as other people, etc).
> Unicode is not the only character set (or the best one); this is a falsehood programmers believe about character sets
Unicode is the best if I want to communicate with other people. I lived through the 1990s; you won't convince me that playing "guess the encoding" with dozens of subtly-incompatible standards (and non-standards, and almost-standards) was a good time, or that having to override a web browser's helpful guess was fun.
Try to understand these issues or rather how they could affect your business processes and software implementations down the line rather than dismissing them on a technical level.
You can store the Unicode representation just as you normally would. But what you don't do is assume that your Unicode representation is the only representation of the actual name.
More concretely, there are names that have multiple equally valid ways of writing them. You can probably expect that usually the same one is used, but you should absolutely not require this when building your business processes.
Even more concretely, as an example there are transliteration or simplification / shortening rules that allow people with otherwise strange or long names to buy an airline ticket. The actual, real name may not be any of the ones you have in your system. This matters e.g. when searching for someone or in customer support.
As for people without names (or unknown names), you should probably recognize that the handling might differ by country. E.g. records with "John Doe" in the US might have to be handled differently: analogous to "NULL != NULL" in SQL John Doe != John Doe. Or maybe even "Jane Doe == John Doe" in some cases. See also "Fnu Lu" (First Name Unknown, Last Name Unknown) used in the US.
And although I don't have knowledge about all the countries in the world, it may very well be that this leads to situations where the "no name" has to be handled specially or at least understood to be a special case, completely differently from other cases.
> So... what? What do I do with this? My program has to use something to represent text, and since I fail to be a large multinational consortium, I can't invent my own character set and expect it to work.
Maybe don't rush to remove your "legacy" encoding support because "everyone is using UTF-8"? Or at least check with some Japanese users with obscure names first.