Watch as someone names themselves the bell character, “^G” (ASCII code 7) [1]
When they meet people, they tell them their name is unpronounceable, it’s the sound of a PC speaker from the late 20th century, but you can call them by their preferred nickname “beep”.
In paper and online forms they are probably forced to go by the name “BEL”.
This name, "คุณสมชาย" (Khun Somchai, a common Thai name), appears normal but has a Zero Width Space (U+200B) between "คุณ" (Khun, a title like Mr./Ms.) and "สมชาย" (Somchai, a given name).
In scripts like Thai, Chinese, and Arabic, where words are written without spaces, invisible characters can be inserted to signal word boundaries or provide a hint to text processing systems.
> Presumably there aren't any people with control characters in their name, for example.
Of course there are. If you commit to supporting everything anyone wants to do, people will naturally test the boundaries.
The biggest fallacy programmers believe about names is that getting name support 100% right matters. Real engineers build something that works well enough for enough of the population and ship it, and if that's not US-ASCII only then it's usually pretty close to it.
Or unpaired surrogates. Or unassigned code points. Or fullwidth characters. Or "mathematical bold" characters. Though the latter two should be probably solved with NFKC normalization instead.
That’s just an invalid Unicode string, then. Unicode strings are sequences of Unicode scalar values, not code points.
> unassigned code points
Ah, the tyranny of Unicode version support. I was going to suggest that it could be reasonable to check all code points are assigned at data ingress time, but then you urgently need to make sure that your ingress system always supports the latest version of Unicode. As soon as some part of the system goes depending on old Unicode tables, some data processing may go wrong!
How about Private Use Area? You could surely reasonably forbid that!
> fullwidth characters
I’m not so comfortable with halfwidth/fullwidth distinctions, but couldn’t fullwidth characters be completely legitimate?
(Yes, I’m happy to call mathematical bold, fraktur, &c. illegitimate for such purposes.)
> solved with NFKC normalization
I’d be very leery of doing this on storage; compatibility normalisations are fine for equivalence testing, things like search and such, but they are lossy, and I’m not confident that the lossiness won’t affect legitimate names. I don’t have anything specific in mind, just a general apprehension.
Cc: Control, a C0 or C1 control code. (Definitely safe to reject.)
Cn: Unassigned, a reserved unassigned code point or a noncharacter. (Safe to reject if you keep up to date with Unicode versions; but if you don’t stay up to date, you risk blocking legitimate characters defined more recently, for better or for worse. The fixed set of 66 noncharacters are definitely safe to reject.)
Cs: Surrogate, a surrogate code point. (I’d put it stronger: you must reject these, it’s wrong not to.)
Co: Private_Use, a private-use character. (About elf names, I’m guessing samatman is referring to Tolkien’s Tengwar writing system, as assigned in the ConScript Unicode Registry to U+E000–U+E07F. There has long been a concrete proposal for inclusion in Unicode’s Supplementary Multilingual Plane <https://www.unicode.org/roadmaps/smp/>, from time to time it gets bumped along, and since fairly recently the linked spec document is actually on unicode.org, not sure if that means something.)
Cf: Format, a format control character. (See the list at <https://util.unicode.org/UnicodeJsps/list-unicodeset.jsp?a=[...>. You could reject a large number of these, but some are required by some scripts, such as ZERO-WIDTH NON-JOINER in Indic scripts.)