Hacker News new | past | comments | ask | show | jobs | submit login

Presumably there aren't any people with control characters in their name, for example.





Watch as someone names themselves the bell character, “^G” (ASCII code 7) [1]

When they meet people, they tell them their name is unpronounceable, it’s the sound of a PC speaker from the late 20th century, but you can call them by their preferred nickname “beep”.

In paper and online forms they are probably forced to go by the name “BEL”.

[1] https://en.wikipedia.org/wiki/Bell_character


Or Derek <wood dropping on desk>

https://www.youtube.com/watch?v=hNoS2BU6bbQ


The interaction brings to mind Grzegorz Brzęczyszczykiewicz:

https://www.youtube.com/watch?v=AfKZclMWS1U

(from the Polish comedy film "How I Unleashed World War II")


I thought this was going to be a link to the Key & Peele sketch: https://youtu.be/gODZzSOelss?t=180


I can finally change my name to something that represents my personality: ^G^C

https://en.wikipedia.org/wiki/End-of-Text_character


คุณ สมชาย

This name, "คุณสมชาย" (Khun Somchai, a common Thai name), appears normal but has a Zero Width Space (U+200B) between "คุณ" (Khun, a title like Mr./Ms.) and "สมชาย" (Somchai, a given name).

In scripts like Thai, Chinese, and Arabic, where words are written without spaces, invisible characters can be inserted to signal word boundaries or provide a hint to text processing systems.


The reminds me of a few Thai colleagues who ended up with a legal first name of "Mr." (period included), probably as a result of this.

Buying them plane tickets to attend meetings and so on proved fairly difficult.


But C0 and C1 control codes are out, probably.

> Presumably there aren't any people with control characters in their name, for example.

Of course there are. If you commit to supporting everything anyone wants to do, people will naturally test the boundaries.

The biggest fallacy programmers believe about names is that getting name support 100% right matters. Real engineers build something that works well enough for enough of the population and ship it, and if that's not US-ASCII only then it's usually pretty close to it.


Or unpaired surrogates. Or unassigned code points. Or fullwidth characters. Or "mathematical bold" characters. Though the latter two should be probably solved with NFKC normalization instead.

> Or unpaired surrogates.

That’s just an invalid Unicode string, then. Unicode strings are sequences of Unicode scalar values, not code points.

> unassigned code points

Ah, the tyranny of Unicode version support. I was going to suggest that it could be reasonable to check all code points are assigned at data ingress time, but then you urgently need to make sure that your ingress system always supports the latest version of Unicode. As soon as some part of the system goes depending on old Unicode tables, some data processing may go wrong!

How about Private Use Area? You could surely reasonably forbid that!

> fullwidth characters

I’m not so comfortable with halfwidth/fullwidth distinctions, but couldn’t fullwidth characters be completely legitimate?

(Yes, I’m happy to call mathematical bold, fraktur, &c. illegitimate for such purposes.)

> solved with NFKC normalization

I’d be very leery of doing this on storage; compatibility normalisations are fine for equivalence testing, things like search and such, but they are lossy, and I’m not confident that the lossiness won’t affect legitimate names. I don’t have anything specific in mind, just a general apprehension.


> > Or unpaired surrogates.

> That’s just an invalid Unicode string, then. Unicode strings are sequences of Unicode scalar values, not code points.

Because surrogates were retrofitted onto UCS-2 to make it into UTF-8, they are both code units and (reserved) code points.


It's safe to reject Cc, Cn, and Cs. You should probably reject Co as well, even though elves can't input their names if you do that.

Don't reject Cf. That's asking for trouble.


Explanation for those not accustomed, based on <https://www.unicode.org/reports/tr44/#GC_Values_Table> (with my own commentary):

Cc: Control, a C0 or C1 control code. (Definitely safe to reject.)

Cn: Unassigned, a reserved unassigned code point or a noncharacter. (Safe to reject if you keep up to date with Unicode versions; but if you don’t stay up to date, you risk blocking legitimate characters defined more recently, for better or for worse. The fixed set of 66 noncharacters are definitely safe to reject.)

Cs: Surrogate, a surrogate code point. (I’d put it stronger: you must reject these, it’s wrong not to.)

Co: Private_Use, a private-use character. (About elf names, I’m guessing samatman is referring to Tolkien’s Tengwar writing system, as assigned in the ConScript Unicode Registry to U+E000–U+E07F. There has long been a concrete proposal for inclusion in Unicode’s Supplementary Multilingual Plane <https://www.unicode.org/roadmaps/smp/>, from time to time it gets bumped along, and since fairly recently the linked spec document is actually on unicode.org, not sure if that means something.)

Cf: Format, a format control character. (See the list at <https://util.unicode.org/UnicodeJsps/list-unicodeset.jsp?a=[...>. You could reject a large number of these, but some are required by some scripts, such as ZERO-WIDTH NON-JOINER in Indic scripts.)


Challenge accepted, I'll try to put a backspace and a null byte in my firstborn's name. Hope I don't get swatted for crashing the government servers.

That sounds like a reasonable assumption, but probably not strictly correct.

Mandatory reference: https://xkcd.com/327/



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: