I've stopped using my real name. My last name has an apostraphe, and at this point I'm just sick of websites rejecting it and then forcing me to refill in all of my information.
So, now I drop the apostraphe.
In high school my school ID was completely broken. It printed <first half of last name> <first letter of last name> <middle name> <first name>.
They ended up putting a quotation mark instead of the apostraphe - somehow the machine handled that fine. That was the workaround.
Hotel wifi systems tend to break if they ask for a last name to connect. Again, training me to reserve the hotel under a 'fake' last name.
I actually notice it every time I write my last name. I'm like "is this a system that will break" and now I'm inclined to just never use the "real" spelling because then I'll get a mismatch later on.
This is something I've really noticed over the last decade. The older I get, and the more systems become a part of my life, the more systems I have to give a fake name. Paper records will show my true name, but I wonder if at this point, given digital records, what it would appear to be.
I was at Xerox when they were inventing i18N in the mid 1980s. They were so excited when I started because my real name is Mairé and not Maire. They wanted my login name to include the acute accent.
Over time all of the i18N people left for Apple. My login broke and I removed the accent. When I went to Apple I never bothered to add the accent again even though I was friends with the Unicode gang. I never added it back.
Well, my name cannot be accurately represented in Latin characters. Most people in Taiwan adopt an English name because the Mandarin one is hard on computers and communicating with English-speaking businesses.
I think a lot of those writing software just escape [^a-zA-Z0-9-_\.] or similar just to be overly cautious. I've definitely been guilty of this myself when I'm in a rush.
I'm also the person who gets irrationally annoyed at web forms that barf at my username+tag@example.com email address usage (+ signs being valid in the username part), so I can certainly relate to the ire, too.
It's a special character in HTML, which is the context in which the name is being shown. The fact that it's escaped before being concatenated into the HTML document is fine; the fact that it's triple-escaped is the problem.
>Hotel wifi systems tend to break if they ask for a last name to connect
DDWRT WPA2 had issues with apostrophe in passwords. I wonder whether OEM firmwares have such issues, ironical considering special character does improve password strength.
The number of systems failing the "L'Oréal test" is staggering. (This one tests multiple assumptions: nonalphabetic names, nonascii names, capitalization. We have encountered a particularly unexpected name+address a long time ago, and the name kinda stuck ;))
Kwpolska, your account is shadowbanned, most posts from you are not displayed, although they are overwhelmingly useful/factually correct. I am unable to directly reply to your post, so need to go up one level and hope you see this post.
(HN readers, if you don't understand what's going on, then enable "showdead" in your profiles.)
Apostrophe is within ASCII, but breaks sites that a) are prone to SQL injection and/or b) are prone to XSS, generally anything where it can be a control character.
"é" is not in ASCII, you're thinking of some "extended ASCII" variant, such as Latin-n (where n goes from 1 to 12). Those are insidious, as your "of course é is in ASCII" becomes "of course ř is in ASCII", or something similar. Finding the original encoding then requires out-of-band signalling, or guessing (heuristics ;)).
The technical limitations checked here are in bespoke code, not in the OS.
é is not in ASCII. No accented leters are. You might be thinking of ISO-8859-1, or Windows-1252, which Windows misnames as "ANSI". But not all legacy/broken systems use these, they might be thinking in a different legacy encoding, which lacks an é.
Yep, I too have stopped using my real name. I have an internet nickname which is "close" to my real name that i use for most sites because I am tired of having problems with my real name.
This is really annoying because whenever I call about an account somewhere they will ask my name and I will say it is "AAAA ZZZZZZZ" and they will say, "oh it looks like I don't have an account for you". So then I will have to say, "oh wait, try 'BBBBB ZZZZZZ' or 'AAA BBB ZZZZZ', because that is probably me." This happens all the time when I book hotel rooms or reservations. I will come in and say I have a reservation under one name, they will say they can't find me, then i list several potential names. It causes all sorts of confusion and remarks of suspicion.
(those aren't my real names, just examples for illustrative purposes)
To make matters worse, my first name is broken into two words, plus then my family name. All three words can individually be perceived as "first names" or "last names". So the initial confusion, will frequently lead to mis-collecting of my name and mixing up the order of my name.
It is so bad, that I have had to register an alias now with the credit bureaus because it causes so many problems (yes you can do this if you call them) when authenticating me under various names.
Another fun/annoying recent story is that I am in the process of selling my house. My realitor told me that there was a problem because the county doesn't have me as the registered homeowner. My mortgage was signed using my real legal name. But apparently the county's computer system can't handle someone with punctuation in their legal name that is not part of a "title". So my name got dicombobulated in their system and is reported as a single letter with no punctuation. So the county records for my house do not show me as the homeowner, because their system made too many assumptions about naming conventions and my name got "broken" in the process. So now I have to correct it manually before I can sell my house.
Long story short... you should make no assumptions about names when creating a name field in a system architecture. It should literally allow for the absolute highest amount of flexibility possible. Especially if you are designing a system for say... the county government's title and deed records.
I find this terrifying because I have worked with county-level government organizations in the past, and have seen some truly horrendous and ancient systems. But they have such a high level of control over important things in one's life, such as ability to sell property! To fall victim to this because of something as simple as a name that doesn't fit a legacy system's assumptions is really painful and discouraging I imagine.
There are a lot of places where Know Your Customer rules apply. Further, there are places where identity really does matter, such as insurance. The approach you outlined sounds like a headache when it comes to those things. Not sure who the headache is for... :)
Mostly because of the concept of “insurable interest”. Insurance pricing doesn’t hold up unless the buyer has a reason to avoid the outcome being insured against. The simplest example is life insurance.
Why do you question whether identity would matter?
I would say that NONE need to identify you uniquely. If any engineer is using names as a unique identifier then they need to find a different career.
I can't just call up and say "My name is John Snow, I want to withdraw $100 from my account." Even a small system will quickly run into duplicates. Names are far from unique. Systems might use name for reference purposes, but beyond that, some other unique identifier needs to be used.
The last sign-up system I built you get a single max length utf8 field called 'Name'. This field is not used at all in the UI except on the profile screen where it appears in full and in email templates where it also appears in full. It might not be perfect but it's as good as you are going to get without having to become an expert in names. I'll leave full localisation to companies that can afford to dedicate a team to it.
In my university (TU Delft), there are people from all kinds of countries, so they have a (I suppose legally-mandated) first name / last name entry, but also a "How should we call you" entry, in which you can write anything you want, and that's what will appear in the UI and class lists.
I'm Brazilian, a country where the vast majority of people will have at least two last names (typically one maternal, one paternal, but not uncommonly multiple from either or both parents), separated by spaces.
I've immigrated to the US, where the standard is for people to have a single last name, or if they have multiple, they're separated by hyphens.
To make my situation a little worse, legally I have two first names i.e. the second one is not registered as a middle name in any official document I have.
I've dealt with systems that wouldn't accept my last names. I either had to join them together, or separate them by a hyphen.
Some systems don't accept my legal double first name. When I got my driver's license in WA, the system accepted it but later had trouble generating the license identifier, so they had to manually apply the system's logic and enter it by hand.
Some systems just ask you to enter your full legal name, and then try to be smart about figuring out what your first, middle, and last names are. So I end up with the first of my last names being registered as a middle name, and I can't edit it.
Fortunately I'm close to being able to apply for naturalization, and that'll give me the opportunity to change my name. I'll just drop part of my first name and one of my last names and finally make it simple.
Edit: another thing that's funny wrt names in my life is my family situation. My wife has her own two last names (it's not super common in Brazil for women to take their husband's last name), and she has a daughter from her first marriage. So we all have totally different last names, with only my wife and my stepdaughter sharing one of their last names. That caused us a little bit of trouble crossing the border to Canada once.
Wherein someone named Amr keeps having his first name split into Mr. A by the airline booking system. Which would be bad enough, even if they didn't also insist that a name must be at least 2 characters. Good times.
If you think names are hard, try email addresses. Most complex validator I ever wrote, and then had to switch off lots of it because email addresses in the wild don't actually follow the RFC...
In school we had a guy whose last name was 'Van', a fairly common prefix for names in NL, which tends to be used like this 'van den Broek' or 'van der Berg'.
Teachers would ask for his name, he'd say 'Jan Van' and they'd invariably ask 'Jan van Wat?'. So he eventually gave up and answered 'Jan van Wat' right of the bat, which would leave the teachers even more confused because there was no such person on their lists of students...
Given that we know at least one guy whose first name is Van (Morrison) it is theoretically possible there is a person called Van Van somewhere.
Nope. Email validation is easy: contains an "@", and gets delivered somewhere. Everything else is too clever by half. Names lack this interactive verification...and oh my goodness, the tussenvoegsel thing you mention: the horror...the horror!!! (Oh, you wanted to sort the names? That's a Lovecraftian nightmare of its own.)
During registration flows, I'm also a fan of validating that the domain has MX records published - DNS queries are a pretty cheap option for detecting whether its at least plausible that your validation email will arrive later.
At scale, its an easy option for helping users detect typos in their email domain (or for helping guide less technical users, like those who might provide address@gmail or similar instead of address@gmail.com)
If you try to make @ mandatory, you'll soon get an angry mail from hn!verizon!randomisp!somehipster complaining how you're marginalizing people who still use bang paths.
Sorry about that, then. I also have no gopher support and redirect all HTTP to that newfangled HTTPS - so such person would not even get to such a form in the first place ;)
Here's an example regex that implements the official RFC 5322 email validation rules, along with the "preferred" syntax from RFC 1035 (which is one of the recommendations in RFC 5322):
Alternately, you can just use this super basic regex which handles all of the common use cases (while still allowing all of the weird edge cases), and functions as a quick "smoke test" for email addresses:
[^@^\s]+@[^@^\s]+
Basically, require at least one character (except a space, a line break, or an @) followed by an @, then at least one character (except a space, a line break, or an @).
The main flaw this simple regex is that it is too permissive and potentially allows weird edge cases like special UTF-8 characters that might not be allowed.
But hey, one might argue that if a user is intentionally entering an invalid email address with wonky characters to create an account, then that's kind of their fault when they never receive the account activation email with the email validation link, right?
The only way to validate an email address is to send something to it and either get a reply or have the recipient use the contents to further interact (eg: click a link copy a code).
If you are lucky the regex will just cause maintenance issues. If you are unlucky it will stop users signing up.
I suppose there isn't a whole lot of value even using a quick "smoke test" regex, when the big version is already a solved problem, and can be so easily copied and pasted at this point...
Prediction: it will fail on many email addresses that are out there in the wild and in daily use. The problem with 99% cases is that if you handle enough accounts that 1% becomes a serious tech support issue. In the end; the best test whether an email address is valid or not is if the mail arrives with the intended recipient. So what you do if an email address does not pass verification is prompt the user to that effect but do not force them to do anything at all.
Exactly. The absolute best test for whether or not an email address is valid or not is if the mail arrives to the intended recipient.
And if you are going to do that email validation step anyways, you could probably get away with just using the simple highly permissible "smoke test" regex for your email address input form validation.
This project ran 5 years ago, sorry I do not have access to that data anymore. But it wasn't just a few. The big providers got it right but there are so many smaller ones with their own tweaked mail servers, those were the ones causing the problems.
Names are associated with roles, not with persons. This is particularly true for prefix and suffixes. Mayor Lyndon Smersh and Lyndon Smersh, Esq. are different roles as politician and lawyer of the underlying person Lyndon Smersh. Many women also have multiple names, e.g. a married name used socially and their original name used professionally.
When dealing with persons, it's necessary not to confuse their roles.
Once I was working in a piece of software that handled people's names. The client wanted exactly a first name and a last name for every person in the database. I tried to explain why that could be a problem, but they weren't interested. It isn't necessarily the programmer's fault.
TBH, I disagree with most of the points in this article, or least the implied idea that a particular piece of software needs to handle every permutation of how someone wants to be addressed.
I'm pretty sure most US legal documents required some form of given name and some form of family name. If you're trying to interface with as US government system, and that's your primary use case, a single first and last name will probably suffice (assuming it's not too arbitrarily short, people can change their name, etc.)
I’m from the US. My legal name since birth is not 1 word nor my last name. My first name also includes an é.
My birth certificate has all of these as well as my passport. My drivers license doesn’t have the é.
Truncating my name is how I get mail from people states away who have never lived here. One of them being my dad. He’s never lived in this state so it’s not due to an old address.
That’s also how I commit felonies by opening those letters thinking their for me.
Spaces and é can easily be represented in a single DB column, and none of what you wrote precludes having a single "first name" and "last name" field to represent your name in a DB.
We didn't name our baby until about two weeks after birth, which probably wouldn't have been a problem, except for the required daily jaundice checkups and blood work, all outpatient.
While delivery and the nursery apparently are set up to deal with this, (by barcoding the babies patient bracelet) computers in other departments were not, at least without navigating multiple menus and screens.
Every computer that required patient name and birthdate to be entered was a 10 minute process, to the point that we were famous by the third day of visits.
At some point, a higher up was able to log in and "fix" the name to be discoverable by just last name and birthday, or via mothers patient info. I'm not exactly sure how, but I was impressed with it being "fixable" so easily (for us at least), especially network wide.
I also understand now why the vital records people were so persistent in trying to get us to finish the birth certificate before discharge. We obviously didn't.
Programming decisions around names get encoded operationally as well: within the last decade I missed a flight when a major airline's system did not accept my hyphenated last name at the check-in kiosk. I was then told by customer service I did not need the hyphen in my name -- their system was correct, and my view of my name was not.
Oh my god names are so, so, difficult to deal with. I'm doing something where I get information / stats about players from different sites, and it's like all of them have different names. For example:
Joseph Smith // nice and generic
Joseph Smith Jr. // gotta make sure we tell people he's the son of another Joseph Smith
Joseph Smith Jr // psh, we don't use periods!
Jose Smith // if from a different country, some sites keep their native name
José Smith // yup, sometimes accents are kept
Joe Smith // nicknames, all about those.
Joey Smith // kid nickname, sure
Jos� Smith // literally even had to deal with a site that had unknown unicode letters
Juan Jose Smith // sometimes, people go by their middle names, but some sites like to use their full name instead
Juan Smith // remember how some people go by their middle names? Maybe a site only gives you their first and last
How, how in the world is someone supposed to go through and match all those names with each other in a somewhat quick manner? I can't imagine it's fully solved because there'll always be other cases.
I've tried a couple ways, like removing periods and accents, removing "Jr", "Sr", "III", removing all vowels even. And then went further and tried some of the string matching libraries that return a number with how close words are to each other. That would help cases where Joe, Joey, Joseph would come back quite close, but then I'd run into the issue of a "Darren Smith" and "Darek Smith" would come back so similar the computer thought they were the same person.
In cases like names, that Patrick writes about here, yeah, they're almost impossible to get right and it ends up being mostly by hand, which is fine since I can get it correct, but eats up so much time.
There is also no such algorithm that can try so hard to be "clever": most fuzzy matches will present their uncertainty, human input is assumed to be certain. The umpteen times some clerk "identified" me as someone who used a similar name (of different location, different position and ACLs, even different gender) were decidedly not entertaining.
The problem is that you're using names as a proxy to ask whether two people are the same. If I ask you: "in this film, are Will Smith and Will Smith the same", you do not have enough information to answer, because there are at least 74 Will Smith's on record in film credits: https://www.imdb.com/find?q=will%20smith&s=nm&exact=true
And this is from an industry with guilds and unions to try to ensure that no two credits have the same written name.
You can make a very good guess, but names can't tell you enough to determine if these two people are the same or different. One person can have many names, and many people can have the same name.
We should like, give tax breaks or pay people based on how nationally-unique they name their child. Should be an easy system to implement. Some like inverse log or inverse root of the frequency of the name within the existing population.
The now defunct (pretty sure) Peacock Data Inc. used to sell (~$300) a data set of names including spelling variations and their strength of relationship to other names. It included nicknames, variations across languages (e.g. "john" ~= "juan"), and typos. It even had different relationship types, each with their own dissimilarity measurement to their related names. I've never found anything remotely that comprehensive for names.
A very efficient way of matching strings across typos (aka "fuzzy") is SymSpell, but just finds nearby lexical matches efficiently.
Johnny Cash recorded a song that's relevant. Wikipedia has a lengthy entry full of unisex names from around the world. Surely, someone, somewhere in the world makes up a new name every day. Based on a person's name alone, there's no way to assign a salutation with unattended automation without substantial and sustained effort and still not eventually fail in a potentially easily embarassing way. Much easier and safer to provide a field for users to fill in their own desired salutation. Bonus points if it's a text entry field instead of a dropdown list, for the same reasons that #24 through #29 are listed in the parent article.
I'm aware of this, having such name myself. Border control agencies eye me with various degrees of suspicion for this, depending on what gender my name is supposed to signify to them. (Yes, that's in the passport as well, but...well, bureaucracy)
I worked at a place where the convention was to give employee emails being first initial and last name. Well this guy had a last name just the letter "I" and his first name started with "H". So his email became hi@company.com.
Speaking of names, my full name has 6 parts, and I share 4 of them with my brother. We never use the first three, and I stead, use 4th and 6th names to make the full name that satisfies the falsehoods.
In the web sites I work on, I have decided to have one field that asks "How would you like to be called", and allow full Unicode range in it. If you want to use Emoji, fine, go for it. I have fewer data to store (privacy and data protection concerns), and users feel better not typing the last name either.
Anything can be pronounced, obvs.; but especially English names are hard to pronounce correctly, as the rules are mostly embedded in the names' ancestry, not in the way they're written.
(And then of course you get The Knights Who...Until Recently Used To Say Ni)
"Preferred name," a UTF-8 string with absolutely no implied formatting, is a great way to deal with a bunch of these. Simply ask people what they want to be called and use that verbatim, always. Unfortunately the absolutely most common misconception about names is that they are easy, so almost every system tries to implement something more "clever," inevitably failing in stupid ways.
There are cultures in which it is considered unlucky to name a child before, say, their second birthday, so it's never done.
In this case, it would be quite normal to have an 18 month old with no name at all. I mean, you could call them "Baby Surname" or something. But they don't have a name.
Typically entered as “Baby Surname”, “Baby Boy Surname”, “Baby Girl Surname”. For twins, yeah I don’t think I’ve seen and index appended to their name. I feel like even if the EHR system had the birth time, I would normally get their birthdate starting at midnight. Let’s hope the EHR system keeps a unique MRN for those twins, because down stream systems are totally going to commingle their records. Same applies to Jrs living in the same house as the parent, since their demographics match the parent’s other than birthdate.
> All MRNs should be unique and everything else can be a duplicate.
> Every system should support name changes and merges. John Does come in all the time, and then you figure out their existing MRN, or their real name.
In theory yes, I’d hope to see that across the board, but fat fingering happens way too often on top of the reuse of mrn’s. I work in the IHE space, so I can point fingers at most of the big guys as we accept their HL7/CDA feeds and not even ids from the same system are consistent between the formats. I don’t want to get too off topic with that last line, but yeah we try to consolidate records that are received by leveraging Fellegi-Sunter[0].
People might have names, but they're not always known: you admit an unconscious patient amongst many (e.g. a bus/plane/whatever accident), unclear whose documents are whose (and whether there even are any), now what? USA has the assumed name John/Jane Doe - but that has obvious pitfalls of its own.
Electronic medical records often have to deal with newborns before they’ve been named. This was even a common issue at the community pharmacy where I worked, so not that unusua.
what about Prince when he was in the trademark dispute about his name and he went by that symbol -- or maybe someone in a remote tribe where their script is not in unicode yet?
It's always annoyed me that this blog post above didn't contain examples, and I'm grateful that someone here on HN in another thread posted a later post that does contain examples:
There's hope in functional literacy. The list clearly states that this is not a bunch of checkboxes your app needs to tick, and that some of the requirements are irresolvably contradictory.
Include some ways to record how name is written by attaching image and 3D scan files or directly writing it by scribing with mouse/touchscreen/stylus, and how it's pronounced by attaching sound files and providing record button (video too: don't forget about sign languages; but it is still not a solution for tactile sign languages which are used by deaf blind communities).
I have personally encountered a system that used name + birthdate to uniquely identify its "users" (large % of citizens of a country of non negligible size).
That was absolutely insane. Try applying such logic in China or a big Arab state.
Wouldn't even work for a largish city. I have like 10 namesakes that I know of (none of them related to me), and my full name is not even that common. With birthdate, this is just the birthday effect waiting to bite someone.
I propose we introduce a clever new concept we can call your "Fizz". You get a first fizz, and a last fizz. You can change it whenever, but you need to have one. But a fizz can only be ascii characters without case and no spaces.
We just need a way to map (with some but not perfect reliability) your fizz to your identity as proven by some authorative third party..
The point being that 'name' is a term interpreted in culturally varied ways... which when I design most systems I dont care about. so to make it clear, lets not use that any more.
In the USA, "John Doe" is used for an unknown name. God help someone with that actual name. But hospitals commonly encounter the problem of a patient with unknown name that must still be entered into the database. Presumably the patient has a name, it's merely unknown. Except babies with no or unknown parents. Really no name.
Unless you have a system which also references dead people. ("Oh! Those exist?") There are administrative databases spanning centuries..."just use a random year in the 19th century" is just a y2k bug in a new suit.
How very USian. For example, what about "people outside Latin-1"? There's the original, canonical name (e.g. in Cyrillic), a Latin transliteration (also canonical, but depends on the country, IIRC Bulgarian transliteration is slightly different from Russian), and both are canonical. As far as string matching, these are completely different; yet they represent the same person.
Not to mention the Irish name duality, if Cyrillic is too exotic for you: those are two equivalent names, both canonical, both official...and saying "use the English version then" gets political right at that point.
How very smart of you to make such assumptions. I'm Eastern European and no stranger to different encodings and alphabets, and your argument is bollocks. Transliteration is by definition canonical form of the name, otherwise it would not be transliteration. There might be more ways to write a name, using different alphabets or using all or only some parts of the name, but only one is canonical.
Nope. And you failed to read the second example, too: both the Gaelic and English versions of the Irish names are canonical: they are not string-equal, even though both fit into Latin-1.
I'm not familiar with this area, but what I gather after cursory google search is that they are equivalent about as much as are names John and Johannes. The simple fact that the name can be translated doesn't make the translation canonical.
Nope. The entire point is "you have two versions of the same name, where both are official and canonical" - not that one is official and the other is a tolerated translation.
Looking through the comments of the original post, there's even a Rory/Ruairí mentioning this.
Thanks to official forms not making it clear whether my name was a signature or data entry, I now have different first name on my driver's license compared to other government forms.
There's this absolutely perfect scene in the movie 'Idiocracy' where just interacting with the system will get your name mangled so the lead character ends up with 'Not Sure' as his legal name.
Many of the counterexamples show up as "???????" seen from here - perhaps a related assumption is "Unicode is always available at its latest version everywhere."
So, now I drop the apostraphe.
In high school my school ID was completely broken. It printed <first half of last name> <first letter of last name> <middle name> <first name>.
They ended up putting a quotation mark instead of the apostraphe - somehow the machine handled that fine. That was the workaround.
Hotel wifi systems tend to break if they ask for a last name to connect. Again, training me to reserve the hotel under a 'fake' last name.
I actually notice it every time I write my last name. I'm like "is this a system that will break" and now I'm inclined to just never use the "real" spelling because then I'll get a mismatch later on.
This is something I've really noticed over the last decade. The older I get, and the more systems become a part of my life, the more systems I have to give a fake name. Paper records will show my true name, but I wonder if at this point, given digital records, what it would appear to be.