Oh my god names are so, so, difficult to deal with. I'm doing something where I get information / stats about players from different sites, and it's like all of them have different names. For example:
Joseph Smith // nice and generic
Joseph Smith Jr. // gotta make sure we tell people he's the son of another Joseph Smith
Joseph Smith Jr // psh, we don't use periods!
Jose Smith // if from a different country, some sites keep their native name
José Smith // yup, sometimes accents are kept
Joe Smith // nicknames, all about those.
Joey Smith // kid nickname, sure
Jos� Smith // literally even had to deal with a site that had unknown unicode letters
Juan Jose Smith // sometimes, people go by their middle names, but some sites like to use their full name instead
Juan Smith // remember how some people go by their middle names? Maybe a site only gives you their first and last
How, how in the world is someone supposed to go through and match all those names with each other in a somewhat quick manner? I can't imagine it's fully solved because there'll always be other cases.
I've tried a couple ways, like removing periods and accents, removing "Jr", "Sr", "III", removing all vowels even. And then went further and tried some of the string matching libraries that return a number with how close words are to each other. That would help cases where Joe, Joey, Joseph would come back quite close, but then I'd run into the issue of a "Darren Smith" and "Darek Smith" would come back so similar the computer thought they were the same person.
In cases like names, that Patrick writes about here, yeah, they're almost impossible to get right and it ends up being mostly by hand, which is fine since I can get it correct, but eats up so much time.
There is also no such algorithm that can try so hard to be "clever": most fuzzy matches will present their uncertainty, human input is assumed to be certain. The umpteen times some clerk "identified" me as someone who used a similar name (of different location, different position and ACLs, even different gender) were decidedly not entertaining.
The problem is that you're using names as a proxy to ask whether two people are the same. If I ask you: "in this film, are Will Smith and Will Smith the same", you do not have enough information to answer, because there are at least 74 Will Smith's on record in film credits: https://www.imdb.com/find?q=will%20smith&s=nm&exact=true
And this is from an industry with guilds and unions to try to ensure that no two credits have the same written name.
You can make a very good guess, but names can't tell you enough to determine if these two people are the same or different. One person can have many names, and many people can have the same name.
We should like, give tax breaks or pay people based on how nationally-unique they name their child. Should be an easy system to implement. Some like inverse log or inverse root of the frequency of the name within the existing population.
The now defunct (pretty sure) Peacock Data Inc. used to sell (~$300) a data set of names including spelling variations and their strength of relationship to other names. It included nicknames, variations across languages (e.g. "john" ~= "juan"), and typos. It even had different relationship types, each with their own dissimilarity measurement to their related names. I've never found anything remotely that comprehensive for names.
A very efficient way of matching strings across typos (aka "fuzzy") is SymSpell, but just finds nearby lexical matches efficiently.
I've tried a couple ways, like removing periods and accents, removing "Jr", "Sr", "III", removing all vowels even. And then went further and tried some of the string matching libraries that return a number with how close words are to each other. That would help cases where Joe, Joey, Joseph would come back quite close, but then I'd run into the issue of a "Darren Smith" and "Darek Smith" would come back so similar the computer thought they were the same person.
In cases like names, that Patrick writes about here, yeah, they're almost impossible to get right and it ends up being mostly by hand, which is fine since I can get it correct, but eats up so much time.