> I find that it's really easy to use AI techniques "the wrong way" NLP-wise, is...

JD557 · on July 17, 2019

Just because there's not a "right way" doesn't meant that there aren't a lot "of wrong ways" ;)

Actually, some years ago I bumped into a similar problem to the one discussed, where someone wanted to use NLP to show gender bias in a dataset, and hit one of the common pitfalls that I mention (I hope that I don't start a flamewar by sharing this story):

Here's what they tried to do: 1. Fetch a list of ~800 atendee names from a Portuguese tech conference (it had an official API with user profiles) 2. Download a dataset most common male/female names for newborn babies in Portugal and America for the latest 3 years 3. Train a naive bayes model on the downloaded dataset and use it to classify the antedees into male/female

After doing that, the algorithm returned something like "8 female attendees and 792 male antendees".

I found this particularly strange (considering that I knew more than 8 women that attended on previous years), so I took a peek at the antendee dataset and found that: - There were some users using a fake name (including one organization account) - There were certainly more than 8 female antendees, and at least 6 were named "Inês" (female name)

After discussing this with the ones involved, we found the problem! - The dataset was not being normalized (it was being trained with "Ines" and tested with "Inês") - The naive bayes[1] implementation used, when faced with a completly new input, outputed the most common class of the training dataset

In the end, the final result was closer to "80 female, 705 male, 15 unknown", which is a much more believable result (closer to the typical distribution of Software Engineer students in Portugal).

Note that the author wasn't trying to deceive anyone, he just tripped on some common pitfalls (forgot to normalize the data and used an off-the-shelf implementation without looking into the details).

[1] There was only one attribute, so implementing this without using a naive bayes library was actually easier and produced the correct results.