Hacker News new | past | comments | ask | show | jobs | submit login
Using Machine Learning and Node.js to detect the gender of Instagram Users (totems.co)
142 points by spolu on Sept 29, 2014 | hide | past | favorite | 53 comments



Neural networks have their place, but are probably the most complicated and opaque machine learning tool. They are also hard to set up: so many parameters! Given that, I found it really strange that they went straight for a neural network (and then implemented one themselves!). Surely the place to start would be Naive Bayes, followed up with regularized logistic regression (through something like glmnet). Heck, even random forests would do quite well on this task I imagine, although thats getting closer to on the complexity and opaqueness spectrum towards NN.

There is also no evidence of doing cross-validation, and in another comment they say they used entire data set to do variable selection - a pretty bad mistake. They justify by saying they aren't in an academic environment, but thats kind of a bad excuse, as given the way they've done it I'm very unsure whether they actually are getting the accuracy they think they are.

I also worry that they sunk two man-months into this when they could probably have achieved similar if not better results with off-the-shelf and battled-tested tools. Sets off a lot of warning bells.


I am not sure whether I understood everything right, but I think they computed just with the data they got from the links to Facebook-Profiles.

Although their computing is very smart, the output cant be better than the input they used. Just determining the gender by looking up a related Facebook-Profile should therefore be a better solution in my opinion.


Well said - this is a crazily over-engineered and highly problematic approach to the problem at hand.


This is a great example of how privacy is not optional, even in "opt-in" systems such as Instagram and FB. That Instagram does not require you to have a Facebook profile, and Facebook does not require you to list gender means very little in terms of your own privacy.

Merely choosing to withhold information about yourself does not insulate you from a breach of privacy. That others do disclose such information allows 3rd parties to make really good guesses and inferences about you.

There's a strange morality here: at what point is it unethical to voluntarily disclose data about oneself, if it could be used in a way to harm someone else's privacy? Short of drawing a moral boundary (it could very well be impossible), we might do well to at least acknowledge the cost to these methods, alongside their benefits.


> There's a strange morality here: at what point is it unethical to voluntary disclose data about oneself

That's an interesting question. Especially since the data you disclose may triggers inappropriate inference of characteristics on someone else, maybe eventually causing some form of harm (anytime the demo fails to classify someone, we do cause some harm to him/her in a way). In the case where the misclassification is more harmful than the privacy disclosure, one is better off disclosing the information... weird equilibrium.


In other words: how am I affected by the fact that many of my Facebook friends like drug-related pages?


> There's a strange morality here: at what point is it unethical to voluntarily disclose data about oneself, if it could be used in a way to harm someone else's privacy?

At what point is it unethical to exhale, given that carbon dioxide is toxic to humans and is a greenhouse gas? At what point is it unethical to vote, given that you might influence an election in a way that is bad for society or some subset of society?

It's true that nearly every (and the "nearly" is just a hedge) action we take has some negative externality. I personally don't lose sleep over the ones that are virtually impossible to measure.


I don't get your point. If you post online do you really expect much privacy? Make your profile private and then the software won't be able to guess your gender. I tried three accounts, one was of a famous female model and apparently she's 0.516 probably male and 0.006 female, so I wouldn't be too worried just yet. Just don't post online if you want privacy, otherwise, learn from this.


It's unusual to see a coherent, from-first-principles explanation of a neural network. Especially one that's commercially valuable (i presume) to Totems.

Mildly alarmed to learn I'm only .039 probability male, though - better bloke it up on Instagram.


that's okay, I'm only male with a probability of 0.01. Hint: I'm male.


What's so alarming about being thought female?


Personally, being male, I'd rather be thought of as male. Not as a slight towards females, but just because it's who I am.

That said, I did get 0.998 female and 0.996 male. Oh well.


If you're looking to avoid disclosing personal data, then it's a positive. On the other hand, if you're actually making use of a service, then it would probably lead to a lot of unwanted targeted information. Then, assuming a future of ambient intelligence where most people don't question machine mined data (I'd say an inevitability), you'd probably have a lot of awkward moments ensue, especially if public and private institutions hold data to be sacrosanct.


That's a very binary take on gender.


Thanks for sharing your experience! Couple of questions

Why implement the training in NodeJS and not use an existing library in R or Python (scikit-learn) and just implement the scoring (feedforward network) in Node?

Did you just use a single test/train split? What is the variation in Res if you run cross validation?

Your article suggests that you used MI to select the 10k best features. Did you perform this MI feature selection before your test/train split? If so, you would already be "using" your class labels, and the results will be biased. It is likely your true generalisation error will be lower.


> Why implement the training in NodeJS and not use an existing library in R or Python (scikit-learn) and just implement the scoring (feedforward network) in Node?

We wanted to contribute to the nodeJS ecosystem and build whatever tool was missing to use neural network directly from NodeJS or at least as an add-on. We also wanted to come up with a simple an straightforward implementation to serve as an educational example rather than just bind into an existing library (even though the results might have been better of course)

> Did you just use a single test/train split? What is the variation in Res if you run cross validation?

We didn't use cross-validation but rather simple train/test split (though our test set was quite large ~100k / 570k). As explained in the intro we wanted to stay very practical and were ok with dirty shortcuts as long as the result looked OK.

> Did you perform this MI feature selection before your test/train split? If so, you would already be "using" your class labels, and the results will be biased. It is likely your true generalisation error will be lower.

Yes MI selection was made on the overall data set before training. You totally are right that this is a bias against the test set. Nice catch.


Your implementation of momentum seems off, you just add a multiple of last error, instead of adding exponentially declining contributions from the past. I think you want

    double dW = alpha_ * val_[l][j] * D_[l+1][i] + beta_ * dW_[l+1][i][j];
    W_[l+1][i][j] += dW;
If you want to get an output class probability, softmax is the standard way. Minimize KL-divergence instead of squared error.

You don't seem to be doing any regularization. It could maybe give you better generalization.

I think you could get a speedup by doing your linalg with blas, I guess this would complicate the code though, making it a tradeof.

Training on multiple threads and averaging is a nice touch. It would be interesting to hear if (how much) it improved your results.


> Your implementation of momentum seems off

I think we used what is described in Artificial Intelligence: A Modern Approach... But I have to check because what you propose seems better.

> If you want to get an output class probability, softmax is the standard way. Minimize KL-divergence instead of squared error.

Thanks! We'll totally try that.

> You don't seem to be doing any regularization. It could maybe give you better generalization.

Thanks again. Someone mentioned that before as well. We'll have to experiment with that as well.

> Training on multiple threads and averaging is a nice touch. It would be interesting to hear if (how much) it improved your results.

Training was much faster and therefore tractable on a much larger set but we didn't manage to get our best results using this multi-threaded approach unfortunately as described in the post.

Maybe with a bigger training set we could have reach better results using multi-threaded training. That being said, the averaging phase disrupts a lot the overall backpropagation process, so I don't know how efficient it can be... Some advanced experimentation would probably be interesting here.


>Thanks! We'll totally try that.

oops maybe I spoke too soon, allow me to backpedal a little. I still recommend minimizing KL-divergence.


Giving it a go with most of my friends and I'd say the success rate was definitely below .5, and it was pretty sure about it.

What seems odd is that the "test tool" allows you to tweet whether it's wrong or right. Why not just have it make a call to your API or something to tell you directly, so you can look at the profiles and figure out what's gone wrong?


2 more epicly failing profiles: @friendzis @algimantas69


Having used /harthur/brain before and being deeply interested in Neural Networks, I have to say that this is one of the most interesting articles about the topic I've ever seen.

Thank you for sharing the C version, I'll use it for sure.


This was submitted 4 days ago [1], and then was deleted. Anyone know what was up with that?

[1] https://news.ycombinator.com/item?id=8368186


I deleted it shortly after submitting it, because the demo crashed and we didn't want to waste such a great opportunity on HN on a failed demo.

I know it's not perfect... But heh. Hope it's ok.


The "post if WRONG" twitter link failed on my iPhone 5.

My instagram name is the same as my HN user id, and you classified me with 99.3% as female... Needs work!


It looks like very few of your photos have captions, meaning that the algorithm doesn't have a lot of text to work with, and among those that do have text there are a few there which contain keywords that are probably heavily waited toward female such as "pink".

The algorithm could probably be improved to also take the instagram name into consideration. Someone named "arthur" is very unlikely to be female.


> Our platform retrieves or refreshes around 400 user profiles per second (this is managed using 4 high-bandwidth servers co-located with instagram’s API servers on AWS).

Interesting, since Instagram's API only allows 5,000 requests per hour, (http://instagram.com/developer/limits/) and does not support bulk requests of user data. How does this application bypass this limit?


Hi minimaxir. We have a large number of tokens from our clients and people doing oauth to access our free demo. Since the data is public, we can use any of these tokens to access hashtags and account followers, etc...

Actually, Instagram API limit is pretty high when compared to other platform. Today we have something like 100k tokens available to us, which means we can make 12bn+ calls everyday. Almost like having a firehose. We don't use all of them but we're one of the top users. Though there are at least 10 bigger users than us on the API (according to them of course). Hope it helps!


> Since the data is public, we can use any of these tokens to access hashtags and account followers

I'm fairly certain this isn't an intended use of the API. Kudos to you for posting your method here, but I wouldn't be surprised if your app gets banned because of this (a lot of people at Instagram read HN).


Interesting, when I was at GoDaddy (Website Builder), it seems Facebook had implemented not only a per token limit, but application limits as well that we hit pretty easily. Does Instagram not have the same kind of limits?

For those curious, IIRC it took a while to get our account's limits raised, and we had to implement some request caching to stay under the limits as much as possible. All around, it was interesting.


Nope they don't... at least not for now :)


@yid. I'd be curious to understand why you think they would ban us? For using the tokens this way? Well they know everything about our usage of these tokens, and any analytics tool out there behave similarly, right?


> For using the tokens this way? Well they know everything about our usage of these tokens, and any analytics tool out there behave similarly, right?

Yes, for using what are intended as per-user activity tokens for public scraping (which the user who has been issued the token has not requested). As you said, you can assemble a firehose using this method, and if they'd wanted apps to access a firehose, they'd have come up with an API for it.


> if they'd wanted apps to access a firehose, they'd have come up with an API for it.

This is a quite idealistic view of the problem. But it probably holds some truth I have to admit.


You only need to use 288 tokens to hit 400 user profiles/second.


Wouldn't Bayesian filter be better suited? There must be a reason Spam Filter use them instead of Neural Networks.


_up we evaluated thoroughly perceptron which are somewhat close to Bayesian networks. Basically a perceptron is a one layer NN and is therefore quite similar to a bayesian network in the fact that it encodes a linear regression.

That being said... studying bayesian networks more thoroughly might raise better results indeed. Don't know though if Gmail is using bayesian networks or deep learning?


My guess is that gmail is using a linear classifier. Both because of the scale of the data, and because up until very recently linear classifiers have been state of the art on text classification.

In the few cases where NN have achieved new state of the art on text, the stanford sentiment analysis work and a few more recent works, a full sentence parse is needed. Sentence parses do achieve 95% accuracy, but only on well structured text in a given domain. Plus, they are hugely time intensive compared to the large scale linear classifiers like vowpal wabbit or sophia-ml.

Regarding perceptrons, a basic perceptron, although it is a linear classifier, will not achieve state of the art. Averaged perceptrons get you closer, but what you really want is a discriminatively trained linear classifier with regularization.

If I had to bet, gmail is probably using something closer to https://code.google.com/p/sofia-ml/ than a NN. Maybe a Googler will surprise me though!


> My guess is that gmail is using a linear classifier.

Yup, the Google "Priority Inbox" feature does indeed use a linear classifier, in particular logistic regression [1] for the reason of scale as you point out.

Also, IIRC Gmail's original spam detection used naive bayes. It may have evolved since then.

[1] http://static.googleusercontent.com/media/research.google.co...


I don't know about GMail specifically, but I suspect your approach of using neural networks is the best one given the state-of-the-art in machine learning today.

You can think of naive bayes and a perceptron as roughly equivalent in terms of expressiveness--they're both linear models--but a perceptron is usually better since it can account for correlations between input variables.

As you say, a perceptron is a one-layer neural network, so with a large enough training set, a multi-layer neural network will almost certainly perform better since it can recognize combinations of features that work well together.

Bayesian filtering for spam detection is a good starting point since it's easy to implement, and was very popular in the mid-2000s, but with all the advancements in deep learning since 2006, I'd almost certainly bet on a neural network these days.


Arguing that using a neural network instead of a simpler classifier is a good idea because of recent developments in deep learning is like arguing that driving a car from the 50s is safer than taking a train because of recent developments in airbag/cruise control technology.


It's true that it seems to be a lot of work in implementation. NN have a complexity/performance ratio much higher than other algorithms. But hey ! la fin justifie les moyens, I'm quite impressed with the result and had a lot of fun with the demo and the article. Keep it up guys !


:+1:


It doesn't predict my account correctly:

   PROBABILITY FEMALE: 0.997 
   PROBABILITY MALE: 0.569 
I wonder if the fact that I mostly just post pictures with no text accompanying them skews things.


My account (@matiassingers) got some very interesting numbers, and most of my photos definitely do have a caption and hashtags.

    PROBABILITY FEMALE: 0.003 
    PROBABILITY MALE: 0.001


1.000 probability of being a man. Thank you for affirming my masculinity.

However, my business has a 0.885 probability of being a woman, which is odd for a men's brand.


Not really odd, if your brand is looking to seduce men into buying (;


Hehe, the network was trained mostly on humans (see the data set section)... So all bets are off for a non human account :)


Interesting blog, interesting ideas, but completely bogus results. It's very inaccurate. Just using simple NB you'll get much better than this.


PROBABILITY FEMALE: 0.003 PROBABILITY MALE: 0.999

Errr, so it's out of 1.002?


On many machine learning algorithms, the pattern matcher doesn't return probabilities, rather confidences on a range of 0 to 1. The higher confidence wins.


@teganandsara

PROBABILITY FEMALE: 0.003

PROBABILITY MALE: 0.996

I would say this doesn't work very well.


Well, one datapoint means nothing. Also, if this is aimed at advertisers, it's more useful to identify people whose interests skew (stereotypically) male or female.

Also, Tegan and Sara are great singers and artists, but neither of them is an exemplification of what our culture considers stereotypically female.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: