Hacker News new | past | comments | ask | show | jobs | submit login

I know what they trained on because it's been reported on. They got around 50 million people's FB profiles, and a smaller subset's (300k, I think) personality test results.

I use ML models every day in my work, and understand how they function. It is true that individuals information is probabilistically encoded into the parameters of the model. However, if the model is any good, the people they trained on's information is encoded only a bit more than that of the entire population.

There is sort of a privacy issue in the following sense: The models they've built have learned relationships between preferences and personalities that they wouldn't otherwise have been able to learn. But these relationships are abstract. They are not tethered to any particular, identifiable individual.

A reasonable argument can be made that those learned relationships are, in a sense, stolen property. And I think arguments along those lines are interesting things that we'll have to explore as this sort of thing becomes more common. But the idea that this model invades individuals privacy just isn't really true.




Is there a reason that people are only talking about the privacy angle?

People very much don't want these models to exist. They don't want a predictive model which will guess their affiliation just by providing unrelated Activity bread crumbs.

That's why I assumed this whole issue has exploded recently.

Not the privacy, but the implications.


But if the resulting model doesn't contain information about individuals, how does this help targeting individuals for the campaign?

Edit: is it that the model is then applied to only strictly public data about the person? If so I guess the interesting question then becomes whether the model is definitely not anything near overfitting (i.e. containing enough information to match a person's public data directly since it was trained on it (amongst other data))? (I'm not an ML developer.)

Edit 2: also, going with your comparison with the "20 most representative pixels", it seems interesting then that 'this much' (although not exactly sure how much) information can be inferred from a public profile when just also knowing enough about the whole Facebook population. OK, so perhaps a human would be able to infer about as much, but doesn't scale, and that's why the model becomes valuable?


> But if the resulting model doesn't contain information about individuals, how does this help targeting individuals for the campaign?

I don't know exactly what they were modeling, but from the published reports, it sounds like they were trying to predict big 5 personality characteristics (conscientousness, neuroticism, openness, extraversion, agreeableness) from FB profile data (e.g. likes, dislikes, bio, post content, etc.). So in that case, the model would contain weights that measure the strength of relationship between characteristics like "likes punk rock music" and "openness". That description really only literally applies to a linear model - but nonlinear models are, for these purposes, the same.


> I know what they trained on because it's been reported on.

What reason do you have to think their data set consisted of only what has been reported?

How do you know anything about the models they used?




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: