I read the paper when this first showed up on HN[1]. The most important thing they did was to create a training set with higher granularity in the data than much of anything previously seen. Based on their training set, their algorithm was able to achieve 85% positive/negative accuracy on sentences, but previously state-of-the-art algorithms moved from 80% accuracy up to 83% accuracy when adapted to their training set. While their algorithm appears to be better than anything they tested against, this is fundamentally an incremental improvement, not groundbreaking research. The real win here came from using a better dataset.
It probably linked to [1], but all we can tell is that it was something at python.org. The post was by plessthanpt05, an HN user for about two years with 838 karma. All of his/her posts from the past year or so are dead (about 60 of them). They don't appear to be a bot.
Python 3.4.0a1 isn't something that would interest everyone on HN, but it certainly doesn't seem like the kind of thing that should have been killed.
Well, that user also posted over 60 links in the last year, while never receiving a single upvote or comment. I'm not sure why a flesh-and-blood user would still be posting after that.
"Saved links" might be one reason. While I wouldn't advocate using HN as a personal bookmark service it does lessen the burn a bit if nobody else comments/votes on it. The good thing is that if you use saved links for that it makes you think twice about what sort of things you should be submitting.
Their posting frequency dropped significantly over the past year, probably as a result of never receiving any upvotes. At the time they were (presumably) hellbanned, they were posting at a rough average of one post per day, and it's much less frequent now. They seem to go for a week or two period in which they'll post several links, then forget about HN for a while.
I won't try to defend all of their posts; most of them are things that I don't even find interesting. But it seems that someone basically used banning as a method of cutting down on uninteresting material.
That may even be a good way of maintaining the signal-to-noise ratio on HN: if we ban the users that post many uninteresting links then the community won't have to see them. But it wasn't a tactic that I was aware of before today.
According to the statistics site that the OP used, 54 million Americans are single. [1] Ignoring the 62 million Americans that are under fifteen, we find that only 21% of American adults are available. [2] So if your area's ratio is 9:10 and ~80% of the people are taken, you end up with a 7:12 ratio among the singles (ignoring homosexuality).
This sounds eerily familiar. Around a decade ago, a data analytics company called Pharmatrak was actually found guilty of breaking federal wiretapping statutes for doing something very similar. [1] In their case, they had built a network tracking HTTP GET requests to pharmaceuticals companies websites with a web bug [2] and attached cookie. But because some of the pharmaceuticals companies were using GETs as the method on HTML forms (remember, this was ten years ago), the users actually ended up making GET requests with personally identifying information in the URL encoded parameters. Since these GET requests were logged by Pharmatrak, and neither party (the users nor the pharmaceuticals companies) had consented to giving away personal information to them, Pharmatrak was found guilty of wiretapping.
Pharmatrak eventually won on appeal though, arguing that they had no intention of collecting personal information, which exonerated them because only intentional eavesdropping is a crime.
The company in the OP's article could make no such arguments though. I suspect that their main difference is that they make no assurances of confidentiality to the websites using their software the way Pharmatrak did. Which 1) is just really creepy, and 2) sets them up for trouble with users in California, because California's wiretapping statutes say that it's a crime unless both parties agree to it. [3]
I don't think so. "Statistically significant" is a relative term and when testing an entire population is infeasible (as it often is), we instead sample some fraction that we believe is "statistically significant" on the assumption that it will accurately reflect the whole.
The point of this article is that a sample only accurately reflects the whole in some ways. Variability in particular scales with the square root of the sample size. And since misconceptions about variability have been at the heart of many controversies (male vs. female intelligence, school size, cancer risk, etc.), De Moivre's equation is important; even dangerous in the sense that ignorance of it has led to billions of dollars wasted.
You can also find more stuff that he's done at his website, http://brand.io/