A Statistical Analysis of All Hacker News Submissions

pg · on Feb 24, 2014

There was a point when we figured out how to stop spam submissions almost completely. That was probably what happened at the end of 2011. That would have been about the right time.

d23 · on Feb 24, 2014

I don't suppose there's an explanation we could have? :)

pg · on Feb 24, 2014

Unfortunately like most of our anti-abuse measures it's surprisingly simple and would be easy to circumvent.

wslh · on Feb 24, 2014

Sorry to hijack the thread but now that you will have more time in your hands... can we have an option to download our data from HN? I mean my submissions, saved articles, and comments. Thanks!

minimaxir · on Feb 24, 2014

You can use the API to download your submissions and comments extremely easily, too.

wslh · on Feb 24, 2014

Which official API? And... the saved articles are not public. If you are referring to the new hn.algolia.com this is far (rate limit?) from being a data liberation initiative. Even Google and Facebook are much better.

minimaxir · on Feb 24, 2014

With the Algolia API, you can request 1000 stories or 1000 comments per request. I don't think you'll hit the rate limit. :P

Here are your 1000 out of your 1248 comments: https://hn.algolia.com/api/v1/search_by_date?tags=comment,au...

And here are 1000 out of your 1548 submitted stories: https://hn.algolia.com/api/v1/search_by_date?tags=story,auth...

You can pagenate each endpoint on the created_at_i parameter to get the rest. I can write up a data liberation script if you want.

wslh · on Feb 25, 2014

Thanks! I also need to web scrape HN to retrieve my saved articles. It will be useful to use users credentials with the API.

pasbesoin · on Feb 24, 2014

Do situations such as this cause you to reflect further upon the role(s) of information asymmetry in systems and society? If so, I'd be curious about your thoughts in the context of one of your essays, if they were ever to come to such.

I continue to reflect upon this, for myself. But you write better and sometimes rather elucidating essays.

mturmon · on Feb 24, 2014

You have a green heat map of

  #submissions(time)

where time is 1-hour slots across 7 days. You also have a red heat map of

  #successful_submissions(time)

where successful is > 100 points. I think what you want is a third map which is the ratio,

  #successful_submissions / #submissions

which would be the empirical probability of a submission being successful, given the submission time. The raw counts don't tell you this.

(If you have a zero in the #submissions bin at some time, this will give 0/0, so you might want to put in a "Laplace correction" which is to add 1 count to each #submissions bin. There are other adjustments you can use, but this would be good enough for the purpose of the plot.)

karpathy · on Feb 24, 2014

I did a similar analysis to the one posted here and computed a similar heat map to the one you describe, but I marked a submission as successful when it went from new -> front page, not when it hit 100 points. The result is in ~middle of the post and it seems that weekends are best for chances of a story making it to the front.

http://karpathy.ca/myblog/2013/11/27/quantifying-hacker-news...

and the raw ipython notebook with too many details: http://cs.stanford.edu/people/karpathy/hn_analysis.html

mturmon · on Feb 24, 2014

Thanks very much. And on a log scale too!

As you noticed, not only do weekends offer a significantly improved chance of making it to the front page, but also: the mid-morning weekday peak seems to cause enough competition that submissions have a hard time making it.

This is in contradiction to an assertion made in the OP: "Your odds are slightly better when submitting at peak activity (weekdays at 12 PM EST / 9 AM PST)." The problem being, they did not calculate the odds.

j2kun · on Feb 24, 2014

I also noticed this, and would love to see such a chart.

gmisra · on Feb 24, 2014

Obligatory repost - "Hacking Hacker News Headlines" from May 2011, examining the significance of language in story headlines:

http://metamarkets.com/2011/hacking-hacker-news-headlines/

davidw · on Feb 24, 2014

Interesting - I'd love to see the number of stories on the front page about politics over time. Is it really growing, or does it just seem that way?

x0054 · on Feb 24, 2014

You have to keep in mind that the biggest tech story from last year was also the biggest political story of last year. So, the numbers would be skewed towards rise in political stories, but it could be simply due to the overlap.

davidw · on Feb 25, 2014

Presumably you could do stats with and without that one.

Fomite · on Feb 24, 2014

Almost lost me with the word clouds, but I'm glad I soldiered on. An interesting look at the patterns behind HN.

_hoa8 · on Feb 24, 2014

> …so Lisp and Erlang are well-liked on HN.

Umm... Maybe not? What if a post is titled "I don't like Lisp. Go Python!", and it hit the front page? How exactly do you infer the language being talked about?

minimaxir · on Feb 24, 2014

Here's the data set of all submissions containing Lisp or Erlang in their title: https://docs.google.com/spreadsheets/d/1tnYpawKHOg7K1eKMaERw...

There are a few negative mentions, but they're in the minority.

BlackDeath3 · on Feb 24, 2014

Interesting natural language processing question. It took me a minute to notice that you actually included three languages in there.

Houshalter · on Feb 24, 2014

http://minimaxir.com/img/hn-points-hist.png

The wealth distribution of HN is awful. The rich get richer, by getting closer to the front page and getting exponentially more points, for every point they get.

DanBC · on Feb 24, 2014

It does make me wonder what great links I'm missing because they only got a few upvotes.

Up voting articles on New only goes so far. Other people have to stop upvoting fluff.

Not sure what a solution would be.

Houshalter · on Feb 25, 2014

A solution has been proposed here: http://www.bayesianwitch.com/blog/2013/why_hn_shouldnt_use_r...

Basically move some new articles closer to the front page to get them more exposure in order to find the ones that are actually best. More exploration and less exploitation, and finding the optimal tradeoff between the two.

3rd3 · on Feb 25, 2014

I wonder if there is a bias due to a specific type of people that are patient and interested enough to browse /newest and give stories the initial boost.

thanatropism · on Feb 24, 2014

This is not statistical analysis, this is "descriptive statistics" at best. This:

> One of the infamous memes about Hacker News is programming language elitism, with favoritism for languages such as Lisp and Erlang.

> Lisp and Erlang are indeed obscure, which might discredit the meme.

is the exact opposite of analysis. If it was found that 40% of HNers were left-handed, HN would be noted as a particularly left-handed website, since the base rate in the population is a fraction of that.

plg · on Feb 24, 2014

should include "posts about analyzing HN posts" as a category

how meta

pdevr · on Feb 24, 2014

Nicely done.

1. What did you use to generate the graphs?

2. While analyzing JavaScript, were submissions of posts related to Angular, Bootstrap, Require, etc classified as JavaScript?

minimaxir · on Feb 24, 2014

1. Plots were made using R and ggplot2. (additionally, charts were rendered on a Mac; rendering Line Charts on Windows doesn't work very well)

2. To maintain an apple-to-apples comparison, I only checked for the presence of a language, and not any frameworks.

pdevr · on Feb 24, 2014

1. Thanks for the answer and the tip about rendering the charts.

2. I guess that is a practical approach - otherwise, it would have gotten too complex with all the frameworks, tools and technologies.

aet · on Feb 24, 2014

You should check to see if the points per post is power-law distributed. What happens when you also put the y-axis on log-scale? Does it look linear?

gtirloni · on Feb 24, 2014

So now our bosses have a pretty chart to show we don't work.

Weekends being dead don't help, folks!

aaronsnoswell · on Feb 24, 2014

With the NSA graph it's worth noting that HN posts with 'NSA' or 'Snowden' in the title are known to be down-graded by the site's ranking algorithm. Can't remember where the source for this is right now.

csandreasen · on Feb 24, 2014

NSA is, but Snowden is not (or, at least, wasn't noted as being in the writeup).

See here: http://www.righto.com/2013/11/how-hacker-news-ranking-really...

JacobAldridge · on Feb 24, 2014

It would be interesting to see the distribution of Erlang posts over time - specifically, what portion of the 1,189 submissions came on Erlang Day (and its 1-2 sequels)?

AznHisoka · on Feb 25, 2014

What software did you use for those pretty charts?