Hacker News new | past | comments | ask | show | jobs | submit login
A Statistical Analysis of All Hacker News Submissions (minimaxir.com)
147 points by minimaxir on Feb 24, 2014 | hide | past | favorite | 36 comments



There was a point when we figured out how to stop spam submissions almost completely. That was probably what happened at the end of 2011. That would have been about the right time.


I don't suppose there's an explanation we could have? :)


Unfortunately like most of our anti-abuse measures it's surprisingly simple and would be easy to circumvent.


Sorry to hijack the thread but now that you will have more time in your hands... can we have an option to download our data from HN? I mean my submissions, saved articles, and comments. Thanks!


You can use the API to download your submissions and comments extremely easily, too.


Which official API? And... the saved articles are not public. If you are referring to the new hn.algolia.com this is far (rate limit?) from being a data liberation initiative. Even Google and Facebook are much better.


With the Algolia API, you can request 1000 stories or 1000 comments per request. I don't think you'll hit the rate limit. :P

Here are your 1000 out of your 1248 comments: https://hn.algolia.com/api/v1/search_by_date?tags=comment,au...

And here are 1000 out of your 1548 submitted stories: https://hn.algolia.com/api/v1/search_by_date?tags=story,auth...

You can pagenate each endpoint on the created_at_i parameter to get the rest. I can write up a data liberation script if you want.


Thanks! I also need to web scrape HN to retrieve my saved articles. It will be useful to use users credentials with the API.


Do situations such as this cause you to reflect further upon the role(s) of information asymmetry in systems and society? If so, I'd be curious about your thoughts in the context of one of your essays, if they were ever to come to such.

I continue to reflect upon this, for myself. But you write better and sometimes rather elucidating essays.


You have a green heat map of

  #submissions(time)
where time is 1-hour slots across 7 days. You also have a red heat map of

  #successful_submissions(time)
where successful is > 100 points. I think what you want is a third map which is the ratio,

  #successful_submissions / #submissions
which would be the empirical probability of a submission being successful, given the submission time. The raw counts don't tell you this.

(If you have a zero in the #submissions bin at some time, this will give 0/0, so you might want to put in a "Laplace correction" which is to add 1 count to each #submissions bin. There are other adjustments you can use, but this would be good enough for the purpose of the plot.)


I did a similar analysis to the one posted here and computed a similar heat map to the one you describe, but I marked a submission as successful when it went from new -> front page, not when it hit 100 points. The result is in ~middle of the post and it seems that weekends are best for chances of a story making it to the front.

http://karpathy.ca/myblog/2013/11/27/quantifying-hacker-news...

and the raw ipython notebook with too many details: http://cs.stanford.edu/people/karpathy/hn_analysis.html


Thanks very much. And on a log scale too!

As you noticed, not only do weekends offer a significantly improved chance of making it to the front page, but also: the mid-morning weekday peak seems to cause enough competition that submissions have a hard time making it.

This is in contradiction to an assertion made in the OP: "Your odds are slightly better when submitting at peak activity (weekdays at 12 PM EST / 9 AM PST)." The problem being, they did not calculate the odds.


I also noticed this, and would love to see such a chart.


Obligatory repost - "Hacking Hacker News Headlines" from May 2011, examining the significance of language in story headlines:

http://metamarkets.com/2011/hacking-hacker-news-headlines/


Interesting - I'd love to see the number of stories on the front page about politics over time. Is it really growing, or does it just seem that way?


You have to keep in mind that the biggest tech story from last year was also the biggest political story of last year. So, the numbers would be skewed towards rise in political stories, but it could be simply due to the overlap.


Presumably you could do stats with and without that one.


Almost lost me with the word clouds, but I'm glad I soldiered on. An interesting look at the patterns behind HN.


> …so Lisp and Erlang are well-liked on HN.

Umm... Maybe not? What if a post is titled "I don't like Lisp. Go Python!", and it hit the front page? How exactly do you infer the language being talked about?


Here's the data set of all submissions containing Lisp or Erlang in their title: https://docs.google.com/spreadsheets/d/1tnYpawKHOg7K1eKMaERw...

There are a few negative mentions, but they're in the minority.


Interesting natural language processing question. It took me a minute to notice that you actually included three languages in there.


http://minimaxir.com/img/hn-points-hist.png

The wealth distribution of HN is awful. The rich get richer, by getting closer to the front page and getting exponentially more points, for every point they get.


It does make me wonder what great links I'm missing because they only got a few upvotes.

Up voting articles on New only goes so far. Other people have to stop upvoting fluff.

Not sure what a solution would be.


A solution has been proposed here: http://www.bayesianwitch.com/blog/2013/why_hn_shouldnt_use_r...

Basically move some new articles closer to the front page to get them more exposure in order to find the ones that are actually best. More exploration and less exploitation, and finding the optimal tradeoff between the two.


I wonder if there is a bias due to a specific type of people that are patient and interested enough to browse /newest and give stories the initial boost.


This is not statistical analysis, this is "descriptive statistics" at best. This:

> One of the infamous memes about Hacker News is programming language elitism, with favoritism for languages such as Lisp and Erlang.

> Lisp and Erlang are indeed obscure, which might discredit the meme.

is the exact opposite of analysis. If it was found that 40% of HNers were left-handed, HN would be noted as a particularly left-handed website, since the base rate in the population is a fraction of that.


should include "posts about analyzing HN posts" as a category

how meta


Nicely done.

1. What did you use to generate the graphs?

2. While analyzing JavaScript, were submissions of posts related to Angular, Bootstrap, Require, etc classified as JavaScript?


1. Plots were made using R and ggplot2. (additionally, charts were rendered on a Mac; rendering Line Charts on Windows doesn't work very well)

2. To maintain an apple-to-apples comparison, I only checked for the presence of a language, and not any frameworks.


1. Thanks for the answer and the tip about rendering the charts.

2. I guess that is a practical approach - otherwise, it would have gotten too complex with all the frameworks, tools and technologies.


You should check to see if the points per post is power-law distributed. What happens when you also put the y-axis on log-scale? Does it look linear?


So now our bosses have a pretty chart to show we don't work.

Weekends being dead don't help, folks!


With the NSA graph it's worth noting that HN posts with 'NSA' or 'Snowden' in the title are known to be down-graded by the site's ranking algorithm. Can't remember where the source for this is right now.


NSA is, but Snowden is not (or, at least, wasn't noted as being in the writeup).

See here: http://www.righto.com/2013/11/how-hacker-news-ranking-really...


It would be interesting to see the distribution of Erlang posts over time - specifically, what portion of the 1,189 submissions came on Erlang Day (and its 1-2 sequels)?


What software did you use for those pretty charts?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: