I'm relatively new, so it's difficult for me to see any major change that's happened since I first joined. My personal opinion is that there has been little change since I arrived, but much talk about how much has changed.
That said, I think an article about economics or politics can be more profound and deep than a lightweight article about technology and be better for the tone of the site. My personal preference is for news here to be technical, but an article being technical is not enough for it to be interesting. Overall, however, I want articles with meat; that dig a little deeper than "Head First SQL" or "How I made a blog engine with Erlang" (or "I am Phillip Greenspun and I don't like people in Northern California").
While I love beautifully designed languages as much as the next guy, I've seen probably 50 blog articles posted here about "Why language design matters" or "Why language design doesn't matter" or "Why there can never be a better Lisp". While these may be technical in nature, they're often very shallow and redundant.
As far as I know, it would be an impressive feat to determine depth automatically, but I think it would give you a better picture of how the tone is changing over time in a relevant way. And maybe it would be a good filter for submitted articles.
If anyone wonders why there was such a high proportion of articles about startups at first, it was because the site was initially called "Startup News." After 6 months or so we changed the name and focus.
Do you think that as a site becomes large enough people start to change their own focus and sort of get lost in the crowd? That eventually people are just disillusioned by the site itself? I noticed this is sort of happening with reddit / digg.
I don't think largeness is the problem in itself so much as the decline in quality and civility that usually accompanies it. If we can avoid decline we're ok. This is mostly uncharted territory, but I'm hopeful. I'm going to be working this year on tweaks to encourage people to be more civil in comment threads.
I'd be interested in a breakdown of the 10 or so most popular source sites on a (say) month-by-month basis. You'd expect to see a lot of articles referencing Google, techcrunch, HN and the like but I'm also surprised by the number of articles from the NY Times, WSJ and such.
Maybe determine "popularity" first by number of articles and then again by score.
One thing this graph can't account for is the quality of submissions. A technology article about 8051 assembly language is a lot different than a technology article about the top 10 SEO tips for bloggers.
When this is all done I'll make all the data available so that people can mine it for what it is worth.
A fair amount of work went in to this little project, not all of it automated (unfortunately), a lot of time went in to making sure the tags would be somewhat relevant.
If there would be any major trends that I could not explain by looking at samples of the data I would have certainly investigated.
I hope that if there are such trends that I missed that they will come out in the follow up (weighted by votes, see above), if not then the analysis will have to be much more detailed, and that will probably mean a lot more handwork than what went in to making this graph.
The 'good news' for me is that there is no unbounded growth of the 'unspecified' category, that would be a fairly large indicator of trouble.
To me top 10 SEO tips for bloggers is not about technology at all, it's about marketing. This is the category I feel is growing and it makes HN less interesting to me.
The x axis is the rank number of the posting divided by 1000, so that's a constant sampling interval in blocks of 1,000 but more compressed in time towards the right because of the higher posting frequency.
It wouldn't make much difference actually, apart from greatly complicating the matching up of the Y axis.
The bigger issue is the fact that this is just everything that is posted and not flagged, so it is if you wish a view of the 'new' page, it has nothing to do with the 'home' page, I'll try to address that tomorrow.
As for the labelling and clustering, that was based on keywords in the title from a fair sized sample, and from the urls the links pointed to.
What I am specifically searching for is larger trends, smaller trends would be very difficult to catch using this method.
I'm actually quite surprised how even the graphs come out over the longer term, I would have expected more variation in the submissions.
So if there is a problem at this point in time I would conclude that the problem is not in the submissions, they seem to have roughly the same subjects over the long term as they did in the beginning, with the exception of a shift of focus away from 'startups' in the first year or so of operation.
I think that has to do with an influx of programmers / people interested in technology in general whereas originally most of the people on news.yc were active in the startup scene.
they seem to have roughly the same subjects over the long term as they did in the beginning
That's been my impression for a long time. Do your techniques allow you to measure the trend of people complaining about the site deteriorating? Because that's been going on for a long time too, and in approximately the same way (though possibly in cycles).
I think that has to do with an influx of programmers / people interested in technology in general whereas originally most of the people on news.yc were active in the startup scene.
Pretty clearly that is because the site was originally named Startup News and had a relatively narrow scope, then was renamed to Hacker News as part of explicitly broadening the scope.
Agreed, with the exception that the proportion of articles tagged 'startups' decreased reasonably fast in the first quarter of the graph.
This was probably part of the iterative change from 'Startup News' to 'Hacker News'. I can't recall exactly when the name changed, or whether it was a response to that trend or precipitated the wider focus.
Apart from the settling down period in the first year that is my conclusion as well.
Even the cyclic nature that edw519 referred to in the 'new years exchange' seems to be mostly limited to the voting, it has nothing to do with the actual submissions.
FWIW, yesterday I submitted a link to the history of Waite Group publishing by Mitch Waite himself. It was an awesome piece on entpreneurship, failure, and the history of personal computing and it's still sitting at 1 point:
I'm tempted to create some kind of site that checks the 1 pt submissions automatically. Not sure how much semantic analysis but even minimal keyword checking could flag possibly good articles. Hidden.HN
Forgive me if I'm misinterpreting your question. I believe it should be "Ask HN" (Hacker News) and "Ask YC" (Y Combinator). I haven't been around here for too long, but I have seen many Ask HN threads and none specifically addressed to YC. Those threads about the Y Combinator process are usually of the form "Ask PG: " (Paul Graham) or "Ask HN: YC Founders, what do you think about X?"
Thanks, that's interesting. However, I'm a little confused about the categorisation. It looks like the categories add up to 100%. If that is the case, the category "blogs" doesn't make sense in my view. All other categories characterise the subject of the content whereas "blogs" says something about the publication channel. In my view, there is no sensible way to label a blog post about technology either "technology" or "blogs".
Correct, the problem here is that even though most of the blogs are technology blogs it is very hard to categorize the majority of them as something specific. For instance, Bruce Schneier blogs about security, most of the time, so all articles that could be tagged like that are now under hacking,security.
But he also has lots of stuff that is not so easy to categorize, so that ended up depending on the ease with which the title let itself be identified either under 'technology' or, in the worst case under 'blogs'.
A similar problem appears with the 'mainstream' media websites, and it was solved in the same way with the top level category as a catch-all after other matches were ruled out.
It looks like there are too many "unspecified" articles to learn much from this visualization, other than a moderate decrease in the number of articles in your "startup" category, supplanted largely by "ask" topics--a trend that largely leveled off early in the x-axis on this graph (which would be dramatically more useful if it had some amount of real-time benchmarks to give some sense of scale).
See the text in the article about the 'unspecified'.
As for the scale, it doesn't get much more precise than this, the only concession to legibility is to stretch the graph horizontally because otherwise it would be only 138 pixels wide, vertical is very close to one posting per pixel.
As the volume of postings on news.ycombinator increases due to increased traffic to the site the graph will stretch more further to the right.
This could be counteracted by changing the algorithm to 'bin' more posts to the right hand side to get for instance one month per bin, but in practice the outcome would be the same, you'd just have another weighting to do to get the Y-axis of the bins to line up.
Yes, it was remarked on before, but you're wrong about the areas, if you look closely you'll see that the rows are 'in order' and the green area is actually a one time occurrence somewhere near the top. The other green area is the one that has unclassified submissions in it.
That said, I think an article about economics or politics can be more profound and deep than a lightweight article about technology and be better for the tone of the site. My personal preference is for news here to be technical, but an article being technical is not enough for it to be interesting. Overall, however, I want articles with meat; that dig a little deeper than "Head First SQL" or "How I made a blog engine with Erlang" (or "I am Phillip Greenspun and I don't like people in Northern California").
While I love beautifully designed languages as much as the next guy, I've seen probably 50 blog articles posted here about "Why language design matters" or "Why language design doesn't matter" or "Why there can never be a better Lisp". While these may be technical in nature, they're often very shallow and redundant.
As far as I know, it would be an impressive feat to determine depth automatically, but I think it would give you a better picture of how the tone is changing over time in a relevant way. And maybe it would be a good filter for submitted articles.