Hacker News new | past | comments | ask | show | jobs | submit login
HN long term change (jacquesmattheij.com)
84 points by jacquesm on Jan 12, 2010 | hide | past | favorite | 54 comments



I'm relatively new, so it's difficult for me to see any major change that's happened since I first joined. My personal opinion is that there has been little change since I arrived, but much talk about how much has changed.

That said, I think an article about economics or politics can be more profound and deep than a lightweight article about technology and be better for the tone of the site. My personal preference is for news here to be technical, but an article being technical is not enough for it to be interesting. Overall, however, I want articles with meat; that dig a little deeper than "Head First SQL" or "How I made a blog engine with Erlang" (or "I am Phillip Greenspun and I don't like people in Northern California").

While I love beautifully designed languages as much as the next guy, I've seen probably 50 blog articles posted here about "Why language design matters" or "Why language design doesn't matter" or "Why there can never be a better Lisp". While these may be technical in nature, they're often very shallow and redundant.

As far as I know, it would be an impressive feat to determine depth automatically, but I think it would give you a better picture of how the tone is changing over time in a relevant way. And maybe it would be a good filter for submitted articles.


If anyone wonders why there was such a high proportion of articles about startups at first, it was because the site was initially called "Startup News." After 6 months or so we changed the name and focus.


Do you think that as a site becomes large enough people start to change their own focus and sort of get lost in the crowd? That eventually people are just disillusioned by the site itself? I noticed this is sort of happening with reddit / digg.


I don't think largeness is the problem in itself so much as the decline in quality and civility that usually accompanies it. If we can avoid decline we're ok. This is mostly uncharted territory, but I'm hopeful. I'm going to be working this year on tweaks to encourage people to be more civil in comment threads.


Wonderful idea, but I'd like to see this with:

* Objectively ranked categories based on word usage, not author-provided tags

* Legible graphs with fewer colors

* Analysis that weights comments by karma (or, better, a reasonable non-linear function on karma)

* Open source for reproducibility and better outside critique

Edit: In fact, I'd like to see it enough that I might build it. Any other ideas?


I'd be interested in a breakdown of the 10 or so most popular source sites on a (say) month-by-month basis. You'd expect to see a lot of articles referencing Google, techcrunch, HN and the like but I'm also surprised by the number of articles from the NY Times, WSJ and such.

Maybe determine "popularity" first by number of articles and then again by score.


I'll see if I can do something like that.

I have to do all kinds of tricky date parsing to figure out when an item was posted so this is not as easy as it seems at first glance.


I'll build another graph tomorrow that is weighted by votes, that might give another perspective.

There seems to be some confusion about how to interpret the graph, the Y axis is 'per post', the x axis is per block of 1,000 posts.

The graph would have been harder to read at 138x1000 pixels so I've stretched it a bit.


One thing this graph can't account for is the quality of submissions. A technology article about 8051 assembly language is a lot different than a technology article about the top 10 SEO tips for bloggers.


That's absolutely true.

When this is all done I'll make all the data available so that people can mine it for what it is worth.

A fair amount of work went in to this little project, not all of it automated (unfortunately), a lot of time went in to making sure the tags would be somewhat relevant.

If there would be any major trends that I could not explain by looking at samples of the data I would have certainly investigated.

I hope that if there are such trends that I missed that they will come out in the follow up (weighted by votes, see above), if not then the analysis will have to be much more detailed, and that will probably mean a lot more handwork than what went in to making this graph.

The 'good news' for me is that there is no unbounded growth of the 'unspecified' category, that would be a fairly large indicator of trouble.


To me top 10 SEO tips for bloggers is not about technology at all, it's about marketing. This is the category I feel is growing and it makes HN less interesting to me.


Can you provide more information about how the sampling was done and how you categorized the articles?


The graph also needs better labelling. For eg. What is X-Axis? is that snapshots over time?


The x axis is the rank number of the posting divided by 1000, so that's a constant sampling interval in blocks of 1,000 but more compressed in time towards the right because of the higher posting frequency.


It would make more sense to use a constant time x-axis. Also, how did you do the article labeling/clustering?


It wouldn't make much difference actually, apart from greatly complicating the matching up of the Y axis.

The bigger issue is the fact that this is just everything that is posted and not flagged, so it is if you wish a view of the 'new' page, it has nothing to do with the 'home' page, I'll try to address that tomorrow.

As for the labelling and clustering, that was based on keywords in the title from a fair sized sample, and from the urls the links pointed to.

What I am specifically searching for is larger trends, smaller trends would be very difficult to catch using this method.

I'm actually quite surprised how even the graphs come out over the longer term, I would have expected more variation in the submissions.

So if there is a problem at this point in time I would conclude that the problem is not in the submissions, they seem to have roughly the same subjects over the long term as they did in the beginning, with the exception of a shift of focus away from 'startups' in the first year or so of operation.

I think that has to do with an influx of programmers / people interested in technology in general whereas originally most of the people on news.yc were active in the startup scene.


they seem to have roughly the same subjects over the long term as they did in the beginning

That's been my impression for a long time. Do your techniques allow you to measure the trend of people complaining about the site deteriorating? Because that's been going on for a long time too, and in approximately the same way (though possibly in cycles).

I think that has to do with an influx of programmers / people interested in technology in general whereas originally most of the people on news.yc were active in the startup scene.

Pretty clearly that is because the site was originally named Startup News and had a relatively narrow scope, then was renamed to Hacker News as part of explicitly broadening the scope.


> Do your techniques allow you to measure the trend of people complaining about the site deteriorating?

No, especially not because plenty of those get flagged and die.


I'm pretty sure the number on the x-axis is the amount of weeks past the founding of HN.


Eventually I'll release the whole dataset.


That looks surprisingly consistent to my eye.

Any data on how other similar sites (e.g., reddit) have changed?


Agreed, with the exception that the proportion of articles tagged 'startups' decreased reasonably fast in the first quarter of the graph.

This was probably part of the iterative change from 'Startup News' to 'Hacker News'. I can't recall exactly when the name changed, or whether it was a response to that trend or precipitated the wider focus.

Edit: Change was made 14 August 2007.


Apart from the settling down period in the first year that is my conclusion as well.

Even the cyclic nature that edw519 referred to in the 'new years exchange' seems to be mostly limited to the voting, it has nothing to do with the actual submissions.


I think this is interesting, but wonder if the change people talk about could be more in the quality of discussion than type of submitted articles?


FWIW, yesterday I submitted a link to the history of Waite Group publishing by Mitch Waite himself. It was an awesome piece on entpreneurship, failure, and the history of personal computing and it's still sitting at 1 point:

http://news.ycombinator.com/item?id=1047482

Meanwhile less topical, but more sensational stories shot to the top.


I'm tempted to create some kind of site that checks the 1 pt submissions automatically. Not sure how much semantic analysis but even minimal keyword checking could flag possibly good articles. Hidden.HN


That's actually a piece of cake to do.


That is a real problem. I think that that has mostly to do with the speed with which new articles are being submitted.


I'm having difficulty matching the key with the graph for the thinner categories, and for some similarly-colored categories.


i believe the layers (stacked lines) and legend are in the same order.


yep.


Technology is listed twice in the legend?


Hey, sharp eye ! I must have mis-spelled it at some point and then the 'matcher' then used both labels.

Those two should be summed, but the misspelled on is used very rarely (fortunately).

I'll re-do the graph tomorrow when I'm awake, it won't affect any other rows or the shapes though.

There were many subcategories as well, but I've used only the top level of the tags to make the graph legible.

In total I used about 200 different tags.


Seems like the proportion of articles actually focused on hacking have decreased, but everything else is pretty consistent.

Also, what is "as khn" and "as kyc"? It looks like one replaced the other.


Forgive me if I'm misinterpreting your question. I believe it should be "Ask HN" (Hacker News) and "Ask YC" (Y Combinator). I haven't been around here for too long, but I have seen many Ask HN threads and none specifically addressed to YC. Those threads about the Y Combinator process are usually of the form "Ask PG: " (Paul Graham) or "Ask HN: YC Founders, what do you think about X?"


I think "as khn" is "Ask Hacker News" and "as kyc" is "Ask Y Combinator".


I believe it's just the text formatting gone awry. They are:

ask hn (ask hacker news) and ask yc (ask Y Combinator).

ask yc was obviously more popular in the early days.


You're correct. I don't know why the plot headings did that, in the tags they are correct.


Thanks, that's interesting. However, I'm a little confused about the categorisation. It looks like the categories add up to 100%. If that is the case, the category "blogs" doesn't make sense in my view. All other categories characterise the subject of the content whereas "blogs" says something about the publication channel. In my view, there is no sensible way to label a blog post about technology either "technology" or "blogs".


Correct, the problem here is that even though most of the blogs are technology blogs it is very hard to categorize the majority of them as something specific. For instance, Bruce Schneier blogs about security, most of the time, so all articles that could be tagged like that are now under hacking,security.

But he also has lots of stuff that is not so easy to categorize, so that ended up depending on the ease with which the title let itself be identified either under 'technology' or, in the worst case under 'blogs'.

A similar problem appears with the 'mainstream' media websites, and it was solved in the same way with the top level category as a catch-all after other matches were ruled out.


It looks like there are too many "unspecified" articles to learn much from this visualization, other than a moderate decrease in the number of articles in your "startup" category, supplanted largely by "ask" topics--a trend that largely leveled off early in the x-axis on this graph (which would be dramatically more useful if it had some amount of real-time benchmarks to give some sense of scale).


See the text in the article about the 'unspecified'.

As for the scale, it doesn't get much more precise than this, the only concession to legibility is to stretch the graph horizontally because otherwise it would be only 138 pixels wide, vertical is very close to one posting per pixel.

As the volume of postings on news.ycombinator increases due to increased traffic to the site the graph will stretch more further to the right.

This could be counteracted by changing the algorithm to 'bin' more posts to the right hand side to get for instance one month per bin, but in practice the outcome would be the same, you'd just have another weighting to do to get the Y-axis of the bins to line up.


For unspecified, possibly could run the unspecified URLs through bit.ly to get the meta description/keywords.

Example: (Is Amazon EC2 oversubscribed)

http://api.bit.ly/info?version=2.0.1&hash=83VrYk&log...


Does this data include flagged/dead items?


No.

Those are only available to logged in members directly from HN.

I agree that that would make it a lot better.


The green area is marked technlogy and the orange area is marked technology ! Isn't that an error ?


Yes, it was remarked on before, but you're wrong about the areas, if you look closely you'll see that the rows are 'in order' and the green area is actually a one time occurrence somewhere near the top. The other green area is the one that has unclassified submissions in it.


I wonder if tracking just the number of flagged posts would provide similar insight.


That would have to include some points threshold, lots of spam gets flagged as well.

The biggest indicator of something being 'populist' but not 'HN' is when it gets killed after receiving more than 10 upvotes.


Nice one jacquesm. Do you mind sharing the data? I'd like to play around with some graphs too.


All in good time, 3 or 4 more days before it is really done, this is the first bit of usable info that I could extract.

The tagging has been a lot of work, to put it mildly and it is far from finished. Eventually I hope to crowdsource that part to get it perfect.


[deleted]


One was a misspelling, as you can see from the graphs it was used only once, which is why it doesn't show up in the graphs.


Interesting. Some quantitative proof to support comment rule #7.


I'm not quite ready to draw that conclusion, more work needs to be done for that. Stay tuned :)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: