HN long term change

onewland · on Jan 13, 2010

I'm relatively new, so it's difficult for me to see any major change that's happened since I first joined. My personal opinion is that there has been little change since I arrived, but much talk about how much has changed.

That said, I think an article about economics or politics can be more profound and deep than a lightweight article about technology and be better for the tone of the site. My personal preference is for news here to be technical, but an article being technical is not enough for it to be interesting. Overall, however, I want articles with meat; that dig a little deeper than "Head First SQL" or "How I made a blog engine with Erlang" (or "I am Phillip Greenspun and I don't like people in Northern California").

While I love beautifully designed languages as much as the next guy, I've seen probably 50 blog articles posted here about "Why language design matters" or "Why language design doesn't matter" or "Why there can never be a better Lisp". While these may be technical in nature, they're often very shallow and redundant.

As far as I know, it would be an impressive feat to determine depth automatically, but I think it would give you a better picture of how the tone is changing over time in a relevant way. And maybe it would be a good filter for submitted articles.

pg · on Jan 13, 2010

If anyone wonders why there was such a high proportion of articles about startups at first, it was because the site was initially called "Startup News." After 6 months or so we changed the name and focus.

zitterbewegung · on Jan 13, 2010

Do you think that as a site becomes large enough people start to change their own focus and sort of get lost in the crowd? That eventually people are just disillusioned by the site itself? I noticed this is sort of happening with reddit / digg.

pg · on Jan 14, 2010

I don't think largeness is the problem in itself so much as the decline in quality and civility that usually accompanies it. If we can avoid decline we're ok. This is mostly uncharted territory, but I'm hopeful. I'm going to be working this year on tweaks to encourage people to be more civil in comment threads.

goodside · on Jan 13, 2010

Wonderful idea, but I'd like to see this with:

* Objectively ranked categories based on word usage, not author-provided tags

* Legible graphs with fewer colors

* Analysis that weights comments by karma (or, better, a reasonable non-linear function on karma)

* Open source for reproducibility and better outside critique

Edit: In fact, I'd like to see it enough that I might build it. Any other ideas?

BearOfNH · on Jan 13, 2010

I'd be interested in a breakdown of the 10 or so most popular source sites on a (say) month-by-month basis. You'd expect to see a lot of articles referencing Google, techcrunch, HN and the like but I'm also surprised by the number of articles from the NY Times, WSJ and such.

Maybe determine "popularity" first by number of articles and then again by score.

jacquesm · on Jan 13, 2010

I'll see if I can do something like that.

I have to do all kinds of tricky date parsing to figure out when an item was posted so this is not as easy as it seems at first glance.

jacquesm · on Jan 13, 2010

I'll build another graph tomorrow that is weighted by votes, that might give another perspective.

There seems to be some confusion about how to interpret the graph, the Y axis is 'per post', the x axis is per block of 1,000 posts.

The graph would have been harder to read at 138x1000 pixels so I've stretched it a bit.

chaosmachine · on Jan 13, 2010

One thing this graph can't account for is the quality of submissions. A technology article about 8051 assembly language is a lot different than a technology article about the top 10 SEO tips for bloggers.

jacquesm · on Jan 13, 2010

That's absolutely true.

When this is all done I'll make all the data available so that people can mine it for what it is worth.

A fair amount of work went in to this little project, not all of it automated (unfortunately), a lot of time went in to making sure the tags would be somewhat relevant.

If there would be any major trends that I could not explain by looking at samples of the data I would have certainly investigated.

I hope that if there are such trends that I missed that they will come out in the follow up (weighted by votes, see above), if not then the analysis will have to be much more detailed, and that will probably mean a lot more handwork than what went in to making this graph.

The 'good news' for me is that there is no unbounded growth of the 'unspecified' category, that would be a fairly large indicator of trouble.

ntoshev · on Jan 13, 2010

To me top 10 SEO tips for bloggers is not about technology at all, it's about marketing. This is the category I feel is growing and it makes HN less interesting to me.

tdedecko · on Jan 13, 2010

Can you provide more information about how the sampling was done and how you categorized the articles?

sidmitra · on Jan 13, 2010

The graph also needs better labelling. For eg. What is X-Axis? is that snapshots over time?

jacquesm · on Jan 13, 2010

The x axis is the rank number of the posting divided by 1000, so that's a constant sampling interval in blocks of 1,000 but more compressed in time towards the right because of the higher posting frequency.

jacoblyles · on Jan 13, 2010

It would make more sense to use a constant time x-axis. Also, how did you do the article labeling/clustering?

jacquesm · on Jan 13, 2010

It wouldn't make much difference actually, apart from greatly complicating the matching up of the Y axis.

The bigger issue is the fact that this is just everything that is posted and not flagged, so it is if you wish a view of the 'new' page, it has nothing to do with the 'home' page, I'll try to address that tomorrow.

As for the labelling and clustering, that was based on keywords in the title from a fair sized sample, and from the urls the links pointed to.

What I am specifically searching for is larger trends, smaller trends would be very difficult to catch using this method.

I'm actually quite surprised how even the graphs come out over the longer term, I would have expected more variation in the submissions.

So if there is a problem at this point in time I would conclude that the problem is not in the submissions, they seem to have roughly the same subjects over the long term as they did in the beginning, with the exception of a shift of focus away from 'startups' in the first year or so of operation.

I think that has to do with an influx of programmers / people interested in technology in general whereas originally most of the people on news.yc were active in the startup scene.

gruseom · on Jan 13, 2010

they seem to have roughly the same subjects over the long term as they did in the beginning

That's been my impression for a long time. Do your techniques allow you to measure the trend of people complaining about the site deteriorating? Because that's been going on for a long time too, and in approximately the same way (though possibly in cycles).

I think that has to do with an influx of programmers / people interested in technology in general whereas originally most of the people on news.yc were active in the startup scene.

Pretty clearly that is because the site was originally named Startup News and had a relatively narrow scope, then was renamed to Hacker News as part of explicitly broadening the scope.

jacquesm · on Jan 13, 2010

> Do your techniques allow you to measure the trend of people complaining about the site deteriorating?

No, especially not because plenty of those get flagged and die.

johnfn · on Jan 13, 2010

I'm pretty sure the number on the x-axis is the amount of weeks past the founding of HN.

jacquesm · on Jan 13, 2010

Eventually I'll release the whole dataset.

teuobk · on Jan 13, 2010

That looks surprisingly consistent to my eye.

Any data on how other similar sites (e.g., reddit) have changed?

JacobAldridge · on Jan 13, 2010

Agreed, with the exception that the proportion of articles tagged 'startups' decreased reasonably fast in the first quarter of the graph.

This was probably part of the iterative change from 'Startup News' to 'Hacker News'. I can't recall exactly when the name changed, or whether it was a response to that trend or precipitated the wider focus.

Edit: Change was made 14 August 2007.

jacquesm · on Jan 13, 2010

Apart from the settling down period in the first year that is my conclusion as well.

Even the cyclic nature that edw519 referred to in the 'new years exchange' seems to be mostly limited to the voting, it has nothing to do with the actual submissions.

dill_day · on Jan 13, 2010

I think this is interesting, but wonder if the change people talk about could be more in the quality of discussion than type of submitted articles?

mahmud · on Jan 13, 2010

FWIW, yesterday I submitted a link to the history of Waite Group publishing by Mitch Waite himself. It was an awesome piece on entpreneurship, failure, and the history of personal computing and it's still sitting at 1 point:

http://news.ycombinator.com/item?id=1047482

Meanwhile less topical, but more sensational stories shot to the top.

wallflower · on Jan 13, 2010

I'm tempted to create some kind of site that checks the 1 pt submissions automatically. Not sure how much semantic analysis but even minimal keyword checking could flag possibly good articles. Hidden.HN

jacquesm · on Jan 13, 2010

That's actually a piece of cake to do.

jacquesm · on Jan 13, 2010

That is a real problem. I think that that has mostly to do with the speed with which new articles are being submitted.

dunstad · on Jan 13, 2010

I'm having difficulty matching the key with the graph for the thinner categories, and for some similarly-colored categories.

diN0bot · on Jan 13, 2010

i believe the layers (stacked lines) and legend are in the same order.

jacquesm · on Jan 13, 2010

DTrejo · on Jan 13, 2010

Technology is listed twice in the legend?

jacquesm · on Jan 13, 2010

Hey, sharp eye ! I must have mis-spelled it at some point and then the 'matcher' then used both labels.

Those two should be summed, but the misspelled on is used very rarely (fortunately).

I'll re-do the graph tomorrow when I'm awake, it won't affect any other rows or the shapes though.

There were many subcategories as well, but I've used only the top level of the tags to make the graph legible.

In total I used about 200 different tags.

johnfn · on Jan 13, 2010

Seems like the proportion of articles actually focused on hacking have decreased, but everything else is pretty consistent.

Also, what is "as khn" and "as kyc"? It looks like one replaced the other.

sparky · on Jan 13, 2010

Forgive me if I'm misinterpreting your question. I believe it should be "Ask HN" (Hacker News) and "Ask YC" (Y Combinator). I haven't been around here for too long, but I have seen many Ask HN threads and none specifically addressed to YC. Those threads about the Y Combinator process are usually of the form "Ask PG: " (Paul Graham) or "Ask HN: YC Founders, what do you think about X?"

MikeCapone · on Jan 13, 2010

I think "as khn" is "Ask Hacker News" and "as kyc" is "Ask Y Combinator".

tdm911 · on Jan 13, 2010

I believe it's just the text formatting gone awry. They are:

ask hn (ask hacker news) and ask yc (ask Y Combinator).

ask yc was obviously more popular in the early days.

jacquesm · on Jan 13, 2010

You're correct. I don't know why the plot headings did that, in the tags they are correct.

fauigerzigerk · on Jan 13, 2010

Thanks, that's interesting. However, I'm a little confused about the categorisation. It looks like the categories add up to 100%. If that is the case, the category "blogs" doesn't make sense in my view. All other categories characterise the subject of the content whereas "blogs" says something about the publication channel. In my view, there is no sensible way to label a blog post about technology either "technology" or "blogs".

jacquesm · on Jan 13, 2010

Correct, the problem here is that even though most of the blogs are technology blogs it is very hard to categorize the majority of them as something specific. For instance, Bruce Schneier blogs about security, most of the time, so all articles that could be tagged like that are now under hacking,security.

But he also has lots of stuff that is not so easy to categorize, so that ended up depending on the ease with which the title let itself be identified either under 'technology' or, in the worst case under 'blogs'.

A similar problem appears with the 'mainstream' media websites, and it was solved in the same way with the top level category as a catch-all after other matches were ruled out.

gambling8nt · on Jan 13, 2010

It looks like there are too many "unspecified" articles to learn much from this visualization, other than a moderate decrease in the number of articles in your "startup" category, supplanted largely by "ask" topics--a trend that largely leveled off early in the x-axis on this graph (which would be dramatically more useful if it had some amount of real-time benchmarks to give some sense of scale).

jacquesm · on Jan 13, 2010

See the text in the article about the 'unspecified'.

As for the scale, it doesn't get much more precise than this, the only concession to legibility is to stretch the graph horizontally because otherwise it would be only 138 pixels wide, vertical is very close to one posting per pixel.

As the volume of postings on news.ycombinator increases due to increased traffic to the site the graph will stretch more further to the right.

This could be counteracted by changing the algorithm to 'bin' more posts to the right hand side to get for instance one month per bin, but in practice the outcome would be the same, you'd just have another weighting to do to get the Y-axis of the bins to line up.

wallflower · on Jan 13, 2010

For unspecified, possibly could run the unspecified URLs through bit.ly to get the meta description/keywords.

Example: (Is Amazon EC2 oversubscribed)

http://api.bit.ly/info?version=2.0.1&hash=83VrYk&log...

shib71 · on Jan 13, 2010

Does this data include flagged/dead items?

jacquesm · on Jan 13, 2010

No.

Those are only available to logged in members directly from HN.

I agree that that would make it a lot better.

vinutheraj · on Jan 13, 2010

The green area is marked technlogy and the orange area is marked technology ! Isn't that an error ?

jacquesm · on Jan 13, 2010

Yes, it was remarked on before, but you're wrong about the areas, if you look closely you'll see that the rows are 'in order' and the green area is actually a one time occurrence somewhere near the top. The other green area is the one that has unclassified submissions in it.

mhb · on Jan 13, 2010

I wonder if tracking just the number of flagged posts would provide similar insight.

jacquesm · on Jan 13, 2010

That would have to include some points threshold, lots of spam gets flagged as well.

The biggest indicator of something being 'populist' but not 'HN' is when it gets killed after receiving more than 10 upvotes.

revorad · on Jan 13, 2010

Nice one jacquesm. Do you mind sharing the data? I'd like to play around with some graphs too.

jacquesm · on Jan 13, 2010

All in good time, 3 or 4 more days before it is really done, this is the first bit of usable info that I could extract.

The tagging has been a lot of work, to put it mildly and it is far from finished. Eventually I hope to crowdsource that part to get it perfect.

on Jan 13, 2010

[deleted]

jacquesm · on Jan 13, 2010

One was a misspelling, as you can see from the graphs it was used only once, which is why it doesn't show up in the graphs.

clistctrl · on Jan 13, 2010

Interesting. Some quantitative proof to support comment rule #7.

jacquesm · on Jan 13, 2010

I'm not quite ready to draw that conclusion, more work needs to be done for that. Stay tuned :)