Am I really the only person who has dubious feelings about this? I contribute my words to HN, where they can be seen in context and where they are viewed by the same community that I am interacting with. I don't contribute them for arbitrary other uses off the site.
Unless I have missed something, posters who submit their comments here do not automatically release them into the public domain. In fact, I have seen no legal statement anywhere about transfer of copyright as a condition of posting, so it's not clear that posters give anyone any rights at all, other than probably the operators of HN an implicit licence to publish them on the site and visitors to HN an implicit right to read them while browsing the site in the normal way. That would make downloading and sharing the entire HN database in this way an obvious infringement of the copyright of every poster here.
Sorry if this seems a bit OTT, but some of us watched many comments we contributed to the community in the Usenet days being appropriated by long-term Usenet archives that then republished them out of context, covered in advertising, with comments/ratings attached to them that aren't open to the rest of the Usenet community, etc. That is basically profit-making on the back of others' work without their knowledge or consent, and potentially at the expense of the community the poster originally wished to support, and I have a problem with that.
The "The web is considered 'public domain'" argument seems to be said by people who have personal interest in that statement being true.
The "When it's out on the internet..." is more a statement of inevitability. The person making the statement may not participate in taking the content, they are just pointing out that it's inevitable that other people will.
It's not surprising that words in a public forum are misappropriated, just like it's not surprising that a Mercedes with the keys in the ignition will be stolen. But that doesn't make either act any less illegal or deplorable. The law is clear in this case. Without any agreement to the contrary, the copyright still remains with the author.
The analogy would have to be, "if you are worried about that then your only option is to refrain from parking it in public places with the keys in the ignition."
This entire thread is people taking my analogy and shooting in the face. The point is this: jut because a crime is trivial and likely to happen doesn't mean we shouldn't blame the criminal. Yes, you're an idiot if you leave the keys in the ignition of your expensive car and you're even more of an idiot if you're surprised your car got stolen. But that doesn't make stealing it any less of a crime or the criminal any less responsible.
There are already over 60 plus apps are being made on top of HN as a data source. Almost each and everyone is trying to either make their lives better, code up an experiment or give back to the community. Usually they try and do all three. Some used fixed scrapes other are actively scraping HN daily.
Really I am not sure what the fus is over, like what profit making?
I have personally visited each and every app. I could be wrong but I don't think I have seen so much as a single Adsense ad on any of them. Think the only one that is making money is the iPhone HN app.
For this project three hackers got together and saw that we had a common need and that many people hacking together apps too. We wanted to help save everyone the trouble of scraping HN. We did this after speaking to other people and each other asking each other if they had the data. We were trying to make things easier for people in the HN community creating mashups and figured would save Paul and YC some bandwidth. Ronnie already has been donating his time and server for http://api.ihackernews.com/ and Joseph has been running http://metaoptimize.com/projects/autotag/hackernews/ "To learn more about large-scale NLP + machine learning techniques". I have been getting the data because I have been running analysis/status of HackerNews that I want to share with the community.
Do you have a problem with all the HN mashups out there from mobile apps and twitter bots to search sites?
It probably is, as I acknowledged in my original post. Then again, it's also a matter of principle, and as I also noted in my original post, it's a kind of behaviour that has been widely abused in the past.
> Do you have a problem with all the HN mashups out there from mobile apps and twitter bot to search sites?
I think you have to look at each on its own merits. Clearly we're all happy to volunteer our posts on HN, so while I can't speak for others, I personally have no problem with someone writing, say, a non-commercial app that makes HN more readable on a mobile device. It's still clearly linked to the original site, and is unlikely to harm anyone or compromise the integrity of the posts. I just don't think the same applies to a complete data dump of the entire site, distributed independently and with no inherent link back to the original source.
PG's knowingly permitted people to scrape the site for a long time [1]. This site has been redistributing HN content through their unofficial API for a couple of months now [2], and PG was also aware of that [3].
I agree that it would be nice to have a concrete statement about the copyright status of content we post, but this isn't a large step from what's fairly well-known to have already been done.
> I agree that it would be nice to have a concrete statement about the copyright status of content we post, but this isn't a large step from what's fairly well-known to have already been done.
(a) I've checked everywhere I can find on the site, including the sign-up, and the copyright status of posts seems very clear: it remains with the poster, and no-one else (including HN/pg) has the right to authorise redistribution elsewhere.
(b) This behaviour was only "fairly well-known to have already been done" if you happened to see that discussion a few weeks ago, and in any case wouldn't apply retrospectively to earlier posts.
(c) This isn't really about the legal status anyway. Assuming this is all based in the US, anyone who wanted to register their content with the US Copyright Office and then record evidence of redistribution by this service and anyone else hosting the torrents could theoretically hit each party with a $150,000 lawsuit for each post they copied, with bonus points awarded because they've even uploaded it to every copyright lawyer's favourite service, The Pirate Bay. Of course, that seems rather unlikely. It just seems really disrespectful to me to take content that many people have volunteered their time to contribute to HN and share/reproduce that content elsewhere without permission.
(a) I've checked everywhere I can find on the site, including the sign-up, and the copyright status of posts seems very clear: it remains with the poster, and no-one else (including HN/pg) has the right to authorise redistribution elsewhere.
You didn't find anything mentioning it, correct? I'm ignorant of the law here, could someone who isn't please clarify what sort of implicit granting of rights takes place when we submit content to a website? This seems to be the crux of the issue. Clearly we're permitting it to be made available here even though that's not explicitly stated.
(b) This behaviour was only "fairly well-known to have already been done" if you happened to see that discussion a few weeks ago, and in any case wouldn't apply retrospectively to earlier posts.
SearchYC is fairly well-known and it has restributed HN content for a few years. Perhaps less well-known: it has also had them available in JSON for most of that time.
(c) This isn't really about the legal status anyway. Assuming this is all based in the US, anyone who wanted to register their content with the US Copyright Office and then record evidence of redistribution...
I don't think I understand what you're saying here. How is that not about the legal status?
In the discussion some months ago about whether Hacker Magazine would include reprinted versions of comments, grellas thought that it would probably need permission of the commenters to do so. His view was that, absent any explicit license terms, there was an implicit grant of permission to print the comment on HN, but not an implicit grant to reprint it anywhere else.
> I'm ignorant of the law here, could someone who isn't please clarify what sort of implicit granting of rights takes place when we submit content to a website?
The problem with this area is precisely that it is not clear, so clarifying it is hard. :-) But the basic rule would be that the benefit of the doubt is going to go with the copyright holder, and anyone who wants to make a copy without explicit consent would have to defend their position.
> I don't think I understand what you're saying here. How is that not about the legal status?
My own concern is not really about the legal status. It's about whether it is respectful to the community to reuse their contributions wholesale in this way.
Obviously if anyone wanted to take more serious action, their case would be entirely about the legal status, but from what I've seen, that status probably isn't in much doubt anyway.
The usenet example is an interesting one, but I'd be firmly on the other side. If someone is going to the trouble of creating a new method of access (via the web), they should be rewarded for it.
> "That is basically profit-making on the back of others' work without their knowledge or consent"
I disagree. Should a paid usenet client download be similarly thought of? If you make money selling a usenet reader, should you be thought of as making money out of other peoples work?
How about a browser? Surely that's making money off other peoples work (The website authors).
I'm of the opinion that anything posted in a comment is pretty much public domain.
I don't think it's dubious at all. I think it's a clear abuse of copyright. My comments are my own. They are implicitly licensed to Y Combinator to store and display on Hacker News, but they are not licensed to anyone else.
The internet is not public domain. If you have not been explicitly granted permission, you don't have it. I have not, and will not, grant permission for my posts to be redistributed in a data dump.
The thing is, anybody could just use a crawler and get the data directly from HN. Downloading the bittorrent is just a shortcut.
Since you published your comments here, I think they are free for everyone to consume. Now to republish them somewhere else is a different matter. I guess publishing a torrent could be counted as republishing the data, but what would be the point in refusing it? The publisher is not making any money from it.
I'm more glad that things I've posted are being "backed up" in various ways. Nothing beats backing things up personally but I'm happy Google has posts of mine from Usenet in the mid 90s. Not something I'd have ever kept a copy of and awesome to look back on.
With usenet you did give permission, the inherent nature of usenet required messages be copied across large numbers of independent servers with different access policies (free, charging, etc) and mechanisms (nntp, email, web). Hence by posting to usenet you implicitly granted permission for your message to be copied (this was actually tested in court a few times in the 90s).
> this was actually tested in court a few times in the 90s
Sure, I imagine the legal departments at places like Deja News were kept busy for a while as the practical nature of Usenet was debated.
However, I'm not aware of any case that was sufficiently broad as to support everything that goes on. For example, has any site that takes Usenet archives and slaps in-line ad links all over the posts, thus actually changing the content, ever successfully defended an action? Or any site that takes posts from Usenet, but keeps replies posted from its own systems on those systems only?
Inline content modification is one where I suspect the legal territory might be murky, however not propagating replies is almost certainly legal.
While posting on usenet gives an implicit licence to allow servers to to copy the message, there's no explicit or implicit legal requirement for an individual Usenet server to propagate replies. Indeed even historically there have been many usenet servers that were one way only due to the nature of peering arrangements.
> however not propagating replies is almost certainly legal.
On Usenet, possibly, but it is probably reasonable to assume that someone posting to Usenet knows that it is a distributed system and not all servers copy all messages, and therefore that by posting anyway any implicit permission to copy their material takes this into account.
On HN, however, a post is always accompanied by its replies. If someone posts something, and then posts a follow-up, they expect that anyone reading one comment can also see the other. Given that the entire legal basis for reproducing any comment from HN is built on implied consent, taking a snapshot that breaks that connection seems like fairly dangerous ground to walk on.
"That is basically profit-making on the back of others' work without their knowledge or consent, and potentially at the expense of the community the poster originally wished to support."
Not at all. A site that happens to be based on user-generated content is probably using that content with the user's knowledge and consent, just as pg does when HN publishes comments we make here.
I was thinking more about sites like FaceBook, where users initially gave their info because the site represented a community they were already part of (like their college), and then FaceBook turned around and sold it to advertisers and app developers.
I don't personally use Facebook. I don't like the way they operate, for much the reasons you describe. However, I expect there is at least some small print that users theoretically signed up for when they created an account that gives Facebook the sorts of rights you mentioned, so it's a rather different case.
Also, Facebook has been slapped down several times recently by government authorities/privacy regulators for going too far. If I were running an organisation like Facebook or Google, I'd be pretty concerned today about the privacy backlash that seems to be building in many countries as everyday non-geek users become more aware of how these companies operate and the potential risks associated with such liberal use of personal data. I think that's probably a discussion for another thread, though.
I have no problem with my words being on the Hacker News and available for everyone to see here. Obviously I wouldn't post them otherwise.
I do have a problem with people taking those words and redistributing them elsewhere. Just because it's on the Internet doesn't mean it's free, and I fail to see how what we're talking about here is much different to this case that was doing the rounds on the tech forums the other day:
Just because there is a reasonable reuse of the content doesn't mean that all reuses of the content are reasonable. Also search engines and archive.org preserve the context and content.
Caching is an interesting case, or rather, several interesting cases.
If it's not a faithful representation, i.e., if it does not reproduce anything exactly, omits any part of the material, adds additional material, or is out of date, then I'm not in favour. It's not really a cache at all at that point, and such flaws are obviously potentially damaging to both the visitor and the original host service/content providers. The only likely reasons I can see for not reproducing faithfully are incompetence or active leeching. (I note in passing that I have never seen a cache or archive web site that actually did reliably reproduce content faithfully. All of the major services, including heavyweights like Google Cache and archive.org, failed badly on this criterion last time I checked them out. But real cache proxies tend to pass, as long as they refresh at an appropriate frequency.)
If it's faithful but continues to make content available after the original host has pulled it, then it's potentially valuable to visitors and probably many hosts/content providers would have no objection, but I think it should be opt-in. Not respecting a service that puts content up for a while but then chooses to take it down again for whatever reason could have a chilling effect on willingness to put the content up in the first place, and could undermine business models that would otherwise be useful, reasonably fair to all parties, and financially viable. (This point doesn't really apply to sites like HN, though, as they typically publish posts indefinitely anyway.)
If it's faithful and timely but still robs the original host of valuable meta-information (notably real server logs that have at least two legitimate uses: helping to optimise the site based on real user behaviour, and supporting claims made to third parties about site traffic) then this could also be harmful to the host service, but on the other hand, it's not clear to what extent such meta-information should ever be relied upon anyway given the architecture of the Web today. Ideally, I think we would have some sort of standard proxy notification so that caches and such could forward relevant meta-information to the original host in some sort of digest form. That way, this one becomes a non-issue, as long as any cache/archive service implements the appropriate notification to be fair to the original host.
Basically, I think cache/archive services can be widely useful and probably many hosts/content providers would have no objection, but given that they can have significant downsides, they should always be opt-in and ideally we would have simple conventions based on something like robots.txt to indicate this.
I appreciate your asking the question, but I don't want to make this debate about my personal views just because I started the thread. As I write this, more than 50 people have upvoted my original post, so it appears to be a matter of general interest to the community (which is not to say that all of the upvoters necessarily object to any particular use of the data, nor that I do myself for that matter).
With that said, speaking only for myself, I don't think opt-outs are worth much in this sort of debate. If the question is worth asking, I think it should almost always be an opt-in system.
No. I opted in to having my posts shown on HN, for as long as the operators of HN are willing to host them there. That is the only thing any poster here opts into by signing up.
Cool. Now in XX years time, after all my expressed opinions are proven to be correct, I can fire up an intelligent program to try to track down everyone who's ever dis-agreed with me and say 'I told you so'.
Thanks! This opens up a number of really interesting possibilities when mixed with people's expertise in search and machine learning / natural language processing.
A long time ago I got hold of a large chunk of Slashdot's stories and comments. The text and karma ratings for each post lead me to try some fun experiments automatically extracting the community's sentiment towards certain topics or trying to mine Slashdot memes.
I've wanted to play around with the comments of Hacker News for some time due to the wealth of knowledge most comments hold but felt that crawling would be a bad idea as I certainly didn't want to cause PG's bandwidth cost/server load to increase.
Think about it - HN's a community full of people like me and if we all crawled HN to get that data it would be somewhat ugly, so thanks for sharing your data ;)
Part of the crawl the Ronnie is sharing in this link is something I shared with him. I've used HN crawls before for my automagically organized HN demo:
One of the problems with hacker news is that while there is great discussion whether it is one a short lived story or evergreen advice it pretty much fades into obscurity a couple of days after it is posted.
There have been curation efforts in the past and collections like this make it even more accessible and feasible the someone will apply some good NLP to organise the data in such a way to provide the benefit of the older content which is still relevant.
Now this is neat. I spent a bit of today reverse engineering the Better HN chrome extension to see how it looked up url's on HN (http://json-automatic.searchyc.com/domains/find?url=%s, for anybody who wondered, I can't find this API documented anywhere). Now, if this API holds up, it seems like there might be a more sustainable way of doing what I was planning.
It is a pity that the API does not have a way of getting what appears on the front page. I was planing to soon create a script that scanned the front page, and then applied pg's algorithm from "a plan for spam" to remove links that I think are off topic. Actually turning that into a desktop client would be rather cool, or maybe a Firefox extension as that is more my kind of thing.
Yes, I took it all down. I need to rethink this. Clearly some folks aren't happy with it at all. Not that intentions really matter, but my intentions were only to help all of the people who are making HN apps. I figured we could reduce the need to scrape HN if we just distributed the data.
Based on the front page example, it looks like the database doesn't contain the comment structure (ie, what comments go with what article, and which comments parent others)
Sorry, top level posts don't have a parent. When that's null, it's excluded from the XML. Comments all have a ParentID. I just added ParentID to the example.
If you really want to understand the HN community (or online communities in general) this data doesn't tell you much.
There are many interesting questions this data cannot answer.
When was each of those 39 points gained?
Who upvoted?
How many points did the submission have when each comment was posted?
How does the number of points affect the number of comments?
Are there cascades of votes (several users voting in quick succession)?
For each of the comments, who voted up and who voted down?
I don't think the actual HN "database" has that much info. You would need to reconstruct this from the complete server logs, along with application logs for PG's direct updates (assuming he doesn't use the web interface)
It has to at least know who upvoted, to prevent revotes.
HN has to know who voted, not which way someone voted. HN doesn't let you change your vote once its been made, but it does let you downvote once you have enough karma.
I don't think this would work well. The occasional false-positives would look like annoying glitches, every now and then making it look like your account had been hijacked by hiding the buttons as if you had already 'voted'. Although maybe I'm not getting the math right?
Also, the existence of the 'saved stories' link from the user page would seem to indicate that for stories at least a full list is being stored.
I think it would have been neat if someone created a browser-extension that let users contribute such data, but I guess preventing abuse could prove to be an impossible task, since there is no way that I can think of for a third-party to check whether the user really performed an action without having to perform those actions for the user, in which case we would have to trust our login-details to that third-party.
Unless I have missed something, posters who submit their comments here do not automatically release them into the public domain. In fact, I have seen no legal statement anywhere about transfer of copyright as a condition of posting, so it's not clear that posters give anyone any rights at all, other than probably the operators of HN an implicit licence to publish them on the site and visitors to HN an implicit right to read them while browsing the site in the normal way. That would make downloading and sharing the entire HN database in this way an obvious infringement of the copyright of every poster here.
Sorry if this seems a bit OTT, but some of us watched many comments we contributed to the community in the Usenet days being appropriated by long-term Usenet archives that then republished them out of context, covered in advertising, with comments/ratings attached to them that aren't open to the rest of the Usenet community, etc. That is basically profit-making on the back of others' work without their knowledge or consent, and potentially at the expense of the community the poster originally wished to support, and I have a problem with that.