I had to use a similar approach when creating a cluster analysis of the amendments in the Italian Senate [0].
The Italian Senate offers a SPARQL endpoint [1], which unfortunately doesn't offer access to the texts of the amendments. So I had to roll my own and create a small spider for them using Scrapy [2].
So what is the legality of this? Apart from the risk of having someone pull the plug on the way one takes the information out, when is something without a proper license able to be used?
In the US, there is no copyright protection for "facts" on their own. However, a compilation/database of facts can have copyright protections based on a 3 part test[0].
1. the collection and assembly of pre-existing material, facts, or data;
2. the selection, coordination, or arrangement of those materials; and
3. the creation, by virtue of the particular selection, coordination, or arrangement of an original work of authorship.
But specifically there is no protection for the underlying facts themselves, and there is no "sweat of the brow" doctrine. So scraping the data, and rearranging the underlying facts into your own arrangement/organization is almost always not copyright infringement. However, if that data is categorized in some non-trivial way, and you keep that organization, then that is likely to be copyright infringement.
However, if what you're scraping are not "facts", but some creative works, such as blog posts, product descriptions, etc, then it is likely to be copyright infringement.
Then on top of that, even if there is copyright infringement, other defenses such as a license to use the data, or fair use may apply.
> So scraping the data, and rearranging the underlying facts into your own arrangement/organization is almost always not copyright infringement.
I'm not so sure. It would definitely be illegal in the US for me to cherry pick data out of Google Maps and add it to OpenStreetMap (and OSM has policies addressing exactly this).
No one in the US can hold copyrights to the pure 'facts', especially if one demonstrates they invested enough energy to 'creatively reinterpret' it. Scraping hasn't quite seen a Supreme Court ruling yet (@grellas correct me, please), but I'm sure one could make a reasonable argument that the energy invested in re-collating the data is sufficient enough to pass any barrier. See Feist Publications, Inc., v. Rural Telephone Service Co, 1991. and O'Connors opinion.
IANAL but in the EU at least, even databases comprised of simple "facts" are protected.
It's a sad state of affairs when i'm not even allowed to scrape data generated using taxpayers' money, like the (required by EU laws) noise maps for cities, which I'd like to use to augment real estate offers, for example.
I'm not fond of the implication at the end that scraping is justifiable because old websites are dinosaurs without APIs, and those websites are jerks for not doing so, and therefore scraping is the moral thing to do.
I've scraped my share of BuzzFeed data and Foursquare data to make data visualizations (with the latter explicitly saying "don't scrape" in their Terms). But if either one told me to stop and take down my results, I would not contest, since data is what drives the Internet ecosystem.
(For the record, neither service did; in fact, both tried to recruit me as a result of the visualizations. The difference is that I am not using the data to create a direct competitor that could cause them to lose business.)
> I'm not fond of the implication at the end that scraping is justifiable because old websites are dinosaurs without APIs, and those websites are jerks for not doing so, and therefore scraping is the moral thing to do.
It was not my intention to give that implication. The main implication behind CityBikes is that public services should already provide this information since, well, it is a public service. On the same line, a private company providing a public service should already do so. See motives [1].
> I've scraped my share of BuzzFeed data and Foursquare data to make data visualizations (with the latter explicitly saying "don't scrape" in their Terms). But if either one told me to stop and take down my results, I would not contest, since data is what drives the Internet ecosystem.
Same as CityBikes is doing. If we receive a cease and desist, we remove their service from our API. As for Foursquare, I do not see Foursquare as a public service. Your taxdollars at work, and all that.
I tried to keep the article balanced but maybe it wasn't clear. There are many transportation companies willing and happy to be scraped, or looking forward to provide their information for people to reuse [2].
> For the record, neither service did; in fact, both tried to recruit me as a result of the visualizations. The difference is that I am not using the data to create a direct competitor that could cause them to lose business.
I do not understand that implication. How is providing bike share information creating a competitor? I can't run a bike sharing service.
Random tangent: your comment about unmanaged non-canonical misrepresented data reminds me of Zillow, who publish "facts" about real estate transactions, with no mechanism for error correction. Afrw years ago they posted an erroneous sale of my house -- which transaction never took place -- listing a sale price 20% lower than we'd paid for it. It directly harmed me, when we later tried to sell the house, when potential buyers cited zillow's "estimates" which of course were artificially, drastically lower because of the phantom transaction. There was no avenue for recourse; angry tweets got a half-baked response from a junior social media person, but it was never resolved. I wonder how many others zillow must have messed up.
If the service has an API with a fair rate limit (Foursquare does at 5000 requests/hour), I believe that is ok, since that implies their architecture is built for massive data requests. On the other hand, bypassing those rate limits with proxies is definitely bad.
If a website does not have an API (BuzzFeed), I take care to only collect data that I need. Not anything that would damage the business. (E.g entire articles). Consequently, I sanitize the data of such things if I decide to release the dataset.
"Web scraping to create Open Data" is the exact reason why weboob (http://weboob.org/) was created and still thrives today. CityBikes already seems to be doing a big part of the job, and in Python nonetheless, so it should be easy to integrate its data and use it with Boobsize (http://weboob.org/applications/boobsize.html)
It's funny, everytime Weboob is presented somewhere, and everytime there is a post about the latest version of Weboob, the first comment is a variation of "it's sexist/boobs are unprofessional/grow up", and very very little time is spent talking about the actual thing, what it does and why its only goal is to become irrelevant. Sad thing.
Here's what they have to say about it, and why there's very little chance they will change anything:
> It's funny, everytime Weboob is presented somewhere, and everytime there is a post about the latest version of Weboob, the first comment is a variation of "it's sexist/boobs are unprofessional/grow up", and very very little time is spent talking about the actual thing
I don't believe it's a problem with the name itself but more a problem with how commenters are more easily concerned about everything sounding professional even though it may very well have nothing to do with a job.
Don't click the link above. It may crash your browser.
------------------------------
Full text:
We often get complaints around Weboob‘s name, and the various application names.
There’s no denying they’re childish. What they are not, however, is sexist.
There is “boob” in the main name, and “boob” is a friendly name referring to (mostly female) breasts. We would, for example, avoid using “tits” or “cunt”, because they are often demeaning1. Though it is a happy accident (our earlier ideas like “woob” and “webob” were taken), we certainly like playing with that.
The idea is the same with application names; it’s all about friendly jokes (like wetboobs the weather tool, which manages to be related to weather and boobs).
If you’re offended, just ask yourself “how is it sexist?”.
As it appears, Weboob is a formidable tool to detect people that are part of the “be offended first, think later” crowd. Interestingly, the crusaders2 are to date all male, and often assert that women can’t like jokes about breasts or sex in general3. How fucked up is that?
They will always make a scene4 on how they’re never going to use Weboob because of names. Guys, here’s the thing: we don’t need you and we certainly don’t want you. I for one am glad we created an Asshole Detector, albeit by accident.
We however are mostly not native English speakers. Mistakes can happen. [↩]
This is not an euphemism. They act like they are fighting for a good cause, but it’s only pretend. [↩]
And who the hell are they to talk in place of others? That is actual sexism. [↩]
So that it is abundantly clear, this is purely about making a scene; I do not care about their opinions or how many penises they may have. All our contributors do not and do not have to like the branding. [↩]
It's a fun read. You know, I'm not normally one to jump on the "offended" bandwagon. In this case I felt compelled to speak up because the name makes it pretty much impossible to reference or recommend the project in a business setting. Wearing a t-shirt to work makes me feel like a true rebel; mentioning a project with components such as "boobsize" and "wetboobs" is going to simply be too weird for many people.
> It's funny, everytime Weboob is presented somewhere, and everytime there is a post about the latest version of Weboob, the first comment is a variation of "it's sexist/boobs are unprofessional/grow up", and very very little time is spent talking about the actual thing, what it does and why its only goal is to become irrelevant. Sad thing.
That in itself is a very good reason why the naming scheme needs a long, hard rethink. It's distracting.
Put another way, I wouldn't blame everyone else if every time my hypothetical company "Natzie" was mentioned the conversation devolved into something unrelated to the reason it was mentioned. If you want to succeed, you need to make decisions based on the real world, not how the world should be (not that you can't push limits, but baby steps are often needed).
I think it probably /is/ an asshole detector, but not quite in the way that the authors of that webpage /think/ it is.
from Wikipedia:
Psychological projection is a theory in psychology in which humans defend themselves against their own unpleasant impulses by denying their existence while attributing them to others.
This script [1] using pybikes would accomplish a similar thing.
Only problem with using pybikes natively are some systems we call asynchronous. These are bike share websites that require actually clicking on a station to get the status information. This means that for getting accurate information on the entire the feed (let's say 500 stations), you would have to go through each of them. On these cases, it's way easier to just use the API.
Can anyone tell me which cloud provider they are using? I want to make sure that scrapinghub are on the list. I block the IP addresses of all the major cloud providers to prevent parasites such as this.
The Italian Senate offers a SPARQL endpoint [1], which unfortunately doesn't offer access to the texts of the amendments. So I had to roll my own and create a small spider for them using Scrapy [2].
[0]: https://github.com/jacquerie/senato.py/blob/master/analysis....
[1]: http://dati.senato.it/23
[2]: https://github.com/jacquerie/senato.py/blob/master/senato/sp...