Web Scraping to Create Open Data

jnotarstefano · on March 31, 2016

I had to use a similar approach when creating a cluster analysis of the amendments in the Italian Senate [0].

The Italian Senate offers a SPARQL endpoint [1], which unfortunately doesn't offer access to the texts of the amendments. So I had to roll my own and create a small spider for them using Scrapy [2].

[0]: https://github.com/jacquerie/senato.py/blob/master/analysis....

[1]: http://dati.senato.it/23

[2]: https://github.com/jacquerie/senato.py/blob/master/senato/sp...

harperlee · on March 30, 2016

So what is the legality of this? Apart from the risk of having someone pull the plug on the way one takes the information out, when is something without a proper license able to be used?

dsp1234 · on March 30, 2016

In the US, there is no copyright protection for "facts" on their own. However, a compilation/database of facts can have copyright protections based on a 3 part test[0].

    1. the collection and assembly of pre-existing material, facts, or  data;
    2. the selection, coordination, or arrangement of those materials; and
    3. the creation, by virtue of the particular selection, coordination, or arrangement of an original work of authorship.

But specifically there is no protection for the underlying facts themselves, and there is no "sweat of the brow" doctrine. So scraping the data, and rearranging the underlying facts into your own arrangement/organization is almost always not copyright infringement. However, if that data is categorized in some non-trivial way, and you keep that organization, then that is likely to be copyright infringement.

However, if what you're scraping are not "facts", but some creative works, such as blog posts, product descriptions, etc, then it is likely to be copyright infringement.

Then on top of that, even if there is copyright infringement, other defenses such as a license to use the data, or fair use may apply.

[0] - http://www.pddoc.com/copyright/compilation.htm

toomuchtodo · on March 30, 2016

> So scraping the data, and rearranging the underlying facts into your own arrangement/organization is almost always not copyright infringement.

I'm not so sure. It would definitely be illegal in the US for me to cherry pick data out of Google Maps and add it to OpenStreetMap (and OSM has policies addressing exactly this).

iolothebard · on March 30, 2016

Yet companies like LexisNexis get most their data they resell this way.

toomuchtodo · on March 30, 2016

Are they scraping copyrighted data? Or public records? Big difference.

iheartmemcache · on March 31, 2016

No one in the US can hold copyrights to the pure 'facts', especially if one demonstrates they invested enough energy to 'creatively reinterpret' it. Scraping hasn't quite seen a Supreme Court ruling yet (@grellas correct me, please), but I'm sure one could make a reasonable argument that the energy invested in re-collating the data is sufficient enough to pass any barrier. See Feist Publications, Inc., v. Rural Telephone Service Co, 1991. and O'Connors opinion.

iolothebard · on March 31, 2016

Facts aren't copyrightable.

They scrape everything in the world they can get their hands on.

toomuchtodo · on March 31, 2016

Collections of facts are: https://www.unc.edu/courses/2006spring/law/357c/001/projects...

ap22213 · on March 30, 2016

What part of the law does this fall under? Do people get arrested for this? (i.e. criminal) What's the worst that can happen?

toomuchtodo · on March 31, 2016

https://en.wikipedia.org/wiki/Copyright_infringement

http://www.copyright.gov/title17/92chap5.html#501

https://www.law.cornell.edu/uscode/text/17/chapter-5

https://www.lib.purdue.edu/uco/CopyrightBasics/penalties.htm...

pimlottc · on March 31, 2016

That's begging the question of whether Google's data on public streets is actually protected by copyright under U.S. law.

toomuchtodo · on March 31, 2016

https://www.google.com/permissions/geoguidelines.html

lazyjones · on March 30, 2016

IANAL but in the EU at least, even databases comprised of simple "facts" are protected.

It's a sad state of affairs when i'm not even allowed to scrape data generated using taxpayers' money, like the (required by EU laws) noise maps for cities, which I'd like to use to augment real estate offers, for example.

Symbiote · on March 31, 2016

"Europe" would like to partially fund that noise database with income from businesses that use it. The result is less taxpayer money us needed.

I think it's only the UK that has copyrightable fact databases

techdragon · on March 31, 2016

Except that doesn't happen because the last thing a new business idea needs is more red tape, paperwork and expenditure.

minimaxir · on March 30, 2016

I'm not fond of the implication at the end that scraping is justifiable because old websites are dinosaurs without APIs, and those websites are jerks for not doing so, and therefore scraping is the moral thing to do.

I've scraped my share of BuzzFeed data and Foursquare data to make data visualizations (with the latter explicitly saying "don't scrape" in their Terms). But if either one told me to stop and take down my results, I would not contest, since data is what drives the Internet ecosystem.

(For the record, neither service did; in fact, both tried to recruit me as a result of the visualizations. The difference is that I am not using the data to create a direct competitor that could cause them to lose business.)

kh_hk · on March 30, 2016

Disclaimer, I wrote the article.

> I'm not fond of the implication at the end that scraping is justifiable because old websites are dinosaurs without APIs, and those websites are jerks for not doing so, and therefore scraping is the moral thing to do.

It was not my intention to give that implication. The main implication behind CityBikes is that public services should already provide this information since, well, it is a public service. On the same line, a private company providing a public service should already do so. See motives [1].

> I've scraped my share of BuzzFeed data and Foursquare data to make data visualizations (with the latter explicitly saying "don't scrape" in their Terms). But if either one told me to stop and take down my results, I would not contest, since data is what drives the Internet ecosystem.

Same as CityBikes is doing. If we receive a cease and desist, we remove their service from our API. As for Foursquare, I do not see Foursquare as a public service. Your taxdollars at work, and all that.

I tried to keep the article balanced but maybe it wasn't clear. There are many transportation companies willing and happy to be scraped, or looking forward to provide their information for people to reuse [2].

[1]: https://blog.scrapinghub.com/2016/03/30/web-scraping-to-crea...

[2]: http://nabsa.net/current-members/

seanp2k2 · on March 30, 2016

Why does your blog intentionally crash browsers that it thinks are Safari?

mryan · on March 31, 2016

You appear to have replied to the wrong comment. ScrapingHub is not the site that attempts to crash Safari - that is weboob.

fucking_tragedy · on March 31, 2016

What do you mean by intentionally?

kh_hk · on March 30, 2016

> For the record, neither service did; in fact, both tried to recruit me as a result of the visualizations. The difference is that I am not using the data to create a direct competitor that could cause them to lose business.

I do not understand that implication. How is providing bike share information creating a competitor? I can't run a bike sharing service.

minimaxir · on March 30, 2016

Less a competitor, more a non-canonical source of information that they cannot manage.

If your offshoot were to misrepresent data, for example, then you would become a liability, even if you weren't making money.

chrisweekly · on March 30, 2016

Random tangent: your comment about unmanaged non-canonical misrepresented data reminds me of Zillow, who publish "facts" about real estate transactions, with no mechanism for error correction. Afrw years ago they posted an erroneous sale of my house -- which transaction never took place -- listing a sale price 20% lower than we'd paid for it. It directly harmed me, when we later tried to sell the house, when potential buyers cited zillow's "estimates" which of course were artificially, drastically lower because of the phantom transaction. There was no avenue for recourse; angry tweets got a half-baked response from a junior social media person, but it was never resolved. I wonder how many others zillow must have messed up.

kh_hk · on March 30, 2016

That's a fair point. Easily enforceable by a proper license. One example is ETALAB open data license.

jdc · on March 31, 2016

Why do you think scraping needs justifying in the first place?

unsettledtck · on March 30, 2016

Out of curiosity, where are the boundaries of your gray area when it comes to scraping?

minimaxir · on March 30, 2016

If the service has an API with a fair rate limit (Foursquare does at 5000 requests/hour), I believe that is ok, since that implies their architecture is built for massive data requests. On the other hand, bypassing those rate limits with proxies is definitely bad.

If a website does not have an API (BuzzFeed), I take care to only collect data that I need. Not anything that would damage the business. (E.g entire articles). Consequently, I sanitize the data of such things if I decide to release the dataset.

rakoo · on March 30, 2016

"Web scraping to create Open Data" is the exact reason why weboob (http://weboob.org/) was created and still thrives today. CityBikes already seems to be doing a big part of the job, and in Python nonetheless, so it should be easy to integrate its data and use it with Boobsize (http://weboob.org/applications/boobsize.html)

maxaf · on March 30, 2016

That naming scheme definitely needs a long, hard rethink.

rakoo · on March 30, 2016

It's funny, everytime Weboob is presented somewhere, and everytime there is a post about the latest version of Weboob, the first comment is a variation of "it's sexist/boobs are unprofessional/grow up", and very very little time is spent talking about the actual thing, what it does and why its only goal is to become irrelevant. Sad thing.

Here's what they have to say about it, and why there's very little chance they will change anything:

http://laurent.bachelier.name/2013/12/weboob-the-asshole-det...

(This comment is not directed at you directly)

WaxProlix · on March 30, 2016

> It's funny, everytime Weboob is presented somewhere, and everytime there is a post about the latest version of Weboob, the first comment is a variation of "it's sexist/boobs are unprofessional/grow up", and very very little time is spent talking about the actual thing

Seems like a bad name then, no?

JamilD · on March 30, 2016

There is a legitimate argument to be made if the naming was unintentional or if people were making a big deal out of nothing.

But with Application names like 'Handjoob', 'Boobsize', and 'Flatboob', I think the naming is unavoidably distracting and inappropriate.

snurk · on March 30, 2016

Uh, exactly. Not much "sad" about it. More like bad choices on the part of the creator.

rakoo · on March 31, 2016

I don't believe it's a problem with the name itself but more a problem with how commenters are more easily concerned about everything sounding professional even though it may very well have nothing to do with a job.

chrisbroadfoot · on March 30, 2016

Don't click the link above. It may crash your browser.

------------------------------

Full text:

We often get complaints around Weboob‘s name, and the various application names.

There’s no denying they’re childish. What they are not, however, is sexist.

There is “boob” in the main name, and “boob” is a friendly name referring to (mostly female) breasts. We would, for example, avoid using “tits” or “cunt”, because they are often demeaning1. Though it is a happy accident (our earlier ideas like “woob” and “webob” were taken), we certainly like playing with that.

The idea is the same with application names; it’s all about friendly jokes (like wetboobs the weather tool, which manages to be related to weather and boobs).

If you’re offended, just ask yourself “how is it sexist?”.

As it appears, Weboob is a formidable tool to detect people that are part of the “be offended first, think later” crowd. Interestingly, the crusaders2 are to date all male, and often assert that women can’t like jokes about breasts or sex in general3. How fucked up is that?

They will always make a scene4 on how they’re never going to use Weboob because of names. Guys, here’s the thing: we don’t need you and we certainly don’t want you. I for one am glad we created an Asshole Detector, albeit by accident. We however are mostly not native English speakers. Mistakes can happen. [↩] This is not an euphemism. They act like they are fighting for a good cause, but it’s only pretend. [↩] And who the hell are they to talk in place of others? That is actual sexism. [↩] So that it is abundantly clear, this is purely about making a scene; I do not care about their opinions or how many penises they may have. All our contributors do not and do not have to like the branding. [↩]

siegecraft · on March 30, 2016

It may crash your browser >intentionally<

maxaf · on March 30, 2016

It's a fun read. You know, I'm not normally one to jump on the "offended" bandwagon. In this case I felt compelled to speak up because the name makes it pretty much impossible to reference or recommend the project in a business setting. Wearing a t-shirt to work makes me feel like a true rebel; mentioning a project with components such as "boobsize" and "wetboobs" is going to simply be too weird for many people.

mpeg · on March 30, 2016

If you have to write a blog post about how the sexism "crusaders" have it in for you, chances are you are the asshole.

I'm not offended by the name, I get the joke, I just think it's a stupid joke.

kbenson · on March 30, 2016

> It's funny, everytime Weboob is presented somewhere, and everytime there is a post about the latest version of Weboob, the first comment is a variation of "it's sexist/boobs are unprofessional/grow up", and very very little time is spent talking about the actual thing, what it does and why its only goal is to become irrelevant. Sad thing.

That in itself is a very good reason why the naming scheme needs a long, hard rethink. It's distracting.

Put another way, I wouldn't blame everyone else if every time my hypothetical company "Natzie" was mentioned the conversation devolved into something unrelated to the reason it was mentioned. If you want to succeed, you need to make decisions based on the real world, not how the world should be (not that you can't push limits, but baby steps are often needed).

snurk · on March 30, 2016

Interesting — that link is either down, or it's set up to present several different troll responses depending on the http referrer.

When I click, I'm redirected to a simple image. When I request it in incognito mode, I'm redirected to a "crash safari" url.

mpeg · on March 30, 2016

Hacker news referrer = social justice warrior hipsters it seems.

mirimir · on March 31, 2016

No problem here. RefControl set to block by default.

AgentME · on March 30, 2016

I clicked the link and got some "hipsters not allowed" page. Not exactly helping the point that it's not just childish.

lamontcg · on March 30, 2016

"weboob the asshole detector"

I think it probably /is/ an asshole detector, but not quite in the way that the authors of that webpage /think/ it is.

from Wikipedia:

Psychological projection is a theory in psychology in which humans defend themselves against their own unpleasant impulses by denying their existence while attributing them to others.

Jarwain · on March 30, 2016

Is that supposed to link me to http://no-hipsters-allowed.t28.net/ ?

Edit: it directs me properly in an incognito window.

seanp2k2 · on March 30, 2016

Warning, this link either goes to CrashSafari or http://no-hipsters-allowed.t28.net/ .

snurk · on March 30, 2016

> ... long, hard ...

Heh.

Ok, but seriously, I look forward to seeing their appearance on r/drama when twitter discovers this.

danieltillett · on March 30, 2016

Are you referring to weBOOB.com or scRAPINGhub.com or both?

tn13 · on March 30, 2016

On their homepage you see "anal+" image....

danvoell · on March 30, 2016

agreed

kh_hk · on March 31, 2016

This script [1] using pybikes would accomplish a similar thing.

Only problem with using pybikes natively are some systems we call asynchronous. These are bike share websites that require actually clicking on a station to get the status information. This means that for getting accurate information on the entire the feed (let's say 500 stations), you would have to go through each of them. On these cases, it's way easier to just use the API.

[1]: https://gist.github.com/eskerda/bbd65539048a53eadfccc5d535ad...

PlzSnow · on March 31, 2016

Can anyone tell me which cloud provider they are using? I want to make sure that scrapinghub are on the list. I block the IP addresses of all the major cloud providers to prevent parasites such as this.

l1n · on March 30, 2016

Heh. I do this with my Student Government data [1].

[1] https://umbc.lin.anticlack.com/finance/

yeukhon · on March 31, 2016

You need to fix the certificate before showing that to the public.