Web scraping case fails under Supreme Court's Dastar doctrine

btilly · on Nov 19, 2018

Here is an attempted translation from legalese.

Company B scraped listings off of Company A's site, and published their own site based off of that data. Then A sued on three grounds:

1. Company B's notices falsely claimed copyright to the listings.

2. Company B ignored company A's copyright notice.

3. Violation of the Lanham Act, which prevents people claiming something is from somewhere other than where it is when they sell it. (This is the interesting one.)

The court ruled as follows.

General copyright notices at the bottom of a web page claim copyright over the site as a whole and not to all of the data that may appear in the site. Therefore 1 fails because company B's notice is not claiming copyright. And 2 fails because A's notice was not specific enough to claim copyright.

As for 3, the fact that there is no physical product means that the precedent the Supreme Court set in the Darstar case applies - the Lanham act only applies to physical products.

I'm sure that I didn't get it quite right, but that version may be more readable than the original article.

balfirevic · on Nov 19, 2018

> And 2 fails because A's notice was not specific enough to claim copyright.

What does that mean?

I thought you possess copyright on your creation without needing specific copyright notices. And if some part of your website is not copyrightable, how would notice help?

Additionally, if someone copies your work that you have copyright on, they are infringing even if there isn't any copyright notice. Are they not?

caf · on Nov 19, 2018

The plaintiff wasn't claiming a copyright infringement, it was trying to claim specifically for removal of the copyright notice, which is a separate cause of action under the DMCA.

balfirevic · on Nov 20, 2018

Removal of copyright notice from material that's not actually under copyright?

I'm sorry if I'm not understanding something simple, I tried a quick google search and came up with this [0], but that seems to apply to cases where you remove copyright notice from copyrighted work.

[0] - http://www.photolaw.net/did-someone-remove-the-copyright-not...

caf · on Nov 20, 2018

It appears that no argument was made as to whether the material was actually under copyright or not, because the plaintiff wasn't alleging copyright infringement. You could perhaps read into this that it was something of an ambit claim.

kijin · on Nov 20, 2018

I don't know exactly which website the case is about, but maybe the plaintiff couldn't claim copyright infringement because they don't own any copyright over the scraped content in the first place, i.e. the listings were posted by users.

andjd · on Nov 19, 2018

To clarify, the Lantham act covers trademarks and unfair competition, not copyright. That makes the title of this blogpost confusing, because it seems to be conflating the copyright and trademark arguments that were both rejected.

jrochkind1 · on Nov 19, 2018

To be fair, the original case itself, and the law, are confusing before the blogpost/title even get to it. :)

So the case in question was making separate copyright and trademark claims, and the court rejected _both_ of them, is that right?

kevin_thibedeau · on Nov 20, 2018

Except that aggregated data can be treated as having a copyright. Why plaintiff didn't present that argument boggles the mind. Second rate lawyers there.

polm23 · on Nov 20, 2018

Aggregated data doesn't really have any copyright in the US.

https://en.wikipedia.org/wiki/Sui_generis_database_right

fjsolwmv · on Nov 20, 2018

Maybe the professionals know their job better than you do? Mere aggregations of facts are generally not copyrightable.

TheOtherHobbes · on Nov 20, 2018

But unauthorised re-use can be challenged as misappropriation - effectively unfair competition.

https://www.ipiustitia.com/2013/12/retrospective-passing-off...

Which does suggest the copyright angle really wasn't the best one to take.

There may be a good reason why that didn't apply here, but - not being a lawyer - I don't know what it is.

kbutler · on Nov 20, 2018

Because of the conflict between the public interest in allowing copying facts and the misappropriation doctrine, that doctrine is generally limited to "hot news", and citing the source is generally sufficient even there.

The article you linked includes various restrictions on the doctrine:

"highly time-sensitive", "in direct competition", "rend[ering] [the] publication profitless"

The wikipedia page on the misappropriation doctrine has a various links and commentary about the narrowing of its application: https://en.wikipedia.org/wiki/Misappropriation_doctrine

Particularly, from the Second Circuit NBA vs Motorola (1997) "Such concepts are virtually synonymous for wrongful copying and are in no meaningful fashion distinguishable from infringement of a copyright." https://www.law.cornell.edu/copyright/cases/105_F3d_841.htm Other circuits have followed similar principles.

Novashi · on Nov 19, 2018

So if you have data that you can legitimately claim copyright to mixed with data that you can’t, how would you proceed?

Retric · on Nov 19, 2018

Data and factual information, such as rainfall amounts, are not protected by copyright.

This includes a lot of things like prices which seem to be creative endeavors making it somewhat confusing.

baroffoos · on Nov 20, 2018

I thought that collections of facts are still under copyright so things like Google Maps data is copyrighted even if the things in it are facts. You can create an identical copy as long as you verify and collect the facts yourself and if a mistake on google maps is found on another map then its obvious it was copied.

lmkg · on Nov 20, 2018

An arrangement of facts can be copyrighted, but the facts themselves cannot. Collecting together facts and bundling them does not make a copyrighted work, but organizing facts can create a copyrighted work.

The litmus test for this in the US literally the phone book, based on Supreme Court rulings about copyright on phone books. The white pages is an exhaustive list of phone numbers sorted alphabetically by last name, and is not copyrightable because that's considered obvious. The yellow pages sort businesses based on category according to someone's judgement, and emphasizes certain ones, which is considered just barely 'original' enough that it can be copyrighted. The same phone numbers in the same categories would violate copyright of the yellow pages; the same phone numbers with a different organizing principle would not violate copyright.

jrochkind1 · on Nov 20, 2018

Not in the US. Google's choice of colors and widths of lines and other layout might be. Not the map data. Possibly in Europe and other countries that aren't the US

fjsolwmv · on Nov 20, 2018

Nope. Facts are not copyrightable. You cns copy all of Google's factual map data, but you can't copy their artsy rendering of it in a picture.

baroffoos · on Nov 20, 2018

Then why is copying data from google maps absolutely banned for open street map editors? Also if google includes some fake data than thats not a fact and possibly a creative work.

yellowbkpk · on Nov 20, 2018

OSM editors are prevented from using Google Maps because OSM prefers that we have explicit permission to use data sources. Since we don't have explicit permission to use Google Maps, we can't use it.

Separately, Google Maps has a terms of use that prevent reuse of Google Maps data. You agree to those terms when visiting Google Maps or using Google Maps API.

You wouldn't be breaking copyright law when copying from Google Maps to OSM, you'd be breaking the terms of use contract with Google and community Norma expectations in OSM.

codedokode · on Nov 20, 2018

> You agree to those terms when visiting Google Maps or using Google Maps API.

Those terms are not even presented on the screen when visiting Google Maps. And even if they were, I didn't sign anything or agree to anything. It is ridiculous if those terms are legally binding. Because then I will make a site and make you pay $1 for every page viewed.

yorwba · on Nov 20, 2018

> And even if they were, I didn't sign anything or agree to anything. It is ridiculous if those terms are legally binding.

Contracts don't need to be signed to be valid (when was the last time you signed a contract with an online shop). And if you don't agree to the terms, then you're using the site without permission and can be sued for damages. But the exact amount to pay can't just be any made-up number.

Similarly, if you walk into a shop, take an apple and eat it; then when the shopkeeper demands a million dollars, you can refuse to agree. Then the shopkeeper is free to sue you for destroying his property and you will be ordered to compensate him, although probably not for the full amount.

db48x · on Nov 20, 2018

Except that Google has a credible threat: they can delete your account, including your email, your Android applications, your Youtube videos, etc.

__david__ · on Nov 20, 2018

That's only a threat if you use Google for those things.

rmc · on Nov 21, 2018

Fair enough. But the OSM project would rather not try to find the legal limits of this idea, and would rather play it safe.

em-bee · on Nov 19, 2018

at least austria recognizes a separate copyright for list compilations that apply to the whole list but not the individual entries. a phonebook for example, or a pricelist. you can use the individual entries (based on their own copyright if one applies) but not duplicate the list as a whole.

greetings, eMBee.

sgc · on Nov 20, 2018

but can you recreate your own list which will be definition be very similar to the original? If not it is first past the post rights to public domain data, which has always been my major objection to database/digitization rights.

em-bee · on Nov 20, 2018

i suppose it depends on where you get the data from. like implementing your own version of a function, if you can demonstrate that you haven't looked at the other implementation you should be fine.

greetings, eMBee.

fjsolwmv · on Nov 20, 2018

Please don't sign your comments. Your username is displayed above your comment.

pfortuny · on Nov 20, 2018

Stating the fact explicitly, not just with a “generic notice” at the end.

devy · on Nov 19, 2018

What's the precedent the Supreme Court set in the Darstar case?

torstenvl · on Nov 19, 2018

The "origin" under the Lanham Act refers to the physical product. If I repackage a music CD pressed and published by Sony BMG and pass it off as my own, that could be a violation of the Lanham Act. If I copy the content but procure the media and press my own CDs, that cannot be a violation, because I, not Sony BMG, am the origin of the physical good.

https://caselaw.findlaw.com/us-supreme-court/539/23.html

burtonator · on Nov 19, 2018

Scraping public web content is a really confusing situation.

My company, Datastreamer, has been in business for ten years indexing public web content (news + blogs). We focus primarily on "live" content. Content that publishes often.

The main challenge we've always had is that just because the content CAN be indexed doesn't necessarily mean you MAY index it.

A recent situation was around Craigslist vs 3taps:

https://en.wikipedia.org/wiki/Craigslist_Inc._v._3Taps_Inc.

Basically the issue doesn't evolve around WHO has copyright but who has copyaccess to the content.

So if you create an account on Acme.com... You still own the copyright to the content you post but Acme controls access. Not only that but the ToS that you sign gives them rights to your content including bulk sales.

This means that Acme can monetize the content that YOU create while actively preventing people from indexing it even that may be your intention.

This means that in 2018 a company like Google COULD NOT get started because websites would just not allow them to access your content.

I believe that when most people post public content on the Internet they intend it to be public including accessed by other search engines crawlers, etc.

Now we're in a horrible situation where just a few companies essentially own the Internet.

This is why Google can't index Facebook content or Twitter content even though it's public - they can't access it.

meritt · on Nov 19, 2018

> Google can't index Facebook content or Twitter content even though it's public

They index plenty of their content [1][2]. What they don't index is content not explicitly marked as "Public". e.g. Facebook posts with visibility settings or protected tweets. FB, Twitter, and now LinkedIn have plenty of content that's not publicly accessible: that is, content which doesn't require a logged-in account that explicitly agreed to TOS/EULA, but they have tons of publicly-accessible content, too. The latter is fair-game.

That said, the Craigslist lawsuit is still bewildering to me. That content is explicitly public, does not require a login, and the only agreements are the automatic unforceable browsewrap ones. The LinkedIn v. HiQ case is very similar to the Craigslist v. 3Taps, however the decisions are in opposite directions.

[1] https://www.google.com/search?q=site%3Afacebook.com

[2] https://www.google.com/search?q=site%3Atwitter.com

fjsolwmv · on Nov 20, 2018

What's bewildering?

Published does not mean uncopyrightable. A radio station broadcasts a song; that doesn't remove the song's copyright protection. Receiving a copy is not making a copy.

3Taps settled put of court; there was no decision.

7j · on Nov 20, 2018

I'm not sure the radiostation analogy holds. Radiostations can't claim copyright to the song.

bduerst · on Nov 19, 2018

Yep - I specifcially remember using Padmapper (3Taps) around the time that Craigslist yanked their content.

Padmapper is just a much better interface for craigslist data. Even though craigslist pulled their data, craigslist didn't create a new interface to match or even bother to improve their experience, even though they have the user mindshare for posting this type of data.

Essentially it's as you said - innovation is being killed to preserve walled gardens like Craigslist, and users are the ones being hurt here.

kapitalx · on Nov 20, 2018

> craigslist didn't create a new interface to match or even bother to improve their experience,

That's not completely true. After lots of bashing online [1], craigslist started adding some features such as searches on maps[2]. Even though it's not as nice and easy to use as padmapper was, it would still be considered an improvement.

[1] - https://www.inc.com/abigail-tracy/craigslist-quietly-makes-c...

[2] - https://sfbay.craigslist.org/d/apts-housing-for-rent/search/...

alkonaut · on Nov 19, 2018

What part of the indexing isn’t allowed, is it the scraping or the publishing of the scraped material? Or something else such as a “use” of the material?

Example: Does it matter whether the data is ever stored? Say I make a listing notifier that pings people when a house is listed in their desired area. I’d need to crawl real estate listings and but I’d never store or republish any content from the site.

bigiain · on Nov 19, 2018

The hardline copyright view is that to "index" content from a website, you need to make a copy of that content in your indexing server, which is an action copyright law legally prohibits without an explicit exception. (The people who argue that somehow also seem to argue that publishing the webpage legally gives "explicit permission" for your web browser to make a copy, without also giving permission for your web scraping software to do so. Not sure how they get to _that_ position logically...)

alkonaut · on Nov 20, 2018

Yeah but to e.g. make an app that notifies people when a house is published in area X really doesn't need to index anything. I can walk through all the listings once per week and all that is in my database is the table of what user wants what notification. If the data on the site matches my notification table I send an email (or whatever) and after the weekly run is done I still have zero data copied or "indexed" from the target site.

jval43 · on Nov 20, 2018

Makes sense from a technical standpoint, but it's the legal people you'll need to convince.

jscissr · on Nov 20, 2018

Actually, you still have a copy of the data, in memory.

alkonaut · on Nov 20, 2018

yes. And I suspect it's not legal. Which is why I'm wondering what the actual definition is. Because it can't be that I "keep" or "index" the data, nor that I "republish" the data.

It's actually that I "read" the data, regardless of my intention and regardless of what I intend to do with it. But taken to the extreme there has to be something that makes it legal for me as a consumer to scrape ONE item from their data, while at the same time making it illegal for a company to scrape ALL their data in this way.

The worst situation to be in is one where it's "legal until it isn't" i.e., when you actually cause a problem (bandwidth, revenue) you will be sued - and until then you no idea.

bigiain · on Nov 20, 2018

It's that you "copy" it. That's the "right" that copyright law extends to creators.

Technically (and it's complicated these days because copyright law was mostly written well before computers and the internet existed) any time you create a copy of someone else's creative work, you need an explicitly granted right to do so. You can _read_ a book, but you cannot write down all or part of what you just read. You can listen to a song, but you cannot perform all or part of that song (An Australian band called Men At Work lost a copyright case for a flute solo that contained 2 bars/11 notes of a 1930's folk tune "Kookaburra" who's copyright had been sold/bought in 2000.)

There are some explicit legal exceptions to requiring a license, most commonly people talk about "fair use", but that's again a "pre internet" body of law, and is extremely open to interpretation as to how it applies to digital copies in computers.

https://www.copyright.gov/fair-use/more-info.html

https://fairuse.stanford.edu/overview/fair-use/what-is-fair-...

hummingurban · on Nov 20, 2018

post 3taps, craigslist won't be able to stop scrapers.

https://www.eff.org/deeplinks/2018/04/dc-court-accessing-pub...

wildmusings · on Nov 20, 2018

That is about the CFAA, not copyright.

hummingurban · on Nov 20, 2018

how do you claim ownership of data you release publicly at no charge to the users? how do you enforce anyone who copies it by hand or programmatically, does the minimum work to derive some insight and then copyrighting that? You can't. Once it's on the internet, anything you release will simply be relinquishing control.

It's foolish to assume that the lawyers and the justice system will help you out. This is simply not a case of somebody pirating a movie and slapping on their own copyright.

CFAA was how craigslist won but that's been thrown out thanks to the EFF lawyers.

Going forward, there will be little to no recourse for "web scraping public data".

mc32 · on Nov 19, 2018

There are a few forums where I don’t expect nor would prefer indexing. They may be niche forums, etc. Forums like reddit, of course I expect to be indexed. So, basically, it depends.

jimktrains2 · on Nov 19, 2018

Isn't this what robots.txt was created to convey?

mc32 · on Nov 19, 2018

Yes. But not all scrapers and indexers respect that? So it’s still a problem.

baroffoos · on Nov 20, 2018

Require a login/invite then

DoctorOetker · on Nov 20, 2018

I would like to know your opinion regarding the content problem that people prefer sharing their content on the established platform (simply for widest reach) which prevents new players from emerging (since they are intimidated from scraping said content, even if the original poster intended the widest reach).

Suppose there was an intermediary platform so that the user uploads his content/video once to the intermediary platform, and can then select which platforms may use the content under what conditions (i.e. perhaps for free, perhaps for a user-set minimal amount of remuneration, ...), or perhaps instead of selecting which platforms, selecting the rules to which a platform must adhere in order to distribute the content (so that new players can instantly start hosting a diverse set of content as long as they adhere to these rules). The user thus only needs to upload once.

Obviously it would require youtube etc, to provide API access for the intermediary so it can post the content without being able to hijack a user's account on youtube etc...

This would not be in their interest of course, so anyone attempting to create this intermediary upload service would face a never-ending cat and mouse game, somewhat reminescent of youtube-dl...

Perhaps instead of silly link tax rules, it would be better for governments to force platforms above a certain size/usage to provide API access such that these intermediary upload services can flourish.

They might effectively become some kind of consumer/producer protection organizations, by proposing multiple usage standards (like my video MUST NOT have ads, or perhaps the opposite for another user, my video MUST be remunerated at x cents per y views, otherwise the platform may no longer distribute my content, ...) from which their users can select...

apart from "pure format" content like music, videos, ... as soon as there is more compound structured content, like a blog post with images, it would be hard to standardize the format.

then there is the problem of authentication, especially if a user uploads a video through the intermediary, and 1 year later a new platform wishes to distribute the content on its network too, and after another year the user wants be active on this platform, there is already an account, and how does he get his login credentials?

In theory the intermediary could have a messagebox for each user, and the platform can send the credentials to the user in this messagebox, so that if interested a producer can decide to make use of his account on a new platform...

Edit: so while currently all the platforms act like publisher and printer separate the market into publishers (intermediary rights managers) and printers/distributors (big platforms like YT etc). Similar to how humans decided it was better to split up doctors and pharmacists to prevent de facto quackery of selling whatever you happen to have on hand, or cheaper deals for...

rossmachinery · on Nov 19, 2018

My company, Alan Ross Machinery, is the plaintiff in this case. Happy to answer questions to the extent legal counsel will permit it. I can tell you the case remains active, stay tuned...

meritt · on Nov 19, 2018

How was Machinio impacting your business? They appear to act as a metasearch engine and were scraping listings/imagery without permission. As you don't publicize the dealer contact info, presumably buyers still ended up on your site and you were still able to broker the deal?

My guess is by acting as a central authority for all used equipment deals (bootstrapped by scraping listings from many sources), they were able to use the existence of the listings to better convince future dealers to just list directly with them instead of your business? e.g. increased exposure for dealers, Machinio doesn't take a commission, so dealers just pay a flat rate to gain exposure?

At the end of the day, I have to assume the litigation was over them stealing dealers by disintermediating you? Because otherwise I'm guessing they'd just argue they were providing you free exposure.

Rjevski · on Nov 20, 2018

Just wondering, if you don't want the information to be public, then why make it public? Put your website behind a login form, after which you can enforce (meaningless) terms of service as well as real technical solutions to prevent abusive scraping like rate limits or captchas.

Seems like you're trying to have your cake and eat it too. You're presumably happy for Google to scrape your website (and collect your customers' data for their purposes like ad tracking), but when a smaller site does it you're not happy? Make up your mind.

pavel_lishin · on Nov 19, 2018

It seems like y'all are sort of an ebay/craigslist for industrial machinery, right?

What benefit does scraping your site give someone else? If someone posts your listings on their site, how do they connect would-be buyers with the sellers, anyway?

rossmachinery · on Nov 19, 2018

Without products to list eBay has no business. So, there's a strong incentive for companies who want to be in the space to acquire listings through a variety of means, including some means they ought not to.

donaltroddyn · on Nov 19, 2018

I quickly browsed your site, and you don't seem to publish contact details for sellers, so back to Pavel's question: What does the site that scraped your listing do if someone wants to buy a scraped product? Send them to your site?

rossmachinery · on Nov 19, 2018

They seemed to do several things, but click through to our site was not one of them. They capture(ed) visitor data much of which I assume they retain for their purposes.

donaltroddyn · on Nov 19, 2018

Interesting - thanks for your answer.

Varcht · on Nov 19, 2018

Probably there are some differences but there are hundreds of sites doing this to eBay, why haven't they stopped it?

edoceo · on Nov 20, 2018

All the ones I've seen (domain listings) route me to eBay to close the deal, just a nicer/aggregated listing of items for sale across multiple sites

aw3c2 · on Nov 19, 2018

Isn't copyright implicit, not requiring any kind of notice to "exist"? I don't understand what those website footers matter at all.

dragonwriter · on Nov 19, 2018

> Isn't copyright implicit, not requiring any kind of notice to "exist"?

Yes, but notices matter for purposes beyond whether copyright exists, and therefore falsifying, removing, or modifying them also matters.

This wasn't about infringement (oddly enough) but about falsified or removed “copyright management information”, specifically the copyright notices.

Which seems odd, because if the original notices were valid, then I can't see why actual infringement wasn't alleged. OTOH, if there was no infringement, how could the thing presented been the thing covered by the notices, such that either the removal or falsification claims had substance?

data_spy · on Nov 19, 2018

Were you on the Gary Vee podcast as a 4D's guest? Business sounds familar

rossmachinery · on Nov 19, 2018

Nope. Sounds fun.

docker_up · on Nov 19, 2018

You really should delete the post and not answer questions because it could inadvertently jeopardize your court case.

untog · on Nov 19, 2018

> Happy to answer questions to the extent legal counsel will permit it.

They are literally checking with their lawyers about what they can and cannot answer. I think they're OK.

curiousgal · on Nov 19, 2018

This brings up a question that I've always had. There are plenty of companies that offer e-commerce "insights" by scraping all merchants' products and prices and then sell that data to a particular merchant. Is that legal?

stoic_heimdall · on Nov 19, 2018

Also interested in the answer to this question.

gammateam · on Nov 19, 2018

Misleading title: this is a Trial court case in the lowest Federal Court. Northern District of Illinois specifically. Nobody cares about trial court and this is mildly informative if we want to discuss the rationale anyway.

AznHisoka · on Nov 19, 2018

Whatever happened to the HiQ labs case? Did HiQ labs ultimately win and can continue scraping Linkedin?

igolden · on Nov 19, 2018

Was thinking the same thing, thanks for asking.

manurandon · on Nov 19, 2018

Yes, they won

perpetualpatzer · on Nov 19, 2018

Any idea where I can read the opinion? I know they won a preliminary injunction, which LinkedIn challenged in the 9th Circuit in march '18, but hadn't heard anything since then.

comex · on Nov 20, 2018

There is none; the appeal is still open. You can check the status of the case on PACER – at least if you don't mind paying a few cents for every request due to the judiciary being stuck in the 90's:

https://ecf.ca9.uscourts.gov/n/beam/servlet/TransportRoom?se...

(The case number is 17-16783.)

There was oral argument on March 15, and the 9th Circuit posts video recordings of all hearings on YouTube:

https://www.youtube.com/watch?v=tvLdJujOp8k

(I haven't watched it yet, so I don't know how it went.)

Since then, the only filings have been a few citations of supplemental authorities, the last one in June. If I'm not mistaken, the case is just waiting for the judge to write an opinion. According to the 9th Circuit's FAQ:

> 18. How long does it take from the time of argument to the time of decision?

> The Court has no time limit, but most cases are decided within 3 months to a year.

...It's currently been 8 months since the date of the argument, so hopefully that won't take too much longer.

AznHisoka · on Nov 20, 2018

That's what I thought... Most people assumed the verdict earlier this year was an automatic win, and they keep citing it. But it wasn't the final verdict, which is what I'm interested in. Hope it comes out soon.

sam0x17 · on Nov 19, 2018

So now if they add a specific copyright notice on the page that was getting scraped, the court might come back later and rule differently if scraping continues? Or am I misunderstanding.

elliekelly · on Nov 19, 2018

I think the issue was more that the plaintiff claimed a copyright on each page but the defendant had copied the photographs and descriptions. Having only read this one decision it sounds like the Judge had dismissed the case once without prejudice to allow the plaintiff to restate their claim to include copyright violations for the specific material that was copied (the photographs and descriptions) but the plaintiff failed to do so. The opinion then says that a photograph merely appearing on a site doesn't mean the website claims ownership of the photo. Since the plaintiff made no claims of ownership over the copied material they can't sustain a claim of copyright infringement.

phkahler · on Nov 19, 2018

>> So now if they add a specific copyright notice on the page that was getting scraped, the court might come back later and rule differently if scraping continues? Or am I misunderstanding.

IANAL but you are not misunderstanding. From all the ruling I've read, the courts tend to address "this case" particularly when they cite specifics. In this case the defendant used the argument that the copyright notice was not on the page the data was taken from and the court ruled in their favor. Nothing appears to be said about how it would have gone if such a notice were present because no other arguments were made (based on this article). I suspect there are other arguments to be made in such a notice were present (data/facts can not be copyrighted etc) but they didn't rule on any of that.

echelon · on Nov 19, 2018

Is this a strong ruling that establishes precedent and is unlikely to be overturned? (Forgive me for not having a great understanding of the legal world.)

Here are a couple of cases that I'm especially interested in:

1. Does this mean that it is legal to scrape "database"-type websites for statistics and provide them on your own website? Could one use this ruling and copy all of IMDB's film data? Repackage that data into a Creative Commons website? Or a better set of tools for casting agents?

2. What about social or community-curated websites? Could you mirror all Reddit comments (which used to be Creative Commons anyway) to a more dev-friendly site? Don't force a mobile app down people's throats? Make it ad-free, donation-supported like Wikipedia?

3. What about big media? Could you bootstrap a new video site by scraping all existing (or popular) YouTube videos? Provide a means for owners to "claim ownership" of their account on the new site? Then market it as YouTube "but grown up" (18+)?

4. Kind of getting off-topic, but could you build a new music service by temporarily ignoring copyright, copying pirated music, then pivot to something that does collect money (via ads or subscriptions) for the music rights holders? I'm thinking Spotify, but built for music connoisseurs. Rich APIs with tagging, smart playlists, etc.

How likely would any of these be to avoid lawsuits until they're big enough to hire a legal team? Are certain behaviors less legal than others?

I'd really appreciate feedback on this. (Thanks in advance!)

weinzierl · on Nov 19, 2018

> Could one use this ruling and copy all of IMDB's film data?

IMDb forbids scrapers in their conditions[1] but you can freely download their datasets[2].

Similar to what you described: omdbapi[3] is a third party API for the free IMDb data.

[1] https://www.imdb.com/conditions

[2] https://datasets.imdbws.com/

[3] http://www.omdbapi.com/

hummingurban · on Nov 20, 2018

it's not legally binding and not the law. The TOS can forbid all they want.

https://www.eff.org/deeplinks/2018/04/dc-court-accessing-pub...

suggests there is little to no recourse for IMDB and the likes. Craigslist was able to win their case against 3taps, arguing the scraping was putting a load on their servers (typical Craig Newman bullshit) and that they continued scraping even after the IP ban and that is a computer frauds act or something like that which is draconian response likes of which that guy who killed himself because he got caught for scraping academic journals.

interknot · on Nov 19, 2018

Pushshift.io is pretty close to what you were talking about for Reddit comments; they make their archive available periodically at http://files.pushshift.io/reddit/

IIRC Pushshift is what powers sites like https://removeddit.com/

(to anyone reading this, please consider donating--this project is so cool)

cookiecaper · on Nov 20, 2018

Scenarios 1 and 2:

1. I did something very similar and got a legal threat from a Fortune 100. My attorneys had successfully litigated similar matters but told me in my case, the law was clear-cut in the F100's favor. My attorneys attempted to negotiate a license agreement that would allow me to continue, but it fell on deaf ears and I had to shut down my business.

2. Like most things on the internet, sure, you could. But someone could stop you with legal force if they were inclined to do so.

Even though cases like Feist supposedly say that facts pulled from fact-collections aren't subject to copyright, that doesn't really work out on the internet for a couple of reasons.

The biggest reason is the CFAA. Anyone has the right to tell you to stop talking to their servers at any time. Such notice doesn't even have to be explicit; in Craigslist v. 3Taps, the judge indicated that IP blocks should've been sufficient notice that 3Taps's presence was unwelcome, but companies frequently go even farther and say that the fine print buried in the Terms of Service banning "automated access", "spidering", "crawling", or similar is sufficient notice. Judges frequently accept that argument.

Failure to heed these notices very likely constitutes "exceeding authorized access" under the CFAA, which is both a tort and a crime (meaning, you can both be sued and be put in jail for it). Companies use this to bludgeon upstarts all the time.

The other thing here is:

a) very few facts on the internet are really alone; interesting data is almost always encased in HTML or some other copyrightable superstructure.

b) a legal philosophy called "the RAM Copy Doctrine", which is very well established at this point, means that storing any copyrightable work in system memory for any amount of time, even if it's just long enough to extract a non-copyrightable fact and then throw the surrounding copyrightable content away, meets the legal definition of a copy and is thus potentially infringing. This effectively allows rightsholders to instantly double any infringement claim, and attorneys actually have tried to posit that each instance of a copyrighted work getting loaded into memory for e.g. network transmission to another source should count as an independent act of infringement.

The 9th Circuit ruled that memory copies meet the Copyright Act's definition for fixed, non-transitory copies in MAI v. PEAK (1993) and that has been (mis)applied to most digital copyright controversies since. This is the effective equivalent of stating that the reflection of an image from the human retina onto the optic nerve is a separate potentially-infringing copy.

Scraping is most certainly still a legally dubious business. You have to take the Google route and get too big to fail before you'll start to see inconsistent outcomes like Perfect 10 v. Amazon, where the judges effectively gave Google a pass because they were scared of the widespread ramifications of shutting down Image Search, despite the clear legal reasoning that would've mandated this.

IANAL

gnopgnip · on Nov 20, 2018

The ram copy doctrine is not current case law since Cartoon Network v. CSC Holdings.

cookiecaper · on Nov 21, 2018

Good reference. I've definitely seen cases where it was referenced since then (2008), but this is good support for a SCOTUS case -- the case you reference appears to have occurred in the Second Circuit whereas MAI v. Peak is Ninth Circuit. The competing precedents will probably need settlement at some point (assuming people don't just wise up and surrender the RAM Copy doctrine organically).

chime · on Nov 19, 2018

Grooveshark was #4 on your list. Didn’t work out for them.

dotdi · on Nov 19, 2018

Can somebody translate?

elliekelly · on Nov 19, 2018

This case was dismissed on procedural grounds. The opinion's analysis applying Dostar is just the Judge saying how they would have applied the standard to this set of facts but it doesn't actually set any precedent.

randomerr · on Nov 19, 2018

Plaintiff had bad copyright notice. If the copyright notice is changed to be specific about the data and not 'everything under the under sun' on this website. This would be like someone trying to copyright the word 'crayon' on his website. Crayon in most countries counties means 'writing device'. Since So you can copyright the word 'crayon' your website. But you could copyright 'Crayola Crayons' since that is specific.
In this case the copyright was 'This is our website. We don't have anything specific about our data.' So no specified claims to sales or specific claims to sales data puts everything in the public domain as long as the scraper uses the data for 'informational purposes' and does not make a copyright claim to the data itself.

FYI: a more specific definition for crayon is 'a writing device where the writing material wears off to leave a meaningful mark.' Examples are: crayons (duh), pencils, charcoal sticks, etc. Since pens and markers have ink or pigment reserves and do not wear down they are not considered crayons.

rz2k · on Nov 19, 2018

Though you can't copyright "Crayola Crayon", you could trademark the name.

However, you could copyright a article describing the merits of a specific type of writing instrument, regardless of who owns the trademark.

elliekelly · on Nov 19, 2018

This is not at all what the case says. You're either not an attorney or didn't read the opinion. I suspect both.

kylnew · on Nov 19, 2018

How might this apply to content on a personal website? What I'm gathering from the article/discussion is that a Copyright message in my footer may not be enough to copyright all my works. Therefore, I should generate a small copyright notice at the end of each blog post as well that more claims explicit copyright over the article. Do I understand that correctly or can someone clarify? (Thanks in advance)

PuffinBlue · on Nov 19, 2018

I wasn't sure it would be enough on my own site so I put a notice in the footer notifying that some rights are reserved and then linking to a page that specifically breaks down content by section and the license terms applied:

https://www.josharcher.uk/copyright/

Perhaps as you mention it's necessary to publish more explicit notices alongside content. But perhaps not. This ruling seems to relate to data, possibly distinct from creative works.

baroffoos · on Nov 20, 2018

I have noticed that automated bots always scrape and repost my blog posts on a variety of websites. I don't really care but something to keep in mind. No copyright notice will stop a script that can't understand them.

kylnew · on Nov 20, 2018

You’re right but should it come down to a lawsuit your butt is covered. You’re probably never going to go after someone unless they reach financial success with it anyway. Also, what about disincentivizing anyone who might know how to exploit content not properly copyrighted online?

It’s probably a lot like the piracy problem — Don’t spend a lot of time fighting it because those people don’t pay, but secure the legal rights to your works well enough to fight anyone with more sinister plans than making just a copy or two.

nprateem · on Nov 20, 2018

I've always wondered whether my famous "number defence" would work in copyright cases. It argues that no digital file can be copyrighted. It goes like this:

Is the number 1 copyrightable? No, because it exists and was "discovered".

Is the number 2 copyrightable? No, because it exists and was "discovered".

Is it reasonable to assume therefore that this very large number X, that represents this disputed file, cannot in fact be copyrighted because it already existed was in fact just "discovered"?

If it's argued that there was some effort to "discover" this number, then I'll write a generator that produces each number between -1 million and +1 million and claim "copyright" over each of them. After all, regardless of the process for arriving at a particular mp4 of a movie (people just dancing round on a set, running the output through various software, etc.), the final output is simply a number. Just because someone went to a lot of effort shouldn't prevent me from going through some effort to write files containing each number between -1 million and +1 million and charging royalties for anyone to use them.

In fact, this raises the interesting point that anything that can be represented as a number already fundamentally exists in the range 0 to +infinity. It's just down to us to discover them. Think about that for a moment: Somewhere out there in the range 0-infinity is a good Star Wars Episode 7 just waiting to be discovered.

We could therefore write algorithms that search all the numbers in some search space (e.g. whatever 0-3mb is in decimal), scan them to see if they're valid files, e.g. mp3s and then run them through AI to see how they compare to known music, etc.

Thus, a new tech "Big Random" is born... I thank you :-)

SwellJoe · on Nov 20, 2018

This is an absurd argument. Software (and written work) is not generated by random number generators...even though it theoretically could be.

If you can randomly generate works that have market value (copyright is intended to encourage creation of works with value, not merely a sequence of random words or bits), then we can talk about whether they're individually copyrightable (maybe the generator itself is the only copyrightable work in the picture, I dunno).

But, no one is going to take an argument seriously that because the number 1 (or the letter "a") cannot be copyrighted then no sequence of numbers or letters can be copyrighted.

nprateem · on Nov 20, 2018

I can generate works with market value. In pseudocode:

i=0

while {

  write_to_file("%s.bin" % i, i)

  i++

}

Of course the effort is in discarding the dross.

How are we to determine "market value"? And let's not forget, the copyright crowd don't want to claim rights over just 1 specific number, but any number that happens to render (when run through a movie player) anything that resembles e.g. Star Wars. I don't think this argument is as absurd as it first appears...

SwellJoe · on Nov 21, 2018

"And let's not forget, the copyright crowd don't want to claim rights over just 1 specific number, but any number that happens to render (when run through a movie player) anything that resembles e.g. Star Wars. I don't think this argument is as absurd as it first appears..."

It makes it more absurd, as it indicates clearly that it's not a random sequence of numbers that is being protected by copyright, but a specific, recognizable, creative work. If I watch an mpeg of Star Wars and then watch a laser disc version of Star Wars, I will recognize them as the same work (well, ignoring Lucas' retconning nonsense and CGI shitshow). Most humans would, including most judges and juries in a copyright case.

"The copyright crowd" includes the leadership (and presumably the populace) of nearly every developed nation on earth, so you're arguing against a pretty solid majority (which is fine, I have some unpopular ideas that I hold pretty strongly...the majority seems to love and support war and killing more than I ever will).

Anyway, I'm all for reasonable copyright terms (and the US has absurd and abusive copyright law that punishes and inhibits new creators at the behest of old corporations and billionaires), but it's just not a sound basis to argue that because one can randomly generate any creative work given infinite monkeys and infinite time that no creative work should be able to be copyrighted. It says that there is no difference between a creative work, like Star Wars, and a random series of bits, because Star Wars can be recorded as a (non-random) series of bits.

edoo · on Nov 20, 2018

I made good money back in the day scraping fortune 100 companies for fortune 500 companies. Everyone does it for market intelligence. It is rarely made public though.

anigbrowl · on Nov 20, 2018

I hope this leads to improvements in scraping software. I find it perplexing there aren't more tools to help strip away the cruft from structured data.

prirun · on Nov 21, 2018

They should have sued on trespassing grounds. That's what eBay did to prevent scrapers. (eBay vs BiddersEdge)

wrycoder · on Nov 19, 2018

Prior law asserts raw data can't be copyrighted. The particular display of the data can be.

gymshoes · on Nov 20, 2018

Can anyone please ELI5?