Matt Cutts is looking for scraper sites

spindritf · on Feb 28, 2014

It's a funny quip but it's getting more attention than the important piece of news it highlights. Google is finally doing something about scraping sites doing better in search results than original creators. Good.

Many people don't write for money, to put ads on their website, or as part of some "content marketing" campaign. All they want is a little recognition. A boost in positioning on the SERP means we will be getting useful stuff at no cost.

And there are genuine replies there. Ryan Jones[1] even got the scrapers to confess their sins[2].

[1] https://twitter.com/RyanJones/status/439123533349015553

[2] https://www.google.com/search?q=%20%22istwfn%22+%22stole+thi...

leephillips · on Feb 28, 2014

"Google is finally doing something about scraping"

I hope this is genuine and not a disingenuous diversion on Google's part. The fact that the Huffington Post still ranks very high for trendy searches makes me wonder.

As usual, follow the money: the scraping sites exist to make money, often through Google's advertising; Google gets a cut. The original content is often on sites with no advertising or real traffic, from which Google profits nothing.

EDIT: To expand on this: Google-search for any hot topic in the news, say the name of some misbehaving pop star. See the HuffPo result near the top of the page. Look down to see several results from real newspapers. This is where the original content can be found. Most of these newspapers are about to die because they're not making any money. HuffPo investors are filthy rich because they're gaming the search engines to profit from copy-and-paste.

ANOTHER EDIT: I apologize for my characterization of the Huffington Post. I was describing, accurately, the nature of that site as it was the last time I visited it some time before its purchase by AOL three years ago. The HuffPo I see today is utterly transformed. They use wire services, do plenty of their own reporting, and many of the links on the front page go directly to other news sites. They are no longer a copy-and-paste site.

grey-area · on Feb 28, 2014

Google-search for any hot topic in the news, say the name of some misbehaving pop star. See the HuffPo result near the top of the page. Look down to see several results from real newspapers.

Many newspapers get a lot of their content from syndication services like Reuters. You may be seeing similar content because lazy editorial assistants just copied out a reuters story verbatim, slapped a pic on it and put it up at multiple organisations, not because HuffPo is scraping other sites. Do you have an example of this sort of thing you can point to? It'd be interesting to trace the origin of the content.

acjohnson55 · on Feb 28, 2014

Huffington Post isn't a scraper site. Aside from the original content they produce, they republish blog posts with permission from the authors. If you have an example of Huffington Post literally cut-and-pasting content from someone without attribution, please share.

I also assume that by "HuffPo investors" you mean AOL? Huffington Post is a fully owned subsidiary.

(Disclosure: I consult for Huffington Post)

leephillips · on Feb 28, 2014

You are right. Please see my second edit in my comment.

perbu · on Feb 28, 2014

That would be a pretty blatant copyright violation. Can you provide an example to substantiate the claim?

thehal84 · on Feb 28, 2014

I covered this a few months back while it all went down with the Verge and HuffPo and how our social search engine algorithm accounted for this while Google did not.

http://theenginuity.com/blog/how-a-copied-excerpt-of-a-story...

AJ007 · on Feb 28, 2014

You are correct, and the parent post before you is also correct.

Google's algorithm put a great deal of value on domain names, which provided a strong incentive for owners of a strong domain name to pump out low quality content. Low quality content can be the example you give, it can be "re-authoring" someone elses article, or is can be blatant copy and paste which is generally avoided due to the obviousness.

When a media property, such as the Huffington Post, pumps out this volume of low quality content the advertising revenue can subsidize the cost of paying for original journalist content.

On another note, take a look at The Daily Mail. They pump out timely news pieces so quickly that they are covered in typos and sometimes can't even keep left and right straight in photo captions.

untog · on Feb 28, 2014

There's a big difference between writing an article based on another article you read, and web scrapers.

Not that I particularly feel like defending the Huffington Post, but they're not a web scraper.

BrownBuffalo · on Feb 28, 2014

This is not Google activing on the community's behalf. It's Google doing a little CYA. It's one of those situations where Google's PR is trying to quell problems that dings their bottom line of advertising revenue. This was a mirror of the FB "official" advertising vs. "bot" advertising through things like Fiver.com. Still a sham on both sides rather than really make a big stink about it.

ezequiel-garzon · on Feb 28, 2014

Is their copy and paste verbatim?

triplepoint217 · on Feb 28, 2014

HuffPo definitely has original reporters, my friend is one of them:

http://www.huffingtonpost.com/betsy-isaacson/

She writes good articles for the general public about tech in general and things like net neutrality and aaron swartz

pinakothek · on Feb 28, 2014

Here's hoping they get it right, because the way it stands now, building a good aggregator is very tricky. The "this is why we can't have nice things" attitude is really putting a damper on valuable developments, and it's very close to stifling competition. Why is it anyone's problem that their algo is not smart enough to see potential value in new kinds of services? Even human vs. software curation should be irrelevant here.

loceng · on Feb 28, 2014

If this is the case, depending on how deep they follow this philosophy, it's important because the people or places that actually gather the content or lead to the content creation have invested and are investing more resources than those who simply scrape the data - and they the opportunity to recover costs, otherwise you damage that positive system's ability to continue on their successful path. That matters if you care how quickly we reach a better and better world for everyone. The source of where content is generated (in form of data or other) is also a strong signal of value, and that seems to be a leading metric of what Google wants to offer - the highest value search results.

robomartin · on Feb 28, 2014

I've always wondered about news organizations. The vast majority of them have websites full of regurgitated content. They don't usually discover and report their own news, they feed off Associated Press, Reuters and often times individuals posting videos on youtube or articles in their blogs. It would be nice to see them not dominate news on the internet by simply engaging in re-writing what someone else originated.

Is it content curation? Don't know. Everyone is reporting pretty much exactly the same things. They can't all be the original source. Who gets the juice?

pointillistic · on Feb 28, 2014

Scrapping is only possible because the value of original content creation is 0% and the value of the toll collecting aggregation is 100% via ads. This is a structural problem in the universe Goggle created, not some anomaly that can be handled by ranking.

psykovsky · on Feb 28, 2014

Yes, because only Google can earn advertising money by scrapping sites.

debacle · on Feb 28, 2014

It is highlighting the more important aspect - Google is the largest scraping site in the world.

VikingCoder · on Feb 28, 2014

Scrapers lift the full content, wholesale, without attribution.

You may as will just show http://images.google.com and complain that it's scraping. Or http://news.google.com.

In general, do you think Wikipedia gets more traffic because Google exists, or do you think Google gets more traffic because Wikipedia exists? Meaning, which affect is larger? I'm pretty sure the answer to this is obvious.

And if more scrapers donated millions to the site they scrape from, the world would be a much better place.

http://wikimediafoundation.org/wiki/Press_releases/Wikimedia...

josefresco · on Feb 28, 2014

One man's "scrapper" is another man's "aggregator".

How do you think Google would view my site if I wrapped Wikipedia's content, with back link and ran my own ads alongside that content? I would imagine not very positively.

Also, is it okay that a bigger entity scrapes my content just because they send me traffic? You might not want to bite the hand that feeds you, but it still doesn't make it right.

ds9 · on Feb 28, 2014

Google does not reproduce whole articles, only short excerpts to help searchers decide whether it's relevant to what they're looking for - and with clear indication of the source and in a context where it's understood that Google is showing the blurb only to pointing to the source where it was found.

This is technically scraping but it's hardly comparable to the bottom-feeders that plagiarize for money. (Edit: according to 'pud' on this page, Google uses a Wikipedia index so it's not scraping, but it is in the case of other sites that Google indexes.)

And yes, it's OK both legally and ethically if you do the same to Wikipedia - like Google that is, just for indexing purposes and not using whole articles.

Silhouette · on Feb 28, 2014

Google does not reproduce whole articles, only short excerpts to help searchers decide whether it's relevant to what they're looking for

And what about, for example, Google's image search tool, where the image itself might be what their user is searching for, and where Google controversially changed their system a little while ago to show full-size images in-SERP and de-emphasize forwarding search users to the original source? Or Google Cache, if it's reproducing material that has since been taken down deliberately from the original source?

To add insult to injury, some Google services still appear to rely on the original source's bandwidth to serve things like images (not to mention avoiding a certain legal argument about copyright infringement), thus violating the basic principle of netiquette that has been good manners ever since people actually used the word netiquette that you don't hotlink other people's stuff on your site.

josefresco · on Feb 28, 2014

You're comparing what Google does to another extreme when you say things like "bottom-feeders that plagiarize for money"

Surely you don't believe that all "scapers" are bottom feeders? It's like saying every criminal is a murderer. There's a whole bunch of grey area in between, and this is where the criticism of Google's harsh penalties is valid.

AJ007 · on Feb 28, 2014

You are at least partially incorrect.

Last year Google was testing reproducing entire Wikipedia articles within their site for their mobile site. You could read the full article within going to Wikipedia (allowed by Creative Commons, of course.) Between that and what they did with Google Images, I would say this reveals intention and is the direction web publishers should expect Google to be headed in.

In order for Google to continue to meet their growth targets they must increase the percentage of outgoing click from free to paid.

ezequiel-garzon · on Feb 28, 2014

The de facto standard robots.txt is pretty likely to be respected by Google, so it's fairly easy to stop their scraping your site. Yes, it is opt-out, bit I'd expect it to be.

It may be quite frustrating for an upstart to be denied access while Google is explicitly allowed, but that's another matter.

zone411 · on Feb 28, 2014

Wikipedia's top rankings are actually a big problem. I know of a site that was the first to put up high-quality reference-type content on the Web and for a while getting reasonable traffic from Google. Wikipedia's editors copied that content into thousands of articles in various ways. Thousands with attribution or copying just the facts and thousands without and copying more than just the facts.

This original site is now getting so little traffic from Google that more people visit it from the trickle of these bottom-of-the-page Wikipedia links than from Google itself. Its traffic was also badly hurt by Google's Panda algorithm, which I think clearly proves how flawed it is since this algorithm was supposed to do the exact opposite.

Because of this situation, if somebody thinks of spending money to create high quality reference-type content, I would strongly advise against it. You have no chance vs. Wikipedia's poorly-written articles repurposing your content and Google's flawed algorithms.

lisper · on Feb 28, 2014

It seems a bit odd for you to be so cagey about the identity of this "original site" while at the same time lamenting that they aren't getting the traffic they deserve. Why don't you tell us who they are?

zone411 · on Feb 28, 2014

It's because I don't speak for the owners of the site and I'd rather make sure they don't mind me putting it out there like this. I could let you know privately, if you'd like to check my story for yourself, though.

lisper · on March 1, 2014

Why on earth would they mind?

zone411 · on March 1, 2014

I'm not sure if they do mind. I do know that their relationship with Google is important to them when it comes to their much larger and more successful projects and that this site has been mostly left behind, so they may not want to bring it up in the context of this Hacker News post, even in the unlikely case that it resulted in this site getting its traffic back. Why not just email me and I'll show you a simple content site with minimal traffic, not using any black or gray-hat SEO tactics, with high-quality, original (to the Web) content, referenced in thousands of Wikipedia articles and you can decide for yourself if my post was truthful.

dennisgorelik · on March 2, 2014

Could you get permission from the owners and publish it here?

danielbarla · on Feb 28, 2014

That's one way of looking at it, on the other hand, they link to the original URL, passing traffic back to the original source. Most "scraper" sites take the content, wrap it in their own similar outer layer, and try to take ad revenue. E.g. I've seen my own StackOverflow answers copied, word for word, to a scraper site and presented under a made-up name.

dangrossman · on Feb 28, 2014

StackOverflow actually allows this; all their data is Creative Commons licensed, and they publish the full database dump on the Internet Archive.

https://archive.org/details/stackexchange

jbinto · on Feb 28, 2014

Do the terms of the license allow for this kind of abuse?

Just because something is CC doesn't mean you can do whatever you want with it.

dangrossman · on Feb 28, 2014

Yes, they do; it's not abuse when you're given explicit permission. CC BY-SA means you can do whatever you want with it as long as you attribute the source as specified.

leephillips · on Feb 28, 2014

"as long as you attribute the source"

danielbarla said that they presented the material under a false name; this goes beyond copying and becomes plagiarism, which I can't imagine is an intended result of the CC license.

aroch · on Feb 28, 2014

Is the source 'User X' or 'StackOverflow'? When you reference CC BY-SA code you don't reference the people who, say, checked it into git but rather the whole repo.

Flimm · on Feb 28, 2014

CC BY-SA is short for Creative Commons Attribution Share-Alike. BY means you must attribute, and SA means you must license any distributed derivative works under the same license (copyleft). Attribution on its own is not enough.

grey-area · on Feb 28, 2014

No, attribution is required.

jliptzin · on Feb 28, 2014

Interesting, from the file sizes you can quickly gauge the relative popularity of each subject.

tobehonest · on Feb 28, 2014

By having a tl;dr about the actual Wikipedia page, there is no need for the user to click on the link. Following what you're saying, Google as wrapped it in their own layer, and trying to take ad revenue.

smoyer · on Feb 28, 2014

Actually, I find that having a tl;dr will rarely answer the question(s) I have on a topic, but it will commonly show me whether I've found the right wikipedia page. I usually either click-through or refine my search.

bushido · on Feb 28, 2014

They don't actually link to the wikipedia URL. They mask a link that leads to another Google page "/url?sa=t&rct=j&q=&...." which in turn responds with a 200 OK page that redirects to Wikipedia.

Sure it passes the keywords etc. But this likely reduces the number of people visiting Wikipedia, while increasing Google's ad revenues, if anyone but Google did this they'd be potential blacklisted by Google.

VikingCoder · on Feb 28, 2014

Actually, they do link to the wikipedia URL.

href="http://en.wikipedia.org/wiki/Scraper_site" appears directly in the source code of that web page.

It also has an onmousedown handler that rewrites the URL to point at Google, so they can tell which link you clicked, to improve their ranking system. And Google works very closely with sites to make sure the sites know how to understand the referrals.

jpeterson · on Feb 28, 2014

Relax, it's a joke.

MWil · on Feb 28, 2014

If Google only needs to visit Wikipedia's "scraper" page once a day or less, but serves it out to others with attribution, isn't that helping Wikipedia by lowering traffic COSTS?

baldfat · on Feb 28, 2014

BUT it gives full attribution to http://en.wikipedia.org/wiki/Scraper_site???

DanBC · on Feb 28, 2014

That is what paren comment says.

xuki · on Feb 28, 2014

This is pretty funny https://twitter.com/danbarker/status/439125570115223552

nmeofthestate · on Feb 28, 2014

Hah yep. See also this: https://news.ycombinator.com/item?id=7318203

nissehulth · on Feb 28, 2014

ARGH! INFINITE RECURSION!

mkr-hn · on Feb 28, 2014

Every search engine does this.

jjoonathan · on Feb 28, 2014

Bah, what would I possibly need with a scraped definition that

1) Hasn't been chunked into 20 pieces of varying grammatical structure which are automatically matched to corresponding questions

2) Hasn't been subsequently pasted over a slideshow of completely irrelevant stock photos in bold, white font

3) Isn't accompanied by a grid of ~30 vaguely related questions helpfully linked to similar pages and tastefully decorated with more irrelevant stock photos

4) Only occupies ~1.5 rather than 3 or 4 of the front page search results

5) Contains only closely related textual ads rather than a melange of casino, fast food, and online college banners

6) Has fewer than 25 trustworthy stock faces smiling back at me from any given scroll position

If this is the best google can do then I don't think wiki.answers.com has anything to fear.

------------

Seriously, how the hell does wiki.answers.com manage to pollute half of the searches I make with their algorithmically generated garbage (multiple times, at that)?! What kind of SEO catapulted them to the top despite 0 viewer retention and what surely must be about 0 reputable backlinks? How haven't they been sent to the 1000th page with manual penalties already? They show up before wikipedia itself, for crying out loud!

Google, if you aren't going to let users maintain a manual blacklist, you need to be on top of this kind of thing. It's seriously degrading my search experience and I suspect I'm not alone. This kind of inattention is the type of thing that can push even the most inattentive users to change default search engines.

dangrossman · on Feb 28, 2014

The backlinks and SEO were gained before they did that to the site. Turning every Q&A they used to host as simple text into a terrible 30-click slideshow was a pretty recent change.

robryk · on Feb 28, 2014

If you use Chrome, there is a "Personal Blacklist" extension that does essentially what the manual blacklist used to do.

6cxs2hd6 · on Feb 28, 2014

"If you use Chrome" is Internet Explorer ActiveX controls redux

pud · on Feb 28, 2014

Wikipedia's database is public and used by Google with permission. You can probably use it for your projects, too.

So this is neither scraping, nor against the rules.

Here are dumps in SQL and XML format:

http://dumps.wikimedia.org/enwiki/

Ps- Yes the original post was meant to be funny and it was; I do have a sense of humor. :)

solve · on Feb 28, 2014

He's talking about outranking the true original source of the content in search results. You most certainly cannot create your own site that consists only of excerpts from Wikipedia, if you wish to remain on Google's search results. Copyrights are irrelevant to this.

What's bad though, is that Google isn't just lowering the rankings of non-original content pages now (including any kind of legitimate curation sites.) They're marking the entire domains of new curation sites as "pure spam" and de-listing them from Google entirely, and punishing anyone who's linked to them.

This is having the effect of sending a clear message to developers -- stay far away from Google's territory of recommending third party content to people, no matter how you do it.

pjc50 · on Feb 28, 2014

Could you show an example of such a legitimate new curation site please?

dreamfactory2 · on Feb 28, 2014

Not new but a legitimate curation site - http://hypem.com/

_wmd · on Feb 28, 2014

Cue damage control explaining the indiscernibly subtle difference between what Google does and what these evil, spammy scraper sites are doing

mkr-hn · on Feb 28, 2014

The difference isn't subtle.

There's this site called "News360" that sends a lot of traffic to my site every time I post something. It copies the post in full. Apparently it's a popular app for iPhone and Google Play. This is an aggregator.

Google copies my site so it can send people to my writing. This is a search engine.

Then there's the legion of sites that copy my stuff and send no traffic even though they link back. Most of these are scrapers, meaning they're adswill garbage dumps that get no traffic after recent algorithm updates by Google, but some are attempts to build new aggregators like Huffington Post or that News360 thing.

The scrapers are a nuisance, but don't harm me in any way. Google is free, relevant traffic. Aggregators find an audience and provide useful content to them with credit, probably using the RSS feed I publish for that purpose.

level09 · on Feb 28, 2014

Google is taking Cognitive Dissonance to a new level: It's okay for them to scrape every single site, download its content and images, and cache it on their servers, and run their AD platform on top of it. but that's not enough, they would still like to impose their rules and punish people who do the same thing.

eli · on Feb 28, 2014

I'll concede that there's probably a middleground that's kinga gray, but are you really going to defend bona fide scraper sites? Like ones that simply grab all the text off some other site and repost it, adding no value? Google is obviously adding value by providing snippets from Wikipedia.

cruise02 · on Feb 28, 2014

"Cognitive dissonance" doesn't mean the same thing as "hypocrisy."

blumkvist · on Feb 28, 2014

You can tell google not to download your site with your robots.txt file.

Also google can impose whatever rules they fancy, because it's their own site. Do you have a website? If I find some that you govern it with some rule that I don't like, should I rage on forums about it? Should you care if I rage about it?

fear91 · on Feb 28, 2014

Google seems less and less connected to reality the bigger they grow.

It's a shame that the search engine market share isn't split evenly by several different engines. I think it would be beneficent both to the users and website owners. Right now everyone tries to court Google and they seem to do whatever the fuck they want.

jerf · on Feb 28, 2014

It's also worth remembering they're under continuous, distributed assault by human-intelligent agents (at least to a first approximation) trying to game them specifically. The miracle is that Google works at all.

smoyer · on Feb 28, 2014

There are lots of places where Google decides to "help" me, but sometimes I just want search results. Other times, I actually like getting the curated content (e.g. search for "delta 3810"). Is there a way to disable this?

EDIT: I should also note that I'm one of those who switched over to DuckDuckGo for privacy reasons, so I don't see these results as often now.

7952 · on Feb 28, 2014

The content they do provide is often so bad it is almost embarrassing. Search for "Russia" and you get a completely useless map and a list of random facts. It may be useful to a child researching geography but for me it is just annoying.

I want content that is curated by people who actually understand the subject. I would pay for a search engine designed by someone who understands my industry. The Google algorithm only manages to grab at the low hanging fruit. I am a professional working on real stuff, I want something better than coffee shop suggestions.

josefresco · on Feb 28, 2014

I'm curious, what should they show you if you type on "Russia" the term by itself seems pretty generic and open ended.

"I would pay for a search engine designed by someone who understands my industry"

What industry is that? And how would Google guess or know your industry unless you tell them?

DanBC · on Feb 28, 2014

I'm experimenting with paying someone to do a bit of research for me.

To give some idea I've asked for a list of URLs to documents covering best current practice for suicide prevention in Gloucestershire and Herefordshire; to include national level NHS and NICE guidance, DoH guidance, anything from Gloucestershire and Herefordshire, and anything recognised as excellent from anywhere else in the country. If possible I also want a list of protocols used in schools, care homes, etc.

It's probably something you could risk on MTurk. Perhaps Bountify.com could expand to this kind of simple websearching.

Obvious drawbacks include delay between starting the search and getting the results, and cost, and having to trust some random person to not miss stuff.

I don't know if there's anything similar to "clippings services" either where you'd provide them with list of types of stories you'd want, and they'd read all the newspapers and clip any relevant stories and post them to you.

mattstreet · on Feb 28, 2014

I think most people would be a little unsettled and at least occasionally annoyed if when they googled something as broad as "Russia" it gave them all very specific to their industry.

UweSchmidt · on Feb 28, 2014

Try www.startpage.com.

An option to disable the personalization of search and going back to seeing the top 10 results for a search term that everyone sees would go against the core strategy that Google and others pursue these days.

The idea is to gain information about you and give you personalized advertisement and services. This has been criticized with the term Filter Bubble [1]. Consider your phrasing "I just want search results", similarly the terms "have you googled it" or "let me google that for you".

Quaint.

https://en.wikipedia.org/wiki/Filter_bubble

basisword · on Feb 28, 2014

I'd love to see a response from Matt to that. If they think the Wikipedia article is most important and they will scrape it and put it to the top why not just put the wikipedia article as the top link and leave out the Google box.

300bps · on Feb 28, 2014

I wonder how Google chooses which Wikipedia articles they scrape and which ones they don't.

In testing, they definitely don't seem to scrape every article:

http://i.imgur.com/ujDqZhB.png

danso · on Feb 28, 2014

This is a good question...I've long since surmised that Google has a set of heuristics for every site that has an API that allows for easy domain-specific ranking. With Wikipedia, you have number of page edits, frequency of page edits, and (to an extent) quality of recent page edits. StackOverflow provides an even easier metric for what's considered high quality, and Google appears to apply its own layer on top of that (and in my non-scientific perception, looking something up by Google is almost always more fruitful on the first search than by going directly to SO)

habosa · on Feb 28, 2014

Hey guys you know this is meant to be humorous right? I honestly can't believe that people here are saying Google is a scraper site and complaining about "hypocrisy". No more caching! When I search Google I want them to freshly crawl the web and get back to me in a day or two with my results.

</rant>

ITB · on Feb 28, 2014

Google is most certainly crossing the line here.

1. They are not only doing this with wikipedia, but with many, many sites: "what is the smallest cell in the human body", "what is the biggest planet in the solar system".

2. The sites they chose to link are not always the highest quality sites, such as the two examples above- why are these websites being featured?

3. Many times, the user will get their answer right then and there, and be done with the search process. The site misses a visitor. In spite of these type of questions being "facts", someone took the time to organize and give context to these "facts". Turning facts into useful, consumable, content costs money. Google should not be taking visitors away from these sites.

4. There should be public information on the CTR of these snippets. See if it helps or hurts the user.

5. Google is abusing its power as a major search engine to reinforce structuring rules, such as microformats. With these rules, webmasters are giving more and more semantic meaning to their content, which means Google has an easier time completing their knowledge graph. They might link to the source site for a while, but there is no good argument for linking back to wikipedia to attribute the fact that Jupiter is the largest planet, since it's a fact, just like 2+2 is 4 (no attribution).

6. Google is all about ML/NLP/AI driven knowledge. But in reality they are turning all of the internet content creators into a giant sweat shop for their knowledge graph. This is not fair, and sooner or later it will come back to bite them.

Shooti · on Feb 28, 2014

All of your arguments are based on the underlying assumption that being a "pure web search engine" is inherently better than their current striving towards being an "knowledge engine" modeled after the Star Trek computer (of which web results are a just a subset). I'm not sure that can be taken as a given, if only because the later presents a much clearer model/metaphor in the mobile first technology climate.

higherpurpose · on Feb 28, 2014

Google should be very careful with this. They don't want someone in power to get the idea like "wait a minute...isn't Google mining whole websites too and profiting from it? Maybe we should do something about that!"

altcognito · on Feb 28, 2014

Scraper sites usually don't reference the source material, but yeah, you might want to get some ice for that burn.

Angostura · on Feb 28, 2014

Of course, its not just Wikipedia these days - Movie theatres etc all suffer it.

theepauk · on Feb 28, 2014

Indeed. Google is constantly doing things they punish other people for.

A happy DDG user, who still uses !g too often though.

air · on Feb 28, 2014

https://duckduckgo.com/?q=scraper+site

theepauk · on Feb 28, 2014

I'm not sure which point you're trying to make. Did you look at op's submission?

air · on Feb 28, 2014

I assumed you thought scraping wikipedia and putting it on top of the search results was unethical. (The other alternative - that punishing scraper sites is unethical - seemed unlikely). So the fact that you prefer DDG because of this seemed weird, considering DDG does the same thing.

ldng · on Feb 28, 2014

DDG is clearly attributing to Wikipedia and is not cloaking the link behind a redirect.

sp332 · on Feb 28, 2014

Google's page wouldn't get indexed in a search engine to try to drive traffic to itself.

tobehonest · on Feb 28, 2014

"All therefore whatsoever they bid you observe, that observe and do; but do not ye after their works: for they say, and do not." --Matthew 23:3

"Do as I say, not as I do" -- Google

Grue3 · on Feb 28, 2014

Cue Bing, DuckDuckGo and any other search engine (except Google, of course) being Google-killed for "scraping". It's the perfect plan!

solve · on Feb 28, 2014

Google recently flagged my content-curation startup as "Pure Spam", even though it only takes small snippets from the original sources, is 100% human curated, and always links back to the true original source.

Not only are the curated pages blocked, but the entire domain is blocked as "pure spam". People who use Google to find a domain instead of typing the full URL now can't find it anywhere.

These assholes are just being anti-competitive now.

lauradhamilton · on Feb 28, 2014

That's brutal.

Honestly you might consider switching to a new domain given that you haven't launched yet. It can take a long time to get out of the Google doghouse.

troels · on Feb 28, 2014

Curious - Which site is that?

solve · on Feb 28, 2014

You won't find it on Google :) We're making a products recommendation site, focusing on goal-oriented decisions. We'll make the full announcement within the next couple of weeks.

This reminds me. Google is really killing the "release early and release often" approach, if people will now have to do a ton of SEO learning and tweaking to avoid having your MVP permanently banned at launch day.

pinakothek · on Feb 28, 2014

"Google is really killing the "release early and release often" approach, if people will now have to do a ton of SEO learning and tweaking to avoid having your MVP permanently banned at launch day."

Judging by the downvotes this is not a legitimate concern? Why?

PaulHoule · on Feb 28, 2014

Actually, from an SEO perspective, the #1 principle Google follows is preventing "release often" from being effective for SEO.

Talk to anyone who makes money at PPC and they will tell you one thing. You make a campaign, measure the results, change it a little, measure the results, and make incremental improvements to make a profitable campaign.

If you could do that with SEO, SEO would be a lot easier. Google, therefore, has a number of mechanisms (some patented) that cause all hell to break loose if you make the kind of changes to your site that you'd use to incrementally improve it's SEO.

It's one of the reason we are stuck with crappy sites like answers.com, w3schools, and wrongdiagnosis, because once a site like that is successful, the operators are loathe to make any changes lest their rankings drop.

troisx · on Feb 28, 2014

Depending on what products you're talking about, you should see if it competes in any way with Google's paid results on product searches. If there is any crossover, the USDOJ might be happy to hear what you have to say.

trillium · on Feb 28, 2014

Your site sounds really interesting! Look forward to learning more about it.

ricg · on Feb 28, 2014

Easy. Search for a programming related question. After the result from stackoverflow you'll find dozens of scraper sites.

ntaso · on Feb 28, 2014

If you read correctly, the task is to report scraper sites that rank HIGHER than the original site. Not the case in your scenario.

tobehonest · on Feb 28, 2014

And in OPs context, not only does it rank higher, it's above ALL links. So you don't even need to visit the target page and in affect stealing traffic (and potential ad revenue) from the source.

dangrossman · on Feb 28, 2014

Those sites didn't actually scrape anything. StackExchange publishes their full database dump, and it's CC licensed.

https://archive.org/details/stackexchange

corresation · on Feb 28, 2014

Do we need to differentiate from mechanical scraping versus manual scraping, though? Because realistically a hefty percentage of StackOverflow content are users manually scraping other sites, all in hopes of earning some imaginary internet points.

sebii · on Feb 28, 2014

More evidence: http://shadyseo.com/

gwu78 · on March 1, 2014

I do not understand the Wikipedia definition of "scraper site".

By this definition webcache.googleusercontent.com qualifies.

It is a full copy of every site GoogleBot scrapes.

Google gives attrition to the original source, but if this isn't "scraping", what is?

They have been sued for this, and they've won. The benefits of a decent search engine outweigh the burden of infringing the copyrights of others. At least where Google and other search engines that cache websites are concerned.

gwu78 · on March 1, 2014

s/attrition/attribution/

baldfat · on Feb 28, 2014

Not funny since it is a double post for the same wikipedia: http://en.wikipedia.org/wiki/Scraper_site

Seriously that was just a stretch, but they both say the full url. So all of Google News is a scraper site and any other summery given is a scrapper site then. Sad.

return0 · on Feb 28, 2014

Joking aside, I see these people as the waste collectors of the web. Respect their work, but i wouldn't want to do it.

cousin_it · on Feb 28, 2014

That SERP should show only one result from Wikipedia instead of two. It should be on top, have a blue title link to Wikipedia, and look like an answer to the user's question. That could be done by a general mechanism that lets every site customize their representation on the SERP, or by a special case for Wikipedia.

bhartzer · on Feb 28, 2014

Who is deemed to be the scraper? The site that get crawled and indexed first, and ranks better, or the site that ranks well but has sites with scraped content that doesn't rank as well?

Matt is looking for scapers that rank better than the original, basically meaning that they have higher PageRank and more links.

lazyjones · on Feb 28, 2014

I would report Google to him, but I'm afraid he's not planning to act fairly/consistently ...

MitziMoto · on Feb 28, 2014

Google should offer some kind of revenue sharing (Like Youtube) to the sites it's "stealing" visitors from by showing information directly. And you should have to opt into it through something like webmaster tools.

rip747 · on Feb 28, 2014

why they don't just integrate this into the results page? what's wrong with having up and down votes or a report this link button for the results?

globalpanic · on Feb 28, 2014

I thought this was largely taken care of by Google Panda?

motyar · on March 1, 2014

I know one, Google.com

iamabraham · on Feb 28, 2014

Sensational.

pearjuice · on Feb 28, 2014

Ah, a whole thread filled with pseudo-intellectual discussion about what scraping is (or isn't) due to some silly snarky-joke which Matt is probably laughing at, too. Hacker News to the rescue!