Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I think this is pretty fair on Google's part. How could you possibly figure out who owned content?

What if I published a book, it was copy-pasted in blogs, and then later I put it somewhere crawlable by Google? You certainly can't just say "first time we saw it, that's the proper owner". It would either require a massive amount of manual QA to get right (and even then, there are going to be interminable copyright battles), or have a super high error rate.

I think Google's best value is letting proper content owners easily find violators via normal searches, and let them deal with them via takedown notices or the court system -- which is where it should be done, not in a pseudo-court run by a Google who does not want what responsibility.



So back when Blekko was a consumer search engine we could 100% figure out who owned content on sites we crawled often. And even when we didn't we could often guess correctly more often than not based on the domain registration dates. (not to mention registry owners). That is because few people who rip off content rip off just one web site, they will rip off dozens of web sites and they will all share the same AdSense ids and the same domain registrar. This is easy stuff to spot when you crawl the web regularly.

I suspect that Google simply doesn't care. They get Ad revenue regardless and in their laissez-faire editorial position it doesn't matter. What are you going to do, use another search engine?


Or, more likely, they can't get involved for legal reasons. If they took steps to block the easy stuff, an arms race would ensue, and the content providers would never be satisfied with the performance being provided for free by Google. The content providers would always demand stricter enforcement, and could threaten to sue for copyright infringement regardless of merit.


I think if that were the case they would not have pushed out the "Panda" updates which penalized content farms so heavily. If their past behavior (with content farms and other "low value" content sites) is a guide they will not do anything until enough people complain about it.

In the mean time it isn't even Google's content so its not a hosting issue, they are just the "neutral" third party providing their 10 blue links (oh and supplying the advertising engine those sites are using)


Do you know if any search engine is actively filtering for this?


Sadly no, there are only a few actual derived general search indexes (english language search) in the world, Microsoft's, Google's, Baidu's, and Yandex's. They are expensive to build and maintain and the only way to monetize them requires driving search traffic your way. Google is paying $4B/year to third parties to send search traffic their way.

My guess, having been at both Google and Blekko, is that "whitelisted" search will be the next wave in the industry. For those old enough to remember Yahoo!'s original "directory" model, once Yahoo!'s contract with Microsoft is up one could hope they rebuild their search team and technology into something with a strong editorial bias for "quality" content.


Did you get that $4 billion number from that quarterly results? Does that include things like their payments to Opera and Apple? Does it include search rev share deals with entities like AOL and Ask (& soon to be Yahoo)? Does it include paying from Chrome distribution bundled with Flash security updates & such? I have never seen the overall numbers broken down in terms of what percent goes where on the different sorts of syndication deals.

Three things which would be a major issue for Yahoo! on that sort of search would perhaps be first that they themselves rely so heavily on content syndication to power their various verticals, second they keep losing search market share (especially as more search happens on mobile devices and Google has mobile locked down with their Android contracts), and they also screwed up their old directory before they moved it to Yahoo! small business as part of the Alibaba share spinco.

I also don't see how Yahoo would effectively differentiate their search engine enough to be able to (profitably) buy share at prices set by Google, particularly if they over-promote their internal results & rely on a smaller search index.


Prior to restructuring their reporting, Google reported as a cost paid distribution. I left in 2010, I started tracking the number in Q1 2011. It was $337M for the quarter. by Q4 of 2014 that number had ballooned to $968M for the quarter. In 2015 they changed the way the reported this number making future comparisons problematic.

I expect it does include fees paid to Apple so that Apple would send search traffic to Google, and fees paid to browser vendors.

Our experience as a search results provider was that there was demand for a more 'functional' search capability (not casual searching) many of the techniques we used have been adopted by Microsoft in their Bing engine which has improved both their recall and quality with respect to Google results on highly contested searches.

I certainly agree that Yahoo! has made a number of missteps with their search technology. I talked with them once (post Marissa's arrival) and in many ways they were confused as ever about how search engines generate value for the parent company, but such things are rarely permanent.


Thanks for sharing that :)

One interesting bit from the most recent IAC investor conference call is on it they mentioned that their search deal with Google was renewed for another 4 years & that the rev share on mobile was lower than it was in the past. An analyst asking a question mentioned both Google and Yahoo! were lowering revenue share on mobile.


> "whitelisted" search will be the next wave in the industry

> ... strong editorial bias ...

Interesting. Care to elaborate?

When you use words like 'whitelisted' and 'editorial', I imagine humans adding something to a database one by one. But the volume of useful pages (and the number of site) is really large now, so I guess that's not what you mean.

One thing I like about search today is that it's almost comprehensive. If I know something exists on the (open) web, I can usually find it with a few searches, even if it's very recent or obscure. I don't want to go back to the days when I browsed gopher directories, or even to the days when finding good quality content meant a hierarchical journey from a directory to a site, to a site map, to an individual page.


Also all of this is assuming that any duplicated content is inherently stolen, when it could in fact be public domain, fair use, legitimately licensed, distributed under Creative Commons, etc.


True, but in these cases I would still want original/canonical/fastest/best source first while the others are probably only valuable as backups.


People have been complaining about this problem for a LONG time.

Duplicate removal is essential for making a web search engine that works. For instance, together with a CS research group, I built a search engine for a major university library that had more than 80 web sites. We found huge amounts of duplicate content produced by various mechanisms (for instance, multiple people posted the same stuff to the web.) If your ranking is content-based, all of the duplicate documents are going to rank the same and form a "plug" that excludes other documents.

It has long (post 2006) been a common story that "I wrote a blog post but somebody else ranks for it." For instance, I made a blog post that got a huge amount of traffic in the day, but right now you search for it and you find a presentation from some fresher at Oracle that is based on those ideas.

There are many factors that make this hard to control and these include: (1) for one "real" origin there are probably ten or a hundred fakes, so if you are picking at random you strike out -- you have to not only outrank one fake you have to outrank all the fakes, (2) freshness... copies are fresher than the original, also they can be updated years later, (3) also the bad guys think a lot more seriously about indexation, Page Rank, and other variables they control than do most content creators.


Even if you don't care about trying to identify the original source of some piece of content, it seems like the content farm site which is plagiarizing is more likely to be a lower quality site than the original content producer.

The behavior does seem weird in any case, like there is a certain slot for a given piece of content, and Google is swapping different domains in and out to fill that slot. It seems like Google is actually trying to identify the original content, failing, and then actually inadvertently penalizing the original producer.


Well, Google has already indexed a new article x. When article y appears, and Google sees that y is an almost verbatim repeat of x, it shouldn't be that hard to figure out that article x is the original, should it? Especially if they both have time/date stamps....


one man's stolen content is another man's mirror. There are countless times where original content is region-blocked or behind a paywall, or expired, but accessible via "stolen" links.


While most of what you say is true, they could at the very least reject AdSense applicants based on how often they copy-paste content from established publishers. I’m sure they have the means to figure something like that out.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: