Matt Cutts surely can't ignore this debate for much longer. The least he could do is chime in with Google's official perspective on the situation whether it's for or against the Mahalo approach.
There is a large incentive for google to send traffic to Maholo. The low quality of the pages causes the user to want to exit the page. What do they click on when leaving? the adsense ads! I'm not saying Google are going out of their way to support this but it is easy to see why they may be hesitant to penalize Maholo.
They're also important. If Mahalo, Demand Media, etc. all continue to do what they're doing, then they'll have built a publisher-paid paywall around most of the content online.
That is bad news for anyone who tries to get long-tail organic search traffic. It's good news for sites with great brand names, but terrible news for anyone else.
Its not fair to lump Demand in with Mahalo. Demand's business model is filling in holes with their own original content, not scraping (or "aggregating") other people's content.
AFAIK, Google views their relationship as symbiotic, not parasitic.
Demand Media does not produce high quality content. They product content that is good enough to rank (i.e. it's written in English) and unique. But they have a strong incentive to have bad content! If your article on "how to make pancakes" tells someone how to make pancakes, they close their tab and make pancakes; if it's 300 words of "original content" that makes no sense, you'll end up clicking through to another site (that has to pay for the privilege).
When you think of how many struggling freelancers use those long-tail guides to build their business ("How to shoot a commercial for a gym," or "How to write brochure copy for life insurance,"), you can see the magnitude of this problem. People who could trade their time for traffic now have to trade their money for traffic. When they're just getting started, money is harder to come by than time. The result: fewer people creating this kind of content, more of them joining organizations that pay for the traffic instead.
As I understand it, their algorithm looks for keywords with little to no competition, but at least ~$10/year potential in AdSense revenue. If they can find such a keyword, they pay someone ~$5 to write an article or make a video on the topic.
The result isn't Pulitzer worthy, but its a lot better than you're making it out to be.
Using your example, here's the wikiHow article on "how to make pancakes":
But, doesn't Demand only occupy one slot in search engines result pages for a certain keyword search? Even if Demand would create billions of articles, one for each conceivable search term, other nine spots in first 10 search results are still available for others, aren't they?
All in all, I still think this is primary a Google problem, not Demand's. They can publish anything they want, it's their right protected under free speech. It's Google which should be concerned about quality of their pages.
The #1 search result gets about 40% of all organic clicks. And Demand Media has more than one property (eHow, Wikihow, Cracked, livestrong.com). So I wouldn't be surprised if there were some searches for which Demand Media got more than half of all traffic.
You are correct about this being Google's problem. These guys exist to exploit an arbitrage opportunity: Google's search algorithm picks them, and the average searcher's which-engine-do-I-choose algorithm picks Google. In the long run, one of these things will stop being true.
There's something about their content feels fake, though. When I run across one of their pages in a search, it's not always terrible, but if I don't initially notice where I am, I sometimes get this weird feeling that the text I'm reading isn't quite right, and might not actually contain any information. It frequently has this pumping-out-words-to-fill-the-page feeling, very strongly reminiscent of high-school students' essays.
They are about "original content" in the Google sense. It's not quality content, just well-SEO'd stuff that is 1) good enough for Google to think it's grammatically-correct English, but 2) bad enough that contextual ads on the article will still get clicks.
Guys, draw a big circle around point #2 here and remember it: Demand Media understands the user psychology of CPC ads better than any Valley startup you've ever heard of. They want their content to suck because if it satisfies the users' needs then the users will not click the ads. If, on the other hand, it introduces the topic and leaves the user feeling "Where do I go next?", then the only options from a Demand Media page are either a) other worthless articles or b) CPC ads. And people choose door #2 quite frequently.
This kind of frustrates me to the extent that I compete with Demand Media (limited, but real), because there is absolutely no circumstance under which a Demand Media article is a better result for the user than a page on my site if we're in competition, but Demand Media has the economics of it nailed such that they can produce content for some genres at scales I can't possibly keep up without duplicating their core strategy (automated keyword selection -> cheap freelancer -> post without substantial quality control), pushing my related content out of the SERPs and then charging me for the clicks I would otherwise be getting for free.
How can Google NOT look at that sample page Aaron has displayed? That is truly terrible :( Also, Matt Cutts is generally very honest and public, but in this case, uncharacteristically quiet.
some of these discussions make a lot of sense for HN as many entrepreneurs here are building businesses around content ie - "what is spam?". This one does not.
"I talked to him, and so I said what software do you use to power your search engine? And he said we use Twika or MediaWiki. You know, wiki software, not C++ not Perl not Python. And at that point it really does move more into a content play. And so it is closer to an About.com than to a Powerset or a Microsoft or Yahoo! Search."
How much original content is there in Google Reader? Not defending Mahalo, especially if it stealing copyrighted content from other sites, but not having original content doesn't mean you don't provide value. Google Reader has value and no original content.
If Google started giving people public Google Reader pages, would you like it if Scoble's Google Reader outranked your blog when people searched for your blog's name?
There is value in duplicate content. But there's no value in a search query taking you to a search page, where the site you wanted to land on has to pay Mahalo for adding an extra click between you and what you wanted.
All of google reader is blocked by their robots.txt, exactly like the article is saying mahalo needs to do for their scraped content
http://www.google.com/robots.txt
The internet is only moving in the direction where Google points and Google favors quantity over quality. All Internet problems flow from this Google unstated rule.
Google would rather make a cent each on a million stolen pages of crap than $10 on an original content, plus the spam "author" is not going to ask anything for the "content".
This whole thing is insane.... we have "stub" pages just like Wikipedia.
These are topic pages that people are working on and THEY DON'T RANK in search engines until they we get the word count to around 300-500 words.
We are the process of NOINDEXING the pages that are below 300 words just to make Aaron happy... we actually had these noindexed before our last version and that got lost in the shuffle of the new launch (really, it did... when you do new code you might leave something out of the old code).
i'm also getting a list of every page under 300 words and having the page managers build them out in 30 days or deleting them.
Anyway, i thank Aaron for busting out chops and making us better!
The claims that we are "scraping" are absurd... we're using google, bing, twitter, etc. apis to do a comprehensive search page.
i dont know everything about SEO, but i don't understand this claim by Aaron. i think he is trying to start trouble for us... and maybe it will work. Thanks pal!
- If you don't want those pages indexed in Google then why are you submitting them in an XML sitemap?
- I have already shown examples of the 0 original content pages ranking, so how can you claim that they do not rank?
- You are not scraping directly, you are pulling from 3rd party sites and using it as content on your own site. Which is worse, because there is no way to opt out of it.
- My problem is not just with what you call stub pages, but with most of your pages. When you give people embed code to embed your content in their site you give them an iframe AND a direct link back to you. If you want me to stop highlighting the absurdity of it then perhaps you should hold yourself to the same standards as what you offer others. But you do just the opposite when you embed 3rd party content in your site. You slap a nofollow on the links and embed the content directly into the page (rather than in an iframe).
- Worth noting that every time I mention the above point you end up talking about stub pages or experiments or some other strategy to try to redirect attention. But in reality, what I am talking about is what you do on almost every page of your website.
1. everything in site is in the sitemap... it's not selective. it will be shortly.
2. they don't get traffic is my point... we look at any page that gets over 100 page views in a month and we build those pages out. so, even if you find a page that ranks it will not have traffic. if it has traffic it gets built out.
3. we are not scraping, we are using search APIs
4. i dont understand this issue of our widgets (which don't get used to be honest.. it's a failed program)
5. this is simply false... our traffic comes from how to articles, walkthroughs and Q&A. if you want to know what the top 10 pages are they are things like how to play guitar and call of duty walkthrough pages. those things are 3-5k words!
- 1. everything in site is in the sitemap... it's not selective. it will be shortly.
Ah, so now you admit it was intentional. But good on you for (eventually? hopefully?) fixing it.
- 2. they don't get traffic is my point... we look at any page that gets over 100 page views in a month and we build those pages out. so, even if you find a page that ranks it will not have traffic. if it has traffic it gets built out.
If a person has a quarter million pages that are getting 5 visits each that is still a lot of traffic. Especially when the page has 0 editorial costs.
- 3. we are not scraping, we are using search APIs
The end result is what people would typically call a "scrapper site". It is irrelevant how it is created (if you scrape directly or syndicate from somewhere else that is scraping). The issue is a lack of editorial control (see your page about 13 year old rape) and a lack of citing sources with links.
- 4. i dont understand this issue of our widgets (which don't get used to be honest.. it's a failed program)
Search engines have duplicate content filters. If the content is within the page as HTML (as you do on Mahalo) then you can often outrank the original source for their own content. You bypass this issue and me mentioning it if you only use an iframe to embed the content in your pages. But if you embed it directly into the HTML (as you are doing right now) then of course it is bogus.
- 5. this is simply false... our traffic comes from how to articles, walkthroughs and Q&A. if you want to know what the top 10 pages are they are things like how to play guitar and call of duty walkthrough pages. those things are 3-5k words!
I am not talking about your top 10 pages. I am talking about the bottom 300,000 pages, which in aggregate get far more traffic than the top 10 pages do. :D
- just lay off dude... go troll someone else.
Not trolling at all. Just trying to give you valuable feedback, as you have claimed it to be publicly multiple times (unless you were lying when you stated that) :D
Does that mean that (for the remaining pages on the site)...
a.) the other scraped content which exists on the remaining pages will be put in an iframe (rather than as text on the page)
- OR -
b.) that you will be removing nofollow from the pages you are scraping content from?
Either you trust the content enough that you should link to it directly, or you should put it in an iframe such that search engines don't see it. Either route would likely be more akin to fair use than what you are currently doing (automatically scraping 3rd party content into your pages and using it to rank against the content creators, without permission, and without a way of opting out).
After this many lies, do you actually believe he is planning on actually taking that step? In my books he's already used up his credibility. I won't believe anything he says until he says what he has done and someone else publicly verifies it. (He has lied enough that I don't think it worth my time to bother verifying anything he says. The bozo bit is well and truly flipped.)
What fraction of your traffic comes from those 3-5K-word articles? And what's your return on investment from those compared to, e.g., one of the pages that Aaron linked?
I'm curious! I'm in the content-creation business, and if what you're doing works, I'll either need to radically change what I do or to start copying you.
This is not a small matter, this destroys the internet for the rest of us. The Internet becomes unusable and untrustworthy. And you are doing this in a profitable collusion with Google. You cynically going where Google pushes you.
> We are the process of NOINDEXING the pages that are below 300 words just to make Aaron happy... we actually had these noindexed before our last version and that got lost in the shuffle of the new launch (really, it did... when you do new code you might leave something out of the old code).
It's been a few weeks: have we seen any evidence of this happening?
Not only are they NOT noindexing them (which they said they would do a couple years ago and a couple weeks ago), but they are still submitting them to Google via an XML sitemap
I understand that many of us disagree with with what Jason is saying in his comments, but I don't think that we should be downvoting his comments below 1. He is at least making an attempt at reasoning argument for his actions, and taking the time to post them. It makes the discussion hard to follow when all the detracting comments are a barely visible grey.
>We are the process of NOINDEXING the pages that are below 300 words just to make Aaron happy... we actually had these noindexed before our last version and that got lost in the shuffle of the new launch (really, it did... when you do new code you might leave something out of the old code).
>
>i'm also getting a list of every page under 300 words and having the page managers build them out in 30 days or deleting them.
It's not just Aaron Wall that sees something fishy.
But for those of us who are just tired of the whole drama, just change how it's done, or don't do it at all. Adding nofollow and not submitting auto-generated content in Mahalo's sitemap does not seem like a great amount of development work if you really want to change it.
Funny. By those standards applied strictly, google's own search result pages would be considered spam. Jason can argue that there is value in the SELECTION of content that is not explicitly visible but nonetheless significant.
Similarly, Hacker News homepage might be considered spam because it's mostly titles of articles. Until you take into account the value of the votes--which is value that these google guidelines don't address clearly.
And you'll notice that Google does not index its own search result pages, nor encourage anyone else to do so. Perhaps "spam" is too pejorative a word and "things that should not be in our search engine" would be a better description.
Well the debate really is how well should Google reward Mahalo for it's practices. Namely, how much ranking Google should (and other search engines) give to Mahalo on it's SERPs (search engine results pages)?
I don't think Google's search results or even Hacker News really ever shows up too often on SERPs except for their respective brands... so it doesn't really apply. They are destination websites not focused on search traffic.
I don't think so. It looks like the policy is in place to keep you from having to perform multiple clicks to get where you want. In the common use case (I visit Google.com, type a search, get a result) I do not have to click on anything more than the search result to get where I am going. The difference is in a use case where my search result takes me to a search results page, which means I am only halfway (or less) to my target page after clicking on the Google search result. This doesn't mean, necessarily that all Mahalo content is bad, but that Google shouldn't be indexing Mahalo search results pages, and instead should link directly to the resultant content pages instead (scraped or not.)
Similarly, Hacker News articles should not be the result of a search done on the news title (as HN's link to the original article actually increases the article's score, right?), but should perform better if the matched term is found in the HN-specific content (e.g., the commentary we create on HN only.)
I'm sure that there are (hopefully) edge cases in which a search for the article title returns the HN link instead of to the original article, but that should be the exception to the rule. Of course, if HN were to have an absurdly high page rank and link to a relatively unknown (or new) blog, it would likely lead to a result on here, instead of the other way 'round.