Wow, I'm seeing tons of false positives. Why on earth is the co-written blog of a nobel laureate and a 7th circuit appeals court judge making blekko's bayesian classifier freak out?
A site in the UK that gets a huge amount of traffic and is backed by a large off-line national advertising campaign. Why has this been banned? Never seen spam on there, nor have seen them spamming.
Don't be fooled, that advertising campaign only used a 'sensible', similarly-named insurance business to introduce an unusual niche company[1] to the market. It's a clever ploy, you can read about it in Aleksandr Orlov's book that came out a month ago.
Did you try adding /reviews ? Sometimes we can deliver better results without the user doing anything different from what they're used to doing, but not always.
What exactly is the best refrigerator test? Searching for it on other sites (Google, Wikipedia) just return refrigerator results in the former and nothing really useful in the latter...
Two big problems with automated spam blocking are: false positives and changing domain names.
For the second one, how often do you revise your blocked links? what if it changed owner and the new one doesn't provide spam.
For the first one, is even one false positive tolerable? Will you deny someone presence in your index because you failed? And if so, how do you handle challenges?
It looks like they have a considerable number of false positives. Another site, jet2.com is the website of a low cost airline; wp-plugins.net is hosting Wordpress plugins.
* This list is generated by our algorithm, it's the most important sites which our algorithm thinks are spam. The point of making the list public is so that you guys can tell us when we're wrong. Google has a list like this, but they don't show it to anyone. Transparency in action.
Thank you all for pointing out false positives in the list. That is what we hoped would happen.
* The "nocrawl" sites are human-picked by us. Geocities is on the list because it was a very spammy domain. Even though they've (finally) removed the data, we still have old data indexed, and will remove them from the spam list once all that old data ages out.
* BT is the hosting company for comparethemarket.com
There are quite a few false positives on that list. Also titling sites with nocrawl as spam is pretty lame. dshield.org is listed as mfa, but there isn't an adsense ad on it.
I have been using Blekko as my primary search-engine for a couple of days now, and in my experience their search-results are very decent.
Not, maybe in terms of falsely blocked sites, but certainly in terms of having fewer false positives (e.g. spam/useless pages) in the search results.
Mom & pop users (and even more advanced searchers, such as students looking for book-reviews, or torrents) might very well forgive them the few false blocks for this.
Zittrain in his 'The Future of the Internet and How to Stop It' already wrote about this trade-off in terms of spam being made possible by the generativity of the internet, and people increasingly preferring controlled environments over those full of virusses and spam (wonder why apple's locked down devices are so popular?).
Of course this has big downsides too, and even is bad in my opinion. But Blekko, by allowing people to create their own slashtags (categories, much more flexible and quick than Googles domain search) and google/yahoo/bing always being only one click away, might have arrived at a good middle-ground...
Imho Blekko might very well be able to beat Google at their own game. Give them a try, or at least sometimes when google doesn't do it for you, I'd say...
> Not, maybe in terms of falsely blocked sites, but certainly in terms of having fewer false positives (e.g. spam/useless pages) in the search results.
A false positive is when a good site is mistakenly identified as a spammy site.
In spite of the (justified) complains you get about the false positives, I think that's a great way to go. Unlike with email where missing a message might be critical, in search I'd rather have even as much as 10-20% false positives than deal with the spam sites Google delivers.
More in general, concerning the front page search examples:
"cure for headaches" works very well indeed compared to google.
However, "global warming /liberal" is a bit irritating. I understand the rationale behind it, however there is this slight difference between finding only what one is looking for and hearing only what one wants to hear. To find anything non-mainstream might necessitate a technique like this in Google where you otherwise don't see anything else in the first 50 results... But maybe you can strive to find for me what's really going on and not merely what's mainstream and politically correct. Thinking about it, your blocking of domains like Answer.com might be a great step in that direction anyway.
Really? I'd rather have spam sites than 20% of my legitimate results missing.
I guess it depends on what you're searching for. "cure for headaches" will probably be just fine with some missing sites, but "Deadmau5 tour dates" will definitely be affected by the false positive block on deadmau5.com
can anyone explain what bayes and mfa mean? I picked the site http://www.basemetals.com/ (bayes (spam 8.6 > 5.3)) at random, and although it won't win any design awards I can't see what the problem with it is. Am I missing something?
mfa = Made for Adsense. That means we believe that the domain seems designed more to show ads than to provide content.
bayes = our Baysian analysis gizmo thinks bad things about this site's content. Too much Viagra, not enough content. Like all artificially-intelligent things, it can sometimes be hard to see why it's upset.
Omitting these domains from results is "automated".
Managing the list isn't. It's based partly on how many users report a domain as being spam. At least that's one of the reasons for inclusion. And don't bayesian filters, with little data to work with and if newly implemented always have false positives?
Maybe some are labelling valid stuff as spam out of spite.
When blekko has millions of users labelling stuff as spam instead of very few, the system will be harder to abuse and the list much better.
From all the comments and what I have noticed it sounds like a good question is "Is it better to have false positives or false negatives" in the spam problem. I personally think that its better to have false negatives then positives and a lot of the comments here seem to reflect that.
That depends on how many false positives and negatives we have -- you don't really know how many of either. There are 100 million hosts in our crawl, how can you estimate if we have a false positive problem from looking at the list of the top 100 marked spam?
Wouldn't it be better to let people do their own blacklists, and then incorporate that into their official list if a percentage of people have that site down as spam?
Google counts them as different websites. It is typically a wise SEO move to use your .htaccess file to setup a 301 redirect to/from the www. version of your site.
#88 - http://www.becker-posner-blog.com - bayes (spam 31.6 > 5.3)