Hacker News new | past | comments | ask | show | jobs | submit login
Blekko already has a list of domains to block from results (blekko.com)
46 points by aj700 on Jan 7, 2011 | hide | past | favorite | 39 comments



Wow, I'm seeing tons of false positives. Why on earth is the co-written blog of a nobel laureate and a 7th circuit appeals court judge making blekko's bayesian classifier freak out?

#88 - http://www.becker-posner-blog.com - bayes (spam 31.6 > 5.3)


my guess is the fact that it has two dashes, which is almost always indicative of spammers.


Really, a seemingly content-heavy site with no ads to serve, why?


The first item on the list currently which is not some sort of false positive appears to be wareseeker.com at #10 (and that's debatable).


I'm seeing:

http://www.deadmau5.com -> Really, spam?

http://www.comparethemarket.com -> Price comparison site with no known biases in the UK and not owned by British Telecommunications, as it is listed.

Even ONE false positive is enough to make me think this listing is a load of bullshit. How is geocities' "closing down" page spam?


A site in the UK that gets a huge amount of traffic and is backed by a large off-line national advertising campaign. Why has this been banned? Never seen spam on there, nor have seen them spamming.


Don't be fooled, that advertising campaign only used a 'sensible', similarly-named insurance business to introduce an unusual niche company[1] to the market. It's a clever ploy, you can read about it in Aleksandr Orlov's book that came out a month ago.

[1] http://www.comparethemeerkat.com/


They're not passing the "best refrigerator" test yet:

http://blekko.com/ws/best+refrigerator


Did you try adding /reviews ? Sometimes we can deliver better results without the user doing anything different from what they're used to doing, but not always.


What exactly is the best refrigerator test? Searching for it on other sites (Google, Wikipedia) just return refrigerator results in the former and nothing really useful in the latter...


Two big problems with automated spam blocking are: false positives and changing domain names.

For the second one, how often do you revise your blocked links? what if it changed owner and the new one doesn't provide spam.

For the first one, is even one false positive tolerable? Will you deny someone presence in your index because you failed? And if so, how do you handle challenges?


We don't mark a domain as spam until many of the pages we've seen look spammy.

Our ideal is to recrawl everything every 14 days, but during our launch we have not been achieving that.


Why is http://www.deadmau5.com/ marked as spam?? Its obviously not.


It looks like they have a considerable number of false positives. Another site, jet2.com is the website of a low cost airline; wp-plugins.net is hosting Wordpress plugins.


Maybe looking for numbers in the domain name? It's probably a really good signal overall.


* This list is generated by our algorithm, it's the most important sites which our algorithm thinks are spam. The point of making the list public is so that you guys can tell us when we're wrong. Google has a list like this, but they don't show it to anyone. Transparency in action.

Thank you all for pointing out false positives in the list. That is what we hoped would happen.

* The "nocrawl" sites are human-picked by us. Geocities is on the list because it was a very spammy domain. Even though they've (finally) removed the data, we still have old data indexed, and will remove them from the spam list once all that old data ages out.

* BT is the hosting company for comparethemarket.com


There are quite a few false positives on that list. Also titling sites with nocrawl as spam is pretty lame. dshield.org is listed as mfa, but there isn't an adsense ad on it.


I have been using Blekko as my primary search-engine for a couple of days now, and in my experience their search-results are very decent.

Not, maybe in terms of falsely blocked sites, but certainly in terms of having fewer false positives (e.g. spam/useless pages) in the search results.

Mom & pop users (and even more advanced searchers, such as students looking for book-reviews, or torrents) might very well forgive them the few false blocks for this.

Zittrain in his 'The Future of the Internet and How to Stop It' already wrote about this trade-off in terms of spam being made possible by the generativity of the internet, and people increasingly preferring controlled environments over those full of virusses and spam (wonder why apple's locked down devices are so popular?).

Of course this has big downsides too, and even is bad in my opinion. But Blekko, by allowing people to create their own slashtags (categories, much more flexible and quick than Googles domain search) and google/yahoo/bing always being only one click away, might have arrived at a good middle-ground...

Imho Blekko might very well be able to beat Google at their own game. Give them a try, or at least sometimes when google doesn't do it for you, I'd say...


> Not, maybe in terms of falsely blocked sites, but certainly in terms of having fewer false positives (e.g. spam/useless pages) in the search results.

A false positive is when a good site is mistakenly identified as a spammy site.


In spite of the (justified) complains you get about the false positives, I think that's a great way to go. Unlike with email where missing a message might be critical, in search I'd rather have even as much as 10-20% false positives than deal with the spam sites Google delivers.

More in general, concerning the front page search examples: "cure for headaches" works very well indeed compared to google. However, "global warming /liberal" is a bit irritating. I understand the rationale behind it, however there is this slight difference between finding only what one is looking for and hearing only what one wants to hear. To find anything non-mainstream might necessitate a technique like this in Google where you otherwise don't see anything else in the first 50 results... But maybe you can strive to find for me what's really going on and not merely what's mainstream and politically correct. Thinking about it, your blocking of domains like Answer.com might be a great step in that direction anyway.


Really? I'd rather have spam sites than 20% of my legitimate results missing.

I guess it depends on what you're searching for. "cure for headaches" will probably be just fine with some missing sites, but "Deadmau5 tour dates" will definitely be affected by the false positive block on deadmau5.com


can anyone explain what bayes and mfa mean? I picked the site http://www.basemetals.com/ (bayes (spam 8.6 > 5.3)) at random, and although it won't win any design awards I can't see what the problem with it is. Am I missing something?


mfa = Made for Adsense. That means we believe that the domain seems designed more to show ads than to provide content.

bayes = our Baysian analysis gizmo thinks bad things about this site's content. Too much Viagra, not enough content. Like all artificially-intelligent things, it can sometimes be hard to see why it's upset.


Maybe try using something like a decision tree where the classification steps are much more obvious.

There are parallelizable implementations out there...


I guess Bayes refers to running a Bayes-based classifier[1] on a certain site in order to infer the probability that it is a "bad"/spam one.

On the other hand, I have no idea what mfa means.

[1]http://en.wikipedia.org/wiki/Bayesian_spam_filtering


Omitting these domains from results is "automated".

Managing the list isn't. It's based partly on how many users report a domain as being spam. At least that's one of the reasons for inclusion. And don't bayesian filters, with little data to work with and if newly implemented always have false positives?

Maybe some are labelling valid stuff as spam out of spite.

When blekko has millions of users labelling stuff as spam instead of very few, the system will be harder to abuse and the list much better.


From all the comments and what I have noticed it sounds like a good question is "Is it better to have false positives or false negatives" in the spam problem. I personally think that its better to have false negatives then positives and a lot of the comments here seem to reflect that.


That depends on how many false positives and negatives we have -- you don't really know how many of either. There are 100 million hosts in our crawl, how can you estimate if we have a false positive problem from looking at the list of the top 100 marked spam?


Your right. Would there be a way to get a sample of the database and attempt to do a test if there are a considerable amount of false positives?


If they put johnchow.com on the list, they must be doing something right.


I actually only saw johnCOW.com (not CHOW).


Wouldn't it be better to let people do their own blacklists, and then incorporate that into their official list if a percentage of people have that site down as spam?


No, that would actually be an easy system to exploit: create a million accounts with your competitors in it, watch how the search engine blocks it.


Blekko has individual blacklists -- that's what the "spam" button by every result is.

This topspam list is algorithmic, and is not affected by user "spam" clicks.


Doesn't include swik.net which is a crap link aggregation / search tag spam - I'd expect that one to be removed...


I wonder why they have both the www and the non-www version in there for domains?


Google counts them as different websites. It is typically a wise SEO move to use your .htaccess file to setup a 301 redirect to/from the www. version of your site.


nice. if I had to guess, in a couple years, google will attempt to acquire blekko to integrate with their webspam team.


Wow, the search results are terrible on blekko. I think someone's gone crazy with the ban hammer.

My suggestion to blekko: look for signals of relevance to determine serps, instead of flagging every other website as spam.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: