Hacker News new | past | comments | ask | show | jobs | submit login

The height of the HN middlebrow dismissal.

Just do some Bayesian filtering, SEO solved. Can't believe those idiots at Google never thought of it.

Do you seriously think spam filtering is this easy? Just one bayesian machine away from solved?




I do a search for "how to barbaz a foo" on google.

I get the following domains as the top 5 results:

  1. allaboutfoo.com
  2. fooexperts.com
  3. infoonfoo.com
  4. foonation.com
  5. thefooblog.com
The linked pages above all have

  1. A table of contents section
  2. About 30mb of AI, autogenerated filler ("Why should you foo?", "How does foo affect your dog?", "Can you foo a foo?", "Why you should foo twice a day?")
  3. The same content plagiarized from each other's sites, just slightly reformatted/edited
  4. NO ACTUAL INFORMATION ON HOW TO BARBAZ THE FOO
It's 2022, and there's still no way to filter this garbage out?


I'm sure there is, but there is an enormous corpus of data, and any filtering you add can end up impacting the legitimate sites you're trying to get to.

SEO spam is constantly adapting to changes in filtering. If you start filtering sites that have a table of contents, SEO spam will remove their ToC in no time, but authentic blog poster will probably not.

Real content producers don't have the time to chase every change to google's algorithm but SEO spammers do. How do you filter out the spammers who adapt to changes without affecting the real content producers who can't afford it?

I promise you, the problem is harder than you realize. And sure as shit a lot harder than "just add a bayesian filter and these 3 hard coded rules I just came up with"




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: