You're definitely right -- this is an issue. I could very well believe that we t...

Mehdi2277 · on July 27, 2020

The simplest spam filtering algorithm would be a naive bayes filter. It's essentially keep a count of words that appear in all posts, words that appear in spam posts, and words in non spam posts. Those counts + bayes rule will let you figure out the probability of spam given a word. It's called naive bayes because you assume each word in your post is independent of the others so probability the whole post is spam is just product of the probabilities.

The nice thing about this is it's pretty computationally light and straightforward to implement for any language. I have no clue as to your stack, but if you have python for your backend then sklearn is a good library that has a naive bayes classifier (plus a lot of other better options). Any post with a high probability of being spam, I'd automatically flag and by default just remove with the option for a user to ask for manual review. Main thing you'd need for this or any fancier approach is some dataset of spam/non spam posts. If you have an easy way of retrieving past posts that were labelled spam that should allow you to make a fine dataset. If you don't want to train on your own user posts (although only information kept is word counts here), you can look online for spam datasets and use one of those to train your classifier.

gus_massa · on July 27, 2020

I used SpamBayes a few years ago http://www.spambayes.org/ (Is the project dead now?) (It has a PSF licence https://en.wikipedia.org/wiki/Python_Software_Foundation_Lic... https://en.wikipedia.org/wiki/Comparison_of_free_and_open-so...)

The nice part is that SpamBayes gives you two numbers, the spam "probability" and the ham "probability". When one of them is very close to 1 (like > .99) and the other is very close to 0 (like <.01), there is a good chance that the message is really spam or ham. And this classify almost all the messages. But from time to time you get a message where the numbers are not so clear, or both are big or both are small, and this means the classifier is confused and you really must take a look at the message.

audiometry · on July 28, 2020

Wow when this came out (I think this was the ‘original’) it felt quite ground breaking. Perhaps early 2000s it was?

Then google started doing that or something similar at scale and effectively eliminated spam in my mailbox ever since. (With the curious recent exception of some highly similar bitcoins spams)