Hacker News new | past | comments | ask | show | jobs | submit login
Autospam and Naive Bayes (pixelfed.blog)
135 points by pixelfed on Aug 14, 2023 | hide | past | favorite | 38 comments



I have some history with Naive Bayes for email filtering[1][2] and it doesn't surprise me that it's still a viable technique. It's incredibly fast to execute and for spam it was fascinating to see that as spammers tried to evade Naive Bayes they were actually increasing the numbers of tokens that made their spam look like spam. And it generalized to multiple categories. To this day I receive messages from people saying they are using POPFile for multi-category email filtering.

As I pointed out in 2004[3], one technique around Naive Bayes (and other machine learning systems) was to pit machine learning against machine learning[4].

[1] https://getpopfile.org/

[2] https://en.wikipedia.org/wiki/POPFile

[3] https://blog.jgc.org/2023/07/how-to-beat-adaptivebayesian-sp...

[4] https://en.wikipedia.org/wiki/Adversarial_machine_learning


Yea, I remember using Bayes filtering back in the early 2000s and it was very effective. Pretty sure it was in the SpamAssassin perl application.

One thing we saw that was more problematic is spams getting shorter and the use of image only spam. Turns out a lot of human to human communications is just short single sentence blurbs between people. We saw a lot of junk spam that didn't really have a call for action, but seemed to exist just for the purpose of making Bayes classification more difficult and error prone.


Had been using popfile for years before gmail and it had always been great.

I think that's also my earliest encounter with the conception of "local web ui", while everything else at the time needs some this for configurations. It's also truly cross platform as it's written in perl.


I remember someone saying to me in about 2005 that the "true genius" of the UI was that it had a localhost web UI. I did that because I wanted it to be cross-platform and I didn't want to use Tk or something. It just seemed natural that apss would use the web browser for configuration. And yeah, I used Perl and me (and others) worked hard to make sure it coped with all sorts of platforms.


s/this/gui/, sometimes gboard makes bizarre auto corrections.


Was working in NLP from about 2014-2020. At the start, NB was indeed generally the best performing baseline model you could use, and we would train it for every task. However, we came to realise that Facebook's FastText model[1] was almost always the better choice as soon as your training set was more than a few hundred samples. Some advantages:

- accuracy generally a few points better than NB

- better generalisability due to the word embeddings acting as a bottleneck on expressiveness compared to NB or logistic regression which essentially model all words / bigrams as independent

- trained with cross-entropy, meaning that model scores can be used more effectively as a 'confidence' - e.g. for spam if you want to say something like "if prediction score > X, then filter", Naive Bayes is not ideal due to the 'naive' assumption which makes the scores very un-calibrated (it tends to give extremely high or low confidence scores, which gets worse with document length).

- is completely linear (or at least log-linear like NB), so explainability is super simple.

disclaimer: I haven't really thought about NLP for about 3 years so there may be something better than this now

[1] https://github.com/facebookresearch/fastText


Theoretically, any discriminator that maximizes the expected value rather than the underlying distribution will be better. Logistic regression is an example of a discriminative classifier.

Even better, however, is if you maximize the margin, so a large margin discriminative classifier should be your baseline. I believe that the classification layer in fasttext is equivalent to logistic regression (but not large-margin).

All that said, I would probably use vowpalwabbit as a baseline as it doesn't use word vectors underneath, is extremely fast and easy to use, and has many optimization option. This way you can determine if word vectors help your particular problem or not.


All bag-of-words models use some kind of word vector - for a normal logistic regression / naive bayes, say "spam" and "spammer" are first and second indexed words in the vectorizer, then their word vectors are like [1, 0, 0, ....] and [0, 1, 0, ....] (length = vocab size). For both logistic regression and fasttext you get the 'document vector' by adding up their respective word vectors before applying a final linear projection + softmax to get the class predictions [1]

The insight of fasttext is to notice that these high dimension, unit vector word embeddings aren't an ideal way to learn - since they are orthogonal, during training time, a datapoint giving signal between the "spam" word vector and the label gives you no information about "spammer", or indeed any other word. Explicitly modelling and learning word vectors helps with this as now these are two points in a vector space that can be moved relative to one another.

[1] this remains true for LR if you rescale the vectors using e.g. TFIDF or use the hashing trick (a la vowpalwabbit).


Yes when I said word vectors I meant rich embeddings not one hot representations.

That said, in reviewing the fastext bag of tricks paper on their classification module I’m now second guessing my assumption that they use complex embeddings. Their architecture is otherwise exactly reproducible in vowpalwabbit, and in fact in the paper they claim that it is equivalent to a specific combination of vowpalwabbit flags.

In particular the vowpalwabbit neural network flag is needed. However, vowpalwabbit only uses one hot vectors for their ngram features. The neural network flag just adds a hidden layer.

I had assumed that fasttext uses rich word embeddings in its classifier because it has another module to train them.

If it is actually the same as vowpalwabbit, then I can say that I’ve never had the extra hidden layer really help, though as they note it does make vowpalwabbit quite slow.


fasttext word embedding is equivalent to adding a hidden layer, as long as you DON'T put a nonlinearity on it if that helps.


IMHO one of the most overlooked features is that it's the best eXplainable AI out there IMHO. Spam scores are easily understandable and correctable. You can even build quite digestible nomograms [1] . Still everyone is using decision trees as XAI example while in my experience they are very unstable in quite indigestible after a depth of 3. I also believe Naïve Bayes is quite undervalued. What many people however ignore is that you also need to estimate a density function for continuous variables, which like with many kernel based methods might the rather interesting and sometimes not so naïve part.

[1] https://orange3.readthedocs.io/projects/orange-visual-progra...


The issue with NB for explainability is that the model scores can be very badly calibrated due to the "naive" assumption. You find especially on longer documents, the NB scores basically clump around 0 and 1 due to multiplying a bunch of dependent scores together as if they were independent. This means you can't really use them to assess how 'confident' the model is on its decision.

IMO logistic regression, or better fasttext (rank-limited logistic regression) should be the "default" baseline models you should use for NLP classification. They are trained with cross-entropy loss, so the scores are generally fairly well calibrated to an actual confidence score. Moreover since everything is linear (or at least logit-linear), you can do all the same explainability tricks as with NB (a little more effort for fasttext, but it is doable).


No mention of the spammers counter-attack,which was to stuff their spam with keywords to cause naive bayes to misclassify important emails as spam and cause it to be turned off due to unacceptable false positive rates. Spam-filtering is tough because it is an adversarial problem.


Its tough cause cost to broadcast is free on the internet.

They solve this in the RF world by licensing who gets to broadcast. The only spammers on newspapers, radio, tv, post, satellite are people with a lot of cash.


Fediverse is slightly better than email because there is a cost in the domain from which you send spam. Vanilla SMTP provides zero authentication so anyone can impersonate any domain. Technologies like DKIM and SPF are relatively new and not everyone uses them.

ActivityPub, however, requires all POST requests to be signed by the server that sends them. The signature is made with an RSA key of the actor that performed the activity. The public key of that pair is inside the JSON object of the actor. So, to verify the signature for an incoming activity, you'd make a request back to the originating server to fetch the actor object. This makes spoofing impossible and domain bans practical.


Naive Bayes is also surprisingly powerful for deduplication of large datasets of messy data (and the related problem of record linkage).

It's still competitive with cutting edge approaches if carefully applied, and usually much faster so it can be applied to big data.

It's the engine behind Splink, the free Python library that I develop: https://github.com/moj-analytical-services/splink


> Naive Bayes is also surprisingly powerful for deduplication of large datasets

Thanks for this! I implemented Entity Resolution from scratch 2 years back and was mostly able to do deterministic linkages using identifiers or using TF-IDF on names as a suggestion for human intervention. Will be writing about my learnings in upcoming months [1].

Was mostly looking at research papers before but will take a look at your project - your documentation looks good!

[1] https://www.sheshbabu.com/posts/entity-resolution-challenges...


This is a very cool tool. I was testing fastlink recently but found some issues with the docs and implementation. I will check out your tool.

Congrats on winning all those awards.


Thanks, appreciate it.

Fastlink was the inspiration for Splink - the fundamental statistical model is very similar. The first version of Splink was essentially a port to make it work faster and at greater scale, but we've subsequently added quite a bit of additional functionality

Feel free to ask questions if you run into any issues - we're usually fairly good at responding: https://github.com/moj-analytical-services/splink/discussion...


I had a friend that had a record linkage problem which interested me and so I dove down a rabbit-hole of Markov logic, but didn't see any mention of Naive Bayes approach when doing so. Although I don't really have any expertise in the area anyway, so it would be easier for me to miss.


The model is usually called the Fellegi Sunter model, but once you get into the maths, it's actually the same as Naïve Bayes. If you're interested I've got a blog post that explains this here: https://www.robinlinacre.com/maths_of_fellegi_sunter/


Thanks for this, been meaning to read up on the Fellegi Sunter paper for a long time!


I predicted this more than 2 years ago when I was running a fedi node.

Back in 2004 I used to build spam filtering proxies using bayesian filters in spamassassin. And my experience in the fedi instantly made me realize they would be a quick and easy solution for a lot of fedi spam.

So it's fun to see pixelfed taking the leap. I can't wait to see how it evolves.

The fedi being very similar to e-mail is most likely going to see a lot of the developments of e-mail. For example domain verification would also be good. And I also predict some sort of reputation system for instance domains.


I'm still a happy SpamAssassin user (self hosting email still) and this is the first thing I thought of as well. especially as I more often consider spinning up a pixelfed instance.

The URIBL addition really helped cut down junk, I wonder if it would help here too.


The Bayes algorithm was core of one of my first shared software projects - a local spam filter plugin for the blog engine serendipity. The surprising thing back then was how well it worked in practice, the surprise now is that it still works just fine.

It's used to filter blog comments since over a decade now.

And I mean that with the surprise. It was tought to us as the simple method used before, and I was just testing whether it maybe can still be of use a bit. Mainly because it was easy to understand, it just counts, and I liked the position with the a priori probabilities. «Definitely still useful» was not the expected outcome.


See also the Paul Graham classics "A Plan For Spam" (2002) [0] and "Better Bayesian Filtering" (2003) [1].

Always love to see "simple" 20yo ideas find a new use case.

0: http://www.paulgraham.com/spam.html

1: http://www.paulgraham.com/better.html


Back in the day, I ran my own email server, and dealing with spam was a constant pain in the butt.

I wrote a spam filter, based on source addresses, "spammy" keywords and such, but it was about as effective as you would expect. Most spam had to be classified manually.

When I first read "A Plan For Spam", I sat down and wrote a Bayes classifier (pretty much a straightforward implementation of his proposal) in a couple of hours. Due to running my own server, I had a ready-made corpus of hand-curated spam and ham on which to train the classifier.

It worked astonishingly well, and I ran that classifier unmodified for another couple of years, until SpamAssassin became good.


"There are two ways of constructing a software design. One way is to make it so simple that there are obviously no deficiencies. And the other way is to make it so complicated that there are no obvious deficiencies." (Tony Hoare)

Naive Bayes is the former, me thinks.


pixelfed is doing remarkable work in fediverse, this just appears to be one of those things that other fediverse projects will adopt pretty quickly


Just as background for people who don't work with this stuff much, this is my understanding.

Naive Bayes is an application of Bayes Theorem to classification problems in machine learning. The problem it is solving is where you are trying to classify something and you have a fairly large number of features. The example that is frequently used in the ML community is imagine you have an image and you want to classify it as containing a cat (or not)[1]. The conditional probability you want is P(cat|pixels) or more specifically P(cat|p_1, p_2, ...., p_n) where p_i is a given pixel[2]. The problem is if you want to apply Bayes theorem you would want to calculate all those conditional probabilities and multiply them together. Since each conditional probability is small this would trend to zero in an actual floating point calculation and by the time you apply Bayes you would end up dividing by zero.

To work around this you make the "naive" (aka very obviously wrong) assumption that the probabilities P(cat|p_i) are independent. Then you get to just sum rather than multiply and the problem goes away. The Naive Bayes classifier turns out very useful in spite of the fact that this assumption is definitely wrong. The slightly different problem that arises that a few other people have mentioned is it means that since you are summing if your feature is say a word then as your text gets longer the score gets higher. So long ham emails would get classified as spam if you just used a threshold.

[1] The computer vision community seems weirdly obsessed with classifying cat photos

[2] in the spam filter case you're looking at P(spam|w_i) where W is a bag of words.


If folks are curious about what it was like fighting spam back in the early 2000s, I used to work at a firm that was contracted by some of the big email providers.

Our job was to take spam, identify who was sending it or, at a minimum, who it was advertising for and then forward that on to the law firms of the email providers.

This was right around the time that spam legislation was coming out so it was interesting to see how it was being applied directly to people we found.

I write a lot more detail about it here: https://twitter.com/alexpotato/status/1208948480867127296


The cost to implement a model that can do P(Y|X) can unsurprisingly be much lower than something that can model P(X, Y) and that's not just because its an easier problem but economics and what just works easily.

But in the world of the future which is likely look at the "LLM of everything spoken/written" and given you can solve P(Y|X) P(X|Y) from a model that can do P(Y, X); LLMs could just win because of integration with everything.


Maybe a good place to ask why my Thunderbird spam filter seems broken? It used to work fine, then something happened (maybe I deleted the emails in Junk but that shouldn't affect the corpus). Now it doesn't seem to learn anything and I keep seeing the same Spam in the Inbox. This is with v102 on Win 10. Also with a few previous versions.


TL;DR, without the dramatic narration:

Naive Bayes is an old spam filter that works on words independently from each other, ie not assuming any grammatical or semantical sense to what it processes.

It makes it simple, but efficient as a first line of defense against "dumb keyword stuffing spam".

(No LLM used, if that ever means anything these days).


True. It can be extended to work also on bigrams instead of individual words, I'm not sure how often this is done in spam filters. For an in-depth explanation of how it works, I suggest the excellent book from Jurafsky and Martin https://web.stanford.edu/~jurafsky/slp3/4.pdf


> Naive Bayes is an old spam filter that works on words independently from each other

It should be pointed out that a lot of ML models operate on a 'bag of words' approach where words (tokens) are considered without regard for structure/order. Bag of words is surprisingly powerful. It can be expanded to include n-grams (n words in order) and this makes it a powerful technique.

Punctuation (where punctuation is counted as a token) is a crazy powerful predictor in text classification including spam.


Yeah a good next easy step could be to use word2vec and build a quick linear regression model.


Just wait until they discover Bayes on n-grams. Or even better FastText.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: