Hacker News new | past | comments | ask | show | jobs | submit login
The funny rules of SpamAssassin in 2023 (updown.io)
230 points by alexis2b 9 months ago | hide | past | favorite | 90 comments



I've been using SpamAssassin for at least 15 years and it's sadly gotten less useful as the spam arms race has moved on. We regularly see people on here post about deliverability issues with Gmail/Outlook but the truth is that sender reputation is by far the biggest indicator of whether a message will be spam - these type of rules are just counting deckchairs on the titanic in comparison.

And this plays into the strengths of the big mail networks in detection. It's a bonus to them that every time they block a smaller host there is a good chance that sender will consider a move to office365 or Google Workspace for their mail.

As an aside, not sure if OP is related to them but updown.io is a nice service and I appreciate the simple PAYG pricing! For what it's worth their mails seem to get through successfully to me too.

Also for those facing mail delivery issues (or just practicing good email hygiene) - I recommend www.mail-tester.com - they give you an email address to send a mail to and carry out a heap of tests - including checking against SpamAssassin + blacklists, SPF/DNS/etc testing.


> It's a bonus to them that every time they block a smaller host there is a good chance that sender will consider a move to office365 or Google Workspace for their mail.

The irony is that a substantial amount of the spam I receive comes from those platforms.


Are you certain the spam is actually coming from IP addresses controlled by those platforms? It's common for spammers to fake the SMTP headers.


50% of my spam are from a single Google Groups source - unsolicited job applications, all ending:

  You received this message because you are subscribed to the Google Groups "jan-09" group.
  To unsubscribe from this group and stop receiving emails from it, send an
  email to jan-09+unsubscribe@googlegroups.com.
  
  To view this discussion on the web visit https://groups.google.com/d/msgid/jan-09/0
14101da368d$0bdc8160$23958420$@gmail.com.

I cannot email that unsubscribe link because it says I am not subscribed. I cannot visit that page, I have not subscribed to that group. I've had to set up a special filter to look for that footer.

I am not the only one with this issue. See https://support.google.com/groups/thread/68075070/i-get-goog... .

... Wait! You've indirectly helped solve the issue!

They are being sent to "info@" my domain, an alias that forwards to my real account. I set up a new outgoing account with that From address, sent from there, and managed to get Google to unsubscribe something I never agreed to in the first place.

It's been like this for a year, and with multiple attempts to fix it.

Thank you!


This reminds me, it would be nice if Google had an easy way to blacklist a sender and/or subject such that it wouldn't even go to spam. I have enough spam false positives that I needed to scan my spam folder periodically, and I'd love to have a way to permanently filter out regular crap in there. I've created explicit filters for some of the more prevalent ones, but a one click blacklist button on spam emails would be great. (Along with some way to edit the blacklist in case of mistakes.)


Quite a bit is "genuine", at least with Gmail, because infinite monkeys can sign up for infinite accounts.


I had spam bounces coming from Microsoft. Someone had convinced MS that they owned my domain and was apparently sending spam via an MS SMTP server (failing SPF was apparently not a problem for them), and any that bounced were being sent to my server: the real mail server for that domain. I reported the malicious org and explained what I had found out, but they obviously denied that it could possibly actually come from themselves and misunderstood what was happening (took me a bit to puzzle that out as well, those mystery emails showing up in my inbox). MS sending out spam isn't my problem, but I figured I'd be nice. Alas.

Few months later, they started bouncing my server's new IP address and that, too, wasn't their fault of course: "we're not seeing a block for your IP address so there cannot be bounces". Denying reality is super effective. The punchline was that they had blocked the new ISP's whole range rather than just my IP, so they weren't getting any hits when searching for my IP address. I found this out through some back-and-forth with a friendly sysadmin at the ISP, who was also banging their head against MS' wall...

These people must be so underpaid they're probably giving MS money for the privilege to work for such a correct business


I get plenty of spam from Gmail accounts with SPF and DKIM passing.


Plenty of dumps of stolen personal Gmail usernames+passwords, that anyone can feed into a bot that will use browser automation to sign into Gmail on those accounts and “hand write” some spam messages to send.

(If you haven’t realized, this is why Gmail has SMTP message origination disabled by default — these days requiring not only enabling it for your Gmail account, but also fiddling with app passwords to get it working. If it was enabled by default, the “spam from stolen credentials” problem would be so, so much worse. Whereas, at least with the webapp route, Google can block you if you look like a bot [i.e. if you’re doing an insufficiently good job at fooling them.])


I've got sometimes a legitimate Google or MS dev newsletter emails going into their own spam folders :) .


I've seen mail from my work Google Workspace that I sent to my own personal Gmail get flagged as spam. It's me sending to me. Google to Google. Logged into both account on the same computer.

If anything I'm nervous to recommend Google because they flag too many legitimate emails as spam. After years of not checking, I'm checking spam again.


> mail from my work Google Workspace that I sent to my own personal Gmail

Does your company do outbound marketing/sales?

I've seen multiple companies spin up outbound email marketing campaigns where someone compiles a list of 5000 email addresses based on certain demographics, and then send automated emails (that look not automated) over the course of a month, rinse, repeat. Google Workspace will let you do this, but if you're too aggressive with email volume it can kill the reputation, and therefore deliverability of any email from that domain.

(Which is why most companies send outbound sales emails from a domain other than their primary domain to separate out the sending domain reputation)


> Does your company do outbound marketing/sales?

Good guess, but we don't. I also checked DKIM/SPF when this happened and all appeared in order.


There is a significant amount of spam coming from google accounts, yes. Just think about all the “sales automation” junk that businesses uses.


Because of course, its an arms race.


> Because of course, its an arms race.

Somewhat out of context, but greylisting works as well as the day it came out.


Probably should put "works" in quotes...


I've been using it for over a decade and I have only one domain I've had to make a rule in Postfix for because their admins don't know how to configure their racks of SMTP servers.


I like rspamd much more (performance and redis) than SpamAssassin, and as you mentioned:

-https://www.mail-tester.com

-https://www.learndmarc.com

-https://mecsa.jrc.ec.europa.eu/en/

Are exellent tool's to check your "deliverability".


Suprised to not see https://mxtoolbox.com in this list too


Truetrue sorry, I forgot ;)


I switched from GMail to a personal Microsoft 365 domain when Google decided they didn't want to give me free email/domain services anymore. 365 was cheaper. I got about 10x the amount of spam to my 365 Junk folder than I did to the Junk folder in GMail. I would spend 10 minutes a day going through the junk folder to pick out false positives. I woud have inexplicable issues with missing email with 365, where the root cause was always SPF issues from a third party sender. The big issue was event tickets mailed from a third party ticket service provider using the venue's domain name rather than the ticket provider's domain.

I switched back to GMail a few months ago, and not only do I see less stuff in my Junk folder (indicating Google is blocking stuff rather than identifying it) but also I have not seen a single false positive. Hopefully that means Google is more effective, but there's no way to tell if I'm missing legitimate email. So far, no complaints.


Microsoft's spam filter is fundamentally broken. It's been that way for decades. There's an entire cottage industry of snakeoil salesmen that want to sell in-line antispam gateways to bolt onto 365, and the worst part is that they have a very good reason to exist...


Strange, while I keep my GMail address I don't use it for anything new anymore since roughly 50% of the positives are false (no false negatives, though).


> As an aside, not sure if OP is related to them but updown.io is a nice service and I appreciate the simple PAYG pricing! For what it's worth their mails seem to get through successfully to me too.

Not related in any way except as an happy customer. They added a blog recently and this article caught my eye because of the nightmare that is mail delivery issue for everyone.

I found it particularly ironic that you now have to think like a spammer (i.e. look at spam detection engine source code to find a way to circumvent their heuristics) in order to get your totally valid email delivered (^_^).

edit: typo


Thank you


there needs to be like a mozillia vs chrome thing here no? What's the best try so far for something like letsencrypt or mozilla foundation for not owned by big tech email so "will consider a move to office365 or Google Workspace for their mail" the sender has this other awesome option?


If you wanted to operate a haven for independent email hosting, where you want to assure deliverability in the face of Gmail's sender reputation system, you would need to classify your outbound traffic, and have a death penalty for spammers. If you tolerate any activity that peers classify as spam, that would tank your reputation.


> ... the truth is that sender reputation is by far the biggest indicator of whether a message will be spam

I couldn't agree with this more. I want people to remember this whenever the topic of decentralization or federation comes up. People see this as a technical problem. it's not. It's a political and organizational problem. Even with email, which is fully decentralized (other than the ICANN TLDs) running your own node still incredibly difficult. And those reasons aren't technical at all.


I've kinda given up on reputation scores to indicate spam/ham, personally, and rely more heavily on textual analysis rules. Going by "reputation" caused me far too many false positives.


Reputation works well because of those other rules. If every office365/gmail email got through and everything lose was blocked spammers would just move to those platforms. Thus email inspection is a critical component enabling reputation based filtering.


I love the analysis. But I hate that the 'fixed' email ends up being wordier for no reason at all.

Brevity has value. Having to bloat content (an email to get past anti-spam; a cooking blog to rank better within Google SEO; ...) brings back memories of high-school english papers, or the modern equivalent ChatGPT.


100% agree, I also hate that I had to do this.


Another piece of feedback: the link doesn’t look like a link any more. It wasn’t great before, but the verbiage made it adequately clear. But now it’s terrible, because the wording doesn’t suggest an action, and it doesn’t look like a link or a button. You should either restore its underline and lean into “link”, or give a background colour or (generally better) gradient and lean into “button”. But when it’s just a border, it doesn’t look like a button, especially when there’s a tick after it. And change the wording again.


Thanks for this feedback, I actually changed this because some of my clients complained of the opposite, that the link was a bit too "dim" and didn't look like the the obvious Call To Action in the email. But it's all very debatable I agree and I may change this again in the future.


The link style wasn’t great before, just darker black with a faint underline. For best results, links should be underlined blue.


Couldn’t you add some “hidden” text instead, e.g. white on white or display:none?


I could but I don't want to, it's even more of a dark pattern and looks way too "spammish" IMO. I don't want my users to find this in their email and think that I'm trying to trick their system. Also I wouldn't be suprised if some antispam tries to detect this as a spam criteria.


Putting your outbound emails through SpamAssassin as part of a regression test sounds like a really good idea - would have never thought of doing that myself!


Having the rules public seems to take away most of the benefits...

Any smart spammer will just tweak his spam to not hit these rules... And if he hasn't, it's because the vast majority of people don't use SpamAssassin


The vast majority of operations in these fields are mind-bogglingly simplistic.

Well-known rules will block most spam, some with occasional collateral damage but many with no realistic chance of collateral damage.

Entity-encoding @ as @ in email addresses in HTML will block the vast majority of email address harvesters, with no collateral damage.

Adding a honeypot field to an HTML form, with the label “If you are human, leave this field blank” and hidden by CSS, will catch practically all spam submissions, with no collateral damage.


>smart spammer

I am sure there are plenty of smart spammers, but it also seems like a lot of spam comes from folks using scripts and email lists they use without fully understanding. It appears SpamAssassin would help with those operations.


I'm starting to think the smart spammers are the ones selling worthless spam tools to the dumb spammers, because so much spam is entirely unactionable.


Part of the smart spammer approach is to condition people to what spam looks like, so you're more likely to let through the ones they really care about.


Spammers seem to be really lazy. The only time I’ve received spam on my own domain was when I changed servers and forgot to enable the Postgrey service. Grey-listing has been around for long enough (a couple of decades) so I would have expected spammers to be resending emails that are rejected with a temporary error.

So I wasn’t expecting Postgrey to provide much benefit. As it happens, in 10 years of running my own mail server, it’s the only anti-spam measure I’ve had to bother with.


Hi, Adrien here author of this article (and of updown.io). That is true and I actually hesited to write the article for this reason, because it could make the spammer life easier. But after seing some of the legacy and nonsense in here I though it's still worth it so people at least understand what they are using.


> Any smart spammer will just...

Spam is all about high-volume/~no-cost delivery of crap. Time spent tweaking the spam - to evade $Defense_1, $Defense_2, etc. - is added cost. Especially if $Defense_n is only used by a few of the prospective victims (folks too savvy or paranoid to be suckered do not count), then tweaking to get around $Defense_n is a losing strategy for the spammer.


it's because the vast majority of people don't use SpamAssassin

Bingo. Not that there aren't a lot of people running SA, but spammers want to be able to deliver to the big players(1) (gmail, o365, etc), not the size folks out there running SA. It's not worth their time to devote effort to optimizing for a rounding error in the deliverability equation.

(1) Unless they're selling 'targeting' services where you're paying to deliver to a specific domain/user which might be behind SA. Plenty do, but that's a little bit farther down the criminality spectrum and vastly less volume than shilling peener pills or warranty extension scams.

edit: formatting


Spam is a minimal effort endeavor. A lead generator for scams. Only 0.1% or 0.01% response rate is good. They only want the unsophisticated, naive recipients. Expending effort to tweak rules is not worth the effort. (Although if they can get AI to do it, then that's back to near zero effort)


It’s effectively no different from a spammer running their spam mails through a local SpamAssassin to see how it goes, and tweaking them until they pass.


I tried using SpamAssassin (via Proxmox Mail Gateway, which makes it much easier to set up) to replace a Barracuda email appliance (it was destined to get a *6x* service price increase in 2024!), and after several months of trying to get the number of FPs down, I gave up.

The problem wasn't just the number of FPs (which were much higher than the 'Cuda) -- it was that they came from real people, who were often common senders. This is not corporate email, or anything that was even remotely spam (except as SA's crazy ruleset determined). These all required whitelisting, and it became a real chore for all my users to keep up with all the whitelisting.

So back to the Barracuda for another year. It lets a little more spam through, but virtually no FPs. I just couldn't make SA get the same performance, even with many tweaks to the weights and rulesets.


It's been a very long time since I ran a mail server, but for a decade or more I pumped all our outgoing mail through Hashcash because it gave a good boost to the Spam Assassin score. We'd crank it through the largest one, and it would add ~60sec to the mail delivery, unless we had a bunch of outgoing mail, but it was worth it I felt.


I've been using SpamAssassin since, well, forever, in internet terms. My recent facepalm moment was when I noticed that E-mails from the Playdate developer forum (Playdate is a really cool tiny gaming console) land in my spam folder, because anything in the .date domain (and the forum uses play.date as the domain) is assumed to be "dating spam".


Given the double meaning of “play date”, it’s not surprising that it would cause a higher score, even if it used a different TLD.


I have deny listed most of those new funny tld as it’s indeed a good indication of spam. Here the face palm should be playdate’s because they have realized their domain looks like a spam domain.


Yes, I do the same. It's very useful to me because there is no non-generic TLD that I would be getting legitimate email from, but it may not work well for people who do want to get emails from such TLDs.


I couldn't help but think about mechanistic interpretability research on large neural models reading this — I guess this is what happens when humans do something similar, adding and removing tweaks here and there to better fit this or that case, over a long period of time.


I won the war a while ago now.

I basically trash all emails not in my contact lists. Easy.


Spamassassin is doing it's job here, and doing a good job!

Most spammers and marketing/sales sleezoids never think they are doing anything wrong. They are totally empathy incapable. Or they know they are scum and don't care. Either way.

OP talks about adding "invisible text" and other such common spammer tactics to get around some of the rules. Zero self-awareness.

At no point did this person ever think "did I do something wrong?". No, it's that shitty Spamassassin!


Some of those highlighted rules, such as using CC or having the string “can help” being used to decide if something is spam or not is so absurd I’ll make sure to never use SpamAssasin.


> being used to decide if something is spam or not

Each rule has a score associated with it. By default a message needs to reach 5.0 to be marked as "spam":

* https://spamassassin.apache.org/full/3.0.x/dist/doc/Mail_Spa...

The threshold is configurable. An header is added post-processing, e.g.:

    X-Spam-Status: Yes, score=21.6 required=4.0 […]
* https://cwiki.apache.org/confluence/display/SPAMASSASSIN/X+S...

One can then choose what do to with this information (via procmail or Sieve). There is another header as well:

> X-Spam-Level: This displays your spam level with asterisks, with one asterisk displayed per point, rounded down. For example, if your overall SpamAssassin score is 4.3, it will display ****. If you score less than 1, for example, 0.5, it will display nothing.

* https://www.mailercheck.com/articles/spamassassin-score


It is not and has never been a good classifier. If open AI fans want to contribute something of value to society, they would train a spam classifier on a large, manually-labeled corpus of mail, where the features include envelope data. That would get open source maybe 10% of the way to Gmail quality, or 100x better than SA.


FWIW I've been using SpamAssassin for over a decade personally (partly to avoid Google dependence), and it's been pretty darn good once I ran the Bayesian learning thing a few times many years ago. I get like 3-5 spams per week in my inbox. Do others really consider SA that bad?


FastMail uses SpamAssassin, and I get less than one spam emails in my inbox a month, with essentially zero false positives (which is the tricky bit where gmail seems to fail – I'd rather have the occasional spam in my email than false positives).

In short: you can probably do better than 3-5 spams per week with SA.

The big problem is the entire thing is a beast to configure with all the documentation of a Babylonian cuneiform stone tablet.


I hate to agree here, but configuring SpamAssassin is pretty rough. That being said, once its done, its pretty bulletproof and doesnt usually require messing with it all the time.


> I get like 3-5 spams per week in my inbox

I'm more curious about the opposite metric: how many non-spam emails a week arent getting delivered to you? Because that seems to be the real flaw in spamassassin: the false positive rate.

And the spamassassin users don't usually have much visibility into this, so when emails don't get to them they just blame the sender.


> how many non-spam emails a week arent getting delivered to you?

My false-positive rate is very low, maybe a couple per month. However, I can predict with a high degree of accuracy when a piece of email is likely to land in the spam folder. Things like confirmation emails, registration emails, etc. are guaranteed to land in the spam folder. It's pretty hard for any system to accommodate those without allowing spam to get by.

That's fine by me, though, because I know when to check my spam folder.


Fastmail user here, so SpamAssassin I assume: virtually no false positives. My GMail spam folder is generally 50/50 false and true positives. I really can't use that email for anything as having to go to the spam folder every day defeats the purpose of an anti-spam filter.


On Gmail I get maybe 1 spam a month max in my inbox (and it blocks many per day)


That's pretty good. I have no doubt that Gmail spam protection is better than my self-hosted SA protection. For me, independence from a somewhat suspicious large company for something as important as e-mail is worth it.


I have a gmail honeypot where I fetchmail junk email straight to my junk folder and have a scheduled sa-learn cronjob. Ever since I started this I essentially stopped getting junk email in my selfhosted inbox.

I also have dovecot set to learn Ham every time I file an email from the inbox to a folder for good measure.


So...statistically insignificant difference from SA for most mail users.


3-5 spams per week vs. 1 per month is a big difference


I got (if I'm reading this right) 5500 emails (junk+delivered) to my personal mail account from 12/01/2023 to 12/31/2023. So that's a minimum, since I don't see the ones that get flushed out before I even see them.

1 spam a month would be .018% of emails and 5 x 4 spams a month would be .364% of emails

So I would have gotten about .346% more spam based on the number of emails. In reality, because I don't see all of the mails, it's less. Is a touch more than a third of a percent a 'big difference'? YMMV.


It's a 20x difference... My annoyance by spam is unrelated to the detection rate and is entirely related to the number of spam emails I see


I find SA works excellently, personally.


We'll get there eventually, but it will be a bit. Spam classification at scale is already a compute-bound, or at least compute-starved, operation. Spam classification systems already do what they can to avoid so much as invoking a virus scanner if they can avoid it, because at scale it's so expensive. LLM-based spam classification is another order of magnitude more expensive and would require hardware that current spam systems do not have.

But that's a problem that will resolve itself over time, in a variety of ways. And the spam systems can play the same tricks with only invoking it on a fraction of emails too, of course. It's just at current expense levels, that would be a very small fraction indeed. I'd hazard that trying to use modern AI on spam classification at scale could easily consume 10x-100x of all current AI hardware and still make less of a dent than you'd hope.


It doesn't need to be computationally costly because, as you seem to imply, there are tiers of cost tradeoffs. You can invoke a very cheap classifier at SMTP time, that is biased to have few false positives, that will temporarily reject all that which is highly likely to be spam. You can do this without even glancing at the body. Of course, having signals about peer reputation is the strong suit of Gmail or Microsoft, and the distributed, open community would need to solve the problem of promptly updating and distributing such reputation signals. And by "promptly" I mean within seconds of the leading edge of an attack.

Then there are increasing tiers of cost that you would only run after it becomes likely that the message is acceptable. As you say, you would only run an antivirus on a message on the verge of delivery, because decoding the attachment and running the AV (in an expensive sandbox) is so costly.


I hope against hope that AI spam detection never becomes a thing. At least with today's methods, I can tell a person why their message was marked as spam. If AI detection becomes the norm, all I can do is shrug and say, "Sorry, it's the algorithm."


Gmail has used machine learning to classify spam since its creation.

https://workspace.google.com/blog/identity-and-security/an-o...


Gmail is successful because it naturally is the biggest honeypot. Most antispam API filters are like accumulators. When a trend is detected, the rest are protected. But overall, it's about scale.


SpamAssassin has a Bayes filter that you can train with ham and spam. This has basically been a thing since forever.


"A plan for Spam" (2002) - https://paulgraham.com/spam.html


How do you think SpamAssassin/gmail/outlook created their spam rules?


Correct, Spamassassin will not classify emails be default. But it does include a Bayesian classifier and tooling that can be used to train it with curated ham and spam emails. It does require extra steps to set this up and feed it selected maildirs (for example, exclude inbox, spam and trash for training ham).

https://spamassassin.apache.org/full/3.0.x/dist/doc/sa-learn... https://cwiki.apache.org/confluence/display/spamassassin/Bay...


I find it far more likely that AI will be used (indeed, is used already) to generate spam, rather than filter it.


Really any of the open source language models might work well enough for the job. If you could manage to get a classifier that runs with tensorflow to take advantage of a coral tpu it would certainly be a major step up with managable performance.


Does SpamBayes still work?


The Python project that's clent side? Best of my knowledge it hasn't been updated in years, spamassasin's in built Bayesian classifier works just fine.


I still used spambayes up until I basically abandoned my self-hosted e-mail setup (in favor of an @gmail).

It occurred to me recently that LLM-style tokenization + bayesian classification would be a sweet upgrade for spambayes, which always struggled with ad-hoc tokenization rules.

(I don't think of it as "client side"; it was integrated with my system via procmail on what I'd call the "server side". You could use it in other ways, including as an Outlook plugin, way back in the day. Or it could connect to an IMAP mailbox and filter messages it found, etc. Really versatile tool for its time)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: