This tangentially relates to something I was thinking about yesterday. Does anyo...

dochtman · on Nov 20, 2016

I had similar problems. On top of that, I forward my email through my own server to GMail (so I control the domain, but can use the GMail ecosystem as UX), and this was posing problems because GMail would greylist my server quite a bit for sending in too much spam.

I now run rspamd on my own server, which does a pretty great job. With properly training the bayes filters it has, I now receive on the order of 3 spam messages per day in GMail. Actually, rspamd seems to have fewer false positives than the GMail spam filter -- I guess this could be because it has more information as the original receiver, though?

Getting these results did take some very limited tweaking of the rspamd configuration; I lowered the treshold for what's "definitely" spam (that is, just gets discarded), and I bumped the weight of the BAYES_SPAM rule.

tempestn · on Nov 20, 2016

Actually, that's the exact situation I'm in as well. So to clarify, the spam that gets through rspamd lands in gmail's spam folder, so you still need to manually check that, but all the obviously-spam stuff has been cut out before it ever got to gmail (solving both your problems). Sounds like exactly what I asked for!

Interesting your mention about being grey-listed too. How did you determine that happened? Presumably the same thing could happen to me as well.

Guess I should also check out SRS as mentioned by emilburzo.

dochtman · on Nov 20, 2016

rspamd has three levels of handling, depending on the spam score: (1) ham, which gets passed through, (2) spam, which does not, and (3) "not sure", which gets passed through but gets headers attached with the spam score and how the score is built up. So I get (1) and (3) in my GMail account, but all the stuff for which rspamd is confident it's spam no longer makes it into GMail.

Of course, rspamd lets you tweak the thresholds for these levels. For example, after a while I lowered the threshold for "spam", increasing the amount of stuff that gets discarded by rspamd, because I noticed that rspamd was doing a pretty good job of scoring, and the false positives I was seeing had a lower score anyway.

I'm actually not sure grey-listing is the correct term, but I noticed in my server's MTA log that Google was rate limiting me a lot because my server was sending through significant amounts of email. This was also noticeable sometimes because it would take quite some time for email to get through, which I found annoying.

Yes, you probably want to run SRS as well, otherwise GMail will be unable to correctly understand your headers. However, this effectively puts your server on the hook for any email forwarded; this is why I don't think you want to go there without also putting some kind of spam filter in place, otherwise I assume your server's reputation will deteriorate.

emilburzo · on Nov 20, 2016

Have you ever had a legit email marked as spam by rspamd?

And is there a way to (manually) double-check those?

dochtman · on Nov 20, 2016

I don't think it has ever marked any legit mail as spam. The rspamd web UI has an overview of recent history which has the date/time, message ID and score, but there are no full headers/content for things that qualify as outright spam.

emilburzo · on Nov 20, 2016

Just wondering, do you have SRS setup on your email server?

I ask because I had the same problem, with the same setup (own domain forwarding to gmail), adding SRS (besides the obvious SPF/DKIM/DMARC) has really improved things for me.

dochtman · on Nov 20, 2016

Yeah, I have postsrsd set up now, although I only set it up after setting up rspamd.

I'm not sure why it would improve things without also setting up a spam filter? In that case, you're just lowering the reputation of your own server by passing on a lot of spam while acting as if you sent it.

emilburzo · on Nov 20, 2016

> I'm not sure why it would improve things without also setting up a spam filter?

I don't know the rules by which emails are judged, I'm just saying what I've noticed.

Before SRS, I occasionally got legit mail marked as spam.

After that, I don't think I've ever had one marked as spam.

As I said previously: with my own domain, forwarded to my gmail.com address, no spam filtering at all on my side.

Maybe they can tell it's SRS and skipping some penalties?

EDIT: I'm also monitoring my domain at https://postmaster.google.com, but I'm probably not reaching significant traffic thresholds because I've never seen anything other than:

"No data to display at this time. Please come back later. Postmaster Tools requires that your domain satisfies certain conditions before data is visible for this chart. "

haimez · on Nov 20, 2016

maybe, but you have to start with a better definition of spam. Currently, you're describing it as "something that you wouldn't want to read" where even a fellow human might not hit the mark 100% of the time.

Machine learning requires large data sets and training, and mushy targets like your inbox are tricky because it's hard to tell a computer what its score was on a given attempt- even with human scoring.

tempestn · on Nov 20, 2016

To clarify, I'm not expecting something to identify the exact messages I'd want to read. But if it could take the 3000 messages in my spam folder, and separate out 500 "probably spam" from 2500 "definitely spam", it would cut my manual spam-scanning time down significantly.

haimez · on Nov 20, 2016

You need a deterministic way to explain to the machine how to score itself, which is the problem. If a precise win condition exists without human intervention, then the software can iterate on itself.

ikeboy · on Nov 20, 2016

Can you identify the top 5-10 keywords in spam that never appear in legitimate emails, and set up a filter to discard those immediately or move them somewhere else?

haimez · on Nov 23, 2016

Welcome to the arms race.