This tangentially relates to something I was thinking about yesterday.
Does anyone have a sense of how difficult it would be to create a service that scans your gmail spam folder and categorizes the contents into 'definitely spam' and 'maybe spam'? I'm probably somewhat of an edge case, but I get over 100 spam emails per day in my spam folder. Almost none make it through to my inbox. However, every month, one or two legitimate emails land in spam. Usually these are there for an obvious reason - "cold call" emails from advertisers and that kind of thing, but still stuff that I want to see, and obviously (to a human) not on the same level as penis enlargement garbage or whatever. (Although occasionally there are real head-scratchers, like I thread where I've already replied to someone twice, and then their third message goes to spam. I guess gmail's filters only operate on the current message and don't look at history.)
Anyway, I get enough false positives that I need to scan through the thousands of spam messages I get each month to try to find them, which is obviously a huge waste of time. If something could go through there, identify the least spammy fraction, and label them, it would save me a ton of time. It would be lovely if gmail offered this themselves, since they already have the spam score for each message. But barring that it seems conceivable that you could do it with a browser add-on, perhaps using a neural network. Still seems kind of like reinventing the wheel though, since you'd basically be building a reverse spam filter. So I'm wondering if there's an easier way...
I had similar problems. On top of that, I forward my email through my own server to GMail (so I control the domain, but can use the GMail ecosystem as UX), and this was posing problems because GMail would greylist my server quite a bit for sending in too much spam.
I now run rspamd on my own server, which does a pretty great job. With properly training the bayes filters it has, I now receive on the order of 3 spam messages per day in GMail. Actually, rspamd seems to have fewer false positives than the GMail spam filter -- I guess this could be because it has more information as the original receiver, though?
Getting these results did take some very limited tweaking of the rspamd configuration; I lowered the treshold for what's "definitely" spam (that is, just gets discarded), and I bumped the weight of the BAYES_SPAM rule.
Actually, that's the exact situation I'm in as well. So to clarify, the spam that gets through rspamd lands in gmail's spam folder, so you still need to manually check that, but all the obviously-spam stuff has been cut out before it ever got to gmail (solving both your problems). Sounds like exactly what I asked for!
Interesting your mention about being grey-listed too. How did you determine that happened? Presumably the same thing could happen to me as well.
Guess I should also check out SRS as mentioned by emilburzo.
rspamd has three levels of handling, depending on the spam score: (1) ham, which gets passed through, (2) spam, which does not, and (3) "not sure", which gets passed through but gets headers attached with the spam score and how the score is built up. So I get (1) and (3) in my GMail account, but all the stuff for which rspamd is confident it's spam no longer makes it into GMail.
Of course, rspamd lets you tweak the thresholds for these levels. For example, after a while I lowered the threshold for "spam", increasing the amount of stuff that gets discarded by rspamd, because I noticed that rspamd was doing a pretty good job of scoring, and the false positives I was seeing had a lower score anyway.
I'm actually not sure grey-listing is the correct term, but I noticed in my server's MTA log that Google was rate limiting me a lot because my server was sending through significant amounts of email. This was also noticeable sometimes because it would take quite some time for email to get through, which I found annoying.
Yes, you probably want to run SRS as well, otherwise GMail will be unable to correctly understand your headers. However, this effectively puts your server on the hook for any email forwarded; this is why I don't think you want to go there without also putting some kind of spam filter in place, otherwise I assume your server's reputation will deteriorate.
I don't think it has ever marked any legit mail as spam. The rspamd web UI has an overview of recent history which has the date/time, message ID and score, but there are no full headers/content for things that qualify as outright spam.
Just wondering, do you have SRS setup on your email server?
I ask because I had the same problem, with the same setup (own domain forwarding to gmail), adding SRS (besides the obvious SPF/DKIM/DMARC) has really improved things for me.
Yeah, I have postsrsd set up now, although I only set it up after setting up rspamd.
I'm not sure why it would improve things without also setting up a spam filter? In that case, you're just lowering the reputation of your own server by passing on a lot of spam while acting as if you sent it.
> I'm not sure why it would improve things without also setting up a spam filter?
I don't know the rules by which emails are judged, I'm just saying what I've noticed.
Before SRS, I occasionally got legit mail marked as spam.
After that, I don't think I've ever had one marked as spam.
As I said previously: with my own domain, forwarded to my gmail.com address, no spam filtering at all on my side.
Maybe they can tell it's SRS and skipping some penalties?
EDIT: I'm also monitoring my domain at https://postmaster.google.com, but I'm probably not reaching significant traffic thresholds because I've never seen anything other than:
"No data to display at this time. Please come back later.
Postmaster Tools requires that your domain satisfies certain conditions before data is visible for this chart. "
maybe, but you have to start with a better definition of spam. Currently, you're describing it as "something that you wouldn't want to read" where even a fellow human might not hit the mark 100% of the time.
Machine learning requires large data sets and training, and mushy targets like your inbox are tricky because it's hard to tell a computer what its score was on a given attempt- even with human scoring.
To clarify, I'm not expecting something to identify the exact messages I'd want to read. But if it could take the 3000 messages in my spam folder, and separate out 500 "probably spam" from 2500 "definitely spam", it would cut my manual spam-scanning time down significantly.
You need a deterministic way to explain to the machine how to score itself, which is the problem. If a precise win condition exists without human intervention, then the software can iterate on itself.
Can you identify the top 5-10 keywords in spam that never appear in legitimate emails, and set up a filter to discard those immediately or move them somewhere else?
Does anyone have a sense of how difficult it would be to create a service that scans your gmail spam folder and categorizes the contents into 'definitely spam' and 'maybe spam'? I'm probably somewhat of an edge case, but I get over 100 spam emails per day in my spam folder. Almost none make it through to my inbox. However, every month, one or two legitimate emails land in spam. Usually these are there for an obvious reason - "cold call" emails from advertisers and that kind of thing, but still stuff that I want to see, and obviously (to a human) not on the same level as penis enlargement garbage or whatever. (Although occasionally there are real head-scratchers, like I thread where I've already replied to someone twice, and then their third message goes to spam. I guess gmail's filters only operate on the current message and don't look at history.)
Anyway, I get enough false positives that I need to scan through the thousands of spam messages I get each month to try to find them, which is obviously a huge waste of time. If something could go through there, identify the least spammy fraction, and label them, it would save me a ton of time. It would be lovely if gmail offered this themselves, since they already have the spam score for each message. But barring that it seems conceivable that you could do it with a browser add-on, perhaps using a neural network. Still seems kind of like reinventing the wheel though, since you'd basically be building a reverse spam filter. So I'm wondering if there's an easier way...