The content in this article about anti-spam techniques is fascinating. But the title should be edited to indicate its vintage: 2014.
Also, by the by, there is some interesting context about the author (Mike Hearn) and his recent activities that are not related to the topic of spam.
Warning: off-topic digression below.
Mike Hearn became one of the most visible Bitcoin core developers, working in that community for 5 years, until a well-publicized departure where he declared Bitcoin a failure[0].
He then joined R3CEV, a startup venture that is building a private blockchain platform for a consortium of 70 of the world's largest banks. Hearn's departure was criticized by members of the cryptocurrency community, such as Bram Cohen, who famously called his exit a "whiny ragequit" [1].
Regardless about how one feels about the internal politics of the bitcoin dev community, the R3 project is technically interesting (to me, at least) because it uses Kotlin [2], a JVM-based functional language, and also because it has some interesting design approaches that depart from the established bitcoin blockchain model [3].
I still think that private blockchain platforms have an uphill battle if they want to compete for developer attention with rapidly evolving broad-based platforms like Ethereum and Bitcoin, but the R3 Corda platform is nevertheless worth tracking.
This would allow a client to combine a server-provided function that calculates a spam score with their private key such that the resulting function calculates a spam score on encrypted email. The client could then hand that function back to the server so it can perform server-side spam detection.
There are a number of drawbacks, including performance and general questions about the security of such a system. That said, I think this is probably the biggest problem (from the OP):
"The third problem is that spam filters rely quite heavily on security through obscurity, because it works well. Though some features are well known (sending IP, links) there are many others, and those are secret. If calculation was pushed to the client then spammers could see exactly what they had to randomise and the cross-propagation of reputations wouldn't work as well."
Using functional encryption to provide server-side spam detection would still require handing a spam scoring function to the client so they can apply that function to their private key and hand the server a result. This would expose the internals of the spam detection routine to all clients, including spammers.
The problem with functional encryption is as you say, you need to hand over the "function" somehow to the server (presumably they use machine learning and tools that aren't feasible client-side), and there's no guarantee the private key is hidden unless you use something like indistinguishibility obfuscation, which isn't really practical at all right now.
Did you mean fully homomorphic encryption? (https://en.wikipedia.org/wiki/Homomorphic_encryption#Fully_h...) The server can compute the spam score under the encryption of an email, and client side decrypts and sorts it from there, so not even the server knows if a given email is spam or not. Of course, not that FHE is feasible, but perhaps this special case is...
No, I didn't mean FHE, because FHE does not meet the criteria given in the post, namely that it must happen as quickly as possible and cannot rely on the liveness of the client. The OP practically rules out schemes that involve looping in the client.
What? With FHE the client just gets an additional encrypted metadata that is the encryption of whether the attached file is spam or not. No looping required, whereas your functional encryption scheme seems to necessitate the client being "live."
One of the requirements given in the OP (the one I was referencing in my previous post) is that the server can tell spam from non-spam without the client being online. The FHE solution doesn't work for this requirement.
The functional encryption scheme only requires a client to bootstrap it. Once the client has calculated the appropriate function based on their private key, they can give it to the server, who can thereafter apply it to incoming emails regardless of whether the client is online or offline.
Okay so to fix mine: create a circuit that decrypts a ciphertext using the private key, returning 1, 0, or Bottom depending if it's an encryption of spam marking or not, or not valid, and run it through iO. So both solutions still require iO...
Also, in order to build the function, the service needs to collect data about both spam and non-spam emails. So simply providing a function that allows you to calculate a score isn't enough; they would need (at least a large percentage of) users to send back information about their email contents as well. (Like embedded links, as mentioned in the post.)
The people who use encrypted messaging, while greatly increased in number, are still not the target demographic who would click on FREE VIAGRA ads. That said, I wouldn't be surprised to see a small niche industry arise around unsolicited encrypted bulk email sending for tech recruiters...
The fundamental problem is not spam messages. It's unlimited free identities. Encrypted email bodies would play well with spam detection, as long as there's some effort or cost associated with creating a new sender identity. Google tries to do this by tying your entire life to a Google account. So do Facebook and Linkedin. It takes a while for a new account on those systems to develop a life history, so there's a reputation anchor of sorts that doesn't involve money. If you could send an encrypted email that was signed with your Facebook, LinkedIn, Google, or Github ID, that would be a reasonable way to tie messages to a reputation.
This tangentially relates to something I was thinking about yesterday.
Does anyone have a sense of how difficult it would be to create a service that scans your gmail spam folder and categorizes the contents into 'definitely spam' and 'maybe spam'? I'm probably somewhat of an edge case, but I get over 100 spam emails per day in my spam folder. Almost none make it through to my inbox. However, every month, one or two legitimate emails land in spam. Usually these are there for an obvious reason - "cold call" emails from advertisers and that kind of thing, but still stuff that I want to see, and obviously (to a human) not on the same level as penis enlargement garbage or whatever. (Although occasionally there are real head-scratchers, like I thread where I've already replied to someone twice, and then their third message goes to spam. I guess gmail's filters only operate on the current message and don't look at history.)
Anyway, I get enough false positives that I need to scan through the thousands of spam messages I get each month to try to find them, which is obviously a huge waste of time. If something could go through there, identify the least spammy fraction, and label them, it would save me a ton of time. It would be lovely if gmail offered this themselves, since they already have the spam score for each message. But barring that it seems conceivable that you could do it with a browser add-on, perhaps using a neural network. Still seems kind of like reinventing the wheel though, since you'd basically be building a reverse spam filter. So I'm wondering if there's an easier way...
I had similar problems. On top of that, I forward my email through my own server to GMail (so I control the domain, but can use the GMail ecosystem as UX), and this was posing problems because GMail would greylist my server quite a bit for sending in too much spam.
I now run rspamd on my own server, which does a pretty great job. With properly training the bayes filters it has, I now receive on the order of 3 spam messages per day in GMail. Actually, rspamd seems to have fewer false positives than the GMail spam filter -- I guess this could be because it has more information as the original receiver, though?
Getting these results did take some very limited tweaking of the rspamd configuration; I lowered the treshold for what's "definitely" spam (that is, just gets discarded), and I bumped the weight of the BAYES_SPAM rule.
Actually, that's the exact situation I'm in as well. So to clarify, the spam that gets through rspamd lands in gmail's spam folder, so you still need to manually check that, but all the obviously-spam stuff has been cut out before it ever got to gmail (solving both your problems). Sounds like exactly what I asked for!
Interesting your mention about being grey-listed too. How did you determine that happened? Presumably the same thing could happen to me as well.
Guess I should also check out SRS as mentioned by emilburzo.
rspamd has three levels of handling, depending on the spam score: (1) ham, which gets passed through, (2) spam, which does not, and (3) "not sure", which gets passed through but gets headers attached with the spam score and how the score is built up. So I get (1) and (3) in my GMail account, but all the stuff for which rspamd is confident it's spam no longer makes it into GMail.
Of course, rspamd lets you tweak the thresholds for these levels. For example, after a while I lowered the threshold for "spam", increasing the amount of stuff that gets discarded by rspamd, because I noticed that rspamd was doing a pretty good job of scoring, and the false positives I was seeing had a lower score anyway.
I'm actually not sure grey-listing is the correct term, but I noticed in my server's MTA log that Google was rate limiting me a lot because my server was sending through significant amounts of email. This was also noticeable sometimes because it would take quite some time for email to get through, which I found annoying.
Yes, you probably want to run SRS as well, otherwise GMail will be unable to correctly understand your headers. However, this effectively puts your server on the hook for any email forwarded; this is why I don't think you want to go there without also putting some kind of spam filter in place, otherwise I assume your server's reputation will deteriorate.
I don't think it has ever marked any legit mail as spam. The rspamd web UI has an overview of recent history which has the date/time, message ID and score, but there are no full headers/content for things that qualify as outright spam.
Just wondering, do you have SRS setup on your email server?
I ask because I had the same problem, with the same setup (own domain forwarding to gmail), adding SRS (besides the obvious SPF/DKIM/DMARC) has really improved things for me.
Yeah, I have postsrsd set up now, although I only set it up after setting up rspamd.
I'm not sure why it would improve things without also setting up a spam filter? In that case, you're just lowering the reputation of your own server by passing on a lot of spam while acting as if you sent it.
> I'm not sure why it would improve things without also setting up a spam filter?
I don't know the rules by which emails are judged, I'm just saying what I've noticed.
Before SRS, I occasionally got legit mail marked as spam.
After that, I don't think I've ever had one marked as spam.
As I said previously: with my own domain, forwarded to my gmail.com address, no spam filtering at all on my side.
Maybe they can tell it's SRS and skipping some penalties?
EDIT: I'm also monitoring my domain at https://postmaster.google.com, but I'm probably not reaching significant traffic thresholds because I've never seen anything other than:
"No data to display at this time. Please come back later.
Postmaster Tools requires that your domain satisfies certain conditions before data is visible for this chart. "
maybe, but you have to start with a better definition of spam. Currently, you're describing it as "something that you wouldn't want to read" where even a fellow human might not hit the mark 100% of the time.
Machine learning requires large data sets and training, and mushy targets like your inbox are tricky because it's hard to tell a computer what its score was on a given attempt- even with human scoring.
To clarify, I'm not expecting something to identify the exact messages I'd want to read. But if it could take the 3000 messages in my spam folder, and separate out 500 "probably spam" from 2500 "definitely spam", it would cut my manual spam-scanning time down significantly.
You need a deterministic way to explain to the machine how to score itself, which is the problem. If a precise win condition exists without human intervention, then the software can iterate on itself.
Can you identify the top 5-10 keywords in spam that never appear in legitimate emails, and set up a filter to discard those immediately or move them somewhere else?
I like the approach Facebook Messenger takes - I can receive messages from anyone who can view my profile, but first time messages from non-friends end up in a special bucket and require approval.
Spam is a problem everywhere except Gmail. People have an illusion that the battle against spam is won because Gmail did the job. All the other email providers struggle everyday. SpamAssassin is worthless, it is a piece of shit that does nothing.
We desperately need a service that helps filtering email -- like, for example, something simple that just accepts reports about some email being spam or not, and creates a list of spam addresses.
> SpamAssassin is worthless, it is a piece of shit that does nothing.
This is not entirely true. Spamassassin, dspam, and all the bayesian based ones need constrant training and feedback loop to work. The time trainings lasts for is getting shorter and shorter, but it's not entirely inefficient. ( I'm running dspam on my mail box. )
Combine this with weighted blacklists ( postscreen in my case ), add dkim and dmarc checks and it's fine for a small provider. Far from ideal, but working.
I'd love to see an open source implementation of what google was referring to as domain based trust, but it's also a nasty thing.
I've recently tried to change my mail address from a .eu domain to a .net, and most of my mail landed in the recipients' spam folder. It's a fresh domain, no one ever sent anything from it, so I'd assume fresh domains are untrusted by default, which is really bad and is generally wrong. If the trust is not OK by default, spam is users consider it spam, that's going to kill domain based mailing, which is horrible, and gmail will be the one to blaim when we have not alternatives to few providers.
Fascinating insight into one of the biggest email operations. I'd be very interested to hear how the other companies in this space differ or are similar.
Question: Is there a curve where pushing this processing back to the phone will become possible? The most powerful counterpoint at the moment is battery life, but I do see that improving to a plausible point where this sort of continual processing is feasible?
> Botnets appeared as a way to get around RBLs, and in response spam fighters mapped out the internet to create a "policy block list" - ranges of IPs that were assigned to residential connections and thus should not be sending any email at all.
While I understand that _some_ residential ISPs don't let you run services on your connection, policies like this make me sad because it means the web is becoming more-and-more something you need other people to do for you.
As far as I am concerned the PBL is a tool to facilitate net neutrality violations, not by ISPs, but by e. mail service providers.*
Another effort by anti-spam people to encourage network neutrality violations is the SUBMISSION port, port 587 as opposed to port 25, for the authenticated submission of mail by MUAs. As far as I can tell, the only possible utility of this split is to make it more politically feasible for ISPs to block port 25 out, i.e. to facilitate network neutrality violations. As such I do not view the SUBMISSION port RFC well.
(*Some people argue that PBL networks have no business sending mail because they have dynamic IPs. This is false; there are PBL-listed networks which assign static IPs, and it is at any rate a moot point. No specification requires that MTAs for domain outgoing mail and MXs (that is, MTAs for incoming mail) be one and the same, and there is only a technical need for the latter to have fixed IPs.)
This is in my postfix config. The invalid hostname and non_fqdn tests are working quite well against residential hosts: they don't have valid reverse DNS or an fqdn, and so they get eliminated fast.
One of the issues mentioned is that bulk mail and spam are not necessarily the same thing. However, bulk mail much less use for e2e than directed mail.
Personalized bulk mail (e.g. bulk mail with a small personalized offer) is an issue here. For that, partial encryption seems like a nice solution. As the template remains plaintext, reputation can include the template. As such, you could levy a lower 'tax' on such email as opposed to fully confidential email.
The reason why spam is an issue with email is, anyone can send emails and it's not always possible to identify the sender. Once encryption is deployed then the sender is associated with a public key and it's possible to establish a web of trust. Gmail could also manage identity-based scores based on all it's user's trusted connections.
> The other reason it sucks is that it confuses bulk mail with spam. This is a very common confusion. Lots of companies send vast amounts of mail that users want to receive. Think Facebook, for example.
I challenge this. In 2016 I think most companies don't need to rely on massive bulk mails anymore.
> Botnets appeared as a way to get around RBLs, and in response spam fighters mapped out the internet to create a "policy block list" - ranges of IPs that were assigned to residential connections and thus should not be sending any email at all.
Your residential connection might not, but mine does.
(not because I want to dismiss conversation here - did want to read up on it myself)