You might be right (I don't actually know what a HELO string is, I don't know an...

Brian_K_White · on Nov 7, 2021

I do know about smtp and you were right regardless, because the author was talking about 4 database fields, not smtp. The details of smtp are irrelevant.

twic · on Nov 8, 2021

Normalisation depends on the semantics of the data, and so the details of SMTP are very much relevant.

Brian_K_White · on Nov 8, 2021

incorrect

GauntletWizard · on Nov 7, 2021

It's (somewhat) because the HELO is forged that there's no relationship between HELO and IP. The very first message of SMTP is "HELO <hostname>", hostname can either be a unique identifier (server1.company.com, etc.) or a system-level identifier (mta.company.com for all of your company's outbound mail agents, or in the case of bulk mailers they might use bulk.client1.com when sending as client1, bulk.client2.com, etc). But there is/was no authentication on what you send as HELO (Now you can verify it against the TLS certificate, though many implementations don't do that at all or well), so correlating based on the hostname in HELO was questionable at best. Thus, the combination of (HELO, IP) was the a single value as a tuple.

Izkata · on Nov 7, 2021

> But there is/was no authentication on what you send as HELO

Yep, and that explains the "foobar" rows - those should have resolved to the same IP, except because there's no authentication that blocks it you could put gibberish here and the SMTP server would accept it.

> so correlating based on the hostname in HELO was questionable at best

Eh, spambots from two different IPs could have both hardcoded "foobar" because of the lack of authentication, so I could see this working to filter legitimate/illegitimate emails from a compromised IP.

GauntletWizard · on Nov 7, 2021

Right, useful as a signal in a bayesian filter most certainly, but there's no strong general rule.