Hacker News new | past | comments | ask | show | jobs | submit login

People in the comments are getting (rightfully) outraged about the poor understanding of indexing, but I'm a little surprised that everyone here doesn't seem to understand normalization either. The original schema is perfectly normalized and is already in 3NF: none of the columns shown has a dependence on any of the other columns outside of the primary key (in other words, if you knew eg the values of the ip, helo, and from columns, you'd still have no information about the "to" column).

Normalization does not mean "the same string can only appear once". Mapping the string "address1@foo.bar" to a new value "id1" has no effect on the relations in your table. Now instead of "address1@foo.bar" in 20 different locations, you have "id1" in 20 different locations. There's been no actual "deduplication", but that's again not the point. Creating the extra tables has no impact on the normalization of the data.




Thank you. Was thinking that I was the clueless one... (probably a little) because that is not my understanding of what normalization is.

I thought perhaps swapping the strings to integers might make it easier to index, or perhaps it did indeed help with dedupilcation in that the implementation didn't "compress" identical strings in a column—saving space and perhaps help performance. But both issues appeared to be implementation issues with an unsophisticated Mysql circa 2000, rather than a fundamentally wrong schema.

I agreed with her comment at the end about not valuing experience, but proper databasing should be taught to every developer at the undergraduate level, and somehow to the self-taught. Looking at comptia... they don't seem to have a db/sql test, only an "IT-Fundamentals" which touches on it.


Came here to say this. Whatever may have been true of InnoDB 20 years ago, don't follow this article's advice for any modern relational database.

To add: Not sure about MySQL, but `varchar`/`text` in PostgreSQL for short strings like those in the article is very efficient. It basically just takes up space equaling the length of the string on disk, plus one byte [1].

[1] https://www.postgresql.org/docs/current/datatype-character.h...


>>> Mapping the string "address1@foo.bar" to a new value "id1" has no effect on the relations in your table.

How do you mean? If id1 is unique on table A, and table B has a foreign key dependency on A.id, then yeah you still have id1 in twenty locations but it's normalized in that altering the referenced table once will alter the joined value in all twenty cases.

This might not be important in the spam-graylisting use case, and very narrowly it might be 3NF as originally written, but it certainly wouldn't be if there were any other data attached to each email address, such as a weighting value.


>> Mapping the string "address1@foo.bar" to a new value "id1" has no effect on the relations in your table.

>How do you mean? If id1 is unique on table A, and table B has a foreign key dependency on A.id, then yeah you still have id1 in twenty locations but it's normalized in that altering the referenced table once will alter the joined value in all twenty cases.

Ok, but that's not what normalization means.

If you have a table as described in the article that looks like:

foo | bar | baz

------------------

foo1 | bar1 | baz1

foo2 | bar2 | baz2

foo3 | bar3 | baz3

foo4 | bar1 | baz2

foo3 | bar1 | baz4

and then you say "ok, bar1 is now actually called id1", you now have a table

foo | bar | baz

------------------

foo1 | id1 | baz1

foo2 | bar2 | baz2

foo3 | bar3 | baz3

foo4 | id1 | baz2

foo3 | id1 | baz4

you haven't actually changed anything about the relationships in this data. You've just renamed one of your values. This is really a form of compression, not normalization.

Normalization is fundamentally about the constraints on data and how they are codified. If you took the author's schema and added a new column for the country where the ip address is located in (so the columns are now ip, helo, from, to, and country), then the table is no longer normalized because there is an implicit relationship between ip and country--if ip 1.2.3.4 is located in the USA, every row with 1.2.3.4 as the ip must have USA as the country. If you know the IP for a row, you know its country. This is what 3NF is about. Here you'd be able to represent invalid data by inserting a row with 1.2.3.4 and a non-US country, and you normalize this schema by adding a new table mapping IP to country.

But none of that is what's going on in the article. The author described several fields that have no relationship at all between them--IP is assumed to be completely independent of helo, from address, and to address. And the second schema they propose is in no way "normalizing" anything. The four new tables don't establish any relationship between any of the data.

>This might not be important in the spam-graylisting use case, and very narrowly it might be 3NF as originally written, but it certainly wouldn't be if there were any other data attached to each email address, such as a weighting value.

It's not "very narrowly" 3NF. It's pretty clear-cut! A lot of commenters here are referring to the second schema as the "normalized" one and to me that betrays a fundamental misunderstanding of what the term even means. And sure, if you had a different schema that wasn't normalized, then it wouldn't be normalized, but that's not what's in the article.


Your point is well taken. In the bare description provided for the first schema in the article, the data is already normalized, and all that can be achieved is compression.


> This is really a form of compression, not normalization.

First of all, the index hashes of the emails depend on the email strings, hence the indexed original schema is not normalised.

Secondly, it would not effectively be compression unless there were in fact dependencies in the data. But we can make many fair assumptions about the statistical dependencies. For example, certain emails/ips occur together more often than others, and so on. In so far as our assumptions of these dependencies are correct, normalisation gives us all the usual benefits.


I think I’ve determined very long ago that the only normal form worth keeping in mind is the Boyce-Codd normal form.

While you may technically be correct, I think there’s few people that think of anything else when talking database normalization.

That said, I cannot quickly figure out if it’s actually true here.


Okay, this will be my last comment in this thread because I don't have any new way of presenting this and checking this is not a good use of my time.

This schema is also in Boyce-Codd normal form. It's normalized in every usual sense of the word. Trivially so, even. It's not a question of being "technically" correct. If you think the second schema is more normalized than the first one, you need to re-evaluate your mental model of what normalization means. That's all there is to it.


Plus depending on the amount of emails they get, that optimisation could be unnecessary. That database schema was perfectly fine in some cases.


This is very slightly not quite true, because the HELO string and the remote address should go hand-in-hand.


You might be right (I don't actually know what a HELO string is, I don't know anything about SMTP :). I was just going off how the author presented the data, as a tuple of four completely independent things.

Of course the main point still stands, that the two schemas are exactly as normalized as each other.

Edit: rereading the original post, the author mentions that "they forged...the HELO"--so perhaps there was indeed no relationship between HELO and IP here. But again, I don't know anything about SMTP, so this could be wrong.


I do know about smtp and you were right regardless, because the author was talking about 4 database fields, not smtp. The details of smtp are irrelevant.


Normalisation depends on the semantics of the data, and so the details of SMTP are very much relevant.


incorrect


It's (somewhat) because the HELO is forged that there's no relationship between HELO and IP. The very first message of SMTP is "HELO <hostname>", hostname can either be a unique identifier (server1.company.com, etc.) or a system-level identifier (mta.company.com for all of your company's outbound mail agents, or in the case of bulk mailers they might use bulk.client1.com when sending as client1, bulk.client2.com, etc). But there is/was no authentication on what you send as HELO (Now you can verify it against the TLS certificate, though many implementations don't do that at all or well), so correlating based on the hostname in HELO was questionable at best. Thus, the combination of (HELO, IP) was the a single value as a tuple.


> But there is/was no authentication on what you send as HELO

Yep, and that explains the "foobar" rows - those should have resolved to the same IP, except because there's no authentication that blocks it you could put gibberish here and the SMTP server would accept it.

> so correlating based on the hostname in HELO was questionable at best

Eh, spambots from two different IPs could have both hardcoded "foobar" because of the lack of authentication, so I could see this working to filter legitimate/illegitimate emails from a compromised IP.


Right, useful as a signal in a bayesian filter most certainly, but there's no strong general rule.


Only if the SMTP client is following the RFC but being spammers they probably sent anything but the actual domain they sent from.


> Normalization does not mean "the same string can only appear once".

It can, if you want to easily update an email address, or easily remove all references to an email address because it is PII.


That's not what normalization normally means, no. See eg wikipedia https://en.wikipedia.org/wiki/Database_normalization




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: