My biggest pet peeve in the world of data storage is that the most powerful ideas from relational data theory are still mostly locked up in books and papers, and most programmers who think they know what "relational" means aren't aware of the full picture.
The people who did all the pioneering work on the subject (mostly E.F. Codd and Chris Date) seem to have made several blunders in trying to popularize their ideas. For example, one of Date's later books ("Go Faster: The Transrelational(tm) Approach to DBMS Implementation") contains some very useful ideas. But it was written in 2001 and not published until 2011, because he was working with a now-failed startup that was trying to keep it all a trade secret.
Most of their writing is not available online. You have to buy their books. Which is an author's prerogative, but seriously limits the reach of the ideas.
The world thinks it already has relational databases that are good enough. Convincing it otherwise requires a web-savvy marketing approach that has so far been lacking.
I just checked out that book. The results are presented as some kind of revolution but:
1. The system presented is simply an inefficient way of doing a single column index on each column.
2. The book contains NO benchmark results. It only claims that it's efficient because of X, Y and Z without any numbers to back that up. Meanwhile the method involves a lot of pointer chasing which is extremely slow, especially on disk. Queries that require multi-column indices will also be extremely inefficient of course, because they will require scans.
I wouldn't go so far as saying that the book is worthless, but its claims are certainly dubious at best.
"Queries that require multi-column indices will also be extremely inefficient of course, because they will require scans."
If a query uses just the columns that are in a multi-column index, and doesn't sort in any order other than the index order, then the query may be much more efficient than a with a single-column index - since all the data may be retrieved from the index, rather than seeking to the row.
The conditions that you state are not by far the only conditions when a multi column index helps you. Even if you do not sort by the index order, and you do not have all the data in the index, the index may still be way more efficient. For example if you have SELECT A WHERE X=3 AND Y=5 ORDER BY B with an index (X,Y).
I actually got a chance to chat about nosql with one of the people who coined the term "nosql" (here on hn, no less). It sounds like it was a term that got away from them a bit. I don't think anyone really intended to suggest that sql should be completely abandoned. notalwayssql, or maybeconsidersomethingotherthansql might have been better terms (well, not better, but more specific to what was actually intended).
Relational databases and sql are an exceptionally sound approach to a lot of data storage and retrieval problems. Some of the "nosql" technologies emerged from new problems that emerged (such as full text indexing, networks).
Progress is an interesting thing, and doesn't always happen on an even keel. These days, I think you're far less likely to encounter knee-jerk resistance to an rdbms or sql. The dust is settling a bit, and in this case, that's a good thing.
Can't you summarise all of that simply: "a database schema should be able to represent only valid data".
e.g. employee, department, location is not good because that schema can represent invalid data: a single department can't be in 2 locations but the schema can represent that.
The same rule of thumb applies to types in programming: a data type should be able to represent only valid values. This is why sum types are superior to error codes: a function can return a result AND an error code, when what we usually want is that the function returns a result OR an error code. When invalid values are unrepresentable that makes a lot of errors impossible. Though sometimes your type system can't deal with all the constraints that you have, e.g. an array should be sorted (an unsorted array is representable but not valid => bad).
The various normal forms are named for the kinds of invalid data they reject, so they can be used as a design process to ensure the higher level goal (that "a database schema should be able to represent only valid data")
A repeating field is a field that contains multiple values of the defined type - eg. the column is of type integer, but the field contains multiple integer values.
Mainstream relational databases does not support repeating fields, so any table will be in 1NF by default. Some non-relational databases supports repeating fields.
You can simulate a repeating field in a relational database by e.g. having the column be of type string, and then have multiple comma separated values in the field. But strictly speaking this is not a repeating field, since the column type is string, and there is only one string. It is still bad design though!
If you find it hard to read, set the body style e.g. width:600px; margin:auto; font: 18px/24px Calibri,Arial,Helvetica; color:#333; background:#F3F3F3;
About the Author
William Kent (1936-2005) was a renowned researcher
in the field of data modeling. Author of Data and
Reality, he wrote scores of papers and spoke at
conferences worldwide, posing questions about
database design and the management of information
that remain unanswered today. Though he earned a
bachelor's degree in chemical engineering and a
master's in mathematics, he had no formal training
in computer science. Kent worked at IBM and later
at Hewlett-Packard Laboratories, where he helped
develop prototype database systems. He also served
on or chaired several international standards
committees. Kent lived in New York City and later
Menlo Park, Calif., before retiring to Moab, Utah,
to pursue his passions of outdoor photography and
protecting the environment.
While 69 is fairly young, his death does not appear to be tragic in the sense of unfulfilled ambitions.
> While 69 is fairly young, his death does not appear to be tragic in the sense of unfulfilled ambitions.
My comment was from the perspective of course of life. After retirement, he had only 5 years to pursue "his passions of outdoor photography and protecting the environment". (Note that he did not continue working on his research.)
He was under pressure to deliver results (school, working, etc.) for the majority of his life, and he had only 5 years of total and independent (within financial constraints) freedom. That is sad.
It's worth noting that this model works quite well for the vast majority of client facing applications out there.
I.E. Things categorized by tags or groups where 1-to-n relationships are necessary. Of course, the extreme of this is EAV (Entity Attribute Value) which I would limit to meta/taxonomy data for performance reasons.
But for information where the labels (or quantity) aren't known before input are easier to deal with in normalized databases.
E.G. Getting metadata on a specific entry with no prior knowlege of said data except parent id:
SELECT id, label, content FROM meta WHERE id IN (
SELECT meta_id FROM posts_meta WHERE post_id = :id
);
The following is an excerpt from the actual schema I used on a very simple forum that ran on SQLite for years before switching to Postgres fairly recently. The normalization (varying forms) afforded a lot of flexibility. Maybe someone will find it useful.
CREATE TABLE posts (
id INTEGER PRIMARY KEY AUTOINCREMENT NOT NULL,
created_at DATETIME NOT NULL DEFAULT CURRENT_TIMESTAMP,
updated_at DATETIME DEFAULT NULL,
title VARCHAR NULL,
summary TEXT NOT NULL,
body TEXT NOT NULL,
plain TEXT NOT NULL,
quality FLOAT NOT NULL DEFAULT 0,
status INTEGER NOT NULL DEFAULT 0,
reply_count INTEGER NOT NULL DEFAULT 0,
auth_key VARCHAR NOT NULL
);
CREATE INDEX idx_posts_on_status ON posts ( status );
CREATE INDEX idx_posts_on_created_at ON posts ( created_at );
CREATE VIRTUAL TABLE posts_search USING fts4 ( search_data );
CREATE TABLE posts_family (
parent_id INTEGER NOT NULL,
child_id INTEGER NOT NULL,
PRIMARY KEY ( child_id, parent_id )
);
CREATE TABLE taxonomy (
id INTEGER PRIMARY KEY AUTOINCREMENT NOT NULL,
label VARCHAR NOT NULL,
term VARCHAR NOT NULL,
created_at DATETIME NOT NULL DEFAULT CURRENT_TIMESTAMP,
updated_at DATETIME DEFAULT NULL,
status INTEGER NOT NULL DEFAULT 0
);
CREATE TABLE posts_taxonomy (
post_id INTEGER NOT NULL,
taxonomy_id INTEGER NOT NULL,
PRIMARY KEY ( post_id, taxonomy_id )
);
CREATE TABLE taxonomy_family (
parent_id INTEGER NOT NULL,
child_id INTEGER NOT NULL,
PRIMARY KEY ( child_id, parent_id )
);
CREATE TABLE meta (
id INTEGER PRIMARY KEY AUTOINCREMENT NOT NULL,
label VARCHAR NOT NULL,
parse_as VARCHAR NOT NULL DEFAULT "text",
content TEXT NOT NULL
);
CREATE TABLE posts_meta (
post_id INTEGER NOT NULL,
meta_id INTEGER NOT NULL,
PRIMARY KEY ( post_id, meta_id )
);
CREATE UNIQUE INDEX idx_taxonomy_on_terms ON taxonomy ( label ASC, term ASC );
CREATE INDEX idx_taxonomy_on_status ON taxonomy ( status );
-- Triggers
-- Post create procedures
CREATE TRIGGER post_after_insert AFTER INSERT ON posts FOR EACH ROW
BEGIN
INSERT INTO posts_search ( docid, search_data )
VALUES ( NEW.rowid, NEW.plain );
UPDATE posts SET updated_at = CURRENT_TIMESTAMP WHERE id = NEW.rowid;
END;
-- Post update procedures
CREATE TRIGGER post_before_update BEFORE UPDATE ON posts FOR EACH ROW
BEGIN
DELETE FROM posts_search WHERE docid = OLD.rowid;
END;
CREATE TRIGGER post_after_update AFTER UPDATE ON posts FOR EACH ROW
BEGIN
INSERT INTO posts_search ( docid, search_data )
VALUES ( NEW.rowid, NEW.plain );
UPDATE posts SET updated_at = CURRENT_TIMESTAMP WHERE id = NEW.rowid;
END;
-- Post deletion procedure
CREATE TRIGGER post_before_delete BEFORE DELETE ON posts FOR EACH ROW
BEGIN
UPDATE posts SET reply_count = ( reply_count - 1 )
WHERE id != OLD.rowid AND id IN (
SELECT parent_id FROM posts_family WHERE child_id = OLD.rowid
);
DELETE FROM posts_family WHERE parent_id = OLD.rowid OR child_id = OLD.rowid;
DELETE FROM posts_search WHERE docid = OLD.rowid;
DELETE FROM posts_taxonomy WHERE post_id = OLD.rowid;
DELETE FROM meta WHERE id IN (
SELECT meta_id FROM posts_meta WHERE post_id = OLD.rowid
);
DELETE FROM posts_meta WHERE post_id = OLD.rowid;
END;
-- Post parent insert procedures
CREATE TRIGGER posts_family_after_insert AFTER INSERT ON posts_family FOR EACH ROW
BEGIN
UPDATE posts SET reply_count = ( reply_count + 1 ) WHERE id IN (
SELECT parent_id FROM posts_family WHERE child_id = NEW.rowid
);
END;
-- Taxonomy procedures
CREATE TRIGGER taxonomy_after_insert AFTER INSERT ON taxonomy FOR EACH ROW
BEGIN
UPDATE taxonomy SET updated_at = CURRENT_TIMESTAMP WHERE id = NEW.rowid;
END;
CREATE TRIGGER taxonomy_after_update AFTER UPDATE ON taxonomy FOR EACH ROW
BEGIN
UPDATE taxonomy SET updated_at = CURRENT_TIMESTAMP WHERE id = NEW.rowid;
END;
CREATE TRIGGER taxonomy_before_delete BEFORE DELETE ON taxonomy FOR EACH ROW
BEGIN
DELETE FROM posts_taxonomy WHERE taxonomy_id = OLD.rowid;
DELETE FROM taxonomy_family WHERE parent_id = OLD.rowid OR child_id = OLD.rowid;
END;
That's not an associative array, it's just an array of a composite type. The hstore type is probably the closest you'd get to an associate array in PostgreSQL, but the keys and values are always strings.
> The hstore type is probably the closest you'd get to an associate array in PostgreSQL
No, a table (possibly a temp table with the results of a particular query) with an appropriately defined primary key (or other unique index) would be the closest you can get to an associative array -- since a relation with n candidate keys is exactly identical to n different associative arrays with the same data but different key/value splits.
But an array of a composite type drawn in such a way that some set of columns is guaranteed unique is pretty much the same thing from a data perspective, even though you may need to do work on the client side to load it into a structure that supports associative array operations (e.g., efficient key/value lookups).
Keep in mind that preserving space was of the utmost of importance in 1982. Eliminating data redundancy was considered the utmost of design, eschewing maintainability, intelligibility, and real application behaviors.
Please do not allow your data model designs to be artificially constrained by the requirements of the past.
Wow, what an incredibly dense thing to presuppose and advice others that the idea of data normalisation is antiquated because the primary purpose, when it was "thought of", was to save disk space!
Data normalisation is all about correctness - mathematical correctness. It's about reducing inconsistencies in your abstracted domain model so that you do not suffer side effects when using your data (such as in your programs). It is not optional.
This is poor advice to others. Especially if "the adviced" are uninformed, new or finding the subject hard.
Normalization might actually take up more space, e.g. if you have to break out a column into a separate table and introduce a new primary key.
The reason for normalization is to preserve integrity of the data. If the same information occurs multiple times, it is possible to change it in one place and not the other, thereby introducing inconsistent data. This might have disastrous results depending on the application.
As far as I know, data integrity is still pretty important even today.
Redundancy should be a cause for concern regardless of the amount of storage available. Redundancy might be used deliberately eg. for caching or robustness, but even the you usually want to have one canonical version of the data.
For the love of (insert small thing here). NoSQL is not necessarily about throwing data integrity out the window. It's about re-organizing your data so that the things you care to be integrated are, such as a document with a sub-document or list. Pretty much every data store promotes some kind of data integrity.
It might not be ACID data integrity. It might not care to enforce referential integrity outside of the document, or value, or whatever the modeling paradigm is. There is still integrity. If you don't like that, fine. Fair enough. Some might be okay with such a system. Name calling is of little benefit. It is a form of violence. "Violence is the last refuge of the incompetent" - Isaac Asimov.
If you would like to learn more about the new data stores, see my short book, New Data: a Field Guide. https://leanpub.com/NewDataAFieldGuide. It's target at managers and other with little time to get a good idea.
However, it was more apt against the parent's comment. It did not say why NoSQL users are problematic, or even wrong. It just flat out lambasts anyone who walks away from the orthodoxy of ACID-SQL with an epithet. It truly was an incompetent comment.
I'm using ArangoDB on my project. My reason is that honestly what I'm doing really is document-y. I could use MySQL or PostGRE, but I'm opting for something different for the sake something different. I do want to have integrity, though. Arango has integrity at the document level as well as distributed transactions. It doesn't enforce references though. I'm fine with that.
What annoys me is when Internet commentator attack something new without at least a cursory show of reasoning.
No, it was calling name callers incompetent, which caused an unintended self reference :-) I suppose it's not irony, but I don't think it's as far from it as Alanis.
That's not what the parent did. He called them hipsters. Hipsters, at least in the HN circles, is not a complement. It's a disparaging term. He called them hipster without any force of argument.
Two questions: 1) where did I lie? 2) How is insulting not a form of violence? To critique a work as incompetent is not to disparage the person. It is to say they didn't think something through.
In my post I said the NoSQL has a different definition of integrity. As a result, one has to use its terms in order to come to a conclusion about the validity of its claims. I took offense at the OP equivocating on integrity and then insulting people.
Eliminating data redundancy isn't about saving space and can increase space requirements depending on the data. It's really just the database equivalent of the DRY[1] principle. Which is itself about maintainability, intelligibility, and real application behaviors. Although I will admit that 4th and 5th forms can trade intelligibility for flexibility.
The brain-dead perception that "relationships are bad" came from the old brain-dead MySQL query optimizer (which has since been fixed, as far as I know). They did actually make things universally slower in that database engine and were a universally bad idea if you were using that specific database engine.
Relationships (and by association the normal forms) provide extremely important information/hints to the query optimizer and allows it to do stuff like join elimination[1]. It is my belief that SQL's bad name came from the widespread use of MySQL's partially functional engine and the poor performance caused by the lack of a proper query optimizer.
Aside: sometimes normal forms do cause too much additional overhead during queries, which is when you can use materialized views[2].
Not to mention that a normalized database is ripe for translation into analysis forms (cubes etc.), and that advanced tooling generally has a better understanding of normalized databases.
Basically, there are many more reasons (above the valid ones that you give) to have normalized forms and describe referential integrity.
>came from the old brain-dead MySQL query optimizer (which has since been fixed, as far as I know).
It came from the fact that mysql doesn't support joins properly. I do not believe this has been fixed, as far as I know it still only supports nested loop joins.
Yep, choosing the correct joining method is one of the tasks that a query optimizer (a.k.a. join engine) performs. I think someone around here pointed me toward a document that indicated that there was, at the very least, an experimental competent query optimizer in the MySQL codebase.
Even then it's a far cry away from "the big guns." Things like statistics, DPVs, etc. simply aren't (edit: or weren't) present in MySQL: which are more examples of a critical components of efficient database design at massive load scales.
TLDR; MySQL simply isn't competitive, it's being improved (especially MariaDB) but is a long ways off.
MySQL's optimizer isn't best of breed but your comment is massively ignorant. MySQL supports transactions, row level locking, nested SQL, and even replication -- how would you ever get the idea that it only ever just supported nested loop joins?
MySQL always did more than nested loop joins -- which is literally the worst kind of join. It's the join you get if you have no indexes on anything in any DBMS. I fail to see what your reply here is trying to refute.
Even your comment, stating that MySQL doesn't maintain statistics, it's totally wrong and easily checked with 5 seconds of Googling.
To your edit: The idea that a DBMS would support so many advanced features and do nested loop joins is laughably ridiculous.
It doesn't do hash joins which is the biggest limitation but
MySQL is pretty good a joins and certainly was never bad enough to warrant it's reputation. It's toolbox is smaller than other RDBMS products but it's not empty.
You claim that MySQL always did more than nested loop joins, and yet the features cited in that blog post were introduced in version 5.6, which was released in 2013... Several years after the other advanced features that supposedly preclude MySQL from doing something as stupid as nested loop joins.
No, my comment is a simple statement of fact. None of what you just mentioned is even remotely relevant. I get the idea that it only supports nested loop joins because it only supports nested loop joins. "MySQL resolves all joins using a nested-loop join method": http://dev.mysql.com/doc/refman/5.6/en/explain-output.html
In my experience of you don't at least adhere to third normal form, you eventually regret it. Normalized data is more flexible when you need to extend a data model, at the cost of a little more effort during initial development.
I agree with you on the third normal form, but let's consider the fifth too. In TFA the example is agents working for car makers. Let's use a more realistic example: agents working for car dealers.
So Frank sells cars for Festive Ford. We store that in fifth normal form by saying that Frank sells cars and Frank works at Festive Ford, and Festive Ford sells cars.
But one day Eve walks in to Festive Ford and walks to trade in her Eagle. Frank cuts her a sweetheart deal and Eve drives away in a brand new Focus. Festive is now left with an Eagle to sell. Maybe we'll add a record saying that Festive sells Eagle cars too, though that seems silly when they've only ever had one to sell, and Eagle is out of business for a decade already. And even if we do that, Frank is a Ford man, so if he manages to sell the Eagle, we'll need to add a record for Frank being an Eagle salesman too.
After a few years of this, everyone is selling every brand according to the database, and the business people wonder why the website says Festival AMC-BMW-Chrysler-Dodge-Eagle-Ford-GMC-Honda-Isuzu-Jeep-Kia-Lexus...oh you silly programmers!
I'm not totally sure what you are getting at, but are you suggesting that the business rules described in the example (that salesmen are limited to selling certain products for certain brands) is not realistic in a real-world car dealership? That might be true, since it is just a simple example intended to demonstrate a concept.
If everybody can sell every product, obviously there is no reason to keep a list of who can sell what. You just keep a list of actual sales. But that is a question of designing the database to correspond to the business.
It has nothing to do with the merits of any of the normal forms. The normal forms does put any requirements on your business rules.
That's not what the database says. It doesn't say "x has a current contract with every brand" or "x offers every brand for sale", it says "x has offered every brand at some point in history".
I think this is perhaps the biggest problem with 3NF and above: People construct sloppy queries (such as confusing "sells" with "has ever sold in the past"), verify that they give the expected results on a seven-row test table, and then deploy them in production. 3NF and above demand clear thinking.
I am an ardent fan of Postgres, and think it's important to properly store data for all the reasons mentioned elsewhere.
That said, there are cases where people get a bit too architecture astronaut with the DB relations, making simple things difficult before those complications are really necessary.
I once had an interview with an "architect" who was, in fact a trained architect (someone who designs buildings, he probably got into IT with the .COM bubble), who had zero training in DB theory.
We were chatting about database design, and he said the same line about normalization being about saving space in the old days.
Naturally, I declined the job offer. In fact, I made that one of my litmus tests for job hunting, along with "do your databases have referential integrity?", and "what's your opinion of primary keys?",
More likely got out of architecture with the S&L crisis (or more fully the Regan tax reforms that reduced the attraction of passive losses on real estate that triggered it). May have come into IT from CAD management which was somewhat on the cutting edge for small businesses in the early 1990's [e.g. Novell networks, megs of RAM, 80387's and multiple monitors...Acad with an nthengine and Herculese card was better than can't touch this baggie pants.]
I have to admit that the changes brought about by the S&L crisis in the US pale relative to the changes to South Africa's landscape in the early 1990's.
>> Please do not allow your data model designs to be artificially constrained by the requirements of the past.
I'm currently dealing with a large piece of software where many of the high profile bugs stem from data duplication errors. This isn't even using a DBMS it's happening in classes (which a very similar to tables BTW). I gave up DB programming as my primary job about 15 years ago, but the lessons of 3NF are applicable way beyond the DBMSs of the world. It's still quite relevant advise.
It's problematic to view the evolution of data modeling in the same way as the evolution of programming languages or paradigms associated with programming languages (including SQL).
So while languages and language-oriented paradigms that assist in abstractions (e.g. OOP) have changed a lot over the last few decades, the concept of data abstraction and modeling (relational data modeling in particular) has seen less dramatic evolutionary change.
I wonder whether this in itself makes it the reason why some folks believe there ought to be a change from an ostensibly outmoded and antiquated paradigm!
The people who did all the pioneering work on the subject (mostly E.F. Codd and Chris Date) seem to have made several blunders in trying to popularize their ideas. For example, one of Date's later books ("Go Faster: The Transrelational(tm) Approach to DBMS Implementation") contains some very useful ideas. But it was written in 2001 and not published until 2011, because he was working with a now-failed startup that was trying to keep it all a trade secret.
Most of their writing is not available online. You have to buy their books. Which is an author's prerogative, but seriously limits the reach of the ideas.
The world thinks it already has relational databases that are good enough. Convincing it otherwise requires a web-savvy marketing approach that has so far been lacking.