Fascinating. I haven't finished reading the entire presentation yet, but I already stumbled on something.
The author lists a drawback of the first antipattern ("metadata tribbles"): a table split into multiples to keep its size down requires tricky querying, specifically, it requires a union across all splits (slide 10). It also requires extra effort to keep table structures synchronized (slide 11). Then, in a solution to his second antipattern ("entity-attribute-value"), he suggests doing something quite similar to his first antipattern: creating multiple tables with nearly identical columns and using a union to query across them (slides 27 and 28).
Hello? So to resolve antipattern 2 you just apply antipattern 1? This is why precisely why relational databases for most problems look like square pegs being hammered into round holes.
[EDIT] Not sure if it's worth finishing reading this presentation. Slide 56, a "solution" to the problem of storing a hierarchy in a relational database. Breadcrumbs. Not bad, except that the query now requires a "like" clause. Whoops, can't use an index on that column anymore! (Feel free to correct me if there exists a RDBMS which can use an index on a "like.") Hope the author doesn't mind a nice table scan on that query. He doesn't even mention this problem.
Anti-pattern one was critiquing using values in table definitions, the second was using attributes in table definitions. These are not the same thing. There isn't a similarity between values and attributes in the relational model, which is the whole point of avoiding key-value strategies which are confusing an attribute with a value.
A big part of why the modelling of data that you don't know the shape of in advance in a relational database is hard is because reasoning about data you don't know in advance is hard, and the model exposes that difficulty up front.
(Also, if like has its wildcard at the end of a string not the beginning, indexes can help you as they can look into the sorted trees.)
> Feel free to correct me if there exists a RDBMS which can use an index on a "like."
huh? mysql does that for 'like' queries which don't start with a wildcard (that's exactly what you need for breadcrumbed trees).
also (i didn't read TFA, though) you don't have to use 'like'. you can use BETWEEN or just "path >= 'a' and path < 'b'", which, i suppose, uses index in even the most stupid RDBMS in the world).
prooflink log from my live real production database with a breadcrumbed tree: http://pastie.org/541611
Well, certainly in Oracle, it can use an index on a like if it is structured:
where col like 'str%'
However, doing
where col like '%str'
Means it cannot use any index. So in the example given, it would be able to use an index because there is a leading 'path' before the %, assuming MySQL can do the same thing.
In Oracle you can do hierarchy query's using the 'connect by' syntax, which means you can structure your table with id, parent_id and get all the rows out efficiently in a single piece of SQL, which is nice!
You then use a Common Table Expressions query, like this:
WITH CommentTree (ParentID, ID, Level)
AS
(
SELECT ParentID, ID, 0 as [Level],
FROM Comments
WHERE ParentID = [ROOT COMMENT ID] OR ParentID IS NULL
UNION ALL
SELECT c.ParentID, c.ID, ct.Level + 1
FROM Comments c
JOIN CommentTree ct ON ct.ID = c.ParentID
)
-- Now actually query the data from the CTE Expression
SELECT ParentID, ID, Level FROM CommentTree
ORDER BY ParentID, Level, ID
This returns a table with a calculated column called Level.
The like search will still be faster than hitting the db repeatedly to get each node's children which is how most people seem to implement retrieving a tree.
This "path" method is straight out of SQL for Smarties - Hierarchies & Trees" and though the nested set method is probably better for most uses, the path column has its uses.
There's a few tricks you can do with Oracle function based indexes to optimize out like clause in this case. Can't speak for other RDBMS but I imagine they have similar facilities.
I disagree with slides 20 - 30 (calling "entity-attribute-value" tables an antipattern). Variable attributes table not only gives you a lot of application flexibility but also helps performance in that you're not polluting your primary table with extraneous metadata columns that are hardly ever referenced. It doesn't surprise me that DB programmer would look at that and be offended that it's not purely referential -- not everything that happens in a database has to be referential to everything else. Sometimes databases are just dumb data stores. Not all problems are solved by "referential integrity"
The Entity-Attribute-Value idiom (tuplestores, RDF, BigTable, etc.) is by far my favorite data model. The App Engine 'Datastore' sugar on top of BigTable is pretty goddamn awesome.
RDF would have been great if the W3C wasn't full of idiot blowhards.
There are some application domains that have not lived up to their potential because of RDBMS abuse where EAV would fit perfectly -- Desktop Search for example. All of [Spotlight, Google Desktop, Beagle, Gnome Tracker] use RDBMSs with a pre-determined set of 'attributes' baked in, crippling any creative use. Spotlight at least has the decency to let you add your own non-indexed properties, but it's not very useful. The contents of people's filesystems and metadata about them will never fit into some spec-writer's schema.
Compared to RDBMSs I think sugared EAV is lot more amenable to being abused in the "when all you have is a hammer" way -- they start a bit slow but scale linearly with more resources without getting any slower.
The author lists a drawback of the first antipattern ("metadata tribbles"): a table split into multiples to keep its size down requires tricky querying, specifically, it requires a union across all splits (slide 10). It also requires extra effort to keep table structures synchronized (slide 11). Then, in a solution to his second antipattern ("entity-attribute-value"), he suggests doing something quite similar to his first antipattern: creating multiple tables with nearly identical columns and using a union to query across them (slides 27 and 28).
Hello? So to resolve antipattern 2 you just apply antipattern 1? This is why precisely why relational databases for most problems look like square pegs being hammered into round holes.
[EDIT] Not sure if it's worth finishing reading this presentation. Slide 56, a "solution" to the problem of storing a hierarchy in a relational database. Breadcrumbs. Not bad, except that the query now requires a "like" clause. Whoops, can't use an index on that column anymore! (Feel free to correct me if there exists a RDBMS which can use an index on a "like.") Hope the author doesn't mind a nice table scan on that query. He doesn't even mention this problem.