Two common mistakes when using databases

aaw · on Oct 7, 2009

A well thought-out post and a good read, but both of the author's points about the limitations of relational databases are a little misleading.

The first, "Every single element in a relation (aka table) has to be exactly the same type" is true in many relational databases, but isn't necessary: SQLite, for example, is implemented with a nice form of dynamic typing where individual columns can take on multiple data types.

Second, "SQL isn't even Turing-complete" is sort of a folk theorem in computer science that isn't necessarily true anymore, especially with all of the proprietary SQL dialects in existence: the original standard, SQL92, is basically equivalent to relational algebra, which has the expressive power of first-order logic - it's so weak it can't even express concepts like graph reachability. But constructs like recursive queries have been added to the SQL standard since SQL92 which may make even standard SQL Turing Complete.

gaius · on Oct 7, 2009

PL/SQL and T-SQL are certainly Turing complete.

samuel · on Oct 7, 2009

AFAIK PL/SQL is not "Oracle's SQL". It's executed by a different engine and you get context switches when you execute SQL Code (don't know about T-SQL).

But Oracle's SQL is by itself Turing complete since the introduction of the "Model" clause (see here: http://www.adellera.it/investigations/nocoug_challenge/index... how you can compute the FFT using it).

Moreover, I read somewhere that the engines that implement the "recursive with" are also Turing Complete, but can't recall where.

Confusion · on Oct 7, 2009

Second, "SQL isn't even Turing-complete" is sort of a folk theorem

And it doesn't mean anything, unless you specify its exact limitations. If they are merely theoretical limitations that nobody uses in practice: who cares?

rbranson · on Oct 7, 2009

This seems like an aimless rant. Rails (and Hibernate to my knowledge) uses SQL for exactly what he recommends: filtering, aggregation, and joining. The RDBMS is a very real and concrete concept in ActiveRecord. If you don't know SQL, it's going to make it difficult to get beyond the most basic ActiveRecord functionality.

fauigerzigerk · on Oct 7, 2009

ActiveRecord and Hibernate can certainly do everything SQL can do. But the modeling and data access patterns they encourage are inconsistent with the strengths of relational databases and the relational model.

The point of using a relational database system is to think in sets of tuples because doing that makes life simpler (for some types of tasks). You apply one of a handful of set operations to a set and you get back another set of tuples.

Once you start assigning types and custom operations to particular combinations of attributes, you lose this recursive transformability and hence the simplicity of the model. What you get is a horribly complex monster.

The whole OO idea is a tangled mess. ActiveRecord and Hibernate do their best to help you view the database as a tangled mess as well.

ubernostrum · on Oct 7, 2009

Well.

I've written at length about this in the past, and basically my opinion comes down to there being two ways to approach this. One is driven by the application, the other is driven by the database.

The application-driven approach is, I believe, the vastly more common use case: you have some things in your code (objects, data structures, whatever) and want some way to persist them and fetch them back later according to their properties, so you use a SQL-based DB because, hey, you can get them for zero monetary cost (PostgreSQL, MySQL, SQLite, etc.), every language has ways to speak SQL and every developer on your team knows SQL.

The database-driven approach is, in my experience, less common: you start with the database, design it meticulously, take full advantage of the relational model and the features it and your DB offer, build as much as possible in the DB layer and then let people write applications which talk to it.

Both you, and the author of the reply in the email this thread links to, seem to inhabit the latter use case. And that's fine; I know people who do that stuff and I respect it. Where the problem comes in is that you have a tendency to assume that your use case and your approach are or should be the only use case and approach. Which is, frankly, wrong. There are plenty of situations where the application-driven approach does just fine and will take you a very, very long way (and when it breaks down, it probably will not break down because you need to switch to the database-driven approach; it'll break down because of other things), and where the database-driven approach really isn't a great fit (for various reasons).

charltones · on Oct 7, 2009

To be fair to the author of the email, he didn't entirely assume his approach was the only good one, since he did suggest: "either use the data, and structure your data (at that layer) to take advantage of it, or don't use a database." e.g. if what you want is to persist objects, then use something other than a relational database. Perhaps one problem is that most programming environments don't have an obvious choice for this so MySQL or equivalent just jumps out as being the ideal candidate. A very brief amount of googling for various permutations of "scalable object store" or "object oriented database" didn't seem to return anything useful. I use Erlang which has an integrated DBMS that can store arbitrarily complex types without marshalling. Are there some equivalents on other platforms?

fauigerzigerk · on Oct 7, 2009

I don't disagree with your application driven vs data driven distinction and I agree that it's impossible to say in general which one is better. My experience is that it is rare for an application to stand on its own for long. Data has a longer life span than application code, but anyway, both scenarios do exist and both are legitimate.

But making this distinction says little about what are good design principles in order to reduce the complexity of a system. My opinion is that data and relationships among data items should be represented uniformly on the application level as well as on a broader data management level.

Creating one API per combination of attributes leads to a combinatorial explosion, an increase in coupling and unnecessary mental load. It makes generic transformation of data and querying very difficult indeed. We need to work with few general purpose data structures that are easy to reason about.

And no, I'm not assuming that my approach should or even could be the only one. But I think the standard reply of "one size doesn't fit all" or "use the right tool for the job" has become all too fashionable. There are many tools for the same job, so I have to have an opinion and I have to choose. I'm not an authoritarian person at all, so I don't care the least if you make a different choice ;-)

ubernostrum · on Oct 7, 2009

Except you're falling right into that worldview.

You seem to be arguing for databases which exist and are structured independently of any applications which happen to access them. Personally, I think this makes about as much sense as arguing for, say electrons to have well-defined properties independently of anyone trying to measure them, which is the same as saying that there is no such thing.

You also seem to have trouble accepting that there may be lots of situations where there is exactly one application, and if that application goes away, then so does the company (or the department, or the project). In those cases, I don't see much value in trying to make the database be independent of the application; the database exists to serve that application, and if they happen to be tightly coupled to each other, so be it: sometimes that's how you get something to work.

mattmcknight · on Oct 7, 2009

Agreed. It's complete ignorance on the author's part to suggest that Rails and Hibernate serialize objects into a single cell. acts_as_nested_set even provides a decent way of dealing with tree structured data in a relational store.

His second part is off base as well, the challenges of serialization in strongly typed languages have little to do with wire protocols. He seems to be suggesting that adding a translation layer would solve the problem. It rather seems that would still fail if deserialization were used, as the translation layer would then become brittle. Rather it would seem to be more wise to step away from a serialization based persistence approach, while it might be a good idea for wire protocols, since the data is more ephemeral.

loup-vaillant · on Oct 7, 2009

About the serialization, I understood he basically said "don't use that". Not that an explicit, manual translation layer would solve any problem. It just makes you aware of it. Translation layers are brittle even if they are automatic. The only solution is to have a protocol (or file format) that change as little as possible. An automatic marshalling facility doesn't encourage you to do so.

I see one case where automatic serialization is harmless, though: when you have the guarantee that the program producing the data and the program receiving it are of the same version and don't talk to the outside world. Temporary files used to recover crashed sessions are a good example.

didroe · on Oct 7, 2009

I liked the OPs reply to this message:

>Thanks for your reply. It was quite interesting, though I get the feeling you used my question solely as a trigger to share with us a long-held dissatisfaction with the current state of affairs concerning the use of databases, regardless of whether it actually applies to my particular problem.

Kirby · on Oct 7, 2009

For beginners, this is definitely good advice. Particularly the first point - if you're using a relational database, and don't structure your data around its strengths, you'll take a profound performance hit. And sure, don't pass around data structures if you don't know what you're doing, and MySQL is a pretty crummy place to put them.

However, once you're doing real work, sometimes translating to XML and back is extraordinarily expensive.

The best approach is a hybrid. Use the database to store things relationally if possible, and using defined APIs for sure. And if your translation stage is expensive, use something like memcached to make sure you do that translation as infrequently as possible. _This_ is the layer that it's appropriate to store serialized data structures at. It's not permanent storage, you can blow it away when you change the internal structures, and nobody external is relying on it. But you end up, in most programs, with even more speed benefits than if you'd stored it in the database like this to begin with - the initial build can be expensive, but then you're reading basically from memory. (Not all data is well suited to this approach, but if your data is read frequently and written to infrequently - there's few things you can do to increase performance more than this.)