SQLAlchemy and You

espeed · on July 20, 2011

This article is by Armin Ronacher, the author of Flask. I use Flask for most of my projects because it's decoupled from the ORM and provides a clean way for you to use whatever database tool is right for the job.

But lately I have moved away from using the relational database as the primary datastore. When I do need a relational database, it's simple to use SQLAlchemy in Flask.

But I have been finding graph databases as a more elegant alternative to relational databases because you don't have to mess with tables or joins -- everything is explicitly joined.

Neo4j is one of the leading open-source graph databases, and it's pretty sweet -- it can store 32 billion nodes while doing 2 million traversals per second -- and you can use Gremlin with it (Gremlin is a graph query language, like SQL for graphs).

The graph-database world is starting to emerge, but most graph databases are Java based so non-JVM bindings are not always available.

However, there is a project called TinkerPop (http://www.tinkerpop.com) that has made most of the open-source software stack for graph DBs. TinkerPop is the project that developed Gremlin and Blueprints, and it released a REST service called Rexster that is tied to Blueprints so it lets you connect to any Blueprints-enabled graph database, including Neo4j, OrientDB, Dex, OpenRDF, and an InfiniteGraph implementation is being released next month.

Because the graph-DB world lacked Python support, I decided to write a Python persistence framework that connects to Rexster so you can now use any of those graph DBs from Python, and binary bindings are in the works. The framework is called Bulbflow, and you can find out more about it out at http://bulbflow.com/overview.

In a few weeks I'm going to release Flask-based libraries that do authentication and authorization in a graph-DB way.

zzzeek · on July 20, 2011

You've been posting quite a bit about graph DBs lately including one the other day which stated that "the relational DB is dying as the primary datastore" as well as "Graphs are a much more elegant way of storing relational data". (source: http://news.ycombinator.com/item?id=2778945).

I have to disagree with you here. Graph DBs are a much more elegant way to store graph data, i.e. social media data, some forms of scientific data. But for the vast majority of relational database usage, i.e. finance, accounting, retail, medical records, human resources, academia, reservation databases, etc. the truly vast amount of databases we interact with indirectly all day long, these systems are firmly in the relational camp and that is not changing anytime soon. SQLAlchemy would very much like Python to be prevalent throughout all these industries, and is not just about social media applications.

espeed · on July 20, 2011

Yes, that was in the context of Django as well. Look at the trend for MongoDB compared to PostgreSQL (http://www.google.com/trends?q=mongodb%2C+postgresql&cta...).

Because Flask is decoupled, it let's you use whatever tool is right for the job. If you need to store financial data, it's easy to use SQLAlchemy, but graph databases are much better fit for work working with stuff like the modern-day social graph.

And don't get me wrong, I like SQLAlchemy -- I have used it for years and continue to use it -- thank you for all the work you've put into it.

St-Clock · on July 20, 2011

Look at the trend for Britney Spears compared to PostgreSQL (http://www.google.com/trends?q=britney+spears%2C+postgresql&...). I'm not sure what to make of it because Britney Spears and mongodb do not offer a tenth of the features of postgresql, yet they are clearly more popular.

In computer science, I believe everything can be seen as a graph so of course, graph databases are a great fit for every problem. Object databases were also a great fit for objects. But where are object databases now compared to relational databases?

It's great to have so many options now and it's great to see nosql databases giving ideas to sql databases (http://momjian.us/main/blogs/pgblog/2010.html#April_10_2010_...). Although relational databases might be less elegant for immediate programming problems, they are far more flexible when new needs emerge and their ACID properties have a far better track record than most (if not all given their age) nosql databases.

bretthoerner · on July 20, 2011

> Because Flask is decoupled, it let's you use whatever tool is right for the job.

(Derail alert)

How is it any more difficult to use a graph database in Django? (Hint: if you say "but then you can't use X" my next question is "are you implying Flask includes X and you can automagically use X with a graph database?")

espeed · on July 20, 2011

You absolutely can use a graph DB in Django. But if you're not using a relational DB as your primary datastore (for stuff like authentication and authorization) and not using the ORM, how much of Django would you really use?

Django was built around an ORM. Now you have more database options that have less of an impedance mismatch, but because Django is so ORM-centric, forgoing the RDBMS in favor of another option means you now have a framework mismatch.

Starting with Flask, which was designed to be decoupled, simplifies things.

simonw · on July 20, 2011

"How much of Django would you really use?"

The URL mapper, the view abstraction, the middleware, the debugging tools, the generic views concept, the httprequest and httpresponse classes, the templating system, the caching framework, the syndication library and a bunch of the third party ecosystem tools.

espeed · on July 20, 2011

Ahh...what you describe sounds a lot like Flask.

parfe · on July 20, 2011

It sounds like every halfway decent web framework ever written.

espeed · on July 20, 2011

The point is why not just start with Flask if that's the feature set you're going to end up with.

hello_moto · on July 21, 2011

Becaus I'm more familiar and have experienced in using it. Because I decided to use Django for the other part of my big web application. Because I want to learn more and get better with my tool as opposed to learn the Python web app framework of the week.

jbellis · on July 20, 2011

MongoDB is not by any stretch of the imagination a graph database.

espeed · on July 20, 2011

You're right -- I was just using Mongo as an example to show that people are moving away from the relational database as the primary datastore, esp for Web stuff.

espeed · on July 23, 2011

You've been posting quite a bit about graph DBs lately

I have been a strong advocate for SQLAlchemy too (http://news.ycombinator.com/item?id=2456994). It's just that I discovered the power of graph DBs a few months ago and have grown quite fond of them.

andrewcooke · on July 20, 2011

thanks for that. also, it's worth noting that the licence for ne04j changed recently - the community edition is now GPL rather than AGPL, so it can be used to bootstrap web services for free (if you ever do need to upgrade to the paid-for version then you are hopefully at a scale where you can start charging customers and paying for software).

[i'm not connected to neo4j in any way; i was interested in using it a while back (for what might have become a commercial project), but was discouraged by the agpl licence, so thought i'd share the good news]

[ps and sqlalchemy is awesome too!]

espeed · on July 20, 2011

Yes. The Neo4j community edition is always free, and as long as your code is open source, the enterprise edition is also free even for commercial use (http://neo4j.org/licensing-guide/).

wulczer · on July 20, 2011

SQLAlchemy supports multiple-column primary keys, Django ORM does not. What's more to think about?

cturner · on July 20, 2011

If I'm designing a schema, I always have a primary key column called "id" with an autoincrement sequence against it. Sometimes I'll have a domain table, with key varchar, description varchar.

Perhaps you deal with existing systems you didn't design. Aside from that - where is the cause to use a multi-column primary key?

wulczer · on July 20, 2011

There are strong opinions on it, but I firmly believe that "always have an autoincremental id" is a very wrong approach.

Read http://it.toolbox.com/blogs/database-soup/primary-keyvil-par... (an often-quoted series of blog posts on why surrogate numeric keys are evil).

In general, if you have an entity with a well-defined natural key (users and their logins, shipments and their tracking codes), why would you use an autogenerated key that's meaningless?

Case in point, my current schema, designed from scratch. Users can have multiple dashboards, each dashboard has a label. A user cannot have multiple dashboards with the same label, but the two users can of course use the same label for their dashboards (like "sales" or "mydashboard"). The dashboard's table primary key is (username, label) - that pair uniquely identifies a dashboard, so it's a perfect PK candidate.

Also, as already noted in one response, many-to-many relationships are typically modelled by an intermediate table that has foreign keys to both sides of the many-to-many, and its foreign key is the sum of these foreign keys.

EDIT: added a concrete example

zzzeek · on July 20, 2011

DBAs will emphasize surrogate integer primary keys largely for performance reasons. On SQL Server and others, their performance and size usage in indexes and such is well defined and well optimized. I acutally had a few natural PKs suggested in my current schema and the DBAs insisted that every table primary key on a surrogate.

Surrogate PKs also allow you to not have to worry about ON UPDATE cascades. A gig I did a few years ago we decided to use some natural PKs, but as is reality, these PKs had to change all the time, not as much as part of regular operation, but more as we modified how the site worked and referenced information as well as when we had to go in and correct data that was mis-labeled (it was a content aggregation site sucking in many gigs of data per day). Plus the natural PKs made our indexes, primary key as well as all the referencing foreign key indexes, huge and poorly performing.

wulczer · on July 20, 2011

Yeah, these might all be valid reasons to use surrogate keys. I'm not saying there are none, just that always using one often leads to design mistakes.

Incidentally, part III of the keyvil blog posts talks about reasons to use surrogates :)

As always, nothing's black or white, sometimes surrogates are OK, sometimes they're not. But I see schemas abusing surrogates much more often than schemas abusing natural keys...

zzzeek · on July 20, 2011

After my experience in not using them for a large schema, using an int pk is pretty much a default decision for me, unless the table's PK is meaningful compared to another table like an association table. Doing big data migration jobs where the primary keys have to change is very painful. An identity-map ORM like SQLAlchemy is keying everything on primary key - SQLA supports mutable PKs and cascading updates (both via the DB, as well as in memory if you're on sqlite/MyISAM) fully, but the changing in-memory identities is fundamentally painful to deal with when you're shoveling around a lot of data and mutations.

wladimir · on July 21, 2011

well-defined natural key (users and their logins,

What if you want to offer an option to change user login name later? This is a headache if the rest of your database points to your UID by name.

Same thing if you are making a book database. Sure, in theory the ISBN is a unique book identifier, but what if this somehow doesn't hold (ie, same ISBN but different covers). When you use only the 'natural' identifier you're stuck. Or what if someone typed the ISBN wrong? In a distributed system it can be incredibly difficult to update all the references, for example in users's shopping carts.

For these reasons the canonical DB design advice seems to be to use "meaningless" unique identifiers for objects (doesn't matter if this is autoincrement or UUID etc).

wulczer · on July 21, 2011

That's what ON UPDATE CASCADE is for.

If you have a distributed system, generating unique identifiers and keeping referential consistency across the distributed parts is already a difficult problem. But that's a digression. If I ever get so many users I'll have to partition the user table across databases, I'll have much more trouble that just their IDs (and I'll be rich :) )

wladimir · on July 21, 2011

Yes, but in practice it frequently happens that you refer to records in various different places. Either client-side in API-using software, or in a completely separate database, etc. In such cases, "Update all references in the world" will be entirely impractical and bug-prone.

cturner · on July 20, 2011

Thanks to you and others for respectful responses on what I didn't realise was a hot-button issue.

You ask, "In general, if you have an entity with a well-defined natural key (users and their logins, shipments and their tracking codes), why would you use an autogenerated key that's meaningless?"

I like my approach because of reduced cog. overload. When I'm designing my database, or writing queries that jump through several tables, I always know that my foreign key in table Person [1] is going to be of the form fk_organistaion_id. All I have to do is remember table names and relationships. If I also had to remember arbitrary keys, that would increase memory overload. When I'm doing consulting and working on several applications in parallel that's less practical. Similarly, I always know I can store a reference to an ID column and know I'm it will be there and uniquely identify a row.

I don't think the many-to-many example really answers to my question. The issue of whether it is commonly done is different to whether it is good practice.

I can imagine there might be performance advantages to having a composite primary key for a join table. But - I expect you could get equivalent performance on id|fk_a_id|fk_b_id by adding an index. This comes back to the principle of - do you write your code principally to be run, or to be read. I have a memory like a sieve and write to be read.

I think composite primary keys are done for the wrong reason at times.

The problem in the example in the article linked to is not a poor use of keys, it's a poor design of system. They're trying to enforce type at the schema level. Though he doesn't spell it out, I expect the reason they're getting bizarreness is because they either have people interacting with the database at too-low a level, or because they have multiple applications hung off it. More on this in a second. [2]

Similarly, I think your attempt to use the database to enforcing typing rules will work at some levels but runs out. For example - imagine if a user was only alowed to have three labels. Or that the label musn't have any spaces in it.

While you can delve into triggers [3] I think it's misguided to think you can enforce a general sense of business logic at the database level. That stuff should be done by an application surrounding the database. Then all interaction with the database should go through that one-and-only-one system that owns the database.

Databases have some type information but it's very primitive and inadequate for all but the most simple of scenarios. I've found that once you acknowledge that you start designing schemas and the systems around them in a way that is very different to what you'd learn in the Oracle course.

--

[1] Another quirk of my style - singular table names - because it makes it easier to wrap ORMs around it without having code that reads as bizarre

[2] The second example is riduculous. They assigned an int to the wrong place. That's not a problem with schema design, that's just a stupid mistake.

[3] I've done plenty of this work on hairy enterprise systems

wulczer · on July 20, 2011

I think I see the fundamental point where we disagree. I try to keep as much business logic in the database as possible. That's more of a DBA perspective, I guess: don't trust these pesky app developers, always assume they'll try to put broken data in your DB.

It's often not possible to model all business rules in SQL, but the more you manage to cram into the database, the less probable it is you'll end up with inconsistent data.

Often there are dozens of applications (or independent modules) using the database. Keeping the business rules (like "every user has to have a first name" or "no two customers can rent the same car at the same time") in a centralised place helps a lot with keeping everything in order. It's often not possible to limit the interaction with the database to a one-and-only system (imagine different teams writing different parts of the system, app changes, outsourced integrations, etc)

So, if a label can't have spaces in it, that's easy - slap on a check constraint that enforces it. Then you know that even if the new hire writes a "change my label" funcionality and forgets to enforce the no spaces rule, she won't screw up your data.

If a user can have at most three labels, things get more interesting. Offhand I'm not sure how I'd model that and it actually is a difficult problem, best solved with triggers IMHO, as you have better control over locking and can guard yourself against concurrency issues.

Anyway, the discussion on "all logic in DB" vs "all logic in the app" is part of the neverending struggle between DBAs and app programmers...

cturner · on July 20, 2011

haha I can see why we disagree. Same problems, different solutions.

    Often there are dozens of applications (or independent
    modules) using the database

I know! I always assume that someone will try to pull this on me, some shortsighted bugger always does, and I end up on crusde protecting my platform against this!

It is fundamentally bad design to have more than one application dependent on a database, and you're right, it's used everywhere. If I do a PhD one day, it will be on this topic. It's the number one stupid design in enterprise computing.

I've spent years trying to write a logic framework to surround schemas to make it possible to do the functionality of pl/sql, but with stateful sessions. Every eighteen months I have a new go at it and rewrite.

I have the ideas working, and use the patterns now, but haven't managed to create a framework of it that would be usable to anyone else.

    Anyway, the discussion on "all logic in DB" vs "all
    logic in the app" is part of the neverending struggle
    between DBAs and app programmers...

Yeah. When I was doing a lot of relational work, I'd enforce these checks by having business logic layers around my ORM.

IgorPartola · on July 20, 2011

The canonical example is a when you have a many-to-many relationship. You have table A, table B and a join table C, with columns (a_id, b_id). This table by itself is meaningless and does not need it's on primary key. Thus, you want to set your primary key to be (a_id, b_id) or (b_id, a_id) depending on how you want the data clustered on disk. You then also want to add a unique key with the opposite order of columns to make lookups fast when going the other way.

cturner · on July 20, 2011

Could you use a table design of id|fk_a_id|fk_b_id, and then complement with indexes to get equiv. performance of the direct composite primary key?

Assuming so, I think composite key is premature optimisation.

IgorPartola · on July 20, 2011

Sort of. First of all, this is all about not creating unnecessary entities in your DB. The performance benefits usually follow, but are not necessarily the goal. What happens (at least in MySQL) is that now you have three indecies instead of two: (id), (a_id, b_id) and (b_id, a_id). So, for a large table you are taking up more RAM than you need to, since the index is likely to be stored in RAM. Second, often times the DB will try to optimize lookups against primary keys by clustering the records that are close on the index tree to be close on your HDD. This can lead to increases in performance for large tables. See: http://dev.mysql.com/doc/refman/5.1/en/innodb-index-types.ht...

zzzeek · on July 20, 2011

A many-to-many table as IgorPartola mentioned, other examples include versioned or temporal storage, i.e. a "history" table where the primary key is the PK of the parent table + a version id.

SQLAlchemy can also map to views or selects - if a view or select is against multiple tables, now you have a composite primary key by nature. A core piece of SQLAlchemy is that we select rows from many things besides physical tables.

parfe · on July 20, 2011

>A core piece of SQLAlchemy is that we select rows from many things besides physical tables.

Your sentence is a key part about the differences between Django ORM and SqlALchemy. Every article should include it.

pyre · on July 21, 2011

This was just explained to me by our DBA for a data warehouse (technically a 'data mart' not a data warehouse if you want to pedantic) that we're building. Say you've got two tables:

  table a (
    id integer primary key,
    data text
  )

  table b (
    id integer primary key,
    postal_code text
  )

In reality, postal_code is a natural key, so there's no need to the surrogate key b.id but if you insist on using it that way you'll end up with this:

  select *
  from a inner join b on a.id = b.id
  where b.postal_code = '97000'

The query plan for this will end up something like:

  NESTED LOOP
    TABLE B
      WHERE postal_code = '97000'
    TABLE A

That means that it will do a full table scan of table a for each record in table b where postal_code = '97000'. There are a couple of ways to mitigate this:

1. Use postal_code as the key.

2. Perform two queries:

  select * from b where postal_code = '97000';

and use the result of that to filter table a

  select * from a where id in (*); -- or similar

So now it only does a single pass table scan of table a

3. Use Oracle which has something called a 'star join' which basically does option #2 automatically for you.

Obviously this is the simple case. Things get more complex the more tables you add in.

j-kidd · on July 21, 2011

> That means that it will do a full table scan of table a for each record in table b where postal_code = '97000'.

The query plan would be either:

  NESTED LOOP
    TABLE B
      WHERE postal_code = '97000'
    TABLE A (index scans)

or:

  TABLE A (1 full table scan)
  TABLE B
      WHERE postal_code = '97000'

What database do you use that would perform multiple full table scans?

ismarc · on July 20, 2011

When you're dealing with non-third normal form db schemas for performance reasons where materialized views won't provide the gains you need. That said, I've only seen a need for them about 3 times in my career, but I imagine they're much more applicable in data warehousing scenarios.

brown9-2 · on July 21, 2011

Even if your design doesn't have a use case for it - isn't it reassuring to know that your ORM has the functionality available if it is ever needed?

cbs · on July 20, 2011

>SQLAlchemy supports multiple-column primary keys, Django ORM does not. What's more to think about?

Exactly. If it wasn't for that restriction we would have begun a test adoption. As soon as I saw that multiple-column keys were a mystery to it, not only did it send me packing but made me question the fitness of django in general, I feel like we may have dodged a serious bullet there (if they can't even get that right, what other surprises lie in wait?)

madewulf · on July 20, 2011

It is possible to specify that some column values must be unique together, which makes this point pretty moot for any new project.

That said, if you are planning to work on existing schemas, this can really be a headache.

About other nasty surprises, I have been using Django a lot extensively over the last 4 years, and I can honestly say I haven't find any and I dare you to find a web framework that will allow you to bootstrap a project much faster than Django .

j-kidd · on July 21, 2011

I think SQLAlchemy is designed for people who are comfortable dealing with SQL directly, while other ORMs want their users to be oblivious about SQL.

In Chinese, there are 3 stages of life:

1. Looking at a mountain, it's a mountain - this is when you start learning about SQL

2. Looking at a mountain, it's not a mountain - this is when you start using tools that shell you from SQL

3. Looking at a mountain, it's still a mountain - this is when you start using SQLAlchemy

sc68cal · on July 20, 2011

Many people's first experience with a Python ORM (or maybe an ORM altogether) is the Django one.

Useless anecdote: I started out using SQLObject since I was importing data from flat files into a relational database, then doing some analysis. When the analysis became useful to others, that's when I started building a web front-end with Django.

nerd_in_rage · on July 20, 2011

SQLAlchemy is pretty much the only ORM I've used that DIDN'T suck. It's great, especially after seeing Java / Hibernate hell.