Database Versioning

psadauskas · on March 2, 2009

I'm the original author and current maintainer of DataMapper's migrations. I had a reply written up, but it ended up being really long, so I wrote a reply blog post instead. I'm not trying to drive traffic to my blog, because I don't care, but here's a link: http://www.theamazingrando.com/blog/?p=86

Essentially, DataMapper already provides the solution that Adam outlines in his post; Replace schema.yml with DataMapper model definitions, and have the discipline to not write data migrations. Write specs for your migrations, like everything else, and use DM migrations’ sane versioning, rather than AR’s irritating one, and you should be fine. There a definetly improvements to be made with DM migrations, to be sure, but I feel like I got the underlying design mostly right.

jpcx01 · on March 2, 2009

That's what I was thinking while I was reading the original post. Datamapper solves the problem awesome, provided you think the schema should be driven based on the generated model properties instead of magically defined in the schema and automatically mapped.

Datamapper migrations are perfect since they focus on one time runnable data migrations, or a bulk schema per release, instead of having a huge chain of tiny schema tweaks that can go back and forward.

jasonkester · on March 2, 2009

Database migration has been solved for a long time. This author is having trouble getting his head around it because he's starting from flawed assumptions, and is limited by his choice of tools.

Here's how you do it:

  - every database has a build number stamped into it
  - every schema change goes into a change script called build_105-106.sql
  - changes are applied to the dev database as they are made
  - the automated build pins the change script and runs it against an integration database
  - a schema diff tool is used to compare dev and integration schemas.  if they don't match, the build breaks.
  - the build creates a file called build_106-107.sql and sticks it into source control

There's simply no way you can get your schemas out of synch this way. If you screw up and commit a change straight to dev without scripting out the change, it will break the next build and it will hurt to fix it. You'll learn quickly not to try to shortcut the process. (It's really not that hard a process anyway).

But the big thing the author is getting wrong is that he's trying to define schema from his application. His ORM seems to be generating both code and schema from config files, which seems a bit silly from an outsider's viewpoint. It's causing him a lot of pain and he doesn't understand why.

Define the schema in the schema. Generate your code from that. Problem solved.

cstejerean · on March 2, 2009

I'm interested to hear more about how this works when you have multiple developers working in parallel on several branches. What you describe sounds like it would only work in a very linear development model.

jasonkester · on March 3, 2009

If you mean "multiple" to be "more than a dozen", then yeah, you're right that you might need to think harder. For small projects though (which could be defined as "anything small enough to conceivably work with an ORM"), it works great.

I guess that one necessary assumption is that your shop doesn't have a strong requirement for Code Ownership (in capitals). Since anybody can change the schema, and thus the object model, anybody needs to be able to propagate out their changes as far as necessary to keep the tests running and the build unbroken.

cstejerean · on March 3, 2009

it's not about code ownership or problems with who is or is not allowed to touch the schema. the problems come from things like even numbering the migrations, keeping track of dependencies between migrations and the order they need to be applied in, possibly needing to rollback certain migrations, etc.

jasonkester · on March 3, 2009

That's actually the problem that this system addresses. Each change script has from and to versions, so it's dead simple to move from one schema version to another. So when you go to roll out to production, your build will notice that it needs to go from version 104 to 116, and will run the script 104-105, 105-106, etc. until the version number is where it belongs.

Basically, you can know with certainty that you can move from any numbered version to another by simply running the intervening scripts.

cstejerean · on March 3, 2009

I understand that part, that's what I meant about it working in a linear fashion. What happens when two developers both start with schema 104 and develop in parallel? Which one of them becomes schema 105?

jasonkester · on March 3, 2009

Ah. Since the scripts are under source control, it's simply a matter of resolving conflicts when you check in.

But then, since you'll generally have a continuous build is running on 5 minute intervals, it's pretty rare to have two independent schema changes in the same script.

DougBTX · on March 2, 2009

Only one migration can be applied to a database at a time, so the migrations will need to be arranged sequentially at some point. With the scheme above, I guess this involves lots of file renaming...

newt0311 · on March 2, 2009

Thats for people using SQL. ORMs screw the entire process up. Another reason why ORMs are a bad idea.

ORMs are basically trying to recreate SQL in their own special non-portable immature bug-ridden manner. Thus, even the schema has to be defined in the ORM and that is where the problem comes in.

jrockway · on March 2, 2009

None of this is true. Some ORMs demand complete control over the schema, but this is an implementation detail and not something intrinsic to ORMs in general. Try DBIx::Class sometime, it can handle almost any schema you throw at it.

With that in mind, most people don't want the availability of methods and classes in their application to depend on external things like the database schema. That makes error reporting and debugging tough. So, you dump the DB schema to some format your ORM understands, and then consider that authoritative.

Many of us consider the database a tool for storing and querying data, rather than a way of life. In our case, the way the ORM works is perfect. The database is abstracted away, and we can write our software in our language of choice.

jasonkester · on March 2, 2009

Yeah, I never understood why most ORMs do things that way. Why the extra config file to define everything? You can infer everything you need to build an object model directly from the database schema. It's just as easy, if not easier, to read programmatically than XML. And it's every bit as descriptive.

akeefer · on March 2, 2009

If you're using a config file to define things, it's generally the case that it's defining the schema as well and the database is built programatically on top of it. So it's really just a matter of taste: some frameworks have you manipulate the DB schema directly and then build the ORM representation off of the DB, whereas others have you directly define the ORM representation and then build the database off of it. Either way there's no real duplication, and there's still just one place to modify if you need to change the schema/ORM mapping.

In our case we have a config file because A) it makes it easier to abstract the schema across different databases and B) we have extra metadata in there above and beyond just the schema information.

jasonkester · on March 3, 2009

I guess that's the luxury of being a shop that builds software vs. a shop that builds open source ORMs. We can target a single DB and technology, and not need to cater to thousands of angry devs with different naming conventions and religious views as to the virtues of Postgre vs MySql.

Wrapping a relational database with an object layer is really a simple, straightforward problem. It surprising that so many smart people feel compelled to spend so much time and effort engineering overly complex solutions to it.

akeefer · on March 3, 2009

Well, I work at a shop that builds software, but it's installed enterprise software, which means we don't fully control the deployment platform and have to support multiple databases. Control of the deployment model is definitely a huge, huge plus of a SaaS model, but the economics don't really work out for that in our industry. We also happen to write our own (currently closed-source) ORM framework to build on top of. One of the nice things about building our own framework is that we only have to deal with the complexities that we care about (i.e. our sort of performance requirements, our sort of object graphs, the databases our customers need).

The problem with pretty much any kind of framework, ORM or not, is that eventually it becomes bloated by being flexible enough to meet the requirements of thousands of different developers, meaning that any given one of them only ever uses a small percentage of the overall functional footprint. Keeping things simple yet powerful is definitely the hardest part about it all, especially given that different users have vastly different opinions about what's simple or what power they need.

jasonkester · on March 3, 2009

Indeed. Building frameworks for other people just doesn't sound like any fun at all. We also rolled our own thing in house, and every once in a while we start thinking thoughts about polishing it up and releasing it to the world.

But the amount of polish that you need to turn an in-house thing into an open source project is just immense. And worse, by the time you have it flexible enough to handle every possible use case, you've bloated it out to the point where it's no fun to use anymore. Pick pretty much any off the shelf ORM to see the end result of that path.

jhawk28 · on March 2, 2009

I think ORMs don't infer everything because they want the "hard break" to know when the data model changes. (influenced by statically typed languages)

technoweenie · on March 3, 2009

ActiveRecord and Sequel (both ruby ORMs) read the schema from the database. The generated schema.rb file is dumped from the actual database and only used for bootstrapping new installations. ActiveRecord even supports dumping the schema as straight SQL if you like. This is required if your database makes use of db-specific features like views/stored procedures, etc.

akeefer · on March 2, 2009

What he's imagining is somewhat close to what we do with our currently-internal (hopefully open source some day) ORM framework (which is on Java). Our schema is defined as metadata in XML, and we build a checksum off the XML to compare against what's in the database in order to know when to upgrade. To ensure the stability of the checksum, it's built off of a filtered, sorted version of the XML DOM tree, rather than the actual files. Not all changes to the files will affect the checksum, since certain bits of the metadata don't affect the data or the schema.

We do still use explicit version numbering for version triggers (our name for migrations), since it's just too hard if the numbers aren't meaningful. If you just used a checksum, how would you know if A248B5FC comes before or after 3F56EB2? It's a little easier if you can say "the DB is at version 23, and the latest is version 26, so we need to run these three triggers." The database stores both the metadata checksum as a hash and a version number; if the hash changes without the version number changing, we at least know that's an error and can detect it. Certain types of changes (like adding a nullable column) we handle automatically by diffing the current schema against the actual DB schema, while anything more complicated (and anything that touches the data) requires an explicit trigger.

Since we build deployed enterprise software that gets upgraded on the scale of years, not days or weeks, we also follow the course of using triggers only on production databases. It's still not really a perfect solution for all sorts of reasons I could go into, but it's interesting that we ended up pretty close to what he's proposing.

bodhi · on March 3, 2009

When I was working with migrations and ActiveRecord a couple of years ago, we just treated the production database as authoritative. We basically used migrations as disposable chunks of code that would transform the database from state a to state b. If a migration from 3 months ago didn't work anymore it really didn't matter, as the production database was well beyond that point anyway. When we needed to test the migrations, we'd load the most recent backup of the production database into our dev dbs and try it out.

But we did have the luxury of running a (fairly small-scale) website rather than distributing an application.