What is the best approach to web app development in Python without a RDBMS?

Hexayurt · on Sept 2, 2007

Cog is the Checkpointed Object Graph object database, providing semi-transparent persistence for large sets of interrelated Python objects. It handles automatic loading of objects on reference, and saving of modified objects back to disk. Reference counting is used to automatically remove no longer referenced objects from storage, and objects will be automatically be attached to the database if a persistent object references them.

======

I looked this package over a few years ago, and I think it got an awful lot of things right... but not enough.

It's worth examining the design if you want to understand the intricacies of non-RDBMS approaches. A lot of thought went into it.

codeslinger · on Sept 2, 2007

It sounds to me like the closest thing to what you want is a library for Object Prevalence <http://en.wikipedia.org/wiki/Object_prevalence>. There is one that I know of for Python called Pypersyst <http://pypersyst.org/>. There is also an IBM DeveloperWorks article about this library by one of the authors, as well <http://www.ibm.com/developerworks/library/l-pypers.html>. Hope that helps :)

jey · on Sept 2, 2007

Depends, what are your requirements or what does your program do?

RDBMSes are popular because they're easy and they perform decently for most answers to the previous question. And it may be that an RDBMS is the best answer for your application.

dood · on Sept 2, 2007

I should have been clearer in the question, I'm not looking for a solution to a specific problem, I'm looking for a general approach to building this kind of system, or an understanding of the benefits and trade-offs of different methods. I understand why RDBMS are normally used, and am comfortable with SQL, but I'm interested in the alternatives.

jey · on Sept 2, 2007

Start simple: store the data in-memory in whatever way makes sense, and just write a log file that contains pickled transactions that can be played back. Conceptual sketch of how I would do it follows. This isn't "the" way to do it, just "a" way to do it.

  # this is the utility method used by the rest of your code 
  # to register a new user. maybe this is at module scope in
  # the User.py module
  def register_user(user):
     run_transaction(AddUserTransaction(user))

  # this is the actual object that represents the
  # transaction
  class AddUserTransaction(Transaction):
     def __init__(self, user):
        Transaction.__init__(self)
        self.user = user

     # all Transaction objects have an apply() method
     def apply(self):
       Transaction.apply(self) # invoke base method

       # the following MyGlobalUserTable probably just
       # stores a couple dicts as indexes over the user info
       # e.g. it will have a dict by username, by user ID, etc
       MyGlobalUserTable.add_user(self.user)



  def run_transaction(t):
    # apply the transaction first, then write to the log.
    # since if it crashes while running the transaction,
    # you don't want to crash again when you play back the log.
    t.apply()
    transaction_log.append(t)

Here transaction_log.append(t) is some method that will pickle the transaction and append it to some log file. You'll have multiple classes like AddUserTransaction, all derived from Transaction. When you crash and want to play back the transaction log, all you have to do is unpickle the Transaction-derived objects and call apply() on them in the same order. http://docs.python.org/lib/module-pickle.html

Caveats:

-- This will eventually get to have too large a startup cost to play back all the transactions, when you have a huge number of transactions. You can fix it at that point, and you can just replay all the transactions to get the data imported into the new format.

-- The above approach is horrid if you want to launch a new process for every single request. You'd have to replay the transaction log for each incoming request, you'd have a data coherency nightmare, etc. So if you want to use this approach, make sure you're using something that shares one process amongst all requests.

You might also need to worry about whether your server is multithreaded and in that case deal with locking and other crap. I'd suggest going simple with something like Twisted's single-request-at-a-time approach. This isn't as bad as it sounds at first; you just write everything to finish ASAP, and if there's something you need to come back to, you just request the Twisted Reactor to schedule an event. If you have some big blocking thing to do, do it in another thread then notify the Reactor when it's complete (and have the Reactor schedule an event for your code to be notified that the big blocking thing finished).

As you can see, it's not totally trivial and general purpose like the RDBMS + ThingThatGeneratesNastySQLFromObjects approach (aka ORM). But you get more flexibility in your interfaces, and lower overhead. While it can be rewarding and simplifying to work in this way, it could also can lead to over-engineering and/or lost time and effort if you aren't already intimately familiar with the steps in this approach.

mechanical_fish · on Sept 3, 2007

The great thing about this post is that it doesn't hide the fact that "building a site without an RDBMS" is perilously similar to "writing your own buggy, half-implemented, slow, nonstandard DBMS".

Building a custom DBMS has been done, and done well, but I think a good general approach to the problem is to read the line about "over-engineering and/or lost time and effort" out loud, to your entire team, at dawn and noon and sunset on every day of the project.

jey · on Sept 3, 2007

> is perilously similar to "writing your own buggy, half-implemented, slow, nonstandard DBMS".

People have used files and in-memory data structures just fine for a long time. I don't they had bugs in their code owing solely to the fact that RDBMSes hadn't been invented yet.

I also don't see how this is slow; it's all in-memory. Why bring a big honking DBMS into the picture when all you wanted was a hash table?

Storing data in-memory doesn't amount to a "buggy, half-implemented, slow, nonstandard DBMS" -- it's serving an entirely different set of goals. Storing data in-memory in data structures is how programming is done. If you actually started to make the interface to your data layer as horrid as the interface to most DBMSes ($DIETY help you), you'd definitely end up with a shitty buggy half-implemented DBMS. But if you just want to store a couple hash tables with a sane interface dictated by the software design, not by your DBMS, just store the hash tables! Don't go wrestle with SQL just because that's the buzzword in-vogue.

You should measure your DBMS sometime, and consider it in terms of a hash table lookup. It starts looking like a comedy of horrors: first, generate a string in an obtuse language to perform the lookup, and oh wait, don't put a quotation mark in the wrong spot! Got that string generated? Now send it over a socket to a server. Now the server is going to parse your string into an AST, turn the AST into an internal representation, then it's going to guess how to optimize your query (this is also expensive in terms of cycles). When it's done optimizing, we run it to the query evaluator which looks in memory (since your table is so tiny anyway) and pulls out your hash table value, encodes it, and ships it back over the socket to you.

I agree and concede that these days you'll fit into the ecosystem better if you do use an RDBMS with no rationale (other than that it's the de facto standard), generate giant SQL strings from objects, and most worryingly from my perspective: deal with the big impedance mismatch between the two paradigms of OO/anything and RDBMS. http://en.wikipedia.org/wiki/Object-relational_impedance_mis... I haven't seen a single ORM implementation that really makes it truly simple. You end up manipulating your structure at a lower-level than you'd like to naturally.

This is partly why my next project isn't being done with an RDBMS, and I'm not storing it just in memory because of the volume of data involved. I'm going to be using Erlang and its bundled Mnesia database. There's no impedance mismatch there. The whole thing, including the database interface, works the way Erlang works.

I'll throw a party when RDBMSes die.

mechanical_fish · on Sept 4, 2007

> I also don't see how this is slow; it's all in-memory.

Are you not writing to disk on every single transaction, then? My bad. And good luck keeping that power cord plugged in.

> Why bring a big honking DBMS into the picture when all you wanted was a hash table?

I didn't bring that DBMS into the picture. You did, in the second half of your first sentence:

"Start simple: store the data in-memory in whatever way makes sense, and just write a log file that contains pickled transactions that can be played back."

Indeed, an in-memory cache is really fast. But it's unfair to compare it with an RDBMS, which is handicapped by the need to write every transaction to disk before it can be committed. The hash table is great only up to the point where a stray cosmic ray crashes the server and makes the whole thing disappear, after which you realize that you need a log file.

In the general case, writing a transaction log file is a hard problem. If you write a really robust tool for managing that log file - a tool which is efficient at reading and writing even when the number of transactions grows large; one that lets you specify when the log gets written, and how often, and whether the system can be queried during the write, and how long those queries will block, and what they will return; one which allows multiple threads and multiple machines to read and write the data without concurrency problems; one which prevents the in-memory cache from getting out of sync with the filesystem - you will have implemented a substantial portion of MySQL, and probably memcached as well.

It is easy to start out designing a fast, simple, non-transactional DB and end up reinventing MySQL. If you don't believe me, ask the folks who invented MySQL!

In certain special cases (e.g. Google), rolling your own persistent storage system is a big, big win. You may know, in advance, that your website is one of those cases. If you are correct, you will be a superstar - you will build a relatively untested, nonstandard data storage system with a tiny subset of PostgreSQL's features, but all of that will be worth it because the system will be fast. If you are wrong, you will work for a month or six and then end up installing PostgreSQL anyway. In fact, even if you are right, you will end up installing PostgreSQL after your customer changes the spec at the last minute to require some boring standard feature - like a shopping cart - which any CRUD jockey can build in a day but which your "simple and elegant" database doesn't support because it wasn't in the original spec.

So it's no surprise that the "de facto standard" is to build your site around an RDBMS, get the damn thing working, and optimize later - and that, as a result, the typical in-memory data structure ends up being backed by an RDBMS instead of by a "simple" log file.

As for your ire at SQL... if you think you're in pain, just imagine how the designers of SQL must have felt back in the 1970s, when string parsing and query planning were tens of thousands of times slower than they are now, an extra database server cost more than a coder's daily salary, and RDBMS software was very, very non-free. It was a dark time. And yet for some reason those guys abandoned their efficient hand-rolled binary databases for SQL. In fact, they did it so fast that Larry Ellison grew richer than the Beatles. Why did they do that, I wonder? It must have been the drugs.

dood · on Sept 2, 2007

Very instructive comment, thankyou. The need for a single process seems to be the biggest problem for me, since it prohibits hooking up this system with a framework like Pylons. Practically then, this is probably over my head right now, and overkill for my needs, but it is something I hope to explore as I learn more.

palish · on Sept 2, 2007

You might want to ask yourself, "Why am I choosing to not use an RDBMS?"

If the answer is "because I know it will save time" or "because I'd like to explore the possibility", more power to you. I wholeheartedly encourage it.

If the answer is anything else then you should just use one, because that path will save you time (possibly a lot of time).

Check out Rails. The ORM is so well done that it feels like you aren't using a database at all.

dood · on Sept 3, 2007

Yes, I'm pretty much checking out the solution-space as a whole. I'll probably end up slinking back to SQL, but at least I'll know why.

mattculbreth · on Sept 2, 2007

What kind of app are you building?

I don't think you'll find a pre-packaged framework out there that works without a DB. I guess if I were doing this I'd use memcached and then I'd write to a shared file system somewhere for the writes.

We use Pylons and SQLAlchemy/Elixir/PostgreSQL though and we're very pleased.

dood · on Sept 2, 2007

I wasn't really after a pre-packaged framework necessarily. I'm also very pleased with Pylons so far, getting a non-RDMBS thing to play well with Pylons could work out very neatly.

rams · on Sept 2, 2007

Zope, QP - those work without RDBMSs.

mxh · on Sept 2, 2007

Twisted is an interesting framework, though a little lower-level than others. I think you could build a self-contained, in-memory solution around it, as DBs aren't particularly central to its operation.

kashif · on Sept 3, 2007

schevo.org integrates well with pylons

dood · on Sept 3, 2007

schevo does look interesting, I'll have to give it a closer look. Though as far as I can tell it isn't widely used... how is your experience of it?

kashif · on Sept 3, 2007

The documentation is limited. But it is very powerful and easy to use. The support on the IRC is excellent although the response time is around a day.