All valid points if compared against a single-instance RDBMS. Except that modern RDBMS used in any large scale web service will also be guilty of these points. When you shard/cluster your data across various master/slave DB instances you essentially lose the ability to store your data in a non-atomic normal form, just like SimpleDB. In both a sharded RDBMS and SimpleDB your data is denormalized.
With a DB setup like this (and all major web sites are) you cannot feasibly join, group by (or do associated aggregations), or order by anymore anyway. Constraints are not really a good idea either. It would be nice if SQL was still useful in a situation like this but it just is not.
The days of hugely massive single machine databases is coming to and end if not already dead. Even if you are reporting on massive amounts of data and need to do aggregations across large datasets, then OLAP is increasingly becoming the way to go (again, with fact tables and cube partitions sharded across multiple instances to increase query performance).
Your argument is good, but it also serves to advance the argument that using SimpleDB for a new web app is premature optimization.
Yeah, when your app has millions of users and massive amounts of data you might need to adopt sharding or clustering, perhaps giving up the advantages of a huge single-machine database relative to SimpleDB. And, once you have this problem, it might be a good time to contemplate rearchitecting your entire system around something like SimpleDB. But few web apps start off with millions of users and massive amounts of data, and to deliberately avoid using SQL tools just because they won't scale to Facebook-esque levels of traffic is the very definition of "premature optimization".
Finally, someone took time and said it. Moreover, I'd recommend for someone who didn't have exposure to "big iron databases" to look up performance benchmarks for Oracle, Microsoft and DB2, they're very popular not only among these companies but also among big-iron manufacturers like Sun, IBM and HP. It's crazy what they can do if you use them right, good luck building a business that will need such transactional capabilities of your DB.
I assume the real appeal behind SimpleDB is that you don't have to buy servers, disk arrays, backup systems, a data center, 24-hour techs to swap RAID drives, etc. You just put your data in there and forget about that stuff.
Exactly. And its not just hardware you can forget about. No longer having to worry about indexing, scheduled backups and database shrinks, temp table sizes, connection pooling, and all the other nasty stuff that goes along with maintaining a DBMS is equally as nice.
I've switched my project from a local MySQL instance to SimpleDB and I'm never looking back. It took some work to simplify my existing schema to be purely atomic (no joins), but after that you can forget just about everything else.
I would be interested in hearing more about your migration from MySQL to SimpleDB.
Basically, I am a big fan of normalization (since none of the other ways seem elegant to me). For example, if I were going to write the schema for a Blog, I would have an articles table, a tags table, and an article<->tag mapping table. This means that "show me articles with tag foo" is indexed (and "faster" than an O(n) search over the articles table).
The only approach I can see with SimpleDB is to tolerate scanning all articles to search tags (and authors, etc.) or to maintain a separate "document" for each tag (that links to the articles with the tag). Then I worry about concurrent updates to a tag ( {foo:[1,2]} gets updated to {foo:[1,2,3]} while someone else immediately changes it to {foo:[1,2,4]}; the correct result would be {foo:[1,2,3,4]} but adding 4 overwrites adding 3.)
Anyway, like I said, I would be interested in hearing people's experiences. Perhaps I am missing something obvious.
With SimpleDB, you can still accomplish many-to-one and even one-to-many. So in your example schema you described, you could have an attribute (aka column) in your tags domain (aka table) which holds the ids of the article items (aka rows) in your article domain that the tag is associated with. Since your data is denormalized though, you will have many duplicate items if a tag applies to many different articles. Thats ok though, SimpleDB was designed for this dont worry about large domains.
So this is how you could get your many-to-one mapping. To determine what articles have tag foo, you would just do the equivalent of select from tags where your attribute value matches the item id for the desired article.
It seems unintuitive, but you can actually do one-to-many mappings too with SimpleDB. So you could say "show me all the tags for article A" without a join table or duplicate items. The reason this is possible is because SimpleDB can attach up to 256 values to a single attribute. This obviously isnt possible in SQL. You just assign references to each tag id for that article to one attribute. Kinda like putting a list of Strings in a single cell in SQL. Be very careful with this though, as you cannot exceed 256 values bound to 1 attribute. So no more than 256 tags per article in your example.
But the problem with all this is keeping track of changes. updates and deletes do not cascade so you have to handle that yourself. And of course you have to consider eventual consistency if making frequent changes on top of each other.
My suggestion to help reduce the complexity of managing your domain references is to use this library (if you are using Java that is). It makes SimpleDB so much more usable: http://code.google.com/p/simplejpa/
I hacked together a SimpleDB-alike tonight and played with various index-on-write schemes. I feel a bit better about the situation now. Of course, now I just need to make it as scalable as SimpleDB :)
I don't want to bash something I personally not very familiar with, especially on this podium. But I don't know what the appeal of SimpleDB-like services is. I know why Google/Amazon needed them, but those reasons don't apply to 99.9% of startups.
You don't need to buy your own RAID/SAN and powerful servers to use RDBMS, nearly every modern scalable hosting provider will be happy to rent them to you just like Amazon: get an account on Slicehost for smaller projects and move up to EngineYard or Joyent for high-end stuff: they will all let you scale up as needed, use DB of your choice.
P.S. I also think the author forgot about yet another important advantage of RDBMS: there are tons of various tools for them. Yes, he mentioned import/export/backup, but the list is much longer than that.
True, but the reason that SimpleDB is so inexpensive is because all of the items you mention have gotten so cheap. A managed server, where the provider does all the backup and RAID stuff, costs ~$100/month. That's nothing, given the tremendous benefits (like, joins that allow for normalization, your own backup, no vendor lock-in, etc).
I understand the desire for something that's maintained by someone else and makes life easy, but if you're a tech startup, this stuff ought to be second nature.
if you think in SQL every project seems to require the "advantages" of an RDBMS. But the critics are forgetting that:
a) when you use sharding your RDBMS looses most of the advantages you seem to like so much
b) SimpleDB and friends not only distribute your data but they allow you to parallelize your queries. think map/reduce
c) as a programmer I would much rather solve summing and grouping in my code then learn the subtleties of properly configuring and managing a particular RDBMS (and re-learn everything when I switch vendors)
I'll agree that SimpleDB is not appropriate for everything, but most of the perceived problems are really just a lack of creativity.
I'm not quite sure I understand what advantages are lost by sharding -- would you mind elaborating?
As far as being able to write parallel queries, this is perhaps true, but you have to manage the parallelization, synchronization, and all other manner of nastiness. For data that doesn't require frequent updates (like Google), this may be a better route, but for systems where consistent data is important, I would much rather rely on years' worth of RDBMS research. It's not a matter of a lack of creativity; rather, I would prefer to spend my time solving problems as they are related to my application, not duplicating the bits of an RDBMS that aren't present in SimpleDB.
I find point (c) to be a bit surprising -- SQL is based around the concepts of set theory, and works in a fairly straightforward fashion. While there are differences between implementations, you certainly don't have to 're-learn everything' when you switch vendors, and the basic tasks of database management are not terribly complicated.
And yes, I regularly bounce between about three different SQL implementations (Postgres, MySQL, and SQLlite). Spent some time with Sybase as well, and I am not a DBA. There are differences, definitely, but it never took me more than a day or two to get up to speed, and I am not a genius by any stretch of the imagination.
If you needed to, you could solve most of the problems with SimpleDB by writing a server that wraps it and provides consistency checks, sorting and such. I was considering writing that for a little while.
Then they announced that they were going to support mountable drives on EC2 images, and it stopped being necessary. When they release that, you'll be able to load up a more straightforward MySQL or PostgreSQL setup, and use traditional clustering mechanisms.
Honestly, I find SimpleDB interesting, but I have no idea what I could use it for without writing quite a bit of middleware to wrap it.
Good article. Like anything, there are pros and cons to each approach. I tested SimpleDB pretty thoroughly for the next version of Gridjit and decided against using it.
One of the drawbacks to working with these sorts of storage services is the temptation to create overly-abstract models too early. Abstraction and metadata-based schemas are great but can also add needless complexity (ironically) early on.
I was about to do similar research before committing our next project of 5TB data space with SimpleDB, ouch! It would have been a disaster. I will probably use the S3 just for mass storage nothing else.
a lot of his problems seem to suggest that SimpleDB require more coding to contol the data.
the counter argument is, that okay, this is true, but you also don't do any maintenance ... which seems like a fair trade-off, plus you can always add an abstraction layer in code that simplify the cooding
so my question becomes, can't one add an abstraction layer on top of simpledb that makes it act more like a traditional rdbms ?
With a DB setup like this (and all major web sites are) you cannot feasibly join, group by (or do associated aggregations), or order by anymore anyway. Constraints are not really a good idea either. It would be nice if SQL was still useful in a situation like this but it just is not.
The days of hugely massive single machine databases is coming to and end if not already dead. Even if you are reporting on massive amounts of data and need to do aggregations across large datasets, then OLAP is increasingly becoming the way to go (again, with fact tables and cube partitions sharded across multiple instances to increase query performance).