I had a similar change of perspective in 1997 when I worked at a well known Inte...

nicpottier · on March 29, 2010

I worked at said retailer as well, as a matter of fact I was in charge of said millions of customer reviews. And you know what? BDB's were mostly used just because of distribution. They were pushed to all live servers nightly, that way you had no dependency on one giant server.

This eventually changed as everything moved to a service architecture, and those BDBs were also eventually stuck in an Oracle database as well.

Basically BDB's, although great for some things were more a product of how to scale initially, and the pain quickly became so big they moved away from them on most things.

So I wouldn't consider said retailers experience as a terribly pro-nosql one.

justinsb · on March 28, 2010

If you'd had viable open source SQL databases in 1997, would you have spent the engineering time on BerkeleyDB, or would you simply have replicated the master database onto each server? In 1997, you weren't choosing NoSQL vs SQL, you were choosing open-source vs commercial.

Anyway, you're essentially storing pre-generated pages, which isn't a use case that I think anyone considers particularly database-appropriate (SQL or NoSQL). Using memcached to cache the data that is in active use seems more efficient, faster, and gives you the option to force through out-of-sequence updates (though again, that wasn't available to you off-the-shelf in 1997.) Complete re-generation of data might have worked for Amazon in 1997, but is this what they're using today?

spudlyo · on March 28, 2010

With open source SQL databases there is no "simply" when it comes to replication. Even today, MySQL replication is brittle, and master/slave inconsistencies are the rule rather than the exception. Slave crashes often cause replayed transactions due to lack of atomicity in writing master.info and relay-log.info. The replication landscape with PostgeSQL is varied and essentially a bag on the side. Last I counted there were more than 10 different ways of doing it, a number involving trigger based log shipping. It wasn't about open-source vs commercial, it was about scaling the reads.

The detail pages weren't pre-generated, they were based on read-only catalog data, which I think is entirely database appropriate. I imagine that complete re-generation of data is no longer done, but I'd be willing to bet that Berkeley DBs are still used in production somewhere.

justinsb · on March 28, 2010

A read-only database isn't a database in my book.

I agree that built-in replication can be difficult to administer even today, but you're being completely revisionist here. Replication wasn't introduced into MySQL until 2000. In 1997, you would by necessity have rolled your own replication system tailored to your needs (much simpler than solving the general-case problem). That's basically what you did anyway, but you solved it in the most trivial way possible: you 'replicated' by doing a complete database dump and re-distributing the entire DB. If you'd had a viable open-source relational database, you could have scaled the reads and got more developer productivity by distributing a SQL database (e.g. SQLLite) rather than a key-value database (BDB).

I appreciate your standing up and giving a concrete example of NoSQL usage - nobody else has been brave enough to do so. But it seems that the reasons for it were highly specific to the time: there were no viable open-source databases, Amazon was just introducing the idea of customer reviews (i.e. pre Web 2.0) so data was primarily read-only, memory was comparatively expensive and memcached didn't exist, and you had a comparatively small product catalog where complete re-generation was an option. I don't think you can carry forward the optimizations you made in that framework into today's world.

nicpottier · on March 29, 2010

See my reply to the grandparent.

I actually was responsible for that system, and moving away from BDB's being pushed to servers sometime in '00 or so.

As you said, these weren't really databases by any stretch of the imagination, simply snapshots, and built for a very specific type of query. (by asin, by time, reverse ordered)

The building of the DB's was a pain in the ass, because the sheer scale of them was so big that you had to do clean builds (instead of incrementals) fairly often without them wasting space. There was also all sorts of voodoo magic going on to work around various BDB issues.

The system did eventually move to a service architecture (as all of AMZN did), for two main reasons:

1) pushing that much data to more and more servers was getting insane, even on their inner networks.

2) we wanted faster turnaround for new reviews

3) rebuilding the BDBs was becoming more and more cumbersome with scale

All that said, the original system did take us pretty darn far, both in scalability of traffic and scalability of data, farther than most websites will ever reach.

Fun times working there, you really get to work on some unique problems.