Buzzword headline aside, the Spanner paper is great and worth your time. As is t...

ckuijjer · on Sept 24, 2012

I've downloaded them all, now I should really start reading them. Here are some direct pdf links of the papers:

- Spanner: http://research.google.com/archive/spanner-osdi2012.pdf

- BigTable: http://research.google.com/archive/bigtable-osdi06.pdf

- Dremel: http://research.google.com/pubs/archive/36632.pdf

- Paxos made live: http://www.eecs.harvard.edu/cs262/Readings/paxosmadelive.pdf

espeed · on Sept 24, 2012

Look at Titan (http://thinkaurelius.github.com/titan), a new distributed OLTP graph database that has a storage layer that adds distributed transactions to pluggable backends, such as HBase and Cassandra. It's by the team the created Tinkerpop Blueprints and Gremlin, the graph traversal language.

You can read more about it in Matthias's PhD dissertation (http://www.knowledgefrominformation.com/category/publication...).

Also see Calvin:

"Calvin can run 500,000 transactions per second on 100 EC2 instances in Amazon’s US East (Virginia) data center, it can maintain strongly-consistent, up-to-date 100-node replicas in Amazon’s Europe (Ireland) and US West (California) data centers---at no cost to throughput."

"Calvin is designed to run alongside a non-transactional storage system, transforming it into a shared-nothing (near-)linearly scalable database system that provides high availability and full ACID transactions. These transactions can potentially span multiple partitions spread across the shared-nothing cluster. Calvin accomplishes this by providing a layer above the storage system that handles the scheduling of distributed transactions, as well as replication and network communication in the system. The key technical feature that allows for scalability in the face of distributed transactions is a deterministic locking mechanism that enables the elimination of distributed commit protocols."

http://cs.yale.edu/homes/thomson/publications/calvin-sigmod1...

And Omid (https://github.com/yahoo/omid) is another somewhat similar system, but it only works with HBase. Here's a comparison from the Omid team: https://groups.google.com/d/msg/omid-project/BTue2jAH1iQ/ZP3...

equark · on Sept 24, 2012

You could come work at Sense (http://www.senseplatform.com) or email me tristan@senseplatform.com. Not Google scale, but the same technological challenges without the legacy bagage.

chops · on Sept 24, 2012

You could go to Basho to work on Riak Enterprise.

http://basho.com/products/riak-enterprise/

fintler · on Sept 25, 2012

At lanl.gov, we've been developing a ton of stuff related to distributed problems, recently we wrote a library for abstracting self-stabilization for single-mount petascale filesystem treewalking that extends Dijkstra's original design: https://github.com/hpc/libcircle

We have a detailed paper on it coming out at SC12.

nicolast · on Sept 25, 2012

We created Arakoon (http://arakoon.org) in-house to solve a real distributed systems problem. At small engineering scale (only a handful of contributors), using (Multi-)Paxos.

neilk · on Sept 24, 2012

Lots of big sites roll their own solutions for this problem. Often there will even be multiple in-house technologies with different characteristics. Google, LinkedIn, and Facebook for sure.

ssmoot · on Sept 25, 2012

I haven't read TFA, so this is out of context I'm sure. Your note on failover just caught my eye. It's not a counter to your thought-provoking comment.

  "At smaller scales you can cheat -- you don't need Paxos,
  you can get away with non-consensus-based master / slave
  failover."

IME, not so much. Github's recent outages are a solid example. Automating failover for master/slave systems based on a single point of data, such as IP-failover, process failure, heartbeat, etc for slave-promotion is inherently a very risky strategy.

More often than not, no matter the amount of hardening or testing I put into my scripts, it just doesn't work when you really need it to.

TravisCI on Heroku seems like another good example.

What does work incredibly well IME is systems designed with HA built in. HA-Proxy and CARP is a seriously solid combination. You can synchronize your configuration nightly to add some meager level of automation to the process. I'm also pretty excited about CouchDB (and it's derivatives) since AFAIK it's the only free, nix, peer-to-peer replicating database system I've found (I know of no RDBMS that qualifies, Riak is not free, and RavenDB Server is Microsoft Server only).

In my experience, the two pieces of the HA "last mile" that few systems overcome are:

  * Server Affinity is bad, so bad
  * Failover MUST be transparent to the clients

The great thing about a database over HTTP is it's stateless. As long as the failover doesn't happen mid-request, you don't have to worry about connection errors, reconnecting, etc. So pair that up with something like HA-Proxy and you can increase both read scalability and availability. Then stick HA-Proxy on a box using CARP, and you've just automated the failover of your entire database stack, with as close to zero downtime as you can get, all with simple, reliable, free tools.

Put several application servers behind HA-Proxy as well, and use something like Infinispan for Session Storage and Caching, with CouchDB for file storage and now you have a system that can support highly dynamic web sites, the sort that can't typically be easily cached so availability can't easily be achieved through serving stale responses, with no single point of failure.

Then use FreeBSD for your CouchDB servers, have at least two actively load-balanced systems for "production" and a third that's just a replication-consumer backup. Then have all three independently zfs-snapshot for your backups. Now your massive database files are safe as well, and it doesn't impact availability. In the case of critical corruption (ie: an application bug that propagated bad data throughout the system), just a few minutes of manual effort cloning snapshots until you pinpoint the issue can have you back up and running.

No lengthy database restore commands to run. If you know when the bad data was persisted, you could have the correct snapshot identified, a clone created, a server instance loaded on another port, and batch update the records in the live system. If you document and run drills on the the different application failure scenarios you could recover from what would otherwise be catastrophic losses in just a few minutes.

You could even have the backup system be running on the last snapshot, so if you're really on the ball, you could be back online simply by disabling the health-check proxies (run the health-check through an nginx proxy, so you can just "sv stop healthcheck") on the primary systems (assuming you could develop a reliable way to suspend the clone/server-process-restart process on the backup server).

As long as your monitoring is up to snuff, you should be able to sleep peacefully.

Got a bit carried away. Hopefully someone finds the rambling useful though. ;-)