Buzzword headline aside, the Spanner paper is great and worth your time. As is the BigTable paper, the Dremel paper, and the Paxos Made Live paper.
I read the Google whitepapers and wonder, is there anywhere else one can go to work on real solutions to distributed systems problems? At smaller scales you can cheat -- you don't need Paxos, you can get away with non-consensus-based master / slave failover. You can play the odds with failure modes. At Google's scale, you can't: behavior under normally unlikely failures matters, probability matters, CAP matters.
Look at Titan (http://thinkaurelius.github.com/titan), a new distributed OLTP graph database that has a storage layer that adds distributed transactions to pluggable backends, such as HBase and Cassandra. It's by the team the created Tinkerpop Blueprints and Gremlin, the graph traversal language.
"Calvin can run 500,000 transactions per second on 100 EC2 instances in Amazon’s US East (Virginia) data center, it can maintain strongly-consistent, up-to-date 100-node replicas in Amazon’s Europe (Ireland) and US West (California) data centers---at no cost to throughput."
"Calvin is designed to run alongside a non-transactional storage
system, transforming it into a shared-nothing (near-)linearly scalable
database system that provides high availability and full ACID
transactions. These transactions can potentially span multiple
partitions spread across the shared-nothing cluster. Calvin
accomplishes this by providing a layer above the storage system that
handles the scheduling of distributed transactions, as well as
replication and network communication in the system. The key technical
feature that allows for scalability in the face of distributed
transactions is a deterministic locking mechanism that enables the
elimination of distributed commit protocols."
You could come work at Sense (http://www.senseplatform.com) or email me tristan@senseplatform.com. Not Google scale, but the same technological challenges without the legacy bagage.
At lanl.gov, we've been developing a ton of stuff related to distributed problems, recently we wrote a library for abstracting self-stabilization for single-mount petascale filesystem treewalking that extends Dijkstra's original design: https://github.com/hpc/libcircle
We have a detailed paper on it coming out at SC12.
We created Arakoon (http://arakoon.org) in-house to solve a real distributed systems problem. At small engineering scale (only a handful of contributors), using (Multi-)Paxos.
Lots of big sites roll their own solutions for this problem. Often there will even be multiple in-house technologies with different characteristics. Google, LinkedIn, and Facebook for sure.
I haven't read TFA, so this is out of context I'm sure. Your note on failover just caught my eye. It's not a counter to your thought-provoking comment.
"At smaller scales you can cheat -- you don't need Paxos,
you can get away with non-consensus-based master / slave
failover."
IME, not so much. Github's recent outages are a solid example. Automating failover for master/slave systems based on a single point of data, such as IP-failover, process failure, heartbeat, etc for slave-promotion is inherently a very risky strategy.
More often than not, no matter the amount of hardening or testing I put into my scripts, it just doesn't work when you really need it to.
TravisCI on Heroku seems like another good example.
What does work incredibly well IME is systems designed with HA built in. HA-Proxy and CARP is a seriously solid combination. You can synchronize your configuration nightly to add some meager level of automation to the process. I'm also pretty excited about CouchDB (and it's derivatives) since AFAIK it's the only free, nix, peer-to-peer replicating database system I've found (I know of no RDBMS that qualifies, Riak is not free, and RavenDB Server is Microsoft Server only).
In my experience, the two pieces of the HA "last mile" that few systems overcome are:
* Server Affinity is bad, so bad
* Failover MUST be transparent to the clients
The great thing about a database over HTTP is it's stateless. As long as the failover doesn't happen mid-request, you don't have to worry about connection errors, reconnecting, etc. So pair that up with something like HA-Proxy and you can increase both read scalability and availability. Then stick HA-Proxy on a box using CARP, and you've just automated the failover of your entire database stack, with as close to zero downtime as you can get, all with simple, reliable, free tools.
Put several application servers behind HA-Proxy as well, and use something like Infinispan for Session Storage and Caching, with CouchDB for file storage and now you have a system that can support highly dynamic web sites, the sort that can't typically be easily cached so availability can't easily be achieved through serving stale responses, with no single point of failure.
Then use FreeBSD for your CouchDB servers, have at least two actively load-balanced systems for "production" and a third that's just a replication-consumer backup. Then have all three independently zfs-snapshot for your backups. Now your massive database files are safe as well, and it doesn't impact availability. In the case of critical corruption (ie: an application bug that propagated bad data throughout the system), just a few minutes of manual effort cloning snapshots until you pinpoint the issue can have you back up and running.
No lengthy database restore commands to run. If you know when the bad data was persisted, you could have the correct snapshot identified, a clone created, a server instance loaded on another port, and batch update the records in the live system. If you document and run drills on the the different application failure scenarios you could recover from what would otherwise be catastrophic losses in just a few minutes.
You could even have the backup system be running on the last snapshot, so if you're really on the ball, you could be back online simply by disabling the health-check proxies (run the health-check through an nginx proxy, so you can just "sv stop healthcheck") on the primary systems (assuming you could develop a reliable way to suspend the clone/server-process-restart process on the backup server).
As long as your monitoring is up to snuff, you should be able to sleep peacefully.
Got a bit carried away. Hopefully someone finds the rambling useful though. ;-)
I read the Google whitepapers and wonder, is there anywhere else one can go to work on real solutions to distributed systems problems? At smaller scales you can cheat -- you don't need Paxos, you can get away with non-consensus-based master / slave failover. You can play the odds with failure modes. At Google's scale, you can't: behavior under normally unlikely failures matters, probability matters, CAP matters.