This is a beautiful site with tons of information, graphs, lists, examples, etc. and yet omits the one thing that underpins it all: how it actually works.
The claims are: serializable ACID cross-machine transactions "without performance penalty." MVCC with optimistic concurrency control.
Optimistic concurrency control means that the server has to check the version on all modified data before the transaction commits. Cross-machine transactions mean that this version check has to happen on multiple machines. ACID means that both machines have to either commit (if all the version checks succeed) or roll back. How are you going to reconcile all of these requirements without resorting to two-phase commit, which most certainly has a performance penalty? And how are you going to get serializable transactions across machines without processing these cross-machine commits one at a time (waiting for two-phase commit each time)?
I'll try to address your questions. This isn't a complete explanation of how our system works!
Most people trying to solve this problem have approached it by trying to take multiple machines, each capable of processing transactions, and do distributed transactions between them. This leads to systems that do local transactions fast but global transactions much more slowly, and a tradeoff between the quality of fault tolerance (paxos=good, 2-phase=bad) and commit performance.
Our approach is instead to decouple the first job you mention (checking if transactions conflict to provide isolation) from the second job (making transactions durable). These can then be scaled independently.
The first job, conflict resolution, is a small fraction of the total work and needs access only to the key ranges used by each transaction, not the existing contents of the database or the actual values being written. This allows it to be very fast.
The rest of the system is a consistent, durable distributed store that has to preserve a predetermined ordering of writes, and provide a limited window of MVCC snapshots for reads.
I still don't see it. Say you have conflicting transactions T1 and T2 that both update data on two machines M1 and M2. If these transactions race to commit, how do you guarantee consistency? Even if you have a perfect, zero-latency oracle that can tell you that T1 and T2 conflict, you still need M1 and M2 to form consensus about which transaction commits first, and to make sure that both machines either commit or roll back (I am assuming that both machines have their own transaction log). It still sounds to me like 2PC is required.
Specific questions: does each machine have its own commit log? Is each machine authoritative for its range of the keyspace, in that it can serve reads without having to consult other machines?
The conflict resolution service assigns a global ordering to transactions as well as pass/fail. Transactions that fail don't do any writes. Transactions that pass still aren't durable, they could be rolled back by a subsequent failure until they get on disk in a few places.
Each machine doesn't have its own commit log at the level of the distributed database. (The key/value store where we locally store data on each node happens to also use write ahead logging, but this is not an architectural requirement). Ideally transactions just have to be made durable in some N places to be considered committed. In practice, we elect specific machines to a transaction logging role, among other reasons because SSDs do better at providing fast durable commits if they aren't also serving random reads.
Each machine is authoritative for reads for some ranges of the keyspace and for the range of versions it knows about. It proactively fetches writes for newer versions. If it gets a request for a version it hasn't managed to get the writes for yet, the read has to wait while that storage server catches up. In practice this lag is very small, as you can see from our latency measurements.
I can only speculate, because I have never used datomic.
Like FoundationDB, datomic does appear to separate transaction processing from storage.
A few relevant differences:
- Datomic provides transaction isolation through stored procedures which run on its single "transactor", while FoundationDB provides interactive transactions.
- To my knowledge datomic does not claim to be interested in high write scalability, which we definitely are.
- Datomic is designed to keep multiversion history indefinitely, while FoundationDB keeps it just long enough to execute short transactions.
Serialized transactions over a broad distribution of keys isn't a huge problem, but you're right: this bodes poorly for hot data. I'm more concerned about their CAP semantics: I'm sorry, but claiming multidc availability and acid transactions is not gonna work.
As any ACID database must, we choose Consistency over Availability in the CAP theorem sense. This means that when a datacenter is disconnected or poorly connected to the internet, the database is not available for writes in that datacenter. If you are serving a web site to the internet, and need it to stay available if one Amazon AZ fails, this isn't too bad.
Some applications really do need disconnected operation, and thus 'AP' semantics. (For example, a grocery list for a mobile phone, or an application that must be available everywhere if the entire Internet partitions.)
So you require quorum over data centers to stay up? How do you provide transactional consistency between dcs? Multipaxos over blocks of transactions with delayed failure?
We require a quorum over coordination servers to stay up. For example, if you only have two datacenters with actual DB servers, a third coordination server out in the cloud somewhere can act as a tiebreaker (it will be lightly loaded).
These coordination servers do paxos, but are only needed when there are failures or role changes - they don't participate at all in committing individual transactions. Normally, we only need a single geographic ping time to do a durable transaction if it originates in the current 'primary' datacenter.
For further improved latency, we plan to allow each individual transaction to decide whether it needs multi-datacenter durability. The commit process is the same, but we can notify the client of success earlier if it is willing to take the risk that a WAN partition or meteor strike violates its 'D' guarantee. ACI are guaranteed either way.
Gotcha. That sounds solid to me; would be a good explanation to put in your feature list.
Sounds like your coordinators are authoritative masters for transaction ordering. Does that imply a single-machine limit to throughput in a given dc? Presumably the ordering process is much less expensive than the kv store itself, so this might not be a practical issue.
What happens when nodes are asymmetrically partitioned from the coordinator and peers? E.g. a node is unreachable by a peer, but reachable by a coordinator, or vice versa?
Hmm. Okay. I open a transaction for three days (let's say indefinitely), maybe reading a few tuples here and there and updating a few, but never committing. How does this system not fall over? I guess it could abort my transaction...
Still, I think their performance numbers can be legit, because they are much, much slower than what a handful (much less than 24) machines could do if fully partitioned and without coordination. 500KRead/Second on such small values is not really fantastic performance over 24 machines. I also don't understand the initial burst-capacity on the read-side either, although I can guess -- on writes it can make sense because some work is deferred, but I'm trying to understand how that can happen on reads as well. My guess is a transaction ID allocation on-read that hits a wall somewhere.
Another common use case is I want to take a consistent backup/copy. For large databases, this is a very, very long snapshot to maintain, if MVCC, or a lot of locks to acquire, if 2PL. How does the system act then?
All in all, I think it's pretty neat, and I like that someone is dealing with OLTP database problems with a mind towards easing one's burden via transactions.
As other posters have already guessed, we don't support long-running transactions (we only keep multi-version information for a short time). So if you keep a transaction open for a long time it will not commit (after a while reads will start to fail, too).
It's not architecturally impossible for us to support long-running read snapshots, but it is expensive for our storage servers to keep snapshots alive for a long time. So our backup solution instead works by backing up transaction logs while doing a non-isolated read of the database. At restore time we will replay the logs to get back a consistent point-in-time snapshot.
The reason that we see higher burst than steady state performance has nothing to do with transactions. We have to do reads from SSD as the application requests them, and we have to make writes durable by logging them right away, but as you surmise we can defer the hard work of doing random writes to our btrees for a while. Even a workload that is 90% reads benefits a lot from deferring writes because writes have to be 3x replicated and because consumer SSDs are comparatively slow at mixed read/write workloads. Read only workloads will not see any initial burst (and might see a ramp-up time from cold cache effects).
> It's not architecturally impossible for us to support long-running read snapshots, but it is expensive for our storage servers to keep snapshots alive for a long time. So our backup solution instead works by backing up transaction logs while doing a non-isolated read of the database. At restore time we will replay the logs to get back a consistent point-in-time snapshot.
Sure, it's all technically possible -- Postgres will simply stop collecting garbage via VACUUM, for example -- but the results are not usually very pleasant. But it sounds like instead your solution to the snapshot-read is more like how the online binary backups work, whereby you combine transaction logs with inconsistent reads. That system works very well. On the other hand, "logical" backups have been a major problem for me; sidestepping this by using lower level transaction log mechanics and dirty reads I think is a much better idea.
> The reason that we see higher burst than steady state performance has nothing to do with transactions. We have to do reads from SSD as the application requests them, and we have to make writes durable by logging them right away, but as you surmise we can defer the hard work of doing random writes to our btrees for a while. Even a workload that is 90% reads benefits a lot from deferring writes because writes have to be 3x replicated and because consumer SSDs are comparatively slow at mixed read/write workloads. Read only workloads will not see any initial burst (and might see a ramp-up time from cold cache effects).
I see. It's not read-only, it's 90/10, I misread. That makes more sense. What's your read-only performance? Presumably it would be impacted by requiring fencing to deal with cases involving network partitions, which is one of the problem with non-partitioned and consistent systems systems, as far as I can tell: reads need to check for liveness, and that's much harder than....doing nothing.
Random, uncached reads on the same dataset as the 90/10 - 1.6M/sec, limited by SSD throughput. Cached reads (from the same dataset, with a different distribution of read keys) you can see on our website at 3.2 M/sec. I should mention that the test clients also run on the same cluster, and in that workload they are a significant fraction of CPU.
Read requests come to a storage server with a specific version number attached, so the storage server can reply authoritatively without any further communications. Starting a transaction requires selecting a version number for the read snapshot, and that incurs some latency to deal with the concerns you mention (but scales fine).
> Okay. I open a transaction for three days (let's say indefinitely), maybe reading a few tuples here and there and updating a few, but never committing. How does this system not fall over?
They say they use optimistic concurrency control, so your writers aren't taking any locks, instead the transaction will just fail when you commit if any of your read/modify/write's were modified in the meantime.
> Coming soon. An integrated backup system provides a true "moment-in-time" snapshot backup of the entire distributed database stored to a remote file system on a schedule.
Sure it is. But I would say, more to the point, it's a problem for systems that support consistent snapshots. The question is: how does it deal with the failure condition?
Yes, we provide the strongest level of ACID semantics. Although proving things about large computer programs is pretty hard, we have spent much of the past three years building testing and validations systems to ensure this is true. We run tens-of-thousands of nightly simulations to check system correctness and properties in the face of machine failures, partitions, etc. We also run long-running tests on real-world clusters using programmable network and power switches to simulate these same cases in the real world.
So, we've convinced ourselves. What would you like to see on the site to help provide the kind of incredible evidence you're looking for?
Honestly... if you have figured out how to scale with full ACID... I have a hard time believing that having SQL support in there is going to "kill" your ability to scale.
I'm guessing that no-SQL documents are written at a more course granularity than SQL table rows. One document could well represent 3, 4, 5 rows in a normalized SQL database.
OTOH, if you put all your "documents" in 3rd normal form, you might not see much gain.
You are right. Our API is strong enough that you could implement a SQL database on top of it efficiently. I have done it (using sqlite4 as the front end) as a proof of concept.
ACID represents one design philosophy at the consistency end of the consistency-availability spectrum. According to CAP, to have the maximum Consistency of ACID, you'd need to trade-off against a lower Availability and/or Partitionability.
"Trading off against partinonability" is kind of a strange concept -- not only can you not prevent network partitions, you usually can't tell when they've happened until it's too late.
The real upshot of the CAP theorem is that you have to choose what goes in the face of a partition. An ACID system says you lose availability (writes, and possibly reads, fail if you can't reach a majority of the participants); an Eventually Consistent system may pick either.
You can use Paxos. There will be some partitions that can cause an outage, but this is only the case when there is no majority of nodes that can communicate with each other, which in practice means you are exposing yourself to a vanishingly small risk.
3 rounds of communication to reach consensus for every transaction? Off the cuff, it seems like that would hinder the performance. I haven't tested it, though..
It doesn't have to be per transaction, it could be for a batch of transactions. There are also optimizations like multi-paxos that can reduce the number of round trips.
Presumably they sacrifice availability? If your network is only partitioned 1/1000 of the time and you only need 99.9% uptime, that would be a viable approach.
That system works by giving up interactive transactions. But the foundationdb page claims to have interactive transactions, so it must be doing something else.
Calvin requires all transactions to be executed fully server-side and sacrifices the freedom to non-deterministically abort or reorder transactions on-the-fly during execution. In return, Calvin gets scalability, ACID-compliance, and extremely low-overhead multi-shard transactions over a shared-nothing architecture.
The comments on that post are pretty interesting, too.
FoundationDB transactions are true interactive sessions, unlike distributed databases that require stored procedures. This means that client code can make an iterative series of reads and writes over the network to execute complex transactions.
Looking at what they are providing, it's a shared nothing architecture, with data stored in ordered form. Technically, you can make this work in a very scalable fashion with an SSTable style store.
Think of it as LevelDB with a distributed B+ tree (or even just a few extra levels) handling the partitioning between nodes. That can scale quite well, and wrap updates and reads with snapshots at very low overhead to provide all the key bits to handle ACID in a distributed database.
What's problematic is when you have a transaction that span amongst several nodes and you need to reconcile several transactions (commit phase).
The usual way to do it is to refuse the transaction if a conflict is detected, however this is a real performance and useability issue for a NoSQL database.
What I'm implying is that if you have ACID transactions but they can fail very easily, you don't offer much...
It's not that hard to retry a transaction until some deadline is reached. Many programming languages have standard library facilities to make this extremely easy.
If the database is fast enough and transaction failure is at least moderately unlikely, it's not really an issue.
SSTable style databases are very cheap to snapshot, so you could totally handle this MVCC style. I buy the logic could get nutty in some pathological cases, but performance wise it shouldn't be much of a problem.
Perhaps I'm being a designer-snob, but Twitter's Bootstrap has quickly become too popular. I honestly have a hard time taking a site seriously when it uses the default Bootstrap look.
Having a unique visual identity is extremely important.
Honestly, if someone wants to sell me a database, I'd prefer they spend money on database engineers rather than on designers. To each his own, I guess.
I don't think it's a designer-snob thing. Taking the bootstrap provided website and a lot of overreaching claims of db goodness gives me an immediate sense of there being no there, there.
I was nervous writing that sentence, because it's hard to be sure of the truth of any claim to be 'first'! But none of the examples you mention, to my knowledge, provide multi-key ACID transactions. A compare and set facility for an individual key/document/row/entity group/etc is a very useful feature, but cannot be used to provide atomicity or isolation for transactions that read or write more than one key.
We plan to write something longer talking about the different levels of support that various products provide for A,C,I, and D.
> But none of the examples you mention, to my knowledge, provide multi-key ACID transactions
If you see it that way, you are right. The guaranty is just for one key/document pair (the key can be a key vector though). There is no way to have a commit over several key/document pairs with these databases.
If FoundationDB can do that, it is a big plus. If it can do it in a cluster, your product will have a great future.
The claims are: serializable ACID cross-machine transactions "without performance penalty." MVCC with optimistic concurrency control.
Optimistic concurrency control means that the server has to check the version on all modified data before the transaction commits. Cross-machine transactions mean that this version check has to happen on multiple machines. ACID means that both machines have to either commit (if all the version checks succeed) or roll back. How are you going to reconcile all of these requirements without resorting to two-phase commit, which most certainly has a performance penalty? And how are you going to get serializable transactions across machines without processing these cross-machine commits one at a time (waiting for two-phase commit each time)?
I'm just not seeing it.