It seems every few years I read about another KVS (half of the time from academia) purporting to have desirable properties previously not found.
As someone who's never had to work with distributed data structures, I'd love to see an article comparing and contrasting them. There seems to be a whole lot of these things, so they can't all be the same. Off the top of my head, there's RocksDB, Aerospike, FoundationDB, ZippyDB, Harvard published "Monkey", which is apparently an "optimal" key value store, and I'm sure many more. What is it that keeps improving?
I'd also love to read something about other distributed systems/datastructures. Surely we need more than key value stores?
The post touches on the underlying problem. Distributed stuff is fundamentally very difficult. As Jepson has shown so definitively, getting these protocols right is very unlikely even for very experienced developers. Small changes can destroy correctness, or influence the assumed consistency / isolation level in surprising ways.
The NoSQL fad embraced one extreme: the simplicity and performance of Eventual Consistency, essentially pushing any complexity that creates onto the application. At the other extreme, the Spanner paper explicitly argues for prioritizing correctness first, using a conservative transaction design more straightforward to implement and use, at the cost of potentially much worse performance under high contention.
Both approaches have merits, but I think it's clear they aren't one size fits all solutions. Or even fits most. Hence the explosion in systems trying to find some balance between these extremes. With time the picture should resolve. The early days of rdbms's went through a similar explosion then consolidation down to the most effective algorithms.
> Following the Dynamo design, Anna applies a CRC32 hash on the id to assign the actor to a position on the hash ring and applies the same hash function to a key in order to determine the actors responsible for storing the key.
I think this choice is what led to the "hot-partition" hell that early versions of DynamoDB suffered from, is that correct?
Anyone knows how DynamoDB handles this nowadays? What I know is they eliminated the hot-partition issue, at least from the user/developer perspective.
Fundamentally they can do it because partitions are relatively small, so you'll have hundreds of partitions from hundreds of different tables on a single machine and a handful of hot partitions are only a tiny blip. (There is of course a limit to how hot a partition can be, they'll cut you off before you start consuming a large amount of resources on the host.)
Unfortunately single-tenant systems are more constrained in this, because you don't have millions of partitions to spread across thousands of machines.
dynamodb differs from the dynamo architecture quite a bit. ddb's partitioning of the hash ring is contiguous and each partition is backed by a set of replicas. when a partition gets hot for whatever reason the partition is split and each half gets its own set of replicas. this historically led to two issues
1. splitting also split RCUs/WCUs proportionally, so if you split a bunch then scaled down your db would have a very low amount of RCU/WCUs available per partition leading to surprisingly throttling. if you did a load test right before your launch, you would be better off trashing the db to avoid this.
2. responsiveness to hot spotting. dynamodb wasn't as good about burst capacity management and so between when you hit a hot spot and dynamodb reacted, you'd be up a creek.
I also have a similar question: how could a consistent hash ring "infinitely scale"? As the number of nodes increases, the chances that some nodes flip-flop their membership will increase, and the system will become less reliable. Such temporary instability is very annoying for operations (alert mitigation in particular), especially when the requests per second is high.
I think there's a reason that large-scale KV stores use range-based partitions, such as FB's Tectonic and Google's Colossus (Colossus uses Spanner for metadata, and Spanner uses hierarchical range-based partitioning).
Colossus doesn’t use Spanner for metadata unless the design has changed (always possible). Colossus metadata is self-hosted, except for the “ring zero” metadata needed to bring up the cluster, which is in Chubby.
Thanks. I must've got it wrong then. I remember the Colossus paper said Colossus built a scalable metadata system, but someone from Google told me that the metadata system was based on Spanner, and I took it for granted.
It's entirely possible they've re-architected Colossus to run on top of Spanner by now (I left in 2018). A lot of distributed systems problems just disappear when you throw Spanner into the mix, so it would make sense. The only argument against it would be adding a dependency.
Spanner stores its data in Colossus, so there would be some bootstrapping issues to resolve to move it to Spanner over Bigtable. (Bigtable also has the bootstrap issues but has solved them already and there are additional difficulties due to some details that I probably am not at liberty to share)
Spanner is used for metadata for many other very very large storage systems, though.
Yeah, that was my understanding as well. Colossus stores metadata in Bigtable on top of a smaller Colossus, which stores its metadata in Bigtable on top of an even smaller Colossus, which… [insert more stacks of turtles here] ends up in Chubby.
With respect to key value stores I’m not sure why foundationDB isn’t the clear winner - strongly consistent, scalable, and ACID. Sure it’s not the fastest but it seems like for most use cases that’s really all you need given it’s still pretty fast.
* Operationally, there's no turn-key, hosted solution. Getting it set up locally for testing and CI is not trivial. You have to really want to use it. This criticism 100% applies to Anna too.
* FoundationDB provides serializable transactions (yay!), but it doesn't provide the tools to make this work well in the face of contention. FoundationDB uses fully optimistic concurrency control. Consider you have a workload where, say, you want users to do something as simple as increasing a counter. In FoundationDB, you're going to have a bad time if there are lots of concurrent writes. You might instead need to reorganize the problem into having the increments each just write a row and then in the background, read the older rows and turn them into an increment in a single-threaded way. In Anna, you can define your counter data structure inside the system. That's cool. More generally though, Anna pushes the problem of consistency further into the database, which, while cool, is also not super convenient. It does scale. Other databases use pessimistic locking to deal with contention. That is going to be lower throughput because the work will need to serialize. It'll still be higher throughput and lower latency than a fully-optimistic approach.
FoundationDB is operationally very complex; there's a multitude of processes, no reference architectures of at-scale deployments (althought those do exist; at least at Snowflake and Apple), no docker images with helm charts, and no cloud managed solution.
It's also extremely low level, at the query layer, it's simply an ordered map of (byte[], byte[]). The sibling comment also mentions optimistic concurrency control as a built-in primitive that can scale poorly with high contention. These two issues and similar others are left for the query layer to resolve. Foundationdb is not really intended to be used directly as a general purpose key/value store but to provide a distributed transaction and storage layer for a query layer to be implemented on top of it.
I think it's interesting to think about foundationdb as a higher-layer from rocksdb which has quickly become the de-facto standard for the single node storage-layer. For building highly scalable, globally distributed ACID systems, foundationdb could very well be the clear winner like you say, although systems in this category have opted to implement their own consensus and transaction layers. At least, all the spanner and calvin derivatives I know of (cloud spanner, yugabyte, cockroach, faunadb) implement their own. But for use cases where strict serializability is not required, there's ample room for innovation and I think this is where things like Anna can potentially fit in
To clarify, FoundationDB has some documented multi-region features, but it's not at all clear that anybody who runs it at scale relies on those multi-region features [1]. Even if they do, it's not obvious they run this Fearless DR mode at any appreciable latency.
Geo-replication is an area where Anna really shines, and foundation pays severe costs. All of the concurrency control problems have contention footprints on the order of round-trips. Fully optimistic concurrency control needs to expose backoffs quite a few round-trips to be live. Even pessimistic concurrency control requires some number of round trips, probably at least 1.5 optimally, but in practice in most tuned systems, probably 3. A heck of a lot of use cases make sense at 0 global RTTs and don't at 1+. The ability to tell the database how to manage concurrency, and then providing causal is ultimately the best you can do. That's awesome.
At the end of the day, I have to believe that there's a whole big mess of applications we'll build this, on systems in the portion of the design space Anna is choosing, one day. This is only recently not cutting edge research, but it's definitely still research. We don't know how to model these concepts at a high-level and in a composable way where the masses of software developers can engage with them.
It's interesting to think about how long ago we were graced with BAYOU [2], that thing was ahead of its time. I suspect it's going to take a little while longer before these sorts of techniques make their way into data storage primitives we think of as part of the standard vernacular, but I believe we'll get there eventually.
Are there existing, production ready KV stores that take advantage of CALM properties? I'm currently having to layer CALM optimizations on top of other dbs because my data model is built on associative and commutative operations but databases seem totally split between "transactional" and "eventually consistent" with seemingly no support for strong eventually consistent operations natively.
Dr9m the docs: Spoiler Anna can give you only upto causal consistency, but cannot provide strong consistency at key level, and nothing stronger than read-committed at the multi-key level.)
Plenty of user facing apps do need strong consistency. We use Cassandra at Ripple, for API access to the XRP ledger, and strong consistency is a requirement. Records refer to other records, and not being able to reason about when things are going to show up in the database would lead to a ton of complexity. For instance, I read a transaction, then I want to read the account that sent or received that transaction. I need to know the record for the account is going to be there in the database, and is consistent with the transaction, as opposed to stale. Or, a client might be paging through data over several calls, where we return a marker to allow the client to resume. This wouldn't work if some nodes might not know about the marker.
When do you need anything stronger than causal consistency? Eg when I was a jpmorgan, actual customer financial transactions in practice just needed causal consistency. Anything stronger than that really was about consistent snapshot reads for reporting purposes.
I disagree. Most user-facing apps I've seen in my career don't really need strong consistency.
The question is the performance of eventuality for propagating changes. If it's sub-second level, I'd say 99% of apps (of those which don't need strong consistency) can live with that. Single-digit seconds could still serve many purposes.
This means it is highly available and eventually consistent, but with additional consistency guarantees.
Anna is strongly eventually consistent. Meaning that, regardless of the order or number of times operations are replayed, the answer will always converge to the same value given time. For example, imagine you have a default value of "False", and a convergent value of "True". You may observe "False" after the value has been set to True, but you will never observe True and then have to worry about it flipping back to False.
Another example might be an integer that only grows. You can write to that integer:
1, 2, 3, 4
In any order, in parallel, or 100x each. But eventually the value will be 4 - not 1, 2, or 3. It may be 1, 2, or 3 at any time before it is 4, but the system will eventually converge to 4.
This can be very useful. Let's say you have a query "X < 3". This query happens after a write of 4, but you observe 3. You know that, given that the value only ever grows, X could be 3 or higher, but it definitely isn't less than 3.
So you can answer that query with the stale read. In an eventually consistent system without strong eventual consistency, after 4 gets written another write may get replayed after and make it go back to 2.
This has some obvious benefits. You can cache values forever, without invalidation logic. If I had read '3' from a cache I could answer the query directly. If the read had returned '2' I would have to go fetch the latest value to know if it had grown since then.
You may be asking "but what if I need to know the value right then". The answer is that you can put a strongly consistent store behind your strongly eventually consistent store. This is what is proposed in the CURP paper that I can't find this very moment.
I am very confused by Table I, claiming Redis and HStore have a multi-key consistency of "Serializable". What meaning is it trying to convey? Serializable is a level of isolation, not of consistency, so how are they on the same category?
As someone who's never had to work with distributed data structures, I'd love to see an article comparing and contrasting them. There seems to be a whole lot of these things, so they can't all be the same. Off the top of my head, there's RocksDB, Aerospike, FoundationDB, ZippyDB, Harvard published "Monkey", which is apparently an "optimal" key value store, and I'm sure many more. What is it that keeps improving?
I'd also love to read something about other distributed systems/datastructures. Surely we need more than key value stores?