Why Cassandra Doesn't Need Vector Clocks

seancribbs · on Sept 4, 2013

Since they're not likely to approve my comment on their blog, here's what I said:

"Way to misrepresent[1] vector clock usage in Riak! LWW deliberately ignores the vector clock. No one would use that in production without a strong assurance that they will never have concurrent writes. Also note that later in the post[2] Kyle shows how using them properly leads to zero data-loss.

[1] https://yourlogicalfallacyis.com/strawman [2] http://aphyr.com/posts/285-call-me-maybe-riak"

To add on,

> Cassandra addresses the problem that vector clocks were designed to solve by breaking up documents/objects/rows into units of data that can be updated and merged independently. This allows Cassandra to offer improved performance and simpler application design.

This is not solving the problem vector clocks solve, it is punting on the resolution issue. Perhaps LWW partial updates result in greater performance, but they only solve performance.

Listen to or watch http://thinkdistributed.io/blog/2012/08/28/causality.html

To be fair, both designs are valid choices, but jbellis should be honest about his tradeoffs and not simply cast aspersions on other valid designs because they aren't the one that C* chose.

aphyr · on Sept 4, 2013

I concur; this is punting on the resolution problem.

As far as I can determine in testing with Jepsen, there are no cases where one can safely (e.g. in a way which guarantees some causal connection of your write to a future state of the system) update a cell in Cassandra without a strong timestamp coordinator: either an external system like Zookeeper, or Cassandra 2.0 paxos transactions.

Most of the production users and datastax employees I've spoken with recommend structuring your data as a write-once immutable log of operations, and to merge conflicts on read. This is exactly the same problem that Riak siblings have--except that you don't get the benefit of automatic pruning of old versions, because there's no causality tracking. GC is up to the user.

cbsmith · on Sept 5, 2013

> As far as I can determine in testing with Jepsen, there are no cases where one can safely (e.g. in a way which guarantees some causal connection of your write to a future state of the system) update a cell in Cassandra without a strong timestamp coordinator: either an external system like Zookeeper, or Cassandra 2.0 paxos transactions.

It depends on what you mean by "guarantees". In most real world systems, if you can actually have two concurrent writes to the same record, it is non-deterministic which write wins... which is precisely what happens in Cassandra if you have multiple concurrent writes to a cell without a strong timestamp coordinator.

There is a somewhat more complex case about scenarios where you have multiple cells tied to the same record being updated together, but even then it will look identical to if you allowed both writes to happen, but it was non-deterministic which one happened first.

While you can create an timestamp authority that arbitrates precisely what happens when, the reality is that if there wasn't an already observable mechanism for determining this, who is to say which one happened first?

Really, the only problem you run into is if there is atomic validation logic you need tied in to whether the updates happen at all, which is what Cassandra 2.0's lightweight transactions really address. Without them, you generally do need some kind of external authority or a "log concurrently and then resolve serially later" strategy. That sounds like a pain, but the latter in particular is again closer to how most of the real world operates (in particular, the world of financial transactions).

aphyr · on Sept 5, 2013

The world of consistency is rich: not all systems require serializability, linearizability, or any one write winning. It might be interesting to skim http://pmg.csail.mit.edu/papers/adya-phd.pdf, http://ftp.research.microsoft.com/pub/tr/tr-95-51.pdf, and pagesperso-systeme.lip6.fr/Marc.Shapiro/papers/RR-6956.pdf‎ for a taste.

carterschonwald · on Sept 5, 2013

(asking for my self and other readers of this thread)

is LWW = Last Write Wins?

nosequel · on Sept 5, 2013

kunthar · on Sept 4, 2013

time to meet with VoltDB.

MichaelGG · on Sept 4, 2013

That requires your dataset to fit into RAM. Which is becoming more and more feasible. But Cassandra apparently can use disks while keeping performance (giving up on other things).

jbellis · on Sept 4, 2013

I explained my reasoning in more detail in the paragraphs starting with "Vector clocks are good at helping clients with simple merges like the above user object, but it’s important to understand that vector clocks only tell you that a conflict occurred, and not how to resolve it" and "What Cassandra gives up here is the ability to create custom behavior for updates-based-on-existing-values. However, as counters illustrate, vector clocks are often inadequate for this as well. Usually you end up having to store the entire version history, which Cassandra supports by clustering with uuids."

I think what you and aphyr are getting hung up on is this last part.

In any case, I wrote this because I got tired of people who read the Dynamo paper but haven't thought it through reacting to Cassandra with ZOMG YOU'LL SURELY LOSE DATA WITHOUT VECTOR CLOCKS. When in fact, this decision represents one that we made deliberately, for the reasons I've tried to describe. Maybe some not casting aspersions on other valid designs might be in order all around.

Thanks for reading!

aphyr · on Sept 4, 2013

But... you will lose data if your conflict resolution strategy is LWW, unless you do all writes to unique objects. Siblings+vclocks are provably equivalent to that strategy; they just allow you to garbage-collect unnecessary parts of the causal history more efficiently. Neither strategy frees you from having to define a commutative, associative, and idempotent merge function to perform a read.

rbranson · on Sept 4, 2013

Agreed in general.

I think the statement that "you will lose data" is a bit simplistic. Given enough chances, all systems will lose data. One could make a pretty strong argument that the default LWW approach used by Riak and Voldemort is quantifiably far less safe than the default approach in Cassandra, which works more like an LWW-Element-Set. I know this is changing with the CRDT work in Riak 2.0, which is very exciting.

There ARE a large number of use cases, many of which are driving the demand for scale-out distributed DBs, where data IS immutable, with a requirement for ordered traversal over subsets of the data. The key/value+vclock approaches that I've seen make this either very difficult or very slow.

aphyr · on Sept 5, 2013

Sorry, I was speaking loosely. More formally:

In a system which uses LWW as the conflict resolution strategy, there exist no circumstances under which you can guarantee that a value written to a given key will be causally connected to any future state of the system, unless all values written to that key are identical, or a strong external coordinator (e.g. Zookeeper) orders timestamps.

If you have siblings and vclocks, you can recover that causal connection guarantee for arbitrary write patterns--at least over CRDTs. Since Cassandra did not (until today) offer transactional isolation for any type of multi-cell update, this means that--and we're speaking strictly in terms of safety here, not performance--Riak and Voldemort's consistency models were, prior to 2.0, a strict superset of Cassandra's. For instance, you can guarantee the visibility and transactional isolation of a write making multiple changes to a Riak object; I'm reasonably confident that you cannot achieve those guarantees in, say, a Cassandra collection without a Paxos transaction.

You can certainly emulate Riak's consistency model by storing a distinct object for every write, and this is, as I understand it, what many Cassandra users do. The difference is in space consumption. Consider making four updates to an object. In Cassandra, you could write each update to a separate cell. In Riak, you might write them all to the same key:

    Cassandra    Riak
    [update1]    [update1|update2|update3|update4]
    [update2]
    [update3]
    [update4]

To read from both Cassandra and Riak you need a merge function. Since neither provides ordering constraints, our merge must be associative, commutative, and idempotent in both cases.

    Cassandra    Riak
    [update1]+   [update1|update2|update3|update4]
    [update2]+      |        |       |       |
    [update3]+      +--------+-------+-------+
    [update4]+                  |
             |                  |
             V                  V
     [current value]    [current value]

The difference is in space. Vector clocks allow you to prune the causal history, meaning we can write back [current value], and as soon as a node sees that write, it can discard updates 1-4. In Cassandra, there is no causality tracking: you have to figure out how to do GC yourself, or punt.

    Cassandra    Riak
    [update1]    [merged value|update5]
    [update2]
    [update3]
    [update4]
    [update5]

You can see how unbounded space might be a problem. From my conversations with DataStax, it sounds like users tend to write reducers which apply their merge function to compact some portion of the history. Which portion? Well, without causality tracking we'll leave that as an exercise to the reader.

    Cassandra      Riak
    [update1-4]    [merged value|update5]
    [update5]

Does this look familiar? Yeah. It's the same concurrency model as the vector clocks this post is arguing against. You just have to do more work.

Now, there are all sorts of practical efficiency constraints at play! For instance, Riak has ~50-100 bytes of overhead per key, and will start barfing if you go over 10 megabytes per key or so. And without being able to call list-keys, you wind up having to play all kinds of games with predictable keys, splitting datasets between multiple objects, and so on. Cassandra's IO throughput generally seems much higher than Riak's, and Cassandra has a much more efficient representation for wide values. It also offers better key ranges--but you also pay a per-cell overhead for every atomic chunk of state. Not so efficient if you were looking to store, say, big blocks of integers for your CRDTs.

The great thing is--again speaking purely in terms of consistency--Cassandra 2.0 is now capable of a superset of Riak's operations! If correctly implemented, their Paxos operations support linearizable reads and writes, which is a way stronger class of consistency than the CRDT operations described above. I don't understand why jbellis is so upset when folks point out that LWW provides weak safety constraints--when their strongly-consistent operations now offer the highest level of transactional safety. Seems like we should be celebrating that achievement, because it opens up large classes of operations which were previously unsafe. :)

rbranson · on Sept 5, 2013

+1

I don't think he's upset about LWW being characterized as a weak safety constraint, but that the perception that what's provided by Cassandra is equivalent to per-key LWW. While it doesn't serve to completely eliminate the chance of data loss caused by conflicts, breaking a complex data structure into atoms that resolve independently vastly improves the average and P99 (and probably many more 9s) case. The argument being made is that while not as correct as vclock+sibling resolution, this is within the threshold many real life use cases are willing to tolerate.

The other thing I think is mischaracterized is that the choice to use timestamps over vector clocks was done out of ignorance or that there is nothing gained. This was a conscious choice and made with the trade-off of performance in mind. We should strive for the largest amount of correctness given the constraints of performance and/or availability. While the CAS operations in C* 2.0 are useful, they sacrifice a lot on those fronts to gain that correctness. Systems that needlessly trade correctness without returning serious dividends (I'm sure we can all name a few) add no value.

jbellis · on Sept 5, 2013

Good summary; thanks!

cbsmith · on Sept 5, 2013

> Since Cassandra did not (until today) offer transactional isolation for any type of multi-cell update

I guess it depends on what you mean by "transactional isolation" and "multi-cell update". Certainly there is nothing like ACID, but a single multi-cell update to a given record is guaranteed to be _atomic_, and if you have two concurrent multi-cell updates to a single record, they are guaranteed to eventually resolve to a consistent ordering of those operations (though without a strong clock/timestamp it is non-deterministic from the callers' POV).

For a wide variety of use cases, that is actually a more accurate reflection of how reality works than the traditional ACID model.

> but you also pay a per-cell overhead for every atomic chunk of state. Not so efficient if you were looking to store, say, big blocks of integers for your CRDTs

The theory goes that compression tends to wipe out much of that inefficiency, and of course if your columns are sparsely populated it is actually more efficient. I'm sure that isn't always true, but I'd bet it is far more of a trivial side issue than one might think.

aphyr · on Sept 5, 2013

...a single multi-cell update to a given record is guaranteed to be _atomic_, and if you have two concurrent multi-cell updates to a single record, they are guaranteed to eventually resolve to a consistent ordering of those operations (though without a strong clock/timestamp it is non-deterministic from the callers' POV).

I disagree. https://gist.github.com/aphyr/6402464

cbsmith · on Sept 5, 2013

Okay, you do raise a good point about what happens if the timestamps happen to be precisely identical. Most of the scenarios I've had where the precise same timestamp was at all likely, the updates would also have been identical. If you want to have overlapping cells resolving highly concurrent writes (not even using wide rows to make precisely concurrent writes go to different cells anyway), Cassandra is probably not the right tool to you.

Of course, if that were considered a likely scenario (generally microsecond collisions at the row level would only be at high probability if you had high concurrency on a record), you have a number of paths open for resolving it, the one that I've usually ended up with is that the two concurrent updates actually should be to two different records ANYWAY (usually you add a client ID to the key, for example) because you want to have a record of them which is later resolved when any partitioning issues are addressed (so, you write with ANY consistency to a log, have sloppy real-time reads that are consistency ONE, but then have another process which does ALL consistency reads on the log and then resolves any conflicts using application logic, before writing with QUORUM consistency to the "source of truth".

Alternatively, you can simply provide a client generated timestamp which has a different scale/resolution with a lower order bits being truly random values. For example, if you have that kind of high-concurrency, you probably don't need to handle a range of timestamps beyond ~50 days. You can then use a client generated timestamp which is a combination of 32 high order bits for milliseconds since the epoch and then a random 32-bit value for the low order bits, which makes the odds of a collision on the timestamp pretty good even for highly concurrent cases.

I'm curious about the use case where you'd have all the concurrency with different but overlapping values, but you'd not want to record them separately and then have some custom app logic for resolving them.

aphyr · on Sept 5, 2013

if that were considered a likely scenario

When timestamps are selected by the Cassandra nodes, I can replicate this failure in 2% to 5% of writes. When timestamps collide, I can replicate this failure in 99.9% of writes. Given that the whole point of isolation is to provide invariants during concurrent modification, it doesn't make any sense to claim that a write is transactionally isolated only insofar as it is not concurrent with other writes.

cbsmith · on Sept 5, 2013

> I can replicate this failure in 2% to 5% of writes.

Yeah, I'm curious about how you achieved those numbers.

Your test that gets that 2-5% of writes (though your docs say 7.5%) to be messed up... what is really is measuring is the probability that out of 5 concurrent clients writing to 4 servers, at least two will finish writing to a row with the exact same timestamp... AND that they will be the LAST ones to write to that row. If just one of those clients ends up just a hair behind the other four, then you should register 0 collisions.

What is even weirder is your benchmark takes 100 seconds to complete what amounts to 5000 writes, or averaging a rate of 50 writes per second, 10 writes per client per second. Those are pathetic numbers for a one node Cassandra cluster, let alone a four node one. WTF is going on here?

Even more confusing, you are writing with ANY consistency, which means that in many cases those writes will be stamped and committed on different nodes, yet somehow getting the same timestamp. Odds on this seem... highly suspect. It almost seems like your clock only has 1 second resolution, which is weird. Have you checked the writetime timestamps on your records?

I've done writes at much higher rates where we recorded the timestamps of every single write operation. We've yet to get the same timestamp on two operations.

I also see Cassandra timeouts while writing with consistency ANY, yet are still somehow getting timeouts with this operation. That really screams to me that the cluster is truly messed up.

Now, as you say, if you control the timestamps, you get collisions 99.9% of the time. I don't even get why it isn't just straight up 100% for that case.

> Given that the whole point of isolation is to provide invariants during concurrent modification

I think it is fair to say that you don't have transaction isolation if the timestamps are exactly the same. That is just an exceedingly low probability event unless you have a LOT of transactions per second.

I'd dump the "writetime(a), writetime(b)" values to get an idea of what is going on there.... something smells and there is a lot less cardinality in those timestamps than I'd expect.

aphyr · on Sept 5, 2013

Yeah, I'm curious about how you achieved those numbers.

Jepsen's code is open source (http://github.com/aphyr/jepsen) and I've written extensively about the techniques involved; see http://aphyr.com/tags/jepsen for details. Cassandra work is upcoming; no formal writeup yet.

Your test that gets that 2-5% of writes (though your docs say 7.5%) to be messed up...

Sorry, 2-5% was my mistake. Been playing around with the parameter space; numbers vary a bit.

what is really is measuring is the probability that out of 5 concurrent clients writing to 4 servers, at least two will finish writing to a row with the exact same timestamp... AND that they will be the LAST ones to write to that row. If just one of those clients ends up just a hair behind the other four, then you should register 0 collisions.

In a Dynamo system, a.) there is no such thing as "time", b.) there is no such thing as "last", and c.) causality tracking beats everything. Doesn't matter what order you do the writes in; timestamps (and/or vclocks in Voldemort/Riak) take precedence.

What is even weirder is your benchmark takes 100 seconds to complete what amounts to 5000 writes, or averaging a rate of 50 writes per second, 10 writes per client per second. Those are pathetic numbers for a one node Cassandra cluster, let alone a four node one. WTF is going on here?

Each client in the Jepsen test harness is (independently) scheduling n writes per second. Jepsen schedules its writes this way to a.) avoid measuring an overloaded system, b.) produce results which are somewhat comparable between runs, and c.) measure results over changing underlying dynamics--in this case, a network partition.

Even more confusing, you are writing with ANY consistency, which means that in many cases those writes will be stamped and committed on different nodes, yet somehow getting the same timestamp. Odds on this seem... highly suspect. It almost seems like your clock only has 1 second resolution, which is weird.

There's an interesting probability anecdote called the Birthday Paradox, which says that if you get 30 people in a room, chances are good that 2 will share the same birthday. At ten uniformly distributed writes a second, the probability of a timestamp collision is 0.44%... in any given second. Chances of a collision after a thousand seconds of runtime are 99.9999%. If you push 100 writes per second, collision probability is 50% in any second. If you push only 2 writes every second, you should expect to see a collision once every few days. How long-lived is that collision? It depends on the distribution of writes over time, and on the network, but you can work out a mean free path.

TL;DR: microsecond timestamps do not provide sufficient entropy for uniqueness constraints over common workloads.

I also see Cassandra timeouts while writing with consistency ANY, yet are still somehow getting timeouts with this operation. That really screams to me that the cluster is truly messed up.

The timeouts in this case are, I think, a Cassandra bug (or expected behavior) when partitions occur. Last I heard from jbellis, it wasn't clear what Cassandra should do under these conditions, but I think he was leaning towards allowing the local hint to count as success always.

Now, as you say, if you control the timestamps, you get collisions 99.9% of the time. I don't even get why it isn't just straight up 100% for that case.

The reason not all writes result in conflict with identical timestamps is, I suspect, due to that transitional period during the beginning of the network partition.

cbsmith · on Sept 5, 2013

> Each client in the Jepsen test harness is (independently) scheduling n writes per second.

Umm... that's kind of an important detail. I'm betting that your mechanism for achieving this effectively synchronizes your clients to act in concert, or at least as close to "in concert" as is possible for your clock to measure. That explains the probability.

> There's an interesting probability anecdote called the Birthday Paradox

Yeah, I thought of the Birthday Paradox with this problem, but this is a different variant. The probability that two people in the room have the same birthday and no one else in the room has a birthday later in the year follows different probabilities.

Try writing a program that spawns 5000 threads and has them get the current time in microseconds, and then write it in a file. You won't have any collisions unless you do some kind of precise coordination between them. In fact, you likely only have a shot at getting the same timestamp if you call from different threads, because just executing the instructions to read the current time takes long enough that two calls in a row will get different values.

> TL;DR: microsecond timestamps do not provide sufficient entropy for uniqueness constraints over common workloads.

See, that's the part I have a problem with, because I've had quite the opposite experience (without even having Cassandra involved).

aphyr · on Sept 5, 2013

Like I said, mean free paths will vary depending on your write profile. None of this alters my original assertion, which is that row operations are not isolated.

cbsmith · on Sept 5, 2013

You've got a corner case that is way harder to hit than you think you've measured it to be, and the scenarios where it may happen would have almost certainly not have the design required to cause it. Even so, it is addressable.

Based on my own experiences with this scenario, I'd be surprised if you managed to experience any problems if you turned off throttling (and didn't force everything to the same timestamp).

So yeah, you have a scenario that can happen, and I'd recommend anyone who absolutely cannot have that happen either not use Cassandra or design their schema accordingly. Absent that scenario though, row operations are isolated.

aphyr · on Sept 5, 2013

You're presuming that all reads occur after the system has quiesced. This is not always the case. I'm happy your write pattern works for you, and that you measure your consistency; I'm just trying to keep folks honest about what their systems actually provide; and give them tools to analyze those constraints.

cbsmith · on Sept 5, 2013

> You're presuming that all reads occur after the system has quiesced.

It sure looked like they did, but maybe I misread the code.

> This is not always the case.

Even if that weren't the case, you might check out the probabilities on your birthday problem. What you've got is effectively a calendar with 100 million days (microseconds in the benchmark's 100 seconds) and 5 people (5 writes to the same record). You've managed to end up with those 5 people sharing a birthday well over 1% of the time.

> I'm just trying to keep folks honest about what their systems actually provide

I appreciate that you've found an interesting corner case that I'd not considered.

Actually, it's not that I hadn't considered it. When creating client side timestamps I do tend to think about this scenario, but with server side timestamps, I tend to think of it much like I think of collisions on Type 1 UUID's, but in truth the probabilities are higher.

I do think Cassandra ought to consider either a) using Type 1 UUID or similar strategies to make the risk of these collisions all but ridiculous or b) when resolving ties in the timestamps on cells by choosing a winning node or thread or something other than the value in the cell, but rather tied to the update operation. That would avoid this scenario in a fashion more fitting with the rest of the semantics of the system.

> give them tools to analyze those constraints

I think unfortunately in this case the analysis of those constraints is flawed.

rbranson · on Sept 5, 2013

Finding corner cases is the point of Jepsen :)

cbsmith · on Sept 5, 2013

> Finding corner cases is the point of Jepsen :)

And with that it obviously did a great job. The probabilities of finding those corner cases is unfortunately completely misrepresented.

I'd worry though that this distortion of the probabilities might mean it also doesn't find other kinds of corner cases.

tjake · on Sept 5, 2013

bq. I also see Cassandra timeouts while writing with consistency ANY, yet are still somehow getting timeouts with this operation. That really screams to me that the cluster is truly messed up.

This was a issue with atomic batches and CL.ANY we've fixed since the test.

https://issues.apache.org/jira/browse/CASSANDRA-5967

rdtsc · on Sept 5, 2013

> Given enough chances, all systems will lose data.

Well in this case it seems it in not chance but bad architectural decisions. Or say bad default options for Riak.

All systems lose data given enough chances is like saying all people will eventually die, why not just stop wearing seat belts and not go the doctor when you are sick.

> There ARE a large number of use cases, many of which are driving the demand for scale-out distributed DBs

That is true. This sounds like Datomic to me? What are you thinking about?

jbellis · on Sept 4, 2013

The whole point is that "conflicting" updates to a single column is supposed to cause overwrites ("data loss"). If I wanted to keep multiple values in a column around I'd use a Map or a Set instead!

Maybe the disconnect is that in Riak, you have to fetch the existing document before modifying it anyway, so there is a lot of "update-based-on-existing-document" code around. Cassandra is designed to encourage "blind," idempotent updates instead.

aphyr · on Sept 4, 2013

If I wanted to keep multiple values in a column around I'd use a Map or a Set instead!

[edit: thanks iamalesky, correction on cell keys]

If you're following along at home, Lists are implemented by choosing a UUID cell key for each distinct element of the collection; Maps and Sets use the key and value, respectively. The conflict resolution function applied at read time is, for Sets, set union, plus tombstones elements for deletion. Depending on the resolution strategy used, similar consistency anomalies to LWW may be present; e.g clock skew allows for various possible outcomes of logically concurrent mutation of a collection.

iamaleksey · on Sept 4, 2013

In case you are talking about Cassandra collections (maps and sets), and not abstract maps/sets, then you are wrong. Only CQL3 Lists work like that - there are no UUIDs anywhere in CQL3 maps/sets implementations.

brown9-2 · on Sept 4, 2013

Doesn't this assume that both writers want their value for the column to win, instead of one of them possibly deciding that it shouldn't write it's value if one already exists for the field?

jbellis · on Sept 4, 2013

Correct. But, in that case vector clocks does not help you either [unless you are okay with both writers thinking they have "won" for an unbounded period of time]; what you really need is true linearizability: http://www.datastax.com/dev/blog/lightweight-transactions-in...

rdtsc · on Sept 5, 2013

> Usually you end up having to store the entire version history, which Cassandra supports by clustering with uuids."

You need the history to be able to reduce the entropy by solving and eliminating conflicts, unless CRDTs are used (which I think Riak supports CRDT counters). Otherwise merging is custom to an application. Merging two {X,Y} position updates is not the same as merging two shopping cart updates or the same as merging and address and email tuple. Sounds like Cassandra is just sweeping the problem under the rug.

advisory5739f2 · on Sept 4, 2013

Cassandra counters once were implemented as a blazing-fast vector-clock patch [1,2] that did not involve read-before-write on update. The biggest (IMO) downside is that it did not support decrements. So Cassandra tech leadership decided [2] to implement counters as "read, increment, write." However, that came with two major downsides: 1) counter updates were now no longer replayable--so if you experience an update timeout on the Cassandra side under heavy load or due to a lengthy compaction or some network fluke, you just can't replay it. You need to hope that it has made it over. Counters run on hope. But you can decrement them now. 2) read-before-write can severely limit performance in the case of a large number of counter writes (as the workload becomes much more read-intensive, unlike regular blazing-fast Cassandra writes).

1 - https://issues.apache.org/jira/browse/CASSANDRA-580 2 - https://issues.apache.org/jira/browse/CASSANDRA-1072

However, in many many a use case for counters, you just don't need to decrement. And if you need to decrement, you can often get away with two counters, that you subtract from each other. So that's a shame, as restricting the use case for counters to be monotonically increasing would've been just the right trade-off for many users.

jbellis · on Sept 4, 2013

I think you've misread the 580 patch. By definition, you need to read the existing vector clock value before you can issue an update against it.

/Member of the presumedly incompetent Cassandra tech leadership

advisory5739f2 · on Sept 4, 2013

Based on the algorithm described in the link here: http://www.slideshare.net/kakugawa/distributed-counters-in-c... -- the insert path does not perform any reads, unless the column is already in the memtable (so no disk read). Am I missing something?

jbellis · on Sept 4, 2013

That is describing the 1072 approach aka the one in Cassandra today. Incrementing without read-before-write was one of the reasons Twitter liked it.

Unfortunately, while you can increment on a single node without read-before-write, you need to merge it with the existing value before you can replicate to the others. Read-before-replicate, if you will.

Theres a more recent, quite long discussion over https://issues.apache.org/jira/browse/CASSANDRA-4775 that boils down to, "this is actually the best we can do."

ddorian43 · on Sept 5, 2013

Hypertable has crazy fast increments-decrements(without reads). If the column is in memory it is updated on the spot. Else multiple values are inserted and merged on read or compaction. But it is a little different from cassandra, more similar to hbase.

AaronFriel · on Sept 4, 2013

That's great, but what if you want vector clocks in Cassandra? For example, storing opaque blobs encrypted by a remote client. You might want to store both versions. Does Cassandra offer the ability to store lists of primitive data types and atomically insert/remove elements? This would allow clients to atomically append and then (with less need for guaranteed ordering) asynchronously attempt to remove the ancestor record. Even if the removal fails, future gets would pull the list and again attempt to synchronize to a single record.

I'd suppose that under very high contention you could have some problems, but at least you'd be able to implement vector clocks as a library without touching the Cassandra core.

Apologies if my question is common knowledge. And for anyone who is well versed in other key value stores, does it support vector clocks or something like the above (list elements with atomic append/prepend?)

jbellis · on Sept 4, 2013

If you want to manage opaque blobs then you should probably use Riak or Voldemort. Cassandra gives you a set of one-size-fits-most tools that by and large work very well together, but if you want more of a build-your-own-database toolkit, it is a bad fit.

ghc · on Sept 4, 2013

I don't see what's so scary about vector clocks. I have no trouble with them in Riak. Sure, there's a little extra logic to deal with certain types of siblings, but honestly I haven't encountered any difficult edge cases.

But then again, maybe it's highly dependent on what you use Riak for. Anyone out there have any type of application they've built with Riak that did encounter issues with using vector clocks?

gigq · on Sept 5, 2013

It's worth noting that HBase has the same method of using LWW on column level updates. While this usually is what you want and like Cassandra it gives you the ability to do fast blind writes there are sometimes that you need to make sure you aren't having conflicting writes.

The solution that HBase employs is to have checkAndPut functionality. Basically what this lets you do is write a value and only successfully save the update if a given column is the value you expect.

So for example you could have a "records" table that has a column called "check" whenever you update a record you pull the old one, do whatever processing you want to do on it, set the "check" column to a new UUID and then save it with a checkAndPut where you specify the "check" column in hbase has to be the old check value you read. If any other process wrote to this row then it would have updated the check value with a new UUID and so this checkAndPut will fail thereby detecting the conflict. Now you can repull the row and handle any conflict resolution without blindly overwriting the changes.

jbellis · on Sept 5, 2013

Cassandra exposes functionality similar to checkAndPut as Lightweight Transactions: www.datastax.com/dev/blog/lightweight-transactions-in-cassandra-2-0

gigq · on Sept 5, 2013

Ah, good to know that functionality was added. I haven't used Cassandra since the 0.6.x days so my knowledge of what's possible there is rusty.

banachtarski · on Sept 4, 2013

This isn't anything dramatically new. Essentially, there is a tension between data normalization and requirements for strong consistency.

If my data is more segmented, I can afford to be less stringent about it. I think that foregoing vector clocks in an distributed database entirely though, is a recipe for disaster, no matter how denormalized my data is.

marshray · on Sept 4, 2013

Riak/C* noob here. Couldn't I just store data as column cells in Riak?

    bucket: 'users'
    'jbellis/email', 'jbellis@example.com'
    'jbellis/phone', '555-5555'

What would I lose by doing this?

iamaleksey · on Sept 4, 2013

Performance, mostly. To read both the email and the phone you'll have to make two requests (or a single multiget). The keys have a good chance of belonging to different replicas, too, and even if they don't, you are still going to have 2x disk reads.

jbellis · on Sept 4, 2013

You're also giving up atomicity on writes when you actually do want to update multiple fields at once.

cbsmith · on Sept 5, 2013

I believe that is precisely what Riak does in the "Riak now does this" link provided, and that's effectively what is going on under the hood with Cassandra. This means more overhead per column and the loss of the ability to just encode the data fields in an application native serialization structure.

In effect, you throw out a lot of the less talked about advantages of NoSQL and end up with something more like traditional SQL databases (not necessarily a bad thing, just different).

iamaleksey · on Sept 5, 2013

It's not how Riak does this, and it's not what's going on under the hood with Cassandra, either.

cbsmith · on Sept 5, 2013

Riak has something pretty close to this, and if your table in cassandra looked like:

create bucket.users ( id uuid, username text, email text, phone text, primary key (uuid, username) );

That would be exactly how the data was encoded.

iamaleksey · on Sept 5, 2013

The difference is that all the cells with the same `id` would belong to the same partition and stored together, so you'd be able to write them atomically and read them together cheaply in a single operation.

What marshray suggested would look like this:

create table bucket.users (name text, field text, value text, primary key ((name, field))); - with a composite partition key. Then you'd have two separate single-cell partitions "jbellis:email" : "jbellis@example.com" and "jbellis:phone" : "555-5555". You don't want to do that in either Riak or Cassandra.

cbsmith · on Sept 5, 2013

> What marshray suggested would look like this: create table bucket.users (name text, field text, value text, primary key ((name, field))); - with a composite partition key. Then you'd have two separate single-cell partitions "jbellis:email" : "jbellis@example.com" and "jbellis:phone" : "555-5555". You don't want to do that in either Riak or Cassandra.

Ah I see what you mean. I thought the intent was to group by some unique ID for the user.

aphyr · on Sept 5, 2013

you'd be able to write them atomically

To clarify, this is only the case if you use Cassandra 2.0 transactions; normal batched writes are not atomic in the sense you're probably thinking.

jeremiahjordan · on Sept 5, 2013

They are atomic, they are not isolated. Those are two different properties. The "A" vs the "I" in ACID. http://en.wikipedia.org/wiki/ACID

Edit: Actually, they are isolated in the sense of ACID, doesn't matter what order you do to operations, the answer will be the same. But not isolated the way you want isolated described.

cbsmith · on Sept 5, 2013

The scenario that he's found is when the two timestamps are actually identical. In that scenario, Cassandra cannot maintain its atomicity guarantee.

aphyr · on Sept 5, 2013

No, they are not isolated in the sense of ACID. This behavior violates P0.

iamaleksey · on Sept 5, 2013

No batching here, just single-partition, two cell update. Atomic, but not isolated in the way you define isolation in your gist.

cbsmith · on Sept 5, 2013

Actually, even in Cassandra 1.2, batch operations were atomic unless otherwise specified.

jtuple · on Sept 5, 2013

Avinash Lakshman make a decision early in Cassandra's life cycle not to use vector clocks for performance reasons. You gain performance, you lose the ability to safely update an existing key concurrently.

This is a reasonable tradeoff.

Cassandra has since evolved around this limitation: use write-once or immutable data, transform updates into appending new columns rather than modifications, handle all of this automagically through CQL datatypes, etc.

Really cool stuff. A nice way to evolve a product around a limitation and still be immensely useful.

But, seriously, it's a tradeoff. Defending that choice against those who question Cassandra is fine. But, jbellis has a history of claiming Cassandra is a "one-size-fits-most tool", proclaiming that vector clocks aren't needed in 99% of cases, etc. That's opinion presented as fact. Without omniscience no one really knows what all users/the market actually needs, what 100% of all use cases look like, etc. Let's cut the rhetoric and realize engineering is about tradeoffs.

I still wonder if Cassandra won't someday add vector clocks, just like they eventually added vnodes.

In Riak land, we have vector clocks. Do we pay a performance hit? Yes. Must we? Maybe.

In theory, vector clocks should only be expensive when updating existing data. If you are writing data once (the standard Cassandra approach), vector clocks should be free. They're not currently, but that's an implementation detail that Riak can fix. And we're considering fixing in the near future.

What about when you have multiple updates to the same key? You must pay a penalty there, right? Maybe. We're actively looking at approaches to reduce that penalty in the future as well. Summary: have multiple versions of an object, just append new version on write, read all versions on read and rollup siblings/LWW resolve. Basically, identical to "just append a column" that Cassandra uses.

It's easy to extend Riak to support the same operations Cassandra does, with the same performance characteristics, while still supporting conflict-detection for concurrent modifications. Best of both worlds. We'll like do it at some point too, as that's what engineers do -- we make products better over time.

As an side, a benefit of per-key conflict detection and siblings is that it doesn't require a sorted backend. The multi-version approach Cassandra uses requires sorted data to be efficient on reads.

While most users of Riak use the LevelDB backend today which is a sorted backend similar to Cassandra's backend (both log-structured SST systems), Riak also supports the non-sorted Bitcask backend for folks that want a high performance purely K/V store. Bitcask is about 10x faster than LevelDB in many workloads, because it doesn't waste time sorting/rewriting data. Write amplification really hurts for certain workloads.

Yet another tradeoff. Engineering is always about those pesky tradeoffs.

rdtsc · on Sept 5, 2013

I don't know much about CAP but it doesn't sound right to me.

Let's see, you store your data tuple for example for a position say {X,Y} as 2 coordinates. X in one column and Y in another one.

Now when it gets updated concurrently and it needs to merge two conflicting {X1,Y1} and {X2,Y2} positions you could end up instead with {X1,Y2} non-existing/impossible/broken position.

Is that really that easily broken? Or am I not thinking about it straight.

gigq · on Sept 5, 2013

It wasn't really mentioned in the article but if you do both {X1,Y1} and {X2,Y2} as full row updates then no you can't end up with a mixed row.

http://www.datastax.com/dev/blog/row-level-isolation

brown9-2 · on Sept 5, 2013

In Cassandra the second update agreed upon by the cluster would "win", so either (X1,Y1) or (X2,Y2). Since there is no clock mechanism there is no way for the cluster to see that the data has changed since the requester decided it wanted to update the value.

rdtsc · on Sept 5, 2013

But didn't the article say that last update to column wins. To me it seems if X and Y values are in different columns you would end up with invalid data.

Someone else (a sibling comment) pointed me out to row level locks. That is what I was wondering about. I am looking at that at the moment.

aphyr · on Sept 5, 2013

Rows are not isolated. You might see (X1, Y1), (X1, Y2), (X2, Y1), or (X2, Y2) if timestamps happen to conflict. https://gist.github.com/aphyr/6402464

brown9-2 · on Sept 5, 2013

Oh wow, that is bad. I was assuming that guarantee of row isolation was valid.