*...a single multi-cell update to a given record is guaranteed to be _atomic_, a...

cbsmith · on Sept 5, 2013

Okay, you do raise a good point about what happens if the timestamps happen to be precisely identical. Most of the scenarios I've had where the precise same timestamp was at all likely, the updates would also have been identical. If you want to have overlapping cells resolving highly concurrent writes (not even using wide rows to make precisely concurrent writes go to different cells anyway), Cassandra is probably not the right tool to you.

Of course, if that were considered a likely scenario (generally microsecond collisions at the row level would only be at high probability if you had high concurrency on a record), you have a number of paths open for resolving it, the one that I've usually ended up with is that the two concurrent updates actually should be to two different records ANYWAY (usually you add a client ID to the key, for example) because you want to have a record of them which is later resolved when any partitioning issues are addressed (so, you write with ANY consistency to a log, have sloppy real-time reads that are consistency ONE, but then have another process which does ALL consistency reads on the log and then resolves any conflicts using application logic, before writing with QUORUM consistency to the "source of truth".

Alternatively, you can simply provide a client generated timestamp which has a different scale/resolution with a lower order bits being truly random values. For example, if you have that kind of high-concurrency, you probably don't need to handle a range of timestamps beyond ~50 days. You can then use a client generated timestamp which is a combination of 32 high order bits for milliseconds since the epoch and then a random 32-bit value for the low order bits, which makes the odds of a collision on the timestamp pretty good even for highly concurrent cases.

I'm curious about the use case where you'd have all the concurrency with different but overlapping values, but you'd not want to record them separately and then have some custom app logic for resolving them.

aphyr · on Sept 5, 2013

if that were considered a likely scenario

When timestamps are selected by the Cassandra nodes, I can replicate this failure in 2% to 5% of writes. When timestamps collide, I can replicate this failure in 99.9% of writes. Given that the whole point of isolation is to provide invariants during concurrent modification, it doesn't make any sense to claim that a write is transactionally isolated only insofar as it is not concurrent with other writes.

cbsmith · on Sept 5, 2013

> I can replicate this failure in 2% to 5% of writes.

Yeah, I'm curious about how you achieved those numbers.

Your test that gets that 2-5% of writes (though your docs say 7.5%) to be messed up... what is really is measuring is the probability that out of 5 concurrent clients writing to 4 servers, at least two will finish writing to a row with the exact same timestamp... AND that they will be the LAST ones to write to that row. If just one of those clients ends up just a hair behind the other four, then you should register 0 collisions.

What is even weirder is your benchmark takes 100 seconds to complete what amounts to 5000 writes, or averaging a rate of 50 writes per second, 10 writes per client per second. Those are pathetic numbers for a one node Cassandra cluster, let alone a four node one. WTF is going on here?

Even more confusing, you are writing with ANY consistency, which means that in many cases those writes will be stamped and committed on different nodes, yet somehow getting the same timestamp. Odds on this seem... highly suspect. It almost seems like your clock only has 1 second resolution, which is weird. Have you checked the writetime timestamps on your records?

I've done writes at much higher rates where we recorded the timestamps of every single write operation. We've yet to get the same timestamp on two operations.

I also see Cassandra timeouts while writing with consistency ANY, yet are still somehow getting timeouts with this operation. That really screams to me that the cluster is truly messed up.

Now, as you say, if you control the timestamps, you get collisions 99.9% of the time. I don't even get why it isn't just straight up 100% for that case.

> Given that the whole point of isolation is to provide invariants during concurrent modification

I think it is fair to say that you don't have transaction isolation if the timestamps are exactly the same. That is just an exceedingly low probability event unless you have a LOT of transactions per second.

I'd dump the "writetime(a), writetime(b)" values to get an idea of what is going on there.... something smells and there is a lot less cardinality in those timestamps than I'd expect.

aphyr · on Sept 5, 2013

Yeah, I'm curious about how you achieved those numbers.

Jepsen's code is open source (http://github.com/aphyr/jepsen) and I've written extensively about the techniques involved; see http://aphyr.com/tags/jepsen for details. Cassandra work is upcoming; no formal writeup yet.

Your test that gets that 2-5% of writes (though your docs say 7.5%) to be messed up...

Sorry, 2-5% was my mistake. Been playing around with the parameter space; numbers vary a bit.

what is really is measuring is the probability that out of 5 concurrent clients writing to 4 servers, at least two will finish writing to a row with the exact same timestamp... AND that they will be the LAST ones to write to that row. If just one of those clients ends up just a hair behind the other four, then you should register 0 collisions.

In a Dynamo system, a.) there is no such thing as "time", b.) there is no such thing as "last", and c.) causality tracking beats everything. Doesn't matter what order you do the writes in; timestamps (and/or vclocks in Voldemort/Riak) take precedence.

What is even weirder is your benchmark takes 100 seconds to complete what amounts to 5000 writes, or averaging a rate of 50 writes per second, 10 writes per client per second. Those are pathetic numbers for a one node Cassandra cluster, let alone a four node one. WTF is going on here?

Each client in the Jepsen test harness is (independently) scheduling n writes per second. Jepsen schedules its writes this way to a.) avoid measuring an overloaded system, b.) produce results which are somewhat comparable between runs, and c.) measure results over changing underlying dynamics--in this case, a network partition.

Even more confusing, you are writing with ANY consistency, which means that in many cases those writes will be stamped and committed on different nodes, yet somehow getting the same timestamp. Odds on this seem... highly suspect. It almost seems like your clock only has 1 second resolution, which is weird.

There's an interesting probability anecdote called the Birthday Paradox, which says that if you get 30 people in a room, chances are good that 2 will share the same birthday. At ten uniformly distributed writes a second, the probability of a timestamp collision is 0.44%... in any given second. Chances of a collision after a thousand seconds of runtime are 99.9999%. If you push 100 writes per second, collision probability is 50% in any second. If you push only 2 writes every second, you should expect to see a collision once every few days. How long-lived is that collision? It depends on the distribution of writes over time, and on the network, but you can work out a mean free path.

TL;DR: microsecond timestamps do not provide sufficient entropy for uniqueness constraints over common workloads.

I also see Cassandra timeouts while writing with consistency ANY, yet are still somehow getting timeouts with this operation. That really screams to me that the cluster is truly messed up.

The timeouts in this case are, I think, a Cassandra bug (or expected behavior) when partitions occur. Last I heard from jbellis, it wasn't clear what Cassandra should do under these conditions, but I think he was leaning towards allowing the local hint to count as success always.

Now, as you say, if you control the timestamps, you get collisions 99.9% of the time. I don't even get why it isn't just straight up 100% for that case.

The reason not all writes result in conflict with identical timestamps is, I suspect, due to that transitional period during the beginning of the network partition.

cbsmith · on Sept 5, 2013

> Each client in the Jepsen test harness is (independently) scheduling n writes per second.

Umm... that's kind of an important detail. I'm betting that your mechanism for achieving this effectively synchronizes your clients to act in concert, or at least as close to "in concert" as is possible for your clock to measure. That explains the probability.

> There's an interesting probability anecdote called the Birthday Paradox

Yeah, I thought of the Birthday Paradox with this problem, but this is a different variant. The probability that two people in the room have the same birthday and no one else in the room has a birthday later in the year follows different probabilities.

Try writing a program that spawns 5000 threads and has them get the current time in microseconds, and then write it in a file. You won't have any collisions unless you do some kind of precise coordination between them. In fact, you likely only have a shot at getting the same timestamp if you call from different threads, because just executing the instructions to read the current time takes long enough that two calls in a row will get different values.

> TL;DR: microsecond timestamps do not provide sufficient entropy for uniqueness constraints over common workloads.

See, that's the part I have a problem with, because I've had quite the opposite experience (without even having Cassandra involved).

aphyr · on Sept 5, 2013

Like I said, mean free paths will vary depending on your write profile. None of this alters my original assertion, which is that row operations are not isolated.

cbsmith · on Sept 5, 2013

You've got a corner case that is way harder to hit than you think you've measured it to be, and the scenarios where it may happen would have almost certainly not have the design required to cause it. Even so, it is addressable.

Based on my own experiences with this scenario, I'd be surprised if you managed to experience any problems if you turned off throttling (and didn't force everything to the same timestamp).

So yeah, you have a scenario that can happen, and I'd recommend anyone who absolutely cannot have that happen either not use Cassandra or design their schema accordingly. Absent that scenario though, row operations are isolated.

aphyr · on Sept 5, 2013

You're presuming that all reads occur after the system has quiesced. This is not always the case. I'm happy your write pattern works for you, and that you measure your consistency; I'm just trying to keep folks honest about what their systems actually provide; and give them tools to analyze those constraints.

cbsmith · on Sept 5, 2013

> You're presuming that all reads occur after the system has quiesced.

It sure looked like they did, but maybe I misread the code.

> This is not always the case.

Even if that weren't the case, you might check out the probabilities on your birthday problem. What you've got is effectively a calendar with 100 million days (microseconds in the benchmark's 100 seconds) and 5 people (5 writes to the same record). You've managed to end up with those 5 people sharing a birthday well over 1% of the time.

> I'm just trying to keep folks honest about what their systems actually provide

I appreciate that you've found an interesting corner case that I'd not considered.

Actually, it's not that I hadn't considered it. When creating client side timestamps I do tend to think about this scenario, but with server side timestamps, I tend to think of it much like I think of collisions on Type 1 UUID's, but in truth the probabilities are higher.

I do think Cassandra ought to consider either a) using Type 1 UUID or similar strategies to make the risk of these collisions all but ridiculous or b) when resolving ties in the timestamps on cells by choosing a winning node or thread or something other than the value in the cell, but rather tied to the update operation. That would avoid this scenario in a fashion more fitting with the rest of the semantics of the system.

> give them tools to analyze those constraints

I think unfortunately in this case the analysis of those constraints is flawed.

rbranson · on Sept 5, 2013

Finding corner cases is the point of Jepsen :)

cbsmith · on Sept 5, 2013

> Finding corner cases is the point of Jepsen :)

And with that it obviously did a great job. The probabilities of finding those corner cases is unfortunately completely misrepresented.

I'd worry though that this distortion of the probabilities might mean it also doesn't find other kinds of corner cases.

tjake · on Sept 5, 2013

bq. I also see Cassandra timeouts while writing with consistency ANY, yet are still somehow getting timeouts with this operation. That really screams to me that the cluster is truly messed up.

This was a issue with atomic batches and CL.ANY we've fixed since the test.

https://issues.apache.org/jira/browse/CASSANDRA-5967