Call me maybe: final thoughts

ChuckMcM · on May 20, 2013

So a long time ago, in a galaxy far far away, when the network was purportedly the computer, these questions kept me awake at night.

When Remote Procedure Calls were a new thing technologists got into an interesting debate about whether or not you could create a distributed infrastructure which had the same semantics as local execution. It was after all the holy grail to be able to write code and them seamlessly pull it apart into as many pieces as you wanted for scaling (caveat Amdahl's law).

There were great flame wars over something called "At most once" semantics, where a remote procedure call would either succeed or fail, and if it succeeded it would have done so exactly one time. And marshalling between systems, with "network canonical order" vs "receiver makes it right" order for bit fields, integers, floats, or strings.

I have a lot of scars from those days.

It left me with one really strong belief though. That was that to reason about distributed systems at all you had to be ruthlessly simple in your designs. You have to be able to know exactly what the lowermost piece of the puzzle is going to do before you can reason about what pieces that depend on that may or may not do. And at each layer every alternative cuts like a scalpel and make reasoning about the next layer that much harder.

rdtsc · on May 20, 2013

> So a long time ago, in a galaxy far far away, when the network was purportedly the computer, these questions kept me awake at night.

Interesting. That was a bit before "my time". It was just passed down as folk wisdom -- "never make the network transparent, it cannot be done easily" and never quite knew where that came from.

I remember asking a senior developer about RPC after seeing a a bunch of RPC daemons on a Sun machine, and he said, it is some really complicated stuff that you don't really want to know about and unless you absolutely need to use it. I was just a young intern then.

But there are a whole bunch of technologies from that time that try to abstract the network away (I can think of NFS, or I guess pretty much any remote storage that makes itself appear as a local file system).

Then I've seen RPC API philosophies that take 2 sides. Some magically proxy method calls and marshal data between object, others want users to always marshal by hand and send via the sockets.

> You have to be able to know exactly what the lowermost piece of the puzzle is going to do before you can reason about what pieces that depend on that may or may not do.

So it is possible to abstract any of that away? Maybe it is not a horizontal abstraction as in "hide the network away", but maybe a vertical abstraction, as in the creating socket-like objects that have extra features (like ZeroMQ). Connecting, sending, receiving is still there, but there is extra helping on top at each step.

On another side note. I think network topologies and characteristics changed. It used to be that creating consistent distributed systems was easiest even though networks speed were slower. Networks were more likely to be local in a data center. So chance of a network split was much lower. (And I've learned there are fewer more evil things in a general distributed system than a network split). Today's distributed systems are more likely so experience network splits (multi-zone, multi-datacenter clusters I guess are more common + unpredictability added by using VMs). So maybe paradoxically distributed system really got harder not easier.

ChuckMcM · on May 20, 2013

So it is possible to abstract any of that away? Maybe it is not a horizontal abstraction as in "hide the network away", but maybe a vertical abstraction, as in the creating socket-like objects that have extra features (like ZeroMQ). Connecting, sending, receiving is still there, but there is extra helping on top at each step.

Its possible to design bits of it away (and that becomes necessary) in order to build larger systems. Protobufs were, in a lot of ways, one response.

The key takeaway for me from that time, and going forward, was that distributed services are effectively messages in flight. The more messages in flight, the more combinations of arrival times, the fewer messages in flight, the less scaling. This became what is known as the CAP theorem. (Wonderfully articulated by Bowers, when I read it I said, "Yeah that!")

For things like RAID arrays, where you can related the state of the data in a RAID array mathematically with the data that you know, you can abstract it away to a pretty straight forward open/read/write interface. For things where the data set mutates along a path (imagine a state machine of 'n' states, where its 'path' is the sequence of states it is in between time 0 and time t) if correctness is a function of the path then you have a lot harder time of it. The canonical example of this is two writers to a file. The sequence Insert A -> Insert B -> Insert C -> Delete Last leaves behind a different file if it sees Insert A -> Insert B -> Delete Last -> Insert C.

Unwinding those sorts of things seems (or at least illuminating them) seems to be what Jepsen is trying to solve.

_pmf_ · on May 21, 2013

Damn these Byzantine generals!

voidmain · on May 20, 2013

I love that someone is actually testing database fault tolerance in public, instead of just assuming that anything with replication is fault tolerant...

Stateful distributed systems are really, really hard to get right. The only hope is to do a LOT of simulation. Not that testing is a substitute for designing things right, but no one is smart enough to be sure they've thought of all cases.

At FoundationDB we do this sort of testing millions of times over: http://foundationdb.com/white-papers/testing/

so that we can let people plug and unplug our servers at trade shows without fear (video): http://blog.foundationdb.com/post/50924738464/foundationdb-f...

and ultimately so that applications on top of our system face a somewhat less scary world.

shin_lao · on May 20, 2013

Nice to see you are serious about testing, however this PR page has got something that hits me as strange:

As a rough measure of the effectiveness of our simulation testing framework, Quicksand has only ever found one bug.

To me, this means your test is useless, not that your software is reliable.

Generally speaking, simulating hardware faults is not very useful because you're testing the software layer below you: you don't talk to the hardware, you talk to the kernel which talks to the hardware.

(Unless your filesystem and network code is in kernel land in which case I will retract this comment and go die in a dark corner of Earth. ;) )

voidmain · on May 20, 2013

Quicksand (the physical fault testing environment) doesn't find our bugs because our simulation-based testing finds them first!

Our deterministic simulations cover all of the failure cases that quicksand can test and many more. The main reason we built quicksand is precisely to find out if our assumptions about the layers below us are wrong, since the simulator can't test its own assumptions. For example, there are hardware/software combinations out there where fsync() can't be trusted...

malkia · on May 21, 2013

Can this piece of software be used to test other systems/db's in combination with OS?

jamwt · on May 21, 2013

I'm late to the party on this, but unless I'm misreading things... the author used Riak the wrong way and it failed, and then he used it the right way, and it succeeded.

Granted, as the author says, Basho tries to delay the necessity of "breaking the news" about the flaws of LWW to prospective Riak users, but the author references the Dynamo paper. The Dynamo paper is chock full of information about about siblings and the key role they play. If you read the Dynamo paper, you would know immediately that using a system like Riak without thinking through sibling resolution is akin to lighting your data on fire. This complexity of siblings is the tradeoff you pay for high availability (no silver bullet, etc).

And if your scaling/availability problems are serious enough to even begin saying "we a distributed database like Riak!", hopefully it's not asking too much to have you read the Dynamo paper to understand how these sorts of databases actually work.

( note - we admin a largish riak cluster here at bu.mp )

brown9-2 · on May 21, 2013

It makes sense to me to test the default configuration that Riak ships, and even with the caveat of "you probably shouldn't use this", it's failures are an instructive lesson (which is the underlying goal of the series of blog posts, not just benchmarking various databases).

cs648 · on May 20, 2013

Slightly off topic, but I didn't understand the first line: "Notorious computer expert Joe Damato explains: “Literally no one knows.”". I guess it's a joke I didn't get?

aphyr · on May 20, 2013

In-joke. You gotta know the guy. :)

tripzilch · on May 20, 2013

I don't know who Joe Damato is, but I found it hilarious, from the execution alone.