Reliability, availability and scale – an interlude

jorangreef · on July 25, 2011

I have this idea in mind of a system where nodes are symmetrical and each node contains a vertically integrated stack, all within a single, evented process.

So instead of having asymmetrical services (database, web server, workers, logging, messaging) spread across clusters of symmetrical nodes, one would have all of these services within a single process on a single core machine.

Each node would be responsible for itself: from the time it boots to the time it dies, continuously integrating from the master trunk, accepting data partitions from other nodes, checking up on the health of other nodes etc.

One advantage would be that data and workload are evenly distributed, for example each node would have a few workers running which would only operate on local data. So all services would automatically be sharded, not just the database. If each node's workers only operate on local data this would simplify locking across workers throughout the system.

There would be no master nodes, rather replica sets, all multi-master similar to Dynamo. Expanding capacity across all services would mean booting a new instance in EC2.

Also, having the services vertically integrated within a single evented process would eliminate network overhead, deserialization/serialization overhead, parsing overhead, queue contention, locking etc.

Everything could be written in the same language, to make for easier extensibility, more in-house understanding of core components, and more opportunities to reduce waste.

noonat · on Aug 1, 2011

I think the main problem with this approach is that different parts of your stack have different requirements. One part of your stack might require large amounts of RAM for efficient response times, but may be used infrequently, for instance. With a separate server, you can scale that part of the stack independently. If the whole stack lives in every server, you must sacrifice performance or money for that consistency.

IgorPartola · on July 25, 2011

Somewhat OT: I have been lately toying with the idea of removing a part of the cluster logic from the DB, and moving it the client's drivers instead. Basically, it would work something like this:

You have 3 database nodes. Your application requests write access. The DB driver opens an event loop and tries to connect to all three at once. As soon as it has two live connections, it starts a distributed transaction (I know, these are unreliable, but bear with me, and assume that they work for now. This is the same thing as used on the server anyways). The application completes the work, and it is committed to two servers, which is the majority. The third server will replicate the changes as soon as it is able, verifying that the majority of the servers agree that this is in fact a valid changeset.

I think that this approach, combined with a good locking system (e.g.: the client would declare all locks upfront), could result in a robust system. It also makes it easy to change where your system lies on the CAP triangle: just tune your driver to require a connection to all servers, not just the majority to make the system lean more towards consistency, and make downed nodes that are recovering refuse any connections until they are caught up.

Does anyone have any thoughts on this?

Robin_Message · on July 25, 2011

I pretty sure this has been tried (generalised to getting a write lock on more than N/2 of the servers) but I can't find a citation right now.

IgorPartola · on July 25, 2011

Any idea if there was any success with it?

Robin_Message · on July 25, 2011

I've found a reference to it in Jean Bacon's Concurrent Systems. She calls it "quorum assembly", but I can't find any other references with that name (slides on cl.cam.ac.uk are by her, or influenced by her work.) I'm afraid it is probably an idea that has been had many times.

The write quorum (number of nodes you must talk to to do a write) must be > n/2. The read quorum plus the write quorum must be > n (otherwise you can read from outside the write quorum).

So the n=3 case is simple, RQ=WQ=2 as you suggested. I think it wasn't very successful in larger cases because n/2 isn't much better than n, and you have to do some kind of synchronisation between the nodes which is tricky to get write (consider how the write nodes know the other write nodes have done the write.)

In practice, hierarchical schemes where you are reading from a slave which might be behind the master are more common. There you can have 1 write node and n read nodes. Since reads are generally more common than writes, this is great.

IgorPartola · on July 26, 2011

Thanks for the details and the explanation. Makes a lot of sense.

cx01 · on July 25, 2011

> the client would declare all locks upfront

But if the client crashes after acquiring the locks, wouldn't your system be deadlocked?

IgorPartola · on July 25, 2011

Breaking the connection would release the locks.

damienkatz · on July 25, 2011

TCP doesn't "break" connections like that. Cleanly closing sockets breaks connections, but machines that crash or drop off the network won't be noticed until the connection times out, which is typically on the order of many minutes.

IgorPartola · on July 25, 2011

I realize that. My point is that existing locking systems work this way. For example, MySQL releases any table locks, rolls back the transaction and drops all temporary tables whenever the connection is closed or times out. This is no worse than what we already have.