> Considering more than one exact txn I imagine will hit a single specific node often, at large scale with a single mode down even if that means 5% of transactions block, you are basically growing a queue of waiting work indefinitely with the only upper bound being how much ram you have. Meanwhile 5% of clients will be waiting and this node may take awhile to come back if it needs something.
Ahh, no! For each obj, there are 2F+1 replicas, each replica on a different node. For each txn that hits an obj, you only need F+1 of those replicas to vote on the txn. So, provided F > 0, a single failure will never cause anything to block.
> I would suggest that you have at minimum strict timeouts for the the transactions, in conjunction with an immediate fail if the node is answer is not available right now. So a client would never wait more than X seconds or the transaction is aborted and if necessary rolled back.
I agree. I think it would be very difficult for a client to do anything sensible with such information, but even if all I'm doing is getting the client to resubmit the txn verbatim, at least it clears up the resource usage on the server, which is the most important thing.
> Furthermore , what if a node never comes back. Somehow there seems to need a transaction failure that is handed back to the client whether it's node is down, node is gone forever , or node simply timed out.
Well, this only becomes a problem if > F nodes fail and never come back - the whole design of consensus systems is to cope with failures up to a certain threshold. Provided <= F nodes fail, the failures are detected and any txns that are in-flight are safely aborted (or, if it actually committed, that information is propogated) - this is all just usual Paxos stuff. But yes, again, I completely agree: if you have a massive failure and you lose data, then you are going to have to recover from that. For goshawkDB, that's going to require changing topology which is not supported in 0.1, but is the main goal for 0.2.
> At the end of the day even if your system is able to handle N thousands of transactions pleasantly waiting and can still answer other requests indefinitely that is a great accomplishment, but in practice may not be ideal for many workloads. People and computers tend to both retry interactions with data that are slow or failed and the combination of just taking on more and more work and hoping everything flushes out when things become healthy is a recipe for a thundering herd, and better served by something like an async work queue type of pattern.
Oh absolutely. In a previous life I did much of the core engineering on RabbitMQ. It was there that I slowly learnt the chief problem tends to be that under heavy load, you end up spending more CPU per event than under light load, so as soon as you go past a certain tipping point, it's very difficult to come back. I certainly appreciate that human interaction with a data store is going to require consistent and predictable behaviour.
At a high level it would be interesting to make it as durable as possible in situations where say F=2 but you have 30 nodes.
As I understand it, in this hypothetical case, the majority of the system would work fine if 3 nodes fail, but there could also be some percentage of transactions which are unfulfillable due to this scenario while there should be many that can still be answered with confidence. ( if I'm understanding correctly that data does not go into every node with a configuration like this and is consistently hashed across )
So was just curious how a prolonged period of this may turn into an unbounded problem when trying to digest your page on it.
Ultimately it sounds like it's headed in a good path and embracing this partial-fail-but-that-doesn't-mean-show-is-over with specifics is a good thing to get ironed out and would certainly be a key thing to get right early, even if it's largely semantics and client behavior.
> As I understand it, in this hypothetical case, the majority of the system would work fine if 3 nodes fail, but there could also be some percentage of transactions which are unfulfillable due to this scenario while there should be many that can still be answered with confidence. ( if I'm understanding correctly that data does not go into every node with a configuration like this and is consistently hashed across )
Yes, you understand it perfectly. :)
> Ultimately it sounds like it's headed in a good path and embracing this partial-fail-but-that-doesn't-mean-show-is-over with specifics is a good thing to get ironed out and would certainly be a key thing to get right early, even if it's largely semantics and client behavior.
Indeed. Semantics are very important and I want to make sure GoshawkDB is easy to understand and reason about at all times.
Ahh, no! For each obj, there are 2F+1 replicas, each replica on a different node. For each txn that hits an obj, you only need F+1 of those replicas to vote on the txn. So, provided F > 0, a single failure will never cause anything to block.
> I would suggest that you have at minimum strict timeouts for the the transactions, in conjunction with an immediate fail if the node is answer is not available right now. So a client would never wait more than X seconds or the transaction is aborted and if necessary rolled back.
I agree. I think it would be very difficult for a client to do anything sensible with such information, but even if all I'm doing is getting the client to resubmit the txn verbatim, at least it clears up the resource usage on the server, which is the most important thing.
> Furthermore , what if a node never comes back. Somehow there seems to need a transaction failure that is handed back to the client whether it's node is down, node is gone forever , or node simply timed out.
Well, this only becomes a problem if > F nodes fail and never come back - the whole design of consensus systems is to cope with failures up to a certain threshold. Provided <= F nodes fail, the failures are detected and any txns that are in-flight are safely aborted (or, if it actually committed, that information is propogated) - this is all just usual Paxos stuff. But yes, again, I completely agree: if you have a massive failure and you lose data, then you are going to have to recover from that. For goshawkDB, that's going to require changing topology which is not supported in 0.1, but is the main goal for 0.2.
> At the end of the day even if your system is able to handle N thousands of transactions pleasantly waiting and can still answer other requests indefinitely that is a great accomplishment, but in practice may not be ideal for many workloads. People and computers tend to both retry interactions with data that are slow or failed and the combination of just taking on more and more work and hoping everything flushes out when things become healthy is a recipe for a thundering herd, and better served by something like an async work queue type of pattern.
Oh absolutely. In a previous life I did much of the core engineering on RabbitMQ. It was there that I slowly learnt the chief problem tends to be that under heavy load, you end up spending more CPU per event than under light load, so as soon as you go past a certain tipping point, it's very difficult to come back. I certainly appreciate that human interaction with a data store is going to require consistent and predictable behaviour.
Thanks for your input.