Hacker News new | past | comments | ask | show | jobs | submit login

Very good post! Well all of Aphyr's "Call Me Maybe" series is great.

A lot of modern systems are distributed and quite often it seems they are designed and built without much thought given to these kind of issues. In the "Call Me Maybe" series for example he shows how a lot of popular NoSQL, webscale etc etc. databases really fall on their face when network partitions hit.

And there clearly is enough literature written on this but it is scholarly articles or research papers, that don't get necessarily get read by architects of most distributed system. (Heck I imagine, a lot of distributed systems are created so in a ad-hock way. Something like, "Hey Jim we need to have a hot spare fail-over machine in the US-EAST zone". ... and now they have a distributed system).

Also I am not sure how much this is taught in schools. When I was in school distributed systems mean, MPI, mesh topologies and so on. HPC type things. Nobody said absolutely anything about network partitions.

Speaking of network partitions. The question I have, can one make the chance of partitions happening low enough to be discounted? Or does it not even make sense to talk about that? For example, imagine the classic scenario of a single data center. Switches do fail sometimes. What if there is a redundant network, running on a separate physical interface (say eth1 not the default eth0). Now partitions can be detected and simultaneous failure of 2 networks switches now has to be happen. This is not even a hypothetical scenario. There is a Japanese NoSQL out there that nobody has probably heard of -- Hibari. It features such a "partition detector" application.

http://hibari.github.io/hibari-doc/hibari-sysadmin-guide.en....

And the code:

https://github.com/hibari/partition-detector

I saw that a while back and now I remembered about. So is that a practical solution or there are fundamental theoretical issues with this approach?




On the partition side of things... not really. Usually you end up with coordinated failures of various types.

For example, lets say you run redundant networks. Are the physical paths diverse? What happens if someone trips on BOTH network cables? What happens if the redundant network is on the SAME switch hardware? What happens if yes you use two network switches, but they are hooked up to the same UPS? Or both UPSs are on the same circuit breaker?

The classic reference on this is from google: http://www.catonmat.net/blog/wp-content/uploads/2008/11/that...

There are a number of theoretical issues with these approaches, starting with the FLP result.

Also realistically, eventually all practical solutions fail, often due to user error.


Ok let's say mostly diverse physical paths, both tied up nicely into bundles, different switch hardware, different UPS, different power grids (some nice data centers will even have that!).

At this point other kinds of odd failures are more likely, than failure of both networks paths at the same time, let's say meteor strikes, disgruntled employee wiping out backups and data.... Can one even begin then to consider an AC system? And if so what would it look like?




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: