Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> And now they become the machine with a bad failure mode.

What is the failure mode the recipients have here?



If the recipient fails after telling the deduping machine what it has seen, the recipient's failover will be in an unknown different state. And now the failover machine has to try to figure out what its state should be, what is going on, and so on. You can add ways to addressing each possibility here, and you'll get ever more obscure chains that again can result in failure. At the cost of ever increasing complexity, you'll make failures ever more complicated.

Frequently developers who implement these manage to do two bad things. First they introduced a lot of complex code which rarely triggers except in disaster, and so whose bugs tend to survive. And second, they manage to convince themselves that they have accomplished the impossible, and make reliability promises that other developers unwisely believe. Exactly how unreliable most systems were and how much the documentation couldn't be trusted was underappreciated until https://jepsen.io/ came along and started proving how bad most distributed software was.

Now it may seem bizarrely unlikely that you'll ever see this kind of situation. But failures often start from network congestion due to a packet storm. And the failures lead to chatty Byzantine fault tolerance protocols adding significant traffic. This causes cascading failures. And so a small, simple outage can escalate into a series of outages as ever more confused servers continue overwhelming the network with their futile attempts to discover what is supposed to be true. So complex combinations of failures occur together more often than most of us would expect.


Certainly the recipient can fail, this seems obvious; if a user's phone dies then your app is not going to work. Perhaps I have the wrong model/framing here, but I was thinking from the perspective of being resilient to any failure outside of the recipient device.


Ah, but distributed systems tend to be hooked together. So the recipient device may itself feed into something else. And in the case of a message queue, generally does.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: