This seems to pose an interesting question that's out of my pay grade. The funda...

simonh · on Jan 12, 2023

It’s a fundamental limitation of identical redundant systems that they have vulnerability to some of the same threats, particularly bad inputs and capacity issues. It’s important to understand it’s only giving you physical redundancy, such as if one data centre goes down. But the same software bugs, the same bad input data, even the same memory overruns are likely to hit both systems.

It’s not bad design, it’s just you have to understand what resiliency you have and plan against each of various such threats according to your risk appetite.

phaedrus · on Jan 12, 2023

It's very much like the advice that mirror RAID is not a backup solution. That seems to be almost literally what went wrong here.

atoav · on Jan 12, 2023

My intuition on that is that two is also a bad number to choose in that case. One could fo full lunar mission on the thing and have three models and in case of inconsistency the majority wins.

ilyt · on Jan 12, 2023

It is but it doesn't matter if it is the input message that crashes it. You just need 3 bad message (sooo a message and 2 retries) to crash it whole

atoav · on Jan 12, 2023

In the end you get never around getting some parts of a system bulletproof and reliable. The question is just how big those parts are.

hgsgm · on Jan 12, 2023

or 1 message that crashes all three on the first try vecause you are voting.

UniverseHacker · on Jan 12, 2023

Ideally three models developed to the same spec independently, with majority voting

weaksauce · on Jan 12, 2023

you could go to a model that verifies the integrity of the data coming in and makes sure the limits on the data are sane before committing it to the db. using a language with strong safety principles (that are not very "hip") like ada or fortran. or you design it so that the system is robust to failure and expects failure like something from teleco like erlang. redundant hardware is fine and great but having them do verification on the data and monitoring the limits of the system is pretty important too.