"On December 1st, at 8:52am PST, a box dropped offline; inaccessible. And then, instead of automatically coming back after failover — it didn’t. Our primary on-call engineer was alerted for this and dug in. While digging in, another box fell offline and didn’t come back"
This makes no sense. A machine restarted and you had catastrophic failure? VMs reboot time to time. But if you design your setup to completely destroy itself in this scenario, I don't think you will like a move to AWS, or god forbid, your own colo.
Read the article more carefully. The article (the text you quoted, even) clearly states that the machine didn't "restart". It crashed and didn't come back online.
And nowhere in the article do they state that this was a "catastrophic failure" - Railway itself didn't go down entirely. But Railway is a deployment company, so they are re-selling these compute resources to their customers to deploy applications. So when one of those VMs goes down and doesn't automatically failover, that's downtime for the specific customer who was running their service on that machine.
As they state:
> During manual failover of these machines, there was a 10 minute per host downtime. However, as many people are running multi-service workloads, this downtime can be multiplied many times as boxes subsequently went offline.
TFA is a bit too light on details. Boxes due, it's a fact of life. I don't really follow what "didn't come back online" is supposed to mean. Nodes aren't 100% durable. A lot depends on your particular configuration.
In any case, there's no world where all VM failures trigger automatic reboots. Expecting that to be the case just makes no sense. Automatically failing over should be handled at another layer, for which there a many possibilities.
Manually restoring nodes sounds like a "pets, not cattle" problem.
Long ago, we used to run into this on AWS all the time before we started automatically aging them out.
This makes no sense. A machine restarted and you had catastrophic failure? VMs reboot time to time. But if you design your setup to completely destroy itself in this scenario, I don't think you will like a move to AWS, or god forbid, your own colo.