"On December 1st, at 8:52am PST, a box dropped offline; inaccessible. And then, ...

wavemode · on Dec 2, 2023

Read the article more carefully. The article (the text you quoted, even) clearly states that the machine didn't "restart". It crashed and didn't come back online.

And nowhere in the article do they state that this was a "catastrophic failure" - Railway itself didn't go down entirely. But Railway is a deployment company, so they are re-selling these compute resources to their customers to deploy applications. So when one of those VMs goes down and doesn't automatically failover, that's downtime for the specific customer who was running their service on that machine.

As they state:

> During manual failover of these machines, there was a 10 minute per host downtime. However, as many people are running multi-service workloads, this downtime can be multiplied many times as boxes subsequently went offline.

> For all of our users, we’re deeply sorry.

xyzzy_plugh · on Dec 2, 2023

TFA is a bit too light on details. Boxes due, it's a fact of life. I don't really follow what "didn't come back online" is supposed to mean. Nodes aren't 100% durable. A lot depends on your particular configuration.

In any case, there's no world where all VM failures trigger automatic reboots. Expecting that to be the case just makes no sense. Automatically failing over should be handled at another layer, for which there a many possibilities.

Manually restoring nodes sounds like a "pets, not cattle" problem.

Long ago, we used to run into this on AWS all the time before we started automatically aging them out.