Agreed, that default policy should be manual intervention, but currently I don't...

wtallis · on Aug 24, 2017

In a large cluster, you'd probably plan on replacing every drive that failed rather than reconfigure to use less redundancy. So in that case, you'd probably want the hot spare feature to be stabilized and upstreamed. Then the FS could automatically copy over (or reconstruct from parity) data for a missing disk, without modifying data on the surviving disks.

In environments with a smaller budget that want to get the system back up and running before a replacement drive is available, it could be valuable to be able to pre-specify that the system should rebalance with less redundancy when a drive goes missing. I'm not aware of any work to implement this kind of feature. No enterprise customer would want or use this feature, and even a home user on a shoestring budget wouldn't necessarily want this rebalancing to happen automatically. (What if the drive was only temporarily missing, such as from a failed or loose SATA cable? You wouldn't want to do a ton of re-writing of data only to have to reverse it on the next boot.)

pas · on Aug 24, 2017

A hot spare is expensive. I want pure redundancy. That is making a one drive failure into a still perfectly operational server/node/box.

That's what RAID1 usually means.

And sure, you can't survive the next without replacing the failed drive and resyncing.

But the remount read only thing is something different. It's a useful failure mode, but doesn't help with operational simplicity.

(If the SATA cable is loose, the it'll cause intermittent failures, you'll see it in the log, and there will be a lot of resync events. And probably degraded performance, a lot of ATA (or SCSI errors in case of SAS), and other bus/command errors that go away on retry. And with SMART it's possible to at least guess that it's not the drive. It'd be great to have an error notification interface from the kernel and a tool could try to dig into the relevant subsystem's perf and health data to try to guess what's the faulty component exactly.)