Hacker News new | past | comments | ask | show | jobs | submit login

Agreed, that default policy should be manual intervention, but currently I don't see how I can set a different policy.

For example in a large cluster where that degradation is expected and monitored for.

How do you add a second drive? Is there a brtfs raid faq/howto?




In a large cluster, you'd probably plan on replacing every drive that failed rather than reconfigure to use less redundancy. So in that case, you'd probably want the hot spare feature to be stabilized and upstreamed. Then the FS could automatically copy over (or reconstruct from parity) data for a missing disk, without modifying data on the surviving disks.

In environments with a smaller budget that want to get the system back up and running before a replacement drive is available, it could be valuable to be able to pre-specify that the system should rebalance with less redundancy when a drive goes missing. I'm not aware of any work to implement this kind of feature. No enterprise customer would want or use this feature, and even a home user on a shoestring budget wouldn't necessarily want this rebalancing to happen automatically. (What if the drive was only temporarily missing, such as from a failed or loose SATA cable? You wouldn't want to do a ton of re-writing of data only to have to reverse it on the next boot.)


A hot spare is expensive. I want pure redundancy. That is making a one drive failure into a still perfectly operational server/node/box.

That's what RAID1 usually means.

And sure, you can't survive the next without replacing the failed drive and resyncing.

But the remount read only thing is something different. It's a useful failure mode, but doesn't help with operational simplicity.

(If the SATA cable is loose, the it'll cause intermittent failures, you'll see it in the log, and there will be a lot of resync events. And probably degraded performance, a lot of ATA (or SCSI errors in case of SAS), and other bus/command errors that go away on retry. And with SMART it's possible to at least guess that it's not the drive. It'd be great to have an error notification interface from the kernel and a tool could try to dig into the relevant subsystem's perf and health data to try to guess what's the faulty component exactly.)




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: