> I wish that if a drive fail, the btrfs filesystem still mounts rw and leave the OS running, but warns the user of a failing disk and easily allow the addition of a new drive to reintroduce redundancy.
This might make sense for you, but is insane as a default policy. Manual intervention should be required before the FS will accept writes with less than the configured degree of redundancy. Silently mounting and hoping the user notices something in their logs is too dangerous.
> I created a raid1 btrfs filesystem by converting an existing single btrfs instance into a degraded raid1, then added the other drive
This seems backwards. Why not add the second drive, then convert to RAID1?
-
Note that the patch to work around the refusal to mount is extremely simple, and that the patch is quite safe if used properly. But it's not really an acceptable solution for upstreaming, because it will lead to bigger problems if used in more complicated situations. There are several potential solutions that would be safe and widely deployable, but all involve changing far more than two lines of code.
In a large cluster, you'd probably plan on replacing every drive that failed rather than reconfigure to use less redundancy. So in that case, you'd probably want the hot spare feature to be stabilized and upstreamed. Then the FS could automatically copy over (or reconstruct from parity) data for a missing disk, without modifying data on the surviving disks.
In environments with a smaller budget that want to get the system back up and running before a replacement drive is available, it could be valuable to be able to pre-specify that the system should rebalance with less redundancy when a drive goes missing. I'm not aware of any work to implement this kind of feature. No enterprise customer would want or use this feature, and even a home user on a shoestring budget wouldn't necessarily want this rebalancing to happen automatically. (What if the drive was only temporarily missing, such as from a failed or loose SATA cable? You wouldn't want to do a ton of re-writing of data only to have to reverse it on the next boot.)
A hot spare is expensive. I want pure redundancy. That is making a one drive failure into a still perfectly operational server/node/box.
That's what RAID1 usually means.
And sure, you can't survive the next without replacing the failed drive and resyncing.
But the remount read only thing is something different. It's a useful failure mode, but doesn't help with operational simplicity.
(If the SATA cable is loose, the it'll cause intermittent failures, you'll see it in the log, and there will be a lot of resync events. And probably degraded performance, a lot of ATA (or SCSI errors in case of SAS), and other bus/command errors that go away on retry. And with SMART it's possible to at least guess that it's not the drive. It'd be great to have an error notification interface from the kernel and a tool could try to dig into the relevant subsystem's perf and health data to try to guess what's the faulty component exactly.)