There are plenty of instances of this bug being brought up on the mailing list. One of them is already linked elsewhere in this discussion, and the btrfs status page (also linked from this discussion) has further mailing list links.
Basically, btrfs doesn't want to allow a writeable mount when it might be missing some data. If there's some data on the FS that isn't stored with the RAID1 profile, then the kernel can't safely assume that the missing drive didn't have more chunks like that, holding data that wasn't mirrored on one of the surviving drives. But it's currently not possible to convert from RAID1 to non-RAID or to rebuild the array with a replacement without mounting the degraded array as writeable, which leads to non-RAID data being written. That puts the FS in a state that cannot be automatically judged safe at mount time, and the FS remains in that state until the recovery is complete (either converting from RAID1 to non-RAID, or replacing the failed drive).
There's no easy way to require the user to specify at the time of the `mount -o degraded,rw` whether they intend to resolve the situation by ceasing to use RAID1 or by replacing the failed drive. That leaves users with the opportunity to do neither and instead make the situation worse.
Thanks for the explanation. I was hoping for a Github issue number (or Bugzilla or whatever) to easily track this bug, but perhaps the Btrfs dev team doesn't work with issue number ?
At least for RAID1, it seems that implementing RAID1 N-way mirroring would ease the process to recover from a failed drive.
In case of drive failure, we could use the remaining drive in read-only mode to copy the data to a new drive, hence creating a RAID1 array with two working drives and one failed drive.
The OS should then allow to boot in rw mode, and from there it is easy to remove the failed drive from the RAID1 array.
However it seems that RAID1 N-way mirroring (with N > 2) is not even on the roadmap at this moment.
Have I misunderstood something or does this approach make sense ?
You can do RAID1 with more than two drives, but you'll only get two copies of each chunk of data. In this scenario, when one drive dies you can still write new data in RAID1 to the remaining space on the surviving drives, so mounting the FS writeable in degraded mode doesn't risk leaving the FS in a state where the safety is hard to determine on the next mount. If space permits, you can also rebalance before even shutting down to remove the failed drive, also avoiding the corner case.
Being able to do N-way mirroring with three or more copies of the data would be nice, but it's not necessary; 2-way mirroring across 3 or more drives is sufficient, and the hot spare feature will be more widely useful.
I was referring to this sequence of events:
1) 2-way mirroring across 2 drives
2) one drive fails
3) buy and plug a new drive
4) rebalance to have 3-way mirroring across 3 drives (with one being out): this is currently not possible
5) remove the failed drive, ending with 2-way mirroring across 2 drives
But it seems that you are referring to:
1) 2-way mirroring across 3 drives
2) one drive fails
3) rebalance to have 2-way mirroring across the 2 working drives
4) remove the failed drive, ending with 2-way mirroring across 2 drives
I assume that people don't/won't start the initial RAID1 with 3 drives.
Anyway, I would find 3-way mirroring across 3 drives very useful as it gives a simple identical foolproof process to replace a faulty hard drive, whether it has just a few corrupted data (but still readable) or have completely failed : just plug a new drive, rebalance, reboot and remove the defective drive.
> rebalance to have 3-way mirroring across 3 drives (with one being out): this is currently not possible
I'm not sure this even has meaning. But anyways, it's probably pointless to try to kick off a rebalance when the FS is still trying to use a dead drive. Either use the device replace command (which isn't stable yet), or tell btrfs to delete the dead drive then add the replacement drive. If the problem drive is failing but not completely dead yet, then the device replace command is supposed to move data over with a minimum of excess changes to drives other than the ones being removed and added. But the device replace command doesn't properly handle drives with bad sectors yet, so the separate remove and add actions are more reliable albeit slower and put more work on the other drives in the array.
Basically, btrfs doesn't want to allow a writeable mount when it might be missing some data. If there's some data on the FS that isn't stored with the RAID1 profile, then the kernel can't safely assume that the missing drive didn't have more chunks like that, holding data that wasn't mirrored on one of the surviving drives. But it's currently not possible to convert from RAID1 to non-RAID or to rebuild the array with a replacement without mounting the degraded array as writeable, which leads to non-RAID data being written. That puts the FS in a state that cannot be automatically judged safe at mount time, and the FS remains in that state until the recovery is complete (either converting from RAID1 to non-RAID, or replacing the failed drive).
There's no easy way to require the user to specify at the time of the `mount -o degraded,rw` whether they intend to resolve the situation by ceasing to use RAID1 or by replacing the failed drive. That leaves users with the opportunity to do neither and instead make the situation worse.