Use mirror raid and have mdadm do a full disk compare/check every month (this is...

usefulcat · on Aug 2, 2017

Yes it can detect errors, but it can't continue to function correctly (read: return the correct data) because it doesn't know which copy of the differing data is damaged because it doesn't have checksums.

Moreover, if it doesn't always read both copies of the data (which it may well not, for performance reasons), then you have the possibility of silently propagating damaged data to all mirrors in the case that damaged data is returned to an application and the application then rewrites said data.

Compare that to a filesystem with checksums, which, in addition to being able to detect such a problem, could also continue to function completely correctly in the face of it.

willglynn · on Aug 2, 2017

Yep. "What happens if you read all the disks successfully but the redundancy doesn't agree?" is a great question.

Mirrors and RAID5: there's obviously no way that `md` software RAID can help, since it doesn't know which is correct. What about RAID6 though? Double parity means `md` would have enough information to determine which disk has provided incorrect data. Surely it does this, right?

Wrong. In the event of any parity mismatch, `md` assumes the data disks are correct and rewrites the parity to match. See "Scrubbing and Mismatches" section in `man 4 md`:

https://linux.die.net/man/4/md

If you scrub a RAID 6 array with a disk that returns bad data, `md` helpfully overwrites your two disks of redundancy in order to agree with the one disk that's wrong. Array consistent, job done, data... eaten.

rob-olmos · on Aug 2, 2017

That's incredible! Thanks for the insight.

Any recommendations for detecting/correcting bitrot with RHEL 7.4 at the filesystem or lower levels?

Wicher · on Aug 2, 2017

Disk A and disk B both contain file SomeFile.

On disk B this file has rotted.

When reading the file SomeFile into memory, the read will be distributed among the disks (for performance reasons) (and it will probby need to span a multiple of the stripe size).

Ok, file is read into memory, including the bitrotted part from disk B. Now we write the file blocks back - as one does.

Voila! Both disks now contain the bitrot. And mdadm will not complain - disk A and B are identical for the area of file SomeFile.

IgorPartola · on Aug 2, 2017

Moreover, even if you don't read the file, and the bit rot is discovered during the monthly compare, at least on Linux the disk that is considered correct will be chosen at random. So you need at least three disks to have some semblance of protection. Have you guys seen many laptops that come with three or more drives?

Just use ZFS. Even on a single disk setup you will at least not get silent bit rot.

binaryphile · on Aug 2, 2017

"Never go to sea with two chronometers; take one or three."

- adage cited in the Mythical Man Month

gigatexal · on Aug 2, 2017

Or just do raidz6 in ZFS and call it a day.

feld · on Aug 2, 2017

Actually it's better to just do mirrors. Avoid RAIDZ at all costs if you care about performance and the ability to resilver in a reasonable amount of time.

gigatexal · on Aug 2, 2017

Sure I agree. But nested mirrors still suffer from the same issue of losing a drive and you lose everything.

aeorgnoieang · on Aug 2, 2017

> But nested mirrors still suffer from the same issue of losing a drive and you lose everything.

Are you referring to mirroring a volume or dataset on a single disk? Why would you want to do that instead of mirroring among multiple drives?

gigatexal · on Aug 2, 2017

how would you set up a large pool?

two sets of say 5 disks in a mirror raidz1 would still fail if a disk in one set failed and a disk in the other set failed. I guess you could do a stripe setup of 5 sets of 2 disks in mirrors. Still it seems wicked risky to me. I do agree though mirroring has been the best for speed but a lot of that changes with nicer SSDs especially NVMe ones.

aeorgnoieang · on Aug 2, 2017

I was curious about what a "nested" mirror is really. What exactly is nested?

I'd setup a large pool with mirror vdevs, i.e. n sets of 2 disks per mirror.

My half-remembered reasoning was that backups manage the risk you'll lose data. But replacing a disk in a mirror vdev is much easier, and faster, than doing so with RAIDZ.

The risk of RAIDZ is that resilvering impacts multiple vdevs, is much more intensive than a simple mirror resilvering, and thus the probability that additional drives will fail is much higher.

Here's a blog post that I definitely read the last time I was reading up on this:

- [ZFS: You should use mirror vdevs, not RAIDZ. – JRS Systems: the blog](http://jrs-s.net/2015/02/06/zfs-you-should-use-mirror-vdevs-...)

aeorgnoieang · on Aug 2, 2017

A Reddit post about that blog post in my other reply:

- [You should use mirror vdevs, not RAIDZ. : DataHoarder](https://www.reddit.com/r/DataHoarder/comments/2v0quc/you_sho...)

gigatexal · on Aug 2, 2017

I wonder if resilvering is still an issue with SSDs. But I cede your point, nested vdevs of two disks making mirrors makes sense. It doesn't sit well still, but makes sense

gigatexal · on Aug 3, 2017

According to OpenZFS's changelog for v 0.7 resilvering is smarter now: https://github.com/zfsonlinux/zfs/releases/tag/zfs-0.7.0