> If you suffer a total disk failure of one of those disks in the array, you hav...

InvaderFizz · on May 29, 2022

> any read error on any of the remaining disks is a lost/corrupted file.

That is the meat of it. With traditional RAID it is the same issue, except you never know it happens because as long as the controller reads something, it's happy to replicate that corruption to the other disks. At least with ZFS, you know exactly what was corrupted and can fix it, with traditional RAID you won't know it happened at all until you one day notice a corrupted file when you go to use it.

RAID-Z1 is better than traditional RAID-5 in pretty much every conceivable dimension, it just doesn't hide problems from you.

I have encountered this literal scenario where someone ran ZFS on top of a RAID-6(don't do this, use Z2 instead). Two failed drives, RAID-6 rebuilt and said everything was 100% good to go. A ZFS scrub revealed a few hundred corrupted files across 50TB of data. Overwrote the corrupted files from backups, re-scrubbed, file system was now clean.

KennyBlanken · on May 29, 2022

You don't need to fix anything.

ZFS automatically self-heals an inconsistent array (for example if one mirrored drive does not agree with the other, or if a parity drive disagrees with the data stripe.)

ZFS does not suffer data loss if you "suffer a total disk failure."

I have no idea where you're getting any of this from.

lazide · on May 29, 2022

If the data on disk (with no redundant copies) is bad, you’ve (usually) lost data with ZFS. It isn’t ZFS’s fault, it’s the nature of the game.

The poster built a (non redundant) zfs pool on top of a hardware raid6 device. The underlying hardware device had some failed drives, and when rebuilt, some of the underlying data was lost.

ZFS helped by detecting it instead of letting the bad data though like would normally have happened.

fulafel · on May 30, 2022

The parity cannot be used in the degraded scenario that was under discussion.

See eg here where the increasing disk size vs specified unrecoverable read error rate is explored in relation to the question at hand: https://queue.acm.org/detail.cfm?id=1670144 (in the article Adam Leventhal from Sun, the makers of ZFS, talks about the need for triple parity).

Also, the conclusion "ensure your backups are really working" is an important point irrespective of this question, since you'll also risk losing data due to buggy software, human errors, ransomware, etc.

KennyBlanken · on May 29, 2022

You're not missing anything. They're completely wrong.

In RAID-Z, you can lose one drive or have one drive with 'bit rot' (corruption of either the parity or data) and ZFS will still be able to return valid data (and in the case of bit rot, self-heal. ZFS "plays out" both scenarios, checking against the separate file checksum. If trusting one drive over another yields a valid checksum, it overwrites the untrusted drive's data.)

Regular RAID controllers cannot resolve a situation where on-disk data doesn't match parity because there's no way to tell which is correct: the data or parity.

InvaderFizz · on May 29, 2022

The situation I laid out was a degraded Z1 array with the total loss of a single disk(not recognized at all by the system), plus bitrot on at least one remaining disk during resilver. Pairity is gone, you have checksum to tell you that the read was invalid, but even multiple re-reads don't give valid checksum.

How does Z1 recover the data in this case other than alerting you of which files it cannot repair so that you can overwrite them?

viraptor · on May 29, 2022

Why do you have bitrot to begin with? That's what scheduled scrubbing is for. You could of course by very unlucky and have a drive fail and the other get corruption on the same day, but check how often you find issues with scrubbing and tell me how likely that scenario is.

InvaderFizz · on May 29, 2022

I've had hundreds of drives in hundreds of terabytes of appliances over years. URE and resilver is a common occurrence, as in every monthly scrub across 200+ drives. This isn't 200 drives in a single array, this is over 4 appliances geographically distributed.

The drives have been champs overall, they're approaching an average runtime of about 8 years. During that 8 years we've lost about 20% of the drives in various ways.

It is almost guaranteed that when a drive fails, another drive will have a URE during the resilver process. This is a non-issue as we run RAID-Z3 with multiple online hotspares.

viraptor · on May 30, 2022

> monthly scrub

Are they used 24/7 at high iops? Why not nightly scrub?

InvaderFizz · on May 30, 2022

We could do weekly. The volume of data is large enough that even sequential scrubbing when idle is about an 18 hour operation. As it is, we're happy with monthly scrubbing on the Z3 arrays. We don't bother pulling drives until they run out of reallocatable sectors, this extends the service lifetime by a year in most cases.

I intentionally provisioned one of the long term archive only appliances with 12 hot spares. This was to prevent the need for a site visit again before we lifecycle the appliance. Currently down to seven hot spares.

That replacement will probably happen later this year. Should reduce the colo cost by power requirement reduction enough that the replacement 200TB appliance pays for itself in 18 months.

1500100900 · on May 29, 2022

They mean: lose one drive and have another with bit rot.

anarcat · on May 29, 2022

ah. right. that's the bit I was missing (pun intended).

thanks for the clarification.

in that sense, yes, of course, if you have bit rot and another disk failing, things go south with just two disk. ZFS is not magic.