Hacker News new | past | comments | ask | show | jobs | submit login

> If you suffer a total disk failure of one of those disks in the array, you have likely lost some data. [...] The reason is, with a total loss of a single disk, any read error on any of the remaining disks is a lost/corrupted file.

Wait, what? If a RAID-(z)1 ZFS array loses one disk, there's data loss? I've ran so many RAID-1 and RAID-10 arrays with mdadm that I can't even being to count them, and I had many drive failures. If any of those arrays would have corrupted data, I would have been mad as hell.

What I am missing here? How is this even remotely acceptable?




> any read error on any of the remaining disks is a lost/corrupted file.

That is the meat of it. With traditional RAID it is the same issue, except you never know it happens because as long as the controller reads something, it's happy to replicate that corruption to the other disks. At least with ZFS, you know exactly what was corrupted and can fix it, with traditional RAID you won't know it happened at all until you one day notice a corrupted file when you go to use it.

RAID-Z1 is better than traditional RAID-5 in pretty much every conceivable dimension, it just doesn't hide problems from you.

I have encountered this literal scenario where someone ran ZFS on top of a RAID-6(don't do this, use Z2 instead). Two failed drives, RAID-6 rebuilt and said everything was 100% good to go. A ZFS scrub revealed a few hundred corrupted files across 50TB of data. Overwrote the corrupted files from backups, re-scrubbed, file system was now clean.


You don't need to fix anything.

ZFS automatically self-heals an inconsistent array (for example if one mirrored drive does not agree with the other, or if a parity drive disagrees with the data stripe.)

ZFS does not suffer data loss if you "suffer a total disk failure."

I have no idea where you're getting any of this from.


If the data on disk (with no redundant copies) is bad, you’ve (usually) lost data with ZFS. It isn’t ZFS’s fault, it’s the nature of the game.

The poster built a (non redundant) zfs pool on top of a hardware raid6 device. The underlying hardware device had some failed drives, and when rebuilt, some of the underlying data was lost.

ZFS helped by detecting it instead of letting the bad data though like would normally have happened.


The parity cannot be used in the degraded scenario that was under discussion.

See eg here where the increasing disk size vs specified unrecoverable read error rate is explored in relation to the question at hand: https://queue.acm.org/detail.cfm?id=1670144 (in the article Adam Leventhal from Sun, the makers of ZFS, talks about the need for triple parity).

Also, the conclusion "ensure your backups are really working" is an important point irrespective of this question, since you'll also risk losing data due to buggy software, human errors, ransomware, etc.


You're not missing anything. They're completely wrong.

In RAID-Z, you can lose one drive or have one drive with 'bit rot' (corruption of either the parity or data) and ZFS will still be able to return valid data (and in the case of bit rot, self-heal. ZFS "plays out" both scenarios, checking against the separate file checksum. If trusting one drive over another yields a valid checksum, it overwrites the untrusted drive's data.)

Regular RAID controllers cannot resolve a situation where on-disk data doesn't match parity because there's no way to tell which is correct: the data or parity.


The situation I laid out was a degraded Z1 array with the total loss of a single disk(not recognized at all by the system), plus bitrot on at least one remaining disk during resilver. Pairity is gone, you have checksum to tell you that the read was invalid, but even multiple re-reads don't give valid checksum.

How does Z1 recover the data in this case other than alerting you of which files it cannot repair so that you can overwrite them?


Why do you have bitrot to begin with? That's what scheduled scrubbing is for. You could of course by very unlucky and have a drive fail and the other get corruption on the same day, but check how often you find issues with scrubbing and tell me how likely that scenario is.


I've had hundreds of drives in hundreds of terabytes of appliances over years. URE and resilver is a common occurrence, as in every monthly scrub across 200+ drives. This isn't 200 drives in a single array, this is over 4 appliances geographically distributed.

The drives have been champs overall, they're approaching an average runtime of about 8 years. During that 8 years we've lost about 20% of the drives in various ways.

It is almost guaranteed that when a drive fails, another drive will have a URE during the resilver process. This is a non-issue as we run RAID-Z3 with multiple online hotspares.


> monthly scrub

Are they used 24/7 at high iops? Why not nightly scrub?


We could do weekly. The volume of data is large enough that even sequential scrubbing when idle is about an 18 hour operation. As it is, we're happy with monthly scrubbing on the Z3 arrays. We don't bother pulling drives until they run out of reallocatable sectors, this extends the service lifetime by a year in most cases.

I intentionally provisioned one of the long term archive only appliances with 12 hot spares. This was to prevent the need for a site visit again before we lifecycle the appliance. Currently down to seven hot spares.

That replacement will probably happen later this year. Should reduce the colo cost by power requirement reduction enough that the replacement 200TB appliance pays for itself in 18 months.


They mean: lose one drive and have another with bit rot.


ah. right. that's the bit I was missing (pun intended).

thanks for the clarification.

in that sense, yes, of course, if you have bit rot and another disk failing, things go south with just two disk. ZFS is not magic.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: