Hacker News new | past | comments | ask | show | jobs | submit login

RAID generally does not protect against bit-rot (i.e. undetected errors); it only protects against detectable errors / catastrophic disk failure.

ZFS does protect against undetected (at the disk level) errors. To a first approximation, it does this by keeping block checksums alongside pointers so that it can verify that a block has not been changed since it was referenced, by keeping multiple checksummed root blocks, and by having enough redundancy to reconstruct blocks that fail their checksums. Naturally, there are information-theoretical limits to the number of corruptions that may occur for detection/correction to be guaranteed.

You should refer to the relevant wikipedia pages if you want more detail than that :p

edit: To answer your actual question, yes, you should be able to leave a ZFS box in the corner unattended for years with reasonable confidence that your data is safe and uncorrupted (tune redundancy settings to your taste on the space efficiency vs. error tolerance trade-off). Two caveats: 1) the machine must have an effective means of communicating a catastrophic disk failure for you to resolve (this should hold for any made-for-purpose NAS device, but you'll need to do some work if you're DIY). 2) ZFS does not actively patrol for corruptions, it fixes problems when it encounters them. If your data is not being periodically accessed, there will need to be some provision for walking the filesystem and correcting any accumulated errors (the ZFS scrub util exists for this purpose, but it has to be used)




One thing I didn't consider until just now (sigh) is that if you're using ZFS as a backup medium (I am), unless your source drive is also ZFS (mine isn't), you're still exposed to the same bit-rot your source drive is, since that change would then get backed up to ZFS.


> Naturally, there are information-theoretical limits to the number of corruptions that may occur for detection/correction to be guaranteed.

To my practical point: Does it tell me, when it approaches that limit or do I have to put in more maintenance? Can it be fixed by swapping one of the drives?


If you are using ZFS, the flipped bit (or "bit rot") problem is completely solved. You need never give it another thought.

You still need to worry about failing drives and about the integrity of your raidz arrays (or whatever), but that has nothing to do with the flipping bits.

That being said, you can see statistics about error corrections (which should typically be near-zero) and if you see a lot of them it might be advanced warning of a drive dying. But the actual bit errors themselves would not be a problem and you would not need to take any action specifically related to them.


The limits are on the number of simultaneous failures. Data is safe provided that too many errors do not accumulate before they can be detected and corrected. This is fundamental fact: no real storage system can tolerate an unbounded number of simultaneous errors without an unbounded number of space for replicas. You can control the number of allowable simultaneous errors by tuning redundancy settings (the trade off is space efficiency vs. probability of data loss). It is straightforward to put in place an automated process to guarantee that errors are detected within some finite period of time.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: