Storage is a really weak area of mine, but it is important, and I do not see muc...

Nacraile · on Aug 25, 2016

RAID generally does not protect against bit-rot (i.e. undetected errors); it only protects against detectable errors / catastrophic disk failure.

ZFS does protect against undetected (at the disk level) errors. To a first approximation, it does this by keeping block checksums alongside pointers so that it can verify that a block has not been changed since it was referenced, by keeping multiple checksummed root blocks, and by having enough redundancy to reconstruct blocks that fail their checksums. Naturally, there are information-theoretical limits to the number of corruptions that may occur for detection/correction to be guaranteed.

You should refer to the relevant wikipedia pages if you want more detail than that :p

edit: To answer your actual question, yes, you should be able to leave a ZFS box in the corner unattended for years with reasonable confidence that your data is safe and uncorrupted (tune redundancy settings to your taste on the space efficiency vs. error tolerance trade-off). Two caveats: 1) the machine must have an effective means of communicating a catastrophic disk failure for you to resolve (this should hold for any made-for-purpose NAS device, but you'll need to do some work if you're DIY). 2) ZFS does not actively patrol for corruptions, it fixes problems when it encounters them. If your data is not being periodically accessed, there will need to be some provision for walking the filesystem and correcting any accumulated errors (the ZFS scrub util exists for this purpose, but it has to be used)

pmarreck · on Aug 26, 2016

One thing I didn't consider until just now (sigh) is that if you're using ZFS as a backup medium (I am), unless your source drive is also ZFS (mine isn't), you're still exposed to the same bit-rot your source drive is, since that change would then get backed up to ZFS.

allendoerfer · on Aug 25, 2016

> Naturally, there are information-theoretical limits to the number of corruptions that may occur for detection/correction to be guaranteed.

To my practical point: Does it tell me, when it approaches that limit or do I have to put in more maintenance? Can it be fixed by swapping one of the drives?

rsync · on Aug 25, 2016

If you are using ZFS, the flipped bit (or "bit rot") problem is completely solved. You need never give it another thought.

You still need to worry about failing drives and about the integrity of your raidz arrays (or whatever), but that has nothing to do with the flipping bits.

That being said, you can see statistics about error corrections (which should typically be near-zero) and if you see a lot of them it might be advanced warning of a drive dying. But the actual bit errors themselves would not be a problem and you would not need to take any action specifically related to them.

Nacraile · on Aug 25, 2016

The limits are on the number of simultaneous failures. Data is safe provided that too many errors do not accumulate before they can be detected and corrected. This is fundamental fact: no real storage system can tolerate an unbounded number of simultaneous errors without an unbounded number of space for replicas. You can control the number of allowable simultaneous errors by tuning redundancy settings (the trade off is space efficiency vs. probability of data loss). It is straightforward to put in place an automated process to guarantee that errors are detected within some finite period of time.

codemac · on Aug 25, 2016

> Two copies would not help to determine which bit is the correct one, you would need three. Once you have these, I guess you would be able to detect corruption. Am I guessing correct that duplication beyond 3 times would just help with read speed and in the case when a drive fails while one is already dead (e.g. when you are rebuilding)?

You're confusing two topics. Integrity vs. availability. 3x copies, RAID4 XOR, RAIDZ (a form of RAID6), etc are about increasing the availability of your data in the face of device failure/partitions.

The integrity of you data is computed different ways in different systems, but you can safely think of it in your head as a pile of checksums on top of eachother. There is a half decent blog entry[0] on Oracle's website, but you can mentally model it as having checksum on disk for each block, which then is combined with other blocks into some larger object that is also checksummed, and on and on until you have "end-to-end integrity". It is a feature of all serious storage systems.

I'd love if someone who was more familiar with ZFS could respond with how ZFS's internals work for you.. I worked at a direct competitor of Sun's that sued eachother over this stuff, so I actively tried to not learn about ZFS. I regret this.

[0]: https://blogs.oracle.com/bonwick/entry/zfs_end_to_end_data

bcantrill · on Aug 25, 2016

It seems that you might have edited your penultimate sentence, but just to make sure that the record is unequivocally clear on this: NetApp and Sun didn't simply "sue each other", NetApp initiated patent litigation against Sun in East Texas, and Sun countersued as a defensive maneuver.[1]

As for your ZFS question, Bonwick's blog is certainly a good source -- though if one is looking for a thorough treatment, I might recommend "Reliability Analysis of ZFS" by Asim Kadav and Abhishek Rajimwale.[2]

[1] https://news.ycombinator.com/item?id=9129784

[2] http://pages.cs.wisc.edu/~kadav/zfs/zfsrel.pdf

codemac · on Aug 26, 2016

I did not edit that part IIRC, I try not to edit anything substantially that I post online.

As far as NetApp v. Sun, it was very unfortunate. I could rant for days about how crappy the business reasons were for NetApp going after Sun (zomg coraid!).

I use your quote about Oracle being a lawn mower from your usenix presentation all the time. Thank you for your reply.

cbsmith · on Aug 25, 2016

So many levels to this. Drives themselves remap sectors to handle bad sectors, and some have integrated ECC bits... there are also things at the sensor level to help recover from errors. Once you are at the block level, you have simple parity checking with some redundancy models, but the good ones use ECC bits. Usually this is done at the block device layer, but ZFS and some other filesystems do this at the filesystem layer (sometimes you have multiple layers of this). You're right that beyond a few copies, the main win tends to be read speed, but there is one other factor: surviving rebuilds. As drives get larger, the failure rate is such that the chance of failure during a rebuild gets too disturbingly high, so you need additional redundancy to avoid catastrophic failure during recovery.

As for "automatically fix it", the short answer is there is a lot of stuff that automatically fixes problems, but it is a leaky abstraction. RAID-5 rebuilds are notoriously terrible for performance, and often it is easier to have logic for dealing with failures & redundancy at the application layer.

FreeNAS and similar projects are definitely intended to be turnkey storage solutions. They have their strengths & weaknesses, but the notion that you just plug it in and go isn't too far off. Usually you don't go with the blinky light, but with an alerting mechanism (e-mail, SMS, whatever) that you integrate with it for notifications about problems. In principle, it is a fire & forget kind of thing.

baruch · on Aug 25, 2016

An HDD/SSD without ECC will not work, and wouldn't have worked for a long time now. Many of the reads performed will require some ECC use.

ECC is always written alongside the actual block and the overhead for ECC is the reason for the move from 512b sectors to 4Kb sectors in HDDs. For SSDs the data is already written in different block sizes depending on the NAND and the internal representation and ECC is done for larger than 512b units.

The probability of failure during rebuild is not really directly linked to drive size, the usual interpretation of the drive BER is wrong (media BER is stated across a large population of drives rather than just one drive).

cbsmith · on Aug 26, 2016

> An HDD/SSD without ECC will not work, and wouldn't have worked for a long time now. Many of the reads performed will require some ECC use.

Yeah, I expressed that badly. I always forget about the ECC bytes that are in the firmware.

> The probability of failure during rebuild is not really directly linked to drive size, the usual interpretation of the drive BER is wrong (media BER is stated across a large population of drives rather than just one drive).

Regardless of interpretations of BER, I can't agree about drive sizes. The phenomenon of failures during rebuild is well documented and the driving principle behind double-parity RAID. Adam Leventhal (who ought to know this stuff better than either of us) wrote a paper several years back on the need for triple-parity RAID, and it was entirely driven by increased drive densities: http://queue.acm.org/detail.cfm?id=1670144

The reality is the higher drive densities mean you lose more bytes at a time when you have a drive failure, and that means more bytes you want to have "recovered".

baruch · on Aug 27, 2016

Adam Leventhal uses the wrong interpretation of the BER value so I wouldn't take his words at face value. The whole argument of Adam is by the disk BER and not by density.

I do assume though that the RAID array does media scrub (BMS) periodically, if you don't you are at risk anyway and I'd call that negligence in maintaining your RAID.

If you do use scrubbing the risk that another drive has a bad media spot is low as it must have developed in the time from the last scan and that is a bounded time (week, two, four) so the risk of two drives having a bad sector is now even lower (though never non-zero, backups are still a thing). If you couple that also with TLER and proper media handling instead of dropping the disk on media error the risk to the data becomes very low since there isn't a very high likelyhood that two disks will have a bad sector in the same stripe.

I've been working with HDDs and SSDs and developing software for enterprise storage systems for a number of years now, I've worked in XIV and for all the thousands of systems and hundred of thousand disks of many models I've never seen two disks fail to read the same stripe and RAID recovery was always possible. Other problems are more likely sooner than an actual RAID failure (technician shutting down the system by pressing the UPS buttons or a software bug).

I did learn of several failure modes that can increase the risk but they depend on specific workloads that are not generally applicable. One of those is that if you write to one track all too often you may affect nearby tracks and if the workload is high enough you don't give the disk the time to fix this (the disk tracks this failure mode and will workaround it in idle time). In such a case same stripe can be affected in multiple disks and the time to develop may (or may not) be shorter than the media scrub time. And even then a thin provisioned raid structure would reduce the risk of this failure mode and giving disks some rest time (even just a few seconds) would allow the drive to fix this and other problems it knows about.

All in all, RAID is not dead (yet).

cbsmith · on Aug 29, 2016

So why would one ever bother with RAID-6 (or its moral equivalent with RAID-Z)? I've yet to hear someone justify it from a requirement being to actually survive two drives failing simultaneously.

baruch · on Aug 30, 2016

The main problem is not two drives failing simultaneously but rather one failing shortly after the other. The more drives you have the more the likelihood of that increases, the rebuild time also increases with large drives and so you are more open to such a second disk failure. If you take the MTTF of the drive (or the mostly equivalent probability of failure) you find the risk increases since the MTTF of the drives does not increase at all with size, it is rather constant at least as it is specified by the vendors (HDDs are usually 1.2M hours). The more drives you have the more likely one of them to have a failure and once one fails and since the rebuild time increases with size your chance of failure during rebuild increases as well.

Some systems mitigate that by having a more evenly distributed RAID such that the rebuild time doesn't increase that much as the drive size increases and is actually rather low. XIV systems are like that.

cbsmith · on Aug 31, 2016

> The more drives you have the more likely one of them to have a failure and once one fails and since the rebuild time increases with size your chance of failure during rebuild increases as well.

That was exactly what I was saying. Given the same sustained transfer rate, the bigger the drive, the longer the rebuild time, hence the greater the chance you'll have a failure during the rebuild. While you might think they would, throughput increases have not grown to match increases in bit density & storage capacity.

SSD's have helped a bit in this area because their failure profiles are different and under the covers they are kind of like a giant array of mini devices, but AFAIK they still present challenges during RAID rebuilds.

cmurf · on Aug 26, 2016

Yes, the two key things: population of drives (model/version possibly, but ultimately not publicly known what the sample size is), and the rate is generally a maximum. i.e. use of "less than" symbol with the rate. Therefore, it's wrong to say every x bytes you should expect an unrecoverable read error. We have no idea how good the drives actually are, the reporting is in orders of magnitude anyway so it could be 8 or 9 times more reliable than the reported rate; or even multiple orders of magnitude. It's not untrue to say < 1 URE in 10^14 bits, but for the population to actually experience 1 URE in 10^16 bits. Why not advertise that? Well, maybe there's a product that costs a little more than advertises 1 URE in 10^15 bits. And other product classes that promise better.

FireBeyond · on Aug 25, 2016

Give your computer ECC memory. Configure ZFS to scrub data regularly. It can be configured to email you of impending (SMART) or actual failure. You can label drives by ID/SN, or if your chassis or controller supports it, it can blink to tell you what drive to replace.

It can be configured with a hot spare so the "resilver" starts immediately on failure detection, not "when you happen to fix".

gmfawcett · on Aug 25, 2016

Let me add that you really want to have a hot spare, and if you can't do that (e.g. no extra bay in the chassis), then you should not delay on replacing a failed disk. Have a spare disk on hand, and take the time to replace it ASAP.

We lost everything on a FreeNAS system once because we were alerted to a failed-disk error, but decided to wait until the next week to replace it. But then a second disk failed, and we lost the storage pool. Lesson learned!

But this was an operator error, not a system problem. FreeNAS has been tremendously stable and reliable for us.

johngalt · on Aug 25, 2016

Everyone makes that mistake once. It is an understated problem in building fault tolerant storage arrays.

We buy a batch of 20 drives all at the same time, and they are all the same manufacturer, model, size etc... Possibly even from the same batch or date of manufacture. Then we put them in continuous use in the same room, at the same temperature, in the same chassis. Finally they have an almost identical amount of reads/writes.

Then we act shocked that two drives fail within a short interval of each other :)

mioelnir · on Aug 25, 2016

Just a couple of weeks ago, I upgraded my home storage system from 9 disks (4x3 raidz) from the same manufacturer with partially consecutive serials.

Now its 18 disks (3x6 raidz2) from 3 different manufacturers and every vdev has 2 of each. And the vdevs are physically evenly spread throughout the case.

I sleep so much better. It was kind of a miracle the first setup survived the 4.5 years it did.

dhess · on Aug 25, 2016

FYI for those running ZFS on Linux or FreeNAS (and maybe FreeBSD, as well):

The nice zfs built-in autoreplace functionality doesn't work on Linux or FreeNAS. You need some scripting/external tooling to do the equivalent. See

https://github.com/zfsonlinux/zfs/issues/2449

https://forums.servethehome.com/index.php?threads/zol-hotspa...

I've had a few drives fail on my ZFS for Linux fileserver and wondered why my hot spares weren't automatically kicking in, and this is why.

On Linux, if you don't use the zed script that's referenced in that Github issue above and just replace a failing drive manually, a hot spare is worse than useless, because you need to remove the hot spare from the array before you can use it with a manual replace operation.

dragontamer · on Aug 25, 2016

> How does integrity protection work in practice? I know that the bits on the HDD itself do not map one to one to bits you can actually use and that it uses that to protect us from flipped bits.

Checksums.

If you get big enough checksums (aka: Reed-Solomon Codes), you can not only detect errors but correct them as well. See https://en.wikipedia.org/wiki/Reed%E2%80%93Solomon_error_cor... for the math.

Now that you have this "error-correcting checksum", where do you put it? Raid5 means you place the error-correcting checksum on different disks.

If you have three disks: A, B, and C. You'll put the data on A & B, then the checksum on C. This is RAID4 (which is never used).

RAID5 is much like RAID4, except you also cycle between the drives. So the checksum information is stored on A, B, or C. Sometimes the data is on A&B, sometimes is B&C, and sometimes its on A&C.

-----------------------

> Could I put a FreeNas in a corner a room and leave it there? Would it just blink when I need to put a new drive in or would it need more maintenance? Of course talking about worst case scenarios here, so say I want this to live for 10 years sitting in the corner.

Maybe, maybe not. If all the drives fail in those 10 years, of course not (Hard Drive arms may lose lubricant. If the arms stop moving, you won't be able to read the data).

"Good practice" means that you want to boot up the ZFS box and run a "scrub" on it every few months, to ensure all the hard drives are actually working. If one fails, you replace it and rebuild all the checksums (or the data from the checksums).

RAID / ZFS isn't a magic bullet. They just buy you additional time: time where your rig begins to break but is still in a repairable position.

ZFS has more checksums everywhere to check for a few more cases than simple RAID5. But otherwise, the fundamentals remain the same. You need to regularly check for broken hard drives and then replace them before too many hard drives break.

---------

This also means that no single box can protect you from a natural disaster: fires, floods, earthquakes... these can destroy your backup all at once. If all hard drives fail at the same time, you lose your data.

ldehaan · on Aug 25, 2016

if you intend to try this, I would recommend getting spinrite at grc.com, and run it a couple times every year, if you're going to be doing a lot of io on it.

I run it on all my hdd's and its kept them alive and running smoothly for years at a time.

voltagex_ · on Aug 26, 2016

http://serverfault.com/questions/51681/does-spinrite-do-what...