Raid5 is not dead yet (https://www.cafaro.net/2014/05/26/why-raid-5-is-not-dead-...

technion · on May 14, 2016

    The problem with failures during rebuilds is overblown

I thought I was the only one who believed that. I've said this on reddit before and ended on like -20 votes with people blatantly arguing I'm falsifying an "impossibility".

I've got roughly 30 arrays in production, between 4 and 12 disks in each. All are RAID5 + hotspare. If you believe the maths people keep quoting, the odds of seeing a total failure in a given year is close to 100%. I started using this configuration, across varying hardware, over 15 years ago and I've been growing in number since.

I'm not pretending one example proves the rule, or that it's totally safe and I would run a highly critical environment this way (before anyone comments: these environments do not meet that definition), but people have tried to show maths that there's a six nine likelihood of failure, and I just don't for a second believe I'm that lucky.

cmurf · on May 14, 2016

Well at least on Linux, by default almost everyone (using consumer drives) has their array in a very common misconfiguration. And this leads to raid5 collapse much sooner than it should.

The misconfiguration is the drive's SCT ERC timeout is greater than the kernel's SCSI command timer. So what happens on a URE is, the drive does "deep recovery" if it's a consumer drive, and keeps trying to recover that bad sector well beyond the default command timer of the kernel, which is 30 seconds. At 30 seconds the kernel assumes something's wrong and does a link reset. On SATA drives this obliterates the command queue and any other state in the drive. The drive doesn't report a read error, doesn't report what sector had the problem, and so RAID can't do its job and fix the problem by reconstructing the missing data from parity and writing the data back to that bad sector.

So it's inevitable these bad sectors pop up here and there, and then if there's a single drive failure, in effect you get one or more full stripes with two or more missing strips, and now those whole stripes are lost just as if it were a 2-disk failure. It is possible to recover from this but it's really tedious and as far as I know there are no user space tools to make such recovery easy.

I wouldn't be surprised if lots of NAS's using Linux were configured this way, and the user didn't use recommended drives because, FU vendor those drives are expensive, etc.

rincebrain · on May 14, 2016

Don't forget the part where many consumer drives won't let you play with the SCT ERC settings, and some of them just completely crap out on URE and won't come back.

(My personal favorite was when I discovered a certain model of "consumer" drives we had thousands of in production claimed to not support SCT ERC configuration, but if you patched smartctl to ignore the response to "do you support this", the drives would happily configure and honor it.)

jamesblonde · on May 15, 2016

Most enterprise-class drives are just consumer drives packaged with a bit more software, buy I guess you know that.

rincebrain · on May 15, 2016

Yeah, I was just entertained by how lazily the removal was implemented in the consumer drive FW.

jamesblonde · on May 14, 2016

Follow the money who is selling the raid5 is dead story. The main worry is correlated failures if you have the Sam types.of drives in arrays and they reach their end of life.

cmurf · on May 14, 2016

Note that all the manufacturers aren't actually saying the URE is X. They are saying it's less than X, it's a cap. Therefore it isn't a rate. The actual rate for two drives could be very different, maybe even more than an order of magnitude different, but so long as it's below the spec's cap for such errors, it's considered normal operation.

So yeah, I agree, the whole idea in some circles that you will get a URE every ~12TB of data read is obviously b.s. We don't know what the real world rate is because of that little less than sign that appears in all of these specs. We only know there won't be more errors than that, and not for a specific drive, but rather across a (virtual) sample size for that make/model of drive.