Hacker News new | past | comments | ask | show | jobs | submit login

Raid5 is not dead yet (https://www.cafaro.net/2014/05/26/why-raid-5-is-not-dead-yet...). The problem with failures during rebuilds is overblown, IMO. Manufacturer quoted URE failure rate (probability of failure to read) is overstated - instead of 1×10^14, they are mostly like 1×10^15 or higher. Full disclosure: we're actually doing erasure coding in HDFS over Raid5 on servers (double insurance - if the raid array goes down, we can recover from other servers in HDFS). But our expectation for 6x4TB arrays is not for a 70%+ chance of a URE during a rebuild, rather a couple of percent. With ZFS or btrfs, it won't actually matter for us, as we'll only lose a block on a URE- that we can recover from the rest of the cluster.



    The problem with failures during rebuilds is overblown
I thought I was the only one who believed that. I've said this on reddit before and ended on like -20 votes with people blatantly arguing I'm falsifying an "impossibility".

I've got roughly 30 arrays in production, between 4 and 12 disks in each. All are RAID5 + hotspare. If you believe the maths people keep quoting, the odds of seeing a total failure in a given year is close to 100%. I started using this configuration, across varying hardware, over 15 years ago and I've been growing in number since.

I'm not pretending one example proves the rule, or that it's totally safe and I would run a highly critical environment this way (before anyone comments: these environments do not meet that definition), but people have tried to show maths that there's a six nine likelihood of failure, and I just don't for a second believe I'm that lucky.


Well at least on Linux, by default almost everyone (using consumer drives) has their array in a very common misconfiguration. And this leads to raid5 collapse much sooner than it should.

The misconfiguration is the drive's SCT ERC timeout is greater than the kernel's SCSI command timer. So what happens on a URE is, the drive does "deep recovery" if it's a consumer drive, and keeps trying to recover that bad sector well beyond the default command timer of the kernel, which is 30 seconds. At 30 seconds the kernel assumes something's wrong and does a link reset. On SATA drives this obliterates the command queue and any other state in the drive. The drive doesn't report a read error, doesn't report what sector had the problem, and so RAID can't do its job and fix the problem by reconstructing the missing data from parity and writing the data back to that bad sector.

So it's inevitable these bad sectors pop up here and there, and then if there's a single drive failure, in effect you get one or more full stripes with two or more missing strips, and now those whole stripes are lost just as if it were a 2-disk failure. It is possible to recover from this but it's really tedious and as far as I know there are no user space tools to make such recovery easy.

I wouldn't be surprised if lots of NAS's using Linux were configured this way, and the user didn't use recommended drives because, FU vendor those drives are expensive, etc.


Don't forget the part where many consumer drives won't let you play with the SCT ERC settings, and some of them just completely crap out on URE and won't come back.

(My personal favorite was when I discovered a certain model of "consumer" drives we had thousands of in production claimed to not support SCT ERC configuration, but if you patched smartctl to ignore the response to "do you support this", the drives would happily configure and honor it.)


Most enterprise-class drives are just consumer drives packaged with a bit more software, buy I guess you know that.


Yeah, I was just entertained by how lazily the removal was implemented in the consumer drive FW.


Follow the money who is selling the raid5 is dead story. The main worry is correlated failures if you have the Sam types.of drives in arrays and they reach their end of life.


Note that all the manufacturers aren't actually saying the URE is X. They are saying it's less than X, it's a cap. Therefore it isn't a rate. The actual rate for two drives could be very different, maybe even more than an order of magnitude different, but so long as it's below the spec's cap for such errors, it's considered normal operation.

So yeah, I agree, the whole idea in some circles that you will get a URE every ~12TB of data read is obviously b.s. We don't know what the real world rate is because of that little less than sign that appears in all of these specs. We only know there won't be more errors than that, and not for a specific drive, but rather across a (virtual) sample size for that make/model of drive.




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: