Has RAID5 stopped working? (2013)

apenwarr · on June 2, 2014

The analysis in this article is horrifically poor. Among other things, just using the expected error rate as a pass/fail criterion is not a good idea. Even if the "typical" RAID5 array doesn't fail, isn't it still bad if 1/10 or 1/100th of them fail because of randomness in the error distribution?

There are also lots of ways to mitigate this problem even with a high error rate. In particular, if you get a bad sector, don't just kick the disk immediately out of the RAID; try recovering that one sector first, using the other disks in the RAID. The drive's built-in sector remapping should then bypass the newly-bad sector.

I wrote about this in detail in 2008: http://apenwarr.ca/log/?m=200809#08

baruch · on June 2, 2014

What you describe is what is known as disk scrubbing, and indeed if you are operating a RAID device and it doesn't do disk scrubs every so often you are doing it wrong. Linux mdadm has this feature as well and in most distributions that I'm aware of disk scrubbing is performed once a month by default.

In addition, the reading into the BER and the immediate conclusion that RAID stands no chance is not quite correct. If you did your disk scrubs your disks are mostly clean and the likelyhood of failure is not that great. I was responsible for tens of thousands of disks in a storage system that did do disk scrubs and it did find occasionally a bad sector[1] and I've never even once seen a double failure as is often warned about in the "RAID is dead" articles.

Please do your disk scrubs to avoid sectors falling off and remember that RAID is not backup, extra redundancy is still needed if you really care for your data.

[1] I didn't have the sense of mind to compare the rate of error detection to the official BER.

CHY872 · on June 1, 2014

Ok, so one thing that is bugging me: This is totally a software based problem. At the moment, if I have a big hard disk I'm almost certainly going to have some kind of big read error in at least a couple of sectors - which practically means that I'll get some garbled bytes on some reads, and I hope that they're in a cache file, or swap file, and not my registry, or programs, etc.

So here, they're saying that if you have a RAID 5, and one disk fails, then when it has such an error, it's suddenly game over for the entire array - the rest of the data on the drive must be copied to a new array, because the firmware says so. Apparently that's why RAID 5 is a bad idea.

Surely the same argument can be applied to hard drives on their own? If Windows checksummed every file before using it, and forced copying every byte onto a new hard drive whenever an error was found, hard drives themselves would stop working as a meaningful method of storage in 2009. But it doesn't, you (likely) get some garbled text, and it's usually not a problem. File systems get corrupted etc etc (unless you use something like ZFS), and RAID 5 handling it in a particularly non-graceful way is just an implementation detail.

So really, this is just a problem with the controller firmwares forcing you to make a whole new array. It's not a problem with RAID 5.

Or have I missed something?

I know that RAID 5's not a good choice in this modern age (cloud, big SANs etc, SSDs etc), but surely the 'a sector will be bad, and you will not be able to rebuild after a failure' problem would be easily fixed by the raid controller company if their clients complained enough?

baruch · on June 2, 2014

If it's only a sector that died on you in one disk the likelyhood that the equivalent sector in the RAID stripe also failed is fairly small[1].

The most often cited issues is with a full disk failure and then having a sector already dead in another disk that you weren't aware of. Do disk scrubs and you will mostly avoid this risk. Your array (software or hardware) should be capable of disk scrubs, most if not all are capable and will do it.

[1] There is a failure mode whereby frequent writes to one sector/track will cause damage to nearby sectors in which case if your RAID stripes are static you may have multiple sector failures in multiple disks in the same stripe. It does require hundreds or thousands of rewrites for all I know so your workload needs to be extreme to reach that case.

simplexion · on June 1, 2014

No. Here is a good explanation: http://www.zdnet.com/blog/storage/why-raid-5-stops-working-i...

CHY872 · on June 2, 2014

I don't believe the contradicts what I've said at all. His claim is that when one disk fails completely, we expect a sector or two on each of the two remaining drives to result in a 'bad sector' error.

> So the read fails. And when that happens, you are one unhappy camper. The message "we can't read this RAID volume" travels up the chain of command until an error message is presented on the screen. 12 TB of your carefully protected - you thought! - data is gone. Oh, you didn't back it up to tape? Bummer!

So, at this point, we've got two hard drives containing millions of sectors for which one or two are bad, and the software claims that the whole thing is broken, and we have to find a new array?

As far as I know, RAID 5 is block level - the block level failure should only destroy one block. All of the others (apart from the other dead ones) are fine. This sort of thing happens in all scenarios with a single disk - eventually an operating system will hit a bad sector which it will have to deal with.

In other words, why does the RAID controller crap itself when it can't read (with no recourse at all, according to these articles) when it could just do what every other hard drive does and return 'sector unreadable'. Then the operating system can just remap it etc.

I know in some situations one would want to be notified of any miniscule error, but it should be possible to ignore the warnings.

micro-ram · on June 2, 2014

I think the assumed issues here are the long rebuild times and the expectation of an additional failure happening during the rebuild which could then trigger another disk error. Then the entire array of data would be taken offline and possibly corrupted. I have build many RAID5/6 arrays and have rarely lost any data, but I am leery and now and tend to just stick with smaller RAID 1 or 10's due to the large size of current disks. We really need native ZFS (BTRFS?) on everything now. Data should be automatically distributed to multiple disks and the file system should be able to guarantee via checksum the data read is what I wrote.

venus · on June 2, 2014

Are any consumer NAS that implement ZFS even available?

Sounds like ZFS's proactive sector sweeps across all managed drives would handily solve the problem the article raises with conventional RAID.

mitchty · on June 2, 2014

So I built a zfs raid nas box with 6x3Tb drives. I have it resilver every 2 weeks. So far, NOTHING has failed checksums for almost 2 years. So while the maths behind a 3-4 tb drive returning incorrect data I'm sure is technically correct, I've not seen issues.

If you want "proof" i can dump out zpool status/info/log/etc to show i'm not lying. Note the pool is ~50% in use so its not a great example. Also its raidz2 (raid6) so not a direct comparison. I also bought each drive from different lots to hopefully ensure if a drive failed i'd have 2ish days to get a replacement.

lmz · on June 2, 2014

iXsystems sells the 4-bay FreeNAS mini on Amazon: http://www.ixsystems.com/storage/freenas/ That probably counts as "consumer".

dannyperson · on June 2, 2014

What happens when a URE is encountered and all the disks are online? It seems that they could be detected early and fixed before a rebuild is necessary by doing a weekly sweep of the entire array, reading every data block.

harshreality · on June 2, 2014

With raid 5 if you get a parity block mismatch, you don't know which drive was wrong. You could compare parity bit by bit, to find which bits (it could be anywhere from 1 bit up to the stripe size) was wrong, but you won't be able to figure out how to fix those bit(s) on the stripe without additional information, either a hash or FEC data.

Checksumming filesystems let you find the faulty drive by reconstructing data from each n-choose-(n-1) drive set and finding the set with the correct hash.

Filesystems using FEC instead of raid (5/z1, 6/z2 ...) can also correct data errors, but I'm not aware of any consumer-level filesystems that implement it. I'm not sure why. Doesn't Amazon use it for S3? Data block and FEC data layout on a disk array has to be a solved problem.

shopinterest · on June 2, 2014

In my anecdotal, (not technical) experience, here are the issues with RAID5 failing (from losing 4TBs of Data and 2TBs of backups)

- Consumer devices - The later Buffalo terastations (Consumer NAS) and other 'low' price devices for some stupid reason tied the RAID array to their own hardware (like firmware) if the unit fails, the array fails.

- The size and 'time' to rebuild places a great stress on disks, therefore causing a potential 2nd drive loss. Rebuilding HDs for 250GB-500GB took a 90 min or so, a 2TB rebuilt takes several hours, all HDs are grinding to recover data at the same time, for hours as well, and this is when you are likely to lose more than the 1 drive and your array.

- Consumer and SMB NAS devices tend to buy the hard drives from the same manufacturer, at the same time, so when one drive fails, the other drives are just as likely to fail in the same time period. I lost two 1TB backups when the HDs had a 'suicide pact' and when I opened the Hard drives, I saw they were literally made the same day. To be extra paranoid, I used to have 2 or 3 different brand of HDs when I had my NAS. - I recommend Synology units and constantly monitor the health of your hard drives. Totally agreed that RAID1 should be the golden standard. I've used several brands and Synology was rock solid.

chadnickbok · on June 1, 2014

Perhaps a better argument to make is that, since 2009, "Cloud Storage" has taken off alongside Amazon's S3 service.

So while folks who really do need to reliably store 500GB of stuff might still be looking at RAID, for the 'average' end-user they have a far better solution than complicated multi-disk setups.

kijin · on June 2, 2014

The "average" end user also doesn't want to spend weeks uploading 1TB of files to the cloud.

Imagine a typical American or Canadian "broadband" consumer internet connection with 10Mbps down and 1Mbps up. If you keep your computer on 8 hours a day, it will take just over 9 months to upload 1TB of data to the cloud at that speed. A multi-TB RAID array would take years to back up. (This is why I think Backblaze et al. can afford to offer "unlimited backup" for a flat monthly fee. The customer's ISP is doing all the limiting for them!)

It was a major PITA when I first started using online backup services while in Canada. (TekSavvy FTW!) Despite paying for one of their faster plans, I had to keep my computer on for a couple of weeks 24/7 to make the initial backup.

jimmcslim · on June 2, 2014

Is there an opportunity for a well-known backup vendor to have shopfronts in malls that are wired into a high-speed backbone? Take them your external USB, go have lunch in the food court, come back later when the backup is complete.

'YOUR PHOTOS BACKED UP TO CLOUD WHILE U WAIT'

gizmo686 · on June 2, 2014

You don't even need the high-speed backbone. You could backup the files to a local drive. The latency might be bad, but trucks still have higher bandwidth then the internet.

http://what-if.xkcd.com/31/

jimmcslim · on June 2, 2014

True absolutely. But there's something more appealing to me about handing a drive across a counter, knowing it will be backed up in a few hours (assuming that is practical), and ready for me to be confirmed back home via my higher-download-speed broadband; rather than have the drive rattle across the country in a truck, sit in a mail room, and finally get processed X days/weeks later...

lessnonymous · on June 2, 2014

So Mall Backup Co takes two copies immediately. One on their local node and one that gets shipped back to base.

As soon as you get home you can access your backup (transparently) from the mall node. Then when the drive gets back to base it does another verification run and instructs the node that it can re-use that space.

vertex-four · on June 2, 2014

It could be feasible in the UK; we can do same-day courier across the country, although I suspect it'd be expensive at low volume. Next day post gets there the next day pretty much every time, though.

venus · on June 2, 2014

AWS allows you to mail drives in for import/export:

http://aws.amazon.com/importexport/

pronoiac · on June 2, 2014

Sure, network bandwidth is an issue, and getting a fast enough connection at a mall might be interesting.

What's also fun: USB speeds. I found a transfer time calculator - http://techinternets.com/copy_calc - and, well, 4 terabytes at USB 2.0 speeds (480 mb/s) would take over 21 hours, or 3 hours at USB 3.0 speeds. Those are at optimal wire speeds, less 10% for overhead.

Spooky23 · on June 2, 2014

RAID enhances online availability and performance. It isn't backup. So if you are the consumer addressed in the article, the answer is no, you don't need a RAID array, you need offline backup.

_3u10 · on June 2, 2014

No it isn't you just need to do proactive maintenance on your array... and/or backup your data.

If you use RAID5 on any size array with out making backups you will lose data. Does he really think that with the invention of 1TB+ drives that it's the first time two drives have failed at once?

sard420 · on June 2, 2014

Back in the days when all 10 of your Seagate 120GB drives would die in a span of 2 months, sometimes two or three popping off in the same day.

jrockway · on June 2, 2014

Drives are so cheap these days that I wouldn't consider anything other than RAID-1. For applications requiring cheaper storage or better density, I'd consider application-level redundancy coding.

kayoone · on June 2, 2014

Raid 1 doesn't offer any performance benefits though but you could go with Raid 0+1 (or 10) but i believe read performance is still better in a Raid 5 configuration while write performance is lower.

In the age of really fast SSDs Raid 1 might be the better choice though, unless you really need a huge amount of fast storage.

jrockway · on June 2, 2014

RAID 1 does have read performance benefits; you can read from both disks in parallel. Write time is max(disk0, disk1).

kayoone · on June 2, 2014

oh really ? i thought to have read benefits youd go RAID 10 and thats the main reason it exists in the first place, but seems like you are right but it depends on the RAID controller and drivers.

pmoriarty · on June 1, 2014