Hacker News new | past | comments | ask | show | jobs | submit login

> RAID or storage replication in distributed storage <..> is not only useless, but actively undesirable

I guess I'm different from most people, good news! When building my new "home server" half a year ago I made a raid-1 (based on ZFS) with 4 NVMEs. I rarely appear at that city, so I brought the fifth one and put it into an empty slot. Well, one of the 4 nvmes lasted for 3 months and stopped responding. One "zpool replace" and I'm back to normal, without any downtime, disassembly, even reboots. I think that's quite useful. When I'm there the next time I'll replace the dead one, of course.




This article is speaking of large scale multinode distributed systems. Hundreds of rack sized systems. In those systems, you often don't need explicit disk redundancy, because you have data redundancy across nodes with independent disks.

This is a good insight, but you need to be sure the disks are independent.


well most often hba's and raid controllers are another thing which increases latency and makes maintenances costs go up quite a bit (more stuff to update) and also it's another part that can break.

that's why it's not recommended when running ceph.


I'm pretty sure discrete HBAs / Hardware RAID Controllers have effectively gone the way of the dodo. Software RAID (or ZFS) is the common, faster, cheaper, more reliable way of doing things.


Don’t lop HBAs and RAID controllers together. The former is just PCIe to SATA or SCSI or whatever (otherwise it is not just an HBA, but indeed a RAID controller). Such a thing is still useful and perhaps necessary for software RAID if there are insufficient ports on the motherboard.


Hardware RAID doesn't seem to be going away quickly. Since they're almost all made by the same company, and they can usually be flashed to be dumb HBAs, it's not too bad, but it was pretty painful when using managed hosting and the menu options with lots of disks all have the raid controllers that are a pain to setup; and I'm not going to reflash their hardware (although I did end up doing some SSD firmware updates myself because firmware bugs were causing issues and their firmware upgrade scripts weren't working well and were tremendously slow)


ZFS needs HBAs. Those get your disks connected but otherwise get out of the way of ZFS.

But yes, hardware RAID controllers and ZFS don't go together.


Hardware caching raid controllers do have the advantage if power is lost, the cache can still be written out without the CPU/software to do it. This let's you safely run without write-thru cache fsync. This was a common spec for provisioned bare-metal MySQL servers I'd worked with.


The entire comment thread of this article is on-prem, low scale admins and high-scale cloud admins talking past each other.

You can build in redundancy at the component level, at the physical computer level, at the rack level, at the datacenter level, at the region level. Having all of them is almost certainly redundant and unnecessary at best.


Sometimes. Other times they may make things worse by lying to the filesystem (and thereby also the application) about writes being completed, which may confound higher-level consistency models.


It does seem to me that it's much easier to reason about the overall system's resiliency when the capacitor-protected caches are in the drives themselves (standard for server SSDs) and nothing between that and the OS lies about data consistency. And for solid state storage, you probably don't need those extra layers of caching to get good performance.


Since my experience was from a number of years back, I tried searching for more recent reports: "mysql ssd fsync performance". The top recent one I found was for Digital Ocean[0] in 2020. It says "average of about 20ms which matches your 50/sec" and mentions battery back-up controllers which wasn't even in my search terms.

[0] https://www.digitalocean.com/community/questions/poor-fsync-...


I would be worried about my data behind held hostage by a black box proprietary RAID controller from a hostile manufacturer (unless you're paying them millions to build & design you a custom product, at which point you may have access to internal specs & a contact within their engineering team to help you).

I'd rather have ZFS or something equivalent in software. Software can be inspected, is (hopefully) battle-tested for years by many different companies with different workloads & requirements, and worst-case scenario, because it's software, you can freeze the situation in time by taking byte-level snapshots of the underlying drives as well as a copy of the software for later examination/reverse-engineering, something you can't do with a hardware black box where you're bound to the physical hardware and often have a single shot at a recovery attempt (as it may change the state of the black box).

Have you heard of the SSD failures about a decade ago where the SSD controller's firmware had a bug that bricked the drive past a certain lifetime? The data is technically still there, and would be recoverable if you could bypass the controller or fix its firmware, but unless you had a very good relationship with the manufacturer of the SSD to gain access to the internal tools and/or source code to allow you to tinker with the controller you were SOL.


It was RAID-1, so there's no data manipulation going on, a simple mirror copy with double the read bandwidth.


> > RAID or storage replication in distributed storage <..> is not only useless, but actively undesirable

> I guess I'm different from most people, good news!

The earlier part of the sentence helps explain the difference: "That is, because most database-like applications do their redundancy themselves, at the application level..."

Running one box I'd want RAID on it for sure. Work already runs a DB cluster because the app needs to stay up when an entire box goes away. Once you have 3+ hot copies of the data and a failover setup, RAID within each box on top of that can be extravagant. (If you do want greater reliability, it might be through more replicas, etc. instead of RAID.)

There is a bit of universalization in how the blog post phrases it. As applied to databases, though, I get where they're coming from.


You omitted the context from the rest of the sentence:

> most database-like applications do their redundancy themselves, at the application level …

If that’s not the case for your storage (doesn’t sound like it), then the author’s point doesn’t apply to your case anyway. In which case, yes, RAID may be useful.


What setup do you use to put 4 NVME in one box? I know it’s possible, I’ve just heard off so many different setups. I know there are some PCIE cards that allow for 4 NVME drives. But you have to match that with a motherboard/CPU combo with enough ones to not lose bandwidth.


For distributed storage, we use this: https://www.slideshare.net/Storage-Forum/operation-unthinkab...

We then install SDS software, Cloudian for S3, Quobyte for File, and we used to use Datera for iSCSI. Lightbits maybe in the future, I don't know.

These boxen get purchased with 4 NVME devices, but can grow to 24 NVME devices. Currently 11 TB Microns, going for 16 or more in the future.

For local storage, multiple NVME hardly ever make sense.


I've been looking into building some small/cheap storage, and this is one of the enclosures I've been looking at.

https://www.owcdigital.com/products/express-4m2


That’s exactly what they are doing. Anyone else is using proprietary controllers and ports for a server chassis


I’ve recently converted all my home workstation and NAS hard drives over to OpenZFS, and it’s amazing. Anyone who says RAID is useless or undesirable just hasn’t used ZFS yet.


The article’s author only said RAID was useless in a specific scenario, not generally, and the post you’re replying to omitted this crucial context.


Compare your solution with having 4 SBCs, with 1 NVME each, at different locations. The network client would handle replication and checksumming.

The total cost might be similar but you have increased reliability over SBC/controller/uplink failure.

Of course there are tradeoffs on performance and ease of management...


You think that building 4 systems in 4 locations is likely to have a similar cost to one system at one location? For small systems, the fixed costs are a significant portion of the overall system cost.

This is doubly true for physical or self-hosted systems.


My environment is not a home environment.

It looks like this: https://blog.koehntopp.info/2021/03/24/a-lot-of-mysql.html


We are currently converting our SSD-based Ganeti clusters from LVM on RAID to ZFS, to prepare for our NVMe future, without RAID cards (1). Was hoping to get the second box in our dev ganeti cluster reinstalled this morning to do further testing, but the first box has been working great!

1: LSI has a NVMe RAID controller for U.2 chassis, preparing for a non-RAID future, just in case.


Does zpool not automatically promote the hot spare like mdadm?


It can, if you set a disk as the hot spare for that pool.

But a disk can only be a hot spare in one pool, so to have a "global" hot spare it has to be done manually. That may be what that poster was doing.


Also, if I understand it correctly, there are a few other caveats with hot spares: It will only activate when another drive completely fails, so you can't decide to replace a drive when it's close to failure (probably not an issue in this case, though, with the unresponsive drive). Second, with the hot spare activated, the pool is still degraded, and the original drive still needs to be replaced; then the hot spare is removed from the vdev, and goes back to being a hot spare.

It's these reasons that I've decided to just keep a couple of cold spares ready that I can swap in to my system as needed, although I do have access to the NAS at any time. If I was remote like GP, I might decide to use a hot spare.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: