OpenZFS – add disks to existing RAIDZ

jedberg · on Aug 19, 2023

I wish Apple and Oracle would have just sorted things out and made ZFS the main filesystem for the Mac. Way back in the day when they first designed Time Machine, it was supposed to just be a GUI for ZFS snapshots.

How cool would it be if we had a great GUI for ZFS (snapshots, volume management, etc.). I could buy a new external disk, add it to a pool, have seamless storage expansion.

It would be great. Ah, what could have been.

mustache_kimono · on Aug 19, 2023

> How cool would it be if we had a great GUI for ZFS (snapshots, volume management, etc.).

How cool would it be if we had a great TUI for ZFS...

Live in the now: https://github.com/kimono-koans/httm

SpecialistK · on Aug 20, 2023

Now that's a smooth segue to self-promo. I respect it.

mustache_kimono · on Aug 20, 2023

httm has been at 9XX GitHub stars for what seemed like forever! Time to get over that 1k hump.

As a Rust project, it's something of a duty to hype and self-promote, and to troll C++, of course. So let it be known: If you're not using httm, you're not using modern ZFS. Or -- httm is the cppfront of ZFS utilities? Or -- httm is the 'auto' keyword for filesystems, but less confusing (to me)?

I'll work on it.

xoa · on Aug 19, 2023

Yes, this will always depress me, and to me personally be one of the ultimate evils of the court invented idea of "software patents". I've used ZFS with Macs sicne 2011, but without Apple onboard it's never been as smooth as it should have been and has gotten more difficult in some respects. There was a small window where we might have had a really universal, really solid FS with great data guarantees and features. All sorts of things with OS could be so much better, trivial (and even automated) update rollbacks for example. Sigh :(. I hope zvols finally someday get some performance attention so that if nothing else running APFS (or any other OS of course) on top of one is a better experience.

gjvc · on Aug 20, 2023

1000x this. A universal filesystem for UNIX would have been a wonderful thing, and would have been the best legacy Sun could have left.

cromka · on Aug 20, 2023

Would become so ubiquitous that I can see how Microsoft would even eventually add a native support for it, especially in the WSL era.

Terretta · on Aug 19, 2023

> How cool would it be if we had a great GUI for ZFS (snapshots, volume management, etc.). I could buy a new external disk, add it to a pool, have seamless storage expansion.

See QNAP HERO 5:

https://www.qnap.com/static/landing/2021/quts-hero-5.0/en/in...

NAS Options:

https://www.qnap.com/en-us/product/?conditions=4-3

// This was about MacOS, and yes, that would be cool. But QNAP is remarkably MacOS-friendly, including Thunderbolt and Time Machine support.

rincebrain · on Aug 20, 2023

Worth noting QNAP's pools are 500% not ever going to be importable on stock OpenZFS.

If you're happily using their devices, this may not matter to you, but since the post is about OpenZFS...

ksec · on Aug 20, 2023

If you follow Qnap you may be aware of their security nightmare in recent years.

And HERO OS is only available on the Enterprise range of NAS, not even Prosumer class NAS.

gliaskos · on Aug 20, 2023

Something like TVS-h474 seems quite nice for a wide range of uses.

I have TVS-h1288X and I am really happy with it, maybe it's a bit overkill for most.

That said, I could not use zfs send from my prev fbsd server, there are some compat checks in QNAP's version. Used nfs to migrate from my decade++ old pool.

Using Tailscale for access, they have QNAP builds on GH.

rincebrain · on Aug 20, 2023

QNAP forked from OpenZFS a long time ago, before the ZoL project was renamed the canonical OpenZFS project IIRC, and has been reimplementing whatever they wanted for some time.

So it's not just a check for check's sake, I would suspect in practice their send/recv is just completely incompatible at this point.

jedberg · on Aug 19, 2023

As far as I can tell I can't use that as my root filesystem, right?

wkat4242 · on Aug 20, 2023

Yeah I wonder if it was really caused by Jonathan Schwartz bragging to the media about it and pissing Steve Jobs off, as was rumoured at the time. Or if there was more behind it.

I would certainly not put it past Jobs. While he had some genius qualities, his personality was pretty deeply flawed. But on the other hand I wonder if his sense for business would not have prevailed.

If only Apple had absorbed sun instead of oracle, their legacy would have been better off. Of course it couldn't have been worse.

Which also made me wonder, of course Jobs and Ellison were big friends so perhaps Jobs just left it for him to take? Oracle always wanted to get their hands on java obviously.

Ps the whole ZFS on Mac adventure came a little after time machine was already in production.

ZFS is indeed an amazing desktop filesystem. I'm using it now on FreeBSD, I moved away from Mac because it became ever more closed down. I wanted more control, not less.

toyg · on Aug 20, 2023

Apple buying Sun would have been a disaster. Oracle at least needed some (arguably most) of what Sun made - Apple did not and do not.

Apple would have killed Java in half a second. They don't need it, it was built to do the opposite of what they want their tech to do. Same for Solaris, burdensome duplicate. As for all the desktop-oriented FOSS projects (OpenOffice, VirtualBox etc), they would not have just been killed - Apple would have viciously bullied anyone to preclude them from forking them. The stuff they'd keep (MySql, probably), they would have made osx-only, and then strangled them quietly after they got out of the server game. They would have ripped anything technologically interesting from the hardware divisions and then shut them down - because they were no match, in terms of supply efficiency and value, for the Apple equivalents; and anyway they were never seriously interested in the server space.

Oracle did what Oracle does, and realistically the culture clash was never going to result in a smooth transition; but from a commercial perspective they kept around the best of what Sun was making (Java, servers) and largely let the community get on with the stuff they weren't interested in (if rebranded/forked). The only real crime they committed, IMHO, was trying too hard to make money from VirtualBox, effectively spooking the market; there is a different timeline where VBox ends up being what Docker became. But everything else was par for the course.

mekster · on Aug 20, 2023

Not sure why you limit that to desktop. It's more important that server files can have an instant snapshot, compression and performant incremental backup to remote servers.

wkat4242 · on Aug 20, 2023

Because the discussion was about Mac and macOS doesn't have a server version anymore that's worth speaking about.

But when all this happened they still did, yes.

voytec · on Aug 20, 2023

> How cool would it be if we had a great GUI for ZFS

Not in workstation OS, maybe not "great" but tools like FreeNAS had ZFS GUI sorted out[1] pretty decently and user-friendly for some time. I've seen storage laymen set these up, learn on the go and not screw up.

[1] https://windows-cdn.softpedia.com/screenshots/FreeNAS_1.jpg

[1] https://www.ixsystems.com/documentation/freenas/9.3/freenas_...

phs318u · on Aug 19, 2023

I had the same thoughts about Apple and (at the time) Sun, 15 years ago.

https://macoverdrive.blogspot.com/2008/10/using-zfs-to-manag...

postmodest · on Aug 20, 2023

Apple wanted one FS across all devices and ZFS's ARC makes that less than useful in memory-starved phones (and base model Airs amirite?)

ZFS's separate cache meant that the better choice was APFS. Once they got around to writing it.

And on drives at least, TM backups ARE APFS snapshots.

michaelmrose · on Aug 20, 2023

Basically every OS has a file cache and you can tune the ARC how you please

globular-toast · on Aug 19, 2023

TrueNAS? You can manage the volumes with a web UI. It has a thing you can enable for SMB shares that let you see the previous versions of files on Windows. Or perhaps I don't understand what you're after.

jedberg · on Aug 19, 2023

I'm looking for that kind of featureset for my root filesystem on MacOs. :/

panick21_ · on Aug 20, 2023

ZFS was open source, Nothing was preventing Mac using ZFS. Just as they use DTrace. Its just Apple NIH.

risho · on Aug 19, 2023

yeah imagine a world where you could use time machine to backup 500gb parallels volumes where only the diff was stored between snapshots rather than needing to back up the whole 500gb volume every single time.

jedberg · on Aug 19, 2023

Right, that would be nice wouldn't it?

As a workaround, you can create a sparsevolume to store your parallels volume. Sparsevolumes are stores in bands, and only bands that change get backed up. It might be slightly more efficient.

risho · on Aug 19, 2023

wow that sounds like an interesting solution!

kalleboo · on Aug 20, 2023

> I could buy a new external disk, add it to a pool, have seamless storage expansion.

You still can't do that with ZFS though, that's what this PR is for.

yencabulator · on Aug 21, 2023

Beware using ZFS with external disks: https://github.com/openzfs/zfs/issues/3461

postmodest · on Aug 20, 2023

You can add devices as vdevs to a pool, though. You can't add disks to a vdev. (IIRC) It's no different to LVM in that aspect.

Lex-2008 · on Aug 20, 2023

IIRC, the pool is effectively RAID0 over vdevs: if any single vdev dies - whole pool dies. Do i remember this correctly?

So if over the years I got, say, five disks of varying quality - I shouldn't add them to a pool as five individual vdevs :)

kalleboo · on Aug 20, 2023

Correct. I lied when I said you can't add single disks, I meant you shouldn't add single disks. It's not like e.g. Synology where you can add a single disk and it will expand your underlying RAID5/6 arrays

calcifer · on Aug 20, 2023

> if any single vdev dies - whole pool dies. Do i remember this correctly?

Yes, that's why you don't have single-disk vdevs.

nortonham · on Aug 20, 2023

>How cool would it be if we had a great GUI for ZFS (snapshots, volume management, etc.)

I believe that was something built into the file manager in OpenSolaris, and then into illumos OS's. OpenIndiana has it in their file manager within their themed MATE desktop environment.

eminence32 · on Aug 19, 2023

The big news here seems to be that iXsystems (the company behind FreeNAS/TrueNAS) is sponsoring this work now. This PR supersedes ones that was opened back in 2021

https://github.com/openzfs/zfs/pull/12225#issuecomment-16101...

wkat4242 · on Aug 20, 2023

I hope this corporate sponsorship won't go the way that wireguard went. https://arstechnica.com/gadgets/2021/03/buffer-overruns-lice...

What worries me the most is them doubling down on it after it was shown to be a mess and totally insecure. I imagine to protect the corporate image. But such corporate interests don't belong in FreeBSD. Just admit you hired the wrong guy (or the right guy at the wrong time) and take the hit. Because it wasn't even the company's fault really.

The lack of serious review on the freebsd side was also a big one but I find the 'strings attached' the biggest issue myself because it made mitigation of serious issues such a problem.

The large amount of corporate kernel work in Linux is why I went for FreeBSD and to see whatever little go so wrong was worrying.

SpecialistK · on Aug 20, 2023

iXsystems are an entirely different beast than Netgate (pfSense).

One can trace its legacy to BSD while it was still at Berkley and has worked with many of the most respected names in the BSD space.

The other bought a domain name of a fork and used it to post disparaging messages and Hitler "Downfall" memes slandering them. Source: https://www.wipo.int/amc/en/domains/search/text.jsp?case=D20...

While I have some disagreements with iXsystems' pivot to ZFS-on-Linux offerings and the like, they're not the clowns that Netgate / pfSense are. 10cm to my right are TrueNAS and opnSense boxes - you won't catch me dead using pfSense in my network after the crap they've pulled.

seltzered_ · on Aug 19, 2023

Yep, see also https://freebsdfoundation.org/blog/raid-z-expansion-feature-... (2022)

(Via https://lobste.rs/s/5ahxj1/raid_z_expansion_feature_for_zfs_... )

crote · on Aug 19, 2023

> After the expansion completes, old blocks remain with their old data-to-parity ratio (e.g. 5-wide RAIDZ2, has 3 data to 2 parity), but distributed among the larger set of disks. New blocks will be written with the new data-to-parity ratio (e.g. a 5-wide RAIDZ2 which has been expanded once to 6-wide, has 4 data to 2 parity).

Does anyone know why this is the case? When expanding an array which is getting full this will result in a far smaller capacity gain than desired.

Let's assume we are using 5x 10TB disks which are 90% full. Before the process, each disk will contain 5.4TB of data, 3.6TB of parity, and 1TB of free space. After the process and converting it to 6x 10TB, each disk will contain 4.5TB of data, 3TB of parity, and 2.5TB of free space. We can fill this free space with 1.66TB of data and 0.83TB of parity per disk - after which our entire array will contain 36.96TB of data.

If we made a new 6-wide Z2 array, it would be able to contain 40TB of data - so adding a disk this way made us lose over 3TB in capacity! Considering the process is already reading and rewriting basically the entire array, why not recalculate the parity as well?

mustache_kimono · on Aug 19, 2023

> Does anyone know why this is the case? > Considering the process is already reading and rewriting basically the entire array, why not recalculate the parity as well?

IANA expert but my guess is -- because, here, you don't have to modify block pointers, etc.

ZFS RAIDZ is not like traditional RAID, as it's not just a sequence of arbitrary bits, data plus parity. RAIDZ stripe width is variable/dynamic, written in blocks (imagine a 128K block, compressed to ~88K), and there is no way to quickly tell where the parity data is within a written block, where the end of any written block is, etc.

If you had to instead, modify the block pointers, I'd assume you have to also change each block in the live tree and all dependent (including snapshot) blocks at the same time? That sounds extraordinarily complicated (and this is the data integrity FS!), and much slower, than just blasting through the data, in order.

To do what you want, you can do what one could always do -- zfs send/recv between a filesystem between and old and new filesystem.

toast0 · on Aug 20, 2023

> To do what you want, you can do what one could always do -- zfs send/recv between a filesystem between and old and new filesystem.

Sure, but that involves having enough spare disks, enough places to put them, and enough places to connect them.

This way, while the initial expansion is not ideal, it works. If you really need the space gains from a wider distribution, you can do this expansion and then do the make a new copy, then replace old copy with a new copy dance... although that's counter productive if you have snapshots.

mustache_kimono · on Aug 20, 2023

> Sure, but that involves having enough spare disks

So long as we are pointing out things: this presumes that the user doesn't have enough space to make the copies on their own disks? If one has a 3GB dataset and 9GB of free space, one can easily zfs send/recv that dataset and destroy the old copy.

> This way, while the initial expansion is not ideal, it works.

Yep.

> If you really need the space gains from a wider distribution, you can do this expansion and then do the make a new copy, then replace old copy with a new copy dance... although that's counter productive if you have snapshots.

That's why you do a zfs send/recv, instead of a copy. ZFS will copy your snapshots for you!

crote · on Aug 20, 2023

But you are moving the blocks to other disks anyway, surely that would require a change of the block pointers too?

mustache_kimono · on Aug 21, 2023

Again not an expert -- but if you look at the presentations re: this feature, I presume you wouldn't. Why? Because another feature, adopted concurrently, provides for time independent geometry? So, if you look at the pool after time X, you have the same block #, etc., but with a different geo.

allanjude · on Aug 19, 2023

That is not how this will work.

The reason the parity ratio stays the same, is that all of the references to the data are by DVA (Data Virtual Address, effectively the LBA within the RAID-Z vdev).

So the data will occupy the same amount of space and parity as it did before.

All stripes in RAID-Z are dynamic, so if your stripe is 5 wide and your array is 6 wide, the 2nd stripe will start on the last disk and wrap around.

So if your 5x10 TB disks are 90% full, after the expansion they will contain the same 5.4 TB of data and 3.6 TB of parity, and the pool will now be 10 TB bigger.

New writes, will be 4+2 instead, but the old data won't change (they is how this feature is able to work without needing block-pointer rewrite).

See this presentation: https://www.youtube.com/watch?v=yF2KgQGmUic

crote · on Aug 20, 2023

The linked pull request says "After the expansion completes, old blocks remain with their old data-to-parity ratio (e.g. 5-wide RAIDZ2, has 3 data to 2 parity), but distributed among the larger set of disks". That'd mean that the disks do not contain the same data, but it is getting moved around?

Regardless, my entire point is that you still lose a significant amount of capacity due to the old data remaining as 3+2 rather than being rewritten to 4+2, which heavily disincentives the expansion of arrays reaching capacity - but that is the only time people would want to expand their array.

It just seems to me like they are spending a lot of effort on a feature which you frankly should not ever want to use.

rincebrain · on Aug 20, 2023

I don't think that's true.

I don't use raidz for my personal pools because it has the wrong set of tradeoffs for my usage, but if I did, I'd absolutely use this.

Yes, your data has the old data:parity ratio for older data, but you now have more total storage available, which is the entire goal. Sure, it'd be more space-efficient to go rewrite your data, piecemeal or entirely, afterward, but you now have more storage to work with, rather than having to remake the pool or replace every disk in the vdev with a larger one.

ilyt · on Aug 20, 2023

> So the data will occupy the same amount of space and parity as it did before.

So you lose data capacity compared to "dumb" RAID6 on mdadm.

If you expand RAID6 from 4+2 to 5+2, you go from using 33.3% data for parity to 28.5% on parity

If you expand RAIDZ from 4+2 to 5+2, your new data will use 28.5%, but your old (which is majority, because if it wasn't you wouldn't be expanding) would still use 33.3% on parity.

wkat4242 · on Aug 20, 2023

Could you force a complete rewrite if you wanted to? That would be handy. Without copying all the data elsewhere of course. I don't have another 90TB of spare disks :P

Edit: I suppose I could cover this with a shell script needing only the spare space of the largest file. Nice!

lloeki · on Aug 20, 2023

> Could you force a complete rewrite if you wanted to?

On btrfs that's a rebalance, and part of how one expands an array (btrfs add + btrs balance)

(Not sure if ZFS has a similar operation, but from my understanding resilvering would not be it)

Not that it matters much though as RAID5 and RAID6 aren't dependable upon, and the array failure modes are weird in practice, so in context of expanding storage it really only matters for RAID0 and RAID10.

https://arstechnica.com/gadgets/2021/09/examining-btrfs-linu...

https://www.unixsheikh.com/articles/battle-testing-zfs-btrfs...

rincebrain · on Aug 20, 2023

ZFS does not, and fundamental is not going to ever get one without rewriting so much you'd cry.

keeperofdakeys · on Aug 20, 2023

The easiest approach is to make a new subvolume, and move one file at a time. (Normal mv is copy + remove which doesn't quite work here, so you'd probably want something using find -type f and xargs with mv).

magicalhippo · on Aug 19, 2023

> Considering the process is already reading and rewriting basically the entire array, why not recalculate the parity as well?

Because snapshots might refer to the old blocks. Sure you could recompute, but then any snapshots would mean those old blocks would have to stay around so now you've taken up ~twice the space.

crote · on Aug 20, 2023

Or you could just rewrite the snapshots while you are at it.

magicalhippo · on Aug 21, 2023

How would you go about doing that?

AFAIK in ZFS snapshots are just a pointer to (an old) merkle tree[1], if you go about changing the blocks then you need to update that tree, but you can't due to copy-on-write (without paying for the copy, like I mentioned).

[1]: https://openzfs.readthedocs.io/en/latest/introduction.html#d...

louwrentius · on Aug 19, 2023

The key issue is that you basically have to rewrite all existing data to regain that lost 3TB. This takes a huge amount of time and the ZFS developers have decided not to automate this as part of this feature.

You can do this yourself though when convenient to get those lost TB back.

The RAID VDEV expansion feature actually was quite stale and wasn’t being worked on afaik until this sponsorship.

didntcheck · on Aug 20, 2023

Was it confirmed anywhere that this space-inefficiency-until-rewrite issue was why it stagnated? I remember looking up the progress a few months ago and being perplexed by the radio silence, given there were celebrations that we were apparently on the home straight back in 2021

louwrentius · on Aug 20, 2023

No, that’s not what I meant. It’s unrelated. After Matthew Ahrends initial commit years ago basically nothing happened.

The stuff about rewriting data I just wrote to clarify

katbyte · on Aug 19, 2023

I imagine there would eventually be a way / option to automatically rewrite old blocks. Or at least I would hope so because usually when your adding new disks the array is going to be near full

rincebrain · on Aug 20, 2023

Almost certainly not in any useful fashion that doesn't duplicate data versus old snapshots.

ZFS really deeply assumes you're not gonna be rewriting history for a bunch of features, and you'd have written a good chunk of a new filesystem to reimplement everything without those assumptions.

didntcheck · on Aug 20, 2023

IIRC one of the developers was asked about this and said it would be about as much additional effort as the whole expansion feature so far. There is of course the user-level workaround of (roughly) `for file in *; do cp $file $file.new && mv $file.new $file && done`, but if I understand correctly you'd need to do this with no snapshots, otherwise the space from $file wouldn't be freed so you'd need an array that's at least half empty

rincebrain · on Aug 20, 2023

It'd probably be worse than that.

I cannot stress how expensive and invasive that would be enough.

rleigh · on Aug 20, 2023

You can do a zfs send/zfs recv to send a dataset (including all snapshots) to yourself which is effectively rewriting the whole dataset, including all of its history, by duplicating it.

Not hard, but it does require sufficient free space. Once it's done you can destroy the original dataset and reclaim the space.

shrubble · on Aug 19, 2023

"This feature allows disks to be added one at a time to a RAID-Z group, expanding its capacity incrementally. This feature is especially useful for small pools (typically with only one RAID-Z group), where there isn't sufficient hardware to add capacity by adding a whole new RAID-Z group (typically doubling the number of disks)."

A feature I am excited to see is being added!

ErneX · on Aug 19, 2023

I’m not an expert whatsoever but what I’ve been doing for my NAS is using mirrored VDEVs. Started with one and later on added a couple more drives for a second mirror.

Coincidentally one of the drives of my 1st mirror died few days ago after rebooting the host machine for updates and I replaced it today, it’s been resilvering for a while.

edmundsauto · on Aug 19, 2023

I’ve read this is suboptimal because you are now stressing the drive that has the only copy of your data to rebuild. what are your thoughts?

deadbunny · on Aug 19, 2023

Mirrored vdevs resilver a lot faster than zX vdevs. Much less chance of the remaining drive dying during a resilver if it takes hours rather than days.

edmundsauto · on Aug 20, 2023

Is the amount of data read/written the same? I'm not clear why wall time is the relevant metric, unless that's the critical driver of failure likelihood. I would have guessed it's not time but bytes.

keeperofdakeys · on Aug 20, 2023

A mirror resilver is a relatively linear and sequential rewrite, so its very fast. Raidz resilvers require lots of random reads and writes across all the drives, and requires waiting for all drives to read data before it can be written to the replaced drive - "herding cats" sounds appropriate here.

didntcheck · on Aug 20, 2023

That makes sense. Usually the received wisdom I hear is "RAIDZ is slower due to the parity calculations", which has always seemed dubious to me on modern CPUs, given we can do much more complex compression and encryption at way faster then disk speed (for HDDs, at least)

someplaceguy · on Aug 20, 2023

Usually what makes RAID-Z slower is not really the parity calculations, it's the fact that each RAID-Z vdev only has the read IOPS of the slowest disk in it.

So for example, a pool with 2 RAID-Z vdevs each containing 5 disks effectively has only 2x the IOPS of a single disk, while a pool with 5 mirror vdevs (i.e. RAID-10) has 10x the IOPS of a single disk.

It's not a big problem if you mostly do sequential I/O but it's a huge difference if/when you do small random reads (e.g. traversing a non-cached directory tree).

In the context of resilvers, currently RAID-Z pools can only be done by traversing the block tree which, due to fragmentation of CoW filesystems, usually leads to a lot of random reads/writes, while a RAID-10 pool can basically resilver a pool while doing almost fully sequential I/O which can be much, much faster.

edmundsauto · on Aug 20, 2023

Do random reads and writes out more stress on the drive mechanism? I’m wondering if this increase in resliver time adds risk of data loss, which is probably the more important factor than clock time. (But I’m open to learning if I am wrong)

ErneX · on Aug 20, 2023

I am backing up the whole NAS in case such a failure happens, and as was mentioned in other replies getting the replacement drive in the pool is way quicker on a mirrored vdev.

ilyt · on Aug 20, 2023

....vs stressing all of them for parity rebuild.

didntcheck · on Aug 20, 2023

However RAIDZ2 can survive up to one of those stressed disks failing, while a 2-mirror has everything riding on that single survivor. It seems intuitive that mirrors would be safer, yeah, but if you run the binomial distribution numbers there are some realistic combinations of array size and individual drive failure probabilities where RAIDZ2 really does appear to be safer

Imagine a gambler telling you "you can either throw one k-faced die and you lose if it comes up on 1, or throw n k-faced dice and you lose if two come up on 1". Depending on k and n, the second really can be the better choice

Often people cite the faster rebuild of mirrors as a safety advantage, but the same amount of rebuild IO will occur regardless of how long it takes. Yes, there will be more non-rebuild IO in a larger time window, but unless that routine load was causing disks to fail weekly then I doubt it will change the numbers non-negligibly. It will of course affect array performance though, so mirrors for performance is a good argument

labawi · on Aug 20, 2023

If we were to only compare survival after a fixed number of drives failed vs. storage efficiency, RAIDzN should always come out ahead in any configuration - with mirrors you can get unlucky drives choice fail, with RAIDzN any choice is a good choice. Only way to have RAIDz fail sooner than mirror is to have a comparatively less redundant setup (your choice of N and K).

Realistically though, RAIDz recovery is longer and more stressful, so more of your drives can fail in the critical period, and, assuming you have backups, your storage is there for for usability - mirroring gives you a performant usable system during a fast recovery for the price of a small chance of complete data loss (but you have backups?) vs RAIDz that gives you long recovery pains on a degraded system, but I expect a smaller chance of data loss on a lightly loaded system.

ilyt · on Aug 21, 2023

Well, from my probe of 500 servers I have at work I have seen RAID1 rebuild failing zero times and RAID6 failing once (but we did recover it).

Granted, that was a bit of uncommon case where:

* someone forgot to order spares after taking last one

* we still had our consumable buying pipeline going thru helpdesk

* helpdesk didn't had any importance communicated about it and because of some accounting bullshit the purchase got delayed long enough

* the drives in question were all from some segate's fuckup of a model with much higher failure rates.

One disk failed with some media errors, remaining 2 got kicked out of array for same reason during resilvering

We ddrescue'd the 2 on the pair of fresh ones and the bad blocks didn't land in the same place on both drives so it made full recovery. But we did learn many lessons from that..

edmundsauto · on Aug 20, 2023

Wouldn't it be better to apply stress over more drives to minimize the chances that you lose a second drive during a rebuild?

ilyt · on Aug 20, 2023

To replace drive in RAID5/6 array, you need to read the entirety of every drive

To replace drive in RAID1 you need to read the entirety of one drive.

Instead of reading one whole drive worth of stress, you're reading N drives worth of stress

But it is funny that the people making arguments about "stressing drives" don't even fucking know how RAID works...

edmundsauto · on Aug 20, 2023

However, RAID1, you stress the drive that has your only copy of the data. With 5/6 (RAIDz w/ 2 spares as well) you can have multiple copies of the data, so although you are overall increasing stress on your fleet, wouldn't it have a lower probability of data loss?

Drive stress (for me) is mostly a concern about data loss.

ilyt · on Aug 21, 2023

With RAID5 you'd have higher probability of data loss. Remember, you just loss the only redundancy you had, brining your redundancy to same level as RAID1 with failed drive.

... except now you have more drives that can fail.

With RAID6 yes, you can still fail one more time and still keep your data, which is why it is recommended.

Data safety wise I'd go RAID6 -> RAID1/10 (particularly linux implementation can do raid10 on odd number of drives which is nice) -> RAID5

ErneX · on Aug 20, 2023

I believe this is the article I read when I started:

https://jrs-s.net/2015/02/06/zfs-you-should-use-mirror-vdevs...

labawi · on Aug 20, 2023

Note that the listed probabilities of surviving N-disk failures are not correct, though after calculation, though the differences may not be that important.

Listed survival probability of N-disk failure in 8-drive/4-vdev mirror.

1: 1; 2: 0.857; 3: 0.667; 4: 0.400; 5: 0; 6: N/A; 7: N/A; 8: N/A

Proper survival probability:

1: 1; 2: 0.857; 3: 0.571; 4: 0.229; 5: 0; 6: 0; 7: 0; 8: 0

Comparatively, survival probability for 8-drive RAIDz4 with equal amount of usable space:

1: 1; 2: 1; 3: 1; 4: 1; 5: 0; 6: 0; 7: 0; 8: 0

Personally, I'd use multi-vdev mirror pools only for data that is either backed-up or data I can afford to lose completely.

ilyt · on Aug 20, 2023

That talks more about RAID5/Z1 vs mirroring.

You still need 6/Z2 if you want to have reasonable fault tolerance, unless you want to waste ungodly amount of space.

ErneX · on Aug 21, 2023

My reasoning for using mirrored vdevs was more about the ease of expanding the storage than the redundancy.

ilyt · on Aug 21, 2023

We just used RAID6... expanding by more drives is easy while keeping data waste low and the "we want to put bigger drives now" case isn't all that problematic considering that what you save you waste on low efficiency of RAID1 setup.

We did had a big case so we just ran 2xRAID6 setup and that one time where it was needed we just replaced to bigger drives one by one, while using the removed ones as spares for other machines. But that's benefit of scale bigger than "a NAS server".

ErneX · on Aug 22, 2023

Yeah mine is a home NAS, not a professional deployment. At first I was a bit disappointed at losing half the storage due to using mirrors but honestly it’s not a big deal considering the prices I managed to find on the drives. And I also wanted to expand at my own pace. The drive I recently used to replace the dead 4TB drive cost me 110€ and it’s a 10TB drive.

I got 2 so I’ll also replace the sibling of the dead 4TB one.

scheme271 · on Aug 19, 2023

This has been floating around for 2 years at this point so might be a long while until it gets in. Interesting, QNAP somehow added this feature into the code that their QuTS Hero NASes uses. I'm not sure how solid or tested the QNAP code is but it's solid enough that they're shipping it in production.

jtriangle · on Aug 19, 2023

Qnap does recommend a full backup before doing so, which tells me it's not exactly production ready as you and I would think of it.

phpisthebest · on Aug 19, 2023

I would expect any storage array to recommend a full backup any time you are messing with the physical disks. even "production ready" features one would not add, remove, or do anything with the array with out a full backup.

"Production" systems should not even be considered production unless you have a backup of them,

rincebrain · on Aug 19, 2023

QNAP's source drops are also kind of wild, in that they branched a looooooooong time ago and have been implementing their own versions of features they wanted since, AFAICT.

hinkley · on Aug 19, 2023

I’m frustrated because this feature was mentioned by Schwartz when it was still in beta. I thought a new era of home computing was about to start. It didn’t, and instead we got The Cloud, which feels like decentralization but is in fact massive centralization (organizational, rather than geographical).

Some of us think people should be hosting stuff from home, accessible from their mobile devices. But the first and to me one of the biggest hurdles is managing storage. And that requires a storage appliance that is simpler than using a laptop, not requiring the skills of an IT professional.

Drobo tried to make a storage appliance, but once you got to the fine print it had the same set of problems that ZFS still does.

All professional storage solutions are built on an assumption of symmetry of hardware. I have n identical (except not the same batch?) drives which I will smear files out across.

Consumers will never have drive symmetry. That’s a huge expenditure that few can justify, or much afford. My Synology didn’t like most of my old drives so by the time I had a working array I’d spent practically a laptop on it. For a weirdly shaped computer I couldn’t actually use directly. I’m a developer, I can afford it. None of my friends can. Mom definitely can’t.

A consumer solution needs to assume drive asymmetry. That day it is first plugged in, it will contain a couple new drives, and every hard drive the consumer can scrounge up from junk drawers - save two: their current backup drive and an extra copy. Once the array (with one open slot) is built and verified, then one of the backups can go into the array for additional space and speed.

From then on, the owner will likely buy one or two new drives every year, at whatever price point they’re willing to pay, and swap out the smallest or slowest drive in the array. Meaning the array will always contain 2-3 different generation of hard drives. Never the same speed and never the same capacity. And they expect that if a rebuild fails, some of their data will still be retrievable. Without a professional data recovery company.

which rules out all RAID levels except 0, which is nuts. An algorithm that can handle this scenario is consistent hashing. Weighted consistent hashing can handle disparate resources, by assigning more buckets to faster or larger machines. And it can grow and shrink (in a drive array, the two are sequential or simultaneous).

Small and old businesses begin to resemble consumer purchasing patterns. They can’t afford a shiny new array all at once. It’s scrounging and piecemeal. So this isn’t strictly about chasing consumers.

I thought ZFS was on a similar path, but the delays in sprouting these features make me wonder.

ilyt · on Aug 20, 2023

Unraid "solves" that althought needs some user knowledge (IIRC parity drives must be the biggest one in array and for some reason that isn't automatic)

Ceph actually works quite well for that, althought obviously far more complex than anything for home use. You just tell it "x chunks with N parity" and it will spread it over available drives. Just need more drives than x + n and not too egregious size differences.

> Weighted consistent hashing can handle disparate resources, by assigning more buckets to faster or larger machines. And it can grow and shrink (in a drive array, the two are sequential or simultaneous).

It would need to be more complex than that. Putting chunks on say 2, 6, 8, 12, 22 TB (which might be what you'd get if you just buy "cheapest GB/$" for your NAS over last 10 years!) is more complex than that.

gregmac · on Aug 19, 2023

I can afford it, but have a hard time justifying the costs, not to mention scrapped (working) hardware and inconvenience (of swapping to a whole new array).

I started using snapraid [1] several years ago, after finding zfs couldn't expand. Often when I went to add space the "sweet spot" disk size (best $/TB) was 2-3x the size of the previous biggest disk I ran. This was very economical compared to replacing the whole array every couple years.

It works by having "data" and "parity" drives. Data drives are totally normal filesystems, and joined with unionfs. In fact you can mount them independently and access whatever files are on it. Parity drives are just a big file that snapraid updates nightly.

The big downside is it's not realtime redundant: you can lose a day's worth of data from a (data) drive failure. For my use case this is acceptable.

A huge upside is rebuilds are fairly painless. Rebuilding a parity drive has zero downtime, just degraded performance. Rebuilding a data drive leaves it offline, but the rest work fine (I think the individual files are actually accessible as they're restored though). In the worst case you can mount each data drive independently on any system and recover its contents.

I've been running the "same" array for a decade, but at this point every disk has been swapped out at least once (for a larger one), and it's been in at least two different host systems.

[1] https://www.snapraid.it/

didntcheck · on Aug 20, 2023

I like Snapraid, but there are a few downsides worth mentioning

* Rebuilds are semi-offline, as you said. Almost every other solution, even Unraid, will immediately emulate the data from a failed drive. On Snapraid you have to wait for each file to be restored, and depending on your union setup this may mean you have directories with half the files missing

* Whenever you modify or delete a file between syncs, your parity is now out of sync. This means some other files may fail to restore if you have a failure now. However using more than single parity will make this far less likely to happen. Another way to fix this completely the "snapraid-btrfs" tool, which runs Snapraid on snapshots of independent btrfs disks (somewhat like Synology?), meaning the old data is still available

* It saves almost no file metadata, not even owner and mode. Restored files just use the umask of the user running the restore command. A minor one, but surprisingly annoying

However one big advantage over Unraid it has is how transactional and rigorous it is. A power failure can cause Unraid parity to desync, with no clear way to know which disk is right. Snapraid OTOH is designed to survive interruptions gracefully, and checksums all files so should never accidentally restore corrupt data. And can detect silent drive failure ("bitrot") as a bonus

And for a typical home NAS storing movies and family photos (mostly append-only), those downsides are probably no big deal anyway

RulerOf · on Aug 20, 2023

I gave snapraid a serious look a few months back and decided it might not be for me because the act of balancing writes out to the "array" member disks appeared to be manual.

I didn't want to point applications at 100 T of "free" space only for attires to start blocking after 8.

Am I mistaken about that?

gregmac · on Aug 20, 2023

Sorry I said unionfs but it's actually handled by mergerfs [1], and it's all automatic. There are a whole boatload of policies [2] to control writes.

I use "existing path, least free space". Once a path is created, it keeps using it for new files in that path. If it runs out of space, it creates that same path on another drive. If the path exists on both drives for some reason, my rationale is this keeps most of the related files (same path) together on the same drive.

I see there's some newer "most shared path" options I don't remember that might even make more sense for me, so maybe that's something I'll change next time I need to touch it.

[1] https://github.com/trapexit/mergerfs

[2] https://github.com/trapexit/mergerfs#policy-descriptions

_trapexit · on Aug 20, 2023

> If it runs out of space, it creates that same path on another drive.

That's not how it works.

The policy picks what branch to use and then once selected mergerfs will clone the relative path as needed. With "ep" policies it will never select a branch that doesn't have the full relative path. "msp" will always rerun the check one level up in the hierarchy if nothing is found at the current level.

RulerOf · on Aug 21, 2023

This is immensely helpful, thanks.

Will MergerFS mount points behave the same as on the host inside of a docker container if passed as a bind mount?

wtallis · on Aug 19, 2023

> which rules out all RAID levels except 0, which is nuts.

RAID 1, you mean? Because that way you still have a complete copy of your data if one drive fails.

BTRFS is excellent for the use case of a wide variety of mismatched drives, because it supports adding and removing drives and rebalancing the array. But for the moment only the RAID 1 modes are really trustworthy. I have a NAS consisting of drives whose advertised capacities in GB are: 1000, 1000, 1920, 2000, 2048, 2050, 3840, 3840, 4000. But it started as a pile of infamously unreliable 3TB hard drives.

keeperofdakeys · on Aug 20, 2023

The BTRFS raid1 feature is very badly named, its best called two-copy. Skipping some details, two copies of data are written to the two drives with the most free space. So you can have multiple mismatched drives.

However you get no striping, and data is only read from one drive, so performance is limited to that of one drive for reads and writes. Plus with mismatched drives, smaller drives go unused unless you write enough data.

wtallis · on Aug 20, 2023

> The BTRFS raid1 feature is very badly named,

You're going to have to support that claim a bit better. The core idea of RAID 1 is mirroring data, which BTRFS RAID 1 mode definitely does. Striping is not an essential part of RAID 1 (hence RAID 10), and reading data from two disks in parallel is an optional performance optimization that is not performed by all RAID 1 implementations (but could be implemented for BTRFS RAID 1: https://stackoverflow.com/questions/55408256/btrfs-raid-1-wh... ).

> Plus with mismatched drives, smaller drives go unused unless you write enough data.

Yes, the allocation is suboptimal from a performance perspective, as I've already said. But it is simple and straightforward and reasonably good at avoiding putting you into a situation where manually issuing a rebalance command is necessary. If you do need better performance, there's a RAID 10 mode. But since my NAS is currently a motley pile of SSDs, I don't need to to anything extra to have decent performance.

keeperofdakeys · on Aug 20, 2023

RAID1 is about mirroring disks, BTRFS RAID1 mirrors block groups. Plus traditionally a RAID1 of 3 disks will mirror the same data on all disks, which is different to how the RAID1 mode on BTRFS acts. So the name leads to misunderstandings since it doesn't act like RAID1 at all.

It'd be way easier to talk about if it had a unique name, and you could say "It's like RAID1".

Despite all that I do like the mode, and use it in a few places.

ChocolateGod · on Aug 20, 2023

> Plus traditionally a RAID1 of 3 disks will mirror the same data on all disks, which is different to how the RAID1 mode on BTRFS acts.

You have raid1c<drive count> mode in BTRFS which does this.

hinkley · on Aug 20, 2023

Raid 1 does not work with heterogenous disk arrays.

wtallis · on Aug 20, 2023

BTRFS RAID 1 works just fine with heterogeneous disks with no extra configuration required; I would not have been able to provide examples from experience otherwise. When writing new data, it allocates two blocks on two separate devices, preferring to allocate space on whatever devices have the most free space. This may not be optimal from a performance perspective, but it does mean there's minimal wasted capacity in most configurations. (The RAID1c3 and RAID1c4 modes work the same, but with three or four copies of each block, all on separate devices.)

Were you assuming that BTRFS RAID 1 meant every block of data gets mirrored across every device? I've seen that assumption before, from people who don't realize there are two ways to generalize RAID1 to more than two devices.

ilyt · on Aug 20, 2023

Eh, it can kinda work. If you have 1, 2,3,4TB drives you can make RAID10 out of first GB, then stitch 2TB RAID0 from second and third and RAID1 it with 2TB out of 4th drive. then raid1 out of remaining GB on 3rd and 4th.

Then glue it together with LVM and hope for best.

wtallis · on Aug 20, 2023

Or you can just put all those devices into one BTRFS RAID 1 filesystem and not have to worry about it.

tatersolid · on Aug 20, 2023

Until the BTRS filesystem gets to some indeterminate threshold of “almost full”, at which point you typically need to reformat and restore from backups…

wtallis · on Aug 20, 2023

I haven't been following BTRFS development closely for a few years, so I don't know how much more progress they've made on eliminating those corner cases. But I'm still careful about not letting the filesystem get close to full. Earlier this month, I noticed one of my machines was over 80% full, so I grabbed a few drives I had lying around that weren't in use at the time and added them to the array. The flexibility of BTRFS gives you a lot more options than eg. ZFS for getting out of a tight spot without the full wipe and restore hassle. But both can still leave you stuck, especially if you ignore the need for monitoring and preventive maintenance.

tatersolid · on Aug 21, 2023

Disk full is not a “corner case”. It should be one of the first test cases a filesystem test suite should have, and that test should always pass in every configuration.

The fact that total data loss happened regularly on full volumes many years after a “stable” release means BTRFS is either fundamentally flawed in design or incompetently implemented. I have no idea how it was accepted into mainline Linux.

The fact that a certain large NAS vendor used BTRFS by default (without documentation at the time of purchase) has cost my org and others lots of downtime and money.

wtallis · on Aug 21, 2023

> Disk full is not a “corner case”.

I didn't say it was. There are plenty of disk full scenarios that can be handled by BTRFS without trouble, especially when adding more disks (even temporarily) is an option. There are ways to get stuck with a full fs, but it's not a guarantee you'll end up in a "wipe and restore from backup" situation. (And to be fair, ZFS also gets very problematic when completely full; this is a general problem for CoW filesystems.)

ilyt · on Aug 21, 2023

Sooo basically if you were using XFS you wouldn't need to do anything at all till it got 99% full, and if you got it there you'd just get "disk is full" message instead of some catastrophe? I don't need that kind of stress in my life...

And with plain RAID6 I can still add hard drives if needed. Yeah the case of buying bigger drives is still a bit iffy but btrfs RAID1 gonna lose me more space than RAID6+LVM+xfs anyway...

ilyt · on Aug 20, 2023

BTRFS doesn't exactly had great history with fs bugs

hinkley · on Aug 20, 2023

There is what a man can do, and what he should do. Nobody normal is going to do any of this.

mrighele · on Aug 19, 2023

> I’m a developer, I can afford it. None of my friends can. Mom definitely can’t.

The only thing that can work for your mom and your friend is, in my opinion, a pair of disks in mirror. When the space finished, buy another box with two other disk in mirror. Anything more than this is not only too complex for the average user but also too expensive.

Quekid5 · on Aug 19, 2023

I think I would advise against a direct mirroring -- instead, I'd do sync-every-24-hours or something similar.

Both schemes are vulnerable to the (admittedly rarer) errors where both drives fail simultaneously (e.g. mobo fried them) or are just ... destroyed by a fire or whatever.

A periodic sync (while harder to set up) will occasionally save you from the deleting the wrong files which mirroring doesn't.

Either way: Any truly important data (family photos/videos, etc.) needs to be saved periodically to remote storage. There's no getting around that if you really care about the data.

kalleboo · on Aug 20, 2023

> A periodic sync (while harder to set up) will occasionally save you from the deleting the wrong files which mirroring doesn't

Which ever solution for disks you end up with, you should definitely always be using something that keeps around periodic snapshots.

Quekid5 · on Aug 21, 2023

Agreed, but even that is vulnerable to software bugs around snapshotting itself.

bartvk · on Aug 19, 2023

Even that sounds complex.

If they want network attached storage, I’d just use a single disk NAS, and remotely back it up.

hinkley · on Aug 19, 2023

Keeping track of a heterogenous drive array is just as big an imposition.

Tijdreiziger · on Aug 20, 2023

https://kb.synology.com/en-global/DSM/tutorial/What_is_Synol...

hinkley · on Aug 20, 2023

That creates multiple arrays from the same disks. That’s intimately no simpler than buying a string of 2 disk raid enclosures.

Do any of you people actually do UX or DX for a living? This is starting to worry me.

Osiris · on Aug 19, 2023

I completely agree. To build my array I had to buy several drives at the same time. To expand I had to buy a new drive, move the data onto the array, and then I’m left with the extra drive I had too buy to temporarily store the data because I can’t add it to the array.

I would love to have more options for expandable redundancy.

ilyt · on Aug 20, 2023

I just...leave it then create new RAID out of free space once I get more than 2-3 disks that are bigger. But yeah, PITA.

znpy · on Aug 19, 2023

It bothers me so much that zfs is not into mainline linux. I know it’s due to the license incompatibility… :(

tux3 · on Aug 19, 2023

We may get a successor filesystem before that particular situation is sorted out..

By all accounts mainline is at best not interested, if not actively against ZFS on Linux. The last few kerfuffles around symbols used by the out-of-tree module laid out the position rather unambiguously.

Nextgrid · on Aug 19, 2023

> The last few kerfuffles around symbols used by the out-of-tree module laid out the position rather unambiguously.

Source, for someone who isn't following kernel mailing lists?

tux3 · on Aug 19, 2023

I was thinking of this thread from 5.0 in particular (2019, time flies!)

https://lore.kernel.org/all/20190110182413.GA6932@kroah.com/

Nextgrid · on Aug 19, 2023

It’s sad to see that free software under a license (and movement) that was born out of someone’s frustration with closed-source printer drivers (acting as DRM, albeit inadvertently) appears to include similar DRM whose sole purpose is to restrict usage of a (seemingly arbitrary) selection of symbols.

tux3 · on Aug 19, 2023

It is. That said, I can't fault people too much for being afraid of the lawnmower. People have been mowed for much less.

ilyt · on Aug 20, 2023

So, why OpenZFS can't fix their license ?

toast0 · on Aug 20, 2023

OpenZFS is a fork of the Sun/Oracle release of ZFS. The OpenZFS maintainers cannot change the license without consent of the copyright owners, and Oracle is unlikely to consent. Oracle could potentially change the terms in a future release of CDDL, and that might work, too.

Similarly, Linux probably has too many contributors, some of which are no longer living, to come to an agreement on a fixed license either.

Although, license changes do happen; OpenSSL did one recently, but I think there are fewer contributors.

cromka · on Aug 20, 2023

> OpenSSL did one recently

VLC did, too, although with probably even less contributors.

wkat4242 · on Aug 20, 2023

Because it uses code released by Sun under the CDDL. Oracle has since closed sourced the whole thing again so they're not going to change the license of this old code, they don't even want it out there but they can't take it back.

Doing a clean room rewrite would be a huge job.

mappu · on Aug 20, 2023

https://lwn.net/Articles/939842/ was the most recent one (August 2023). The article focuses on the NVIDIA proprietary driver but the bulk of the comments are about the impact on ZFS.

ZFS is already a very awkward citizen on Linux (e.g. can't use the page cache so it has to duplicate all that in the ARC; won't be able to take advantage of multi-page folios because of the lowest-common-denominator SPL; ...) and it will get worse, not better.

betaby · on Aug 19, 2023

> successor filesystem

Which one?

pa7ch · on Aug 19, 2023

bcachefs presumably

j16sdiz · on Aug 19, 2023

we have heard the same with btrfs.

chungy · on Aug 19, 2023

bcachefs has had time to actually mature instead of being kneecapped early on by an angry Linus Torvalds when btrfs's on disk format changed and broke his Fedora install.

mappu · on Aug 20, 2023

Btrfs on mdadm is an excellent ZFS alternative.

mustache_kimono · on Aug 20, 2023

But it's not really an alternative, is it? It's near-beer.

Nextgrid · on Aug 19, 2023

Has the whole license incompatibility thing actually been tested/litigated in court? I heard Canonical has (at least at some point) shipped prebuilt ZFS. I understand that in-tree inclusion brings its own set of problems, but I’m just asking about redistribution of binaries - the same binaries you are allowed to build locally.

It would be nice to have a precedent deciding on this bullshit argument once and for all so distros can freely ship prebuilt binary modules and bring Linux to the modern times when it comes to filesystems.

The whole situation is ridiculous. I'd understand if this was about money, but who exactly gets hurt by users getting a prebuilt module from somewhere, vs building exactly the same thing locally from freely-available source?

chungy · on Aug 19, 2023

The only entity to start a litigation cycle is Oracle, and they've either been uninterested or know they can't win. Canonical is the only entity they have any chance of going after; Canonical's lawyers already decided there was no license conflict.

Linus Torvalds doesn't feel like being the guinea pig by risking ZFS in the mainline kernel. A totally reasonable position while the CDDL+GPL resolution is still ultimately unknown. (And honestly, with OpenZFS supporting a very wide range of Linux versions and FreeBSD at the same time, I have the feeling that mainline inclusion in Linux might not be the best outcome anyway.)

Nextgrid · on Aug 19, 2023

I wonder what Oracle would litigate over though? My understanding is that licenses are generally used by copyright holders to restrict what others can do with a work so the holder can profit off the work and/or keep a competitive advantage.

Here I do not see this argument applying since the source is freely available to use and extend; the license explicitly allows someone to compile it and use it. In this case providing prebuilt binaries is more akin to providing a "cache" for something you can (and are allowed to) build locally (using ZFS-DKMS for example) using source you are once again allowed to acquire and use.

What prejudice does it cause to Oracle that the “make” command is ran on Ubuntu’s build servers as opposed to users’ individual machines? Have similar cases been litigated before where the argument was about who runs the make command, with source that either party has otherwise a right to download & use?

wkat4242 · on Aug 20, 2023

Playing legal chicken with Oracle is a really dumb thing to do because catching their customers in obscure license violations and then suing the hell out of them is their entire business model. They specialise in this stuff.

Dylan16807 · on Aug 19, 2023

Copying and distributing those binaries is covered by copyright. You have to follow the license if you want to do it. It doesn't matter that end users could legally get the same files in some other manner. Distribution itself is part of the legal framework.

bubblethink · on Aug 19, 2023

Not to mention, bcachefs is making progress towards mainline.

rincebrain · on Aug 19, 2023

I don't think Linux would like to ship a mass of code that size that's not GPLed, even if court cases say CDDL is GPL-compatible.

jonhohle · on Aug 19, 2023

It isn’t compatible because it adds additional restrictions regarding patented code. The CDDL was made to be used in mixed license distributions and only affects individual files, but the GPL taints anything linked (which is why LGPL exists). Since the terms of the CDDL can’t be respected in a GPL’d distribution, I can’t see a way for it to ever be included in the kernel repo.

I don’t think there’s any issue with canonical shipping a kmod, but similar to 3ᴿᴰ party binary drivers, it would need to be treated as a different “work”.

yencabulator · on Aug 21, 2023

It's a little unfair to blame it on GPL. If ZFS was MIT/BSD/etc licensed, it could be brought into the Linux repository just fine. CDDL explicitly doesn't let that happen, by saying "Source Code form must be distributed only under the terms of this License". That only is harsh.

_wf2l · on Aug 22, 2023

if the kernel was MIT/BSD/etc licensed ZFS would be allowed in the kernel just fine too. but only one of the licenses is infectious and requires things of the other sources.

it's definitely a GPL issue more than a CDDL issue

yencabulator · on Aug 22, 2023

GPL predates CDDL; CDDL authors were well aware of GPL; CDDL authors chose to make the license different from existing commonly-used open source licenses. Ask why...

rincebrain · on Aug 20, 2023

I'm aware of the surrounding arguments, I promise; my remark was that I doubted Linux would like to ship a large mass of code with any additional encumberments even if it didn't somehow fail the GPL compatible test. (I'm aware that this is an impossible condition barring court precedent that ignores the definition of the constraint.)

askiiart · on Aug 19, 2023

I can confirm that Ubuntu ships prebuilt ZFS, I used a live Ubuntu USB to copy some data off a ZFS pool just a couple weeks ago.

2OEH8eoCRo0 · on Aug 19, 2023

Because it will be hella fun backing ZFS out of the kernel source once it's been there for a few years.

gigatexal · on Aug 19, 2023

Same. Maybe one day. Though I don’t know what will happen first: it gets mainlined into the kernel or we get HL3

Filligree · on Aug 19, 2023

So long as the kernel developers are actively hostile to ZFS…

You will take your Btrfs and you will like it.

gigatexal · on Aug 19, 2023

I’m holding out for bcachefs and still building ZFS via dkms on current kernels like a madman.

Filligree · on Aug 19, 2023

I’m running bcachefs on my desktop right now.

It’s promising, but there’s… bugs. Right now only performance oriented ones, that I’ve noticed, but I’d wait a bit longer.

thrtythreeforty · on Aug 20, 2023

Can you elaborate on what kind of bugs? I've been following bcachefs casually because I am very excited for its eventual stabilization. But I am very conservative with my main storage array.

terinjokes · on Aug 20, 2023

Not GP, but I tried it out a few weeks ago. I experienced it working for about 15-20 minutes, then it stopped doing any I/O to the disks until I rebooted. After it happened a few times in one evening, I switched to something else.

tambourine_man · on Aug 19, 2023

If you don’t need real-time redundancy, you maybe better served by something like SnapRAID. It’s more flexible, can mismatch disk sizes, performance requirements are much lower, etc.

Dwedit · on Aug 19, 2023

So is it safe to use btrfs for a basic Raid-1 yet?

wtallis · on Aug 19, 2023

Yeah, the non-parity RAID modes have been safe for a pretty long time, as long as you RTFM when something goes wrong instead of assuming the recovery procedures match what you'd expect coming from a background of traditional RAID or how ZFS does it. I've been using RAID-1 (with RAID1c3 for metadata since that feature became available) on a NAS for over a decade now without loss of data despite loss of more drives over the years than the array started out with.

canvascritic · on Aug 19, 2023

this is a really neat addition to raid-z. i recall setting up my zfs pool in the early 2000s and grappling with disk counts because of how rigid expansion was. Good times. this would've made things so much simpler. small nitpick: in the "during expansion" bit, I thought he could have elaborated a touch on restoring the "health of the raidz vdev" part, didn't really follow his reasoning there. but overall, looking forward to this update. nice work.

mgerdts · on Aug 19, 2023

I remember in the decade+ before zfs was introduced and in several of the years since then trying to figure out the optimal stripe size for RAID 10 volumes and the optimal stripe size and count for RAID 5 volumes with Solaris Disk Suite, VxVM, Linux md, etc. Fixing bad decisions has typically meant backup twice, test a restore, fixe the volume, and pray one of your backups is good. :)

ZFS has generally been easier to go from zero to working well once you accept that life doesn’t need to be so complicated as legacy md, lvm, and fs layers made it. It also has a pretty big footgun that causes me to carefully read the man page before using “zpool add” or “zpool attach”. Vdev removal is a big help for this.

canvascritic · on Aug 19, 2023

oh, the memories of trying to optimize stripe sizes in the pre-zfs days! i can totally relate to the struggles with solaris disk suite and linux md. the pain of fixing poor decisions and the nail-biting backup processes are all too familiar. zfs really did simplify things in many ways. though i agree, it's always a good idea to tread carefully with commands like `zpool add` and `zpool attach`. glad vdev removal has made that process a bit smoother. it's pretty cool to see how storage management has evolved over the years. cheers for the trip down memory lane. :)

theshrike79 · on Aug 21, 2023

Lack of this feature has kept me on Synology and their SHR-system.

I can just slap in any drives of any size and it'll just adapt. As long as the new drives go in an empty slot or are bigger than the previous one, I get more space. Zero CLI commands needed.

Now get this feature in unraid or TrueNAS and I'm switching.

matheusmoreira · on Aug 19, 2023

So happy to see this. Incremental expansion is extremely important for consumers, homelabs. Now we can gradually expand capacity.

lyjia · on Aug 19, 2023

Wow, I've been hearing this has been in the works for a while. I am glad to see it released! My RAIDZ array awaits new disks!

cassianoleal · on Aug 19, 2023

What do you mean by released? It hasn't even been merged yet. :)

lyjia · on Aug 20, 2023

Misread the pull request. But this means it is close to release! It has been in the works for some time.

cassianoleal · on Aug 20, 2023

I do hope so! I have 2 drives waiting to be added to my pool. :)

atmosx · on Aug 19, 2023

Finally :-)