Hacker News new | past | comments | ask | show | jobs | submit login

> After the expansion completes, old blocks remain with their old data-to-parity ratio (e.g. 5-wide RAIDZ2, has 3 data to 2 parity), but distributed among the larger set of disks. New blocks will be written with the new data-to-parity ratio (e.g. a 5-wide RAIDZ2 which has been expanded once to 6-wide, has 4 data to 2 parity).

Does anyone know why this is the case? When expanding an array which is getting full this will result in a far smaller capacity gain than desired.

Let's assume we are using 5x 10TB disks which are 90% full. Before the process, each disk will contain 5.4TB of data, 3.6TB of parity, and 1TB of free space. After the process and converting it to 6x 10TB, each disk will contain 4.5TB of data, 3TB of parity, and 2.5TB of free space. We can fill this free space with 1.66TB of data and 0.83TB of parity per disk - after which our entire array will contain 36.96TB of data.

If we made a new 6-wide Z2 array, it would be able to contain 40TB of data - so adding a disk this way made us lose over 3TB in capacity! Considering the process is already reading and rewriting basically the entire array, why not recalculate the parity as well?




> Does anyone know why this is the case? > Considering the process is already reading and rewriting basically the entire array, why not recalculate the parity as well?

IANA expert but my guess is -- because, here, you don't have to modify block pointers, etc.

ZFS RAIDZ is not like traditional RAID, as it's not just a sequence of arbitrary bits, data plus parity. RAIDZ stripe width is variable/dynamic, written in blocks (imagine a 128K block, compressed to ~88K), and there is no way to quickly tell where the parity data is within a written block, where the end of any written block is, etc.

If you had to instead, modify the block pointers, I'd assume you have to also change each block in the live tree and all dependent (including snapshot) blocks at the same time? That sounds extraordinarily complicated (and this is the data integrity FS!), and much slower, than just blasting through the data, in order.

To do what you want, you can do what one could always do -- zfs send/recv between a filesystem between and old and new filesystem.


> To do what you want, you can do what one could always do -- zfs send/recv between a filesystem between and old and new filesystem.

Sure, but that involves having enough spare disks, enough places to put them, and enough places to connect them.

This way, while the initial expansion is not ideal, it works. If you really need the space gains from a wider distribution, you can do this expansion and then do the make a new copy, then replace old copy with a new copy dance... although that's counter productive if you have snapshots.


> Sure, but that involves having enough spare disks

So long as we are pointing out things: this presumes that the user doesn't have enough space to make the copies on their own disks? If one has a 3GB dataset and 9GB of free space, one can easily zfs send/recv that dataset and destroy the old copy.

> This way, while the initial expansion is not ideal, it works.

Yep.

> If you really need the space gains from a wider distribution, you can do this expansion and then do the make a new copy, then replace old copy with a new copy dance... although that's counter productive if you have snapshots.

That's why you do a zfs send/recv, instead of a copy. ZFS will copy your snapshots for you!


But you are moving the blocks to other disks anyway, surely that would require a change of the block pointers too?


Again not an expert -- but if you look at the presentations re: this feature, I presume you wouldn't. Why? Because another feature, adopted concurrently, provides for time independent geometry? So, if you look at the pool after time X, you have the same block #, etc., but with a different geo.


That is not how this will work.

The reason the parity ratio stays the same, is that all of the references to the data are by DVA (Data Virtual Address, effectively the LBA within the RAID-Z vdev).

So the data will occupy the same amount of space and parity as it did before.

All stripes in RAID-Z are dynamic, so if your stripe is 5 wide and your array is 6 wide, the 2nd stripe will start on the last disk and wrap around.

So if your 5x10 TB disks are 90% full, after the expansion they will contain the same 5.4 TB of data and 3.6 TB of parity, and the pool will now be 10 TB bigger.

New writes, will be 4+2 instead, but the old data won't change (they is how this feature is able to work without needing block-pointer rewrite).

See this presentation: https://www.youtube.com/watch?v=yF2KgQGmUic


The linked pull request says "After the expansion completes, old blocks remain with their old data-to-parity ratio (e.g. 5-wide RAIDZ2, has 3 data to 2 parity), but distributed among the larger set of disks". That'd mean that the disks do not contain the same data, but it is getting moved around?

Regardless, my entire point is that you still lose a significant amount of capacity due to the old data remaining as 3+2 rather than being rewritten to 4+2, which heavily disincentives the expansion of arrays reaching capacity - but that is the only time people would want to expand their array.

It just seems to me like they are spending a lot of effort on a feature which you frankly should not ever want to use.


I don't think that's true.

I don't use raidz for my personal pools because it has the wrong set of tradeoffs for my usage, but if I did, I'd absolutely use this.

Yes, your data has the old data:parity ratio for older data, but you now have more total storage available, which is the entire goal. Sure, it'd be more space-efficient to go rewrite your data, piecemeal or entirely, afterward, but you now have more storage to work with, rather than having to remake the pool or replace every disk in the vdev with a larger one.


> So the data will occupy the same amount of space and parity as it did before.

So you lose data capacity compared to "dumb" RAID6 on mdadm.

If you expand RAID6 from 4+2 to 5+2, you go from using 33.3% data for parity to 28.5% on parity

If you expand RAIDZ from 4+2 to 5+2, your new data will use 28.5%, but your old (which is majority, because if it wasn't you wouldn't be expanding) would still use 33.3% on parity.


Could you force a complete rewrite if you wanted to? That would be handy. Without copying all the data elsewhere of course. I don't have another 90TB of spare disks :P

Edit: I suppose I could cover this with a shell script needing only the spare space of the largest file. Nice!


> Could you force a complete rewrite if you wanted to?

On btrfs that's a rebalance, and part of how one expands an array (btrfs add + btrs balance)

(Not sure if ZFS has a similar operation, but from my understanding resilvering would not be it)

Not that it matters much though as RAID5 and RAID6 aren't dependable upon, and the array failure modes are weird in practice, so in context of expanding storage it really only matters for RAID0 and RAID10.

https://arstechnica.com/gadgets/2021/09/examining-btrfs-linu...

https://www.unixsheikh.com/articles/battle-testing-zfs-btrfs...


ZFS does not, and fundamental is not going to ever get one without rewriting so much you'd cry.


The easiest approach is to make a new subvolume, and move one file at a time. (Normal mv is copy + remove which doesn't quite work here, so you'd probably want something using find -type f and xargs with mv).


> Considering the process is already reading and rewriting basically the entire array, why not recalculate the parity as well?

Because snapshots might refer to the old blocks. Sure you could recompute, but then any snapshots would mean those old blocks would have to stay around so now you've taken up ~twice the space.


Or you could just rewrite the snapshots while you are at it.


How would you go about doing that?

AFAIK in ZFS snapshots are just a pointer to (an old) merkle tree[1], if you go about changing the blocks then you need to update that tree, but you can't due to copy-on-write (without paying for the copy, like I mentioned).

[1]: https://openzfs.readthedocs.io/en/latest/introduction.html#d...


The key issue is that you basically have to rewrite all existing data to regain that lost 3TB. This takes a huge amount of time and the ZFS developers have decided not to automate this as part of this feature.

You can do this yourself though when convenient to get those lost TB back.

The RAID VDEV expansion feature actually was quite stale and wasn’t being worked on afaik until this sponsorship.


Was it confirmed anywhere that this space-inefficiency-until-rewrite issue was why it stagnated? I remember looking up the progress a few months ago and being perplexed by the radio silence, given there were celebrations that we were apparently on the home straight back in 2021


No, that’s not what I meant. It’s unrelated. After Matthew Ahrends initial commit years ago basically nothing happened.

The stuff about rewriting data I just wrote to clarify


I imagine there would eventually be a way / option to automatically rewrite old blocks. Or at least I would hope so because usually when your adding new disks the array is going to be near full


Almost certainly not in any useful fashion that doesn't duplicate data versus old snapshots.

ZFS really deeply assumes you're not gonna be rewriting history for a bunch of features, and you'd have written a good chunk of a new filesystem to reimplement everything without those assumptions.


IIRC one of the developers was asked about this and said it would be about as much additional effort as the whole expansion feature so far. There is of course the user-level workaround of (roughly) `for file in *; do cp $file $file.new && mv $file.new $file && done`, but if I understand correctly you'd need to do this with no snapshots, otherwise the space from $file wouldn't be freed so you'd need an array that's at least half empty


It'd probably be worse than that.

I cannot stress how expensive and invasive that would be enough.


You can do a zfs send/zfs recv to send a dataset (including all snapshots) to yourself which is effectively rewriting the whole dataset, including all of its history, by duplicating it.

Not hard, but it does require sufficient free space. Once it's done you can destroy the original dataset and reclaim the space.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: