Hacker News new | past | comments | ask | show | jobs | submit login
A look at VDO, the new Linux compression layer (redhat.com)
146 points by lima on April 18, 2018 | hide | past | favorite | 63 comments



At first I was confused because I thought that someone miraculously implemented compression with no context switching via VDSO. Funny to see how wrong I was ;)

Could someone explain how VDO works to me? Based on the example, it looks like another DM backend, like LUKS and such. It exposes a virtual device backed by a real one, adding a layer of compression, am I reading it right? I can also see that we're specifying a "logical size" that is exposed to the user. How much space is really used and how is it allocated? Can I grow the logical size later?

Apart from this - what is the status of the patch, is it redhat-only or is it on its way to the upstream?


> Apart from this - what is the status of the patch, is it redhat-only or is it on its way to the upstream?

I had a conversation with some of the RH kernel devs a few weeks ago and from my (hazy) memory, this comes from Red Hats acquisition of permabit so it was closed source, its now GPL'd but needs a fair amount of tidying up and bits rewriting to get it into a state that the kernel developers would accept it (they reimplemented some standard kernel features so they didn't have to link to GPL'd code). So it will be upstreamed but it will take time.

In the meantime its released for RHEL only, not even in Fedora. Iirc could be ported to other distros (its 100% GPLed) its just not been.


> It exposes a virtual device backed by a real one, adding a layer of compression, am I reading it right?

And deduplication, yes.

>Can I grow the logical size later?

Yes; the man page for the program discusses the growLogical subcommand briefly: https://github.com/dm-vdo/vdo/blob/6.1.1.24/vdo-manager/man/...

>I can also see that we're specifying a "logical size" that is exposed to the user. How much space is really used and how is it allocated?

There's a vdostats command mentioned in the article that exposes how much actual space is available and used.

>Apart from this - what is the status of the patch, is it redhat-only or is it on its way to the upstream?

It has not yet been submitted as a kernel patch.


If I specify a logical size, and that's larger than the physical backing size, then what happens if someone attempts to write unique high-entropy data to fill their logical space?


VDO will return ENOSPC when it gets a write and has nowhere to store it. The filesystem or other writer is responsible for passing ENOSPC up to the user, just like if dm-thin were to run out of space.


OK, that sounds like a convincing reason not to use this feature - it really looks like it covers wrong abstraction. Consider reasons why you could get ENOSPC before:

1. Ran out of disk space,

2. Ran out of inodes,

3. Media reported ENOSPC and it gets propagated,

4. (something else I don't know about?)

So now, #3 gets more common. Is it actually something that FS devs think about? "What if I get this weird error message?" Or is it something that could trigger a some destructive edge case in the filesystem and lead to serious data loss? You mention dm-thin doing the same, so I guess that at least the popular filesystems should handle it well.

Anyway, when you think about it, how do you debug #3? You checked file size, you checked number of inodes, other than that the FS gives you no feedback. There's no central API to signalling this kind of problems and suggested solutions and I would say that this makes systems much less flexible. You could of course run vdostats (and whatever dm-thin uses to report its resources), but just imagine the amount of delicate code you would need to automatically solve this kind of issues. It's insane, it really looks like crappy engineering to me.


#3 is certainly not a historically common error from a disk, but thin provisioning has been around for a while -- dm-thin was introduced in Linux 3.2 in 2012, and filesystems in my opinion worked extensively to handle ENOSPC correctly since then. Consider that storage returning ENOSPC at a particular write and all thereafter is roughly equivalent to storage stopping writing suddenly, and storage stopping writing is equivalent to a system crash and reboot. Filesystems do work hard to recover without loss of unflushed data from a crash, assuming the storage was properly processing flushes, and this case should be very similar.

Filesystems (and VDO) both log extensively to the kernel log in an out of space situation, so inspection via dmesg or journalctl hopefully leads swiftly to identification of the problem. The 'dmeventd' daemon provides automatic monitoring of various devices (thin, snapshot, and raid) and emits warnings when the thin-provisioned devices it is aware of are low on space; there's a bug filed against VDO to emit similar warnings [1]. Careful monitoring is definitely important with thin provisioned devices, though.

[1] https://bugzilla.redhat.com/show_bug.cgi?id=1519307


inodes are a similarly weird reason to be “out of space”. the number of new-to-linux users that i’ve asked “have you checked inodes?” is huge. once you know that, you know that there are more things than just bits on the drive that effect available “space”


For data compression it uses the LZ4 format [1] (real-time LZ77-like data compressor -without entropy encoding, just string references and literals-, with small blocks so its LUT-based O(1) search is always in the data cache).

[1] https://rhelblog.redhat.com/2018/02/05/understanding-the-con...


This is great news!

At my employer, we're currently using btrfs in production for deduplication. It's relatively stable nowadays, but fragmentation and maintenance are massive issues for our use case.


On Redhat? Are you aware that Redhat has deprecated btrfs? I believe btrfs still used heavily in SUSE.


Exactly!

_" The Btrfs file system has been in Technology Preview state since the initial release of Red Hat Enterprise Linux 6. Red Hat will not be moving Btrfs to a fully supported feature and it will be removed in a future major release of Red Hat Enterprise Linux. The Btrfs file system did receive numerous updates from the upstream in Red Hat Enterprise Linux 7.4 and will remain available in the Red Hat Enterprise Linux 7 series. However, this is the last planned update to this feature."_

Source: https://access.redhat.com/documentation/en-us/red_hat_enterp...


Yes, we build and test our own kernels and userland tools and figured out which kernel versions work and which ones don't. You've got to keep a close eye on the IRC channel and mailing lists. We never had any data loss but a lot of unplanned downtime until it got sufficiently stable - for our use case - a few months ago.

Works well but we'd rather not have to worry about these kind of things.

(even before it was deprecated, it was marked a "technology preview", rightfully so - if you want to use btrfs, you need to stay very close to mainline)


If you plan to use it for your personal traditional storage HDD, you have to keep in mind that it will render most (if not all) data recovery tools naught. Tools such as PhotoRec rely on the file headers to determine the file type to recover the data.

I have a long-term storage HDD that I thought I would never need data recovery on it until I fucked it up good by a wrong partitioning command. Normally those types of mistakes can be recovered by TestDisk but not that time. I realized my only hope was to use PhotoRec to recover what I can from the garbage that I overwrote on the disk. Thanks to PhotoRec I could recover most of them with messed up names. I was so thankful I didn't use any fancy compression algorithm.

Tools such as LVM, SSD backing storage, btrfs, compression and stuff are all nice when you understand their limitations. For now, I don't use all of my storage so I just create ext3, ext4, and exFAT partitions to store my data for the maximal chance of recovery, whether it is due to hardware or software.


I'm not saying you're wrong, but "what about data recovery" has been used to argue against every advance in storage, even massive ones like SSDs. Have backups and unshackle yourself from fear.


I'm also old enough to remember how buggy DoubleSpace was and, truth be told, that still kinda puts me off disk compression.


NTSF compression is pretty transparent and robust. I usually have 60% of my main disk compressed (laptop, I need the space especially with multiple Visual Studios-installed) for many years and no failures in any way.


As I understand it you've never had to recover your compressed files.


I run off Google Drive or GitHub for all files. I think most people should. Local data is ephemeral.


On the other hand, you should make backups anyway.


On the other hand if you are using full-disk encryption then you will be fucked anyways.


Depends. If you mean soft compression (not ATA level crypto) you may be able to back up the header externally and still decrypt it. The data will be scrambled where it would have been scrambled anyway, and unmodified regions will remain unmodified.


Using something like VDO for personal storage would be pretty pointless.

Unless, of course, you have a ton of similar virtual machines or other data with high amount of 4 kB block size aligned redundancy.


If your data directories have much redundancy (so deduplication can help), or are nicely compressable, then VDO can also make sense for home setups. Like companies, you would consider things like "does it make sense for my kind of data", "do I have enough CPU cycles/ram to spend on getting storage down to some degree" and "am I ok with the additional layer, so more complicated setup".


imo VDO makes sense where you deal with high speed, low capacity media. I've used BTRFS compression to hold huge disk images in RAM or an SSD while I fully strip/TRIM and repack them (say from RAW to qcow2 or similar). Of course this is usually a case in which I've imported a physical disk image of a 500GB drive, but the actual data on it was only ~50GB or less.

Edit: Sorry not the disk image, the temporary overlay for zero blanking. But sometimes smaller disk images for the whole thing.

Recently I've also experimentally been holding some older games in RAM on a compressed image.


Considering that it is based on blazingly fast lz4, using twice as much time as non-VDO in this simple test seems ..bad? I suppose the dedup stage is really expensive here. On a setup like this with simple compression, I would have expected the results to be inverted.


I would have thought they would consider zstd also, because that would be even better and easier tunable. But I guess there are some license problems still.


Reading some blog posts and such about ZFS compression, I got the impression that simpler algorithms such as LZ4 and LZO are typically preferred for transparent storage compression. Presumably a balance must be struck between the benefits of writing less data and the cost of taking CPU time away from other code.


VDO probably predates zstd (1.0) release fair bit.


I wonder if you can run VDO on top of ZFS.

Or is there a better way to handle snapshots and data integrity (block level checksums)?

Without block level checksums, a single corrupted data block could corrupt half of your virtual machine images...


Nothing prevents you from running VDO ontop of ZFS, apart of that not being in focus/tested at Red Hat. I do not expect issues from purely running both layers, but maybe things like proper mounting/umounting via fstab might not be in place.

If ZFS gives you block level checksums, then you could use compression/dedup from VDO. Just activating i.e. compression on both layers would be waste of cycles.


LVM provides CoW snapshots. But why would you use this instead of ZFS's native compression or dedup ?


How about block level checksums? Can LVM do that as well? Or is there some other way?

ZFS dedup takes 320 bytes of RAM per unique block "record". So 1TB of RAM is enough to dedup only a bit over 6 TB worth of unique blocks, when using a block size that works well with virtual machines — 4 kB.

One can of course use larger ZFS record size than 4kB. But virtual machine dedup savings drop very sharply as a result if record size does not match virtual machine filesystem block size and alignment. This happens, because there are exponentially more different combinations how 4 kB blocks can be arranged inside a bigger record size.

It's painful to format all virtual machine images to use say 64/128 kB filesystem block size to be able to efficiently use larger ZFS record size dedup.

I understand VDO dedup requires significantly less memory and uses only 4 kB blocks for dedup. This is ideal for VM storage application.


How about block level checksums? Can LVM do that as well? Or is there some other way?

LVM by itself doesn't do anything, it's a management layer on top of device-mapper. But yes, a block-level checksum exists since cryptsetup v2.0 (very recent), called dm-integrity: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/lin...


320 bytes? Doesn't that seem excessive or am I missing something.


How about taking block level checksum a few levels higher and finally build a filesystem that has a Merkle tree checksum?


You mean like ZFS?


ZFS’ dedup is essentially useless, since it requires that the entire dedup table is held in RAM.

This is one area where btrfs has a huge advantage, dedup can be done offline, rather than in real time.


ZFS’ deduplication does not require the deduplication table be held in RAM. Hoewever, the performance is IOPS bound. The only way to improve on that is to keep it in RAM or have a very fast L2ARC. Many modifications also require quite a large number of writes, which also slows things down.

My point is that the system will not stop working if the DDT is not in cache. It would still run, but it would not be fast.


ZFS's dedupe is, I believe, 64k block dedupe; VDO's is 4k block dedupe.


ZFS dedup block size is whatever you set dataset record size to be. Powers of two from 0.5 kB to 256 kB are allowed at least.

So you can set ZFS to dedup 4 kB blocks. It just requires a boatload of RAM.


FreeNAS supports lz4 compression on ZFS, but I'm not sure if that a ZFS feature or what.


It is ZFS, and is on by default. ZFS also offers gzip compression, and you can "turn up" the LZ4 from the default if you wish.

I've almost always left it at the default; on my home file server (slightly long-in-the-tooth 6 core xeon, enterprise-grade spinning rust), throughput is noticeably faster on compressible data.

(ZFS dedupe should only be considered for weird cases, like if you somehow have a ton of RAM but very limited storage. Frankly, at this point I think it is an attractive nuisance that leads beginners down a dangerous path and should be removed, or at least the commands to enable it should be given loud, scary confirmation messages.)


LZ4 is a ZFS feature as far as I know, and works really well. So well, that I think it should be default.

But dedup is why I'm interested in VDO.


For block level checksums there is dm-integrity :)

It's dm layers all the way down...


That writes all data twice. If it really does emulate hard drives that have per sector tags, it should have the same weakness. That being a misdirected write will put data with a correct checksum in the wrong location, which will be served without knowing.

It’s documentation suggests that it can detect on journal replay whether the data was written where it was supposed to be, but it is not going to catch sites clobbered by a misdirected write every time because sone times the wrong sector is perfectly overwritten with no overlap with other sectors.

This is not a replacement for ZFS zvold, which do checksum each block.


Compression usually has checksumming already.


Compression and checksumming have nothing to do with each other.

I can't think of even one compression algorithm that implements checksums. That's why archive formats (like ZIP, 7z, etc.) need to implement checksums separately — inflate and deflate algorithms ZIP uses don't have any kind of built in checksums.


Applications use libraries that implement data compression. Such libraries usually have checksumming in their algorithms, because compression is not very practical without it. LZ4, that VDO uses, has checksumming too.


LZ4 frame format (called LZ4F) has checksums — a stream that contains LZ4 compressed data. But LZ4 algorithm itself doesn't have any checksums.

A block device layer would use the "raw" algorithm, not any frame/container format and use something like SHA256 for checksums.

Take a look at LZ4 source: https://github.com/lz4/lz4/tree/dev/lib

VDO LZ4 source:

https://github.com/dm-vdo/kvdo/blob/master/vdo/base/lz4.c

No LZ4F frame format or checksum in sight.


Yes, you are right, they use raw lz4 and don't do checksumming there. And I can only find checksumming in superblock code and volume geometry code in vdo repository. So it doesn't look like they do it on regular data blocks at all, probably just assume you have a properly working FTL.

(by the way, sha256 is a slow cryptographic hash, not a checksumming one, like crc32)


It should work.


So it's a layer under the filesystem and should work with any filesystem in theory? So would XFS + VDO be better than let's say using compression in BTRFS?


Yes, VDO is transparent for the layers above, like filesystems. XFS+VDO might be preferable to BTRFS, if it has the features you need. - LVM could be used for snapshots, but the snapshots from BTRFS might be preferable in corner cases where many snapshots are done (snapshots in XFS are being worked on AFAIK, these could then be comparable) - XFS+VDO combination might be more reliable, it's a completely supported combination on RHEL, whereas BTRFS is just techpreview - only missing feature coming to mind is checksums over data blocks, XFS is only providing checksums over meta data


How do filesystems cope with the apparent capacity of the disk changing? If you put a lot of data on that compresses really well, suddenly your disk will appear to have become larger.


The capacity is not changing 'rapidly', instead the whole VDO device is thin provisioned. Let's assume a 1TB harddisk, then when creating a VDO device ontop, you can have that directly appear as 3TB, so 'thin provisioned'.

The filesystem ontop always sees 3TB, unless you explicitly modify the VDO device. Of course, you have to monitor the VDO status tightly: if you happen to store data on the filesystem which is absolutely unique, has no zeros and is uncompressable, then dedup/compression can not do anything. The VDO device can only consume up to ~1TB of such data. Your monitoring should detect the low space on the VDO backend before, and you should then either stop writing or extend the backend device.


What happens if you go over the limit? Does VDO reject writes? Or is your filesystem lost.


With the current code, your data becomes unmountable in that case. There is a workaround for getting the data accessable: leave a bit of the underlying blockdevice unused, so when creating VDO ontop give it not the full blockdevice. Then, in case the filesystem ontop completely fills up, the last part of the blockdevice can be used to grow the VDO device, and the filesystem can be mounted.

Should be tried out before relying on it. The current behaviour seems to be considered a bug, https://bugzilla.redhat.com/show_bug.cgi?id=1519377 has details.


I’ve been wanting to try Cassandra on top of ZFS to see if the file system compression worked better than Cassandra’s internal impl. This would be much easier to try (and to sell to my manager) with the Redhat support.


I'm assuming so, because Redhat is getting rid of btrfs. That would explain why they announced the two in the same release notes.


The opening question made me laugh because I actually do feel I have too much storage.

I haven't bought a new hard drive in years, and I'm only using maybe 1.5 Tb of ~3 Tb, and most of that space is used up by RAW images from my DSLR. At the rate I'm going, it'll be 2 or 3 years before I need a new drive.


Personally I do agree about hard drives. They're so enormous these days that I'll never use even a moderately sized one. I'm spoiled, though, and love my SSDs; they're fast, smaller, and more expensive, and just a couple of VMs quickly eat up a bunch of expensive space. With VDO, I personally run 20 VMs instead of 8.


Remember when the N in O(N) was the size of the problem, rather than the quantity of O(lg N') indirection layers? Pepperidge Ph4rm remembers.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: