Why Virtual Machines suck when you run them from BTRFS files system

aliguori · on July 14, 2011

This rambles a bit. Here's the summary:

btrfs is currently optimized for normal applications that do open("foo", O_RDWR). With this mode, integrity semantics are quite loose in POSIX.

Because VMs emulate physical hardware with strong integrity semantics, they usually do either open("foo", O_DIRECT) or open("foo", O_SYNC).

btrfs sucks for O_SYNC. It's not just VMs, databases also tend to make heavy use of O_SYNC.

gaius · on July 14, 2011

Ironic that BTRFS is sponsored by Oracle then!

aliguori · on July 14, 2011

A lot of filesystems don't optimize O_SYNC heavily until it becomes necessary. ext4 has really bad O_SYNC performance until pretty recently FWIW.

Given where BTRFS is right now development wise, it's not at all surprising O_SYNC hasn't been optimized yet.

masklinn · on July 14, 2011

So BTRFS is very efficient for big sequential read (which you generally don't care for much, because they're pretty fast in any case) and dies when subjected to small random read (which are the bane of platters in the first place)... isn't that dumb for a general-purpose FS?

rbanffy · on July 14, 2011

What I got from that is that BtrFS sucks for doing lots of small synchronous writes, something that's relatively unique to VMs and is a major improvement over ext4 in just about everything else (feature set and performance). In fact, it's so unique it never popped up on the tests they do regularly on every patch.

ScottBurson · on July 14, 2011

Huh, that's funny. I've been running a VM out of a btrfs partition for months and haven't seen these problems. It's not blindingly fast, but (a) the partition is encrypted, and (b) the VM is running Windows with antivirus software, so there are a couple of things other than btrfs slowing down the write path. But I certainly haven't seen freezes such as those described in this post.

rbanffy · on July 14, 2011

I'm glad there are about a dozen different file systems that don't suck with VM work and quite relieved BtrFS developers are actively working on improving the case that hurts VM performance.

Having said that, I'd love to know if there are automated tests within the kernel that could verify integrity/correctness/performance of things like filesystem drivers in a simple and automated form. Something like it could prevent surprising developments performance regressions like this and provide better mapping between what you want to do and how you should do it.

simcop2387 · on July 14, 2011

The closest thing I'm aware of for doing that currently is the Phoronix Testing Suite; or whatever they're calling it now. It's most certainly not complete but it's the only thing I'm aware of that can do any kind of regression testing like that. In fact it's been worked on recently to make it really easy to do testing with git bisect on the kernel.

rbanffy · on July 14, 2011

Going further in the thread they mention xfstests and that they run it against every patch.

http://xfs.org/index.php/Getting_the_latest_source_code#XFS_...

and

http://oss.sgi.com/cgi-bin/gitweb.cgi?p=xfs/cmds/xfstests.gi...

vivekl · on July 14, 2011

I think the problem also depends on what kind of virtual disk you end up using. Let me ellaborate:

I dont think the problem is purely buffered vs. unbuffered IO. The guest operating system will have performed some block coalescing anyway so the block requests will often NOT be 4K chunks but should have slightly larger granularity. However, if you use a COW based virtual disk layout like QCOW2 which I guess is standard in KVM, you may see additional scattered IO.

I think it is weird to be using COW virtual disk layouts on a file system that natively supports COW as is the case with BTRFS. I would be curious to see what the performance of raw sparse files on BTRFS is vs. qcow2 etc.

otterley · on July 14, 2011

It sounds as though the same issues that make it perform suboptimally on VM hypervisors would make it also perform suboptimally for OLTP databases -- in both, the I/O patterns generally involve high numbers of small writes.

jhefter · on July 14, 2011

This problem is similar to (and exacerbated by) the IO bottlenecks VMs experience when using traditional hard drive disks, due to high levels of random IO operations. For this reason, many new virtual setups are using solid state drives, which have no seek time. This keeps the high level of random IO operations from significantly impacting performance.

masklinn · on July 14, 2011

> For this reason, many new virtual setups are using solid state drives, which have no seek time. This keeps the high level of random IO operations from significantly impacting performance.

Except for btrfs, where it would make the whole thing even less efficient (because now the only cost is the waiting around for threads, not even the random seek on your platters).

And as a result, I disagree with your "and exacerbated by". BTRFS's problem becomes worse on SSDs (qualitatively) because the random read itself is almost free, and all of the cost is in the context switching done by the FS, instead of only 80~90% of that cost.

ch0wn · on July 14, 2011

Thanks for posting this. This is a really important information when setting up a new host to run VMs.

funkah · on July 14, 2011

The quoted text is painful to read. I don't know why so many mailing list pages have to look like this. At the very least could the linebreaks be taken out?

sagarun · on July 14, 2011

You could have clicked the "Previous message" link and viewed the original message without quotes: http://lists.fedoraproject.org/pipermail/devel/2011-July/154...

vdm · on July 15, 2011

I'm with funkah. Mailing list archive pages haven't evolved in a decade.

rbanffy · on July 14, 2011

Observing the dynamics of the list, I have to ask: who is JB and why is he/she so worried about VM performance under BtrFS?

Fedora is not a Linux you recommend for someone who doesn't know what they are doing and, if you know VM performance sucks with BtrFS, please, by all means, add another partition and use ext4 (or 3 or 2 or XFS or anything you think may offer you better performance)