Don't use BTRFS for OLTP

simoncion · on Sept 22, 2015

Oh. He's using Linux 4.0? That's an old kernel if you're running btrfs. I hope that he re-tests with 4.2 [0] or 4.3. I also hope that he posts his benchmark settings so I can run it on my systems at home. :) I'm very curious about the ENOSPC errors.

Also, the linux-btrfs mailing list thread about the rant is somewhat worth reading, if you're into this sort of thing: http://thread.gmane.org/gmane.comp.file-systems.btrfs/48248

[0] There's a mv/rm deadlock that I can trigger from time to time that was only fixed in 4.2 and later. (His description of the test stall doesn't indicate that this deadlock is the cause of his problem, mind.) Happily, this deadlock only blocks operations on the file being operated on, rather than the whole FS. Also, a reboot clears up the issue. (An umount, then mount might also clear up the issue, but btrfs is my rootfs, so I can't test that. :P)

semi-extrinsic · on Sept 22, 2015

The author adresses this issue very eloquently: if your file system requires a Linux kernel newer than 4.0 to get decent performance, it is very hard to argue that it is "mature" and "production-ready".

simoncion · on Sept 22, 2015

> The author adresses this issue very eloquently...

That would be one of the reasons why I never made the claim that btrfs is ready for general use in all situations. :) I mean, in my footnote I mention that I'm running into a deadlock triggered by perfectly ordinary filesystem operations.

pgaddict · on Sept 22, 2015

Yes, I'll publish the data soon. I've planned to do that before the talk at pgconf.eu (end of October in Vienna) where I'll talk about the results. But given the current interest, I'll probably do that sooner, given the interest.

I've used kernel 4.0 because that's what was latest in May when I started with the benchmark - it takes ~3-4 days to test a single configuration and you can't simply change the kernel halfway through.

Also, as pointed out by semi-extrinsic, if you have to use bleeding-edge kernel, it probably is not mature enough for general use.

FWIW, I haven't really expected the rant to get discussed on linux-btrfs, but I'm nicely surprised by how polite and factual the posts are. They got most of the facts right and the inaccuracies are mostly minor and probably a consequence of the rant not providing all the data.

simoncion · on Sept 22, 2015

> ...it takes ~3-4 days to test a single configuration and you can't simply change the kernel halfway through.

No worries! I've done very-long-running testing before. Moreover, your blog post made it very clear that you're not some n00b with an axe to grind. :)

I wonder if it'll take substantially longer than 3->4 days for my systems to complete a test. Guess I'll find out within a few months! :D :D :D

> ...I'm nicely surprised by how polite and factual the posts are.

Yeah, the btrfs devs strike me as a really nice bunch of folks.

Anyway, thanks for the info and the comment! :D

notacoward · on Sept 22, 2015

I'm pretty sure that I was at a conference where Chris Mason explicitly commented on the irony of working at a database company (Oracle) but designing a file system that's fundamentally bad for database workloads. I think it was a Linux Foundation event, maybe an EUS in NYC. In any case, it's pretty well known that COW file systems are not a good fit for this kind of workload. A more interesting question is whether this level of (in)stability and (un)predictability is acceptable for any workload other than scratch storage (which doesn't benefit from snapshots and CRCs very much). My takeaway from this article is basically that F2FS is worth another look.

pgaddict · on Sept 22, 2015

While I'm obviously interested in benchmarks and performance (I'm author of the blog post referenced here), I'm perfectly OK with sacrificing some of the performance in exchange for advanced features provided by the filesystem.

For example built-in snapshotting, additional data integrity guarantees thanks to checksums (e.g. resiliency to torn pages) etc. Because you either can't get that with the traditional filesystems or it'll come at a cost (e.g. LVM adds complexity and has impact on performance).

What I'm not quite OK with is getting very unstable performance - with OLTP workloads you really want smooth behavior, not the jitter or random issues you get with BTRFS. Especially when the other COW filesystems like ZFS perform so much more sensibly.

I don't think comparing F2FS and BTRFS is entirely fair, though. Those are filesystems with very different goals, F2FS is mostly designed to work with single SSD devices (so no RAID-like stuff like BTRFS) and lacks many of the advanced features (you can't even do snapshots).

Also, it was not my intention to say that BTRFS is somehow conceptually wrong and unusable for database workloads. But the current state is not really something I'd recommend for OLTP in production - that's what the rant is essentially about.

notacoward · on Sept 22, 2015

I also take a dim view of that performance instability, with any workload. In fact, it's one of the points I address on the slides I just sent in for a mini-tutorial on storage performance (for LISA'15 in case anyone wants to see it). All I'm saying is that, even under the best of circumstances, I would consider any COW file system a dubious choice for OLTP.

pella · on Sept 22, 2015

wiki.archlinux.org / PostgreSQL + Btrfs

"Warning: If the database resides on a Btrfs file system, you should consider disabling Copy-on-Write for the directory before creating any database."

https://wiki.archlinux.org/index.php/PostgreSQL

simoncion · on Sept 22, 2015

Fun fact: I have a multi-TB, largely-write-only Postgres 9.4 database on a force-compress multi-device btrfs volume. Sadly, you must enable CoW to use transparent compression.

The performance is... not the best, and not all of that can be blamed on either my shitty choice of indexes long ago, or my decision to use firewire to attach the devices.

However, btrfs hasn't eaten any of my data, compression has given me ~2x the space to work with, and btrfs handled the sudden and unexpected loss of a device like a champ.

pgaddict · on Sept 22, 2015

So what's the point of using a COW-based filesystem, and then just disabling the COW and sacrificing features that require that. Sure, you can still do snapshots (which will do COW on the modified data), but you lose compression and checksums.

The compression is not really that interesting on OLTP I guess, but losing checksums is a major PITA because it means the filesystem is no longer resilient to torn pages (at least that's my understanding). Which means you have to enable full_page_writes in PostgreSQL, which has impact on performance.

So how is BTRFS+nodatacow better than EXT4+LVM, for example?

pella · on Sept 22, 2015

other benchmark from the same author ( 4month ago ) :

Tomas Vondra: "PostgreSQL on EXT4, XFS, BTRFS and ZFS"

http://www.slideshare.net/fuzzycz/postgresql-on-ext4-xfs-btr...

pgaddict · on Sept 22, 2015

That's a first version of the benchmark, with some minor differences (newer kernel, a bit more sensible PostgreSQL configuration).

While the end results are mostly the same, the PostgreSQL changes (significant increase of checkpoint_segments) significantly improved the COW case.