The State of ZFS on Linux

ownedthx · on Sept 11, 2014

At a previous job, we built a proof-of-concept Sinatra service (i.e., HTTP/RESTful service) that would, on a certain API call, clone from a specified snapshot, and also create an iscsi target to that new clone. This was on OpenIndiana initially, then some other variant of that OS as a second attempt.

The client making the HTTP request was IPXE; so, every time the machine booted, you'd get yourself a flesh clone + iscsi target and we'd then mount that ISCSI target in IPXE, which would then hand off the ISCSI target to the OS and away you'd go.

The fundamental problem we hit was that there was a linear delay for every new clone; the delay seemed to be 'number of clones * .05 second' or so. This was on extremely fast hardware. It was the ZFS command to clone that was going to slowly.

Around 500 clones, we'd notice these 10/20 second delays. The reason that hurt so bad is that, to our understanding, it wasn't safe to do ZFS commands or ISCSI commands in a parallel manner; the Sinatra service was responsible for serializing all ZFS/ISCSI commands.

So my question to the author:

1) Does this 'delay per clones' ring familiar to you? Does ZFS on Linux have the same issue? It was a killer for us, and I found a thread eventually that implied it would not ever get fixed in Solaris-land.

2) Can you execute concurrent ZFS CLI commands on the OS? Or is that dangerous like we found it to be on Solaris?

ryao · on Sept 11, 2014

1. I am not aware of this specific issue. However, I am aware of an issue involving slow pool import with large numbers of zvols. Delphix has developed a fix for it that implements prefix. It should be merged into various Open ZFS platforms soon. It could resolve the problem that you describe.

2. Matthew Ahrens' synctask rewrite fixed this in Open ZFS. It took a while for the fix to propagate to tagged releases, but all Open ZFS platforms should now have it. ZoL gained it with the 0.6.3 release. Here is a link to a page with links to the commits that added this to each platform as well as the months in which the were added:

http://open-zfs.org/wiki/Features#synctask_rewrite

ownedthx · on Sept 11, 2014

Thanks for the reply.

Regarding #2: On OpenIndiana, we first started with concurrent zfs commands and ruin, I think, the whole pool (maybe it wasn't that drastic, but was still a disaster scenario where key data would be lost). I couldn't believe it.

I was asking anyone who knew anything... 'so if two admins were logged in at the same time and made two zvols, they could basically ruin their filesystem'? No one knew for sure. Crazy stuff.

Anyway, I'm quite glad that's safe now.

prakashsurya · on Sept 11, 2014

I can't recall the exact detail from memory, but I believe #1 has to do with the fact that zfs creates/clones/snapshots/etc are done in "syncing" context. Thus, each command has to wait for a full pool sync to complete, limiting the rate at which these can be done.

This is a known problem, and likely to be fixed in the not too distant future.

ryao · on Sept 11, 2014

I am the author. Feel free to respond with questions. I will be watching for questions throughout the day.

IgorPartola · on Sept 11, 2014

Is there a plan at some point to include a daemon or a cron job to run automatic zpool scrubbing? I believe this was a feature that is available in other OSs' packages, but not currently with ZoL. Currently, I include two cronjobs, like so:

    18 * * * * /sbin/zpool list | grep ztank | grep ONLINE > /dev/null || /sbin/zpool status
    35 1 * * 4 /sbin/zpool scrub ztank

This way cron simply emails me if there are errors. However, I'd like a lot more communication from my storage array: if there are detected errors, I want to know right away.

My other question is much more specific to my case. Stupidly, I bought the Western Digital Green drives for my server (running in as a mirror of two drives). They are normally under no heavy load: just occasional file access to store/retrieve pictures, documents, or stream video. How likely am I to run into problems/should I replace these drives ASAP?

ryao · on Sept 11, 2014

ZoL 0.6.3 introduced the ZFS Event Daemon. This functionality could likely be implemented into it. Please file an issue requesting this:

https://github.com/zfsonlinux/zfs/issues/new

JeremyNT · on Sept 11, 2014

I very much adore ZoL. Thank you for your efforts. Everything critical works and works very well.

While I get the sense that this is probably not the focus of your own work, do you have any thoughts on the maturity of the "share" facilities when using ZoL, and as the project matures will these become more of a priority? These are "nice to have" features that are obviously of relatively low importance.

You mentioned shareiscsi is unimplemented, but I also find that its friends sharenfs and especially sharesmb have some rough edges as well. When I moved a pool from an OI machine to a Linux machine, I had to massage all of my share attributes to make them function.

ryao · on Sept 11, 2014

You can obtain the commands that you used to massage your sharenfs and sharesmb settings from `zpool history`. Please file issues with them in the issue tracker:

https://github.com/zfsonlinux/zfs/issues/new

Please include information describing your distribution, the distribution release, your kernel version, the ZoL release and also the Samba version.

JeremyNT · on Sept 11, 2014

I guess I am curious about how this functionality is viewed more generally / philosophically and what state is considered to be in. I wonder whether it is perceived to be good enough for production use, and whether it's expected to work as smoothly as it does under Illumos.

In my case I encountered situations where valid configurations from OI weren't supported in ZoL, but I remember coming to the conclusion that these limitations were already known and addressing them simply wasn't a priority at that time. Since this was some months ago the situation very may well have changed already!

ryao · on Sept 11, 2014

The share support is one of the few things in ZoL that I do not actually use, so I cannot make a definitive statement on this topic. At present, the project does not have regression tests in the LLNL buildbot. The majority of the enabling work in this area was done by Turbo Fredriksson.

https://github.com/zfsonlinux/zfs/commits?author=FransUrbo

I would emailing the zfs-discuss mailing list with Turbo on CC to ask:

http://zfsonlinux.org/lists.html https://github.com/zfsonlinux/zfs/commits?author=FransUrbo

Additionally, I am aware of some SMB improvements by Turbo that are currently pending review. They are scheduled for inclusion into 0.6.4 provided that no issues are found during review:

https://github.com/zfsonlinux/zfs/pull/1476

Nursie · on Sept 11, 2014

I don't really have a question, but as a ZoL user I'd just like to say thanks for all the hard work.

It makes management of my disk arrays pretty painless and has some fantastic migration/recovery stuff going on. All of which I'm sure you know!

blumentopf · on Sept 11, 2014

I use ZFS on a dual-boot Mac to cross-mount the Linux partitions on OS X. ZFS is pretty much the only file system that allows this: The XFS OSXFUSE plugin is read-only, the ext plugin only supports ext2 with unstable write support and BtrFS can't be mounted on OS X at all.

The ZoL-derived OpenZFSonOSX port inherits ZoL's maturity, runs in kernel space and the OS X integration is really nice (Notification Center integration in ZED, custom icons, etc).

Lovin' it!

IgorPartola · on Sept 11, 2014

Seconded. ZFS is the only filesystem I trust with my children's baby pictures, as well as to store the git repo's for my personal projects (stuff I don't want on GitHub for a variety of reasons).

ryao · on Sept 11, 2014

I am happy to hear that. While I certainly think ZFS is the best filesystem available for storing this kind of data, i would like to add a word of caution that ZFS is not a replacement for backups. I elaborated on this in one of the supplementary blog posts:

https://clusterhq.com/blog/file-systems-data-loss-zfs/#disk-...

IgorPartola · on Sept 11, 2014

Absolutely. I have a regular backups strategy that also backs up to a ZFS system :).

Rapzid · on Sept 12, 2014

I followed you and Brian's contributions quite closely at my previous job. We had some pretty extreme backup targets for our VPS's and the old tools were starting to become bothersome; particularly for keeping a couple hundred million files sync'd between data centers. I knew about ZFS's send/receive but we were a linux shop.. About the time we(I) were going to make some major changes to the backup systems I gave ZFS another Google, as I like to re-check my assumptions every now and then, and discovered ZoL went "stable" just that month! I immediately pushed to give it a spin and the rest was history. Learning ZFS was fantastic fun. It challenged everything I thought a filesystem was capable off. L2Arc, snapshots(cloning and shared data wut?!), ZVol's, checksum's, and on and on and on. Thanks for all your hard work and making this possible!

foobarqux · on Sept 11, 2014

Can you talk about performance of virtual machine disks on ZFS? How is ZFS better/worse than BTRFS?

ryao · on Sept 15, 2014

Before I answer that, let me say that ZFS has two options for the storage of virtual disk for virtual machines. The first is the traditional file. The second is the zvol. The zvol is a virtual block device that is a lower overhead option than a traditional file. The default internal records (blocks) vary for each. On datasets where files are stored, this is called the recordsize and it is 128KB by default. This is a per file property that is set with the value of the dataset's recordsize at the time of creation. On zvols, the volblocksize is 8KB by default. This is set at the time of creation of the zvol. In both cases, partial block writes cause a read-modify-write penalty and typically, zvols are more performant by default. At the same time, LZ4 compression seems to have the counterintuitive consequence of making the larger recordsize about equal in performance in filebench tests that I have done, so it is hard to say which is ultimately better. It is also worth noting that there are several improvements in the pipeline for ZoL zvols that should improve its performance:

https://github.com/zfsonlinux/zfs/pull/2484

With that introduction out of the way, the actual answer to your first question is that it is very dependent on your workload, so I cannot provide a solid answer, but I can discuss my own personal experience. I am involved with ZoL because I was interested in the performance of virtual storage for a home server in 2011. The technology at the time could not compare to ZoL and as far as I know, still cannot compare. In specific, I had 6x Samsung HD204UI drives connected to an AMD Phenom X6 1090T. A configuration with MD RAID 6 + LVM + ext4 did not manage more than 20MB/sec, regardless of whether I used KVM or Xen. A raidz2 configuration using ZFSOnLinux did 220MB/sec. someone with 4 disk recently had a similar experience about a week ago where MD RAID 5 + LVM + XFS and could not get more than 44MB/sec while a ZFS raidz1 configuration managed 210MB/sec if I recall correctly.

As for how ZFS is better/worse than btrfs, ZFS has several advantages in terms of its implementation. In specific, it has ARC that provides a scan resistant cache to maintain performance consistent. It has L2ARC for using flash to extend that cache. It has the ZFS Intent Log, which allows it to avoid blocking on expensive full merkle tree updates. This has allowed ZFS to outperform btrfs in ways that amazed the btrfs developers:

http://comments.gmane.org/gmane.comp.file-systems.btrfs/2754...

It has SLOG devices to allow flash to be used to accelerate ZIL. It also has a custom IO elevator that does a very good job of ensuring performance consistency:

https://twitter.com/lmarsden/status/383938538104184832/photo...

Quite honestly, here are the 5 hypothical areas in terms of where performance can be in any given comparison and what I expect the distribution to be:

1. Areas where ZFS significant outperforms btrfs. I expect there are many of these.

2. Areas where ZFS slightly outperforms btrfs. I expect that there are many of these.

3. Areas where ZFS and btrfs are equal. I expect that there are some of these.

4. Areas where btrfs slightly outperforms ZFS. I think some of these probably exist.

5. Areas where btrfs significant outperforms ZFS. I do not expect any of these to exist. If they do, they indicate bugs in the ZFS kernel driver that need to be corrected.

The areas where I think btrfs might significantly outperform ZFS today are:

1. Uncached directory lookups (getdents performance). ZFS does not currently have directory prefetching and btrfs might. This only affects cold cache performance, so it does not affect production usage and has been a low priority. It will likely be fixed in the next 12 months.

2. Small file performance. btrfs does block suballocation while ZFS does not. This should change in 0.6.4 when ZFS will begin storing small files in the dnode (the ZFS equivalent of the inode).

That said, there is nothing I or anyone can say that is a valid substitute for your own testing and I encourage you to run your own tests.

I hope that this answers your question.

Sanddancer · on Sept 11, 2014

What's performance like on ZoL compared to Solaris/OpenIndiana/FreeBSD?

ryao · on Sept 15, 2014

Performance is a difficult question to answer because it is very workload dependent. However, I can say a few things on performance:

1. Solaris's EULA prohibits the publication of benchmarks without the explicit permission of Oracle. I have not done any benchmarks of Solaris and if I had done any benchmarks, I could not publish them without placing myself and/or my employer at risk of a lawsuit from Oracle.

2. Block device drivers are known to influence performance. In situations where Linux has superior driver support, ZoL will outperform its counterparts on other platforms.

3. There is a key performance fix in a pull request that is planned to be included in 0.6.4. I have seen benchmarks where it enables ZoL to outperform its counterparts on other platforms, although the other platforms were subject to inferior block device drivers. Before the fix, ZoL performed worse:

https://github.com/zfsonlinux/spl/pull/369

4. I have posted some information on performance to the ClusterHQ blog:

https://clusterhq.com/blog/state-zfs-on-linux/#comment-15844... https://clusterhq.com/blog/state-zfs-on-linux/#comment-15880...

5. A list of performance improvements in Open ZFS is available from the Open ZFS wiki:

http://open-zfs.org/wiki/Features#Performance

6. It is quite likely that I will post a future blog post discussing performance. However, performance is a difficult topic to discuss because it is extremely workload dependent. No matter what I or anyone says, or any benchmarks you see, the best measure of a storage stack's performance will ultimately be a test of your own workloads.

advisory5739f2 · on Sept 11, 2014

Tried using ZFS in the earnest and got spooked, felt it was not production ready. Wanted to use ZFS for MongoDB on Amazon Linux (primarily for compression, but also for snapshot functionality for backups). Tried 0.6.2.

Ended up running into a situation where a snapshot delete hung and none of my ZFS commands were returning. The snapshot delete was not killable with kill -9. https://github.com/zfsonlinux/zfs/issues/1283

Also, under load encountered a kernel panic or a hang (I forget), turns out it's because the Amazon Linux kernel comes compiled with no preemption. It seems that "voluntary preemption" is the only setting that's reliable. https://github.com/zfsonlinux/zfs/issues/1620

That left a bad taste in my mouth. Might be worth trying out 0.6.3 again.

I am still leafing through the issues closed in 0.6.3, but based on what I see, 0.6.2 did not seem production-ready-enough for me:

https://github.com/zfsonlinux/zfs/issues?page=2&q=is%3Aissue...

ryao · on Sept 11, 2014

Your deadlock was likely caused by the sole regression to get by us in the 0.6.2 release:

https://github.com/zfsonlinux/zfs/commit/a117a6d66e5cf1e9d4f...

This occurred because it was rare enough that neither us nor the buildbots caught it back in Feburary. George Wilson wrote a fix for it in Illumos rather promptly. However, Illumos and ZoL projects had different formats for the commit titles of regression fixes. In specific, the Illumos developers would reuse the same exact title while the ZoL developers would generally expect a different title, so we missed it when merging work done in Illumos. I caught it in November when I was certain that George had made a mistake and noticed that our code and the Illumos code was different. It is fixed in 0.6.3. The fix was backported to a few distribution repositories, but not to all of them.

The 0.6.3 release was notable for having a very long development cycle. As I described in the blog post, the project will begin doing official bug fix releases when 1.0 is tagged. That should ensure that these fixes become available to all distributions much sooner. In the mean time, future releases are planned to have much shorter development cycles than 0.6.3 had, so fixes like this will become available more quickly.

That being said, I was at the MongoDB office in NYC earlier this year to troubleshoot poor performance on MongoDB. I will refrain from naming the MongoDB developer with whom I worked lest he become flooded with emails, but my general understanding is that 0.6.3 resolved the performance issues that the MongoDB had observed. Future releases should further increase performance.

advisory5739f2 · on Sept 11, 2014

Thank you so much for the information! This is very encouraging. I will definitely give 0.6.3 a whirl!

WestCoastJustin · on Sept 11, 2014

For anyone interested, over the past couple weeks I have heavily researched ZFS, and created a couple screencasts about my findings [1, 2].

[1] https://sysadmincasts.com/episodes/35-zfs-on-linux-part-1-of...

[2] https://sysadmincasts.com/episodes/37-zfs-on-linux-part-2-of...

agapon · on Sept 11, 2014

Great blog post! Something from personal experience. OpenZFS on FreeBSD feels mostly like a port of illumos ZFS where most of the non-FreeBSD-specific changes happen in illumos and then get ported downstream. On the other hand, OpenZFS on Linux feels like a fork. There is certainly a stream of changes from illumos, but there's a rather non-trivial amount of changes to the core code that happen in ZoL.

ryao · on Sept 11, 2014

This is because Martin Matuška of FreeBSD has been focused on upstreaming changes made in FreeBSD's ZFS port into Illumos. At present, the ZFSOnLinux project has had no one dedicated to that task and code changes mostly flow from Illumos to Linux. This is starting to change. A small change went upstream to Illumos earlier this year and more should follow in the future.

That being said, there are commonalities between Illumos and FreeBSD that make it easier for the FreeBSD ZFS developers to collaborate with their Illumos counterparts:

1. FreeBSD and Illumos have large kernel stacks (4 pages and 6 pages respectively) while Linux's kernel stacks are limited to 2 pages.

2. In-kernel virtual memory is well supported in FreeBSD and Illumos while Linux's in-kernel virtual memory is crippled for philosophical reasons.

3. FreeBSD and Illumos have both the kernel and userland in the same tree. FreeBSD even maintained Illumos' directory structure in its import of the code while ZoL's project lead decided to refactor it to be more consistent with Linux.

Difficulties caused by these differences should go away changes made in ZoL to improve code portability are sent back to Illumos.

prakashsurya · on Sept 11, 2014

FWIW, if there's anybody interested in learning about the ZFS code base, we'd love help porting patches from ZoL into Illumos and vice versa. That's a good way to get a new developer integrated with the code and process surrounding each platform.

gnu8 · on Sept 11, 2014

I'm interested in point 2, can you clarify how and why Linux in-kernel virtual memory is crippled or provide a link?

ryao · on Sept 11, 2014

There are two issues:

1. Page table allocations use GFP_KERNEL, even when done for an allocation that used GFP_ATOMIC. This means that allocations that are needed to do pageout and other things to free memory can deadlock. There is a workaround in the SPL that will switch it to kmalloc when this issue occurs. There is also a new mechanism in Linux 3.9 and later that should render this unnecessary.

2. The kernel deals with kernel virtual address space exhaustion in vmalloc() by spinning until memory becomes available. This is not a problem on current 64-bit systems where the virtual address space used by vmalloc is larger than physical memory, but it is a problem on most 32-bit systems.

bussiere · on Sept 11, 2014

I may have read the article too fast , but what about cryptography in zol ? is there a way to crypt data on zol ? regards and thks for the article

ryao · on Sept 11, 2014

At present, you need to either encrypt the block devices beneath ZFS via LUKS or the filesystem on top of ZFS via ecryptfs. There are some guides on how to do this for each distribution.

There is an open issue for integrating encryption into ZoL itself:

https://github.com/zfsonlinux/zfs/issues/494

This will likely be added to ZoL in the future, but no one is actively working on it at this time.

bussiere · on Sept 11, 2014

feld · on Sept 11, 2014

Oracle extended ZFS to be able to encrypt specific filesystems, but this method has been heavily scrutinized for being susceptible to watermarking attacks

blumentopf · on Sept 11, 2014

Source? Edit: Found it.

http://lists.freebsd.org/pipermail/freebsd-hackers/2013-Sept...

a2743906 · on Sept 11, 2014

I'm using ZFS right now, because I need something that cares for data integrity, but the fact that it will never be included in Linux is a very big issue for me. Every time you upgrade your kernel, you have to upgrade the separate modules as well - this is the point where bad things can happen. I will definitely be looking into Btrfs once it is more reliable. For now I'm having a bit of a problem with SSD caching and performance, but don't care about it enough for it to be relevant, I just use the filesystem to store data safely and ZFS does an OK job.

sp332 · on Sept 11, 2014

From what I understand, aside from certain RAID levels, btrfs is production-ready. RAID5 and RAID6 don't have recovery code finished yet, but RAID0, RAID1, "dup" which just keeps 2 copies of each chunk, and "single" mode all work fine.

xioxox · on Sept 11, 2014

I've setup btrfs on software mdraid (raid 6) as a backup system (not the only backup!). You still get the checksums and snapshotting, but not the flexibility of the btrfs raid system. It has the advantage of being easy to grow, unlike zfs, which can't be resized once created. We've encountered no problems, even though it has been running for around three years of rsyncing and snapshotting.

barrkel · on Sept 11, 2014

zfs can grow - new vdevs can be added, and existing vdevs can have their disks replaced one at a time with larger disks.

A bigger downside of ZFS, IMO, is lack of defragmentation and similar larger scale pool management. If you ever push a ZFS pool close to its space limit, you can end up with fragmentation that never really goes away, even if you delete lots of files. The recommended solution is to recreate the pool and restore from backup, or create a new pool and stream a snapshot across with zfs send | zfs receive. Not terribly practical for most home users.

xioxox · on Sept 12, 2014

zfs can sort of grow by adding vdevs, as you say, however it's pretty wasteful due to the new parity drives It was much more efficient to expand the mdraid raid 6 and expand the btrfs onto that. The other backup server does use zfs (albeit the user mode fuse version). I set up that system in a similar way putting zfs-fuse on mdraid 6.

IgorPartola · on Sept 11, 2014

Out of curiosity, what distro are you using that's giving you problems? My Ubuntu box is humming along with no problems: kernel and ZFS updates happen at the same time.

ryao · on Sept 11, 2014

That is a feature of DKMS. Not all distributions use this, but most distributions on which ZFS is available have ways of avoiding this problem.

ryao · on Sept 11, 2014

Which distribution are you using and does this involve / on ZFS? Most distributions have ways of avoiding this problem. If you are using a distribution where this is a problem, I would like to know so that I can try to address it. I can do nothing about it without knowing more.

Andys · on Sept 11, 2014

I used ZFSonLinux on my laptop and workstation for a couple of years now, with Ubuntu, without any major problems. When I tried to use it in production, I didn't get data loss but I hit problems:

* Upgrading is a crapshoot: Twice, it failed to remount the pool after rebooting, and needed manual intervention.

* Complete pool lockup: in an earlier version, the pool hung and I had to reboot to get access to it again. If you look through the issues on github, you'll see weird lockups or kernel whoopsies are not uncommon.

* Performance problems with NFS: This is partially due to the linux NFS server sucking, but ZFS made it worse. Used alot of CPU compared to solaris or freebsd, and was slow. Its even slow looping back to localhost.

* Slower on SSDs: ZFS does more work than other filesystems, so I found that it used more CPU time and had more latency on pure SSD-backed pools.

* There are alternatives to L2ARC/ZIL on linux, are built-in, and work with any filesystem, such as "flashcache" on ubuntu.

For these reasons, I think ZoL is good for "near line" and backups storage, where you have a large RAID of HDDs and need stable and checksummed data storage, but not mission critical stuff like fileservers or DBs.

ryao · on Sept 15, 2014

I mentioned most of these issues in the supplementary blog posts. Here is where each stands:

* There are issues when upgrading because the initramfs can store an old copy of the kernel module and the /dev/zfs interface is not stabilized. This will be addressed in the next 6 months by a combination of two things. The first is /dev/zfs stabilization. The second is bootloader support for dynamic generation of initramfs archives. syslinux does this, but it does not at this time support ZFS. I will be sending Peter Alvin patches to add ZFS support to syslinux later this year. Systems using the patched syslinux will be immune to this problem while systems using GRUB2 will likely need to rely on the /dev/zfs stabilization.

* There are many people who do not have problems, but this is certainly possible. Much of the weirdness should be fixed in 0.6.4. In particular, I seem to have fixed a major cause of rare weirdness in the following pull requests, which had the side benefit of dramatically increasing performance in certain workloads:

https://github.com/zfsonlinux/spl/pull/369 https://github.com/zfsonlinux/zfs/pull/2411

* The above pull requests have a fairly dramatic impact on NFS performance. Benchmarks shown to me by SoftNAS indicate that all performance metrics have increased anywhere from 1.5 to 3 times. Those patches have not yet been merged as I need to address a few minor concerns from the project lead, but those will be rectified in time for 0.6.4. Additional benchmarks by SoftNAS have shown that the following patch that was recently merged increases performance another 5% to 10% and has a fairly dramatic effect on CPU utilization:

https://github.com/zfsonlinux/zfs/commit/cd3939c5f06945a3883...

* There is opportunity for improvement in this area, but it is hard for me to tell what you mean. In particular, I am not certain if you mean minimum latency, maximum latency, average latency or the distribution of latency. In the latter case, the following might be relevant:

https://twitter.com/lmarsden/status/383938538104184832/photo...

That said, I believe that the kmem patches that I linked above will also have a positive impact on SSDs. They reduce contention in critical code paths that affect low latency devices.

Additionally, there is at least one opportunity to improve our latencies. In particular, ZIL could be modified to use Force Unit Access instead of flushes. The problem with this is that not all devices honor Force Unit Access, so making this change could result in data loss. It might be possible to safely make it on SLOG devices as I am not aware of any flash devices that disobey Force Unit Access. However, data integrity takes priority. You can test whether a SLOG device would make a difference in latencies by setting sync=disabled temporarily for the duration of your test. All improvements in the area of SLOG devices will converge toward the performance of sync=disabled. If sync=disabled does not improve things, the bottleneck is somewhere else.

* These alternatives operate on the block device level and add opportunity for bugs to cause cache coherence problems that are damaging to a filesystem on top. They are also unaware of what is being stored, so they cannot attain the same level of performance as a solution that operates on internal objects.

Andys · on Sept 17, 2014

Great reply :)

Your last two points are a little weak. Ultimately ZFS does more stuff, so on an SSD its going to be slower than a filesystem which doesn't have all these extra features if you aren't using them. I think there's a trade-off to be had.

All software has bugs, especially ZFS. You could argue it is easier to developer, test, and maintain individual device-mapper building blocks.

ryao · on Sept 12, 2014

I have been inundated with feedback from a wide number of channels. If I did not reply to a comment today, I will try to address it tomorrow.

ashayh · on Sept 11, 2014

ZFS, and most* other file systems are all about _one_ computer system.

While ZFS data integrity features may be useful, they don't prevent the wide variety of things that can go wrong on a _single_ computer. You still need site redundancy, multiple physical copies, recovery from user errors etc.

Large, modern enterprises are better off keeping data on application layer "filesystems" or databases, since they can more easily aggregate the storage of hundreds or thousands of physical nodes. ZFS doesn't help with anything special here.

For the average home user, ZoL modules are a hassle to maintain. You are better of setting up FeeNAS on a 2nd computer if you really want to use ZFS. Otherwise there is nothing much over what XFS, EXT4 or btrfs can offer.

The 'ssm' set of tools to manage LVM, and other built in file systems, is more easier for home users with regular needs.

GlusterFS and others are distributed file systems, but suffers from additional complexity at the OS and management layer.

ryao · on Sept 11, 2014

The Lustre filesystem is able to use ZFSOnLinux for its OSDs. This gives it end to end checksum capabilities that I am told enabled the Lustre developers to catch buggy NIC drivers that were silently corrupting data.

Alternatively, there is a commercial Linux distribution called SoftNAS that implements a proprietary feature called snap replicate on top of ZFS send/recv. This is allows it to maintain backups across availability zones and is achieved by its custom management software running the zfs send/recv commands as per user requests.

In the interest of full disclosure, my 2014 income tax filing will include income from consulting fees that SoftNAS paid me to prioritize fixes for bugs that affected them. I received no money for such services in prior tax years.

mbreese · on Sept 11, 2014

I love ZFS, and I love working with Linux, but I can't help but worry about using ZFS on Linux. Without the needed support from the kernel side, I don't see how it can be useful for production. I can see using it on personal workstations, but for any situation where data loss is critical, you just won't see any uptake. Because of the licensing, ZFS can never be anything more than a second-class citizen on Linux.

That said, I run a FreeBSD ZFS file server just to host NFS that is exported over to a Linux cluster. At least on FreeBSD, there is first-class integration of ZFS into the OS. (I used to also maintain a Sun cluster that had a Solaris ZFS storage server that exported NFS over to Linux nodes, which is where I first got a taste for ZFS).

So, I guess my main question is: In what use cases is ZFS on Linux so useful when native FreeBSD/ZFS support exists?

I'm not saying it can't be done - I just don't understand why.

michael_h · on Sept 11, 2014

  Without the needed support from the kernel side

Can you clarify what you mean by that?

mbreese · on Sept 11, 2014

Basically I mean integration with the kernel's code base and all of the testing that entails. After their initial development, file systems all end up migrating to the kernel's code base.

So, I'm not thinking in terms of technical API support, but more development/testing/integration support.

ryao · on Sept 12, 2014

That is not a requirement. ZoL has the most sophisticated build system of any Linux kernel module in the world to enable it to live outside of the main tree. ZoL relies on autotools' API checks to do this. In addition, the project has an automated buildbot that helps us to detect regressions in pull requests before they are merged. It is similar in principle to how lustre filesystem development is done. Lustre is also (primarily) outside of the tree and lived entirely outside of the tree for years.

That being said, being inside the kernel source tree is not necessarily a good thing. As I wrote in the blog post, other filesystems on Linux generally do not provide the latest code to older kernels, but ZoL provides the latest code to all supported kernels and distributions. This ranges from Linux 2.6.26.y to the 3.16.y in 0.6.3 and will include 3.17.y in 0.6.4. Something as important as a filesystem should be updated to fix bugs, even if the kernel proper cannot be. The inability of all Linux systems to update to the latest kernel is an issue that Linus Torvalds mentioned at LinuxCon North America 2014 and ZoL is one of the few filesystems that can deal with it on systems where it is deployed.

mbreese · on Sept 12, 2014

You know... I hadn't given much thought to Lustre. And that is a very good comparison. Lustre is completely out of kernel (in terms of development), so it's not like you're the first to try this.

I guess it really depends on where you want ZoL to be deployed. Lustre is typically deployed on HPC clusters and is managed by people who understand it, how to set it up, and how to manage it. It's not a trivial system to get working, and requires a pretty large budget. It just isn't setup on your typical server. Lustre also has a few big names behind it to provide development resources and support (Whamclound, now Intel).

What is the target market for ZoL? Do you want it to work on servers? Clusters? Personal workstations? They are very different markets.

I'm using ZFS right now for a single storage server that doesn't need to support a large cluster, so ZFS/NFS works great. But I'm using FreeBSD. For my Linux servers, I wouldn't think of using a filesystem that wasn't natively supported by Red Hat. I just don't want the management headache. I'm okay running a single FreeBSD box for my storage needs, but I'm fearful of what happens when I'd need to scale things out to multiple servers.

I wish you luck, I really do. It seems like you're not going into this blind and you're making good decisions. But it will be an uphill battle and you'll always have that specter of Oracle looming over you. It doesn't matter if you're technically/legally right on the licensing front - you'll still have it looming over you until Larry signs off on it (like you said below).

If you can get that sign off, then all bets are off, and you'd be golden!

justizin · on Sept 11, 2014

The previous poster clearly does not understand that ZFS on Linux is essentially a native port with very careful dancing around licensing borders.

mbreese · on Sept 11, 2014

The previous poster is well aware of what ZFS on Linux is.

Licensing matters. It may not matter in terms of what I can do with my servers. I can compile ZFS and run it just fine on Linux. But you won't find Red Hat doing it. You won't find IBM doing it. Linus won't be adding it into the mainline tree anytime soon.

This means that there may be "technical" solutions to getting ZFS running on Linux, but that also means that there will be artificial barriers in place that will make it more difficult to run, test, deploy, and support.

These things matter.

I might use ZFS for Linux on a personal workstation, but there is no way that I'd add it to a production cluster.

ryao · on Sept 12, 2014

I am working on tearing down barriers to this. Expect to see the fruit of that effort later this year.

That said, I do not consider absence from Linus' tree to be a barrier because being in Linus' tree limits the ability of end users to obtain bug fixes. That is a problem other filesystems have that I am happy to see ZFSOnLinux avoid.

Also, Linus Torvalds and I discussed the possibility of merging ZFS into his tree because of feedback like yours. In short, Linus has no interest in merging ZFS into his tree. Linus thinks that it might enable Oracle might sue people unless they provide a signed-off to confirm that the code is under the CDDL. He also thinks that the kernel signed off keeps Oracle from suing people over btrfs when software patents are a likely loophole.

Lastly, I realize that my statement about the signed off confirming that the code is under the CDDL might sound strange. However, there are two cases to consider. The first case involves source code. The GPL does not restrict the redistribution of source code, so there would be no problem. The second case is binaries, which the GPL does restrict in situations in which a court of law would consider to be a derived work. Being a module only option should be sufficient to deal with that. Linus did not seem to have any problem with that idea. However, he insists on Oracle's signed off with the preference that it be Larry Elison.

justizin · on Sept 11, 2014

Right, but had you read the discussion this thread is about, you can see that Debian is working on a roadmap that includes binary drivers, shipped with the OS, available during the install.

Linus has traditionally done a poor job of ensuring that filesystems will work well in production, ext4 being a _fantastic_ example of that. I appreciate the work he does, but I don't quite give a fuck if he is scanning the source for my filesystem. I've seen more ext4 filesystems go corrupt than all others combined.

Red Hat and IBM can, equally, kiss my ass, I've never found either to be reliable.

I don't see how DKMS is really an artificial barrier to run, test, deploy, and support something. It is, in fact, the opposite of a barrier - it's a tool.

mbreese · on Sept 11, 2014

Debian is working on a roadmap that includes shipping source drivers to be slipstreamed in as dynamic modules.

DKMS is a tool, but it's hardly the same as a good, in tree module. I remember having tons of issues with DKMS WiFi drivers back when I still ran a Linux laptop (10-ish years ago). If you get an upstream kernel upgrade and the ZFS DKMS module doesn't compile against it, you're screwed. Can you imagine trying to debug DKMS issues with your main filesystem? I hope you leave an ext2/3/4 volume around for booting and root.

I'm glad that you feel like you can discount the work that IBM and Red Hat have done to advance Linux, but they've done a ton to make sure that Linux could take over data centers.

Good to know... Thanks!

leonroy · on Sept 11, 2014

I've used ZFS (FreeNAS) for quite a few years and find it pretty flawless. Trust it's not too dumb a question but what advantage is there to running ZFS on Linux when you can run it on variants of Solaris or BSD just fine?

DiabloD3 · on Sept 11, 2014

I've used ZoL since it was created, and zfs-fuse before that. I ran it on my workstation for a few years (managing a 4x750gb RAID-Z (= ZFS's RAID-5 impl), with ext3 on mdadm RAID 1 2x400gb root), and then swapped to BTRFS for 2x2TB BTRFS native RAID 1 (which was Oracle's ZFS competitor that seems to be largely abandoned although I see commits in the kernel changelog periodically), and now back to ZFS on a dedicated file server using 2x128GB Crucial M550 SSD + 2x2TB, setup as mdadm RAID 1 + XFS for the first 16GB of the SSDs for root[2], 256MB on each for ZIL[1], and the rest as L2ARC[3], and the 2x2TB as ZFS mirror. I honestly see no reason to use any other FS for a storage pool, and if I could reliably use ZFS as root on Debian, I wouldn't even need that XFS root in there.

All of this said, I get RAID 0'ed SSD-like performance with very high data reliability and without having to shell out the money for 2TB of SSD. And before someone says "what about bcache/flashcache/etc", ZFS had SSD caching before those existed, and ZFS imo does it better due to all the strict data reliability features.

[1]: ZFS treats multiple ZIL devs as round robin (RAID 0 speed without increased device failure taking down all your RAID 0'ed devices). You need to write multiple files concurrently to get the full RAID 0-like performance out of that because it blocks on writing consecutive inodes, allowing no more than one in flight per file at a time. ZIL is only used for O_SYNC writes, and it is concurrently writing to both ZIL and the storage pool, ie, ZIL is not a write-through cache but a true journal.

The failure of a ZIL device is only "fatal" if the machine also dies before ZFS can write to the storage pool, and the mode of the failure cannot leave the filesystem in an inconsistent state. ZFS does not currently support RAID for ZIL devices internally, nor is it recommended to hijack this and use mdadm to force it. It only exists to make O_SYNC work at SSD speeds.

[2]: /tank and /home are on ZFS, the rest of the OS takes up about 2GB of that 16GB. I oversized it a tad, I think. If I ever rebuild the system, I'm going for 4GB.

[3]: L2ARC is a second level storage for ZFS's in memory cache, called ARC. ARC is a highly advanced caching system that is designed to increase performance by caching often used data obsessively instead of being just a blind inode cache like the OS's usual cache is, and is independent of the OS's disk cache. L2ARC is sort of like a write through cache, but is more advanced by making a persistent version of ARC that survive reboots and is much larger than system memory. L2ARC is implicitly round robin (like how I described ZIL above), and survives the loss of any L2ARC dev with zero issues (it just disables the device, no unwritten data is stored here). L2ARC does not suffer from the non-concurrent writing issue that ZIL "suffers" (by design) from.

rodgerd · on Sept 11, 2014

> (which was Oracle's ZFS competitor that seems to be largely abandoned although I see commits in the kernel changelog periodically)

Under heavy development, officially supported by most of the major commercial distros, and still designated by Linus as the ext* replacement as the standard Linux filesystem.

click170 · on Sept 12, 2014

I'm curious where the idea that it's largely abandoned comes from. This isn't the first time I've heard it, but each time I've heard it and looked, the project looks far from dead or abandoned.

rodgerd · on Sept 14, 2014

It seems to be a talking point amongst folks who want everyone to ignore Sun (and subsequently Oracle's) determination to keep ZFS out of Linux. If there's a healthy btrfs project then the constant stream of suggestions Linux distros should just ignore the huge legal problems associated with bundling ZFS seems like a completely insane idea.

foobarqux · on Sept 11, 2014

Can you speak more about why ZFS is better than BTRFS?

DiabloD3 · on Sept 11, 2014

Working RAID 5 and 6, RAID "7" (Z3[1], triple parity, where Z/5 is single, and Z2/6 is dual), tiered RAID setups (such as JBOD'ed RAID-Z3s), zvols (as in, but not limited to, swap partitions in ZFS), more nuanced multi-controller and failover/spare setups, write-only journals (ZIL), second tier caching (L2ARC), LZ4 transparent compression, better designed snapshot and snapshot cloning support, more mature CLI tools, support on other OS's (with shared code bases for bonus points), configurable checksum algos, and a few other things that I'm forgetting at the moment.

[1]: Seen that 90 drive Supermicro drive chassis[2]? Three 29-drive RAID-Z3s with 3 hot spares (which is shared across the three RAIDs) in a single storage pool (round robin-esqued), and you could plug that into a 2U with 16 small SSDs (like those Crucial M550 128GBs I use now) for the ZIL/L2ARC farming, and then have, well, near-infinite IO performance. Good luck trying to assemble that with BTRFS.

[2]: A wet dream for CEPH users, too.

mdellabitta · on Sept 11, 2014

Re: [1], isn't that far too many drives in one RAIDZ? Referencing this: http://www.solarisinternals.com/wiki/index.php/ZFS_Best_Prac...

DiabloD3 · on Sept 11, 2014

Yes and no. The optimal number for RAIDZ/Z1/Z2, claimed by many benchmarks, is (power of two count of drives) plus (number of drives for parity)[1], with the upper bound being the likelihood of another drive failing before rebuild can complete.

There is a tradeoff of, for example, using 3 RAIDZs each with a third of the drives in your pool, or 1 RAIDZ3 using all the drives in your pool.

In the first case, a rebuild takes less time, but the chances of all data in that RAIDZ being lost is (I think) 4 times higher (but statistics is not my strong point; it may be higher).

In the second case, a rebuild takes longer, the chances of another drive failing before rebuild is complete is higher, however, you require three failures before you have a problem; the array is still fully protected from another drive failure after two drive failures.

The numbers in the example I gave, however, are not quite optimum for performance reasons (you'd be better off with 16 or 24), but most people don't like tons of hotspares (even though that case likely eats drives like old Sun thumpers would).

All of that said, it depends entirely on how much you want data reliability. And, also, it was just an example. Not necessarily a good one. I was just illustrating that ZFS can handle that many drives with ease, but BTRFS would not be a good fit.

[1]: There seems to be no difference between power of two and half way steps in real world performance when the number of drives are big enough, as in, 8, 12, and 16 will all perform similarly.

phil21 · on Sept 12, 2014

You're pretty wrong regarding performance here. In ZFS, performance is directly bounded by the number of vdevs given to a pool. If you make one 90 drive vdev, you are more or less limiting yourself to the iops of a single disk. It's a bit more complicated than that, as your throughput increases.

At work where we use ZFS extensively we more or less use # of vdevs * I/O performance of a single drive = total "worst-case" IOPS prior to caching. Caching of course is where ZFS will start to shine - it will win no performance crowns when put up against a "traditional" RAID6 array with the same number of spindles.

This is also why most production implementations of ZFS utilize many sets of mirrors. For example your 90 drive example would be 45 sets of mirrors in our environment, granting roughly 45*120 iops (if NL SAS) total peak capacity. We avoid RAIDZ largely for the reason it would drastically lower the total # of vdevs, and performance falls off a cliff in that situation.

ZFS (not on Linux) seems to handle up to around 300 drives fairly handily before you start to run into issues I've found. 280 is our current max cluster design.

DiabloD3 · on Sept 12, 2014

I actually mentioned that, but maybe not as well as I should have. I did say 8, 12, and 16 drives in an array will perform similarly.

However, in cases where you are using SSDs for ZIL/L2ARC, vdev spam isn't as big of a factor. In the example I gave I was illustrating the amount of usable storage you could get in the context of having it reliable, not comparing performance of RAID-Z vs non-ZFS RAID.

nisa · on Sept 11, 2014

I'm not the op but I'm running ZoL on a small cluster with 200 HDDs and 50 machines. I've also tested btrfs.

Why is ZFS better (IMHO):

+ Tooling. The zpool and zfs commands are clear, easy to use and well documented. btrfs e.g. has no way to get a list of files with checksum problems. Also automounting and setting flexible mount points as well as the whole zfs get/set concept felt like a really concise way to configure all aspects easily.

+ Architecture. ZFS uses merkle hash trees with multiple copies spread over the disk (ditto blocks). This means that disk failures likely won't affect metadata and every possible disk problem is accounted for (except main memory). btrfs only uses crc32c and has 2 copies of the metadata tree on disk (but not spread over the disk afaik) or simple RAID1 in case of multiple disks. So in case of disk problems btrfs tends to corrupt faster. I've actually tested this and it's almost impossible to corrupt ZFS metadata even with hundreds of bad blocks. btrfs tends to switch faster to read-only mode with bad disks.

+ Compression: LZ4 is slightly better than LZO

+ L2ARC/ARC - easy to plug in a SSD cache. ARC is better than the pagecache for most workloads.

+ It's really fun to work with. A lot of good documentation. Dealing with btrfs felt very different...

+ RAIDZ(2-3) and copies=n for datasets. Both don't exist in btrfs (yet).

Reasons against ZoL:

- Memory. I've got out of memory errors because ARC memory is not freed fast enough. Also ZFS hogs a lot of memory. At least 750MB to 1GB kernel memory with several disks. btrfs needs far less. There is upcoming work to fix the ARC out of memory issues through.. it's on the roadmap from the developers.

- Stability. This also applies to btrfs but it's not as rock-solid as ext4. Especially with broken disks and usage of advanced features you'll run eventually into bugs. But the "normal" operations where 100% stable for me with ZoL.

- It's not native Linux. You have some Solaris Porting layer modules and integration into the Linux kernel is not as tight as you may like it to be. E.g. cgroups based io throttling was not possible, last time I looked. This may change in the future but it's work to integrate it and if e.g. the module does not not compile for a new kernel you are screwed.

- btrfs will be supported and be a default choice in 1-2 years in likely all major Linux enterprise distributions. SLES already made it the default. For most workloads ZFS is likely not worth it. At least it's tough to sell against a default option.

- rootfs on a ZoL volume is quite a lot of work and difficult to get working. But this is largely a distribution problem.

ryao · on Sept 11, 2014

Below are a few remarks:

To expand on your LZ4 comment, LZ4 is 3 times faster than LZO at decompression:

https://code.google.com/p/lz4/

However, LZ4's real innovation is on incompressible data. I do not have a reference at the moment, but there have been benchmarks on the userland lz4 tool on incompressible data that show it processing 10GB/sec. This is because LZ4 uses a hash table that enables it to give up when trying to compress data far sooner than other lempel-ziv implementations do.

As for stability, it is hard to for me to tell what you mean by that, but I just wrote a series of blog posts on the topic. If there are any outstanding issues not listed there, please file an issue on github:

https://github.com/zfsonlinux/zfs/issues/new

I also encourage you to file an issue requesting support for cgroup bsaed IO throttling. You are the first person to mention it.

As for not compiling with new kernels, the 0.6.3 release supports Linux 2.6.26 to Linux 3.16 while HEAD adds support for Linux 3.17 release candidates.

I briefly touched on the rootfs on ZFS issue in my blog post. I plan to add ZFS support to syslinux in the near future as per a discussion that I had with Peter Alvin at LinuxCon North America 2014. I expect this to resolve the issue for those willing to use syslinux as their bootloader.

nisa · on Sept 11, 2014

Wow! Thanks for your reply. I've experienced some minor issues with severe broken disks where the SPL layer discards the disk but the zpool is still online. It's not a major issue and I've yet to encounter the issue again I've already filed a related bug: https://github.com/zfsonlinux/zfs/issues/2508 but I'm not an expert. If I can gather more data I'll try to submit as much data and hints I can find.

As for cgroups blkio - there is already a bug report: https://github.com/zfsonlinux/zfs/issues/1952

Sorry. It was not my intention to suggest that ZFS does not compile on new kernel versions. I've never had problems with that. I just wanted to point out that it's not in the mainline kernel and for some people that might be important.

Thanks for your great work on ZoL! It made my life a lot easier :)

ryao · on Sept 11, 2014

I am also a Linux kernel contributor, so running the latest kernels is important for me. This prompted me to switch from Nvidia graphics to Intel graphics a couple of months ago. ZoL's future kernel version support should reflect this.

As for the issues you linked, the project will likely not be able to tackle them until next year, but being in the tracker means that they will receive attention.

voltagex_ · on Sept 12, 2014

Is it possible to integrate the ZoL modules with DKMS so that kernel upgrades are handled automagically?

ryao · on Sept 15, 2014

Yes. It is done this way on EPEL distributions (i.e. Fedora, CentOS, Scientific Linux) and Ubuntu. However, it is not presently done on Debian and Gentoo. I am hopeful that this will change on Debian in the future. Gentoo makes kernel upgrades the responsibility of the system administrator, so this does not really apply to it.

WRT to Gentoo, Gentoo does have a tool called genkernel that I consider to make such upgrades fairly easy. It can handle any of initramfs generation, kernel compilation, kernel configuration and bootloader configuration (GRUB2). However, it does have a moderate learning curve. I am also one of the genkernel developers, so I likely have an implicit advantage over users in understanding genkernel. In particular, I find some users have difficulty discovering its functionality, despite it having things fairly well documented in the man page. There is likely more that needs to be done in this area.

ClashTheBunny · on Sept 12, 2014

That's exactly how it is set up. Just install a couple deb files or set up the repo and DKMS does the rest on every distro I've tried it on.

ryao · on Sept 11, 2014

A btrfs versus ZFS comparison probably deserves a blog post of its own, but I will try to address your question. I wrote the following on this topic last year:

https://groups.google.com/d/msg/funtoo-dev/g9OY_vqVpCM/VTKF8...

However, significant time has passed and it requires some corrections to be current:

1. I have not heard of any recent data corruption issues in btrfs, although I have not looked into them lately.

2. btrfs now has experimental RAID 5/6 support, but is neither production ready nor as refined as ZFS' raidz.

3. I should have said "inline block-based data deduplication". You can (ab)use reflinks to achieve a file-level data deduplication in btrfs, but it is not quite the same. btrfs now has a bedup tool that makes using reflinks somewhat easier now:

https://btrfs.wiki.kernel.org/index.php/Deduplication

4. btrfs now has some kind of incremental send/recv operation. However, it is not clear to me how it handles consistency issues from having "write-able snapshots":

https://btrfs.wiki.kernel.org/index.php/Incremental_Backup

5. Illumos' ZFS implementation is now able to store small files in the dnode, which is improves its efficiency when storing small files in a manner similar to btrfs' block suballocation. This feature will likely be in ZoL 0.6.4.

Aside from those corrections, what I wrote in that mailing list email should still be relevant today. However, there are a few advantages that ZFS has over btrfs that I recall offhand that I do not see here there or in nisa's reply:

0. ZFS uses 256-bit checksums with algorithms that are still considered to be good today. btrfs uses checksum algorithms that are known to be weak. In specific, btrfs uses CRC32 on 32-bit processors and CRC64 on 64-bit processors. CRC32 is the same algorithm used by TCP/IP. Its deficiencies are well documented:

http://noahdavids.org/self_published/CRC_and_checksum.html

I have not examined CRC64, but I am not particularly confident in it. btrfs should have room in its on-disk data structures that would allow it to implement 256-bit checksums in a future disk format extension, but until then, its checksum implementations are vastly inferior.

1. The ztest utility that I described in the blog allows ZFS developers to catch issues that would have otherwise gone into production and debug them from userland. No other filesystem has something like quite like it.

2. ZFSOnLinux is the only kernel filesystem driver that is kernel version-independent, so if you are unable to upgrade your kernel, you can still get fixes. The inability of people to always update their kernels is an issue Linus mentioned at LinuxCon North America 2014.

3. The CDDL gives the ZFSOnLinux a patent grant for the ZFS patent portfolio. This is something that btrfs does not have and will likely never have unless Oracle decides to provide one. Consequently, Oracle is the only company in the world that I know is able to ship products incorporating the btrfs source code without being at risk should btrfs infringe on one of the dozens if not hundreds of patents in the ZFS patent portfolio. A small subset of them can be accessed from the Jeff Bonwick Wikipedia page:

https://en.wikipedia.org/wiki/Jeff_Bonwick

cmurf · on Sept 12, 2014

Metadata/data checksums are CRC-32C on all platforms and is per 4KB fs block. And ext4's (optional) checksumming also uses it. TCP is 16-bit and not CRC, although ethernet makes use of CRC-32. While SHA-2 is a cryptographic function and CRC-32C is not, and therefore inferior, as a checksum in the context of mostly but not entirely trusted hardware is adequate. It's also fast to the degree on modern hardware there's no point using nodatasum to disable it. Even 32-bit hardware handles it.

Btrfs send/receive snapshots are read-only. So you first take a read-only snapshot, send it and upon receive it's read-only. To make it rw, snapshot it (without -r), and optionally delete the ro snapshot.

I agree the kernel currency issue to get more urgent fixes is a problem for making filesystem usage friendly, in particular while development is heavy and backporting is non-trivial.

ryao · on Sept 12, 2014

You are correct. The ethernet frames use CRC32 while the IP frames use 16-bit CRC. While the checksums need not be of cryptographic quality (and not all checksum algorithms in ZFS are), there are multiple trivial ways in which the wrong data can match a CRC32 checksum and that is inadequate for the purposes of data integrity. It does help, but it does not go far enough.

As for my assertion that btrfs uses 64-bit CRC on 64-bit processors, this was the result of a misunderstanding between myself and the btrfs developers in #btrfs on freenode last year. I double checked with them and it turns out that you are correct that btrfs uses 32-bit CRC even on 64-bit processors. That makes btrfs much worse in this area than I previously thought.

Thanks for clearing some of the confusion about btrfs send/receive on its snapshots. To be clear, does this mean that btrfs will refuse to send a "writeable snapshot"?

cmurf · on Sept 12, 2014

I think the triviality and inadequacy are overstatements for the following two reasons: it's rare to have undetected errors in normally operating hardware that simultaneously go undetected by Btrfs. If this doesn't go far enough, then I'm challenged how to characterize the vast majority of the world's data on NTFS, HFS+, ext234, and XFS as being anything but fragile and tenuous, which would be an overstatement. What is going far enough? It's a use case question rather than something absolutely determinable. For a general purpose filesystem, I think CRC-32C is adequate, although it will be nice when there's choice for another algorithm to meet other use case requirements.

Yes, it won't send writable snapshots. e.g. ERROR: /mnt/root is not read-only.

colin_mccabe · on Sept 12, 2014

As others have commented, CRC32 or CRC64 being "weak" in a cryptographic sense doesn't mean it's not suitable for detecting disk errors. x86 has a crc32 instruction, making it far faster than any other option. By the way, TCP doesn't use CRC32... it uses a simple additive checksum.

Chris Mason wrote most of btrfs while he was at Oracle. Hence, the GPLv2 grants an implicit patent license to whatever patents Oracle might hold that bear on btrfs.

ryao · on Sept 12, 2014

Saying that a checksum is weak in a cryptographic sense would mean that it is possible for a malicious intelligence to generate collisions. This is not what I meant when I said that CRC was weak. Instead, I meant that it is trivial for ordinary glitches to corrupt data in ways that the checksums fail to detect. Things byte swaps and non-adjacent double bit flips are all that is necessary. The link I provided elaborates on this:

http://noahdavids.org/self_published/CRC_and_checksum.html

As for patent grants, you are thinking of the GPLv3. The GPLv2 does not provide any patent grant. That is one of the reasons why the GPLv3 was made. If you place code under the GPLv2, you do not necessarily provide a grant to any software patents to which you have rights that cover it. People who wish to use code encumbered by such patents must license it or be at risk a lawsuit.

ansible · on Sept 12, 2014

This is not what I meant when I said that CRC was weak. Instead, I meant that it is trivial for ordinary glitches to corrupt data in ways that the checksums fail to detect. Things byte swaps and non-adjacent double bit flips are all that is necessary. The link I provided elaborates on this ...

Since we're talking about data corruption that has managed to not be detected by the hard drive's own firmware (a read error not detected as a read error), I now wonder what sorts of data errors might also slip by the 32-bit CRC used by btrfs.

It would be hilarious (in a bad way) if the same algorithm was used at both levels. So that the same byte swap (for example) would exactly bypass both the drive firmware and the check done by btrfs.

If the algorithms are different though, then I'd expect the chances are reduced that a particular error can slip by both.

db48x · on Sept 15, 2014

It's pretty scary; data on ordinary filesystems just evaporates over time. Undetected bit-flips every few terabytes from the drive itself, flaky cables and controllers, bit flips in your non-ecc ram, ghost writes, misaligned reads, firmware bugs, cheap port multipliers, etc, etc.

I've had a zfs filesystem for a few years now and twice it's detected and corrected an error that would have been silent data corruption in a lesser filesystem.

colin_mccabe · on Sept 26, 2014

Serious servers use ECC RAM so bit-flips in memory are not an issue. CRC32 is also pretty robust against single-bit errors. If you think about it, you can't really make any guarantees for non-ECC systems. Flip the wrong bit, and the kernel will delete all your data. There is an if statement somewhere waiting to make this happen, and if the program code can randomly change, eventually it will.

Typical disk corruption patterns are a whole sector of zeroes being read, or reading data that belonged somewhere else. If you want to prove that ZFS's checksum is better, you need to prove that it's better against the patterns of corruption that actually occur.

colin_mccabe · on Sept 26, 2014

As for patent grants, you are thinking of the GPLv3. The GPLv2 does not provide any patent grant.

This is incorrect. See http://en.swpat.org/wiki/GPLv2_and_patents

ScottBurson · on Sept 11, 2014

I got the impression somewhere that btrfs RAID 5/6 support allows, or will allow, new devices to be added to an existing RAID group.

That's an important feature for home and small business users. Having to replace every drive in a RAID group to grow it, as ZFS requires, is painful and expensive.

Fortunately, drives are cheap enough these days that you can just way oversize your pool to begin with. But anyone switching to ZFS should be aware of the need to do this.

cmurf · on Sept 12, 2014

You can add/delete devices from raid5/6 volumes now. The raid5/6 code is still experimental, in particular while detected problems are fixed on-the-fly to userspace, the fixes aren't written back to drives. That limitation applies to normal usage and scrubbing. A balance detects and fixes these.

Also the determination of a drive being "faulty" (in the md/mdadm sense) and how this gets communicated to userspace isn't in place. If it's in place for ZoL (?) that'd be a considerable difference, a bigger one than checksum algorithms in my opinion.

voltagex_ · on Sept 12, 2014

Can you link to anything about oversizing the pool? I think I vaguely remember a trick with creating a sparse file and adding that to the pool until you can replace that with a drive.

jaytaylor · on Sept 11, 2014

BTRFS is still maturing. Kernel bugs involving BTRFS are still popping up pretty regularly.

ryao · on Sept 11, 2014

Debian is the last major Linux distribution where it is not easy to do / on ZFS. It might be possible to implement a module for Debian's initramfs generator in ZoL upstream. I suggest filing an issue to inquire about this possibility:

https://github.com/zfsonlinux/zfs/issues/new

DiabloD3 · on Sept 11, 2014

They have it for Ubuntu, but Ubuntu and Debian have somewhat diverged on how it is done. It is not a big issue for me on this build, but if I wanted to do large scale file servers and didn't want to stash a USB dongle in the box for root, it would be useful.

Although, I guess I could use FreeBSD if I didn't need KVM and/or Xen....

walterbell · on Sept 12, 2014

How about FreeBSD guest on Xen with Linux dom0?

https://wiki.freebsd.org/FreeBSD/Xen

Freaky · on Sept 11, 2014

> I guess I could use FreeBSD if I didn't need KVM and/or Xen....

Have you looked at BHYVE?

justincormack · on Sept 11, 2014

There is bhyve now...

DiabloD3 · on Sept 11, 2014

I've heard mixed reviews of that. It is probably not quite ready for production, but it will be interesting in the future.

mbreese · on Sept 12, 2014

Just to add to the reviews - I've had difficulty in getting Bhyve to consistently work. Sometimes the VM will boot fine, others it will just crash halfway into the boot cycle. When it works, it's good. It has a few years before it gets to Xen/KVM levels of stability, but it's a really good start.

khc · on Sept 11, 2014

And then how do you use this file server to take advantage of the RAID 0'ed SSD-like performance? Do you export it as NFS? iSCSI?

DiabloD3 · on Sept 11, 2014

It isn't solely a file server. I made it to take over Linux workstation duties from my workstation (which now runs Windows 8.1, which, tbh, really isn't nearly as bad as everyone says it is), and it runs VMs for clustered software testing and development.

I export /home and /tank via Samba[1] to the now-Windows workstation (and sshfs for my OSX running MBP, which I may transition to Samba now) and get about 85% of Gigabit in performance, which is good enough for purely file storage tasks, although I'm considering dropping in 10gbit into both machines via http://www.provantage.com/~7MLNX1KC.htm (the cheapest 10gbit cards I've seen that aren't used pulls on ebay) if I start making bigger test clusters that run on both my workstation and my server.

Workstation has 2x Crucial M500 960GB RAID 1 (was BTRFS RAID 1, now is Intel Matrix FakeRAID RAID 1 in Windows), for the record.

[1]: For each share: aio read size = 1, aio write size = 1, use sendfile = yes, min recievefile size = 1, socket options = SO_KEEPALKIVE TCP_NODELAY IPTOS_THROUGHPUT

Do not mess with SO_SNDBUF and SO_RECVBUF for socket options, the default seems to already be the maximum size allowable on Linux.

blumentopf · on Sept 11, 2014

You need to tell the kernel if you need larger-than-default SO_SNDBUF/SO_RCVBUF, e.g. for 4 MByte (default is 128 kByte):

  sysctl -w net.core.rmem_max=4194304
  sysctl -w net.core.wmem_max=4194304

Documentation: http://git.kernel.org/cgit/linux/kernel/git/stable/linux-sta...

DiabloD3 · on Sept 11, 2014

The default seems to be 128kb for both, and that seems to be enough. It is already bigger than the buffers on most NIC.

turrini · on Sept 11, 2014

I've created the script below a while (year) ago. It (deb)bootstrap a working Debian Wheezy with ZFS on root (rpool) using only 3 partitions: /boot(128M) swap(calculated automatically) rpool(according to # of your disks, mirrored or raidz'ed).

All commentaries are in Brazilian Portuguese. I didn't have time to translate it to English. Someone could do it and fill a push request.

https://github.com/turrini/scripts/blob/master/debian-zol.sh

Hope you like it.

ryao · on Sept 12, 2014

Thanks for sharing. I will let Debian users interested in / on ZFS know that this is available as they ask me about this sort of thing.

andikleen · on Sept 12, 2014

When swap doesn't work, mmap is unlikely to work correctly either.

Figuring out why that is so is left as an exercise for the poster.

nailer · on Sept 11, 2014

Putting production data on a driver maintained outside the mainline Linux kernel is a bad idea.

That isn't a licensing argument - I'm happy to use a proprietary nvidia.ko for gaming tasks, for example, because I won't be screwing up anyone's data if it breaks.

1amzave · on Sept 15, 2014

Maybe the odds of it doing so are smaller, but if you think a broken nvidia.ko can't screw up anyone's data I think you're simply mistaken.

Say Nvidia's driver has a use-after-free bug: it kmalloc()s a buffer, kfree()s it, then a filesystem kmalloc()s something and gets allocated the same buffer. If nvidia.ko then decides it still wants to use that buffer and writes something into it...kablooie.

Unless you start running some microkernel-ish thing with drivers each running in their own distinct address spaces, you're going to have a hard time avoiding this possibility.

ryao · on Sept 12, 2014

You could be "screwing up" someone's data if an in-tree filesystem breaks. If you read the supplementary blog posts, you would have seen the following:

http://lwn.net/Articles/437284/

Nearly all in-tree filesystems can fail in the same way described there. ZFS cannot. That being said, no filesystem is a replacement for backups. This applies whether you use ZFS or not. If you care about your data, you should have backups.

mbreese · on Sept 12, 2014

Well, to be honest, all filesystems that are currently in the kernel tree started out as being maintained outside the tree. Inclusion into the kernel is normally one of the end goals. It's part of the standard progression - 1) rapid development outside of the tree, 2) once the filesystem is stable, it negotiates for inclusion, 3) inclusion into the kernel, 4) maintenance / updates as part of the main kernel development process.

ZoL isn't even trying to get included into the kernel, so it's a bit of an odd duck here.

smcleod · on Sept 11, 2014

While I like most parts of ZFS, these days BTRFS is both stable and performs well with a decent feature set. We moved from ZFS and EXT4 to BTRFS for a good portion of our production servers last year - and we haven't looked back.

thijsb · on Sept 11, 2014

Do you run RAID5/6? I had that running for half year, and it crashed often.

Now on ZFS (raidz) and it works flawless

smcleod · on Sept 14, 2014

We run it on iSCSI LUNs straight from our SAN so this hasn't been an issue for us.

seoguru · on Sept 11, 2014

I have a laptop running ubuntu with a single SSD. Does it make sense to run it with ZFS to get compression and snapshots? If I add a hard drive, again does it make sense (perhaps using SSD as cache (arc?) )

fsckin · on Sept 11, 2014

I've never heard of someone using ZFS with a single disk. You're probably better off with ext4.

The compression and deduplication features of ZFS is terrific on network filers. Compression could possibly improve performance slightly on a single disk system.

With two disks, I'd say you'd probably be better off with running RAID0 (or no RAID at all) and having a great backup plan. Using another SSD to cache writes to another SSD doesn't make a whole lot of sense to me.

ryao · on Sept 12, 2014

I use it as my laptop's root filesystem. It works well for me. Here is a link to notes on how I installed it:

https://github.com/ryao/zfs-overlay/blob/master/zfs-install

DiabloD3 · on Sept 11, 2014

Honestly, I recommend XFS over ext4. It seems to be a much more mature file system, and Redhat-family distros (RHEL, Centos, Scientific, etc) have switched to XFS as the default filesystem (instead of moving to ext4 from 6.x's default of ext3; 6.x does not support XFS or ext4 for root).

XFS performs better out of the box on a wide range of hardware, while seemingly giving stronger data reliability guarantees (but not anywhere near ZFS's).

ZFS on a single disk, however, will still give you data checksumming, so you can detect silent data corruption. XFS's sole missing feature as a basic filesystem, imo, is data checksumming.

lmz · on Sept 12, 2014

RHEL6 uses ext4 as default[1] filesystem. It certainly does support ext4 as root.

[1]: https://access.redhat.com/documentation/en-US/Red_Hat_Enterp...

DiabloD3 · on Sept 12, 2014

Weird, I had to install RHEL6 for a customer, and it defaulted to ext3 and ext4 was not selectable.

curiousbiped · on Sept 12, 2014

Well, with ZFS you could squirt snapshots of /home to another box for backups. Since they're just the changed blocks, they'd be fairly small.

ryao · on Sept 12, 2014

Yes to using ZFS for your root drive. Probably no to using L2ARC in a laptop.

awonga · on Sept 11, 2014

I've looked into ZFS before for distributions like freenas, is there any solution on the horizon for the massive memory requirements?

For example, needing 8-16gb ram for something like a xTB home nas is high.

ryao · on Sept 12, 2014

The "massive memory requirements" only exist if you use data deduplication and care about write performance. Otherwise, ZFS does not require very much memory to run. It has a reputation to the contrary because ARC's memory usage is not shown as cache in the kernel's memory accounting, even though it is cache. This is in integration issue that would need to be addressed in Linus' tree.