Apple kicks ZFS in the butt

tvon · on Aug 30, 2009

Summary: There is still no ZFS in Snow Leopard, we still don't know why. Let's speculate.

mmphosis · on Aug 30, 2009

I'll speculate: Testing.

* maybe with ZFS, there were problems like slow speed, unreliability, poor resource use, etc... more speculation of Apple's testing benchmarks...

* maybe without ZFS, no problems were found in testing. If this was the case, and remember I am speculating, I would ship without ZFS, easy.

pohl · on Aug 30, 2009

I was looking forward to ZFS, but there was a recent article about BTRFS that caught my eye.

What I would really like to see, though, is some modern filesystem with a license that is neither offensive to a commercial vendor like Apple or Microsoft, nor offensive to the GNU crowd, or the BSD crowd. I'm not sure that's even possible, but it would be nice to know that I could format a flash drive in a universal way that isn't FAT32.

rbanffy · on Aug 30, 2009

"format a flash drive in a universal way that isn't FAT32."

Microsoft will never, ever support a technology they don't control. FAT32 is just fine by them and you can't count them out when you say "universal".

They could, of course, use BSD licensed stuff, but they would not be able to put pressure and extract licensing fees like they do with FAT.

glymor · on Aug 30, 2009

Anything out of BSD land would have a compatible licence with both commercial vendors and the GPL. For example http://en.wikipedia.org/wiki/HAMMER

If your talking about writing something and being sure to be able to read it at some future time UFS is read-compatible pretty much everywhere.

barrkel · on Aug 30, 2009

I'm running ZFS on OpenSolaris kernel (Nexenta, for GNU userland) for home storage, with my main zpool at 9.06TB spread over 8 hard drives, for almost 7TB worth of ZFS storage.

ZFS isn't quite what's billed to be, but it's (mostly) better than every other current choice. You can read the positives everywhere else, but let me point out some of the negatives.

* Space wastage: block sizes for files grow up to 128K, but they never shrink, and the last block in a file is currently always the same as the overall file block size. This means that files e.g. 129K in size will have 127K of wasted space on disk, and files that were once 128K and truncated back down to 1 byte will still have 128K block size. Depending on the kinds of files, you can see space wastage rates in the region of 25-45%. This isn't theoretical: I had to reduce max block size to 8K on one filesystem to avoid huge wastage.

* Streaming latency: as a media server, I often play movies from data stored on this server, but there's a problem somewhere in the stack (network, OS file cache, filesystem implementation). If I'm playing a movie or TV episode from a ripped DVD (i.e. VIDEO_TS directory) I can guarantee that there'll be at least one glitch in every 40 minutes of playback, where network transfer has dropped to zero for 3 or 4 seconds.

* Device removal: you can add more storage to a ZFS pool, and one can replace any given device with a drive equal or larger in size, but it's not possible to remove a device. If you start out with 8 disks, you're stuck with 8 disks unless you want to backup and restore the entire FS - but at least FS streaming is easy with e.g. zfs send <snapshot> | ssh <remote> zfs receive <filesystem>.

* Fragmentation: there is no defragmentation solution currently. Sure, people say it's "not an issue", that "ZFS is not pathalogical", but the same was said about NTFS, and NTFS fragments like crazy when small incremental writes are used. ZFS performance is known to fall off a cliff (e.g. less than 1% of prior perf) as you approach capacity (symptomatic of fragmented free space) and, more importantly, stay abysmal until you back off quite a bit and preferably drop a whole file system or two. Recommendation is to stay under 80%, which isn't very full.

* Performance: performance is a bit of a red herring; you can make ZFS as fast as you want by adding mirroring and striping. The more interesting case is when you don't have mirrors and stripes out the wazoo, and are relying on RAIDZ or RAIDZ2, and here the performance depends on access patterns. The way the parity and checksumming works, you need to burn a lot of cache to get decent small random read/write performance, because entire blocks need to get read and written even for small updates. So performance looks great for a while, then falls off.

I still think ZFS is one of the best choices for this kind of application - the software RAID, multiple-volume handling is terrific - but it's far from the last word, and not 100% baked (needs defragmentation and data recovery tool ecosystem). BTRFS looks like it could be a very worthy contender, but from what I've read, I haven't seen that it values a storage-unit/pool/filesystem stack, where storage-unit can be a disk, a file or a partition, and still seems stuck on the old device/filesystem approach.

dlsspy · on Aug 30, 2009

"better than every other current choice"

That's all that really matters to me.

My media system is using ZFS (currently 98.7% full), and I keep a small zpool in my backpack with a bunch of data on it I don't want wasting space on my SSD.

There's nothing else I trust my data on at this point.

I have a FreeBSD machine with a raidz for all my other data (databases, mail, a copy of my music, my photo album, etc...) It was RAID5 before that and it was pretty terrible.

My current plan is to stick with ZFS until something better comes along, and there isn't all that much better showing up in this neighborhood. That does mean my media system can't run Snow Leopard, though. :(

ciupicri · on Aug 30, 2009

What case are you using? Also, what are you using for cooling?

barrkel · on Aug 30, 2009

Just an average case, Gigabyte 3D Aurora full tower. Here's a review:

http://www.cluboverclocker.com/reviews/cases/gigabyte/aurora...

HDs are both in the internal HD bays, which have a dedicated fan, and in 3.5/5.25 adapter in the 5.25 bays up top, with active cooling, it's something like this:

http://www.pcstats.com/articleview.cfm?articleID=2313

philwelch · on Aug 30, 2009

Apple has a history of working on features and then dropping them. They had one in the works for years where your home directory would be sync'd to your iPod so you could plug into any Mac anywhere and get it all back.

It's more than likely that ZFS meant biting off more than they could chew, so they deferred it for another release.

kylec · on Aug 30, 2009

You can still get at the home sync feature if you manually enable it - it can be found at /System/Library/CoreServices/Menu Extras/HomeSync.menu

andrewtj · on Aug 31, 2009

If I recall correctly that's actually used to sync network home directories to a local drive on machines that are bound to a directory service. Some blogger mistakenly thought it was related to the iPod home directory feature and the connection's just never died.

duskwuff · on Aug 30, 2009

The menu extra stub is still present, but there's no way to configure it -- the Accounts preference pane doesn't have the options that the menu extra refers to.

kylec · on Aug 30, 2009

You can get at them if you enable the menu item, then select "Mobile Account Preferences" from the dropdown. However, this is as far as I've gotten because I don't have the ability to test to see if this works.

duskwuff · on Aug 31, 2009

Oh, wow, you're right. (Tip: You have to unlock the Accounts pref pane first to make it show up.) I haven't tried this out, but I almost want to now.

dhess · on Aug 31, 2009

I don't think that's what you think it is. The Home Sync menu I'm familiar with is for mirroring a network home directory on your local disk, in conjunction with another machine running Mac OS X Server. As far as I know, this feature has nothing to do with portable home directories on iPods.

jeromewbrock · on Aug 30, 2009

2 things:

1. Sun's shared source license sucks and this is why ZFS isn't natively supported in Linux. 2. btrfs will soon have better features/performance/stability than ZFS, and a better architecture to boot: http://lwn.net/Articles/342892/

jodrellblank · on Aug 30, 2009

"Soon", maybe, but it will take a long time for it to be Generally Recognised As Reliable. (For all I argue against buying I to a brand just because, I do like ZFS more with Sun's name behind it)

jsz0 · on Aug 30, 2009

I don't really understand the hype around ZFS. What's so great about it?

I've tried it out on OpenSolaris a few times in an attempt to learn why it's so hyped up. The setup process is a bit cryptic and confusing. Even after the initial setup I was really confused about what commands were destructive and which ones were not. Based on that alone I would not consider using ZFS yet. The flexibility of ZFS was a bit lost on me. It seems like there's a ton of limitations and caveats to consider. I can't really comment on performance or reliability since this was a short lab test but in the last 5 years or so I've never lost a HFS+, EXT3 or NTFS file system so I'm not sure just how much more reliable ZFS can be. Is all this hype just acronym lust? I feel like a good RAID card combined with a semi-modern FS is still a better solution. It's certainly easier to setup IMO.

jodrellblank · on Aug 30, 2009

The hype is that with ZFS you don't need an expensive/extra point of failure RAID card and RAID card config, plus a logical volume manager with it's own utilities plus a filesystem plus a quota management system - you plug disks in, ZFS does the RAID and the volume management and is the filesystem and the quotas all in one set of tools (quite few / simple commands for basic usage, at least).

On top of that, ZFS stores all data along with checksums, so when you read a file it will pick up errors and correct them. How do you know the photos you haven't accessed recently are OK? That they will not be corrupt? You don't - you assume that if the disk shows up and the directory list shows up then they will be there. Wih ZFS, you set up scheduled scrubs and it verifies all the data is present, readable and correct - and because of this regular check, it picks up hardware failures more quickly.

Also, it has instant (well, small constant time) snapshots, which doubles as a time-machine/filesystem versioning system so you can see files as they were, and you can clone snapshots to take backups / send them over a network to keep a clone in sync.

Also, it was going to be Zettabyte File System (rumour) because it was designed to take filesystem size limits out of consideration.

Did I mention it writes data, then updates a pointer to the data in one operation so as long as your disks /controller do not cheap out and misrepresent what they are doing, data is always in the old or the new state - never half written.

Also, because of the lack of RAID controller, you can move the disks to another system and attach the zpool with leas requirements in an emergency.

That's pretty awesome - not just acronyms and buzzwords.

enneff · on Aug 30, 2009

ZFS allows RAID-5 without the "write hole" (google it). That's the killer advantage for me. The rest is just gravy.

I'm amazed you found set-up cryptic. Once I understood the concepts of ZFS pools and filesystems, I found the toolset among the most elegant and well-designed out there.

I think it might actually be a little daunting in it's simplicity, maybe. ("that's it?!")

jsz0 · on Aug 31, 2009

I think that might be it. I found it a bit hard to visual what was actually happening with these commands since they are all pretty simple.

lsc · on Aug 31, 2009

block level checksumming. http://storagemojo.com/2007/09/19/cerns-data-corruption-rese...

If you have never lost a file system, I'm guessing you haven't had very many.

Also, ZFS gracefully handles cheap disk. RAID cards don't handle the case where you are using consumer grade disk and it retries for hours rather than failing. (I recently tested this on a rather nice 3ware 9550SX. It handled the problem no better than linux md. Performance was worse on the 3ware, too. I'm sending it back.)

I am personally switching to 'enterprise' (read, 20% more cost for 30% less capacity) sata just so I get drives that actually fail rather than retrying when they are failed.

And this is ignoring the 'silent corruption' issue the CERN guys were on about. Block level checksumming fixes that too.

Also, cheap snapshots are awesome. LVM snapshots work, and they can save your ass in a pinch, but they are expensive performance-wise. In my case, where I'm pushing my disk to it's performance limits already, LVM snapshots are nearly unusable. ZFS snapshots, I understand, have almost zero performance overhead. (space is cheap. Performance, not so much.)

dhess · on Aug 31, 2009

This has nothing to do with ZFS, but FYI: you can configure consumer-grade Western Digital drives to report failures immediately, rather than retrying, just like their more expensive "enterprise" counterparts. See http://en.wikipedia.org/wiki/Time-Limited_Error_Recovery for details. A bit of Googling will turn up the WDTLER.exe binary that you need; make a bootable FreeDOS CD or USB stick, then run WDTLER from the command line to toggle the feature on or off.

Possibly there are other good reasons to buy the enterprise models, but using the cheaper drives is working out for me. I'm running 6 WD15EADS drives with TLER disabled in a software RAID6 configuration on a GNU/Linux machine, and except for the fact that two of the drives failed within hours of first use (the replacement drives work fine), I've had no problems running the array 24/7 for over a month now.

lsc · on Aug 31, 2009

Yeah, I have 2 wd 'black' consumer grade drives fixed with WDTLER paired with some seagate ES drives right now. (I mix makes to avoid firmware plagues) I read that WDTLER only works on 1TB and smaller drives. I will have to find some 1.5TB WD drives and experiment.

As for the relation to ZFS, it is supposed to timeout ill-behaving consumer drives like this. In fact, from what I understand, you can configure ZFS's behavior, for example, to fail the first drive that times out, and then hang on the second drive (which would perhaps be a reasonable configuration for a ZRAID volume)

dhess · on Aug 31, 2009

WDTLER works on the WD15EADS (1.5TB), as that's what I'm using.

mustpax · on Aug 30, 2009

Maybe Oracle is less willing than Sun to license their core differentiating server feature to competitors.

I believe Time Machine is still built on top of ZFS snapshot though. So obviously some ZFS code remains.

Edit: Apparently Time Machine is not based on ZFS. So it seems ZFS is completely off the map after all.

Locke1689 · on Aug 30, 2009

Actually,Time machine is not and has never been based on zfs. It is implemented using hfs+, journaling, and a kernel/fs modification for journaling and hard links. A time machine snapshot is basically modified hard link journals.

rbanffy · on Aug 30, 2009

The sad part is that Time Machine would be a one-liner had OSX supported ZFS since Leopard...

glhaynes · on Aug 31, 2009

Which one line (heck, doesn't have to just be one) command would you issue to get Time Machine-like functionality from ZFS?

ZFS contains a lot of features that sound like they overlap with Time Machine, but in practice they really don't in a way that would make Time Machine a "free" feature.

furyg3 · on Aug 31, 2009

I implemented a simple personal backup solution a long time ago which is (essentially) what Time Machine implements now (without the eye-candy application, of course). It's not that hard, it only requires hard-linking, which the OS X kernel didn't support, so they added it.

Steps: First run:

* Rsync data from source drive to backup drive (can be done over network, ssh, whatever). We'll call this backup "0"

Second run:

* On the backup drive, copy the last backup folder (backup "0") to a new backup folder (backup "1"), using hardlinks. Hardlink copies don't take any additional space, since they're pointing to the original resource on the disk (folders do add some negligible space).

* Rsync data from source drive to backup drive (backup "1"). Rsync, of course, only updates what's new/changed.

Repeat, saving as far back as you'd like, and deleting old backups if the drive is full, after N backups, or so many days. Now you've got an easy incremental backup bash script.

Hardlinks are beautiful, because they don't add extra space, and once all the hardlinks to a resource are deleted the resource becomes free space. This isn't an enterprise-level backup solution, but it's a great quick-and-dirty way to do incremental backups, and is exactly how Time Machine works.

glhaynes · on Aug 31, 2009

One way this isn't exactly like Time Machine: since it doesn't hook into the FSEvents journal of file system modifications, rsync has to check the modified dates of every file in the backup set. With FSEvents, Time Machine knows what's changed (or at least which folders contain changed items) and can deal with only those.

But, yeah, same basic concept for sure.

vsync · on Aug 31, 2009

rsnapshot works exactly that way as well and is a supported package.

rbanffy · on Aug 31, 2009

The gain I was referring to would be the use of ZFS snapshots instead of the convoluted and costly link-based mechanism Time Machine uses on HFS+ file systems.

Obviously the rest of Time Machine would not fit in one line, but allow me the right to exaggerate a bit.

I bet the timeline feature in Nautilus on OpenSolaris, while not a one-liner, must be quite simple an add-on.

glhaynes · on Aug 31, 2009

But that's exactly what I mean: ZFS snapshots make copy-on-write branches of a single volume. Time Machine backs up files to an external disk. Unless there's some proposal to make Time Machine-managed external volumes be part of a pool that's only attached some of the time and thus get some ZFS benefit from that, I don't see how TM would be any different in a ZFS world.

Now, local snapshots would be useful for sure (and could be integrated into the TM system in various ways), but they're totally orthogonal to separate-disk backup/versioning systems like TM.

rbanffy · on Sept 1, 2009

It would not be just different - it would be much better.

With ZFS, Time Machine would not require an extra disk and could maintain nearly continuous history of the whole file system even in single disk configurations. The minimum time between snapshots could be easily adjusted so that not much disk space is consumed in the process.

glhaynes · on Sept 1, 2009

It would be different.

To say it would be "much better" ignores that one of the primary usages of Time Machine is that it provides easy whole-disk backup in case of catastrophe. In this scenario (the primary reason why many people use it), ZFS is completely irrelevant.

Now, perhaps one could imagine a future version of Time Machine that would read from source file system snapshots while making backups to the external target in order to capture versions that were made while the user was unplugged from the external disk; and perhaps ZFS would be appropriate for Time Machine target disks in order to save space via block copy-on-write. I'd say I expect both of these features at SOME point in the future ("expect" inasmuch as they're obvious wins and one has to think that the future direction of file systems is to include constant time low cost snapshots via copy-on-write), though I don't know if the file system(s) will actually be ZFS-derived or not.

But this is a far cry from saying that Time Machine functionality would be free on top of ZFS. Any future version of Time Machine will include replication to an external disk (or, at least, "the Cloud") and that's NOT a feature of ZFS as it stands right now.

Locke1689 · on Aug 30, 2009

Actually,Time machine is not and has never been based on zfs. It is implemented using hfs+, journaling, and a kernel/fs modification for journaling and hard links. A time machine snapshot is basically modified hard link journals.

Herring · on Aug 30, 2009

You clearly have a solid backup infrastructure.

Locke1689 · on Aug 30, 2009

By that I mean that it saves hard links to "bundles" which haven't changed.

P.S. Woops, iPhone + airport == spotty internet.

wmf · on Aug 30, 2009

Apple dropped ZFS from Snow Leopard before Oracle bought Sun (and the acquisition hasn't even closed yet; AFAIK companies are supposed to behave as if nothing is going to happen until the deal closes).

dmaz · on Aug 30, 2009

If the point of 10.6 was to add improvements to the OS without major end-user changes, then ZFS could have been tested in the development cycle and then put on hold for the next major release.

miracle · on Aug 30, 2009

Well, they can integrate it in their next service pack then and charge another 50$! :-)

Herring · on Aug 30, 2009

It's going to cost $50 regardless, so why work on ZFS? I wonder if they've thought of just charging money for no features.