Despite what the article says about illumos not having something analogous, we do actually have something and have had it for more than a decade: https://illumos.org/man/1M/syseventadm
It allows programs to be run in response to sysevents, some of which are generated by ZFS and some of which are generated by other parts of the system (e.g., device hotplug).
That seems the right approach. Why would you want anything ZFS-specific -- other than events not being documented, which presumably means you dig them out of ZED?
Insightful article. Zfs is the best file system. Ability to know that you do not have silent file corruption. Running without raid controllers. Feature to take snapshots super fast without waiting and that take little extra space. You can use Cache SSD with ZFS for read acceleration of physical disk. Transparent file compression. Good command line interface. Now you can take good actions with ZED for example sending a notification to an alert system like Slack, ticket system when disk fails or start disk scrubbing/rebuild.
If you are on Linux I can highly recommend ZFS and Minio for S3 like storage. ZFS for local storage.
I don't understand where minio suddenly comes from?
Minio is barely documented. I had to ask in Slack to interpret what it meant when minio said "your cluster is 5 red and 7 yellow" as the colours aren't event documented.
Every minio cluster I hosted had dataloss. Each one has lead to a reported issue on their GitHub that hasn't been closed to date. Nothing about recovery is documented. Documentation is slim in general. Really confused why you're naming it next to ZFS which is sublimely documented and withstood the test of time.
I'd advise something battletested through time like Ceph for object storage instead.
I'm super confused how anyone could recommend Minio for the foundation of a production system as well. I've understood it to be primarily for development purposes and have used it for some local mock tests for S3 compatibility behind a firewall or to simulate error test cases more reliably in my applications.
Even Ceph has its warts for distributed object storage as does basically... anything in existence worth considering that's OSS-ish (GlusterFS, HDFS, Lustre) but comparing Minio to ZFS is confusing given how drastically different in purpose, functionality, and engineering hardening has happened between the two projects. With that said, the AWS S3 team really is impressive in what they've built out and deserves more shout-outs from people outside Amazon.
Could you expand more re: your minio experience? We're about to enter production with minio as our S3 interface-providing file storage system, and have found documentation to be sufficient.
However, I'm slightly concerned that we haven't tested it enough, and that doc and support may prove to be lacking (like in your case) when we hit edge cases and failure scenarios.
No LTS version. Bugs will be fixed in a new version, which may include new bugs. Some upgrades need full cluster restart. If you ask too many question you may get "it's better start a subscription".
We've been running for about a year now and had zero problems with it. (~20TB total, few single instances, serving as S3 backend to Restic - handmade HA - backup service)
It's a bit on a non-sequitur, but I am also looking at using MinIO as a S3 interface to a ZFS filesystem. Would be interested to hear from others about this use case for MinIO and possible alternatives.
The thing I never see mentioned when the pros and cons of ZFS are discussed is that ZFS is not zero-copy for things like sendfile. This is one reason why we (Netflix) use UFS for serving content rather than ZFS.
This is because ZFS is cached by the ARC, not the normal page cache. ARC is weird, and operates in 8K blocks (like sparc page size), rather than 4K pages. Zero-copy things like sendfile depend on referencing pages in the page cache, and have never been adapted to deal with ARC. So making sendfile zero-copy with ZFS is a hard project that would involve either teaching ARC to "loan" pages to sendfile, or ripping out ARC caching from ZFS and making it use the same page cache that all other filesystems use.
Thanks for the comment. I think it's a bigger issue: people advocating ZFS only promote the good features and aren't open about the downsides (or even try to downplay them with clear bullshit).
It all depends on the circumstances and requirements: (small) business application or some home-build NAS?
In your case, how much does it matter that some node experiences bitrot, and how big are those risks?
Our use case is far different than a home NAS (hundreds of TB of disk), I replied to that simply because it was also talking about potential downsides to ZFS.
For our use, bit rot is pretty low risk. We have tooling to catch corrupted files (and it happens surprisingly rarely). We don't care about any of the raid like features (if a drive dies, we tell clients to get their video elsewhere).
For our use case, ZFS would be attractive mainly because of the ability to keep metadata in the L2 ARC. One of our bigger sources of P99 latency is uncached metadata reads from mechanical drives. Our FS guys are currently solving that problem in other ways.
I use zfs in the home on one machine that runs nfsd and samba.
I started doing that because I saw corruptions on magnetic disks at home. Some files were silently corrupted. I had no redundancy. I didn't know which files were "good" either.
So now I have one multi disk machine running FreeBSD with zfs. Works well. Hardware isn't especially fancy. In the time since I have seen it catch hardware failures. I have seen it call out specific files as corrupt. This is a huge improvement over how I saw bad disks surface in the past.
I've now seen two sata controller failures thanks to zfs. One on my own machine amd one on my parents. Both presented as if a drive failed but it was actually the port on the controller that failed. It'd start up fine but randomly reads (and probably writes) would just be corrupted and the disk would eventually stop talking. Changing disks wouldn't fix it, which is how I kmew it was the controller but replacing the controller made everything happy. It the resilvered with no issues.
Memory has much fewer “moving parts” than disk. You cannot get bad “memory cable”, and (AFAIK) there are no cases when overloaded power supply caused memory errors,
Obviously it's not connected by cable, but you certainly get the equivalent to cable errors, and I've been plagued by them on certain systems. It's revealing when you have a lot of systems with monitoring of ECC errors. I wouldn't like to say whether multiple DIMMs are necessarily more reliable even than rotating disks.
Silent data corruption is a real thing, I have a few thousands files for the evidence.
If you are building a home NAS from quality rackmounted server parts, then maybe you are fine. But this was not an option for me, as I did not have dedicated server space and needed something quiet. And once you have to start with to mess with desktop cases full of hard drives, it is very easy to get corrupted data.
I run ZFS on my home NAS. Yes, it (probably) eats too much RAM, and it (probably) not the fastest thing, but at least my data is intact. I had to piece together my photo collection from multiple backups, it was not fun at all.
It provides another example that silent data corruption is a thing, and that it can happen. While SATA protocol has error detection, it is pretty weak (32 bit CRC) and it does not always help, especially since there is no way to tell how often the packets are retried.
I had a few cases of data damage. One of the worst ones was when I moved to a different place, and had to leave much of my stuff behind. I had half a dozen or so smaller drives in my PC (SATA + IDE) which were working just fine. I got about three new drives (I believe SATA 1TB?), installed them into PC, and copied all the files to the new drives. I then left the old PC, and only took the new drives with me.
This was Linux, ext3 and JBOD (no RAID or anything). I did not have a good filing system, so some data got copied multiple times.
Once I got into new place, I bought a new PC and installed the hard drives I had. I have noticed that some files are damaged. I had some checksumming scripts, and I was recording checksums, and found out that some checksums would not match - and each copy had a different set of damaged files. I ended up cherry-picking files from multiple copies to assemble a good set.
I don't know the exact reason, but I am fairly sure they were transfer errors. The original PC was working fine for a long time, so source data was likely clean. The new PC did not show any more data corruptions, and it was reading the same data every time. So my theory is either transfer errors while copying files, or silent data corruption on disk.
I don't know of any solutions that would have helped here except custom data checksum tools or ZFS (I suppose btrfs might have helped too, but I heard too many horror stories about it).
I actually had this come up the second time: when I moved again, and built another NAS box (desktop motherboard, 5x 4TB drives with ZFS), I started copying files off the old SATA drives (ext3) and saw the data transfer mismatch. It was pretty freaky: "rsync" the file, flush caches, md5sum source and destinations -- and they are different. kernel log was quiet, memtest was not showing any errors, so I got a beefier power supply and replaced all the SATA cables. This helped.
> ZFS is fine, but it is overkill for most home applications
Yeah, not having silent data corruption is "overkill", sure. /s Why not use ZFS? It takes 15 seconds to install, and its CLI is fairly intuitive. Works fine. Costs $0. Why not, even for "home" applications?
I could see how it could be unsuitable for "entreprise" applications where there are strict performance requirements etc, but for home, I wish I could use ZFS everywhere.
You can add them to RAID0/1 vdevs; you can't add them to RAIDZ vdevs. Which you don't have if you don't run ZFS. You might have RAID5, but then you also have a write hole.
Strongly disagree that the CLI is intuitive. It’s very easy to kick off unintended actions or back oneself into a corner while doing things that seem reasonable. It’s like using Git in that it’s very hard to do well without a good understanding of the fundamentals, which feel almost nothing like managing disks normally. Lots of things require multiple steps that aren’t obviously related and you’d better not screw them up.
It’s way at the far end of the “must RTFM to use safely, and then probably brush up on it again before actually doing anything unless you use it daily” spectrum of intuitiveness.
I like my ZFS mass storage volumes for my home server. I worry I’ll screw them up and/or burn an hour googling and reading the manual every time I have to touch them, though.
I've been burned by BTRFS too many times to trust it.
At work, our servers will get into a state where they just hang for a span from minutes to hours while BTRFS does "something". I'm not one of the admins though, so I don't know the exact details. I just know that this is a vendor-supported configuration and the vendor has been unable to tell us why this happens or offer any solution that makes it not happen. Our answer to this issue has been to rebuild servers with ext4 when things get bad. This has happened on multiple servers hosting different applications - the only commonality seems to be that write-heavy loads get it into this state. Servers that just have their OS on BTRFS but do all of their work on NFS volumes are fine.
At home, I once rebooted my OpenSuse Tumbleweed laptop and ended up with a BTRFS filesystem that couldn't be mounted RW. Fortunately I was able to mount it RO after booting off installation media and copy my data off, but I couldn't get the filesystem back into a state where it could be mounted RW. I ended up reinstalling. I never did figure out the root cause, but I suspect that some BTRFS-related process was running when I rebooted.
On the flip side, ZFS has never let me down in this way, but to be fair I've never subjected it to the same use-cases. Unfortunately, the inability to resize/reshape the filesystem is an issue for me. I believe that it's being worked on, but I don't think that work is production-ready yet.
Last I checked BTRFS RAID5/6 was a dumpster fire and unusable in production. Have they actually open sourced the ability to fix bitrot detection with mdraid? If not, it's kind of irrelevant.
So... once again down votes without response - BTRFS raid still isn't recommended and the file healing isn't compatible with MDRAID I assume and you just don't like the fact I pointed it out? The "I'm downvoting because you pointed out a flaw in my logic" @HN is disappointing.
If you're using RAID-Z on zfs, your comparison isn't fair. Rather than use RAID56 with btrfs, the equivalent would be to get 1 or 2 disk redundancy with raid1 or raid1c3.
RAID-Z is the equivalent of RAID-5. RAID-Z2 is the equivalent of RAID-6. RAID-Z3 would be the equivalent of RAID-7 (or whatever the standard is named for 3-disk parity).
This is strictly speaking to how it deals with data and parity, the implementations are obviously different.
BTRFS raid1 isn't mirroring drives though, it means there are two copies of each extent across the whole set of 2+ drives. and BTRFS and raid1c3 and raid1c4 are 3 and 4 copies.
Can you comment on btrfs by comparison? Also, what is the status of the license incompatibility/integration of ZFS with distros? Canonical seems to think that shipping it with ubuntu is legal, but other distros seem less sure.
Btrfs has pretty much the same features listed above minus the cache drive handling. (You can still get the cache drive behaviour over bcache device) The raid has different modes, so you'll have to decide if what's available is enough for you.
None of the options are marked experimental. (Specific features are marked unstable on the status page)
"Under heavy development" does not mean anything about stability. The kernel itself is under heavy development. ZoL is under heavy development. The disk format is stable, which is what matters.
SUSE provides commercial support for btrfs and uses it as the default. That's pretty much as non-experimental as is gets.
In the past, when data integrity issues emerged, btrfs devs have stated that it is not ready for production. Has there been an announcement to the contrary? If btrfs is production ready, is this clearly stated somewhere? Solaris has defaulted to ZFS at least since version 11 first released eight years ago.
Update - also it appears Suse enterprise uses btrfs for the root os filesystem but xfs for everything else including by default /home. To me, this seems telling? If it’s so solid why not use it for /home?
It's trivial to use ZFS on NixOS, including on the root partition.
The legal position is that ZFS is open source under a copyleft license but that many people think that it's illegal to bundle it with the Linux kernel because of some (I think unintended) incompatibilities between its license and the Linux kernel's license. Canonical (and some others) disagree, and think that it's legal. It's only the bundling that's at issue - everyone agrees that it's legal to use with Linux once you have both.
The incompatibility was very much intended, Sun needed a way to compete with linux and didn't want to be assimilated into the linux ecosystem, so they released opensolaris under CDDL instead of GPL or MIT. Oracle haven't re-licensed it for their own reasons.
Nope. Simon Phipps, Sun's Chief Open Source Officer at the time, and the boss of Danese Cooper—who is the source of the claims it was intended—has stated it was not:
I think there's also a general suspicion that Sun could have just chosen the GPL if they cared about compatibility. Although, for various reasons, it's probably at least somewhat more complicated than that because of patent protection, etc.
> I think there's also a general suspicion that Sun could have just chosen the GPL if they cared about compatibility.
There were 'technical' reasons why they did go not with GPL, and specifically GPLv2 (GPLv3 was not out yet). IIRC, they did consider waiting for GPLv3, but it was unknown when it would be out, and one thing they desired was a patent grant, which v2 does not have.
Another condition was that they wanted a file-based copyright rather than a work-based copyright (i.e., applies to any individual files of ZFS as opposed to "ZFS" in aggregate).
I had forgotten about some of the reasons they specifically wanted file-based copyright. Sun were clients at the time and I spoke fairly frequently with the open source folks there. But I didn't remember all the details and was certainly not privy to all the internal discussions.
"Sun could have just chosen the GPL if they cared about compatibility."
That's a very loaded statement. I've seen it said quite a lot over the years. But, have you thought about its implications?
The implicit assumption here is the primacy of the GPL over all other open source licences. Why should other companies and organisations treat it as "more special" than any other free/open source licence when it comes down to interoperability?
When it comes down to compatibility, the GPL is one of the last licences you should choose. Because by its very nature it is deliberately and intentionally incompatible with everything other than the most permissive licences. The problem with "viral" licences like the GPL is that "there can only be one" because they are mutually incompatible by nature. Why should the MPL/Apache/CDDL licences make special exemptions to lessen their requirements so that they can be GPL-compatible?
I should have written compatibility with the GPL (or really the Linux kernel which was what was most relevant from the perspective of Solaris). And, of course, Sun could have chosen a fully permissive license but AFAIK nothing like that was seriously considered.
Nit: Apache 2.0 is compatible with GPLv3 (but not v2).
I agree it was intended and I remember Sun talking about that intention at the time. Sun specifically removed the multiple license compatibility language (section 13) from the Mozilla MPL when creating the CDDL:
FWIW, CDDL is pretty much just Mozilla license. The incompatibility is caused by GPL, which, according to FSF, cannot be linked against anything that's not a subset of GPL. You'd get the same incompatibility if ZFS was covered by GPLv3, for example.
Except then you'd loose the patent protection provided by CDDL. And you'd cause licensing problems for literally everyone else but Linux. And you probably wouldn't gain anything in the long run anyway; AdvFS was released under GPL and went nowhere.
Or Linux Kernel people. If you are distributing their work, you can do it because GPL allows it. But for GPL to allow it, you cannot break it or ignore it, otherwise you will lose the distribution rights.
I have a home NAS that has an SSD on it for the OS and four HDDs in RAIDZ. Does anyone know if/how I can use a small part/partition of the SSD for the cache? I don't need an entire SSD's worth of cache, and I'd rather not have to buy an extra one.
Most of the time you don't need the extra SSD cache. If you do use one for the ZIL (writes), it will only help synchronous writes, it should be redundant if possible, and it should be extremely low latency like Intel Optane. L2ARC (reads) on an SSD is not as good as just adding more RAM.
You can add partitions to a pool, either for storage, for SLOG or L2ARC.
Unless you're ok with data loss in the case of power-outage, you'll want to use a mirror for SLOG at the very least. You can do that by making a mirror from two partitions on separate SSDs and then adding that as the SLOG. The partitions do not have to be very large, just enough to handle a minute or so of writes, so 10GB or so should is often plenty.
Also keep in mind that ZFS does a lot of shuffling for the L2ARC. I had <1% L2ARC hits on my pool with a 128GB L2ARC partition, but almost a TB of writes per day to the L2ARC due to ZFS rotating data in and out of it.
Probably not, since it's a home NAS the reads won't be very repeatable. I wanted to prevent the hard disks from waking up, but it's mostly writes, so it doesn't help there either.
Has anyone noticed really slow deletes on ZFS? I have a home NAS with RAIDZ and deleting a 1 GB file takes about 30 seconds. I asked on IRC repeatedly but the disks just aren't very busy during the delete (or at all) and nobody managed to figure out why this is happening.
It's been that way for years throughout various ZFS versions, and it's driving me crazy.
I have been using Ubuntu + LXD + RaidZ for as long as it was available. Recently (a year ago?) I first had noticed some kind of unclear performance issue with a FLEX container. Heavy file streaming/download would hose and make entire server unusable. I had no idea what to do, and I pretty much randomly converted the LXD storage to BTRFS, and everything was fine. Last few months I ran into same problem with a bunch of PostgreSQL containers doings lots of writes, and same solution helped. Not sure if it's a ZFS problem or LXD, but I will not use ZFS in high load scenario again.
Qnap will hopefully have an ZFS based Home NAS out soon.
Are there any simple to use tools that automatically compare same files from different source and tell if they are different. I have multiple copies of Data but not knowing which file is corrupted and working through it is a pain in the bottom.
That's neat, certainly, but I'm struggling to think what to actually use this for? The article mentions a couple cases, of which "taking action if devices fail" seems the more concrete; AFAIK that would make it probably straightforward to replace a drive with a spare? Anybody want to share any other concrete uses?
Mail on certain events (scrub start/finish, warnings, errors, failures), automatically replace failed drives, and all sorts of other things. ZED, for example, can be used to create systemd mounts at boot.
... thank you, that was the example that made me realize I can/should go use this at work because it will solve a problem we have:) And a good point in general.
I think you are referring to one of two HPE components: Either the iLO, or the iLO combined with a Smart Array controller.
(Dell has equivalent products: For this discussion, iDRAC is the equivalent of iLO, and PERC is the equivalent to Smart Array.)
With those products, the RAID controller (Smart Array or PERC) will be connected to internal and/or external drives, will handle RAID in hardware (ideally with a battery backup write cache), and (through the iLO or iDRAC) generate alerts when a drive fails (or is close to failure).
In the context of ZFS, you don't have that. Your drive controller is either on-motherboard, or a PCIe card like (for example) the (Broadcom) LSI SAS 9300-8e. Those cards do have a RAID option (MegaRAID), but they are often used without.
The rest of the ZFS storage setup is pretty similar to the setup you are used to: Internal drives will have a SAS expander (if needed) on the motherboard, or will use a SAS expander card (for example, an HPE part #870549-B21 SAS expander card). External drives will be in a JBOD that has one or two expanders, and which is connected back to the server using SAS cables. One ZFS difference is that if you have many JBODs, instead of daisy-chaining arrays, you might choose to use a SAS switch (for example, an A54812-SW-01).
With all that I described with ZFS, I haven't mentioned how RAID is handled: ZFS handles RAID in software. RAID Z-1 is equivalent to RAID 5, Z-2 to 6, and Z-3 to 7. RAID also supports RAID 1 (two-drive mirrors), as well as a RAID 10 equivalent (striping across mirrored pairs of drives).
Since RAID is handled in software, and with the physical equipment I described, it is left to the OS to handle almost all monitoring an alerting. The one exception is that JBODs and rack-mount SAS switches (like the Astek) often have an Ethernet connection for monitoring and (basic) hardware control. But even that can often be handled within the OS, using SCSI Enclosure Services (SES, where the enclosure/switch itself is a device the OS can see and query).
I'd love to switch to ZFS, but the RAM requirements are absurd. I don't have a separate storage server, and I'm not really to sacrifice 10GB of RAM (1GB/TB of storage if I'm to believe what I find through Google) on my home desktop just for it when the vast majority of my data could probably handle a rotted bit or two.
I did briefly try ZFS on my laptop a year or so ago, and it ate up half of my RAM permanently. Since it was already fairly limited, that wasn't a sacrifice I wasn't willing to make either when I have plenty of backups anyway.
That RAM is needed only when you run deduplication (you have to store the checksums of blocks that you deduplicate somewhere). If you don't, the RAM requirements are similar to other filesystems.
On your home desktop, you don't have to run dedup. You will get still the bitrot protection.
Why wouldn't I want deduplication though? "You can use ZFS just fine, just turn off one of its most useful features." Really? And I'm being downvoted for it. Thanks, guys. You realize my home desktop is doubling as my storage, right? Which goes back to RAM requirements being an issue.
Chances are, that you don't have many users saving the same or slightly modified version of a file on the same storage. For a single person, it doesn't make much sense.
> "You can use ZFS just fine, just turn off one of its most useful features."
ZFS has many useful features. They come with a price though, because there's no free computation (see also laws of thermodynamics). It is then a matter of deciding, which features you want or need, and are willing to pay the price for.
You obviously are not willing the pay the price for dedup (lvmvdo asks for similar price, so it is not ZFS-specific), so why are you complaining that you cannot use it? ZFS still has many more useful features.
You also have another option: add RAM to your desktop. It is cheap. Then you will be able to use that one feature.
> I hate the cargoculting on this fucking site.
Sigh. I'm actually btrfs fan, all my data are on a btrfs volume (at work, we do use ZFS though, so I do have the experience). But that doesn't mean I won't point out something that the other club does well.
> I'd love to switch to ZFS, but the RAM requirements are absurd. I don't have a separate storage server, and I'm not really to sacrifice 10GB of RAM (1GB/TB of storage if I'm to believe what I find through Google) on my home desktop just for it when the vast majority of my data could probably handle a rotted bit or two.
Please explain where this number comes from. I run ZFS on boxes with as little as 4GB of RAM, which are also doing all sorts of other things in addition to ZFS.
As with all filesystems on Linux, more RAM means more cache, and if you have free RAM you will benefit from a dynamically resized filesystem cache. That RAM is however not required and the cache can be evicted under memory pressure.
> I did briefly try ZFS on my laptop a year or so ago, and it ate up half of my RAM permanently.
ZFS will use all available RAM for caching, unless you tell it otherwise (zfs_arc_max[1] or similar[2]), but it should release it when the system requires it[3].
It allows programs to be run in response to sysevents, some of which are generated by ZFS and some of which are generated by other parts of the system (e.g., device hotplug).