I've been a ZFS user for maybe 6 years. I had data loss this year.
It was quite interesting. Writes would fail on multiple disks at the same time. That's why it was data loss. A normal disk failure wouldn't look like that, failures should be spread out and redundancy would help.
It turned out to be a bad power supply. I was able to predictably get it to corrupt writes with the old PSU. Then I replaced the PSU. No longer failed after that.
I wouldn't have guessed to suspect the PSU. It was a frustrating experience but in the end, ZFS did help me detect it. On ext4 or ffs I wouldn't have even been aware it happened, let alone could confirm a fix.
Jesus same happened to me so frustrating! Everything working ok, then slowly increase of writing failures and power resets, and strange clicking, eventually all disks drop, scary every time it happened
I swapped every server part before the PSU. Updated Linux, downgraded, tried kernel options... At some point I thought btrfs was the problem so I created an mdadm ext4 raid but got the same problem
It was a btrfs raid 1, lots of fs errors, and files missing until restart, as some disks were down. But didn't lost any data (take that ZFS!) besides the one being transferred as the disks/array went down anyway
A bad PSU is not at all comparable to a hard reset. A bad PSU can cause individual writes to fail without the entire computer shutting down, and these failures can and will happen simultaneously across multiple disks since they are all connected to the same faulty PSU. If a filesystem wanted to guard against this kind of failure, I suppose it could theoretically stagger the individual disk write operations for a given RAID stripe so they don't happen simultaneously. (Implementation is left as an exercise to the reader.)
I don't mean to be critical of you, restic, or your backup strategies, but it doesn't seem like you know. ZFS is the I really need to know filesystem. I think it's pretty great and I use it where I can. But if it's not for you, it's not for you.
And FYI, an power supply failure does not manifest the same as a power outage.
As mentioned, it was not usually manifesting as power outage. Random components would fail. Presumably because the PSU was able to keep the system nominally "running" but not delivering the right power to components.
I also experienced some random reboots, things that also looked to me like bad memory... I suspected bad memory at some point. But swapping the PSU did the trick.
Yep, designing around the write hole is hard, especially with non-enterprise equipment. Lots of firmware does unsafe things with cached data and will tell you data has hit disks that has not. The file system can't really do anything about this either, other than tell you after the fact that the data that should be there isn't (which ZFS is very good for).
You can disable write caches for safety, but note that this is very hard on performance.
I was a little imprecise on my words. It would lose recent writes seemingly randomly, and reading those back would fail. It seemed that caches could mask this for a while.
POSIX systems are pretty lax with this sort of failure. write(2) and close(2) can succeed if you write to cache. If the actual write failure occurs later there is typically no way to let your process know.
What did that look like in terms of error messages etc? I'm guessing ZFS would try to write the checksum, which wouldn't work and then throw an error? I assume it never impacted data that already resided on disk?
zpool status showed an identical number of checksum failures across drives, and status -v would list certain files as corrupt. Reads on those files would return EIO. It was always recently written files.
A large file copy would predictably trigger it. Other times it was random.
I've been a ZFS user for maybe 6 years. I had data loss this year.
It was quite interesting. Writes would fail on multiple disks at the same time. That's why it was data loss. A normal disk failure wouldn't look like that, failures should be spread out and redundancy would help.
It turned out to be a bad power supply. I was able to predictably get it to corrupt writes with the old PSU. Then I replaced the PSU. No longer failed after that.
I wouldn't have guessed to suspect the PSU. It was a frustrating experience but in the end, ZFS did help me detect it. On ext4 or ffs I wouldn't have even been aware it happened, let alone could confirm a fix.