Hacker News new | past | comments | ask | show | jobs | submit login

There is no fsck for ZFS. Because the ZFS developers at Sun/Oracle believe that ZFS is robust enough to never corrupt (untrue) and in case it does you should have a backup.

I got screwed by ZFS, and I was able to recover the filesystem by learning its on-disk structure and fixing it directly on-disk. Something that a decent fsck tool could do. But no, ZFS never breaks. Go figure.




fsck doesn't make sense for ZFS. If you want to check for fs integrity, you can scrub the pool. If you want to get the fs back to a state before there was corruption, you use the transaction history. If you want to import degraded arrays, there are commands for that. There's no magic in ZFS. What exactly is missing that you want to see?


> If you want to get the fs back to a state before there was corruption, you use the transaction history.

How? ZFS refuses to mount/import a corrupt filesystem.

In my case, the latest superblock (or some internal bookkeeping structures that the superblock points to) was corrupted in a way that ZFS completely gave up. So what I ended up doing is to manually invalidate the latest superblocks until ZFS was able to mount the filesystem. I may have lost the changes written in the few minutes before the corruption, but that's still way better than loosing everything.

Before I decided to poke around the raw disk with dd (to invalidate the superblocks by overwriting them with zeros), I googled around and I wasn't the only one with that problem. One other guy asked on the ZFS mailing list and the response was along the lines of 'Your filesystem is FUBAR, restore from backup'.

You may argue that ZFS itself should do what I did (dropping a few minutes of transaction history and roll back to the latest non-corrupt state) upon mounting. Fair enough. I don't really care if that functionality is built into ZFS or an external fsck binary. The fact is that ZFS wasn't able to recover from the situation. One that I would argue is very trivial to recover from if you know the internal ZFS on-disk structure.


"How? ZFS refuses to mount/import a corrupt filesystem."

zpool clear -F $POOLNAME

http://www.c0t0d0s0.org/archives/6067-PSARC-2009479-zpool-re...


So, you're argument then is that this is something a fsck would normally do?

You may have found a corner case in the fs and perhaps this sort of thing should be added to the import command, but I'm not sure simply having an "fsck" fixes this. I just think the import command appears to have a bug/needs a feature.


How's zdb different from fsck? Nobody cares whether the tool is called fsck, scrub or zdb. You just need to have a tool which can recover from corruption. And for very long the ZFS developers didn't think the end-users needed it. Maybe zdb was there all the time. But it wasn't advertised or documented. People were told their filesystem is FUBAR when it wasn't.


"How's zdb different from fsck?"

That's answered very well in the article connected to this other currently active HN discussion:

https://news.ycombinator.com/item?id=5460988

In short, fsck simply checks to see that the metadata makes sense, and that all inodes belong to files, and that all files belong to directories, and if it finds any that don't, it attaches them to a file with a number for a name in lost+found.

It's pretty crude compared to a filesystem debugger.

If you want to compare apples-to-apples, you'd be better off asking how zdb compares to debugfs (for ext2/3/4) as both are filesystem debuggers.

You could also ask "How's zfs scrub different from fsck?" and the answer to that would be: zfs scrub checks every bit of data and metadata against saved checksums to ensure integrity of everything on-disk. In comparison, fsck cannot detect data corruption at all, and can only detect metadata corruption when an inode points to an illegal disk region (for example).

Even that comparison shows fsck is crude when compared to scrub.

The tool to recover from corruption is a rollback: usage: clear [-nF] <pool> [device]


My employer has thousands of large machines with ZFS on them. We've seen corruption happen once, five years ago.

Maybe we've just been really lucky.


It's more likely you are underestimating your corruption because you are not monitoring it thoroughly. If you have thousands of machines, running for five years, you're gonna see corruption occasionally. No FS can truly protect you (but a well designed FS will reduce the probability that a corruption will become a user-visible event).


Of course there's problems with the disks -- corrupt sectors, phantom writes, misdirected reads, etc. The point is that ZFS handles those problems for us.

I won't discuss the nature of the business, but it's unlikely that actual corruption that isn't automatically repaired would go undetected for any amount of time.


The idea that you can have corruption and nothing to fix it with already sounds scary enough to me.

Especially for those of us that don't have thousands of machines and can therefor be badly screwed by one issue.


Given the way ZFS works, you'll have to elaborate on what an fsck would do that ZFS doesn't automatically do already.

What I find scary is silently corrupt data, something which is a problem for most other filesystems. I've seen ZFS catch and fix that error orders of magnitude more often than I've seen ZFS flake out. If we're talking risk analysis, I feel you're worried about a mouse in the corner, while a starved tiger is hungrily licking its chops while staring at you.


Just look at the recent KDE git disaster. A lot of things went wrong there, but fundamentally the issue was ext4 silently returning bad data.

The thing about fsck-like recovery tools is you need to have a failure mode in mind when you write them. ZFS can fix most of those types of errors thanks to the checksums and on-disk redundancy on the fly. Or at least tell you that something is now going wrong and which files are affected.


I think it's unfortunate that Sun called it 'scrub' instead of 'fsck'. Because the two are largely equivalent in functionality (to the extents supported by the filesystem). Both fix filesystem corruption if they can. Just scrub can be run while the filesystem is mounted, whereas the traditional fsck must be run when the filesystem is offline.

However, scrub does not make ZFS perfect. There are still ways the filesystem can become corrupted without scrub noticing. Or corrupted in a way so that ZFS fails to recover from, even though recovering would be dead simple.

The attitude of the ZFS developers only works in the enterprise market: Your data is safe (checksummed, scrubbed, replicated using RAID-Z), but if a bit flips in the superblock just restore from your backup, because we won't provide tools to recover from that.


While I concur with most of your points, I must point out that there are four uberblocks, not one; a flipped bit will not impair the pool.

Let's argue that bits flip in all four uberblocks though, then ZFS will use the previous valid commit, which also has four uberblocks (ZFS is a CoW FS). And so on backwards for quite a few transactions. All these uberblocks are spread out at the beginning and end of each disk.

ZFS has a policy that the more important the data (and the top of the tree is most important), the more copies there are, although a user can also define how many duplicates there should be at the leaves of the tree.

Basically, you'd need a very trashed disk to render an entire pool dead. You're not going to recover from that, regardless of filesystem.


Well, ZFS clearly didn't use the extra copies, nor did it try to use the previous uberbocks. Otherwise it would've been able to mount the filesystem, don't you think? And I disagree that the disk was very trashed, if all it took was to invalidate a few uburblocks to get the filesystem mounted again.


Did you use zpool import -F (are you sure)? I'm surprised that'd be necessary though, given my experiences. How long ago was this?


As others have stated, if the current state has corruption, you roll it back. As I posted in response to another post above, this is as easy as "zpool clear -F $POOLNAME".

The fact that the vast majority of other filesystems have no way to detect silent corruption of data (only metadata inconsistencies) is far more frightening to me.

Here is an nice article written by someone who discovered just how unreliable disks are, after switching to ZFS (because other filesystems couldn't detect the corruption). Quite an eye-opener. http://www.oracle.com/technetwork/systems/opensolaris/data-r...

If you use chrome and are getting the same error as I am, here's the google cache link (Firefox will load it): http://webcache.googleusercontent.com/search?q=cache:caEwhGD...

The article references sources of studies of hard disk corruption, if you'd want something with even more detail and statistics:

http://research.cs.wisc.edu/wind/Publications/pointer-dsn08....

http://static.usenix.org/events/fast08/tech/full_papers/bair...


ummm.. last I used "zpool scrub tank" used to do the trick. has that changed/unavailable for linux ?


Nothing inspires confidence like a filesystem with a 0.61 version number.


zpool import -fFX, but sssh, don't tell anyone :)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: