btrfs only uses CRC32c which is weakish. ZFS is great but not exactly portable. ...

rleigh · on March 12, 2017

Not portable? In comparison to which filesytems? I can easily export a ZFS pool on Linux, physically transport the discs to a FreeBSD or IllumOS server and import the pool. Or to MacOS X, which is also supported (though I haven't tried it unlike the others where it worked perfectly).

That's already far ahead of ext, ufs, xfs, jfs, btrfs etc. The only ones offhand which are possibly more portable are fat, hfs, udf and ntfs, and you're not exactly going to want to use them for any serious purpose on a Unix system. ZFS is the most portable and featureful Unix filesystem around at present IMHO.

ak217 · on March 12, 2017

crc32c is not weakish, and was chosen for a reason: crc32c has widespread hardware acceleration support that remains faster than any hash, and crc32c can be computed in parallel (unlike a hash, it has no hidden state, so you can sum independently computed block checksums to get the overall blob checksum). Bitrot detection doesn't need a cryptographic hash. You may want a hash for other purposes (like if you somehow trust your metadata volume more than your data volume), but that's a separate and slower use case.

bascule · on March 13, 2017

Bitrot detection doesn't need a cryptographic hash.

Not only is a cryptographic hash unnecessary, under certain circumstances it will actually do a worse job.

Cryptographic hashes operate under a different set of constraints than error detecting code. With an error detecting code, it's desirable to guarantee a different checksum in the event of a bitflip.

With a cryptographic random oracle, this is not the case: we want all outcomes to have equal probability, even potentially producing the same digest in the event of a bitflip. As an example of a system which failed in this way: Enigma was specifically designed so the ciphertext of a given letter was always different from its plaintext. This left a statistical signature on the ciphertext which in turn was exploited by cryptanalysts to recover plaintexts. (Note: a comparison to block ciphers is apt as most hash functions are, at their core, similar to block ciphers)

Though the digests of cryptographic hash functions are so large it's statistically improbable for a single bitflip to result in the same digest as the original, it is not a guarantee the same way it is with the CRC family.

Cryptographic hash functions are not designed to be error detecting codes. They are designed to be random oracles. Outside a security context, using a CRC family function will not only be faster, but will actually provide guarantees cryptographic hash functions can't.

rwiggins · on March 13, 2017

In practice, the chance of a digest collision between two messages that differ in a single bit is exceedingly small for any secure cryptographic hash function. It's so small that it's practically not worth considering. Cryptographers are incredibly careful in building and ensuring proper diffusion in cryptographic hash functions.

mrb · on March 13, 2017

"Not only is a cryptographic hash unnecessary, under certain circumstances it will actually do a worse job."

A cryptographic hash is unnecessary, but as you point out in the next to last paragraph, it is statistically improbable that it will do a worse job. Because collisions are statistically improbable.

jepler · on March 13, 2017

Specifically, I wrote a program to search for single-bit-flip collisions in sha1 truncated to 16 bits. The program didn't need to search for long before finding two messages with the same 16-truncated sha1 with a single bit flip at bit 1 of byte 171 of a 256-byte message. 376 1 171 be44b935e7ecfc81d1fe2cddcd7c1d7e04338fd83fa994cd6a877732ca5d8db83346bd9ccbfc4c8770682bd307c782421a512a80a106be87825d5c13f3156e23ffaacdfc1651f88f775507d1175542def2ccf084271ebd4ead175c8a448be0d50b26f59d970301ebc5a7f672d3ea870d9a1e02f8f5fd01c38297b8aa264a3f07fec32f9a91aa359784d2d9ce0e4649465c705f50feed23dcbefc0a726cfadb5e47ee577ed45203f90d6e2e650d42ddb10cba49d06bd4cdad4e6eaf5cfcb062de2539fc847ce0c104f2e667369080eaaab5934ae5f7f1ba733c3d1bfbda87bfa72ef12475b9ff0edc4deb99e6a5cf387c7f6b9c71ea62b4db4bb67c92d36460dd be44b935e7ecfc81d1fe2cddcd7c1d7e04338fd83fa994cd6a877732ca5d8db83346bd9ccbfc4c8770682bd307c782421a512a80a106be87825d5c13f3156e23ffaacdfc1651f88f775507d1175542def2ccf084271ebd4ead175c8a448be0d50b26f59d970301ebc5a7f672d3ea870d9a1e02f8f5fd01c38297b8aa264a3f07fec32f9a91aa359784d2d9ce0e4649465c705f50feed23dcbefc0a726cfadb5e47ee577ed45203f90d6e2e640d42ddb10cba49d06bd4cdad4e6eaf5cfcb062de2539fc847ce0c104f2e667369080eaaab5934ae5f7f1ba733c3d1bfbda87bfa72ef12475b9ff0edc4deb99e6a5cf387c7f6b9c71ea62b4db4bb67c92d36460dd

https://gist.github.com/jepler/96d1e779dc95b8941b208887e10a8...

On the other hand, any CRC will detect all such errors; a well-chosen one such as CRC32C will detect all messages with up to 5 bits flipped at this message size.

This is quite appropriate for the error model in data transmission, of uncorrelated bit errors. https://users.ece.cmu.edu/~koopman/networks/dsn02/dsn02_koop... is a pretty good paper, though there are probably better ones for readers without an existing background in how CRC works.

mrb · on March 13, 2017

You are not testing a crypto hash. "Crypto hash" means it is cryptographically strong, not truncated to 16 bits. For example ZFS with checksum=sha256 will use the full 256-bit hash for detecting data corruption.

jepler · on March 14, 2017

Yup, you're right. if you use a full size cryptographic hash then the number of undetected errors can be treated as 0 regardless of hamming distance. On the other hand, it has 8x the storage overhead of a 32-bit CRC.

jepler · on March 13, 2017

Less so if you assume the cryptographic hash will be truncated to 32bits so that its size matches crc32c. Furthermore, some of those collisions will be on message pairs with small hamming distance, probably including messages with a single bit flipped which CRC will always detect.

dom0 · on March 12, 2017

CRC32c -> I saw this many times fail to detect corruption on message lengths anywhere between a couple kB and a few MB. btrfs blocks are 16 kB iirc, so in range. The longer hashes of ZFS, Borg and so on mean that if it's corrupted I _definitely_ know. Not so confident with CRC32 from experience.

ak217 · on March 13, 2017

I'm curious about the setting in which you saw these failures, could you elaborate?

Unlike a plain checksum, CRC-32C is hardened against bias, which means its distribution is not far from that of an ideal checksum. This means if your bitrot is random and you're using 16KB blocks, you will need to see on the order of ((2 * * 32) * 16KB)=64TB of corrupted data to get a random failure. Modern hard drives corrupt data at a rate of once every few terabytes. TCP without SSL (because of its highly imperfect checksum) corrupts data at a rate of once every few gigabytes. Assuming an extremely bad scenario of a corrupt packet once every 1 GB, in theory you'd need to read more than a zettabyte of data to get a random CRC-32C failure. I'm not doubting that real world performance could be much worse, but I'd like to understand how.

vardump · on March 13, 2017

> This means if your bitrot is random and you're using 16KB blocks, you will need to see on the order of ((2 * * 32) * 16KB)=64TB of corrupted data to get a random failure.

No, that's what you need to generate one guaranteed failure, when enumerating different random corruption possibilities. Simply because 32-bit number can at most represent 2^32 different states.

In practice, you'd have 50% probability to have a collision for every 32 TB... assuming perfect distribution.

By the way, 32 TB takes just 4-15 hours to read from a modern SSD. Terabyte is just not so much data nowadays.

leni536 · on March 13, 2017

Just a nit: you don't get a guaranteed failure at 64TB, you get a failure with approx 1-1/e ~= 63% probability. At 32TB you get a failure with approx 1-1/sqrt(e) ~= 39% probability.

I do agree that tens of TBs are not too much data, but mind that this probability means that you need to feed your checksum 64TB worth of 16KB blocks, every one of them being corrupt, to let at least one of them go trough unnoticed with 63% probability. So you don't only need to calculate with the throughput of your SSD, but the throughput multiplied with the corruption rate.

marcosdumay · on March 13, 2017

For disk storage CRC-32C is still non-broken. You can't say the same about on-board communication protocols or even some LANs.

When people started using CRC-32 it was because with the technology of the time it was virtually impossible to see collisions. Nowadays we are discussing if it's reasonable to expect a data volume that gives you 40% or 60% of collision chance.

CRC32 end is way overdue. We should standardize on a CRC64 algorithm soon, or will have our hands forced and probably stick with a bad choice.

speleo_engr · on March 13, 2017

Posts like this are what keep me coming back to HN.

cmurf · on March 13, 2017

Not exactly, blocks are 4KiB so the vast majority of the CRC32C's apply to 4KiB block size; the metadata uses 16KiB nodes and those have their own checksum also. From what I see in the on disk format docs, it's a 20 byte checksum.

johnramsden · on March 13, 2017

ZFS is probably one of your better options if you want portability. It works on most Unix like operating systems, and due to its zfs send/recv capability you can send your data all over the place - and can trust that it will actually end up the same on the other end. If that's not portability I don't know what is.

If what you're saying is it's not portable to Windows then sure, but compared to most filesystems it's extremely portable.

X86BSD · on March 12, 2017

ZFS not portable? Have you ever used it?

"ZFS export" on the host system, remove drives, insert drives in new server, "ZFS import" on the new server, It really is that simple.

dpedu · on March 12, 2017

I think he means portability of the software. What I want to 'ZFS import' on a Windows or macOS machine?

X86BSD · on March 12, 2017

Oh, well it's ported from Solaris to illumos and FreeBSD and to a lesser extent OS X and Linux. So I'm still confused about the portability claim.

tux1968 · on March 12, 2017

The license incompatibility has kept it out of the mainline kernel for Linux, so it's not really a viable option there in many situations. Linux is definitely lacking in this department.

rleigh · on March 12, 2017

That's a minor concern and unrelated to the portability claim. That comes down to choice, and is not a technical consideration. I'm using it with Ubuntu 16.04 LTS and 16.10 where it works out of the box. It's most certainly portable to and from Linux and other systems; I've done it personally, and it works a treat.

tux1968 · on March 12, 2017

Minor for some, major for others. And it is technical too, because there are maintenance ramifications for it not being in the mainline kernel, for example how quickly a security patch can be applied.

And for others a constriction on which distribution they can move to etc. It just reduces the number of situations where it can be used, even if you find yourself in one where it can.

justincormack · on March 12, 2017

ZFS on OSX has been revived I believe.

drvdevd · on March 12, 2017

I wonder if the recent Linux syscall emulation on Windows would somehow make it possible or easier to port ZFS on Linux to Windows.

I know you have the SPL anyway, so maybe with the addition of the Linux POSIX-ish layer in there this could be the case...

chungy · on March 12, 2017

Short answer: No it wouldn't make it easier to port.

Longer answer: The Linux subsystem in Windows 10 only deals with userspace. It doesn't support kernel modules nor changes anything about making Windows drivers. Porting ZFS to Windows is certainly possible, but it will take quite a lot of effort, and the Linux subsystem is irrelevant in that situation.

drvdevd · on March 12, 2017

Yeah, I figured as much, but was hoping there might be something about the Linux subsystem that would be helpful in porting drivers around, beyond userspace.

Obviously, I've yet to actually use it myself.

gjjrfcbugxbhf · on March 12, 2017

Would a FUSE implementation of ZFS not be possible? (Just wondering I've no idea what's possible here)

Dylan16807 · on March 12, 2017

A FUSE implementation of ZFS exists and works well, and adding FUSE support to the Windows 10 Linux subsystem appears to be reasonably high up on the priority list.

That doesn't get you access from Windows programs, but there are some other ways to do FUSE or FUSE-like things on Windows..

problems · on March 13, 2017

It may be fairly straightforward to port the ZFS FUSE to Windows if you use things like Dokan or WinFsp which have the FUSE interface supported fairly well - these would give full access via standard Windows tools.

rincebrain · on March 12, 2017

Last I knew, the zfs-fuse codebase hadn't been updated since before feature flags were added to any of the OpenZFS targets, so it's not a particularly well-supported solution...

e12e · on March 12, 2017

Fwiw I've successfully shared my luks-encrypted usb3 zfs-formatted disk to Windows pro on my Surface 4 via a hyper-v vm running Ubuntu and samba. It won't work for all external drives - you need to be able to set the drive as "offline" in device manager under Windows in order to pass it through to the hyper-v vm (and sadly this doesn't appear possible with the sdcard - I had hoped to install Ubuntu on the sdcard and have the option to boot from the sdcard and also boot into the same filsystem under hyper-v).