> JuiceFS, written in Go, can manage tens of billions of files in a single namespace.
At that scale, I care about integrity. Can someone working in this space please have a real integrity story as part of the offering? Give each object (object, version pair, perhaps) a cryptographic hash of the contents, and make that hash be part of the inventory. Allow the entire bucket to opt in to mandatory hashing. Let me ask the system to do a scrub in which it verifies those cryptographic hashes.
If this blows up metadata to 164 bytes per object, so be it. But the hash can probably get away with being stored with data, not metadata, as long as there is a mechanism to inventory those hashes. Keeping them in memory doesn’t seem necessary.
Even S3 has only desultory support for this. A lot of competitors have nothing of the sort.
JuiceFS relies on the object store to provide integrity for data. Besides that, JuiceFS stores the checksum of each object as tags in S3, and verifies that when downloading the objects.
Inside the metadata service, it uses merkle tree (hash of hash) to verify the integrity of whole namespace (including id of data blocks) between RAFT replicas. Once we store the hash (4 bytes) of each objects into metadata, it should provide the integrity of the whole namespace.
surely you mean that the FS should calculate the hash on file creation/update, not take some random value from the user. but I agree that a FS that maintains file-content hash should allow clients to query it.
No, the FS should verify the hash on creation/update. Otherwise corruption during creation/update would just cause the hash to match the corrupted data.
The Quobyte DCFS does end-to-end CRC32 for each 4k block of data. All metadata and communication is also CRC protected, although one other frame boundaries.
All distributed networks are vulnerable to Sybil attacks (unless you can ensure provenance somehow, out-of-band), but unless you can break the hash function, all that gets you with BitTorrent is denial of service (and traffic interception, I suppose, but that should already be part of your threat model).
There is absolutely a risk of downloading malicious data. But the protocol will reject it for failing integrity checks. That doesn't mean the software will reject it.
Are you saying mainstream torrent clients don't check the hash? As far as I know, not only do they, but they ban peers who have sent them wrong data more than once. So you could DoS them for a bit with lots of peers sending bad data, but you need a lot of ips to do that because you'll quickly get all of them banned. And unless you are doing this through residential proxies, people will learn your ranges and block you by default.
Maybe there's a DoS you could do with uTP by spoofing someone else's IP and getting them to ban a real peer, but you'd presumably have to get in between them requesting blocks and reply with bad ones, which realistically means you are a MitM, so you could DoS them more directly by just dropping their traffic.
Or if you mean more generally that a malicious packet could reach a client and exploit a memory bug or something, that applies to literally anything on a network.
Suppose you have a torrent client that saves chunks to the filesystem before performing integrity checks. Suppose also that you have an antivirus program that scans every newly-created file for malware… and someone sent you 42.zip. Sure, the torrent client will reject it later, but the damage has already been done.
This specific scenario is unlikely (most antivirus programs can cope with zip bombs, these days), but computers are complex. Other edge-cases might exist. Torrenting is safer than downloading something from your average modern website, but in practice it's nowhere as safe as the theoretical limit.
RocksDB supports hashing at multiple levels (key, value, files) because Meta also realized the importance of integrity. It also supports verifying them in bulk.
Presumably filesystems built over rocksdb also support this.
At that scale, I care about integrity. Can someone working in this space please have a real integrity story as part of the offering? Give each object (object, version pair, perhaps) a cryptographic hash of the contents, and make that hash be part of the inventory. Allow the entire bucket to opt in to mandatory hashing. Let me ask the system to do a scrub in which it verifies those cryptographic hashes.
If this blows up metadata to 164 bytes per object, so be it. But the hash can probably get away with being stored with data, not metadata, as long as there is a mechanism to inventory those hashes. Keeping them in memory doesn’t seem necessary.
Even S3 has only desultory support for this. A lot of competitors have nothing of the sort.