How does it compare to other players in the area? E.g Ceph, Gluster or Seaweed? (I'm no expert myself, only used those as a consumer of already setup systems)
Bit weird comparison. Like sure CephFS doesn't support S3-like access... because the object store is a separate service that also runs on top of Ceph/RADOS store
It is weird, but it's also a valid use case. I can imagine someone wanting to pull files from a FS that was populated as a regular POSIX filesystem through an S3 api. I'm not sure if you can access the CephFS files from the underlying Ceph store easily.
It's doable to run a MinIO gateway on top of CephFS mount point, but that will has performance issue, especially for multipart-upload and copy. That's why we put MinIO and JuiceFS client together and use some internal API to do zero-copy uploads.
It seems to be a specific part of SeaweedFS, i.e. the "filer+client" components. It will use a database or key-value store for metadata and a blob store for data, and expose that as a filesystem.
The difference is that SeaweedFS has its own blob store ("volume server") while JuiceFS uses S3 (or some other protocols). SeaweedFS also decouples the server ("filer server") from the client ("mount" command or client libraries) while JuiceFS only has a single process, so the machine where you mount the filesystem talks to the metadata and data backends directly; this means you can't mount a filesystem on an untrusted machine if I understand correctly (you need full R+W access to the backends from the machine where you mount).
You can see it as similar to `rclone mount`, which allows you to mount a remote S3 bucket locally. The difference is that JuiceFS is much faster and filesystem-like, by storing the metadata in a separate faster database and by chunking your files in the backend rather than storing files unchanged in the bucket.
What I really want is a filesystem I can span across geographically remote nodes that's transparently compatible. I should just be able to chuck files into it from my NAS like any other. I think Mayastor [1] might get some of the way there?
Yes, JuiceFS uses the Apache 2 fork [1] directly (master branch), but also provide a full featured S3 gateway (gateway branch) under AGPL for people' choice.
You may want to specify reasons for wanting a “free as in beer” license, otherwise it just sounds greedy. (Not a swipe at you; I’ve fallen into this trap more than once.)
I love this project and I'd love to switch to it. Hopefully constructive feedback: The big issue I always run into with this stuff is what am I supposed to do if something goes wrong? I think the project documentation people would be wise to document procedures to do when certain things go wrong and how you should deal with them, such as if a server or two fail, or there's some unexpected corruption. Without that, "distributed storage" systems really feel incomplete to me. Storage is usually "mission critical" and they had a procedure for every single thing that could go wrong on the Apollo mission.
Ah I've been quite happy with LizardFS (which is a fork of MooseFS) and I found Ceph to be a bit of a letdown (too complex to manage). Well, time to try Seaweed then :)
Important to note that S3 does not have any Durability SLA. We promise Durability and take it extremely seriously, but there is no SLA. Much more of an SLO
Also, “durability” is not a property you can delegate to another service. Plenty of corruption is caused in-transit, not just at rest.
If your system handles the data in any way, you must compute and validate checksums.
If you do not have end to end checksums for the data, you do not get to claim your service adopts S3’s Durability guarantees.
S3 has that many 9s because your data is checksumed by the SDK. Every service that touches that data in any way recomputes and validates that (or a bracketed) checksum. Soup de nuts. All the way to when the data gets read out again.
And there is a lot more to Durability than data corruption. Protections against accidental deletions, mutations, or other data loss events come into play too. How good is your durability SLO when you accidentally overwrite one customer’s data with another’s?
Check out some of the talks S3 has on what Durability actually means, then maybe you investigate how durable your service is.
ps: I haven’t looked at the code yet, but plan to. Maybe I’m being presumptuous and your service is fully secured. I’ll let you know if I find anything!
pps: I work for amazon but all my opinions are my own and do not necessarily reflect my employer’s. I don’t speak for Amazon in any way :D
As you allude to in your response, that's usually referred to as durability, not reliability. The home page could probably use an update there to reflect that terminology.
It's an average- presumably they don't smear files across disks byte by byte, since that would be insane. But with drives randomly breaking, at some point every copy of at least one file will go at once. With, say, a terabyte of files over a thousand years, you'd expect to lose a total number of files equal to 100Kb. So probably not even one, with some small chance of losing half a drive.
It's unavoidable that too many disk failures in quick succession lead to data-loss. For example if you store two copies, your durability rests on being able to detect a disk failure and create another copy, before the sole remaining version dies as well.
Do you know if strong read-after-write consistency is supported (as in s3)?
Is an atomic put-if-absent method supported in JuiceFS (as in Azure blob storage)?
If so, this could be a really cool platform for formats like Delta.io :)
Regarding the topic of "cloud storage" - could someone tell me if Juice or maybe MinIO would be a good solution to:
1. Storing multimedia data (image/video) uploaded by an user - here I would guess it can either hit it directly or via the backend for auth
2. Should be accessible by an URL exposed outside of the docker-compose so it doesn't need to go through the backend REST API
3. Some form of authentication based on the JWT token in the Header - or maybe as this is a MVP simply generating a long enough random string will be enough
Or should I simply use nginx + filesystem and not overcomplicate?
I hear everywhere S3 but as it's a pet project don't want to go the AWS route, instead maybe a Hetzner VPS with docker-compose to run the whole setup with an external Postgres instance.
Is that a use case they are really targeting though? Their splash page mentions big data with model generation and genomic sequencing as examples. I can really only speak to genetic sequencing. The IO pattern for these workflows is almost all streaming reads/writes. Random access takes too long when you are reading/writing 100-500GB files.
Postgres doesn’t like running on NFS either to be fair.
Is it 'Posix Compatible' or 'Posix' aka 'Posix compliant'?
It's incredibly hard to make a distributed posix compatible filesystem since you run into CAP. I believe (but am not certain) you are caching locally in violation of Posix or you are signing up for arbitrarily long stalls and a ton of latency on every read/write. (I'm not certain because I'm not sure what Posix specifies wrt stale reads and other cache consistency requirements between sync's)
It would be interesting to hear what the tradeoffs are here, but assuming they are explicit and can be designed around this seems very useful.
It is not posix anything. It provides a compatibility layer that makes open, close, read, and write work but other than that does not provide the type of features that would allow you to deliver mail on it with qmail or whatever. It is incredibly misleading to advertise it that way.
As you say there is no free lunch with distributed filesystems. Application programmers have to program their way around the fact that something like posix atomic writes with multiple writers is never going to work, and that the only way to get reasonable efficiency out of the thing is to defer work until the file is closed.
Agreed, it's very hard, that's why GFS and HDFS had give up some parts of POSIX compatibility.
Per CAP, it's addressed by different meta engines (CP system, Redis, MySQL, TiKV) and also different object stores (AP system). When the meta engine is not available, the operation to JuiceFS will be blocked for a while and finally it returns EIO. When object store returns 404 (object not found), which means it's not consistent with the meta engine, it will be retried for a while, may return EIO if it's not recovered.
The file format is carefully designed to workaround the consistency issue from object store and local cache. Any part of data is written into object store and local cache with unique ID, so you will not go stale data once the metadata is correct [1].
Within a mount point, JuiceFS provides read-after-write consistency. Across clusters, JuiceFS provides open-after-close consistency, which should be enough for most of the applications, also provide good balance between consistency and performance.
I was actually building something similar to Juice using S3 as an object store and optionally using redis(fast) or s3(slow) for metadata storage. Basically a log structured filesystem using rolling hash chunk encoding and delegations. I kinda stopped when I found juice (and to some extent seaweed) as they were much further along. If you need shared storage and don't have crazy performance requirements it makes a lot of sense to separate out metadata and just throw blobs into object storage.
> Usually the meta engine or object storage can scale horizontally by itself, JuiceFS is middleware to talk to these two services.
Thanks for confirming this -- I spent a bunch of time reading and was wondering why I couldn't find anything... this answers my question.
I think I misunderstood JuiceFS -- it's more like a better rclone than it is a distributed file system. It's a swiss army knife for mounting remote filesystems.
Assuming you're using a large object service (S3, GCP, Backblaze, etc) then the scale issue is expected to be solved. If you're using filesystem or local minio for example, then you have to solve the problem yourself.
> To serve S3 request, you can setup multiple S3 gateway and put a load director in front of them.
This is exactly the question I had -- it occurred to me that if I make 2 s3 gateways, even if they share the metadata store they might be looking at resources that only one can serve.
In that situation, then if a request came in to s3-0 for data that was stored @ s3-1, the request would fail, correct? Because s3-0 has no way of redirecting the read request to s3-1.
This could work if you had an intelligent router sitting either in front the s3s (so you could send reads/writes to a certain one of them that is known to have the data), but by default, your writes would fail, I'm assuming.
Oh I have one more question -- can you give options to the sshfs module? It seems like you can just append `?SSHOptionHere=x` to `--bucket` but I'm not sure (ex. `--bucket user@box?SshOption=1`)
Atomic file/directory renames/moves is the fundamental feature of JuiceFS, which makes it truely a file system rather than a proxy to S3, please check the docs for all the compatibility details [1].
About Juicedata Inc.
Founded in April 2017, Juicedata is a globally oriented innovated distributed file system company. The team consists of senior architects, genius engineers, and consulting experts who have worked in the field of distributed systems for many years. The team members located across Hangzhou, Shanghai, Xiamen, and other cities, used to serve Facebook, Databricks, Tencent, Alibaba, Zhihu, Xiaohongshu, Douban, and other well-known high-tech enterprises around the world.
Juicedata was jointly invested by China Growth Capital and Foothill Ventures.
EDIT: There is a whole comparison section in the docs that I missed: https://juicefs.com/docs/community/comparison/juicefs_vs_cep...