> Cloud Storage FUSE can only write whole objects at a time to Cloud Storage and does not provide a mechanism for patching. If you try to patch a file, Cloud Storage FUSE will reupload the entire file. The only exception to this behavior is that you can append content to the end of a file that's 2 MB and larger, where Cloud Storage FUSE will only reupload the appended content.
I didn’t know GCS supported appends efficiently. Correct me if I’m wrong, but I don’t think S3 has an equivalent way to append to a value, which makes it clunky to work with as a log sink.
Append workloads are common in distributed systems. Turns out nearly every time you think you need random read/write to a datastructure (eg. a hard drive/block device for a Windows/Linux VM), you can instead emulate that with an append-only log of changes and a set of append-only indexes.
Doing so has huge benefits: Write performance is way higher, you can do rollbacks easily (just ignore the tail of the files), you can do snapshotting easily (just make a new file and include by reference a byte range of the parent), etc.
The downside is from time to time you need to make a new file and chuck out the dead data - but such an operation can be done 'online', and can be done during times of lower system load.
> Append workloads are common in distributed systems.
Bringing back the topic to what the parent was saying; since S3 is a pretty common system, and a distributed system at that, are you saying that S3 does support appending data? AFAIK, S3 never supported any append operations.
I’ll add some nuance here. You can implement append yourself in a clunk way. Create a new multipart upload for the file, copy its existing contents, create a new part appending what you need, and then complete the upload.
Not as elegant / fast as GCS’s and there may be other subtleties, but it’s possible to simulate.
Well, I'd argue that it's just two different names for similar concepts, but applied at different levels. WAL is a low-level implementation detail, usually for durability, while Event Source is a architecture applied to solve business problems.
A WAL would usually disappear or truncate it's length after a while, and you'd only rerun things from it if you absolutely have two. Changes in business requirements shouldn't require you to do anything with a WAL.
In contrast, Event sourcing log would be kept indefinitely, so when business requirements change, you could (if you want to, not required) re-run N previous events so you can apply new changes to old data in your data storage.
But, if you really want to, it's basically the same, but in the end, applied differently :)
WAL is just one moving piece of database, that could be using SSTable or LSM tree or some other append-only immutable data structure, and the Event Sourcing is enterprise architecture pattern applied to business applications that heavily relies on immutable append-only data structures as the backbone
You can use S3 versioning, assuming you have enabled this on the bucket. It would be a little clunky. It would also be done in batches and not continuous append.
Basically if your data is append only (such as a log), buffer whatever reasonable amount is needed, and then put a new version of the file with said data (recording the generated version ID AWS gives you). This gets added to the "stack" of versions of said S3 object. To read them all, you basically get each version from oldest to newest and concatenate them together on the application side.
Tracking versions would need to be done application side overall.
You could also do "random" byte ranges if you track the versioning and your object has the range embedded somewhere in it. You'd still need to read everything to find what is the most up to date as some byte ranges would overwrite others.
Definitely not the most efficient but it is doable.
You can set up lifecycle policies. For example, auto delete or auto archive versions > X date. That’s one lifecycle rule. With custom naming schemes, it wouldn’t scale as well.
gcsfuse uses compose [1] to append. Basically it uploads the new data to a temp object, then performs a compose operation to make a new object in the place if the original with the combined content.
Building a storage service like these today and not having "append" would be very silly indeed. I guess S3 is kind of excused since it's so old by now. Although I haven't read anything about them adding it, so maybe less excused...
For appends, the normal way is to apply the append operations in an arbitrary order if there are multiple concurrent writers. That way you can have 10 jobs all appending data to the same 'file', and you know every record will end up in that file when you later scan through it.
Obviously, you need to make sure no write operation breaks a record midway while doing that. (unlike the posix write() API which can be interrupted midway).
It's also sensible to have high quality record markers that make identifying and skipping broken records easy. For example, recordio, a record-based container format used at Google in the early mapreduce days (and probably still used today) was explicitly designed to skip corruption efficiently (you don't wnt a huge index build to fail halfway through because a single document got corrupted).
Even so, I avoid parallel appenders to a single file, it can be hard to reason about and debug in ways that having each process appending to its own file doesn't.
Objects have three fields for this: Version, Generation, and Metageneration. There's also a checksum. You can be sure that you were the writer / winner by checking these.
You can also send a x-goog-if-generation-match[0] header that instructs GCS to reject writes that would replace the wrong generation (sort of like a version) of a file. Some utilities use this for locking.
I do scientific computing in google cloud. When I first got started, I heavily relied on GCSFuse. Over time, I have encountered enough trouble that I no longer use it for the vast majority of my work. Instead, I explicitly localize the files I want to the machine that will be operating on them, and this has eliminated a whole class of slowdown bugs and availability bugs.
The scale of data for my work is modest (~50TB, ~1 million files total, about 50k files per "directory").
FUSE does not work well with a large number of small files (due to high metadata ops such as inode/dentry lookups).
ExtFUSE (optimized FUSE with eBPF) [1] can offer you much higher performance. It caches metadata in the kernel to avoid lookups in user space. Disclaimer: I built it.
ExtFUSE seems really cool and great for implementing performant drivers in userspace for local or lower latency network filesystems, but I doubt FUSE is the bottleneck in this case since S3/GCS have 100ms first byte latency.
I am curious to hear what solutions people have found for this- for example, does anybody cache S3 in cloudfront and then point S3 clients at cloudfront?
This comes up because we store a lot of image data where time-to-first-byte affects the user experience but the access patterns preclude caching unless we are willing to spend $$$.
If the reason you were unable to use a CDN cache was because your access patterns require a lot of varying end serializations (due to things like image manipulation, resizing, cropping, watermarking, etc.), then this API could be a huge money saver for you. It was for me.
OTOH if the cost was because compute isn't free and the corresponding cloudflare worker compute cost is too much, then yeah, that's a tough one... I don't have a packaged answer for you, but I would investigate something like ThumbHash: https://evanw.github.io/thumbhash/ - my intuition is that you can probably serve some highly optimized/interlaced/"hashed" placeholder. The advantage of thumbhash here could be that you can optimize the access pattern to be less spendy by simply storing all of your hashes in an optimized way, since they will be extremely small, like small enough to be included in an index for index-only scans ("covering indexes"). (I have not actually tried this.)
I would consider making S3 the source of truth and mirroring your data to a regular server with a bunch of SSDs. If egress is substantial this may save you money. (When I built encodeproject.org AWS egress fees made it worth putting in a stateless proxy server so we only paid the lower Direct Connect fees.)
A 4U server with 24 8tb 2.5” $400 consumer sata ssds is probably the best bang for the buck. Probably about $20k plus hosting fees. That’s 192TB of storage.
I guess I should have explained better: we're running batch science jobs, or applications that serve up images to users, already in AWS. The user experience is already "fast enough" with an EC2 server with a bunch of SSDs, except that the dataset size is hundreds of terabytes and we don't know the access pattern ahead of time. S3 has a huge savings in terms of total byte cost, and for batch science jobs that already speak S3 natively or for workflows where the engine itself can do S3 to local staging we have good performance.
See https://www.nature.com/articles/s41592-021-01326-w figure 1 a and b for a demonstration of the distinct time-to-first-byte for uncached data (local POSIX, S3, and nginx/http). I'm trying to find a way to accelerate that TTFB for a dataset that lives in S3 and we don't know which parts could be cached before a user shows up and clicks on a dataset.
(I know it's a weird use case. It's not one I particularly want to support, as my own interests are more in the large-scale batch compute than the interactive user).
Looking at your figure, almost a second to download a 128KB chunk with HDF5 seems extremely high. How many requests are being made here? Is the HDF5 metadata being re-read from scratch for each iteration of the benchmark?
It might be interesting to trace log each http request being made and its timing during an iteration of the benchmark.
Using AWS CloudShell I see range requests for the last 128KB of a 512MB S3 file without authentication taking about 25ms when reusing a connection and 35ms when not.
> I'm trying to find a way to accelerate that TTFB for a dataset that lives in S3 and we don't know which parts could be cached before a user shows up and clicks on a dataset.
If latency from S3 turns out to be the problem I'm not sure there's any alternative but paying for the space your data uses on fast storage. The cheapest option on AWS will likely be EBS SSD volumes. That runs to $0.08/GB/month or $8000/month for 100TB (4x more than S3) before IOPS charges. You could try they EBS HDD volumes but they do not advertise latency figures and are still 2x S3 pricing.
We don't use HDF5. Just look at the numbers for OME-NGFF and TIFF.
We have already tried EFS and it worked for our needs but cost about the same as your EBS. We had to enable the "EFS go fast" button (at least it exists!) which greatly increases the cost.
A high-latency network will certainly become the bottleneck, but ONLY for file reads/writes. Metadata attributes (e.g., symlinks, dentries, inodes) are maintained by the file system (FUSE). Caching them in the kernel with ExtFUSE will yield faster metadata ops (e.g., lookups).
What userspace operations do these metadata attribute lookups map to? Listing a directory will incur an S3 list request, so another a network call though I’m not sure how fast that is.
Note, it’s not that the network is slow, it’s that object storages aren’t designed for low latency access (but with parallel reads can serve data at high bandwidth.)
These file systems are not a good fit for large numbers of small files. Their sweet spot is working with large (~GB+) files which are mostly read from beginning to end. I’ve mostly used them for bioinformatics stuff.
I had a similar experience with S3 Fuse. It was slower, more complex and expensive than using S3 directly. I had feared refactoring my code to use the API, but it went quickly. I’ve never gone back to using or recommending a cloud filesystem like that for a project.
For workloads with many small files, it usually is better to store many files in a single object. Filesystems with regular POSIX semantics, such as atomic directory renames etc, also makes it easier to integrate with existing software. We have seen a lot of scientific computing usage of our filesystem (https://objectivefs.com) and as you mentioned localized caching of the working set is key to great performance.
Very strongly agree with your point. In my case the real number of files is in the billions, and they are already aggregated into grouped files to reduce that overall size. But for a period of time I tried to operate on the individual unaggregated files, and that was totally untenable. (Also expensive due to the normally-negligible cost of fetching operations.)
These codes aren't talking HTTP. They are talking POSIX to a real filesystem. The problem is that cloud-based FUSE mounts are never as reliable (they will "just hang" at random times and you need some sort of external timeout to kill the process and restart the job and possible the host) as a real filesystem (either a local POSIX one or NFS or SMB).
I've used all the main FUSE cloud FS (gcsfuse, s3-fuse, rclone, etc) and they all end up falling over in prod.
I think a better approach would be to port all the important science codes to work with file formats like parquet and use user-space access libraries linked into the application, and both the access library and the user code handle errors robustly. This is how systems like mapreduce work, and in my experience they work far more reliably than FUSE-mounts when dealing with 10s to 100s of TBs.
> From reading the docs, it looks very similar to `rclone mount` with `--vfs-cache-mode off` (the default). The limitations are almost identical.
> However rclone has `--vfs-cache-mode writes` which caches file writes to disk first to allow overwriting in the middle of a file and `--vfs-cache-mode full` to cache all objects on a LRU basis. They both make the file system a whole lot more POSIX compatible and most applications will run using `--vfs-cache-mode writes` unlike `--vfs-cache-mode off`.
It is not how you would want do it for a typical ML workload, where the samples have to get randomly permuted each epoch.
Instead, tar up the files in some random order, and put the tar file on a web server or bucket, then stream then in during the first epoch, while keeping track of their byte offsets in the tar file, which you cache locally, assuming ample local Flash storage. Then permute the list of offsets and use those when reading samples for the next epoch.
If you only have local HDD then you will need a more advanced data structure like the one provided by https://github.com/jacobgorm/mindcastle.io , which will allow you to write out permuted samples at close to disk sequential write bandwidth. See my talk at USENIX Vault 2019 for a full explanation, linked from https://vertigo.ai/mindcastle/
gcsfuse worked great for me on a couple of projects, but YMMV for production use. As with all distributed storage systems, make sure you can handle timeouts, retries, high latency periods and outages.
A few months ago I paired gcsfuse with Seafowl [0][1]. Was a lot of fun balancing tradeoffs that are usually not possible with classical databases e.g. Postgres. Thank you gcsfuse contributors for making it possible!
[0] Seafowl is an early stage open source database written in Rust. https://seafowl.io/
One thing I don’t fully understand is whether data is cached locally or whether I would have to handle that myself (for example if I have to read a configuration file)? And if it is cached, how can I control how often it refreshes?
It should be possible to use (2) and (3) for a better performance but might need changes to the underlying fuse library they use to expose those options.
You'd have to use your own cache otherwise. IME the OS-level page cache is actually quite effective at caching reads and seems to work out of the box with gcsfuse.
Does anyone have any experience on how this works at scale?
Let’s say I have a directory tree with 100MM files in a nested structure, where the average file is 4+ directories deep. When I `ls` the top few directories, is it fast? How long until I discover updates?
Reading the docs, it looks like it’s using this API for traversal [0]?
What about metadata like creation times, permission, owner, group?
Hi, Brandon from GCS here. If you're looking for all of the guarantees of a real, POSIX filesystem, you want to do fast top level directory listing for 100MM+ nested files, and POSIX permissions/owner/group and other file metadata are important to you, Gcsfuse is probably not what you're after. You might want something more like Filestore: https://cloud.google.com/filestore
Gcsfuse is a great way to mount Cloud Storage buckets and view them like they're in a filesystem. It scales quite well for all sorts of uses. However, Cloud Storage itself is a flat namespace with no built-in directory support. Listing the few top level directories of a bucket with 100MM files more or less requires scanning over your entire list of objects, which means it's not going to be very fast. Listing objects in a leaf directory will be much faster, though.
Our theoretical usecase is 10+ PB and we need multiple TB/s of read throughout (maybe of fraction of that for writing). So I don’t think Filestore fits this scale, right?
As for the directory traversals, I guess caching might help here? Top level changes aren’t as frequent as leaf additions.
That being said, I don’t see any (caching) proxy support anywhere other than the Google CDN.
Brandon, I know why this was built, and I agree with your list of viable uses; that said, it strikes me as extremely likely to lead to gnarly support load, grumpy customers, and system instability when it is inevitably misused. What steps across all of the user interfaces is GCP taking to warn users who may not understand their workload characteristics at all as to the narrow utility of this feature?
If you really expect a file system experience over GCS, please try JuiceFS [1], which scales to 10 billions of files pretty well with TiKV or FoundationDB as meta engine.
Blobstores are O(n) to perform a directory operation. You are forced to serialize / lock when these expensive operations happen to maintain consistency which limits the maximum size.
The gcsfuse k8s CSI also works well if you build to expect occasional timeouts. It is a shame that a reliable S3-compatible alternative does not yet exist in the open source realm.
Id also look at goofys, which I’ve found to be google performant for reads. Also nice that it’s a golang binary which is easily passable around to hosts.
Hah nice! I developed https://github.com/ahmetb/azurefs back in 2012 when I was about to join to Azure. I'm glad Azure actually provides a supported and actively-maintained tool for this.
I didn’t know GCS supported appends efficiently. Correct me if I’m wrong, but I don’t think S3 has an equivalent way to append to a value, which makes it clunky to work with as a log sink.