Personally I wasn't too impressed with fsspec, though I am using it. I had to wrap files with Python's io wrappers to get acceptable performance, and found that different fsspec implementations still have significant differences that you can't ignore. They don't seem interested in supporting or documenting use cases outside of Pandas and Dask.
What use cases would you like to see? fsspec is also used by xarray, Intake, DVC (see other comments) and others, and supported by, e.g., pyarrow. Please share your performance concerns as issues.
I'd give the a thumbs up to smart_open. Happily used it to have our servers parse data dumps of up to several hundred megabytes on user-configurable remote storage without drama. Seems like it could handle larger files without issue, but we haven't had need to push it.
I can't speak to fsspec, but their filesystem functionality (eg. `ls`, `cp`) isn't something that smart_open has and it does look cool.
I'm a big fan of smart_open -- it makes moving stuff around in Airflow DAGs so much simpler, esp. with its transparent compression and S3 multipart upload handling. (which you can tweak, but comes with sensible defaults!)
Also, I'm no 10x developer, but I found smart_open's source code a pleasure to read through and grok -- it's nicely organised and very easy to reason about!
How does it compare to pyfilesystems [1]. In our startup using pyfilesystems as a generic interface for our webapp which makes it cloud agnostic and the app can use amazon, gcs, NFS, gzip, tar, azure or regular file system.
PyFilesystem is a Python module that provides a common interface to any filesystem. It supports many filesystems [2].
Our team has been migrating DVC to fsspec. We've even started working on creating fsspec compatible wrappers for SSH, Alibaba cloud, etc.
There were challenges indeed, and some storages will require more work (e.g. things like GDrive) but I would say maintainers are responsive and helpful.
One feature of a filesystem library I would like is to convert a filename into one that only includes permitted characters for a given filesystem type (for example, ":" and "?" are allowed in xfs but not exfat). For instance, mystring = fs.util.convert_name(u"What time is it?.txt", fs.fstype.exfat, "_") where "_" is the character to be substituted.
Just a heads up - fsspec has an (optional) dependency on s3fs which has a requirement on aiobotocore, which in turn is currently locked to an ancient boto3 version. Versioning fsspec[s3fs] in projects with other boto3 dependencies is a nightmare.
Note that this is a pip issue specifically - updating boto3 might update botocore to be incompatible with aiobotocore. Users of conda are always fine. Or you can pin your version requirements and be fine however you install.
Conda - yes, because it has less strict versioning for setuptools based packages. pin-ing versions, no - because you end up with fsspec[s3fs] requiring a version incompatible with any other recent package requiring boto3.
My own experience is that nameless temporary files (eg tempfile.TemporaryFile()) are _considerably_ faster than (c)StringIO and worth using whenever the "file" is bigger than a few bytes.
Odd, why would this be? Read and write operations on temporary files need to go through the kernel and thus lead to frequent context switches. The same shouldn’t be true for a string buffer except on reallocation.
Maybe page cache (the thing tmpfs relies on) doesn't do full copy when reallocation, but rather, roughly speaking, appends extra pages to some sort of linked list? It would explain why it's faster.
it's super convenient, but watch out for bugs and performance issues. it's caching mechanism, how it uploads to S3, and how it seeks unseekable files conceal pretty big performance bottlenecks.
You may want to make issues out of this comment. fsspec tries to provide sensible defaults and lots of options for how you cache bytes, file listings and connection objects.
this is what the web's upcoming File System Access[1] really enables, but stealtily so. it's advertised as an implementation, as the capabilities to interact with the filesystem. but it's also an interface too. JS having a interface for filesystems is going to be extremely great.
Personally I wasn't too impressed with fsspec, though I am using it. I had to wrap files with Python's io wrappers to get acceptable performance, and found that different fsspec implementations still have significant differences that you can't ignore. They don't seem interested in supporting or documenting use cases outside of Pandas and Dask.