Hacker News new | past | comments | ask | show | jobs | submit login
Fsspec: Filesystem Interfaces for Python (filesystem-spec.readthedocs.io)
129 points by gilad on June 16, 2021 | hide | past | favorite | 28 comments



See also smart_open: https://github.com/RaRe-Technologies/smart_open which might be more user-friendly? Never used it myself but it was on HN before. Discussion on their bugtracker: https://github.com/RaRe-Technologies/smart_open/issues/579

Personally I wasn't too impressed with fsspec, though I am using it. I had to wrap files with Python's io wrappers to get acceptable performance, and found that different fsspec implementations still have significant differences that you can't ignore. They don't seem interested in supporting or documenting use cases outside of Pandas and Dask.


What use cases would you like to see? fsspec is also used by xarray, Intake, DVC (see other comments) and others, and supported by, e.g., pyarrow. Please share your performance concerns as issues.


I'd give the a thumbs up to smart_open. Happily used it to have our servers parse data dumps of up to several hundred megabytes on user-configurable remote storage without drama. Seems like it could handle larger files without issue, but we haven't had need to push it.

I can't speak to fsspec, but their filesystem functionality (eg. `ls`, `cp`) isn't something that smart_open has and it does look cool.


I'm a big fan of smart_open -- it makes moving stuff around in Airflow DAGs so much simpler, esp. with its transparent compression and S3 multipart upload handling. (which you can tweak, but comes with sensible defaults!)

Also, I'm no 10x developer, but I found smart_open's source code a pleasure to read through and grok -- it's nicely organised and very easy to reason about!


fsspec is used in dask (https://github.com/dask/dask) if you want to see it in action.

The author of fsspec also created fastparquet (https://github.com/dask/fastparquet), a native Python implementation of the Parquet file format.

I'm really appreciative for Martin's extensive contributions to the PyData ecosystem.


FsSpec is also the old name for a file specification on Macintosh (Classic OS).

I came here because of the naming :-)


How does it compare to pyfilesystems [1]. In our startup using pyfilesystems as a generic interface for our webapp which makes it cloud agnostic and the app can use amazon, gcs, NFS, gzip, tar, azure or regular file system.

PyFilesystem is a Python module that provides a common interface to any filesystem. It supports many filesystems [2].

[1] https://www.pyfilesystem.org/

[2] https://www.pyfilesystem.org/page/index-of-filesystems/


It address pyfilesystem in the introduction https://filesystem-spec.readthedocs.io/en/latest/intro.html#...


Not very convincingly IMO. There’s also the approach taken by [s3path](https://github.com/liormizr/s3path)


Our team has been migrating DVC to fsspec. We've even started working on creating fsspec compatible wrappers for SSH, Alibaba cloud, etc.

There were challenges indeed, and some storages will require more work (e.g. things like GDrive) but I would say maintainers are responsive and helpful.


One feature of a filesystem library I would like is to convert a filename into one that only includes permitted characters for a given filesystem type (for example, ":" and "?" are allowed in xfs but not exfat). For instance, mystring = fs.util.convert_name(u"What time is it?.txt", fs.fstype.exfat, "_") where "_" is the character to be substituted.


How would you do that without potentially creating colliding names?


Just a heads up - fsspec has an (optional) dependency on s3fs which has a requirement on aiobotocore, which in turn is currently locked to an ancient boto3 version. Versioning fsspec[s3fs] in projects with other boto3 dependencies is a nightmare.


Note that this is a pip issue specifically - updating boto3 might update botocore to be incompatible with aiobotocore. Users of conda are always fine. Or you can pin your version requirements and be fine however you install.


First: hi, and thanks for all the dask :waves:

Conda - yes, because it has less strict versioning for setuptools based packages. pin-ing versions, no - because you end up with fsspec[s3fs] requiring a version incompatible with any other recent package requiring boto3.


If there is an in-memory filesystem library for this, it could be useful for unit testing things that want to write files.


I don't know if it was a question to fsspec; we have MemoryFileSystem, addressed with "memory://" URLs.


What’s wrong with /tmp?


StringIo is probably faster than temp files.


My own experience is that nameless temporary files (eg tempfile.TemporaryFile()) are _considerably_ faster than (c)StringIO and worth using whenever the "file" is bigger than a few bytes.


Odd, why would this be? Read and write operations on temporary files need to go through the kernel and thus lead to frequent context switches. The same shouldn’t be true for a string buffer except on reallocation.


Maybe page cache (the thing tmpfs relies on) doesn't do full copy when reallocation, but rather, roughly speaking, appends extra pages to some sort of linked list? It would explain why it's faster.


If that’s the case, there’s a bug somewhere.


I'm sure I must have made a mistake as I wasn't able to replicate this with some quick local benchmarking.


It's used in pandas and I love it!


it's super convenient, but watch out for bugs and performance issues. it's caching mechanism, how it uploads to S3, and how it seeks unseekable files conceal pretty big performance bottlenecks.


You may want to make issues out of this comment. fsspec tries to provide sensible defaults and lots of options for how you cache bytes, file listings and connection objects.


this is what the web's upcoming File System Access[1] really enables, but stealtily so. it's advertised as an implementation, as the capabilities to interact with the filesystem. but it's also an interface too. JS having a interface for filesystems is going to be extremely great.

[1] https://wicg.github.io/file-system-access/




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: