Hacker News new | past | comments | ask | show | jobs | submit login
Show HN: Zero – Local file system transparently swapping to the cloud (github.com/konstantinschubert)
252 points by konschubert on Sept 9, 2018 | hide | past | favorite | 96 comments



Very cool project! Close to my heart. I write ExpanDrive

http://www.expandrive.com

Which operates with the same idea, up to a 10Gb local cache and will stream out larger files. Supports a huge number of storage back ends (Dropbox, Google Drive/GCS, Amazon S3/Drive, OneDrive, SFTP, etc). Also can pin files and trees into the cache.


Bought a license maybe 4-5 years ago because the website said that Linux support is coming very soon. It's been coming since then. Tried to sign up for the beta but didn't even get an answer. :(


Rclone can mount 20+ backends and has caching and mounting support. I guess you can do something similar with Rclone. Plus it's FOSS.


Originally said "coming soon" to see if anyone was interested. Turns out they are!


Can you explain a little more what you mean by this?


It means that is was not actually "coming soon". It said that, just to see if many people are interested in that feature. If there were many people interested, development of that feature would start. If not many people were interested, I guess the text would have been just removed. Makes sense?


Pretty convoluted to me...


No, it's simple, it's just a lie.

(Sorry, don't mean to attack anyone or put too much blame, but let's call it what it is.)


Technically the lie really only extends to the "soon" part, since it is coming. And software regularly has a list of future features that often don't ship. Would love to chat if you were ever interested!


Hi, I am always up for a chat, my email is mail@konstantinschubert.com. :)

To get back to our argument: By the time you said "coming soon" you didn't know if it was coming at all, so I would call that a bit of a lie.


But at least they've been working on it for a good minute:

https://twitter.com/expandrive/status/1000012200612950017


We took a beta list to gauge interest. No sense in all the work if nobody cares.


Yeah but why not just ask if people were interested? Instead of saying something that wasn't true..


Because everyone and their dog would say they want it, or are interested in it, even if they don't actually want it. This puts up a (small) barrier that helps gauge true interest.

It's the same idea as setting up a landing page for a product that doesn't exist and seeing if people click "pricing" or "buy now" to determine if there is a market for it.


Anywhere I can sign up to be notified when the Linux version is out?


linux@expandrive.com


Why is the local cache limited to 10 GB?

I am building Zero to store my personal pictures and videos and I feel like 10GB local cache means that only the last month of pictures and videos is local, which seems very limited.

Are there any technical reasons why the cache cannot be 500GB?


It used to be unlimited but most of the users ultimately desired to offload data and access it on demand. It should be configurable but currently isn’t.


Very pedantic but I think "Built with in Boston." is intended to be "Built in Boston." in the footer?


There is a red heart after "with" - or at least there is on every machine I've tried.


Script blocker bit me there, you are right :). Looks like a very interesting project!


Based on "I write ExpanDrive", I thought the author was Chinese.


A bit rude. I like that phrasing.


I use expandrive everyday. Thanks man.


Is that similar to the Drive FS client then (except it's obviously multi-cloud and backend)?

https://support.google.com/a/answer/7491144?hl=en


Any progress on the Linux version?



Does everything get stored in the cloud - even files that are in the 10Gb local cache?


Yes, unless you're offline - in which case it stays in the cache until a connection is available.


Happy customer of Expandrive for quite a while. Got a lifetime license.


I should also switch to the lifetime license sometime, Expandrive is indeed really an awesome product!


Thanks!


Sounds good. At what stage would I need to buy the license? I could not find any info on that.


“ExpanDrive runs unlicensed and fully featured for 7 days giving you a chance to try everything out. Once the trial period expires you’ll need to buy a license or you will be limited to 20 minutes of use.”


Author here. Very glad to see that nobody has noticed how slow it is and nobody has made a comment about how messy code is in some places :D

Working on both issues.

But first I will add some instructions on how to run it.


I think you should try funding your project by encouraging users to follow an affiliate link to buy storage from backblaze.


Interesting idea. But is the cloud version a complete image (perhaps out of sync)? If so then it’s a performance disaster, if not it’s very fragile.

It seems to me what we really want is a cloud file system with local cache (like Dropbox or iCloud conceptually) so that if our local device is vaporized we have a pretty much up to date logical store alive and well (and we can work on any number of machines). The word “swapping” seems to me to be based on the virtual memory model which means that if anything goes wrong you have two disconnected piles of crap.

At a file level you could theoretically have a giant file that is never wholly local, but how useful is this as a feature in real terms?


I think Borg or Tarsnap use the right approach here: a map of blocks, updating a file updates only the changed block(s). It balances the efficiency of updates and the completeness of the copy. Sort of like FAT filesystem, only with block-level deduplication built in.

Of course you don't get a nice mirror of your files right in the cloud, unless you run a separate server that reconstructs it and makes available as traditional buckets.


restic and duplicacy are the newer implementations of block level dedup encrypted backup.

From what I tested, restic has friendlier command line options but duplicacy is technically superior at this point (restore works way faster)


Restic's restore isn't parallelized at all, whereas its backup is. It should be straightforward to improve the restore performance.

https://github.com/restic/restic/pull/1719


I use a Rubric appliance, that does block level dedupe and extends to cloud. I was able to instantiate a multi TB db, from the backup to a physical server in minutes. Extremely impressed .


I decided against a block- level system with Zero because I'm trying to make predictions about which files will be needed next locally and that's hard on a block level, I think.


I am wondering if there is a backup solution that works that way but without requiring a manual time consuming invocation.

Using something like inotify to record changed files and a worker in the background to immediately sync. Like dropbox.



Yes, the cloud version is a complete image (without the file names though) that should be eventually consistent.

And yes, performance is a disaster right now simply because the code is not optimized at all. But the sync to the cloud happens in the background so it should not affect your performance unless you have a "cache miss".


Isn't it a fusion of HSM https://en.wikipedia.org/wiki/Hierarchical_storage_managemen... and continuous backup?

What about often-locally-changed data which are part of a coherent set, the classic case being a file used by a database engine to store data? We nearly always need to mirror/backup a consistent version of it (just after a successful nesting transaction, in the SQL world the upper-level "COMMIT"), but AFAIK for the time being the HSM+backup software cannot detect such a state. trapping existing system calls (fsync and co, in order to copy to the remote storage data in a sync'ed state) but this is not robust because their semantics is not "upon return of this call the whole dataset (in all files) is consistent".

Moreover if the application using the DB engine is not perfect such inconsistency may reside at application level => after a COMMIT the file is consistent for the DB engine, but not for the application.

I wonder if some users of such HSM+backup software felt some major disappointment after restoring an inconsistent version of such a file. Even a minor loss (garbled index) may be hard to detect and lead to a "fork" of the data.

A dedicated system function called to signal "in my set of opened files the data are consistent" would be useful but is AFAIK missing, and even if someone adds it to some libc/kernel it will only be useful when the application code will actually call it.

The kludge is a procedure "order to engine to sync the data ; throttle the engine in 'no write mode' ; create a RO snapshot ; backup the snapshot; unthrottle the engine ; delete the snapshot", which seems not exactly "transparent".


In such a case, you’re better off with a database engine that streams its journal or transaction log to an object store.

Don’t perform data operations at the wrong layer.


Indeed, and this is my point: such tools cannot be generic ("works with any file") and also transparent ("plug & play").


Yes, but those are the preconditions to user adoption.


Author here. Thanks for the Wikipedia link. I think that the software is trying to implement HSM but I didn't know that this is what it's called.

With Zero, all local data is eventually synced to the cloud but usually this only happens after the local file is idle for a while.


I've been using SyncThing [1] recently, which does a similar thing but between your own devices (anything from Android to desktop to servers in the cloud). I've been using it on Linux and Windows, and it seems pretty good.

[1] https://syncthing.net


I don't think Syncthing is the same. As far as I recall, it stores the full sync folder to disk, similar to traditional sync services. This project only stores a cache on the disk.


You are correct. With SyncThing each device gets a full copy of whatever folders you've asked it to sync there. The FUSE aspect of Zero lets it do just-in-time file transfer and have no solid upper limit on storage, saving bandwidth and disk space at the cost of portability, latency, and redundancy.


I built syncthingfuse to do partial syncs. Currently unmaintained, though.

https://github.com/burkemw3/syncthingfuse


Gcloud also has a FUSE adapter for cloud storage buckets

https://cloud.google.com/storage/docs/gcs-fuse


Just wondering. Would anybody want a block devise backed by s3 or other object storage? Local cache with snapshots that can be rolled back ? Maybe giving you a exobyte of addrisable storage ?

This would be a actual block device not a fuse file system.


I was actually thinking of doing this as a block device instead of a file system and I agree that it may be the more "natural" solution.

However, the software tries to make predictions for which files will be accessed and doing this on a block level would be much harder.

Plus, I had to learn fuse to write this and I thought I'd start off easy :D


You mean like iSCSI in the cloud?


Yeah like iSCSI backed by the cloud but with local cache for recently used data. And the ability to roll the entire device back to a state in the past (depending on how big you want your S3 bill to be).


So, what all these filesystems do is having a local cache (in RAM or on SSD for better performance). The successors of NFS in the Linux kernel have been doing that all along.

Not at all interesting IMO. WebDAV and Nextcloud already work (just like Rsync). What's interesting is applying some kind of encryption on top of it. For that, I use Cryptomator [1] which also works on mobile devices.

[1] https://cryptomator.org


I built something similar years ago when I had a laptop with a really tiny hard drive. I would move a bunch of files to a cloud server, and left a bunch of files with no data on the hard drive. Then I wrote a file system filter driver which would watch for the files opening, catch the open request, and quickly download.

It was better than walking around with an external drive. But it was super slow. After I upgraded the laptop, I had no further use for it.


This is one of the reasons I use pCloud. It adds a huge disk device that is actually remote and only the recent files are locally cached.

The other reason is Linux support.


Lifetime prices for online services? Sounds really sketchy.


Lifetime of the service perhaps.


Nice work!. I also made a similar file system Zbox: https://github.com/zboxfs/zbox. The difference is Zbox is a in-app file system focused on privacy, so FUSE is not supported intentionally. Although it already supports key-value store now, I am currently trying to extend its capability to cloud storage.


Zbox looks pretty interesting, thanks for linking it!

Have you done any performance testing to compare zbox vs xfs/ext4/zfs/whatever on the same system? I saw the benchmark in the readme, but that doesn't necessarily show how much overhead or loss of performance there is over the native filesystem.


No, I didn't. And I don't think it is necessary as Zbox is much more like an 'application-level' fs, obviously can't match the system level fs.


Awesome idea! Recently I've been looking for a solution to automatically back up ~GBs of scraped data that is updated daily. Is this solution trustworthy enough? I was burned by OneDrive silently deleting data on a previous attempt.


The README states "Do not use in production" so I wouldn't trust it yet.

For ~GBs of data per day is it necessary to use something that avoids having a full local copy? I'd have thought you could have a full local mirror backed up with Dropbox, Backblaze, rsync, rclone, etc.


Pricing and reliability are my two main problems. I have access to 1Tb OneDrive but after it silently deleted 40Gb of irretrievable data I will never trust it again. Pricing wise Dropbox GDrive and others seem too expensive. Backblaze is the current frontrunner for sure, particularly with its deduplication facilities.


Please don't use it in prod, this is work in progress.


I would recommend you borg (has deduplication) + rclone.


Reminds me of https://github.com/Azure/azure-storage-fuse

The code is good, maybe you'll find inspiration there.


Why would one prefer this over s3ql (https://bitbucket.org/nikratio/s3ql/)?


Zero has a cache where it keeps the most recently accessed files. For example, your latest 100GB of raw video recordings. Does s3ql do that? I skimmed over their documentation and could not see it but maybe I didn't look long enough?


"S3QL splits file contents into smaller blocks and caches blocks locally." -- http://www.rath.org/s3ql-docs/about.html#features

It allows you to configure an arbitrary cache size, I've been using it with 60GB local cache.


But are writes and reads really 100% local or do they require synchronous networking?


I can't see how this is used. Can we get a "here's how to set up the file system" guide somewhere? I see the bit about the config file. Then what?

I'd primarily like to use this to back up a couple of Proxmox hosts.


Yes, I'll add a guide and more description but please don't use it yet, it's work in progress.


Not seeing, but does it support encrypting files/folders on cloud storage?


Worst case, could layer with encfs or equivalent. Be very careful to understand the exact threat model that covers (for starters, it leaves you painfully exposed to metadata issues), but it would work easily enough.


My plan is to use it with fusecrypt for now and eventually include encryption directly.


Keybase does something similar with their file system kbfs(which is encrypted), they use fuse too.

Local caching is limited to 10% of your disk space (if I remember correctly).

Cool project though. Will definitely keep it in my radar.


Does keybase sell could storage packages?

Btw, I really love the idea of keybase, I hope they take off.


No, they have 250GB per user. IMO that is good for backups for now till they maybe start selling more storage?

Same here, they're doing some really good work.


Been using rclone mount for something similar, curious how this compares


I think with rclone you need as much local space as you take up in the cloud?


Rclone has a cache backend in the latest version which works exactly like OP.


Cool! Do you have a link?


Very cool idea. I will be keeping an eye on this one. 1 tb of practical storage for 5 bucks a month. You could have a petabyte of storage for $5000 a month!


rclone.org is a great tool for syncing to/from cloud storage.


are there any plans to support S3?


Hi, the Back-Ends are in principle pluggable so I'm very happy to incorporate a PR for an s3 Back end. I may also do it myself eventually.

I just find s3 very expensive for long term storage of personal data.


They should support providers such as Wasabi where they have unlimited egress to feel safe with a fixed pricing (per GB).


Minio can do disk caching


NFS again?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: