Hacker News new | past | comments | ask | show | jobs | submit login
Casync – A tool for distributing file system images (0pointer.net)
138 points by Nekit1234007 on June 20, 2017 | hide | past | favorite | 33 comments



I'm not sure I buy the embedded/IoT use case; OSTree is a really good model there and is more featureful. The "well, if your filesystem image delta happens to be in the form of a lot of very small files it's not so great for CDNs" doesn't strike me as a terribly good reason to give up everything OSTree gives you (especially with stuff like the meta-updater [1] Yocto integration).

[1] https://github.com/advancedtelematic/meta-updater

(Full disclosure: I work for Advanced Telematic, the creators and maintainers of the meta-updater Yocto layer.)


Well, I am pretty sure IoT devices should be designed with security in mind, and that means that they need to be protected against offline modification. And that's something OSTree can't really deliver, but dm-crypt can. And casync works pretty well for delivering dm-crypt enabled disk images.

I think OSTree is great — but for embedded devices that are installed in the wild, humm, uh, I don't think so? I am pretty sure there are better options than that.


I'm with the open source project Mender.io (OTA for embedded Linux) and we think Casync is a very interesting building block and may look into this and evaluate whether it makes sense to incorporate it into our project.

We had looked into OSTree before but given the use case of embedded devices in the wild, we concluded it was too risky as OSTree relies on the filesystem to protect from power failures. And rollback was not built-in and is quite challenging to implement reliably.


Please elaborate on 'need to be protected against offline modification'?


Think of cell towers or wind power turbines: they both are primary hacking targets in today's world, and they are placed in the wild, in uncontrolled and unprotected locations. This means more or less anybody can just walk by, temporarily cut the power source, take the harddisk out, plug it into their hacking laptop, install an OS trojan on it, place it back into the original device and restore the power. From the PoV of the cell company or the power company this was just a short power cut, and nothing changed. I reality the system was just hacked. And in order to protect yourself against that OSTree can't help you, because disk accesses aren't validated. The only validation takes place during downloading. dm-verity OTOH will protect every single access, and if deployed properly then such "offline" modifications to the OS will result in the device not booting anymore, which is much preferable over accepting that the device was hacked with no scheme to detect it.

And it's not just cell towers or wind power turbines: pretty much any device which is around people not unconditionally trusted needs to be protected against such offline modifications. In fact, if people today build cars, TVs, surveillance cameras or anything else like that and do not deploy dm-verity in some form to make sure the devices cannot be modified offline without noticing are just participating in turning IoT into Internet of Shit.


But physical access == game over? Whatever software layer you add imho.

Wouldn't it be easier to simply dunk the whole device in some epoxy preventing access to the hardware with some anti-tamper deadman switch?


trusted boot and TPMs with remote attestation exist precisely to ensure that physical access does not mean game over. It's all there, people just need to make use of it in their systems. And yes, trusted boot and TPM has issues, but without all this the attack surface is massive, and I think needlessly so.


(trusted boot and TPM are afaik already compromised albeit you need to bring a near rocket scientist)

I will always think physical access is game over whatever 'rocket science' or re-invented old principles people come up with software wise and i'm not sure, but hardware probably too but software is easier to mangle.

And indeed yes, security is layers, layers that make it more difficult, and having many options for layers to choose from that is great.

Also didn't hear about OStree before really, reading up on both for some future project.


He probably means modification by those who have physical access, which means often the users, but sometimes they are not the owners.

If you have devices like cable box or water meter, the real owners do not want you to modify the device. That's where mechanisms like dm-verity step in.


If you read the internals description it could just as well be about Borg, very similar principles here, though the application is very different.

By the way, both buzhash and SHA-256 are kinda poor choices for a new system, especially one that targets servers.


Yap borg is the first thing I thought of. It already does a lot of this an more: encryption, configurable encoding, a rolling hash computed by the Buzhash algorithm and so on.

Maybe it wasn't geared for CDN delivery during restores but otherwise I've been impressed by borg so far (haven't deployed it in production, only played with it locally though).

https://github.com/borgbackup/borg

This is a description of the internal design:

http://borgbackup.readthedocs.io/en/stable/internals.html


The "latest" version of that page has seen significant additions: http://borgbackup.readthedocs.io/en/latest/internals.html


Thanks. That is a better reference indeed. It's got nice diagrams as well. Can't edit my post any longer, so hopefully others will just see your message.


both buzhash and SHA-256 are kinda poor choices for a new system

Why?


Low software performance


Considering that it uses xz for compression, does the performance of SHA-256 matter? (Well, using faster hash function can speed up finding duplicate blocks, which were already packed.)

I'm more interested to hear about buzhash, though.


I assume™ that xz won't stay the only choice. I think it's important to understand that in deduplication, you'll pass all data through your hashes one to two times. Regarding buzhash, it can break with byte granularity, and it has a dependency chain that prohibits parallelization. You'll likely never see it go faster than 700-750 MB/s on a desktop CPU (~3.8 GHz Haswell) and it won't profit from non-clock improvements of CPUs. Giving up byte-granularity allows significant improvements in performance, but I don't think anyone comprehensively analysed the impact on deduplication performance. I didn't.

(OTOH if your storage is faster than ~200-300 MB/s (buzhash and a hash, naively combined) then there is likely no issue using higher degrees of I/O concurrency, so you can work around these problems).


Thanks for explanation. Do you know of any implementations?


SHA-256 seems to score well on https://www.cryptopp.com/benchmarks.html .


Depends on the context. If there is a lot of hashing, then a faster alternative like BLAKE2b is better.


what would you suggest instead? numbers?


BLAKE2b mentioned above would be more than twice faster on 64-bit CPUs. But I think we'll soon see SHA-256 CPU instructions on most processors (ARM and lower-cost Intel and latest AMD already ship them -https://neosmart.net/blog/2017/will-amds-ryzen-finally-bring...), so I guess it's not important. For numbers, see blake2.net or bench.cr.yp.to.

For IoT devices, hashes that work on 32-bit words, like SHA-256, actually make more sense and will be faster, so BLAKE2s would work well.

What I'd like to hear from the above commenter is about a faster replacement for buzhash, which I'm also interested in.


Great. The chunked model (inspired by Borgbackup/Tarsnap) seems preferable to Docker layering, and diff-based approaches.

As far as I can tell, the advantages compared to Borgbackup seem to be:

* casync offers control over which FS metadata is included

* casync, the server, exposes chunks over HTTP

* casync, the library, is written in C so is more easily used by systems software.

I'm betting we'll see machinectl integration. Excellent!


Oops. Just realised my comment contains a mistake.

casync does not act as a server. Its on-disk representation and client behaviour is designed in such a way that the server need only serve static files. This makes deployment easy.


> I'm betting we'll see machinectl integration. Excellent!

systemd-nspawn integration is what I was thinking about too, so yeah! Nice work.


All these great Unix based tools make me wish I did not have to work on Windows Servers all day.


What is the difference between Casync & rclone (https://rclone.org)?


rclone is a nice cloud cloning solution for files and folders. casync is intended to clone and delta entire filesystems, in a way that makes them nicely deployable. rclone is very cloud focused while casync says nothing of the details of how images are served.

casync also has fs composition, a multitude of recorded file attributes, automatic reflinking/hardlinking, uid/gid shifting, and so much more.

tl;dr: rclone is for files, casync is for entire filesystems/deployments.


Thanks for explaining


I think this is a very useful tool if working with arbitrary file changes on block devices, but it's still very low level and would need a crapton of modification/wrapping to make it useful in a complex system. I would rather use Kickstart or the like to distribute changes intelligently, or barring that, rsyncing hardlinked directory trees, or zsync (but RPM/Yum would really be ideal due to the features gained)


I'd be really interested in creating full disk images and then cloning to another disk. If that is a use case (I think I skimmed the post correctly) it could be very useful and performant over dd and similar disk cloning tools.


It sounds like a sort of half done distributed filesystem, a lot of similarities.


This sounds like a centralized BitTorrent.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: