Casync – A tool for distributing file system images

tkfu · on June 20, 2017

I'm not sure I buy the embedded/IoT use case; OSTree is a really good model there and is more featureful. The "well, if your filesystem image delta happens to be in the form of a lot of very small files it's not so great for CDNs" doesn't strike me as a terribly good reason to give up everything OSTree gives you (especially with stuff like the meta-updater [1] Yocto integration).

[1] https://github.com/advancedtelematic/meta-updater

(Full disclosure: I work for Advanced Telematic, the creators and maintainers of the meta-updater Yocto layer.)

poettering · on June 20, 2017

Well, I am pretty sure IoT devices should be designed with security in mind, and that means that they need to be protected against offline modification. And that's something OSTree can't really deliver, but dm-crypt can. And casync works pretty well for delivering dm-crypt enabled disk images.

I think OSTree is great — but for embedded devices that are installed in the wild, humm, uh, I don't think so? I am pretty sure there are better options than that.

ralphmender · on June 22, 2017

I'm with the open source project Mender.io (OTA for embedded Linux) and we think Casync is a very interesting building block and may look into this and evaluate whether it makes sense to incorporate it into our project.

We had looked into OSTree before but given the use case of embedded devices in the wild, we concluded it was too risky as OSTree relies on the filesystem to protect from power failures. And rollback was not built-in and is quite challenging to implement reliably.

thinkMOAR · on June 21, 2017

Please elaborate on 'need to be protected against offline modification'?

poettering · on June 22, 2017

Think of cell towers or wind power turbines: they both are primary hacking targets in today's world, and they are placed in the wild, in uncontrolled and unprotected locations. This means more or less anybody can just walk by, temporarily cut the power source, take the harddisk out, plug it into their hacking laptop, install an OS trojan on it, place it back into the original device and restore the power. From the PoV of the cell company or the power company this was just a short power cut, and nothing changed. I reality the system was just hacked. And in order to protect yourself against that OSTree can't help you, because disk accesses aren't validated. The only validation takes place during downloading. dm-verity OTOH will protect every single access, and if deployed properly then such "offline" modifications to the OS will result in the device not booting anymore, which is much preferable over accepting that the device was hacked with no scheme to detect it.

And it's not just cell towers or wind power turbines: pretty much any device which is around people not unconditionally trusted needs to be protected against such offline modifications. In fact, if people today build cars, TVs, surveillance cameras or anything else like that and do not deploy dm-verity in some form to make sure the devices cannot be modified offline without noticing are just participating in turning IoT into Internet of Shit.

thinkMOAR · on June 22, 2017

But physical access == game over? Whatever software layer you add imho.

Wouldn't it be easier to simply dunk the whole device in some epoxy preventing access to the hardware with some anti-tamper deadman switch?

poettering · on June 22, 2017

trusted boot and TPMs with remote attestation exist precisely to ensure that physical access does not mean game over. It's all there, people just need to make use of it in their systems. And yes, trusted boot and TPM has issues, but without all this the attack surface is massive, and I think needlessly so.

thinkMOAR · on June 23, 2017

(trusted boot and TPM are afaik already compromised albeit you need to bring a near rocket scientist)

I will always think physical access is game over whatever 'rocket science' or re-invented old principles people come up with software wise and i'm not sure, but hardware probably too but software is easier to mangle.

And indeed yes, security is layers, layers that make it more difficult, and having many options for layers to choose from that is great.

Also didn't hear about OStree before really, reading up on both for some future project.

vetinari · on June 21, 2017

He probably means modification by those who have physical access, which means often the users, but sometimes they are not the owners.

If you have devices like cable box or water meter, the real owners do not want you to modify the device. That's where mechanisms like dm-verity step in.

dom0 · on June 20, 2017

If you read the internals description it could just as well be about Borg, very similar principles here, though the application is very different.

By the way, both buzhash and SHA-256 are kinda poor choices for a new system, especially one that targets servers.

rdtsc · on June 20, 2017

Yap borg is the first thing I thought of. It already does a lot of this an more: encryption, configurable encoding, a rolling hash computed by the Buzhash algorithm and so on.

Maybe it wasn't geared for CDN delivery during restores but otherwise I've been impressed by borg so far (haven't deployed it in production, only played with it locally though).

https://github.com/borgbackup/borg

This is a description of the internal design:

http://borgbackup.readthedocs.io/en/stable/internals.html

dom0 · on June 20, 2017

The "latest" version of that page has seen significant additions: http://borgbackup.readthedocs.io/en/latest/internals.html

rdtsc · on June 20, 2017

Thanks. That is a better reference indeed. It's got nice diagrams as well. Can't edit my post any longer, so hopefully others will just see your message.

dchest · on June 20, 2017

both buzhash and SHA-256 are kinda poor choices for a new system

Why?

dom0 · on June 20, 2017

Low software performance

dchest · on June 21, 2017

Considering that it uses xz for compression, does the performance of SHA-256 matter? (Well, using faster hash function can speed up finding duplicate blocks, which were already packed.)

I'm more interested to hear about buzhash, though.

dom0 · on June 22, 2017

I assume™ that xz won't stay the only choice. I think it's important to understand that in deduplication, you'll pass all data through your hashes one to two times. Regarding buzhash, it can break with byte granularity, and it has a dependency chain that prohibits parallelization. You'll likely never see it go faster than 700-750 MB/s on a desktop CPU (~3.8 GHz Haswell) and it won't profit from non-clock improvements of CPUs. Giving up byte-granularity allows significant improvements in performance, but I don't think anyone comprehensively analysed the impact on deduplication performance. I didn't.

(OTOH if your storage is faster than ~200-300 MB/s (buzhash and a hash, naively combined) then there is likely no issue using higher degrees of I/O concurrency, so you can work around these problems).

dchest · on June 22, 2017

Thanks for explanation. Do you know of any implementations?

tomfitz · on June 21, 2017

SHA-256 seems to score well on https://www.cryptopp.com/benchmarks.html .

ktta · on June 21, 2017

Depends on the context. If there is a lot of hashing, then a faster alternative like BLAKE2b is better.

poettering · on June 22, 2017

what would you suggest instead? numbers?

dchest · on June 22, 2017

BLAKE2b mentioned above would be more than twice faster on 64-bit CPUs. But I think we'll soon see SHA-256 CPU instructions on most processors (ARM and lower-cost Intel and latest AMD already ship them -https://neosmart.net/blog/2017/will-amds-ryzen-finally-bring...), so I guess it's not important. For numbers, see blake2.net or bench.cr.yp.to.

For IoT devices, hashes that work on 32-bit words, like SHA-256, actually make more sense and will be faster, so BLAKE2s would work well.

What I'd like to hear from the above commenter is about a faster replacement for buzhash, which I'm also interested in.

tomfitz · on June 21, 2017

Great. The chunked model (inspired by Borgbackup/Tarsnap) seems preferable to Docker layering, and diff-based approaches.

As far as I can tell, the advantages compared to Borgbackup seem to be:

* casync offers control over which FS metadata is included

* casync, the server, exposes chunks over HTTP

* casync, the library, is written in C so is more easily used by systems software.

I'm betting we'll see machinectl integration. Excellent!

tomfitz · on June 21, 2017

Oops. Just realised my comment contains a mistake.

casync does not act as a server. Its on-disk representation and client behaviour is designed in such a way that the server need only serve static files. This makes deployment easy.

purpleidea · on June 21, 2017

> I'm betting we'll see machinectl integration. Excellent!

systemd-nspawn integration is what I was thinking about too, so yeah! Nice work.

RachelF · on June 20, 2017

All these great Unix based tools make me wish I did not have to work on Windows Servers all day.

the_arun · on June 20, 2017

What is the difference between Casync & rclone (https://rclone.org)?

striking · on June 20, 2017

rclone is a nice cloud cloning solution for files and folders. casync is intended to clone and delta entire filesystems, in a way that makes them nicely deployable. rclone is very cloud focused while casync says nothing of the details of how images are served.

casync also has fs composition, a multitude of recorded file attributes, automatic reflinking/hardlinking, uid/gid shifting, and so much more.

tl;dr: rclone is for files, casync is for entire filesystems/deployments.

the_arun · on June 20, 2017

Thanks for explaining

0xbadcafebee · on June 20, 2017

I think this is a very useful tool if working with arbitrary file changes on block devices, but it's still very low level and would need a crapton of modification/wrapping to make it useful in a complex system. I would rather use Kickstart or the like to distribute changes intelligently, or barring that, rsyncing hardlinked directory trees, or zsync (but RPM/Yum would really be ideal due to the features gained)

JustinGarrison · on June 21, 2017

I'd be really interested in creating full disk images and then cloning to another disk. If that is a use case (I think I skimmed the post correctly) it could be very useful and performant over dd and similar disk cloning tools.

nwmcsween · on June 20, 2017

It sounds like a sort of half done distributed filesystem, a lot of similarities.

abhineet97 · on June 21, 2017

This sounds like a centralized BitTorrent.