OSTree – Robust OS upgrades for Linux

contingencies · on Aug 26, 2013

The point that nobody seems to be making is that this degree of rigour is essentially a requirement for some classes of systems, ie. the OS platform upon which you deploy applications becomes a named, versioned, tested entity with its own repo and changelog, which can be rapidly provisioned at any of its versioned states, rolled back, and have specific versions of specific applications tested against it. This is not always required, but is definitely good practice in all cases.

This is the part where I suggest docker people broaden their scope to include such... "devopsy" concerns around virt deployment. In my own system used internally within my company we have this distinction .. the platforms are called PEs (platform environments) and the applications are SEs (service environments). Combining both produces a SIN (service instance node) on a particular CP (cloud provider).

While I really support docker and they probably get annoyed at me always commenting on their project and coming across as being slightly critical, in all honesty docker irritates me because it leaves all this sort of business out of scope. But I think they are possibly also heading in this direction. :)

wmf · on Aug 26, 2013

That stuff is being built by CoreOS which is technically a different project but I suspect many people will end up using CoreOS and Docker together for maximum DevOps. Since Chrome/CoreOS and OSTree have different philosophies it might be worth exploring both ways to do it.

contingencies · on Aug 26, 2013

Ahh yes, I forgot about that. CoreOS looks interesting but doesn't seem to handle paravirt, which is a serious limitation if you need to run non-Linux systems, plus it's based on vagrant, which falls in to the class of IMHO misconstrued sysadmin-automation software I refer to as PFCTs (post-facto configuration tinkerers). PFCT based instantiation is IMHO epic-fail versus cleaner methods such as cloning a blockstore as it opens large classes of potential bugs that are otherwise avoidable.

Basically, it seems like CoreOS might replace Ubuntu as the default host environment for the docker userbase, however the issue of supporting more exotic environments is apparently not being tackled.

kanzure · on Aug 26, 2013

> plus it's based on vagrant, which falls in to the class of IMHO misconstrued sysadmin-automation software I refer to as PFCTs (post-facto configuration tinkerers).

Huh? I thought the point of vagrant was to just boot up a known image. Then you have veewee which can build an image suitable for vagrant from kickstart or whatever. Can you elaborate on your concerns? Thanks!

contingencies · on Aug 27, 2013

There is a fallacy in here...

Provisioners in Vagrant allow you to automatically install software, alter configurations, and more on the machine as part of the vagrant up process.

This is useful since boxes typically aren't built perfectly for your use case. Of course, if you want to just use vagrant ssh and install the software by hand, that works. But by using the provisioning systems built-in to Vagrant, it automates the process so that it is repeatable.

It's the last part of the final sentence. You can't prove something is repeatable if it's potentially accessing the internet, date and time logic, etc. Cloning an image is a lot more repeatable (and known quantity) than re-massaging one in to existence.

I term the class of IMHO misdirected systems administration tools that attempt the latter PFCTs (post-facto configuration tinkerers). Vagrant is on the better end of these (at least it focuses on one-time instantiation versus constant modification). Other tools such as puppet could be construed as encouraging long term modification without a clean reference at all, thus potentially resulting in configuration drift.

It seems clear to me that this class of tool grew somewhat organically from classical systems administration and is not the most rigorous of approaches in this era of cheap virtualization and instant cloud-based provisioning.

philips · on Aug 26, 2013

CoreOS runs on Xen, QEMU/KVM, and we will do bare metal soon too.

Vagrant is just one of the options for running CoreOS that lots of people like because it is easy to test out. It isn't the only option though. What do you mean by paravirt?

contingencies · on Aug 27, 2013

http://en.wikipedia.org/wiki/Paravirtualization

My comment was about running arbitrary things on CoreOS, not the other way around. (Sorry if perhaps I wasn't clear enough, IMHO it goes without saying that Linux will run on any standard platform.)

binarycrusader · on Aug 26, 2013

(disclaimer: I'm a developer on the Image Packaging System project: https://java.net/projects/ips/)

I think there's been some editorialising of the title as I don't see that the author of OSTree claims that it's a "robust" solution (alone).

With that caveat, without filesystem snapshot support, OSTree really isn't a complete solution (as the original author points out).

On Solaris 11+, there are generally two update scenarios for package upgrades:

1) pkg update [name1 name2 ...]

  no packages have new or updated items tagged with
    reboot-needed=true
  pkg will create a zfs snapshot
  pkg will create a backup boot environment
  perform update in place on live root
  if update fails, will exit and tell admin name
    of snapshot so they can revert to it if desired;
    will also print name of backup BE
  if update succeeds, will destroy snapshot and exit

2) pkg update [name1 name2 ...]

  one or more packages have new or updated items tagged
  with reboot-needed=true
  pkg will create a zfs snapshot
  pkg will clone the live boot environment
  pkg will perform update on clone
  if update fails, will tell admin name
    of clone BE so they can inspect it and exit
  if update succeeds, will activate clone BE,
    destroy snapshot, and exit telling admin name
    of new clone

Put another way, on Solaris 11+, the "default practice is the best practice." This is the advantage of integrating the package system with the native features the OS itself supports.

colinwalters · on Aug 26, 2013

If you think OSTree needs filesystem snapshot support, you aren't understanding how it works. Or perhaps we don't have a shared definition of "robust" - for me, assuming correct behaviour at the filesystem and block layer, I believe deployment changes to be atomic.

How do you determine reboot-needed=true? Is that something assigned by the package developer statically (i.e. kernel is reboot-needed=true)? Determined dynamically at update time (like in the package metadata for a specific revision?)

Do you attempt to control for local configuration?

Is the X server package reboot-needed=true?

binarycrusader · on Aug 26, 2013

  If you think OSTree needs filesystem snapshot support, you
  aren't understanding how it works. Or perhaps we don't
  have a shared definition of "robust" - for me, assuming
  correct behaviour at the filesystem and block layer, I
  believe deployment changes to be atomic.

I'll put this differently; how can OSTree reliably get a system from StateSRC -> StateDEST without filesystem-level snapshots?

If the answer is that an administrator must define what StateSRC is every time they upgrade their system, I don't think that's a usable mechanism.

If it does it based on the live system (the best answer), then without filesystem-level snapshots, how can OSTree reliably define StateSRC? There's a non-trivial set of race conditions OSTree would encounter while attempting to create its own "snapshot" of the state of the system.

Having filesystem-level snapshots allows you to get an accurate view of the live state of the system so that you can then modify a copy.

With ZFS, snapshots are basically "zero cost". Or rather, they have very low overhead for the base snapshot, and you only incur additional expense from divergences in the snapshot from its parent.

colinwalters · on Aug 27, 2013

> I'll put this differently; how can OSTree reliably get a system from StateSRC -> StateDEST without filesystem-level snapshots?

Your comment strongly implies to me you don't understand how it works. It's really really simple - the physical / of the filesystem no longer stores the OS. Instead, you have multiple chroots in /ostree/deploy/osname/{checksum1, checksum2} etc.

To make a new checkout, it's just a hardlink farm.

This kind of thing has been done to some degree before - it's not entirely new. OSTree just polishes it, makes it more efficient, documents the semantics of /var, etc.

binarycrusader · on Aug 27, 2013

You're right that I don't fully understand how it works, which is why I was asking; there's no summary of how it works on the main page and I don't have the time to read the source code or attempt to digest the entire manual.

How do you define the initial chroot? Who is responsible for creating that?

How would you migrate an existing system to use OSTree?

If an administrator has to manually create the initial chroot link farm, I don't think that's a very usable solution. In particular, it suggests a lot of process overhead that assumes most systems are "templatable" representations.

It's not immediately clear whether OSTree deals with necessary in-place upgrades or if it only helps you with reboot-based ones.

Even if you don't feel in-place upgrades are safe, again, most enterprise-level customers are demanding update solutions that don't require them. There's a reason why ksplice, et al. are becoming popular.

binarycrusader · on Aug 26, 2013

  If you think OSTree needs filesystem snapshot support, you
  aren't understanding how it works. Or perhaps we don't
  have a shared definition of "robust" - for me, assuming
  correct behaviour at the filesystem and block layer, I
  believe deployment changes to be atomic.

While OSTree itself may not need filesystem snapshot support, for "robust OS upgrades" (the original title of the ycombinator story) you do need some sort of snapshotting mechanism. Whether that's in the form of ZFS-style snapshots, or boot environments is up to the implementer.

Also, I'd argue that assuming correct operation at the filesystem and block layer is a pretty big assumption :-) Even if the software itself works correctly, time has proven that trusting the hardware is not a good thing.

With that said, yes, I think OSTree needs filesystem snapshot support if admins are to rely on it to provide "robust" OS upgrades.

  How do you determine reboot-needed=true? Is that
  something assigned by the package developer statically
  (i.e. kernel is reboot-needed=true)? Determined
  dynamically at update time (like in the package metadata
  for a specific revision?)

Currently, reboot-needed is determined by the package creator at the individual item level by tagging individual "actions" in a package manifest. It's a way for a package creator say "this component can't be safely updated on a live system, so force the creation and use of a boot environment clone when updating it."

Further refinement of this functionality is planned when applied to specific types of actions found in a package manifest, such as "driver" actions. But it will be handled by the package system transparently.

  Do you attempt to control for local configuration?

In what sense? pkg(5) allows an admin to force the creation and use of a clone boot environment as part of any package-modifying operation.

If you're talking about service configuration, that's handled via SMF (the service management facility) which has its own snapshot system that services use.

If you're talking about legacy service configuration that uses flat files on-disk, then package creators have the option of using service actuators to trigger a restart or refresh of the service whenever the configuration file changes.

You'd have to be more specific.

  Is the X server package reboot-needed=true?

No, because any files that the X server needs are already loaded into memory so updates to the on-disk files won't generally affect it. Any changes made to the X server binaries and libraries on-disk won't take effect until the X server is restarted.

An incompatible change in the X server itself (such as a protocol change, which is very rare) will likely come with a related kernel change which would require a reboot. But again, the package creator is in control of that.

Also, the "reboot-neededed" tag doesn't force a reboot per se; it just forces the update of a given component to be performed in a clone boot environment.

colinwalters · on Aug 26, 2013

> With that said, yes, I think OSTree needs filesystem snapshot support

I'm happy with the fact that it allows admins to choose whatever they want at the filesystem and block level, the same as dpkg/rpm do. For example in a cloud environment, block level redundancy may be more easily provided at the infrastructure level for guests.

For those who do need redundancy, you're free to choose BTRFS, XFS+LVM, or hardware raid, whatever suits you.

As for the rest of your reply around the reboot-needed flag; it basically sounds like it's fairly manual, but maybe that's "good enough". I have a particular paranoia about race conditions though, so were I to design a system that attempted to do live update applications, it'd be a whilelist, not a blacklist as reboot-needed effectively is.

binarycrusader · on Aug 26, 2013

  I'm happy with the fact that it allows admins to choose
  whatever they want at the filesystem and block level, the
  same as dpkg/rpm do. For example in a cloud environment,
  block level redundancy may be more easily provided at the
  infrastructure level for guests.

While it may provide admins with choice, it doesn't provide them with "robust OS upgrades". Again, that's my only quibble here.

  For those who do need redundancy, you're free to choos
  BTRFS, XFS+LVM, or hardware raid, whatever suits you.

There's a large variance in those solutions that doesn't provide the same end-user experience (or even result), and without integration with the package system, doesn't really provide the sort of safety net most administrators are looking for.

  As for the rest of your reply around the reboot-needed
  flag; it basically sounds like it's fairly manual, but
  maybe that's "good enough".

It's only "manual" in some sense for package creators; not administrators. For administrators, it's a transparent decision about what's safe to update and what's not.

As for actually being manual, not really. The package system provides package creators with a tool called pkgmogrify, which allows them to set rules that cause transformation of actions based on patterns at package publication time. As an example, Solaris has a set of rules that say that, by default, any files delivered to /kernel should be tagged with reboot-needed=true.

I won't claim it gets the OS 100% coverage, but as has been said before "perfect is the enemy of good."

  I have a particular paranoia about race conditions though,
  so were I to design a system that attempted to do live
  update applications, it'd be a whilelist, not a blacklist
  as reboot-needed effectively is.

While I understand the paranoia, a whitelist is not actually a practical option. The vast majority of content delivered onto a modern UNIX or UNIX-like system doesn't require a reboot when being updated. man pages, header files, and most userspace binaries and libraries can all be safely updated without a reboot.

For example, on my desktop workstation, ~289,806 items have been installed by the package system. Of those, only 857 have been determined to require a reboot if they are updated.

That means that roughly 0.37% (yes, < 1%) of the items on my system have been determined to require updates to be performed in a clone boot environment (require a reboot).

For enterprise-level customers, minimising the number of reboots needed to enact change is paramount. Any downtime at all can often cost them millions of dollars.

aristidb · on Aug 26, 2013

This would not be complete without a mention of Nix: http://nixos.org/

mst · on Aug 26, 2013

Absolutely true; fortunately the article already provides one.

zokier · on Aug 26, 2013

While I like the idea of atomic upgrades, the idea really should be made to work without requiring a full reboot. The update procedure could be something like:

1) updater makes a private snapshot of the filesystem

2) updater writes its changes to the private snapshot while the rest of the system gets served by unchanged FS

3) updater stops affected services

4) the updated snapshot is swapped in (atomically)

5) services are restarted

contingencies · on Aug 26, 2013

You outline a good approach, however filesystem is not the only state. Also you have RAM and network connections to deal with. The kernel people have been working on this using the freezer driver for control groups (cgroups). Their ultimate holy-grail type goal is portability to the point where "freeze on host A", "thaw on host B" is possible near-instantly for arbitrary code without issue. In reality, they may never get there. For now, the easiest way to reliably approach the issue is to use applications that can rapidly shut down to known consistent on-disk state, thus limiting the scope of state to block data.

Designing applications/services for systems with read-only root filesystems is a good way to enforce known location of on disk state.

The above is the strategy I am taking with a personal project used internally by my company which predates docker, works for arbitrary OSs and has far wider scope.

With regards to block data, I took a different approach to docker, namely using LVM2 LVs instead of aufs. The limitation of LVM2 LVs is snapshot is possible, but not yet snapshot-of-snapshot. That means whatever you want to spin up either has to be spun up slowly (ie. not from a snapshot) or has to be non-snapshottable when running (due to LVM2's snapshot-of-snapshot limitation). We opt for the latter, which requires that state is not stored within the snapshot-powered system. This is good because you can power them off without fear, like conventional PXE-style diskless systems.

wmf · on Aug 26, 2013

A wise man once said there is no such thing as a bootable system, only systems that have booted. Many OS updates will change something in the boot process (maybe the kernel, maybe some init scripts, etc.) and you need to reboot to be sure that stuff is still working. Likewise any change that would require the user session to logout/login might as well force a full reboot.

contingencies · on Aug 26, 2013

Many OS updates will change something in the boot process (maybe the kernel, maybe some init scripts, etc.) and you need to reboot to be sure that stuff is still working

That's precisely why a manual upgrade process should not exist for system environments. Instead of live upgrades, you name, version, and test a newer and known-good image of the environment with the applications required on it on test infrastructure prior to final staging/deployment.

wmf · on Aug 26, 2013

Keep in mind that some of this stuff is designed for near-zero-admin desktops, not just the cloud.

contingencies · on Aug 26, 2013

Fair point. Strange anyone would bother though, I always thought PXE was the best desktop solution... maybe for often-offline nodes or laptops.

wmf · on Aug 26, 2013

Sure, PXE imaging is great for enterprise desktops but people don't use it at home.

breakall · on Aug 26, 2013

How does this relate to what I've been reading about lately with docker and "containers"?

shykes · on Aug 26, 2013

Docker applies similar concepts (change management at the file and directory level, a chrootable filesystem as the unit of delivery, atomic and revertable deployments), except it applies them 1 level higher in the stack.

Instead of rebooting machines into a new filesystem, docker spawns processes directly into them, using the sandboxing capabilities of the linux kernel. So you get a much more powerful and flexible deployment mechanism. But of course, you need a machine to exist in the first place, which docker isn't designed to do.

So OSTree is a good companion for Docker, because you need a machine to run docker. A good workflow is probably: pack the bare minimum on a physical machine, using ostree. Then put everything else into docker containers. That's the approach of new "just enough" distros like CoreOS.

colinwalters · on Aug 26, 2013

Right, this is a pretty good summary (that docker is 1 level above).

One fundamental difference is that OSTree has a custom serialization format for trees (inspired by git), whereas from what I can tell from the code, docker is just tar (I assume whatever the host /usr/bin/tar serializes to). For example, OSTree explicitly supports extended attributes, so it can support SELinux (and SMACK). Fedora ships a patched tar but...the tar format is a really serious mess.

I would further add though that OSTree does, providing the OS is compatible with it, allow booting a separate "deployment" as a container. So for example if you have Debian in /ostree/deploy/debian/90cd266 while you're booted into /ostree/deploy/fedora/562d0a, you can easily just systemd-nspawn /ostree/deploy/debian/90cd266 and boot the same OS as a container.

But the emphasis right now of OSTree is indeed on bare metal deployments, and I'd like to push hard to integrate with package systems.

shykes · on Aug 26, 2013

> you can easily just systemd-nspawn /ostree/deploy/debian/90cd266 and boot the same OS as a container

Perhaps we should be looking at integrating ostree and docker then? :)

> I'd like to push hard to integrate with package systems.

How would that integration work exactly?

colinwalters · on Aug 26, 2013

> Perhaps we should be looking at integrating ostree and docker then? :)

It might make sense for docker to be able to store containers as OSTree commits in addition to tarfiles. But I haven't used it myself. On the HTTP side, OSTree may be less efficient or more efficient than what docker does on the wire for updates; I don't know. Static deltas will help significantly.

> How would that integration work exactly?

This section describes it very briefly: https://people.gnome.org/~walters/ostree/doc/adapting-packag...

Basically this is something you can do on a build server or on a client.

contingencies · on Aug 26, 2013

In my very limited understanding (could be wrong, but suspect not significantly) docker doesn't really seem to approach this issue... the docker user base seems to almost exclusively use Ubuntu, of ~latest version only.

v0land · on Aug 26, 2013

From https://people.gnome.org/~walters/ostree/doc/ostree-package-...:

> OSTree only supports recording and deploying complete (bootable) filesystem trees. It has no built-in knowledge of how a given filesystem tree was generated or the origin of individual files, or dependencies, descriptions of individual components.

Um-m, isn't this what snapshots (e.g. LVM snapshots) are for? Correct me if I'm wrong.

colinwalters · on Aug 26, 2013

The snapshot + "live upgrade in place" has quite different semantics. First, while I am actually looking at LVM snapshots for a different project, making them actually atomic would require integrating with the bootloader in a manner similar to what OSTree has done.

So...what you end up with is a bit of a layering mess because you have to keep track of both block and filesystem level bits.

While OSTree does pay a penalty in complexity in other ways (mainly the chains of indirect symbolic links for boot=), it's extremely flexible in that you can do whatever you want at the FS and block layer.

You can absolutely use LVM snapshots and OSTree together, and that's very much intended. For example, in the OSTree model, if one object gets corrupted, all trees using that object are corrupted.

So it's very much still recommended with OSTree to add redundancy at the block layer with LVM/RAID/BTRFS - again, you can do whatever you want there.

Finally, LVM snapshots don't directly help you parallel install independent operating systems with their own /var, and optionally /home.

cbhl · on Aug 26, 2013

What he's saying is that if you had (say) a FS image with nginx, mysql, and rails installed on it, the system has no idea of what files belong to mysql and what files belong to rails. So you wouldn't just be able to take said image and tell OSTree "remove mysql and replace it postgresql".

Docker works in a really similar way, actually -- it creates file system images and tars them up, and then uses AUFS to separate changes in each container from the base image. Docker added metadata to describe how to build images; if you want to change the components (AFAICT) the "right" way is to change the metadata and then do a full rebuild.

shykes · on Aug 26, 2013

> Docker added metadata to describe how to build images; if you want to change the components (AFAICT) the "right" way is to change the metadata and then do a full rebuild.

Correct. And Docker implements caching of build layers, so you get the semantics of a full rebuild, but in practice are only rebuilding the interesting parts.

scolex · on Aug 26, 2013

As far as I understand this tool works at higher level - with directories and files.

I will try to explain the concept less accurately. For example, you have your favorite os versioned in git. (kernel, x server, apps, init daemon, session management...). This tool will take all the stuff, compile it, will create directory structure and place binaries to appropriate places. In the end it will create bootable directory structure (filesystem tree) to that you can chroot. But maybe you want to test different revision of systemd. So ostree will take chosen revision from git, compile it, add binaries to filesystem while preserve old structure. It will add only new things that are different. Both verisons will be bootable. It's basically binary mapping of git tree.

zdw · on Aug 26, 2013

This is closer to using a DVCS to manage the files on disk than block level snapshots.

See also the Solaris 10 IPS, which appears on the surface to have many similarities (checksums, http delivery of files, etc.) to OStree

bensummers · on Aug 26, 2013

IPS and Boot Environments using ZFS is an OpenSolaris thing, available in Solaris 11 and some Illumos distributions like OmniOS. And someone has ported BEs to FreeBSD.

But of course, all the cool new features in Linux have been in use in production by Solaris people for years. :-)

jaybuff · on Aug 26, 2013

Reminds me of http://camlistore.org/