Flat Docker Images

dkulchenko · on Sept 22, 2013

Very cool. Here's the corresponding GH issue for docker itself to get this sort of functionality: https://github.com/dotcloud/docker/issues/332

mpasternacki · on Sept 22, 2013

I have even linked to it in the article. I am going to comment and suggest this kind of approach - I wanted first to flesh out the idea and validate the proof of concept in this script.

ef4 · on Sept 22, 2013

I think the intermediate images are a deliberate feature, at least during Dockerfile development. They serve as a cache of the results at each step, so that expensive steps don't rerun between invocations of the Dockerfile, even if you've been editing later steps.

I'm basing this off the documentation I've read, I haven't tried it myself.

mpasternacki · on Sept 22, 2013

Yes, this is convenient and speeds up Dockerfile development. At the same time, this is an issue when you consider using Docker as a part of your production deployment toolchain. I think both ways should be supported: incremental multi-layered images for development or exploration, and then possibility of creating a single, compact image for lower overhead in deployment. In a perfect world, it would be a switch for `docker build`, but I'm not fluent enough in Go to propose a solution on that side.

Some people have considered another approach: flattening an existing stack of images. Scripts for that are linked from the Docker issue on GitHub. I wasn't able to get any of these working, and the logic behind these seemed quite convoluted.

Still, my script is just a proof of concept - I tried whether it is possible to take the approach I use internally for Docker build scripts, and use it to build Dockerfiles. It seems it's possible, and it delivers good results. Time and actual usage will show whether it's a good idea; if this approach makes sense, it will hopefully make its way to the Docker core, and my hack won't stay relevant for too long.

pault · on Sept 23, 2013

Can you not just `docker export [imageid] > myimage.tar` and `docker import - < myimage.tar`?

EDIT: from the [github issue](https://github.com/dotcloud/docker/issues/332#issuecomment-2...):

"Currently the only way to "squash" the image is to create a container from it, export that container into a raw tarball, and re-import that as an image. Unfortunately that will cause all image metadata to be lost, including its history but also ports, env, default command, maintainer info etc. So it's really not great."

mpasternacki · on Sept 23, 2013

I want to inherit from the base image, to keep the shared files actually shared. This is a feature. It's just the dozen layers of that inheritance that bothers me.

ithkuil · on Sept 23, 2013

well, sometimes you do want a layered image, just not so much layered. For example, most images will be layered on top of a small number of base OS images which you don't want to download all all over again.

cbhl · on Sept 22, 2013

In practice, I find that docker repeats intermediate run steps every time you run docker build.

Which makes sense to me, because you have no idea if an arbitrary shell command is deterministic or not.

JeanSebTr · on Sept 22, 2013

a shell command is indeed not really deterministic but docker won't repeat a RUN step as long as it's not preceded by a non-deterministic (for docker) step such as ADD.

So a good way to optimize your Dockerfile is to put commands in an order like:

* dependencies, e.g. apt-get, useradd...

* containers configs PORT, ENV, USER from less likely to change to more likely to change

* ADD commands

* final RUN commands to setup your image

gmuslera · on Sept 22, 2013

In Docker 0.6.2 was added -rm as builder parameter to delete intermediate containers

mpasternacki · on Sept 22, 2013

Doesn't it just remove the containers, but still keep the generated images layered? I'll take a look (running 0.6.1 here), but it seems to solve a different issue.

aegiso · on Sept 22, 2013

-rm is useful but orthogonal to this -- the issue addressed here is the depth of the image stack, not the temporary build containers.

themodelplumber · on Sept 23, 2013

As a graphic designer who clicked: Darn.

darklajid · on Sept 23, 2013

Nice, thanks!

That allows me to work around a 'blocker' [1] for me right now: I can write a normal Dockerfile, use this ready made tool to test/build the image until my 'docker build' issue is resolved one way or another. Cool!

1: https://github.com/dotcloud/docker/issues/1916

nickstinemates · on Sept 22, 2013

That's a pretty perl script. Well done.

noonespecial · on Sept 22, 2013

It comes as a bit of a surprise to some, but a huge amount of this kind of "advanced sysadmin" is carried out in perl.

The kind of people who use perl like this are generally much more advanced users of perl than the developers of the clumsy cgi's that might have formed your first impressions of perl.

zorkian · on Sept 22, 2013

Yup. Matt's Script Archive was a long time ago. Modern Perl is about a million times better and easier to use than it used to be.

Dancer, Moose, DBIx::Class (a bit more advanced), Plack/PSGI, etc etc. Not to mention the language changes they've made with 5.10/5.12/5.14/etc (I'm in love with the defined-or operator).

noonespecial · on Sept 23, 2013

>Yup. Matt's Script Archive was a long time ago...

Argh. FormMail. FormMail. Its a near-PTSD-like flashback.

_lex · on Sept 23, 2013

"Docker itself is intentionally limited: when you start a container, you’re allowed to run only a single command, and that’s all. "

- Not exactly true, you can launch a shell in that one command, in interactive mode, so you're then able to run as many commands in that shell as you'd like.

joshstrange · on Sept 23, 2013

Did you read the rest of the post? He goes on to say you can do that. In fact the entire message of the post is that entering into the container and running commands and then committing at end is more efficient (In the terms of numbers of layers and disk space, at the cost of what else I don't know yet) than using a Dockerfile and 'docker build'.

mpasternacki · on Sept 23, 2013

The main cost is less clarity: the build steps aren't isolated anymore, so it's harder to pinpoint issues. There's also obvious risk of my script not interpreting all options correctly.

Actually, I've just disabled VOLUME statement in the script, as it seems to be a no-op in Docker. Only trace it leaves in the image is setting image's command to '/bin/sh -c "#(nop) VOLUME [\"/data\"]"'.

contingencies · on Sept 23, 2013

TLDR: Docker uses aufs to provide copy-on-write snapshots, integral to docker container-image builds. aufs is not that widely used, and reportedly has a depth limit of 42. This script flattens an entire build process to a single snapshot to avoid said issue.

Context: docker people have already announced an intention to work to unlink themselves from aufs dependency.

Alternatives/Reality-check: LVM2 can provide snapshots at the block layer: either through the normal approach with a single depth limit (though you can un-snapshot a snapshot through a process known as a merge, and then snapshot again as required), or through the new/experimental thin provisioning driver to get arbitrary depth (but 16GB max volume size). In both cases it's filesystem neutral, and the first approach is very widely deployed which means no roll-thy-own-kernel requirement. zfs and btrfs also provide snapshots, but are historically respectively poorly supported/slow (userspace driver or build your own kernel for zfs) and unfinished/in development (btrfs). Linux also supports the snapshot-capable filesystems fossil (from plan9), gpfs (from IBM), nilfs (from NTT). A related set of options are cluster filesystems with built-in replication, see https://en.wikipedia.org/wiki/Clustered_file_system#Distribu... Overall, the architectural perspective on various storage design options can be hard to grasp without digging, and higher-layer solutions such as NoSQL distributed datastore applications remain strong options in many cases.

Trend/future?: Containers in general are moving towards formalizing the "here's what I need: x-depth snapshots with y-availability and z-redundancy" environment requirements specifications for software. In the nearish future I predict that we'll see this in terms of all types of resources (network access at layers 2 and 3, CPU, memory, disk IO, disk space, etc.) for complex, multi-component software systems as CI/CD processes mature and container-friendly software packaging becomes normalized (we're already much of the way there for single hosts - eg. with Linux cgroups). Infrastructure will become 'smarter', and the historical disconnect between network gear and computing hosts will begin to break down. Systems and network administration will tend to merge, and the skillsets will become rarer as a result of automation.

ithkuil · on Sept 23, 2013

LVM snapshots have some issues:

* you have to preallocate the size of the snapshot back storage. * you create N snapshots of the same base block device. For each block changed in the base, each of the snapshots will get a copy on write block added to the snapshot backing storage. * you cannot resize snapshot (I mean logical volume size, not the storage area for cow data) * you cannot shrink snapshot backing storage

snapshot aware filesystems solve these issues. The slowness of ZFS you mention is only true for the fuse based toy driver. The license incompatibility between ZFS and the linux kernel is source of much confusion. All it means is that you cannot distribute linux kernel binaries linked with ZFS code (where a kernel module can be seen as parts of the linux kernel API linked with ZFS code). However nothing prevents you compiling the module on your machine, and there is a nicely packaged solution for doing this for, with support for distributions:

http://zfsonlinux.org/

there is also a new place for promoting zfs: http://open-zfs.org

AuFS seems to me a rather pragmatic approach for those who don't need the advanced features and performance of an advanced filesystem, yet don't want to waste IO bandwidth just to provision a lightweight container.

contingencies · on Sept 23, 2013

All good points. I guess in response the only two things I would add are: (1) If snapshots are for backup (most frequent use case? I guess so!) then LVM2 can do it for you without an exotic FS already. Sure, you may have to preallocate. But it's generic (not filesystem-linked), so if you're an infrastructure provider it future proofs your backup implementation. Sometimes that's worth a lot more due to engineering and testing cycles. (2) You probably can shrink snapshot backing storage if you remove them, for example after the snapshot is complete and the data has been subsequently copied elsewhere to long term storage (cheaper/slower/remoter/more geographically dispersed disks?). You can make a new one next time you need it. That said, people who are that short on disk space are few and far between these days... it's cheap.

mpasternacki · on Sept 23, 2013

The issue here is not only the depth limit, but also:

* performance overhead of each layer, however small

* disk space for files removed in intermediate steps (scenario: ADD huge-ass source tarball, commit, RUN compile+install+remove, commit - user has still download the huge-ass source tarball to use the final image which doesn't have it)

* there's often just no need to publish intermediate layers; there may even be a good reason to not publish them (say, I distribute a program compiled with a proprietary compiler as a step of the build, but can't distribute the compiler itself)

* simplicity of having just one image for user to download and for publisher to distribute, rather than whole chain (this will be more important when we are able to use anything else than the registry to distribute images)

contingencies · on Sept 23, 2013

All valid points.

I guess a lot depends on other aspects of your project. For example, if you are looking at distributing frequently, and rsync is an option, then bandwidth concerns are effectively nullified. Likewise, disk space diffs for a few installs on a base filesystem are not big and thus not really expensive to keep. But I agree with you.

One aspect is crypto: signing a tarball is easier than a bunch of files.

consonants · on Sept 22, 2013

Off-topic, but pertains to Docker images:

Do people usually roll out their own images from source/based on verified binaries from the parent distribution's repositories or are base images provided by the community?

mpasternacki · on Sept 22, 2013

I've seen both; Docker's main registry provides some base images (the most used is named `ubuntu` and has base system for Precise and Raring), and I've seen many imaged descending from author's own base - it's quite easy to prepare a base image using debuild or other distros' equivalents. Can't talk about not debian-ish distributions, but debuild does verify its downloads.

The place of trust here is the registry - usually, for convenience, tags are used rather than hashes (and I'm still quite not sure whether the long hex IDs are hashes, or just unique random names). The registry returns hex id for a given tag, and is trusted to deliver correct files for an ID.

I believe that the main index/registry runs over https and provides basic security, but it would be a huge issue if it was compromised. It's quite easy to run your own registry, too. What I'd love to see on top of that is some kind of GPG-based verification of downloaded images (Debian got the problem basically solved in Apt).