For the security enthusiasts out there, Docker 1.10 comes with some really cool Security focused additions. In particular:
- Seccomp filtering: you can now use bpf to filter exactly what system calls the processes inside of your containers can use.
- Default Seccomp Profile: Using the newly added Seccomp filtering capabilities we added a default Seccomp profile that will help keep reduce the surface exposed by your kernel. For example, last month's use-after-free vuln in join_session_keyring was blocked by our current default profile.
- User Namespaces: root inside of the container isn't root outside of the container (opt-in, for now).
- Authorization Plugins: you can now write plugins for allowing or denying API requests to the daemon. For example, you could block anyone from using --privileged.
- Content Addressed Images: The new manifest format in Docker 1.10 is a full Merkle DAG, and all the downloaded content is finally content addressable.
- Support for TUF Delegations: Docker now has support for read/write TUF delegations, and as soon as notary 0.2 comes out, you will be able to use delegations to provide signing capabilities to a team of developers with no shared keys.
These are just a few of the things we've been working on, and we think these are super cool.
It's "funny" yesterday RKT made the announcement of their version 1.0 (with emphasis on security) and today we have 2 news about Docker at the top of HN with your comment about security.
By the way, you can use DockerSlim [1] to auto-generate custom seccomp profiles (in addition to shrinking your image). They are already usable, but they can be improved. Any enhancements or ideas are appreciated.
Disclaimer: I work for SUSE, specifically on Docker and other container technologies.
Docker containers /in principle/ do work with systemd. They are implemented as transient units when you use --exec-opt native.cgroupdriver=systemd (in your daemon's cmdline). I've been working on getting this support much better (in runC and therefore in Docker), however systemd just has bad support for many of the new cgroups when creating transient units.
So really, Docker has systemd support. Systemd doesn't have decent support for all of the cgroup knobs that libcontainer needs (not to mention that systemd has no support for namespaces). I'd recommend pushing systemd to improve their transient unit knobs.
But I'd rather like to know why the standard cgroupfs driver doesn't fulfil your needs? The main issues we've had with systemd was that it seems to have a mind of it's own (it randomly swaps cgroups and has its own ideas about how things should be run).
im not sure if we are talking the same thing here. I'm talking about systemd inside a container (as pid 1). I think that's the part that's not working.
Every few days someone comes up with a new run script for docker (baseimage "my_init", etc). I personally use supervisord. Since systemd is already universal, might as well use that.
> I'm talking about systemd inside a container (as pid 1). I think that's the part that's not working.
Ah sorry, I misunderstood. I'm not sure why you'd want to use systemd as your process manager in a container. systemd has a very "monolithic" view of the system and I'm not sure you gain much by using systemd over supervisord (I'd argue you lose simplicity for useless complexity).
> Overlayfs is still causing some problems though.
I've been looking into overlayfs and I really encourage you to not use it. There have been an endless stream of bugs that the Docker community has discovered in overlayfs, and as far as I can see the maintainer is not particularly responsive. There's also some other issues (not being POSIX complete) which are pretty much unresolvable without redesigning it.
Whoa... Thank you so much for pointing out the issue with overlays. There seems to be no real consensus on what should be used. Could you talk about what should be used?
Devicemapper works (make sure you don't use loop devices) okay. Unfortunately it's slow to warm up, but it's probably the most tested storage driver (it's the default on a bunch of systems).
btrfs works pretty well and is quite a bit faster. It's the default for SLE and openSUSE (as well as other distros which use btrfs by default). I'd recommend it (but I can't remember if it requires you to use btrfs on your / partition, which might be an issue for you).
ZFS, while being an awesome filesystem, I doubt has had much testing under Docker, so I'd be wary about using it.
And I've already told you what I thought about overlay. I'd like to point out that it's a good filesystem for what it was designed for (persistent or transient livecds) but the hacks in Docker in order to use it for their layering keeps me up at night.
Yeah - all of which mean that I have to move away from Linode.
They can't backup anything other than a vanilla ext4 volume. I setup a direct lvm docker vm on Linode (was surprisingly easy) - but Linode is refusing to back it up.
If Linode supports ZFS, you could use ZFS send-recieve to make backups to your local machine. But like I said, the ZFS storage driver probably hasn't been well tested.
nope. no backup for anything other than EXT4. Its really weird and limiting.
I mean LVM volumes have to be pretty standard right ? (I would argue more standard than ZFS).
Because "only one process in a container" is a dangerous rule (because it has so many exceptions). In certain cases, that idea makes sense, but you shouldn't contort your app such that you only have one process in every container. Not to mention that there are other issues (PID 1 has unique problems that databases and your app aren't used to having to deal with).
I see, so it looks like if your process spawns other processes and doesn't reap them when they die, you end up in trouble with zombie processes in docker.
> The new manifest format in Docker 1.10 is a full Merkle DAG, and all the downloaded content is finally content addressable.
Can someone elaborate on this a bit more? From a CS point-of-view, sounds like a problem where a data structure came in handy but I'm not sure what it solves. Thanks!
A simple immutable data structure can be implemented with a Merkle DAG. Merkle tree leaves store hashes of previous nodes and DAGs are directed graphs that don't loop around. Examples include simple blockchains. These structure provides immutable, versioned control of information. Containers are immutable, or like to think they are at least, so blockchains are an obvious thing to use in conjunction with deployments of said containers. At least that's what I keep telling everyone.
Do you literally mean a proof-of-work backed blockchain (like Bitcoin), or something more like git, which has a similar structure to a blockchain without the consensus mechanism?
I don't see how the former would be useful to someone deploying containers, but interested to hear your thoughts in either case.
I was referring to the second part of the comment:
"Containers are immutable, or like to think they are at least, so blockchains are an obvious thing to use in conjunction with deployments of said containers. At least that's what I keep telling everyone."
I'm pretty sure they mean that they are using Merkle DAGs. A blockchain is a Merkle DAG. The proof-of-work algorithm in Bitcoin is an algorithm for deciding how a node gets added to the blockchain. Depending on how you look at it, that algorithm is not part of what makes it a "blockchain".
Admittedly people are sloppy about how they use the term "blockchain". I would prefer that people use the term Merkle DAG and forget the term "blockchain" altogether, but I think we are stuck with "blockchain" ;-)
"People" are also sloppy about how they use the term "cloud", yet the world goes on with that concept in hand applying it to everything in site, often times in irritating ways. "Blockchain" is now a thing people can hold in their hand as a way to visualize the concept of a nearly immutable data store. That idea of storing something in an immutable way represents a shift in the way we can think about system's design. Calling it a "Merkle DAG" isn't going to kick off that insight any better than using "blockchain", but remembering what it really is and drawing the distinction with the right people can be immensely useful when trying to implement the insight.
Blockchains don't have to contain proof-of-work as long as the values in the chain itself aren't valued in and of themselves over longer time periods. In Bitcoin, a cryptocurrency built using a blockchain, the values represent debt owed to someone in exchange for a real world item, and that debt stays active for the life of the entry. There are a slew of proof-of-somethings that allow blockchains to become cryptocurrencies. I don't exactly ascribe to these ideas of value store as related to compute provisioning, but I suppose there could be some actions which might benefit from it, such as certain types of licensing.
That said, triggering provisioning using cryptocurrencies is likely to be a thing at some point.
> Docker 1.10 uses a new content-addressable storage for images and layers.
This is really interesting.
Sounds like the up/download manager has improved too. I did some early work adding parallel stuff to that (which was then very helpfully refactored into actually decent go code :), thanks docker team) and it's great to see it improved. I remember some people looking at adding torrenting for shunting around layers, I guess this should help along that path too.
IIRC, Docker has used content-addressable storage for layers for a very long time (in the form of filesystem directories whose names looked like md5 hashes). I'm not sure what's changed. Maybe just the hash function?
Correct, layers were hashed for verification at upload and download, then stored on a "regular" uuid-addressed storage. Now they are stored in a true content-addressed store end-to-end.
Network-scoped aliases are really handy when dealing with a multi-container setup, so I'm really happy that they implemented this!
In previous versions, only the name of a container would be aliased to its IP address, which can make it hard to deploy a setup with multiple containers in a given network group that should address each other using their names (e.g. "api" host connects to "postgres") and then have multiple instances of those groups on the same server (as container names need to be unique).
I had with an issue with files in a volume being created by users in the container that don't exist on the host. I was trying to figure out if this fixes that issue, and if so how? I played around with it for an afternoon but left confused.
Use case:
Using compass within a container for development
Compass creates sass files with whatever user it run under within the container (likely root)
Host must chmod them to do stuff with them
As a work around, I've been building images and creating a user:group that matches my host. Obviously this destroys portability for other developers.
In unix, a uid is just an integer (chown <random_int> file will work). In your case, the container created a file in a volume with a uid. This uid makes no sense on host but it leaks out to host anyway since it's a 'volume'.
I think with the userns map, you can map container uid to a host uid. The files created in the volume will then be visible on the host as the mapped uid.
This is my understanding, I have to play with it :-)
I think the canonical way (at least a while ago) was to only manipulate volumes using containers. There might be other solutions available now but this one seems to be the easiest, as it does not require to change the permissions or ownership on the files/volumes being manipulated.
I think I might have caused a misunderstanding of the issue (I was trying to be brief). It really doesn't matter who I'm running compass as (root or otherwise), the issue is that the container is writing files to my host with a different UID:GID then the one I'm using on my host machine.
I wouldn't normally run compass as root, it was incidental to the actual issue.
> 2. 'docker update' command, although I would have preferred 'docker limit'. Ability to change container limits at runtime:
- CPUQuota - CpusetCpus - CpusetMems - Memory - MemorySwap - MemoryReservation - KernelMemory
This is not correct. You cannot change kernel memory limits on Linux after processes have been started (disclaimer: I've done a bunch of work with the runC code that deals with this in order to add support for the pids cgroup). You can update everything else though.
I love the ability of specifying IPs but, I just want to give static IPs to my containers from my private network, and attaching to my already existing bridge does not work, I started daemon as following but no help
> ./docker-1.10.0 run --ip 172.16.0.130 -ti ubuntu bash
docker: Error response from daemon: User specified IP address is supported on user defined networks only.
But my KVM vms work fine with that bridged network. I know I could just port forward but I don't want to, yes It seems I am treating my containers as VMs, but it worked so fine in default LXC, we could even use Open vSwitch bridge for advanced topologies.
But doesn't user-defined networks create new bridges? I want use my already existing network. Over SSH, I executed the following and my connection is lost, because my eth1 and probably newly created bridge is in conflict over routing table.
Yes. `--ip` is supported only on user-defined networks. That is because, the subnet for the default bridge (docker0) can be changed (via --bip) or user can change the user-specified default bridge (via -b) on daemon restarts. If the subnet backed by these default bridge's change, then the container with a assigned `--ip` will fail as well.
Having said that, with Docker 1.9 & above, IP address management and Network plumbing are separate concerns and both are pluggable.
One could implement a network plugin with any logic and continue to enjoy all the built-in IPAM features (including --ip). Hence, if you have a specific network plumbing requirement, you could easily spin up a plugin (or use one of the many network plugins that are available out there).
Docker 1.9 brought in Native multi-host networking support using overlay driver. Proper multicast support for the overlay driver would require proper multicast control-plane (L2 & L3). Contributions welcome.
Yeah, but I needed this for scenarios where I managed the Docker bridge directly - i.e., running a set of streaming servers that are an insane hassle to set up and required frequent upgrades. Docker was perfect for building, upgrading and deploying them.
Weave would just get in the way in this scenario (and has a tendency to over-complicate simple stuff like running an ElasticSearch cluster with auto discovery)
> Weave ... has a tendency to over-complicate simple stuff like running an ElasticSearch cluster with auto discovery
Elasticsearch autodiscovery relies on multicast, AFAIK the only way to get it working with Docker is to use Weave (or another overlay network that gives you multicast). Is that not correct?
That is precisely my point. I want to do it without weave. As long as I can control the Docker bridge (and be responsible about it), I should be able to just do it.
It's the danger of running against "latest" all the time...But it's been a day of chasing my own tail when creating a new cluster (Mesos, but that really isn't an issue) and using some tools built against the prior version (volume manager plugin, etc.) that break with updates to Docker.
It seems like if one piece gets an upgrade, every moving component relying on some APIs may need to be looked at as well.
Did a PR on one issue.
Currently chasing my tail to see if a third party lib is out of whack with the new version or it's something I did.
The whole area is evolving and the cross pollination of frameworks, solutions (weave, etc), make for a complicated ecosystem. Most people don't stay "Docker only". I'm curious to see the warts that pop up.
I'm also running Mesos and Docker as a containerizer, and experienced the same problems (i.e. API change on the Docker volumes leads to broken volume driver implementations).
Even within the Mesos environment, there are so many nuts and bolts which have to fit together that sometimes I'm just fed up with the complexity. Furthermore, releases of Mesos and Marathon are not necessarily synched... Stateful apps? No go... Persistent volumes in Marathon? Maybe in v0.16... Graceful shutdown of Docker containers? No go...
The --tmpfs flag is a huge win for applications that use containers as unit of work processors.
In these use cases, I want to start a container, have it process a unit of work, clear any state, and start over again. Previously, you could orchestrate this by (as an example, there are other ways) mounting a tmpfs file system into any runtime necessary directories, starting the container, stopping it once the work is done, clean up the tmpfs, and then start the container again.
Now, you can create everything once with the --tmpfs flag and simply use "restart" to clear any state. Super simple. Awesome!
I've had that problem too, and found that you can implement a "wait" command.
If you're using Docker Compose, add this to your environment:
environment:
- WAIT_COMMAND=[ $(curl --write-out %{http_code} --silent --output /dev/null http://elastic:9200/_cat/health?h=st) = 200 ]
- WAIT_START_CMD=python /code/lytten/main.py
- WAIT_SLEEP=2
- WAIT_LOOPS=10
Then, create a 'wait' bash script in your app's source code that looks like this:
!/bin/bash
echo $WAIT_COMMAND
echo $WAIT_START_CMD
is_ready() {
eval "$WAIT_COMMAND"
}
# wait until is ready
i=0
while ! is_ready; do
i=`expr $i + 1`
if [ $i -ge $WAIT_LOOPS ]; then
echo "$(date) - still not ready, giving up"
exit 1
fi
echo "$(date) - waiting to be ready"
sleep 10
done
#start the script
exec $WAIT_START_CMD
Then, finally, for your nginx container, add:
command: sh /code/wait_to_start.sh
Apparently no one else has been paying an ounce of attention... And you get downvoted for it. The HN way!
https://github.com/docker/docker/issues/19474
Least of all you're forced to go through their DNS server which doesn't support TCP.
Boy, this is absolutely going to fuck people. Because I bet a bunch of people are going to run Go containers in 1.10 engine. And guess what happens when you send a Go app a DNS response, in UDP format, that is larger than 4096 bytes?
You get a panic and crash! Woohoo! And yes, there are DNS servers that incorrectly throw out UDP DNS responses larger than 4096 bytes.
Can't wait for my containers to fail because of fucking Docker putting a DNS service in Engine. Unacceptable. Docker should've realized they needed to think about this stuff all-the-why shykes was too busy picking fights with people as Kubernetes encroached on what he saw as "his" territory.
There's a reason that everyone is very excited about the rkt announcement today. Particularly amongst some Kubernetes users...
(In the interest of not tainting the waters, I do NOT work for Google)
It uses DNS for only discovering the other containers within the same custom network, if query is not found the DNS redirects the queries to your own DNS server, I don't see how it can be worse than /etc/hosts solution.
Exactly, obviously /etc/hosts will still take precedence, but instead of munging your /etc/hosts when starting a container you can just use their DNS server. I don't see a problem with this.
Because /etc/nsswitch.conf exists to do exactly what's needed here. Now there's an extra layer of need-to-know that adds confusion. I can almost guarantee that there's going to be a major outage somewhere, someplace because of this change.
Docker assumes it cannot trust the container's DNS resolver to respect TTL and cache in a compliant way. So it guarantees stable name->ip mapping for each container. As a result, when you point a service endpoint at another container, it's the IP/container mapping that is changed, which is a much more reliable and atomic change. I would definitely never rely on changing DNS records to orchestrate changes to my production stack, that would be way too brittle.
The real problem is that you can't trust the program's resolver, either. Java will behave differently than Go, which behaves differently from Python, and so on.
Nothing, since no one has used that backend in years. Note that the LXC command-line utility is not the same thing as Linux containers, which Docker still uses.
It's not deprecated anymore. It's been removed. People have been using the native driver (libcontainer) for 2 years. LXC was deprecated in 1.8 and the code was completely removed in 1.10.
Well ... it's an alternate backend which has been known to be "a bad idea to use" since 0.11. LXC stopped being the default a long time ago, and anybody using it right now REALLY shouldn't be.
Not to mention that I'm not really sure that Docker Inc has strong feelings about semantic versioning (I don't work for Docker).
For the security enthusiasts out there, Docker 1.10 comes with some really cool Security focused additions. In particular:
- Seccomp filtering: you can now use bpf to filter exactly what system calls the processes inside of your containers can use.
- Default Seccomp Profile: Using the newly added Seccomp filtering capabilities we added a default Seccomp profile that will help keep reduce the surface exposed by your kernel. For example, last month's use-after-free vuln in join_session_keyring was blocked by our current default profile.
- User Namespaces: root inside of the container isn't root outside of the container (opt-in, for now).
- Authorization Plugins: you can now write plugins for allowing or denying API requests to the daemon. For example, you could block anyone from using --privileged.
- Content Addressed Images: The new manifest format in Docker 1.10 is a full Merkle DAG, and all the downloaded content is finally content addressable.
- Support for TUF Delegations: Docker now has support for read/write TUF delegations, and as soon as notary 0.2 comes out, you will be able to use delegations to provide signing capabilities to a team of developers with no shared keys.
These are just a few of the things we've been working on, and we think these are super cool.
Checkout more details here: http://blog.docker.com/2016/02/docker-engine-1-10-security/ or me know if you have any questions.