Escaping Docker container using waitid() – CVE-2017-5123

hacknat · on Dec 29, 2017

This exploit is interesting, but if you are doing container security correctly it’s actually not a big deal. In particular if you are setting per-container usernamespaces, like you ought to be, then this exploit doesn’t do anything. In fact you can actively give a usernamespaced container any CAPs you want, because they are isolated to that container’s uid:gid offset.

Obviously, giving containers unecessary CAP privileges in unwise, but if you are practicing sound security best practices then there would be multiple layers of defense between you and this CVE. I think a strong AppArmor profile and SecComp profile would also make this CVE moot.

Edit: Also, this exploit relies on you being able to fork up to a certain pid value. You can and should take advantage of Linux’s per cgroup ulimit functionality. No container needs more than 255 threads (even if they do you can make special exceptions for such applications).

Edit2: Additionally this CVE relies on the getuid syscall being available, there is no reason to give a container this syscall, you should block it, ala this guide: https://rhelblog.redhat.com/2016/10/17/secure-your-container...

I have to say I’m more than a little dissapointed in Twistlock for not pointing out what countermeasures you can employ against this and other CVEs.

wahern · on Dec 29, 2017

1) User namespaces don't magically protect you from a vulnerability that allows writing to kernel memory. Neither would AppArmor. seccomp could theoretically, but waitid is a pretty fundamental Unix syscall and blocking it would break a lot of basic software.

The author devised a particular exploit, but his example was hardly the only way to leverage the vulnerability. Being able to write to kernel memory is about as huge a vulnerability as you can get. Just because you can't think of a way to leverage a vulnerability doesn't mean an attacker can't; your failure of imagination is not evidence that it cannot be done.

2) Plenty of containers need more than 255 threads. Like, pretty much any Java server. In any event, this particular exploit doesn't necessarily require hundreds or thousands of simultaneous processes.

3) Blocking getuid is even worse than blocking waitid. Block getuid and you'll break glibc and god knows what. In any event, it would be futile as the real and effective UIDs are passed to the process through the auxiliary process vector when the kernel executes the process.

4) You're missing the forest for the trees. The real moral of the story is this: "In 2017 alone, 434 linux kernel exploits where found". Unless you're prepared to pour over every published exploit, 24/7, meticulously devise countermeasures, and be prepared to run effectively crippled software, you really shouldn't be relying on containers to isolate high-value assets. I wouldn't rely on VMs, either, as the driver infrastructure of hypervisors has also proven fertile ground for exploits.

hacknat · on Dec 29, 2017

It’s not a full kernel memory CVE, you have +-255 bytes access to kernel memor from the cred pointer. I have no idea if that extends to userns or not. Also I think your confusing Java threads for system threads they are not the same.

I think your being overly alarmist. You have to trust someone else’s code at some point, otherwise you’ll be paralyzed by non-productivity.

geofft · on Dec 29, 2017

> It’s not a full kernel memory CVE, you have +-255 bytes access to kernel memor from the cred pointer. I have no idea if that extends to userns or not.

As I understand it, a kuid_t is the UID in the root namespace, so setting your cred->uid to 0 gets you considered as equivalent to root in the container host.

Also, don't think that limited exposure to kernel memory saves you - take a look at the sudo "vudo" exploit from 2001, in which a single byte that was erroneously overwritten with 0, and then put back, turned out to be exploitable. http://phrack.org/issues/57/8.html (And in general, don't confuse the lack of public existence of an exploit with a proof that a thing isn't exploitable in a certain way.)

> Also I think your confusing Java threads for system threads they are not the same.

Current versions of the HotSpot JVM (where by "current" I mean "since about 1.1") create one OS thread per Java thread: http://openjdk.java.net/groups/hotspot/docs/RuntimeOverview.... "The basic threading model in Hotspot is a 1:1 mapping between Java threads (an instance of java.lang.Thread) and native operating system threads. The native thread is created when the Java thread is started, and is reclaimed once it terminates." Plus there are some other OS threads for the runtime itself.

> I think your being overly alarmist. You have to trust someone else’s code at some point, otherwise you’ll be paralyzed by non-productivity.

Sure, but you can choose which code to trust, and how to structure your systems to take advantage of the code you trust and not the code you don't. Putting mutually-distrusted things on physically separate Linux machines on the same network is a pretty good architecture: I trust that the Linux kernel is relatively low on CVEs that let TCP packets from a remote machine overwrite kernel memory.

rincebrain · on Dec 30, 2017

> I trust that the Linux kernel is relatively low on CVEs that let TCP packets from a remote machine overwrite kernel memory.

You know, you say that...

https://lkml.org/lkml/2017/12/29/137

wahern · on Dec 29, 2017

255 bytes is huge, though IIUC it's actually less than that. Nonetheless, it's much more than is typical. Sometimes these holes are limited to a single word, and only a single value for that word (like NULL), and attackers still come up with marvelously devious exploits.

The critical vulnerability is that the cred pointer address is entirely under your control, so you get to poke at whatever kernel memory you want. The limitation is 1) locating the address of what you want to poke, and 2) being limited to a smallish ranges of values that you can write out.

Also, I'm not confusing Java threads with system threads. Most JVMs use a 1:1 threading model. And because on Linux a thread is just a process (which unfortunately still causes headaches with things like signals, setuid, etc), each thread has its own PID.

I'm not being alarmist, just realistic. Nobody is going to stop using Linux anytime soon. Nor am I. But the fact of the matter is that the Linux kernel is riddled with vulnerabilities. Something like the waitid vulnerability comes along at least 3 or 4 times a year, and that's just the published ones. (IMO, part of the reason is precisely because of complex features like user namespaces, which add tremendous complexity to the kernel. But that's a contentious point.)

At least for high-value assets (however you want to define that), people should just treat Linux as if it lacks secure process isolation entirely, absent a commitment to herculean efforts--extremely locked down seccomp, PNaCL-like sandboxing, etc for all your code that juggles tainted data. Even then, vulnerabilities like recvmmsg come along and ruin your day, but those are rare enough that it would be unfair to single-out Linux.

Not only is that pragmatic and reasonable, after 25 years of endless vulnerabilities of this sort I wouldn't trust the judgment of anyone who thought otherwise. And for what it's worth, I'd make much the same point about Windows, although I have much less experience on that platform.

nacs · on Dec 30, 2017

> people should just treat Linux as if it lacks secure process isolation entirely

Can you state what OS/software you believe has secure process isolation?

Would one of the BSDs have it or would I have to use some OS that's more exotic to run a process in isolation?

wahern · on Jan 3, 2018

Empirically, among the general purpose, commodity platforms OpenBSD has one of the best track records. Professionally I've had success placing OpenBSD in roles where Linux boxen were constantly hacked. But IT departments in particular, and application developers generally, dislike OpenBSD or would dislike it if it were forced upon them.

More importantly, while nowhere near as bad as Linux, macOS, or Windows, OpenBSD has at least one published severe local kernel vulnerability every year or two. In many cases those OpenBSD boxen I mentioned survived _despite_ being neglected by IT and not kept up-to-date; I know for a fact some were in a known, locally exploitable state for a not insignificant period of time. That makes me think a big part of their relative security is simply related to OpenBSD not being a common target of rootkits and worms. I have little doubt a sophisticated attacker could root an OpenBSD box from the shell, for example, if he was targeting that box. (My rule of thumb when isolating services is that anything running a web service using a non-trivial framework (PHP, NodeJS, etc) provides at least the equivalent to shell-level access to a targeted attacker. Among other things, that means even if I'm writing a privilege separated, locked-down, formally-verified backend daemon, I assume as a general rule that any data it's trying to protect isn't safe from that front-end web application unless it's running on separate hardware.)

While I don't think that security and convenience are necessarily mutually exclusive, as a practical matter they are largely mutually exclusive. Unless you're prepared to accept the burden and cost of using a specialized OS like seL4--and in particular use it in a way that preserves and leverages the stronger security guarantees--your best bet is simply to use separate servers when you want a significant degree of isolation assurance. Separate hardware is not sufficient (if all your boxes have Intel ME cards, or have firmware pushed from a puppet server, or share an engineer's account whose hacked desktop is logging SSH keys, passwords, and Yubikey PINs), but it's largely necessary. This is true whether you're concerned with targeted or opportunistic attacks, but _especially_ opportunistic attacks, which are by far the most common and, in many respects, an important element to targeted attacks.

Separate hardware is a uniquely simple, high-dividend solution. But the point is to be realistic about the actual robustness of your solutions, to be able to easily identify and mitigate risks, so you can make more informed security decisions. And it all depends on what you're protecting and what sort of investment you're capable of making. Just endeavor to predicate your decisions on accurate and informed assessments. Among other things, that means being honest about and understanding your own lack of knowledge and experience. Similarly, continuity and long-term maintenance are critical to security, which means you need to be honest about institutional capabilities (or for a private server, what you're prepared to track and upgrade 3 years from now.)

Linux, OpenBSD, co-location, and cloud hosting can all be part of a perfectly robust strategy. And HSMs probably should be, too, which is basically just a way to attach a tiny, isolated, dedicated piece of hardware to a computer. But none of these options alone are likely to be reasonable choices, all things considered, especially in the organizational context.

twic · on Dec 29, 2017

> Also I think your confusing Java threads for system threads they are not the same.

Oh? On mainstream JVMs, a Java thread is the same as the thing you could create with pthreads. What do you mean by "system threads"?

nl · on Dec 30, 2017

Java threads are system threads.

I can confirm hundreds of threads for networked Java applications is normal behaviour.

quotemstr · on Dec 29, 2017

> No container needs more than 255 threads > Additionally this CVE relies on the getuid syscall being available, there is no reason to give a container this syscall,

The problem with MAC schemes is that, in practice, they lead to security people imposing random and arbitrary restrictions on general APIs in the name of the least privilege. In doing so, they break the orthogonality of general-purpose platform concepts and break the reductive mental model necessary to get anything done. It's a misunderstanding of what least privilege actually means.

Security is better achieved by creating clear, principled security domains and boundaries, then controlling access to these domains in a general and transparent way. Saying "you, unix process, you can call system call X, but not system call Y, because in my opinion, Y is risky", when neither X nor Y breaks through a security domain, is bad practice. So is arbitrarily capping the number of threads in a container.

jwilk · on Dec 29, 2017

> Additionally this CVE relies on the getuid syscall being available, there is no reason to give a container this syscall, you should block it,

Huh? Lots of legitimate things will break without working getuid().

> you should block it, ala this guide

getuid() doesn't require any capabilities, so it can't be blocked by taking them away.

hacknat · on Dec 29, 2017

Oops good call. I meant setuid, but either way I was wrong.

geofft · on Dec 29, 2017

> Edit2: Additionally this CVE relies on the getuid syscall being available

This exploit relies on it. The vulnerability does not. The exploit happens to use getuid() along the way to using heap spraying, but the writeup is pretty clear that neither getuid() nor heap spraying is required.

hacknat · on Dec 29, 2017

Yeah and I’m wrong about that part anyways. You can’t cap out or block getuid without breaking glibc. I meant setuid, but that call isn’t used in this exploit. I got confused.

eikenberry · on Dec 29, 2017

> but if you are doing container security correctly

Doing it correctly should be the case using the default settings. Defaulting to an insecure setup is a bug.

dvdhnt · on Dec 29, 2017

Hmm. Perhaps this is a difference between dev and ops, but almost every tool we use comes out of the box with settings unfit for production. Instead, they're tuned for development, and in some cases, deploying to a staging environment. At least, this has been the case in my experience.

sverhagen · on Dec 29, 2017

Ah, dev... ops... How about DevOps? As a (originally) dev I bring my app to production. How do I stand a fighting chance to reconfigure the defaults in the way you suggest, without suddenly gaining a whole new set of skills? Good defaults would be helpful, even if they're very conservative. I can break things open, but at least then I know what to read up on.

kemitche · on Dec 29, 2017

100% agree here. Ship with secure settings by default and have simple "developer guides" that show what to crack open for easier use in non-production environments.

bacongobbler · on Dec 29, 2017

I would argue the opposite. As a developer I want to have tools that make my life easier to - you guessed it - develop. Enabling unnecessary secure defaults that either hinder or don't apply to my use case is silly.

There's a reason most users choose Ubuntu over OpenBSD as their workstation. I would put good money on the reason is because it's "secure enough" without getting too heavy handed on production use cases.

However, I do agree that there has to be a balance. Most tooling I write tends to lean more towards the "good user experience" side first, and then document the production use case. Either that or release two separate (but similar) products; one for developers, one for operations teams. Docker's doing that with the Community Edition/Enterprise Edition, but I still think the Community Edition is far too heavy-handed when it comes to things like pulling images from "insecure" registries.

oneweekwonder · on Dec 30, 2017

> As a developer I want to have tools that make my life easier to - you guessed it - develop. Enabling unnecessary secure defaults that either hinder or don't apply to my use case is silly.

The problem is a lot of these insecure defaults will be rolled to production by a "developer" only to get hacked later. Because the marketing and friction-less dev is more important then sane default security

> Over a quarter of MongoDB databases left open to the internet have been ransacked by online extortionists.[0]

They "forgot"/"did not know" the password is not set. This was for db tech.

Now you want your average dev, who is getting "forced" to use docker more and more. To know how to setup container tech secure?

But hey, I need to get this on my CV and it is just click click install. So what can go wrong?

[0]: http://www.zdnet.com/article/mongodb-ransacked-now-27000-dat...

sverhagen · on Dec 30, 2017

You're touching upon a natural tension, where developers take certain shortcuts that operations people then have to solve, which, I suppose, DevOps practices try to solve. The developers pay the price of less freedom for that, or if you will, the price of (production) reality. If that's not what your shop is doing, then what you say makes sense to me, and clearly DevOps practices aren't for everyone.

eikenberry · on Dec 29, 2017

I don't think the contention is necessary in most cases. You can make something developer friendly and secure at the same time, you just have to allow for both. Simple things, like taking any credentials from environment settings, can make it easy for both dev and ops/prod to work with and be secure.

bmitch3020 · on Dec 29, 2017

> In particular if you are setting per-container usernamespaces, like you ought to be, then this exploit doesn’t do anything.

User namespacing in docker is enabled at the daemon level, not per container, so all containers share the same offset. This would ensure that a root user in the container would escape to a different uid on the host, but doesn't prevent someone from moving sideways through the containers on the same host.

Note that enabling this will break the developer workflow of mounting files from the host into the container. I believe files will show up with the wrong ownership inside the container.

hacknat · on Dec 29, 2017

You don’t need to use containerd. Other runtimes make it possible (per container offsets have been possible in runc for over year).

borplk · on Dec 29, 2017

> but if you are doing container security correctly ...

The container that wasn't!

(I get the gist of it, just tongue in cheek)

oblio · on Dec 29, 2017

As a somewhat of a container noob, could you expand on "per-container usernamespaces"?

hacknat · on Dec 29, 2017

Sure. User namespace-ing is a feature of container security that allows you to grant a process root access to a filesystem that itself is not root. To the running process it appears that it is or can run as root, but on the host it actually isn’t root, but some uid:gid offset. Here’s an article explaining more: http://man7.org/linux/man-pages/man7/user_namespaces.7.html

The gist is that a container is further sandboxed by the kernel that is agnostic of the higher level security precautions. It’s not perfect by itself, but used in conjunction with other features like AppArmor or SELinux and SecComp it can make a container virtually sandboxed.

baq · on Dec 29, 2017

Follow-up question: And why docker doesn't do that by default?

hacknat · on Dec 29, 2017

Because the Docker project doesn’t make money off of security. It is actually quite infuriating, because they have become the de facto container image standard. Most of their security has actually come from Twistlock (I am not a Twistlock employee, FYI). My recommendation to most Admins or Devs that are serious about container security is to let your developers use docker, but run your images with CRI-O on your servers: http://cri-o.io/

raesene9 · on Dec 29, 2017

The docker project has done quite a bit of work on container security, so I'm not sure it's entirely fair to say that them not enabling user namespace by default is for that reason.

For example the work that was done on their seccomp filtering and apparmor profiles.

My guess would be that it's as User namespacing can introduce some issues (e.g. where mounting host filesystems), that they've decided the trade-offs aren't worth it.

Also looking at the CRI-O trello (https://trello.com/c/Ak2yMcpf/714-epic-cri-o-support-for-use...) it seems like user namespaces isn't even an option there yet?

eikenberry · on Dec 29, 2017

Viewing docker containers as anything more than a bundling and deployment system is a mistake. While they might help with security they will never be completely secure and you should architect your deployments with that in mind.

Unless you are a giant enterprise shop with the resources to staff a decent sized K8s team, you should use the hosted solutions.

cpuguy83 · on Dec 29, 2017

There are trade-offs to using userns and many ppl don't like the current set of trade-offs. In addition changing a default like this is a breaking change. Admins can enable userns by default in a daemon, but making it a hard-coded default is much more difficult.

It's not just a matter of enabling user ns. There is no support at the vfs layer for uid/gid mapping. This means in order to use it, images must be chowned with the remapped ID's. Per-container mappings are not supported for this reason (it would require copying and chowning the entire image for each container mapping).

Do you care to qualify your statement about CRI-O?

ecnahc515 · on Dec 29, 2017

I recall seeing some patches submitted to make it possible to pass an uid/gid offset to the mount syscall at one point when people were implementing usernamespaces for container runtimes like docker. So is this fixable without having to make every file system implement this feature, or is there something else holding back better support for doing uid shifting for use with user namespaces?

cpuguy83 · on Dec 29, 2017

That has not been accepted into the kernel. It's called "shiftfs", which basically let's you perform the uid/gid shift on mount.

zenlikethat · on Dec 29, 2017

Nah, Docker has taken lots of strides in the right direction for security over the years, Twistlock or not, albeit with a few weird dangling remainders. They'd love to turn user ns on by default but it'd break lots of existing stuff. Many more users would be mad about having that on by default than leaving it off.

Disclaimer: I worked at Docker for 3 years.

nvarsj · on Dec 29, 2017

CRI-O is bleeding edge. I'm not sure it's ready for production usage. But it looks very promising. The sooner we can all dump docker in kubernetes the better.

_0w8t · on Dec 29, 2017

Docker excels in image building especially now with multi-image Dockefile support. It seems all those alternatives to the Docker just gave up on providing anything on their own. The documentation typically starts "lets pull a docker image".

On the other hand the container runtime is straightforward. I recently discovered that one can run a docker image with a bash script and the unshare command and get a very tight security setup. That explains proliferation of various alternatives to Docker to run its images.

andbberger · on Dec 29, 2017

Maybe because it breaks things?

I just enabled user namespaces after reading this post. Broke Jenkins and there doesn't appear to be an easy solution. I mount the docker socket in the Jenkins container, which is not an option with user namespaces as the user Jenkins now runs as does not have permission to access the socket.

It seems to be possible to provide this user access to the socket through a socket proxy, but since all containers use the same user this seems to defeat the purpose of using namespaces in the first place.

Cherry on top: although `docker run` supports running containers with custom userns settings, docker swarm, which I use to run Jenkins, does not.

So as far as I can tell my only options are:

1. Go back to not using user namespaces 2. Make the docker daemon on the host available over HTTP, which is really something I was trying to avoid...

Anyone have a more elegant solution?

zenlikethat · on Dec 29, 2017

Mm, if you're bind mounting in the Docker socket, enabling user namespaces won't help much. You just have to deal with the fact that you have a privileged container (Docker API access == root, at least unless you're using authz). It'd be nice if we could see more RBAC around Docker API so you could do things like "grant only permission to run this one container".

andbberger · on Dec 29, 2017

Totally. But the vast majority of containers I use do not get a bind mount to the Docker socket... for which user namespaces would be a very nice feature.

zenlikethat · on Dec 29, 2017

Yeah, definitely turn it on where possible, just important to realize that it's not a panacea (some people really hyped it up to be before the feature was released and criticized Docker for not having it at all). As always, gotta try and find the right sweet spot between convenience and attack surface.

nassyweazy · on Dec 30, 2017

disclaimer: I'm a member of Docker Security team

We're working on a solution that would please most people for docker containers and services called the Docker Entitlements: https://github.com/moby/libentitlement

These Entitlements are high-level privileges for containers and services that could be baked in images, same way as macOS/iOS apps. These permissions would allow to create custom {seccomp+capabilities+namespaces+apparmor+...} profiles (effectively security profiles) for a better granularity in app sandbox configuration by app developers and ops.

The current POC has `docker run`, `docker service create` and even build mechanism working. The integration is actively being worked on and PRs are being prepared.

The issue you mentioned is already opened here: https://github.com/moby/libentitlement/issues/44

Feel free to have a look at it and open issues/participate or reach out through Github as I'm the lead and would love to discuss use-cases :)

LaGrange · on Dec 29, 2017

As far as I remember things, because it breaks overlay filesystems, which are a major space saver in Docker world. Something might have changed, but last time I checked, you couldn't "offset" uids/gids on a filesystem overlay, so every layer of the container would have to be copied and chowned (slowly).

This would obviously only work for minimal containers (i.e. ones that don't contain a distribution), but software has to be pretty much built for such a case (e.g. statically linked, no dependencies on common tooling — popular with Go, but your Python application won't work edit: unless you copy all the layers, that is).

You can read the docs here: https://docs.docker.com/engine/security/userns-remap/#prereq..., and note that it stores image/container layers in subdirectories under /var/lib/docker.

Tl;dr: user namespaces are inherently incompatible with many of the usability features Docker brings over other solutions, while they're not particularly useful for many popular use cases (no shared hosting, minor differences in consequence between escalating to the root of the container and its host - though that's an assumption frequently wrongly made).

zenlikethat · on Dec 29, 2017

Also, people hold their bind mounts to the host near and dear, and user namespaces would break all kinds of things people expect to "just work" with bind mounts. Having user namespaces on by default would break tons of existing scripts, Compose/Kube files, etc. that do things like mount /var/lib/mysql into the container for persistence.

_0w8t · on Dec 29, 2017

I tried it a couple of months ago. It immediately broke build of one of the images. It was a known bug. So I guess I just wait one more year to try.

In the mean time, I make sure that all my containers runs as non-root with max security restrictions. The exception so far was sshd from OpenSSH and mostly due to incorrect porting from OpenBSD in portable ssh.

cirowrc · on Dec 29, 2017

In the article the author states:

> CVE-2017-5123 was published earlier this year on Oct 12 — it was a Linux kernel vulnerability in the waitid() syscall for 4.12-4.13 kernel versions.

Does this mean that kernel versions prior to 4.12 are not affected? That's what I understood from the related issue in the bug tracker https://bugzilla.redhat.com/show_bug.cgi?id=1500094

By the way, this is very important:

> In 2017 alone, 434 linux kernel exploits where found, and as you have seen in this post, kernel exploits can be devastating for containerized environments. This is because containers share the same kernel as the host, thus trusting the built-in protection mechanisms alone isn’t sufficient. Make sure your kernel is always updated on all of your production hosts.

Great article!

TheDong · on Dec 29, 2017

That range isn't quite correct. It only impacted 4.13 kernels, and only 4.13.0-4.13.6 (inclusive, distro-dependent due to backports).

It was patched in 4.13.7 after being introduced in the 4.13.0 merge window.

See https://lwn.net/Articles/736348/

This issue shouldn't have happened at all, but it was caught and patched very quickly, so relatively few real-world systems are or were affected.

Da5hes · on Dec 29, 2017

it was introduced by this commmit: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/lin...

which by itself is in a 4.12 vanilla code tree

TheDong · on Dec 29, 2017

That commit is correct (4c48abe91be03d191d0c20cc755877da2cb35622), but it was not in 4.12 as cut by linus.

https://github.com/torvalds/linux/commit/4c48abe91be03d191d0... (click the little '...' to expand tags it's in) or:

    $ git tag --contains 4c48abe91be03d191d0c20cc755877da2cb35622
    v4.13

What is your methodology that gets that it is in the 4.12 tree?

Da5hes · on Dec 29, 2017

you are right, i actually didn't check on git, my bad

cirowrc · on Dec 29, 2017

Thanks!

benevol · on Dec 29, 2017

> Make sure your kernel is always updated on all of your production hosts.

And in order to avoid any zero-day exploits, always use dedicated machines, never use a VPS server.

oblio · on Dec 29, 2017

I'm not sure I get this - are you saying that you are more at risk due to the VM host layer?

openasocket · on Dec 29, 2017

Personally, I'd reasonably trust Xen or KVM or something else with hardware-based virtualization and the like to protect me in an multi-tenancy scenario. Much less so in the case of Docker. Sharing a full kernel with potentially malicious actors is more risky than sharing a hypervisor, much more surface area for attack.

Da5hes · on Dec 29, 2017

that's correct, versions prior to 4.12 are not affected

userbinator · on Dec 29, 2017

This is 4.11:

https://elixir.free-electrons.com/linux/v4.11/source/kernel/...

The code is significantly different but I still see a lack of access_ok(), so was the checking performed somewhere else that I didn't notice (I haven't looked closely at this part of the kernel before)?

Da5hes · on Dec 29, 2017

it is the use of unsafe_put_user without access_ok(), not access_ok() alone

icebraining · on Dec 29, 2017

IIUC, you only need the access_ok() when using the new unsafe_put_user(). That code is still using put_user().

zenlikethat · on Dec 29, 2017

Apparently Linux privilege escalation bugs are now "Docker container escapes"? Thanks to Twistlock for a detailed article but call it what it is, a Linux vulnerability not specific to Docker.

Xylakant · on Dec 29, 2017

Since people rely on docker containers as an isolation layer between potentially unfriendly services, a linux bug that allows breaking that isolation barrier is a relevant thing. It’s worth being called “docker container escape”

zenlikethat · on Dec 30, 2017

People should not rely on that to any degree more than they would rely on colocated processes on a VM being isolated. The easiest way to be safe is to assume that all containers are already broken out of - what would you do then? Make sure processes are running as non-root, use various protection layers (pick your poison - SELinux, gresecurity, etc.), take away capabilities, and don't run workloads you don't trust.

Xylakant · on Dec 30, 2017

I’m not saying they should, but they are. And given the fact that containers are often marketed as “lightweight VMs”, I’m Not very surprised by that.

akvadrako · on Dec 30, 2017

In this case the analogy is apt - VM isolation is also not very secure - the exploits like row hammer are usually more heavyweight though.

Xylakant · on Dec 30, 2017

sure, any Xen guest escape receives equal amounts of press for exactly that reason: It's an isolation barrier breaking down. However, trivial exploits breaking VM isolation have been relatively rare lately.

toomuchtodo · on Dec 30, 2017

People most definitely should not but very much are expecting a level of isolation between containers that does not exist.

top_post · on Dec 29, 2017

This. It's getting a bit hard to read relevant security news now due to this happening a lot.

cbisnett · on Dec 29, 2017

Just to clarify the terminology here:

- A vulnerability is a sofware bug that has particular behaviors and ramifications that allow it to be used maliciously.

- An exploit is a crafted piece of input data that is designed to trigger a vulnerability to execute arbitrary code, crash the target (Denial-of-Service), etc.

> In 2017 alone, 434 linux kernel exploits where found, and as you have seen in this post, kernel exploits can be devastating for containerized environments.

There are a few places in the article like this one where the correct terminology is vulnerability not exploit. cvedetails.com aggregates vulnerabilities. Places like exploit-db.com aggregate exploits people have written to take advantage of vulnerabilities to enable them to perform some unintended action against the target.

Edit: formatting

alpb · on Dec 29, 2017

Any ideas why this is branded as "Docker"? Are the same namespacing constructs not being used by other Linux container runtimes? I think this should be titled "Escaping Linux containers" as docker is not at fault here?

shykes · on Dec 29, 2017

Marketing.

There's a long tradition by enterprise vendors large and small to market someone else's product as insecure, in order to create demand for their "improved, secured" version.

In this particular instance, Twistlock is selling Docker security by amplifying the meme of "insecure Docker". The Docker brand has visibility the target audience (Enterprise IT) so it's a good target for this kind of piggyback.

This type of FUD marketing happens all the time in many different markets, it's not specific to Docker.

jo909 · on Dec 29, 2017

What do you mean by "branded"?

The author shows a concrete exploit of the kernel bug described in CVE-2017-5123 as he has developed it in the context of the docker container environment.

He shows how to use this bug to break out of docker, so he calls the blog post "Escaping Docker ...".

Which is IMHO the most interesting container runtime to write such an exploit for first because it is very widely deployed, but it might also just have been what the author is most familiar with or what was easiest to develop for him.

chowyuncat · on Dec 29, 2017

Think of it this way: what if the author had titled it "Escaping Ubuntu containers" ?

jo909 · on Dec 29, 2017

"Ubuntu container" is just not a name typically used for anything. "Docker container" is.

Let's make it realistic and say he had used RedHat OpenShift as his target and example for the exploit. I'd be completely fine with the title referencing that exact product by name.

Why would he have to dance around what he is using in his demo? Maybe that concrete product has multiple layers of security or lacks them, or uses a certain version etc. He can only speak to what he himself was using and testing. "Escaping Docker container..." is the best short description (as you would need it for a title) of this demo exploit I can think of.

dchest · on Dec 29, 2017

Why? The article demonstrates exploitation of Docker containers.

oblio · on Dec 29, 2017

Yeah, but is it limited to Docker containers? Can other container types be attacked in the same way?

jo909 · on Dec 29, 2017

It is a reasonable _assumption_ that other container runtimes on linux might be affected by the same kernel bug. The article does not explore that and the author has no duty to do so just to avoid using a branded technology name.

How would you reasonably talk about "Linux containers" without having a very exhaustive list of all existing implementations and testing all of them? If one of them is not affected you are now factually wrong.

chowyuncat · on Dec 29, 2017

The exploit overwrites kernel memory credentials of a task structure. That structure is the lynchpin of kernel security, including SELinux.

dchest · on Dec 29, 2017

Sure, you can write an article demonstrating exploitation of Ubuntu containers and call it "Escaping Ubuntu containers".

chowyuncat · on Dec 29, 2017

The former part was exactly what was done.

derefr · on Dec 29, 2017

"Docker" isn't at fault here either way, as Docker isn't its own "execution driver" (in Docker parlance) any more; that would be https://github.com/opencontainers/runc.

But to answer the spirit your question, each container runtime uses its own peculiar combination of such constructs. It's helpful to know that this attack allows you to break out of the combination used specifically by runc, and thereby to break out of any system relying on Docker (with the default runc execution driver.)

icebraining · on Dec 29, 2017

I doubt any runtime would have prevented this bug; some have seccomp-bpf profiles to blacklist some kernel operations, but I doubt any block a function as basic as waitid().

mattmcknight · on Dec 29, 2017

If anything, this points out that the use case of Docker for security isolation, such as in a multi-tenant architecture, is probably still not a good one.

In most use cases I see containers used for rapid and consistent deployment. The isolation benefit with multiple containers on a host is that if you install things with different library dependencies you don't run into conflicts. As such, the comparison for the common use case is just software installed directly on the host, which also is subject to this vuln.

snvzz · on Dec 29, 2017

>In 2017 alone, 434 linux kernel exploits where found, and as you have seen in this post, kernel exploits can be devastating for containerized environments. This is because containers share the same kernel as the host, thus trusting the built-in protection mechanisms alone isn’t sufficient.

More than one kernel exploit _per day_. Exploiting Linux is just a matter of finding one such vulnerability and using it. This can be done in a single day.

There's just no fixing megabytes of buggy kernel code.

It really drives home the need for a proper OS based on a verified, capability-enabled microkernel such as seL4.

dataflow · on Dec 29, 2017

I'll surely get a lock of flak for this, but these kinds of bugs would be trivial to avoid in C++. All you need is to make the pointer arguments to syscalls be some other data type (say, user_ptr<T>) that performs an access-check upon conversion to a raw pointer. Then the compiler simply wouldn't let you bypass the access-check, so you simply could not forget to do so. That's the fundamental difference between C++ and C: one of them actually lets you write code that cannot contain many classes of mistakes, and the other, well, doesn't. For the life of me I don't understand the stubbornness behind sticking to the same languages and tools from decades ago.

ilammy · on Dec 30, 2017

Well, the exploited code used unsafe interfaces in unsafe manner. It is effectively equivalent to calling something like reinterpret_cast<T*>(user_ptr.get()) to bypass the safety provided by the compiler. How do you avoid that with C++ alone? I guess you will need some external static analyzer. Linux kernel does have one: Sparse. IIRC, it can report casting out __user declarations from pointers.

As for stubbornness, C can be used safely with proper discipline. Kernel development does require a certain amount of experience and discipline, so arguably C can be used by kernel developers in mostly safe manner. That's why some view it as a feature: if you don't have the required discipline then just stick to userspace development.

dataflow · on Dec 30, 2017

Thanks for the explanation! __user and sparse seem to be almost exactly the kinds of tools I had in mind, with the caveat that __user would be the default for a pointer argument to a syscall, so that it wouldn't need to be specified.

I'm not sure I understand what interface you're referring to that was "unsafe" and subsequently "used in an unsafe manner". What is the "unsafe interface" here that was being used in an unsafe manner? It seems to me that the problem was that the pointer was not marked as __user? Which is awful, because shouldn't __user be the implicit default behavior for a pointer argument to a syscall? Why should the default behavior be the unsafe one you pretty much never want?

ilammy · on Dec 30, 2017

System call pointer arguments are usually marked with __user annotations (I cannot recall if there are some weird calls that may need a kernel pointer, none should need it, but there may be some legacy one). In particular, the infop argument to waitid() is marked as user-space pointer [1].

Before using a pointer to user-space one should check if access_ok() to it. The usual safe interfaces — copy_{to,from}_user(), put_user(), get_user() — always perform this check and fail with an error if the pointer is not an okay user-space pointer.

The commit that introduced the vulnerability [2] replaced the safe interface with unsafe ones, possibly for performance improvements. The code used put_user() function to set individual fields of a struct. Multiple calls to put_user() were replaced with multiple calls to unsafe_put_user() which does not perform access_ok() check every time. A check for NULL pointer was added before the stores. unsafe_put_user() still checks whether the address points to an actually mapped memory location, but does not verify whether the location is in user-space.

The commit was not really discussed in-depth on LKML [3] as it came from Al Viro who should know better, is one of the Sparse maintainers, etc. Some projects require human justifications for any usage of unsafe interfaces during code review (like, flagging a review with 'needs-check' or something that requires a sign-off by another human that the unsafe thing is actually safe). This may have been the case where it could matter, as the static analysis tool should not produce bogus warnings for interfaces which are designed to perform unsafe stuff. Though, it may also be useful to add a check to Sparse which will verify that unsafe_{get,put}_user() calls are preceded by an access_ok() call in the same function.

[1]: http://elixir.free-electrons.com/linux/v4.13.6/source/kernel... [2]: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/lin... [3]: https://lkml.org/lkml/2017/5/15/896

dataflow · on Dec 30, 2017

Wow, awesome analysis, thank you!

0xcde4c3db · on Dec 30, 2017

According to Linus, programmers who prefer C++ are so bad that he would have chosen C solely to avoid dealing with their "total and utter crap" code, and C++ is only good for kernel development if you limit yourself to a C-like subset anyway [1].

[1] https://lwn.net/Articles/249460/

user5994461 · on Dec 31, 2017

There are C++ mechanisms that do not work well in the kernel context. Like any implicit memory allocation or exceptions.

Linus is not entirely crazy. The Windows kernel SDK only supported C++ in the last decade and it has a lot of limitations.

pjmlp · on Dec 30, 2017

The same Linus that now uses Qt and C++ for his diving application.....

xi- · on Dec 29, 2017

>This can be done in a single day.

Only if you're lucky. Most of these exploits probably took weeks to find and analyze properly, it's not like one person found more than one a day. They're found because whole teams are working with the linux kernel at the same time and either happen by them or actively look for them.

DyslexicAtheist · on Dec 29, 2017

Linux kernels in production (since we all now like to run docker there :)) without grsec/seccomp have always been pretty dangerous. What I dislike about docker is their feature creep and lack of proactively steering their users to accepting more secure defaults. The mindset towards security in the Linux kernel community remains shockingly stubborn compared to the shift to "better security", which is taking over the rest of the industry.

DyslexicAtheist · on Dec 29, 2017

actually the most affected by this CVE would be medium sized companies not investing in enough internal development pumping out services fast with secure default (startups rushing to their MVP maybe too). The companies running in a totally automated farm with Kubernets or docker swarm usually don't have containers with long uptimes.

lclarkmichalek · on Dec 30, 2017

Docker containers run with a seccomp profile enabled by default.

saagarjha · on Dec 29, 2017

In case the article author is here: The code snippets given aren't escaped, so &, <, > show up as HTML entities instead.

dataflow · on Dec 29, 2017

> The vulnerability is that the highlighted access_ok() check was missing in the waitid() syscall.

Why in the world does this class of vulnerabilities still exist in 2017? Why are kernel maintainers not writing some kind of C linter that makes sure every single pointer argument to every syscall is passed to a well-known function like access_ok (Linux) or ProbeForRead (Windows)? Literally all you need is a syntactic check; you don't even need to do any kind of semantic analysis... since all you want is to flag the code so someone can inspect each spot manually. Why is this not done?!

quotemstr · on Dec 29, 2017

C++ would also make it harder to get it wrong. Its type system is powerful enough to enforce rules like "you must call access_ok before writing through a pointer": you just have access_ok transform an inaccessible pointer token of some sort, passed in as a syscall parameter, into a different kind of object through which you can write into memory.

The generated machine code would be identical to what's in the kernel today, but it'd be both safer and cleaner. C++ still has to get over the bad gang-of-four-1990s-era-object-goo reputation it has among systems people.

dataflow · on Dec 29, 2017

> C++ would also make it harder to get it wrong.

Funny you mention this... https://news.ycombinator.com/item?id=16032324

quotemstr · on Dec 29, 2017

It's a thought a lot of people have, I bet. :-)

AgentME · on Dec 29, 2017

Does this escape only work if they have root inside of the container? I usually try to make it so my containers always contain a non-root process as an extra layer of security.

hacknat · on Dec 29, 2017

No it doesn’t matter. If they have waituid and getuid then they are off to the races.

zenlikethat · on Dec 29, 2017

Running your containers as non-root is still great though! Shocking how uncommon it is.

ttul · on Dec 29, 2017

OpenBSD has randomized pids since the dawn of time. Why has Linux not taken this basic step to improve security?

hacknat · on Dec 29, 2017

Randomized pids wouldn’t nexessarily help that much in this situation, especially if the getuid syscall is available. However, I agree with your general sentiment that there are basic security features that Linux could implement to make a lot of CVEs impotent. I think the community is coming round, but this stuff takes more work than most people may realize.

ttul · on Jan 2, 2018

OpenBSD randomizes everything they possibly can. This is a Good Thing and so cheap..

quotemstr · on Dec 29, 2017

Related: https://lwn.net/Articles/736348/

upofadown · on Dec 30, 2017

It has been general knowledge that escaping from a container is trivial since forever. The article is merely an example.

Is there actually a counterpoint for this? Who is saying that containerization can be used for isolation?

eeZi · on Dec 29, 2017

This is precisely why we need projects like Grsecurity.

crb002 · on Dec 29, 2017

Linus really needs to start having more formal verification around patches.

Matt3o12_ · on Dec 29, 2017

This might be a bit off topic but I wonder why the vulnerability has been patched this way:

    if (!access_ok(VERIFY_WRITE, infop, sizeof(*infop)))
        goto Efault;

Why doesn’t the if use curly brackets? I thought it has been established that it is best practices to always use curly brackets even if they are explicit, especially after Apple's infamous goto bug[1].

Secondly, why does it use goto at all? I thought it has also been established not to use goto unless it is the only performant solution (and performance is important in that case). Sure Efault with probably kill the program but wouldn’t it still be better to use a function call considering that the desired resolution should be the same?

[1]: https://www.imperialviolet.org/2014/02/22/applebug.html

heinrichhartman · on Dec 29, 2017

1. Linux kernel coding style is documented here [1], and contains the line:

> Do not unnecessarily use braces where a single statement will do.

2. There is no built-in exception handling in C. `goto ERROR_HANDLING_CODE` is a common and well established pattern to handle exceptions in C, see e.g. [2].

[1] https://www.kernel.org/doc/html/v4.10/process/coding-style.h... [2] https://news.ycombinator.com/item?id=3883310

umanwizard · on Dec 29, 2017

> I thought it has also been established not to use goto unless [...]

“Established” by whom? Certainly not by kernel developers — `goto` is very common in all kernels I have looked at (xnu, Linux, bsd)

drchickensalad · on Dec 29, 2017

And it has quite a consensus as the best solution to this problem. Goto being considered harmful is a generally true statement. However, this usage is a more specific exception to the rule, with objective benefits vs alternatives.

user5994461 · on Dec 31, 2017

"goto considered harmful" is an old meme from 50 years ago, when control blocks like if and for were invented.

icebraining · on Dec 29, 2017

From your link: Maybe the coding style contributed to this by allowing ifs without braces, but one can have incorrect indentation with braces too, so that doesn't seem terribly convincing to me.

By the way, gcc now has -Wmisleading-indentation, which is activated by default if you enable -Wall: https://news.ycombinator.com/item?id=10875449

chowyuncat · on Dec 29, 2017

The kernel style guide mandates no curly brackets in this case.

https://www.kernel.org/doc/html/v4.10/process/coding-style.h...

adam-ff · on Dec 29, 2017

Note: The present solution replaced the goto with a return:

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/lin...

efficax · on Dec 29, 2017

goto is kind of a necessity in any complex C codebase unless you want to duplicate tons of code. Sometimes you need to jump way out of the context, especially to handle errors, and C does not have "exceptions" (although you could do them with setjmp/longjmp)

wildmusings · on Dec 29, 2017

goto is the best way to do early exit and cleanup from a C function. The alternatives are 1) deeply nested if's, one level for each function call that can return an error code, or 2) repeat the same cleanup code over and over again at every exit point.

anonfunction · on Dec 29, 2017

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/lin...