This exploit is interesting, but if you are doing container security correctly i...

wahern · on Dec 29, 2017

1) User namespaces don't magically protect you from a vulnerability that allows writing to kernel memory. Neither would AppArmor. seccomp could theoretically, but waitid is a pretty fundamental Unix syscall and blocking it would break a lot of basic software.

The author devised a particular exploit, but his example was hardly the only way to leverage the vulnerability. Being able to write to kernel memory is about as huge a vulnerability as you can get. Just because you can't think of a way to leverage a vulnerability doesn't mean an attacker can't; your failure of imagination is not evidence that it cannot be done.

2) Plenty of containers need more than 255 threads. Like, pretty much any Java server. In any event, this particular exploit doesn't necessarily require hundreds or thousands of simultaneous processes.

3) Blocking getuid is even worse than blocking waitid. Block getuid and you'll break glibc and god knows what. In any event, it would be futile as the real and effective UIDs are passed to the process through the auxiliary process vector when the kernel executes the process.

4) You're missing the forest for the trees. The real moral of the story is this: "In 2017 alone, 434 linux kernel exploits where found". Unless you're prepared to pour over every published exploit, 24/7, meticulously devise countermeasures, and be prepared to run effectively crippled software, you really shouldn't be relying on containers to isolate high-value assets. I wouldn't rely on VMs, either, as the driver infrastructure of hypervisors has also proven fertile ground for exploits.

hacknat · on Dec 29, 2017

It’s not a full kernel memory CVE, you have +-255 bytes access to kernel memor from the cred pointer. I have no idea if that extends to userns or not. Also I think your confusing Java threads for system threads they are not the same.

I think your being overly alarmist. You have to trust someone else’s code at some point, otherwise you’ll be paralyzed by non-productivity.

geofft · on Dec 29, 2017

> It’s not a full kernel memory CVE, you have +-255 bytes access to kernel memor from the cred pointer. I have no idea if that extends to userns or not.

As I understand it, a kuid_t is the UID in the root namespace, so setting your cred->uid to 0 gets you considered as equivalent to root in the container host.

Also, don't think that limited exposure to kernel memory saves you - take a look at the sudo "vudo" exploit from 2001, in which a single byte that was erroneously overwritten with 0, and then put back, turned out to be exploitable. http://phrack.org/issues/57/8.html (And in general, don't confuse the lack of public existence of an exploit with a proof that a thing isn't exploitable in a certain way.)

> Also I think your confusing Java threads for system threads they are not the same.

Current versions of the HotSpot JVM (where by "current" I mean "since about 1.1") create one OS thread per Java thread: http://openjdk.java.net/groups/hotspot/docs/RuntimeOverview.... "The basic threading model in Hotspot is a 1:1 mapping between Java threads (an instance of java.lang.Thread) and native operating system threads. The native thread is created when the Java thread is started, and is reclaimed once it terminates." Plus there are some other OS threads for the runtime itself.

> I think your being overly alarmist. You have to trust someone else’s code at some point, otherwise you’ll be paralyzed by non-productivity.

Sure, but you can choose which code to trust, and how to structure your systems to take advantage of the code you trust and not the code you don't. Putting mutually-distrusted things on physically separate Linux machines on the same network is a pretty good architecture: I trust that the Linux kernel is relatively low on CVEs that let TCP packets from a remote machine overwrite kernel memory.

rincebrain · on Dec 30, 2017

> I trust that the Linux kernel is relatively low on CVEs that let TCP packets from a remote machine overwrite kernel memory.

You know, you say that...

https://lkml.org/lkml/2017/12/29/137

wahern · on Dec 29, 2017

255 bytes is huge, though IIUC it's actually less than that. Nonetheless, it's much more than is typical. Sometimes these holes are limited to a single word, and only a single value for that word (like NULL), and attackers still come up with marvelously devious exploits.

The critical vulnerability is that the cred pointer address is entirely under your control, so you get to poke at whatever kernel memory you want. The limitation is 1) locating the address of what you want to poke, and 2) being limited to a smallish ranges of values that you can write out.

Also, I'm not confusing Java threads with system threads. Most JVMs use a 1:1 threading model. And because on Linux a thread is just a process (which unfortunately still causes headaches with things like signals, setuid, etc), each thread has its own PID.

I'm not being alarmist, just realistic. Nobody is going to stop using Linux anytime soon. Nor am I. But the fact of the matter is that the Linux kernel is riddled with vulnerabilities. Something like the waitid vulnerability comes along at least 3 or 4 times a year, and that's just the published ones. (IMO, part of the reason is precisely because of complex features like user namespaces, which add tremendous complexity to the kernel. But that's a contentious point.)

At least for high-value assets (however you want to define that), people should just treat Linux as if it lacks secure process isolation entirely, absent a commitment to herculean efforts--extremely locked down seccomp, PNaCL-like sandboxing, etc for all your code that juggles tainted data. Even then, vulnerabilities like recvmmsg come along and ruin your day, but those are rare enough that it would be unfair to single-out Linux.

Not only is that pragmatic and reasonable, after 25 years of endless vulnerabilities of this sort I wouldn't trust the judgment of anyone who thought otherwise. And for what it's worth, I'd make much the same point about Windows, although I have much less experience on that platform.

nacs · on Dec 30, 2017

> people should just treat Linux as if it lacks secure process isolation entirely

Can you state what OS/software you believe has secure process isolation?

Would one of the BSDs have it or would I have to use some OS that's more exotic to run a process in isolation?

wahern · on Jan 3, 2018

Empirically, among the general purpose, commodity platforms OpenBSD has one of the best track records. Professionally I've had success placing OpenBSD in roles where Linux boxen were constantly hacked. But IT departments in particular, and application developers generally, dislike OpenBSD or would dislike it if it were forced upon them.

More importantly, while nowhere near as bad as Linux, macOS, or Windows, OpenBSD has at least one published severe local kernel vulnerability every year or two. In many cases those OpenBSD boxen I mentioned survived _despite_ being neglected by IT and not kept up-to-date; I know for a fact some were in a known, locally exploitable state for a not insignificant period of time. That makes me think a big part of their relative security is simply related to OpenBSD not being a common target of rootkits and worms. I have little doubt a sophisticated attacker could root an OpenBSD box from the shell, for example, if he was targeting that box. (My rule of thumb when isolating services is that anything running a web service using a non-trivial framework (PHP, NodeJS, etc) provides at least the equivalent to shell-level access to a targeted attacker. Among other things, that means even if I'm writing a privilege separated, locked-down, formally-verified backend daemon, I assume as a general rule that any data it's trying to protect isn't safe from that front-end web application unless it's running on separate hardware.)

While I don't think that security and convenience are necessarily mutually exclusive, as a practical matter they are largely mutually exclusive. Unless you're prepared to accept the burden and cost of using a specialized OS like seL4--and in particular use it in a way that preserves and leverages the stronger security guarantees--your best bet is simply to use separate servers when you want a significant degree of isolation assurance. Separate hardware is not sufficient (if all your boxes have Intel ME cards, or have firmware pushed from a puppet server, or share an engineer's account whose hacked desktop is logging SSH keys, passwords, and Yubikey PINs), but it's largely necessary. This is true whether you're concerned with targeted or opportunistic attacks, but _especially_ opportunistic attacks, which are by far the most common and, in many respects, an important element to targeted attacks.

Separate hardware is a uniquely simple, high-dividend solution. But the point is to be realistic about the actual robustness of your solutions, to be able to easily identify and mitigate risks, so you can make more informed security decisions. And it all depends on what you're protecting and what sort of investment you're capable of making. Just endeavor to predicate your decisions on accurate and informed assessments. Among other things, that means being honest about and understanding your own lack of knowledge and experience. Similarly, continuity and long-term maintenance are critical to security, which means you need to be honest about institutional capabilities (or for a private server, what you're prepared to track and upgrade 3 years from now.)

Linux, OpenBSD, co-location, and cloud hosting can all be part of a perfectly robust strategy. And HSMs probably should be, too, which is basically just a way to attach a tiny, isolated, dedicated piece of hardware to a computer. But none of these options alone are likely to be reasonable choices, all things considered, especially in the organizational context.

twic · on Dec 29, 2017

> Also I think your confusing Java threads for system threads they are not the same.

Oh? On mainstream JVMs, a Java thread is the same as the thing you could create with pthreads. What do you mean by "system threads"?

nl · on Dec 30, 2017

Java threads are system threads.

I can confirm hundreds of threads for networked Java applications is normal behaviour.

quotemstr · on Dec 29, 2017

> No container needs more than 255 threads > Additionally this CVE relies on the getuid syscall being available, there is no reason to give a container this syscall,

The problem with MAC schemes is that, in practice, they lead to security people imposing random and arbitrary restrictions on general APIs in the name of the least privilege. In doing so, they break the orthogonality of general-purpose platform concepts and break the reductive mental model necessary to get anything done. It's a misunderstanding of what least privilege actually means.

Security is better achieved by creating clear, principled security domains and boundaries, then controlling access to these domains in a general and transparent way. Saying "you, unix process, you can call system call X, but not system call Y, because in my opinion, Y is risky", when neither X nor Y breaks through a security domain, is bad practice. So is arbitrarily capping the number of threads in a container.

jwilk · on Dec 29, 2017

> Additionally this CVE relies on the getuid syscall being available, there is no reason to give a container this syscall, you should block it,

Huh? Lots of legitimate things will break without working getuid().

> you should block it, ala this guide

getuid() doesn't require any capabilities, so it can't be blocked by taking them away.

hacknat · on Dec 29, 2017

Oops good call. I meant setuid, but either way I was wrong.

geofft · on Dec 29, 2017

> Edit2: Additionally this CVE relies on the getuid syscall being available

This exploit relies on it. The vulnerability does not. The exploit happens to use getuid() along the way to using heap spraying, but the writeup is pretty clear that neither getuid() nor heap spraying is required.

hacknat · on Dec 29, 2017

Yeah and I’m wrong about that part anyways. You can’t cap out or block getuid without breaking glibc. I meant setuid, but that call isn’t used in this exploit. I got confused.

eikenberry · on Dec 29, 2017

> but if you are doing container security correctly

Doing it correctly should be the case using the default settings. Defaulting to an insecure setup is a bug.

dvdhnt · on Dec 29, 2017

Hmm. Perhaps this is a difference between dev and ops, but almost every tool we use comes out of the box with settings unfit for production. Instead, they're tuned for development, and in some cases, deploying to a staging environment. At least, this has been the case in my experience.

sverhagen · on Dec 29, 2017

Ah, dev... ops... How about DevOps? As a (originally) dev I bring my app to production. How do I stand a fighting chance to reconfigure the defaults in the way you suggest, without suddenly gaining a whole new set of skills? Good defaults would be helpful, even if they're very conservative. I can break things open, but at least then I know what to read up on.

kemitche · on Dec 29, 2017

100% agree here. Ship with secure settings by default and have simple "developer guides" that show what to crack open for easier use in non-production environments.

bacongobbler · on Dec 29, 2017

I would argue the opposite. As a developer I want to have tools that make my life easier to - you guessed it - develop. Enabling unnecessary secure defaults that either hinder or don't apply to my use case is silly.

There's a reason most users choose Ubuntu over OpenBSD as their workstation. I would put good money on the reason is because it's "secure enough" without getting too heavy handed on production use cases.

However, I do agree that there has to be a balance. Most tooling I write tends to lean more towards the "good user experience" side first, and then document the production use case. Either that or release two separate (but similar) products; one for developers, one for operations teams. Docker's doing that with the Community Edition/Enterprise Edition, but I still think the Community Edition is far too heavy-handed when it comes to things like pulling images from "insecure" registries.

oneweekwonder · on Dec 30, 2017

> As a developer I want to have tools that make my life easier to - you guessed it - develop. Enabling unnecessary secure defaults that either hinder or don't apply to my use case is silly.

The problem is a lot of these insecure defaults will be rolled to production by a "developer" only to get hacked later. Because the marketing and friction-less dev is more important then sane default security

> Over a quarter of MongoDB databases left open to the internet have been ransacked by online extortionists.[0]

They "forgot"/"did not know" the password is not set. This was for db tech.

Now you want your average dev, who is getting "forced" to use docker more and more. To know how to setup container tech secure?

But hey, I need to get this on my CV and it is just click click install. So what can go wrong?

[0]: http://www.zdnet.com/article/mongodb-ransacked-now-27000-dat...

sverhagen · on Dec 30, 2017

You're touching upon a natural tension, where developers take certain shortcuts that operations people then have to solve, which, I suppose, DevOps practices try to solve. The developers pay the price of less freedom for that, or if you will, the price of (production) reality. If that's not what your shop is doing, then what you say makes sense to me, and clearly DevOps practices aren't for everyone.

eikenberry · on Dec 29, 2017

I don't think the contention is necessary in most cases. You can make something developer friendly and secure at the same time, you just have to allow for both. Simple things, like taking any credentials from environment settings, can make it easy for both dev and ops/prod to work with and be secure.

bmitch3020 · on Dec 29, 2017

> In particular if you are setting per-container usernamespaces, like you ought to be, then this exploit doesn’t do anything.

User namespacing in docker is enabled at the daemon level, not per container, so all containers share the same offset. This would ensure that a root user in the container would escape to a different uid on the host, but doesn't prevent someone from moving sideways through the containers on the same host.

Note that enabling this will break the developer workflow of mounting files from the host into the container. I believe files will show up with the wrong ownership inside the container.

hacknat · on Dec 29, 2017

You don’t need to use containerd. Other runtimes make it possible (per container offsets have been possible in runc for over year).

borplk · on Dec 29, 2017

> but if you are doing container security correctly ...

The container that wasn't!

(I get the gist of it, just tongue in cheek)

oblio · on Dec 29, 2017

As a somewhat of a container noob, could you expand on "per-container usernamespaces"?

hacknat · on Dec 29, 2017

Sure. User namespace-ing is a feature of container security that allows you to grant a process root access to a filesystem that itself is not root. To the running process it appears that it is or can run as root, but on the host it actually isn’t root, but some uid:gid offset. Here’s an article explaining more: http://man7.org/linux/man-pages/man7/user_namespaces.7.html

The gist is that a container is further sandboxed by the kernel that is agnostic of the higher level security precautions. It’s not perfect by itself, but used in conjunction with other features like AppArmor or SELinux and SecComp it can make a container virtually sandboxed.

baq · on Dec 29, 2017

Follow-up question: And why docker doesn't do that by default?

hacknat · on Dec 29, 2017

Because the Docker project doesn’t make money off of security. It is actually quite infuriating, because they have become the de facto container image standard. Most of their security has actually come from Twistlock (I am not a Twistlock employee, FYI). My recommendation to most Admins or Devs that are serious about container security is to let your developers use docker, but run your images with CRI-O on your servers: http://cri-o.io/

raesene9 · on Dec 29, 2017

The docker project has done quite a bit of work on container security, so I'm not sure it's entirely fair to say that them not enabling user namespace by default is for that reason.

For example the work that was done on their seccomp filtering and apparmor profiles.

My guess would be that it's as User namespacing can introduce some issues (e.g. where mounting host filesystems), that they've decided the trade-offs aren't worth it.

Also looking at the CRI-O trello (https://trello.com/c/Ak2yMcpf/714-epic-cri-o-support-for-use...) it seems like user namespaces isn't even an option there yet?

eikenberry · on Dec 29, 2017

Viewing docker containers as anything more than a bundling and deployment system is a mistake. While they might help with security they will never be completely secure and you should architect your deployments with that in mind.

Unless you are a giant enterprise shop with the resources to staff a decent sized K8s team, you should use the hosted solutions.

cpuguy83 · on Dec 29, 2017

There are trade-offs to using userns and many ppl don't like the current set of trade-offs. In addition changing a default like this is a breaking change. Admins can enable userns by default in a daemon, but making it a hard-coded default is much more difficult.

It's not just a matter of enabling user ns. There is no support at the vfs layer for uid/gid mapping. This means in order to use it, images must be chowned with the remapped ID's. Per-container mappings are not supported for this reason (it would require copying and chowning the entire image for each container mapping).

Do you care to qualify your statement about CRI-O?

ecnahc515 · on Dec 29, 2017

I recall seeing some patches submitted to make it possible to pass an uid/gid offset to the mount syscall at one point when people were implementing usernamespaces for container runtimes like docker. So is this fixable without having to make every file system implement this feature, or is there something else holding back better support for doing uid shifting for use with user namespaces?

cpuguy83 · on Dec 29, 2017

That has not been accepted into the kernel. It's called "shiftfs", which basically let's you perform the uid/gid shift on mount.

zenlikethat · on Dec 29, 2017

Nah, Docker has taken lots of strides in the right direction for security over the years, Twistlock or not, albeit with a few weird dangling remainders. They'd love to turn user ns on by default but it'd break lots of existing stuff. Many more users would be mad about having that on by default than leaving it off.

Disclaimer: I worked at Docker for 3 years.

nvarsj · on Dec 29, 2017

CRI-O is bleeding edge. I'm not sure it's ready for production usage. But it looks very promising. The sooner we can all dump docker in kubernetes the better.

_0w8t · on Dec 29, 2017

Docker excels in image building especially now with multi-image Dockefile support. It seems all those alternatives to the Docker just gave up on providing anything on their own. The documentation typically starts "lets pull a docker image".

On the other hand the container runtime is straightforward. I recently discovered that one can run a docker image with a bash script and the unshare command and get a very tight security setup. That explains proliferation of various alternatives to Docker to run its images.

andbberger · on Dec 29, 2017

Maybe because it breaks things?

I just enabled user namespaces after reading this post. Broke Jenkins and there doesn't appear to be an easy solution. I mount the docker socket in the Jenkins container, which is not an option with user namespaces as the user Jenkins now runs as does not have permission to access the socket.

It seems to be possible to provide this user access to the socket through a socket proxy, but since all containers use the same user this seems to defeat the purpose of using namespaces in the first place.

Cherry on top: although `docker run` supports running containers with custom userns settings, docker swarm, which I use to run Jenkins, does not.

So as far as I can tell my only options are:

1. Go back to not using user namespaces 2. Make the docker daemon on the host available over HTTP, which is really something I was trying to avoid...

Anyone have a more elegant solution?

zenlikethat · on Dec 29, 2017

Mm, if you're bind mounting in the Docker socket, enabling user namespaces won't help much. You just have to deal with the fact that you have a privileged container (Docker API access == root, at least unless you're using authz). It'd be nice if we could see more RBAC around Docker API so you could do things like "grant only permission to run this one container".

andbberger · on Dec 29, 2017

Totally. But the vast majority of containers I use do not get a bind mount to the Docker socket... for which user namespaces would be a very nice feature.

zenlikethat · on Dec 29, 2017

Yeah, definitely turn it on where possible, just important to realize that it's not a panacea (some people really hyped it up to be before the feature was released and criticized Docker for not having it at all). As always, gotta try and find the right sweet spot between convenience and attack surface.

nassyweazy · on Dec 30, 2017

disclaimer: I'm a member of Docker Security team

We're working on a solution that would please most people for docker containers and services called the Docker Entitlements: https://github.com/moby/libentitlement

These Entitlements are high-level privileges for containers and services that could be baked in images, same way as macOS/iOS apps. These permissions would allow to create custom {seccomp+capabilities+namespaces+apparmor+...} profiles (effectively security profiles) for a better granularity in app sandbox configuration by app developers and ops.

The current POC has `docker run`, `docker service create` and even build mechanism working. The integration is actively being worked on and PRs are being prepared.

The issue you mentioned is already opened here: https://github.com/moby/libentitlement/issues/44

Feel free to have a look at it and open issues/participate or reach out through Github as I'm the lead and would love to discuss use-cases :)

LaGrange · on Dec 29, 2017

As far as I remember things, because it breaks overlay filesystems, which are a major space saver in Docker world. Something might have changed, but last time I checked, you couldn't "offset" uids/gids on a filesystem overlay, so every layer of the container would have to be copied and chowned (slowly).

This would obviously only work for minimal containers (i.e. ones that don't contain a distribution), but software has to be pretty much built for such a case (e.g. statically linked, no dependencies on common tooling — popular with Go, but your Python application won't work edit: unless you copy all the layers, that is).

You can read the docs here: https://docs.docker.com/engine/security/userns-remap/#prereq..., and note that it stores image/container layers in subdirectories under /var/lib/docker.

Tl;dr: user namespaces are inherently incompatible with many of the usability features Docker brings over other solutions, while they're not particularly useful for many popular use cases (no shared hosting, minor differences in consequence between escalating to the root of the container and its host - though that's an assumption frequently wrongly made).

zenlikethat · on Dec 29, 2017

Also, people hold their bind mounts to the host near and dear, and user namespaces would break all kinds of things people expect to "just work" with bind mounts. Having user namespaces on by default would break tons of existing scripts, Compose/Kube files, etc. that do things like mount /var/lib/mysql into the container for persistence.

_0w8t · on Dec 29, 2017

I tried it a couple of months ago. It immediately broke build of one of the images. It was a known bug. So I guess I just wait one more year to try.

In the mean time, I make sure that all my containers runs as non-root with max security restrictions. The exception so far was sshd from OpenSSH and mostly due to incorrect porting from OpenBSD in portable ssh.