This exploit is interesting, but if you are doing container security correctly it’s actually not a big deal. In particular if you are setting per-container usernamespaces, like you ought to be, then this exploit doesn’t do anything. In fact you can actively give a usernamespaced container any CAPs you want, because they are isolated to that container’s uid:gid offset.
Obviously, giving containers unecessary CAP privileges in unwise, but if you are practicing sound security best practices then there would be multiple layers of defense between you and this CVE. I think a strong AppArmor profile and SecComp profile would also make this CVE moot.
Edit:
Also, this exploit relies on you being able to fork up to a certain pid value. You can and should take advantage of Linux’s per cgroup ulimit functionality. No container needs more than 255 threads (even if they do you can make special exceptions for such applications).
1) User namespaces don't magically protect you from a vulnerability that allows writing to kernel memory. Neither would AppArmor. seccomp could theoretically, but waitid is a pretty fundamental Unix syscall and blocking it would break a lot of basic software.
The author devised a particular exploit, but his example was hardly the only way to leverage the vulnerability. Being able to write to kernel memory is about as huge a vulnerability as you can get. Just because you can't think of a way to leverage a vulnerability doesn't mean an attacker can't; your failure of imagination is not evidence that it cannot be done.
2) Plenty of containers need more than 255 threads. Like, pretty much any Java server. In any event, this particular exploit doesn't necessarily require hundreds or thousands of simultaneous processes.
3) Blocking getuid is even worse than blocking waitid. Block getuid and you'll break glibc and god knows what. In any event, it would be futile as the real and effective UIDs are passed to the process through the auxiliary process vector when the kernel executes the process.
4) You're missing the forest for the trees. The real moral of the story is this: "In 2017 alone, 434 linux kernel exploits where found". Unless you're prepared to pour over every published exploit, 24/7, meticulously devise countermeasures, and be prepared to run effectively crippled software, you really shouldn't be relying on containers to isolate high-value assets. I wouldn't rely on VMs, either, as the driver infrastructure of hypervisors has also proven fertile ground for exploits.
It’s not a full kernel memory CVE, you have +-255 bytes access to kernel memor from the cred pointer. I have no idea if that extends to userns or not. Also I think your confusing Java threads for system threads they are not the same.
I think your being overly alarmist. You have to trust someone else’s code at some point, otherwise you’ll be paralyzed by non-productivity.
> It’s not a full kernel memory CVE, you have +-255 bytes access to kernel memor from the cred pointer. I have no idea if that extends to userns or not.
As I understand it, a kuid_t is the UID in the root namespace, so setting your cred->uid to 0 gets you considered as equivalent to root in the container host.
Also, don't think that limited exposure to kernel memory saves you - take a look at the sudo "vudo" exploit from 2001, in which a single byte that was erroneously overwritten with 0, and then put back, turned out to be exploitable. http://phrack.org/issues/57/8.html (And in general, don't confuse the lack of public existence of an exploit with a proof that a thing isn't exploitable in a certain way.)
> Also I think your confusing Java threads for system threads they are not the same.
Current versions of the HotSpot JVM (where by "current" I mean "since about 1.1") create one OS thread per Java thread: http://openjdk.java.net/groups/hotspot/docs/RuntimeOverview.... "The basic threading model in Hotspot is a 1:1 mapping between Java threads (an instance of java.lang.Thread) and native operating system threads. The native thread is created when the Java thread is started, and is reclaimed once it terminates." Plus there are some other OS threads for the runtime itself.
> I think your being overly alarmist. You have to trust someone else’s code at some point, otherwise you’ll be paralyzed by non-productivity.
Sure, but you can choose which code to trust, and how to structure your systems to take advantage of the code you trust and not the code you don't. Putting mutually-distrusted things on physically separate Linux machines on the same network is a pretty good architecture: I trust that the Linux kernel is relatively low on CVEs that let TCP packets from a remote machine overwrite kernel memory.
255 bytes is huge, though IIUC it's actually less than that. Nonetheless, it's much more than is typical. Sometimes these holes are limited to a single word, and only a single value for that word (like NULL), and attackers still come up with marvelously devious exploits.
The critical vulnerability is that the cred pointer address is entirely under your control, so you get to poke at whatever kernel memory you want. The limitation is 1) locating the address of what you want to poke, and 2) being limited to a smallish ranges of values that you can write out.
Also, I'm not confusing Java threads with system threads. Most JVMs use a 1:1 threading model. And because on Linux a thread is just a process (which unfortunately still causes headaches with things like signals, setuid, etc), each thread has its own PID.
I'm not being alarmist, just realistic. Nobody is going to stop using Linux anytime soon. Nor am I. But the fact of the matter is that the Linux kernel is riddled with vulnerabilities. Something like the waitid vulnerability comes along at least 3 or 4 times a year, and that's just the published ones. (IMO, part of the reason is precisely because of complex features like user namespaces, which add tremendous complexity to the kernel. But that's a contentious point.)
At least for high-value assets (however you want to define that), people should just treat Linux as if it lacks secure process isolation entirely, absent a commitment to herculean efforts--extremely locked down seccomp, PNaCL-like sandboxing, etc for all your code that juggles tainted data. Even then, vulnerabilities like recvmmsg come along and ruin your day, but those are rare enough that it would be unfair to single-out Linux.
Not only is that pragmatic and reasonable, after 25 years of endless vulnerabilities of this sort I wouldn't trust the judgment of anyone who thought otherwise. And for what it's worth, I'd make much the same point about Windows, although I have much less experience on that platform.
Empirically, among the general purpose, commodity platforms OpenBSD has one of the best track records. Professionally I've had success placing OpenBSD in roles where Linux boxen were constantly hacked. But IT departments in particular, and application developers generally, dislike OpenBSD or would dislike it if it were forced upon them.
More importantly, while nowhere near as bad as Linux, macOS, or Windows, OpenBSD has at least one published severe local kernel vulnerability every year or two. In many cases those OpenBSD boxen I mentioned survived _despite_ being neglected by IT and not kept up-to-date; I know for a fact some were in a known, locally exploitable state for a not insignificant period of time. That makes me think a big part of their relative security is simply related to OpenBSD not being a common target of rootkits and worms. I have little doubt a sophisticated attacker could root an OpenBSD box from the shell, for example, if he was targeting that box. (My rule of thumb when isolating services is that anything running a web service using a non-trivial framework (PHP, NodeJS, etc) provides at least the equivalent to shell-level access to a targeted attacker. Among other things, that means even if I'm writing a privilege separated, locked-down, formally-verified backend daemon, I assume as a general rule that any data it's trying to protect isn't safe from that front-end web application unless it's running on separate hardware.)
While I don't think that security and convenience are necessarily mutually exclusive, as a practical matter they are largely mutually exclusive. Unless you're prepared to accept the burden and cost of using a specialized OS like seL4--and in particular use it in a way that preserves and leverages the stronger security guarantees--your best bet is simply to use separate servers when you want a significant degree of isolation assurance. Separate hardware is not sufficient (if all your boxes have Intel ME cards, or have firmware pushed from a puppet server, or share an engineer's account whose hacked desktop is logging SSH keys, passwords, and Yubikey PINs), but it's largely necessary. This is true whether you're concerned with targeted or opportunistic attacks, but _especially_ opportunistic attacks, which are by far the most common and, in many respects, an important element to targeted attacks.
Separate hardware is a uniquely simple, high-dividend solution. But the point is to be realistic about the actual robustness of your solutions, to be able to easily identify and mitigate risks, so you can make more informed security decisions. And it all depends on what you're protecting and what sort of investment you're capable of making. Just endeavor to predicate your decisions on accurate and informed assessments. Among other things, that means being honest about and understanding your own lack of knowledge and experience.
Similarly, continuity and long-term maintenance are critical to security, which means you need to be honest about institutional capabilities (or for a private server, what you're prepared to track and upgrade 3 years from now.)
Linux, OpenBSD, co-location, and cloud hosting can all be part of a perfectly robust strategy. And HSMs probably should be, too, which is basically just a way to attach a tiny, isolated, dedicated piece of hardware to a computer. But none of these options alone are likely to be reasonable choices, all things considered, especially in the organizational context.
> No container needs more than 255 threads
> Additionally this CVE relies on the getuid syscall being available, there is no reason to give a container this syscall,
The problem with MAC schemes is that, in practice, they lead to security people imposing random and arbitrary restrictions on general APIs in the name of the least privilege. In doing so, they break the orthogonality of general-purpose platform concepts and break the reductive mental model necessary to get anything done. It's a misunderstanding of what least privilege actually means.
Security is better achieved by creating clear, principled security domains and boundaries, then controlling access to these domains in a general and transparent way. Saying "you, unix process, you can call system call X, but not system call Y, because in my opinion, Y is risky", when neither X nor Y breaks through a security domain, is bad practice. So is arbitrarily capping the number of threads in a container.
> Edit2: Additionally this CVE relies on the getuid syscall being available
This exploit relies on it. The vulnerability does not. The exploit happens to use getuid() along the way to using heap spraying, but the writeup is pretty clear that neither getuid() nor heap spraying is required.
Yeah and I’m wrong about that part anyways. You can’t cap out or block getuid without breaking glibc. I meant setuid, but that call isn’t used in this exploit. I got confused.
Hmm. Perhaps this is a difference between dev and ops, but almost every tool we use comes out of the box with settings unfit for production. Instead, they're tuned for development, and in some cases, deploying to a staging environment. At least, this has been the case in my experience.
Ah, dev... ops... How about DevOps? As a (originally) dev I bring my app to production. How do I stand a fighting chance to reconfigure the defaults in the way you suggest, without suddenly gaining a whole new set of skills? Good defaults would be helpful, even if they're very conservative. I can break things open, but at least then I know what to read up on.
100% agree here. Ship with secure settings by default and have simple "developer guides" that show what to crack open for easier use in non-production environments.
I would argue the opposite. As a developer I want to have tools that make my life easier to - you guessed it - develop. Enabling unnecessary secure defaults that either hinder or don't apply to my use case is silly.
There's a reason most users choose Ubuntu over OpenBSD as their workstation. I would put good money on the reason is because it's "secure enough" without getting too heavy handed on production use cases.
However, I do agree that there has to be a balance. Most tooling I write tends to lean more towards the "good user experience" side first, and then document the production use case. Either that or release two separate (but similar) products; one for developers, one for operations teams. Docker's doing that with the Community Edition/Enterprise Edition, but I still think the Community Edition is far too heavy-handed when it comes to things like pulling images from "insecure" registries.
> As a developer I want to have tools that make my life easier to - you guessed it - develop. Enabling unnecessary secure defaults that either hinder or don't apply to my use case is silly.
The problem is a lot of these insecure defaults will be rolled to production by a "developer" only to get hacked later. Because the marketing and friction-less dev is more important then sane default security
> Over a quarter of MongoDB databases left open to the internet have been ransacked by online extortionists.[0]
They "forgot"/"did not know" the password is not set. This was for db tech.
Now you want your average dev, who is getting "forced" to use docker more and more. To know how to setup container tech secure?
But hey, I need to get this on my CV and it is just click click install. So what can go wrong?
You're touching upon a natural tension, where developers take certain shortcuts that operations people then have to solve, which, I suppose, DevOps practices try to solve. The developers pay the price of less freedom for that, or if you will, the price of (production) reality. If that's not what your shop is doing, then what you say makes sense to me, and clearly DevOps practices aren't for everyone.
I don't think the contention is necessary in most cases. You can make something developer friendly and secure at the same time, you just have to allow for both. Simple things, like taking any credentials from environment settings, can make it easy for both dev and ops/prod to work with and be secure.
> In particular if you are setting per-container usernamespaces, like you ought to be, then this exploit doesn’t do anything.
User namespacing in docker is enabled at the daemon level, not per container, so all containers share the same offset. This would ensure that a root user in the container would escape to a different uid on the host, but doesn't prevent someone from moving sideways through the containers on the same host.
Note that enabling this will break the developer workflow of mounting files from the host into the container. I believe files will show up with the wrong ownership inside the container.
Sure. User namespace-ing is a feature of container security that allows you to grant a process root access to a filesystem that itself is not root. To the running process it appears that it is or can run as root, but on the host it actually isn’t root, but some uid:gid offset. Here’s an article explaining more:
http://man7.org/linux/man-pages/man7/user_namespaces.7.html
The gist is that a container is further sandboxed by the kernel that is agnostic of the higher level security precautions. It’s not perfect by itself, but used in conjunction with other features like AppArmor or SELinux and SecComp it can make a container virtually sandboxed.
Because the Docker project doesn’t make money off of security. It is actually quite infuriating, because they have become the de facto container image standard. Most of their security has actually come from Twistlock (I am not a Twistlock employee, FYI). My recommendation to most Admins or Devs that are serious about container security is to let your developers use docker, but run your images with CRI-O on your servers:
http://cri-o.io/
The docker project has done quite a bit of work on container security, so I'm not sure it's entirely fair to say that them not enabling user namespace by default is for that reason.
For example the work that was done on their seccomp filtering and apparmor profiles.
My guess would be that it's as User namespacing can introduce some issues (e.g. where mounting host filesystems), that they've decided the trade-offs aren't worth it.
Viewing docker containers as anything more than a bundling and deployment system is a mistake. While they might help with security they will never be completely secure and you should architect your deployments with that in mind.
Unless you are a giant enterprise shop with the resources to staff a decent sized K8s team, you should use the hosted solutions.
There are trade-offs to using userns and many ppl don't like the current set of trade-offs. In addition changing a default like this is a breaking change.
Admins can enable userns by default in a daemon, but making it a hard-coded default is much more difficult.
It's not just a matter of enabling user ns. There is no support at the vfs layer for uid/gid mapping. This means in order to use it, images must be chowned with the remapped ID's.
Per-container mappings are not supported for this reason (it would require copying and chowning the entire image for each container mapping).
Do you care to qualify your statement about CRI-O?
I recall seeing some patches submitted to make it possible to pass an uid/gid offset to the mount syscall at one point when people were implementing usernamespaces for container runtimes like docker. So is this fixable without having to make every file system implement this feature, or is there something else holding back better support for doing uid shifting for use with user namespaces?
Nah, Docker has taken lots of strides in the right direction for security over the years, Twistlock or not, albeit with a few weird dangling remainders. They'd love to turn user ns on by default but it'd break lots of existing stuff. Many more users would be mad about having that on by default than leaving it off.
CRI-O is bleeding edge. I'm not sure it's ready for production usage. But it looks very promising. The sooner we can all dump docker in kubernetes the better.
Docker excels in image building especially now with multi-image Dockefile support. It seems all those alternatives to the Docker just gave up on providing anything on their own. The documentation typically starts "lets pull a docker image".
On the other hand the container runtime is straightforward. I recently discovered that one can run a docker image with a bash script and the unshare command and get a very tight security setup. That explains proliferation of various alternatives to Docker to run its images.
I just enabled user namespaces after reading this post. Broke Jenkins and there doesn't appear to be an easy solution. I mount the docker socket in the Jenkins container, which is not an option with user namespaces as the user Jenkins now runs as does not have permission to access the socket.
It seems to be possible to provide this user access to the socket through a socket proxy, but since all containers use the same user this seems to defeat the purpose of using namespaces in the first place.
Cherry on top: although `docker run` supports running containers with custom userns settings, docker swarm, which I use to run Jenkins, does not.
So as far as I can tell my only options are:
1. Go back to not using user namespaces
2. Make the docker daemon on the host available over HTTP, which is really something I was trying to avoid...
Mm, if you're bind mounting in the Docker socket, enabling user namespaces won't help much. You just have to deal with the fact that you have a privileged container (Docker API access == root, at least unless you're using authz). It'd be nice if we could see more RBAC around Docker API so you could do things like "grant only permission to run this one container".
Totally. But the vast majority of containers I use do not get a bind mount to the Docker socket... for which user namespaces would be a very nice feature.
Yeah, definitely turn it on where possible, just important to realize that it's not a panacea (some people really hyped it up to be before the feature was released and criticized Docker for not having it at all). As always, gotta try and find the right sweet spot between convenience and attack surface.
We're working on a solution that would please most people for docker containers and services called the Docker Entitlements: https://github.com/moby/libentitlement
These Entitlements are high-level privileges for containers and services that could be baked in images, same way as macOS/iOS apps. These permissions would allow to create custom {seccomp+capabilities+namespaces+apparmor+...} profiles (effectively security profiles) for a better granularity in app sandbox configuration by app developers and ops.
The current POC has `docker run`, `docker service create` and even build mechanism working. The integration is actively being worked on and PRs are being prepared.
As far as I remember things, because it breaks overlay filesystems, which are a major space saver in Docker world. Something might have changed, but last time I checked, you couldn't "offset" uids/gids on a filesystem overlay, so every layer of the container would have to be copied and chowned (slowly).
This would obviously only work for minimal containers (i.e. ones that don't contain a distribution), but software has to be pretty much built for such a case (e.g. statically linked, no dependencies on common tooling — popular with Go, but your Python application won't work edit: unless you copy all the layers, that is).
Tl;dr: user namespaces are inherently incompatible with many of the usability features Docker brings over other solutions, while they're not particularly useful for many popular use cases (no shared hosting, minor differences in consequence between escalating to the root of the container and its host - though that's an assumption frequently wrongly made).
Also, people hold their bind mounts to the host near and dear, and user namespaces would break all kinds of things people expect to "just work" with bind mounts. Having user namespaces on by default would break tons of existing scripts, Compose/Kube files, etc. that do things like mount /var/lib/mysql into the container for persistence.
I tried it a couple of months ago. It immediately broke build of one of the images. It was a known bug. So I guess I just wait one more year to try.
In the mean time, I make sure that all my containers runs as non-root with max security restrictions. The exception so far was sshd from OpenSSH and mostly due to incorrect porting from OpenBSD in portable ssh.
> In 2017 alone, 434 linux kernel exploits where found, and as you have seen in this post, kernel exploits can be devastating for containerized environments. This is because containers share the same kernel as the host, thus trusting the built-in protection mechanisms alone isn’t sufficient. Make sure your kernel is always updated on all of your production hosts.
Personally, I'd reasonably trust Xen or KVM or something else with hardware-based virtualization and the like to protect me in an multi-tenancy scenario. Much less so in the case of Docker. Sharing a full kernel with potentially malicious actors is more risky than sharing a hypervisor, much more surface area for attack.
The code is significantly different but I still see a lack of access_ok(), so was the checking performed somewhere else that I didn't notice (I haven't looked closely at this part of the kernel before)?
Apparently Linux privilege escalation bugs are now "Docker container escapes"? Thanks to Twistlock for a detailed article but call it what it is, a Linux vulnerability not specific to Docker.
Since people rely on docker containers as an isolation layer between potentially unfriendly services, a linux bug that allows breaking that isolation barrier is a relevant thing. It’s worth being called “docker container escape”
People should not rely on that to any degree more than they would rely on colocated processes on a VM being isolated. The easiest way to be safe is to assume that all containers are already broken out of - what would you do then? Make sure processes are running as non-root, use various protection layers (pick your poison - SELinux, gresecurity, etc.), take away capabilities, and don't run workloads you don't trust.
sure, any Xen guest escape receives equal amounts of press for exactly that reason: It's an isolation barrier breaking down. However, trivial exploits breaking VM isolation have been relatively rare lately.
- A vulnerability is a sofware bug that has particular behaviors and ramifications that allow it to be used maliciously.
- An exploit is a crafted piece of input data that is designed to trigger a vulnerability to execute arbitrary code, crash the target (Denial-of-Service), etc.
> In 2017 alone, 434 linux kernel exploits where found, and as you have seen in this post, kernel exploits can be devastating for containerized environments.
There are a few places in the article like this one where the correct terminology is vulnerability not exploit. cvedetails.com aggregates vulnerabilities. Places like exploit-db.com aggregate exploits people have written to take advantage of vulnerabilities to enable them to perform some unintended action against the target.
Any ideas why this is branded as "Docker"? Are the same namespacing constructs not being used by other Linux container runtimes? I think this should be titled "Escaping Linux containers" as docker is not at fault here?
There's a long tradition by enterprise vendors large and small to market someone else's product as insecure, in order to create demand for their "improved, secured" version.
In this particular instance, Twistlock is selling Docker security by amplifying the meme of "insecure Docker". The Docker brand has visibility the target audience (Enterprise IT) so it's a good target for this kind of piggyback.
This type of FUD marketing happens all the time in many different markets, it's not specific to Docker.
The author shows a concrete exploit of the kernel bug described in CVE-2017-5123 as he has developed it in the context of the docker container environment.
He shows how to use this bug to break out of docker, so he calls the blog post "Escaping Docker ...".
Which is IMHO the most interesting container runtime to write such an exploit for first because it is very widely deployed, but it might also just have been what the author is most familiar with or what was easiest to develop for him.
"Ubuntu container" is just not a name typically used for anything. "Docker container" is.
Let's make it realistic and say he had used RedHat OpenShift as his target and example for the exploit. I'd be completely fine with the title referencing that exact product by name.
Why would he have to dance around what he is using in his demo? Maybe that concrete product has multiple layers of security or lacks them, or uses a certain version etc. He can only speak to what he himself was using and testing. "Escaping Docker container..." is the best short description (as you would need it for a title) of this demo exploit I can think of.
It is a reasonable _assumption_ that other container runtimes on linux might be affected by the same kernel bug. The article does not explore that and the author has no duty to do so just to avoid using a branded technology name.
How would you reasonably talk about "Linux containers" without having a very exhaustive list of all existing implementations and testing all of them? If one of them is not affected you are now factually wrong.
"Docker" isn't at fault here either way, as Docker isn't its own "execution driver" (in Docker parlance) any more; that would be https://github.com/opencontainers/runc.
But to answer the spirit your question, each container runtime uses its own peculiar combination of such constructs. It's helpful to know that this attack allows you to break out of the combination used specifically by runc, and thereby to break out of any system relying on Docker (with the default runc execution driver.)
I doubt any runtime would have prevented this bug; some have seccomp-bpf profiles to blacklist some kernel operations, but I doubt any block a function as basic as waitid().
If anything, this points out that the use case of Docker for security isolation, such as in a multi-tenant architecture, is probably still not a good one.
In most use cases I see containers used for rapid and consistent deployment. The isolation benefit with multiple containers on a host is that if you install things with different library dependencies you don't run into conflicts. As such, the comparison for the common use case is just software installed directly on the host, which also is subject to this vuln.
>In 2017 alone, 434 linux kernel exploits where found, and as you have seen in this post, kernel exploits can be devastating for containerized environments. This is because containers share the same kernel as the host, thus trusting the built-in protection mechanisms alone isn’t sufficient.
More than one kernel exploit _per day_. Exploiting Linux is just a matter of finding one such vulnerability and using it. This can be done in a single day.
There's just no fixing megabytes of buggy kernel code.
It really drives home the need for a proper OS based on a verified, capability-enabled microkernel such as seL4.
I'll surely get a lock of flak for this, but these kinds of bugs would be trivial to avoid in C++. All you need is to make the pointer arguments to syscalls be some other data type (say, user_ptr<T>) that performs an access-check upon conversion to a raw pointer. Then the compiler simply wouldn't let you bypass the access-check, so you simply could not forget to do so. That's the fundamental difference between C++ and C: one of them actually lets you write code that cannot contain many classes of mistakes, and the other, well, doesn't. For the life of me I don't understand the stubbornness behind sticking to the same languages and tools from decades ago.
Well, the exploited code used unsafe interfaces in unsafe manner. It is effectively equivalent to calling something like reinterpret_cast<T*>(user_ptr.get()) to bypass the safety provided by the compiler. How do you avoid that with C++ alone? I guess you will need some external static analyzer. Linux kernel does have one: Sparse. IIRC, it can report casting out __user declarations from pointers.
As for stubbornness, C can be used safely with proper discipline. Kernel development does require a certain amount of experience and discipline, so arguably C can be used by kernel developers in mostly safe manner. That's why some view it as a feature: if you don't have the required discipline then just stick to userspace development.
Thanks for the explanation! __user and sparse seem to be almost exactly the kinds of tools I had in mind, with the caveat that __user would be the default for a pointer argument to a syscall, so that it wouldn't need to be specified.
I'm not sure I understand what interface you're referring to that was "unsafe" and subsequently "used in an unsafe manner". What is the "unsafe interface" here that was being used in an unsafe manner? It seems to me that the problem was that the pointer was not marked as __user? Which is awful, because shouldn't __user be the implicit default behavior for a pointer argument to a syscall? Why should the default behavior be the unsafe one you pretty much never want?
System call pointer arguments are usually marked with __user annotations (I cannot recall if there are some weird calls that may need a kernel pointer, none should need it, but there may be some legacy one). In particular, the infop argument to waitid() is marked as user-space pointer [1].
Before using a pointer to user-space one should check if access_ok() to it. The usual safe interfaces — copy_{to,from}_user(), put_user(), get_user() — always perform this check and fail with an error if the pointer is not an okay user-space pointer.
The commit that introduced the vulnerability [2] replaced the safe interface with unsafe ones, possibly for performance improvements. The code used put_user() function to set individual fields of a struct. Multiple calls to put_user() were replaced with multiple calls to unsafe_put_user() which does not perform access_ok() check every time. A check for NULL pointer was added before the stores. unsafe_put_user() still checks whether the address points to an actually mapped memory location, but does not verify whether the location is in user-space.
The commit was not really discussed in-depth on LKML [3] as it came from Al Viro who should know better, is one of the Sparse maintainers, etc. Some projects require human justifications for any usage of unsafe interfaces during code review (like, flagging a review with 'needs-check' or something that requires a sign-off by another human that the unsafe thing is actually safe). This may have been the case where it could matter, as the static analysis tool should not produce bogus warnings for interfaces which are designed to perform unsafe stuff. Though, it may also be useful to add a check to Sparse which will verify that unsafe_{get,put}_user() calls are preceded by an access_ok() call in the same function.
According to Linus, programmers who prefer C++ are so bad that he would have chosen C solely to avoid dealing with their "total and utter crap" code, and C++ is only good for kernel development if you limit yourself to a C-like subset anyway [1].
Only if you're lucky. Most of these exploits probably took weeks to find and analyze properly, it's not like one person found more than one a day. They're found because whole teams are working with the linux kernel at the same time and either happen by them or actively look for them.
Linux kernels in production (since we all now like to run docker there :)) without grsec/seccomp have always been pretty dangerous. What I dislike about docker is their feature creep and lack of proactively steering their users to accepting more secure defaults. The mindset towards security in the Linux kernel community remains shockingly stubborn compared to the shift to "better security", which is taking over the rest of the industry.
actually the most affected by this CVE would be medium sized companies not investing in enough internal development pumping out services fast with secure default (startups rushing to their MVP maybe too). The companies running in a totally automated farm with Kubernets or docker swarm usually don't have containers with long uptimes.
> The vulnerability is that the highlighted access_ok() check was missing in the waitid() syscall.
Why in the world does this class of vulnerabilities still exist in 2017? Why are kernel maintainers not writing some kind of C linter that makes sure every single pointer argument to every syscall is passed to a well-known function like access_ok (Linux) or ProbeForRead (Windows)? Literally all you need is a syntactic check; you don't even need to do any kind of semantic analysis... since all you want is to flag the code so someone can inspect each spot manually. Why is this not done?!
C++ would also make it harder to get it wrong. Its type system is powerful enough to enforce rules like "you must call access_ok before writing through a pointer": you just have access_ok transform an inaccessible pointer token of some sort, passed in as a syscall parameter, into a different kind of object through which you can write into memory.
The generated machine code would be identical to what's in the kernel today, but it'd be both safer and cleaner. C++ still has to get over the bad gang-of-four-1990s-era-object-goo reputation it has among systems people.
Does this escape only work if they have root inside of the container? I usually try to make it so my containers always contain a non-root process as an extra layer of security.
Randomized pids wouldn’t nexessarily help that much in this situation, especially if the getuid syscall is available. However, I agree with your general sentiment that there are basic security features that Linux could implement to make a lot of CVEs impotent. I think the community is coming round, but this stuff takes more work than most people may realize.
This might be a bit off topic but I wonder why the vulnerability has been patched this way:
if (!access_ok(VERIFY_WRITE, infop, sizeof(*infop)))
goto Efault;
Why doesn’t the if use curly brackets? I thought it has been established that it is best practices to always use curly brackets even if they are explicit, especially after Apple's infamous goto bug[1].
Secondly, why does it use goto at all? I thought it has also been established not to use goto unless it is the only performant solution (and performance is important in that case). Sure Efault with probably kill the program but wouldn’t it still be better to use a function call considering that the desired resolution should be the same?
1. Linux kernel coding style is documented here [1], and contains the line:
> Do not unnecessarily use braces where a single statement will do.
2. There is no built-in exception handling in C. `goto ERROR_HANDLING_CODE` is a common and well established pattern to handle exceptions in C, see e.g. [2].
And it has quite a consensus as the best solution to this problem.
Goto being considered harmful is a generally true statement. However, this usage is a more specific exception to the rule, with objective benefits vs alternatives.
From your link: Maybe the coding style contributed to this by allowing ifs without braces, but one can have incorrect indentation with braces too, so that doesn't seem terribly convincing to me.
goto is kind of a necessity in any complex C codebase unless you want to duplicate tons of code. Sometimes you need to jump way out of the context, especially to handle errors, and C does not have "exceptions" (although you could do them with setjmp/longjmp)
goto is the best way to do early exit and cleanup from a C function. The alternatives are 1) deeply nested if's, one level for each function call that can return an error code, or 2) repeat the same cleanup code over and over again at every exit point.
Obviously, giving containers unecessary CAP privileges in unwise, but if you are practicing sound security best practices then there would be multiple layers of defense between you and this CVE. I think a strong AppArmor profile and SecComp profile would also make this CVE moot.
Edit: Also, this exploit relies on you being able to fork up to a certain pid value. You can and should take advantage of Linux’s per cgroup ulimit functionality. No container needs more than 255 threads (even if they do you can make special exceptions for such applications).
Edit2: Additionally this CVE relies on the getuid syscall being available, there is no reason to give a container this syscall, you should block it, ala this guide: https://rhelblog.redhat.com/2016/10/17/secure-your-container...
I have to say I’m more than a little dissapointed in Twistlock for not pointing out what countermeasures you can employ against this and other CVEs.