Copying my comment from the earlier submission that didn't gain much traction here:
What an absolutely amazing tour-de-force of a devastating design flaw in all versions of macOS and iOS and tvOS and watchOS!
The negotiations detailed in the bug report timeline about meetings between "senior apple and google leadership" for keeping this secret past the general deadline really underlines that.
Yeah - the failed mitigations followed by a "long term" fix was interesting as well. Apple literally had to change execve() this late in the OS's development cycle to allocate new task and thread structs(that's two extra allocations and copies in hot path!) to fix it for good. That this design problem lingered around for so long doesn't look good for Apple - it's one thing for a use after free bug in obscure piece of code to linger but bad design affecting a ton of their own frequently used code to stand so long is somewhat brown bag category!
I wonder what other fallout we might experience from this.
Business as usual with macOS. The other day I was browsing the ocspd source code. Turns out it calls openssl using system(). So openssl is officially deprecated on macOS and yet they're using it internally to handle certificates?! And there's an enlightening comment:
/* Given a path to a DER-encoded CRL file and a path to a PEM-encoded
* CA issuers file, use OpenSSL to validate the CRL. This is a hack,
* necessitated by performance issues with inserting extremely large
* numbers of CRL entries into a CSSM DB (see <rdar://8934440>).
ocspd was introduced with 10.4. A decade ago. And that's really the problem with macOS: There's no refactoring of old hacks, but rather just bolting on of ever more new stuff.
Apple needs to take a bit of those tens of billions of dollars they have sitting around and spend it on starting from scratch with something that's not horrifically crufty. The quality of their software is lagging so far behind the quality of their hardware right now. Realistically, I think we may just be at the point where operating systems and all the stuff the companies put on top of them are too complicated to keep developing in the traditional way with traditional tools. Formal verification might be the cheapest way forward at this point.
So far as the current state of the art in computer engineering goes, we don't know how to completely rewrite a system as complicated as XNU without creating fresh batches of implementation errors. So this is a little like suggesting Apple use its hundreds of billions of dollars to build an iPhone battery that only needs to be recharged once a month.
We may someday get an XNU rewrite, but probably not until software engineering produces a new approach to building complex systems reliably that works at the scale (here: number of developers and shipping schedule) Apple needs.
This is so, so true, that I wish there were enough beer in this world to gift you with. There's a lot of cruft in XNU, and there's even more of it in the rest of the system, but all this heap of hacks isn't just useless cruft that we'll be better off without. That heap of code also contains almost twenty years' worth of bugfixes and optimizations from more smart engineers than Apple can hope to hire and get to work together in a productive and meaningful manner. All this unpleasant cruft is what keeps the system alive and well and the users happy enough to continue using it.
More often than not, systems that get boldly rewritten from scratch end up playing catch-up for years. Frankly, I can't remember a single case when a full rewrite with an ambitious timetable wasn't a full-scale disaster. The few success stories, like (what eventually became) Firefox have taken a vastly different approach and took a lot more than users would have wanted.
A lot of idealistic (I was about to write naive) engineers think it's all a matter of throwing everything away. That's the easy part. Coming up with something better is the really hard part, and it's not achieved by just throwing the cruft away. If you innocent souls don't believe me, come on over to the Linux side, we have Gnome 3 cookies. You'll swear you're never going to touch anything that isn't xterm or macOS again.
A lot of macOS/iOS was written from scratch, though: Core Graphics (vs. Cairo and FreeType), Core Animation, Core Text (vs. pango), WindowServer (vs. X11), UIKit (vs. Cocoa), IOKit (vs. the native BSD driver framework), Cocoa Finder (vs. Carbon Finder), LLVM/clang/Swift (if you count Chris Lattner's work on it at UIUC)...
Of those, the last one is very impressive: it's a decade-long from-scratch project that has succeeded in competing with a very entrenched project (GCC) in a mature market.
Regarding GNOME 3, the delta between GNOME 2 and GNOME 3 is far less than the delta between NeXTSTEP+FreeBSD and the first version of Mac OS X.
> This is so, so true, that I wish there were enough beer in this world to gift you with. There's a lot of cruft in XNU, and there's even more of it in the rest of the system, but all this heap of hacks isn't just useless cruft that we'll be better off without. That heap of code also contains almost twenty years' worth of bugfixes and optimizations from more smart engineers than Apple can hope to hire and get to work together in a productive and meaningful manner. All this unpleasant cruft is what keeps the system alive and well and the users happy enough to continue using it.
This whole premise is a false dichotomy. Apple does not have to throw away Mac OS X, and it does not have to keep piling crap on without fixing things. If you stop the excuses and rationalizations and commit to code quality you can ship an operating system with quality code and minimal bugs. The OpenBSD project has been doing this for two decades with minimal resources. There is no valid excuse other than "we are too lazy and incompetent."
Oh too much code, bad code, we inherited it, throwing out away won't work etc are baloney excuses without meat. All it takes is the will to hire and commit the right resources with an objective of increasing code quality. I mean take this bug itself - Apple did fix it but only after GPZ was on their arse. No reason they couldn't have reviewed it themselves and fixed it.
Hasn't the vulnerable code been here for over a decade? Why do people think this was an easy bug to spot? There are dozens of extremely qualified people looking for these things. I think there's a reason there isn't a Nemo Phrack article about this bug: it was hard to spot, and required a flash of insight about the competing lifecycles of objects in two different domains (POSIX and Mach).
> Apple needs to take a bit of those tens of billions of dollars they have sitting around and spend it on starting from scratch with something that's not horrifically crufty.
They certainly don't have to throw everything away. Not having thrown everything away is one of the reasons why OpenBSD is a good example here. Remember all that quality code that was in place before Cranor's UVM? (Edit: actually, the fact that UVM is an improvement over it should say something, too...)
And, at the risk of sounding bitter, in my experience, very few companies have the capability to "commit to code quality", and I don't think Apple is one of them.
Edit: BTW, I really like your blog. You should write more often :-).
Hold on, because, I'm pretty familiar with pre- and post- UVM OpenBSD, because Arbor Networks shipped on OpenBSD (against medical advice) and ran into a number of really bad VM bugs that Theo couldn't fix because of the UVM rewrite!
> We may someday get an XNU rewrite, but probably not until software engineering produces a new approach to building complex systems reliably that works at the scale (here: number of developers and shipping schedule) Apple needs.
It's conceivable to perform a gradual transition away, though. They could demote Mach to a fast IPC system that just augments BSD, similar to the way the kdbus/bus1 proposal for Linux does. That would be difficult and a long-term project, but it would fix the underlying issue in a way that mostly retains userspace compatibility. Driver compatibility would be more difficult, of course…
That's true, but if you undertake a difficult and long-term project, you want the outcome to be decisive. Mach is ugly and a nest of bugs, but kernels implemented in C/C++ are bug magnets with several orders of magnitude more force.
My prediction is that we don't ever see an XNU refactor/ redesign/ rewrite so long as C/C++ is the kernel implementation language.
SeL4 is an indicator of things to come. We can build complicated OSs with extreme reliability; the up-front cost is just higher than most companies are willing to spend right now, because customers don't yet realize that it's technically possible to avoid the huge costs associated with software failure in exchange for slightly higher amortized software costs.
Until we have a way to extend seL4 or something like it to full multiprocessor operation (without the multiple kernels with separate resources limitation that is currently the only way to use multiple processors with seL4) I'd disagree that we can build general-purpose OSes with verification. Our techniques for verifying concurrent programs are still very primitive and cumbersome, and I don't think many would take an OS where processes can't use multiple hardware threads seriously.
Also, seL4 (being a microkernel) leaves out a huge swath of kernel-level facilities that need to be implemented with the same standard of verification (resource management, network stack, drivers, etc.). Running on a verified microkernel provides a great foundation, but these still add a ton of code that needs to be verified. Plus the concurrency problem will strike again at this level.
L4 is incredibly simple. It is essentially (a word I chose carefully) the opposite of a complicated OS. It also doesn't really do anything.
If you have just a few extremely simple applications you'd like to run in an enclave, L4 is a good way to minimize the surface area between the applications themselves and the hardware.
If you'd like to host a complicated operating system on the simplest possible hosting layer: again, L4 is your huckleberry.
Otherwise: not so useful.
Note that if you just host XNU on top of L4, you might rule out a very small class of bugs, but the overwhelming majority of XNU bugs are contained entirely in the XNU layer itself; having XNU running on an adaptor layer doesn't do much to secure it.
This is kind of a "perfect storm" situation for Apple. At least three vectors are converging:
1. Apple inherited OSX from NeXT, and with it the Mach subsystem. Mach overcomplicates XNU.
2. XNU has become incredibly popular by dint of being shipped in the iPhone. Avie Tevanian probably did not see that coming when they designed the original BSD/Cocoa/XNU/whatever architecture. Regardless: it is now difficult to make sweeping architectural changes in XNU, because of the enormous installed base.
3. Ian Beer is simultaneously very clever and also willing to wade into the XNU Mach fire swamp.
I think it's fair to criticize Apple for designing the XNU frankenkernel. I think it's less legit to say that the presence of this bug class "looks bad" --- it's 2016 and this is just getting published. This is one of those bug classes that is sort of obvious in retrospect, and you wonder why people didn't catch it earlier.
Keep in mind that this is a dangling pointer problem that is a stupid design mistake and Apple's own kexts used the same vulnerable pattern for years before Project Zero had to disclose it to Apple after which there were two failed mitigations and a fix that changes core OS code!
That's not very justifiable no matter how many storms and twisters you throw in the mix ;) I am not going to argue but for OS development this looks bad that it wasn't even looked at much rather fixed for so long given many of Apple's kexts had the same issue.
[Also this isn't an isolated problem by the way - look up the thread and you'll find two other egregious errors including one involving execve that was pointed to Apple in vain. Not long ago I could just remove and insert a kmod and that would lock up the system - go ahead and argue that's not a hot path or nobody does that but it does speak to general lack of code and testing quality. To their credit Apple fixed that particular one though when I reported them.]
The dangling pointer isn't the interesting part of the bug --- dangling pointers are, to a first approximation, the key component of all UAFs, which are a 15 year old bug class. The interesting bit about this bug is how they arrived at the dangling pointer: they had to manipulate at least three different object lifecycles in the kernel, one of of which involved a credential passing trick that is not super common in normal code.
I disagree that this is a simple bug that anyone should have been able to find. If you'd like to put money on whether or not this is going to win a 2017 Pwnie Award, I would be happy to take your money.
When you take money for writing and maintaining an OS you have to be competent to avoid fundamental design issues like this when you're going to have a bunch of downstream users of the code!
Read the TLDR - exploitation of dangling pointers aside you can't write code and/or design APIs that does or makes it easy to do what the TLDR says - hold or use a task struct pointer and expect the euid of that task to stay the same.
Many many places in the kernel do this and there are a great many very exploitable bugs as a result.
Ian Beer certainly is talented but that doesn't excuse Apple being so sloppy!
Aha! I think I understand what has you confused. You seem to think that the TLDR describes some basic rule of XNU programming that people were already aware of and expected to follow. No. Ian Beer invented that rule. In this post. That's why the bug is such a big deal; it's why we call it a "new bug class". It's also why it's the TLDR of the post.
Wow straight up conclusion - I had it confused, right!
It's not a, pardon me for the expression, fucking XNU specific programming rule - it is a general rule that was invented long before Ian got to it! You don't hold a reference counted pointer and operate on it without taking a :gasp: reference first - having that shit in your sample code is just, well extra shitty!
Also, separately from the dangling pointer issue, the first sentence of the post is literally - This post discusses a design issue at the core of the XNU kernel!
AFAIU, the dangling pointer problem wasn't a defect in the kernel; it was a problem introduced by some module authors who misunderstood the ownership semantics of the API. It might not even have been a pointer, per se, but I guess that's beside the point.
The larger problem was an inherent TOCTTOU bug in the interface semantics between the BSD subsystem and Mach. AFAIU that wasn't a dangling-anything problem; the reference was still valid. It was a logic and design problem that could happen in any language, even in Rust, and even without resorting to unsafe code.
>(that's two extra allocations and copies in hot path!)
I've never really thought of process spawning as a hot-path (in the hundreds+ of calls per second sense). What software so heavily relies on spawning so many processes so quickly that the overhead of malloc would be noticeable?
On UNIX systems fork+execve is a very commonly used code path - build a piece of software using make for example and the compiler process is forked/exec'ed lot of times, web servers like Apache used multiprocess model for long time etc. Yeah you could decide to not care about fork+exec* performance but what I am familiar with (Linux land) a lot of optimizations go into making fork/exec faster.
Also my bigger point though was not that it was in hot code path (although I would prefer to not be in a position where I need to change execve to add two allocs and copies if I can avoid it) - it was that they had to change execve() this late in the OS's development cycle to fix this long standing bug - typically you get to a point where you don't really need to touch core OS code and when you do you risk adding new issues to a central piece of code.
Yeah, I don't see the hot path problem either. I think the bigger issue is doing deep brain surgery on XNU as a hotfix; it's the kind of thing you want to put off for a next major release.
Incidentally the huge set of problems with execve were very old news - I know a developer who tried to get them fixed (because they affected a piece of Apple software) and failed. Presumably this was because execve bugs were considered unimportant.
Ever since installing 10.12.1, I've been having a bunch of processes randomly entering a quasi-paused SIGSTOP-ish state (neither closable, apps not "bouncing" (loading) and just not responding. Running Instruments, correlating logs and such doesn't identify any clear cause. I'm having to `sudo kill -CONT -1` in order to get things moving again. I'm wondering if it's related to XNU mitigations or just some spurious "system configuration entropy" on my box.
I did exactly this when my Mac ran out of memory yesterday. Safari hung with a 'your computer is running out of memory' warning (168 tabs open!) and I didn't want to lose them all by force quitting. But the Safari process itself wasn't "Not Responding" and we were back to 0% CPU.
So I quit everything else, SIGCONT'd Safari, and it started responding again, so I tried unsuccessfully to close some tabs. Of course, Safari somewhat isolates pages in separate processes, so I ran `ps aux | grep WebContent | grep -v grep | cut -d' ' -f11 | xargs kill -SIGCONT` as well.
It all sprang back to life, and all the tabs I'd shut in vain zipped away. Got that one saved for later. It's probably easier just to use -1 now I've learned what that is!
I do wonder what's suspending these processes indefinitely. I should have done more inspection to see what state they were in. I'm not familiar with how WebKit content threads communicate though, so that's for another day.
I was going to say that it signals launchd's process group — every process spawned by launchd, which always runs with PID 1. However, the `kill` manpage confirms your hunch:
-1 If superuser, broadcast the signal to all processes; otherwise
broadcast to all processes belonging to the user.
(This is on macOS/iOS, Linux might have slightly different semantics.)
It's funny you mentioned processes entering a quasi-paused SIGSTOP-ish state. I swear I've been having tons of problems with Java/Tomcat the past week or so I've been on 10.12.1 (betas), and I keep thinking I broke my config by updating my Java version, changing my Tomcat config, or some other "system configuration entropy"! Nice to know I'm not alone and that I'm probably not crazy.
Tomcat 7.0.72 from Homebrew, Oracle Java 8u112, PostgreSQL 9.3 and 9.5.
It's funny, but I think it's written that way on purpose, not just as snark.
It's a little tricky to keep track of what happened here. There are 4 bugs in this post, and (I think) 2 different timelines: the UAF timeline for the first bug, and the TOCTTOU timeline for the 3 subsequent bugs. What's important to understand about the three TOCTTOU bugs is that there's a "right" fix for that bug, and a series of wrong fixes that delay the inevitable. Ian Beer and GPZ probably go into this whole process knowing what the right fix is, and with predictions on how they'll defeat any of the wrong fixes.
So it looks like GPZ reported a bug and then found flaws in the mitigations, but really all three of the flaws they found were known, at least conceptually, when GPZ reported the TOCTTOU race to Apple.
In the TOCTTOU timeline, Apple got an extension. Subtextually, it sounds like Tim Cook called Sundar Pichai. GPZ does not want to give extensions. They have a 90 day disclosure timeline, it's very well known, and probably the healthiest disclosure process in the industry. It's problematic for GPZ to give extensions because next time Tavis Ormandy finds a vulnerability in Norton Antivirus, Symantec is going to try to play chicken, and GPZ doesn't want to be at day 89 having to decide whether to drop zero-day versus being held hostage by a patch schedule.
But if a bug escalates all the way to Tim Cook, GPZ is probably pretty OK just with the degree to which that raises the profile of their bug --- it's hard to look at that and think Apple isn't taking your bug extremely seriously. So they'll trade the raised profile for the 5 week extension.
So they include a bunch of fuck-yous to Apple in the disclosure timeline, messaging to other vendors that GPZ is not going to budge even if your dumb original fix turns out to have a flaw that Ian Beer will notice and exploit. If you want the extension, you'd better have a Tim Cook.
Or maybe they're just having fun. Either way, a good read!
I'm still with 10.11. I don't plan to update soon, since the benefit of Siri, Photos and the other major features is quite small, compared to the risk that I might loose working days if something goes wrong ( I'm a freelancer ).
As far as I read in the article there will be 10.12.1 ( the final fix ) which will have that part of the kernel refactored. I hope Apple will also support 10.11 and issue an update with the same fix.
I got an update to 10.11 El Capitan yesterday, which probably fixes the vulnerability. You can see the fix in Apple's support page, have a look at the bottom of the page about "System Boot":
I would argue that the original underlying problem here is the idea that having execve() increase privilege is acceptable. It's necessary for legacy reasons (sudo, anyone?), but even then, it's barely necessary. "sudo foo" could be implemented by asking a privileged daemon to run foo and handing off access to the console to the daemon.
On Linux, you can do PR_SET_NO_NEW_PRIVS to turn off this type of privilege gain, and it's even required for certain purposes. I would love to see someone develop a distribution that enables no_new_privs for all processes.
It's a pattern of privilege escalation bugs. If you run untrusted code on your machine, that code can obtain root or alter the kernel, potentially even if it's running as nobody.
There is a relatively long sequence of attempts to band-aid the bug, all of which failed, because Ian Beer found a systemic flaw, not just a single point flaw. So, the other implication for users is a general sense of foreboding.
Bug 1: Many XNU drivers save task_t's on the heap without bumping their refcount.
1. Attacker creates process A and B
2. B->A send task port Bt
3. A->XNU request IOKit framebuffer client for Bt
4. A ditches Bt, retains client
5. Kill B; Bt in client now dangling
6. Trigger creation of privileged C, unrelated to A & B
7. C inherits memory once used by Bt
8. A use retained framebuffer client to write C's memory
What's important to understand is that this is not just a single UAF, but a pattern of UAFs scattered throughout XNU.
Fix: at step 3, check to make sure the task being given to IOKit is owned by the task making the IOKit request.
Bug 2: IOKit drivers cache task details on their stack; the lifetime of that cached task is the lifetime of the IOKit kernel object, not of the program that made the request. In particular: if you execve() an SUID, the task_t is repurposed.
1. Attacker creates process A and B
2. B->XNU request IOKit framebuffer for Bt, Bc
3. B->A send client Bc
4. B execve /bin/su. B is now running as root.
5. A use retained framebuffer client to write B's memory
The tricky thing here is that this isn't just one bug, but a pattern of bugs: every place where a driver stashes a task_t on the heap and exposes functionality through a passable object is a place where colluding processes can potentially take advantage of SUIDs to raise privileges.
Fix: Lifetime of IOKit clients now tied to lifetime of creating process.
Bug 3: Even if a driver doesn't save a task_t on the heap, they're saved on the stack during the servicing of system calls and kernel mach message handlers, so there are race conditions.
1. Attacker creates process A and B
2. B->A send task port Bt
3a. A->XNU task_threads(Bt), retrieving thread ports for Bt
3b. (simultaneously) execve /bin/su. B is now running as root.
4a. task_threads converts Bt to a task_t
4b. execve modifies the same task_t to replace thread ports
4c. task_threads retrieves the (now privileged) thread ports.
5. A uses thread ports to overwrite registers and take control of B.
Fix: Kernel objects now check to see if a task_t has been touched by execve before returning them to userland. Even if you win the race, that failsafe prevents the kernel from giving you privileged objects.
Bug 4: You don't need the kernel to give you a privileged object directly; all you need is to be able to influence a privileged object.
1. Attacker creates process A and B
2. B->A send task port Bt
3a. A->XNU task_set_exception_port(Bt), wiring A to B's exceptions
3b. (simultaneously) execve /bin/su with rlimited stack. B is now running as root, briefly.
4a. task_set_exception_port converts Bt to a task_t
4b. execve modifies the same task_t to replace thread ports
4c. task_set_exception_port rewrites the exception port.
5. stack access in B, running /bin/su as root, causes a SEGV
6. XNU generates an exception message, passing with it the thread ports, to A
7. A uses thread ports to overwrite registers and take control of B.
Fix: table flip. Rewrite execve so it generates entirely new task_ts when loading binaries, rather than repurposing old task_t.
This is all pretty magnificent. What's best about it is that it totally justifies the title of the post: pretty much every place in XNU where they save a task_t creates a TOCTTOU bug.
In particular about "pattern of UAFs scattered throughout XNU", there was missing memory management of task_t references in the sample code for kext drivers. So it wouldn't be enough to just add the missing retain calls in the Apple XNU kexts, because there may be an unknown number of third party kexts out there. Perhaps not as many as windows has device drivers, but it's still the same type of thing. Can you imagine if every windows device driver turns out to have copy-pasted privesc bugs?
Because human brains are pattern-matching engines; parent comment saw the phrase "considered harmful", didn't read the article, and linked a previously-read article that they presumed was related based on the title alone.
GPZ finds a entirely new class of vulnerability, Apple takes 4 months to patch and resolve.
And you claim this has been exploited for years. There is 0 evidence of this, and such a claim demands proof.
I would be happy to apologise if you could find one example of exploitation prior to a few days ago when it became public.
The point of the 0-day black market is to not reveal these attacks publicly. If there were public proof of this in the past it would have been fixed in the past.
Take my word for it when I say there are upper echelons of black hats that are stockpiling unknown 0-day exploits like this and presently using them in the wild.
Or dismiss me as irrational and continue with the belief that all bugs are unknown until white hats share them with Apple.
What an absolutely amazing tour-de-force of a devastating design flaw in all versions of macOS and iOS and tvOS and watchOS!
The negotiations detailed in the bug report timeline about meetings between "senior apple and google leadership" for keeping this secret past the general deadline really underlines that.