Hacker News new | past | comments | ask | show | jobs | submit login

Yeah - the failed mitigations followed by a "long term" fix was interesting as well. Apple literally had to change execve() this late in the OS's development cycle to allocate new task and thread structs(that's two extra allocations and copies in hot path!) to fix it for good. That this design problem lingered around for so long doesn't look good for Apple - it's one thing for a use after free bug in obscure piece of code to linger but bad design affecting a ton of their own frequently used code to stand so long is somewhat brown bag category!

I wonder what other fallout we might experience from this.




Business as usual with macOS. The other day I was browsing the ocspd source code. Turns out it calls openssl using system(). So openssl is officially deprecated on macOS and yet they're using it internally to handle certificates?! And there's an enlightening comment:

    /* Given a path to a DER-encoded CRL file and a path to a PEM-encoded
     * CA issuers file, use OpenSSL to validate the CRL. This is a hack,
     * necessitated by performance issues with inserting extremely large
     * numbers of CRL entries into a CSSM DB (see <rdar://8934440>).
http://opensource.apple.com/source/security_ocspd/security_o...

ocspd was introduced with 10.4. A decade ago. And that's really the problem with macOS: There's no refactoring of old hacks, but rather just bolting on of ever more new stuff.


Linking to openssl was deprecated (lack of binary stability) - not the command line tools.


I don't see a major problem with using the openssl command for this, but using system() to do it is completely insane.


Apple needs to take a bit of those tens of billions of dollars they have sitting around and spend it on starting from scratch with something that's not horrifically crufty. The quality of their software is lagging so far behind the quality of their hardware right now. Realistically, I think we may just be at the point where operating systems and all the stuff the companies put on top of them are too complicated to keep developing in the traditional way with traditional tools. Formal verification might be the cheapest way forward at this point.


So far as the current state of the art in computer engineering goes, we don't know how to completely rewrite a system as complicated as XNU without creating fresh batches of implementation errors. So this is a little like suggesting Apple use its hundreds of billions of dollars to build an iPhone battery that only needs to be recharged once a month.

We may someday get an XNU rewrite, but probably not until software engineering produces a new approach to building complex systems reliably that works at the scale (here: number of developers and shipping schedule) Apple needs.


This is so, so true, that I wish there were enough beer in this world to gift you with. There's a lot of cruft in XNU, and there's even more of it in the rest of the system, but all this heap of hacks isn't just useless cruft that we'll be better off without. That heap of code also contains almost twenty years' worth of bugfixes and optimizations from more smart engineers than Apple can hope to hire and get to work together in a productive and meaningful manner. All this unpleasant cruft is what keeps the system alive and well and the users happy enough to continue using it.

More often than not, systems that get boldly rewritten from scratch end up playing catch-up for years. Frankly, I can't remember a single case when a full rewrite with an ambitious timetable wasn't a full-scale disaster. The few success stories, like (what eventually became) Firefox have taken a vastly different approach and took a lot more than users would have wanted.

A lot of idealistic (I was about to write naive) engineers think it's all a matter of throwing everything away. That's the easy part. Coming up with something better is the really hard part, and it's not achieved by just throwing the cruft away. If you innocent souls don't believe me, come on over to the Linux side, we have Gnome 3 cookies. You'll swear you're never going to touch anything that isn't xterm or macOS again.


A lot of macOS/iOS was written from scratch, though: Core Graphics (vs. Cairo and FreeType), Core Animation, Core Text (vs. pango), WindowServer (vs. X11), UIKit (vs. Cocoa), IOKit (vs. the native BSD driver framework), Cocoa Finder (vs. Carbon Finder), LLVM/clang/Swift (if you count Chris Lattner's work on it at UIUC)...

Of those, the last one is very impressive: it's a decade-long from-scratch project that has succeeded in competing with a very entrenched project (GCC) in a mature market.

Regarding GNOME 3, the delta between GNOME 2 and GNOME 3 is far less than the delta between NeXTSTEP+FreeBSD and the first version of Mac OS X.


> This is so, so true, that I wish there were enough beer in this world to gift you with. There's a lot of cruft in XNU, and there's even more of it in the rest of the system, but all this heap of hacks isn't just useless cruft that we'll be better off without. That heap of code also contains almost twenty years' worth of bugfixes and optimizations from more smart engineers than Apple can hope to hire and get to work together in a productive and meaningful manner. All this unpleasant cruft is what keeps the system alive and well and the users happy enough to continue using it.

This whole premise is a false dichotomy. Apple does not have to throw away Mac OS X, and it does not have to keep piling crap on without fixing things. If you stop the excuses and rationalizations and commit to code quality you can ship an operating system with quality code and minimal bugs. The OpenBSD project has been doing this for two decades with minimal resources. There is no valid excuse other than "we are too lazy and incompetent."


Bingo! That's a beer shot :)

Oh too much code, bad code, we inherited it, throwing out away won't work etc are baloney excuses without meat. All it takes is the will to hire and commit the right resources with an objective of increasing code quality. I mean take this bug itself - Apple did fix it but only after GPZ was on their arse. No reason they couldn't have reviewed it themselves and fixed it.


Hasn't the vulnerable code been here for over a decade? Why do people think this was an easy bug to spot? There are dozens of extremely qualified people looking for these things. I think there's a reason there isn't a Nemo Phrack article about this bug: it was hard to spot, and required a flash of insight about the competing lifecycles of objects in two different domains (POSIX and Mach).


I was (obviously...) responding to this:

> Apple needs to take a bit of those tens of billions of dollars they have sitting around and spend it on starting from scratch with something that's not horrifically crufty.

They certainly don't have to throw everything away. Not having thrown everything away is one of the reasons why OpenBSD is a good example here. Remember all that quality code that was in place before Cranor's UVM? (Edit: actually, the fact that UVM is an improvement over it should say something, too...)

And, at the risk of sounding bitter, in my experience, very few companies have the capability to "commit to code quality", and I don't think Apple is one of them.

Edit: BTW, I really like your blog. You should write more often :-).


> Remember all that quality code that was in place before Cranor's UVM?

So much before my time I was not even aware of it. For the uninitiated: https://www.usenix.org/legacy/events/usenix99/full_papers/cr...

> Edit: BTW, I really like your blog. You should write more often :-).

Thank you. :) Just this week I started thinking of getting back into it.


Hold on, because, I'm pretty familiar with pre- and post- UVM OpenBSD, because Arbor Networks shipped on OpenBSD (against medical advice) and ran into a number of really bad VM bugs that Theo couldn't fix because of the UVM rewrite!


But I am on the Linux side & wouldn't want to touch anything that is xterm or macOS again (suckless's st ftw)

Also currently running on a nice pure wayland system, no need for that X11 cruft


> We may someday get an XNU rewrite, but probably not until software engineering produces a new approach to building complex systems reliably that works at the scale (here: number of developers and shipping schedule) Apple needs.

It's conceivable to perform a gradual transition away, though. They could demote Mach to a fast IPC system that just augments BSD, similar to the way the kdbus/bus1 proposal for Linux does. That would be difficult and a long-term project, but it would fix the underlying issue in a way that mostly retains userspace compatibility. Driver compatibility would be more difficult, of course…


That's true, but if you undertake a difficult and long-term project, you want the outcome to be decisive. Mach is ugly and a nest of bugs, but kernels implemented in C/C++ are bug magnets with several orders of magnitude more force.

My prediction is that we don't ever see an XNU refactor/ redesign/ rewrite so long as C/C++ is the kernel implementation language.


No argument there. :)


SeL4 is an indicator of things to come. We can build complicated OSs with extreme reliability; the up-front cost is just higher than most companies are willing to spend right now, because customers don't yet realize that it's technically possible to avoid the huge costs associated with software failure in exchange for slightly higher amortized software costs.


Until we have a way to extend seL4 or something like it to full multiprocessor operation (without the multiple kernels with separate resources limitation that is currently the only way to use multiple processors with seL4) I'd disagree that we can build general-purpose OSes with verification. Our techniques for verifying concurrent programs are still very primitive and cumbersome, and I don't think many would take an OS where processes can't use multiple hardware threads seriously.

Also, seL4 (being a microkernel) leaves out a huge swath of kernel-level facilities that need to be implemented with the same standard of verification (resource management, network stack, drivers, etc.). Running on a verified microkernel provides a great foundation, but these still add a ton of code that needs to be verified. Plus the concurrency problem will strike again at this level.


L4 is incredibly simple. It is essentially (a word I chose carefully) the opposite of a complicated OS. It also doesn't really do anything.

If you have just a few extremely simple applications you'd like to run in an enclave, L4 is a good way to minimize the surface area between the applications themselves and the hardware.

If you'd like to host a complicated operating system on the simplest possible hosting layer: again, L4 is your huckleberry.

Otherwise: not so useful.

Note that if you just host XNU on top of L4, you might rule out a very small class of bugs, but the overwhelming majority of XNU bugs are contained entirely in the XNU layer itself; having XNU running on an adaptor layer doesn't do much to secure it.


I don't think I've ever seen a complicated OS based on SeL4, and it is the opposite of complicated itself.

I don't think SeL4 means much for macOS/iOS.


Hundreds of billions?



o.O


This is kind of a "perfect storm" situation for Apple. At least three vectors are converging:

1. Apple inherited OSX from NeXT, and with it the Mach subsystem. Mach overcomplicates XNU.

2. XNU has become incredibly popular by dint of being shipped in the iPhone. Avie Tevanian probably did not see that coming when they designed the original BSD/Cocoa/XNU/whatever architecture. Regardless: it is now difficult to make sweeping architectural changes in XNU, because of the enormous installed base.

3. Ian Beer is simultaneously very clever and also willing to wade into the XNU Mach fire swamp.

I think it's fair to criticize Apple for designing the XNU frankenkernel. I think it's less legit to say that the presence of this bug class "looks bad" --- it's 2016 and this is just getting published. This is one of those bug classes that is sort of obvious in retrospect, and you wonder why people didn't catch it earlier.


Keep in mind that this is a dangling pointer problem that is a stupid design mistake and Apple's own kexts used the same vulnerable pattern for years before Project Zero had to disclose it to Apple after which there were two failed mitigations and a fix that changes core OS code!

That's not very justifiable no matter how many storms and twisters you throw in the mix ;) I am not going to argue but for OS development this looks bad that it wasn't even looked at much rather fixed for so long given many of Apple's kexts had the same issue.

[Also this isn't an isolated problem by the way - look up the thread and you'll find two other egregious errors including one involving execve that was pointed to Apple in vain. Not long ago I could just remove and insert a kmod and that would lock up the system - go ahead and argue that's not a hot path or nobody does that but it does speak to general lack of code and testing quality. To their credit Apple fixed that particular one though when I reported them.]


The dangling pointer isn't the interesting part of the bug --- dangling pointers are, to a first approximation, the key component of all UAFs, which are a 15 year old bug class. The interesting bit about this bug is how they arrived at the dangling pointer: they had to manipulate at least three different object lifecycles in the kernel, one of of which involved a credential passing trick that is not super common in normal code.

I disagree that this is a simple bug that anyone should have been able to find. If you'd like to put money on whether or not this is going to win a 2017 Pwnie Award, I would be happy to take your money.


When you take money for writing and maintaining an OS you have to be competent to avoid fundamental design issues like this when you're going to have a bunch of downstream users of the code!

Read the TLDR - exploitation of dangling pointers aside you can't write code and/or design APIs that does or makes it easy to do what the TLDR says - hold or use a task struct pointer and expect the euid of that task to stay the same. Many many places in the kernel do this and there are a great many very exploitable bugs as a result.

Ian Beer certainly is talented but that doesn't excuse Apple being so sloppy!


Aha! I think I understand what has you confused. You seem to think that the TLDR describes some basic rule of XNU programming that people were already aware of and expected to follow. No. Ian Beer invented that rule. In this post. That's why the bug is such a big deal; it's why we call it a "new bug class". It's also why it's the TLDR of the post.


Wow straight up conclusion - I had it confused, right!

It's not a, pardon me for the expression, fucking XNU specific programming rule - it is a general rule that was invented long before Ian got to it! You don't hold a reference counted pointer and operate on it without taking a :gasp: reference first - having that shit in your sample code is just, well extra shitty!

Also, separately from the dangling pointer issue, the first sentence of the post is literally - This post discusses a design issue at the core of the XNU kernel!


You are describing only the first big in this document. There are four. The timeline we're commenting on is for the last 3, which are not UAFs.


AFAIU, the dangling pointer problem wasn't a defect in the kernel; it was a problem introduced by some module authors who misunderstood the ownership semantics of the API. It might not even have been a pointer, per se, but I guess that's beside the point.

The larger problem was an inherent TOCTTOU bug in the interface semantics between the BSD subsystem and Mach. AFAIU that wasn't a dangling-anything problem; the reference was still valid. It was a logic and design problem that could happen in any language, even in Rust, and even without resorting to unsafe code.


>(that's two extra allocations and copies in hot path!)

I've never really thought of process spawning as a hot-path (in the hundreds+ of calls per second sense). What software so heavily relies on spawning so many processes so quickly that the overhead of malloc would be noticeable?


On UNIX systems fork+execve is a very commonly used code path - build a piece of software using make for example and the compiler process is forked/exec'ed lot of times, web servers like Apache used multiprocess model for long time etc. Yeah you could decide to not care about fork+exec* performance but what I am familiar with (Linux land) a lot of optimizations go into making fork/exec faster.

Also my bigger point though was not that it was in hot code path (although I would prefer to not be in a position where I need to change execve to add two allocs and copies if I can avoid it) - it was that they had to change execve() this late in the OS's development cycle to fix this long standing bug - typically you get to a point where you don't really need to touch core OS code and when you do you risk adding new issues to a central piece of code.


Yeah, I don't see the hot path problem either. I think the bigger issue is doing deep brain surgery on XNU as a hotfix; it's the kind of thing you want to put off for a next major release.


Incidentally the huge set of problems with execve were very old news - I know a developer who tried to get them fixed (because they affected a piece of Apple software) and failed. Presumably this was because execve bugs were considered unimportant.

Oops!


Pardon my ignorance, could you explain what "hot path" means in this context?




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: