Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
VirtualBox hotfix now available for OS X 10.8.2 problem (virtualbox.org)
134 points by isaacsu on Sept 21, 2012 | hide | past | favorite | 13 comments


Let me go over the anatomy of the actual bug, to the best of my understanding, so that people can better understanding what is going on here. Note that I'm the Vagrant creator, not a VirtualBox hacker, not a kernel hacker (though I've had my fair share of both in the past few years).

There is a feature of Intel CPUs called VT-x extensions. Without going into detail: VT-x is set of features built natively into some intel processors to improve virtualization. Any recent desktop/laptop Intel processor has these. For reasons unknown to me, these extensions are typically disabled by default. VirtualBox, VMWare, Parallels all contain code to enable these automatically for you. Enabling VT-x extensions requires ring-0 (kernel level) API calls. Therefore, it is up to the kernel extension to enable these.

Mac OS X 10.6 and greater supports native kernel APIs for doing this[1]. Prior to 10.6, you'd have to directly query the CPUs and modify CPU registers yourself, and icky business prone to some massive failure. Native APIs are pretty nice. In Darwin (the OS X kernel), these APIs are `host_vmxon` and `host_vmxoff`. These are very easy to use, you just call them. Mac OS X does all the internal accounting to verify that the CPU supports VT-x, VT-x isn't already enabled, etc. It also supports this feature called _exclusive_ access to VT-x. By passing `true` to `host_vmxon`, you're requesting _exclusive_ access to the VT-x extensions. If this succeeds, then until you call `host_vmxoff` again, no other application can call `host_vmxon` (a `VMX_IN_USE` error is returned).

It turns out that Mac OS X 10.8.2 on Ivy Bridge CPUs has a bug where this accounting is broken. No application actually is using VT-x extensions but `host_vmon` returns `VMX_IN_USE` anyways. This is what broke VirtualBox in 10.8.2. Now, I do want to note that this is 100% an Apple issue. VirtualBox was using a publicly exposed API at the kernel level and assuming that such an API would be stable. I would say this is a safe assumption. Unfortunately, here we are with the situation we have today.

Now, you must be asking: But I heard (or saw) that VMWare and Parallels were not affected by this issue! How did that happen? In time, friends, in time. I will explain this soon.

Next, on to how VirtualBox worked around this issue. The changeset[2] is pretty simple. As part of the kernel driver initialize process, VirtualBox now calls the new method `vboxdrvDarwinResolveSymbols`. This function breaks across kernel module boundaries and searches the kernel space for a named symbol, even if that symbol is not exported for public access. Specifically, it searches for the symbols "vmx_resume," "vmx_suspend," and "vmx_use_count." The first two are functions, the last is a global variable. These are the exact same APIs that the _publicly_ exposed `host_vmxon` and `host_vmxoff` call, but without the accounting or exclusivity feature.

So now, as part of the VMM (Virtual Machine Manager), which sits in ring-0, it will call these methods directly rather than using the `host_vmxon` function. This avoids the accounting bug that is in the kernel of 10.8.2, and we have a functional VirtualBox.

So how did VMWare and Parallels continue to function properly? Since they're not open source and I don't have the energy to DTrace them right now, I'll just say there are only two options. First, they can use the approach VirtualBox is now using where they search for kernel symbols of unexposed APIs and call those. Second, they can query and modify the CPU registers directly that have to do with VT-x support.

Based solely on my conversations with hypervisor developers at VMWare, I'm going to go with #2. The Fusion hypervisor is the same code as the vSphere hypervisor, workstation hypervisor, etc. It is all one big awesome hypervisor that is meant to run on all sorts of hardware out there. Because of this, I imagine that they have had built-in support for years for detecting various CPU models and manually enabling VT-x extensions, rather than relying on kernel-specific APIs. This is just more portable and flexible for them.

Anyways, the issue appears fixed and huge credit to the Oracle VirtualBox team which got this out the door in no time.

[1]: http://www.opensource.apple.com/source/xnu/xnu-2050.9.2/osfm...

[2]: https://www.virtualbox.org/changeset/43379/vbox


I don't see how this explanation can be correct given the fact that the code using host_vmxon / host_vmxoff in the revision you point to is guarded by #if VBOX_WITH_HOST_VMX, which the changeset also shows is false. It wasn't turned on until the following revision: <https://www.virtualbox.org/changeset/43380/vbox>. I can't see how any change in behavior of host_vmxon / host_vmxoff in 10.8.2 could be the cause if they weren't even being called prior to this.


I'm curious to understand why only Ivy Bridge CPUs are affected, especially if this is a software accounting/refcounting bug. Also, there was talk about this not being an 10.8.2 bug, but a result of an EFI firmware upgrade. Anyone have any insight?


I'm not sure why only Ivy Bridge.

I can say though that the EFI firmware didn't cause the bug, but in fact the 10.8.2 software upgrade did.


By a quirk of how the updates were delivered, I have 10.8.2 on my Air, but the EFI firmware is still to be installed. VirtualBox is very dead (but I can't wait to get it going again!).


Thank you (x3) for the explanation on how this works and what went wrong. Your comment alone is more insightful on any of these articles that've been submitted on this problem. You win my Internets, mitchellh.


I suspect that VT-x is off by default on many configurations due to poor reliability. I had to manually go in and enable VT-x on my laptop to use Vagrant, and doing so caused machine-wide hard locks every 30 minutes or so (only when the VirtualBox VM was running) that I had never seen before.


Great explanation, thank you Mitchell.


Credit where it's due: that was pretty fast.


Yeah, amazing work. I had to actually use MAMP for my web development work was a temporarily workaround instead of Vagrant and it was painful.


Superb work! I just got prompted today to update my firmware and declined, wondering when I would be able to update it again.


VirtualBox, delivering Kernel Panics since 2007.


And tainting the kernel ;)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: