Cute but extremely risky. The kernel has access to all memory-mapped devices and all weird privileged mode instructions, and the nested kernel has to ensure that the IOMMU is safe and no instructions that can turn off paging, modify the page tables, etc. appear at any offset within the code, including unaligned. Which also means that even if working around coincidental unaligned instances of such instructions is easy in practice, as stated in the paper, there's always a chance some (code, compiler) version combination will randomly fail to compile and (even if this is hypothetically made to work automatically most of the time) possibly require manual intervention...
So, have they missed any such instructions? To be fair, they are assuming the original kernel modules are trustworthy but have been compromised at runtime, so these would have to appear either for another reason or by chance in the kernel, but ... off the top of my head:
- Hardware breakpoints/watchpoints in the inner kernel or at the CR0 writing instruction.
- - Or hoping/somehow ensuring the CPU gets an IRQ or something right after executing the PG0 write. Problematic, but not out of the question.
- - This could be partially negated by ensuring that like the PG0 enabler, the fault handlers are mapped at an address that corresponds to a physical address containing a trap. But I don't think this is sufficient, 'cause... now that paging is disabled and so you can write to code, you should be able to set the stack pointer in the TSS to overwrite the handler!
- Hardware VM support. I checked the Intel manual: the VMCS structure points to some addresses that map guest physical addresses to real physical addresses, so this really needs to be disabled. (Also, the VMCLEAR instruction, which requires VMXON first, writes directly to physical addresses.) There's also whatever AMD does.
And with a bit more manual reading - but I could be wrong on these:
- Switching to 32-bit mode and confusing the inner kernel that way.
- Switching to 32-bit mode and then using hardware task switching to load CR3.
Those two only require the iret instruction!
Based on scanner-objdump.py, they seem to only be checking for movs to cr{0,2,3} and wrmsr, so all of the above should work. I'm not an x86 expert, so (a) some of the above may be wrong, but (b) there are probably more potential avenues I don't know about. :)
Correction (self-reply): Switching to 32-bit mode won't work - or could be easily prevented at any rate: you can't switch directly to real 32-bit mode ("legacy mode"), only to "compatibility mode" where the page table format stays the same, and task switching isn't supported. Switching to legacy mode requires a wrmsr, which is blocked. Even if there was a benefit to going into compatibility mode with kernel privileges, once there you're limited to the bottom 4GB of the address space, which would normally be reserved for user mode, and since SMEP is forced on you can't execute user pages in kernel mode. So the inner kernel just needs to prevent any executable kernel pages from being mapped below 4GB.
Told you I wasn't an x86 expert. But the other points stand.
...though in the current implementation the inner kernel doesn't block the relevant MSR write, so it may be possible anyway, although the manual says you're supposed to disable paging first, which is blocked. In any case, this, along with the hardware VM thing (since VMX can be disabled in cr4), would be easy to fix. I'm not so sure about the interrupt approach.
I'm not being sarcastic, this is a serious question because the specifics of this are over my head
Are you saying that the "nested kernel" architecture technically increases the surface area for attack, or is the problem that it gives a false sense of security compared to what it actually provides?
The latter. There's no obvious reason it would decrease security, but if the main kernel is compromised, the inner kernel's security model is really hard to enforce correctly against an opponent which is also running in ring 0.
You sound like a jealous and/or competing university professor. I myself have received a number of paper reviews in the tone like yours.
Snarky, superficial dismissals from "I think I know everything" maleficent people like you are a real problem in any community.
So I'll take a liberty of being condescending myself and explain the contribution of this paper.
Multics had hardware facilities to help in implementing multi-ring privileges. In x86-64 there are 4 rings, but segment-level protection is largely gone, thus effectively collapsing 4 ring levels into just 2: supervisor and user.
Retrofitting nested kernels in a secure way [0] on top of HW where each process's address space is flat and there are only page-based protections with just 2 levels is the real research contribution.
[0] They use code scanning (disassembly) to prevent loading of subversive modules. Given the complexity of x86-64 instruction encodings (for one, they're not unique), I have some doubts about the robustness of this approach.
If you want to blame anybody for the current situation, blame Intel and their CPU design, don't throw stones at people who work within constraints of the available HW.
Maybe you're tempted to denigrate this as "engineering", but the paper solves a real problem in a novel, practical and rather performant way. It qualifies as a research.
The frustrating thing about rings of protection is that there have been machines which had the right hardware, but no OS used it. DEC's VAX line had all the hardware for that. Nobody used it. IA-32 has lots of machinery such as call gates and segmented segment-level memory protection for fine-grained control over memory access. Nobody used that. AMD left all that stuff out of AMD-64 because nobody used it in IA-32. (I once asked the designer of AMD-64 about that, when he spoke at Stanford.) C and UNIX/Linux want a big flat address space and a vanilla CPU.
Protecting the OS's code and the MMU's state is nice, but it prevents only a few classes of attacks. The "untrusted" kernel still has read and write access to most of the kernel's data structures. Attacks can still mess with networking, files, login, etc.
One can go further with the protection hardware on AMD-64. See the KeyKos->EROS->Coyotos development, which continued until 2008. That project appears to be dead. The original KeyKos system was quite successful, with machines running for decades. Good concept, killed by partnering with the wrong hardware vendors. (Omron? Anybody remember Omron? No?)
> The frustrating thing about rings of protection is that there have been machines which had the right hardware, but no OS used it. [...] C and UNIX/Linux want a big flat address space and a vanilla CPU.
C per se does not want flat address space: it is, for example, UB to subtract pointers that do not point within a same object. IOW, "far pointers" from the DOS era and segmented 286-style pointers would absolutely be within the bounds of the C standard, if they hadn't used special syntax.
With 386 segment-level protection you could define one segment per object, with byte-level granularity limit checks for segments smaller than 1MB. Each string could have had its own segment with precise length. No more buffer overruns.
I think we ended where we are now because of several factors:
1. early hardware -- 286 -- was too limiting. Each segment could be at most 64k long, and
2. programmers (naturally) needed arrays larger than 64k even back then,
3. OS-es, UNIX and Win32, but not VMS, targeted the lowest common HW denominator, which is flat address space with only U/S page protections,
4. rather bad tooling.
Situation is changing slowly though. Intel's MPX extension offers much of the old segment-level protection, but is opt-in for software and needs tooling support (compiler, linker, loader, etc). This is being worked on, e.g., in gcc: http://gcc.gnu.org/wiki/Intel%20MPX%20support%20in%20the%20G...
So, have they missed any such instructions? To be fair, they are assuming the original kernel modules are trustworthy but have been compromised at runtime, so these would have to appear either for another reason or by chance in the kernel, but ... off the top of my head:
- Hardware breakpoints/watchpoints in the inner kernel or at the CR0 writing instruction.
- - Or hoping/somehow ensuring the CPU gets an IRQ or something right after executing the PG0 write. Problematic, but not out of the question.
- - This could be partially negated by ensuring that like the PG0 enabler, the fault handlers are mapped at an address that corresponds to a physical address containing a trap. But I don't think this is sufficient, 'cause... now that paging is disabled and so you can write to code, you should be able to set the stack pointer in the TSS to overwrite the handler!
- Hardware VM support. I checked the Intel manual: the VMCS structure points to some addresses that map guest physical addresses to real physical addresses, so this really needs to be disabled. (Also, the VMCLEAR instruction, which requires VMXON first, writes directly to physical addresses.) There's also whatever AMD does.
And with a bit more manual reading - but I could be wrong on these:
- Switching to 32-bit mode and confusing the inner kernel that way.
- Switching to 32-bit mode and then using hardware task switching to load CR3.
Those two only require the iret instruction!
Based on scanner-objdump.py, they seem to only be checking for movs to cr{0,2,3} and wrmsr, so all of the above should work. I'm not an x86 expert, so (a) some of the above may be wrong, but (b) there are probably more potential avenues I don't know about. :)