Does anyone have links that would explain this concept in greater detail? I get the meaning of W^X and I even understand the danger of executable memory, and why one wouldn't want it to be write-able (or at least not write-able by untrusted writers), but, beyond that I don't really understand the broader implications.
Also, it sounds like it would be a massive undertaking, even for a small(ish) kernel like OpenBSD, if the kernel wasn't always written with this goal in mind. Is that the case? (I don't see a patch referenced, so I'm not able to judge for myself.)
OpenBSD's kernel virtual/physical memory system, uvm(9), already had the ability to set permissions on mappings. But the recent work done here is still substantial, API's were cleaned up and the W^X policy was applied in the places where memory was being mapped with both permissions (writable, executable).
I'm not overly familiar with all that went into this, but reading the commit logs by both Theo and Mike might help clarify some of it.
Stack smashing attacks are no longer possible, since:
1. ProPolice detects most attempts to clobber the return address.
2. You can't set the return address to memory you control the contents of, such as user input, since that memory is writable and therefore not executable.
3. The remaining way to get code into executable memory is to write a file and have the program mmap(2) it. Address space layout randomisation makes finding this code difficult, even if you have the ability to smash the stack in a way that bypasses ProPolice.
The difficult cases (like the trampoline case mentioned) are usually problematic because they are programmatically writing small functions in machine code, then executing them; basically this requires the discipline to write the function and then immediately flip the page from writable to executable. Implementing a JIT compiler like the JVM would encounter similar difficulties.
OpenBSD has switched it's platforms to using PIE (Position-independent Executables) by default. The 5.7 release will also introduce self-relocating static PIE.
I think you're confusing the kernel and userland, as the post mentions that OpenBSD's kernel ASLR support (i.e. position independence) is currently limited. When it comes to userland, W^X is a much older feature, which OpenBSD pioneered, but which by now is essentially ubiquitous (except when a JIT is in use, e.g. in web browsers), along with ASLR.
Such features have helped make exploits more difficult over time, but far from impossible - it all depends on the type of vulnerability, as well as things like how much interactivity exists between the attacker/the attacker's code and the target (potentially allowing em to gather data about ASLR, stack canaries, etc. before sending the final code execution bit). For example, web browsers are a very good case for the attacker, where not only is there a lot of interactivity in the form of JavaScript method calls, but a JIT usually ensures RWX pages exist; on the other side, an inetd server that spawns a new process for every request, with new ASLR offsets and stack canaries, would be pretty bad, since there is little interactivity.
When it comes to the kernel, an important attack source is userland programs (already compromised or run by a malicious user in a multiuser system) trying to abuse the system call interface. In this case, not only is there a lot of interactivity (many system calls + complex low-level device drivers, if applicable + weird CPU features + high level of control over multiple cores/threads and timing + sharing the same CPU caches etc. with the kernel), on pre-Haswell x86-64 processors, there is actually no performant way for the kernel to prevent the memory of the currently running user process from being directly accessible from it (not executable as of Ivy Bridge though), making any kind of ASLR much less useful. So while kernels can and do get pretty far by having well-written code that avoids vulnerabilities, they usually only need to give an inch for userland to take a mile. There are, however, other, less favorable attack scenarios, e.g. remote attacks on network stacks, and in any case W^X can't hurt.
It comes from the early x86 processors not having an "no-execute" permission bit for page table entries. In the AMD64 bit architecture, an NX bit was added.
It's fairly rare that you want to write and modify a page somewhat simultaneously, as a high-level goal. Most of the time you're loading some existing executable from disk, and once it's loaded you're executing it, so you can just have the kernel write to it, switch it to executable and nonwritable, and pass it back to the userspace process. Sometimes you have a JIT or a VM doing binary translation, but even there you typically JIT some code once and then call it one or more times, and if you JIT some other code it's a different page. You don't typically intersperse writing code to a page and executing that page. So you just want userspace to be a little rigorous about separating those two steps, and tell the kernel when it's switching between those two, instead of requesting a page that's simultaneously writable and executable.
As another commenter said, one of the reasons this isn't done at the outset is that some architectures (notably x86-32) don't implement W^X in hardware, so there's no pressure to be 100% clean about this. But it's rare that you have code that isn't straightforward, at least conceptually, to rework into W^X compatibility.
One of the more annoying things is fixing up relocations: if you have a call to a function in a dynamic library, and you don't know where that library is going to be loaded until it's loaded, the most obvious way to implement this is to map your program code writable and fill in addresses once you know what they are. Each place where an address needs to be filled in is called a "relocation". So when you load a dynamic library, you loop over all relocations and fill in any addresses for symbols contained in that dynamic library.
There are lots of reasons this is awful; one is that you have to go update every place in the program code. So you indirect that through a thing called the "procedure linkage table" (PLT), which contains a bunch of tiny functions that just go call your real dynamic functions, and you hard-code references to the PLT. The PLT still has to be writable, though. If you don't want that, you make a separate section of the program called the "global offset table" (GOT) that contains addresses, and you have each stub function in the PLT do an indirect function calls to a matching entry in the GOT. So the PLT is executable and doesn't need to be writable, and the GOT is writable and doesn't need to be executable.
If you want to be super paranoid, you resolve all your dynamic libraries at startup and mark the GOT as read-only ("bind now" and read-only relocations aka "relro", respectively), so that nothing is writable. But that's a different discussion from W^X.
(This telling is not very historically accurate about how the PLT and GOT came to exist, but hopefully the explanation of what they do is close enough to correct to convey the general ideas.)
Correct me if I'm wrong, but isn't amd64 kernels always going to have this implemented in hardware? I mean, W^X is a replication of the NX bit, which is (as far as I know) mandatory for the x86 instruction set.
You're correct, the protection is implemented in hardware, but the pages have to be marked appropriately. This message describes a patchset that correctly marks the kernel pages as writable xor executable.
It'd be quite possible for a JIT to have the memory first writable but not executable when creating the code, then the other way around when running it. No need to be both at the same time.
As I understand it, the NX bit is just a new permission bit to say "Don't execute this page". The i386 architecture only had read and write bits, and assumed that read also meant execute.
W^X is a policy that a kernel can choose to implement, that if the W bit is set on a page table entry, so is the NX bit. You need the NX bit to be available in hardware for this to be useful, but hardware support for NX doesn't mean that you have to use it, let alone implement W^X.
This means that amd64 processors are backwards-compatible with kernel and userspace designs that require W|X, even in long (64-bit) mode.
Not a lot of context here: I take it this means mapping pages as writable or executable, but never both? And this is being applied to the kernel itself and the pages it maps into kernel space?
That's right. W^X is a policy that memory is either writable or executable, but not both. OpenBSD uses this model in userspace, now it's being taken a step further and applied to kernel space.
According to [1] it's W xor X. Interpreting that literally suggests that read only memory is disallowed as well. I'd be surprised if that's actually the case.
There is much mention of JITs and the workaround being to switch page permissions, but here is an example of an SMC pattern that W^X would really not work with; a function that does something the first time it is called, and collapses into a single RETurn instruction thereafter:
once:
mov byte [once], 195
; ...do something here...
ret
I have used this technique in applications-level code, where it is significantly more efficient (both smaller and faster) than the alternatives when this "once" function will be called many times. I think it is always important to remember that while W^X and other restrictions have security benefits, they also have downsides in limiting some interesting creativity and the potential to exploit the full abilities of the machine.
First thought offhand is to have the code be writeable and executable the first time through -- W^X allows for this for JITs and the like -- and then set the page the code lives on to be executable only after this instance. Alternatively, have the once function call a function in memory space that's already set to execute only, to minimize the space attackers can perform shenanigans in.
However, this does kinda ignore one of the big focuses of the OpenBSD project. They tend to shy away from such clever hacks in the name of readability and auditability. While it's definitely a neat way of ensuring your code is only executing once, it becomes a hassle when you have to port it to other platforms. Keep in mind that OpenBSD ports to as many platforms as possible because the subtle quirks of various platforms will often tickle out rare bugs to become more repeatable. In this case, your replacing the once function with a return is dependent on x86, so wouldn't work on the many other platforms that OpenBSD runs in.
This is a pretty neat idea and interesting change! I wonder how having a security model where each page of memory can only be one of writeable/executable impacts JITs though? (I guess thats perhaps why jits often have those landing padd spots at the top of function/methods?)
You can have multiple threads generating code if necessary, you just need to ensure that each has its own pages to write to. Once the machine code is written the page can be flipped to rx and it's safe to share across multiple threads.
Exactly yes. A *BSD thing not strictly just openbsd or whatever.
Perhaps it would help to see them "in action" here is a link to the linux emulation layer in freebsd, the MD chapter four specifically describes i386 (this is how you put syscall parameters on, and off, the stack on a i386) and the MI chapter five is a pile of structs that would be used by any emulation layer (NPTL, TLS, the joy of futex'es's (a linux thing that is kind of a mutex cache for speed, sorta kinda), and good luck with the ioctls).
OpenBSD/amd64 does have SMEP/SMAP support, this was committed by Jonathon Gray (jsg@) in 2012, using QEMU. It's still hard to come by in actual hardware.
So this means you can't have features like ftrace (and kpatch), BPF, or a kernel that re-configures itself at runtime like x86 Linux does at boot once it detects the hardware features. Of course you can work around all that by switching the page W/X bits as appropriate, but it's a bit more complex.
I'm not sure I see the link between it being a character device and the fact that memory pages are W^X. But you are correct that I was wrong when saying you can't have it; I also said it was more complex since you have to be careful when switch a page from W to X.
Also, it sounds like it would be a massive undertaking, even for a small(ish) kernel like OpenBSD, if the kernel wasn't always written with this goal in mind. Is that the case? (I don't see a patch referenced, so I'm not able to judge for myself.)