At the meta level this is just a special case of "complexity is evil" in security. CPUs have been getting more and more complex, and the relationship between complexity and bugs (of all types) is exponential. Each new CPU feature exponentially increases the likelihood of errata.
A major underlying cause is that we're doing things in hardware that ought to be done in software. We really need to stop shipping software as native blobs and start shipping it as pseudocode, allowing the OS to manage native execution. This would allow the kernel and OS to do tons and tons of stuff the CPU currently does: process isolation, virtualization, much or perhaps even all address remapping, handling virtual memory, etc. CPUs could just present a flat 64-bit address space and run code in it.
These chips would be faster, simpler, cheaper, and more power efficient. It would also make CPU architectures easier to change. Going from x64 to ARM or RISC-V would be a matter of porting the kernel and core OS only.
Unfortunately nobody's ever really gone there. The major problem with Java and .NET is that they try to do way too much at once and solve too many problems in one layer. They're also too far abstracted from the hardware, imposing an "impedance mismatch" performance penalty. (Though this penalty is minimal for most apps.)
What we need is a binary format with a thin (not overly abstracted) pseudocode that closely models the processor. OSes could lazily compile these binaries and cache them, eliminating JIT program launch overhead except on first launch or code change. If the pseudocode contained rich vectorization instructions, etc., then there would not be much if any performance cost. In fact performance might be better since the lazy AOT compiler could apply CPU model specific optimizations and always use the latest CPU features for all programs.
Instead we've bloated the processor to keep supporting 1970s operating systems and program delivery paradigms.
It's such an obvious thing I'm really surprised nobody's done it. Maybe there's a perverse hardware platform lock-in incentive at work.
A lot of these ideas were in the back of our heads in designing WebAssembly, but to keep expectations low, we don't make too much noise about them. However I personally believe that we are on the right track with WASM and am very excited about the future!
It also made me think of PICK (and PICK cpu hardware implementations); though I never learned enough about the internals of PICK when I last used it 20+ years ago (so I could be wildly off-base).
Tao/Intent/Elate (which I think is defunct nowadays) would also qualify, and I'd argue .NET on Windows with the GAC would, too (although there'll be a legitimate argument about whether that's "simple and closely models the processor").
Tao is long defunct, yes (went under a decade ago). It turns out that people don't really want a runtime-portable OS/apps (IIRC the biggest takeup it got was as a Java runtime for mobile, because the competition at that time was all interpreted). There was no security model in VP, though -- single flat address space and bytecode could turn any integer into a pointer and dereference it (loads just got translated into host cpu load instructions), so there was no isolation between processes or between processes and the os.
AS/400 and descendants have a security model, but they rely at least partially on a trusted runtime code generator (and, transitively, trusted boot). The systems have HW assist to tag real pointers, but that's mainly for performance reasons. Pointer validity checks are performed in software (or they were until ten years ago), automatically inserted by the bytecode translator. If you subverted the code generator, your malicious code could get a bit further by forging pointers.
> We really need to stop shipping software as native blobs and start shipping it as pseudocode, allowing the OS to manage native execution.
What we really need to do is to start shipping all software as source code. This is exactly what JavaScript does, and why it is the most successful method of software distribution ever. WebAssembly is a huge step backward.
> What we need is a binary format with a thin (not overly abstracted) pseudocode that closely models the processor. OSes could lazily compile these binaries and cache them, eliminating JIT program launch overhead except on first launch or code change. If the pseudocode contained rich vectorization instructions, etc., then there would not be much if any performance cost. In fact performance might be better since the lazy AOT compiler could apply CPU model specific optimizations and always use the latest CPU features for all programs.
A major underlying cause is that we're doing things in hardware that ought to be done in software. We really need to stop shipping software as native blobs and start shipping it as pseudocode, allowing the OS to manage native execution. This would allow the kernel and OS to do tons and tons of stuff the CPU currently does: process isolation, virtualization, much or perhaps even all address remapping, handling virtual memory, etc. CPUs could just present a flat 64-bit address space and run code in it.
The overall idea has a lot of merit (and, for example, Apple is moving towards this model with the iOS AppStore) - but I don't see how it solves the current problem.
Across a variety of architectures, the market has come down firmly in favor of hardware address translation and protection. There are various implementations, many not subject to the current side-channel, but all of them do most of the heavy lifting in hardware: TLBs and related things "just work".
Lets say you had some intermediate format and executed everything in a single 64-bit address space after a final JIT compilation step (your suggestion, as I understand it). How you would implement process and kernel memory protection? It amounts to a bounds-check on every memory access. Certainly you can use techniques common in bounds-checking JITs today to eliminate many of the checks via proof methods, hoisting and combining bounds checks, etc - but the cost would still be large in many cases.
Maybe you want a hardware assist for this bounds checking then? Well follow that to its logical conclusion and you end up with hardware protection support: maybe in a slightly different form than we have today, but hardware support nonetheless.
There are a lot things we could do differently with a clean-slate design, and I think intermediate representations have a lot of merit (e.g., the radical performance improvements partly as a result of radical architecture changes enabled by use of intermediate formats in the GPU space are evidence this works) - but hardware address translation doesn't seem like the problem here.
Bounds checking in hardware is an awesome idea. It's still simpler than full protection modes and is more versatile. Not only does it allow efficient software JIT implementation of protection but it also allows pervasive bounds checking to eliminate buffer overflows and other common errors. It eliminates the performance incentive for a lot of unsafe code.
What I'm suggesting is not a total clean slate. It could be done easily on current processors or current instruction sets and would be more an omission than a change to core architecture.
I wonder if doing it on current chips and just ignoring all the protection and remapping logic would have a performance benefit? Look at the boost you get on some databases with transparent hugepages, which kind of do that.
Okay so take this bug for example. It seems to have to do with the CPU speculatively performing a load before checking that it won't generate a page fault due to user code trying to access kernel memory. Say you get rid of process isolation, etc. How do you protect kernel code from user code? You can't do any sort of static analysis I'm aware of that'll still allow you to run C code (which let's you manufacture pointers from arbitrary integers). And if you insert dynamic checks instead, you're talking about turning each memory access into many (memory accesses that in a modern CPU are hidden by the TLB).
> And if you insert dynamic checks instead, you're talking about turning each memory access into many (memory accesses that in a modern CPU are hidden by the TLB).
You only have to check that the memory address is not negative (kernel pointers are negative on x86-64). No extra memory access needed.
> You can't do any sort of static analysis I'm aware of that'll still allow you to run C code (which let's you manufacture pointers from arbitrary integers).
Good point about the ease of checking kernel pointers. That doesn't address process isolation generally, however, unless you're willing to segment virtual memory in the same way.
As to NACL, it relies on various CPU protection mechanisms, and also makes some major trade-offs: https://static.googleusercontent.com/media/research.google.c.... On x86, NACL uses the segmentation mechanism. On x86-64, which has no segmentation registers, it masks addresses and requires all memory references to be in a 4GB space. To handle various edge cases, and to speed up stack references, it relies on huge guard areas on either side of the module heap and stack, thus relying on the virtual memory system. Finally, likely to mitigate the overhead of masking, it does not address reads at all, and relies on the virtual memory system to protect secret browser information from the sandboxed process. Even with these limitations, on about half the SPEC benchmarks the overhead is 15-45%.
Good point about static analysis and C code. That means you would not be able to toss memory protection unless you introduce fast hardware support for bounds checking and bounds check everything in JITed code. I guess you could also have the JIT do more elaborate guarding of memory but that would probably have a performance penalty.
You could still toss a lot: virtualization, complex multi-layered protection modes, address remapping, and essentially every hardware feature that exists to support legacy binary code. All deprecated instructions and execution modes could go, etc.
Finally you would maintain the benefit of architecture flexibility. Switching from x86 to ARM, etc., would be easy.
Out of curiosity, which parts of .NET bytecode do you believe to be "too far abstracted from the hardware"? The object model, certainly, but you don't need to use that. On the other hand, the basic instruction set for arithmetic and pointers seems to be on the same abstraction level as WebAssembly to me.
You can build C++ as .NET, absolutely. So far as I know, it can handle everything in the Standard except for setjmp/longjmp. All it takes is compiling with /clr:pure.
What you're referring to is probably C++/CLI, which wasn't removed, but it hasn't really been updated for a while. C++/CLI is a set of language extensions that make it possible to interface with the .NET object model.
If we go feature by feature, the .NET type system and bytecode has:
- unsigned types
- raw (non-GC) data pointers with pointer arithmetic
- raw function pointers (distinct from delegates)
- structs and unions
- dynamic memory allocation on the stack (like alloca)
- vararg functions
While what you're saying sounds nice, your theory has nothing to do with practice.
In reality, the ultimate source of this problem is the mismatch in speed between silicon logic and silicon memory. This is why your CPU ends up doing all sorts of tricks like caching, branch prediction, speculative execution to compensate for slow memory.
A major underlying cause is that we're doing things in hardware that ought to be done in software. We really need to stop shipping software as native blobs and start shipping it as pseudocode, allowing the OS to manage native execution. This would allow the kernel and OS to do tons and tons of stuff the CPU currently does: process isolation, virtualization, much or perhaps even all address remapping, handling virtual memory, etc. CPUs could just present a flat 64-bit address space and run code in it.
These chips would be faster, simpler, cheaper, and more power efficient. It would also make CPU architectures easier to change. Going from x64 to ARM or RISC-V would be a matter of porting the kernel and core OS only.
Unfortunately nobody's ever really gone there. The major problem with Java and .NET is that they try to do way too much at once and solve too many problems in one layer. They're also too far abstracted from the hardware, imposing an "impedance mismatch" performance penalty. (Though this penalty is minimal for most apps.)
What we need is a binary format with a thin (not overly abstracted) pseudocode that closely models the processor. OSes could lazily compile these binaries and cache them, eliminating JIT program launch overhead except on first launch or code change. If the pseudocode contained rich vectorization instructions, etc., then there would not be much if any performance cost. In fact performance might be better since the lazy AOT compiler could apply CPU model specific optimizations and always use the latest CPU features for all programs.
Instead we've bloated the processor to keep supporting 1970s operating systems and program delivery paradigms.
It's such an obvious thing I'm really surprised nobody's done it. Maybe there's a perverse hardware platform lock-in incentive at work.