This is super cool. This exploit will be one of the canonical examples that just running something in a VM does not mean it's safe. We've always known about VM breakout, but this is a no-breakout massive exploit that is simple to execute and gives big payoffs.
Remember: just because this one bug gets fixed in microcode doesn't mean there's not another one of these waiting to be discovered. Many (most?) 0-days are known about by black-hats-for-hire well before they're made public.
The problem is, VMs aren't really "Virtual Machines" anymore. You're not parsing opcodes in a big switch statement, you're running instructions on the actual CPU, with a few hardware flags that the CPU says will guarantee no data or instruction overlap. It promises! But that's a hard promise to make in reality.
Looking at the IBM's tech from the sixties is somehow weirdly depressing: it's unbelievable how much of the architectural stuff they've invented already by the 1970.
Not depressing, but inspiring. So many great architectural ideas can be made accessible to millions of consumers, not limited to a few thousand megacorps.
In the early days of virtualization on PCs (things like OS/2's dos box) the VM was 100% a weird special case VM that wasn't even running the same mode (virtual 8086 vs 286 / 386 mode), and that second-class functionality continued through the earlier iterations of "modern" systems (vmware / kvm / xen).
"PC" virtualization's getting closer to big iron virtualization, but likely will never quite get there.
Also -- I was running virtual machines on a 5150 PC when it was a big fast machine -- the UCSD P System ran a p-code virtual machine to run p-code binaries which would run equally well on an apple 2. In theory.
IMO, it’s only a special case for commercial support reasons. Almost every engineer, QE, consultant, solution architect I know runs or has run nested virtualization for one reason or another.
Just how many times is the average operating system workload (with or without a virtual machine also running a second average operating system workload) context switching a second?
Like... unless I'm wrong... the kernel is the main process, and then it slices up processes/threads, and each time those run, they have their own EAX/EBX/ECX/ESP/EBP/EIP/etc. (I know it's RAX, etc. for 64-bit now)
How many cycles is a thread/process given before it context switches to the next one? How is it managing all of the pushfd/popfd, etc. between them? Is this not how modern operating systems work, am I misunderstanding?
> How many cycles is a thread/process given before it context switches to the next one?
Depends on a lot of things. If it's a compute heavy task, and there's no I/O interrupts, the task gets one "timeslice", timeslices vary, but typical times are somewhere in the neighborhood of 1 ms to 100 ms. If it's an I/O heavy task, chances are the task returns from a syscall with new data to read (or because a write finished), does a little bit of work, then does another syscall with I/O. Lots of context switches in network heavy code (io_uring seems promising).
> How is it managing all of the pushfd/popfd, etc. between them?
The basic plan is when the kernel takes an interrupt (or gets a syscall, which is an interrupt on some systems and other mechanisms on others), the kernel (or the cpu) loads the kernel stack pointer for the current thread, then it pushes all the (relevant) cpu registers onto the stack, then the kernel business it taken care of, the scheduler decides which userspace thread to return to (which might be the same one that was interrupted or not), the destination thread's kernel stack is switched to, registers are popped, then the thread's userspace stack is switched to, then userspace execution resumes.
Why do comments like this just make a bold claim and then wander off as if the claim stands for itself? No explanation. No insight. I mean why should we just take your word for it?
I'd like to be educated here why a big switch statement wouldn't necessarily protect us from these CPU vulnerabilities? Anyone willing to help?
The question should rather be: why would it protect you? This switch statement also runs on a CPU, which is still vulnerable. This CPU still speculates the execution of the switch statement. No amount of software will make hardware irrelevant.
Hence my choice of phrasing: 'wouldn't necessarily protect you'.
So, yes, the switch statemement might be safe, but you would need to prove that your switch statement doesn't use those instructions. You don't get to claim that for free just because you are using a switch-statement.
Conversely, even if you execute bare metal instructions for the user of the VM, you could also deny those instructions to the user. Eg by not allowing self-modifying code, and statically making sure that the relevant code doesn't contain those instructions.
So the switch statement by itself does not do anything for your security.
Tangent: To deny those bare-metal instructions with static analysis, you might also have to flat out deny certain sequences of instructions that, when jumped to "unaligned" would also form the forbidden instruction. That might break innocent programs, no?
Simple: don't allow unaligned jumps. Google's NaCl already figured out how to do that ages ago. (Eg you could only allow jumps after a bit-masking operation. Details depends on architecture.)
But yes, unless you solve the halting problem, anything that bans all bad programs will also have false positives. It's the same with type systems in programming languages.
Even if we pretend docker is a VM, building an image can happen on as many cores as you like in this hypothetical, it's the running of it that should be restricted.
The comparison to Meltdown/Spectre are a bit misleading though - they were a whole new form of attack based on timing where the CPU did exactly what it should have done; This zenbleed case is a good old fashioned bug though - data in a register that shouldn't be.
Running untrusted code whether in a sandbox, container, or VM, has not been safe since at least Rowhammer, maybe before. I believe a lot of these exploits are down to software and hardware people not talking. Software people make assumptions about the isolation guarantees, hardware people don't speak up when said assumptions are made.
Hardware people are the ones making those promises, so I don't think that's right at all. And Rowhammer is a way overstated vulnerability - there are all sorts of practical issues with it, especially if you're on modern, patched hardware.
In the end, I'm thinking most of these are related to branch prediction?
It strikes me that it's either that branch prediction is so inherently complex enough it's always going to be vulnerable to this and/or it just so defies the way most of us intuitively think about code paths / instruction execution that it's hard to conceive of the edge cases until too late?
At what point does the complexity of CPU architectures become so difficult to reason about that we just accept the performance penalty of keeping it simpler?
More generally, most of them are related to speculative execution, where branch mis-prediction is a common gadget to induce speculative mis-execution.
Speculation is hard, it's sort of akin to the idea of introducing multithreading into a program, you are explicitly choosing to tilt at the windmill of pure technical correctness because in a highly concurrent application every error will occur fairly routinely. Speculation is great too, in combination with out-of-order execution it's a multithreading-like boon to overall performance, because now you can resolve several chunks of code in parallel instead of one at a time. It's just also a minefield of correctness issues, but the alternative would be losing something like the equivalent of 10 years of performance gains (going back to like ARM A53 performance).
The recent thing is that "observably correct" needs to include timings. If you can just guess at what the data might be, and the program runs faster if you're correct, that's basically the same thing as reading the data by another means. It's a timing oracle attack.
(in this case AMD just fucked up though, there's no timing attack, this is just implemented wrong and this instruction can speculate against changes that haven't propagated to other parts of the pipeline yet)
The cache is the other problem, modern processors are built with every tenant sharing this single big L3 cache and it turns out that it also needs to be proof against timing attacks for data present in the cache too.
> At what point does the complexity of CPU architectures become so difficult to reason about that we just accept the performance penalty of keeping it simpler?
Never for branch prediction. It just gets you too much performance. If it becomes too much of a problem, the solution is greater isolation of workloads.
In certain cases isolation and simplicity overlap, I suspect for example that the dangers of SMT implementation complexity are part of why Apple didn't implement it for their respective CPUs. Likely we'll see this elsewhere too, for example Amazon may not ever push to have SMT in their Graviton chips (the early generations are off the shelf cores from ARM where they didn't have a readily available choice).
I could be mistaken, but I don't think Zenbleed has anything to do with SMT, based on my reading of the document. There is a mention of hyperthreads sharing the same physical registers, but you can spy on anything happening on the same physical core, because the register file is shared across the whole core.
It even says so in the document:
Note that it is not sufficient to disable SMT.
Apple's chips don't have this vulnerability, but it's not because they don't have SMT. They just didn't write this particular defect into their CPU implementation.
Correct, I was responding to parent writing "At what point does the complexity of CPU architectures become so difficult to reason about that we just accept the performance penalty of keeping it simpler?"
I think we may be seeing an industry-wide shift away from SMT because the performance penalty is small and the complexity cost is high, if so that fits parent's speculation about the trend. In a narrow sense Zenbleed isn't related to SMT but OP's question seems perfectly relevant to me. I come from a security background and on average more complicated == less secure because engineering resources are finite and it's just harder and more work to make complicated things correct.
Not really if that's an attack you're concerned about, because guests can attack the hypervisor via the same mechanisms. You would need to gang schedule to ensure all threads of a core were only either in host or guest.
>At what point does the complexity of CPU architectures become so difficult to reason about that we just accept the performance penalty of keeping it simpler?
Basically never for anything that's at all CPU-bound, that growth in complexity is really the only thing that's been powering single-threaded CPU performance improvements since Dennard scaling stopped in about 2006 (and by that time they were already plenty complex: by the late 90s and early 2000's x86 CPUs were firmly superscalar, out-of-order, branch-predicting and speculative executing devices). If your workload can be made fast without needing that stuff (i.e. no branches and easily parallelised), you're probably using a GPU instead nowadays.
You can rent one of the Atom Kimsufi boxes (N2800) to experience first hand a cpu with no speculative execution. The performance is dire, but at least it hasn’t gotten worse over the years - they are immune to just about everything
We demanded more performance and we got what we demanded. I doubt manufacturers are going to walk back on branch prediction no matter how flawed it is. They'll add some more mitigations and features which will be broken-on-arrival.
I didn't demand more performance. My 2008-era AthlonX2 would still be relevant if web browsers hadn't gotten so bloated. I still use it for real desktop applications, i.e. everything that isn't in Electron.
Theres VLIW/'preprediction'/some other technical name I forget for infrastructures which instead ask you to explicitly schedule instruction/data/branch prediction. If I remember, the two biggest examples I can think of were IA64 and Alpha. I wanna think HP-PA did the same but I'm not clear on that one.
For various reasons, all these infras eventually lost out in the market due to market pressure (and cost/watt/IPC, I guess).
Yup! I worked at a few companies that would co-mingle Internet facing/DMZ VMs with internal VMs. When pointing this out and recommending we should airgap these VMs to it's own dedicated hypervisor it always fell on deaf ears. Jokes on them I guess.
You can pay AWS a premium to make sure you're the only tenant on the physical machine. You can also split your own stuff into multiple tenants, and keep those separate too.
Eric Brandwine (VP/DE @ AWS) said publicly in 2019 that EC2 had never scheduled different tenants on the same physical core at the same time, even before we learned about these kinds of side-channel attacks.
Even before then, the sufficiently paranoid (but still bound to AWS for whatever reason) would track usage/steal/IO reporting along with best guesses for Amazon hardware expidenture and use that information to size instances to attempt to coincide with 1:1 node membership.
Yes (lowest vCPU seems to be 2 everywhere), and that protects against this attack. However, this thread was talking about airgapping hosts, which is needed for the general threat of VM escapes.
Yes but the Firecracker VMs are pinned to specific cores. So no two tenants never share a CPU core. Other than Rowhammer, has there been a hardware vulnerability of this nature that has worked x-core? I don't recall.
Still, I think that if your company is handling user data it's worth seriously considering dedicated instances for any service that encounters plaintext user information.
That sounds like it's leaking across user/process boundaries on a single EC2 instance, which presumably also requires the processes to be running on the same core.
Leaks between different EC2 instances would be far more serious, but I suppose that wouldn't happen unless two tenants / EC2 instances shared SMT cores, or the contents of the microarchitectural register file was persisted across VM context switches in an exploitable manner.
The problem is that the logical registers don't have a 1:1 relation to the physical registers.
For example, let's imagine a toy architecture with two registers: r0 and r1. We can create a little assembly snippet using them: "r0 = load(addr1); r1 = load(addr2); r0 = r0 + r1; store(addr3, r0)". Pretty simple.
Now, what happens if we want to do that twice? Well, we get something like "r0 = load(addr1); r1 = load(addr2); r0 = r0 + r1; store(addr3, r0); r0 = load(addr4); r1 = load(addr5); r0 = r0 + r1; store(addr6, r0)". Because there is no overlap between the accessed memory sections, they are completely independent. In theory they could even execute at the same time - but that is impossible because they use the same registers.
This can be solved by adding more physical registers to the CPU, let's call them R0-R6. During execution the CPU can now analyze and rewrite the original assembly into "R1 = load(addr1); R4 = load(addr4); R2 = load(addr2); R5 = load(addr5); R3 = R1 + R2; R6 = R4 + R5; store(addr3, R3); store(addr6, R6)". This means we can now start the loads for the second addition before the first addition is done, which means we have to wait less time for the data to arrive when we finally want to actually do the second addition. To the user nothing has changed and the results are identical!
The issue here is that when entering/exiting a VM you can definitely clear the logical registers r0&r1, but there is no guarantee that you are actually clearing the physical registers. On a hardware level, "clearing a register" now means "mark logical register as empty". The CPU makes sure that any future use of that logical register results in it behaving as if it has been clear, but there is no need to touch the content of the physical register. It just gets marked as "free for use". The only way that physical register becomes available again is after a write, after all, and that write would by definition overwrite the stale content - so clearing it would be pointless. Unless your CPU misbehaves and you run into this new bug, of course.
The problem is the freed entries in the register file. A VM can, at least, use this bug to read registers from a non-VM thread running on the adjacent SMT/HT of a single physical core. I suspect a VM could also read registers from other processes scheduled on the same SMT/HT.
Not only do people do this, it's generally how VPS providers work. Most machines barely use the CPU most of the time (web servers etc.) so reserving a full CPU core for a VPS is horribly inefficient. It doesn't matter anyway, because SMT isn't relevant for this particular bug.
With SMT allowing twice the cores on a CPU for most workloads, disabling it would double the cost for most providers!
There are VPS providers that will let you rent dedicated CPU cores, but they often cost 4-5x more than a normal virtual CPU. Overprovisioning is how virtual servers are available for cheap!
SMT is relevant in the VM case of this bug because it determines whether this bug is restricted to data outside the VM or not.
Providers usually won't disable SMT completely, they'd run a scheduler which only allows 1 VM to use both SMT threads of a core. Ultra cheap VPS providers may still find that not worth the pennies though as if you sell a majority of single core VPS then the majority of your SMT threads are still unavailable even with the scheduler approach.
Fully dedicated cores aren't necessarily required because in the timesliced case the registers are unloaded and reloaded when different VMs are shuffled on and off the core. That said, they definitely prevent the cross-vm-data-leak case of this bug.
> Fully dedicated cores aren't necessarily required because in the timesliced case the registers are unloaded and reloaded when different VMs are shuffled on and off the core. That said, they definitely prevent the cross-vm-data-leak case of this bug.
Registers are unloaded and reloaded when different processes / threads are scheduled within a running VM too. That should protect the register contents, but because of this issue, it doesn't, so I don't see why it would if it's a hypervisor switching VMs instead of an OS switching processes. If you're running a vulnerable processor on a vulnerable microcode, it seems like you can potentially read things put into the vulnerable registers by anything else running on the same physical core, regardless of context.
Context switching for processes is done in software (i.e. the OS) via traps because TSS does not store all the registers and it doesn't offer a way to be selective to what the process actually needs to load (=slower). This limits its visibility to what's in the actively mapped registers as well as not guaranteeing the procedure even tries to reload all the registers. In this case, even if the OS does restore certain registers it has no way to know the processor left specific bits of one speculatively set in the register file.
On the other hand, "context switching" for VMs is done via hardware commands like VMSAVE/VMLOAD or VMWRITE/VMREAD which do save/load the entire guest register context, including the hidden context not accessible by software which this CVE is relying on. Not that it isn't impossible for this to be broken as well, but it's a completely different procedure and one the hardware is actually responsible for completely clearing instead of "supposed to be reset by software".
So while the CVE still affects processes inside of VMs the loading/unloading behavior inter VM should actually behave as a working sandbox and protect against cross-VM leaks, barring the note by lieg on SMT still possibly being a problem (I don't know enough about how the hardware maintains the register table between SMT threads of different VMs to say for sure but I'm willing to guess it's still vulnerable on register remappings).
There may well be other reasons I'm completely mistaken here but they'd have to explain why the inter-VM context restore is broken not why it works for inter-process restore. The article already explains why the latter happens, but it doesn't make a claim about the former.
I can't easily find good documentation on the instructions you mentioned; but are you sure those save and load the whole register file, and not just the visible registers? There are some registers that are not typically explicitly visible, that I'd expect to also be saved or at least manipulable in a hypervisor, but just like the cache state isn't saved, I wouldn't expect the register file to be saved.
If we assume the register file isn't saved, just the visible registers, what's happening is the visible registers are restored, but the speculative dance causes one of the other values in the register file to become visible. If that's one of the restored registers, no big deal, but if it was someone else's value, there's the exploit.
If you look at the exploit example, the trick is that when the register rename happens, you are re-using a register file entry, but the upper bits aren't cleared, they're just using a flag to indicate the bits are cleared; then when rolling back the mispredicted vzeroupper unsets the flag, the upper bits of the register file entry are revealed.
Reading more the VM* command sets definitely load/save more than just the normally visible registers, the descriptions in the AMD ASM manual are very explicit about that. However, it looks like (outside the encrypted guest case where everything is done in 1 command) the hyper visor still calls the typical XRSTOR for the float registers, which is no different than the normal OS case. If that's true then I can see how the register file is still contaminated in the non SMT case.
Well you don't have to reserve any CPU Cores per VM. There's no law saying you can't have more VMs than logical cores. They're just processes after all and we can have thousands of them.
Of course not, but the vulnerability works by exploiting the shared register file so to mitigate this entire class of vulnerabilities, you'd need to dedicate a CPU core and as much of its associated cache as possible to a single VM.
In the context of this conversation, SMT on/off is relevant to what scope of the vulnerability has with VMs beyond the claim in the article that the issue is in some way present inside VMs.
In the context of this conversation, SMT on/off is relevant to what scope of the vulnerability has with VMs beyond the claim in the article that the issue is in some way present inside VMs.
Remember: just because this one bug gets fixed in microcode doesn't mean there's not another one of these waiting to be discovered. Many (most?) 0-days are known about by black-hats-for-hire well before they're made public.
CPU vulnerabilities found in the past few years: