I can't easily find good documentation on the instructions you mentioned; but are you sure those save and load the whole register file, and not just the visible registers? There are some registers that are not typically explicitly visible, that I'd expect to also be saved or at least manipulable in a hypervisor, but just like the cache state isn't saved, I wouldn't expect the register file to be saved.
If we assume the register file isn't saved, just the visible registers, what's happening is the visible registers are restored, but the speculative dance causes one of the other values in the register file to become visible. If that's one of the restored registers, no big deal, but if it was someone else's value, there's the exploit.
If you look at the exploit example, the trick is that when the register rename happens, you are re-using a register file entry, but the upper bits aren't cleared, they're just using a flag to indicate the bits are cleared; then when rolling back the mispredicted vzeroupper unsets the flag, the upper bits of the register file entry are revealed.
Reading more the VM* command sets definitely load/save more than just the normally visible registers, the descriptions in the AMD ASM manual are very explicit about that. However, it looks like (outside the encrypted guest case where everything is done in 1 command) the hyper visor still calls the typical XRSTOR for the float registers, which is no different than the normal OS case. If that's true then I can see how the register file is still contaminated in the non SMT case.
If we assume the register file isn't saved, just the visible registers, what's happening is the visible registers are restored, but the speculative dance causes one of the other values in the register file to become visible. If that's one of the restored registers, no big deal, but if it was someone else's value, there's the exploit.
If you look at the exploit example, the trick is that when the register rename happens, you are re-using a register file entry, but the upper bits aren't cleared, they're just using a flag to indicate the bits are cleared; then when rolling back the mispredicted vzeroupper unsets the flag, the upper bits of the register file entry are revealed.