We could get a >50% performance boost by ignoring security.
Think about that. 1.5x - 2x boost, on a single core. That's like doubling a CPU's clock speed, except more powerful since it involves caching memory (which is usually the bottleneck, not raw horsepower).
What would the TempleOS of CPUs look like, I wonder?
We ought to have the option of running workloads on such CPUs. Yes, they're completely compromised, but if you're running a game server you really don't have to care very much. Most data is ephemeral, and the data that isn't, isn't usually an advantage to know.
Being able to run twice the workload as your competitors on a single core is a big advantage. You saw what happened when performance suddenly dropped thanks to the patch.
This brings to mind articles and discussions I read in the early 90s about differences in performance between running in real (or "unreal") mode and protected mode --- yes, the extra permission checks, paging, segmentation, etc. definitely add some overhead:
Thus they decided to do permissions checking in parallel with other operations, leading to Spectre. It's interesting to note that the original intent of protected mode was not as a real secure "security feature" immune from all attacks, but more to provide some isolation to guard against accidents and expand the available address space. The entire Win9x series of OSs embody this principle.
What would the TempleOS of CPUs look like, I wonder?
x86 running in unreal mode might be close. I wonder if anyone has done any benchmarking of recent x86 CPUs in that mode...
Spectre variant 2 is not fundamental to speculative execution. It can be fixed by tagging the indirect branch predictions with the full virtual address and the ASID, and flushing it together with the TLB.
you could certainly extend the unused transactional support to do a chandy-lamport thing, with the difference that when you run out of space isolated cache versions, you just can't speculate any more.
it would be a lot of machinery
you could also do latency hiding with smt instead of trying to fight it head-on with speculation. ultimately probably more productive, but either the compiler or the programming model or the user has to expose that concurrency.
> This brings to mind articles and discussions I read in the early 90s about differences in performance between running in real (or "unreal") mode and protected mode
I remember a 10 years old Microsoft Research project that implemented an OS that would use the .NET managed runtime to implement security. IIRC, they had interesting differences with CPU memory isolation off.
I like the idea that you don't need hardware barriers to isolate programs when they are lobotomized.
You're speaking of Singularity and it's "software isolated processes", which amounted to static verification of IL before AOT compiling it to x86. From the perspective of Sing#, the only way to express IPC was in the form of protocols, which were essentially formalized function call dances between two processes.
Singularity would be just as vulnerable to the recent bugs as contemporary OSes are, possibly more so because there is even less timing uncertainty when crossing privilege domains, making the attacks even easier
Plenty of embedded CPUs don't have MMUs. Having done multi-"process" programming on them, it is hard to appreciate how nice basic address space protection is from a debugging perspective.
Not when the bootloader is an UEFI application. UEFI puts the processor into 32 bit protected mode or typically long mode (64 bit). So when your bootloader that is implemented as an UEFI application is started, you are already out of real mode.
> We could get a >50% performance boost by ignoring security.
What makes you believe that? The (crude, emergency, not designed-in) workaround dropping performance by 50% doesn't mean that fixing the issue in the design requires reducing performance by 50%.
Performance Details
Overhead
Naturally, protecting an indirect branch means that no
prediction can occur. This is intentional, as we are
“isolating” the prediction above to prevent its abuse.
Microbenchmarking on Intel x86 architectures shows that our
converted sequences are within cycles of an native indirect
branch (with branch prediction hardware explicitly
disabled).
For optimizing the performance of high-performance binaries,
a common existing technique is providing manual direct
branch hints. I.e., Comparing an indirect target with its
known likely target and instead using a direct branch when a
match is found.
Example of an indirect jump with manual prediction
cmp %r11, known_indirect_target
jne retpoline_r11_trampoline
jmp known_indirect_target
One example of an existing implementation for this type of
transform is profile guided optimization, which uses run
time information to emit equivalent direct branch hints.
-- https://support.google.com/faqs/answer/7625886
Note the parenthetical in the first paragraph. The way I read the above, they're merely saying that the performance is the same as disabling branch prediction for the section of code. Which would be really slow. But, presumably, faster overall than _actually_ disabling branch prediction as there's probably no good way to explicitly do that in as localized a manner.[1]
And as I interpret this, the reference to FDO is unfortunately confusing and perhaps better placed under the heading of correctness. They're merely saying that the transformation of indirect branches to direct branches is something compilers already do to steer speculative execution down the correct path. Except in the case of a retpoline it's steering speculative execution into a trap.
[1] I'm no expert in this area (nor do I do much assembly programming) but AFAIU branch predictors these days have substantial caches of their own. Disabling branch prediction or flushing the predictor's caches is probably much more costly than simply steering the predictor into a trap. The single pipeline bubble you create is better than the many bubbles that would be created if the branch predictor had to warm up again for later code.
So they lost 5% of performance, then applied FDO, a generic optimization technique, and gained back about that 5%. In other words, they could be ~5% ahead without the bug (at least 2%, if you consider there are some helpful interactions between FDO and retpoline mitigations).
Not to my understanding. They were already applying FDO to all of their performance sensitive programs. They found that after FDO, the retpolines and the other work done to mitigate meltdown and spectre had negligible impact. This is because FDO replaces indirect branches with direct branch hints, so only in uncommon cases do you even have to execute the retpoline. That's my understanding, at least.
Why limit yourself? Just run everything in ring 0. No kernel calls overhead at all. Lightning-fast network stack. Use a unikernel, run your trusted code only, don't store secrets.
Beyond game servers, this could be the mode of operation of e.g. compute cluster nodes, relying on external firewall for security, and running zero untrusted code. Of course they won't care about meltdown or spectre either.
I’ve heard some rumblings that this is what some HFT shops were doing years ago to minimize latency and to reduce context switches. Because they already run bare metal and don’t share anything besides a physical facility with their competitors (if even that) the attack vectors are not the same as those running on shared machines either.
You can already get 1,000x speedup in JavaScript if you e.g. avoid CORS on images where you read all pixels one-by-one. What should take a few ns in fact easily takes 30s in such an image. Security and performance are adversarial; we can have almost-perfectly secure systems that would make 8th gen i7 slower than Z80 or superfast CPUs where anybody can do anything.
It's to prevent reading cross-origin images through JavaScript.
A canvas is considered tainted if it contains data from a cross origin image that was loaded without CORS approval. Tainted canvases cannot be read from by JavaScript.
I don't think this requires a 1000x slowdown but I'm not super familiar with this area of the spec. It should be possible to cache the tainted state of the canvas in a single bit, only updating it when adding new images to the canvas, and checking the bit when converting the canvas to an array.
For some reason every single pixel access must be authorized through CORS. I am not sure if this is in the standard or just shoddy implementation in all browsers. Try it yourself, read pixels in one non-CORSed image, and then in CORSed and wonder what went wrong on the drawing board. Some people bypass it via WebGL.
There's a trade-off, but I seriously doubt it's anywhere near that. For one, if you could get super fast by ignoring safety, that would surely be found in consoles. Then, a good old 486 or early Pentiums have none of these issues, having a multiple of z80 speed without some decades of progress we had in the interim.
I meant you could execute every instruction with cryptographic check, i.e. AES check for every single step if this instruction could be executed. That would bring speeds under Z80 and will be overwhelmingly secure if protocol is correct. I hope nobody would try to add it to normal CPUs in the future, but security guys are sometimes too idealistic.
Doesn't it also mean Intel kind of cheated their way to the performance crown by ignoring security while AMD does not seem to have these issues ? It certainly brings Ryzen closer again, however future improvements to the patch might make it negligible in the end.
Meltdown looks pretty much like cheating. Intel simply took security verification out of the critical path, and their public image ought to suffer for that.
Spectre is a much more complex issue. It is more of a "I would never thing of that" thing. Looks like there were people at the security community talking about it even 10 years ago, but it is certainly hard for knowledge to spread to chip designers - most of what security people study would be noise for them, and even the security people mostly ignore those.
AMD has the same issues. The difference in Meltdown, according to the paper, comes down to idiosyncratic peculiarities of timing differences, not anything fundamentally more sound in other processors.
WRONG. You fell for the Intel PR strategy. AMD is completely unaffected by Meltdown, since AMD checks permissions before fetching memory into the cache.
The toy example referenced in section 6.4 is actually closer to Spectre. It's missing the crucial distinction that the speculatively loaded address should belong to a page inaccessible from the current ring level.
To see why that toy example is insufficient, consider that you could simply execute the load directly without putting an exception in front of it and you would be able to read the value.
Some Amiga people still believe that "virtual memory" is a passing fad, and when the Amiga comes again in glory we'll all realize how futile and self-defeating it was to prevent people from accessing the raw hardware and OS.
> One service pr. VM means no need for virtual address spaces, and no overhead due to address translation. Everything happens in a single, ring 0 address space.
> if you're running a game server you really don't have to care very much
Think also about scientific computing, which usually doesn't need to be done in a secure multi-user environment (strictly speaking). I.e. you could limit security to the entrance gate (log in).
It's more than that even - given that a high portion of the code being run now is using virtual machines, a lot of that protection is redundant. If all code is run inside VMs and zero 'native' code is allowed, then you could run without needing protection rings, system call overhead, memory mapping, etc - which in theory could more than make up for the virtual machine overhead.
This was being explored with Microsoft's Singularity and Midori projects but seems to be a dormant idea.
I don’t think it works like that. You still need protection between the rings inside the VM and the CPU is providing that protection. The VM is not only a user process for the host, it is executing its kernel code in the virtualized ring 0 on the CPU (where its still the CPU that provides protection).
As i understand it, with this approach you wouldn't have a separation between kernel code and the virtual machine - everything runs in a single address space and you rely on the virtual machine verifying the code as it JITs it.
"What would the TempleOS of CPUs look like, I wonder?"
In the old days, probably Burroughs B5000 for it being high-level with security and programming in large support built in before those were a thing. If about hackability, the LISP machines like Genera OS that you still can't quite imitate with modern stuff. If a correct CPU and you want it today, then I've been recommending people build on VAMP that was formally-verified for correctness in Verisoft project. Just make it multicore, 64-bit, and with microcode. Once that's done, use the AAMP7G techniques for verifying firmware and priveleged assembly on top of it. Additionally, microcode will make it easier to turn it into a higher-level CPU for cathedral-style CPU+OS combos such as Oberon, a LISP/Scheme machine, or Smalltalk machines. Heck, maybe even native support for Haskell runtime for House OS.
I assume those save language would still fail to defend against spectre. The problem is that the „code path with data“ executed in a speculative branch is not considered by the compiler but „fantasized“ by the cpu.
Remember that Spectre can be exploited through JavaScript JIT. The JS sandbox is quite comparable to the models used for save languages.
There are some plausible ways it could still be exploited, but, I would say those deserve higher in the stack mitigations.
As a contrived example, a game server that accepts maps over the network that can contain a scripting language like Lua/javascript could theoretically cause data leaks.
By higher in the stack mitigations, I mean things like neutering the javascript timer accuracy in chrome fix to Meltdown/Spectre.
You can get a performance of this magnitude because the baseline secure system is a hurried patch and wasn't designed for it. We're far from the optimal performance-security trade-off.
Nice thoughts but the vulnerabilities of Spectre and Meltdown do not directly threaten code which runs on the vulnerable CPU. Instead, the vulnerabilities allow malicious code running on the vulnerable CPUs to attack the system and other apps. A subtle difference, but makes the job to identify the applications which are not trustworthy and avoid running them on the performant but potentially vulnerable cores.
If every application were signed (like on the apple store), does that help reduce the risk here? If it were a game server then all apps run would presumably be provided by some official third party who could be relied upon to sign the binary.
If it were all signed, then is the risk of the box becoming compromised fairly small and worst case, replacing the entire box (rebuild) if there were an issue could address problems quickly?
It is my understanding that any sort of cloud setup would be completely out. Also, anything with a web-browser or other platform that runs third party code. But that still leaves an awful lot of servers that are under the control of a single company or individual and don't run any untrusted code (sandboxed or otherwise), doesn't it?
Even dedicated supercomputer processors (e.g. A2, SW26010, VIIIfx) use virtual memory. In fact, most GPUs nowadays include an MMU as well, something they historically never had.
(SC applications tend to be huge fans of using enormous page sizes, like 2 GB, which mitigates page table walk costs)
Yeah, you split your business into those who's customers need >1 physical machine, and those that need < 1 physical machine. Drop security for the former, increase for the latter. Both can phrase their application as a unikernel.
The performance boost is not about ignoring security, but about ignoring edge case of one particular security check on i386 which ended up being the only critical one for "modern" unix-derived (which for purposes of this discussion includes Windows NT) OSes.
In original x86 protected mode model (ie. real security flags on descriptor level and page protection bits only used for scheduling of paging) this would be non-issue.
No, it just means that it's better to design chips and software architectures with security in mind from day one. If you don't, then every single patch you add afterwards will bloat-up the whole system and the (software) performance will drop overtime.
It's just another way to say you're building technical debt into your chip/software if you don't consider security from day one. Because sooner or later you will need that security, if you have tens of millions or more people using that product. So it's not like you can just ignore it. You're just postponing it for later.
We just had tech sales folks from major hardware vendor approach us with a pitch to strongly consider AMD EPYC based servers for our next compute purchase. They highlighted
* Value for money
* memory performance on inter core access
* more memory channels per core
* still within x86 ISA negating the need for any Se rewrites.
Our workloads are memory access bound. So the above points hit home.
We're going to try AMD servers for the first time at this research group. If they do hold the promise, intel finally got some active competition in our realm!
If you weren't considering Intel alternatives before this, I'd argue that's a real failure of imagination and risk management. I'm sure some of the really small cloud providers weren't, but all the big players keep tabs on the path to and pain of migration, at a minimum. Just because they weren't actually using PowerPC/ARM/AMD does not mean they did not know how.
Are PowerPC and ARM real alternatives? Most server software is developed for x86 only, if you only offer ARM and PowerPC machines, who will be your clients?
A lot of server software is actually open source and runs on most commonly available hardware architectures, including ARM and PPC. You are right that everybody develops assuming x86, though, and there is some friction in that transition regardless of software portability.
In the sense that those processors can compute things which are computable, yes, they are a real alternative. The correct question is, "How much worse would the situation have to be with Intel before the cost of these alternatives was less than the benefit of switching to one of them." The answer, I guess, is: "much worse than things are now." However, if you're running at billions in revenue per quarter, you can afford to spend a few million to limit your downside risk by keeping yourself apprised of what the costs would be of moving to another platform.
More important is, are PPC or ARM competitive at the amount of computation per watt, and amount of computation per dollar. Power 8 and 9 are pretty good at number-crunching but also pretty expensive, and I don't know about power efficiency. ARMs used to be pretty power-efficient, but likely cannot offer very high single-threaded performance (must still be great for IO-heavy applications that are hit hardest on Intel).
I've been wondering if this will be the thing that brings a wave of new users to open source RISC-V chips, starting a virtuous cycle of new chip development and then more users for those new chips.
The Qualcomm CPU only has the 128-bit wide NEON SIMD, while Broadwell has 256-bit wide AVX2, and Skylake has 512-bit wide AVX-512. This explains the huge lead Skylake has over both in single core performance. In the all-cores benchmark the Skylake lead lessens, because it has to lower the clock speed when executing AVX-512 workloads. When executing AVX-512 on all cores, the base frequency goes down to just 1.4GHz---keep that in mind if you are mixing AVX-512 and other code.
Though not an unquestionable and not ultimately reliable one, using comparably rare hardware and software usually is a valid security measure. Now as Linux has gained solid market share the "no viruses for Linux" era comes to its end and Linux-capable ransomware is emerging I am considering moving to OpenBSD not only because security is among the top priorities of its design but because it's a way more exotic too.
I don't think throwing intel out entirely is a solution, but rather, diversification. What if there later comes out another vulnerability granting arbitrary code exec given a specific bytestring in memory, which only intel processors are immune to? Better to have as many as possible of as many different CPUs as possible. Just like, e.g. backblaze has done with HDD manufacturers.
I mean, we know they are all saying "arm" and "amd" both as negotiating tactic as well as strategically diversifying microarchitectures. That said, I'm not sure it's like amd can deliver more instructions per second per dollar.
I wonder if it would make sense for some loads to prepare for Itanium usage, or even older Atom architectures? Does any one deploy Itanium in the cloud?
It's worth noting that only pre-2013 Itanium CPUs are not vulnerable to meltdown (The same as with Atom CPUs). Intel has also said that the Itanium chips released in 2017 would be the last Itanium chips they will develop, so I don't think there's any reason to bother switching to Itanium. I would wager it's guaranteed you'll be better-off with the latest AMD instead of a 2013 Itanium, and AMD supports x86-64 so it doesn't require making your full software stack support Itanium (Which it likely doesn't).
I am sure you didn’t mean it this way, but it struck me as funny saying AMD supports x86-64. Of course AMD supports x86-64, they invented it. That is why Microsoft and Linux refer to that architecture as AMD64.
Aren't ARM processors are vulnerable to some of the same security flaws? What are the alternatives? AMD got lucky in a sense but the reality is you can't be on any one horse.
AMD is no more lucky than ARM, though Intel was ‘unlucky’: Spectre affects Intel, AMD, and ARM alike, while Meltdown Intel CPUs disproportionately (AMD and ARM CPUs have been mostly unaffected by it).
That's not entirely accurate. There are a lot of ARM CPUs that are not vulnerable to Spectre, and that includes a lot of them that are in use in actual devices. For example, my phone happens to use a Cortex-A53, which is not vulnerable. It is however easy to miss this detail from their 'security update' [0] because the table doesn't list CPUs that aren't affected.
Also, I don't believe the single ARM CPU that's vulnerable to meltdown (Cortex-A75) has actually been included in any devices at this point, so for now it is safe to assume any ARM-based device you have is not vulnerable to meltdown.
Probably because its a much simpler core like all the other chips that are immune. It simply doesn't do the the logic that would get the chip in trouble.
There won't be a hardware fix, you can't patch CPUs (aside from microcode which is just more software). Since this is a pretty fundamental issue wit how certain things work, yes, you'll have to wait for a new generation, and I couldn't expect anything in less than a year at the absolute earliest. It's a multi-year process from design to tape-out and fabrication. Honestly, I'm not sure how far back in the design process fixing this will take, it is something that can be added in as permanent microcode after fabrication/during packaging? Can current designs be "patched" before moving on to tape out? Do we need to go back to square one and rethink a couple key premises? I don't know. But it's not quick.
I was coming at it more from a "having alternatives is good" comment instead of saying it's not vulnerable to the new problems. The specs look interesting enough to benchmark on, and perhaps if it actually does perform better, then even with patches it will stand out.
In response to recently reported security vulnerabilities, this firmware update is being released to address Common Vulnerabilities and Exposures issue numbers CVE-2017-5715, CVE-2017-5753 and CVE-2017-5754. Operating System updates are required in conjunction with this FW level for CVE-2017-5753 and CVE-2017-5754
They have vastly more PCIe lanes, which makes them exceptionally suitable for high-density GPU nodes (or, similarly, high-density NVMe storage nodes).
AMD also offers superior value for the money on a per-CPU basis, even after discounts (which both Intel and AMD offer generously). This makes them exceptionally suitable for high-memory nodes.
My funniest thought of this whole affair that whatever performance upsizing happens in the front is probably going to be with manufacturer discounts from Intel from Amazon Google complaining and the cloud providers may end up with a windfall in the gap.
We could get a >50% performance boost by ignoring security.
Think about that. 1.5x - 2x boost, on a single core. That's like doubling a CPU's clock speed, except more powerful since it involves caching memory (which is usually the bottleneck, not raw horsepower).
What would the TempleOS of CPUs look like, I wonder?
We ought to have the option of running workloads on such CPUs. Yes, they're completely compromised, but if you're running a game server you really don't have to care very much. Most data is ephemeral, and the data that isn't, isn't usually an advantage to know.
Being able to run twice the workload as your competitors on a single core is a big advantage. You saw what happened when performance suddenly dropped thanks to the patch.