The rowhammer "attack" is successful only because the hardware is just plain broken, and I consider it in the same category as things like a CPU which will calculate 1+1=3 if the computation of 1+1 is done enough times --- nothing software should even try to fix, because the problem is at a lower level. The solution is to demand that the hardware manufacturers make memory which actually works like memory should; and it should be possible, since apparently previous generations of RAM don't have this problem at all. In the early 90s Intel recalled and replaced, free of charge, CPUs which didn't divide correctly. Perhaps the memory manufacturers today should do the same for rowhammer-affected modules and chips.
Memory errors are particularly disturbing because they are often highly dependent on data and access patterns, and can be extremely difficult to pinpoint without special testing tools. I've personally experienced a situation where a system which otherwise appears to work perfectly well would always corrupt one specific bit of a file when extracting one particular archive.
As a testing tool, MemTest86+ has always worked well for me, and the newer versions can detect rowhammer, although there is this interesting discussion about whether it is actually a problem (to which I say a resounding YES!!!) or if there's some sort of cover-up by the memory industry:
> The rowhammer "attack" is successful only because the hardware is just plain broken
I too am of this opinion and am surprised this view isn't widely shared. With DDR4, we should be asking for a refund and/or starting a class-action suit, yet we're putting up with software 'mitigations' instead.
This isn't like the 2008 Phenom TLB bug [1] where the CPU was locking up so AMD released a workaround that kept it from freezing at the expense of a 14% performance penalty. This is like the floating point division bug [2] where the device no longer meets basic operational and accuracy guarantees. RAM cells bleeding into each other ought to be considered a fatal flaw, not some intellectual curiosity.
I too am of this opinion and am surprised this view isn't widely shared. With DDR4, we should be asking for a refund and/or starting a class-action suit, yet we're putting up with software 'mitigations' instead.
I extensively test all the hardware I buy (CPU: LINPACK, RAM: MemTest86+) and if it fails any of those tests, it gets returned as "not fit for purpose". I've done this successfully a few times. A lot of other enthusiasists/power users do the same too, especially if they're overclocking, and searches on other forums show plenty of users testing and finding (mostly other, not rowhammer) errors in newly-bought RAM even when not overclocking. But as noted in the threads I linked to, manufacturers may be trying to cover this up and downplay its severity. Even in the original paper on rowhammer, the authors didn't disclose which manufacturers and which modules were affected, although I think this should really be treated like the FDIV bug: name and shame. I blame political correctness...
The Intel LINPACK distribution contains, besides the library, a sample benchmarking application using it, and that happens to be a very intense and "real" workload (solving systems of equations, i.e. scientific computation.) There's plenty of posts on various PC enthusiasists forums about how to run it correctly. (And plenty arguing that it's irrelevant, mostly because their insane overclock seems fine but instantly fails this test. There's a good reason most doing "real" scientific computing don't overclock; a lot of CPUs just barely pass this absolutely realistic test with stock speeds and voltages.)
FDIV was really not technically a serious errata in the grand scheme of erratas. The Phenom TLB bug was worse. Intel basically denied/sat on the issue for half a year, stopped just short of slandering Dr. Nicely, etc, they made it into a complete PR disaster. If they came out the week after it was reported and just said, here's a workaround, here's an opt-in replacement program (which they finally did, but then it was too late), you would probably never have heard about the FDIV bug -- like the countless other errata we have software workarounds for.
In retrospect I regret bringing up the Phenom because my argument could've stood without it, and I could realistically argue either way.
But my original intention was pointing out that the failure mode of the Phenom was such that it wasn't exploitable for anything other than potentially denial-of-service; it was just inconvenient, and only affected a subsystem of the CPU which worked fine without it using a firmware workaround.
Though you don't expect your CPU to halt and lock up, I believe it's far more insidious when you feed a device inputs and get the wrong output without any obvious indication that something went wrong, like in the case of rowhammer-vulnerable memory and FDIV.
I think that is the reason for the misunderstanding.. FDIV was not really insidious in the way you describe. It was 100% predictable, certain bit patterns always gave the wrong answer in the quotient on the affected hardware and it had a very straightforward software fix (with a performance effect sure). You could demonstrate it immediately, but it really wasn't severe.
(Q9 and Q10 http://www.trnicely.net/pentbug/pentbug.html)
Rowhammer is a much more complex errata and I don't feel qualified to comment on, especially the safety of the published mitigations, but it is in a class of bugs where the outcomes are not generally predictable due to more variables involved.
My reason for replying initially though, is that I don't think that the line for what types of hardware defects are open to software workarounds is so cut and dry, and I don't think many people outside of kernel/OS dev realize how many errata are on the chips they use everyday with workarounds they don't notice.
I don't agree, there is software that is designed to run on faulty hardware. This is often in high radiation environments (see: outer space). I agree this is not an area that much hardening has been done in conventional security models, but in other environments, it is common to use CRC error detection, parity information or other means to ensure that even if data is partially corrupted, that the original can be restored.
I see no reason to prevent someone from implementing this sort of error correction for GPG and other important cryptography.
Hostile environments attack your software without intelligence. (When working with them, it may seem otherwise, but that's just cynicism.) Hostile people attack intelligently. Whatever mitigation you may imagine is possible by checking CRCs or something after the fact, you must account for the possibility that the software, the OS, or the CRC has also been attacked by a hostile intelligent adversary. The fact that we can make reliable software in the face of unintelligent attacks is not evidence that we can make secure software in the face of intelligent ones.
Rowhammer is too powerful a technique to expect secure software to run on machines affected by it. This is an attack based on using rowhammer to change bits in other VM's memory. The only sane response to that, from the perspective of writing secure software, is despair. You can't deal with attackers in possession of that primitive.
Rowhammer is largely random. You don't get to target specific bits of physical ram. You find scarce weak bits and work to get the data located there. In this case that means you can only pick a couple bits per 4KB to attack. That won't let you fake out a CRC.
That's where I'm getting a little hazy. The paper says the attacker can "induce bit flips over arbitrary physical memory in a fully controlled way." Sounds a little more advanced than "largely random" to me, and based on the article it sounds like FFS is a step up from "vanilla" Rowhammer...am I missing something?
ECC RAM makes it a harder, but three bit flips will still survive. It depends on whether the system actually acts properly when it sees a huge amount of ECC errors happening.
They can pick a bit or two per page to attack, but then they're stuck with those bits.
In theory they could attack a new bit every few minutes, but that requires a system that allows the victim page to be remapped multiple times. KSM does not; any other memory-merging system could work the same way to mitigate things.
Even if they could keep remapping, it's a very slow attack that way. Reloading the checksum every ten minutes would keep you safe.
This does not make sense. If an attacker an alter your data, he can alter your CRC codes as well. Or just replace the pointer to the checkCRC function to "return true".
without being an expert in this area; my gut feel is that the fix to this problem is likely going to be funded by the end user. Given that competition continues to drive prices down, would 'secure ram' be viable? would you pay more for it?
Given that competition continues to drive prices down, would 'secure ram' be viable? would you pay more for it?
It's funny you mention this, since the problem only affects newer DDR3 and DDR4 modules and older RAM (EDO modules are apparently still in limited production and being sold) does tend to be significantly more expensive. Unfortunately the rest of the hardware needs to be compatible.
This also means all the older hardware that gets scrapped in massive quantities daily is likely to contain RAM immune to this problem, which is somewhat ironic... maybe it's just a (sad) continuation of the "newer is more volatile" trend that can be traced back to thousand-year-old stone tablets which remain readable today.
Not stone, but clay tablets are probably more volatile than your USB drive. There's a huge sampling bias here.
About the main point, why isn't ECC fixing this for everybody? I'll surely get cheaper, more volatile RAM, and use some of it on redundancy so it works better than the more expensive, less volatile kind.
I have the same opinion as you. I would but most people wouldn't. The root cause of the problem is since the trend about ram is "the bigger the better" (in terms of GBs) we have tons of capacitors on a small surface. I'm no expert too but I think there's no simple hardware fix for this instead of returning back to RAMs that hold less memory, but most people won't accept it. Maybe we're hitting the limits of the current technology and we should switch to another one. Just on a side note two years ago one of my professors quoted an ongoing research in my university about RAM that instead of storing electrons formed crystals, but I don't know any other detail about this.
There's CPU's that do memory, integrity checking to contain attacks. They're designed for stoping software and peripheral attacks mainly but consider RAM untrusted. They could probably be modified to deal with the new attacks.
Encrypted RAM is offered by the newest Intel server-grade CPUs (SGX, Skylake) and the next AMD server-grade CPUs (SME, Zen).
One of the main use-cases for these technologies is trusted computing in a cloud environment - the customer can assert that the hardware is securing the program state from the eyes of the computer owner!.
However, the cloud is actually made from cheap commodity boxes without server-grade anything! ;)
Encrypting RAM pages would prevent the hypervisor from deduping pages between virtual machines, and this would be very negative for cloud providers who want to up the occupancy on each box as much as possible...
In a few years, or perhaps longer, perhaps proper DDR4 and other immune memory will be mainstream in clouds. But until then, it seems we'll have a cloud fitted out with increasingly aging cheap machines with no rowhammer immunity.
> However, the cloud is actually made from cheap commodity boxes without server-grade anything! ;)
You know, I refuse to buy anything that does not support ECC for my home desktops (and don't even pay much for it). Only my laptop got a pass from this because there was literally no option available with it.
Good to know cloud providers are not as careful... But honestly, shouldn't be a surprise.
Same here. It helps to sell it if you don't say ECC = RAM + extra cash. That's the normal method. I instead say you have two options:
1. RAM that works at this price.
2. RAM that allows more crashes or corruption of your files for slightly-lower price.
The Right Thing suddenly looks more obvious except to cheap skates. Now I just need one with ChipKill built-in. That's the next level of ECC. I haven't heard whether Intel or AMD got something similar.
Encrypted RAM as AMD is implementing it (SME) protects nicely from "cold-boot attacks" but is otherwise largely a feel-good feature. It also probably doesn't help a whole lot against rowhammer-style attacks because it's merely encrypted, not authenticated. The result is that a bit flip will effectively randomize 64 bytes or whatever the block size is but will not be otherwise detected by the hardware. I bet that clever attackers will find a nice way to take over by randomizing 64 bytes.
Intel's encrypted RAM is authenticated quite nicely, but it's not (yet?) designed for general purpose use -- it's for SGX only right now. Using it for everything would (if I understand correctly) add considerable space overhead and possibly considerable latency.
Don't think I've seen any non-server-grade processors in even the cheapest bargain-basement VPS hosts. (Low-end dedicated is different.) Cramming as many VMs into a big server as possible seems to be too important to their cost structure for that.
We perhaps only disagree on what is "server-grade" vs what is sold for servers.
Google, for example, are famous for making big data centres out of cheap commodity boxes, and I double Amazon are any different. I certainly know the rackspace blades I've played with didn't make my grade of either! :)
I can't make any claims to contrary about other providers, but I know at the very least that at one point not in too distant past the primary systems used for Rackspace Cloud hypervisors were Dell R720 rackmount servers. Maybe not the most amazing hardware, but considering how common they are you can hardly refuse to say they're "server-grade". The newer OpenCompute stuff is also clearly well-made hardware.
Everything I've read implies that cheap commodity servers like Open Compute are just as reliable as name brand Intel servers (not surprising considering that they're made from the same parts), and ~95% of the market appears to be satisfied with that level of reliability.
I figured it would end up in security-oriented, bare-metal hosting first. Or racks people rent out for their own boxes. Didn't know something like that was on new Inte/AMD CPU's. Thanks for tip.
I know what the root problem is. I also know it comes from an oligopoly of companies that only care about money, probably have patents on key features, and operate in a price-sensitive market. Fixing root cause might be tricky unless you could be sure via contracts of volume deals from cloud and other big buyers.
Meanwhile, small teams in academia are building CPU's that knock out those and other issues. Worth bringing up given the fix you want isnt doable for most HW dedigners. RAM vendors might eventually use it as a differentiator but that's not guaranteed.
You can't entirely blame the providers for only caring about money; the consumers that choose the budget hosting options for critical applications must surely share some of it.
Server grade hardware is certainly available to cloud/VPS providers, but it turns out people are unwilling to pay $2 for a VM if there's one going elsewhere for $1.50.
"the consumers that choose the budget hosting options for critical applications must surely share some of it."
The customers expect the RAM they bought to work correctly. They might have even read papers on ASIC verification where the hardware companies brag about all these techniques they use to prevent recalls like one Intel had. The issue is that the companies stopped doing or reduced verification on specific components to reduce costs. What they bring in on the chips is way more than it takes to do that. So, the reason must be greed driving the profits up a little bit.
This one is the companies' fault. I'd have assigned blame differently if we were talking security of regular, consumer products or even operating systems. Verification of repeating pieces of hardware circuits is an industry-standard practice, though. Except for RAM providers apparently.
This blame placed on HW stems from a lack of understanding of RAM physics/electronics. As dimensions scale down, these things happen.
The market has chosen to adopt the cost benefits of smaller transistors and higher capacity for the same $. It's a mix of physics and market forces, not malfunctioning hardware.
It seems the HN title and original title are both pretty wrong, at least according to the article content. The attack vector is really the ability to, if you have a known public key and a server using it, perform a pre-calculated bit flip such that the new public key is much easier to factor, and thus obtain a corresponding private key.
So you're not obtaining original private keys, you're altering original public keys so that you can more quickly factor a private key that will be accepted.
If this is an SSH public key, then you can obtain SSH access. If it's a PGP key trusted by the package manager, then you can craft signatures on packages that would be accepted as valid, assuming you can also get the target machine to download said package.
I think SSH is probably the most interesting attack vector assuming you can get network access to the host once you've jumped through the myriad hoops to perform this attack.
It's a serious issue that should be addressed (probably via forced from-disk reads or at minimum integrity checks), but I think the authors are perhaps a little too eager on the practical implications of corrupting in-memory public keys.
Rowhammer is such a subtle effect and very easily blamed on many other things that it's not hard for the more paranoid among us to imagine the NSA deliberately sabotaging memories with it to use as a backdoor. When it was first discovered I wrote my thoughts on it here:
In the paper, they compromise not only the pgp key but also the Debian update server address, so all that is necessary is a software update to be compromised.
Another attack would be to flip bits in code pages...
That the attackers illustrated it by changing public keys so they could push updates or ssh into a box doesn't mean that's all the ways they could have compromised. You can't say "I don't use SSH so I'm safe!" or anything like that.
Here's the crux of the memory issue from one of the link in the article:
DDR memory is laid out in an array of rows and columns, which are assigned in large blocks to various applications and operating system resources. To protect the integrity and security of the entire system, each large chunk of memory is contained in a "sandbox" that can be accessed only by a given app or OS process. Bit flipping works when a hacker-developed app or process accesses two carefully selected rows of memory hundreds of thousands of times in a tiny fraction of a second. By hammering the two "aggressor" memory regions, the exploit can reverse one or more bits in a third "victim" location. In other words, selected zeros in the victim region will turn into ones or vice versa.
Rowhammer itself has been around a while, and is only 50% of this attack that has been posted.
The other bit is the newer idea (well, old idea, newer actual implementation); memory deduplication by your hypervisor leads to a very minor timing fingerprint when you write to a page of memory that had previously been deduplicated IE the same physical page was shared amongst multiple VM's because it was identical.. until you wrote to it and the OS/hardware had to then Copy-on-Write it out to your own dedicated copy; that has a higher latency than a memory page that is already available directly for writing to you.
There is still significant sharing that can be achieved inside a VM, plus, a lot of the sharing come from zero pages (full of 0) which is still performed accross VMs.
Another benefit of the salting mechanism is that it allows the administrator to define groups of VMs that are trusted in which sharing will be performed.
disclaimer: I work at VMware and wrote the salting code.
I would guess if you're a big VM hosting provider and you have thousands of VMs all running the same version of Windows or Linux distro, that it could add up to some real savings to have them share common pages.
Conceptually, it's safe. UNIX distributions routinely do the equivalent operation within single machines, it's a fundamental part of their operating model.
It's just that in the face of defective hardware, it's not safe. But this is not surprising, because nothing is safe, so it isn't particularly a criticism of page sharing. This specific attack may have used it, but Rowhammer is a powerful tool. This is not the only way it can be used; it is merely an exemplar.
People are focusing too much on the exact specific attack shown here: Deduplication, modifying a public key, etc. (And proposing solutions like turning off deduplicaiton, checksum, etc.)
But that's just this attack - the fact that they have that much control over memory means there are FAR FAR FAR more possible attacks.
If you can control memory to that level then you are limited only by your imagination.
The only mitigation I can think of at the moment is ECC memory. And shame on Intel for only supporting that on Xeon.
It is more costly, but this is a good reason to use a dedicated chunk of memory for every Xen PV domU. No oversubscription!
Allowing multiple domU VMs on the same dom0 (or the equivalent in other hypervisor platforms) to re-use memory and balloon/contract memory on the fly is what enables this.
Can you point me to some services that provide, specifically, Xen PV VMs with non-oversubscribed memory?
I'm considering deploying a custom unikernel for protecting the private key data for my app[1], until I have enough money for a Hardware Security Module.
Sorry, I can't, we use Debian stable + xen on our own bare metal hardware machines with from 256gb to 1tb of RAM. Never tried to buy a rental VM using the same dom0+PV setup. All of my off site VMs are for testing, some cheap $4/mo type openVZ that are basically glorified jails.
I'm not sure if anyone actually oversubscribes ram with Xen. But we (prgmr.com) still allow you to order PV VMs, mostly because NetBSD performance is abysmal in HVM mode.
"For the attacks to work, the cloud hosting the VM must have deduplication enabled so that physical pages are shared between customers."
But the vendor's cloud will not disable sharing pages of physical memory because ____.
This is a great counterpoint to the salesman trying to sell you on "cloud" anything.
Why is it less expensive to use the "cloud"?
One reason is because you do not get your own physical server, including your own RAM.
When the "cloud" buzz began to gain momentum years ago I raised the issue of not knowing who your "neighbors" were on these physical servers that customers are sharing with other customers in datacenters.
As usual, these concerns will just fade into the background... again.
Ouch. Before reading this article I was seriously considering deploying a signing service as a HaLVM (Haskell) Xen PV unikernel running on EC2. The service would receive its private key after startup, such that the key never touches disk. Now I'm a lot less inclined to pretend that the Xen interface actually protects me...
Xen has had page-table and interrupt vector related security vulnerabilities. But I don't think EC2 would use non-ECC RAM, so I don't think it's vulnerable to this "rowhammer" technique. (I also don't think EC2 would do cross-VM page deduplication, another necessary condition.)
Afaik xen does not use memory deduplication. KVM aside, one should be worried about things running inside a linux host/vm, like containers. Maybe I am missing something
For the attacks to work, the cloud hosting the VMs must have deduplication enabled so that physical pages are shared between customers.
This seemingly is an attack where two VMs on the same host can read each other's memory, if a deduplication flag is set on the VM controller. This seems to offer cloud holsters some easy (paid for) upgrades to be honest
its not (afaik) heartbleed time. It's bad but the effort required is high and afaik the attacker will replace your key with their key - making it clear you are compromised.
The abstract says the attack allows "flips over arbitrary physical memory in a fully controlled way." If I'm understanding that correctly, it would be trivial to then restore the old key alongside it, leaving the victim none the wiser.
Also, as others have pointed out, this is a hardware issue and the clear solution is to swap out the vulnerable RAM. Yeah, paying more is an "easy" way to have peace of mind (if that's even an option for you as a "cloud hoster"), but that's just backwards IMHO: a security vulnerability on the host's side should not translate into an upsell.
If your threat model now includes 'the attacker can at arbitrary times make arbitrary alterations to the working memory of my process' then no, this won't help. You can't trust the checksum, you can't trust that the data you just check summed hasn't subsequently changed, and you can't trust that data which passes a checksum wasn't previously different. Also you can't trust the checksum code itself. Or the operating system you're running on. Or anything.
That's security by obscurity (which may work to delay the attacker). If an attacker can modify your public keys he can modify your checksums as well.
Seems to me that a public key should be identified by a cryptographic hash of it, rather than the public keys itself. Then the attacker would need to replace the entire hash, rather than just a few bits, because the hash changes completely just by flipping a single bit in the input.
The attacker isn't making targeted modifications to your public keys, though: they're randomly glitching it, and using the page sharing implemented by the hypervisor to read out and factor the glitched version.
Even with say a 64 bit checksum then there's only a 1 in 2^64 chance of the randomly modified key/checksum pair matching. But you could use a cryptographic hash as your checksum if you wanted.
I only suggest this not because I think it would be a complete defence against all Rowhammer attacks - it wouldn't - but because the general fragility of the RSA construction means that doing it with any potentially corrupted input gives me the willies. There are other sources of bitflips other than Rowhammer and it just strikes me as a generally good idea not to leak the results of RSA operations performed on potentially bitflipped inputs.
No, much simpler - store a checksum alongside public keys in places like .ssh/authorized_keys, and have the software like sshd recompute the checksum of the in-memory key each time it uses it for authentication.
This doesn't sound correct. What if the attacker times the operation so that the bit corruption occurs after the checksumming but before being actually used for cryptographic operations?
Memory errors are particularly disturbing because they are often highly dependent on data and access patterns, and can be extremely difficult to pinpoint without special testing tools. I've personally experienced a situation where a system which otherwise appears to work perfectly well would always corrupt one specific bit of a file when extracting one particular archive.
As a testing tool, MemTest86+ has always worked well for me, and the newer versions can detect rowhammer, although there is this interesting discussion about whether it is actually a problem (to which I say a resounding YES!!!) or if there's some sort of cover-up by the memory industry:
http://www.passmark.com/forum/memtest86/5903-rowhammer-probl...
http://www.passmark.com/forum/memtest86/5475-memtest86-v6-2-...
Run it on your hardware and if it fails, I think you should definitely complain and get it fixed.