ECC memory can't eliminate the chances of these failures entirely. They can still happen. Making software resilient against bitflips in memory seems very difficult though, since it not only affects data, but also code. So in theory the behavior of software under random bit flips is well... Random. You probably would have to use multiple computers doing the same calculation and then take the answer from the quorum. I could imagine that doing so would still be cheaper than using ECC ram, at least around 2000.
Generally this goes against software engineering principles. You don't try to eliminate the chances of failure and hope for the best. You need to create these failures constantly (within reasonable bounds) and make sure your software is able to handle them. Using ECC ram is the opposite. You just make it so unlikely to happen, that you will generally not encounter these errors at scale anymore, but nontheless they can still happen and now you will be completely unprepared to deal with them, since you chose to ignore this class of errors and move it under the rug.
Another intersting side effect of quorum is that it also makes certain attacks more difficult to pull off, since now you have to make sure that a quorum of machines gives the same "wrong" answer for an attack to work.
I don't think ECC is going to give anyone a false sense of security. The issue at Google's scale is they had to spend thousands of person-hours implementing in software what they would have gotten for "free" with ECC RAM. Lacking ECC (and generally using consumer-level hardware) compounded scale and reliability problems or at least made them more expensive than they might otherwise had been.
Using consumer hardware and making up reliability with redundancy and software was not a bad idea for early Google but it did end up with an unforeseen cost. Just a thousand machines in a cosmic ray proof bunker will end up with memory errors ECC will correct for free. It's just reducing the surface area of "potential problems".
When I said consumer hardware I was meaning early Google literally using consumer/desktop components mounted on custom metal racks. While Intel does artificially separate "enterprise" and "consumer" parts, there's still a bit of difference between SuperMicro boards with ECC, LOM features, and data center quality PSUs and the off the shelf consumer stuff Google was using for a while.
I don't know if AMD really intended to break Intel's pricing model. Their higher end Ryzen chips you'd use in servers and capital W Workstations don't seem to have a huge price difference from equivalent Xeons. Even if they're a bit cheaper you still need a motherboard that supports ECC so it seems at first glance to be a wash as far as price.
That being said if I was putting together a machine today it would be Ryzen-based with ECC.
Small nit, the PRO version of Ryzen APUs do support ECC[0], also ASRock has been quoted saying that all of their AM4 motherboards support ECC, even the low end offerings with the A320 chipset.
The CPU chip can do it. Some motherboards bring out the pins to do it, but they're often called "workstation" boards and cost 2x the price of a standard desktop motherboard.
ECC memory itself is overpriced. $60 for 16 GB DDR4 without ECC, $130 for 16 GB DDR4 with ECC.
Because if you give consumers a choice between having ECC or LEDs on otherwise identical boards with identical price, most will go for the LEDs. In reality the price isn't even the same because ECC realistically adds to the BOM (board, modules) more than LEDs do. So the price goes up with seemingly no benefit for the user.
As such features that are unattractive to the regular consumer go into workstation/enterprise offerings where the buyer understands what they're buying and why.
It really isn't. It was a hypothetical choice between 2 models, with ECC or LEDs, at the same price. Hypothetical because most boards don't offer the ECC support at all, and certainly not at the same price.
> LEDs are a great opportunity to increase profit margin, so I'm not sure about your price conclusions
You confused manufacturing costs, price of the product, and profit margins. LEDs cost far less to integrate than ECC but command a higher price premium (thus better profit margins) from the regular consumer. Again supporting my statement that even if presented with 2 absolutely identical parts save for ECC vs. LEDs the vast majority of consumers will go for LEDs because they don't care or know about ECC.
> It really isn't. It was a hypothetical choice between 2 models, with ECC or LEDs, at the same price. Hypothetical because most boards don't offer the ECC support at all, and certainly not at the same price.
You're making a claim about what people would choose. If you have no related data, and logic could support multiple outcomes, then a claim like that is basically useless.
> You confused manufacturing costs, price of the product, and profit margins.
I'm not sure why you think this.
> Again supporting my statement that even if presented with 2 absolutely identical parts save for ECC vs. LEDs the vast majority of consumers will go for LEDs because they don't care or know about ECC.
Sure, if you don't tell them that it's ECC they won't pick the ECC part.
If you actually do a fair test, and put them side by side while explaining that one protects them from memory errors and the other looks cooler, you can't assume they'll all pick the LED.
When people never even think of ECC, that is not evidence that they wouldn't care or know about it in a head-to-head competition.
My claims are common sense and supported by the real life: regular people don't know what ECC is, and those who do find the problem's impact is too minor to get palpable benefits from fixing it. Why are you being pedantic if you aren't actually going to bring arguments at the same level you expect from me?
> If you actually do a fair test, and put them side by side while explaining that one protects them from memory errors and the other looks cooler, you can't assume they'll all pick the LED.
Isn't this exactly the kind of claim you yourself characterize one paragraph above as "useless" because "you have no related data, and logic could support multiple outcomes"? Sure, if people were more tech-educated then my assumption might be wrong. But people aren't more educated so...
The benefits of LEDs are hard to miss (light) all the time. The benefits of ECC are hard to observe even in that fraction of a percent of the time. Human cellular "bitflips" happen every hour but they don't visibly affect you so you also consider it's not an issue that demands more attention, like constant solar protection. People aren't keen on paying to solve problems they never suffered from, or even noticed, especially when you tell them they happen so often still with no obvious impact. Unless they have no choice, like OEMs selling ECC RAM only devices.
Sell me ECC memory when my (actual real life) 10 year old desktop or 5 year old phone never glitched. Sell me ECC RAM when my Matlab calculations come back different every time. See the difference?
> When people never even think of ECC, that is not evidence that they wouldn't care or know about it in a head-to-head competition.
Well then, I guess none of us has any evidence except today people buy LEDs not ECC RAM. Educate people or wait until manufacturing process and design are so susceptible to bitflips that people notice and it will be a different conversation.
> My claims are common sense and supported by the real life: regular people don't know what ECC is, and those who do find the problem's impact is too minor to get palpable benefits from fixing it. Why are you being pedantic if you aren't actually going to bring arguments at the same level you expect from me?
Regular people aren't given the choice! The things you're quoting about the real world to support your argument are incompatible with a scenario where someone is actually looking at ECC and LED next to each other. And I'm not being "pedantic" to say that, it's a really core point.
> Isn't this exactly the kind of claim you yourself characterize one paragraph above as "useless" because "you have no related data, and logic could support multiple outcomes"?
A claim of a specific outcome is useless. "you can't assume" is another way of phrasing the lack of knowledge of specific outcomes.
> Sure, if people were more tech-educated then my assumption might be wrong. But people aren't more educated so...
It's the kind of thing that can go on a product page. But first someone has to actually make a consumer-focused sales page for ECC memory, and the ECC has to be plug-and-play without strong compatibility worries.
And just like when LEDs spread over everything, it's something that you can teach people about and create demand for with a bit of advertising.
> Sell me ECC memory when my (actual real life) 10 year old desktop or 5 year old phone never glitched. Sell me ECC RAM when my Matlab calculations come back different every time. See the difference?
That's a clear picture of one person. But "never glitched" is a very dubious claim, and you can't blindly extrapolate that to how everyone would act.
I think they may have been referring to the actual mainstream retail availability of ECC RAM. I can buy non-ECC RAM at almost any retailer that sells computers. If I need non-ECC RAM right now I can have it in my hands in 30 minutes. ECC on the other hand I pretty much have to buy online. Microcenter stocks a single 4GB stick of PC4-21300, and I can't think of a single use case where I'd want ECC but not more than 4GB.
You're right, rereading the parent post with that angle makes it clearer that they were complaining about the unavailability of memory and other hardware.
It would definitely be great to have more reliable hardware generally available and at less of a price premium.
1. Single bitflip correction along with Google's metrics could help them identify algorithms they've got, customer's VMs that are causing bitflips via rowhammer and machines which have errors regardless of workload
2. Double bitflip detection lets Google decide if they say, want to panic at that point and take the machine out of service, and they can report on what software was running or why. Their SREs are world-class and may be able to deduce if this was a fluke (orders of magnitude less likely than a single bit flip), if a workload caused it, or if hardware caused it.
The advantage the 3 major cloud providers have is scale. If a Fortune 500 were running their own datacenters, how likely would it be that they have the same level of visibility into their workloads, the quality of SREs to diagnose, and the sheer statistical power of scale?
I sincerely hope Google is not simply silencing bitflip corrections and detections. That would be a profound waste.
ECC seems like a trivial thing to log and keep track of. Surely any Fortune 500 could do it and would have enough scale to get meaningful data out of it?
It's not just tracking ECC errors, which as you point out is not hard, but correlating it with the other metrics needed to determine the cause and having the scale to reliably root cause bitflips to software (workloads that inadvertently rowhammer) or hardware or even malicious users (GCP customers that may intentionally run a rowhammer.)
There was an interesting challenge at DEF CON CTF a while back that tested this, actually. It turns out that it is possible to write x86 code that is 1-bit-flip tolerant–that is, a bit flip anywhere in its code can be detected and recovered from with the same output. Of course, finding the sequence took (or so I hear) something like 3600 cores running for a day to discover it ;)
Nit: not for a day, more like 8 hours, and that's because we were lazy and somebody said he "just happened" to have a cluster with unbalanced resources (mainly used for deep learning, but all GPUs occupied with quite a lot CPUs / RAMs left), so we decided to brute force the last 16 bits :)
Also, the challenge host left useful state (which bit was flipped) in registers before running teams' code, without this I'm not sure if it is even possible.
Sure, all's fair in a CTF. That story came to me through the mouths of at least a handful of people, who might have a bit of an incentive to exaggerate given that they hadn't quite been able to get to zero and might be a just a little sour :P
The state was quite helpful, yes–for x86 it seems like a "clean slate" shellcode would be quite difficult, if impossible, to achieve as we saw. However, I am left wondering how other ISAs would fare…perhaps worse, since x86 is notoriously dense. But maybe not? The fixed-width ones would probably be easy to try out, at least.
Maybe being notoriously dense is not a bad thing? While those ModRM bytes popping up everywhere is annoying as f* (too easy to flip an instruction into a form with almost-guaranteed-to-be-invalid memory access), at least due to the density there won't be reserved bits. For example, in AArch64 if bit 28 and bit 27 is both zero the instruction will almost certainly be an invalid one (hitting unallocated area), and with a single bit flip all branch instructions will have [28:27] = b'00...
Right, I was saying that the other ISAs would do wore because they aren't as dense and will hit something undefined much more readily. But the RISCs in general are much less likely to touch memory (only if you do a load/store from a register that isn't clean, maybe). From a glance, MIPS looks like it might work, since the opcode field seems to use all the bits and the remaining bits just encode reg/func/imm in various ways. The one caveat I see is that I think the top bit of opcode seems to encode memory accesses, so you may be forced to deal with at least one.
> Making software resilient against bitflips in memory seems very difficult though, since it not only affects data, but also code.
There is an OS that pretty much fits the bill here. There was a show where Andrew Tanenbaum had a laptop running Minix 3 hooked up to a button that injected random changes into module code while it was running to demonstrate it's resilience to random bugs. Quite fitting that this discussion was initiated by Linus!
Although it was intended to protect against bad software I don't see why it wouldn't also go a long way in protecting the OS against bitflips. Minix 3 uses a microkernel with a "reincarnation server" which means it can automatically reload any misbehaving code not part of the core kernel on the fly (which for Minix is almost everything). This even includes disk drivers. In the case of misbehaving code there is some kind of triple redundancy mechanism much like the "quorum" you suggest, but that is where my crude understanding ends. AFAIR Userland software could in theory also benefit provided it was written in such a way to be able to continue gracefully on reloading.
At some point, whatever's watching the watchers is going to be vulnerable to bitflip and similar problems.
Even with a triple-redundant quorum mechanism, slightly further up that stack you're going to have some bit of code running that processes the three returned results - if the memory that's sitting on gets corrupted, you're back where you started.
> At some point, whatever's watching the watchers is going to be vulnerable to bitflip
One advantage of microkernels is that the "watcher" is so small that it could be run directly from ROM, instead of loaded into RAM. QNX has advocated that route for robotics and such in the past.
Minix may not be the best example of the type. While it is a microkernel, it's real world reliability has been poor in the past. More mature microkernel operating systems like QNX and OpenVMS are better examples.
> While it is a microkernel, it's real world reliability has been poor in the past.
Nitpick/clarification: it currently supervises the security posture, attestation state and overall health of several billion(?) Intel CPUs as the kernel used by the latest version of the Management Engine.
If ME is shut down completely apparently the CPU switches off within 20 minutes. Presumably this applies across the full uptime of the processor, and not just immediately after boot, and iff this is the case... percentage of Intel CPUs that randomly switch off === instability/unreliability of Minix in a tightly controlled industrial setting.
Anyone have any idea why there haven't been any open-source QNX clones, at least not any widely known ones? Even before their Photon MicroGUI patents expired, the clones could have used X11.
I used to occasionally boot into QNX on my desktop in college. It was a very responsive and stable system.
Hypervisors are, to a first approximation, microkernels with a hardware-like interface. All of this kernel bypass work being done by RDBMSes, ScyllaDB, HFTs, etc. is, to a first approximation, making a monolithic kernel act a bit like a microkernel.
There are well known open source microkernels, like Minix 3 and L4. Probably not that attractive.
Why something hasn't been done is always a hard question to answer, since to succeed a lot of things have to go right, and by default none of them do. But one thing is that microkernels were more trendy in the 90s - r&d people are mostly doing things like "the cluster is the computer", unikernel, exokernel, rump kernel, embedded (eg tock), remote attestation since then (I'm not up to date on the latest).
Thinking about it a bit more, QNX clones might suffer from something akin to second system syndrome. There's a simple working design, and it likely strongly invites people to jump right to their own twist on the design before they get very far into a clone.
> Minix may not be the best example of the type. While it is a microkernel, it's real world reliability has been poor in the past. More mature microkernel operating systems like QNX and OpenVMS are better examples.
You might be referring to the previous versions. Minix 3 is basically a different OS, it's more than an educational tool - in fact it's probably running inside your computer right now if you have an Intel CPU (it runs Intel's ME chip - for better or worse).
Yes, but this is the entire principle around which microkernels are designed: making the the last critical piece of code as small and reliable as possible. Minix3's kernel is <4000 lines of C.
As far as bitflips are concerned, having the critical kernel code occupy fewer bits reduces the probability of a bitflip causing an irrecoverable error.
Yes, I understand this -- basic risk mitigation by reducing the size of your vulnerability.
(I'll archaic brag a bit by mentioning I used to be a heavy user of Minix - my floppy images came in over an X25 network - and saw Andy Tanenbaum give his Minix 3 keynote at FOSDEM about a decade ago. I'm a big fan.)
Anyway, while reducing risk this way is laudable, and will improve your fleet's health, as per TFA it's a poor substitute, with bad economics and worse politics behind it, than simply stumping up for ECC.
I'll also note that, for example, Google's sitting on ~3 million servers so that ~4k LoC just blew out to 12,000,000,000 LoC -- and that's for the hypervisors only.
Multiply that out by ~50 to include VM's microkernels, and the amount of memory you've now got that is highly susceptible to undetected bit-flips is well into the mind-blowing range.
Oh i'm not saying it's the single best solution, I guess I got carried away in argument - It's simply a scenario where the concept shines, yet it's entirely artificial scenario and I agree ECC is the correct way.
I'm surprised that the other replies don't grasp this. This is the proper level to do the quorum.
Doing quorum at the computer level would require synchronizing parallel computers, and unless that synchronization were to happen for each low level instruction, then it would have to be written into the software to take a vote at critical points. This is going to be greatly detrimental both to throughput and software complexity.
I guess you could implement the quorum at the CPU level... e.g. have redundant cores each with their own memory. But unless there was a need to protect against CPU cores themselves being unreliable, I don't see this making sense either.
At the end of the day, at some level, it will always come down to probabilities. "Software engineering principles" will never eliminate that.
My first employer out of Uni had an option for their primary product to use a NonStop for storage -- I think HP funded development, and I'm not sure we ever sold any licenses for it.
You need two alpha particles hitting the same rank of memory for failure to happen. Although super rare, even then it is still correctable. You need three before it is silent data corruption. Silent corruption is what you get with non ECC with even a single flip.
Where are you getting this from? My understanding is that these errors are predominantly caused by secondary particles from cosmic rays hitting individual memory cells, and I've never heard something so precise as "you need two alpha particles". Aren't the capacitances in modern DRAM chips extremely small?
The structure of the ECC is at the rank level. This allows for correcting single bit flips in ranks and detecting double bit flip in ranks. So when you grab a cache line each 64bit is corrected and verified.
Bit flips can happen, but regardless if they can get repaired by ECC code or not, the OS is notified, iirc. It will signal a corruption to the process that is mapped to the faulty address. I suppose that if the memory contains code, the process is killed (if ECC correction failed).
> I suppose that if the memory contains code, the process is killed (if ECC correction failed).
Generally, it would make the most sense to kill the process if the corrupted page is data, but if it's code, then maybe re-load that page from the executable file on non-volatile storage. (You might also be able to rescue some data pages from swap space this way.)
If you go that route, you should be able to avoid the code/data distinction entirely; as data pages can also be completly backed by files. I believe the kernel already keeps track of what pages are a clean copy of data from the filesystem, so I would think it would be a simple matter of essentially pageing out the corrupted data.
What would be interesting is if userspace could mark a region of memory as recomputable. If the kernel is notified of memory corruption there, it triggers a handler in the userspace process to rebuild the data. Granted, given the current state of hardware; I can't imagine that is anywhere near worth the effort to implement.
> What would be interesting is if userspace could mark a region of memory as recomputable.
I believe there's already some support for things like this, but intended as a mechanism to gracefully handle memory pressure rather than corruption. Apple has a Purgeable Memory mechanism, but handled through higher-level interfaces rather than something like madvise().
> You probably would have to use multiple computers doing the same calculation and then take the answer from the quorum.
The Apollo missions (or was it the Space Shuttle?) did this. They had redundant computers that would work with each other to determine the “true” answer.
The Space Shuttle had redundant computers. The Apollo Guidance Computer was not redundant (though there were two AGCs onboard-- one in the CM and one in the LEM). The aerospace industry has a history of using redundant dissimilar computers (different CPU architectures, multiple implementations of the control software developed by separate teams in different languages, etc) in voting-based architectures to hedge against various failure modes.
In aerospace where this is common, you often had multiple implementations, as you wanted to avoid software bugs made by humans. Problem was, different teams often created the same error at the same place, so it wasn’t as effective as it would have seemed.
Forgive my ignorance, but wouldn't the computer actually reacting to the calculation (and sending a command or displaying the data) still be very vulnerable to bit-flips? Or were they displaying the results from multiple machines to humans?
If you use multiple computers doing the same calculation and then take the answer from the quorum, how do you ensure the computer that does the comparison is not affected by memory failures? Remember that all queries have to through it, so it has to be comparable in scale and power.
Raft, Paxos, and other consensus algorithms add even more overhead. Imagine running every Google query through Raft and think how long it will take and how much extra hardware would be needed.
ECC memory is just as fast as non-ECC memory, and only cost a little more.
Your comment sounded like "your recursive definition is impossible".
I am totally for ECC and was flabbergasted when it went away. But the article makes sense since I remember Intel pushing hard to keep it out of the consumer space. The freaking QX6800 didn't support ECC and it retailed for over a grand.
Generally this goes against software engineering principles. You don't try to eliminate the chances of failure and hope for the best. You need to create these failures constantly (within reasonable bounds) and make sure your software is able to handle them. Using ECC ram is the opposite. You just make it so unlikely to happen, that you will generally not encounter these errors at scale anymore, but nontheless they can still happen and now you will be completely unprepared to deal with them, since you chose to ignore this class of errors and move it under the rug.
Another intersting side effect of quorum is that it also makes certain attacks more difficult to pull off, since now you have to make sure that a quorum of machines gives the same "wrong" answer for an attack to work.