Libsecded

viraptor · on Nov 9, 2022

I wonder how well that paper holds up over a decade later. It reviewed DDR1/2 in 2009. I like to ask people running ECC to check their error counters. (on Linux `edac-util -rfull`) From my very non-scientific survey, memory errors seem to happen significantly less often than this paper would lead you to believe. Then again, running ECC in the first place indicates better hardware than non-ECC, so that's a likely bias.

ploxiln · on Nov 9, 2022

It can be very hard to get memory error reporting these days.

Bryan Cantrill mentions in one of his talks that Joyent had a datacenter where uncorrectable errors were sporadically halting servers, but no correctable errors were ever counted. He eventually got the motherboard firmware vendor to admit that these were handled "firmware-first" meaning intentionally not reported.

I've looked into using some consumer AMD CPUs that theoretically work with ECC memory, and a couple motherboards from ASUS and ASRock theoretically support ECC, but I've heard that it's hard to figure out if it's really working.

Testing whether a motherboard firmware actually reports ECC errors ... probably doesn't really happen, because it seems to work fine if it doesn't report them, and the company wants to just finish QA and ship. And the rare motherboard that does report errors correctly is more likely to trigger bugs in higher layers that were never actually tested before. And there's pressure to disable or hide this feature to reduce pesky customer support costs. No one else reports any errors, why does your product report errors, I want a replacement, etc.

Consumer DDR5 is all ECC, out of desperate necessity, but it doesn't report anything, so you can't tell how close to the sun it's flying. Rowhammer just keeps coming back.

toast0 · on Nov 9, 2022

I've certainly seen ECC error reporting work, although it was a little sketchy, but that was xeon 2600 v1-4, which is dated now and server platform anyway.

With a fleet of 2000 servers with 64GB to 768GB each of DDR3 and DDR4, most days we didn't see any errors detected unless we currently had a system with a DIMM that would throw a (correctable) error once a day or so. Reporting was always kind of weird, we'd get OS logging once an hour if there were any errors, which is mostly fine, except when a system goes from a couple errors an hour to thousands per minute: machine check exceptions are quite expensive to process and kill throughput if they're happening a lot, but no idea why the system is misbehaving until the next reporting interval. Of course, those thousands of errors really tweak the average rate. We'd replace RAM for more than one uncorrectable, or uncorrectable after correctables, or when we had time, too many correctables (100+ per day). A lot of servers would show a couple correctable errors once and then be fine, but some did become periodic or escalate.

On consumer platforms, you should be able to test if ECC reporting is happening by setting the memory voltage too low or the timings too fast so that you're likely to have errors. If you can trigger an uncorrectable error, you should be able to trigger a correctable too.

On die ECC is better than nothing, I guess, but it's kind of like digital TV --- it's good until it's not, with no indication you're close to the edge. Also, no help if there's problems between the CPU and the RAM.

FullyFunctional · on Nov 9, 2022

The “ECC” on dDR5 does _not_ replace regular ECC. Please see Ian’s explanation: https://youtu.be/XGwcPzBJCh0

segfaultbuserr · on Nov 9, 2022

> I've looked into using some consumer AMD CPUs that theoretically work with ECC memory, and a couple motherboards from ASUS and ASRock theoretically support ECC, but I've heard that it's hard to figure out if it's really working.

It's not obvious, and there is so much misinformation on the Web for sure. But it shouldn't be that difficult at least on Linux (unlike BSD, and I'm speaking as a BSD user). Linux has the best ECC support for consumer AMD CPUs users, thanks to the kernel driver amd64_edac [0][1]. It accurately reports the ECC status by querying the registers inside the memory controller, giving a reliable indication of the ECC status. If dmesg says "EDAC amd64: Node 0: DRAM ECC enabled", you can be pretty sure that ECC is indeed enabled.

This driver also allows you to change the ECC memory scrubbing settings (but make sure the kernel is up-to-date, see [2][3]). Memory scrubbing, similar to RAID disk scrubbing, is a process of reading out the data and checking its integrity in the background (otherwise unused data with a recoverable error may never be checked, until an unrecoverable error occurs). Some consumer-grade motherboards don't show the memory scrubbing options in the firmware, and you often want to select a more aggressive scrub rate than the slow default, so the kernel driver is also really helpful.

And speaking of testing whether actual errors can be reported and corrected... In a proper production environment, ECC is tested by a technique called "data poisoning", or "error injection". It allows the OS to inject ECC errors directly via the memory controller for confirmation. To do that, this feature must be enable by firmware, and the OS must also provide the necessary driver. Unfortunately, while server motherboards always have an enable/disable option, some consumer motherboards do not. And worse, it's not supported by Linux as far as I know. Theoretically one can read the AMD CPU datasheet called the BIOS and Kernel Developer’s Guide and write your own tool, unfortunately all datasheets post-Ryzen are under NDA.

But all is not lost. There is a proprietary tool, memtest86, which claims to support ECC injection [4]. This should be helpful (though I've never tried it personally). Alternatively, on customer-grade hardware, one can simply check ECC by overclocking the memory and adjusting its timings to the edge of instability, then running a stress test like Prime95 (the Unix version is called mprime). In my experience, if the memory is sufficiently overclocked, a single test only takes 10 minutes.

Finally there is so much misinformation on the Web. For example, one article showed that Linux kills a process via SIGBUS when an uncorrectable ECC error occurs, instead of triggering a kernel panic. And it went to conclude that ECC is not fully functional - it was just pure misinformation, Linux only triggers a kernel panic when kernel memory has an uncorrectable error, for user memory, SIGBUS is the expected behavior. Another case of misinformation is due to the lack of proper error decoding on BSD. When an ECC error occurs, a Machine Check Exception is generated by the CPU. On Linux, it will be correctly decoded and recorded. But on FreeBSD, so far there's no decoder, leaving you a mysterious MCA error in dmesg. For example, a correctable DRAM ECC error will be reported as "L3 cache error" (which made many people to falsely believe that Ryzen's ECC was not working on FreeBSD). I've compared the MCE/MCA error code for "L3 cache error" on FreeBSD with Linux's "correctable ECC" error code - they're identical.

> He eventually got the motherboard firmware vendor to admit that these were handled "firmware-first" meaning intentionally not reported.

This is the real problem. Most consumer motherboards don't do this, but it can be a headache when the firmware vendor screwed it up... Some do it by default with an option to disable it, and some cannot be disabled. Also, many server motherboards with "firmware-first" ECC handling hides the error to the OS, but still report ECC errors via the IPMI console.

> Consumer DDR5 is all ECC, out of desperate necessity, but it doesn't report anything, so you can't tell how close to the sun it's flying. Rowhammer just keeps coming back.

Saying "DDR5 is all ECC" is misleading. DDR5's "on-die ECC" should only be seen as an internal implementation detail to increase the chip yield, rather than a full form of data integrity protection. Real ECC is always performed by the memory controller. For DDR5, there still exists separate ECC versions for server applications, just like all previous DDR generations.

[0] https://www.kernel.org/doc/html/latest/admin-guide/ras.html

[1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/lin...

[2] https://unix.stackexchange.com/questions/593060/how-do-i-ena...

[3] https://lore.kernel.org/linux-edac/a9cdf7c2-868a-8f67-ac4e-c...

[4] https://www.memtest86.com/ecc.htm

erik · on Nov 9, 2022

Has anyone tried using software to measure bit-flip rates on non-ECC systems? It seems like a pretty easy task. Turn off swap. Fill a bunch of memory with a known pattern. Every few hours read all the memory and verify that no bits were flipped. If the 2009 result holds on modern systems and a gigabyte of DRAM flips a bit every few hours, then evidence should show up pretty quickly.

adrian_b · on Nov 9, 2022

At least according to more recent publications and to my own experience, a single gigabyte of good DRAM does not flip a bit every few hours, but every few months.

On non-server computers, which seldom reach peak memory usage, a bit flip may happen in a location where it does no harm.

Nevertheless, after a few years of use, a memory module can start to have frequent bit flips, even many per hour. If you have ECC, you will be notified about this and you will be able to replace the bad module. On most computers without ECC, that can easily lead to undetected data corruption.

Also, when you have 64 GB DRAM, that multiplies by 64 the error frequency.

bugfix-66 · on Nov 9, 2022

https://pqsrc.cr.yp.to/libsecded-20220828/INTERNALS.html

libsecded encodes an n-byte array using an extended Hamming code on the bottom bit of each byte, in parallel an extended Hamming code on the next bit of each byte, etc.

https://en.m.wikipedia.org/wiki/Hamming_code

Extended Hamming codes achieve a Hamming distance of four, which allows the decoder to distinguish between when at most one one-bit error occurs and when any two-bit errors occur. In this sense, extended Hamming codes are single-error correcting and double-error detecting, abbreviated as SECDED.

The main idea is to choose the error-correcting bits such that the index-XOR (the XOR of all the bit positions containing a 1) is 0. We use positions 1, 10, 100, etc. (in binary) as the error-correcting bits, which guarantees it is possible to set the error-correcting bits so that the index-XOR of the whole message is 0. If the receiver receives a string with index-XOR 0, they can conclude there were no corruptions, and otherwise, the index-XOR indicates the index of the corrupted bit.

Hamming codes have a minimum distance of 3, which means that the decoder can detect and correct a single error, but it cannot distinguish a double bit error of some codeword from a single bit error of a different codeword. Thus, some double-bit errors will be incorrectly decoded as if they were single bit errors and therefore go undetected, unless no correction is attempted.

To remedy this shortcoming, Hamming codes can be extended by an extra parity bit. This way, it is possible to increase the minimum distance of the Hamming code to 4, which allows the decoder to distinguish between single bit errors and two-bit errors. Thus the decoder can detect and correct a single error and at the same time detect (but not correct) a double error.

rtpg · on Nov 9, 2022

I do wonder how many bits in RAM really are "harmlessly flippable". If I took a snapshot of a running machine, how safe is that flip from landing somewhere bad? Perhaps a lot of stuff ends up being write only so fine?

tiagod · on Nov 9, 2022

My intuition is that it would probably depend on how much memory is being used and how fast.

If you're just decoding a lot of 8K video in SW, maybe most flips will be in the decoded bitmap frame buffer and will no be noticed. On the other hand if you're crunching a lot of data and are storing big, complex data structures in the process, it could more easily break some pointer address or length field and crash your program (or system, if it's kernel stuff).

>If I took a snapshot of a running machine, how safe is that flip from landing somewhere bad

Would be cool to set up KVM to flip a bit in a VM's memory every X amount of time and see how long it will take for weird stuff to happen.

dale_glass · on Nov 9, 2022

I'm not sure how useful this is, because memory interacts with pretty much everything.

I mean, great: you've validated that the important financial data you were going to write to the DB is correct. But you didn't validate that the OS itself is in full working order. A bit goes out of place, the kernel writes something weird to disk, filesystem becomes corrupted and things explode in a dramatic fashion.

That's exactly why I try to get ECC everywhere these days. I had an old box serving firewall duty until one day it died because it got bumped, a memory module got loose somehow and the resulting disk corruption rendered it unbootable. Applications verifying that their data is correct wouldn't have changed anything.

jedisct1 · on Nov 9, 2022

GitHub mirror, since there doesn't seem to be a proper tarball: https://github.com/jedisct1/libsecded

This also adds a cross-platform build script.

bugfix-66 · on Nov 9, 2022

"Donated this to Microsoft Copilot for you."

In this case djb won't mind (CC0 license).

But it's time to move away from GitHub.

segfaultbuserr · on Nov 9, 2022

Similar software error-checking techniques are often used in embedded systems. External electromagnetic interference can cause program counter, register and memory corruptions, but hardening the hardware is often prohibitively expensive. When the reliability requirements are not too high, redundant software checks are often a solution - the goal is not to eliminate all failures, but to reduce their probability.

The now-deleted (due to lack of citations) Wikipedia article Immunity-aware programming [0] was a good overview of this topic. Relevant techniques includes:

> Token passing: Every function is tagged with a unique function ID. When the function is called, the function ID is saved in a global variable. The function is only executed if the function ID in the global variable and the ID of the function match. If the IDs do not match, an instruction pointer error has occurred, and specific corrective actions can be taken. [...] This is essentially an "arm / fire" sequencing, for every function call. Requiring such a sequence is part of safe programming techniques, as it generates tolerance for single bit (or in this case, stray instruction pointer) faults.

> Data duplication: To cope with corruption of data, multiple copies of important registers and variables can be stored. Consistency checks between memory locations storing the same values, or voting techniques, can then be performed when accessing the data. [...] When the data is read out, the two sets of data are compared. A disturbance is detected if the two data sets are not equal. An error can be reported. If both sets of data are corrupted, a significant error can be reported and the system can react accordingly.

> [...] CRCs are calculated before and after transmission or duplication, and compared to confirm that they are equal. A CRC detects all one- or two-bit errors, all odd errors, all burst errors if the burst is smaller than the CRC, and most of the wide-burst errors. Parity checks can be applied to single characters (VRC—vertical redundancy check), resulting in an additional parity bit or to a block of data (LRC—longitudinal redundancy check), issuing a block check character. Both methods can be implemented rather easily by using an XOR operation. A trade-off is that less errors can be detected than with the CRC. Parity Checks only detect odd numbers of flipped bits. The even numbers of bit errors stay undetected. A possible improvement is the usage of both VRC and LRC, called Double Parity or Optimal Rectangular Code (ORC).

> Function parameter duplication: Parameters passed to procedures, as well as return values, are considered to be variables. Hence, every procedure parameter is duplicated, as well as the return values. A procedure is still called only once, but it returns two results, which must hold the same value. The source listing to the right shows a sample implementation of function parameter duplication.

> Test/branch duplication: To duplicate a [if-else] test at multiple locations in the program. [...] For every conditional test in the program, the condition and the resulting jump should be reevaluated, as shown in the figure. Only if the condition is met again, the jump is executed, else an error has occurred.

None of the mainstream compiler has these features, often programmers do all of these tasks by hand (!) in C. If someone implements these kinds of features to GCC or LLVM/clang (similar to how buffer overflow exploits are mitigated by automatic stack canary or Control-Flow Integrity checks), it would be a major contribution to the entire world of embedded system development.

[0] https://web.archive.org/web/20180519034600/https://en.wikipe...

rt12121212 · on Nov 16, 2022

Thanks for the link.

What if instead of passing tokens, checksums were passed and the function checked that its code matched the checksum. This would give some protection against both corruption of the code and instruction pointer errors.

Another element from the article was having copies of the function and comparing the return values, but I suspect this breaks down when the function deals with external state. Possibly it could be done by intercepting the state-related calls and making them atomic/combining them. I feel like there's something here reminding me of STM [0].

I suspect it will always be a better investment of time and result in scalable and simpler applications to go for the hardware required to get a full ECC-covered execution architecture.

[0] https://www.infoq.com/news/2010/05/STM-Dropped/

ece · on Nov 9, 2022

This would be interesting to see in a JIT, even if on a sampling basis. I also wonder if some instruction filtering/detection approach would work for rowhammer.

CalChris · on Nov 9, 2022

Isn't LPDDR5 in the M2 supporting ECC? I believe it corrects errors but doesn't report them, no?

adrian_b · on Nov 9, 2022

The ECC in DDR5/LPDDR5 corrects only internal errors and it has this extra facility only to counteract the degradation of reliability vs. DDR4/LPDDR4, due to smaller cells and faster operation.

It does not really increase much the reliability over older generations, all the mentions about internal ECC are mostly marketing BS.

The ECC that is implemented in the memory controller inside the CPU package protects not only against bit flips in the DRAM arrays, but also against bit errors that happen elsewhere on the long way between memory chips and CPU chips, due to electrical noise, bad seating of CPUs or memory modules in their sockets or cheap sockets whose contacts become oxidized in time.

Due to the increased memory throughput, the links between CPU and memory become more and more sensitive to electrical noise at every new generation.

On laptops or small computers where both the CPUs and the memory chips are soldered on the same PCB, or they are stacked in the same package, ECC is somewhat less important, but on any computer with socketed memory modules ECC should have been mandatory.

fulafel · on Nov 9, 2022

Would you have a reference about ECC in LPDDR5, or Apple specific LPDDR5 features, protecting from internal errors? The sources I found said it only has link error correction, eg https://www.synopsys.com/designware-ip/technical-bulletin/ke...

adrian_b · on Nov 9, 2022

Which kinds of ECC are mandatory and which kinds of ECC are optional can be found only in the JEDEC standards, which are expensive.

On the Synopsys site, both at the link provided by you and in other pages, for DDR5 is mentioned only on-die ECC, which protects only against bit flips during storage, while for LPDDR5 is mentioned only link ECC, which protects only against electrical noise on the PCB traces between the LPDDR5 soldered chips and the CPU soldered chip.

It is likely that on-die ECC is considered more important for DDR5, because the computers that use DDR5 modules are expected to have a larger amount of installed memory, which multiplies the frequency of bit errors during storage, while link ECC is considered more important for LPDDR5, because here the data transfer speed is higher, which multiplies the bit errors due to electrical noise on the PCB link.

On-die ECC can be implemented even if the memory controller of the CPU is not aware of it. Each memory manufacturer may choose to implement on-die ECC, or not, depending on the results of their in-house reliability tests for the storage of the bits in their DRAM chips. The memory manufacturers have no need to mention whether they use internally some form of ECC, because that is transparent for the users of the memories.

Link ECC must be supported by the memory controller and included in the standardized memory interface, so I assume that this is restricted to LPDDR5 memories.

fulafel · on Nov 9, 2022

A web search doesn't bring up any references to this feature, other than the bus layer error coding for signal integrity in transit that is standard in LPDDR5.

Non-LP (and thus Non-Apple) DDR5 does have ECC.

An additional twist here is that apparently the ECC was added to DDR5 because process shrinks and memory size increases have caused an increase in bit flips, so this is needed to keep reliability at the previous non-ECC level. There's an additional "actually robust" level of ECC, which is still sold separately. [1]

I guess we might ask why LPDDR5 is missing the DDR5 equivalent "keep running to stay in the same place" ECC, and what this means to reliability...

[1] https://en.wikipedia.org/wiki/DDR5_SDRAM#DIMMs_versus_memory...

throwaway81523 · on Nov 9, 2022

For a large array maybe you are better off with e.g. a Reed-Solomon code instead of a Hamming code.