It's not even ECC price/availability that bothers me so much, it's that getting CPUs and motherboards that support ECC is non-trivial outside of the server space. The whole consumer class ecosystem is kind of shitty. At least AMD allows consumer class CPUs to kinda sorta use ECC, unlike Intel's approach where only the prosumer/workstation stuff gets ECC.
288-pin ECC is, I believe, available on any X670E/X870E platform boards so long as the motherboard builder hasn’t expressly interfered with it (and probably other chipsets as well?). Windows 10+ reports it as full ECC (multi-bit / 72-bits). AMD pushed that enable in an AGESA three or four years ago iirc. The CAS latency for ECC is about double what gaming RAM offers, but in practice other more costly factors tend to limit performance first. Any motherboard released before the AGESA update would be more difficult to predict, but that’s baseline uncertainty for PCs so no surprises there.
>The CAS latency for ECC is about double what gaming RAM offers
Ironically, overclocking ECC memory is much easier than overclocking non-ECC DIMMs, because you know exactly at which point you start encountering instability and need to dial back, instead of relying on.. application crashes and BSOD's to know that you're running way too optimistic clocks/timings.
Meanwhile I overclocked 'low clock / loose timing' ECC DIMMs on Ryzen 7 platform with no issues at all – kept increasing clocks and lowering timings until ECC started reporting errors, then dialed it back a couple notches, and now it is not just stable, but I also have exact reporting of it being stable.
Yeah! A stick of 5600 can generally reach 6000 with geardown off and that’s as far as I’ve seen cause to dial it. But certain parameters that are popular to lower for latency reduction can be, how would I put it, slightly less flexible — tREFI comes to mind as one that nearly any lowering of (on the enterprise sticks I’m using anyways) tends to cause DFE/MBIST training failures no matter what, even with direct airflow, before it ever boots far enough for memtest to expose ECC errors.
(For those out there following along with PCs, if you aren’t tuning with MBIST maxed out in your BIOS, you might want to revisit that.)
I've been honestly amazed people actually buy stuff that's not "workstation" gear given IME how much more reliably and consistently it works, but I guess even a generation or two used can be expensive.
Very few applications scale with cores. For the vast majority of people single core performance is all they care about, it's also cheaper. They don't need or want workstation gear.
I have come to doubt that single core or CPU performance in general, other than maybe specialty applications like CAD and some games, is all that noticeable for most computer users in the last decade. I can take relatively pedestrian users like my parents or my wife and put them in front of a decade old high end Haswell system or a brand new mega-$$$ threadripper/epyc and for almost all intents and purposes they don't notice a different. What they do notice is when things die. I'm sure consumer hardware might be OK for 2-3 years (maybe), but like for my parents, they're happier to keep using the same computer, and honestly the same Dell Precision system I gave them almost 10 years ago works great today, and I have a suspicion that the hardware, outside of maybe the SSD finally wearing out, will probably work right a decade from now too.
Compilers and test suits do scale (at least for C/C++ and Rust, which is what I work with). But I think the parent comment referred to consumer applications: games, word processing, light browsing, ...
(Though games these days scale better than they used to, but only up to a to a point.)
I find that most tools I write for my own use can be made to scale with cores, or run so fast that the overhead of starting threads is longer than the program runtime. But I write that in Rust which makes parallelism easy. If I wrote that code in C++ I would probably not bother with trying to parallelize.
It's confusing because a few comments up is "for the vast majority of people single core performance is all they care about, it's also cheaper" which is unrelated to ECC.
I think it's coherent -- it's an argument for why most people don't want to buy Workstation class products just to get ECC. (Prices scale with core count. Not linearly, but still.)
I disagree with your handwaving bitflips away as a minor annoyance. Consumers don't love software crashing, even if they don't have any data they care about.
Imagine ECC was free -- would you rather have free ECC and no bitflips, or no ECC and bitflips? It's hard to imagine choosing bitflips.
Test suites often don't scale, actually. Unit tests usually run single-threaded by default, and also relatively often have side effects on the system that mean they're unsafe to run in parallel. (Sure, sure, you could definitely argue the latter thing is a skill issue.)
In theory, do you need a single machine for any of that, or would it be cheaper to use a low-availability cloud cluster? Tests are totally independent, and builds probably parallel enough.
There were several years where used cheese grater Mac Pros could be bought and upgraded for very cheap, and were still not too outdated. I only replaced my MacPro4,1 when the M1 mini came out, mainly cause of wattage.
If I don't know about it, then how does it affect me / why should I care? My home server does what it is supposed to do and has done so for a decade. If bit rot /bit flips in memory does not affect my day-to-day life I much prefer cheaper hardware.
I do hope the nuclear powerplant next door uses more fault tolerant hardware, though.
Eventually you might notice the pictures or other documents you were saving on your home server have artifacts, or no longer open. This is undesireable for most people using computer storage.
> I much prefer cheaper hardware.
The cost savings are modest; order of magnitude 12% for the DIMMs, and less elsewhere. Computers are already extremely cheap commodities.
12% for the DIMMs only, but with Intel you need Xeon and its accompanying motherboard for it. Someone said AMD "kinda" lets you do ECC on consumer hardware, not sure what the caveats are besides just being unbuffered.
Assuming that's more due to intentional market segmentation than actual cost, yeah I would pay 12% more for ECC. But I'm with the other guy on not valuing it a ton. I have backups which are needed regardless of bitrot, and even if those don't help, losing a photo isn't a huge deal for me.
> Someone said AMD "kinda" lets you do ECC on consumer hardware, not sure what the caveats are besides just being unbuffered.
That was me. It isn't "officially" supported by AMD, but it should work. You can enable EDAC monitoring in Linux and observe detected correction events happening.
> Assuming that's more due to intentional market segmentation than actual cost
> ECC should have become standard around the time memories passed 1GB.
Ironically, that's around the time Intel started making it difficult to get ECC on desktop machines using their CPUs. The Pentium 3 and 440BX chipset, maxing out at 1GB, were probably the last combo where it pretty commonly worked with a normal desktop board and normal desktop processor.
As I understand it, DDR5's on-die ECC is mostly a cost-saving measure. Rather than fab perfect DRAM that never flips a bit in normal operation (expensive, lower yield), you can fab imperfect DRAM that is expected to sometimes flip, but then use internal ECC to silently correct it. The end result to the user is theoretically the same.
Because you can't track on-die ECC errors, you have no way of knowing how "faulty" a particular DRAM chip is. And if there's an uncorrected error, you can't detect it.
that doesn't help when the bit is lost between the cpu and the memory unfortunately, it only really helps passing poor quality dram as it gets corrected for single bit flips, not that reliable either it's a yield / density enabler rather than a system reliability thing.
it's "ECC" but not the ecc you want, marketing garbage.
DDR5 on-die ECC detects and corrects one-bit errors. It cannot detect two-bit
errors, so it will miscorrect some of them into three-bit errors. However, the
on-die error correction scheme is specifically specially designed such that the resulting
three-bit errors are mathematically guaranteed to be detected as uncorrectable two-bit errors
by a standard full system-level ECC running on top of the on-die ECC.
ECC also reports error recovery statistics to the operating system. Lets you know if any unrecoverable errors happened. Lets you calculate the error rate which means you can try to predict when your memory modules are going bad.
I think this sort of reporting is a pretty basic feature that should come standard on all hardware. No idea why it's an "enterprise" feature. This market segmentation is extremely annoying and shouldn't exist.
I am not sure I've ever seen a laptop that has ECC memory. I'm sure they exist but I don't think I've seen it.
I would definitely like to have a laptop with ECC, because obviously I don't want things to crash and I don't want corrupted data or anything like that, but I don't really use desktop computers anymore.
ECC are traditionally slower, quite more complex, and they dont completely eliminate the problem (most memories correct 1 bit per word and detect 2 bits per word). They make sense when environmental factors such as flaky power, temperature or RF interference can be easily discarded - such as a server room. But yeah, I agree with you, as ECC solves like 99% of the cases.
Thing is, every reported bug can be a bit flip. You can actually in some cases have successful execution, but bitflips in the instrumentation reporting errors that dont exist.
The amount of overhead a few bits of ECC has is basically a rounding error, and even then, the only time the hardware is really doing extra work is when bit errors occur and correction has to happen.
The main overhead is simply the extra RAM required to store the extra bits of ECC.
ECC are "slower" because they are bought by smart people who expect their memory to load the stored value, rather than children who demand racing stripes on the DIMMs.
The actual RAM chips on a ECC DIMM are exactly the same as a non-ECC DIMM, there's just an extra 1/2/4 chips to extend to 72 bit words.
The main reason ECC RAM is slower is because it's not (by default) overclocked to the point of stability - the JEDEC standard speeds are used.
The other much smaller factors are:
* The tREFi parameter (refresh interval) is usually double the frequency on ECC RAM, so that it handles high-temperature operation.
* Register chip buffers the command/address/control/clock signals, adding a clock of latency the every command (<1ns, much smaller than the typical memory latency you'd measure from the memory controller)
* ECC calculation (AMD states 2 UMC cycles, <1ns).
ECC keeps your bits safe from random flips to a ridiculously large factor. You can run the memory at high consumer speeds, giving up some of that safety margin, while still being more reliable than everything else in your computer.
And there's non-random bit errors that can hit you at any speed, so it's not like going slow guarantees safety.
ECC is actually slower. The hardware to compute every transaction is correct does add a slight delay, but nothing compared to the delay of working on corrupted data.
Looking back, I actually think the older the RAM the more likely you're able to notice bit-flips and they harm your workflow. EDO RAM was the worst in my experience (my first computer), SDRAM was a bit better, and random bit-flips atleast under load got very rare after DDR2. I think Google even had a paper comparing DDR1 vs DDR2 (link: https://static.googleusercontent.com/media/research.google.c...).
That said, memory DIMM capacity increases with even a small chance of bit-flips means lots of people will still be affected.
ECC is standard at this point (current RAM flips so many bits it's basically mandatory). Also, most CPUs have "machine checks" that are supposed to detect incorrect computations + alert the OS.
However, there are still gaps. For one thing, the OS has to be configured to listen for + act on machine check exceptions.
On the hardware level, there's an optional spec to checksum the link between the CPU and the memory. Since it's optional, many consumer machines do not implement it, so then they flip bits not in RAM, but on the lines between the RAM and the CPU.
It's frustrating that they didn't mandate error detection / correction there, but I guess the industry runs on price discrimination, so most people can't have nice things.
It's seriously annoying that ECC memory is hard to get and expensive, but memory with useless LEDs attached is cheap.