ECC should have become standard around the time memories passed 1GB. It's seriou...

loeg · 2026-03-06T00:05:23 1772755523

It's not even ECC price/availability that bothers me so much, it's that getting CPUs and motherboards that support ECC is non-trivial outside of the server space. The whole consumer class ecosystem is kind of shitty. At least AMD allows consumer class CPUs to kinda sorta use ECC, unlike Intel's approach where only the prosumer/workstation stuff gets ECC.

altairprime · 2026-03-08T15:31:32 1772983892

288-pin ECC is, I believe, available on any X670E/X870E platform boards so long as the motherboard builder hasn’t expressly interfered with it (and probably other chipsets as well?). Windows 10+ reports it as full ECC (multi-bit / 72-bits). AMD pushed that enable in an AGESA three or four years ago iirc. The CAS latency for ECC is about double what gaming RAM offers, but in practice other more costly factors tend to limit performance first. Any motherboard released before the AGESA update would be more difficult to predict, but that’s baseline uncertainty for PCs so no surprises there.

antonkochubey · 2026-03-09T10:37:30 1773052650

>The CAS latency for ECC is about double what gaming RAM offers

Ironically, overclocking ECC memory is much easier than overclocking non-ECC DIMMs, because you know exactly at which point you start encountering instability and need to dial back, instead of relying on.. application crashes and BSOD's to know that you're running way too optimistic clocks/timings.

Meanwhile I overclocked 'low clock / loose timing' ECC DIMMs on Ryzen 7 platform with no issues at all – kept increasing clocks and lowering timings until ECC started reporting errors, then dialed it back a couple notches, and now it is not just stable, but I also have exact reporting of it being stable.

altairprime · 2026-03-09T15:46:41 1773071201

Yeah! A stick of 5600 can generally reach 6000 with geardown off and that’s as far as I’ve seen cause to dial it. But certain parameters that are popular to lower for latency reduction can be, how would I put it, slightly less flexible — tREFI comes to mind as one that nearly any lowering of (on the enterprise sticks I’m using anyways) tends to cause DFE/MBIST training failures no matter what, even with direct airflow, before it ever boots far enough for memtest to expose ECC errors.

(For those out there following along with PCs, if you aren’t tuning with MBIST maxed out in your BIOS, you might want to revisit that.)

cromka · 2026-03-09T12:55:51 1773060951

Same experience here. I have a feeling ECC in gaming would soon become a thing if it wasn't for the pricing crisis.

rpcope1 · 2026-03-06T00:43:06 1772757786

I've been honestly amazed people actually buy stuff that's not "workstation" gear given IME how much more reliably and consistently it works, but I guess even a generation or two used can be expensive.

throwaway85825 · 2026-03-06T03:04:51 1772766291

Very few applications scale with cores. For the vast majority of people single core performance is all they care about, it's also cheaper. They don't need or want workstation gear.

rpcope1 · 2026-03-06T17:20:35 1772817635

I have come to doubt that single core or CPU performance in general, other than maybe specialty applications like CAD and some games, is all that noticeable for most computer users in the last decade. I can take relatively pedestrian users like my parents or my wife and put them in front of a decade old high end Haswell system or a brand new mega-$$$ threadripper/epyc and for almost all intents and purposes they don't notice a different. What they do notice is when things die. I'm sure consumer hardware might be OK for 2-3 years (maybe), but like for my parents, they're happier to keep using the same computer, and honestly the same Dell Precision system I gave them almost 10 years ago works great today, and I have a suspicion that the hardware, outside of maybe the SSD finally wearing out, will probably work right a decade from now too.

rafaelmn · 2026-03-06T09:23:52 1772789032

> Very few applications scale with cores

You mean like compilers and test suites ? Very few professional workloads don't parallelize well these days.

VorpalWay · 2026-03-06T11:47:13 1772797633

Compilers and test suits do scale (at least for C/C++ and Rust, which is what I work with). But I think the parent comment referred to consumer applications: games, word processing, light browsing, ...

(Though games these days scale better than they used to, but only up to a to a point.)

I find that most tools I write for my own use can be made to scale with cores, or run so fast that the overhead of starting threads is longer than the program runtime. But I write that in Rust which makes parallelism easy. If I wrote that code in C++ I would probably not bother with trying to parallelize.

rafaelmn · 2026-03-06T14:47:48 1772808468

But those tools aren't really compute bound anyway - you're not buying a workstation to do them, you're getting a consumer laptop or a tablet.

loeg · 2026-03-06T16:58:59 1772816339

And that consumer device should have ECC! That's the whole discussion here.

zadikian · 2026-03-06T17:21:57 1772817717

It's confusing because a few comments up is "for the vast majority of people single core performance is all they care about, it's also cheaper" which is unrelated to ECC.

loeg · 2026-03-06T18:07:47 1772820467

I think it's coherent -- it's an argument for why most people don't want to buy Workstation class products just to get ECC. (Prices scale with core count. Not linearly, but still.)

rafaelmn · 2026-03-07T08:56:32 1772873792

Why ? If your device is a thin client for web services/gaming the risk of bitflips/bad ram is a minor annoyance.

loeg · 2026-03-07T16:20:32 1772900432

I disagree with your handwaving bitflips away as a minor annoyance. Consumers don't love software crashing, even if they don't have any data they care about.

Imagine ECC was free -- would you rather have free ECC and no bitflips, or no ECC and bitflips? It's hard to imagine choosing bitflips.

throwaway85825 · 2026-03-09T07:31:24 1773041484

ECC would save an unbelievable amount of labor. A shocking number of people have jobs looking at various logs.

loeg · 2026-03-06T17:00:07 1772816407

Test suites often don't scale, actually. Unit tests usually run single-threaded by default, and also relatively often have side effects on the system that mean they're unsafe to run in parallel. (Sure, sure, you could definitely argue the latter thing is a skill issue.)

zadikian · 2026-03-06T17:12:09 1772817129

In theory, do you need a single machine for any of that, or would it be cheaper to use a low-availability cloud cluster? Tests are totally independent, and builds probably parallel enough.

throwaway85825 · 2026-03-06T16:23:23 1772814203

Only a small percentage of computer users are programmers.

zadikian · 2026-03-06T07:05:54 1772780754

There were several years where used cheese grater Mac Pros could be bought and upgraded for very cheap, and were still not too outdated. I only replaced my MacPro4,1 when the M1 mini came out, mainly cause of wattage.

loeg · 2026-03-06T00:45:30 1772757930

I've had zero issues with AMD's consumer tier of non-WX Threadripper and Ryzen models, FWIW.

thousand_nights · 2026-03-06T01:17:29 1772759849

overblown? billions of users use consumer tier hardware just fine. i have servers at home with years of uptime without any ECC memory

conception · 2026-03-06T04:32:41 1772771561

But how much bit rot? You’ll never know.

Maxion · 2026-03-06T08:19:05 1772785145

If I don't know about it, then how does it affect me / why should I care? My home server does what it is supposed to do and has done so for a decade. If bit rot /bit flips in memory does not affect my day-to-day life I much prefer cheaper hardware.

I do hope the nuclear powerplant next door uses more fault tolerant hardware, though.

loeg · 2026-03-06T17:01:51 1772816511

Eventually you might notice the pictures or other documents you were saving on your home server have artifacts, or no longer open. This is undesireable for most people using computer storage.

> I much prefer cheaper hardware.

The cost savings are modest; order of magnitude 12% for the DIMMs, and less elsewhere. Computers are already extremely cheap commodities.

zadikian · 2026-03-06T17:18:21 1772817501

12% for the DIMMs only, but with Intel you need Xeon and its accompanying motherboard for it. Someone said AMD "kinda" lets you do ECC on consumer hardware, not sure what the caveats are besides just being unbuffered.

Assuming that's more due to intentional market segmentation than actual cost, yeah I would pay 12% more for ECC. But I'm with the other guy on not valuing it a ton. I have backups which are needed regardless of bitrot, and even if those don't help, losing a photo isn't a huge deal for me.

loeg · 2026-03-06T18:10:36 1772820636

> Someone said AMD "kinda" lets you do ECC on consumer hardware, not sure what the caveats are besides just being unbuffered.

That was me. It isn't "officially" supported by AMD, but it should work. You can enable EDAC monitoring in Linux and observe detected correction events happening.

> Assuming that's more due to intentional market segmentation than actual cost

That's the argument, yeah.

zadikian · 2026-03-06T17:09:54 1772816994

I'm more concerned how the Mac filesystems don't have payload checksums.

deepsun · 2026-03-06T07:51:56 1772783516

I hate my workstation desktop I assembled 15 years. It just doesn't break! I have no excuses to buy a new one (except for video card).

justin66 · 2026-03-06T06:39:02 1772779142

> ECC should have become standard around the time memories passed 1GB.

Ironically, that's around the time Intel started making it difficult to get ECC on desktop machines using their CPUs. The Pentium 3 and 440BX chipset, maxing out at 1GB, were probably the last combo where it pretty commonly worked with a normal desktop board and normal desktop processor.

WatchDog · 2026-03-06T01:09:13 1772759353

All DDR5 ram has some amount of error correction built in, because DDR5 is much more prone to bit flipping, it requires it.

I'm not really sure if this makes it overall more or less reliable than DDR2/3/4 without ECC though.

jml7c5 · 2026-03-06T12:20:34 1772799634

As I understand it, DDR5's on-die ECC is mostly a cost-saving measure. Rather than fab perfect DRAM that never flips a bit in normal operation (expensive, lower yield), you can fab imperfect DRAM that is expected to sometimes flip, but then use internal ECC to silently correct it. The end result to the user is theoretically the same.

Because you can't track on-die ECC errors, you have no way of knowing how "faulty" a particular DRAM chip is. And if there's an uncorrected error, you can't detect it.

himata4113 · 2026-03-06T04:04:34 1772769874

that doesn't help when the bit is lost between the cpu and the memory unfortunately, it only really helps passing poor quality dram as it gets corrected for single bit flips, not that reliable either it's a yield / density enabler rather than a system reliability thing.

it's "ECC" but not the ecc you want, marketing garbage.

jcalvinowens · 2026-03-06T18:14:52 1772820892

DDR5 on-die ECC detects and corrects one-bit errors. It cannot detect two-bit errors, so it will miscorrect some of them into three-bit errors. However, the on-die error correction scheme is specifically specially designed such that the resulting three-bit errors are mathematically guaranteed to be detected as uncorrectable two-bit errors by a standard full system-level ECC running on top of the on-die ECC.

matheusmoreira · 2026-03-06T22:39:41 1772836781

ECC also reports error recovery statistics to the operating system. Lets you know if any unrecoverable errors happened. Lets you calculate the error rate which means you can try to predict when your memory modules are going bad.

I think this sort of reporting is a pretty basic feature that should come standard on all hardware. No idea why it's an "enterprise" feature. This market segmentation is extremely annoying and shouldn't exist.

tombert · 2026-03-06T04:09:41 1772770181

I am not sure I've ever seen a laptop that has ECC memory. I'm sure they exist but I don't think I've seen it.

I would definitely like to have a laptop with ECC, because obviously I don't want things to crash and I don't want corrupted data or anything like that, but I don't really use desktop computers anymore.

bpye · 2026-03-06T07:48:36 1772783316

There are 16" laptops with ECC, you can get a ThinkPad P16 with it for example. I've yet to find any 14" devices with ECC though.

tombert · 2026-03-06T07:54:33 1772783673

Interesting, I actually have a thinkpad p16s, surprised I didn’t notice ECC availability.

oybng · 2026-03-06T00:55:49 1772758549

For the unaware, Intel is to blame for this

johanyc · 2026-03-06T08:32:04 1772785924

Can you explain

samus · 2026-03-06T10:28:39 1772792919

It makes economic sense to keep selling non-ECC hardware to maintain market segmentation.

aforwardslash · 2026-03-06T00:18:54 1772756334

ECC are traditionally slower, quite more complex, and they dont completely eliminate the problem (most memories correct 1 bit per word and detect 2 bits per word). They make sense when environmental factors such as flaky power, temperature or RF interference can be easily discarded - such as a server room. But yeah, I agree with you, as ECC solves like 99% of the cases.

indolering · 2026-03-06T00:40:12 1772757612

Being able to detect these issues is just as important as preventing them.

aforwardslash · 2026-03-06T00:55:20 1772758520

Thing is, every reported bug can be a bit flip. You can actually in some cases have successful execution, but bitflips in the instrumentation reporting errors that dont exist.

russdill · 2026-03-06T11:02:01 1772794921

The amount of overhead a few bits of ECC has is basically a rounding error, and even then, the only time the hardware is really doing extra work is when bit errors occur and correction has to happen.

The main overhead is simply the extra RAM required to store the extra bits of ECC.

jeffbee · 2026-03-06T00:58:02 1772758682

ECC are "slower" because they are bought by smart people who expect their memory to load the stored value, rather than children who demand racing stripes on the DIMMs.

matja · 2026-03-06T10:57:01 1772794621

The actual RAM chips on a ECC DIMM are exactly the same as a non-ECC DIMM, there's just an extra 1/2/4 chips to extend to 72 bit words.

The main reason ECC RAM is slower is because it's not (by default) overclocked to the point of stability - the JEDEC standard speeds are used.

The other much smaller factors are:

* The tREFi parameter (refresh interval) is usually double the frequency on ECC RAM, so that it handles high-temperature operation. * Register chip buffers the command/address/control/clock signals, adding a clock of latency the every command (<1ns, much smaller than the typical memory latency you'd measure from the memory controller) * ECC calculation (AMD states 2 UMC cycles, <1ns).

Dylan16807 · 2026-03-06T08:13:31 1772784811

ECC keeps your bits safe from random flips to a ridiculously large factor. You can run the memory at high consumer speeds, giving up some of that safety margin, while still being more reliable than everything else in your computer.

And there's non-random bit errors that can hit you at any speed, so it's not like going slow guarantees safety.

undersuit · 2026-03-06T03:45:23 1772768723

ECC is actually slower. The hardware to compute every transaction is correct does add a slight delay, but nothing compared to the delay of working on corrupted data.

throwaway85825 · 2026-03-06T03:06:08 1772766368

There's just no demand for high speed ECC aside from a few people making their own dimms.

ece · 2026-03-06T15:15:55 1772810155

Looking back, I actually think the older the RAM the more likely you're able to notice bit-flips and they harm your workflow. EDO RAM was the worst in my experience (my first computer), SDRAM was a bit better, and random bit-flips atleast under load got very rare after DDR2. I think Google even had a paper comparing DDR1 vs DDR2 (link: https://static.googleusercontent.com/media/research.google.c...).

That said, memory DIMM capacity increases with even a small chance of bit-flips means lots of people will still be affected.

hedora · 2026-03-06T02:37:11 1772764631

ECC is standard at this point (current RAM flips so many bits it's basically mandatory). Also, most CPUs have "machine checks" that are supposed to detect incorrect computations + alert the OS.

However, there are still gaps. For one thing, the OS has to be configured to listen for + act on machine check exceptions.

On the hardware level, there's an optional spec to checksum the link between the CPU and the memory. Since it's optional, many consumer machines do not implement it, so then they flip bits not in RAM, but on the lines between the RAM and the CPU.

It's frustrating that they didn't mandate error detection / correction there, but I guess the industry runs on price discrimination, so most people can't have nice things.