Hacker News new | past | comments | ask | show | jobs | submit login
Imaging a Hard Drive with Non-ECC Memory – What Could Go Wrong? (robertelder.org)
169 points by robertelder on March 4, 2023 | hide | past | favorite | 76 comments



Bit flips are totally real, at scale you will definitely see them on large queries. There was a fun talk at DEFCON on bitsquatting, the process of buying 1 bit off domain names and then accepting all incoming connections. Attacks like rowhammer similarly abuse erroneous bit flips. Supposedly microsoft can detect solar activity based on the number of windows crash logs they receive.

DEFCON Talk: https://www.youtube.com/watch?v=aT7mnSstKGs

https://en.wikipedia.org/wiki/Bitsquatting

https://en.wikipedia.org/wiki/Row_hammer


There was an interesting post from Mozilla about this. There's a certain class of telemetry errors which seem to only be caused by cosmic rays

https://blog.mozilla.org/data/2022/04/13/this-week-in-glean-...


I remember reading somewhere about that talk being debunked. Maybe someone more resourceful than me can find it.

It was something about being more likely to be a human typo or a config change that rolled out to a bunch of machines. The statistics didn't add up, and it wasn't plausible that bit flips caused it.


There are many places that memory exists. Your processor cache is memory. Registers are memor-ish. Hard Drives have memory. If a bit flip happens in your hard drives memory before being written to disk, then it's not unreasonable to think that the bitflip would persist, even through reboots.

I have run queries at large companies and found mistakes most easily explained as bitflips in domain names written to disk. Imagine an environmental variable configuring the use of a proxy without proper whitelists and it's not unimaginable to me that a production machine would be able to speak to machines on the internet at large.

I am open to the idea that what I think is happening might not be the mechanics of what is happening, but I find the talk believable, not based on theory, but actually seeing persisted (and non-persisted) bit flips in domain names queried from data warehoused logs at world scale companies.


That was the gist of the response, as far as I remember. It was repeated access from the same set of machines, so it was more likely that one bit flip persisted in a config and was subsequently rolled out to others. So the study grossly overestimated the number of actual bit flips occurring by like 20000x or something.


Unsure about the talk, but I read a research report from someone at I think it was Cisco who purchased a second level tld of a bitsquatted US state( think statenXX.us instead of state.XX.us ). The amount of email they received was staggering, and I'd have trouble believing that many people would make that mistake in typing.


that's more "typo-squatting" than "bit squatting" but its definitely the same broad concept.


I don't know of a debunking, but more numbers can be found in the associated blog post[1].

[1]: http://dinaburg.org/bitsquatting.html


"Supposedly" is false. The sun doesn't produce cosmic ray-power particles and soft errors aren't affected by sunspot activity. It is affected by altitude and space weather, but not solar activity.

https://en.wikipedia.org/wiki/Soft_error

I was in the audience of the talk. All devices should be required to use ECC because it's a security risk. Not as much as http:// era, but silent corruption across networks and systems is a thing.


> space weather, but not solar activity

Space weather is solar activity.


Not entirely. Extrasolar activity is the most likely cause for energetic cosmic rays at the Earth's surface.


Well, mostly...


The link you provided contradicts your comment.

From the link in your comment: "The average rate of cosmic-ray soft errors is inversely proportional to sunspot activity. That is, the average number of cosmic-ray soft errors decreases during the active portion of the sunspot cycle and increases during the quiet portion."


One of my favorite DEFCON talks of all time.


ECC is good, and I genuinely wish it were more common. Thankfully, Ryzen CPUs support ECC by default (except for pre-7000 series with integrated graphics that aren't "Pro" versions), so long as the motherboard does, too (like all ASRock that I've seen). I'm running several Ryzen servers with ECC.

On the other hand, there are many, many systems out there that don't have ECC, nor do they have the option to have ECC. While every video on Youtube wants us to believe that the difference between 580 and 585 frames per second in some silly game or another makes all the difference in the world, for me the difference between a system that runs 10% slower and one that crashes in the middle of the night is actually significant. I test all my systems at a certain memory frequency, then back off to the next slower frequency just to be sure.

That doesn't stop memory errors from happening, but most systems have lived their entire lives without having random crashes or random segfaulting. I consider that worthwhile.


Crashes in the middle of the night are not what worries me. Who cares. It's silent data loss that can go unnoticed for a very long time. And not just a single bit. If the flip hits file system structures or file layout you can have massive silent data loss.


Yep. It's why ZFS, BTRFS, Ceph and Gluster matter. Being able to detect that data at rest has gone wrong, and being able to reconstruct the original state is a big deal.

I'd like to think that as NAND continues to scale up in capacity and lower in cost, that we'll see some real shakeup to filesystems and storage where self-healing mass storage can be genuinely commoditized -- not something that's only accessible to businesses (and computing enthusiasts) due to cost and complexity.


Absolutely, but these are only half the solution. You still have to be sure that the data you're passing to the filesystem is not already corrupted in memory.


Indeed. I have a very low power storage server with tons of ECC, and the better consumer grade NASes also tend to have it. Again, the issue here is its functionally limited to enthusiasts today because of cost and the average person being completely unaware of the impact.

My hope as we move into more advanced fabrication nodes is the increasing shift to HBM in the data center space starts to at least create an HBM option in the consumer side of things. I expect at least AMD to try that push in 2027 and beyond, and I’m sure Intel is looking hard at it too. Granted, Apple is already there with its higher end silicon.

Granted, I still expect there to be product line segmentation with ECC, as it’s a good lever to push a buyer into a higher end product. Though when it’s done on package, you at least eliminate the need for a main board to actually have the traces, and the external modules to have the extra memory. So it might be the easier route to get to more ECC in the consumer space, at least for mid-range and up personal computers.


My most precious personal data are my family pictures. I protect them with par2. These are basically checksums for your files, but they provide so much added information that you can also repair your files if they are damaged.

Once I coded a shell script that verified all my photos, but I don’t bother with that anymore. I just back everything up, and if there’s ever a problem, the parity files provide an additional safety net.


How do you protect/ensure the integrity of your par2 files?


In case you’re not joking: I don’t. It’s the one and only line of defense.


> so long as the motherboard does, too (like all ASRock that I've seen).

I built a home server last year with an ASRock X570M Pro4 [0] with a Ryzen 4750 PRO (which I had to source OEM from Aliexpress as it's not sold direct). I'm not sure what's the current situation, but the only RAM I could find for it was the Kingston Server Premier KSM32ED8 [1], and the ECC premium was not fun to pay.

[0] https://www.asrock.com/MB/AMD/X570M%20Pro4/index.asp

[1] https://www.kingston.com/en/memory/server-premier/ddr4-3200m...


Intel finally started supporting ECC again on their consumer CPUs, but if only the W680 didn't cost ~$400 starting.


Depending on your scale, software running on the machine is more likely to cause crashes in the middle of the night :)


not just system support. availability of modules is bad.

got a HP that have both an AMD pro apu and ddr5 slots, with no soldered ram. i e. all the requirements.

it was $500 to 1500 depending on configuration. then 16 or 32gb of ecc sodimm runs over $2000 for regular consumers! and that's if you can find them in stock!


I think you overstate the problem here. Chances are, unless you’ve addressed other more pertinent issues, simply using ECC memory isn’t going to stop systems from crashing in the middle of the night.


I've had random crashes before that were one-off.

Like my desktop just froze, and then it never happened again. It only makes sense if it was a random bit-flip.

The actual RAM speed never mattered, you can't tell the difference with 150 FPS and 165 FPS (even though my screen's refresh rate is 280Hz)


It's pretty easy for those to be race conditions, too. Plenty of one time crashes in a fleet of thousands of machines with ECC. ECC lets you know it's almost certainly not a memory issue.

10% more fps doesn't matter at 150 fps, but it's nice when your FPS is lower. 60 -> 66 might mean you don't dip below 60 as often. 55 -> 60.5 is pretty nice too. Maybe less of a deal if you've got VRR etc.


What game runs at only 60 FPS because of a RAM issue? I know I only have a 3600, but if a game is running at 60 FPS it's 99% because of my GPU, not the RAM.

Most games still run at 90+ FPS, I would love to have ECC RAM to prevent a potential one-off crash or just to know that the RAM didn't report an error when it happened. I would pay money for this!

Better yet, the 3d cache CPUs don't care about RAM speed as much, according to benchmarks


Play games on APUs and RAM speed will make a big difference and could get you from playable to not.


Or aim for extremely high FPS (>250+) on games that are not GPU bound - counterstrike being the typical example here.


> you can't tell the difference with 150 FPS and 165 FPS

First byte latency makes cache misses significantly slower which in turn makes 99%ile latency (which is perceived as microstutter by humans) significantly higher even if it doesn't affect throughput (fps). This was well documented way back when the first DDR5 sticks came out and they performed like crap compared to overclocked B-Die DDR4.


ECC is now baked into the DDR5 spec. Great news!


On-Chip ECC, it's an improvement but full ECC memory, which you can get for DDR5, also protects your data in transmission at 6400MT/s.

Additionally the on-chip EEC of DDR5 won't report the errors to your OS. ECC memory errors when corrected can be handled by the OS, and you'll even be informed of the uncorrectable 2 bit errors.

Want to protect for 2-bit errors? Make sure your platform has support for ECC-chipkill.


While technically true, that DDR5 comes with "on-die ECC", it is only because the memory is so unreliable it will not work properly without it. However, even then, it is not the same as true ECC that have a extra data correction chip on the memory module and also protects against send errors to the CPU.


10% performance difference in exchange for maybe crashing slightly more often would be huge for people who only really use their PCs for gaming.

HN readers seem to have a skewed idea of how useful ECC is while pretending the downsides don't exist. Not everyone is primarily using their system as a workstation.


A bit over 20 years ago I had a PC with a memory stick that had gone bad, but not bad enough that it was crashing all the time ... it crashed often enough running windows 98 apps that I attributed all crashes to software nonsense.

Back then it was recommended to run a defragger every so often, so I set up a cron job to run it every Saturday night or something like that. The net result was that every file block that got moved made a trip through memory with some small probability of getting corrupted. Often the errors were in files that weren't used that often so I didn't immediately notice. The net result is that after many months of this, I started noticing PDF files that were corrupted, or mp3 files that would hiccup in the middle even though it used to play perfectly before. Sadly, I had ripped my 500-ish CD collection and then had gotten rid of the physical CDs.


That reminds me of how I accidentally tracked memory issue to the failing power supply.

I noticed (after some windows bluescreen) on memtest that the memory is showing some errors. Ordered another 16GB pair, replaced it and.... the problem persisted.

Suspecting something with motherboard I just chalked it to something with mobo and pretty much said "well I'm not replacing mobo now, it will have to wait for next hardware refresh. Gaming PC so no big deal. And now I had 32 GB of RAM in PC.

Weirdly enough, problem only happened when running on multi-core memory test.

Cue ~1 year after and my power supply just... died. Guessing bad caps I just ordered another and thought nothing of it. On a whim I ran memtest and....

nothing. All fixed. Repeated few times and it was just fine, no bluescreen for ~ 2 years now too.

I definitely want to get next machine with ECC but the DDR4 consumer ECC situation looks... weird. I'm not sure whether I should be happy with on-chip ECC, I'd really prefer to have whole CPU-memory pipe ECCed


Two things. Firstly, I don't think any conclusions can be made about whether dd or dd-rescue is more susceptible to bit flips. It could be that both allocated a buffer, and dd-rescue just happened to be handed the area of memory with the fault in it, which it reused multiple times, where when dd was run that area of memory was used by something else. Memory mapping and usage in a real operating system is highly non-deterministic due to the sheer amount of things that affect it.

Secondly, once a good list of known faulty memory addresses had been created by memtest, one can tell the operating system not to use them. Then you can keep using your old hardware without the reliability problems. Although, it is possible that further areas of memory will subsequently fail, and without ECC, you'll still be vulnerable to random (cosmic ray-induced) bit flips.


I ran a cluster of ~30k blade based computers booting entirely off iPXE. They didn't have any onboard ssd/disk storage or ECC memory. Every day, a few of them would randomly lock up, they'd reboot with a fresh network image and keep on humming.


> Every day, a few of them would randomly lock up, they'd reboot with a fresh network image and keep on humming.

There same ones, or random new machines every time?


Totally random.


Could easily be software or some other marginal hardware bug though.


Indeed. Although, sometimes the machine wouldn't fully crash. It was like the disk was corrupted, but apps were still running, which makes me suspect it was the lack of ECC.


How do you even get that many computers without ECC? I think all the blades I've seen have ECC as baseline spec.


Google started with "standard PCs", though I'm not finding any info if they used ECC memory.

https://blog.codinghorror.com/building-a-computer-the-google...

https://patents.google.com/patent/US6549988


They were purpose built and didn't require 100% uptime.


I've had a lot of really strange bugs and data loss with my current build (Ryzen with Gskill memory). After running a memtest for 24h i finally saw that two of the four ram sticks were faulty (two bit flips on each only rarely and on a specific test). The company changed them but now a year later without any issues I have another one that failed in exactly the same way. This is the last time I build a non-ECC system for myself.


My motherboard isn't rated for more than DDR4-3200 with my old cpu, a Ryzen 7 2700. I could set my memory's XMP profile up and run at DDR4-3466 and memtest would be stable for more than 24 hours but would error before 48. I backed off, DDR-3400, DDR4-3333, DDR4-3266... finally stable in memtest for 96 hours, boot into Windows and run Prime95 Blend workload, 3266 crashes in hours. I finally find a little note in my motherboard manual that older CPUs are limited to DDR4-3200. Set that speed, rock solid, I was even able to tighten the JEDEC timings with guidance from the second XMP profile for DDR4-3133.

Gigabyte really did mean DDR4-3200 was the limit for Pinnacle Ridge and older AMD cpus.


In my case I tried at 2666 and 3200 but they were still failing exactly at the same address.


Heard a similar story from a friend last week - a faulty RAM stick as well. I'm glad I bought a Threadripper with ECC instead (worth waiting for a Lenovo sale and buy RAM separately)


You might want to take a look at your PSU. That seems like a suspicious amount of RAM failure to me. How old is it and what model, if you don't mind me asking?

PSU is something I never cheap out on. Always pays for itself in the end. A bad PSU can kill your whole system.


5 years. and the current ram sticks (the 3 that work) can sustain a 3 day memtest with 0 errors (at 3600). They were all able to handle that when I received them back from warranty.


5 years could be getting up there if it's a mid range to low end PSU in terms of reliability. You might want to see if you find your PSU in this list:

https://cultists.network/140/psu-tier-list/


Amazing technical write up. But if there's no cause for alarm based on SMART, I would just do the memtest right then because that's always my goto for weird undiagnosed problems. I find it's usually not the problem, although when it has been I've ended up wasting a silly amount of time on it(just like this case!).

And if there was cause for alarm, I would think long and hard about imaging from the original computer at all. With certain failure modes in drives, just reading could cause more corruption; each failed attempt could lose data.

But yeah, happy you did it this way in the end, because I learned a ton from the resulting blog post!


AFAICT, no current Mac comes with ECC - do they have the same issues? If so, one doesn't hear about them too often.


The Mac Pro has ECC and is still sold.

My iMac Pro has it as well.


Ah, I see, I never considered one of them, thanks. One is discontinued, the other is expected to be replaced soon, but good to know.


Well, the reason why ECC mattered here is because the RAM was bad, but modern Mac computers do not come with user-serviceable RAM at all, so if you have a problem like this, it's a support ticket anyways, and I'm not even sure there's a true equivalent to Memtest86 for modern Mac computers in the first place. So basically, if it was a RAM problem, there's no point in diagnosing it even if you could; just send your Mac in when you start having issues that seem to be bad RAM.

Even with ECC, it's incredibly hard to know that a given one-off issue isn't a memory error, because even ECC can't detect 100% of memory issues. But without ECC, it's also nearly impossible to know if something is a memory error. If it's bad RAM, the same address will likely continue to exhibit bad behavior, but if it's a solar flare, you're never going to know the difference; you will just get incorrect behavior that may or may not crash, and it will be completely impossible to reproduce.

One big reason you don't hear it as much is there are not nearly as many data centers filled with Macs. There are definitely a few, and I bet if you got an experience report from them, they could give some idea of how visible memory errors are on Macs (although it's hard, because again, if you don't have ECC, there's not really a good way to know if something is a memory error; you can only really postulate.)


ECC is error correcting. A bit gets flipped and it not only detects it but fixes it. Two bits get flipped and it can at least detect it and panic the machine immediately instead of corrupting your data.

Without it the corruption is silent. Then this kind of thing happens:

https://news.ycombinator.com/item?id=35026440

Which is another reason not to solder the storage either.

Suppose you have a system board with bad soldered memory and you want to copy your data off of it onto the new one. Well, the memory is flipping random bits as it's copying, but the flash chips are permanently attached to the same board as the bad memory.

Otherwise it would have been just a support ticket; now it's something worse.


>ECC is error correcting. A bit gets flipped and it not only detects it but fixes it. Two bits get flipped and it can at least detect it and panic the machine immediately instead of corrupting your data.

I did neglect to mention that ECC by-definition can correct errors, but I wonder if what's making people upset with my comment is the implication that ECC can't detect all errors.

But it's true: ECC can't detect all bitflips, and in fact there's at least one study[1] that suggests quite a lot of memory errors go entirely undetected even with ECC.

Silent corruption does in fact occur even with ECC and it may not even be particularly rare, even though it is rarer than typical single/double-bit flips. Of course, the majority of desktops use non-ECC RAM and it's mostly fine, so I assume this is only ever going to matter in production workloads, and exactly what impact it has is hard to gauge.

[1]: https://pages.cs.wisc.edu/~remzi/Classes/739/Fall2018/Papers...


Maybe the issue is that undetectable errors are possible, but if the system is in such a bad way that they're happening at any rate, you'll also be getting quite a lot of the detectable ones and then get prompt notice that something is wrong.

Whereas without ECC you could have silent data corruption for years and only discover it after it gets severe enough to warrant a manual investigation, after the damage has already propagated to your backups.

> Of course, the majority of desktops use non-ECC RAM and it's mostly fine, so I assume this is only ever going to matter in production workloads, and exactly what impact it has is hard to gauge.

There are two reasons it's useful. One is the cosmic ray random bit flip that happens even to hardware in good condition, and then ECC can usually detect and correct it, but that's less common and more important for production workloads.

The other is, your hardware is experiencing a higher than average number of random bit flips, and then ECC gives you immediate notice when this starts happening instead of letting it sow chaos until something crashes so hard you take notice.


https://youtu.be/aPd8MaCyw5E ("ShmooCon 2014: You Don' Have The Evidence - Forensic Imaging Tools") was quite an eye-opening talk about common tools, like the article-mentioned `dd` (and its cousin `ddrescue`) and how they deal with I/O errors.

To be clear, I do not believe that the tools are at fault - rather, the SATA/SAS/IDE controllers have a different design goal, and software tools can only do so much.

Tools like DeepSpar (HW+SW), PC-3000 (also HW+SW) allow for a scary level of nitty-gritty access to HW, including flashing SSD/HDD controller FW in case in went pear-shaped), but for data recovery - be it in a forensic context, or in a context of retrieving important irreplaceable data, I have always had a nerd-lust for those tools. Used them at a previous job, but can't ever justify the price for personal and very infrequent use. :)


>Does increased heat increase the likelihood of memory errors? I think it does.

I just got through a round of overclocking my memory. Yes, heat does.

>tRFC is the number of cycles for which the DRAM capacitors are "recharged" or refreshed. Because capacitor charge loss is proportional to temperature, RAM operating at higher temperatures may need substantially higher tRFC values.

https://github.com/integralfx/MemTestHelper/blob/oc-guide/DD...


This reminds me of a bug in Google Chrome that was attributed to flipped bit.

If anyone has the link, it's missing from my collection...


This wasn't run with a large enough sample size to be statistically valid.


Moral of the story?

Upgrade to DDR5 ram the latest standard which has on-die ECC memory but is not as good at spotting bit flips unlike proper ECC memory with a separate extra data correction chip.

https://en.wikipedia.org/wiki/DDR5_SDRAM#:~:text=Unlike%20DD....

Whilst Proper ECC ram chips and motherboards exist, I'm surprised that a cheaper but equally as good as Proper ECC solution doesn't exist although I know some would argue that DDR5 is a step in the right direction of a marathon.

I guess the markets know best and chase the numbers, assuming they are also using Proper ECC memory, binary coded decimal and not floating point arithmetic which introduces errors, something central banks have been using for decades?

https://en.wikipedia.org/wiki/Floating-point_error_mitigatio...


Also from your link:

“There still exist non-ECC and ECC DDR5 DIMM variants; the ECC variants have extra data lines to the CPU to send error-detection data, letting the CPU detect and correct errors that occurred in transit.”


Intel and Asrock released a NUC with in-band ECC, equally as good at protecting your data but with performance hits.

https://www.anandtech.com/show/18732/asrock-industrial-nucs-...


DDR5 has enough ECC on chip to make errors effectively impossible. It doesn't provide error data to the CPU, though, so errors in transit can still occur. This is really unlikely, though, and anything not mission-critical will no longer need the extra ECC computation on the CPU-side. (DDR5 encapsulates the memory controller).


> This is really unlikely, though

It happens quite often as a result of dust in the contacts when the memory was installed or weak solder on the chips or sockets or bad capacitors etc.

None of which is that likely on machines in good working order, but many are not. And you can go from one to the other at any time as a result of a power spike or a cooling failure.


source on that ? Did anyone tested that ?

> This is really unlikely, though, and anything not mission-critical will no longer need the extra ECC computation on the CPU-side.

ECC computation is done in hardware anyway


I meant the memory controller on the CPU side won't need to implement it. Obviously, full DDR5-ECC hardware exists, but the onchip ECC as a whole makes bit flips far less likely than DDR4. There's not much of a need for the complete set on consumer hardware.

Of course this is assuming random cosmic ray bit flips, not faulty hardware. And it's speaking cost-wise from the manufacturer's perspective. I'd personally like full ECC to just be the standard.


> This is really unlikely, though,

I think you can say that because people are not routinely monitoring their surroundings for ionizing radiation.

If this were to change, I think we can start to identify some of those military locations which could be interfering with equipment, that would then expose the weakness of DDR5.


> To even detect this, I needed the patience and discipline to verify the checksum on a 500GB file! Imagine how much more time I could have wasted if I didn't bother to verify the checksum and made use of an important business document that contained one of the 14 bit flips?

Unpopular-opinion counterpoint - the odds of this actually happening are vanishingly unlikely. Many file formats have built-in integrity checks and tons of redundancies and waste. I wouldn't want to risk handling extremely valuable private keys or conducting high value cryptocurrency transactions or something, I suppose, on a machine without ECC memory, but that just doesn't really come up in most knowledge worker or end consumer scenarios.

The odds of actually getting bit by this in a way that matters to you are really low, which is why nobody cares.




Consider applying for YC's W25 batch! Applications are open till Nov 12.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: