Hacker News new | past | comments | ask | show | jobs | submit login
Should I buy ECC memory? (2015) (danluu.com)
300 points by colinprince on April 26, 2017 | hide | past | favorite | 224 comments



While I was at Google, someone asked one of the very early Googlers (I think it was Craig Silverstein, but it may've been Jeff Dean) what was the biggest mistake in their Google career, and they said "Not using ECC memory on early servers." If you look through the source code & postmortems from that era of Google, there are all sorts of nasty hacks and system design constraints that arose from the fact that you couldn't trust the bits that your RAM gave back to you.

It saved a few bucks in a time period where Google's hardware costs were rising rapidly, but the ripple-on effects on system design cost much more than that in lost engineer time. Data integrity is one engineering constraint that should be pushed as low down in the stack as is reasonably possible, because as you get higher up the stack, the potential causes of corrupted data multiple exponentially.


Google had done extensive studies[1]. There is roughly 3% chance of error in RAM per DIMM per year. That doesn't justify buying ECC if you have just one personal computer to worry about. However if you are in data center with 100K machines each with 8 DIMM, you are looking at about 6K machines experiencing RAM errors each day. Now if data is being replicated then these errors can propogate corrupted data in unpredictable unexplainable way even when there are no bugs in your code! For example, you might encounter your logs containing bad line items which gets aggregated in to report showing bizarre numbers because 0x1 turned in to 0x10000001. You can imagine that debugging this happening every day would be huge nightmare and developers would end up eventually inserting lot of asserts for data consistency all over the places. So ECC becomes important if you have distributed large scale system.

1: http://www.cs.toronto.edu/~bianca/papers/sigmetrics09.pdf


That data set covers 2006-2009 and the ram consisted of 1-4GB DDR2 running at 400-800 MB/S. Back when 4GB was considered a beefy desktop, consumers could get away with a few bit-flips during the lifetime of the machine. Now my phone has that much RAM and a beefy desktop consists of 16-32 GB of RAM running at 3GB/s.

It's time we start trading off the generous speed and capacity gains for a some error correction.


Note that the error rate is not proportional to the amount of RAM, it is proportional to the physical volume of the ram chips. (The primary mechanism that causes errors are highly energetic particles hitting the chips, the chance that this happens is proportional to the volume of the chips.) This means that the error rate per bit goes down as density goes up.


Cosmic rays causing the errors has got me thinking about if the error rates vary with the time.

Do you get more/less errors when it's day time (due to the Sun)? Does the season affect it (axial tilt means you're more/less "in view" of the galactic core)?


Wouldn't it go up if the density increases? If the particle hits the chip there are more bits at the place where the particle hits.

So while the chance of hit is lower (per GB), if it hits its effect will be higher (more bits flipped).


It is an interesting question but I think the parent poster did not mean density in the pure physical sense.

That is more memory but less mass which is not physical density. Also I am not sure if gamma rays need to just hit the physical bits to mess things up. If is the case where other things can be hit then it seems surface area might have high correlation but probably not.

I don't know what the answer is but I would imagine that the error rate would be the same percentage assuming orientation is kept the same.

Of course if you are going at extreme macro sense (think Asimov last question computers [1]) then density absolutely probably plays role as gravity starts to cause enormous amount of collisions. This actually happens in stars and is why photons take a long time to escape from the star as well as the edge of black holes where collisions are happening extremely frequently.

[1]: http://multivax.com/last_question.html


An alpha particle for instance is atleast an order of magnitude smaller than smallest transistor. The maximum damage it can do is effectively 1 bit.


Alpha particle won't penetrate that far, it will be stopped at the building level, or at the enclosure. Piece of paper blocks it.

Beta and gamma are the ones that can do damage (not sure about beta), and gamma can pass through the entire chip, so it can hit multiple transistors, depends on the angle and the way they are located.


Actually these high energy particles tend to be order of size of proton or less - so make that 6 orders of magnitude smaller than smallest transistor.


That's a 3% per DIMM per year chance of at least one error. Most memory faults are persistent and cause errors until the DIMM is replaced. Also, the error rate was only that low for the smallest DDR2 DIMMs.


I have hit soft errors in every desktop machine that used ECC. Either I have bad luck, ECC causes the errors or third thing. I think ECC should be mandated for anything except toys and video players.


> I have hit soft errors in every desktop machine that used ECC.

Not sure if I should start getting nervous or just your RAM sucks ;) I get ECC errors only if I overclock too much, and I run the RAM overclocked all time. It's actually one of the reasons I wanted ECC.


Different RAM, more soft errors the older a system gets. Heh, the system should auto over clock until it starts to get correctable soft errors and then back off. Or reduce refresh until soft errors and then bump it up. Max speed at the lowest power.


How much more expensive is ECC ram? I don't have it and I've never experienced obvious issues, if it's a lot more expensive it's not really worth it for the once or twice the desktop will likely experience an actual issue


Should be about 1/8th more since it's just a 72-bit bus for carrying 64-bits data and 8-bits check. Or rather, your dimm will have 9 chips instead of 8.

How they get you is Intel will sell you a xeon which is the exact same die as an i5 in a different package for more money.


Depends what you need - you can pick up older gen Xeon chips for cheap and the performance often isn't that much worse than modern consumer grade stuff. If you're looking to build a consumer-level NAS or home server, Avoton is pretty cheap and takes ECC RAM.


Unfortunately, Avoton might just suddenly stop working on you.

https://www.servethehome.com/intel-atom-c2000-series-bug-qui...



It should be 1/8th more, plus a bit for the scrubber. But in practice ECC memory is "enterprise priced" so it's more like double.


Should we do a Kickstarter to manufacture our own DIMMs? Its an easy design and I hate donating to some corporate gross margins. Maybe enough people feel the same.


It's significantly more expensive, usually around 30-100% more, depending on capacity. IMO not worth it on a desktop, possibly worth it on a home server or a serious workstation. Plus your CPU and motherboard has to support it, which is a pain with Intel's consumer lineup.


Good thing ryzen supports ECC OBO. Just waiting on motherboard support for it.



I think I may go AMD (again) for this very reason.

(Generally, I don't think ECC actually does matter that much for us casual/home users, but I like to reward the people who actually do make it easy to "do the right thing". Same deal as only purchasing AMD graphics cards since 2005-ish(?).)


If you're not worried about certain chip features and power draw, last gen server equipment is very cheap.


usually its cheaper because of server market forced upgrade cycle surplus. Problem is its mostly Buffered/Registered ECC which cant be used in desktop motherboards.


> There is roughly 3% chance of error in RAM per DIMM per year. […] with 100K machines each with 8 DIMM, you are looking at about 6K machines experiencing RAM errors each day.

Can you work out the math? I don't follow it. 3%×100K×8÷365=66 per day by my reasoning…


they've multiplied by 3 instead of 0.03


> There is roughly 3% chance of error in RAM per DIMM per year. That doesn't justify buying ECC if you have just one personal computer to worry about.

How do you make that leap?


It's an inappropriate leap. Consumers should have ECC memory too.

However the consumer market has long decided to settle for ECC nowhere and cheap everywhere.

ECC hardware comes at premium option that can easily be +100%. You need support in the memory, the motherboard and the CPU.

Given the price difference, personal computers will have to live with the memory errors. People will not pay double for their computers. Manufacturers will not sacrifice their margin while they can segment the market and make a ton of money off ECC.


Amd has modestly priced hardware that supports ecc


Was that the case before Ryzen? I know their new CPUs support ECC, but I'm not sure for earlier generations.


I think it was common for AM3 for example too.


ECC is officially supported by all AM2/3(+) CPUs and AFAIK all corresponding motherboards from ASUS. As in, you have it guaranteed on the spec sheet.

There are also reports of BIOS support in some boards which don't have ECC advertised. And you can try to enable it in the OS even without BIOS support, though some level of hardware support is still necessary. As Linux documentation puts it: "may cause unknown side effects" :)


It was technically supported by the hardware, but not by many motherboard and BIOS's.


Yep.


Bristol Ridge does support ECC BTW, but one problem is that you can't use ECC with x16 chips (because ECC is 72-bit), so with 8GB of RAM and 8Gbit chips you have to choose between non-ECC/ECC single channel with x8 chips and non-ECC dual channel with x16 chips. 4Gbit don't have this problem but will become obsolete especially when 18nm ramps up, and while DRAM prices should decline when that happens...


What's the matter with x8/x16 chips and dual channel? I don't think it should matter.

Or do you mean that if you want exactly 8GB then it's hard to find a pair of 4GB DDR4 ECC modules? Well, just get 2x8GB if you are a performance nut.


Yes, what I am saying is that it is impossible with 8Gbit chips, but possible with 4Gbit.


I'd like to know this, too.

I am guessing it's because, if RAM errors increase linearly with the number of computers, then RAM errors will be a greater and greater proportion of total errors. This assumes other kinds of errors don't scale linearly. Someone looking through logs is looking for errors, they'd like to find fixable logic errors, not inevitable RAM errors.


A cost/benefit analysis for a system where non critical operations are performed would seem to favor the non ECC memory. I suspect this is the case for the majority of people who have computers for their personal use, without taking into account that they might not even be aware such a thing exists. Although, I haven't compared ECC prices lately.


Your game machine can live without ECC.

Your NAS should better have it, though.


Probably assumptions about uses of PC. I'd imagine most of bits are media related.


Because the market.


This makes me wonder how banks deal with this issue.


> If you look through the source code & postmortems from that era of Google, there are all sorts of nasty hacks and system design constraints that arose from the fact that you couldn't trust the bits that your RAM gave back to you.

Details of this would be very interesting, but obviously I understand if you cannot provide such details due to NDAs, etc.

I mean, I can imagine a few mitigations (pervasive checksumming, etc), but ultimately there's very little you can actually do reliably if your memory is lying to you[1]. I can imagine that probabilistic programming would be an option, but it's hardly "mainstream" nor particularly performant :)

I'm also somewhat dismayed at the price premium that Intel are charging for basic ECC support. This is a case where AMD really is a no-brainer for commodity servers unless you're looking for single-CPU performance.

[1] Incidentally also true of humans.


You need ECC /and/ pervasive checksumming. There are too many stages of processing where errors can occur. For example, disk controllers or networks. The TCP checksum is a bit of a joke at 16 bits (it will fail to detect 1 in 65000 errors), and even the Ethernet CRC can fail - you need end to end checksums.

http://www.evanjones.ca/tcp-and-ethernet-checksums-fail.html


I did a bunch of protocol level design in the 90's and one of the handful of things that taught me was _ALWAYS_ use at least a CRC with a standard polynomial. Its just not worth it, in the 2000's I relearned the lesson when it comes to data at rest (on disk/etc). If nothing else both of those will catch "bugs" rather than silently corrupting things and leading to mysteries long after the initial data was corrupted.

I just had this discussion (about why TCP's checksum was a huge mistake) a couple days ago. That link is going to be useful next time it comes up.


Too many stages... for what? You haven't stated what the criteria for 'recovery' (for lack of a better word) are. What is the (intrisic) value of the data?

Personally, I'm a bit of a hoarder of data, but honestly, if X-proportion of that data were to be lost... it probably wouldn't actually affect my life substantially even though I feel like it would be devastating.


Crc checksums can be wrong if you have multiple bit errors like runs of zeros. (This resets the polynomial computation) http://noahdavids.org/self_published/CRC_and_checksum.html

but crc is good to check against single bit errors.


> ultimately there's very little you can actually do reliably if your memory is lying to you

1. Implement everything in terms of retry-able jobs; ensure that jobs fail when they hit checksum errors.

2. if you've got a bytecode-executing VM, extend it to compare its modules to stored checksums, just before it returns from them; and to throw an exception instead of returning if it finds a problem. (This is a lot like Microsoft's stack-integrity protection, but for notionally "read-only" sections rather than read-write sections.)

3. Treat all such checksum failures as a reason to immediately halt the hardware and schedule it for RAM replacement. Ensure that your job-system handles crashed nodes by rescheduling their jobs to other nodes. If possible, also undo the completion of any recently-completed jobs that ran on that node.

4. Run regular "memtest monkey" jobs on all nodes that attempt to trigger checksum failures. To get this to work well, either:

4a. ensure that jobs die often enough, and are scheduled onto nodes in random-enough orders, that no job ever "pins" a section of physical memory indefinitely;

4b. or, alternately, write your own kernel memory-page allocation strategy, to map physical memory pages at random instead of linearly. (Your TLBs will be very full!)

Mind you, steps 3 and 4 only matter to catch persistent bit-errors (i.e. failing RAM); one-time cosmic-ray errors can only really be caught by steps 1 and 2, and even then, only if they happen to affect memory that ends up checksummed.


How do you calculate those checksums without relying on the memory?


the chances of the memory erroring in such a way that the checksum still matches becomes quite small


You can't really, but you are now requiring the error to occur specifically in the memory containing your checksum, rather than anywhere in your data.


It deeper than that. What are you calculating the checksum of? Is it corrupted already?

If you can't trust your RAM, you have no hard truth to rely on. It's only probabilistic programing or living with the errors.

(Although, rereading the GP, he seems to be talking about corrupted binaries. Yes, you can catch corrupted binaries, but only after they corrupted some data.)


It's even worse than that: where's the code that's doing all the chucksumming and checking of checksums? Presumably it came from memory at some point...

Maybe it was read fine from the binary the first time, but the second time...

At some point you just have to hope.


Pervasive checksumming is going to cost a lot of CPU and touch a lot of memory. The data could be right, the checksum wrong as well. ECC double bit errors are recognized and you can handle them how you'd like, including killing the affected process.


I agree, which is why I used the word "mitigation", as in: not a solution.

Probabilistic programming is a theoretical possibility, but not really practical.


it was indeed Craig


Given that cosmic radiation is one source of memory errors, shouldn't just better computer cases reduce memory errors?

Basically a tin-foil (or plumb-foil) hat over my computer?


Can people here please stop posting that ZFS needs ECC memory. Every filesystem, with any name like FAT, NTFS, EXT4 runs more safe with ECC memory. ZFS is actually one of the few that can still be safer if you don't run with ECC memory. Source: Matthew Ahrens himself: https://arstechnica.com/civis/viewtopic.php?f=2&t=1235679&p=...


In an old discussion regarding ECC/ZFS (in particular, whether hitting bad RAM while scrubbing could corrupt more and more data), user XorNot kindly took a look the ZFS source and wrote

"In fact I'm looking at the RAID-Z code right now. This scenario would be literally impossible because the code keeps everything read from the disk in memory in separate buffers - i.e. reconstructed data and bad data do not occupy or reuse the same memory space, and are concurrently allocated. The parity data is itself checksummed, as ZFS assumes it might be reading bad parity by default."

His full comment can be found here:

https://news.ycombinator.com/item?id=8294434


Indeed. It's true that the data may be corrupted before hitting any disk[1], but once it has hit the disks (>1), it's extremely unlikely that you'll ever hit a similar bit error where it'll mistakenly choose the wrong disk block to recover from.

The main point of e.g. ZFS or Btrfs checksumming is that a) at least it isn't getting worse, and b) I can tell if it's getting worse.

[1] ... but if the bits are not generated by the machine that actually saving to disk, how do you know they weren't corrupted along the way? The number of people who religiously check PGP signatures/SHA256sums or whatever is miniscule.


> The number of people who religiously check PGP signatures/SHA256sums or whatever is miniscule.

• If you transfer things around using BitTorrent, it'll ensure you always end up with a file that hashes correctly to the sum it originally had when the .torrent file was constructed.

• Many archive formats (zip, rar, and 7z, at least) contain checksums, and archival utilities validate those checksums during extraction, refusing to extract broken files. "Self-extracting archive" executables that use these formats inherit this property.

• Some common disk-image formats (dmg, wim) embed a checksum that checks the whole disk-image during mount, and will refuse to mount a bad one. (I believe you can then try to "repair" the disk image with your OS's disk-repair utility, if you have no other copies.)

• Web pages increasingly use Sub-Resource Integrity attributes on things like .css and .js files, protecting them (though not the page itself) from errors.

• ISO files don't embed checks, but all the common package formats (Windows .cab and .msi; Linux .deb and .rpm; macOS .pkg) on installer ISOs embed their own checksums and often signatures.

• git repos are 'protected' insofar as you won't be able to sync mis-hashed objects from a remote, so they won't spread.

Really, looking over all that, it's only 1. plain binary executables, and 2. "media files" (images, audio, video)—and only when retrieved over a "dumb" protocol, rather than a pre-baked-manifest protocol like BitTorrent or zsync—that are "risky" and in need of explicit checksum comparison.


Both macOS/iOS and Windows use code signing for executables, which should guard against most types of corruption.

Web pages (and anything else) transmitted over HTTPS are protected from corruption in transit by TLS's hashing (which is vastly stronger than the checksums at lower levels of the network stack), though that doesn't help if the server has faulty memory or storage.

PNG has built-in checksums, though other image formats don't (JPEG). Not sure about video.


> Web pages (and anything else) transmitted over HTTPS are protected from corruption in transit by TLS's hashing (which is vastly stronger than the checksums at lower levels of the network stack), though that doesn't help if the server has faulty memory or storage.

I didn't bring this one (or any other transport-level checksums) up, because we were talking about whether you can trust something "across the whole process"—from its origin developer's disk (where it might get an initial explicit checksum generated), to origin memory, across the network to a server's memory, to that server's disk, over the network again to a CDN reverse-proxy's memory, maybe its disk, then the network again to you, then your memory, your disk, and finally your memory again as you verify it. Oh, and a bunch of routers and switches in between, of course.

Static checksums that are baked into file formats or manifest files protect the file across that whole chain. Transport-level checksums only ensure that the one part they're involved in happened correctly.


True. Still mostly immaterial in the grand scheme of things, I think?

EDIT: Should say, as a point of interest: Even though .zip's were protected back in the Good Old Days, that didn't really matter because we all got corrupted (expanded from .zip) .mp3's because of those fucking RTL3xxsomething cards that would just transmit things perfectly and then corrupt the checksum (or whichever way 'round). Ugh. One of the few times I've actually hated engineers.

(Don't get me wrong. I really do want more of these checks to be pervasive. We start with our local file systems.)


What are you doing where you're actually checking checksums periodically and detcting when things get worse? That seems like a lot of work to set up.


They are using ZFS, scrubbing is one command.


zpool scrub

(This may be a myth: It's not something you should actually do that often because actually reading the media may dregrade it.)


or for btrfs users out there:

btrfs scrub start /mnt/volume_name


Either way, it'd be fine on SSDs, then?


Hm? As long as the error correction technology on your chosen SSD of choice stands up, I guess... yes? What, exactly, are you asking?


My thought was "reading SSDs doesn't degrade them, so there's no disadvantage to constant scrubbing." Unless I've misunderstood what you mean by "reading."


Ah, right, fair point. AFAIUI SSD storage does degrade a tiny(!) amount when reading, but magnetic storage degrades quite a bit more.

(That's where that came from, draw your own conlusions :).)


No. ZFS is in much greater need for ECC than most other filesystems.

1. ZFS doesn't come with any disk repair tools and the ones that exist are not nearly as capable as for other filesystems (the ZFS motto is that it is too costly to repair filesystems, just recover from tape instead (here we can sense the intended audience of ZFS)). If the wrong bit get's flipped your entire pool might be gone (you can of course spend months of your spare time to debug it yourself if you want to). This is not the case (to the same extent) for FAT, NTFS or EXT.

2. The more you use memory the more likely you are about to get hit. I'd argue that ZFS is a quite resource heavy filesystem and is thus more likely to actually attract bit flips. This is similar as to when using an encrypted filesystem on an overclocked CPU. There is nothing inherently more risky with encrypting your filesystem with an overclocked CPU - but overclocking your CPU increases the risk for miscalculations. And enabling encryption increases the CPU usage when accessing the filesystem by several orders of magnitude. So, in practice you quickly notice how filesystem data on encrypted drives get corrupted but not on regular drives on an ever so slightly too overclocked machine.

So, if you care about your filesystem, then yes - saying that ZFS needs ECC is quite sensible. (if you care about your data you should have backups regardless)


Well; I don't think 'lack of repair tools' for ZFS is the reason that ECC is the suggested good practice; but I can sort of agree that recovering from a badly farked pool isn't fun having been down that rabbithole...

Regardless of ZFS, this is why we architect our storage to cope with such potential borkage (which has, incidentially, only happened to me _ONCE_ in ~8 years of zfs in prod, had nothing to do with silent md corruption in ram and had everything to do with a nasty hw sas hba bug) -- If losing a single storage node/pool causes you problems then you're "doing it wrong" (sorry) and it makes no difference if you're using XFS, ZFS, BTRFS or whatever else...

So... I'm not sure about the ECC stuff, to me ECC really matters not at all for the simple reason that any significant deployment is using ECC anyway: even deploying a cheapo 10k pair of jbods and a tiny 1U head to run NFS or something, you'll be unlikely to even have the option of non ECC from whoever you're buying the kit from (dell? hp? bla) right?

Yes, it might work without it. It might work better somehow with it.. What does it matter when even the cheap gear comes with it anyway?

I've built a reasonable slew of ZFS backed storage (well, a 10 PB prod or so anyway, nothing compared to what some of the folks who comment here have done) and besides some hardware compat issues if you're building storage this is currently your best option to back your objstore/dfs.

ZFS as the storage backend for your DC/Cloud? Good pick. ZFS as the 'local' FS's in your VMs? I wouldn't bother, unless you need some features (it works pretty well with docker, as it goes, but I prefer to run apps on plain ext backed by zvols instead of 'zfs-on-zfs')....


The whole point of the ZFS needs ECC statement is that it holds true even for non-significant deployments. Such as a home-NAS. And at that scale it comes with a quite significant cost increase. Rather than reusing your old workstation you need a new server class machine with lots of RAM, even though the CPU requirements aren't that high.

If your backend is a zvol you get the integrity advantages and cheap snapshots regardless of your VM filesystem, so it isn't really a fair comparison with an "EXT all the way" scenario.


ZFS is not the cheap option, it was never intended to be - so why bother skimping on ECC? If you're worried about ECC prices - ZFS is probably not for you.

It's not that you need ECC for ZFS, but when you're at a point where you're willing to throw money at a storage system where ZFS makes sense, the extra cost of ECC is minuscule. The most expensive hardware requirement of ZFS is that you need disks of the same size anyway, which means you're not just throwing a random amount of disks together, and if you want to expand, you need to add another full zpool, or replace disks one by one.

On my home NAS, the difference was about 120 EUR for 32GB (80eur/dimm vs 50eur/dimm), on a grand total of over 2500 EUR. One of the reasons for choosing ZFS was storage reliability, and then skimping out on ECC is a imho a bit silly.


You can have disks of differing sizes with ZFS, though you are making things difficult for yourself so your point still stands.

However the cost of ECC is not negligible because you need your CPU and motherboard to support it.

The total budget for my NAS was 1000 EUR in 2013, 500 for the 5 3TB disks, 350 for the motherboard+CPU+8GB ECC RAM, and 150 for the case, PSU, system SSD, and accessories. In reality I salvaged 2 3TB disks, lowering the cost to 800. By using non-ECC I could have used a cheaper motherboard and CPU, in addition to cheaper RAM. In fact I would probably have used hardware from an older desktop PC. It would have been a 15-20% saving, or 45% if I take reuse into account. Not negligible.

My previous NAS, running linux soft-RAID, entirely made of salvaged parts except for some of the disks had a few corruption problems. One of them caused by a defective disk. ZFS would have caught it, so even on cheap systems, ZFS has its use.

I also had defective DRAM, rebuilds not going smoothly, etc... That system caused me too many scares, so I decided that the next system would be cheap but not too cheap as to endanger my sanity. I also got a proper backup solution.


I'm so glad to see this comment high up.


It's not the that it NEEDS it, it's that if you DONT use it you are introducing potential data errors into an otherwise checksummed data path. Which would completely negate the rest of the path.


I reproduced this by bit-squatting cloudfront.net after reading about it. So many memory errors!

http://dinaburg.org/bitsquatting.html

Loved the variety as well. Sometimes though requests came to me the Host header was correct!


Wait so when someone typoes cnn.com as con.com, that is ipso facto a memory error? I guess I could see that if the characters are radically far apart on the keyboard? But doesn't a simpler explanation like "one person out of billions with Internet access typed the wrong thing" seem a lot more likely?


Domains that are only used as CDNs, like cloudfront.com, are almost never typed into an address bar. Errors in the domain name are more frequently the result of a bit-flip error.


Also, these typos are typically not easy ones since flipping a bit changes the letter in ways that are unlikely typos. With cloudfront.net an negligible number of people would be typing them at all. Close to 100% of the errors that I saw were loading either images, css or javascript files that some other page depended on.


This seems like a pretty weak argument. OTOH, 3% of the HTTP requests made to a bitsquatted domain in the linked articles had the original domain in the Host: header; those sound like actual memory errors.


This Defcon 21 presentation from Robert Stucke did something similar with google's domains, plus other stuff. A great watch if you've got a spare 40 minutes!

https://www.youtube.com/watch?v=yQqWzHKDnTI


Some macs do use ECC memory (specifically some of the most popular varieties, like the Mac Pro) which is probably why you saw lower numbers on bit squatted domains.


Fascinating article. Did you ever find a reason for the different ios results?


Based on source IPs I would say that cheaper RAM = more error prone RAM.


Yes. Everybody reading this should use ECC RAM, and non-ECC RAM should be called "error-propagating RAM".

Random bit flips aren't cool, and they happen regularly. Most computers that have ECC RAM can report whether errors happen. I see them at least once a year or so. For instance, here are 2 ECC-correctable memory errors that occurred just last month.

Cosmic rays? Fukushima phantom? Who knows. You'll never know why they happen (unless it's like a bad RAM module and they happen a lot), but if you don't rock ECC you will never know they happened at all. You'll be left guessing when, years later, some encrypted file can no longer decrypt, and all the backups show the same corruption...

[1]: https://www.dropbox.com/s/zndvy3nkv1jipri/2017-03-20%20FUCK%...

[2]: https://www.dropbox.com/s/6yeoedc7ajzq4u9/2017-03-20%20FUCK%...


I remember the one time I bought ECC memory, for a PII-400. It was only 512MB or so I think, but in the 12 years that server ran I saw a grand total of 1 corrected error in the logs. Given how much of a premium that ECC memory was it felt like a waste.


Nice file names, gonna have to start naming my bug reports similarly


An old article from DJB worth perusal: http://cr.yp.to/hardware/ecc.html

It's also worth noting that not all ECC (SECDED) is created equal: ChipKill™ and similar might not survive physical damage because of likely shorts of the data bus but a single malfunctioning chip producing/experiencing higher hard error rate is possible from which to recover.

Also, it'd be really cool if some shop a-la BackBlaze blogged about large-scale monitoring for soft and hard RAM errors across chip/module modules (+ motherboards & CPUs). Without collecting and revealing years data from real use, conversation devolves into opinion and conjecture.

Finally, not all use-cases can benefit from ECC (ie Angry Birds) however there are some obvious/nonobvious ones that can (ie router non-ECC DNS bitsquatting or processing bank transactions).


PS: Random-crazy thought.. it's curious with reduction of costs via Moore's law improvements that there aren't yet formally-verified, zero-knowlege systems which can end-to-end prove they performed computation/real-world side-effects and/or continue to safely store data. Why blindly trust anyone or any company with data that can be seized, lost or misused when distributed computation, communication and storage can be A2E with only limited participants knowing operations / plaintext? Perhaps: homomorphic encryption, blockchain-similar ledger or proof-of-work and periodic, authenticated hash challenge queries. Mix in relaying and other idle phony traffic to make triangulation more difficult. I think in order to assure sufficient distributed system resources are made available, μpayments a-la AWS but just covering costs would make it possible to have a persistent, anonymous computation and storage collective that would survive outages, FBI raids, single nodes going offline, etc.


Storage, [yes](https://storj.io/). Computation ... sure if you don't mind the server viewing the contents of your computation and can verify the results. Sadly, fully homomorphic systems incur waaaay too much overhead so you are constrained in what you can do (i.e. specialized DBs, zkSNARKs, etc).

Then, of course, there is the problem of network latency and bandwidth costs vs just keeping it all on one datacenter.


It wasn't necessary because we've already had systems whose hardware and/or software reliability reached decades between events of unplanned downtime.

https://lobste.rs/s/jea4ms/paranoid_programming_techniques_f...

http://www.hpl.hp.com/techreports/tandem/TR-86.2.pdf

http://h71000.www7.hp.com/openvms/whitepapers/high_avail.htm...

http://www.enterprisefeatures.com/why-are-iseries-system-i-a...

Now I'm not including all the anonymous, zero-knowledge stuff since the market won't buy that. All kinds of costs come with it that they don't want. Besides, most consumers and enterprises love products with lots of surveillance built in. ;)


A better question is why /shouldn't/ you use ECC memory?

Generally the answer to this is any context where you legitimately do NOT care about your data at all, but you still care about costs. This predominately devolves in to consumption only gaming systems.

In all other cases everyone would be better served (in the long run) by buying ECC RAM.


My main issue is that it isn't just a choice between ECC memory and not, but I'd also need a different motherboard and processor, right?


a common network topology is to have a load balancer distribute load to a number of cheap Http servers which internally connect to a centralized and powerful database server. In this case only the database server really needs ECC ram. The system is designed to be fault tolerant for any individual HTTP server node so the increased cost vs the problem it solves doesn't make sense.

I guess you could argue that a random bit flip could somehow make the HTTP server vulnerable and able to compromise the network however that risk is impossibly small. If we take IBMs estimation that a bit flip occurs at an approximate rate of (3.7 × 10-9) bytes/month and then divide it by the number of bytes in the system you can see that the odds of randomly corrupting a byte in memory that triggers a vulnerability is too small.


What about memory-error corrupted application data (or application logic) where the corruption occurred on load balancers or web application servers? There's more to data integrity than security holes.


If you write code that detects stack smashing and illegal dereferences then you can terminate the webservice and either have a watchdog restart it or if it crashes multiple times have it taken out of service by the load balancer. There are plenty of ways to handle hardware errors without throwing out the hardware and getting "better" hardware. Technically, You could have a faulty component somewhere between the Ram and CPU and then what is your expensive ram going to do? What if the CPU cache has errors? For many small businesses often the difference between success and failure is their ability to make things work without throwing cash at the problem.


Even with your proposed checks there remains a high probability to just get silent application data corruption, not crashes.

Regarding faulty components, that is one part of ECC's job, but the other part is correcting the regular bit flips that happen with nominally operating DRAM.

Flagging faulty components is more useful than you propose. There are not that many places where this corruption can occur, so being able to rule out RAM is very useful. The example you used, CPU caches, is actually already covered by ECC in most CPUs, including reasonably recent x86/amd64.

The tradeoff would be more worthy of thought if ECC was much more expensive


ECC ram is not a raid. If corrosion on a trace causes a bit flip from an adjacent line then the ram will recieve rhe corrupt data as valid. There is no parity ram stick to recover from. I never said ECC ram doesnt have a purpose. Im saying you are wasting your money if you think it's essential to running a web sever. Lets be real here, like 80% of computers on the internet stream porn. They dont need ecc ram


> In this case only the database server really needs ECC ram.

That's only true if the database is read only. Otherwise, you will still insert corrupt data into it.


Assuming you dont do any input validation checking which is foolish for a database server particularly one dealing with SQL


So, I've sent an int32 representing a payment amount. One of the low order bits gets flipped. Can you explain to me how I'd validate it?


Compare the cost of the transaction to the price paid. First parse the payment amount from the client and store in in a bad region of memory. Then compare that variable with the transaction amount. When you read from the variable it will be corrupt and not match. Log the Error as High Priority because it shouldn't occur. Just saved you a ton with a simple if statement.


> Compare the cost of the transaction to the price paid.

The cost of the transaction is, by definition, the price paid. `if x != x: raise_error()` only works if x is NaN.


No the price of the transaction is the sum of the items in your shopping cart which is usually stored as a session variable or cookie. The price paid is the value charged to a customer which is gathered from an input parameter on form submission. If you are attempting to bill someone 1000$ for an item that cost 10 then you have a problem. You should know what the value of every item in your inventory is right? And you should also know when said item is being purchased if you are charging someone for it. If you didnt do this check what's to stop someone from submitting a payment of 10$ for a product that cost 1000$? Your fancy system with ECC Ram would let it go through and you just lost 990$ because you thought hardware could fix your software mistakes.

This is a ridiculous conversation because data corruption could happen in the CPU cache, the QPI, or a number of micro-components in between the ram and cpu that could cause errors that ECC Ram can't fix. ECC ram is not a catch all for poor programming and poor validation checking period.


This article is gold in so many ways. It contains interesting bits of information on ECC, company history that I didn't know (Sun's and Google's namely), filesystem reliability (I never knew!), the physics of RAM (50 electrons per capacitor)...

It's a must read, even if only to get you thinking about some of these things.


Depends on what you are doing. ZFS storage servers: Hell yes High-value data in my DB? Hell yes email server: Nope super cool gaming rig: Nope * Cluster: Hell yes

General office workstation: maybe.

I don't have the budget for 20 redundant copies. I do have the budget for slightly more expensive RAM. Especially on my ZFS storage arrays.

ECC memory is like Insurance. You hope you never need it. One real downside that I have found, is finding out _when_ that memory correction has saved your ass. RAID arrays can alert you when a disk is dead. SMART mostly tells you when disks are failing. I haven't found a reliable tool to notify me when I am getting ECC errors/corrections.


I agree on gaming, but e-mail often contains important information that I wouldn't want to suffer from random corruption.


I don't understand why anyone would run their own email server. Cloud offerings work so well and are cheap.


Maybe they are under contract not to pass information to third parties or maybe the company policy is to not let internal email off the network.

That you don't understand it is likely from the perspective of an individual, possibly a private user. For those applications you can't beat the cloud. For business use every business needs to weigh their own needs.

Even then though, many business think they need to have their own server when they really don't and vice-versa.


Sure, but those are rare and usually include enough budget to include sysadmins and definitely enough to buy ECC memory. For anyone weighing whether ECC is worth it, they are wasting time managing their email server.


Those are not rare at all. Every lawyers office has this problem, every journalist, every banker, every insurance company, every notary public, every administration and so on.


Most of these are not contractually obliged to run their own email servers so whatever problem they have, it's not that specific one.


No, but they are contractually required to keep their customers (and their own) data confidential. And that can lead to them deciding to run their own mailservers as well as other infrastructure. Whether that's a good decision or not is another matter, that mostly depends on execution.


Right, what I mean is that your average legal practice, notary public, journalist doesn't actually have this problem like you said. Cloud services cover them just fine.

Somewhat unrelated, your comment gave me the idea to look up the MX records of the last few law firms I've interacted with: mostly cloud, as expected. The biggest and fanciest likely probably has their own servers. Their terminating MX is some middling cheapo hosting company. Disturbing.


Yes, I would agree that for most companies that are in this position rolling your own could easily end up being more problematic than going with gmail or office365. That doesn't mean it does not happen and when it happens they are usually sitting ducks.

The chances of your average law office having an IT staff with capabilities comparable to Google are nil. At the same time the legacy of Snowden has caused a lot of companies to wonder if they're wise to put anything off-premises. And then there's dropbox, weshare and a million other 'handy' services that could easily hoover up and analyze everything that passes through (or whoever hacked them).


If you are using the cloud for email they can usually see all of your activity. Email is also how you typically reset passwords. It can expose corporate secrets.

In practice, nobody encrypts their email. And even if they do, the cloud still gets all the metadata.

Running your own trades the above issues for other issues, but depending on your priorities and fears it might be worth doing.


> in practice nobody encrypts their email.

I work in the defence industry. All attachments must be encrypted. Also, all customer data must be stored in the same country.


Cloud offering here: we run ECC memory on all our servers, natch. It's not turtles quite all the way down.


> cheap

Hard to beat free. Couple this with the fact that I learn something by setting it up makes this a win for me.


NSA agrees with you, for one.


NSA is not the primary threat here. More conventional legal mechanisms are. Home serving is just as vulnerable to the NSA.


Conventional legal mechanisms against your home server cabinet can be handled via full disk encryption and a reed switch on your cabinet door connected to your power strip.


Sounds fun until you have to open your cabinet door for legit reasons like swapping a faulty hard drive in your RAID array.


Huh? If you're going to swap a faulty hard drive you want to power off anyway.


I hotswap drives all the time, it's not a problem and makes a harddrive swap a 30 second task instead of a 10 minute task and doesn't incur downtime either.


Most consumer hard drives (and indeed bays) are not designed for hotswapping and it can cause damage (though maybe modern build quality is good enough that you'd be lucky most of the time). "Downtime" on your home server in your closet is a minor inconvenience at worst.


The SATA connectors are designed for hotswapping, the ground leads are longer than the others so you get nice properties when connecting and disconnecting. I'd be mildly concerned about properly stopping the drive that's being disconnected, except it's probably being disconnected to be replaced. I don't see much difference between connecting a drive and turning the power on to an already connected drive.


I use NAS Harddrives which are built for hotswapping. I have no idea why anybody would use a consumer harddrive in a RAID Array, the price difference is 10€ at best AFAIK.


For a home server what's the benefit you're paying for though? I don't need max performance (I use RAID for redundancy rather than anything else), and a little downtime when I replace a disk isn't an issue.


If you have several drives in the same bay, you're going to get vibrations that severely reduce lifetime of the harddrive. NAS Drives also have much better electronics/mechanics to help them not crash all your data while in use. They won't try to heroically save that one sector and report to your RAID controller instead, meaning you get a much better overview of harddrive defects and lastly

Lastly, NAS Drives have a much lower error rate than Desktop drives due to the usage of higher quality heads that increase error resistance and lifetime.


You want NAS drives for TLER and vibration tolerance.


You should be getting MCA (machine check architecture?) notifications in syslog/dmesg if there are ECC correctable errors, and an MCE (machine check exception) on the console for uncorrectable error, based on my experience with SuperMicro xeon servers running FreeBSD. A lot of our servers see a few correctable errors once in a while, and it doesn't affect the usability of the system; but sometimes the number of correctable errors is very high and the system is very sluggish.


Thanks!


There is a hidden cost of ECC with regards to the chipset. None of the cheap chipsets support it, so on any home build, it's going to be expensive.


Fortunately the new AMD Ryzen processors support ECC in the memory controller, unfortunately none of the boards seem to be testing/certifying it yet and the UEFI on a lot of the boards is a mess right now.

Hopefully more consumer boards support/certify it since it is already there on the memory controller.


The Asrock 370 boards officially support ECC, though I've read that various BIOS/UEFI versions don't: https://www.reddit.com/r/Amd/comments/655e7v/all_asrock_am4_... People are reporting that the Asrock 350 boards do too. Gigabyte lists some ECC modules on the compatibility list but some report that it isn't working right.


Good to know, I'm waiting for all of these boards to UEFI stuff to stabilize before I go shopping. Seems a bit messy right now.


Not true with Ryzen, as long as you find unregistered ECC acceptable.

Somewhat not true with Intel, as some of the lower end Xeons now support it.


Ryzen ECC support is a mess, no AM4 motherboard currently on the market has implemented ECC support fully and properly (not even Asrock). It's better than nothing but you would be a fool to rely on it.

http://www.hardwarecanucks.com/forum/hardware-canucks-review...

"Kinda sorta works but the manufacturer won't stand behind it" is bunch of bullshit. If your data is worth using ECC in the first place - it's worth using a platform that has fully-implemented support, that has passed validation, that you know is going to work properly when you need it.

Until that happens - this is an application where Ryzen is simply not appropriate.

All of the modern i3s and Pentiums support ECC, but you do need the server chipset instead of the cheap consumer stuff. Good news though - those "expensive server boards" are roughly the same price as say, an AM4 motherboard with an X370 chipset.

Heck, you can buy a basic off-lease ThinkServer TS140 for only about $300. You'll only have about 4 GB of RAM but it's a shell to start building out (which is cheaper than having an OEM assemble it for you anyway).


Ryzen ECC support is a mess, no AM4 motherboard currently on the market has implemented ECC support fully and properly (not even Asrock). It's better than nothing but you would be a fool to rely on it.

Ryzen motherboard support is what is agreeably a "mess", not the processor itself, but at least it's functional on ASRock and select Gigabyte boards. As for "a fool to rely on it", not sure what you mean by that. The error correction itself is done by the hardware. Other than calling the initialization routines and providing logging/halt, the BIOS/UEFI isn't responsible for anything afaik.

I'm well aware that this isn't the full grade of ECC support offered by higher-end Xeons and chipset combos, but it's better than nothing and it's affordable.

Also, no offense, but I'm not going to rely on hardwarecanucks as an authority on this subject.

All of the modern i3s and Pentiums support ECC, but you do need the server chipset instead of the cheap consumer stuff. Good news though - those "expensive server boards" are roughly the same price as say, an AM4 motherboard with an X370 chipset.

The goal isn't ECC alone, at least not for me, the goal is an 8-core system with good single-threaded performance and ECC at a reasonable price. As far as I know, only Ryzen offers that.

So for me, I'm looking at the possibility of getting a single system that can give me decent gaming performance, good development performance, ECC support, and more, all at a price that leaves me with money for other components.


> Also, no offense, but I'm not going to rely on hardwarecanucks as an authority on this subject.

Fine then. AMD says it's unvalidated and unsupported, is that good enough for you?

> I'm well aware that this isn't the full grade of ECC support offered by higher-end Xeons and chipset combos, but it's better than nothing and it's affordable.

So would you be OK with running Xeon engineering samples then? After all - they certainly pass the same "best effort" test. Personally since these are server ES hardware - I'd tend to trust it more than consumer hardware like Ryzen, especially given their comparative age/maturity.

I just picked up a 10-core Haswell Xeon engineering sample for $140 last week. 40% more multi-threaded performance than a Ryzen 1700. The X99 mobo I picked up from Microcenter for $60 doesn't have ECC support but a bunch of them do.

Or if you want something that's official and you know works, there are surplus Sandy Bridge Xeons very cheap nowadays. A decent bit more multithreaded performance than a Ryzen 1700 - but you'll be giving up single-threaded performance. http://natex.us/intel-s2600cp2j-motherboard-dual-e5-2670-sr0...

Or really - a full retail E5-2630 v3 is under $500 now on eBay. That's not really that bad if you just have to have everything in one box.

> So for me, I'm looking at the possibility of getting a single system that can give me decent gaming performance, good development performance, ECC support, and more, all at a price that leaves me with money for other components.

What it comes down to: if you want everything in one box then be prepared to shell out. Everyone has this market segmented out, including AMD (after all they won't stand behind Ryzen's ECC either). If you feel you need ECC, that's really not a valid solution.

If a Xeon doesn't cut it for you - sounds like you might be in the market for two boxes here. A server/workstation with ECC and good multi-thread performance, and a gaming machine that you can overclock and get the best single-thread performance out of.

(Also - in general, overclocking also seems kind of counterproductive to the aims to running ECC RAM - although I guess I haven't looked into that.)


Fine then. AMD says it's unvalidated and unsupported, is that good enough for you?

No, that's not what AMD said, they said it isn't validated by motherboard partners. The functionality is there, it's up to their partners to use it.

So would you be OK with running Xeon engineering samples then? After all - they certainly pass the same "best effort" test. Personally since these are server ES hardware - I'd tend to trust it more than consumer hardware like Ryzen, especially given their comparative age/maturity.

That's not even a remotely accurate comparison.

What it comes down to: if you want everything in one box then be prepared to shell out. Everyone has this market segmented out, including AMD (after all they won't stand behind Ryzen's ECC either). If you feel you need ECC, that's really not a valid solution.

Sorry, but so far all of your proposed "solutions" are summed up as: "If you give up significant performance, functionality, buy second-hand, or completely ignore official support statements, X competitor is the better deal!"

If a Xeon doesn't cut it for you - sounds like you might be in the market for two boxes here. A server/workstation with ECC and good multi-thread performance, and a gaming machine that you can overclock and get the best single-thread performance out of.

No, the goal is to have one system, and at this point, Ryzen looks like the best option. If a competitor decides to release something equivalent, I'll consider them too.



AFAIK all of the Xeon chips support ECC, every Xeon E3 chip (which uses the desktop socket) I've looked at includes it.


Sorry, I was a little obtuse. What I was inferring was that historically, only higher-priced server chips from Intel had ECC support. In 2013, Intel launched the v3 lower-end Xeon E3s server chips that were closer to the price of the consumer Intel chips and offer ECC with comparable clock speeds. Of course, all of those only have 4 cores instead of 8.

Yes, all of the E3s support ecc, but Xeon's didn't always support ECC until the launch of the Xeon E3 as far as I can tell.


Xeon has implied ECC support for as long as Xeons have had integrated memory controllers, which is just a generation or two further back than the first Xeon E3 product line. Before that, ECC support was a function of the northbridge.


Hmm, perhaps this is a quirk of ark.intel.com then; it shows that (as an example) that the Xeon X5690 supported ECC, but the Xeon L5638 did not.

The list on WikiPedia also seems to imply that not all models did historically, perhaps this reflects the northbridge change?


There was the Xeon 3400 even before that. Trivia: it supported registered ECC, but only x8 chips and not x4.


Only ASROCK currently has BIOS/UEFI support for ECC.


BIOSTAR specs list ECC support, but I don't know if it has specific BIOS/UEFI options for it.


No chipset has supported ECC for quite a while: Flipping some configuration bits depending on the chipset used is purely an Intel money extraction-engine (Intel ME technology®©™).

Server / workstation class boards normally all do support ECC, though, so no real issue in practice.


There are cheap chipsets that support the entry level workstation ones from Intel you can get a motherboard for 65$ with ECC support you don't need to go to x99.

I have a few home storage servers running on the low end Pentiums with ECC support on these.


I just last week bought an ASRock C236 WSI [1] for £170. 8GB of ECC RAM was £80. Granted that five years ago I needed to pay more than twice that amount - so I skipped the ECC :p

[1] http://asrockrack.com/general/productdetail.asp?Model=C236%2...


If you use your computing in a way that makes you think about the potential interest of ECC, the price you are likely to target for a rig that fit your needs is extremely probably high enough to get some nice ECC...


Some older AMD desktop chips support ECC.

I built a home NAS from an old board and Phenom II 545 CPU I had lying around, fortuitously they happen to support ECC. DDR2 unregistered ECC ram was a bit of a pain to find though.


Not only cost, it can also be harder to find a motherboard with ECC support and with all the components/inputs/outputs that you would want in a desktop computer.


See sibling comment: I just picked up an ASRock C236 WSI


Yes.

Bit errors are uncommon and range from benign to crash.

Your storage has them, memory has them, network has them.

Non error correcting memory very significantly increases risk.

And this is the kind of risk you don't notice, until you do and when you do, it's often subtle, insidious, impossible to track down.

Servers absolutely. It's debatable on desktop, but we have huge RAM now. Might as well error correct. The bit error risk is small. Bigger RAM only adds to that possibility.


Does it make any difference if you're using your desktop to compile stuff?


If you want to be sure it's right, yes.

Here is the thing:

Without ECC, or even simple parity on the RAM, the CPU cannot validate a data transfer.

In the 90s, a place I worked for had a server running non parity, non ECC RAM. That machine was fast and cheap.

But it would demonstrate the most bizarre problems, from time to time.

A fresh OS install would fix it. Then a year or so, off the farm again.

I saw no error correction, had it replaced with a very similar machine, no issues.

The argument was, it's only the possibility, and only once in a blue moon...

The bigger the RAM, the faster we do stuff, the sooner "once in a blue moon" tends to happen.

I did put that box on my personal network, and under Linux (was win NT before), seemed fine. In the syslog, after a year, there were various kernel messages, each recovered, but there was something to recover from... Win NT would blue screen a lot. That's different today with better kernel software from Microsoft, but the point is no error correction comes with no real way to understand where some trouble may have come from.

And that was doing light duty stuff. Didn't trust it for a build, frankly.

We get fast, quality, cheap. Pick two :D

more generally, the fact that the CPU cannot know if it's transactions with RAM make any sense, unless ECC or even simple parity are present, should be a worry today.

Our processes are small, clocks fast, density high. We are pushing it on all fronts!

Best employ error correction.

And, back in the day, the Apple 2 had no parity on its RAM, the first IBM PC did. Even those much larger, more robust circuits, clocked slowly, would throw bit errors.

The IBM guys knew that from their experiences.


> I did put that box on my personal network, and under Linux (was win NT before), seemed fine. In the syslog, after a year, there were various kernel messages, each recovered, but there was something to recover from... Win NT would blue screen a lot. That's different today with better kernel software from Microsoft, but the point is no error correction comes with no real way to understand where some trouble may have come from.

Haha, I remember my first steps with Linux in 1998/99. I got a faulty hard disk of 100MiB that on Windows was constantly getting errors. How ever, when I try to use it on Linux, I found that was working without any issue.


Yeah, good times back then. :D

I ran a Win, IRIX, Linux network in my cube. Had, like 5 machines all doing various things.

Here's another similar thing:

Someone handed me a 33mhz SGI Indigo. "What can we do with it?"

I compiled a little program called "amp" to play Mp3 files, just wondering...

That thing could actually play 256kbps files, shared over NFS, while also offering a desktop. Someone else made a little app that could select tunes, start, stop. I could tell the bitrate based on the CPU load.

Put that and the SGI mixer on the screen, and it was the department tunes. I took it home at one point, where it continued to do that task well into the 00s

CPU utilization was 95 percent, but ran all day long for weeks, not a stutter.

The general point being, a UNIX, Linux could do magic on trash, old, odd, slow, gear and not miss a beat.

An old Pentium 90, running RH 5.2 served up the web pages while also acting as firewall and doing mail.

That thing was literally a dumpster dive. It had NT on it, and just would not run no matter what. Linux did, with a stream of kernel chatter in syslog. A console window (tail -f) showed this stream of scary looking text the whole time. crazy!

That was a stunt. Worked. Should not have. Did a few months duty, the real machine queued up, just in case.


Today, one can buy good hardware and get a seriously long run time on it.

A spend for a fast, robust, ECC machine is worth it.

During the rapid ramp up early on, price arguments were stronger because replacement came much sooner.

Today, particularly on desktop, one can get a killer machine and run it more than long enough to factor out the cost of ECC.


I wouldn't bother on a desktop or laptop. Servers absolutely.


I won't bother on a desktop. I've been using 4 machines for the last 17 years with storage varying from 10GB to 2TB and RAM varying from 128MB to 16GB and haven't personally seen any kind of data corruption in motion (or at rest for that matter). Only had 2 mechanical drives fail (though predictably).

ECC is costly. The memory modules itself and the board required to support it properly.


The only reason ECC is costly is because Intel has a monopoly on the desktop/server chip market and they refuse to deploy ECC to consumer chips. The hardware is there, it's just fused disabled.

If ECC were only the cost premium and we assumed a linear relationship then it should cost about 1/8 more than non-ECC DRAM. Unfortunately Intel's decisions have knock-on effects that ripple through the rest of the market.

IIRC I saw somewhere that JDEC expects a future standard will require ECC to get acceptable error rates for all memory. At that point Intel won't have any choice.


> haven't personally seen any kind of data corruption in motion

Ever had a program crash, hang, or act oddly? That's how data corruption in memory surfaces.

Of course, non-perfect programs (i.e. all of them) act the same way, which means that differentiating memory corruption from misbehaving programs is hard.

Fixing the memory errors will result in more stable system, but it still won't be perfect.


> haven't personally seen any kind of data corruption

How would you know? Unless your computer use has been literally trouble-free (and all your archived data has been verified for correctness somehow), you can't know that none of your glitches over the past 17 years has been due to memory errors.


This is a bit like an inverse magic stone argument ("This stone repels tigers — How do I know it works? — I'm not seeing any tigers around here, do you?").


I have been seeing numerous memory corruption on many of my computers, including end of life memory sticks and motherboards that died before my eyes.

At work, including my sysadmin years, up to a very long time ago on stupid summer jobs fixing computers, I have spent countless man*months to debug issues that were ultimately caused by memory errors. All of that could have been avoided by using ECC.


Have you not had a computer do something unexpected in the last 17 years? A bit flip might look like a kernel or application crash. Have you ever saved a file that couldn't later be opened by an application? You probably blamed the application for being buggy, but it could have been an upset. Bit flips / upsets cause all sorts of odd behavior.


The article makes no mention of single event upsets (SEUs). These occur randomly when cosmic rays can cause a bit flip anywhere in the chip. ECC is a good way to mitigate SEU effects.


Sorry for nitpicking, but it's not the cosmic rays, it's the cosmic rays secondaries cascade shower (produced high up in the atmosphere when a cosmic ray interacts with a particle there).


I am typing this (finally!) on my new desktop build. I did mull over the decision for a while but finally went with Xeon and ECC. So the memory cost more - perhaps even twice as much - so what? I use my computer pretty heavily for my work - with several VMs running at a time. If ECC saves me a headache once a year, it will have paid for itself. If it never provides ANY benefit I will still not regret the peace of mind.


The parameters of the desktop ECC decision have changed massively with today's glacial replacement cycles. Today you make a one time payment for many years of avoided headaches and peace of mind, whereas back then any sign of unreliability would have been a welcome excuse for a cheap upgrade.


No-one's mentioned it yet, but we're in a post-Rowhammer world and ISTM this is relevant to the discussion: while not all non-ECC DIMMs are susceptible, the cheaper ranges generally are, and if your purchasing decisions are driven by hardware cost, that's probably what you'll end up with. Corruption due to malice is a rather different beast to corruption due to random cosmic rays...


Rehashing an old comment:

IEC 61508 documents an estimate of 700 to 1200 fit/MBit (fit = "failure in time"; per 10e-9 hours of operation) and gives the following sources:

a) Altitude SEE Test European Platform (ASTEP) and First Results in CMOS 130 nm SRAM. J-L. Autran, P. Roche, C. Sudre et al. Nuclear Science, IEEE Transactions on Volume 54, Issue 4, Aug. 2007 Page(s):1002 - 1009

b) Radiation-Induced Soft Errors in Advanced Semiconductor Technologies, Robert C. Baumann, Fellow, IEEE, IEEE TRANSACTIONS ON DEVICE AND MATERIALS RELIABILITY, VOL. 5, NO. 3, SEPTEMBER 2005

c) Soft errors' impact on system reliability, Ritesh Mastipuram and Edwin C Wee, Cypress Semiconductor, 2004

d) Trends And Challenges In VLSI Circuit Reliability, C. Costantinescu, Intel, 2003, IEEE Computer Society

e) Basic mechanisms and modeling of single-event upset in digital microelectronics, P. E. Dodd and L. W. Massengill, IEEE Trans. Nucl. Sci., vol. 50, no. 3, pp. 583–602, Jun. 2003.

f) Destructive single-event effects in semiconductor devices and ICs, F. W. Sexton, IEEE Trans. Nucl. Sci., vol. 50, no. 3, pp. 603–621, Jun. 2003.

g) Coming Challenges in Microarchitecture and Architecture, Ronen, Mendelson, Proceedings of the IEEE, Volume 89, Issue 3, Mar 2001 Page(s):325 – 340

h) Scaling and Technology Issues for Soft Error Rates, A Johnston, 4th Annual Research Conference on Reliability Stanford University, October 2000

i) International Technology Roadmap for Semiconductors (ITRS), several papers.

If that's correct, the math is simple: you have bit flips in your PC about once a day.

It's just that (a) you often won't notice those transient errors (one pixel in your multi-megapixel photo is one bit off) and (b) a lot of your RAM is probably unused.


Same topic, same conclusion, even more hard facts.

http://perspectives.mvdirona.com/2009/10/you-really-do-need-...


In the late nineties, the Intel desktop chipsets such as 440LX and 440BX offered ECC functionality, all you had to do was spend ten or fifteen bucks extra on the memory. Great hardware.

I'm unhappy that Intel made things more expensive and complicated with their market differentiation, but from their POV it was logical. PC users were screwing up the reliability of their systems in so many ways via overclocking, and were habituated to accept crappy reliability via pre-NT Windows. PC users could have demanded ECC and they didn't. I'm sure that even when the chipsets made it easy, only a tiny fraction bothered to use ECC.


For servers this is more or less a no brainer: it's not a huge extra cost and a failure will cost you more than the extra cost.

For a regular desktop system for personal use it's not so easy. The data volumes are much smaller, the temperature environments are usually better, they aren't running (other than maybe idling) 24/7, most of the stuff that is in ram isn't going to be mission critical (i.e. you don't have 32Gb of RAM filled with customer database records, you have it filled with read only FPS textures, compiler caches etc).

Unlike a business that has tons of data that is mutated, my data is mostly immutable such as photos etc. It's not a continuously changing dataset where a bit flip in memory is likely to find its way into my data and then into my backups which would be the case e.g. for databases or big creative work (movie editing etc).


I've spent the last two weeks looking at Memtest86+ trying to figure out if either one of my memory modules is damaged, or if it is the motherboard. These tests take a long time, and yield different results from day to day.

I've decided to never ever again buy non-ECC memory, at least not on 24/7 servers as well as on workstations.

In a gaming machine / visual typewriter? Sure, non-ECC memory is ok.


I think that, given the personal importance of computing devices and storage, no filesystem should exist w/o checksum of metadata+data, and no RAM should be without ECC. The slight increase in cost does not justify the risk.


I searched for a good Linux laptop recently with ecc but didn't find much so settled on a kaby lake i5. Does anyone make them?


For example Lenovo P51 seems to support ECC (if equipped with Xeon processor). About the Linux support I don't know, but I've understood at least some other Lenovo models work ok with Linux.

http://psref.lenovo.com/Product/ThinkPad_P51


If you can afford it, sure. That's one reason why I'm so happy Ryzen supports it on consumer processors: It makes ECC cheap.


What are the odds of memory errors causing hard disk corruption / boot failure?


Yes. Everyone does.


some1 already mentioned row hammer so ecc yes :)


Yes. Are we done :)


Altitude also plays a factor in random memory corruption.

From the wikipedia article on ECC Ram, "Hence, the error rates increase rapidly with rising altitude; for example, compared to the sea level, the rate of neutron flux is 3.5 times higher at 1.5 km and 300 times higher at 10–12 km (the cruising altitude of commercial airplanes).[3] As a result, systems operating at high altitudes require special provision for reliability."


Data centres near Amsterdam are below sea level, which has been known to worry some of their foreign customers. They should just start advertising that as ECC error resistant :-)


so Numpy feature request: airplane mode


I wonder if this has anything to do with Microsoft's plans to build an underwater data center.


If I remember correctly, that research venture was mostly due to the potential of easy heat exchange and "free" energy via geothermal/tidal. Now that you mention this, though, it's clear that such a datacenter would also be naturally shielded from many things!


Can adequate shielding help?


Shielding against high-energy radiation is heavy, no matter what you use. The tenth-thickness (ie, the thickness to attenuate the radiation flux by a factor of 10) of lead is 2 inches. For water and other light materials, its about a foot. Commercial power reactors use lots of concrete, just because its cheaper to make, form, etc.

So to go from "rare" to "almost never" you need maybe three or four tenth-thicknesses of shielding material. That's an impractical amount of mass to suspend around your datacenter (rack isles, whatever).

Remember too, that you've already got ~30 feet of water equivalent in shielding (the atmosphere).


"<pubDate>Fri, 27 Nov 2015 00:00:00 +0000</pubDate>"

Needs (2015) added to the Title I think.


Thanks! Updated.


I want to thank Jeff for assisting Dan in writing this article.



That post is linked in the first sentence of the submission.


Do you use ZFS? If yes then you should use ECC memory.

Do you have a use case where you would want your computer to alert you when the ram is failing? If yes then you should use ECC memory.

Otherwise it's a nicitey and probably not worth the money.


Do you use ZFS? If no then you should use ECC memory.

Now the half truth becomes full-truth.


Here is an article detailing how ZFS is virtually unaffected by random bit flips because it would have to occur in such a way as to cause a sha256 collision between one block and its parity block in order for it to repair a valid block with a corrupt one during a scrub. Furthermore it goes on to argue that only a highly specific large scale ram corruption could possibly cause corruption and by that time it's almost certain the OS wouldn't boot up.

http://jrs-s.net/2015/02/03/will-zfs-and-non-ecc-ram-kill-yo...


I don't get this association between ZFS and ECC. The recommendation to use ECC with ZFS basically comes down to "all that fancy data integrity checking that ZFS does won't protect you from memory errors, so you'll effectively lose that feature."

Are you OK with silent data corruption? If so, don't bother with ECC. If not, use it.


History. The ZFS folks, back when, were the only folks making much noise about the association between non-ECC RAM and corrupt data landing on disk.

The truth is, if you care about the notion that your disk should return the same data that software thought it was writing, you should use ECC with any file system. But The ZFS folks made noise about the issue, I think lots of people assumed the reason was that there was something special about ZFS that needed it, and now you have something sort of like an urban legend.


Kind of, there are two main things that give ZFS this false reputation.

First is an academic paper testing if modern filesystems still needed ECC RAM. They tested ZFS and concluded horrible things could happen to your data without ECC RAM. They found the same about ext2, but that was just a small paragraph people overlooked. So nothing new, but many people are unaware that other FS have the same issue.

Second is a moderator on the FreeNAS forums coming up with a scenario where a ZFS scrub would wipe out your data. Developers and other people that have read the code said it couldn't happen as described, but the story was perpetrated on the FreeNAS forums and spread across the net.


> should return the same data that software thought it was writing

Hint: In an OS using a page cache (=every OS) I/O errors are not reliably propagated to applications unless they explicitly sync their dirty pages.


I'm aware of that, but I'm not sure what I'm supposed to take away from it in this context.


That it's difficult to accurately define "what the application thought it wrote" when considering corruption at various abstraction layers; somewhat similar to calculating checksums over already corrupted data.


> I don't get this association between ZFS and ECC.

Because ZFS was the ONLY file system that would actually catch some memory failures even if you didn't have ECC. So, ZFS got a reputation for being snotty when in reality the hardware it was running on was broken.


One of the major reasons for using ZFS is ensuring data integrity.

If you implement ZFS for that propose and cheap out on RAM, you're at odds for that purpose.


Because with ZFS bit rot can be cumulative, with most file systems a memory error will corrupt a file if the format can't handle errors, with ZFS overtime the entire volume can get corrupted especially when you are doing recovery or expansion, even in normal operation data is moved around quite a bit. For them most part with other common file systems when a file is written it stays there even in RAID.


The only detailed explanations I've ever seen for how memory errors can snowball into whole-filesystem loss on ZFS have relied on the assumption that you have a deterministically stuck bit in a region of memory that the OS is re-using for different parts of the FS data structures but never anything that could cause the machine itself to crash (thereby clueing you in to a hardware reliability issue).

Do you have a source for a more plausible analysis that takes into account how memory actually tends to fail?


I don't know, I wouldn't say ZFS moves files around any more than a typical filesystem.


Correct. ZFS won't do unnecessary file reordering unless a scrub has been initiated.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: