Hacker News new | past | comments | ask | show | jobs | submit login
Attack of the cosmic rays: Undetected memory errors can happen to you (ksplice.com)
118 points by nelhage on June 24, 2010 | hide | past | favorite | 28 comments



Since that incident, I’ve had several other, similar problems. Something would start failing mysteriously, but flushing my cache restored it to normal.

This seems like a bit of a red flag that in reality something else is actually going wrong with his computer.


Yeah, while cosmic-rays are cool and all it sounds like his RAM is failing.


It doesn't really matter if the errors were caused by cosmic rays or the ram failing -- it's still dangerous and relatively undetectable.


It does. How dangerous they actually are? If it was that common for them to cause serious disruptions in the operations of our desktops and laptops, where are these faults? I mean, in the last 10 years, I simply can't remember any experience of a sudden and irreproducible computer failure that couldn't be quite convincingly attributed to something else.

I'm sure that this can happen to me, that in any particular moment, my memory can get corrupted by these rays, or something else, and then, the computer can misbehave or crash. And if I really wanted to be sure that it will not, I'd need to have some kind of protection against it. But compared to many other possible faults, is it really anything more than a very very rare and minor reason of a computer failure that I simply can just discard on my laptop, which is nowhere near a "critical and vital" system?


Going by the 12GB figure, I guess he's running 6x2GB DIMMs in a Core i7 box and is pushing his memory controller slightly over its limits.

I used to run 6x2GB at 1333MHz with an i7 920. I swapped the CPU with one that could properly drive my ECC memory (W3520), and quickly found my choices were either an unusable system, 3x2GB at 1333, or 6*2GB at 800.


To give you an idea of density/frequency of this occurring: my wife's CCD for her PhD experiments routinely (roughly 1 in 5) pick up huge spikes from cosmic rays during her 30-second exposures. The CCD is less than an inch square and she's 2 floors down from ground level.


Is she certain that it is not another PhD student one floor above her experimenting with X-Rays?


Yes, she uses the specific room in the building because it's farthest away from noisey experiments.


Reading this, I remember how hard NASA works to get their sattelites and probes secure against cosmic rays, because out there in space, cosmic rays cause your memory to become pretty unpredictable. Error correcting codes and redundancy suddenly become really important, even though you are crammed into this little embedded system which has less processing power than some input devices these days.


The main reason for redundancy is because NASA can't send someone out to fix the probe/satellite, so the hardware has to be designed to work flawlessly for decades. Astronauts use consumer laptops just fine on the shuttle and ISS, and I doubt most of those even have ECC RAM.

Airplane avionics aren't radiation-hardened, but they work fine at altitude where the incidence of cosmic rays is about 1/4th that in space. Planes aren't dropping out of the sky from bit flips. People don't get tons of parity or ECC errors on their laptops while flying. Based on all this, I think the threat of cosmic rays is vastly overstated.


Worse yet, if your probe's OS crashes at the wrong time it could lead to loss of the probe or missing out on a lot of data that would otherwise have been gathered. If a probe's electronic brain crashes due to cosmic rays and goes into a "safe" state it may fail to maintain attitude control, potentially pointing its solar arrays away from the Sun and/or its high gain antenna away from Earth. This can lead to "bad things" such as the probe draining its batteries and dying before controllers have a chance to fix it (this has happened quite frequently in the past, though not always due to cosmic rays). Worse, the probe could reset during a critical course correction manouver, end up failing to go into orbit around a target planet, burning up in a planet's atmosphere, or merely ending up in the wrong location on the wrong trajectory.

Avionics systems use ECC memory and other techniques to avoid being impacted by cosmic rays causing single event upsets. As far as parity and ECC errors on laptops, I don't believe ECC ram is common on laptops.

That being said, this particular article uses a gross overestimate of SEUs for memory, which is really only applicable if your 4GB of RAM fills an entire room in multiple full sized racks (the studies he bases these figures on come from the 80s and the author fails to adjust for physical size when extrapolating to modern memory sizes).


Do any laptops support ECC RAM? Lots of machines will run with ECC RAM but only a few can perform SECDED (single error correction, double error detection). I had SECDED in my previous desktop and the machine seemed more stable than the one I had before. Here's Daniel Bernstein's rant on the subject: http://cr.yp.to/hardware/ecc.html

Microsoft's datacenters have a straightforward policy for dealing with software errors: first restart the service, if that doesn't fix it then restart the OS, and if that doesn't fix it then swap out the machine.


For those interested, an example of the trouble cosmic rays cause the Hubble Space Telescope: http://archive.stsci.edu/cgi-bin/mastpreview?mission=hst&...

Of course, this is cosmic rays hitting the focal plane and not a memory corruption issue.

The Space Telescope Science Institute schedules multiple exposures so they can be stacked on the ground to eliminate the traces.


Inspiring for the seeming ease with which he moves between package managers and debuggers...


I don't say that cosmic rays cannot happen (well, they absolutely certainly do, I mean whether they can cause memory corruption that actually make some difference in the running system), but this is quite strange. No such faults were happening before this single incident and now, there many similar faults happening regularly? Why should I suspect the cosmic rays (was there any reason for such a sudden change in their activity and visibility of it?) and not an hardware fault?


These kinds of memory errors are more often caused by alpha particles emitted by radioactive elements in the chip package: http://en.wikipedia.org/wiki/Soft_error


For those who want to know more about cosmic rays, Wikipedia is filled with goodness on the subject. (http://en.wikipedia.org/wiki/Cosmic_ray) I was looking for stats on average density per m2 to determine just how prevalent this effect might be in ground-based electronics. It's been a major problem with high-altitude and satellite equipment for a long, long time.


From my experience, I think it is unlikely to be due to cosmic rays.Most likely culprit could be power supply or data buffers. Those non tantalum capacitors then to end of life faster if you're operating in high humidity conditions.

This reminded me of a number of random crashes that a client of my previous company had. Stackdumps just showed random errors. We had about a years worth of crash logs from a couple thousand of network switches (they were an ISP). We initially suggested that this might be a problem with cosmic rays. We even checked the frequency of the random crashes with sunspot cycles. No relationship found. Turns out it was due to another component failing due to a design error.


Great work digging into this issue. A memory test is probably in order.

I learned about ECC RAM when I was trying to figure out why server lease deals were so inexpensive relative to others. For instance, the last I checked, hetzner.de's hardware does not support ECC RAM. I am of course not calling out hetzner, and there are other factors in such deals.


Idea: Use pieces of lead sheeting to shield the RAM chips from cosmic radiation.


While lead is a good choice for photon radiation, it is a very bad choice to shield against charged particles. Because lead is very dense (which is why it is good for xrays or gamma rays), it slows down the protons/electrons very quickly, producing brehmsstraulangw. low density materials such as acrylic or wood or concretemight be better.

Although, becuase the protons are such high energy, I can't imagine how much material it would take. This stuff is very bhigh energy, and very nasty to stop. To get an idea of what kind of shielding you'd need, I'd look at the shielding used at the lhc, tevatron, and slac. they all produce particles of comparable energy.

I think ECC ram might be cheaper, up to a point, maybe shielding would be better for data centers.


Wow, it's been a long time since I've seen that word. It was a favorite spelling challenge between my office mate and I: http://en.wikipedia.org/wiki/Bremsstrahlung


One small correction: Bremsstrahlung.


D'oh


> Idea: Use pieces of lead sheeting to shield the RAM chips from cosmic radiation.

It's really difficult to find lead that doesn't emit various particles itself. (Yup, lead has radioactive isotopes.)

IIRC, there have been memory/logic errors traced to radioactive particles from solder, perhaps even the lead. (This was back in the days when solder contained lead.)


Italy is using ancient Roman lead ingots, which have lost most of their natural radioactivity, in their CUORE neutrino experiment.

http://www.physorg.com/news190646406.html


a common saying in the medic industry:

sometimes a zebra is just a horse.


Segfaults from Outer Space !! Duck and cover




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: