I don't work in environment where I get to deal with hardware failures, so pardon my ignorance, but has anyone seen a failed CPU piece which has failed during normal operation? I am under an impression that it is very rare for a CPU itself to fail so that it would need to be replaced.
The only times I've even heard about failing CPUs has been if they've been overclocked or insufficiently cooled(add in overvolting, and you get both :)) or physical damage during mounting/unmounting or otherwise handling hardware. And even then the failure has usually been elsewhere than the CPU itself.
Of course I am not saying it'd be unheard of, but for me frankly, right now it is.
" has anyone seen a failed CPU piece which has failed during normal operation?"
Several. But I'm lucky in that I worked for NetApp for 5 years which have several million NetApp filers in the field that were all calling home when they had issues, and Google which has a very large number of CPUs all around the planet doing their bidding. With visibility into a population like that you see faults that are once in a billion happen about once a month :-).
Two general kinds of failures though, the more common one is a system machine check (the internal logic detected the fault condition and put the CPU into the machine check state) which happens when 3 or more bits go sideways in the various RAM inside the mesh of execution units. Nominally ECC protected it can detect but not correct multi-bit errors. Power it off, power it on, and restart from a known condition and it's good as new.
The more rare occurrence is that something in the CPU fails which results in the CPU not coming out of RESET even, or immediately going into a machine check state. When you find those Intel often wants you to send it back to them so they can do failure analysis on it. The most common root cause analysis for those is some moderate electrostatic damage which took a while to finally finish the process of failing.
Some of the more interesting papers at the ISSCC are sometimes on lifetime expectancy of small geometry transistors. They are a lot more susceptible to damage and disruption due to cosmic rays and other environmental agents.
In the best case your app crashes or misbehaves randomly not unlike bad RAM (yes it can happen even with ECC). In the worst case there is a subtle numeric error that could potentially percolate to users. Usually it's about floating point issues, since usually you don't use floats to access arrays or do pointer arithm, issues in that unit may not cause an early failure and hence go undetected.
Simple case is that your machine just halts. If you have a way of looking at them, it might puts a bios code on the LEDs. On servers the baseboard management chip (BMC) will usually record a machine check event as well.
I had a Celeron that had its cache go bad once. It would work "fine" in Windows, but Linux would report that the CPU was throwing some kind of exception. If you went into the BIOS and disabled the cache it would be stable, but with the cache on it would crash after a day or two. Swapped out the CPU and the machine lived a great life for a long time.
What a crazy coincidence. I was actually just helping a friend troubleshoot his dead PC and I told him to test his ram, video card, motherboard, powersupply, in that order, and not to bother with the CPU since I have never had one fail on me. It ended up being the CPU that went out. I go on Hacker News a couple hours later and see this post. heh.
Me too, but I had always figured it as survivorship bias. That is, there are many such things that I newly hear of, but most of them don't get repeated mentions right way. The few of them that do catch my attention, and it seems like it happens a lot. But it actually happens only for a few of the many, many new things I hear about all the time. Similar to my friend who thought that every time he looked at a dead street lamp, it would turn on. :)
Another possible factor is that I find I pay more attention to things I've recently learned about more closely when I come across them, but ignore them when I either know them very well or don't know them at all. I don't know if this is true for anyone else.
The first would include the Kardasians for example. But just because they get mentioned a lot, you don't feel anything strange about hearing about them a lot.
Whereas the very essense of the phenomenon is that encoutering something multiple times seems strange to you.
Well, after you learned about (or focused on) something, you have an increased awareness of it mentioned, whereas other times in your life you can encounter it 2-3 times in the same day without paying much attention (just one more unknown word).
This will, of course, be getting replaced shortly, preferably before it does any real damage. Given it's a 2002-era chip, 11 years of service isn't exactly terrible.
I've seen quite a few UltraSPARC chips (especially IIIs) go over the years at work, and often had a shit of a time trying to get Sun to accept them as faulty and replace them.
In summary, some data centers are run hotter than recommended, which leads to a lot of mostly ignored domain resolution errors, which leads to a security risk.
My dad had a laptop which would not boot unless he put it in the fridge first for half an hour. As long as he didn't reboot everything then worked "fine". Does that count?
Thanks for clearing that up - it's always been bothering what it could possibly be (that and the "how on earth did he figure that out in the first place?")
The cold would shrink the motherboard to re-connect the contacts; it might have been fixed by doing the old Xbox/nvidia trick of putting it in the oven (which would soften the solder, causing it to shift and re-connect). With the Xbox, you could apparently even just wrap it in a towel and let its own heat do the trick.
I had this problem once when overclocking an AMD phenon. The short story is (I don't know the whole cause) the on-board crypto units stopped being random.
Which wasn't a real problem for 'some' day-to-day use. This was in the mid to later 00's, so https wasn't quiet everywhere yet.
The problem manifested slowly. When ever I'd connect to HTTPS, my browser would crash. One sound card would phone home for an update, and my computer would crash. Randomly certain games would crash when ever anti-cheat software attempted to run.
It was just odd, and took a few days of hunting to find out what was actually going wrong.
Yes, CPUs can fail just like any other hardware component. On desktop systems the most common case is you'll try and boot the system and just be presented with blank video or a beep code. On server systems with multiple CPUs there usually will be an error reported via blinking light or the little info LCD on the front of the system. In some cases the damage is actually visible on the CPU (e.g. discoloration of some of the gold contact points on the bottom). CPUs under normal operation fail less frequently than most other components in my experience.
I think the MTBF is generally longer than people would normally go without replacing their CPU. Also, CPUs are generally designed to degrade more gracefully. For instance, they may have circuity that scales the frequency down as delays get longer. Also, in multicore CPUs, there are generally some spare cores that will get swapped in if a previously in-use core breaks.
> Also, in multicore CPUs, there are generally some spare cores that will get swapped in if a previously in-use core breaks.
That sounds like a huge cost to bear. Looking at e.g. a Haswell die photo [1], there are just four physical cores present for a four-core part. With that die area per core, you would take a ~15-20% area hit (that translates to 15-20% cost) just to have a spare core in case one failed some years later.
I have heard of manufacturers selling otherwise "defective" parts where a core or cache slice has a defect by relabeling as a part with fewer cores. But that's a manufacture-time decision, not a dynamic reconfiguration in the field.
I think the GP was on the right track, but somewhat confused. They obviously don't do this on all models, but some dual-core models are disabled quad-cores. Remember the Athlon X3? That was a binned chip that usually was created from X4s with a broken core. Most buyers didn't mind, and some of them got lucky and were able to re-enable the disabled core.
It seems like the GP might be suggesting that a quad-core CPU will swap in another core when one dies. That doesn't happen. But the binning process allows them to still sell slightly defective silicon with disabled parts (cores, cache), which saves money.
On a related note, a lot of GPUs actually do have a few dozen execution units that are disabled by default and can be swapped in after stress testing at the factory. I believe some can even do that in the wild, but I could be wrong.
Oh whoops, I guess my VLSI professor lied to me. Maybe it only happens with certain multiprocessors. But yeah, there are definitely disabled cores in a lot of processors for the reasons you mentioned.
The cell processor had a main PowerPC core and eight floating-point SIMD co-processors called synergistic processing elements (SPE). For the PS3, one of the eight SPEs was disabled and another was reserved for the operating system, leaving the other six for developers.
A machine here at work logged cache related machine check excel exceptions, at a rate of roughly 1/day but not regularly or deterministic. Not related to load or temperature and even after clocking lower than spec'ed. Changing the CPU fixed it.
Those were correctable errors, prime95 or memtest did not detect anything.
I use to work in situations where we had to account for failures. We had lab equipment that would just run all that time, and with terrible wiring and we were horrible to it too. We left covers off, piled stuff on top of it and just Frankenstein the hell out of all of it. I even managed to flash a new OS/App at the same time of a power failure but it still lived....
Failures were more prominent in memory...but they did happen. We also sent equipment through environmental testing that would force failures. I don't recall of hearing of any CPU failures. Although most of our equipment was DSP & FPGA based but there were some tiny 'lil CPU's there.
I have seen it happen one time that I can remember, where I was sure it was the CPU. We had 8 dual socket 5400-era Xeon servers in a VMware cluster. Whenever 64bit Windows 2008+ virtual machines were started on or were vmotioned to one of the hosts, they would bluescreen. We did not experience this behavior at all, then one day we did. I replaced both CPUs and the problem disappeared. I have to assume it was one of the CPUs.
It's entirely possible that they overheated, but if they did it was due to poor cooling; we did not overvolt or overclock these machines.
It's not easy to get a CPU chip failure reliably diagnosed in the field. Even if you manage to do the trial and error component swapping dance pointing to the CPU, you don't get very good confidence. Might be that new CPU taxes the power feed less or there was misapplied cooling paste or a bad contact in the pins etc.
so not even mentioned here is metastability - basically signals that cross clock domains within traditional clocked logic where the clocks are not carefully organized to be multiples of each other can end up being sampled just as they change - the result is a value inside of a flip-flop that's neither a 1 or a 0 - sometimes an analog value somewhere in between, sometimes an oscillating mess at some unknown frequency - worst worst case this unknown bad value can end up propagating into a chip causing havoc, a buzzing mess of chaos.
In the real world this doesn't happen very often and there are techniques to mitigate it when it does (usually at a performance or latency cost) - core CPUs are probably safe, they're all one clock but display controllers, networking, anything that touches the real world has to synchronize with it.
For example I was involved with designing a PC graphics chip in the mid '90s - we did the calculations around metastability (we had 3 clock domains and 2 crossings), we calculated that our chip would suffer from metastability (might be as simple as a burble on one frame of a screen, or a complete breakdown) about once every 70 years - we decided we could live with that as they were running on Win95 systems - no one would ever notice
Everyone who designs real world systems should be doing that math - more than one clock domain is a no no in life support rated systems - your pacemaker for example
If a failure mode was likely to happen once every 70 chip-years of operation, then it seems like if you sold a few hundred thousand chips then you would expect several instances of that failure mode to occur across the population of chips every day?
simply yes - but as I mentioned in our case by far the most most were going to be pixel burbles - you'd likely see one in the lifetime of your video card - the chances of the more serious sort of jabbering core sort of meltdown are much less likely - we design against them - but, one has to stress, not impossible.
You can design to be metastablity tolerant - use high-gain, high clk->Q flops as synchronizers, uses multiple synchronizers in a row (trading latency for reliability), you can do things to reduce frequencies (run multiple synchronizers in parallel, synchronize edges rather than absolute values etc), but in the end if you're synchronizing an asynchronous event you can't engineer metastability out of your design - you just have to make it "good enough" for some value of good enough that will keep marketing and legal happy.
It's our dirty little secret (by 'our' I mean the whole industry)
It would be awesome if companies like Google would calculate MTBF statistics on components. They've done it for disks and it would be great to extend it to CPUs and memory modules. They're probably in a better position than even Intel to calculate these things with precision.
Found it here, which also goes into some testing the Guild Wars guys did on their population of gamer PCs: http://www.codeofhonor.com/blog/whose-bug-is-this-anyway (scroll down to "Your computer is broken", around 1% of the systems they tested failed a CPU-to-RAM consistency stress test)
Both of them indicate intermittently defective components in running systems are way more common than anybody assumes.
They'd have to be careful with how they quoted the numbers though.
As Linus accurately points out, MTBF varies wildly depending on the usage pattern. If you want to quote it in a unit of time, e.g. "years", then you have to specify the usage the part has been under, which will be very different for a server part compared to a desktop part.
You could quote it per instruction or equivalent, I suppose, taking into account how hard the component is used, but even that isn't perfect.
MTBF is quite well tied to a contract purchasing when you buy large quantities of components for manufacturing. Intel don't know where people are going to use their products and the range of environments they get exposed to can't be easily equated for.
I know Intel has been working for some time on the idea of high temperature data centres, this will impact the MTBF of all components but you can always calculate the cost of the losses vs the cost of the cooling: http://www.datacenterdynamics.com/focus/archive/2012/08/inte...
Data of the sort shown in Table 3 certainly exists. However, it's often inaccessible to the public because companies tend to treat it as a trade secret.
Any large company that makes things employs a bunch of reliability engineers, who are usually EE's or ME's who make Weibull plots and bathtub curves all day (to set the warranty duration, mostly). These guys have all the data you could ever want on this topic, but they're not sharing. Especially at Intel.
I'm almost sure that the components without moving parts will become technologically obsolete long before they start to fail. When I buy used laptop I always change the HDD, the DVD and its reliability jumps sharply up.
That may very well be true on average but I'd bet there are plenty of CPUs and memory modules that fail in the first year of usage for example. After all CPUs are tested and sorted into high/low performance parts, so sample variation itself would be enough to generate some early failures.
As a consumer it's hard enough to keep up with what's reliable in hard drives. Keeping the manufacturers honest with good stats for the most common parts would be great.
Even things with moving parts, it would be nice to know that model X of brand Y has a MTBF of 4.5 years, but hunting the same model X 4.5 years later isn't likely to yield the exact same hardware but some later revision of the same specced hardware.
There was an interesting quote/anecdote, Joe Armstrong likes to tell, it is about people who claim they've built a reliable or fault tolerant service. They would say "This is fault tolerant, they are multiple hard drives in there, I have done formal verification of my code and so on..." and then someone else trips over the power cord and that's the end of the fault tolerance. It is just a silly example, of course they'd properly provide power to an important rack of hardware, but the point is, in the simplest case the system is only as fault tolerant as its weakest components. It is that one bad capacitor from Taiwan that might the whole thing down, or just a silly cosmic ray.
One needs redundant hardware to provide certain guarantees about the service being up. This means load balancers, multiple CPUs running the same code in parallel and comparing results, running on separate power buses, different data centers, different parts of the world.
Yeah, but with the number of 9s you see you realize that asteroids are NOT taken into account. For example, Amazon advertises 99.999999999% durability for a given year for S3 objects. This is just stupid. An extinction-level event (asteroid, global thermonuclear war, black hole) could easily wipe out ALL data on S3. We know that mass extinctions have occurred about once every 100 million years. That means that if we expect a 10^-8 chance of a mass extinction event in a given year, Amazon would need a 99% chance of surviving a mass extinction in order to meet average durability ratings for S3.
After a certain number of 9s you just have to smile, nod, and truncate the number.
> That means that if we expect a 10^-8 chance of a mass extinction event in a given year, Amazon would need a 99% chance of surviving a mass extinction in order to meet average durability ratings for S3.
On the other hand, it's unlikely that many people would be around to ask for a reclamation. It's more likely that Amazon would go out of business before anything like this happened, anyway.
At one of my last jobs someone put the following ticket in the bug tracker "Following the end of the world on 12/21/2012 the system Blah will stop working"
After the date it was closed with a "As the end of the world didn't happen this ticket is no longer needed"
If MTBF is such a big issue then would it be ever possible to build space craft that travels across the stars and still has ability communicate? I guess hats off to designers of Voyager and other spacecrafts whose MTBF seems to have crossed 36+ years for many components including CPU and power supply. But for inter-steller crafts that MTBF seems VERY low. And, seriously, MTBF of 5 years seems to be joke for desktop when lot of mechanical components with moving parts actually lasts longer.
Spacecraft and rovers use ridiculously armoured, redundant systems to get past the fact that they would fail quite regularly in such a hostile environment. The Curiosity rover in 2001 uses what would normally be quite an outdated 132Mhz CPU that's been specially shielded to achieve the reliability the program needs; even then there's two redundant systems that do health checks on one other to avoid bit flips. Even with all of that, they're running on only one CPU and trying to diagnose why the first one failed.
It's probably not fair to compare the MTBF of specialised hardware to the $35 CPU I bought at the retailer down the street either, the RAD750 processors in Curiosity cost almost a quarter of a million dollars each.
http://history.nasa.gov/computers/Ch6-2.html
very interesting article on the computer system of the Voyager. It turns out most of the systems is not powered for most of the time, even the component that is doing the health checks - its called CCS.
"The frequency of the heartbeat, roughly 30 times per minute, caused concern [176] that the CCS would be worn out processing it. Mission Operations estimated that the CCS would have to be active 3% to 4% of the time, whereas the Viking Orbiter computer had trouble if it was more than 0.2% active15. As it turns out, this worry was unwarranted."
They are using DMA a lot; instruments write to memory, occasionally the CPU is turned on and picks up the new values. Also they had to manage with the fact that memory is degrading, so the system needs to adapt to working with less memory. The bus is 16 bits wide, but actually they are processing 4 bits at a time, so addition takes 4 cycles. CPU registers are stored in RAM, so probably they can reassign them if a memory cell fails.
Parts of the system were reused from the Viking mission. Also they where reprogramming the system in flight during the eighties ! That's the reason why they could start the the mission, even without having the full software on board, the mission was extended thanks to reprogramming. Just for the Jupiter visit they had 18 software updates, think about that next time that a software update breaks something on your system.
Also its all a distributed system with several CPU's, and some elements of redundancy, awesome tech. I guess one day alien hackers will have fun with reverse engineering this system.
I thought about another very reliable system; deep under the sea the NSA has a big switch that is splitting deep underwater communication lines;
Now this one has to work 24/7 in a hostile environment; has to be hidden; has to deal with enormous quantities of data and it costs a lot to replace/repair so it must be very reliable.
What is driving technological progress? Instead of a space program, we now have political control of the Internet as driving forces. I guess that's what they mean when they say that civilization is turning inwards ;-)
Yes, in many areas the NSA and Google are pushing the envelope; long term data storage; map reduce of large data sets; AI, you name it, they have it.
I imagine under the sea is actually quite a nice place to be, if you assume perfect waterproofing. There's litte radiation penetrating the water, so there's less chance of bit flips I imagine. You don't need to worry so much about cooling, as the whole ocean is your heatsink. Accessibility would suck, but a bunch of redundant hardware wouldn't be awful.
These switches are usually fiber splits so they are slightly less complex than you are envisioning.
The equipment doesn't have to actually duplicate L2 frames. It just uses standard fiber repeaters (already a common component in undersea cables) to get it back to a more friendly environment where they can actually decode and process it.
He's talking about desktop/server CPUs where people care about performance. If you don't care so much about performance, you can increase the transistor sizes, reduce the clock speed, and achieve totally insane MTBF... as space-rated hardware tends to do. Kind of like how server CPUs are underclocked to increase MTBF, but more so.
(Conventional) Solid state devices are very hard to fail - exception: flash memory
Apart from electron migration issues and failures by excess (voltage/temperature), they're pretty long lasting
Much easier to have a failure because of something else: capacitors failing, oxidation or mechanical failure (for example, because of thermal expansion/contraction)
I've seen people complaining about a dead CPU but I can't find it right now
You are correct. I want to clarify that the failure process is electromigration, not electron migration. It is caused by electrons but it is ions in metal that migrate. Wikipedia has a good description: https://en.wikipedia.org/wiki/Electromigration.
I design integrated circuits and one of the constraints in selecting the width of wires is to make sure that the maximum current density is below the electromigration threshold.
I actually just returned my CPU (Phenom II X4) to AMD, and they've replaced it, but they didn't say exactly why it died. I've asked them for more details, hopefully they can tell me.
Overall though, given with how many computers I've worked with, CPU failures still seem rarer than Memory, Disk, Mobo, or Graphics failures. Of course it ends up being the CPU in _my_ computer that fails -.-
There is still some variance in silicon, so yours may have had a defect that manifested itself after some time, I'm not sure they evaluate returned defective chips to see what happened (and if this is public info)
Also, the packaging is extremely complex and prone to the same kind of defects as other PCBs in the system.
I'd like to through my experience:
I was in charge of 300+ x86 rack servers and around 50 desktops for 3 years and never seen a single CPU fail, even old Pentium 4 with dusty fans.
Disk failures are very common, followed by much rarer RAM chips and motherboards failures.
I suspect server chips are rated for 10-15 years average lifespan
Soft errors are a very real property of low-voltage digital electronics. I personally observed what could only be realistically explained as a soft error in a unit running customer hardware in the field. A single bit was flipped in the program memory of the embedded application and was causing the system to malfunction in an obvious and repeatable manor. We've since added CRC checking to the program memory and some of the static data sections to flag and reset this in the future.
Long term failure rates are not usually measured in realtime but in deliberately heat elevated environments which simulate many years of stresses in a few months. This work is essential to ensure design decisions they've made don't accidentally cause their chips to fail after 2 years (which might be outside warranty lifetime but would still result in class action law suits and horrible publicity).
My immediate reaction is to ask how this reliability characteristic of CPUs affects critical software applications? Certainly some space missions and medical devices out in the field must have surpassed the MTBF mark for the given CPU deployment.
The only times I've even heard about failing CPUs has been if they've been overclocked or insufficiently cooled(add in overvolting, and you get both :)) or physical damage during mounting/unmounting or otherwise handling hardware. And even then the failure has usually been elsewhere than the CPU itself.
Of course I am not saying it'd be unheard of, but for me frankly, right now it is.