CPU reliability – Linus Torvalds (2007)

zxcdw · on Dec 9, 2013

I don't work in environment where I get to deal with hardware failures, so pardon my ignorance, but has anyone seen a failed CPU piece which has failed during normal operation? I am under an impression that it is very rare for a CPU itself to fail so that it would need to be replaced.

The only times I've even heard about failing CPUs has been if they've been overclocked or insufficiently cooled(add in overvolting, and you get both :)) or physical damage during mounting/unmounting or otherwise handling hardware. And even then the failure has usually been elsewhere than the CPU itself.

Of course I am not saying it'd be unheard of, but for me frankly, right now it is.

ChuckMcM · on Dec 9, 2013

" has anyone seen a failed CPU piece which has failed during normal operation?"

Several. But I'm lucky in that I worked for NetApp for 5 years which have several million NetApp filers in the field that were all calling home when they had issues, and Google which has a very large number of CPUs all around the planet doing their bidding. With visibility into a population like that you see faults that are once in a billion happen about once a month :-).

Two general kinds of failures though, the more common one is a system machine check (the internal logic detected the fault condition and put the CPU into the machine check state) which happens when 3 or more bits go sideways in the various RAM inside the mesh of execution units. Nominally ECC protected it can detect but not correct multi-bit errors. Power it off, power it on, and restart from a known condition and it's good as new.

The more rare occurrence is that something in the CPU fails which results in the CPU not coming out of RESET even, or immediately going into a machine check state. When you find those Intel often wants you to send it back to them so they can do failure analysis on it. The most common root cause analysis for those is some moderate electrostatic damage which took a while to finally finish the process of failing.

Some of the more interesting papers at the ISSCC are sometimes on lifetime expectancy of small geometry transistors. They are a lot more susceptible to damage and disruption due to cosmic rays and other environmental agents.

dale-cooper · on Dec 9, 2013

Interesting. How would these failures manifest themselves for a user?

ithkuil · on Dec 9, 2013

In the best case your app crashes or misbehaves randomly not unlike bad RAM (yes it can happen even with ECC). In the worst case there is a subtle numeric error that could potentially percolate to users. Usually it's about floating point issues, since usually you don't use floats to access arrays or do pointer arithm, issues in that unit may not cause an early failure and hence go undetected.

ChuckMcM · on Dec 9, 2013

Simple case is that your machine just halts. If you have a way of looking at them, it might puts a bios code on the LEDs. On servers the baseboard management chip (BMC) will usually record a machine check event as well.

forrestthewoods · on Dec 9, 2013

I think you're only considering catastrophic unable to boot failure. A CPU could be going bad with the effects less obvious.

Microsoft released research that shows enthusiast overclocking has a clear increase in hardware faults.

[1] http://www.extremetech.com/gaming/131739-microsoft-analyzes-... [2] http://research.microsoft.com/pubs/144888/eurosys84-nighting...

alelefant · on Dec 9, 2013

While that research is certainly interesting, the poster above you acknowledged that overclocking could result in failures.

tonyarkles · on Dec 9, 2013

I had a Celeron that had its cache go bad once. It would work "fine" in Windows, but Linux would report that the CPU was throwing some kind of exception. If you went into the BIOS and disabled the cache it would be stable, but with the cache on it would crash after a day or two. Swapped out the CPU and the machine lived a great life for a long time.

zymhan · on Dec 9, 2013

> but Linux would report that the CPU was throwing some kind of exception

Does this mean it didn't boot into Linux?

tonyarkles · on Dec 9, 2013

It booted but dmesg was very noisy!

It would actually run for a day or two, serving up a modest PHP load, before crapping out.

dntrkv · on Dec 9, 2013

What a crazy coincidence. I was actually just helping a friend troubleshoot his dead PC and I told him to test his ram, video card, motherboard, powersupply, in that order, and not to bother with the CPU since I have never had one fail on me. It ended up being the CPU that went out. I go on Hacker News a couple hours later and see this post. heh.

ac29 · on Dec 9, 2013

https://en.wikipedia.org/wiki/Baader-Meinhof_phenomenon

tfinniga · on Dec 9, 2013

Man, I just learned about the Baader-Meinhof phenomenon the other day, and now it seems to be popping up everywhere.

josephagoss · on Dec 9, 2013

Is this the same effect that when told to count yellow cars whilst driving, you end up noticing far more yellow cars than you thought were on road.

Or how you notice how much more frequent your model of car is on the roads?

olalonde · on Dec 9, 2013

Also when you learn a new word in a different language, it seems like suddenly everyone is using that word all the time.

option_greek · on Dec 9, 2013

Are there any explanations for this phenomenon. I keep experiencing it all the time :)

ehsanu1 · on Dec 9, 2013

Me too, but I had always figured it as survivorship bias. That is, there are many such things that I newly hear of, but most of them don't get repeated mentions right way. The few of them that do catch my attention, and it seems like it happens a lot. But it actually happens only for a few of the many, many new things I hear about all the time. Similar to my friend who thought that every time he looked at a dead street lamp, it would turn on. :)

Another possible factor is that I find I pay more attention to things I've recently learned about more closely when I come across them, but ignore them when I either know them very well or don't know them at all. I don't know if this is true for anyone else.

coldtea · on Dec 9, 2013

It's, of course the second explanation.

The first would include the Kardasians for example. But just because they get mentioned a lot, you don't feel anything strange about hearing about them a lot.

Whereas the very essense of the phenomenon is that encoutering something multiple times seems strange to you.

coldtea · on Dec 9, 2013

Well, after you learned about (or focused on) something, you have an increased awareness of it mentioned, whereas other times in your life you can encounter it 2-3 times in the same day without paying much attention (just one more unknown word).

rodgerd · on Dec 9, 2013

From one of my personal machines:

[14865975.000023] Machine check events logged

MCE 0

CPU 0 BANK 2

ADDR 1438280

TIME 1384859595 Wed Nov 20 00:13:15 2013

STATUS d40040000000011a MCGSTATUS 0

MCGCAP 104 APICID 0 SOCKETID 0

CPUID Vendor AMD Family 6 Model 8

This will, of course, be getting replaced shortly, preferably before it does any real damage. Given it's a 2002-era chip, 11 years of service isn't exactly terrible.

I've seen quite a few UltraSPARC chips (especially IIIs) go over the years at work, and often had a shit of a time trying to get Sun to accept them as faulty and replace them.

aimhb · on Dec 9, 2013

It was posted here on HN a few days ago, but you might find the DEFCON talk on single bit domain errors relevant: https://www.youtube.com/watch?v=ZPbyDSvGasw

In summary, some data centers are run hotter than recommended, which leads to a lot of mostly ignored domain resolution errors, which leads to a security risk.

vanderZwan · on Dec 9, 2013

My dad had a laptop which would not boot unless he put it in the fridge first for half an hour. As long as he didn't reboot everything then worked "fine". Does that count?

raverbashing · on Dec 9, 2013

That's probably a bad contact or a tiny fissure that makes contact again when cooled

And the failed part is either needed only on bootup or when the current gets going it doesn't stop until it's powered off again

vanderZwan · on Dec 9, 2013

Thanks for clearing that up - it's always been bothering what it could possibly be (that and the "how on earth did he figure that out in the first place?")

danudey · on Dec 9, 2013

The cold would shrink the motherboard to re-connect the contacts; it might have been fixed by doing the old Xbox/nvidia trick of putting it in the oven (which would soften the solder, causing it to shift and re-connect). With the Xbox, you could apparently even just wrap it in a towel and let its own heat do the trick.

serans · on Dec 9, 2013

how on earth did he come up with putting the laptop in the fridge?

valarauca1 · on Dec 9, 2013

I had this problem once when overclocking an AMD phenon. The short story is (I don't know the whole cause) the on-board crypto units stopped being random.

Which wasn't a real problem for 'some' day-to-day use. This was in the mid to later 00's, so https wasn't quiet everywhere yet.

The problem manifested slowly. When ever I'd connect to HTTPS, my browser would crash. One sound card would phone home for an update, and my computer would crash. Randomly certain games would crash when ever anti-cheat software attempted to run.

It was just odd, and took a few days of hunting to find out what was actually going wrong.

naner · on Dec 9, 2013

Yes, CPUs can fail just like any other hardware component. On desktop systems the most common case is you'll try and boot the system and just be presented with blank video or a beep code. On server systems with multiple CPUs there usually will be an error reported via blinking light or the little info LCD on the front of the system. In some cases the damage is actually visible on the CPU (e.g. discoloration of some of the gold contact points on the bottom). CPUs under normal operation fail less frequently than most other components in my experience.

zhemao · on Dec 9, 2013

I think the MTBF is generally longer than people would normally go without replacing their CPU. Also, CPUs are generally designed to degrade more gracefully. For instance, they may have circuity that scales the frequency down as delays get longer. Also, in multicore CPUs, there are generally some spare cores that will get swapped in if a previously in-use core breaks.

anon_cownerd · on Dec 9, 2013

> Also, in multicore CPUs, there are generally some spare cores that will get swapped in if a previously in-use core breaks.

That sounds like a huge cost to bear. Looking at e.g. a Haswell die photo [1], there are just four physical cores present for a four-core part. With that die area per core, you would take a ~15-20% area hit (that translates to 15-20% cost) just to have a spare core in case one failed some years later.

I have heard of manufacturers selling otherwise "defective" parts where a core or cache slice has a defect by relabeling as a part with fewer cores. But that's a manufacture-time decision, not a dynamic reconfiguration in the field.

[1] http://cdn2.wccftech.com/wp-content/uploads/2013/05/Intel-Ha...

ssafejava · on Dec 9, 2013

I think the GP was on the right track, but somewhat confused. They obviously don't do this on all models, but some dual-core models are disabled quad-cores. Remember the Athlon X3? That was a binned chip that usually was created from X4s with a broken core. Most buyers didn't mind, and some of them got lucky and were able to re-enable the disabled core.

It seems like the GP might be suggesting that a quad-core CPU will swap in another core when one dies. That doesn't happen. But the binning process allows them to still sell slightly defective silicon with disabled parts (cores, cache), which saves money.

On a related note, a lot of GPUs actually do have a few dozen execution units that are disabled by default and can be swapped in after stress testing at the factory. I believe some can even do that in the wild, but I could be wrong.

zhemao · on Dec 9, 2013

Oh whoops, I guess my VLSI professor lied to me. Maybe it only happens with certain multiprocessors. But yeah, there are definitely disabled cores in a lot of processors for the reasons you mentioned.

cantfindmypass · on Dec 9, 2013

ISTR the PS3's Cell processor had eight physical cores to increase yeild.

zhemao · on Dec 12, 2013

You're mostly right

https://en.wikipedia.org/wiki/PlayStation_3#Technical_specif...

The cell processor had a main PowerPC core and eight floating-point SIMD co-processors called synergistic processing elements (SPE). For the PS3, one of the eight SPEs was disabled and another was reserved for the operating system, leaving the other six for developers.

cnvogel · on Dec 9, 2013

A machine here at work logged cache related machine check excel exceptions, at a rate of roughly 1/day but not regularly or deterministic. Not related to load or temperature and even after clocking lower than spec'ed. Changing the CPU fixed it.

Those were correctable errors, prime95 or memtest did not detect anything.

grumps · on Dec 9, 2013

I use to work in situations where we had to account for failures. We had lab equipment that would just run all that time, and with terrible wiring and we were horrible to it too. We left covers off, piled stuff on top of it and just Frankenstein the hell out of all of it. I even managed to flash a new OS/App at the same time of a power failure but it still lived....

Failures were more prominent in memory...but they did happen. We also sent equipment through environmental testing that would force failures. I don't recall of hearing of any CPU failures. Although most of our equipment was DSP & FPGA based but there were some tiny 'lil CPU's there.

gnoway · on Dec 9, 2013

I have seen it happen one time that I can remember, where I was sure it was the CPU. We had 8 dual socket 5400-era Xeon servers in a VMware cluster. Whenever 64bit Windows 2008+ virtual machines were started on or were vmotioned to one of the hosts, they would bluescreen. We did not experience this behavior at all, then one day we did. I replaced both CPUs and the problem disappeared. I have to assume it was one of the CPUs.

It's entirely possible that they overheated, but if they did it was due to poor cooling; we did not overvolt or overclock these machines.

zurn · on Dec 9, 2013

It's not easy to get a CPU chip failure reliably diagnosed in the field. Even if you manage to do the trial and error component swapping dance pointing to the CPU, you don't get very good confidence. Might be that new CPU taxes the power feed less or there was misapplied cooling paste or a bad contact in the pins etc.

cafard · on Dec 9, 2013

A couple of times: Sparc chips that did some notable damage on their way out. One was pretty old, one not so much.

Shorel · on Dec 9, 2013

I saw a 386 let the magic blue smoke out.

Taniwha · on Dec 9, 2013

so not even mentioned here is metastability - basically signals that cross clock domains within traditional clocked logic where the clocks are not carefully organized to be multiples of each other can end up being sampled just as they change - the result is a value inside of a flip-flop that's neither a 1 or a 0 - sometimes an analog value somewhere in between, sometimes an oscillating mess at some unknown frequency - worst worst case this unknown bad value can end up propagating into a chip causing havoc, a buzzing mess of chaos.

In the real world this doesn't happen very often and there are techniques to mitigate it when it does (usually at a performance or latency cost) - core CPUs are probably safe, they're all one clock but display controllers, networking, anything that touches the real world has to synchronize with it.

For example I was involved with designing a PC graphics chip in the mid '90s - we did the calculations around metastability (we had 3 clock domains and 2 crossings), we calculated that our chip would suffer from metastability (might be as simple as a burble on one frame of a screen, or a complete breakdown) about once every 70 years - we decided we could live with that as they were running on Win95 systems - no one would ever notice

Everyone who designs real world systems should be doing that math - more than one clock domain is a no no in life support rated systems - your pacemaker for example

caf · on Dec 9, 2013

If a failure mode was likely to happen once every 70 chip-years of operation, then it seems like if you sold a few hundred thousand chips then you would expect several instances of that failure mode to occur across the population of chips every day?

Taniwha · on Dec 9, 2013

simply yes - but as I mentioned in our case by far the most most were going to be pixel burbles - you'd likely see one in the lifetime of your video card - the chances of the more serious sort of jabbering core sort of meltdown are much less likely - we design against them - but, one has to stress, not impossible.

You can design to be metastablity tolerant - use high-gain, high clk->Q flops as synchronizers, uses multiple synchronizers in a row (trading latency for reliability), you can do things to reduce frequencies (run multiple synchronizers in parallel, synchronize edges rather than absolute values etc), but in the end if you're synchronizing an asynchronous event you can't engineer metastability out of your design - you just have to make it "good enough" for some value of good enough that will keep marketing and legal happy.

It's our dirty little secret (by 'our' I mean the whole industry)

elwell · on Dec 9, 2013

This field really interests me.

pedrocr · on Dec 8, 2013

It would be awesome if companies like Google would calculate MTBF statistics on components. They've done it for disks and it would be great to extend it to CPUs and memory modules. They're probably in a better position than even Intel to calculate these things with precision.

bcoates · on Dec 9, 2013

Here's one for RAM in servers, from Google: http://news.cnet.com/8301-30685_3-10370026-264.html

Found it here, which also goes into some testing the Guild Wars guys did on their population of gamer PCs: http://www.codeofhonor.com/blog/whose-bug-is-this-anyway (scroll down to "Your computer is broken", around 1% of the systems they tested failed a CPU-to-RAM consistency stress test)

Both of them indicate intermittently defective components in running systems are way more common than anybody assumes.

raverbashing · on Dec 9, 2013

This is very interesting

But when they say "Memory Error", even though it's something detected/corrected by ECC I'm not sure we can say 'the memory is defective'

It may be a combination of the conditions of power/load/data/time since last refresh and variance between modules.

Since Google appears not to show all the data they have, we probably are not going to get that from them though :/

rkangel · on Dec 8, 2013

They'd have to be careful with how they quoted the numbers though. As Linus accurately points out, MTBF varies wildly depending on the usage pattern. If you want to quote it in a unit of time, e.g. "years", then you have to specify the usage the part has been under, which will be very different for a server part compared to a desktop part. You could quote it per instruction or equivalent, I suppose, taking into account how hard the component is used, but even that isn't perfect.

bobdvb · on Dec 9, 2013

MTBF is quite well tied to a contract purchasing when you buy large quantities of components for manufacturing. Intel don't know where people are going to use their products and the range of environments they get exposed to can't be easily equated for.

I know Intel has been working for some time on the idea of high temperature data centres, this will impact the MTBF of all components but you can always calculate the cost of the losses vs the cost of the cooling: http://www.datacenterdynamics.com/focus/archive/2012/08/inte...

amscanne · on Dec 8, 2013

Interesting relevant paper: http://www.cs.cmu.edu/~bianca/fast07.pdf

Table 3 suggests that there are data sets that include all components (CPU, memory, power supplies, etc.).

avn2109 · on Dec 8, 2013

Data of the sort shown in Table 3 certainly exists. However, it's often inaccessible to the public because companies tend to treat it as a trade secret.

Any large company that makes things employs a bunch of reliability engineers, who are usually EE's or ME's who make Weibull plots and bathtub curves all day (to set the warranty duration, mostly). These guys have all the data you could ever want on this topic, but they're not sharing. Especially at Intel.

zebra · on Dec 8, 2013

I'm almost sure that the components without moving parts will become technologically obsolete long before they start to fail. When I buy used laptop I always change the HDD, the DVD and its reliability jumps sharply up.

pedrocr · on Dec 8, 2013

That may very well be true on average but I'd bet there are plenty of CPUs and memory modules that fail in the first year of usage for example. After all CPUs are tested and sorted into high/low performance parts, so sample variation itself would be enough to generate some early failures.

As a consumer it's hard enough to keep up with what's reliable in hard drives. Keeping the manufacturers honest with good stats for the most common parts would be great.

jfim · on Dec 8, 2013

Even things with moving parts, it would be nice to know that model X of brand Y has a MTBF of 4.5 years, but hunting the same model X 4.5 years later isn't likely to yield the exact same hardware but some later revision of the same specced hardware.

superuser2 · on Dec 9, 2013

I've seen a lot of failed laptop motherboards.

wging · on Dec 8, 2013

They might very well do this already. But it'd seem they're disincentivized to make this information public... do they publish the disk information?

pedrocr · on Dec 8, 2013

They've published some stuff on disks:

https://static.googleusercontent.com/external_content/untrus...

The backblaze guys also have a big data set:

http://blog.backblaze.com/2013/11/12/how-long-do-disk-drives...

rdtsc · on Dec 9, 2013

There was an interesting quote/anecdote, Joe Armstrong likes to tell, it is about people who claim they've built a reliable or fault tolerant service. They would say "This is fault tolerant, they are multiple hard drives in there, I have done formal verification of my code and so on..." and then someone else trips over the power cord and that's the end of the fault tolerance. It is just a silly example, of course they'd properly provide power to an important rack of hardware, but the point is, in the simplest case the system is only as fault tolerant as its weakest components. It is that one bad capacitor from Taiwan that might the whole thing down, or just a silly cosmic ray.

One needs redundant hardware to provide certain guarantees about the service being up. This means load balancers, multiple CPUs running the same code in parallel and comparing results, running on separate power buses, different data centers, different parts of the world.

dmitshur · on Dec 9, 2013

  > different parts of the world.

Still takes just one asteroid.

oconnor0 · on Dec 9, 2013

Seems like after an asteroid hitting the earth, "server fault tolerance" is the least of our worries.

klodolph · on Dec 9, 2013

Yeah, but with the number of 9s you see you realize that asteroids are NOT taken into account. For example, Amazon advertises 99.999999999% durability for a given year for S3 objects. This is just stupid. An extinction-level event (asteroid, global thermonuclear war, black hole) could easily wipe out ALL data on S3. We know that mass extinctions have occurred about once every 100 million years. That means that if we expect a 10^-8 chance of a mass extinction event in a given year, Amazon would need a 99% chance of surviving a mass extinction in order to meet average durability ratings for S3.

After a certain number of 9s you just have to smile, nod, and truncate the number.

josephagoss · on Dec 9, 2013

They really advertise 99.999999999%? Isn't that 1 byte lost for every 100,000,000 Terabytes or something silly like that?

brianpgordon · on Dec 9, 2013

Yes:

https://aws.amazon.com/s3/faqs/#How_durable_is_Amazon_S3

I suppose that if they ever lose an object then they can say "well we warned you that you might lose an object every hundred million years."

mercurial · on Dec 9, 2013

> That means that if we expect a 10^-8 chance of a mass extinction event in a given year, Amazon would need a 99% chance of surviving a mass extinction in order to meet average durability ratings for S3.

On the other hand, it's unlikely that many people would be around to ask for a reclamation. It's more likely that Amazon would go out of business before anything like this happened, anyway.

raverbashing · on Dec 9, 2013

Off-topic:

At one of my last jobs someone put the following ticket in the bug tracker "Following the end of the world on 12/21/2012 the system Blah will stop working"

After the date it was closed with a "As the end of the world didn't happen this ticket is no longer needed"

dmitshur · on Dec 9, 2013

Yeah, I was kinda referring to danger we as a species still face by being on just one planet.

bcoates · on Dec 8, 2013

Thread context: https://lkml.org/lkml/2007/5/11/179

heaviside · on Dec 8, 2013

This study by Microsoft Research is interesting:

"Cycles, Cells and Platters: An Empirical Analysis of Hardware Failures on a Million Consumer PCs"

http://research.microsoft.com/apps/pubs/default.aspx?id=1448...

sytelus · on Dec 8, 2013

If MTBF is such a big issue then would it be ever possible to build space craft that travels across the stars and still has ability communicate? I guess hats off to designers of Voyager and other spacecrafts whose MTBF seems to have crossed 36+ years for many components including CPU and power supply. But for inter-steller crafts that MTBF seems VERY low. And, seriously, MTBF of 5 years seems to be joke for desktop when lot of mechanical components with moving parts actually lasts longer.

nwh · on Dec 8, 2013

Spacecraft and rovers use ridiculously armoured, redundant systems to get past the fact that they would fail quite regularly in such a hostile environment. The Curiosity rover in 2001 uses what would normally be quite an outdated 132Mhz CPU that's been specially shielded to achieve the reliability the program needs; even then there's two redundant systems that do health checks on one other to avoid bit flips. Even with all of that, they're running on only one CPU and trying to diagnose why the first one failed.

It's probably not fair to compare the MTBF of specialised hardware to the $35 CPU I bought at the retailer down the street either, the RAD750 processors in Curiosity cost almost a quarter of a million dollars each.

http://en.wikipedia.org/wiki/Comparison_of_embedded_computer...

http://en.wikipedia.org/wiki/Curiosity_rover#Specifications

http://en.wikipedia.org/wiki/Radiation_hardening#Radiation-h...

Though that said, Voyager is still happy running on it's 8064 words of 16 bit RAM, which is something.

MichaelMoser123 · on Dec 9, 2013

http://history.nasa.gov/computers/Ch6-2.html very interesting article on the computer system of the Voyager. It turns out most of the systems is not powered for most of the time, even the component that is doing the health checks - its called CCS.

"The frequency of the heartbeat, roughly 30 times per minute, caused concern [176] that the CCS would be worn out processing it. Mission Operations estimated that the CCS would have to be active 3% to 4% of the time, whereas the Viking Orbiter computer had trouble if it was more than 0.2% active15. As it turns out, this worry was unwarranted."

They are using DMA a lot; instruments write to memory, occasionally the CPU is turned on and picks up the new values. Also they had to manage with the fact that memory is degrading, so the system needs to adapt to working with less memory. The bus is 16 bits wide, but actually they are processing 4 bits at a time, so addition takes 4 cycles. CPU registers are stored in RAM, so probably they can reassign them if a memory cell fails.

Parts of the system were reused from the Viking mission. Also they where reprogramming the system in flight during the eighties ! That's the reason why they could start the the mission, even without having the full software on board, the mission was extended thanks to reprogramming. Just for the Jupiter visit they had 18 software updates, think about that next time that a software update breaks something on your system.

Also its all a distributed system with several CPU's, and some elements of redundancy, awesome tech. I guess one day alien hackers will have fun with reverse engineering this system.

MichaelMoser123 · on Dec 9, 2013

I thought about another very reliable system; deep under the sea the NSA has a big switch that is splitting deep underwater communication lines;

Now this one has to work 24/7 in a hostile environment; has to be hidden; has to deal with enormous quantities of data and it costs a lot to replace/repair so it must be very reliable.

What is driving technological progress? Instead of a space program, we now have political control of the Internet as driving forces. I guess that's what they mean when they say that civilization is turning inwards ;-)

Yes, in many areas the NSA and Google are pushing the envelope; long term data storage; map reduce of large data sets; AI, you name it, they have it.

nwh · on Dec 9, 2013

I imagine under the sea is actually quite a nice place to be, if you assume perfect waterproofing. There's litte radiation penetrating the water, so there's less chance of bit flips I imagine. You don't need to worry so much about cooling, as the whole ocean is your heatsink. Accessibility would suck, but a bunch of redundant hardware wouldn't be awful.

Weren't they using hardware in submarines anyway?

ars_technician · on Dec 9, 2013

These switches are usually fiber splits so they are slightly less complex than you are envisioning.

The equipment doesn't have to actually duplicate L2 frames. It just uses standard fiber repeaters (already a common component in undersea cables) to get it back to a more friendly environment where they can actually decode and process it.

eck · on Dec 8, 2013

He's talking about desktop/server CPUs where people care about performance. If you don't care so much about performance, you can increase the transistor sizes, reduce the clock speed, and achieve totally insane MTBF... as space-rated hardware tends to do. Kind of like how server CPUs are underclocked to increase MTBF, but more so.

hkmurakami · on Dec 8, 2013

That makes me wonder... Linus refers to this as well, but how much of the 36+ years can be attributed to the components actually being turned off?

Also, I'd imagine that space craft components are of an entirely different category of components that the off the shelf computing variety.

greenyoda · on Dec 8, 2013

Spacecraft are not built out of the same grade of components as consumer and commercial hardware.

seiji · on Dec 8, 2013

You can fake whole system reliability by incorporating redundant internal systems.

raverbashing · on Dec 8, 2013

(Conventional) Solid state devices are very hard to fail - exception: flash memory

Apart from electron migration issues and failures by excess (voltage/temperature), they're pretty long lasting

Much easier to have a failure because of something else: capacitors failing, oxidation or mechanical failure (for example, because of thermal expansion/contraction)

I've seen people complaining about a dead CPU but I can't find it right now

soundsop · on Dec 9, 2013

You are correct. I want to clarify that the failure process is electromigration, not electron migration. It is caused by electrons but it is ions in metal that migrate. Wikipedia has a good description: https://en.wikipedia.org/wiki/Electromigration.

I design integrated circuits and one of the constraints in selecting the width of wires is to make sure that the maximum current density is below the electromigration threshold.

zymhan · on Dec 9, 2013

I actually just returned my CPU (Phenom II X4) to AMD, and they've replaced it, but they didn't say exactly why it died. I've asked them for more details, hopefully they can tell me.

Overall though, given with how many computers I've worked with, CPU failures still seem rarer than Memory, Disk, Mobo, or Graphics failures. Of course it ends up being the CPU in _my_ computer that fails -.-

raverbashing · on Dec 9, 2013

Interesting.

How long did it work for before it died?

There is still some variance in silicon, so yours may have had a defect that manifested itself after some time, I'm not sure they evaluate returned defective chips to see what happened (and if this is public info)

Also, the packaging is extremely complex and prone to the same kind of defects as other PCBs in the system.

mrich · on Dec 9, 2013

As a side note, the whole site is an amazing collection of wisdom and worth bookmarking:

http://yarchive.net/

AnonNo15 · on Dec 9, 2013

I'd like to through my experience: I was in charge of 300+ x86 rack servers and around 50 desktops for 3 years and never seen a single CPU fail, even old Pentium 4 with dusty fans.

Disk failures are very common, followed by much rarer RAM chips and motherboards failures.

I suspect server chips are rated for 10-15 years average lifespan

synthos · on Dec 9, 2013

Soft errors are a very real property of low-voltage digital electronics. I personally observed what could only be realistically explained as a soft error in a unit running customer hardware in the field. A single bit was flipped in the program memory of the embedded application and was causing the system to malfunction in an obvious and repeatable manor. We've since added CRC checking to the program memory and some of the static data sections to flag and reset this in the future.

lispython · on Dec 9, 2013

There's a more than 100 pages's thread talk about GUP failure after two years use in Apple Support website. https://discussions.apple.com/thread/4766577

dspeyer · on Dec 9, 2013

It doesn't seem worth it for Intel to measure MTBF. By the time they got good numbers for a specific chip, they'd be trying to sell its successor.

gilgoomesh · on Dec 9, 2013

Long term failure rates are not usually measured in realtime but in deliberately heat elevated environments which simulate many years of stresses in a few months. This work is essential to ensure design decisions they've made don't accidentally cause their chips to fail after 2 years (which might be outside warranty lifetime but would still result in class action law suits and horrible publicity).

williadc · on Dec 9, 2013

Intel guarantees their consumer CPUs for 3 years.

http://www.intel.com/support/processors/sb/cs-020033.htm

Zardoz84 · on Dec 8, 2013

I can say that the Z80 if my ZX Spectrum keep working since 1984... Or some old K6-2 300 was working this last year...

caf · on Dec 8, 2013

For how much of the time since 1984 would you imagine that your Z80 has been on and running?

mvanveen · on Dec 9, 2013

My immediate reaction is to ask how this reliability characteristic of CPUs affects critical software applications? Certainly some space missions and medical devices out in the field must have surpassed the MTBF mark for the given CPU deployment.

jokoon · on Dec 9, 2013

I always wondered about this, but does it seem transistor do wear off over time ?

Does that mean a CPU/RAM/GPU will not perform as well as when it's brand new ?

csmuk · on Dec 9, 2013

Never had a CPU go on me.

RAM yes, PROMs yes, CMOS batteries yes, PSUs yes, drives yes.

They're probably the most reliable bit of a computer.

leokun · on Dec 8, 2013

Nice thing about the cloud is that someone else is worrying about this for you.