Hacker News new | past | comments | ask | show | jobs | submit login

Sorry, unable to believe: 2-4% failure rate for CPU's?

That's for detected/known failures: what about random, unable-to-reproduce, hardly noticed the data skip failures?

Have I been living in a fantasy bubble where CPU's do exactly what you asked of them (and errors come from not holding it right)?




This is consistent with their report for 2019-2021, preceding the issues with the 13th/14th gen: https://www.pugetsystems.com/labs/articles/most-reliable-pc-...

Apparently it was much better in 2018: https://www.pugetsystems.com/labs/articles/most-reliable-pc-...

On the other hand, 2011 did show 1.5%: https://www.pugetsystems.com/labs/articles/most-reliable-pc-...

GPU failure rates also weren't great 10-15 years ago, in particular for AMD: https://www.pugetsystems.com/labs/articles/video-card-failur...


Had three ATI Radeons back in the day, and that's only because the first two died under warranty one after the other lmao. it was even worse before AMD bought them.


> "Sorry, unable to believe: 2-4% failure rate for CPU's?"

It seems high to me too. I can't recall ever having a CPU fail and I must have used/owned hundreds of them in my lifetime. But presumably, failure rates in data centres where CPUs are run 24/7 at high temperatures etc are higher than in consumer applications?

Interesting that failure rates seem to peak in the summer months, too, and this didn't seem to be explained in the article. Perhaps the data center's cooling is working less effectively in summer?


How many CPUs have you owned? Compare this to Rolling a die in a tabletop game. How often do you roll a natural one or snake eyes?

On a 20-sided die a natural one only comes up 5% of the time.

I don't think I'm quite at having owned 40 (desktop) CPUs, but I've had failed CPUs a number of times. On an old Athlon XP 3000+ it had bad CPU cache and was able to change it from crashing repeatedly to working by disabling a large swath of the L2 cache in a bio setting. I am just now retiring a workstation that had an AMD Ryzen 5950x that was generally unstable, that would pass memtest and all the diagnostics I know how to run would pass, but about once a month would print a message on all the consoles I had open about an MCE kernel exception detected on some CPU number at random. One time I had one of those budget TriCore CPUs that had the 4th core turned on in the BIOS by accident, and that generally caused the ton of issues when I figured out it was being detected as a quad core I went in and disabled that last core and it went back to working just fine.

I'm sure at least one or two of my Intel machines failed similarly, I've had a number of dead CPUs including a few Pentium 4s back when I used to fool around with overclocking settings, and at least a few dead server chips. And I know that I've had a few systems that simply wouldn't boot even when all the parts were fresh out of the box but would when I'd swap out one part or another, and sometimes that failed part was a CPU whether it be AMD or intel.

Oh, I also get to cheat no clue how many dead Solaris CPUs I have seen! Those big Mainframe imitation systems have the ability to hot swap CPUs and it was pretty rare to have a cabinet sized computer where every piece of hardware was fully operational, at least in my role as a legacy maintainer of such things. I bet they worked much better when they rolled off the factory line.


Outside of overclocking its very rare to see a failed CPU. Validation testing at the fab almost always catches the lemons and it usually takes special circumstances for one to degrade after fabrication. OC'ing with higher voltages is the most common culprit. The number of honestly bad CPUs I've seen in my IT career I can count on my hand. Intel's current issue is due to a manufacturing error and definitely qualifies as extraordinary circumstances.


At some point CPU manufacutrers started treating overclocking as a feature rather than somethings hobbyists do and then computer OEMs started to tune things for this, but due to the quick generation cycles without any kind of long term testing it's only been a matter of time until we started seeing these issues since Moors law hasn't helped much with single core performance for years.

My current laptop was getting uncomfortably hot when some random browser pages started pressing the cpu, after searching I noticed that the default setting was to enable some kind of "Boost Mode" (that's basically overclocking in the classic sense), disabling that made a world of difference and looking at the failure rates of the Ryzen 5000 series in the article I'm not a single bit surprised about it.

Googling the laptop family you get tons of Reddit hits, https://www.reddit.com/r/ZephyrusG14/comments/gho535/importa...


Oh yeah, not trying to say this level of failure is normal. Puget getting 5% failures, or thereabouts, is a typical historical failure rate and they are getting it by being more conservative than others. Ending else is running more aggressive defaults.

I was just trying to provide a few examples of real first hand failures. And most OCing doesn't break anything, but every once in a while you set some voltage and one part never works again, hard not to conclude it was the OC when the failure perfectly coincides. I suppose it could be coincidental, but that stretches credulity.


"How many CPUs have you owned?"

Another issue as an end user is I don't have the resources to prove that it was the CPU a lot of the time. I've had some laptop failures that could have been the CPU, could have been the motherboard, could have been the power supply, dunno, all I've really got is that it doesn't boot and I don't have the capacity to diagnose it due to the level of integration and inability to get replacement parts to even try to diagnose the problem. And while as a poor college student, I had the time and desire to carefully replace parts and exert maximum diagnostic effort to figure out exactly what is wrong because I can't just buy a new complete setup, not everyone goes to this level of effort, and they may not be correct.

End users really can't pick up on these trends. The data set is so noisy. Sure, the "end users" may have had suspicions about this but I've also seen communities come to consensus about certain things being broken that I had very good reason to believe were wrong and were just internet forums amplifying random loud voices being confidently and loudly wrong about things until it become "common consensus" through the power of nothing but the confident and loud error.


I experienced problems with 5950X as well and nobody (including myself) seemed to believe it was the CPU before I got another 5950X which just worked with the same setup.

The issue wasn't easy to reproduce, all standard checks and torture tests ran just fine but much more random workload crashed the unit when all cores were maxed for few seconds at a time instead (eg. during compiling). Sometimes it happened twice a day sometimes once a week.


I'm in the process of troubleshooting my son's desktop and have swapped out every bit of hardware except for the motherboard and the ~2 year old 5600X itself. I just assumed the motherboard had failed at that point and RMA'd the thing but the OEM tested the board and said it checked out. At this point, the CPU is the only thing left. :-/


Try Disabling C states in BIOS. And verify they're disabled within the OS.


Unfortunately, the system was running fine for 2 years and then deteriorated to not-quite-fully-dead in the span of about 20 minutes.

It would only attempt to POST 1 time out of about 10 power on attempts and won't finish enough of the POST to even make it into the BIOS/UEFI setup. When it would fail to even attempt to POST it wouldn't even initialize the USB peripherals (keyboard doesn't light up).

I swapped everything (GPU, ram, power supply, etc) before getting to the point of suspecting the motherboard was faulty because it would not Q-FLASH via a USB thumb drive so I sent it in for warranty service. Since the manufacturer said the motherboard passed their quality checks there is nothing else that could be left besides the CPU.

Even given all that, I refused to believe the CPU could be the culprit until I saw the failure rate graph for AMD 5000 series processors in the article that far exceed what I figure would have been in the 0.5% range.

Live and learn I guess but it certainly adds a new fun dimension to troubleshooting because I don't exactly keep spare CPUs laying around "just in case" like I do with a spare power supply.


oof. Still, hopefully other people read this and try disabling C States. I was about to spend the $25 for a spare CPU when the C state issue came up and...

For me it was simple: disable C states - Stable for a week (at first it'd be weeks between crashes, then days, then daily, and at this point a couple hours) Enabled them again - crashed within an hour Disable again - Stable for a week. Flop back and fourth a handful of times in the same day.... same as above.

I can only guess that this might be something that was fixed in the mysterious "frequency voltage curve change" stepping.


On the 5950x, try disabling C states (not P, C; the ones where it 'sleeps' CPU cores). And verify they're disabled within the OS.

It took most of a year to nail that one down.

AMD didn't ask for anything besides proof of ownership once I told them disabling C states fixed it. And, 5950x's have a 5 year warranty...

PS: if they were Northwood P4's, see: https://www.overclockers.com/forums/threads/the-official-sud... (I didn't name it. I agree, terribly insensitive name)


Oh glad to know I’m not the only the with the 5950x MCE errors. I probably haven’t seen one in a year though, so maybe it was finally fixed


I had a 5950x with similar issues to other posters here. It failed outright after just over two years.


Was common with AMD CPUs in the past, the Ryzen 1000 range had a widespread problem where many made in 2017 would randomly segfault from time to time under Linux, it was a whole drama, and you had to RMA them until you got lucky.


> I can't recall ever having a CPU fail

How do you know? I thought one of the main ways these were failing resulted blue screens. I'm sure you've had a bunch of blue screens in your lifetime.


4% seems very high to me; but CPU errors happen with relative frequency, and design mistakes are common.

If you ever run “cat /proc/cpuinfo” on, say, Skylake - Linux will happily tell you it has 5-6 workarounds active for hardware mistakes.

CPUs are still pretty darn reliable. Think about how many GHz your CPU runs at, multiplied many instructions per cycle there are, and then calculate the failure rate if there was just 1 mistake per minute. Nothing on earth would compare.


Mechanical hard drives are absolutely on the same, or higher, level of reliability. It's mind boggling what we can achieve when we really focus on quality outcomes.


Wow, you weren't joking:

> apic_c1e spectre_v1 spectre_v2 spec_store_bypass swapgs taa itlb_multihit srbds mmio_stale_data retbleed eibrs_pbrsb gds bhi


I mean, a bunch of those are timing issues in speculative execution, you could make an argument that it's working as designed but people didn't anticipate the existence of timing exploits. I'd call that different from computation errors.


As the original comment suggested, about 5-6 of these are not related to timing exploits (or at least, not the Meltdown/Spectre variants which claim so many patches to their name). Summaries from LKML and Kernel.org:

> apic_c1e

Both ACPI and MP specifications require that the APIC id in the respective tables must be the same as the APIC id in CPUID.

The kernel retrieves the physical package id from the APIC id during the ACPI/MP table scan and builds the physical to logical package map.

There exist Virtualbox and Xen implementations which violate the spec. As a result the physical to logical package map, which relies on the ACPI/MP tables does not work on those systems, because the CPUID initialized physical package id does not match the firmware id. This causes system crashes and malfunction due to invalid package mappings.

The only way to cure this is to sanitize the physical package id after the CPUID enumeration and yell when the APIC ids are different. If the physical package IDs differ use the package information from the ACPI/MP tables so the existing logical package map just works.

> taa

TAA is a hardware vulnerability that allows unprivileged speculative access to data which is available in various CPU internal buffers by using asynchronous aborts within an Intel TSX transactional region.

> itlb_multihit

iTLB multihit is an erratum where some processors may incur a machine check error, possibly resulting in an unrecoverable CPU lockup, when an instruction fetch hits multiple entries in the instruction TLB. This can occur when the page size is changed along with either the physical address or cache type. A malicious guest running on a virtualized system can exploit this erratum to perform a denial of service attack.

> srbds

SRBDS is a hardware vulnerability that allows MDS techniques to infer values returned from special register accesses. Special register accesses are accesses to off core registers. According to Intel's evaluation, the special register reads that have a security expectation of privacy are RDRAND, RDSEED and SGX EGETKEY.

> mmio_stale_data

Processor MMIO Stale Data Vulnerabilities are a class of memory-mapped I/O (MMIO) vulnerabilities that can expose data. The sequences of operations for exposing data range from simple to very complex. Because most of the vulnerabilities require the attacker to have access to MMIO, many environments are not affected. System environments using virtualization where MMIO access is provided to untrusted guests may need mitigation. These vulnerabilities are not transient execution attacks. However, these vulnerabilities may propagate stale data into core fill buffers where the data can subsequently be inferred by an unmitigated transient execution attack. Mitigation for these vulnerabilities includes a combination of microcode update and software changes, depending on the platform and usage model. Some of these mitigations are similar to those used to mitigate Microarchitectural Data Sampling (MDS) or those used to mitigate Special Register Buffer Data Sampling (SRBDS).


itlb_multihit is the only one that sounds like an actual bug, just like F00F and FDIV were on the original Pentium. Timing and other data side-channels are arguably not bugs as Intel has long maintained the stance that CPU protection rings are not security boundaries but only meant to protect against accidents instead of deliberate maliciousness.


> There exist Virtualbox and Xen implementations which violate the spec. As a result the physical to logical package map, which relies on the ACPI/MP tables does not work on those systems, because the CPUID initialized physical package id does not match the firmware id. This causes system crashes and malfunction due to invalid package mappings.

You can argue that the system shouldn't crash (although at that low a level, not sure what else can happen)...

but beyond that, how is "VirtualBox and Xen implementations violate the spec" a failing of a CPU?


Here is for my Ryzen 7700X: sysret_ss_attrs spectre_v1 spectre_v2 spec_store_bypass srso


FWIW, “sysret_ss_attrs” is a workaround for a design error in AMD’s x86_64 implementation. One might argue that AMD is right because they designed AMD64 in the first place, but IMO this is silly, and AMD’s design is unjustifiable.

(I’m the one who characterized this issue on Linux and wrote the test case and the workaround.)


It's pretty widely publicised that even on older Intel CPU's there is a non zero number of servers giving unexpected results at scale - https://arxiv.org/pdf/2102.11245


I wonder if this buckets things like motherboard problems under the same causes. Those numbers do seem very high in general.

I guess it would be a pretty useless comparison if they weren't carefully filtering for CPU-only failure though..


I broke the first CPU I've bought for my own money, it was a PPro 180mhz back in early 1997 (2 days before Intel introduced MMX). I ran it overclocked at 200mhz for a good while until I started getting stability issues, had to downclock to 133mhz to use it after that (even 180 was unstable) and bought a new computer once I started my first "real" job.

Seen other HW issues, memory or motherboard(soldered memory) on my ex's laptop that affected only certain adress ranges, memtest86 has been my go-to to check computer health when random crap starts happening since then and I've replaced at least one memory stick on another machine thanks to it.



> Have I been living in a fantasy bubble where CPU's do exactly what you asked of them (and errors come from not holding it right)?

These are gaming CPUs clocked right up to the threshold of stability (and in some cases, past it)

Server CPUs with ECC RAM are significantly better.

However, if you haven’t experienced much CPU instability, you may not have operated at scale where it appears. Get into the scale where operations occur across 100,000s of CPUs and weird things happen constantly.


One of the previous discussions of this instability noted that it was happening to the server equivalents (which are after all the same die) at stock settings as well.


I feel like the entire fiasco has been multiple issues being lumped together as one, and muddied to the point that even a bluescreen out of an attempt to run XMP at extremely high MT/s are now being claimed as degradation. From what I can make out of this mud, there seems to be (1) a failure caused by high current due to some boards unlocking IccMax/PL1/PL2 by default, and (2) high voltage during a single-core boost (TVB). The former is caused by overclocking, and the latter seems to be Intel's failure to validate the CPUs at low load/long period of single-core boost, where IccMax/PL no longer matters as much (since single-core boost never exceeds PL1 anyway).

Most Raptor Lake "server boards" right now are W680 with client CPUs because the C266/Xeon E-2400 took a long time to come out. The one intended for workstations typically has overclockable settings or is even overclocked by default, which means it's likely to get hit with the failure (1). The one intended for servers do have more conservative settings, but can still be hit with failure (2) under some conditions.

Buildzoid released a video on the Supermicro W680 blade a bit ago that were having issues after running a single-core load 24x7, which is essentially 24x7 boost[1] (aka issue (2)). Xeon E-2400 _could_ be affected in this scenario, although even the highest clock E-2400 SKU (E-2488) is only running at 5.6GHz without Thermal Velocity Boost, and most others are ranging from 4.5 to 5.2 GHz boost (rather than the 5.8 to 6 GHz boost some client SKUs do). I feel like the actual B0 Xeon E-2400 would be a lot less prone to both failures (1) and (2) due to this (but it could happen, though there's no reports of such).

But then the conversation gets muddied enough that "even servers and Xeons are affected" becomes the common narrative (while the former is true, the circumstances needs to be noted; and for Xeons, it's a _maybe_ at most, since right now there's no report of Xeon E-2400 failing).

[1]: https://www.youtube.com/watch?v=yYfBxmBfq7k


Looking around, I'm seeing reports of 1.4-1.5V core voltages using Intel's stock profiles, with some even going to 1.7V. That's insanely high for a 10nm process and I'm not surprised about the degradation. For comparison, in the 45/32nm days 1.2-1.3V was the norm, with some extreme overclockers (who don't expect CPUs to survive for more than a few minutes, using liquid nitrogen etc.) hitting ~1.5V, and 1.4V was a commonly quoted safe upper limit for 24x7 operation.


This is why I think it's going to be much harder for Xeon to be affected by this, as they're normally running in a more conservative voltage settings. I don't have Xeon E-2400 to look at, but Raptor Lake should be able to do 5.6 GHz at 1.3-1.4V-ish, which should be within a safe voltage range. (even the "power hungry" w9-3495X only runs at ~1.25V during 4.8 GHz TVB, and ~1.15V at non-TVB 4.6 GHz boost.)


I remember hearing that server motherboards also played a role in overclocking out of the box, which is frankly fucking stupid. I don't recall anything about Raptor Lake-based Xeons suffering from degradation.


Unless they disable Turbo Boost (which is horrible for performance, but great if you want benchmark consistency), the CPU will automatically overclock until it reaches the limits, adjusting both voltage and frequency.

All the evidence I've seen points to electromigration as a cause of this degradation, and IMHO excessively aggressive automatic overvolting by Intel's microcode is to blame.

There is actually a simple experiment which can determine whether that is true --- remove the fan from the heatsink, or even let the CPU run without a heatsink. As the CPU will automatically throttle once it reaches its designed maximum temperature (and AFAIK that is a hardcoded limit), it will lower its frequency and voltage to maintain that temperature. If this results in a stable CPU, while the one that has great cooling becomes more unstable, it confirms the hypothesis.

There are numerous stories of machines where the heatsink was not in contact with the CPU for some reason, yet they remained perfectly stable (but slow) for many years. I can also say that I've had an 8th-gen i7 running at 100% 24x7 with all power and turbo limits disabled, with its temperature constantly at the design limit of 100C, and it has also remained stable for over 5 years.


> There are numerous stories of machines where the heatsink was not in contact with the CPU for some reason, yet they remained perfectly stable (but slow) for many years.

I once had a laptop, which came from the factory with the four screws which hold the heatsink to the CPU missing. It was very slow, and shut down after a few minutes (the reason being thermal shutdown in the BIOS event log helped diagnose the issue). After the four screws were replaced (each screw came in its own large individual cardboard box), it worked fine for many years, BUT after a couple of years (still under warranty), the motherboard failed with a short in the power input. I suspect that all the extra heat from when the CPU was without a working heatsink went to the power supply components through the motherboard ground plane, and cooked them, significantly shortening their useful life.


>Unless they disable Turbo Boost (which is horrible for performance, but great if you want benchmark consistency), the CPU will automatically overclock until it reaches the limits, adjusting both voltage and frequency.

Turbo Boost (and Thermal Velocity Boost if applicable) frequencies are according to specifications, it's not an overclock.


That's a lot of cpu time. Maybe they just don't make them like they used to. If there's some crazy complicated numerical or combinatorial problem you've been trying to crack, do tell.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: