But if you are creating a chip that other people will create software for...then you don’t really know.
This is in the context of assessing usage duty cycles of circuits for aging purposes.
I can imagine an era of exploits that rely on "aging out" paths that were assumed to be rarely used. Like rowhammer, but persistent -- fire up a process on a random cloud instance, run a tight loop of code to wear out an exploitable path in e.g. SGX, rinse lather repeat until you have the coverage you need...
Possible but unlikely IMO. For one thing it would be trivial to detect. To be a suitable attack surface it would need to wear out ~>100x faster than the most robust pathways (months instead of years). I am spectacle that manufacturing allows for that degree of flexibility. I would be more worried about two related issues:
1. If someone figures out how to focus execution that you'd normally assume distributed in a single area. The simplest example would be to write the first line of cache over and over again; that kind of thing has extensive protection but the principle is reasonable. I'm not sure what you'd use for this.
2. Aging related issues can be worked around; if a subunit loses performance you can give it more time to settle and reorder the pipeline around it. That would open up entirely new classes of timing attacks. You run a quick scan of every combination of execution paths (to find the biggest overlap of underperforming transistors) and you'll be able to make attacks using any number of extremely hard to predict timing combinations.
I think in the long term, we'll get away from the concept of general purpose processing units towards some units that have the right architecture and wear prevention to execute untrusted code and others which require less power but wear out more quickly. There will be domains for all those types of computing.
You can have the CPU share the die with the secure cores, but you'd need to figure out a lot of complexity, like dealing with memory access pattern leakage due to a shared memory bus. The less things you share, the less the performant parts need to care about security, and the secure parts need to care about performance.
It would be unfortunate if future process improvements resulted in fragile CPUs and GPUs. I can sort of imagine Nvidia rubbing their hands with glee at the prospect of non-overclocked GPUs aging prematurely, killing used sales and forcing data centers to upgrade to more expensive compute cards.
I think that professional tier hardware generally comes with some sort of guarantees on how long they last, so data centers shouldn't be affected much.
There is a market for used cards. It is one way to get a relatively slow but high RAM compute card without paying extreme prices. Killing that market would force a lot of non-corporate users to start coughing up for extremely expensive new hardware.
Thats correct. And it makes sense. If you buying HW in bulk, you take into account how much power it will draw, what's the expected life expectancy, how much work will given part do and not just the price.
If you get an amazing price for part that fails often, it might cost you way more in long run.
In general I would expect a bulk purchaser to be less sensitive to failure.
If I'm buying one drive or CPU, I might pay a premium to drop the failure rate from 4% to 1%. If I'm buying dozens to hook together in a fault-tolerant system, I'll go for the cheap one and buy a few extras.
When you buy 1000 CPU the failure rate of 4% every year.
That gives you statistically 40 dead servers each year for 5 years (avg component guaranteed life). thats 200 dead cpus.
Lets say $1000 per cpu thats $1,000,000 cost and $200,000 loss to damage.
1% - 50 dead CPU in 5 years - $50,000 in losses
In this scenario you have $150,000 to save or spend to get better equipment.
Also an important note. On top of that you suffer downtime losses and manpower cost of taking server out and swapping parts. If your eCommerce goes offline that might cause significant monetary loss.
For any large scale purchases its all about the numbers game.
In case of single purchases. Paying extra to get form 4% to 1% seems excessive (depending on the cost increase).
At this percentage levels its a roll of dice if it dies or not.
That wasn't supposed to be a per year failure rate. 20% per 5 years is a crazy amount. Divide the numbers by 5 to get a per-year failure closer to my intent.
So if it's $850 for the 4% failure chip, and $1000 for the 1% failure chip, I'll probably buy the cheaper one in bulk. $850k upfront and $34k in replacement, vs. $1000k upfront and $10k in replacement.
There's extra manpower, sure, but even if it costs a hundred dollars of labor per replacement the numbers barely budge.
> downtime losses
In a big system you shouldn't have those from a single server failure! Downtime losses are the point I was making. As someone buying just one drive or chip, failure costs me massive amounts beyond the part itself. But if I can buy enough for redundancy, then those problems become much much smaller.
If you're running things off one server, then apply the single device analysis, not the bulk analysis.
Most of the pro grade hardware does come with those assurances, usually with support and replacement contracts as well. And make no mistake, they may replace things a lot -- at my old gig we had a lot of visits from Dell. To Dell's credit, we had a LOT of gear to cover, in several different colo spaces.
Point is though, they price the cost of replacement into those guarantees. Doesn't mean the hardware will last longer, just that support & replacements are
Maybe 5 years?
I have a gtx970 in my pc which is 5 years old by now. While the card is fine by itself, it is too slow and thus getting replaced in the near future and moved into an office pc.
But which data center uses 5 year old graphics cards?
It is save to assume that dedicated compute card gets replaced from time to time anyway.
I'm typing this from a ~6 year old laptop (used as a desktop OFC) for example. My phone is ~5 years old (unbelievable, I know). When it fails in a few years I'll happily buy a "new" ~4 year old refurbished phone again.
Regarding Moore's law, there's only so many possible shrinks left to go. Once we hit that wall the incentive to be on the latest node is significantly reduced. Combine that with associated lifetime reductions and I think larger nodes might even end up preferable in many cases.
Don't forget preservation efforts. I know most people don't care about it but many enthusiasts like giving old systems a spin every now and then. You can still build your dream 386 from used parts off of eBay and play wing commander on a CRT. For the upcoming generation of hardware that might just be impossible then.
A 5-10 year old machine can still be perfectly usable. I have a 2012 laptop with a high end i7 3-series CPU, high end Quadro GPU and I would hate if any of them failed because it's not something I can easily (if at all) fix, the whole laptop would become a doorstop.
An i7 6700 is already 5 years old. That's most certainly not an outdated "can throw away" CPU. Neither will a 3rd gen. Ryzen 4 years from now.
Since the wear is exponentially dependent on temperature, better cooling (e.g. water cooling) could extend the life of chips significantly, so if someone is worried about chip ageing from continuous use they can just install a water cooler, which is less pricey than an enterprise card.
Maybe industrial water cooling has a better track record, but consumer water cooling is an extremely fiddly and expensive process. My main source is Linus Tech Tips. Unless you're willing to spend a lot of money and effort, air cooling is more effective and much much cheaper. Simple (small) water cooling solutions tend to not perform better than air cooling. Plus, water cooling requires a lot of maintenance, because it's more complex (e.g. there's pumps that can fail) and because the cooling liquid can get contaminated and cause cooling performance to drop.
I can see how it could be cheaper to use cheap air cooling on the chips and efficient, central room cooling.
Spilled some coffee in my open computer case once. Wasn't running faster at all...
But seriously, I think it would be a viable solution for server farms, but it didn't really catch on there yet. Probably still a matter of price. There are some theoretical application with heat exchangers though. If we could recycle some of that, computing would be much more efficient in general.
I assume you're talking about centralized cooling for server farms? The solution I like for that is to turn the entire rear door of each cabinet into a water-fed heat exchanger, with no change to the servers. Then your piping is orders of magnitude simpler and safer.
Probably even better. Water has a nice heat capacity (I think about 10x as much as copper), but maybe that isn't that important for such a solution as long as the heat gets used. Even if we would just get 10% of the invested energy back, it would be a huge boon already.
Heat capacity doesn't really matter, unless you'll be using the device less time than it takes to reach that capacity. If you have two materials with equal thermal conductivity, but different heat capacity, their cooling properties will be the same once both reach their heat capacity.
One problem with any form of cooling is that you have to get the heat away from the silicon that's generating it and into the cooling system in the first place. For a lot of complex components (like a CPU), that's very hard to do, since there's dozens (or more, in the case of 3d circuits) of layers of heat-sensitive silicon and metal between any given component and the surface of the heat sink.
“However, densely packed active devices with minimum safety margins are required to realize advanced functionality requirements. This makes them more susceptible to reliability issues caused by self-heating and increasing field strengths."
We're going to have very short-lived electronics.
The Ford EEC IV engine control unit from the mid-1980s was designed for a 30 year lifespan. Which it delivered. Can the industry even make 30-year parts any more?
For automotive, a key issue may be to turn stuff off. All the way off. Vehicle lifespans are only around 6,000 hours of on time. But too much vehicle electronics runs even when the vehicle is off.
This clearly spells the beginning of the end for transistor scaling. If every node makes the reliability worse, eventually the lifetime gets too short and you're through.
The interesting question is going to be how much faster the unreliable parts are than the more reliable nodes once we're at the limit. If they're only 30% faster then in many cases that's a small price to pay for reliability. If they're 30 times faster then what you want is an increase in modularity and standardization, so that the chip that wears out every year can be a cheap fungible commodity part that you can replace without having to buy a new chassis, screen or battery.
>This clearly spells the beginning of the end for transistor scaling. If every node makes the reliability worse, eventually the lifetime gets too short and you're through.
Transistor scaling (dennard scaling) already ended over a decade ago. Even up to that point there have been a variety of reliability challenges at each shrink, some revisited again and again requiring different techniques at each level.
We haven't been making significant progress over the last decade in terms of process size, instead we're just shuffling towards the edge of a cliff (single atoms), the closer we get the more extreme the challenges will get with less return.
We don't have a long road ahead with gradually decreasing reliability, we are already at the end of the road, we aren't going to get much closer to the crumbly bits if it's not worth it.
You're mixing up terms here. Dennard scaling is the claim that power density stays constant with node shrinks. This has failed, but transistor scaling, as in the Moore's Law progression of transistors getting smaller in size, has not, and will continue to run for at least a little while longer.
I think you are correct, technically the end of Dennard scaling is separate.
> This has failed, but transistor scaling [...] has not
Not entirely true, around the same time Dennard scaling ended, transistor channel length scaling also slowed down significantly (Although I don't know if they are directly related). It's not a short topic and i'm no expert, but the summary is that process node name values no longer directly relate to transistor dimensions, and scaling has become more "strategic".
In my mind the end of Dennard scaling marked the beginning of the end of the road. The challenges are becoming more fundamental and yet process node reduction no longer yields the same benefits - meanwhile it's becoming realistic to count cross sectional areas in terms of numbers of atoms... the road really is short.
Although that is still a proxy for transistor scale - what's interesting is your plot shows that in terms of density transistor scale is still managing to follow a log scale trend, in spite of the fact transistor scaling itself stopped being uniform long ago.
Transistor count is the wrong metric because we've been increasing die area as well as shrinking process nodes... It's not sustainable to keep increasing die areas, so it's not a pivot for moores law.
Moore's law is about transistor count. Not only that but you are saying the exact opposite in reply to someone else, claiming that process node sizes don't matter.
Moores law is about the rate of transistor count AND the implications. If you read Moore's paper you will find the context in which the definition resides is based on transistor scale, and the implication was even explicitly expressed in the very same paper by Moore - He observed that transistor doubling by reduced transistor scale gave faster clock at the same power density for the same die area.
This is no longer true, we are barely still reducing transistor scale (but no longer uniformly), and no longer gain any of the other benefits due to the break down of Dennard scaling, coincidentally at the same time it became difficult to continue to reduce transistor channel scale.
That's Dennard scaling as someone already mentioned to you
You said:
> Transistor scaling (dennard scaling)
Those are two different things.
You also said:
> We haven't been making significant progress over the last decade in terms of process size,
This is false by any reasonable definition, since as has already been said by multiple people, transistors are a fraction of the size they were ten years ago and the density has gone up considerably.
> coincidentally at the same time it became difficult to continue to reduce transistor channel scale.
Frequencies did not go up, but transistors shrunk, I'm not sure why you keep trying to state otherwise. How do you explain the enormous rise in transistor count and process shrinkage over the last decade? You are literally stating something that is false and not even backing up what you are saying with any information at all.
I've no wish to be so adversarial with you, but there are clearly some misunderstandings between us so I'll try to answer your questions. but then I'm done.
As I have already admitted in the sibling thread, it is not as simple as transistor scaling stopping outright, I was clearly _wrong_ to suggest that... but it's also untrue to suggest transistor scaling has not stopped in any way - this is essentially the point I am still trying to make for you: Features are getting stuck due to various fundamental limits, and we no longer get the same benefits as a result, and this all started at the same time we stopped getting significant speed improvements (the breakdown of Dennard scaling).
>> rate of transistor count
> That doesn't really make sense. It was actually about transistor count and cost.
Yes, but that is a bi-product, It is fundamentally about the exponential growth rate of transistor count per unit area which is only achieved sustainably (until fundamental limits) via transistor scaling. When that scaling is uniform we get not only reduced cost per transistor and but higher speeds:
> Moore's law is the observation that the number of transistors in a dense integrated circuit (IC) doubles about every two years. [0]
--
>> faster clock at the same power density
> That's Dennard scaling as someone already mentioned to you
Dennard scaling is a formalization of what Moore already observed in the very same paper a decade prior on page 3 under "Heat problem":
> shrinking dimensions on an inte-grated structure makes it possible to operate the structure at higher speed for the same power per unit area. [1]
This is the context of Moore's law, the only mechanism at the time that he was alluding to, uniform transistor scaling with all the benefits (including what is known as Dennard scaling today). This context is commonly lost by people who quote it today.
--
> You said:
>> Transistor scaling (dennard scaling)
> Those are two different things.
Yes they are, as I already admitted in the sibling subthread, I was technically incorrect to mix them, never the less they are closely related - Dennard scaling has always been related to transistor scale, and broke down at the same time uniform transistor scaling aka "classic transistor scaling" stopped.
--
> You also said:
>> We haven't been making significant progress over the last decade in terms of process size,
> This is false by any reasonable definition, since as has already been said by multiple people, transistors are a fraction of the size they were ten years ago and the density has gone up considerably.
I've already admitted this is inaccurate in the sibling thread. You can read my response there. However progress has been stifled to say the least.
--
>> coincidentally at the same time it became difficult to continue to reduce transistor channel scale.
> Frequencies did not go up, but transistors shrunk, I'm not sure why you keep trying to state otherwise. How do you explain the enormous rise in transistor count and process shrinkage over the last decade? You are literally stating something that is false and not even backing up what you are saying with any information at all.
I am not disputing transistors have shrunk, but not all features of transistors have shrunk at the same rate, I'll add emphasis: channel lengths have become more difficult to reduce in scale.
Here's a random source I found:
> when we approach the direct source-drain tunneling limit, we could move to recessed channel devices and use channel lengths longer than the minimum feature size. This could allow us to continue miniaturization and increase component density. [2]
i.e channel lengths will _not_ be 5nm
Densities continue to increase while transistors can no longer be uniformly shrunk, in the same way a square can be made into a rectangle and have a smaller area while not reducing the maximum edge length. However it's intuitive to see that while this will continue to increase density, it will not necessarily increase speed - and it will not be long until we hit limits on scaling the other features.
We make ICs and target 20 year lifespan. A lot of cost and over-design goes into achieving that MTBF, especially if a device only had a typical usable lifespan much shorter. ECU should be 30 years, but a GPU, probably not. Will you be gaming on a 30 year old card?
It would suck if you own a special console/electronic device and it just expires. The number of people affected is not large but still it will suck. Also this days TVs and other electronics have some computer inside and this "expiration" would affect second hand market and increase the e-wate more.
I seen some youtube videos where people are buying old consoles, clean them up and maybe replace some parts and they can use them. My origin is from a poor family and my first 4 computers were old second hand ones - this is a double benefit, the computers/laptops did not become e-waste and a poor family/person can access expensive tech that in my case helped me find a job and finally afford buying new stuff for me and my child.
The big issue is when the mini computer inside your device is a small part of the device, like for example my parents TV is now showing all the time a red message in the corner because the software fails to validate something, the solution appears to be to have it re-flashed with an updated firware to remove those checks or replace the shitty un-essential component that is not used at all. So I am afraid that in future you will need to replace/repair your TV/fridge/car because a chip like the memory in it is one of the ones that "expire" (we had the case where Tesla cars SSD would get bad because of extreme logging)
It's great that you can repair old consoles / tech but I seriously doubt they were designed with durability in mind, and by that I mean I doubt they sacrificed performance for repairability. I belive that if they could make it 10% faster knowing it will break in 5 years instead of 10 they would have made it faster
Yes, I agree. So they used existing components/manufacturing so they inherited most of the time durability. Now imagine a chip that can consume itself, a infinite loop bug or some logging going wrong or maybe intentional sabotage can destroy your stuff with an update.
They are still OK, IC aging only occurs when it's active. So a chip designed for 10 years expires after 10 years of use. If you occasionally plug in that old console, it will be fine. It's likely other things will expire first in the system, like power supply, or something mechanical if there a DVD drive, etc.
You won't use a 2020 card for 30 years, but what about a 2100 card? By that point, computing hardware will be a fully matured technology, and you certainly won't be seeing the yearly performance increases we've seen for the last fifty years. In a world with no improvement, you can expect customer preferences will move to durability.
Rendering is a textbook example of implicit parallelism so scaling with transistor count will keep going even if CPU single core performance and synchronisation bottlenecks slows down progress
The microcontrollers in an ECU are built on far, far different processes to modern CPUs and GPUs. Microcontrollers are usually >90nm, and if designing for robustness probably a lot larger than that. These processes are optimised for many different things, but speed is usually lower on the list than cost, power consumption, and even lifetime (which the automotive industry are still extremely keen on). And given the strong (generally exponential) dependence such aging processes have on temperature, time spent in a low power mode while the engine isn't running as opposed to off is basically irrelevant as far as lifetime is concerned.
This affects servers, personal computers, and phones. That's pretty much it. <10 nm is only used for the applications most demanding of computer performance.
You know what the limiting factor on your phone and laptop's lifespans are? Because it's certainly not the CPU.
Even for the relatively narrow applications in which aging will have noticeable impacts, I doubt it will be a big deal. >90% of people will get a new laptop, phone, or PC long before they see clock reduction due to aging. There are plenty of people who are still on CPUs as old as sandy bridge, but even that is inflated due to the awkward phase where it seemed like parallel utilization would never improve. 15-20 years from now the number of devices using CPUs from the 2020s will be at least as small as the number of people currently using devices from 2010. Not that those devices aren't important, but "we're going to have very short-lived electronics" is just not true (or at minimum, it's already true).
On top of that aging in future processors will be a gradual reduction in performance and efficiency, not sudden failure, and it'll still take many years. Design issues are the real concern here; something like underspec'd AVX instructions that suddenly see massive increases in utilization. Aside from that CPUs will not be noticeably different, and the vast majority of consumer electronics and infrastructure will not even be on a relevant technology.
The short-lived electronics could be passable for the manufacturers if it was just about that (hey, obsolescence with plausible deniability for free!)
But I think their real problem will be that this will cause even more chips to be discarded during quality assurance and this rate will bleed into electronic prices. If we'd be looking at 20% or 30% more expensive electronics _beyond_ the pricings warranted by performance improvements themselves (otherwise eating into their profit margins from R&D costs etc), will consumers keep purchasing these without batting an eye?
But a sense of scales is also necessary here. If a modern CPU lasts for 15 years and this might be a reduction to 8-10 years, I think many will swallow that. If we're however talking reducing lifespans from 10 years to four, well then we are going to see complaints.
> Can the industry even make 30-year parts any more?
Absolutely. Will they make it, absolutely not.
Planned obsolescence is not a conspiracy theory, its a widely popular business strategy.
Spare part business is a additional revenue stream for car manufactures. They fought hard and dirty trying to monopolize it, via lobbing harmful to public laws.
This seems to be a fundamental economics issue: Things are only valued when they are sold. It's much more profitable for companies to sell me the same thing over and over again every couple years, than to sell me something repairable that will last for decades. Most measures, like GDP, don't capture the inefficiency of this, because they don't capture the value of the things in my house that don't need to be replaced, they only capture the value of the new stuff I buy when the old crap breaks. I think this has many detrimental side effects: There's the opportunity cost, because the money I spend on planned obsolescence could have been spent on other products that added more value for me and the world. It wastes natural resources with more stuff ending up in landfills and CO2 ending up in the atmosphere to manufacture and ship the replacements. And finally it helps reinforce the need to work stressful jobs to have enough money to keep buying these things that keep breaking - it's harder to have 'enough' money. How do we as a society encourage products that last?
> How do we as a society encourage products that last?
People are attracted to novelty. While you and I might value the old and functional things that we have and use everyday there is a substantial number of people who will get rid of things like toasters, food mixers, etc., simply because they don't match the colour scheme of the kitchen; the suppliers only care about those people, not us because we don't make any money for them.
If you want products that last you will somehow have to moderate the human desire for novelty. I think that this could be done but it would mean dismantling our educational systems and replacing them with education in the old fashioned sense of producing people who can think and analyse, people who have a sense of history and an understanding of how the world came to be the way it is.
But what use are such people to a society based on, not merely conspicuous, but also excessive consumption?
I don't think people are as attracted to novelty as you think. Sure, everyone has a gadget (or garment) or three that they buy just for novelty, but the vast majority of household items and appliances are bought just to fulfill their purpose. And they won't get replaced unless they break, deteriorate, or reveal design flaws or something else that makes them unfit for the purpose.
What most people do care about is price. And what most suppliers care about is.. yeah, it's a race to the bottom.
I would assume that people would also care about quality and durability if it were something they were informed about. Like, if you're buying a toaster, one sells for $30 and the label says that will break in two years.. the other sells for $40 and the label says it will last for five years or more. I'm pretty sure most people would pick the latter, unless they're exceptionally poor or there's some other major aspect of the design or functionality that draws them to the cheaper option.
Of course, this is not the world we live in, and toasters in $25 to $100 range can last a while or not. Quality or durability is not on the label, and price is not an indication of quality. The trend seems to be that lots of "race to the bottom" companies fill their lineup with premium priced products that are made of the same crap quality as their bottom tier, but they have some silly gimmick (bluetooth in a toothbrush? goodness gracious).
> Like, if you're buying a toaster, one sells for $30 and the label says that will break in two years.. the other sells for $40 and the label says it will last for five years or more.
There's a proxy for that: the manufacturer warranty period.
My current laptop has a 5 year warranty with on-site repair. I chose it over a cheaper laptop from the same manufacturer with only a 3 year warranty with on-site repair. Part of the reason I chose the more expensive model is that the longer warranty means I won't have to replace it for another 5 years.
FWIW I think you are right, we are trained and raised by society to buy and buy all the time.
I know people who think I'm odd for having a Nokia 6 which is nearing 3 years old (but has current Android) and no intention of replacing it - it's partially the circles I move in via work but everyone earns decent money and replaces their phone every year when the new models come out.
More generally I don't replace something unless it's uneconomic to repair, I'm rocking my 2012 road bike, I just stripped and rebuilt it for about 50 quid - I've friends who buy a new one every 18mths.
I like fixing things maybe that is the difference.
> now they they are made and branded to be used for less than 2 years.
Happily using a 5 year old iPhone, which is still receiving software updates, Just because companies release new phones doesn;t make your old one obsolete.
You don't actually have to play that game. I'm still using a dumb phone and perfectly happy with it. Doesn't distract as much as a smartphone and much better battery life.
Phones never really lasted 6 years. A phone being unusable after a few years is usually a matter of battery wearing out combined with super slow and bloated apps and web pages.
I would expect the problems to become both exponentially harder and exponentially more to solve as we scale down. But vice versa, using the process and knowledge we gain from 5 nm chips might improve the lifetime of 7 nm chips by orders of magnitude. And again using the 4/3/2 nm process knowledge we might obtain very durable 5 nm chips in a few years.
> But vice versa, using the process and knowledge we gain from 5 nm chips might improve the lifetime of 7 nm chips by orders of magnitude
Probably not. The problems at each scale are very different and tend not to have much overlap. Process limitations also exist regardless- say you figure out that patterning transistors at 3 nm increases lifetime in a way that is applicable to 7 nm. 7 nm still can't take advantage of it as long as it requires 3 nm scale patterning, if there is even overlapping technology. Even if it requires adding a new step, say a wafer treatment that is necessary at 3 nm but optional at 7 nm, the economic benefit from increased lifespan is probably not worth retooling the old equipment.
Is there much of a market for medium-detail processes? We have the desktop/server/workstation/smartphone market using the very smallest detail processes, with maybe some categories using the previous-generation node. Then we have the embedded market which is everything from cheap-as-dirt 350nm microcontrollers to 28nm ARM chips. Nobody really wants the five-years-ago chips, the currently 18-22nm node. They're too expensive to buy by the million and not shiny enough to compete against the newest stuff.
There are plenty of market for in between 28nm to 12nm node. It is simply a natural progression between Cost, Features, Performance and Economics of Chips. Especially when they are the last few nodes before you move up to EUV.
A10 / T2 used in many of the Apple Appliance are 16nm, along with dozens of WiFi, Modem, Ethernet Controller, ASIC / FPGA etc. These are easily 100M unit a year. As long as the cost benefits fits their volume they will move to next node.
Always wanted to know if server cpus are of lower base/boost frequencies, compared to HEDT cpus, precisely because they are designed to work 24/7 at 100% load for a decade, while consumer grade cpus have looser reliability requirements and would fail in this regime at a high rate(without overclocking), so they allow for higher frequency, provided that it's not 24/7 100% load.
The conventional explanation is that server cpus are designed for power efficiency and higher frequencies are wasteful/inefficient. But I wonder if reliability is also a factor, since there are applications for high single threaded performance, but even special high-frequency server cpus never come close to consumer-grade ones.
“For example, microprocessor degradation may lead to lower performance, necessitating a slowdown, but not necessary failures. In mission-critical AI applications, such as ADAS, a sensor degradation may directly lead to AI failures and hence system failure.”
This is probably a superficial analogy, but this made me think of people suffering from dementia in old age.
Dementia is actually a decent metaphor for what happens to an aging chip.
Internally, a chip is extremely dependent on gate timings. As the chip decays, certain gates or wires will start to slow down or speed up, and the chip gets sloppier.
Often, you can address the issue by slowing the chip clock rate down, because this gives you a much wider margin for error on your gate timings.
Certain operations will be impacted sooner and more heavily than others. Eventually, the timings get bad enough that certain operations (or even the whole chip) just break altogether.
Dream of cellphone manufacturers. Phone dies beyond repair right after 2 year contract. What can be better!?
Edit: my wife still uses iPhone 6, I recently build dedicated Linux machine with my 8 year old FX-8350 CPU with ancient mainboard. The world with disposable electronics is closer and closer. I bet, exact lifetime can be precisely simulated with software from the companies mentioned in the article.
> I also like my old car and I'll keep it as long as I can but I'm aware a new one would be much more fuel efficient.
From what I see of comparable-model cars, efficiency gains in engines, aerodynamics, etc. have mostly been offset by safety, emissions, and QoL improvements that have increased weight. For instance, by spec, the most fuel-efficient Corolla was a 1984 model.
Do note the note at the bottom of the table: "Note: the EPA tweaked their testing procedure, starting with the 2008 model year, with the end result being that the 2008 MPG estimates are now lower than previous years"
Other compact cars have followed a similar trend where they've gotten much heavier and safer, but fuel economy (in terms of fuel per distance) peaked or stagnated.
That's only really relevant for 24/7 operations with considerable load. For home computers and probably even servers with very little load the price of a new CPU won't really offset lower electricity cost.
Depends on whether you switch them on and off to follow load. Newer chips/systems have better idle characteristics so if the system is always on,its probably worth it to upgrade for that reason alone.
Talking about cars: first car was most economical one, second one consumed more because of catalytic converter. Third one is a car for fun not for commuting with massive fuel consumption.
Old FX-8350 was available for this one shot Linux project. Normally virtual machine is ok. I had lots of thoughts what I do with good old hardware when upgrading to Ryzen, I delayed it for more than a year. Luckily it was solved this fortunate way and I do not need throw working things away.
Do they? I've never experienced that so early. Most mid-range phones are after two years so slow that they can't handle modern apps which is the most common reason people I know buy new phones. Higher spec smartphones tend to last much longer and can fetch good prices on ebay after two years.
The world of disposable electronics is definitely something we would be moving away from if we were smart enough to cooperate a bit as a species. We should be managing complete product life cycles for the best efficiency by way of recycling, reuse, durable designs, etc.
They don't need to do this, they already just put out an update that makes your existing phone too slow and buggy to use the week they release a new one.
I would gladly pay 100k+ for a single 10ghz processor with comparable bus speed / RAM combo. Got a single threaded legacy DB that runs way too many things for a large company. Upgrading to a new system is going to cost million+. Running liquid cooled 5hz CPU on a gaming rig right now with an identical box on standby in case of H/W issues. Could easily justify spending 200k every year if it doubled the DB performance.
A cryogenic cooling system that could let a CPU run at 7 GHz would be doable for about $100K.
But more realistically, there are often a long list of other things that can be done to a legacy DB that have a much bigger bang-for-buck, and are also lower risk.
Try NVMe storage. Databases love low-latency storage.
Are you virtualising this in any way? Don't.
Use a better NIC to cut down latency. Think 200 Gbps Mellanox cards with jumbo frames instead of the built-in Broadcom chip that probably cost $1.50.
Try co-locating the app(s) on the same box with the database to really cut the latency. This works great with modern many-core CPUs like an AMD EPYC. You can often dump everything onto one machine and your latency will go from 1ms to 10μs. That's a 100-fold difference!
Turn off the CPU vulnerability mitigations. This is safe as long as nothing else runs on the same hardware. Boosts some databases up to 50%.
Pin the process threads to specific CPU cores. Some newer Intel Xeons have "preferred cores" that will turbo boost higher than any other core.
Upgrade to the latest CPU to get more instructions per clock. This can have surprisingly beneficial effects.
Or... I dunno... fix the database. Does it even use indexes?
Wouldn’t it cost less than $100k to fix the project to use any commodity SQL database (presumably over ODBC)?
Or are we talking about an unsalvageable 4GL system that had its last update 15 years ago that does everything (storage engine, OLTP, forms UI framework, security, and reports)?
This raises quite a few interesting questions and tradeoffs - how do you handle ugprades (i.e. yearly upgrade to latest/fastest CPU) without taking down the standby box? Have 3 boxes?
How about instant handoff? I guess that would require running the second box as a slave, so both boxes are running harder than they need to...
Does running one of the old FX-8350/70 at 7+ GHz have any benefit, or is it offset by the per-tick improvements of newer CPUs?
It's weird to think of heavily overclocking stuff in a non-gaming workload for me... stuff like chilled coolant and instability suddenly has completely differnet factors to consider.
I'd suppose it doesn't need zero downtime outside of some well-defined business hours. Actually most of things don't and I think would even benefit from planned monthly 5hr outage instead of unplanned 10hr once per year.
This. In terms of Single Core / Thread Performance we have reached the end of S curve. We haven't had much IPC improvements since Skylake in 2014. We reached the Max Clock speed per normal Cooling possibilities.
Although current rumours suggest 7nm Ocean Cove coming in 2022 with 85% increase of IPC compared to Skylake. I wish we could also make performance node that push through 6Ghz.
Well there is 25% from Willow Cove to Skylake, so you are looking at 20% increase from Willow Cove to Golden Cove, and another 20% from Golden Cove to Ocean Cove.
Not entirely impossible. Given some of these improvement has been sitting within Intel for years due to the delay in 10nm.
I can make an ALU in an FPGA that only takes in one bit data and adds it. That would take the IPC crown. Is it a useful statement? No. Comparing ARM IPC to x86 IPC is far more complicated than measuring length.
See SPECint2006 in [1] (ctrl-f “absolute performance”). SPECfp is maybe more like a 60% IPC lead, but less relevant to typical users.
I've taken some measurements of the A13's microarchitecture (mostly, its ROB size), and Apple has a bigger lead here than you'd expect. I don't want to share his data without permission, but the bulk of the code is at [2] and I'd be happy to help anyone able to run code on an iDevice repeat the same measurements.
Could the next computer virus be one which repeatedly adds together 1111111 and 1111111 and wears out the wires that do carries, since they were only designed for typical adding, not worst-case carrying on every operation.
In an event when CPUs become spare part, computing devices should be designed to allow fast repare or even hot-swap. Similar to HDDs, one can maintain a constant supply of CPUs in store that allows to change wore out parts fast without devices being rebooted or service interrupted, anytime. This will solve all the longitivity issues while supporting sales on the expense of extra work/labour. Win-win.
I doubt it. It's way more likely that the computers get more disposable and we replace everything. If for no other reason, because manufacturers won't see a point in making the rest of the electronics outlast the CPU.
Mainframes ? Well, maybe, but in a bit different way.
Motherboards and cables do not wearout, bridges and network interfaces most likely don't either. CPUs and RAM do. Concider a symmetrical multiprocessor system with a lot many multi-core CPU and RAM units (may be even fused together in one IC) that are sussectable to wearout. Operating system maintains a set of depletion counters for each unit. When a counter reaches some threshold, system automatically (or by request) gets such units out of operation and reports. An operator walks through the server rooms daily with a bunch of new units and replaces them. Only depleted silicon is replaced. It can be easily and safely disposed or even recycled and reused. Green. Efficient. High power. Low cost. Constant sales. What else we can dream of ? ;-)
I think you're better off having a spare machine and your data in the cloud than to have 2 different CPUs on the same machine. It would massively increase the price and complexity, reduce reliability, etc. It's 2 machines in one box somehow. In the end your machine will not serve its purpose with the backup CPU or you would have bought one with that as a main CPU from the start.
We're going in the direction where the hardware is a commodity to replace as you see fit and the data stays somewhere safe to be used regardless of device. Phone, tablet, PC, console, etc. can consume the same data.
It's not ideal because I'd like hardware to be reliable, not something I can expect to die on me when I need it most but it's pretty clear we're going that way outside of niches, with most devices being exceedingly hard or impossible to upgrade or repair.
My way of thinking was about high load servers mostly, where your idea works well for personal computing. In both cases a whole computing system should be designed to allow easy replacement of depleted silicon. Nowadays electronics designed in such a way that maintainance either extremely difficult (for an average consummer) or not possible at all, both technically and legally.
I think the usual reply to that question is that it depends on which transistor.
Not quite the same thing but there was a Raspberry Pi board on the homepage earlier today which has hacksawed in half and still worked, the person who did that has also cut some microprocessors in half successfully and they still work because they were cutting off bits he does not plan on using and which are not required for the rest of the device or chip to function.
I am sure a AMD Ryzen CPU would work without a core or two. In fact they often disable cores before shipping by zapping a fuse. But if the same transistor on every core somehow blew, then you would probably be left with a dead CPU.
To be fair, that guy who cut the RPi merely cut off the USB ports, RJ45 jack, and the ethernet controller and maybe a couple caps. I don't think chopping off a couple of low pin count peripherals far away from the SoC and DRAM counts as "cutting the board in half"
I did say it wasn't exactly the same thing but at least with some previous *Lake Intel CPUs the gfx took up quite a lot of the die. If you are using an external GPU quite a lot of transistors could fail and you could still use the CPU, just like hacksawing off the ethernet chip.
All on chip caches these days have ECC, so the literal answer to your question - if it's a single transistor, it would likely be fine.
However electronics doesn't really fail like that. A single transistor might be "zapped" by a cosmic ray, but that's a transient error. Electromigration causes the copper interconnects between parts of the circuit to break (https://en.wikipedia.org/wiki/Electromigration#Practical_imp...), especially parts that carry higher current for power distribution around the chip. I had an Intel C2000 fail in a server after 3 years because of this (https://www.theregister.com/2017/02/06/cisco_intel_decline_t...).
Synology were actually great about it. The server was indeed a Synology 8 bay NAS from ~2016, and even though it failed after about 3½ years they replaced it with a refurbished one (which I couldn't tell the difference from new) within a few days. I didn't pay a penny. I hope they're charging all their costs back to Intel.
My thoughts exactly. If a CPU could self-test reliably, and turn off parts of cache, cores, bus lines, etc. I don't think random failures would be a major problem.
Of course it might be too expensive to design such a feature.
So what? I look at the MTBF/TBW(terabytes written) of solid state disks and shrug.
From a very zoomed out view(and a laymans at that!) the whole semiconductor industry seems like system gastronomy. While the few main players equal McDonalds and Burger King, there is only so much you can do with similar equipment arranged in the same ways. While I'm sure the two could produce the same things, if given access to the same ingredients, they aren't allowed to. Same for Coke vs. Pepsi.
Anyways, if you want to have it different, then you either need different systems arranged in different ways processing different ingredients, or are stuck with wailing Oh vey!
I'd rather prefer some cheering for alternatives like
Some Intel Atom processors die after sending more than a few terabytes over USB over their lifetime. You can easily kill a laptop by leaving the webcam on for a few weeks, and then magically all USB stops working and there is no fix other than soldering on a new CPU.
The USB issues were related to a critical flaw in the LPC clock, according to the Intel errata. USB expected lifetime traffic for the affected processors was 50 TB while active at most 10% of the time. The errata implies lower voltage systems aren't affected.
To me that says simple design flaw. Something like overdriving a transistor to get more performance out of it, without realizing what relied on it. That will cause slightly different failure conditions from electromigration.
Possible, though Atom had it's share of non-aging related problems, too. Ugly race conditions, etc., though I believe the older Atoms were immune to Meltdown & Spectre vulnerabilities.
I do similar reliability work for power electronics (first job after BSEE). I wouldn’t have expected lifetimes at the working stress to be a concern, but I had not fully appreciated how large the E-fields were getting in these devices (about 4x larger than a 1 kVrms isolator).
The article mentions in passing "high elevation (data servers in Mexico City)", but doesn't say why. Low air pressure because that makes cooling systems less efficient?
I kinda suspect more ambient radiation. Less atmosphere to catch stray particles, and the smaller gates are more susceptible. Smaller interactions cause random errors.
Anyway, far far outside the scope of my expertise. But that's my guess.
Could that actually effect the life of the chip though? Annoying to have your system crash, but that isn't a longevity problem by itself. Does a stray ray cause the gate to wear prematurely?
yeah. I've done a little bit (tiny, inconsequential, not an expert) with satellites.
When cosmic rays have enough energy they'll move atoms around. That's bad for gates. A 486, old big gates, doesn't hurt much. maybe you have to reboot every few days. Tiny little 2 atom gates, they're more delicate. think plastic vs glass glassware. With plastic you can have a few incidents and it'll still be a glass. with glass, it's a lot easier to chip an edge or shatter and not be usable anymore.
Again, not my area. But I suspect altitude is really bad for tiny systems because they lose so much "free" shielding.
I think you're right. I remember reading a 'bit flips can and do happen' war story about server crashes that only occurred at a client located at a high elevation.
NOAA has a supercomputing center in Boulder, CO that has produced some analysis of this, but if memory serves they also have to deal with some low intensity radiation from the Flatirons formation they're up by. Get enough computers with enough RAM and all bit flip sources start counting.
I keep getting the feeling that we’re reaching the end of the line for our current CPU technology, and that new fundamental research is needed if we wish to continue improving.
Even then physics has some real limits. If 14nm is ~1,000 atoms wide you could at most double chip density 20 times. But that’s really optimistic, physical limitations rather than design or manufacturing ones are likely well before then. Such as what’s the resistance of a wire 1 atom wide?
14 nm in actual distance is only ~50 silicon atoms (~1.4 angstroms) wide. 14 nm process finFETs have a fin width of ~8 nm and in general 14 nm process transistors have a gate length of ~20 nm.
> Such as what’s the resistance of a wire 1 atom wide?
Quite high; you also get a lot of leakage since electrons are basically scattering elastically all the time. You can't use copper for a wire like this, you need special low-scattering conductors.
Unless I am missing something the general transister density for 14nm is vastly worse than that.
If an A9 has 2 Billion transistors on a 96 mm^2 chip. That's ~45,000 transistor in a row ~= 10 mm = 10000000 nm or 35,000,000 atoms. Or 1 transistor per ~777x777 atoms except that’s across multiple layers so hand wave ~1,000 atoms.
Not least because a transistors are not nice neat npn regions. They have multiple gates, all different gate sizes, all number of inputs, outputs and regions.
Intel manages to cram 20 million SRAM cells per mm^2 with 14nm; each cell has 6 transistors. That's three times higher than their reported density of 45 million per mm^2. More to the point, the transistor density really isn't that important. For one thing there are three regions and four terminals in every transistor, so it doesn't make much sense to collapse all that to a single atom.
It also doesn't make much sense because that wouldn't offer much benefit: those regions and the space between transistors are pretty minor issues compared to the increased switching efficiency from shrinking the gate, which is the truly important part and the limiting feature. Electricity moves at a significant fraction of c, which moves 30 millimeters every clock @10 GHz. Enough to completely cross a CPU multiple times, which it should never need to do.
It's weird that I've read so many people say "your computer isn't slow because of transistors degrading, but because of other thing like software/driver/OS stuff".
As you as you mean “slower than it was” then that statement holds mostly true. Your CPU, RAM and GPU should perform the same on day 1 and day 10000 as long as they are still functional. Any “degradation” won’t make them slower, just non functional.
The complication in this comes from the SSD, where the flash cells have a feedback loop for operations such as erasing that can take longer as these cells degrade.
CPU have error correction, which will mitigate transistor aging and make the CPU work slowly instead of not at all.
It will not "perform the same". At some point there is a noticeable slowdown, and even though Wirth's law is at work, it's not the entire story. Heat will also make any chip age faster.
This article talks about aging under 5nm, but aging is already an issue above 5nm. Read the article.
Someone else has addressed your other points, but for ECC:
This will not detect an error in computation, only a bit flip in data. The current way to mitigate computation errors is to have two processors to detect an error in computation, or more to do a voting system if mere detection is not good enough. Since you cannot detect a computation error with a single CPU (without overhead somewhere, and thus lower performance), you can’t slow down to fix it.
The ECC systems I have worked with can fix a 1 bit error and detect two or more bit errors. They do this by using an algorithm to convert the original data in to an output that is bigger than the original (e.g. 64 bytes is now 72 bytes). This output data does not make sense until passed through the reversing algorithm. So basically the overhead is zero since the memory controller is running the algorithm in hardware every time anyway, so no slow down.
Error correction in CPUs is generally limited to the cache, and its incidence is recorded: if something had failed permanently such that the error correction path was being taken constantly, you would be able to record it.
Absent a mechanism which reduces the clock speed of the CPU when it becomes unstable, there's no reasonable way in which failures in the CPU will result in it running slower. Such a mechanism doesn't generally exist: modern CPUs regulate their clock but only in response to a fixed power and temperature envelope. The recent iphone throttling is the only notable case where anything was done automatically in response to an unstable CPU, and that consisted of applying a tighter envelope if the system reset.
This is reflected in the experiences of those who run older hardware with contemporary software: it generally still works just fine at the speed that it used to.
It may be necessary for the micro to run slower in order to be stable, but to my knowledge no system for making that adjustment automatically exists in the vast majority of systems. The main problem being it's hard to detect. How do you tell if the CPU is on the margin of failing without a huge amount of extra circuitry? It can be hard enough to detect that it has had a fault. It's not due to lack of interest: such sensing approaches have been patented before, but don't seem to have made it out of the R&D lab.
CPU technology is quite arcane, very high level, there are so many patents, IP money and a lot of secrecy involved, since CPU tech is quite a strategic one for geopolitical power. Do you work as an engineer at intel, ARM, AMD? On chip design?
> How do you tell if the CPU is on the margin of failing
It's not about failing, it's about error detection. Redundancy is a form of error detection. If several gates disagree on a result, they have to start again what they worked on. That's one simple form of error detection.
CPU never really fail, they just slow down because gates generate more and more errors, requiring recalculation until they finally correct the detected error. An aged chip will just have more and more errors, that will slow it down. Which is the reason why old chip are slower, independently of software.
Although a CPU that is very old will be very slow, or just crash the computer again and again that hardware-people will just toss the whole thing, since they're not really trained or taught to diagnose if it's the CPU, the RAM, the capacitors, the GPU, the motherboard, etc. In general they will tell their customers "it's not compatible with new software anymore". In the end, most CPUs get tossed out anyway.
It's also a matter of planned obsolescence. Maintaining sales is vital, so having a product that a limited lifespan is important if manufacturers want to hold the market.
> CPU technology is quite arcane, very high level, there are so many patents, IP money and a lot of secrecy involved, since CPU tech is quite a strategic one for geopolitical power. Do you work as an engineer at intel, ARM, AMD? On chip design?
If such a mechanism existing it would be documented at at least a high level and its effects observable under controlled tests. Neither are, in contrast to the power and temperature envelopes I mentioned. There is no actual evidence that aged chips operating with the same clockrate perform computation slower, your subjective experience that hardware 'slows down' does not count.
> It's not about failing, it's about error detection. Redundancy is a form of error detection. If several gates disagree on a result, they have to start again what they worked on. That's one simple form of error detection.
> CPU never really fail, they just slow down because gates generate more and more errors, requiring recalculation until they finally correct the detected error. An aged chip will just have more and more errors, that will slow it down. Which is the reason why old chip are slower, independently of software.
This is not how consumer CPUs work. It's not even how high-reliability CPUs necessarily work (some work through a high level of redundancy but they don't generally automatically retry operations when a failure happens: that's a great way of getting stuck). Such redundancy is so incredibly expensive from a power and chip area point of view that no CPU vendor would be competetive in the market with a CPU which worked like you describe. If a single gate fails in a CPU, the effects can range from unnoticable to halt-and-catch-fire.
The only error correction which is present is memory based, where errors are more common and ECC can be implemented relatively cheaply compared to error checking computations.
> If such a mechanism existing it would be documented
Why would it? It's an internal functionality, and CPU usually have a 1 year warranty or so, and I'm not sure they really have guaranteed FLOPS, only frequency I guess. If it's tightly coupled to trade secrets, I would not expect this to be documented. I also doubt that you could find everything you want to know in a CPU documentation.
> There is no actual evidence
The wikipedia article I mentioned, physics is enough evidence.
> If a single gate fails in a CPU
I did not say fail, I meant "miscalculated". There is a very low probability of it happening, but it can still happen because of the high quantity of transistors, hence error correction.
> Such redundancy is so incredibly expensive from a power and chip area point of view
Sure it is, so what? At one point all CPU need it and it becomes necessary. There are billions (I think?) of transistors on a CPU.
Documentation is light on details, but both major CPU vendors give extensive documentation on the performance attributes of their processors, such as how many cycles an instruction may take to complete, and none see fit to mention once 'may take an arbitrary amount longer as the CPU ages'. Not to mention, these performance attributes are frequently measured by reasearchers and engineers, and such an effect as instructions taking more cycles on one sample compared to another from the same batch has yet to be observed (and it's notable and noted when it does differ, e.g. from different steppings or microcode versions). At least one of the many many people who investigate this in great detail would have commented on it.
The wikipedia article you linked makes zero mention of redundant gates as a workaround for reliability issues. The only thing close is that designers must consider it, but this is design at the level of the geometry of the chip, not its logic. It doesn't even make good sense as a strategy: the extra cost of redundant logic to work around reliability issues on a smaller node will outweigh the advantages of that node.
One of the greatest things about modern CPUs is how reliably they do work given that you need such a high yield on individual transistors.
But if you are creating a chip that other people will create software for...then you don’t really know.
This is in the context of assessing usage duty cycles of circuits for aging purposes.
I can imagine an era of exploits that rely on "aging out" paths that were assumed to be rarely used. Like rowhammer, but persistent -- fire up a process on a random cloud instance, run a tight loop of code to wear out an exploitable path in e.g. SGX, rinse lather repeat until you have the coverage you need...