SSD will fail at 40k power-on hours (2021)

dredmorbius · on July 10, 2022

Possibly related to recent HN issues, see: https://news.ycombinator.com/item?id=32031243

solardev · on July 10, 2022

Wow, thanks for sharing. I didn't realize how closely related they were.

(TLDR For anyone wondering, "recent HN issues" means HN very likely went down yesterday because of this same bug, when two (edit: two pairs, four total) enterprise SSDs with old firmware died after 40,000 hours close together. An admin of HN and its host both like this theory. See details in that thread.)

Edit: If you want to discuss that theory, it's probably better to do it in that other thread directly instead... dang and a person from M5 Hosting (HN's previous host) are both participating there.

mkl · on July 10, 2022

Not two SSDs, four: two in the main server, and two in the backup server.

taneq · on July 11, 2022

Yowch. The old "stagger your drive replacements, stagger your batches" thing might not be quite as outdated as we'd like to think...

ekidd · on July 11, 2022

I have definitely seen RAID arrays where the drives were all part of a single manufacturing batch, and multiple drives all failed in rapid succession. I think this can be caused by several things:

- Unless you periodically do full-drive reads, you may silently accumulate bad blocks across multiple drives in an array. When you finally detect a failed drive, you discover that other drives have also been failing for months.

- A full RAID rebuild is a high-stress event that tries to read every disk block on every drive, as rapidly as possible.

- And finally, some drive batches are just dodgy, and it may not take much to push them over. And if identically dodgy drives are all exposed to exactly the same thermal stress and the same I/O operations, then I guess they might fail close together?

Honestly, RAID arrays only buy you so much reliability. Hardware RAID controllers are another single point of failure. I once lost two drives and a RAID controller all together during Christmas, which was not a fun time.

I do like the modern idea of S3-like storage, where data is replicated over several independent machines, and the controlling software can recover from losing entire servers (or even data centers). It's not a perfect match for everything, but it works great for lots of things.

ericbarrett · on July 11, 2022

You are spot on with everything, especially RAID controllers.

I used to help manage a large fleet of database servers. We found that blocks could "rot" on the underlying storage, yet if they were read often enough they would be held in memory for months and never re-read from the underlying drive. Until you rebooted!

water8 · on July 11, 2022

Yes bitrot is a huge problem with mechanical hard drives and media that hasn’t been read in a long time. What you write to the drive might not be what you read back five years later. That’s why ZFS is critical for systems like this where you have checksums for each block of data and can rebuild from parity if there is a mismatch

Mo3 · on July 11, 2022

100% spot on on RAID, I actually had it happen once that a second disk failed while under load of rebuilding the array after the first disk failure, not related to the SSD issue.

water8 · on July 11, 2022

ZFS is the last file system you will ever need

somat · on July 11, 2022

Zfs is probably the best filesystem tech of this generation. The best filesystem tech of the next generation... I suspect ceph.

water8 · on July 12, 2022

Yeah true, if ZFS could span multiple servers it would be perfect

hackmiester · on July 11, 2022

I had no idea anyone thought this was outdated. We certainly never stopped doing it. I think it is a timeless failsafe.

icelancer · on July 11, 2022

Yeah, Backblaze and DigitalOcean both talk about it a bunch in their sysops stuff.

mywacaday · on July 11, 2022

About 20 years ago I worked for a small storage company, the person that managed the returns of disks form customers was very strongly of the opinion that the odd firmware versions on Seagate drives were returned way more often than the even.

bobsmooth · on July 11, 2022

That's the kind of superstition that's only brought about by deep trauma.

wazoox · on July 11, 2022

back then working with Seagate drives was a trauma in itself. In 2008/2009, I've setup more than 3000 1 TB Barracuda ES drives; 1800 of them failed in the following 3 years (they came with a 5 years warranty). I stopped keeping track of Barracuda failures at some point.

Unsurprisingly, 14 years later I still wouldn't recommend Seagate drives to anyone.

mywacaday · on July 11, 2022

Thinking back mixing firmware versions on new units was avoided where possible and we would also try to replace like for like firmware on RMAs

TeamMCS · on July 11, 2022

Plus you mitigate getting a bad batch of the same drive. See IBM Deathstars and the Seagate drives.

May have to go check my up hours on my drives now, I must have a few nearing that sort of write hours

Nux · on July 11, 2022

That's basically sysadmin scripture. Ignore it at your own peril.

bluedino · on July 11, 2022

Or, update the firmware on your switches, servers, drives, etc on a regular basis.

solardev · on July 10, 2022

Thanks for the correction!

kazen44 · on July 10, 2022

the chance of two SSD's failing at the same time under normal circumstances is extremely slim. So this might actually be a good cause of this incident.

jeffreygoesto · on July 11, 2022

True. But it's more about the probability of things being "normal", isn't it? I had multiple Evo970 fail within a very short time. Turns out to be a systematic problem of drives produced in one specific month.

Just how much difference is enough to be safe is the price question...

dredmorbius · on July 10, 2022

It seems more likely it was four drives (though dang and Mike both refer to "two" in the earlier thread).

Both primary and failover servers had RAID arrays. I suspect RAID 10 (striped mirror), which would mean two drives would have to fail to take down a single server.

Four drives of the same manufacturer spec and batch would do that.

Silhouette · on July 11, 2022

This is what always worries me about our home server. It's running ZFS with multiple redundant drives but the supplier refused (when I explicitly asked) to supply it with disks known to be from different batches claiming that the odds of multiple failures close together were negligible. Obviously we have backups as well but the time and cost to restore a full server from online backups can be significant.

dredmorbius · on July 11, 2022

Within an earlier thread "Tell HN: HN Moved from M5 to AWS", there's an excellent comment by loxias about risk diversification across multiple factors. Well worth reading:

https://news.ycombinator.com/item?id=32031655

I've increasingly come to view systems operations / SRE as a risk management exercise, where the goal is to reduce the odds of a catastrophic failure. Total system outage is one level, unrecoverable total system outage is even worse.

Having multiple redundant backups / storage systems, in different locations, with different vendor hardware / stacks, all helps reduce risk of a single-factor outage. Though complexity risk is its own issue.

kqr · on July 11, 2022

> I've increasingly come to view systems operations / SRE as a risk management exercise, where the goal is to reduce the odds of a catastrophic failure.

s/reduce/find an appropriate level for/

It's a common misconception that risk management and risk reduction are synonyms. Risk management is about finding the right level of risk given external factors. Sometimes that means maintaining the current level of risk or even increasing it in favour of other properties.

dredmorbius · on July 11, 2022

My point is that risk is central to systems management. If you look at earlier standard texts on the subject, e.g., Nemeth or Frisch, the concept of risk is all but entirely missing. I've numberous disagreements with Google, but one place where I agree is that the term SRE, systems reliability engineer, puts the notion of managing for stability front and centre, and inherently acknowledges the principle of risk. I've since heard from others that this is in fact how the practice is presented and taught there.

Quibbling over whether the proper term is risk management or risk reduction rather spectacularly misses the forest for the trees.

mst · on July 11, 2022

A friend of mine who was once at Google mentioned that in a meeting once their SRE group was told by their senior manager roughly "you've broken stuff in $time_period far less than your outage budget, and that probably means you should've been rolling out features faster".

Risk -levels- are a choice and also a trade-off.

kqr · on July 11, 2022

Okay, fair enough. To me, the risk involvement is so obvious (to any serious business function!) that I find the management/reduction distinction a more important point. But I can see it your way too.

dredmorbius · on July 11, 2022

Thanks.

I don't know how long you've been in the business, but the change seems a relatively recent one, one that wasn't manifestly obvious to me, and one that has pretty much always seemed difficult to communicate to management.

Whether that's because business management is often about ignoring risks or treating it as inconvenient, or if I've just had a long string of bad bosses, I'm not sure.

I did make a point of looking through several of the books that were formative for me (mostly 1990s and 2000s publication dates), and there's little addressing the point. Limonchelli's book on time management for sysadmins was a notable departure from the standard when it came out, in 2008. I'd say that marked the shift toward structured and process-oriented practices.

That was about the time of the transition from "pets" to "cattle" (focus on individual servers vs. groups / farms), but pre-dates the cloud transition.

kqr · on July 11, 2022

You know what? You're right again!

The stuff I've read that touches on this idea is almost all from 2006 and onwards, mainly 2010s. The earliest example is a bit of an outlier: Douglas Hubbard's 1985 How to Measure Anything -- but it's also only tangentially related.

The other real exceptions are books on statistics (where the idea of risk management -- at least in my collection -- seems to have gotten popular in the 1950s, probably as a result of the second World war) and financial risk management (which seems to really have taken off in the 1980s, probably in conjunction with options becoming a thing.) Statisticians and finance people (and by extension e.g. poker and bridge players) have known this stuff for a while.

Of course, hydrologists have been doing this stuff since the early 1900s at least, but extreme value theory has always been a kind of niche so I'm not sure I should count that.

----

That said, I did mention it was obvious to me. I still find it hard to convince management and colleagues of its importance...

no-s · on July 11, 2022

> The stuff I've read that touches on this idea is almost all from 2006 and onwards, mainly 2010s. The earliest example is a bit of an outlier: Douglas Hubbard's 1985 How to Measure Anything -- but it's also only tangentially related.

I didn’t keep good track of such things but a lot of my early reading in the 70’s was in operations research and decision support systems, mostly sort of what we call operational analytics these days with a big helping of statistical process control too. World War 2 logistics practices and ‘50s and ‘60s “scientific management” fads generated a lot of material, some insightful. Many medium-sized businesses could afford significant R&D then, so you’ll find e.g. furniture factories developing their own computer systems from PCBs to custom ASIC components, just to manage statistical process control and decision support systems.

> That said, I did mention it was obvious to me. I still find it hard to convince management and colleagues of its importance...

I think the reason I keep having to justify this every few years is the tendency towards abstractions in management which try to simplify things into “anecdotal analytics”, e.g. preferring a persuasive narrative over reality...for a good cynical perspective from the ‘50s I recommend C.M. Kornbluth’s “The Marching Morons” (<https://en.wikipedia.org/wiki/The_Marching_Morons>).

no-s · on July 11, 2022

> It's a common misconception that risk management and risk reduction are synonyms. Risk management is about finding the right level of risk given external factors. Sometimes that means maintaining the current level of risk or even increasing it in favour of other properties.

What’s funny is I seem to have to explain this to senior management anew every 6-7 years. I know they teach it in management school, but it in the real world somehow people fall into the false equivalence when they get promoted. Often they adopt a cartoonish view of things because they can’t get the quantitative signals and everything decision effectively reduces to what I ironically term as anecdotal analytics.

I have this amusing heuristic for risk acceptance which I often use to help people approach decisions: you should kick the decision up to someone with higher authority if your signing authority is less than: risk coefficient times quantified exposure, less mitigation cost where mitigation is within signing authority AND/OR budgeted and authorized spend. I like to view mitigation and opportunity cost/benefit in a similar way so I have some idea of equivalences when evaluating tradeoffs.

It’s not original with me, I must have lifted it from some decades-past HBR article or 60’s rant on quantitative business management.

I could rant on various aspects of risk management application all day, thank goodness I've managed to quit before I really got started sharing...in my experience it’s been very helpful when applied in real world engineering implementations.

gfrff · on July 11, 2022

One of the advantages startups have is higher risk tolerance than branded megacorporations. Data, customers, brand, employees, law suits, or cynically, human lives.

dredmorbius · on July 11, 2022

The risk-tolerance of startups is illusory.

The risk is being managed at the VC/investor level, by diversifying investment bets over numerous early ventures.

The death of any one of those isn't a concern for the VC, if the portfolio performance is sufficient. Of course, for the individual venture and employees, that risk is disaggregated.

More rigorous systems practices are seen as an impediment to early growth with any potential problems either something that can be ironed out later, or simply a post-liquidation concern that doesn't factor into the investors' interests at all.

gfrff · on July 12, 2022

The risk tolerance is there result of not potentially taking out $1T of additional value if you f-- up a small project at a mega corporation.

At a mega corporation, you can't take risks which make complete sense for the project you're on independent of uncapped liability for the mothership.

There are misalignments like the one you describe but that's not what I'm talking about.

no-s · on July 11, 2022

> Total system outage is one level, unrecoverable total system outage is even worse.

Ha, I used to suggest people consider “total failure of business” in exposure quantification...

cyphar · on July 11, 2022

My solution is to use a different manufacturer for each drive in a mirror. The prices are usually pretty similar and you get to make sure that one firmware bug doesn't kill your entire pool.

loxias · on July 11, 2022

This is the way.

For even more peace of mind, (and only when you can afford it, obviously) try decoupling your disk purchases a bit from when you're going to need them.

When you see a good price or a sale on a particular disk, grab it add it to your own personal "prebought disk pool". When it's time to either replace a disk or spin up a whole new array, now you have the benefit of diversification across time.

Hamuko · on July 11, 2022

My current procedure is to keep an external drive for backups and if any of the drives in my RAID fails, I'll just shuck the external and stick it in the RAID. The advantage is that the drive is already known to be good through running badblocks (which takes like a week to run), and I don't need to wait for a week for the Amazon man to get here. Disadvantage is that I need to recreate my backup from the start, which loses out my version history, or restore it from an online copy, which is slow and cumbersome.

iforgotpassword · on July 11, 2022

I do the same. Ages ago I once had to build a server at work and picked three vendors for the raid 5. Got funny looks drom coworkers, apparently they found the idea super strange. One drive (Seagate of course) failed after a year, and since we had a matching size WD lying around, used that. Now there were two WDs in the setup. Some years later the PSU blew up and killed both WDs, the Toshiba survived.

Hamuko · on July 11, 2022

I decided to employ this tactic when I was setting up my new NAS and needed two drives.

Upside was that I could definitely know they weren't from the same batch.

Downside was that I had to buy a Seagate, and I don't have good experiences with Seagate since my only Seagate drive had died an early death at the tender age of 3. Turns out that this was very much a downside since the Seagate drive died at the tender age of 16 months.

cyphar · on July 11, 2022

I had a series of WD drives that failed, and I managed to get them all replaced under warranty since they died within 3-5 years. I don't buy WD drives anymore but it wasn't the end of the world since I had spares while I waited for the replacement drives to be shipped.

Anecdotally, I haven't had issues with Seagate but I'm sure it really boils down to which exact drives you're using and what batch they were in.

mst · on July 11, 2022

Pick a manufacturer and you'll be able to find plenty of horror stories.

Some may be worse than others but diversification is the right answer anyway.

xpe · on July 11, 2022

But how do you color match your drives in your spiffy NAS?

This is why my bicycle drivetrain should be a frankenstein combination of parts from different manufacturers?

/s

mst · on July 11, 2022

I hate absolutely everything about this comment.

If we're ever both at the same conference show me a link to this and I'll buy the first round.

no-s · on July 11, 2022

> But how do you color match your drives in your spiffy NAS?

haha, pimp your ride!

Silhouette · on July 11, 2022

I had a series of WD drives that failed, and I managed to get them all replaced under warranty since they died within 3-5 years.

I find hard drive warranties to be mostly an illusion. It's better now that full disk encryption is becoming better supported and potentially available on personal devices and not just corporate ones managed by IT professionals. However until recently the number of drives I've had in any personal/home system that I would have returned under warranty instead of securely destroying to prevent the risk of data leakage was zero. The number of phones I have ever traded in is similarly zero. It's horribly wasteful but until there are cast iron guarantees that all the private data we keep on these devices is going to be securely deleted it's the only sane policy IMHO (apart from never using these devices for anything remotely sensitive in the first place but that's all but impossible in modern society).

cyphar · on July 11, 2022

All of my drives have FDE. I wouldn't have shipped the drives if that wasn't the case (also luckily the issue was that writes only failed on part of the disk so I could wipe the luks metadata section).

jaclaz · on July 11, 2022

Possibly the supplier was talking of hardware failures.

The issue here (as it was several years ago with the re-known Seagate 7200.11 issue [0]) is not about the odds of multiple (hardware) failures together (which may actually be a very rare case), in these case it is essentially a software failure, a counter that crashes the on-disk operating system (if we can call it so) be it an overflow of the counter or hitting a certain value.

The chances of having almost simultaneous failures is near to certainty for drives that are booted the same number of times and have been powered for the same number of hours, if the affected counters are related to these events.

[0] Some reference:

https://msfn.org/board/topic/128807-the-solution-for-seagate...

>Root Cause

This condition was introduced by a firmware issue that sets the drive event log to an invalid location causing the drive to become inaccessible.

The firmware issue is that the end boundary of the event log circular buffer (320) was set incorrectly. During Event Log initialization, the boundary condition that defines the end of the Event Log is off by one. During power up, if the Event Log counter is at entry 320, or a multiple of (320 + x*256), and if a particular data pattern (dependent on the type of tester used during the drive manufacturing test process) had been present in the reserved-area system tracks when the drive's reserved-area file system was created during manufacturing, firmware will increment the Event Log pointer past the end of the event log data structure. This error is detected and results in an "Assert Failure", which causes the drive to hang as a failsafe measure. When the drive enters failsafe further update s to the counter become impossible and the condition will remain through subsequent power cycles. The problem only arises if a power cycle initialization occurs when the Event Log is at 320 or some multiple of 256 thereafter. Once a drive is in this state, there is no path to resolve/recover existing failed drives without Seagate technical intervention. For a drive to be susceptible to this issue, it must have both the firmware that contains the issue and have been tested through the specific manufacturing process.

colejohnson66 · on July 11, 2022

If you’re willing to wait, you can always order half, wait a month or two, then order the other half.

Silhouette · on July 11, 2022

Unfortunately this particular server was a replacement for another that had failed suddenly so that wasn't really an option in my case. If it had been one of many at work then it would have been a sensible option, I agree.

dredmorbius · on July 11, 2022

Repurposing several of the initial drives with later purchases (or exchanges) might be another option.

mcv · on July 11, 2022

If it's really 4 drives, bought at the same time, failing simultaneously, that's pretty damning evidence.

Remind me of my laptop that I bought with 2 SSDs. Not from HP or Dell, but still, I now wonder if I should replace one of them with a more recent SSD and give the other to my son (he currently has an anemic SSD that's too small to install Genshin Impact on).

mst · on July 11, 2022

Assuming you have the budget for it a worst case scenario of "I made my son happy while achieving nothing -else-" doesn't strike me as terrible at all.

mcv · on July 11, 2022

It costs money and there's no guarantee it will actually make him happy. It could lead to him playing the game in some dark corner where no one can find him. There are advantages to him having to use the desktop PC.

MBCook · on July 10, 2022

Especially since one pair was a nearly unused backup server that had a totally different use profile.

Group_B · on July 10, 2022

If both SSD's are from the same lot number and one fails, the chances of the second failing go up by a high amount. Both failing at the same time though is extremely rare.

oogali · on July 11, 2022

We (as an industry) went through this bad batch madness with the IBM DeskStar 75GXP hard drives, which were affectionately referred to as "IBM Deathstar"[1].

It's rare, but it's not _that_ rare. You have to make the effort to understand why it failed.

I had a situation where I deployed Toshiba SLC SSDs (that were purchased over the course of several months) and a piece of software that synchronized to disk frequently, resulting in about 1GB of writes per hour.

After ~11 months in service, most of the drives died in the same 4 week period. We were astounded that everything failed so close to each other, including instances where both drives in a RAID 1 set were toast.

We did extensive troubleshooting between the failed servers and the remaining servers and figured out that write volume (by proxy of in-service date) was the one predictor of failure. Shortly thereafter, wear leveling and TRIM became things we sought out mentions of when spec'ing out hardware.

1: https://en.wikipedia.org/wiki/Deskstar

jaclaz · on July 11, 2022

There was also more recently the case of the Seagate 7200.11, see my previous comment:

https://news.ycombinator.com/item?id=32053477

rasz · on July 11, 2022

It usually takes something really bad happening before things get better.

Absolute best mechanical drives available until quite recently can be traced back to Deatstar. Deskstar 7K4000 were absolute best in class.

https://en.wikipedia.org/wiki/Deskstar

Hitachi bought IBM hard drive business in 2003 for $2B. Sadly Its now owned by WD.

jasomill · on July 12, 2022

Of the ~20 assorted Deskstar and related Ultrastar drives of that vintage I've had in regular use over the past decade, exactly one has failed...because I dropped it.

HGST SAS SSDs (the ones that pair Intel NAND and Hitachi SAS controllers) have also been reliable performers in my home office experience, even without (RAID controller support for) UNMAP (SCSI "TRIM"). Incidentally, these now appear to be selling on eBay for more than I paid for them "lightly used" several years ago.

dredmorbius · on July 11, 2022

Surprisingly few HN submissions under either term:

IBM Deskstar: https://hn.algolia.com/?dateRange=all&page=0&prefix=true&que...

IBM Deathstar: https://hn.algolia.com/?dateRange=all&page=0&prefix=true&que...

solardev · on July 11, 2022

Back then people still used Slashdot:

Wondering if it's real... https://m.slashdot.org/story/20680

Years later, it's a widespread phenomenon: https://m.slashdot.org/story/43312

It was affectionately called the "click of death".

thrwyoilarticle · on July 11, 2022

>But don't we all love them now because they support linux?

Ah, inventing an opinion to get angry about: some things never change.

xyzzy_plugh · on July 11, 2022

Why would there be submissions? These drives predate HN.

dredmorbius · on July 11, 2022

HN occasionally discusses issues pre-dating itself.

dredmorbius · on July 11, 2022

... and the lifetime (deathtime?) award for the Deathstar only predated HN by a few months:

May 26, 2006: https://www.pcworld.com/article/535838/worst_products_ever.h...

October 9, 2006: https://news.ycombinator.com/item?id=1

Memory would still have been reasonably green.

vidarh · on July 11, 2022

It does however come up in comments regularly. I know I've brought it up more than once, because I had a week long ordeal replacing all the drives in an array as they died one by one back in the day.

mst · on July 11, 2022

The deathstars were fantastic, they almost always failed on the outer edges of the platters.

So if you only formatted them (filesystem wise) out to capacity-2Gb they were a really cheap option at the time.

lostlogin · on July 11, 2022

That’s a very different definition of fantastic than the one I use.

mst · on July 11, 2022

Once I'd worked it out - which was after the problems were public and therefore the price had utterly cratered - they were by far the cheapest storage per Gb available at the time (think "by a factor of two").

I would not have let a normal business user near one, but the developers I was supporting were most pleased about their larger than expected scratch disks for test databases and intermediate compilation artifacts.

Everything breaks. Things that at least break predictably make me happier than the alternative.

somat · on July 11, 2022

The way I always put it is, you have identical drives, with identical usage, powered for identical times. and you are still surprised when a second drive fails under the high stress environment of rebuilding after the first drive fails.

pmlnr · on July 11, 2022

Heh, no. We had a fleet of HPE Cloudline (CL3100) failing at the same time because the SSDs exhausted the writes.

pca006132 · on July 11, 2022

perhaps they are the same model. iirc people recommend not to use the same model of hardware to provide redundancy.

wyager · on July 11, 2022

I generally replace HDDs in my personal zpool a few days apart, for this reason. I also order them from different suppliers, so I can get different manufacture dates.

iratewizard · on July 11, 2022

This gives me a strong feeling of general unease and flashbacks to the days of WD hard drives.

ncphil · on July 11, 2022

Miniscribe RLL disks: destroyer of early PC building firms.

rasz · on July 11, 2022

WD and Seagate had abysmal quality in the eighties. Amstrad has badly burned and sued both. Amstrad won $90mil for Seagate , but failed to secure $141m win on appeal from WD.

lucb1e · on July 10, 2022

Check your power-on hours:

    $ sudo smartctl -a /dev/sda | grep -e Power_On_Hours -e ^ID
    ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
      9 Power_On_Hours          0x0032   098   098   000    Old_age   Always       -       9743

Just looking at the raw value, it seems to be 9'743 hours in my case

zh3 · on July 11, 2022

I seem to have the world's oldest SSD (or am I misinterpreting the output?)

  (shell 1) ~# smartctl -a /dev/sda | grep -e Power_On_Hours -e ^ID
  ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  9 Power_On_Hours_and_Msec 0x0032   000   000   000    Old_age   Always       -       933932h+27m+33.940s

eps · on July 11, 2022

More often than not SMART attributes are completely undocumented and the interpretation of their raw values is a pure guesswork on smartctl devs' part. For your SSD make/model they just have it wrong.

bo0tzz · on July 11, 2022

That would make for 106 years of power on time, so it's probably not right...

user8139471 · on July 11, 2022

That is one of the most careful uses of "probably" that I have ever seen.

duckmysick · on July 11, 2022

Fellow time traveller?

synergy20 · on July 10, 2022

    ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
    9 Power_On_Hours          0x0032   055   055   000    Old_age   Always       -       39676

I'm 300 hours from 40K, time to buy new SSD? is this real?!

JohnBooty · on July 11, 2022

I mean, is it the affected model?

Have you applied the appropriate firmware update?

synergy20 · on July 11, 2022

I did not find the model info, but thanks!

yencabulator · on July 11, 2022

smartctl -i /dev/sda

lucb1e · on July 11, 2022

0. Have backups

1. Check your backups

nomel · on July 11, 2022

The two above rules are true regardless of hours on your storage, make, model, or technology. ;)

freebreakfast · on July 10, 2022

iMac mid-2010. Original disk drive.

"9 Power_On_Hours 0x0032 001 001 000 Old_age Always - 74233"

12 years old. More than 8 years of run time. It keeps on purring.

Yes, I have redundant backups. I also have a replacement drive ready. I just want to see how far I can take it.

dylan604 · on July 11, 2022

Luckily it is old enough where you can replace it

wutbrodo · on July 10, 2022

Just FYI for anyone for whom this didn't work by default: I needed to use the --all flag with smartctl (and install smartmontools if you don't have it).

lucb1e · on July 10, 2022

The -a flag from my example should be an alias for --all (man smartctl | grep -A1 ' -a' | head -2). Is that not the case in your version?

climb_stealth · on July 11, 2022

Pro tip: when writing out commands for people to read it helps to use the long form arguments. In this case passing '--all' instead of '-a' to smartctl. It makes it easier to read and more clear what specific options do. Same with calling things in scripts. Short form is for quick and dirty typing things, but not great for reading or comprehension :)

lucb1e · on July 11, 2022

Fair point, yes, I should have done that!

wutbrodo · on July 11, 2022

Derp you are correct, sorry. Hopefully my comment is still useful for anyone who didn't know they needed smartmontools

Pakdef · on July 10, 2022

I had to adapt a bit:

    sudo pacman -S gsmartcontrol
    sudo smartctl -a /dev/nvme0n1p4 | grep -e "Power On"

Im at only 2727 hours

jeffbee · on July 10, 2022

Checked my Samsung 970 Evo 2TB and it says 487 even though it’s been on continuously for years.

colejohnson66 · on July 10, 2022

Receiving power isn’t the same as being “on”. I’d assume the drive has a sleep state that it goes into after inactivity, and those hours don’t count as “power on” hours.

jeffbee · on July 10, 2022

Interesting, thanks. The spec does leave it almost uselessly underspecified:

"Power On Hours: Contains the number of power-on hours. This may not include time that the controller was powered and in a non-operational power state."

The same drive reports only 329 "controller busy time" minutes.

rootw0rm · on July 11, 2022

My 960 Pro 1TB, which I paid way too much for back in the day is at just shy of 5,000 hours, and I've used the hell out of it since it was new.

Bilal_io · on July 11, 2022

960 pro was released in October 2016, even if we assume you purchased it the same month it was released, and you have been running it 24/7 non-stop, that's 50,591 hours counting from October 1st to this hour. I can assume the number is way below 1/3 of that even for people that use it 8 hours a day.

adamredwoods · on July 11, 2022

https://www.smartmontools.org/

anshumankmr · on July 11, 2022

How to do this on a Windows PC?

nodja · on July 11, 2022

In an elevated powershell prompt

Get-PhysicalDisk | Get-StorageReliabilityCounter | Select-Object PowerOnHours

kuroguro · on July 11, 2022

Getting blanks on some of my disks. Two show 64422h and 73318h with no signs of failing :)

vectorcrumb · on July 11, 2022

I'm getting a blank value back as well, even with admin privileges.

anshumankmr · on July 11, 2022

getting a blank value

lazylion2 · on July 11, 2022

Make sure to run PS with admin privileges

netmare · on July 11, 2022

I did, but still got no SMART values. Speccy works though and I'm sure I've also run smartctl successfully on this laptop in the past.

underscore_ku · on July 11, 2022

install ubuntu linux first

HeavyStorm · on July 11, 2022

Thanks

borplk · on July 10, 2022

Mine is above 53,000 hours ... time to check my backups!

lucb1e · on July 10, 2022

Sounds like you're in the clear for this particular bug...

...but always check your backups regularly for data that is dear to you!

Protip of the day: that includes things on someone else's server. I remember when Grooveshark went offline from one day to the next and I lost nearly my whole library because I remembered only some artists and had to go through thousands of songs to find which ones I actually liked from them. My browser's localStorage object had the playlists but I didn't use those much. Or when 000webhost cancelled my account because I was using the 100MB(?) to back up some files that were most important to me, rather than for actual webhosting (in my defense, I was 15 at the time), and so when I returned from a holiday with my parents with an actual crashed hard drive, that turned double sour. Backing up things from what they now call the "cloud" is something I learned early, as I have virtually no code I wrote before that summer, only some of the music, only essays with WordArt if they were printed, etc.

If you use Telegram, Spotify, Netflix, maybe you have videos uploaded to YouTube and not have local copies anymore... my recommendation is to have backups of things that are important to you, onto a medium that you own (it's only a disaster-case copy anyway), and for copyrighted content like Spotify/Netflix it would simply be enough to just have a list of songs/videos. Maybe Netflix doesn't go offline from one day to the next, but your account might be hacked, or that friend you share it with might be hacked, or the right set of hard disks fail at their datacenter, etc. GDPR data exports are your friend, particularly when they're automated and you don't have to bother support. (They might also reveal, as in my case, that Spotify knows how often you shower because I then connect it to a particular waterproof bluetooth speaker at wakeup time. Data exports are also fun to browse!)

usr1106 · on July 10, 2022

Cisco is not a SSD manufacturer. They write industry-wide bug. Does that mean that more than one SSD manufacturer is affected (because they use partially the same firmware)? Further down they mention only Sandisk. Or is the industry-wide just their newspeak for saying any Sandisk of affected model, regardless whether installed in a Cisco box or somewhere else?

McNutty · on July 11, 2022

I suspect that "industry-wide bug" in this context is simply Cisco pointing out to their customer base that this isn't Cisco's fault and please don't blame Cisco.

dr_zoidberg · on July 10, 2022

I'm interested here too. I've got a Crucial SSD from 2015 that's been on about:

* 100% of 2015-2017, let's add 2 years here

* Aboutish 50% of days since 2018 to 2020

* On and off again (5%?) since then until now.

So it's about 3 years of full use? I'm eyeballing the use here. So it may be close to the numbers that were given, but I'm not sure. Guess I could check the SMART stats to get a precise number and from there decide what to do about it.

Searching a bit it seems it's a well-known bug in "enterprise SSDs"[0, 1] (which my drive certainly isn't) but there aren't any real details about what causes it, other than "a firmaware bug".

[0] https://www.servethehome.com/hpe-issues-hpd7-fix-for-ssds-th...

[1] https://www.anandtech.com/show/15673/dell-hpe-updates-for-40...

wtallis · on July 11, 2022

Enterprise SSDs are commonly made by one of a handful of dominant companies and then rebranded by server vendors, so that you can see a SanDisk or Samsung SSD sold as a Dell EMC or Cisco or HPE drive.

dredmorbius · on July 10, 2022

The problem seems to be widely experienced.

The Cisco report turned up in response to a post I'd made of the HN issue on the Fediverse:

https://mastodon.infra.de/@galaxis/108622795822100862

dredmorbius · on July 11, 2022

Dang also listed a few previous submissions on the topic.

None of which gained traction at the time:

https://news.ycombinator.com/item?id=32038993

Neil44 · on July 11, 2022

If you google SSD 40000 hours you will find many box shifters affected, Dell, HP, IBM etc.

walterbell · on July 11, 2022

Would companies be willing to contribute to the OpenSSD project?

OCP (Open Compute Project) has shown that customer-operators can cooperate on open hardware designs, successfully influencing enterprise hardware supply chains. Commercial DPUs and SmartNICs were preceded by a decade of open hardware and research by the NetFPGA project (https://netfpga.org). Why not DiskFPGA?

2017 OpenSSD overview, based on Xilinx: https://github.com/Cosmos-OpenSSD/Cosmos-plus-OpenSSD/blob/m...

2022 status, http://www.openssd-project.org/

> OpenSSD platforms are still being actively used in many academic institutions. As of June 2022, we have renewed the homepage hoping that this site will be a forum to share various simulators, tools, traces, etc. not only for the conventional SSDs but also for the upcoming storage devices such as KVSSD, ZNS SSD, and Computational Storage (CSX). This site is being maintained by Systems Software and Architecture Lab. at Seoul National University as a part of the SW STAR Lab. project.

colejohnson66 · on July 11, 2022

An open source SSD is also a lot more feasible than an open source hard drive. Even if you managed to get an open source HDD controller, you still need the precision mechanical parts that are impossible for the average person to make. With SSDs, however, it’s just a PCB with ICs.

Edit: this obviously ignores any troubles one would have sourcing the ICs (such as possible NDAs)

belkarx · on July 11, 2022

Is there anything special about making SSDs that the average person would not be able to do or is it a "if you can outsource PCB printing and maybe solder you can make one" situation?

ratsmack · on July 11, 2022

The limiting factor would be the memory chips themselves and any firmware required for them (if any). I also don't know how well they are spec'd and if full documentation is available without NDA's and lawyers.

stormbrew · on July 11, 2022

Looking on mouser and digikey it doesn't seem like flash chips, even into very fairly high density on a single chip[eg. 1], are all that difficult to get and get info on, though they all have very high minimum volume orders. So if a person wanted to try to do this on their own they'd probably be best off finding like 50 friends to go in on the order with them.

[1] https://www.mouser.ca/datasheet/2/671/micron_technology_mict...

wtallis · on July 11, 2022

The flash you linked to the datasheet for is over a decade old. Any SSD built from it would fall short of adequate by an order of magnitude in every important metric. The per-die capacity of typical current-generation NAND flash is 16x larger, the interface speeds are 8x higher, erase blocks are 24x larger, program latency can be 5-10x higher. And most importantly, all mainstream SSDs now use flash that stores three or four bits per physical memory cell, rather than one.

So using that flash, you could build something that is recognizably an SSD. But it would be almost entirely useless: too expensive and too small and slow for production use, and too far removed from the current state of the art to serve as a research platform for the most important challenges the SSD industry has been dealing with for the past several generations (error correction strategies for TLC/QLC, and SLC caching).

stormbrew · on July 11, 2022

A couple things:

- I just picked one at random. I'm sure the bleeding edge is harder to get and datasheets are harder to get, but I wasn't trying to find the newest or best.

- the specific subthread here is about the diy-ishness of ssds vs. spinning rust, where the difficulties are of a fundamentally different kind. I feel like it goes without saying that a home built ssd is not going to perform to the level of mass production devices, the question was just can you.

bradfa · on July 11, 2022

The new, fast, high density flash chips from the big name flash chip vendors are not generally even listed on the vendor websites. You have to talk to a sales person and convince them you're actually going to buy in volume to even get data on the latest generation of flash ICs. You will also likely need more than 50 friends to meet the order minimums, unless your 50 friends each want to buy about an ExaByte worth of flash chips.

Also, with these multiple level flash technologies (quad level is current tech, triple is still used in some SSD/NVMe) the read, write, and ECC algorithms are non-trivial to the point where last I checked even mainline Linux's raw flash driver support won't do anything beyond single level cell flashes (and very few new embedded designs are choosing raw parallel NAND flash, instead opting for things like eMMC or UFS which have built-in controllers to handle this).

stormbrew · on July 11, 2022

Like I said in another branch of this subthread, the question wasn't "can you build a high perf flash drive yourself" but "can you build an SSD more easily than a magnetic drive."

The comparison here is that no matter how much you hunt on digikey you won't find a disk platter or drive head or any of the other precision machined parts that go into a hard drive (never mind putting them together and keeping dust out etc).

kasabali · on July 12, 2022

Taobao has many sellers of pcb/controller and flash chips. There are even sections dedicated to DIY SSDs at Chinese forums.

But they all rely on leaked manufacturer firmware production tools, not open source firmware.

pclmulqdq · on July 11, 2022

An FPGA-based SSD will always be too expensive for people with truly large scale. The controller ASICs are a lot cheaper.

wmf · on July 11, 2022

Yeah, if we want an open SSD the path would be for a hyperscaler to strong-arm a controller vendor like Microsemi or Marvell into opening up their SDK. This worked with various Broadcom ASICs so it's not impossible.

civilized · on July 10, 2022

It's been over two years since this was first identified... since this apparently affected many makes and models of SSDs, it would be nice to know if my laptop could be affected and if there's anything I could do about it.

opencl · on July 10, 2022

This will not affect your laptop, all of the models affected by this are enterprise SAS SSDs.

Of course your SSD might have some other firmware bug that would eat your data, all you can do is search for the model number and see if the manufacturer has issued any notices/firmware updates.

jsheard · on July 10, 2022

There was at least one consumer SSD with a similar failure mode, the Crucial M4 SATA drive, unless you updated the firmware it would crash after 5200 cumulative power on hours.

That drive launched in 2011 though so there probably aren't that many still in active use which still haven't reached ~7 months of uptime.

jzwinck · on July 11, 2022

Yes. This one: https://www.reddit.com/r/buildapc/comments/1z2rm5/crucial_m4...

That problem became known a decade ago, so it's somewhat surprising to see such a similar bug now.

This new one is worse because the drive cannot be used after reaching the magic number of hours. In the Crucial M4 case the firmware could be updated even after the bug struck.

robocat · on July 10, 2022

> This will not affect your laptop

That’s just your presumptive opinion, right?

Edit: sorry, probably put that offensively. mikiem said about the HN drives: “These were made by SanDisk (SanDisk Optimus Lightning II) and the number of hours is between 39,984 and 40,032...” - https://news.ycombinator.com/item?id=32031428 Without knowing parts of a codebase are shared between SanDisk devices, it is hard to say that enterprise SAS devices have absolutely no code shared with consumer devices. So just the commenter’s opinion unless the commenter has knowledge of writing Sandisk firmware. “HPE and Dell both used the same upstream supplier (believed to be SanDisk) for SSD controllers” https://www.anandtech.com/show/15673/dell-hpe-updates-for-40...

wtallis · on July 11, 2022

> Without knowing parts of a codebase are shared between SanDisk devices, it is hard to say that enterprise SAS devices have absolutely no code shared with consumer devices.

Even if the code containing this bug was shared between consumer and enterprise drives, it's not reasonable to assume that it would take SanDisk multiple years to check whether their consumer drives are also affected. The lack of a follow-up report from SanDisk is good evidence that their other products are not affected.

robocat · on July 11, 2022

Yeah, I agree it is very unlikely to affect someone’s laptop.

However I dislike a black and white “This will not” absolute fact statement: even if based on reasonable assumptions which is what appears to be the case versus detailed knowledge.

Most laptops don’t run their SSDs 24/7, and unless a manufacturer’s error affects a lot of consumers, we often don’t find out the cause of consumer equipment errors in my experience.

If the OP has a laptop older than 2020, with an SSD with a crucial chipset (especially if SATA), and they leave it on most of the time, then maybe check SMART hours.

Sakos · on July 10, 2022

How likely is it that they're using an enterprise SAS SSD in their laptop?

muzani · on July 10, 2022

I've been searching "40000 hour SSD" since the HN downtime. There's a lot of bug reports besides this one and I'm fairly confident it only affects enterprise too.

pmoriarty · on July 10, 2022

One thing everyone could and should be doing is backups.

m0llusk · on July 10, 2022

Two things: Test restores or you don't actually have backups. Just saying.

chrischen · on July 10, 2022

I got bit by this with iPhone backups. I did a phone trade in and followed the backup before trading in instructions. Problem is after the trade in the backup failed to restore due to an unknown error. The whole manual syncing and backing up with a cable workflow with Apple is super fickle and riddled with bugs.

Luckily I had Time Machine backups of my iOS backups and I managed to avoid losing too much data.

As a sidenote it seems like Apple has pretty much neglected their offline backup and syncing workflow to drive more people to just pay for iCloud storage. Half the time my iPhone takes hours just to get detected by the mac when plugged in.

climb_stealth · on July 11, 2022

Man, Time Machine can fail just as badly. Unknown errors and there is no help or documentation or way to fix it. Carbon Copy Cloner [0] is the way to go for retaining sanity. Absolutely excellent documentation for pretty much any use case. And it works reliably. Not affiliated but after having had terrible experiences with Time Machine I feel compelled to bring it up every time I come across the topic.

[0] https://bombich.com/

rexf · on July 11, 2022

While I don't like how annoying Apple is with service upselling (iCloud, Music, Arcade), at least they moved iPhone backup from iTunes to Finder. So their local iPhone backup process is being maintained over time.

I don't have issues with my computer (PC or Mac) detecting my iPhone. Generally need to make sure iPhone is unlocked after plugging it in. What is tough is the large size of my iPhone (X gb) and how small my Mac's HD is (2X gb).

chrischen · on July 11, 2022

I’ve changed iphones many times and the issue still persists for me. The only reliable way to get photos synced or iphone deteced in finder is to turn on airplane mode for some reason. Must be a bug with wifi syncing.

You actually bring up another issue. There is no obvious way to backup iPhone locally to an external hard drive. So either pay the mac SSD storage tax or the icloud tax.

kristopolous · on July 11, 2022

Depending on your usecase you can integrate using your backups occasionally into your normal data processing.

Again, depends on usecase but then it becomes integral to your existing workflow instead of an addendum that you end up forgetting to do

The whole purpose is to make the failing of one be effectively extremely noisy and irritating

It's like what I do with raid. I have a script that will shut the machine down on drive failure and then will use dialog(1) to say something like "hey bozo replace the fucking drive first" when you boot it up and then it will shutdown again and be unusable.

Make the complaining show stopping, loud, rude, and disruptive. Because if the next one fails you're screwed

djmips · on July 11, 2022

Absolutely! Twice in my career, in huge failures, the backups turned out to be garbage! You don't want this!

mbb70 · on July 10, 2022

We were eviscerated by this (or something just like it) a few years ago. Drives starting failing by the dozens.

Had to rebuild from HDD backups, down for a week. I still have nightmares.

DonHopkins · on July 11, 2022

They should have used a Free BIOS Language in their hardware like Open Firmware FORTH from the OpenBIOS project, to go with the Bias Free Language in their documentation.

https://en.wikipedia.org/wiki/OpenBIOS

>OpenBIOS is a project aiming to provide free and open source implementations of Open Firmware. It is also the name of such an implementation.

>Most of the implementations provided by OpenBIOS rely on additional lower-level firmware for hardware initialization, such as coreboot or Das U-Boot.

https://en.wikipedia.org/wiki/Open_Firmware

>Open Firmware is a standard defining the interfaces of a computer firmware system, formerly endorsed by the Institute of Electrical and Electronics Engineers (IEEE). It originated at Sun Microsystems, where it was known as OpenBoot, and has been used by vendors including Sun, Apple, IBM and ARM. Open Firmware allows the system to load platform-independent drivers directly from a PCI device, improving compatibility.

userbinator · on July 10, 2022

40000 (or even 40960) seems an odd number to fail at. 64k or 32k would make the cause pretty obvious, but 40000 doesn't seem all that round in binary. Perhaps a 12-bit counter incrementing every 10h? This is puzzling.

Of course, I am also entertaining the possibility that no one thought they would be in use for this long, which would certainly be evidence of planned obsolescence.

twawaaay · on July 10, 2022

Very strange understanding of the word "evidence".

No sane SSD manufacturer would do such thing on purpose. You do it and you loose business, that's it.

The simplest explanation is that somebody made an honest engineering mistake.

bayindirh · on July 10, 2022

When you purchase a server (fleet), you get a long warranty with it. Generally 3 to 5 years. So you expect this fleet to stay in service for <=5 years mostly.

Unless you burn through your SSDs, you're very unlikely to hit this event.

When these servers' continue to be used and disks all start to fail at the same time, this will obviously stink.

The bathtub curve is not like this. You can feel that.

toast0 · on July 11, 2022

40k hours is a little more than 4.5 years. These drives deterministically fail at that uptime (unless firmware is updated) and most servers are on 24x7, so if you run your servers for 5 years, it's highly likely you'd run into this. If you run your SSDs hard and they fail early as a result, then you'd be spared from this mass death. Or if you use three year leases and replace on a schedule.

Now more than ever, five year old server hardware isn't that far behind the curve unless you're on the bleeding edge. I've been looking for bottom of the barrel hosting lately, and there's lots of dedicated servers available with 10+ year old cpus, and probably most of the rest of the machine is a similar age.

jstrong · on July 11, 2022

you have any recs for lower-end bare metal providers?

toast0 · on July 12, 2022

I haven't used them yet, but take a look at dedicated server offers at LowEndTalk and Web Hosting Talk. Obviously, there's some offers that are sketchier than others, and a lot of resale, but if your needs are low, there's some neat stuff.

fartcannon · on July 10, 2022

Given the power dynamic between a single customer and large corporations, the smart thing to do is to assume malice until prove otherwise. This puts the onus on the corporations and, if we're lucky, creates an environment where they compete with each other to be seen as the most honest. The worst thing that happens is the single customer has to buy an SSD from someone they don't trust.

If we do the opposite, as you say, and assume everything is an honest mistake, that puts pressure on the single customer to prove that the organization with a huge marketing budget is doing something wrong. In this situation, the worst thing that happens is we all get taken advantage of.

Our collective distrust is the only power we have against massive marketing/PR budgets. It doesn't have to be angry, or sour, or cranky, we just collectively need to not take their word until we have a reason to do so.

charcircuit · on July 10, 2022

Are you seriously saying that by default we should believe they intentionally planned to cause their customers to lose all of their data?

helionsantos · on July 11, 2022

Considering immoral practices adopted by corporations, such as vendor lock-in, use of slave work (directly or indirectly), law bending for its own interests, supporting and conducting biased research towards its own interests among others, I would say that is quite sensible to believe it. Big corporations, per se, are not evil entities. The people running them might or might not be, and when you have evil/immoral people running things, unless there are good control measures in place, they might take bad decisions.

bayindirh · on July 10, 2022

If a spinning rust can run for ~8 years without any problems,a consumer SSD can hit beyond 40K hours reliably, and everything is checked and tested tens of times because of the complexity of flash storage, I'd get suspicious too.

Also, enterprise drives get firmware updates (regardless of spinning or not), and this firmware is automatically applied via RAID controller, so it could be remedied easily before it got this big if it's an actual error.

alliao · on July 10, 2022

planned obsolescence is quite a thing...?

dtjb · on July 10, 2022

In some cases, but a product must fulfill its core purpose. If a SSD intentionally dumped data and self destructed at a set time, that would be disastrous for the brand. Same way a car doesn't adopt planned obsolescence by blowing up after 200k miles.

cmeacham98 · on July 11, 2022

> If a SSD intentionally dumped data and self destructed at a set time, that would be disastrous for the brand.

Other than "intentionally" (which we cannot know and makes no difference to whether you lose your data or not) that is literally what these SSDs are doing, and no SSD brand has been destroyed over it.

landemva · on July 11, 2022

You are not a used car afficionado?

'This insulation prematurely disintegrates under normal use causing the wires it is designed to protect and insulate, to short causing many problems.'

http://www.mercedesdefects.com/2008/04/wire-harness-defect.h...

bmicraft · on July 11, 2022

What more could a manufacturer do to be "disastrous for the brand" than literally build an ssd that stops working after 40kh? Because this does not seem to qualify for you

justinsaccount · on July 10, 2022

Someone pointed out on the other thread that it could be 2^57 nanoseconds:

  >>> 2**57/10**9/3600
  40031.996687737745

AaronFriel · on July 10, 2022

If it were 53, I'd wonder "are they storing the time in the integer part of a double precision float?" That wouldn't go negative, it'd just start absorbing increments without changing the value.

Though that might cause a divide by zero?

What could cause unexpected behavior at 57 bits?

Perhaps storing fractions of an hour, like incrementing it every 1/16th of an hour and calculating a relative rate of change, causing a divide by zero?

jonas21 · on July 10, 2022

My overactive imagination thinks it went something like this:

Engineer A: Gee, I need to store a few flags with each block, but there's nowhere to put them. Ah! We're storing timestamps as 64-bit microseconds. I can borrow a few of those bits and there'll still be enough to go for thousands of years without overflowing.

Engineer B: Gee, our SSDs are getting so fast, soon we'll be able to hit 1M writes/sec. But we're storing timestamps as microseconds. How can we generate unique timestamps for each write? Ah! I'll switch to nanoseconds. It's a good thing we have plenty of space in this 64-bit int.

BOOM!

danielheath · on July 10, 2022

Packing a type flag into the upper bits of a 64 bit value is a reasonably common optimisation in dynamic language implementations (because it lets you use unboxed number arithmetic).

KMag · on July 11, 2022

Or sometimes the lower bits, as at least used to be the case for integers in v8. (Also OCaml, but that's not dynamically typed. It simplifies the garbage collector to at least some times not require a pointer map for each type, just a flag in the object header to indicate if it contains any pointers, and then everything that isn't ints or pointers needs to be boxed.)

mkl · on July 10, 2022

Do embedded CPUs like the one in an SSD have floating point units? It seems more likely to me that the upper bits in a 64 bit integer counter were used for something else.

R0b0t1 · on July 11, 2022

I think it is more likely they shifted a power of two over implicitly by a base 10 place value instead of a binary one. Or multiplied by 10. Unsure why. But, seems simpler.

52 is notable as 2^4 + 2^3 = 24, 24 + 24 = 48, 48 + 2^2 = 52. But 57?

tyingq · on July 10, 2022

From a related issue with a different vendor:

"The fault fixed by the Dell EMC firmware concerns an Assert function which had a bad check to validate the value of a circular buffer’s index value. Instead of checking the maximum value as N, it checked for N-1. The fix corrects the assert check to use the maximum value as N."

https://www.anandtech.com/show/15673/dell-hpe-updates-for-40...

Why the MAX value would be in an circular buffer, or what was being stored in N-1? No idea.

KMag · on July 11, 2022

From my reading, it's checking the maximum index into the circular buffer. That is, when it hits the end of the circular buffer, there's an assertion to check that they're properly wrapping the index back to the start of the buffer, but the assertion has an off-by-one error.

I presume you find a lot of circular buffers in SSD firmware, for wear-leveling reasons. Samsung's NILFS and NILFS2 are structured as circular buffer append-only logs, at least partly to avoid trusting the firmware wear-leveling.

userbinator · on July 10, 2022

circular buffer

The infamous Seagate firmware bug was due to the same thing.

chaboud · on July 11, 2022

2^57 nanoseconds is ~40032 hours. I wouldn't be surprised if someone out there was counting intervals in a 64-bit value and masking off some of the higher bits for flagging.

Any time I see these sorts of issues (odometers that kill themselves, for example) I think of smaller units at higher bit depths. That's not the only way to get to this kind of concern, but it's a way that pretty-darn-competent engineers can leave ticking time bombs due to estimation failures.

Always check types for overflow and/or precision loss. Always.

R0b0t1 · on July 11, 2022

It's a power of two shifted by a decimal place value instead of a binary one. Unsure why.

onion2k · on July 10, 2022

Backblaze have a great blog about things they learn about hard drives. It's been going for years, less about firmware issues and more about general usage. https://www.backblaze.com/blog/backblaze-drive-stats-for-q1-...

ghostly_s · on July 11, 2022

Yikes. Cisco claims this is "an industry wide firmware index bug". Is there any validity to this claim? Are any consumer drives affected by the same issue?

zinekeller · on July 11, 2022

Yes, HPE is one of the SSD OEMs affected by it:

https://support.hpe.com/hpesc/public/docDisplay?docId=emr_na...

https://support.hpe.com/hpesc/public/docDisplay?docLocale=en...

> Are any consumer drives affected by the same issue?

As far as I know it doesn't affect consumer drives, but I wouldn't be surprised if some have the same defective firmware.

fulafel · on July 11, 2022

Is there any information about the provenance of this SSD controller? Sounds like enterprisey venodrs all rebranded some upstream supplier's hardware.

edit: apparently sandisk: http://forum.hddguru.com/viewtopic.php?f=3&t=39964 - also a clue about the magic 40k hour significance: "the SSD alters its performance in some way as it approaches end of life. This appears to shine some light on the reason for a trigger at 40K Power On Hours."

zymhan · on July 11, 2022

This apparently also happened two years ago?

https://www.anandtech.com/show/15673/dell-hpe-updates-for-40...

champtar · on July 11, 2022

HPE had the same issue on some of their SSDs. We received an advisory months before it would have been a problem, and had time to upgrade all our customers ... except 2 servers that we missed. Luckily when I saw all disk in error in the iLO on one of the server I remembered this issue, googled the model, confirmed it was affected, and was able to shutdown the second server and start the upgrade. Not sure client was happy but at least it was only a 1h outages instead of maybe a day to get the disks + restoring from backup. HPE did replace the disks under warranty.

fareesh · on July 11, 2022

From what I remember reading this affected Sandisk only, is that correct?

I have a Samsung EVO and OCZ SSD. Would these be affected too? Perhaps some shared component?

Cisco has written "Industry-wide" here which is confusing

jsemrau · on July 11, 2022

One of the dumbest things I have done in my life is buying an SSD and new HDDs to farm Chia. After about a month of farming, the SSD died due to the constant read/writes.

arthurcolle · on July 11, 2022

Did you make any money?

jsemrau · on July 11, 2022

Nice one. For your enjoyment have a look at the "all-time" chart. https://coinmarketcap.com/currencies/chia-network/

While the concept of Chia was interesting at the time and also reminded me of the "smart fridges of silicon valley (the show)", filling up gigabytes with trash data to prove a technical point made me lose interest. Just glad I didn't invest more.

neycoda · on July 11, 2022

"the SSD will report that 0 GB of available storage space remains. The drive will go offline and become unusable."

Considering this is a firmware "bug" that bricks the drive, due to a misplaced index, not a physical wear issue, it appears to be a 4.5-year planned obsolescence feature.

fijiaarone · on July 11, 2022

I’m moving all my storage to vellum with papyrus backups.

dredmorbius · on July 11, 2022

Alexandria's got an excellent hosting facility.

ShroudedNight · on July 11, 2022

Sadly rendered effectively write-only

dredmorbius · on July 11, 2022

The original argument for geographically-distributed backups!

zoomablemind · on July 11, 2022

Reading the Wikipedia entry on Power-on hours [1] says that:

"...Once a [SSD] drive has surpassed the 43,800 hour mark (5 years), it may no longer be classed as in "perfect condition" "

And that SSD generally has 5 year life expectancy.

So with this bug, should we simply think of it just to become 40,000 hour hard life-time limit? Well, it's 10% less than by design.

I'm just not sure how realistic will it be to obtain SSD firmware updates given that it's "an industry wide firmware index bug".

How could I even know if a particular SSD has an affected firmware?

[1]: https://en.m.wikipedia.org/wiki/Power-on_hours

FunnyBadger · on July 14, 2022

Nothing has eternal life.

Especially not electronics. And certainly not more advanced semiconductors.

The rule-of-thumb is the lifespan of planar CMOS processes in years is proportional to the "node size" in nanometers. So we are right now at 2-10 years. Some specific ICs or specific applications or specific designs can be bigger or smaller than this but it's the average.

If you've ever worked on "antique" electronics, this is no surprise.

submeta · on July 11, 2022

Here's how to check the power on hours on a Mac:

I didn't have `smartctl` installed on my Mac, so here's how to install it via `brew`:

    brew install smartmontools

My SSD is `disk0` (check `Disk Utility.app`).

Running `smartctl` for my `disk0`:

    sudo smartctl -a /dev/disk0 | grep -e "Power On Hours"
    Power On Hours:                     6,334

So I have only ~15% of those 40k hours used (6,334/40,000).