Hacker News new | past | comments | ask | show | jobs | submit login
SSD will fail at 40k power-on hours (2021) (cisco.com)
723 points by dredmorbius on July 10, 2022 | hide | past | favorite | 254 comments



Possibly related to recent HN issues, see: https://news.ycombinator.com/item?id=32031243


Wow, thanks for sharing. I didn't realize how closely related they were.

(TLDR For anyone wondering, "recent HN issues" means HN very likely went down yesterday because of this same bug, when two (edit: two pairs, four total) enterprise SSDs with old firmware died after 40,000 hours close together. An admin of HN and its host both like this theory. See details in that thread.)

Edit: If you want to discuss that theory, it's probably better to do it in that other thread directly instead... dang and a person from M5 Hosting (HN's previous host) are both participating there.


Not two SSDs, four: two in the main server, and two in the backup server.


Yowch. The old "stagger your drive replacements, stagger your batches" thing might not be quite as outdated as we'd like to think...


I have definitely seen RAID arrays where the drives were all part of a single manufacturing batch, and multiple drives all failed in rapid succession. I think this can be caused by several things:

- Unless you periodically do full-drive reads, you may silently accumulate bad blocks across multiple drives in an array. When you finally detect a failed drive, you discover that other drives have also been failing for months.

- A full RAID rebuild is a high-stress event that tries to read every disk block on every drive, as rapidly as possible.

- And finally, some drive batches are just dodgy, and it may not take much to push them over. And if identically dodgy drives are all exposed to exactly the same thermal stress and the same I/O operations, then I guess they might fail close together?

Honestly, RAID arrays only buy you so much reliability. Hardware RAID controllers are another single point of failure. I once lost two drives and a RAID controller all together during Christmas, which was not a fun time.

I do like the modern idea of S3-like storage, where data is replicated over several independent machines, and the controlling software can recover from losing entire servers (or even data centers). It's not a perfect match for everything, but it works great for lots of things.


You are spot on with everything, especially RAID controllers.

I used to help manage a large fleet of database servers. We found that blocks could "rot" on the underlying storage, yet if they were read often enough they would be held in memory for months and never re-read from the underlying drive. Until you rebooted!


Yes bitrot is a huge problem with mechanical hard drives and media that hasn’t been read in a long time. What you write to the drive might not be what you read back five years later. That’s why ZFS is critical for systems like this where you have checksums for each block of data and can rebuild from parity if there is a mismatch


100% spot on on RAID, I actually had it happen once that a second disk failed while under load of rebuilding the array after the first disk failure, not related to the SSD issue.


ZFS is the last file system you will ever need


Zfs is probably the best filesystem tech of this generation. The best filesystem tech of the next generation... I suspect ceph.


Yeah true, if ZFS could span multiple servers it would be perfect


I had no idea anyone thought this was outdated. We certainly never stopped doing it. I think it is a timeless failsafe.


Yeah, Backblaze and DigitalOcean both talk about it a bunch in their sysops stuff.


About 20 years ago I worked for a small storage company, the person that managed the returns of disks form customers was very strongly of the opinion that the odd firmware versions on Seagate drives were returned way more often than the even.


That's the kind of superstition that's only brought about by deep trauma.


back then working with Seagate drives was a trauma in itself. In 2008/2009, I've setup more than 3000 1 TB Barracuda ES drives; 1800 of them failed in the following 3 years (they came with a 5 years warranty). I stopped keeping track of Barracuda failures at some point.

Unsurprisingly, 14 years later I still wouldn't recommend Seagate drives to anyone.


Thinking back mixing firmware versions on new units was avoided where possible and we would also try to replace like for like firmware on RMAs


Plus you mitigate getting a bad batch of the same drive. See IBM Deathstars and the Seagate drives.

May have to go check my up hours on my drives now, I must have a few nearing that sort of write hours


That's basically sysadmin scripture. Ignore it at your own peril.


Or, update the firmware on your switches, servers, drives, etc on a regular basis.


Thanks for the correction!


the chance of two SSD's failing at the same time under normal circumstances is extremely slim. So this might actually be a good cause of this incident.


True. But it's more about the probability of things being "normal", isn't it? I had multiple Evo970 fail within a very short time. Turns out to be a systematic problem of drives produced in one specific month.

Just how much difference is enough to be safe is the price question...


It seems more likely it was four drives (though dang and Mike both refer to "two" in the earlier thread).

Both primary and failover servers had RAID arrays. I suspect RAID 10 (striped mirror), which would mean two drives would have to fail to take down a single server.

Four drives of the same manufacturer spec and batch would do that.


This is what always worries me about our home server. It's running ZFS with multiple redundant drives but the supplier refused (when I explicitly asked) to supply it with disks known to be from different batches claiming that the odds of multiple failures close together were negligible. Obviously we have backups as well but the time and cost to restore a full server from online backups can be significant.


Within an earlier thread "Tell HN: HN Moved from M5 to AWS", there's an excellent comment by loxias about risk diversification across multiple factors. Well worth reading:

https://news.ycombinator.com/item?id=32031655

I've increasingly come to view systems operations / SRE as a risk management exercise, where the goal is to reduce the odds of a catastrophic failure. Total system outage is one level, unrecoverable total system outage is even worse.

Having multiple redundant backups / storage systems, in different locations, with different vendor hardware / stacks, all helps reduce risk of a single-factor outage. Though complexity risk is its own issue.


> I've increasingly come to view systems operations / SRE as a risk management exercise, where the goal is to reduce the odds of a catastrophic failure.

s/reduce/find an appropriate level for/

It's a common misconception that risk management and risk reduction are synonyms. Risk management is about finding the right level of risk given external factors. Sometimes that means maintaining the current level of risk or even increasing it in favour of other properties.


My point is that risk is central to systems management. If you look at earlier standard texts on the subject, e.g., Nemeth or Frisch, the concept of risk is all but entirely missing. I've numberous disagreements with Google, but one place where I agree is that the term SRE, systems reliability engineer, puts the notion of managing for stability front and centre, and inherently acknowledges the principle of risk. I've since heard from others that this is in fact how the practice is presented and taught there.

Quibbling over whether the proper term is risk management or risk reduction rather spectacularly misses the forest for the trees.


A friend of mine who was once at Google mentioned that in a meeting once their SRE group was told by their senior manager roughly "you've broken stuff in $time_period far less than your outage budget, and that probably means you should've been rolling out features faster".

Risk -levels- are a choice and also a trade-off.


Okay, fair enough. To me, the risk involvement is so obvious (to any serious business function!) that I find the management/reduction distinction a more important point. But I can see it your way too.


Thanks.

I don't know how long you've been in the business, but the change seems a relatively recent one, one that wasn't manifestly obvious to me, and one that has pretty much always seemed difficult to communicate to management.

Whether that's because business management is often about ignoring risks or treating it as inconvenient, or if I've just had a long string of bad bosses, I'm not sure.

I did make a point of looking through several of the books that were formative for me (mostly 1990s and 2000s publication dates), and there's little addressing the point. Limonchelli's book on time management for sysadmins was a notable departure from the standard when it came out, in 2008. I'd say that marked the shift toward structured and process-oriented practices.

That was about the time of the transition from "pets" to "cattle" (focus on individual servers vs. groups / farms), but pre-dates the cloud transition.


You know what? You're right again!

The stuff I've read that touches on this idea is almost all from 2006 and onwards, mainly 2010s. The earliest example is a bit of an outlier: Douglas Hubbard's 1985 How to Measure Anything -- but it's also only tangentially related.

The other real exceptions are books on statistics (where the idea of risk management -- at least in my collection -- seems to have gotten popular in the 1950s, probably as a result of the second World war) and financial risk management (which seems to really have taken off in the 1980s, probably in conjunction with options becoming a thing.) Statisticians and finance people (and by extension e.g. poker and bridge players) have known this stuff for a while.

Of course, hydrologists have been doing this stuff since the early 1900s at least, but extreme value theory has always been a kind of niche so I'm not sure I should count that.

----

That said, I did mention it was obvious to me. I still find it hard to convince management and colleagues of its importance...


> The stuff I've read that touches on this idea is almost all from 2006 and onwards, mainly 2010s. The earliest example is a bit of an outlier: Douglas Hubbard's 1985 How to Measure Anything -- but it's also only tangentially related.

I didn’t keep good track of such things but a lot of my early reading in the 70’s was in operations research and decision support systems, mostly sort of what we call operational analytics these days with a big helping of statistical process control too. World War 2 logistics practices and ‘50s and ‘60s “scientific management” fads generated a lot of material, some insightful. Many medium-sized businesses could afford significant R&D then, so you’ll find e.g. furniture factories developing their own computer systems from PCBs to custom ASIC components, just to manage statistical process control and decision support systems.

> That said, I did mention it was obvious to me. I still find it hard to convince management and colleagues of its importance...

I think the reason I keep having to justify this every few years is the tendency towards abstractions in management which try to simplify things into “anecdotal analytics”, e.g. preferring a persuasive narrative over reality...for a good cynical perspective from the ‘50s I recommend C.M. Kornbluth’s “The Marching Morons” (<https://en.wikipedia.org/wiki/The_Marching_Morons>).


> It's a common misconception that risk management and risk reduction are synonyms. Risk management is about finding the right level of risk given external factors. Sometimes that means maintaining the current level of risk or even increasing it in favour of other properties.

What’s funny is I seem to have to explain this to senior management anew every 6-7 years. I know they teach it in management school, but it in the real world somehow people fall into the false equivalence when they get promoted. Often they adopt a cartoonish view of things because they can’t get the quantitative signals and everything decision effectively reduces to what I ironically term as anecdotal analytics.

I have this amusing heuristic for risk acceptance which I often use to help people approach decisions: you should kick the decision up to someone with higher authority if your signing authority is less than: risk coefficient times quantified exposure, less mitigation cost where mitigation is within signing authority AND/OR budgeted and authorized spend. I like to view mitigation and opportunity cost/benefit in a similar way so I have some idea of equivalences when evaluating tradeoffs.

It’s not original with me, I must have lifted it from some decades-past HBR article or 60’s rant on quantitative business management.

I could rant on various aspects of risk management application all day, thank goodness I've managed to quit before I really got started sharing...in my experience it’s been very helpful when applied in real world engineering implementations.


One of the advantages startups have is higher risk tolerance than branded megacorporations. Data, customers, brand, employees, law suits, or cynically, human lives.


The risk-tolerance of startups is illusory.

The risk is being managed at the VC/investor level, by diversifying investment bets over numerous early ventures.

The death of any one of those isn't a concern for the VC, if the portfolio performance is sufficient. Of course, for the individual venture and employees, that risk is disaggregated.

More rigorous systems practices are seen as an impediment to early growth with any potential problems either something that can be ironed out later, or simply a post-liquidation concern that doesn't factor into the investors' interests at all.


The risk tolerance is there result of not potentially taking out $1T of additional value if you f-- up a small project at a mega corporation.

At a mega corporation, you can't take risks which make complete sense for the project you're on independent of uncapped liability for the mothership.

There are misalignments like the one you describe but that's not what I'm talking about.


> Total system outage is one level, unrecoverable total system outage is even worse.

Ha, I used to suggest people consider “total failure of business” in exposure quantification...


My solution is to use a different manufacturer for each drive in a mirror. The prices are usually pretty similar and you get to make sure that one firmware bug doesn't kill your entire pool.


This is the way.

For even more peace of mind, (and only when you can afford it, obviously) try decoupling your disk purchases a bit from when you're going to need them.

When you see a good price or a sale on a particular disk, grab it add it to your own personal "prebought disk pool". When it's time to either replace a disk or spin up a whole new array, now you have the benefit of diversification across time.


My current procedure is to keep an external drive for backups and if any of the drives in my RAID fails, I'll just shuck the external and stick it in the RAID. The advantage is that the drive is already known to be good through running badblocks (which takes like a week to run), and I don't need to wait for a week for the Amazon man to get here. Disadvantage is that I need to recreate my backup from the start, which loses out my version history, or restore it from an online copy, which is slow and cumbersome.


I do the same. Ages ago I once had to build a server at work and picked three vendors for the raid 5. Got funny looks drom coworkers, apparently they found the idea super strange. One drive (Seagate of course) failed after a year, and since we had a matching size WD lying around, used that. Now there were two WDs in the setup. Some years later the PSU blew up and killed both WDs, the Toshiba survived.


I decided to employ this tactic when I was setting up my new NAS and needed two drives.

Upside was that I could definitely know they weren't from the same batch.

Downside was that I had to buy a Seagate, and I don't have good experiences with Seagate since my only Seagate drive had died an early death at the tender age of 3. Turns out that this was very much a downside since the Seagate drive died at the tender age of 16 months.


I had a series of WD drives that failed, and I managed to get them all replaced under warranty since they died within 3-5 years. I don't buy WD drives anymore but it wasn't the end of the world since I had spares while I waited for the replacement drives to be shipped.

Anecdotally, I haven't had issues with Seagate but I'm sure it really boils down to which exact drives you're using and what batch they were in.


Pick a manufacturer and you'll be able to find plenty of horror stories.

Some may be worse than others but diversification is the right answer anyway.


But how do you color match your drives in your spiffy NAS?

This is why my bicycle drivetrain should be a frankenstein combination of parts from different manufacturers?

/s


I hate absolutely everything about this comment.

If we're ever both at the same conference show me a link to this and I'll buy the first round.


> But how do you color match your drives in your spiffy NAS?

haha, pimp your ride!


I had a series of WD drives that failed, and I managed to get them all replaced under warranty since they died within 3-5 years.

I find hard drive warranties to be mostly an illusion. It's better now that full disk encryption is becoming better supported and potentially available on personal devices and not just corporate ones managed by IT professionals. However until recently the number of drives I've had in any personal/home system that I would have returned under warranty instead of securely destroying to prevent the risk of data leakage was zero. The number of phones I have ever traded in is similarly zero. It's horribly wasteful but until there are cast iron guarantees that all the private data we keep on these devices is going to be securely deleted it's the only sane policy IMHO (apart from never using these devices for anything remotely sensitive in the first place but that's all but impossible in modern society).


All of my drives have FDE. I wouldn't have shipped the drives if that wasn't the case (also luckily the issue was that writes only failed on part of the disk so I could wipe the luks metadata section).


Possibly the supplier was talking of hardware failures.

The issue here (as it was several years ago with the re-known Seagate 7200.11 issue [0]) is not about the odds of multiple (hardware) failures together (which may actually be a very rare case), in these case it is essentially a software failure, a counter that crashes the on-disk operating system (if we can call it so) be it an overflow of the counter or hitting a certain value.

The chances of having almost simultaneous failures is near to certainty for drives that are booted the same number of times and have been powered for the same number of hours, if the affected counters are related to these events.

[0] Some reference:

https://msfn.org/board/topic/128807-the-solution-for-seagate...

>Root Cause

This condition was introduced by a firmware issue that sets the drive event log to an invalid location causing the drive to become inaccessible.

The firmware issue is that the end boundary of the event log circular buffer (320) was set incorrectly. During Event Log initialization, the boundary condition that defines the end of the Event Log is off by one. During power up, if the Event Log counter is at entry 320, or a multiple of (320 + x*256), and if a particular data pattern (dependent on the type of tester used during the drive manufacturing test process) had been present in the reserved-area system tracks when the drive's reserved-area file system was created during manufacturing, firmware will increment the Event Log pointer past the end of the event log data structure. This error is detected and results in an "Assert Failure", which causes the drive to hang as a failsafe measure. When the drive enters failsafe further update s to the counter become impossible and the condition will remain through subsequent power cycles. The problem only arises if a power cycle initialization occurs when the Event Log is at 320 or some multiple of 256 thereafter. Once a drive is in this state, there is no path to resolve/recover existing failed drives without Seagate technical intervention. For a drive to be susceptible to this issue, it must have both the firmware that contains the issue and have been tested through the specific manufacturing process.


If you’re willing to wait, you can always order half, wait a month or two, then order the other half.


Unfortunately this particular server was a replacement for another that had failed suddenly so that wasn't really an option in my case. If it had been one of many at work then it would have been a sensible option, I agree.


Repurposing several of the initial drives with later purchases (or exchanges) might be another option.


If it's really 4 drives, bought at the same time, failing simultaneously, that's pretty damning evidence.

Remind me of my laptop that I bought with 2 SSDs. Not from HP or Dell, but still, I now wonder if I should replace one of them with a more recent SSD and give the other to my son (he currently has an anemic SSD that's too small to install Genshin Impact on).


Assuming you have the budget for it a worst case scenario of "I made my son happy while achieving nothing -else-" doesn't strike me as terrible at all.


It costs money and there's no guarantee it will actually make him happy. It could lead to him playing the game in some dark corner where no one can find him. There are advantages to him having to use the desktop PC.


Especially since one pair was a nearly unused backup server that had a totally different use profile.


If both SSD's are from the same lot number and one fails, the chances of the second failing go up by a high amount. Both failing at the same time though is extremely rare.


We (as an industry) went through this bad batch madness with the IBM DeskStar 75GXP hard drives, which were affectionately referred to as "IBM Deathstar"[1].

It's rare, but it's not _that_ rare. You have to make the effort to understand why it failed.

I had a situation where I deployed Toshiba SLC SSDs (that were purchased over the course of several months) and a piece of software that synchronized to disk frequently, resulting in about 1GB of writes per hour.

After ~11 months in service, most of the drives died in the same 4 week period. We were astounded that everything failed so close to each other, including instances where both drives in a RAID 1 set were toast.

We did extensive troubleshooting between the failed servers and the remaining servers and figured out that write volume (by proxy of in-service date) was the one predictor of failure. Shortly thereafter, wear leveling and TRIM became things we sought out mentions of when spec'ing out hardware.

1: https://en.wikipedia.org/wiki/Deskstar


There was also more recently the case of the Seagate 7200.11, see my previous comment:

https://news.ycombinator.com/item?id=32053477


It usually takes something really bad happening before things get better.

Absolute best mechanical drives available until quite recently can be traced back to Deatstar. Deskstar 7K4000 were absolute best in class.

https://en.wikipedia.org/wiki/Deskstar

Hitachi bought IBM hard drive business in 2003 for $2B. Sadly Its now owned by WD.


Of the ~20 assorted Deskstar and related Ultrastar drives of that vintage I've had in regular use over the past decade, exactly one has failed...because I dropped it.

HGST SAS SSDs (the ones that pair Intel NAND and Hitachi SAS controllers) have also been reliable performers in my home office experience, even without (RAID controller support for) UNMAP (SCSI "TRIM"). Incidentally, these now appear to be selling on eBay for more than I paid for them "lightly used" several years ago.



Back then people still used Slashdot:

Wondering if it's real... https://m.slashdot.org/story/20680

Years later, it's a widespread phenomenon: https://m.slashdot.org/story/43312

It was affectionately called the "click of death".


>But don't we all love them now because they support linux?

Ah, inventing an opinion to get angry about: some things never change.


Why would there be submissions? These drives predate HN.


HN occasionally discusses issues pre-dating itself.


... and the lifetime (deathtime?) award for the Deathstar only predated HN by a few months:

May 26, 2006: https://www.pcworld.com/article/535838/worst_products_ever.h...

October 9, 2006: https://news.ycombinator.com/item?id=1

Memory would still have been reasonably green.


It does however come up in comments regularly. I know I've brought it up more than once, because I had a week long ordeal replacing all the drives in an array as they died one by one back in the day.


The deathstars were fantastic, they almost always failed on the outer edges of the platters.

So if you only formatted them (filesystem wise) out to capacity-2Gb they were a really cheap option at the time.


That’s a very different definition of fantastic than the one I use.


Once I'd worked it out - which was after the problems were public and therefore the price had utterly cratered - they were by far the cheapest storage per Gb available at the time (think "by a factor of two").

I would not have let a normal business user near one, but the developers I was supporting were most pleased about their larger than expected scratch disks for test databases and intermediate compilation artifacts.

Everything breaks. Things that at least break predictably make me happier than the alternative.


The way I always put it is, you have identical drives, with identical usage, powered for identical times. and you are still surprised when a second drive fails under the high stress environment of rebuilding after the first drive fails.


Heh, no. We had a fleet of HPE Cloudline (CL3100) failing at the same time because the SSDs exhausted the writes.


perhaps they are the same model. iirc people recommend not to use the same model of hardware to provide redundancy.


I generally replace HDDs in my personal zpool a few days apart, for this reason. I also order them from different suppliers, so I can get different manufacture dates.


This gives me a strong feeling of general unease and flashbacks to the days of WD hard drives.


Miniscribe RLL disks: destroyer of early PC building firms.


WD and Seagate had abysmal quality in the eighties. Amstrad has badly burned and sued both. Amstrad won $90mil for Seagate , but failed to secure $141m win on appeal from WD.


Check your power-on hours:

    $ sudo smartctl -a /dev/sda | grep -e Power_On_Hours -e ^ID
    ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
      9 Power_On_Hours          0x0032   098   098   000    Old_age   Always       -       9743
Just looking at the raw value, it seems to be 9'743 hours in my case


I seem to have the world's oldest SSD (or am I misinterpreting the output?)

  (shell 1) ~# smartctl -a /dev/sda | grep -e Power_On_Hours -e ^ID
  ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  9 Power_On_Hours_and_Msec 0x0032   000   000   000    Old_age   Always       -       933932h+27m+33.940s


More often than not SMART attributes are completely undocumented and the interpretation of their raw values is a pure guesswork on smartctl devs' part. For your SSD make/model they just have it wrong.


That would make for 106 years of power on time, so it's probably not right...


That is one of the most careful uses of "probably" that I have ever seen.


Fellow time traveller?


    ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
    9 Power_On_Hours          0x0032   055   055   000    Old_age   Always       -       39676
I'm 300 hours from 40K, time to buy new SSD? is this real?!


I mean, is it the affected model?

Have you applied the appropriate firmware update?


I did not find the model info, but thanks!


smartctl -i /dev/sda


0. Have backups

1. Check your backups


The two above rules are true regardless of hours on your storage, make, model, or technology. ;)


iMac mid-2010. Original disk drive.

"9 Power_On_Hours 0x0032 001 001 000 Old_age Always - 74233"

12 years old. More than 8 years of run time. It keeps on purring.

Yes, I have redundant backups. I also have a replacement drive ready. I just want to see how far I can take it.


Luckily it is old enough where you can replace it


Just FYI for anyone for whom this didn't work by default: I needed to use the --all flag with smartctl (and install smartmontools if you don't have it).


The -a flag from my example should be an alias for --all (man smartctl | grep -A1 ' -a' | head -2). Is that not the case in your version?


Pro tip: when writing out commands for people to read it helps to use the long form arguments. In this case passing '--all' instead of '-a' to smartctl. It makes it easier to read and more clear what specific options do. Same with calling things in scripts. Short form is for quick and dirty typing things, but not great for reading or comprehension :)


Fair point, yes, I should have done that!


Derp you are correct, sorry. Hopefully my comment is still useful for anyone who didn't know they needed smartmontools


I had to adapt a bit:

    sudo pacman -S gsmartcontrol
    sudo smartctl -a /dev/nvme0n1p4 | grep -e "Power On"

Im at only 2727 hours


Checked my Samsung 970 Evo 2TB and it says 487 even though it’s been on continuously for years.


Receiving power isn’t the same as being “on”. I’d assume the drive has a sleep state that it goes into after inactivity, and those hours don’t count as “power on” hours.


Interesting, thanks. The spec does leave it almost uselessly underspecified:

"Power On Hours: Contains the number of power-on hours. This may not include time that the controller was powered and in a non-operational power state."

The same drive reports only 329 "controller busy time" minutes.


My 960 Pro 1TB, which I paid way too much for back in the day is at just shy of 5,000 hours, and I've used the hell out of it since it was new.


960 pro was released in October 2016, even if we assume you purchased it the same month it was released, and you have been running it 24/7 non-stop, that's 50,591 hours counting from October 1st to this hour. I can assume the number is way below 1/3 of that even for people that use it 8 hours a day.



How to do this on a Windows PC?


In an elevated powershell prompt

Get-PhysicalDisk | Get-StorageReliabilityCounter | Select-Object PowerOnHours


Getting blanks on some of my disks. Two show 64422h and 73318h with no signs of failing :)


I'm getting a blank value back as well, even with admin privileges.


getting a blank value


Make sure to run PS with admin privileges


I did, but still got no SMART values. Speccy works though and I'm sure I've also run smartctl successfully on this laptop in the past.


install ubuntu linux first


Thanks


Mine is above 53,000 hours ... time to check my backups!


Sounds like you're in the clear for this particular bug...

...but always check your backups regularly for data that is dear to you!

Protip of the day: that includes things on someone else's server. I remember when Grooveshark went offline from one day to the next and I lost nearly my whole library because I remembered only some artists and had to go through thousands of songs to find which ones I actually liked from them. My browser's localStorage object had the playlists but I didn't use those much. Or when 000webhost cancelled my account because I was using the 100MB(?) to back up some files that were most important to me, rather than for actual webhosting (in my defense, I was 15 at the time), and so when I returned from a holiday with my parents with an actual crashed hard drive, that turned double sour. Backing up things from what they now call the "cloud" is something I learned early, as I have virtually no code I wrote before that summer, only some of the music, only essays with WordArt if they were printed, etc.

If you use Telegram, Spotify, Netflix, maybe you have videos uploaded to YouTube and not have local copies anymore... my recommendation is to have backups of things that are important to you, onto a medium that you own (it's only a disaster-case copy anyway), and for copyrighted content like Spotify/Netflix it would simply be enough to just have a list of songs/videos. Maybe Netflix doesn't go offline from one day to the next, but your account might be hacked, or that friend you share it with might be hacked, or the right set of hard disks fail at their datacenter, etc. GDPR data exports are your friend, particularly when they're automated and you don't have to bother support. (They might also reveal, as in my case, that Spotify knows how often you shower because I then connect it to a particular waterproof bluetooth speaker at wakeup time. Data exports are also fun to browse!)


Cisco is not a SSD manufacturer. They write industry-wide bug. Does that mean that more than one SSD manufacturer is affected (because they use partially the same firmware)? Further down they mention only Sandisk. Or is the industry-wide just their newspeak for saying any Sandisk of affected model, regardless whether installed in a Cisco box or somewhere else?


I suspect that "industry-wide bug" in this context is simply Cisco pointing out to their customer base that this isn't Cisco's fault and please don't blame Cisco.


I'm interested here too. I've got a Crucial SSD from 2015 that's been on about:

* 100% of 2015-2017, let's add 2 years here

* Aboutish 50% of days since 2018 to 2020

* On and off again (5%?) since then until now.

So it's about 3 years of full use? I'm eyeballing the use here. So it may be close to the numbers that were given, but I'm not sure. Guess I could check the SMART stats to get a precise number and from there decide what to do about it.

Searching a bit it seems it's a well-known bug in "enterprise SSDs"[0, 1] (which my drive certainly isn't) but there aren't any real details about what causes it, other than "a firmaware bug".

[0] https://www.servethehome.com/hpe-issues-hpd7-fix-for-ssds-th...

[1] https://www.anandtech.com/show/15673/dell-hpe-updates-for-40...


Enterprise SSDs are commonly made by one of a handful of dominant companies and then rebranded by server vendors, so that you can see a SanDisk or Samsung SSD sold as a Dell EMC or Cisco or HPE drive.


The problem seems to be widely experienced.

The Cisco report turned up in response to a post I'd made of the HN issue on the Fediverse:

https://mastodon.infra.de/@galaxis/108622795822100862


Dang also listed a few previous submissions on the topic.

None of which gained traction at the time:

https://news.ycombinator.com/item?id=32038993


If you google SSD 40000 hours you will find many box shifters affected, Dell, HP, IBM etc.


Would companies be willing to contribute to the OpenSSD project?

OCP (Open Compute Project) has shown that customer-operators can cooperate on open hardware designs, successfully influencing enterprise hardware supply chains. Commercial DPUs and SmartNICs were preceded by a decade of open hardware and research by the NetFPGA project (https://netfpga.org). Why not DiskFPGA?

2017 OpenSSD overview, based on Xilinx: https://github.com/Cosmos-OpenSSD/Cosmos-plus-OpenSSD/blob/m...

2022 status, http://www.openssd-project.org/

> OpenSSD platforms are still being actively used in many academic institutions. As of June 2022, we have renewed the homepage hoping that this site will be a forum to share various simulators, tools, traces, etc. not only for the conventional SSDs but also for the upcoming storage devices such as KVSSD, ZNS SSD, and Computational Storage (CSX). This site is being maintained by Systems Software and Architecture Lab. at Seoul National University as a part of the SW STAR Lab. project.


An open source SSD is also a lot more feasible than an open source hard drive. Even if you managed to get an open source HDD controller, you still need the precision mechanical parts that are impossible for the average person to make. With SSDs, however, it’s just a PCB with ICs.

Edit: this obviously ignores any troubles one would have sourcing the ICs (such as possible NDAs)


Is there anything special about making SSDs that the average person would not be able to do or is it a "if you can outsource PCB printing and maybe solder you can make one" situation?


The limiting factor would be the memory chips themselves and any firmware required for them (if any). I also don't know how well they are spec'd and if full documentation is available without NDA's and lawyers.


Looking on mouser and digikey it doesn't seem like flash chips, even into very fairly high density on a single chip[eg. 1], are all that difficult to get and get info on, though they all have very high minimum volume orders. So if a person wanted to try to do this on their own they'd probably be best off finding like 50 friends to go in on the order with them.

[1] https://www.mouser.ca/datasheet/2/671/micron_technology_mict...


The flash you linked to the datasheet for is over a decade old. Any SSD built from it would fall short of adequate by an order of magnitude in every important metric. The per-die capacity of typical current-generation NAND flash is 16x larger, the interface speeds are 8x higher, erase blocks are 24x larger, program latency can be 5-10x higher. And most importantly, all mainstream SSDs now use flash that stores three or four bits per physical memory cell, rather than one.

So using that flash, you could build something that is recognizably an SSD. But it would be almost entirely useless: too expensive and too small and slow for production use, and too far removed from the current state of the art to serve as a research platform for the most important challenges the SSD industry has been dealing with for the past several generations (error correction strategies for TLC/QLC, and SLC caching).


A couple things:

- I just picked one at random. I'm sure the bleeding edge is harder to get and datasheets are harder to get, but I wasn't trying to find the newest or best.

- the specific subthread here is about the diy-ishness of ssds vs. spinning rust, where the difficulties are of a fundamentally different kind. I feel like it goes without saying that a home built ssd is not going to perform to the level of mass production devices, the question was just can you.


The new, fast, high density flash chips from the big name flash chip vendors are not generally even listed on the vendor websites. You have to talk to a sales person and convince them you're actually going to buy in volume to even get data on the latest generation of flash ICs. You will also likely need more than 50 friends to meet the order minimums, unless your 50 friends each want to buy about an ExaByte worth of flash chips.

Also, with these multiple level flash technologies (quad level is current tech, triple is still used in some SSD/NVMe) the read, write, and ECC algorithms are non-trivial to the point where last I checked even mainline Linux's raw flash driver support won't do anything beyond single level cell flashes (and very few new embedded designs are choosing raw parallel NAND flash, instead opting for things like eMMC or UFS which have built-in controllers to handle this).


Like I said in another branch of this subthread, the question wasn't "can you build a high perf flash drive yourself" but "can you build an SSD more easily than a magnetic drive."

The comparison here is that no matter how much you hunt on digikey you won't find a disk platter or drive head or any of the other precision machined parts that go into a hard drive (never mind putting them together and keeping dust out etc).


Taobao has many sellers of pcb/controller and flash chips. There are even sections dedicated to DIY SSDs at Chinese forums.

But they all rely on leaked manufacturer firmware production tools, not open source firmware.


An FPGA-based SSD will always be too expensive for people with truly large scale. The controller ASICs are a lot cheaper.


Yeah, if we want an open SSD the path would be for a hyperscaler to strong-arm a controller vendor like Microsemi or Marvell into opening up their SDK. This worked with various Broadcom ASICs so it's not impossible.


It's been over two years since this was first identified... since this apparently affected many makes and models of SSDs, it would be nice to know if my laptop could be affected and if there's anything I could do about it.


This will not affect your laptop, all of the models affected by this are enterprise SAS SSDs.

Of course your SSD might have some other firmware bug that would eat your data, all you can do is search for the model number and see if the manufacturer has issued any notices/firmware updates.


There was at least one consumer SSD with a similar failure mode, the Crucial M4 SATA drive, unless you updated the firmware it would crash after 5200 cumulative power on hours.

That drive launched in 2011 though so there probably aren't that many still in active use which still haven't reached ~7 months of uptime.


Yes. This one: https://www.reddit.com/r/buildapc/comments/1z2rm5/crucial_m4...

That problem became known a decade ago, so it's somewhat surprising to see such a similar bug now.

This new one is worse because the drive cannot be used after reaching the magic number of hours. In the Crucial M4 case the firmware could be updated even after the bug struck.


> This will not affect your laptop

That’s just your presumptive opinion, right?

Edit: sorry, probably put that offensively. mikiem said about the HN drives: “These were made by SanDisk (SanDisk Optimus Lightning II) and the number of hours is between 39,984 and 40,032...” - https://news.ycombinator.com/item?id=32031428 Without knowing parts of a codebase are shared between SanDisk devices, it is hard to say that enterprise SAS devices have absolutely no code shared with consumer devices. So just the commenter’s opinion unless the commenter has knowledge of writing Sandisk firmware. “HPE and Dell both used the same upstream supplier (believed to be SanDisk) for SSD controllers” https://www.anandtech.com/show/15673/dell-hpe-updates-for-40...


> Without knowing parts of a codebase are shared between SanDisk devices, it is hard to say that enterprise SAS devices have absolutely no code shared with consumer devices.

Even if the code containing this bug was shared between consumer and enterprise drives, it's not reasonable to assume that it would take SanDisk multiple years to check whether their consumer drives are also affected. The lack of a follow-up report from SanDisk is good evidence that their other products are not affected.


Yeah, I agree it is very unlikely to affect someone’s laptop.

However I dislike a black and white “This will not” absolute fact statement: even if based on reasonable assumptions which is what appears to be the case versus detailed knowledge.

Most laptops don’t run their SSDs 24/7, and unless a manufacturer’s error affects a lot of consumers, we often don’t find out the cause of consumer equipment errors in my experience.

If the OP has a laptop older than 2020, with an SSD with a crucial chipset (especially if SATA), and they leave it on most of the time, then maybe check SMART hours.


How likely is it that they're using an enterprise SAS SSD in their laptop?


I've been searching "40000 hour SSD" since the HN downtime. There's a lot of bug reports besides this one and I'm fairly confident it only affects enterprise too.


One thing everyone could and should be doing is backups.


Two things: Test restores or you don't actually have backups. Just saying.


I got bit by this with iPhone backups. I did a phone trade in and followed the backup before trading in instructions. Problem is after the trade in the backup failed to restore due to an unknown error. The whole manual syncing and backing up with a cable workflow with Apple is super fickle and riddled with bugs.

Luckily I had Time Machine backups of my iOS backups and I managed to avoid losing too much data.

As a sidenote it seems like Apple has pretty much neglected their offline backup and syncing workflow to drive more people to just pay for iCloud storage. Half the time my iPhone takes hours just to get detected by the mac when plugged in.


Man, Time Machine can fail just as badly. Unknown errors and there is no help or documentation or way to fix it. Carbon Copy Cloner [0] is the way to go for retaining sanity. Absolutely excellent documentation for pretty much any use case. And it works reliably. Not affiliated but after having had terrible experiences with Time Machine I feel compelled to bring it up every time I come across the topic.

[0] https://bombich.com/


While I don't like how annoying Apple is with service upselling (iCloud, Music, Arcade), at least they moved iPhone backup from iTunes to Finder. So their local iPhone backup process is being maintained over time.

I don't have issues with my computer (PC or Mac) detecting my iPhone. Generally need to make sure iPhone is unlocked after plugging it in. What is tough is the large size of my iPhone (X gb) and how small my Mac's HD is (2X gb).


I’ve changed iphones many times and the issue still persists for me. The only reliable way to get photos synced or iphone deteced in finder is to turn on airplane mode for some reason. Must be a bug with wifi syncing.

You actually bring up another issue. There is no obvious way to backup iPhone locally to an external hard drive. So either pay the mac SSD storage tax or the icloud tax.


Depending on your usecase you can integrate using your backups occasionally into your normal data processing.

Again, depends on usecase but then it becomes integral to your existing workflow instead of an addendum that you end up forgetting to do

The whole purpose is to make the failing of one be effectively extremely noisy and irritating

It's like what I do with raid. I have a script that will shut the machine down on drive failure and then will use dialog(1) to say something like "hey bozo replace the fucking drive first" when you boot it up and then it will shutdown again and be unusable.

Make the complaining show stopping, loud, rude, and disruptive. Because if the next one fails you're screwed


Absolutely! Twice in my career, in huge failures, the backups turned out to be garbage! You don't want this!


We were eviscerated by this (or something just like it) a few years ago. Drives starting failing by the dozens.

Had to rebuild from HDD backups, down for a week. I still have nightmares.


They should have used a Free BIOS Language in their hardware like Open Firmware FORTH from the OpenBIOS project, to go with the Bias Free Language in their documentation.

https://en.wikipedia.org/wiki/OpenBIOS

>OpenBIOS is a project aiming to provide free and open source implementations of Open Firmware. It is also the name of such an implementation.

>Most of the implementations provided by OpenBIOS rely on additional lower-level firmware for hardware initialization, such as coreboot or Das U-Boot.

https://en.wikipedia.org/wiki/Open_Firmware

>Open Firmware is a standard defining the interfaces of a computer firmware system, formerly endorsed by the Institute of Electrical and Electronics Engineers (IEEE). It originated at Sun Microsystems, where it was known as OpenBoot, and has been used by vendors including Sun, Apple, IBM and ARM. Open Firmware allows the system to load platform-independent drivers directly from a PCI device, improving compatibility.


40000 (or even 40960) seems an odd number to fail at. 64k or 32k would make the cause pretty obvious, but 40000 doesn't seem all that round in binary. Perhaps a 12-bit counter incrementing every 10h? This is puzzling.

Of course, I am also entertaining the possibility that no one thought they would be in use for this long, which would certainly be evidence of planned obsolescence.


Very strange understanding of the word "evidence".

No sane SSD manufacturer would do such thing on purpose. You do it and you loose business, that's it.

The simplest explanation is that somebody made an honest engineering mistake.


When you purchase a server (fleet), you get a long warranty with it. Generally 3 to 5 years. So you expect this fleet to stay in service for <=5 years mostly.

Unless you burn through your SSDs, you're very unlikely to hit this event.

When these servers' continue to be used and disks all start to fail at the same time, this will obviously stink.

The bathtub curve is not like this. You can feel that.


40k hours is a little more than 4.5 years. These drives deterministically fail at that uptime (unless firmware is updated) and most servers are on 24x7, so if you run your servers for 5 years, it's highly likely you'd run into this. If you run your SSDs hard and they fail early as a result, then you'd be spared from this mass death. Or if you use three year leases and replace on a schedule.

Now more than ever, five year old server hardware isn't that far behind the curve unless you're on the bleeding edge. I've been looking for bottom of the barrel hosting lately, and there's lots of dedicated servers available with 10+ year old cpus, and probably most of the rest of the machine is a similar age.


you have any recs for lower-end bare metal providers?


I haven't used them yet, but take a look at dedicated server offers at LowEndTalk and Web Hosting Talk. Obviously, there's some offers that are sketchier than others, and a lot of resale, but if your needs are low, there's some neat stuff.


Given the power dynamic between a single customer and large corporations, the smart thing to do is to assume malice until prove otherwise. This puts the onus on the corporations and, if we're lucky, creates an environment where they compete with each other to be seen as the most honest. The worst thing that happens is the single customer has to buy an SSD from someone they don't trust.

If we do the opposite, as you say, and assume everything is an honest mistake, that puts pressure on the single customer to prove that the organization with a huge marketing budget is doing something wrong. In this situation, the worst thing that happens is we all get taken advantage of.

Our collective distrust is the only power we have against massive marketing/PR budgets. It doesn't have to be angry, or sour, or cranky, we just collectively need to not take their word until we have a reason to do so.


Are you seriously saying that by default we should believe they intentionally planned to cause their customers to lose all of their data?


Considering immoral practices adopted by corporations, such as vendor lock-in, use of slave work (directly or indirectly), law bending for its own interests, supporting and conducting biased research towards its own interests among others, I would say that is quite sensible to believe it. Big corporations, per se, are not evil entities. The people running them might or might not be, and when you have evil/immoral people running things, unless there are good control measures in place, they might take bad decisions.


If a spinning rust can run for ~8 years without any problems,a consumer SSD can hit beyond 40K hours reliably, and everything is checked and tested tens of times because of the complexity of flash storage, I'd get suspicious too.

Also, enterprise drives get firmware updates (regardless of spinning or not), and this firmware is automatically applied via RAID controller, so it could be remedied easily before it got this big if it's an actual error.


planned obsolescence is quite a thing...?


In some cases, but a product must fulfill its core purpose. If a SSD intentionally dumped data and self destructed at a set time, that would be disastrous for the brand. Same way a car doesn't adopt planned obsolescence by blowing up after 200k miles.


> If a SSD intentionally dumped data and self destructed at a set time, that would be disastrous for the brand.

Other than "intentionally" (which we cannot know and makes no difference to whether you lose your data or not) that is literally what these SSDs are doing, and no SSD brand has been destroyed over it.


You are not a used car afficionado?

'This insulation prematurely disintegrates under normal use causing the wires it is designed to protect and insulate, to short causing many problems.'

http://www.mercedesdefects.com/2008/04/wire-harness-defect.h...


What more could a manufacturer do to be "disastrous for the brand" than literally build an ssd that stops working after 40kh? Because this does not seem to qualify for you


Someone pointed out on the other thread that it could be 2^57 nanoseconds:

  >>> 2**57/10**9/3600
  40031.996687737745


If it were 53, I'd wonder "are they storing the time in the integer part of a double precision float?" That wouldn't go negative, it'd just start absorbing increments without changing the value.

Though that might cause a divide by zero?

What could cause unexpected behavior at 57 bits?

Perhaps storing fractions of an hour, like incrementing it every 1/16th of an hour and calculating a relative rate of change, causing a divide by zero?


My overactive imagination thinks it went something like this:

Engineer A: Gee, I need to store a few flags with each block, but there's nowhere to put them. Ah! We're storing timestamps as 64-bit microseconds. I can borrow a few of those bits and there'll still be enough to go for thousands of years without overflowing.

Engineer B: Gee, our SSDs are getting so fast, soon we'll be able to hit 1M writes/sec. But we're storing timestamps as microseconds. How can we generate unique timestamps for each write? Ah! I'll switch to nanoseconds. It's a good thing we have plenty of space in this 64-bit int.

BOOM!


Packing a type flag into the upper bits of a 64 bit value is a reasonably common optimisation in dynamic language implementations (because it lets you use unboxed number arithmetic).


Or sometimes the lower bits, as at least used to be the case for integers in v8. (Also OCaml, but that's not dynamically typed. It simplifies the garbage collector to at least some times not require a pointer map for each type, just a flag in the object header to indicate if it contains any pointers, and then everything that isn't ints or pointers needs to be boxed.)


Do embedded CPUs like the one in an SSD have floating point units? It seems more likely to me that the upper bits in a 64 bit integer counter were used for something else.


I think it is more likely they shifted a power of two over implicitly by a base 10 place value instead of a binary one. Or multiplied by 10. Unsure why. But, seems simpler.

52 is notable as 2^4 + 2^3 = 24, 24 + 24 = 48, 48 + 2^2 = 52. But 57?


From a related issue with a different vendor:

"The fault fixed by the Dell EMC firmware concerns an Assert function which had a bad check to validate the value of a circular buffer’s index value. Instead of checking the maximum value as N, it checked for N-1. The fix corrects the assert check to use the maximum value as N."

https://www.anandtech.com/show/15673/dell-hpe-updates-for-40...

Why the MAX value would be in an circular buffer, or what was being stored in N-1? No idea.


From my reading, it's checking the maximum index into the circular buffer. That is, when it hits the end of the circular buffer, there's an assertion to check that they're properly wrapping the index back to the start of the buffer, but the assertion has an off-by-one error.

I presume you find a lot of circular buffers in SSD firmware, for wear-leveling reasons. Samsung's NILFS and NILFS2 are structured as circular buffer append-only logs, at least partly to avoid trusting the firmware wear-leveling.


circular buffer

The infamous Seagate firmware bug was due to the same thing.


2^57 nanoseconds is ~40032 hours. I wouldn't be surprised if someone out there was counting intervals in a 64-bit value and masking off some of the higher bits for flagging.

Any time I see these sorts of issues (odometers that kill themselves, for example) I think of smaller units at higher bit depths. That's not the only way to get to this kind of concern, but it's a way that pretty-darn-competent engineers can leave ticking time bombs due to estimation failures.

Always check types for overflow and/or precision loss. Always.


It's a power of two shifted by a decimal place value instead of a binary one. Unsure why.


Backblaze have a great blog about things they learn about hard drives. It's been going for years, less about firmware issues and more about general usage. https://www.backblaze.com/blog/backblaze-drive-stats-for-q1-...


Yikes. Cisco claims this is "an industry wide firmware index bug". Is there any validity to this claim? Are any consumer drives affected by the same issue?


Yes, HPE is one of the SSD OEMs affected by it:

https://support.hpe.com/hpesc/public/docDisplay?docId=emr_na...

https://support.hpe.com/hpesc/public/docDisplay?docLocale=en...

> Are any consumer drives affected by the same issue?

As far as I know it doesn't affect consumer drives, but I wouldn't be surprised if some have the same defective firmware.


Is there any information about the provenance of this SSD controller? Sounds like enterprisey venodrs all rebranded some upstream supplier's hardware.

edit: apparently sandisk: http://forum.hddguru.com/viewtopic.php?f=3&t=39964 - also a clue about the magic 40k hour significance: "the SSD alters its performance in some way as it approaches end of life. This appears to shine some light on the reason for a trigger at 40K Power On Hours."


This apparently also happened two years ago?

https://www.anandtech.com/show/15673/dell-hpe-updates-for-40...


HPE had the same issue on some of their SSDs. We received an advisory months before it would have been a problem, and had time to upgrade all our customers ... except 2 servers that we missed. Luckily when I saw all disk in error in the iLO on one of the server I remembered this issue, googled the model, confirmed it was affected, and was able to shutdown the second server and start the upgrade. Not sure client was happy but at least it was only a 1h outages instead of maybe a day to get the disks + restoring from backup. HPE did replace the disks under warranty.


From what I remember reading this affected Sandisk only, is that correct?

I have a Samsung EVO and OCZ SSD. Would these be affected too? Perhaps some shared component?

Cisco has written "Industry-wide" here which is confusing


One of the dumbest things I have done in my life is buying an SSD and new HDDs to farm Chia. After about a month of farming, the SSD died due to the constant read/writes.


Did you make any money?


Nice one. For your enjoyment have a look at the "all-time" chart. https://coinmarketcap.com/currencies/chia-network/

While the concept of Chia was interesting at the time and also reminded me of the "smart fridges of silicon valley (the show)", filling up gigabytes with trash data to prove a technical point made me lose interest. Just glad I didn't invest more.


"the SSD will report that 0 GB of available storage space remains. The drive will go offline and become unusable."

Considering this is a firmware "bug" that bricks the drive, due to a misplaced index, not a physical wear issue, it appears to be a 4.5-year planned obsolescence feature.


I’m moving all my storage to vellum with papyrus backups.


Alexandria's got an excellent hosting facility.


Sadly rendered effectively write-only


The original argument for geographically-distributed backups!


Reading the Wikipedia entry on Power-on hours [1] says that:

"...Once a [SSD] drive has surpassed the 43,800 hour mark (5 years), it may no longer be classed as in "perfect condition" "

And that SSD generally has 5 year life expectancy.

So with this bug, should we simply think of it just to become 40,000 hour hard life-time limit? Well, it's 10% less than by design.

I'm just not sure how realistic will it be to obtain SSD firmware updates given that it's "an industry wide firmware index bug".

How could I even know if a particular SSD has an affected firmware?

[1]: https://en.m.wikipedia.org/wiki/Power-on_hours


Nothing has eternal life.

Especially not electronics. And certainly not more advanced semiconductors.

The rule-of-thumb is the lifespan of planar CMOS processes in years is proportional to the "node size" in nanometers. So we are right now at 2-10 years. Some specific ICs or specific applications or specific designs can be bigger or smaller than this but it's the average.

If you've ever worked on "antique" electronics, this is no surprise.


Here's how to check the power on hours on a Mac:

I didn't have `smartctl` installed on my Mac, so here's how to install it via `brew`:

    brew install smartmontools 
My SSD is `disk0` (check `Disk Utility.app`).

Running `smartctl` for my `disk0`:

    sudo smartctl -a /dev/disk0 | grep -e "Power On Hours"
    Power On Hours:                     6,334
So I have only ~15% of those 40k hours used (6,334/40,000).


related topic - leaving SMART control tests ON for a (non-SSD) drive, apparently interferes with sleep; the drive will wake up to test itself. For some drives, I would prefer that not to happen and just stay quiet. Yet, testing for this behavior seems elusive -- querying the disk wakes it, and most linux disk tools seem unaware of sleep state. I just listen for the disk spinning, or notice a long pause before an operation.


What exactly is causing the bug though? If the same area is written to some x number of times the solid state device in that location permanently fails, is that correct? If so in an always on device how can this bug be escaped? They need to randomize the writes and for that the storage size should be a multiple of what is needed for regular operation. Even then the disk will fail eventually. What am I missing here?


there is a bug in a software counter overflowing which a firmware upgrade will fix, is what it says.


Ah cool, thanks


Since this is "an industry wide firmware index bug", is there a (complete) list somewhere of all SSD models that are affected?


So, has anyone opened one of these SSDs and tried to get at the firmware and find out WTAF code was written?


You would just find a firmware update file, much easier, there's a hddguru thread linked above somewhere where people have been having a gander.


A Modest Proposal: All $LARGE business insurance policies specify that, to the extent to which any insured loss was caused or worstened by reliance upon the correct functioning of SSD or related drive technologies...YOYO, and any & all losses are solely on you.


Seems that I also need a replacement SSD Using CrystalDiskInfo on my Samsung 850 EVO 1TB it shows Health 99%, 5450 Power on Count, 27463 Power on Hours BUT [FF] Remaining life 10, that's at the lower end...


4 byte integer rollover: 2^32=4,294,967,296/4.5=954,437,177/365=2,614,896/3600=726/24=30ticks/sec


Unlikely; a clock running at 30 Hz overflows a 32 bit (unsigned) integer in 39,768 hours, while the HN disk failed after at least 39,984 hours [1], and the vendors wouldn't issues warnings about 40,000 hours if it actually fails about 230 hours before that.

[1] https://news.ycombinator.com/item?id=32031428


Not sure about the mfgr- could be rounding for memory saliency. . . but re HN they’re counting clock time not power on time. Could have easily been turned off total 10 days for maintenance etc. (and good routine could have both servers off about the same no hrs.) That’s only 0.6% downtime.


> Not sure about the mfgr- could be rounding for memory saliency. . .

In that case, I would expect them to be rounding down, not up, and to at least mention the exact number somewhere.

> but re HN they’re counting clock time not power on time. Could have easily been turned off total 10 days for maintenance etc.

I seriously doubt HN has been down for 10 days over the last five years. As discussed in the other thread I linked, this has been one of the longest HN outages.


Yeah, no, I don’t mean crashed altogether, just off for routine maintenance on a rotating basis. But anyway, no sense in prolonging this. Anyway, agree that 1% seems high.


Somewhat unrelated, but I recently had a motherboard fried by power instability which has given me a healthy respect for the difference between spinning rust and SSD's.

My SSD's were scragged. my HDD's were just fine. I guess it's time to figure out how to get a realistic write-through cache setup going, because from now on, if it ain't on spinnin' rust it ain't hard enough yet.


Of what quality was the mobo? I thought the higher-end stuff typically has power protection.


ServalWS B450. Was not at all amused. Oddest bloody thing, because had a pair of Samsung EVO's M.2 NVMes in there, and it still managed to let the magic smoke out.

CPU/GPU and RAM lived, but the corruption of the drives (even if the data was largely recoverable) and rendering of them as inviable to further writes really took me by surprise.

That combined with the way the HDD just did not care one lick despite apparently, just illustrated for me a difference in tolerance to operating conditions that I'd not had the chance to witness first hand yet.

Just figured I'd share while we were talking about SSD weirdness and firmware nonsense.


I have an 128gb osz agility (or sth of that era) disk that is still in use in my mom's laptop :)


This sounds like the exact same bug that effected HPE products a couple of years ago.


.


[flagged]


My reaction was "If you want to write some documentation with bias-free language, just write the documentation with bias-free language." Why the need for a long paragraph explaining "Look how great and sensitive we are!"

I understand, and agree with, the desire to use inclusive language, but so much of this has just devolved into performative nonsense.

Edit: Thought I'd leave my original statement above up, but after reading some of the other comments below I at least understand now the purpose of this notice. Basically, when it comes to Cisco's products, they previously used "master/slave" and "whitelist/blacklist" in their terminology, but no longer do. However, of course some older networking products still use that terminology in UI software, for example. So this notice is essentially saying "We got the memo about updated language, but if you see the terms master/slave or whitelist/blacklist in our docs, it's because it's essentially referencing something out of our control to fix, so don't yell at us."


Else you get questions, like, "why don't you say master/slave like everyone else?!@!!"


At this stage I think such questions can just be ignored.


Saying that and usernames like "DoneWithAllThat" and "hn_throwaway", yeah, it checks out.


Not sure exactly what point you're trying to make, but if it's "the risk of saying anything even remotely critical of DEI tactics is a huge, gargantuan, giant career risk these days", then I wholeheartedly agree.


It's a device to identify certain kinds of people who would have a problem with less loaded language without any loss in clarity.


I would love to see some examples of Cisco documentation that ever offended anyone.


It's a warning that the documentation may refer to master/slave or something like that because Cisco cares enough about DEI to update documentation but not enough to actually update out-of-support firmware.


I miss old Cisco documentation, with IP addresses and router names like SanJose3 and 408 phone numbers on PRIs etc.


Firepower is a good start, although for different reasons.


Other than DEI administrators, trainers and people in positions with DEI in them, who is actually getting offended?


People that have to move their mouse a bit to hit the x button, apparently.


Having a statement like that shows people that they are open to suggestions on improvements. Since a lot of people are not so open to suggestions, it makes sense to me to include this language. They added a little X button so you can close it easily.


[flagged]


If someone thinks inclusive language is a sign of hostility, then I think they have misunderstood the situation. I’m happy to have open source contributions from anyone, but I do share my pronouns in posts and on videos and I can tell you that the very small number of people who have gotten upset by that were angry unhelpful people who were more interested in complaining about language than contributing to the community.


[flagged]


It’s a space that is inclusive of queer folks. If someone can’t handle that, then yeah, they probably aren’t mature enough to safely interact with those people. Every space needs to have some rules in the event that someone gets nasty. Someone who gets upset when they encounter a non-binary person is probably going to make those people feel unwelcome. So you can either protect the vulnerable people or you can allow the bully to push other people out. But whoever is organizing that space can consciously decide, or they can let the bully decide for them.

By the way, you seem to have this really charged, negative view of queer spaces. Which I respectfully want to suggest is a misunderstanding of those spaces. I mean, if you go in all angry and complaining about other people, then sure, they’re gonna (rightfully) kick you out. But that’s true of most places if you show up and act like a jerk. The truth about these places, and being queer in general, is that this is a beautiful space of happiness for so many people. Being able to express gender openly without judgment is a positive, empowering experience for so many people. All you have to do to fit in is accept that. You don’t have to like it, though that is encouraged. But if you get so upset by someone existing as their true self that you have to argue with others, you’re probably not a good fit for that space.

I want to suggest you listen to the YouTube channel Beau of the Fifth Column. He’s got a lot of great takes he explains in a way I think you’d understand. And sometimes he talks about stuff like this.

Anyway I’m non-binary. I’m taking hormones and expressing my gender in ways I never knew I could (wearing clothes I “wasn’t supposed” to wear). It feels great. All I really ask for is that you accept me for who I am. And I hope you see that I’m able to be kind to you even though I know you think I’m a “pronoun person”.

Edit: Here’s a couple videos from Beau of the Fifth column talking about gender. He’s a white guy from the south who explains things in a non judgmental way and I think he has a good perspective, but I’ll let you decide:

https://youtu.be/vQ53lVyi4so

https://youtu.be/FaFK9uqbqrY


Thanks for standing up to that unpleasantness. I know you aren't really supposed to feed into that, but it's refreshing to see pushback.


Thanks for recognizing it. I have enough privilege that I am insulated from this kind of thing, so I have the energy to be patient. And it helps me develop my rhetorical skills. Cheers.


[flagged]


Barry I appreciate that you recognize my good faith efforts. But I want to highlight that queer and gender nonconforming people are regularly marginalized and othered in this country and around the world. It is genuinely tiring to them to be dismissed regularly in their daily life and then to encounter people online who want to play this up as some culture war with two legitimate sides. I personally would not label you an "unbeliever". As I have said, these people just want to exist and have people like you deal with that without getting upset.

Some people like me have a little extra energy to sit down and explain it, but you cannot regularly expect marginalized people who have to deal with so much day to day to then spend energy educating you on the facts of the lives of queer and gender nonconforming people. They are far, far too tired from everything else to spend that energy educating you.

But I have given you a choice. A couple of videos from a respectful white southern man who I wholeheartedly endorse for his explanations. I watch his videos every day and to be honest I so far think he always has great takes.

So I have given you an opportunity to learn another perspective, and I encourage you to check it out. But if you don't, and you continue to act as the victim of a culture war that does not actually harm you in any way, then I promise you you will continue to be downvoted and kicked out of any space with open minded people.

I genuinely wish you the best. But I do not endorse this notion of dogma and unbelievers. This is about common decency and respect. This is about the lives of real people I respectfully think you haven't yet had the opportunity to understand. But consider checking out those video links. It would be an act of good faith to help your fellow people who just want to exist in peace.


[flagged]


Barry I’m just being honest. If you don’t change your behavior you can expect more of the same. Whether it’s being downvoted or being accused of being a jerk, I can see there is something about this that bothers you.

But let’s stay focused. Would you be willing to watch one or both videos I posted? Or no? There’s a lot of people out there that could benefit from your understanding, and all it would take is a few minutes of your time. Please consider it. I’d appreciate it too. Thanks.


The docs may include "master/slave", and they don't want to get sued or bad PR, so this generic notice says "we don't like bad words but sometimes the industry uses bad words and that's unfortunate". If you click the Learn More link in the paragraph, you'll learn more.


[flagged]


There is -- it's using words other than those, which is both easy and considerate.


Some of us believe it's a mistake to give mere words that much power.


It is possible to believe that and act to defuse this power anyway. I believe password authentication is crap and people should use WebAuthn, but I don't say to myself, "Since I believe password authentication is crap I don't need a password manager".


You've chosen a good analogy to back up the point I'm making. Passwords aren't rhetorical devices, they're functional ones. Not only that, but they're imperative. If you present the proper password, the computer has no choice but to accept it and grant access, consequences be damned. Technical writers have appropriated the term 'privilege' in this context -- should that be submitted for revision as well?

Verbal offense, on the other hand, cannot be given, only taken. The choice to be hurt by words like "master" and "slave" is entirely up to the listener. Any other position literally disempowers that listener. There is, or should be, no obligation on a writer's part to avoid such terminology. To borrow from another comment that probably got its poster banned, we are bordering on indulging mental illness here.

Anyway, it's OT for the article at hand. This whole debate just seems like a goofy distraction from real injustice, is all.


> Passwords aren't rhetorical devices, they're functional ones.

All symbols are also concrete things. Did you notice that we have words for the letters that we use to spell words ?

Super Mario Maker troll design does a good job of showing this off. Mario Maker doesn't have distinct signifiers, so the way you tell a player, "You will need a POW block to pass this" would be to actually put a real POW block, just like the one they need, inside some solid blocks nearby. Likewise for Mario's powerup mushroom for example. But because the signifier is the signified troll makers will build puzzles where e.g. the correct solution is to use Yoshi's tongue to grab the signifying POW out of the sign and then you can blow that one up. With the powerup mushroom, if we're already Big Mario, a progressive powerup becomes Fire Mario's powerup - when we saw this sign before it told us to be Big Mario, so we got ourselves Big Mario but now since we are Big Mario the same sign says Fire Mario and sure enough, Big Mario doesn't help after all.

[[ Ordinary Mario courses shouldn't do this, because it's annoying, but Troll Mario is supposed to be annoying, it's a delicate art form, like Stand Up comedy ]]

> There is, or should be, no obligation on a writer's part to avoid such terminology.

That is not how language actually works, the clear distinction you're relying on is instead blurred because symbols are not just symbols. Because language is a co-operative activity, choosing to do things you know will offend others is your fault. Now, if you're a comedian, too bad, some of the audience didn't like the joke. But if you write technical documentation this is a failure, your goal was to inform, not to make some people angry as the price to make other people laugh.


Because language is a co-operative activity, choosing to do things you know will offend others is your fault

Sorry, no. It's absurd to expect me to take responsibility for someone else's offense at my use of "master" and "slave" in a technical document. This isn't a question of semiotics. At this point it's more like a Monty Python sketch. I'm not familiar enough with the Mario franchise to grasp your analogy, unfortunately.

The way I think of it is in terms of power: if you demand the right to control and revise the language we share, you're claiming an incredible privilege, and you're doing so without consulting many who have no voice to object. (By 'demanding' I mean claiming an unearned right to the moral high ground, as was done by the people who added that paragraph to the Cisco documentation. That degree of sanctimony normally requires religious backing.)

That doesn't mean you're wrong -- there are hurtful words that have few or no benign uses, after all, and it's easy to make the case that we're better off without them. But the burden of proof is a heavy one in the general case, due to the power required to shame the rest of society into compliance. It isn't met here.


what ever happened to "sticks and stones may break my bones..." ? perhaps its time to reconsider whether society has lost/is losing a very very important thing here.

If younger generations grow up thinking that a simple word is violence like we see some do, that it is akin to breaking out a knife and stabbing someone, how are they properly clothed to function in a world that isnt absolute perfect?

you know what generally happens to snowflakes? they melt. seriously, sticks and stones.


Goes well with the legal disclaimer that follows.

The legal or whatever-not-technical department wanted to leave their mark.


tell that to my 2014 macbook that's been constantly running neural nets for the past 5 yrs


crap so its certainly HP laptops. so which laptops are safe from this?


My HP laptop has Toshiba SSD. I'm not sure about other models. But I think only enterprise SSDs are affected.


This appears to be talking about Cisco enterprise drives; where do you see anything about HP laptops?

Edit: If it's a problem in Cisco's upstream vendor then it could affect others, but probably still just enterprise stuff.


https://news.ycombinator.com/item?id=32052757

> HPE is one of the SSD OEMs affected by it: ...


Are you looking at the same post we are? Because your comment makes no sense whatosever.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: